JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at events@netpreserve.org.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

Please note that all times are shown in the time zone of the conference. The current conference time is: 20th May 2024, 08:15:22am CEST

Session Overview

Session

SESSION #01: Artificial Intelligence & Machine Learning

Time:

Thursday, 25/Apr/2024:

11:20am - 12:40pm

Session Chair: Andrea Goethals, National Library of New Zealand

Location: Grand Auditorium [François-Mitterrand site]

Presentations

11:20am - 11:40am

Re-imagining Large-Scale Search & Discovery for the Library of Congress’s .gov Holdings

Benjamin Lee

Library of Congress, United States of America

Longstanding efforts by the Library of Congress over the past two-plus decades have yielded enormously rich web archives. These web archives – especially the .gov holdings – represent an unparalleled opportunity to study the history of the past 25 years. However, scholars and the public alike face a persistent challenge of scale: how to navigate and analyze the .gov domain archives, which contain upwards of billions of webpage snapshots and have limited affordances for searching them. Given the centrality of these archives in understanding the digital revolution and broader society in the 21st century, the importance of addressing this challenge of searchability at scale is redoubled.

I will present progress on an interdisciplinary machine learning research project to re-envision search and discovery for these .gov web archives within the Library of Congress’s holdings with the goal of understanding the U.S. government’s evolving online presence. My talk will focus on two primary areas. First, I will detail my work to incorporate recent developments in human-AI interaction and interactive machine learning toward new search affordances beyond standard keyword search. In particular, I will discuss my in-progress work surrounding multimodal, user-adaptable search, enabling end-users to interactively search not only over text but also images and visual features as well according to facets and concepts of interest. I will present a short demo of these affordances. Second, I will describe how such affordances can be utilized by end-users interested in studying the online presence of the United States government at scale. Here, I will build on existing work surrounding scholarly use of web archives to describe next steps for evaluation. I will also detail new collaborations with scholars of other disciplines to use these affordances in order to answer research questions.

11:40am - 12:00pm

Extending Classification Models with Bibliographic Metadata: Datasets and Results

Mark Phillips¹, Cornelia Caragea², Seo Yeon Park², Praneeth Rikka¹, Saran Pandi²

¹University of North Texas, United States of America; ²University of Illinois Chicago, United States of America

The University of North Texas and the University of Illinois at Chicago have been working on a series of projects and experiments focused on the use of machine models to assist in the classification of high-value publications from the web. The ultimate goal of this work is to create methods for identifying publications that we can add to existing digital library infrastructures that support discovery, access, and further preservation of these resources.

During the first round of research, the team developed datasets to support this effort including datasets that represent state documents from the texas.gov domain, scholarly publications from the unt.edu domain, and technical reports from the usda.gov domain. These datasets are manually labeled into categories of either “in-scope” or “not in-scope” of collection development plans for local collections of publications. Additionally the research team has developed datasets containing positive-only samples to augment the labeled datasets to provide more potential training data.

In the second round of research, additional datasets were created to test new approaches for incorporating bibliographic metadata into model building. A dataset of publications from the state of Michigan and its michigan.gov domain was created and labeled with the “in-scope” and “not in-scope” labels. Next, several metadata-only datasets were created that would be used to test the applicability of leveraging existing bibliographic metadata in model building.

Finally, collections of unlabelled PDF content from these various web archives were generated to provide large collections of data that can be used to experiment with models that require larger amounts of data to work successfully.

This presentation will present information on how various web archives were used to create these datasets. The process for labeling the datasets will be discussed as well as the need for additional positive only datasets that were created for the project. We will present findings in the utility of existing bibliographic metadata for assisting in the training of models for document classification and report on the design and results of experiments completed to help answer the question of how we can leverage existing bibliographic metadata to assist in the automatic selection of high value publications from web archives. This presentation will provide concrete examples of how web archives can be used to develop datasets that can contribute to research projects that span the fields of machine learning and information science. We hope that the processes used in this research project will be applicable to similar projects.

12:00pm - 12:20pm

Utilizing Large Language Models for Semantic Search and Summarization of International Television News Archives

Sawood Alam¹, Mark Graham¹, Roger Macdonald¹, Kalev Leetaru²

¹Internet Archive, United States of America; ²GDELT Project, United States of America

Among many different media types, the Internet Archive also preserves television news from various international TV channels in many different languages. The GDELT project leverages some Google Cloud services to transcribe and translate these archived TV news collections and makes them more accessible. However, the amount of transcribed and translated text produced daily can be overwhelming for human consumption in its raw form. In this work we leverage Large Language Models (LLMs) to summarize daily news and facilitate semantic search and question answering against the longitudinal index of the TV news archive.

The end-to-end pipeline of this process includes tasks of TV stream archiving, audio extraction, transcription, translation, chunking, vectorization, clustering, sampling, summarization, and representation. Translated transcripts are split into smaller chunks of about 30 seconds (a tunable parameter) with the assumption that this duration is neither too large to accommodate multiple concepts nor too small to fit only a partial concept discussed on TV. These chunks are treated as independent documents for which vector representations are retrieved from a Generative Pre-trained Transformer (GPT) model. Generated vectors are clustered using algorithms like KNN or DBSCAN to identify pieces of transcripts throughout the day that are repetitions of similar concepts. The centroid of each cluster is selected as the representative sample for their topics. GPT models are leveraged to summarize each sample. We have crafted a prompt that instructs the GPT model to synthesize the most prominent headlines, their descriptions, various types of classifications, and keywords/entities from provided transcripts.

We classify clusters to identify whether they represent ads or local news that might not be of the interest of the international audience. After excluding unnecessary clusters, the interactive summary of each headline is rendered in a web application. We also maintain metadata of each chunk (video IDs and timestamps) that we use in the representation to embed a corresponding small part of the archived video for reference.

Furthermore, valuable chunks of transcripts and associated metadata are stored in a vector database to facilitate semantic search and LLM-powered question answering. The vector database is queried with the search question to identify most relevant transcript chunks stored in the database that might be helpful to answer the question based on their vector similarities. Returned documents are used in LLM APIs with suitable prompts to generate answers.

We have deployed a test instance of our experiment and open-sourced our implementation (https://github.com/internetarchive/newsum).

12:20pm - 12:40pm

MeshWARC: Exploring the Semantic Space of the Web Archive

Amr Sheta², Mohab Yousry², Youssef Eldakar¹

¹Bibliotheca Alexandrina, Egypt; ²Alexandria University, Egypt

The web is known as a network of webpages connected through hyperlinks, but to what degree are hyperlinks consistent with semantic relationships? Everyday web browsing experience shows that hyperlinks do not always follow semantics, such as in the case of ads, so people mostly resort to search engines for navigating the web. This sparks the need for an alternative way for linking resources in the web archive for a better navigation experience and to enhance the search process in the future. We introduce meshWARC, a novel technique for constructing a network representation of web archives based on the semantic similarity between webpages. This method transforms the textual content of pages into vector embeddings using a multilingual sentence transformer and then constructs a graph based on a similarity measure between each pair of pages. The graph is further enriched with topic modeling to group pages of the same topic into clusters and assign a suitable title to each cluster.

The process begins with the elimination of irrelevant content from the WARC files and filtering out all non-HTML resources, for which we use the DBSCAN clustering algorithm, as we anticipate, based on observation experimenting with data, that pages with no actual textual content, e.g., “soft 404” pages and possibly homepages, will have closely related vector embeddings. A graph is then constructed by using cosine similarity between each pair of pages and connecting pairs that have a similarity score higher than a certain threshold.

For topic modeling, we developed an enhanced version of BERTopic, which incorporates our new clustering algorithm to generate clusters and identify the remaining noise from our previous technique. It also uses the attention values of each word in the document to highlight the most important words and reduce the document size. The resulting clusters are each labeled with a generated topic title, providing a comprehensive and semantically meaningful representation of the cluster.

To expand on this work in the future, a search engine can be created by representing the search text as a vector embedding and then comparing it to the centers of the clusters, which narrows down the search space. The page rank can then be deduced given the similarity of each page with the search terms and how centered the page is within its cluster. We are also interested in assessing how much the graph constructed based on semantic similarity has in common with the graph of the hyperlinks.

Mobile View Print View

Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2024