Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

Please note that all times are shown in the time zone of the conference. The current conference time is: 9th May 2024, 10:14:23am CEST

 
 
Sessions including 'WS'

IIPC Technical Training
Time:
Wednesday, 24/Apr/2024:
3:45pm - 5:15pm

Session Chair: Gil Hoggarth, The British Library
Location: Salle 70 [François-Mitterrand site]


15:45-15:50 Introductions
15:50-16:40 Presentation: Using a Shell Script to Start Browsertrix in a Docker Container [Antares Reich, Austrian National Library]

  •  Approach & the Script
  •  Monitoring
  • Profile Validation
  • Rewriting and ingest into NetArchiveSuite
  • Q&A

16:40-17:10 Technical training: brainstorming session
17:10-17:15 Wrap-up




Keynote panel: Here Ya Free! Crossed views on Skyblog, the French pioneer of digital social networks
Time:
Thursday, 25/Apr/2024:
10:15am - 11:15am

Session Chair: Emmanuelle Bermès, Ecole des chartes
Location: Grand Auditorium [François-Mitterrand site]




WORKSHOP #06: Browser-Based Crawling For All: Introduction to Quality Assurance with Browsertrix Cloud
Time:
Thursday, 25/Apr/2024:
1:40pm - 3:00pm

Location: Salle 70 [François-Mitterrand site]




 
Presentations including 'WS'

SESSION #01: Artificial Intelligence & Machine Learning
Time: 25/Apr/2024: 11:20am-12:40pm · Location: Grand Auditorium [François-Mitterrand site]

12:00pm - 12:20pm

Utilizing Large Language Models for Semantic Search and Summarization of International Television News Archives

Sawood Alam1, Mark Graham1, Roger Macdonald1, Kalev Leetaru2

1Internet Archive, United States of America; 2GDELT Project, United States of America

Among many different media types, the Internet Archive also preserves television news from various international TV channels in many different languages. The GDELT project leverages some Google Cloud services to transcribe and translate these archived TV news collections and makes them more accessible. However, the amount of transcribed and translated text produced daily can be overwhelming for human consumption in its raw form. In this work we leverage Large Language Models (LLMs) to summarize daily news and facilitate semantic search and question answering against the longitudinal index of the TV news archive.

The end-to-end pipeline of this process includes tasks of TV stream archiving, audio extraction, transcription, translation, chunking, vectorization, clustering, sampling, summarization, and representation. Translated transcripts are split into smaller chunks of about 30 seconds (a tunable parameter) with the assumption that this duration is neither too large to accommodate multiple concepts nor too small to fit only a partial concept discussed on TV. These chunks are treated as independent documents for which vector representations are retrieved from a Generative Pre-trained Transformer (GPT) model. Generated vectors are clustered using algorithms like KNN or DBSCAN to identify pieces of transcripts throughout the day that are repetitions of similar concepts. The centroid of each cluster is selected as the representative sample for their topics. GPT models are leveraged to summarize each sample. We have crafted a prompt that instructs the GPT model to synthesize the most prominent headlines, their descriptions, various types of classifications, and keywords/entities from provided transcripts.

We classify clusters to identify whether they represent ads or local news that might not be of the interest of the international audience. After excluding unnecessary clusters, the interactive summary of each headline is rendered in a web application. We also maintain metadata of each chunk (video IDs and timestamps) that we use in the representation to embed a corresponding small part of the archived video for reference.

Furthermore, valuable chunks of transcripts and associated metadata are stored in a vector database to facilitate semantic search and LLM-powered question answering. The vector database is queried with the search question to identify most relevant transcript chunks stored in the database that might be helpful to answer the question based on their vector similarities. Returned documents are used in LLM APIs with suitable prompts to generate answers.

We have deployed a test instance of our experiment and open-sourced our implementation (https://github.com/internetarchive/newsum).



WORKSHOP #06: Browser-Based Crawling For All: Introduction to Quality Assurance with Browsertrix Cloud
Time: 25/Apr/2024: 1:40pm-3:00pm · Location: Salle 70 [François-Mitterrand site]

Browser-Based Crawling For All: Introduction to Quality Assurance with Browsertrix Cloud

Andrew Jackson1, Anders Klindt Myrvoll2, Ilya Kreymer3

1Digital Preservation Coalition, United Kingdom; 2Royal Danish Library; 3Webrecorder, United States of America

Through the IIPC-funded “Browser-based crawling system for all” project (https://netpreserve.org/projects/browser-based-crawling/), members have been working with Webrecorder to support the development of their new crawling tools: Browsertrix Crawler (a crawl system that automates web browsers to crawl the web) and Browsertrix Cloud (a user interface to control multiple Browsertrix Crawler crawls). The IIPC funding is particularly focussed on making sure IIPC members can make use of these tools.

This workshop will be led by Webrecorder, who will invite you to try out an instance of Browsertrix Cloud and explore how it might work for you, and how the latest QA features might help. IIPC members that have been part of the project will be on hand to discuss institutional issues like deployment and integration.

The workshop will start with a brief introduction to high-fidelity web archiving with Browsertrix Cloud. Participants will be given accounts and be able to create their own high-fidelity web archives using the Browsertrix Cloud crawling system during the workshop. Users will be presented with an interactive user interface to create crawls, allowing them to configure crawling options. We will discuss the crawling configuration options and how these affect the result of the crawl. Users will have the opportunity to run a crawl of their own for at least 30 minutes and to see the results.

After a quick break, we will then explore the latest Quality Assurance features of Browsertrix Cloud. This includes ‘patch crawling’ by using the ArchiveWeb.Page browser extension to archive difficult pages, and then integrating those results into a Browsertrix Cloud collection..

In the final wrap-up of the workshop, we will discuss challenges, what worked and what didn’t, what still needs improvement, etc.. We will also outline how participants can provide access to the web archives they created, either using standalone tools or by integrating them into their existing web archive collections. IIPC member institutions that have been testing Browsertrix Cloud thus far will share their experiences in this area.

The format of the workshop will be as follows:

  • Introduction to Browsertrix Cloud - 10 min

  • Use Cases and Examples by IIPC project partners - 10 min

  • Break - 5

  • Hands-On: Setup and Crawling with Browsertrix Cloud (Including Q&A / help while crawls are running) - 20 min

  • Break - 5 min

  • Hands-On: Quality Assurance with Browsertrix Cloud - 10 min

  • Wrap-Up: Final Q&A / Discuss Access 7 Integration of Browsertrix Cloud Into Existing Web Archiving Workflows with IIPC project partners - 20 min

Webrecorder will lead the workshop and the hands-on portions of the workshop. The IIPC project partners will present and answer questions about use-cases at the beginning and around integration into their workflows at the end.

Participants should come prepared with a laptop and a couple of sites that they would like to crawl, especially those that are generally difficult to crawl by other means and require a ‘high fidelity approach’. (Examples include social media sites, sites that are behind a paywall, etc..) Ideally, the sites can be crawled during the course of 30 mins (though crawls can be interrupted if they run for too long)

This workshop is intended for curators and anyone wishing to create and use web archives and are ready to try Browsertrix Cloud on their own with guidance from the Webrecorder team. The workshop does not require any technical expertise besides basic familiarity with web archiving. The participants’ experiences and feedback will help shape not only the remainder of the IIPC project, but the long-term future of this new crawling toolset.

The workshop should be able to accommodate up to 70 participants.



LIGHTNING TALKS
Time: 25/Apr/2024: 4:40pm-5:20pm · Location: Grand Auditorium [François-Mitterrand site]

Generative AI In Streamlining Web Archiving Workflows

Lok Hei Lui

University of Toronto, Canada



POSTER SESSION #1
Time: 25/Apr/2024: 5:30pm-6:30pm · Location: Foyer [François-Mitterrand site]

Podcasts Collection At The Bibliothèque Nationale de France: From Experimentation To The Implementation Of a Functional Harvest

Nola N'Diaye, Clara Wiatrowski

National Library of France



SESSION #08: Tools
Time: 26/Apr/2024: 10:00am-11:20am · Location: Salle 70 [François-Mitterrand site]

11:00am - 11:20am

Machine-Assisted Quality Assurance (QA) in Browsertrix Cloud

Tessa Walsh, Ilya Kreymer

Webrecorder, Canada

Manual and software-assisted workflows for quality assurance (QA) of crawls are an important and underdeveloped area of work for web archives (Reyes Ayala, 2014; Taylor, 2016). Presentations at previous Web Archiving Conferences such as last year’s talks by The National Archives (UK) and Library of Congress have focused on institutions’ internal practices and tools to facilitate understanding and assuring the quality of captures of websites and social media (Feissali & Bickford, 2023; Bicho, et al, 2023). A similar conversation was facilitated by the Digital Preservation Coalition in their 2023 online event “Web Archiving: How to Achieve Effective Quality Assurance.” These presentations and discussions show that there is a great deal of interest in and perceived need for tools to assist with performing quality assurance of crawls in the web archiving community.

This talk will discuss machine-assisted quality assurance features built in the open source Browsertrix Cloud project that have the potential to help a wide range of web archivists across different institutions gain an understanding of the quality and content of their crawls. We will discuss the goals of automated QA as an iterative process, first, to help users understand which pages may require user intervention, and then, how those might be fixed automatically. The talk will outline assisted QA features as they exist in Browsertrix Cloud at the time of the presentation, such as indicators of capture completeness, whether failed page captures resulted from issues with the website or crawler, and how/if they could be automatically fixed. The talk will provide examples of the types of issues in crawling that may be discovered and how they are surfaced to the user for possible intervention, discuss lessons learned in collecting user stories for and implementing the QA features, and point to possible next steps for further improvement.

As the presentation will discuss the assistive possibilities of software in aiding traditionally manual processes in web archiving, lessons learned are likely to apply widely to all such assistive uses of technology, including other conference themes such as the use of artificial intelligence and machine learning technologies in web archives.

References:

Bicho, Grace; Lyon, Meghan & Lehman, Amanda. The Human in the Machine: Sustaining a Quality Assurance Lifecycle at the Library of Congress, presentation, May 11, 2023; (https://digital.library.unt.edu/ark:/67531/metadc2143888/: accessed September 21, 2023), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting International Internet Preservation Consortium.

Feissali, Kourosh & Bickford, Jake. Open Auto QA at UK Government Web Archive, presentation, May 11, 2023; (https://digital.library.unt.edu/ark:/67531/metadc2143893/: accessed September 21, 2023), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting International Internet Preservation Consortium.

Reyes Ayala, Brenda; Phillips, Mark Edward & Ko, Lauren. Current Quality Assurance Practices in Web Archiving, paper, August 19, 2014; (https://digital.library.unt.edu/ark:/67531/metadc333026/: accessed September 21, 2023), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; .

Taylor, Nicholas. Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability, presentation, August 4, 2016; https://nullhandle.org/pdf/2016-08-04_rethinking_web_archiving_quality_assurance.pdf: accessed September 21, 2023).



SESSION #12: Innovative Harvesting
Time: 26/Apr/2024: 1:50pm-3:10pm · Location: Salle 70 [François-Mitterrand site]

2:30pm - 2:50pm

A Test of Browser-based Crawls of Streaming Services' Interfaces

Andreas Lenander Ægidius

Royal Danish Library

This paper presents a test of browser-based Web crawling on a sample of streaming services’ web sites and web players. We are especially interested in their graphical user interfaces since the Royal Danish Library collects most of the content by other means. In a legal deposit setting and for the purposes of this test we argue that streaming services consist of three main parts: their catalogue, metadata, and the graphical user interfaces. We find that the collection of all three parts are essential in order to preserve and playback what we could call 'the streaming experience'. The goal of the test is to see if we can capture a representative sample of the contemporary streaming experience from the initial login to (momentary) playback of the contents.

Currently, the Danish Web archive (Netarkivet) implements browser-based crawl systems to optimize its collection of the Danish Web sphere (Myrvoll et al., n.d.). The test will run on a local instance of Browsertrix (Webrecorder, n.d.). This will let us login to services that require a local IP-address. Our sample includes streaming services for books, music, TV-series, and gaming.

In the streaming era, the very thing that defines it is what threatens to impede access to important media history and cultural heritage. Streaming services are transnational and they have paywalls while content catalogues and interfaces change constantly (Colbjørnsen et al., 2021). They challenge the collection and preservation of how they present and playback the available content. On a daily basis, Danes stream more TV (47 pct.) than they watch flow-TV (37 pct.) and six out of 10 Danes subscribe to Netflix (Kantar-Gallup, 2022). Streaming is a standard for many and no longer a first-mover activity, at least in the Nordic region of Europe (Lüders et al., 2021).

The Danish Web archive collects websites of streaming services as part of its quarterly cross-sectional crawls of the Danish Web sphere (The Royal Danish Library, n.d.). A recent analysis of its collection of web sites and interfaces concluded that the automated collection process provides insufficient documentation of the Danish streaming services (Aegidius and Andersen, in review).

This paper presents findings from a test of browser-based crawls of streaming services’ interfaces. We will discuss the most prominent sources of errors and how we may optimize the collection of national and international streaming services.

Selected References

Aegidius, A. L. & Andersen M. M. T. (in review) Collecting streaming services, Convergence: The International Journal of Research into New Media Technologies

Colbjørnsen, T., Tallerås K., & Øfsti, M. (2021) Contingent availability: a case-based approach to understanding availability in streaming services and cultural policy implications, International Journal of Cultural Policy, 27:7, 936-951, DOI: 10.1080/10286632.2020.1860030

Lüders, M., Sundet, V. S., & Colbjørnsen, T. (2021) Towards streaming as a dominant mode of media use? A user typology approach to music and television streaming. Nordicom Review, 42(1), 35–57. https://doi.org/10.2478/nor-2021-0011

Myrvoll A.K., Jackson A., O'Brien, B., et al. (n.d.) Browser-based crawling system for all. Available at: https://netpreserve.org/projects/browser-based-crawling/ (accessed 26 May 2023).

 
 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2024
Conference Software: ConfTool Pro 2.6.149
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany