Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

Please note that all times are shown in the time zone of the conference. The current conference time is: 20th May 2024, 09:45:14am CEST

 
 
Session Overview
Session
SESSION #08: Tools
Time:
Friday, 26/Apr/2024:
10:00am - 11:20am

Session Chair: Sawood Alam, Internet Archive
Location: Salle 70 [François-Mitterrand site]


Show help for 'Increase or decrease the abstract text size'
Presentations
10:00am - 10:20am

The BnL’s Migration From OpenWayback To A Hybrid PyWb-SolrWayback Engine Powered By S3 Storage

László Tóth

National Library of Luxembourg

During the year 2023, the National Library of Luxembourg has undertaken the task of migrating its existing web archives, totaling around 300 TB compressed WARC files served by OpenWayback using a static CDX index, to a high-performance hybrid infrastructure consisting of PyWb and SolrWayback using an OutbackCDX index server and state-or-the-art S3 object storage. The goal of this migration was threefold: to modernize the BnL’s offer to users in terms of accessibility, search speed and overall end-user experience, to use the latest web archiving tools and workflows available to date, and to provide a highly efficient and responsive storage solution for our web archives. Thus, we improve the three main pillars of our web archives: user experience, software and hardware. Our final solution sits atop 4 high-performance servers, two of which were initially used for indexing our collections. These machines, having more than 2.5 TB RAM and 192 cores in total, are geographically distributed and host the Solr cluster, SolrWayback, PyWb, OutbackCDX applications and are connected to our governmental S3 object storage network. In order to efficiently retrieve data from this storage, we developed custom modules for SolrWayback and PyWb that are able to stream WARC records directly from an S3 storage starting at any given offset up to any given length, without any additional I/O delay related to stream skipping. Thus, playback is as fast as it would be if the archives were read from local storage. To conclude, these features, though not available by default within the aforementioned applications, have been developed in an open-source spirit and are made available freely online for anyone to use.



10:20am - 10:40am

WARC-GPT: An Open-Source Tool for Exploring Web Archives Using AI

Matteo Cargnelutti, Kristi Mukk, Clare Stanton

Library Innovation Lab, United States of America

This year the team at the Library Innovation Lab (home of Perma.cc) has been exploring how artificial intelligence changes our relationship to knowledge. Partially inspired by colleagues at last year’s IIPC conference, one of our initial questions was: “Can the techniques used to ground and augment the responses provided by Large Language Models be used to help explore web archive collections?”

That question led us to develop and release WARC-GPT: an experimental open-source Retrieval Augmented Generation tool for exploring collections of WARC files using AI.

WARC-GPT functions as a highly-customizable boilerplate the web archiving community can use to explore the intersection between web archiving and AI. Specifically, WARC-GPT is a RAG pipeline, which allows for the creation of a knowledge base out of a set of WARC files, which is later used to help answer questions asked to a Large Language Model (LLM) of the user’s choosing.

What would it mean for your team’s process if it could interact with a chatbot that had insight into what you’ve captured from the web? Would a chatbot with knowledge of your collection be of use for description work? While still an experimental tool, WARC-GPT is a step towards understanding questions like that. Our team will share our experience so far testing things out, decisions we've made around tools, and share how other organizations can do that same.

The Perma team believes the expansion of our project beyond the technology and service we’ve offered for many years is of interest to the IIPC community. All of our work is still rooted in our original service, focusing on authenticity, fidelity, and provenance - but built to be more expansive.



10:40am - 11:00am

R You Validating These WARCs? Automating Our Validation- And Policy-Checking Processes With R

Lotte Wijsman, Jacob Takema

National Archives of the Netherlands

As the National Archives of the Netherlands we receive web archives from one or more producers instead of harvesting these ourselves. Subsequently, the quality of the web archives can differ. To be able to ensure a consistent quality and the long-term preservation of the web archives, we must ask ourselves if the archive is complete, technically sound, and conforms to our guidelines. Since 2021, national government agencies need to comply to the guideline on archiving governmental websites (2018) when harvesting websites. Amongst other requirements, the web archives should be daily harvested, conform to the ISO-28500 standard, contain full- and incremental harvest, and have a maximum size of 1GB. To ensure conformance to the WARC standard, we validate the WARC files with e.g. JHOVE.

Previously, we have presented our work on validation and the web archiving guideline at the WAC. In 2023, we had a poster on WARC validation that also included our future ambitions, because there is still much to improve. The output of WARC validation tools is not always easy to work with, especially when working with a lot of files, and the tools don’t always check everything we want checked (such as the size of the WARC). Furthermore, we considered conformance checking to be the next step beyond validation. This is why we went searching for a way to not only find something that fits our every need concerning validation and conformance checking, but also to automate these processes.

To accomplish this work, we have started to use R, a programming language suited for data analysis. Using R, we have worked on building an automated conformance checker, which not only validates mandatory properties of the WARC standard, but also optional yet (for us) important properties (e.g. payload digests and file size), and important aspects for long-term preservation (e.g. embedded file formats). Furthermore, we have implemented an automated conformance check to see if web archives conform to our guideline for archiving Dutch governmental websites (e.g. harvest frequency, full- and incremental harvests).

In our presentation, we will first provide the attendees with some background information on our previous work on the web archiving guideline and WARC-validation. Subsequently, we will share how we came to automated conformance checking and we will give a demo using R to show our prototype conformance checker.



11:00am - 11:20am

Machine-Assisted Quality Assurance (QA) in Browsertrix Cloud

Tessa Walsh, Ilya Kreymer

Webrecorder, Canada

Manual and software-assisted workflows for quality assurance (QA) of crawls are an important and underdeveloped area of work for web archives (Reyes Ayala, 2014; Taylor, 2016). Presentations at previous Web Archiving Conferences such as last year’s talks by The National Archives (UK) and Library of Congress have focused on institutions’ internal practices and tools to facilitate understanding and assuring the quality of captures of websites and social media (Feissali & Bickford, 2023; Bicho, et al, 2023). A similar conversation was facilitated by the Digital Preservation Coalition in their 2023 online event “Web Archiving: How to Achieve Effective Quality Assurance.” These presentations and discussions show that there is a great deal of interest in and perceived need for tools to assist with performing quality assurance of crawls in the web archiving community.

This talk will discuss machine-assisted quality assurance features built in the open source Browsertrix Cloud project that have the potential to help a wide range of web archivists across different institutions gain an understanding of the quality and content of their crawls. We will discuss the goals of automated QA as an iterative process, first, to help users understand which pages may require user intervention, and then, how those might be fixed automatically. The talk will outline assisted QA features as they exist in Browsertrix Cloud at the time of the presentation, such as indicators of capture completeness, whether failed page captures resulted from issues with the website or crawler, and how/if they could be automatically fixed. The talk will provide examples of the types of issues in crawling that may be discovered and how they are surfaced to the user for possible intervention, discuss lessons learned in collecting user stories for and implementing the QA features, and point to possible next steps for further improvement.

As the presentation will discuss the assistive possibilities of software in aiding traditionally manual processes in web archiving, lessons learned are likely to apply widely to all such assistive uses of technology, including other conference themes such as the use of artificial intelligence and machine learning technologies in web archives.

References:

Bicho, Grace; Lyon, Meghan & Lehman, Amanda. The Human in the Machine: Sustaining a Quality Assurance Lifecycle at the Library of Congress, presentation, May 11, 2023; (https://digital.library.unt.edu/ark:/67531/metadc2143888/: accessed September 21, 2023), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting International Internet Preservation Consortium.

Feissali, Kourosh & Bickford, Jake. Open Auto QA at UK Government Web Archive, presentation, May 11, 2023; (https://digital.library.unt.edu/ark:/67531/metadc2143893/: accessed September 21, 2023), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting International Internet Preservation Consortium.

Reyes Ayala, Brenda; Phillips, Mark Edward & Ko, Lauren. Current Quality Assurance Practices in Web Archiving, paper, August 19, 2014; (https://digital.library.unt.edu/ark:/67531/metadc333026/: accessed September 21, 2023), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; .

Taylor, Nicholas. Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability, presentation, August 4, 2016; https://nullhandle.org/pdf/2016-08-04_rethinking_web_archiving_quality_assurance.pdf: accessed September 21, 2023).



 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2024
Conference Software: ConfTool Pro 2.6.149
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany