Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

 
 
Session Overview
Session
SESSION #08: Handling What You Captured
Time:
Thursday, 10/Apr/2025:
2:15pm - 3:40pm

Session Chair: Meghan Lyon, Library of Congress
Location: Store Auditorium (ground floor)

main entrance at street level

Show help for 'Increase or decrease the abstract text size'
Presentations
2:15pm - 2:35pm

So You’ve Got a WACZ: How Archives Become Verifiable Evidence

Basile Simon, Lindsay Walker

Starling Lab for Data Integrity, Stanford-USC, United States of America

This talk will present a workflow and toolkit, developed by the Starling Lab for Data Integrity, for collecting and organizing web archives alongside integrity and provenance data.

Co-founded by Stanford and USC, Starling supports investigators–be they journalists, lawyers, or human rights defenders–in their collection of information and evidence. In addition to using Browsertrix to crawl (and test) large sets of web archive data, we have built a downstream integration, so data flows into our cryptographically-signed and append-only database called Authenticated Attributes (AA).

AA extends Browsertrix’s utility by enabling archivists to securely attach and verify the provenance of claims that include context-critical metadata about the archived content in a secure and decentralized manner. It allows for the addition, preservation, and sharing of provenance data while facilitating efficient organization, searchability, and integration with other tools. Through AA, web archives and metadata become accessible for other applications and verification workflows, e.g. OSINT investigations.

In this presentation, we will showcase case studies and projects with our collaborators including the Atlantic Council’s DFRLab and conflict monitors.



2:35pm - 2:55pm

Warc-Safe: An Open-Source WARC Virus Checker and NSFW (Not-Safe-For-Work) Content Detection Tool

László Tóth

National Library of Luxembourg, Luxembourg

We present warc-safe, the first open-source WARC virus checker and NSFW (Not-Safe-For-Work) content detection tool. Built with particular emphasis on usability and integration within existing workflows, this application detects harmful material and inappropriate content in WARC records. The tool uses the open-source ClamAV antimalware toolkit for threat detection and a specially trained AI model to analyze WARC image records. Several image formats are supported by the model (JPG, PNG, TIFF, WEBP, …), which produces a score between 0 (completely safe) and 1 (surely unsafe). This approach makes it easy to classify images and determine what to do with those that exceed a certain threshold. The warc-safe tool was developed with ease of use in mind; thus, it can be run in two modes: test mode (scan WARC files on the command line) or server mode (for easy integration with existing workflows). Server mode allows the client to use several features over an API, such as scanning a WARC file for viruses, scanning for NSFW content, or both. This makes it easy to use together with popular web archiving tools. To illustrate this, we present a case study where warc-safe was integrated into SolrWayback and the UK Web Archive’s warc-indexer. This integration made it possible to enrich the metadata indexed from WARC files, by extending the existing Solr schema with several new fields related to virus- and NSFW-test results, allowing for advanced searching and statistical analysis. Finally, we discuss how warc-safe could be used within an institutional framework, for instance by scanning newly harvested WARC files resulting from large-scale harvesting campaigns as well as including it within existing indexing workflows.



2:55pm - 3:15pm

Detecting and Diagnosing Errors in Replaying Archived Web Pages

Jingyuan Zhu1, Huanchen Sun2, Harsha Madhyastha2

1University of Michigan, United States of America; 2University of Southern California, United States of America

When a user loads an archived page from a web archive, the archive must ensure that the user’s browser fetches all resources on the page from the archive, not from the original website. To achieve this, archives rewrite references to page resources that are embedded within crawled HTMLs, stylesheets, and scripts.

Unfortunately, the widespread use of JavaScript on modern web pages has made page rewriting challenging. Beyond rewriting static links, archives now also need to ensure that dynamically generated requests during JavaScript execution are intercepted and rewritten. Given the diversity of scripts on the web, rewriting them often results in fidelity violations, i.e., when a user loads an archived page, even if all resources on the page had been crawled and saved, either some of the content that appeared on the original page is missing or some functionality that ought to work on archived pages (e.g., menus, change page theme) does not.

To verify if the replay of an archived page preserves fidelity, archival systems currently compare either screenshots of the page taken during recording and replay or errors encountered in both loads (e.g., https://docs.browsertrix.com/user-guide/review/). These methods have several significant drawbacks. First, modern web pages often include dynamic components, such as animations or carousels; so, screenshots of the same page copy can vary across loads. Second, neither does incorrect replay always result in additional script execution or resource fetch errors, nor does the presence of such errors indicate the existence of user-visible problems. Lastly, even if an archived page does differ from the original page, existing methods cannot pinpoint what inaccuracies in page rewriting led to this problem.

In this talk, we will describe our work in developing a new approach for a) more reliably detecting whether the replay of an archived page violates fidelity, and b) pinpointing the cause when this occurs. Fundamental to our approach is that we do not focus on only the externally visible outcomes of page loads (e.g., pixels rendered and runtime/fetch errors). Instead, both during recording and replay, we capture each visible element in the browser DOM tree, including its location on the screen and dimensions, and the JavaScript writes that produce visible effects. Our fine-grained representation of page loads also enables us to precisely identify the rewritten source code that led to fidelity violations. The fix has to be ultimately determined by a human developer. However, we are able to validate the root cause we identify by either inserting only the problematic rewrite into the original page or by selectively rolling back that edit from the rewritten archived page and examining the corresponding effects.

In our study across tens of thousands of diverse pages, we have found that pywb (version 2.8.3) fails to accurately replay archived copies of approximately 15–17% of pages. Importantly, compared to relying on screenshots and errors to detect low fidelity replay, our approach reduces false positives by as much as 5x.



3:15pm - 3:35pm

Building a Toolchain for Screen Recording-Based Web Archiving of SVOD Platforms

Alexis Di Lisi

Institut national de l'audiovisuel (INA), France

As Subscription Video on Demand (SVOD) platforms expand, preserving DRM-protected content has become a critical challenge for web archivists. Traditional methods often fall short due to Digital Rights Management (DRM) restrictions, necessitating more adaptable solutions. This presentation covers the ongoing development of a generic toolchain based on screen recording designed to effectively address DRM restrictions, capture high-quality content, and scale efficiently.

The project is structured into two main phases. Phase One focuses on developing a system that automatically checks the quality of screen recordings. By monitoring key metrics such as frame rate, resolution, and bit rate, the system should ensure that recordings match the original content’s quality as closely as possible. This phase addresses several technical challenges, including video glitches, frame drops, low resolution, and audio syncing issues. These problems arise from varying network conditions, software performance issues, and hardware limitations. To refine and validate the toolchain, over 100 hours of competition footage from the Paris 2024 Olympic Games have been collected and are being used to assess the system’s performance. This dataset is crucial for ensuring that the toolchain can handle high-quality recordings effectively.

Phase Two tackles the specific challenges posed by DRM restrictions. Level 1 DRM, which involves a trusted environment and hardware restrictions, uses hardware acceleration that causes black screens when video playback and screen recording are attempted simultaneously. Additionally, many SVOD platforms limit high-resolution playback on Linux systems, complicating the capture of high-quality content. To circumvent these issues, playback should be handled on distant machines running Windows, Mac, or Chrome OS—environments where high-resolution limitations do not apply—while recording is performed on Linux systems. For HD video content, which generally involves Level 3 DRM with only software restrictions, Linux can be used directly for both playback and recording without encountering black screen issues.

The toolchain will utilize Docker to scale the recording process by virtualizing hardware components such as display and sound cards. Docker should enable the system to manage multiple recordings concurrently, improving efficiency and reducing the time required for large-scale archiving. FFmpeg will be employed for recording, while Xvfb and ALSA will be used to virtualize the display and sound cards, respectively. By leveraging Docker for virtualization and managing workloads across various instances, the system is expected to scale effectively and accelerate the archiving process.
This ongoing work aims to provide a robust and scalable solution for capturing DRM-protected content when direct downloading is not possible. The toolchain should be adaptable to various SVOD platforms and DRM systems, offering a flexible fallback method. The presentation will offer insights into the technical challenges being addressed, the strategies being developed to bypass DRM restrictions, and how the toolchain should evolve to manage large-scale content archiving effectively and attendees will gain an understanding of the methods used to overcome DRM challenges, the role of Docker in scaling, and the practical applications of this toolchain in preserving valuable web content.



 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2025
Conference Software: ConfTool Pro 2.6.153
© 2001–2025 by Dr. H. Weinreich, Hamburg, Germany