Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.
|
Session Overview |
| Session | ||
QUALITY ASSURANCE & DEDUPLICATION
| ||
| Presentations | ||
11:15am - 11:37am
Deduplication in browser-based crawling with Browsertrix Webrecorder This talk discusses new deduplication capabilities recently added to Browsertrix, a widely-used open source browser-based crawler and crawl management platform, in relation to sustainable web archiving. Browsertrix Crawler originally did not include support for deduplication, but we have recently added it as an option at the request of our users. This presentation will discuss why Browsertrix and Browsertrix Crawler did not originally support deduplication, the trade-offs introduced by adding deduplication support, and the unique challenges and opportunities related to deduplication with browser-based crawling. These trade-offs will be discussed in relation to storage efficiency and sustainability in web archiving programs. The talk will begin with some background on the early principles and capabilities of Browsertrix, and why deduplication support has not previously been added. This will include some discussion of the complexities deduplication introduces in terms of inter-crawl dependencies, and the tension between this complexity and the goal of being able to create portable, self-contained web archives. Next, the presentation will give a high-level overview of the deduplication capabilities that have been added to Browsertrix and Browsertrix Crawler. This will include our flexible model for how to configure an index for use as a deduplication source of truth using collections of previous crawls, how deduplication has been implemented in crawls, and the consequences this introduces for replay, sharing web archives, and other post-crawl activities. Also discussed will be how browser-based crawling allows for new experimental approaches to deduplication that can potentially result in efficiency gains in crawling time in addition to storage. The remainder of the presentation will provide thoughts on when deduplication may or may not be appropriate, using use cases to help illustrate how deduplication relates to institutions’ efforts to ensure their web archiving programs are efficient and sustainable, as well as the trade-offs that users will need to consider. 11:37am - 11:59am
Efficient quality assurance of deduplicated web archives with Browsertrix National Library of Luxembourg, Luxembourg This presentation focuses on the quality assurance of archived websites using Browsertrix on a national institutional level. In the second half of 2025, our institution completed the migration and expansion of its internal web harvesting infrastructure to use the latest version of Browsertrix. This includes the crawler, the management interface and the quality assurance workflows. We introduce several enhancements to these modules, which we will discuss in this presentation, with a particular emphasis on quality control. In particular, we propose a system for making the QA process more efficient by limiting the number of pages (or samples) that are analyzed in each batch. This process provides a good indication of the overall quality of a harvest, without needing to check all (often many hundreds or thousands) of pages. Thus, together with our crawler’s cross-crawl deduplication feature, this makes it possible to archive and analyze many terabytes of web content on a regular basis. We also present our system architecture and design choices that we made during the migration process, in detail. This includes our Kubernetes deployment, hybrid storage solution, custom registry, and multi-node setup. Our workflow is separated into three dedicated nodes, making it possible to harvest, manage and perform QA separately for : (1) behind-the-paywall news media content, (2) websites of national importance, and (3) ad-hoc collections. Our results show that Browsertrix offers many unique advantages compared to other alternatives that our institution has used previously. Furthermore, our enhanced quality assurance workflow provides an efficient, scalable means to monitor, manage, and maintain regular harvests on a daily, weekly, and monthly basis. 11:59am - 12:20pm
A browser-based approach to measuring completeness in archived websites University of Alberta, Canada The Internet Archive, the world’s most prominent web archiving institution, has created Archive-It (AIT), a popular web archiving subscription service, which is used by hundreds of institutions around the world to preserve their digital cultural heritage. AIT clients can choose to employ an AIT tool called Wayback QA to perform Quality Assurance (QA) on their archived websites (Archive-It, 2025). However, for those institutions who do not use AIT or for whom Wayback QA might not scale, the QA process has remained largely manual. To address this issue, we present a browser-based approach to measuring the completeness of a collection of archived websites. First, we establish a definition of completeness, which we define in terms of the network requests that are executed by a browser in order to properly load a website. We assume a live website is the “gold standard” against which the archived website must be measured. Therefore a fully complete archived website executes all of the same network requests that are also executed when loading the original live website. The completeness of an archived website thus becomes the fraction of original network requests that are successfully executed in the archived version. Our approach operates by comparing the network requests of the live website to those of the archived website and generating a measure of similarity. The approach includes an open-source command-line tool that can be deployed without needing to manually inspect each archived website on a browser. The work presented here is meant to provide a simple way to quickly assess the quality of a web archive collection. It does not preclude the use of other web archiving tools to capture, display, or analyze web archives. The audience for this tool is composed of web archivists looking to carry out QA on their archived websites. Researchers studying web archives could also employ this tool to gauge the quality of an archived web collection at a glance. The accompanying tool was written in Python, runs from the Linux command line, and is available to download and use on the Github Platform. It was written to be as modular as possible, with each step producing an output that is then used as input for the following step. The approach presented here has the following advantages over previous approaches: – It does not require web archivists to manually interact with each site they have archived, saving time and resources. – Additional information such as screenshots, WARC files, or crawler logs is not needed. As input, it only requires the URL of the archived website and its live counterpart. – It is an open-source tool and not proprietary. As such, it is open to further improvements and contributions from the web archiving community, and an AIT subscription is not necessary to use it. – Because the approach is browser-based rather than crawler-based, it is more focused on the user experience of archived websites. References Archive-It: How to patch crawl with the wayback qa tool (2025), https://support.archive-it.org/hc/en-us/articles/115004144786-How-to-patch- crawl-with-the-Wayback-QA-tool | ||
