JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at events@netpreserve.org.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

Agenda Overview

Session

TECHNICAL INNOVATION AND STRATEGIES

Time:

Wednesday, 22/Apr/2026:

11:15am - 12:20pm

Session Chair: Lauren Ko, University of North Texas Libraries

Location: PANORAMA [+6]

Floor: +6

Presentations

11:15am - 11:37am

Archiving websites and social media of national movements: best practices of ADVN | Archives of national movements

Sophie Bossaert

Archive for national movements

In 2018, our archive decided to expand the collection of online publications and started with harvesting websites of our archival creators to preserve their online heritage for future research. The web is constantly changing and content is quickly modified, removed or made inaccessible which make archiving it a necessity. During the coronavirus pandemic we realised the rise of social media could no longer be ignored. It was a start point to capture, record, scrape and download social media archives as well, but we were exposed to many challenges including technical barriers (API limitations, platform restrictions), legal and ethical isuses,… which require continous monitoring and specific strategies for effective preservation. During these years we developed a substainable policy and regulary monitor more than 5000 channels created by our archival community.
In this presentation we would like to share our experiences and approach:
• How can a small institution with limited resources start to build a substainable web and social media archive? What are the main challenges and pitfalls?
• Which collection policy can you develop in archiving websites and social media? How can you monitor a specific working field? Which metadata should be maintained? How can you update the collection?
• What are the trends or tendencies in the use of the platforms? Which platforms are used by which types of archival creators? How did this evolve in time? Which channels are used by radical right and does it differ from other wings or parties? How many sources disappeared from the net and what is anno 2025 the ratio between offline - online channels?
• Which (open source) tools did we use to collect the content and how did this evolve in time? What are the pros and cons when using open source tools? Which are the essential or recommonded elements to capture? How do you cope with the continious changes in development of open source tools and the platforms themselves?
• Which infrastructure will we use to preserve the archives? What are the minimal requirements to preserve the archives?

11:37am - 11:59am

Combining browser-based and browserless crawling for better fidelity vs. efficiency tradeoffs

Jingyuan Zhu¹, Huanchen Sun², Harsha Madhyastha²

¹University of Michigan, United States of America; ²University of Southern California, United States of America

Operators of web archives can crawl pages from the web using either dynamic browser-based crawlers (such as Brozzler and Browsertrix) or static browserless crawlers (such as Heritrix). Static crawlers are more lightweight and, hence, can crawl pages at a faster rate: in our measurements, 16x faster than with a dynamic crawler. However, static crawlers miss page resources which are fetched only when JavaScripts are executed; we repeatedly crawled 10K pages (spread across the top 1 million domains) both statically and dynamically for 16 weeks, and found that only 55% of statically crawled snapshots visually and functionally match the corresponding dynamically crawled snapshots.

In this talk, we will present our study on how to combine dynamic and static crawling so as to serve page snapshots at high fidelity while minimizing the computational needs for supporting high crawling throughput.

First, we quantified the utility of a practice which is common in web archives: reusing crawled resources either across snapshots of multiple pages or across multiple snapshots of the same page. When an archive receives a request for a resource, it serves the copy which it captured closest in time to the page snapshot it is serving. If no resource with the requested URL is found, the archive returns a resource which approximately has the same URL. We estimated the utility of these simple measures if the frequency with which an archive crawls pages matches the availability of page snapshots on the Wayback Machine. We find that, compared to crawling all pages statically, crawling 9% of snapshots with a browser suffices to increase the fraction of statically crawled snapshots which can be served without loss of fidelity from 55% to 96%.

Second, to fix the fidelity issues associated with the remaining static crawls, we studied two methods for augmenting them using other dynamically crawled snapshots.

When replacing the scripts missing in one page snapshot with those from other crawls, we find that there is often the need to replace them collectively. Due to the prevalence on the modern web of bundling all the scripts on a page into a versioned bundle, the scripts work only if they all belong to the same version.
Many pages have resources which are unique to the page and vary over time, e.g., on the page for a specific product, a script may fetch a JSON which lists the product’s current price. We found that the URLs of such resources are often derived by combining strings found in the page’s source, and the recipe for constructing these URLs can be learned from a single dynamically crawled copy of a page. A static crawler can then be augmented to fetch these resources which it would otherwise miss.

Put together, we estimate that these two measures will further increase the fraction of statically crawled page snapshots which can be served without loss of fidelity to 99%.

By communicating our findings to the IIPC audience, we hope that developers of web crawlers will help translate our findings into practice.

11:59am - 12:20pm

The wasteback machine: measuring the environmental impact of the past web

David Mahoney

The University of Edinburgh, United Kingdom

This paper introduces the Wasteback Machine, a JavaScript library that repurposes web archives to analyse historical web page size and composition. It addresses a key limitation in current approaches to web sustainability assessment, which rely on live measurements and therefore obscure the cumulative environmental effects of long-term digital growth. By making web archives amenable to quantitative analysis, the Wasteback Machine enables new forms of historical inquiry into the evolution of page size and composition and their environmental implications. In doing so, it demonstrates how web archives can function as analytical resources rather than merely records of cultural memory. This paper will demonstrate the capabilities of the Wasteback Machine, examine representative analyses of historical web development, and situate its contributions within wider debates in web archiving and sustainability. It will further consider the reuse of “reborn” digital materials for quantitative inquiry, the long-term ecological implications of persistent web expansion, and the challenges and responsibilities facing the future of web archives.