Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.
|
Session Overview |
| Session | ||
TECHNICAL INNOVATION AND STRATEGIES
| ||
| Presentations | ||
11:15am - 11:37am
Archiving websites and social media of national movements: best practices of ADVN | Archives of national movements ADVN | archive for national movements, Belgium In 2018, our archive decided to expand the collection of online publications and started with harvesting websites of our archival creators to preserve their online heritage for future research. The web is constantly changing and content is quickly modified, removed or made inaccessible which make archiving it a necessity. During the coronavirus pandemic we realised the rise of social media could no longer be ignored. It was a start point to capture, record, scrape and download social media archives as well, but we were exposed to many challenges including technical barriers (API limitations, platform restrictions), legal and ethical isuses,… which require continous monitoring and specific strategies for effective preservation. During these years we developed a substainable policy and regulary monitor more than 5000 channels created by our archival community. 11:37am - 11:59am
Combining browser-based and browserless crawling for better fidelity vs. efficiency tradeoffs 1University of Michigan, United States of America; 2University of Southern California, United States of America Operators of web archives can crawl pages from the web using either dynamic browser-based crawlers (such as Brozzler and Browsertrix) or static browserless crawlers (such as Heritrix). Static crawlers are more lightweight and, hence, can crawl pages at a faster rate: in our measurements, 16x faster than with a dynamic crawler. However, static crawlers miss page resources which are fetched only when JavaScripts are executed; we repeatedly crawled 10K pages (spread across the top 1 million domains) both statically and dynamically for 16 weeks, and found that only 55% of statically crawled snapshots visually and functionally match the corresponding dynamically crawled snapshots. In this talk, we will present our study on how to combine dynamic and static crawling so as to serve page snapshots at high fidelity while minimizing the computational needs for supporting high crawling throughput. First, we quantified the utility of a practice which is common in web archives: reusing crawled resources either across snapshots of multiple pages or across multiple snapshots of the same page. When an archive receives a request for a resource, it serves the copy which it captured closest in time to the page snapshot it is serving. If no resource with the requested URL is found, the archive returns a resource which approximately has the same URL. We estimated the utility of these simple measures if the frequency with which an archive crawls pages matches the availability of page snapshots on the Wayback Machine. We find that, compared to crawling all pages statically, crawling 9% of snapshots with a browser suffices to increase the fraction of statically crawled snapshots which can be served without loss of fidelity from 55% to 96%. Second, to fix the fidelity issues associated with the remaining static crawls, we studied two methods for augmenting them using other dynamically crawled snapshots.
Put together, we estimate that these two measures will further increase the fraction of statically crawled page snapshots which can be served without loss of fidelity to 99%. By communicating our findings to the IIPC audience, we hope that developers of web crawlers will help translate our findings into practice. 11:59am - 12:20pm
The wasteback machine: measuring the environmental impact of the past web The University of Edinburgh, United Kingdom
This paper introduces the Wasteback Machine, a JavaScript library that repurposes web archives to analyse historical web page size and composition. It addresses a key limitation in current approaches to web sustainability assessment, which rely on live measurements and therefore obscure the cumulative environmental effects of long-term digital growth. By making web archives amenable to quantitative analysis, the Wasteback Machine enables new forms of historical inquiry into the evolution of page size and composition and their environmental implications. In doing so, it demonstrates how web archives can function as analytical resources rather than merely records of cultural memory.
This paper will demonstrate the capabilities of the Wasteback Machine, examine representative analyses of historical web development, and situate its contributions within wider debates in web archiving and sustainability. It will further consider the reuse of “reborn” digital materials for quantitative inquiry, the long-term ecological implications of persistent web expansion, and the challenges and responsibilities facing the future of web archives.
| ||
