JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at events@netpreserve.org.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available). To only see the sessions for 3 May's Online Day, select "Online" for location.

Please note that all times are shown in the time zone of the conference. The current conference time is: 28th Apr 2024, 09:14:00am CEST

Session Overview

Session

SES-13: CRAWLING, PLAYBACK, SUSTAINABILITY

Time:

Friday, 12/May/2023:

10:30am - 12:00pm

Session Chair: Laura Wrubel, Stanford University

Location: Theatre 2

These presentations will be followed by a 10 min Q&A.

Presentations

10:30am - 10:50am

Developer Update for Browsertrix Crawler and Browsertrix Cloud

Ilya Kreymer, Tessa Walsh

Webrecorder, United States of America

This presentation will provide a technical and feature update on the latest features implemented in Browsertrix Cloud and Browsertrix Crawler, Webrecorder's open source automated web archiving tools. The presentation will provide a brief intro to Browsertrix Cloud and the ongoing collaboration between Webrecorder and IIPC partners testing the tool.

We will present an outline for the next phase of development of these tools and discuss current / ongoing challenges in high fidelity web archiving, and how we may mitigate them in the future. We will also cover any lessons learned thus far.

We will end with a brief Q&A to answer any questions about the Browsertrix Crawler and Cloud systems, including how others may contribute to testing and development of these open source tools.

10:50am - 11:10am

Opportunities and Challenges of Client-Side Playback

Clare Stanton, Matteo Cargnelutti

Library Innovation Lab, United States of America

The team working on Perma.cc at the Library Innovation Lab has been using the open-source technologies developed by Webrecorder in production for many years, and has subsequently built custom software around those core services. Recently, in exploring applications for client-side playback of web archives via replayweb.page, we have learned lessons about the security, performance and reliability profile of this technology. This has deepened our understanding of the opportunities it presents and challenges it poses. Subsequently, we have developed an experimental boilerplate for testing out variations of this technology and have sought partners within the Harvard Library community to iterate with, test our learnings, and explore some of the interactive experiences that client-side playback makes possible.

warc-embed is LIL's experimental client-side playback integration boilerplate, which can be used to test out and explore this new technology. It consists of: a cookie-cutter web server configuration for storing, proxying, caching and serving web archive files; a pre-configured "embed" page, serving an instance of replayweb.page aimed at a given archive file; as well as a two-way communication layer allowing the embedding website to safely communicate with the embedded archive. These unique features allow for a thorough exploration of this new technology from a technical and security standpoint.

This is one of two proposals relating to this work. We believe there is an IIPC audience who could attend either or both sessions based on their interests. This session will dive into the technical research conducted at the lab and present those findings.

Combined with the emergence of the WACZ packaging format, client-side playback is a radically different and novel take on web archive playback which allows for the implementation of previously unachievable embedding scenarios. This session will explore the technical opportunities and challenges client-side playback presents from a performance, security, ease-of-access and programmability perspective by going over concrete implementation examples of this technology on Perma.cc and warc-embed.

11:10am - 11:30am

Sustaining pywb through community engagement and renewal: recent roadmapping and development as a case study in open source web archiving tool sustainability

Tessa Walsh, Ilya Kreymer

Webrecorder

IIPC’s adoption of pywb as the “go to” open source web archive replay system for its members, along with Webrecorder’s support for transitioning to pywb from other “wayback machine” replay systems, brings a large new user base to pywb. In the interests of ensuring pywb continues to sustainably meet the needs of IIPC members and the greater web archiving community, Webrecorder has been investing in maintenance and new releases for the current 2.x release series of pywb as well as engaging in the early stages of a significant 3.0 rewrite of pywb. These changes are being driven by a community roadmapping exercise with members of the IIPC oh-sos (Online Hours: Supporting Open Source) group and other pywb community stakeholders.

This talk will outline some of the recent feature and maintenance work done in pywb 2.7, including a new interactive timeline banner which aims to promote easier navigation and discovery within web archive collections. It will go on to discuss the community roadmapping process for pywb 3.0 and an overview of the proposed new architecture, perhaps even showing an early demo if development is in a state by May 2023 to support doing so.

The talk will aim to not only share specific information about pywb and the efforts being put into its sustainability and maintenance by both Webrecorder and the IIPC community, but also to use pywb as a case study to discuss the resilience, sustainability, and renewal of open source software tools that enable web archiving for all. pywb as a codebase is after all nearly a decade old itself and has gone through several rounds of significant rewrites as well as eight years of regular maintenance by Webrecorder staff and open source contributors to get to its current state, making it a prime example of how ongoing effort and community involvement make all the difference in building sustainable open source web archiving tools.

11:30am - 11:50am

Addressing the Adverse Impacts of JavaScript on Web Archives

Ayush Goel¹, Jingyuan Zhu¹, Ravi Netravali², Harsha V. Madhyastha¹

¹University of Michigan, United States of America; ²Princeton University, United States of America

Over the last decade, the presence of JavaScript code on web pages has dramatically increased. While JavaScript enables websites to offer a more dynamic user experience, its increasing use adversely impacts the fidelity of archived web pages. For example, when we load snapshots of JavaScript-heavy pages from the Internet Archive, we find that many are missing important images and JavaScript execution errors are common.

In this talk, we will describe the takeaways from our research on how to archive and serve pages that are heavily reliant on JavaScript. Via fine-grained analysis of JavaScript execution on 3000 pages spread across 300 sites, we find that the root cause for the poor fidelity of archived page copies is because the execution of JavaScript code that appears on the web is often dependent on the characteristics of the client device on which it is executed. For example, JavaScript on a page can execute differently based on whether the page is loaded on a smartphone or on a laptop, or whether the browser used is Chrome or Safari; even subtle differences like whether the user's network connection is over 3G or WiFi can affect JavaScript execution. As a result, when a user loads an archived copy of a page in their browser, JavaScript on the page might attempt to fetch a different set of embedded resources (i.e., images, stylesheets, etc.) as compared to those fetched when this copy was crawled. Since a web archive is unable to serve resources that it did not crawl, the user sees an improperly rendered page both because of missing content and JavaScript runtime errors.

To account for the sources of non-deterministic JavaScript execution, a web archive cannot crawl every page in all possible execution environments (client devices, browsers, etc), as doing so would significantly inflate the cost of archiving. Instead, if we augment archived JavaScript such that the code on any archived page will always execute exactly how it did when the page was crawled, we are able to ensure that all archived pages match their original versions on the web, both visually and functionally.

Mobile View Print View

Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2023