IIPC General Assembly and Web Archiving Conference 2023

Supporting Computational Research on Web Archives with the Archive Research Compute Hub (ARCH)

Jefferson Bailey, Kody Willis, Helge Holzmann, Alex Dempsey

Internet Archive, United States of America

Coordinators:

Jefferson Bailey, Director of Archiving & Data Services, Internet Archive
Kody Willis, Product Operations Manager, Archiving & Data Services, Internet Archive
Helge Holzmann, Senior Data Engineer, Archiving & Data Services, Internet Archive
An Archives Unleashed member may also coordinate/participate

Format: 90 or 120-minute workshop and tutorial

Target Audience: The target audience is professionals working in digital library services that are collecting, managing, or providing access to web archives, scholars using web archives and other digital collections in their work, library professionals working to support computational access to digital collections, and digital library technical staff.

Anticipated Number of Participants: 25
Technical Requirements: A meeting room with wireless internet access and a projector or video display. Participants must bring laptop computers and there should be power outlets. The coordinators will handle preliminary activities over email and provide some technical support beforehand as far as building or accessing web archives for use in the workshop.

Abstract: Every year more and more scholars are conducting research on terabytes and even petabytes of digital library and archive collections using computational methods such as data mining, natural language processing, and machine learning. Web archives are a significant collection of interest for these researchers, especially due to their contemporaneity, size, multi-format nature, and how they can represent different thematic, demographic, disciplinary, and other characteristics. Web archives also have longitudinal complexity, with frequent changes in content (and often state of existence) even at the same URL, gobs of metadata both content-based and transactional, and many characteristics that make them highly suitable for data mining and computational analysis. Supporting computational use of web archives, however, poses many technical, operational, and procedural challenges for libraries. Similarly, while platforms exist for supporting computational scholarship on homogenous collections (such as digitized texts, images, or structured data), none exist that handle the vagaries of web archive collections while also providing a high level of automation, seamless user experience, and support for both technical and non-technical users.

In 2020, Internet Archive Research Services and the Archives Unleashed received funding for joint technology development and community building to combine their respective tools that enable computational analysis of web and digital archives in order to build an end-to-end platform supporting data mining of web archives. The program also simultaneously is building out a community of computational researchers doing scholarly projects via a program supporting cohort teams of scholars that receive direct technical support for their projects. The beta platform, Archives Research Compute Hub (ARCH), is currently being used by dozens of researchers in the digital humanities, social and computer science researchers, and by dozens of libraries and archives that are interested in supporting local researchers and sharing datasets derived from their web collection in support of large-scale digital research methods.

ARCH lowers the barriers for conducting research of web archives, using data processing operations to generate 16 different derivatives from WARC files. Derivatives range in use from graph analysis, text mining, and file format extraction, and ARCH makes it possible to visualize, download, and integrate these datasets into third-party tools for more advanced study. ARCH enables analysis of the more than 20,000 web archive collections - over 3 PB of data - collected by over 1,000 institutions using Archive-It that cover a broad range of subjects and events and ARCH also includes various portions of the overall Wayback Machine global web archive totalling 50+ PB and going back to 1996.

This workshop will be a hands-on training covering the full lifecycle of supporting computational research on web archives. The agenda will include an overview of the conceptual challenges researchers face when working with web archives, the procedural challenges that librarians face in making web archives available for computational use, and most importantly, will provide an in-depth tutorial on using the ARCH platform and its suite of data analysis, dataset generation, data visualization, and data publishing tools, both from the perspective of a collection manager, a research services librarian, and a computational scholar. Workshop attendees will be able to build small web archive collections beforehand or will be granted access to existing web archive collections to use during the workshop. All participants will also have access to any datasets and data visualizations created as part of the workshop.

Anticipated Learning Outcomes:

Given the conference, we expect the attendees primarily to be web archivists, collection managers, digital librarians, and other library and archives staff. After the workshop, attendees will:

Understand the full lifecycle of making web and digital archives available for computational use by researchers, scholars, and others. This includes gaining knowledge of outreach and promotion strategies to engage research communities, how to handle computational research requests, how to work with researchers to scope and refine their requests, how to make collections available as data, how to work with internal technical teams facilitating requests, dataset formats and delivery methods, and how to support researchers in ongoing data analysis and publishing.
Gain knowledge of the specific types of data analysis and datasets that are possible with web archive collections, including data formats, digital methods, tools, infrastructure requirements, and the related methodological affordances and limitations for scholarship related to working with web archives as data.
Receive hands-on training on using the ARCH platform to explore and analyze web archive collections, from both the perspective of a collection manager and that of a researcher.
Be able to use the ARCH platform to generate derivative datasets, create corresponding data visualizations, publish these datasets to open-access repositories, and conduct further analysis with additional data mining tools.
Have tangible experience with datasets and related technologies in order to perform specific analytic tasks on web archives such as exploring graph networks of domains and hyperlinks, extract and visualize images and other specific formats, and perform textual analysis and other interpretive functions.
Have insights into digital methods through their exposure to a variety of different active, real-life use cases from scholars and research teams currently using the ARCH platform for digital humanities and similar work.

Conference Agenda