Session | ||
WKSHP-05: SUPPORTING COMPUTATIONAL RESEARCH ON WEB ARCHIVES WITH THE ARCHIVE RESEARCH COMPUTE HUB (ARCH)
Pre-registration required for this event.
| ||
Presentations | ||
Supporting Computational Research on Web Archives with the Archive Research Compute Hub (ARCH) Internet Archive, United States of America Coordinators:
Format: 90 or 120-minute workshop and tutorial Target Audience: The target audience is professionals working in digital library services that are collecting, managing, or providing access to web archives, scholars using web archives and other digital collections in their work, library professionals working to support computational access to digital collections, and digital library technical staff. Anticipated Number of Participants: 25 Abstract: Every year more and more scholars are conducting research on terabytes and even petabytes of digital library and archive collections using computational methods such as data mining, natural language processing, and machine learning. Web archives are a significant collection of interest for these researchers, especially due to their contemporaneity, size, multi-format nature, and how they can represent different thematic, demographic, disciplinary, and other characteristics. Web archives also have longitudinal complexity, with frequent changes in content (and often state of existence) even at the same URL, gobs of metadata both content-based and transactional, and many characteristics that make them highly suitable for data mining and computational analysis. Supporting computational use of web archives, however, poses many technical, operational, and procedural challenges for libraries. Similarly, while platforms exist for supporting computational scholarship on homogenous collections (such as digitized texts, images, or structured data), none exist that handle the vagaries of web archive collections while also providing a high level of automation, seamless user experience, and support for both technical and non-technical users. In 2020, Internet Archive Research Services and the Archives Unleashed received funding for joint technology development and community building to combine their respective tools that enable computational analysis of web and digital archives in order to build an end-to-end platform supporting data mining of web archives. The program also simultaneously is building out a community of computational researchers doing scholarly projects via a program supporting cohort teams of scholars that receive direct technical support for their projects. The beta platform, Archives Research Compute Hub (ARCH), is currently being used by dozens of researchers in the digital humanities, social and computer science researchers, and by dozens of libraries and archives that are interested in supporting local researchers and sharing datasets derived from their web collection in support of large-scale digital research methods. ARCH lowers the barriers for conducting research of web archives, using data processing operations to generate 16 different derivatives from WARC files. Derivatives range in use from graph analysis, text mining, and file format extraction, and ARCH makes it possible to visualize, download, and integrate these datasets into third-party tools for more advanced study. ARCH enables analysis of the more than 20,000 web archive collections - over 3 PB of data - collected by over 1,000 institutions using Archive-It that cover a broad range of subjects and events and ARCH also includes various portions of the overall Wayback Machine global web archive totalling 50+ PB and going back to 1996. This workshop will be a hands-on training covering the full lifecycle of supporting computational research on web archives. The agenda will include an overview of the conceptual challenges researchers face when working with web archives, the procedural challenges that librarians face in making web archives available for computational use, and most importantly, will provide an in-depth tutorial on using the ARCH platform and its suite of data analysis, dataset generation, data visualization, and data publishing tools, both from the perspective of a collection manager, a research services librarian, and a computational scholar. Workshop attendees will be able to build small web archive collections beforehand or will be granted access to existing web archive collections to use during the workshop. All participants will also have access to any datasets and data visualizations created as part of the workshop. Anticipated Learning Outcomes: Given the conference, we expect the attendees primarily to be web archivists, collection managers, digital librarians, and other library and archives staff. After the workshop, attendees will:
|