Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

Please note that all times are shown in the time zone of the conference. The current conference time is: 20th May 2024, 07:06:44am CEST

 
 
Session Overview
Session
SESSION #12: Innovative Harvesting
Time:
Friday, 26/Apr/2024:
1:50pm - 3:10pm

Session Chair: Bert Wendland, Bibliothèque nationale de France
Location: Salle 70 [François-Mitterrand site]


Show help for 'Increase or decrease the abstract text size'
Presentations
1:50pm - 2:10pm

Decentralized Web Archiving and Replay via InterPlanetary Archival Record Object (IPARO)

Sawood Alam

Internet Archive, United States of America

We propose InterPlanetary Archival Record Object (IPARO), a decentralized version tracking system using the existing primitives of InterPlanetary File System (IPFS) and InterPlanetary Name System (IPNS). While we focus primarily on the web archiving use-case, our proposed approach can be used in other applications that require versioning, such as a wiki or a collaborative code tracking system. Our proposed system does not rely on any centralized servers for archiving or replay, enabling any individual or organization to participate in the web archiving ecosystem and be discovered without the trouble of dealing with unnecessary traffic. The system continues to allow Memento aggregators to play their role from which both large and small archives can benefit and flourish.

An earlier attempt of decentralized web archiving was realized as InterPlanetary Wayback (IPWB), which ingested WARC response records into IPFS store and indexed their Content Identifiers (CIDs) in CDXJ files for decentralized storage retrieval in place of WARC file name, byte offset, and byte size, as used in traditional archival playback systems. The primary limitation of this system was centralized index, which was needed locally for the discovery of archived data in IPFS stores. Proposals to make IPNS history-aware required changes to the underlying systems and/or additional infrastructure, which failed to mobilize any implementations or adoption.

IPARO takes a self-contained linking approach to facilitate storage, discovery, and versioning while operating within the existing architecture of IPFS and IPNS. An IPARO is a container for every archival observation that is intended to be looked up and replayed independently. These objects contain an extensible set of headers, the data in a supported archival format (e.g., WARC, WACZ, and HAR), and optional trailers. The purpose of headers is to identify the media type of the data, to establish relationships with other objects, to describe interpretation of the data and any trailers, and to hold other associated metadata. By storing CIDs of one or more prior versions of the same resource in the header we form a linked-list of IPAROs that can be traversed backward in time, allowing discovery of prior versions from the most recent version. The most recent memento (and a version at a known time) can be discovered by querying IPNS for specific URIs. Multiple prior links in each IPARO make the system more resilient for discovery of versions prior to a missing/unretrievable record as well as enable more efficient reconstruction of TimeMaps.

Moreover, IPFS allows for custom block partitions when creating Merkle tree for underlying storage, which means slicing IPAROs at strategic places before storage can leverage built-in deduplication capabilities of IPFS. This can be utilized by identifying resource payloads and headers that change less frequently or have a finite number of permutations and isolating them from parts of the data or metadata that change often.

Furthermore, a trailer is added to include suitable nonce to force generation of CIDs with desired substrings in them. This can be helpful in grouping objects by certain tags, names, types, etc. per the application needs.



2:10pm - 2:30pm

Server-Side Web Archiving with ReproZip-Web

Katherine Boss1, Vicky Rampin1, Rémi Rampin2, Ilya Kreymer3

1New York University Libraries, United States of America; 2New York University, United States of America; 3Webrecorder, United States of America

Complex websites face many archiving challenges. Even with high-fidelity capture tools, many sites remain difficult to crawl due to the use of highly dynamic, non-deterministic network access, such as to provide site-wide search. To fully archive such sites, encapsulating the web server so that it may be recreated in an emulated environment may be the most high-fidelity option. But encapsulating a single web server is often not enough , as most sites load resources from multiple servers, or also include external embeds from services like MapBox, Google Maps or YouTube. To fully archive sites with dynamic server and client-side components, we present an integrated tool that provides an overlay of high-fidelity server emulation coupled with a high-fidelity web archive.

ReproZip-Web is an open-source, grant-funded [1] web-archiving tool capable of server-side web archiving. It builds on the capture and client-side replay tools of Webrecorder to capture the front-end of a website, and the reproducibility software ReproZip to encapsulate the backend dynamic web server software and its dependencies. The output is a self-contained, isolated, and preservation-ready bundle, an .rpz file, with all the information needed to replay a website, including the source code, the computational environment (e.g. the operating system, software libraries) and the files used by the app (e.g. data, static files). Its lightweight nature makes it ideal for distribution and preservation.

This presentation will discuss the strengths and limitations of ReproZip-Web, outline ideal use-cases for this tool, and demonstrate how to trace and pack a dynamic site. We will also highlight new features in Webrecorder tools (ArchiveWeb.page and ReplayWeb.page) to allow capture and replay to differentiate and merge content from a live server and WACZ file, allowing for overlaying of preserved server and traditional web archive. Finally, we will discuss the infrastructure needed for memory institutions to provide access to these archived works for the long term.

[1] Institute of Museum and Library Services, “Preserving the Dynamic Web: Building a Production-level Tool to Save Data Journalism and Interactive Scholarship,” NLG-L Recipient, LG-250049-OLS-21, Aug. 2021. http://www.imls.gov/grants/awarded/lg-250049-ols-21.



2:30pm - 2:50pm

A Test of Browser-based Crawls of Streaming Services' Interfaces

Andreas Lenander Ægidius

Royal Danish Library

This paper presents a test of browser-based Web crawling on a sample of streaming services’ web sites and web players. We are especially interested in their graphical user interfaces since the Royal Danish Library collects most of the content by other means. In a legal deposit setting and for the purposes of this test we argue that streaming services consist of three main parts: their catalogue, metadata, and the graphical user interfaces. We find that the collection of all three parts are essential in order to preserve and playback what we could call 'the streaming experience'. The goal of the test is to see if we can capture a representative sample of the contemporary streaming experience from the initial login to (momentary) playback of the contents.

Currently, the Danish Web archive (Netarkivet) implements browser-based crawl systems to optimize its collection of the Danish Web sphere (Myrvoll et al., n.d.). The test will run on a local instance of Browsertrix (Webrecorder, n.d.). This will let us login to services that require a local IP-address. Our sample includes streaming services for books, music, TV-series, and gaming.

In the streaming era, the very thing that defines it is what threatens to impede access to important media history and cultural heritage. Streaming services are transnational and they have paywalls while content catalogues and interfaces change constantly (Colbjørnsen et al., 2021). They challenge the collection and preservation of how they present and playback the available content. On a daily basis, Danes stream more TV (47 pct.) than they watch flow-TV (37 pct.) and six out of 10 Danes subscribe to Netflix (Kantar-Gallup, 2022). Streaming is a standard for many and no longer a first-mover activity, at least in the Nordic region of Europe (Lüders et al., 2021).

The Danish Web archive collects websites of streaming services as part of its quarterly cross-sectional crawls of the Danish Web sphere (The Royal Danish Library, n.d.). A recent analysis of its collection of web sites and interfaces concluded that the automated collection process provides insufficient documentation of the Danish streaming services (Aegidius and Andersen, in review).

This paper presents findings from a test of browser-based crawls of streaming services’ interfaces. We will discuss the most prominent sources of errors and how we may optimize the collection of national and international streaming services.

Selected References

Aegidius, A. L. & Andersen M. M. T. (in review) Collecting streaming services, Convergence: The International Journal of Research into New Media Technologies

Colbjørnsen, T., Tallerås K., & Øfsti, M. (2021) Contingent availability: a case-based approach to understanding availability in streaming services and cultural policy implications, International Journal of Cultural Policy, 27:7, 936-951, DOI: 10.1080/10286632.2020.1860030

Lüders, M., Sundet, V. S., & Colbjørnsen, T. (2021) Towards streaming as a dominant mode of media use? A user typology approach to music and television streaming. Nordicom Review, 42(1), 35–57. https://doi.org/10.2478/nor-2021-0011

Myrvoll A.K., Jackson A., O'Brien, B., et al. (n.d.) Browser-based crawling system for all. Available at: https://netpreserve.org/projects/browser-based-crawling/ (accessed 26 May 2023).



2:50pm - 3:10pm

Crawling Toward Preservation of References in Digital Scholarship: ETDs to URLs to WACZs

Lauren Ko, Mark Phillips

University of North Texas Libraries, United States of America

The University of North Texas has been requiring born-digital Electronic Theses and Dissertations (ETD) of its students since 1999. During that time, over 9,000 of these documents have been archived in the UNT Digital Library for access and preservation.

Motivated by discussions at the 2023 Web Archiving Conference about the need to better curate works of digital scholarship with the URL-based references contained within, the UNT Libraries set out to address this problem for newly submitted ETDs. Mindful of the burdens already upon students submitting works of scholarship in attainment of a degree, we opted to implement a solution that added no additional requirements for authors and that could be repeated with each semester's new ETDs in a mostly automated way.

We began to experiment with the identification of URLs in the ETDs that could be archived and made a permanent part of the package added to the digital library. This would allow future users to better understand the context to some of the references in the document. In the first step of our workflow, we extracted URLs from the submitted PDF documents. This required experimentation with different programmatic approaches to converting the documents to plain text or HTML and parsing URLs from the resulting text. Some methods were more successful than others, but all were challenging due to the many ways that a URL could present itself (e. g. split over multiple lines, across pages, in footnotes, etc.). Next, using Browsertrix Crawler we archived the extracted URLs, saving the results for each ETD in a separate WACZ file. This WACZ file was added to the preservation package of the ETD and submitted to the UNT Digital Library. To view the archived URL content a user can download the WACZ file and use a service like ReplayWeb.Page (https://replayweb.page/) to view its content. The UNT Libraries are experimenting with an integrated viewer for WACZ content in their existing digital library infrastructure and how to make this option available to users in the near future.

In this presentation we expound on our workflow for building these web archives in the context of the ETDs from which the URLs are extracted, as they are packaged for preservation and viewed in our digital library alongside their originating documents. By sharing this work, we hope to continue the discussion on how to best preserve URLs within works of scholarship and offer steps that conference attendees may be interested in implementing at their own institutions.



 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2024
Conference Software: ConfTool Pro 2.6.149
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany