JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at events@netpreserve.org.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

Agenda Overview

Session

COLLECTIONS AS DATA: WORKFLOWS & USE CASES

Time:

Wednesday, 22/Apr/2026:

9:20am - 10:45am

Session Chair: Bjarne Andersen, Royal Danish Library

Location: PANORAMA [+6]

Floor: +6

Presentations

9:20am - 9:42am

Web archives of tragedy: ethical, sustainable access and research use for 9/11 collections

Ian Milligan

University of Waterloo, Canada

During and after the September 11, 2001 (“9/11”) attacks, web users exchanged tens of thousands of emails, listserv posts, BlackBerry messages, and blog comments. Much of this material was captured in exceptional crawls by the Internet Archive and the Library of Congress, or later collected by the September 11 Digital Archive. Read together, these sources enable a minute-by-minute social history in which unity and care coexisted with fear, backlash, and hate, patterns further shaped by platform affordances and moderation practices. Yet this evidentiary base remains fragmented across crawls, platforms, file types, or how information was arranged and presented.

This talk presents a practical model for sustainable access and research use by constructing a releasable, reusable dataset that harmonizes multiple September 11–related web-archival collections (e.g., Yahoo! Groups and web-hosted listservs), totaling tens of thousands of messages. The workflow covers content-hash deduplication; date-time normalization to Eastern Time (anchored to verifiable real-world events); thread reconstruction when possible; and a common schema that structures headers, body text, and related paratext (e.g., moderation notes) into designated fields. The resulting datasets are packaged as CSV and Parquet for straightforward download and reuse and are currently hosted as private collections on Hugging Face pending release decisions.

Many items have effectively enjoyed privacy-by-obscurity in the Wayback Machine or as archive objects not exposed to search engines. When harmonized and made machine-indexable, they become trivially discoverable, including personally identifiable information. A user who posted under their own name to a public list in 2001 could not reasonably anticipate the 2025 search environment or large-scale text mining. While case-by-case review, which can attend to the context of creation, reasonable expectations of privacy, and the purposes of reuse, can guide my own individual decisions, it does not scale to tens of thousands of messages. At IIPC, therefore, I am hoping to seek community input on what to release, how, and with what documentation, and to share my own best practices. In my own work, I have imposed quoting thresholds on records I do not want to be identified; anonymizing names and email addresses in some files, and documenting provenance and processing choices so downstream users can determine what to use.

9:42am - 10:03am

Creative access - Lessons from the Digital Ghosts exhibition

Andrea Kocsis, Dorsey Kaufmann

Univeristy of Edinburgh, United Kingdom

This paper presents the lessons learned from the Digital Ghosts exhibition, a practice-based research project exploring how artistic and creative methods can enhance public engagement with web archives. Centred on the Scotland on the Internet curated collection, the project investigated how visualisation, data enrichment, and storytelling can improve awareness and usability of archived web content among non-specialist audiences.

The exhibition showcased collaborative works created by an interdisciplinary team of archivists, data scientists, artists, and informatics students. Through data-driven artworks and interactive interfaces, the exhibition translated web archive metadata into tangible and visually engaging forms that encouraged visitors to reflect on digital presence, disappearance, and collective memory. Public engagement activities, including a panel discussion and participatory workshops, further enabled dialogue between archivists, artists, and users on issues of selection, loss, and representation of [redacted] online heritage.

A key component of the project was the preparation and enrichment of a dataset derived from the Scotland on the Internet collection, used both for artistic interpretation and as an educational resource. The process of structuring and visualising this web archive metadata offered an entry point for students and artists to engage with the complexities of humanities data, such as gaps, inconsistencies, and ethical and legal considerations. By integrating web archive material into data science teaching, the project aimed to familiarise future data users with the interpretive and contextual challenges of GLAM datasets, while exploring use cases to encourage the future utilisation of web archive data.

To assess the impact of these creative interventions, the project incorporated user research in the form of visitor surveys and focus groups conducted with exhibition visitors, workshop participants, and student groups. Based on the results of the user research and through documenting this interdisciplinary process, the paper argues that creativity is not merely an outreach tool but a sustainable access strategy that bridges preservation and access, facilitates communication between archivists, outreach specialists, researchers, and users, and supports web archives literacy. Situated within the Access and Research Use track, the paper offers conference attendees a tried and tested framework for integrating data enrichment, as well as creative and participatory methods into web archive engagement.

10:03am - 10:24am

Developing a sustainable workflow for UK Web Archive collections as data

Nicola Bingham, Helena Byrne, Nora Ramsey, Mindaugas Vidmantas

British Library, United Kingdom

The UK Web Archive collects and preserves websites published in the UK, encompassing a broad spectrum of topics. The entire collection amounts to approximately 2 petabytes (PB) of data. The archive includes curated or thematic collections that cover a diverse array of subjects and events, ranging from General Elections, blogs, and the UEFA Women’s Euros, to Live Art, the History of the Book, and the French community. 2026 is a special year for the UK Web Archive, as it is celebrating its 21st year curating web archive collections. In the early years these collections followed a simple structure of a title and a list of related websites, subsections of websites, individual web pages and documents published on the web. The implementation of the curation software, in 2013 enabled the use of hierarchical structures to curate collections. Most of the hierarchical collections have one or two subsections, but other collection have up to four subsections.

The UK Web Archive provides an essential resource for studying the evolution of web publishing formats and for accessing a comprehensive record of content published on the web. Due to limitations of the Legal Deposit Regulations, creating data sets of web archive content poses both technical as well as legal challenges. However, the metadata created by UK Web Archive collaborators is something that sits outside the limitations outlined by the Legal Deposit Regulations and could be repurposed to create data sets for further research.

To date, we have published a number of our curated collections metadata as data through the British Library Research Repository. Metadata was extracted from backups of the curation management tool. The first tranche of collections as data published were extracted from a backup of our curation software in July 2023. At this point there were 173,961 curated records in the collection. The second tranche was extracted from a backup of our curation software from October 2023. This backup had 181,551 curated records in the collection.

This presentation runs through a number of the processes involved and the lessons learnt from developing these new workflows. These include:

Developing a policy for what data will be published.
Choosing a documentation framework to use to describe data sets.
Developing work arounds to manage challenges such as staff changes.
Challenges in exporting hierarchical data in flat formats.
Requirements for making these data sets available in the Research Repository.
What is available and how researchers can access them.

It is hoped that this presentation can enable further discussion on publishing collections as data within the web archive community. These discussions will then help to develop best practice for enabling reuse of web archives within the research community.

10:24am - 10:45am

Bridging the Web Archive and the Library: a Linked‑Data Model for FAIR Web Archive Integration

Natanael Arndt, Tracy Arndt

German National Library, Germany

Our library makes its data available as linked open data. Since 2012, we operated a web archive that is currently being redeveloped in-house and with an open‑source approach to increase capacity. Furthermore, the aim of the web archive is integrated with the overall digital library architecture, which involves ingest in the libraries digital object import pipeline, cataloging of the digital objects into the integrated library system, and storage in a common repository for digital objects. Thus far, the metadata of the web archive is converted into the library's internal data format. However, it has become apparent that current bibliographic standards cannot capture the complexity and characteristics of web resources. Additionally, the web archive should provide sufficient metadata to allow data based research on the digital holdings in a way that is adapted for the web medium. The overall architecture of the web archive involves several components that produce metadata about the digital object that are collected and others that require the data as input or enrich the data. Such components are the seed selection, crawlers, file format checkers, quality assurance, meta-data extractor, subject indexer, CDX indexer, playback system.
With these abstract use cases and requirements in mind we have devised an abstract metadata model with concepts of the live web (web page, URL, and domain), the technical part of the archive surrounding the WARC standard format (crawl, file, record, and cdx entry), and concepts that facilitate the integration with the catalogs data model (collection, website or seed, and snapshot). Within the collection a lot of rich payload data is stored, such as browser and server communication metadata and web resources such as HTML, style-sheets, JavaScript, and images. These resources regularly carry their own metadata, such as Microdata, Microformats, OpenGraph, Schema.org, and general JSON‑LD. The metadata model can serve several purposes: (1) as a common reference, (2) a tool for integration, and (3) as an extension point. (1) A common reference for common processes distributed among various participants. (2) A tool for integration with the libraries catalog which in turn serves as a portal to the web archive collection for replay and research purposes. (3) A extension point to link enclosed metadata to the web archive holdings, as well as derived information as the result of library activities like formal cataloguing, subject cataloguing, and enrichment processes.
The data model is modeled in RDF to describe Linked Data. It supports the FAIR data principles (findability, accessibility, interoperability, and reusability). As part of the design activities we have proposed changes to the DOWARC vocabulary—an initial draft by the National Archives UK. Further the data model reuses and aligns with existing vocabularies, such as the Dublin Core metadata element set (dc), DCMI Metadata Terms (dcterms), FOAF Vocabulary (foaf), The Bibliographic Ontology (bibo), and the lobid vocab (lv).