Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.
|
Session Overview |
| Session | ||
COLLECTIONS AS DATA: WORKFLOWS & USE CASES
| ||
| Presentations | ||
9:20am - 9:42am
Web archives of tragedy: ethical, sustainable access and research use for 9/11 collections University of Waterloo, Canada During and after the September 11, 2001 (“9/11”) attacks, web users exchanged tens of thousands of emails, listserv posts, BlackBerry messages, and blog comments. Much of this material was captured in exceptional crawls by the Internet Archive and the Library of Congress, or later collected by the September 11 Digital Archive. Read together, these sources enable a minute-by-minute social history in which unity and care coexisted with fear, backlash, and hate, patterns further shaped by platform affordances and moderation practices. Yet this evidentiary base remains fragmented across crawls, platforms, file types, or how information was arranged and presented. This talk presents a practical model for sustainable access and research use by constructing a releasable, reusable dataset that harmonizes multiple September 11–related web-archival collections (e.g., Yahoo! Groups and web-hosted listservs), totaling tens of thousands of messages. The workflow covers content-hash deduplication; date-time normalization to Eastern Time (anchored to verifiable real-world events); thread reconstruction when possible; and a common schema that structures headers, body text, and related paratext (e.g., moderation notes) into designated fields. The resulting datasets are packaged as CSV and Parquet for straightforward download and reuse and are currently hosted as private collections on Hugging Face pending release decisions. Many items have effectively enjoyed privacy-by-obscurity in the Wayback Machine or as archive objects not exposed to search engines. When harmonized and made machine-indexable, they become trivially discoverable, including personally identifiable information. A user who posted under their own name to a public list in 2001 could not reasonably anticipate the 2025 search environment or large-scale text mining. While case-by-case review, which can attend to the context of creation, reasonable expectations of privacy, and the purposes of reuse, can guide my own individual decisions, it does not scale to tens of thousands of messages. At IIPC, therefore, I am hoping to seek community input on what to release, how, and with what documentation, and to share my own best practices. In my own work, I have imposed quoting thresholds on records I do not want to be identified; anonymizing names and email addresses in some files, and documenting provenance and processing choices so downstream users can determine what to use. 9:42am - 10:03am
Creative access - Lessons from the Digital Ghosts exhibition Univeristy of Edinburgh, United Kingdom This paper presents the lessons learned from the Digital Ghosts exhibition, a practice-based research project exploring how artistic and creative methods can enhance public engagement with web archives. Centred on the Scotland on the Internet curated collection, the project investigated how visualisation, data enrichment, and storytelling can improve awareness and usability of archived web content among non-specialist audiences. The exhibition showcased collaborative works created by an interdisciplinary team of archivists, data scientists, artists, and informatics students. Through data-driven artworks and interactive interfaces, the exhibition translated web archive metadata into tangible and visually engaging forms that encouraged visitors to reflect on digital presence, disappearance, and collective memory. Public engagement activities, including a panel discussion and participatory workshops, further enabled dialogue between archivists, artists, and users on issues of selection, loss, and representation of [redacted] online heritage. A key component of the project was the preparation and enrichment of a dataset derived from the Scotland on the Internet collection, used both for artistic interpretation and as an educational resource. The process of structuring and visualising this web archive metadata offered an entry point for students and artists to engage with the complexities of humanities data, such as gaps, inconsistencies, and ethical and legal considerations. By integrating web archive material into data science teaching, the project aimed to familiarise future data users with the interpretive and contextual challenges of GLAM datasets, while exploring use cases to encourage the future utilisation of web archive data. To assess the impact of these creative interventions, the project incorporated user research in the form of visitor surveys and focus groups conducted with exhibition visitors, workshop participants, and student groups. Based on the results of the user research and through documenting this interdisciplinary process, the paper argues that creativity is not merely an outreach tool but a sustainable access strategy that bridges preservation and access, facilitates communication between archivists, outreach specialists, researchers, and users, and supports web archives literacy. Situated within the Access and Research Use track, the paper offers conference attendees a tried and tested framework for integrating data enrichment, as well as creative and participatory methods into web archive engagement. 10:03am - 10:24am
Developing a sustainable workflow for UK Web Archive collections as data British Library, United Kingdom The UK Web Archive collects and preserves websites published in the UK, encompassing a broad spectrum of topics. The entire collection amounts to approximately 2 petabytes (PB) of data. The archive includes curated or thematic collections that cover a diverse array of subjects and events, ranging from General Elections, blogs, and the UEFA Women’s Euros, to Live Art, the History of the Book, and the French community. 2026 is a special year for the UK Web Archive, as it is celebrating its 21st year curating web archive collections. In the early years these collections followed a simple structure of a title and a list of related websites, subsections of websites, individual web pages and documents published on the web. The implementation of the curation software, in 2013 enabled the use of hierarchical structures to curate collections. Most of the hierarchical collections have one or two subsections, but other collection have up to four subsections. The UK Web Archive provides an essential resource for studying the evolution of web publishing formats and for accessing a comprehensive record of content published on the web. Due to limitations of the Legal Deposit Regulations, creating data sets of web archive content poses both technical as well as legal challenges. However, the metadata created by UK Web Archive collaborators is something that sits outside the limitations outlined by the Legal Deposit Regulations and could be repurposed to create data sets for further research. To date, we have published a number of our curated collections metadata as data through the British Library Research Repository. Metadata was extracted from backups of the curation management tool. The first tranche of collections as data published were extracted from a backup of our curation software in July 2023. At this point there were 173,961 curated records in the collection. The second tranche was extracted from a backup of our curation software from October 2023. This backup had 181,551 curated records in the collection. This presentation runs through a number of the processes involved and the lessons learnt from developing these new workflows. These include:
It is hoped that this presentation can enable further discussion on publishing collections as data within the web archive community. These discussions will then help to develop best practice for enabling reuse of web archives within the research community. 10:24am - 10:45am
Bridging the Web Archive and the Library: a Linked‑Data Model for FAIR Web Archive Integration German National Library, Germany Our library makes its data available as linked open data. Since 2012, we operated a web archive that is currently being redeveloped in-house and with an open‑source approach to increase capacity. Furthermore, the aim of the web archive is integrated with the overall digital library architecture, which involves ingest in the libraries digital object import pipeline, cataloging of the digital objects into the integrated library system, and storage in a common repository for digital objects. Thus far, the metadata of the web archive is converted into the library's internal data format. However, it has become apparent that current bibliographic standards cannot capture the complexity and characteristics of web resources. Additionally, the web archive should provide sufficient metadata to allow data based research on the digital holdings in a way that is adapted for the web medium. The overall architecture of the web archive involves several components that produce metadata about the digital object that are collected and others that require the data as input or enrich the data. Such components are the seed selection, crawlers, file format checkers, quality assurance, meta-data extractor, subject indexer, CDX indexer, playback system. | ||