JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at events@netpreserve.org.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

Agenda Overview

Date: Monday, 20/Apr/2026

9:15am - 9:45am

REGISTRATION: BELGICAWEB & WORKSHOPS AND COFFEE
Location: PANORAMA: FOYER [+6]

☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6).

CLICK TO VIEW: 🗺️ FLOOR PLANS + 🎥 ORIENTATION VIDEO

9:15am - 9:45am

EARLY SCHOLARS SPRING SCHOOL ON WEB ARCHIVES [PART 1]
Location: KRANTEN / JOURNAUX [0]

Registration for this event begins at 9:00.
9:00 Welcome of participants at KBR (in the lobby) an>
9:15 Welcome notes (in the Newspaper Reading Room) an>
9:30–10:00 Interactive icebreaker activity

9:45am - 10:00am

BELGICAWEB SYMPOSIUM: OPENING REMARKS
Location: PANORAMA [+6]

10:00am - 11:00am

BELGICAWEB SYMPOSIUM
Location: PANORAMA [+6]
Session Chair: Julie Birkholz, KBR | Royal Library of Belgium & Ghent University

Digital Heritage for the Future - the BelgicaWeb Story

Julie M. Birkholz¹, Christina Vandendyck¹, Peter Mechant², George Caique Gouveia Barbosa², Dieter De Witte², Élodie Lecroart³, Friedel Geeraert¹

¹KBR | Royal Library of Belgium; ²Ghent University; ³University of Namur

This presentation will showcase the concluding results of BelgicaWeb (2024-2026), an innovative research project led by the Royal Library of Belgium (KBR) and funded by BELSPO. The project aimed to preserve and provide access to Belgium’s born-digital heritage through a multilingual, user-friendly platform and an API, ensuring FAIR principles (Findable, Accessible, Interoperable, Reusable). Key achievements include sustainable access strategies, robust data infrastructure, metadata enrichment, and legal framework analysis. The session will highlight how BelgicaWeb promotes Belgium’s digital heritage. The project brought together a consortium of partners with diverse expertise: CRIDS at UNamur for legal issues and IDLab, GhentCDH, and MICT at Ghent University for data enrichment, user engagement, and outreach. KBR coordinates the project, focusing on platform and API development and data enrichment.

11:00am - 11:30am

BREAK
Location: PANORAMA: FOYER [+6]

☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6)

11:30am - 12:30pm

BELGICAWEB SYMPOSIUM PANEL: USER NEEDS
Location: PANORAMA [+6]
Session Chair: Peter Mechant, Ghent University

Users first: (Re)designing Web Archives around real Needs

Peter Mechant¹, Sara Aubry², Eveline Vlassenroot¹, Jon Carlstedt Tønnessen³, Andrea Kocsis⁴, Natasha Kitcher⁵

¹Ghent University; ²Bibliothèque nationale de France; ³National Library of Norway; ⁴University of Edinburgh; ⁵UK National Archives

As web archives make the transition from niche repositories to essential infrastructure for digital history, one critical disconnection remains: the gap between technical capture and scholarly utility. While archivists battle with dynamic content and platform APIs, too often for the researcher, web archives remain a ‘black box’ that does not meet their methodological requirements. This panel addresses the central research question: “What are user requirements for web and social media archives?”.

Based on recent empirical work such as survey data, workshops results, and exploratory user testing, this session discusses the needs, expectations, and practices of diverse user groups such as researchers, heritage professionals, journalists, policy analysts, … Moving beyond a ‘capture-first’ mentality to a ‘use-centric’ approach, this panel will analyse key areas of conflict and friction. This session will be in an interactive format, inviting the panel members and audience to vote on a series of statements.

11:30am - 12:30pm

BELGICAWEB SYMPOSIUM PANEL: PROGRAMS
Location: CONCERT [+4]
Session Chair: Sally Chambers, British Library

From pilot to program: cultivating institutional web archiving practices for sustainability

Abbie Grotke², Katherine Boss³, Sally Chambers⁴, Stella Wisdom⁴, Olga Holownia¹

¹IIPC, United States of America; ²Library of Congress, United States; ³National Library of Norway; ⁴British Library, United Kingdom

Transitioning an archiving pilot into a resilient, long-term program is a perennial challenge in the web archiving field. Moving beyond initial proof-of-concept and pilot projects requires strategic investment in technical infrastructure, human capital and securing funding. Featuring experts from libraries with 20+ years in running web archiving programs as well as international collaboration, this panel explores the roadmaps for sustainable growth. Discussion will focus on the practical challenges and solutions regarding long-term staffing, infrastructure, and collection management practices that move a program from "temporary project" to "enduring program".

11:30am - 12:30pm

BELGICAWEB SYMPOSIUM PANEL: LEGAL
Location: STUDIO [+6]
Session Chair: Elodie Lecroart, University of Namur

The legal challenges of web archiving and open collections

Élodie Lecroart¹, Séverine Dusollier², Stef van Gompel³, Frédéric Young⁴

¹University of Namur; ²Sciences Po Paris Law School; ³Vrije Universiteit Amsterdam; ⁴The civil society of multimedia authors & Society of dramatic authors and composers

The panel will provide an opportunity to address topics that are particularly critical in the field of web archiving, such as considerations related to copyright (exceptions, Text and Data Mining practices, extended collective licensing), to open data, and to challenges of data enrichment and AI-driven data enrichment.

11:30am - 12:30pm

BELGICAWEB SYMPOSIUM PANEL: AI
Location: ATELIER [+2]
Session Chair: Julie Birkholz, KBR | Royal Library of Belgium & Ghent University

AI in web and social media archives

Julie M. Birkholz¹, Jefferson Bailey², Dorothée Benhamou-Suesser³, Victor Harbo Johnston⁴, Hannes Lowagie¹

¹KBR | Royal Library of Belgium, Belgium; ²Internet Archive; ³Bibliothèque nationale de France; ⁴Aarhus University

This panel brings together experts to discuss the practical use of AI in web and social media archives from leveraging machine learning for the curation, preservation, and discovery of massive, ephemeral datasets, to GenAI for supporting users to navigating the archives. We make a distinction in this discussion as to not focus solely on efficiency and the tools used but also the archival conversation around stewardship and ethics. As with these great affordances come the challenges of ensuring data privacy, managing biases, and establishing transparency and provenance for AI-generated (meta)data. As AI becomes integral to archival practice, how do we balance innovation with accountability?

11:30am - 12:30pm

SPRING SCHOOL [PART 2]
Location: KRANTEN / JOURNAUX [0]

Discussions & collaborative work

Agenda
9:00 Welcome of participants at KBR (in the lobby)
9:15 Welcome notes (in the Newspaper Reading Room)
9:30–10:00 Interactive icebreaker activity
10:00–11:00 BelgicaWeb Symposium
11:00–12:30 Discussion with participants on the insights from the BelgicaWeb presentation, followed by a broader guided collective discussion on collaboration between web archivists and researchers
12:30 Lunch
13:30–16:15 Discussions and presentations by participants, organized in several stages.
During the first hour, you will be divided into three small groups of three or four people (you are 10 participants) to present your research and challenges to each other, guided by a facilitator (Emmanuelle Bermès, Ian Milligan or Valérie Schafer).
This will be followed by 30 minutes of collective feedback from each group, before moving into broader discussions addressing the key questions and issues.
16:15 Break
16:30–17:30 Collaborative work on a collective poster addressing the conference theme of sustainability. This poster is part of an evolving proposal that will be further enriched by all conference participants during WAC26.

12:30pm - 1:30pm

LUNCH
Location: PANORAMA: FOYER [+6]

🍴 Lunch will be served in Panorama Foyer (Floor +6).
🏛️ GUIDED TOUR KBR: If you signed up for a guided tour of KBR, please be in front of the three main elevators on Floor -2 at 12:30 [1st tour] or 13:00 [2nd tour]. To know if you signed up for a tour, check your registration details in ConfTool.

CLICK TO VIEW: 🗺️ FLOOR PLANS + 🎥 ORIENTATION VIDEO

1:30pm - 3:00pm

WORKSHOP: GLAM LABS & JUPYTER NOTEBOOKS [PART 1]
Location: PANORAMA [+6]
Session Chair: Olga Holownia, IIPC
Session Chair: Gustavo Candela, University of Alicante
Session Chair: Ben O'Brien, National Library of New Zealand
Session Chair: Helena Byrne, British Library

Explore the use of the GLAM Labs Checklist, Datasheets, and Jupyter Notebooks for digitized and born-digital collections

Helena Byrne¹, Gustavo Candela², Olga Holownia⁴, Ben O'Brien⁵, Jon Tønnessen³, Yizhe Zhan⁵

¹British Library; ²University of Alicante, Spain; ³National Library of Norway; ⁴International Internet Preservation Consortium; ⁵National Library of New Zealand

There are often significant barriers to accessing and using collections as data in the GLAM sector, often demanding technical expertise and suitable IT infrastructure. Although training in digital research skills is becoming more widespread, GLAM institutions still face the challenge of determining how best to provide access to their digital collections in ways that encourage the use of these skills. Jupyter Notebooks are an increasingly popular form of hybrid tooling that combines data and code to make digital collections more accessible, particularly for less technical users. GLAM institutions have started to employ Jupyter Notebooks as a new approach to demonstrate how users can access and experiment with datasets derived from their collections [1]. Projects like the GLAM Workbench [2] illustrate their utility across various types of collections, including both digitized collections and web archives. They offer interactive and reproducible environments[3] for exploring and analyzing collections of data.

This workshop will help participants explore digitised and born-digital collections using reproducible code and Jupyter Notebooks. These collections will be placed in the context of “datasheets for datasets,” which provide structured documentation about how a dataset was created. Notebooks and datasheets are two key steps in the “Checklist to Publish Collections as Data in GLAM Institutions” (glamlabs.io/checklist). Expert facilitators will help users explore the possibilities of Notebooks, focusing on three areas: 1) working on one specific topic using data from digitised and born-digital collections (e.g. news), 2) using and creating reproducible notebooks, and 3) understanding existing infrastructures, cloud services, and workflows for publishing computationally ready datasets. Use cases and discussion will also address preservation challenges and future reuse of notebooks and datasheets.

Format

The workshop will begin with short presentations on the GLAM Labs Checklist, datasheets for data sets, and the framework for creating a collection of Jupyter Notebooks [3]. These will include examples based on digitized and born-digital collections, and guidance on how to get started using a Jupyter Notebook. The main part of the workshop will involve participants using and exploring the datasets with one or more of the available Jupyter Notebooks. Data research infrastructures and cloud services to run Jupyter Notebooks will be presented. The session will wrap up with a discussion on the preservation challenges of the notebooks and datasheets.

Learning Outcomes

The workshop aims to provide the following outcomes:

Provide a framework for preparing datasets using the GLAM Labs checklist and datasheets, with examples from both digitised collections and web archives.
Grow the participants’ experience using Jupyter Notebooks. Teach them how to build on and extend existing Notebooks to meet their needs, and provide tips for building their own.
Advise on the types and structures of data that work well with Jupyter Notebooks.
Test purpose-built Jupyter Notebooks on the workshop organizer's data collections.
Provide use cases and possibilities of Notebooks to inspire future use, and discuss the preservation challenges of the notebooks and research outputs.
Create awareness of existing data research infrastructures and cloud services to run Jupyter Notebooks.

References

Candela, G., Chambers, S., and Sherratt, T. (2023), “An approach to assess the quality of Jupyter projects published by GLAM institutions”, J. Assoc. Inf. Sci. Technol. 74(13): 1550-1564.
https://doi.org/10.1002/asi.24835
Sherratt, T. (2021), “GLAM Workbench (version v1.0.0)”, Zenodo.
https://doi.org/10.5281/zenodo.5603060
https://open-science-cloud.ec.europa.eu/services/interactive-notebooks
Tønnessen, J.C. and Birkenes, M.B. (2025) ‘Providing Web Archive News Articles as Corpus Data’, Journal of Open Humanities Data, 11(1), p. 2.

Acknowledgments

This workshop builds on the work of the GLAM Labs community and the Web Archives as Data workshops delivered at various conferences, most recently at the Digital Humanities in the Nordic and Baltic Countries (DHNB) 2025 Conference in Reykjavík and the Web Archiving Conference (WAC) in Oslo.

1:30pm - 3:00pm

WORKSHOP: SUSTAINABLE HARVESTING
Location: STUDIO [+6]
Session Chair: Jasper Snoeren, Netherlands Institute for Sound & Vision
Session Chair: Natasha Kitcher, The National Archives, UK

Web harvesting in an environmentally sustainable way

Jasper Snoeren¹, Natasha Kitcher², Lotte Wijsman³, Jane Winters⁴, Jake Bickford², Alexandre Angers⁵, Corinne Frappart⁵

¹Netherlands Institute for Sound and Vision, Netherlands, The; ²The National Archives (UK), United Kingdom; ³National Archives of the Netherlands, Netherlands; ⁴University of London, United Kingdom; ⁵Publications Office of the European Union

As web harvesting grows in scale and frequency, so does its environmental impact. Crawlers use bandwidth and computing power, and all that harvested data takes energy to store and maintain. Web archiving plays an essential role in preserving our digital culture and supporting research, but it also leaves a considerable carbon footprint that can not be ignored. As our reliance on digital preservation increases, finding ways to make these processes more efficient and environmentally responsible has become an important collective challenge.

This workshop invites IIPC members and the wider community to talk about how we can make web harvesting more environmentally sustainable. From smarter crawling techniques to collaboration that cuts down on duplication, we’ll explore how the web archiving community can align its work with broader sustainability goals without compromising the quality and integrity of our web archives.

Sustainability has become a growing priority for libraries, archives, and research institutions. As organizations move toward net-zero targets, web archiving programs should also start to examine their own energy use and storage practices more closely. This workshop responds to a pressing need to explore current sustainability practices and experiments and aims to identify opportunities to reduce energy use during crawling and storage. It offers a space to share what people are already trying, what’s working, and where we see opportunities to reduce our footprint, whether that’s through more efficient crawls, less redundant storage, or greener preservation strategies.

The workshop will kick off with the authors revisiting the talks that were given on this topic during last year’s Web Archiving Conference and highlighting the developments that have taken place since then. Coming from different institutional and professional backgrounds, the authors will demonstrate how approaches to green web harvesting vary across contexts while also showing the value of sharing insights and experiences.

After this introduction, the participants will form small breakout groups to discuss key aspects of sustainability in web harvesting. Topics include ideas for running crawlers more efficiently - like optimizing scope and timing - and strategies for storing and managing data. We’ll explore ways to collaborate across institutions to reduce overlap, and how to measure and report the environmental cost of our work. Additionally, we’ll consider the ethical and policy questions that come with balancing preservation goals and sustainability.

By the end, this session aims to build a shared understanding of what “sustainable web harvesting” can look like in practice. Together, we’ve explored current approaches to making web harvesting more sustainable and discussed best practices. We’ll gather practical ideas and recommendations - technical, organizational, and policy-related - and use the notes and key take-aways from the session as input for a set of community guidelines on sustainable web archiving, to be shared post-event. We hope the discussion will inspire interest in forming a small working group or shared resource on green web archiving, helping the conversation continue beyond the conference. Above all, the session will bring together people who care deeply about both preserving the web and environmental responsibility, fostering new collaborations and long-term awareness of sustainability within the web archiving community.

1:30pm - 3:00pm

WORKSHOP: SOLRWAYBACK [PART 1]
Location: ATELIER [+2]
Session Chair: Thomas Egense, Royal Danish Library
Session Chair: Anders Klindt Myrvoll, Royal Danish Library

Run your own full-stack SolrWayback, collaborate & unlock the potential of archived data

Thomas Egense¹, Anders Klindt Myrvoll¹, Ben O' Brien², Victor Harbo Johnston³, Jon Carlstedt Tønnessen⁴, László Tóth⁵

¹Royal Danish Library, Denmark; ²The National Library of New Zealand; ³Aarhus University; ⁴National Library of Norway; ⁵National Library of Luxembourg

An updated version of the '21, ‘23 and '24 IIPC WAC workshops “Run your own full stack SolrWayback” with added user cases showing SolrWayback’s growing resilience, robustness and sustainability, how to contribute and an open discussion to conclude the workshop.

Background

SolrWayback is a major search and discovery platform for the archived web with a strong focus on usability (https://github.com/netarchivesuite/solrwayback). It provides real time full text search, discovery, statistics extraction & visualisation, data export and playback of webarchive material. SolrWayback uses Solr (https://solr.apache.org/) as the underlying search engine. The index is populated using Web Archive Discovery (https://github.com/netarchivesuite/warc-indexer). The full stack is open source and freely available.
SolrWayback is used by multiple institutions and research environments throughout the web archiving community (map of 29 known users).This workshop aims at making even more people acquainted with the software and its possibilities.

As an open source software project SolrWayback is growing. This can be seen in the diversity of the contributors on Github. Since the last IIPC workshop on SolrWayback in 2024, the software has moved in multiple directions. Some contributions which could be mentioned here are the following: The memento protocol has been implemented for better comparability between archives, a huge rework and upgrade of the frontend framework has been done and playback of old ARC files have been improved.

The workshop consists of:

Introduction to the ecosystem of SolrWayback
Installation, setup and running the SolrWayback bundle
Participants are expected to follow the installation guide and will be helped whenever stuck.
The SolrWayback bundle is the easiest way of getting started with SolrWayback and is a good tool for making participants of the workshop gain hands-on experience with the software on their local computers.
Leave participants with a fully working stack for index, discovery and playback of WARC files
WEB CHILD at Aarhus University - research use at scale: the making of a +30 TB collection/SolrWayback-installation from WARC-files.
WebData Research Infrastructure at the National Library of Norway: Experiences from setting up a test-platform, based on a SolrIndex.

Use-case from the National Library of Luxembourg

From idea to code to implementation - how to contribute / GitHub/ best practise
Open discussion of SolrWayback, configuration, features and ideas

Expected learning outcome

When participants have attended the workshop, they will have a working installation of SolrWayback on their local computers and they will have learned to install and interact with the software. During the workshop participants are also introduced to how SolrWayback can support collaborative work with the archived web as a source for children's history as well as a more technical case on tracking pixels. These specific cases act as examples of research use and through them participants of the workshop will gain an understanding of how SolrWayback can be integrated into their research practices and support their exploration of the archived web. Attendees will also know how to contribute ideas and code through Github and the discussion at the end of the workshop will inspire and leave attendees with thoughts and impulses to act, improve and sustain SolrWayback.

Prerequisites:

Participants should have a Linux, Mac or Windows computer with Java installed. To see if java is installed type this in a terminal: java -version
If you are a Windows user, ensure up-front that you have administrative rights on your computer.
Beforehand participants should download the latest release of the SolrWayback bundle from: https://github.com/netarchivesuite/solrwayback/releases
Having institutional WARC files available is a plus, but sample files can be downloaded from https://archive.org/download/testWARCfiles or https://commoncrawl.org/get-started (The corpus contains raw web page data, metadata extracts, and text extracts.)
A mix of WARC-files from different harvests/years will showcase SolrWaybacks capabilities the best way possible.

Support

During the conference there will be focused support for SolrWayback in a dedicated Slack channel from the facilitators of the workshop.

Target audience

Web archivists and researchers with medium knowledge of web archiving and tools for exploring web archives. Basic technical knowledge of starting a program from the command line is required as this is currently the only way of starting the program. However, the SolrWayback bundle is designed for easy deployment, so terminal interaction will be at a minimum.
Maximum number of participants
20-25 for the installation part. More participants can listen along if wanted

Coordinator(s)/ facilitator(s)

All of the authors, 5, and other attendees that might chime in.

1:30pm - 3:00pm

SPRING SCHOOL [PART 3]
Location: KRANTEN / JOURNAUX [0]

Discussions & collaborative work

3:00pm - 3:30pm

BREAK
Location: PANORAMA: FOYER [+6]

☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6).

3:30pm - 5:00pm

WORKSHOP: GLAM LABS & JUPYTER NOTEBOOKS [PART 2]
Location: PANORAMA [+6]
Session Chair: Olga Holownia, IIPC
Session Chair: Gustavo Candela, University of Alicante
Session Chair: Ben O'Brien, National Library of New Zealand
Session Chair: Helena Byrne, British Library

3:30pm - 5:00pm

WORKSHOP: WEB ARCHIVES & AI
Location: STUDIO [+6]
Session Chair: Ines Vodopivec, National Library of Norway & Stanford University

From problem to practice: a collaborative use case workshop on AI-driven management and reuse of web archive content

Ines Vodopivec¹, Olga Holownia²

¹AI4LAM; ²IIPC

Worksop is developed to explore the transformative potential of artificial intelligence in managing and reusing internet cultural heritage content preserved in web archives of IIPC institutions. As digital heritage grows exponentially, institutions face mounting challenges in accessing, organizing, and repurposing archived web data. Participants will engage with cutting-edge AI tools - to develop innovative solutions for enhancing discoverability, and enabling creative reuse of archived web content.

The event invites developers, researchers, archivists, and digital humanists to collaborate on prototypes that address real-world needs: from semantic enrichment and automated classification to visualization, summarization, and cross-archive interoperability.

By bridging technical innovation with cultural preservation, this workshop aims to unlock new pathways for engaging with the web’s historical layers and ensuring their relevance for future generations.

It is developed in close cooperation between the IIPC and AI4LAM teams to ensure optimal planning, preparation of web content, and effective outreach. This collaboration will help align technical capabilities with community needs and maximize the impact of the event.

Purpose

This workshop brings together participants who want to explore, challenge, and strengthen their real‑world use cases while exploring collaborative efforts between IIPC and AI4LAM to better provide access, reuse, and address ethical issues in AI use on harvested content. The focus is on thoughtful discussion, critical debate, and collaborative refinement to surface high‑quality insights that will contribute to strategic pathways after the event.

Format

Participants are invited to bring their real-life use cases to examine — whether emerging, partially formed, or already in practice. Through guided sessions and structured debate, each use case will serve for mapping of further working areas. The workshop emphasizes clarity, feasibility, impact, and alignment with broader strategic or technological themes of IIPC and AI4LAM.

Activities

Use case presentations: Each participant briefly introduces their scenario, challenge, or idea.
Debate rounds: Participants will be divided into small groups to engage in constructive discussion, test assumptions, identify gaps, and explore alternative approaches. Each group will report on problem definition, possible solutions, and measurable outcomes.
Synthesis & documentation: the workshop will conclude with the definition of key insights, areas for further work, and recommendations, which will be captured to inform the structure and content of a forthcoming white paper.

Max number of participants: Up to 30 persons.

Technical requirements: Participants are encouraged to bring their real‑world use cases.

Expected outcomes: white paper/recommendations/draft strategy for further work on the subject.

3:30pm - 5:00pm

WORKSHOP: SOLRWAYBACK [PART 2]
Location: ATELIER [+2]

3:30pm - 5:00pm

SPRING SCHOOL [PART 4]
Location: KRANTEN / JOURNAUX [0]

Discussions & collaborative work

5:00pm - 5:30pm

SPRING SCHOOL [WRAP-UP]
Location: KRANTEN / JOURNAUX [0]

Date: Tuesday, 21/Apr/2026

8:10am - 9:10am

REGISTRATION AND COFFEE
Location: AUDITORIUM [-2]

☕️🥐 Drinks and snacks will be served in Galerie (Floor -2, next to Auditorium).

9:10am - 9:30am

OPENING REMARKS
Location: AUDITORIUM [-2]

9:30am - 11:00am

OPENING KEYNOTE PANEL
Location: AUDITORIUM [-2]
Session Chair: Lauren Ko, University of North Texas Libraries
Session Chair: Olga Holownia, IIPC

PANELISTS Neil Jefferies, Open Preservation Foundation Neil Jefferies is an Innovation Specialist at the Bodleian Libraries, Oxford, Executive Director of the Open Preservation Foundation and a Director of Data Futures GmbH. He is co-creator of the International Image Interoperability Framework and the Oxford Common File Layout for the preservation of complex digital objects. Most recently he has worked on Text Interoperability API's, AI tools for cataloguing and the EPUB/A ISO specification. Lauren Ko, University of North Texas Libraries Lauren Ko leads the Software Development Unit in the University of North Texas Libraries Digital Libraries division where she has worked as a programmer analyst for several years building, deploying, and maintaining web applications. In parallel, she serves in the areas of creating, providing access to, and preserving web archives with open source tools. She is active in the IIPC tools community, hoping to make a better future for web archiving open source software and its developers. Yves Maurer, National Library of Luxembourg Yves Maurer is the head of IT and digital Innovation at the national library of Luxembourg (BnL). As a former head of digitization, Yves has been involved in the development and open-sourcing of the BnL’s digital collection viewer in 2011, the open publishing of the detailed digitization specifications for METS/ALTO and the quality assurance tool for digitization projects in 2014. The library continues using open source tools and providing updates to tools it is working on. Currently in web archiving, this is mostly Browserix and SOLRWayback and the ecosystem around them. Clare Stanton, Library Innovation Lab at Harvard Law School Clare Stanton is the Director of Product and Research at the Library Innovation Lab (LIL), a department of the Harvard Law School Library. The user-directed citation preservation service Perma.cc was built and is maintained at LIL, along with its associated open-source web archiving tools. Clare has been part of the Perma.cc team since 2018, when the IMLS awarded Perma a multi-year grant to prototype financial sustainability models. Tessa Walsh, Webrecorder Tessa Walsh is the Senior Applications and Tools Engineer at Webrecorder, where she helps develop and maintain open source web archiving tools such as Browsertrix, Browsertrix Crawler, and pywb. The Webrecorder team has been developing open source web archiving tools for over 10 years, with a focus on making high-fidelity browser-based archiving tools accessible to anyone who needs to collect and preserve online content that is meaningful to them. In addition to being a software developer, Tessa is an archivist, a digital preservationist, and a musician.

Sustainability for open source web archiving tools

Lauren Ko¹, Tessa Walsh³, Neil Jefferies⁴, Clare Stanton⁵, Yves Maurer⁶, Olga Holownia²

¹University of North Texas Libraries, United States of America; ²International Internet Preservation Consortium, United States of America; ³Webrecorder, Canada; ⁴Open Preservation Foundation, United Kingdom; ⁵Library Innovation Lab at Harvard Law School, United States of America; ⁶National Library of Luxemburg

For over 25 years, the critical infrastructure of web archiving – including core functions such as capture, indexing, and replay – has depended on open source tools. While the premise of open source code is vital for collaboration and the only option for many users to achieve their web archiving initiatives, maintaining and sustaining these tools remains a persistent challenge for the community, regardless of the scale of their archiving operations.

An ongoing survey of web archiving tool usage shows a majority of institutions dependent on open source products, including a replay tool initially released a dozen years ago and a crawler that has surpassed twenty years in age, despite the original project developers having shifted resources elsewhere. Aging codebases and unmaintained external dependencies, lead developers moving on from projects, shifts in web technologies that render older software less effective, and fewer funding opportunities available to open source projects are only some of the threats to the web archiving infrastructure employed by many institutions.

This panel will address the inherent sustainability challenges facing the foundational open source tools used by the web archiving and other communities. The goal is to move beyond stating the facts about resource limitations and instead focus on diagnosing the status quo, identifying shared issues, and proposing achievable collaborative solutions that can be implemented beginning in the near-term.

To that end, the panel will bring together a diverse range of stakeholders who offer different perspectives and potential solutions:

Consortia representatives who support open source ecosystems.
University and national libraries that rely on and contribute to these tools for large-scale web archiving.
Individual developers and service providers who build and maintain open source tools.

PERSPECTIVES

SUSTAINABLE DIGITAL PRESERVATION (CONSORTIA)

Importance of building technical communities rather than relying on single organizations
Organizations acting as facilitators and enablers for Open Source Communities
Governance and communications are as important as technical capabilities for community health
Facilitators can bridge technical deficits, but not a standard operating procedure
Role for membership and project-based funding
The European Cyber-Resiliency Act will impose some additional costs

UNIVERSITY AND NATIONAL LIBRARIES

Inclusion of funding for open source tools in budget proposals
Allocation of time to contribute code to community-driven open source tools
Development of new projects with an open source release in mind
Collaborating on a model of shared stewardship
Bolstering community engagement to support software development

Use of open source software by national libraries

National library contributions to the open source landscape

Overcoming barriers that prevent national libraries from funding open source software

OPEN SOURCE TOOLS SERVICE PROVIDER

A deeper definition of open source software
Examples of successful open source funding

Bringing external core maintainers to projects lacking support
Sources of funding from individual users, institutions, and downstream proprietary software companies to replace shrinking grant funding
Need for guidance in crafting public tenders that are inclusive of open source projects
Addressing costs associated with meeting institutions' increased security and accessibility requirements

11:00am - 11:30am

BREAK
Location: GALERIE [-2]

☕️🥐 Drinks and snacks will be served in Galerie (Floor -2, next to the Auditorium).
🏛️ GUIDED TOUR KBR: If you signed up for a guided tour of KBR, please be in front of the three main elevators on Floor -2 at 11:00. To know if you signed up for a tour, check your registration details in ConfTool.

11:30am - 12:35pm

POSITIVE + NEGATIVE IMPACT OF AI
Location: AUDITORIUM [-2]
Session Chair: Gil Hoggarth, British Library

11:30am - 11:52am

Why ask WAAI: A sustainable approach to exploring web archiving artificial intelligence (WAAI)

Jefferson Bailey

Internet Archive, United States of America

Beyond the media hype, financial bubble, and general social freakout over Artificial Intelligence (AI), the emergence of machine learning (ML) and AI technologies merit impartial consideration of the potential for these innovations to benefit many aspects of the overall web archiving endeavour. Much as digitization and the internet itself radically changed how libraries and heritage institutions approached professional practices like acquisition and access, ML/AI may have the potential to address longstanding challenges in web archiving related to harvesting, collection management, and search and discovery. Of course ML/AI tools could also prove too immature, too unreliable, too expensive, or too unwieldy to provide a suitable return on investment for web archive collections that can measure in the hundreds of terabytes, if not petabytes. Thus, ML/AI explorations in web archives need a different methodology of research, testing, and assessment than more traditional, more narrowly focused technologies specific only to certain areas of web archiving practice or infrastructure.

This talk will approach the challenge of incorporating ML/AI tools in web archives from a “why ask why” perspective, emphasizing small, low-stakes, and well scoped experimentation across all aspects of the web archiving lifecycle instead of rigorously planned, ambitiously conceived, large scale projects or more formal and ornate methods of research and development. The presentation will thus lay out a general framework for advancing AI-based work in web archiving based on practical examples, use cases, and findings from pursuing such an approach within a large web archiving institution that has been conducting internal AI projects on multiple parts of its web archiving processes.

The talk will cover both managerial and practical aspects of exploring ML/AI for web archiving, such as staffing, infrastructure, tools, costs, program/product development, and engineering practices, and will link these with specific completed or in-progress work on leveraging ML/AI tools for various areas of web archiving, such as appraisal, collection, description, quality assurance, and search. By bridging practical details and results with specific areas of professional practice and wrapping both in a framework that emphasizes experimentation and action over procedural, policy, or administrative plodding, the talk hopes to advocate for a “sustainable” approach to exploring ML/AI in web archiving that proves doable, cost-effective, and user-driven. This presentation will propose a method, detail results from implementing that method in a large web archiving organization, and share results and findings intended to help other web archiving institutions pursue ML/AI work that will be sustainable, productive, and successful.

11:52am - 12:14pm

Understanding and mitigating anti-bot technologies' impact on archival web crawling

Calum Wrench¹, Abbie Grotke²

¹MirrorWeb Limited, United Kingdom; ²Library of Congress, United States of America

The proliferation of AI bot prevention technologies has created an unprecedented challenge for institutional web archiving programs. Website owners, administrators, and hosting providers—particularly those serving large organisations and government entities—have implemented increasingly aggressive safeguards to protect against AI agents harvesting training data. While well-intentioned, these measures inadvertently block legitimate preservation crawlers, threatening the completeness and quality of web archive collections.

This research addresses a critical gap in understanding how anti-bot technologies affect large-scale web archiving operations. Even when securing appropriate crawling permissions per institutional policies, standard preservation tools like Heritrix are increasingly mistaken for malicious bots or AI scrapers, resulting in blocked access to nominated content. While quality assurance teams have documented this issue on individual seeds and domains, no comprehensive analysis of its scale and impact has been conducted.

Our investigation analyses data from institutional crawling operations, and aims to enable systematic identification of blocking patterns, affected content types, and the scope of collection gaps caused by anti-bot technologies.

This work extends existing guidance (such as robots.txt configuration advice) to address the complex landscape of modern bot prevention technologies. By documenting the real-world impact of these systems on institutional collecting and developing evidence-based mitigation strategies, this presentation is intended to aid web archiving programs maintaining collection quality while minimising resource-intensive manual interventions with individual website owners.

The findings will aim to inform both technical approaches to crawling at scale and strategic communication with the broader web archiving community, website creators, and technology providers. Ultimately, this research aims to bridge the gap between legitimate preservation activities and necessary web security measures, ensuring cultural heritage institutions can fulfil their missions in an increasingly bot-hostile web environment.

12:14pm - 12:35pm

AI-powered search to sustain IIPC conference knowledge

Youssef Wael², Adham Abaza³, Youssef Eldakar¹

¹Bibliotheca Alexandrina, Egypt; ²Alamein International University; ³Egypt-Japan University of Science and Technology

The IIPC Web Archiving Conference often receives high ratings in surveys from the community for being recognized as a platform for sharing knowledge and experience among web archiving practitioners and researchers. The output from this annual event is kept and made accessible via an online repository, courtesy of the University of North Texas. With today's advancement in Artificial Intelligence (AI) technology, an opportunity presents itself to render the wealth of information stored within the IIPC's repository of conference materials into more accessible knowledge.

The IIPC Assistant supports the sustainable preservation and accessibility of the International Internet Preservation Consortium (IIPC) conference materials through an AI-powered search frontend that enables natural-language exploration of conference contributions archived in the online repository. By integrating vector embeddings with generative AI, the system delivers contextually accurate answers grounded in content that has been through a review process and was presented at the conference, contributing to the long-term usability and enhanced accessibility of the material that periodically documents the work done in the area of web archiving.

The project began with metadata harvesting via the OAI-PMH API to consolidate creators, titles, subjects, and textual content from IIPC presentations and transcripts into a unified dataset. Because the materials were not designed for interactive querying, a Retrieval-Augmented Generation (RAG) approach was adopted to enable dynamic, source-grounded responses without retraining large models, an approach that promotes computational efficiency and sustainable reuse of existing data.

Challenges in data consistency and semantic coherence were addressed by employing generative AI through the Gemini API to restructure fragmented text and enhance contextual quality. The retrieval pipeline was further refined to group and rank documents based on relevance, ensuring balanced coverage and interpretability. Built with a React + TypeScript frontend, Flask backend, and FAISS vector database, the implementation emphasizes scalability and efficiency.

By advancing sustainable methods for information retrieval, the IIPC Assistant demonstrates how an AI-powered access interface can broaden the potential of a repository of valuable content accumulated over the history of the organization, thus transforming static collections into an interactive, reusable knowledge resource that supports ongoing research and global collaboration in the domain of web archiving.

11:30am - 12:35pm

SHORT TALKS
Location: PANORAMA [+6]
Session Chair: Helena Byrne, British Library
Session Chair: Sharon Healy, IIPC

11:30am - 11:41am

A Toolbox to foster Web Archives Use and Reuse

Sara Aubry, Dorothée Benhamou-Suesser

National Library of France, France

Web Archives represent an immense reservoir of data, with diverse and evolving possibilities for use and reuse that will undoubtedly continue to grow in the coming decades. As a national library, we have faced over the past 10 years a wide variety of requests particularly for extracting, recovering, and replaying web-archived materials for research, institutional, and personal use. All these requests have enabled us to develop a range of services and a set of tools.

We will focus on three real-life use cases and the technical solutions we have developed to answer the needs of:

A former blogger who has completely lost his original website and personal digital history, and seeks to recover it;
A political science researcher working on European elections who is trying to build a research corpus with a hundred websites archived at three time periods. She is interested only by the text content of articles and programs as she wished to run topic modelling and named entity recognition on them.
A museum digital project manager who wants to recover, revive, remediate and showcase early 2000s websites to enable contemporary users to explore and use them on modern computers.

Our presentation will cover how we have progressively developed and consolidated tools from specific user needs and questions to a generic and sustainable set of tools that we integrated into a toolbox to extract and transform archived data and websites into various formats such as metadata, HTML, text, images, and various outputs such as file lists, tree structures or derivative WARC files.

11:41am - 11:49am

Constructing and sharing historical web link graphs from web archives

Vasco Rato

Arquivo.pt, Portugal

At our organisation, we have been developing a new text search platform based on Apache Solr to replace our legacy system, which depends on outdated and unsupported technologies. As part of this major upgrade, we undertook the task of reindexing all archived collections to align with the new, more flexible indexing schema.

This large-scale reindexing effort provided us with a unique opportunity: the chance to extract additional insights from our historical web data. In particular, we focused on capturing link relationships between webpages. From this process, we generated and published a dataset of web link graphs that document the structure of hyperlinks across a significant portion of the web as preserved by our web archive.

The published dataset contains information on over 139 million webpage URLs and the collections chosen for this dataset range from 1996 to 2021, allowing researchers to study the evolution of webgraphs over time. This type of data can be particularly valuable for researchers in areas such as web science, digital preservation, search engine technology, and network analysis.

Furthermore, the code used to generate this dataset has been made publicly available. This allows others to apply the same approach to their own web archives and produce comparable link graph datasets from their WARC files. We believe this makes our work a reusable and extensible contribution to the web archiving and research communities.

In this lightning talk we aim to provide an overview of how the dataset was created and the structure and format of the data itself.

11:49am - 11:57am

Lossy and porous archives: Sustainability and collaborative models of LAC and the Internet Archive

Esmée Colbourne

University of Copenhagen, Denmark

As of 2024 Library Archives Canada and the Internet Archive have partnered to digitize and scan up to 80,000 out-of-copyright Canadian publications. Six Internet Archive created “Scribes” workstations were installed in LAC’s Gatineau facility, run by LAC staff (Library and Archives Canada, 2025; Internet Archive Canada, 2024). This co created project is a reflection of the porous boundaries of democratic digital knowledge ecosystems.This paper will compare both the LAC and IA’s sustainability models through IA’s digital resources and through interviews with Library Archives Canada. It presents a brief overview of mandates, accountability to publics vs. donors, and compares the overlap and (in)dependence of national and transnational digital archiving. The analysis draws on theories of data loss to engage with the porous and lossy boundaries of digital memory infrastructure. Both IA and LAC have gaps and absences, but their losses result in different absences and silences.

Both the IA and LAC are infrastructure within the ecologies of digital archiving but diverge in mandate and logics. LAC is mandated to produce Canadian cultural and governmental memory, and are accountable to Canadian governmental policy, whereas contrastingly, the IA is a transnational nonprofit that controls through providing web infrastructure. They are bound by copyright law but are politically focused on increasing access to data through different, highly visible projects.

I will use the construction of 'Scribes' as a focus to present the porous nature of digital memory institutions. This comparative analysis contributes to conversations around the tensions of digital national futures, and how the process of transnational archiving can complicate or support national archival agendas.

References

Internet Archive Canada. (2024, July 1). Internet Archive Canada launches digitization project with Library and Archives Canada. https://internetarchivecanada.org/2024/07/01/internet-archive-canada-launches-digitization-project-with-library-and-archives-canada/

Library and Archives Canada. (2025, August 1). The plan to scan: digitizing out-of-copyright publications. Government of Canada. https://www.canada.ca/en/library-archives/corporate/updates/2025/the-plan-to-scan-digitizing-out-of-copyright-publications.html

Star, S. L., & Ruhleder, K. (1996). Steps toward an ecology of infrastructure: Design and access for large information spaces. Information Systems Research, 7(1), 111–134. https://doi.org/10.1287/isre.7.1.111

11:57am - 12:05pm

Ten years of websites and born-digital archiving in Slovakia

Jana Matúšková, Peter Hausleitner

University Library in Bratislava, Slovak Republic

Electronic documents and websites should be preserved similarly to physical objects of lasting value. In 2015 our institution has been involved in the project regarding digital resources. The goal of the project was to create the technological and organisational infrastructure for systematic and controlled web harvesting and born-digital archiving. We archive national websites and born-digital content (electronic monographs and electronic serials). Nowadays, the project is out of the sustainability phase and all activities are provided by the specialised department.

During the pilot phase a complex information system for harvesting, identification, management and long-term preservation of web resources and born-digital documents was established. Our information system consists of the specialised open source software modules (Heritrix, OpenWayback, SOLR etc.). The application is supported by a powerful HW infrastructure. The system management is optimized for parallel web harvesting. This enables to master the full domain harvest with required politeness in an acceptable time. One of the useful system features is the identical parallel testing environment. The web archiving system disposes with 800 TB storage. A substantial part of the system is the catalogue of websites, which is regularly updated during the automated survey of the national domain. Some domains that match our policy criteria are added to the catalogue manually (.org, .net, .com, .eu…).

Since 2016 our department has performed seven full-domain harvests - harvesting of the national domain and multiple selective and thematic harvests. Electronic publications with assigned ISSN are archived in cooperation with the National ISSN Centre by upload or by harvest. Access to the archived data is provided in OpenWayback. A limited number of archived websites and electronic publications is available publicly due to the copyright restrictions. All archived resources are available locally in the institution.

This contribution focuses on the path of archiving the national websites and born-digital documents in digital resources archive. During ten years, it faced several opportunities and now it is a recognized source, partly supported in national legislation (archiving of news portals).

12:05pm - 12:13pm

Climate change captured: collaborative, complex crawling & collecting - learnings from a cross-institutional pilot on climate change reactions

Anders Klindt Myrvoll

Royal Danish Library, Denmark

As part of a national, cross-institutional, pilot initiative documenting public reactions to climate change, a recent thematic web collection focused on online debates and reflections surrounding water levels, flooding, and environmental adaptation. Within this pilot, an almost single curator-led effort resulted in the collection of over 1.6 million unique web pages—more than 5 terabytes of data—including embedded videos, dynamic, rich media and selected social media content.

The collection was conducted using Browsertrix, a browser-based crawling technology that proved essential for capturing complex, media-rich web content that traditional crawlers often miss. The setup included both cloud-based and local installations, allowing flexible scaling and testing of workflows. Browsertrix enabled efficient harvesting within a limited timeframe while significantly improving the fidelity of the captures, particularly for sites relying heavily on dynamic or embedded content.

This presentation will share key learnings from the pilot, focusing on technical, curatorial, and collaborative dimensions. On the technical side, challenges included resource demands, blocked access to social media “walled gardens,” and maintaining crawl stability across diverse sites. From a curatorial perspective, the project demonstrated the value of close cooperation with domain experts on climate change, whose insights were crucial for identifying emerging debates and relevant sources as well as inspiration from the other institutions participating in the pilot, collecting non-web media or physical objects. The user friendly GUI of Browsertrix, partly developed during the IIPC funded project "Browser based crawling system for all" - https://netpreserve.org/projects/browser-based-crawling, empowered curators to crawl and make informed decisions in a fast, intuitive and user friendly manner including monitoring crawls at run time, helped identifying important sites, that could be crawled in more depth later. However, the experience also revealed the need for broader outreach and participatory workshops in future large-scale efforts, to ensure diverse and inclusive input across sectors.

The pilot underscored how browser-based harvesting tools can transform national web archiving by bridging gaps in multimedia and interactive content capture. At the same time, it highlighted the limits of current approaches—particularly the need for dedicated development to handle advanced social media and video platforms.

The forthcoming main project, pending accept of fund applications, aims to build on these lessons, exploring how combining existing infrastructures with newer tools like Browsertrix can enhance thematic, rapid-response collections. With modest resources but focused technical and curatorial innovation, it is possible to add substantial cultural and research value to national web archives documenting societal reactions to climate change.

12:13pm - 12:21pm

Bridging local and international communities: Web archiving outreach and collaboration

Sophie Gebeil¹, Maya Anderson-Gonzalez², Jean-Christophe Peyssard³

¹Aix Marseille University, France; ²Humathèque, Campus Condorcet, France; ³MMSH, CNRS, Aix Marseille University, France

This lightning talk aims to present three community-building and outreach initiatives that brought together long-time web-archiving specialists and newcomers to the field in 2025.

The first one is a community-building initiative that resulted in the drafting of a memorandum of understanding between the xxx and xxx. In this declaration, they commit to:

creating a shared ecosystem to foster new cooperation projects, conducting collective work on the methodology for stabilizing, and archiving web data corpora, strengthening links between existing institutions with expertise in collecting, analyzing and archiving web data, and reflecting on how to create a reproducible pipeline to collect, curate, consult and conserve web data corpora for SSH research.

The second initiative is the co-organization of a monthly research seminar entitled “The Web and Web archives for research in the humanities and social sciences: knowledge, methods, and tools for the collection, analysis, and preservation of online corpora”. The third initiative is an event : a hackathon called “Building a corpus with web data” involving SSH researchers and research library professionals from xxx and xxx, but also other local significant players of web archiving.

xxx and xxx are pooling their expertise to transform research practices through knowledge creation, training, awareness-raising, and the sharing of common tools for web archiving. Together, they want to build bridges between the international web-archiving communities (RESAW, IIPC) and local specialists and enthusiasts.

12:21pm - 12:29pm

Best practices for collaboration: Managing themed harvests with external partners

Sanna Haukkala, Topi Chamchoon

National Library of Finland, Finland

A substantial part of the web archiving at The National Library of Finland are themed harvests. Beyond just crawling yearly the Finnish domains ending with .fi or .ax country codes, online content is crawled with continuous harvests and themed harvests, that have varied subjects and content types. The most recent collection plan for 2025-2028 requires to have more emphasis on themed harvests that contain collaboration or cooperation with different groups, third-party organisations and other participants that are interested in suggesting content or participating in other ways in web archiving to the Finnish Web Archive.

This lightning talk will provide insight into how managing collaborational themed harvests are usually done and how they have developed in recent years. As harvests may cover subjects that the team of the legal deposit services that curates the archived online content does not have itself the required expertise about, the role of external partners is crucial. The presentation will include several themed harvests from recent years that had cooperation or collaboration with external partners. Many of the collaborated themed harvests in recent years have mostly been organized with institutions and organizations specialized in or representing language minorities or underrecognized groups, but the findings presented are useable also with other kinds of external partners.

Over the years, we have learned to improve the management of different types of cooperative and collaborational themed harvests. Collecting projects may be sparked by external suggestions or may be based on an already constructed set of online content by a third party. Managing these kinds of projects usually turn out to be fairly different from the projects that require reaching out for expertise beyond The National Library. Organizing themed harvests with especially minorities and underrecognized groups includes a feature in which the collaborating participants are not just providers of suggestions but also have knowledge and say in other aspects of the project (e.g., cataloguing and communicating to peers). Based on our experiences with these kinds of themed harvests, we have produced internal guidelines on how to manage collaborative collecting projects.

12:35pm - 1:35pm

LUNCH
Location: GALERIE [-2] & PANORAMA FOYER [+6]

🍴 Lunch will be served in in Panorama Foyer (Floor +6) and in Galerie (Floor -2, next to the Auditorium).
🏛️ GUIDED KBR MUSEUM TOUR (ENGLISH): If you signed up for a guided tour, please be by the Museum entrance on Floor 0 at 12:35 [1st tour] or 13:05 [2nd tour]. To know if you signed up for a tour, check your registration details in ConfTool.

CLICK TO VIEW: 🗺️ FLOOR PLANS + 🎥 ORIENTATION VIDEO

1:35pm - 3:00pm
Live Now

SOCIAL MEDIA PANEL
Location: AUDITORIUM [-2]
Session Chair: Beatrice Cannelli, University of London

1:35pm - 2:40pm

Digital Democracy: archiving government social media content

Lotte Wijsman¹, Sarah Dietz², Shereen Tay³, Beatrice Cannelli⁴, Susanne van den Eijkel¹

¹The National Archives of the Netherlands, Netherlands, The; ²The National Archives, UK; ³National Library of Singapore; ⁴Bodleian Libraries

For government organisations, the use of social media platforms are a great way to get in direct contact with civilians. For example, local organisations can use social media to ask direct input from the people on new initiatives regarding the environment in their municipality, and national ministries can highlight new policies and regulations. However, archiving social media after we have used it, is a totally different story. We understand the need to archive the material, the legal basis national archives have to do so, and the limitations as we do not own the platforms, but how does that work in practice?

During this panel national archives and libraries from all over the world will share their experiences with safeguarding this public discourse on social media for the long term. The panel will explain their own situation and touch upon relevant legislation from their country briefly. Furthermore, during an interactive panel discussion, the panelists will touch upon topics such as best and worst practices, user experiences, accessibility, the ongoing debate to include or exclude comments and direct messages, how to handle donated material, long-term preservation, and in which file formats social media is archived in. Questions and use cases from the audience are very much appreciated.

Short abstracts per institution

National Archives of the Netherlands (NANETH)

Panelist: Lotte Wijsman.

At NANETH, we are not responsible for archiving social media on behalf of the entire Dutch government. Each government organisation is responsible for managing its own social media archives. However, all currently active government social media accounts have been designated as information that must be permanently preserved. This means their archives will eventually be transferred to NANETH within the next 10 to 20 years. To support this process, NANETH has developed guidelines for archiving social media and is actively contributing to the creation of a government-wide policy. We have also defined the essential properties of social media archives that must be safeguarded to ensure their long-term preservation.

National Library of Singapore (NLS)

Panelist: Shereen Tay

With the growing significance and usage of social media, the NLS developed and included this new format as part of our collection policy in 2024. The policy covers both private organisations/individuals, as well as government accounts. Working closely with the National Archives of Singapore, it was made mandatory for government agencies to transfer their social media accounts to the NLS. NLS is currently archiving Singapore political office holders’ social media accounts, with plans to collect government agencies’ accounts in the near future. Our future plans also include ingesting the social media collection into NLS’ digital preservation system and exploring access ideas.

The National Archives, UK (TNA)

Panelist: Sarah Dietz

The National Archives is archiving UK Government social media at scale, using automated harvesting methods. This activity is supported by the Public Records Act 1958 which defines public records broadly as ‘not only written records, but records conveying information by any means whatsoever’ - so including social media. Currently, we are harvesting a limited number of platforms and we are exploring ways to expand our coverage, including direct transfer of accounts. Archived material is publicly accessible via our Social Media Archive.

2:40pm - 3:00pm

High-fidelity social media archiving: current state of the art

Ilya Kreymer, Tessa Walsh

Webrecorder

How to archive social media remains one of the most frequently asked questions, and sometimes one of the biggest challenges, in web archiving. Social media platforms are vast and quickly evolving, while web archiving tools are always playing catch up. Can web archiving tools be used to archive social media at high fidelity, i.e. accurately to their users’ experience? What makes archiving social media so difficult, and what are the key aspects of web archiving that apply to social media?

This talk will share some of our experience in the field, as well as the latest state of the art (which sometimes changes daily). We’ll cover the major platforms, such as Facebook, Instagram, Twitter/X, TikTok, YouTube, Telegram and LinkedIn, their current state, and how archiving some of these platforms has changed over the years. We’ll discuss browser profiles and paywalls, challenges of session information and rate limiting, custom behaviors, and how all of these factors affect capture and replay.

We’ll discuss what has consistently worked and why, and what hasn’t, what requires more work and maintenance, and what trade-offs may be necessary. We’ll also provide a real world use case of social media archiving workflows that others could perhaps use.

The presentation will discuss how we've approached social media archiving across key open source tools, including Browsertrix/Browsertrix Crawler, ArchiveWeb.page, and ReplayWeb.page.

We hope to end with a discussion on the subject of how to make social media archiving a sustainable practice within the web archiving field, and what can be done collaboratively for the benefit of all.

1:35pm - 3:00pm
Live Now

WORKFLOWS FOR BUILDING AND ANALYSING DATA
Location: PANORAMA [+6]
Session Chair: Andrea Goethals, National Library of New Zealand

1:35pm - 1:57pm

Digital Diaspora: mapping the Jewish internet

Hana Cooper

The National Library of Israel, Israel

Methods are being developed to systematically detect and archive Jewish web content on a large scale, capturing the evolving, multilingual digital expression of diasporic culture. This presentation outlines new procedures for the systematic detection and collection of Jewish web materials. Building on earlier curatorial approaches, this phase of the project focuses on automating the identification of thematically relevant websites through content-based analysis. Drawing on linguistic markers, semantic clustering, and metadata extraction, the process generates an expansive and continuously updated registry of Jewish web domains.

To expand the detection of thematically relevant web content, the workflow integrates automated site aggregation with multilingual linguistic modeling. The system applies cross-lingual text analysis, semantic clustering, and metadata extraction according to defined selection criteria, enabling the identification of recurring cultural, historical, and communal markers across diverse digital sources. Detecting websites by thematic relevance rather than by technical metadata or domain structures presents a distinct challenge, as cultural or communal identity is often conveyed implicitly through language, visual and textual cues, and context rather than explicit tags or classifications. However, by embedding these computational processes within curatorial practice, the project broadens how the Jewish digital sphere is identified and delineated, ensuring that content produced in multiple languages and regions is systematically recognized and incorporated into the resulting archive.

The presentation will address the conceptual design and technical aspects of this workflow, including criteria for data selection, the balance between automation and curatorial oversight, and methods for verifying the alignment of collected materials with the intended thematic focus. Beyond its technical contribution, the project reflects on the broader questions of how such workflows might inform other initiatives seeking to create expansive, thematically driven web collections, and how these systems can remain adaptable as online content and communities evolve. By presenting this next phase, the project invites further dialogue on how national and thematic archives can responsibly automate the preservation of networked, transnational cultural spheres.

1:57pm - 2:18pm

Improved language identification for web crawl data

Laurie Burchell, Pedro Ortiz Suarez

Common Crawl Foundation, United Kingdom

Identifying the languages contained in crawl data is a fundamental step in exploring the multilinguality of web archives. However, this task is far from straightforward: language annotations contained in webpage metadata are often unreliable or missing, and existing language identification systems are limited in their ability to handle large-scale diverse web crawl data well. Specifically, common language identification systems used for web crawls (e.g. CLD2) only cover a small number of languages well and are not reliable for many under-served language varieties. At the same time, more recent high-coverage language identification systems (e.g. GlotLID) are too computationally expensive for large-scale pipelines and often lack robustness when dealing with the heterogeneity inherent in web data. We therefore identify five desiderata for a language identification system suitable for annotating web crawls: it must be fast, computationally lightweight, adapted to the web domain, able to handle multilingual input, and easily extensible to additional language varieties.

In this talk, we present a new language identification system designed for web crawl data that meets all these requirements. Our solution is implemented in Rust and so is performant enough to process large amounts of web data in a reasonable time. It is designed from scratch for the web domain, including identifying multilingual web pages. The initial model is able to identify around 200 language varieties, but is easy to extend to additional language varieties given sufficient training data. We benchmark our system’s performance against popular existing language identification models, measuring computational performance and language identification fidelity. We finish with a discussion of the potential impact of our system on downstream language technologies, with a particular focus on under-served languages. Our language identification model is released under a permissive open source license to enable easy adoption and extension by the community.

2:18pm - 2:39pm

Hyperlinked homeland: A historical hyperlink analysis of 200 Dutch LGBT+ websites

Jesper Verhoef

University of Groningen, Netherlands, The

Over the past years, scholars have increasingly emphasized that queer cultures intrinsically transcend national borders (Bayramoğlu et al., 2024). The transnational connections that LGBT+ people establish online, among others through hyperlinks (Kiel & Osterbur, 2017), are often presented as a case in point (e.g., Gonsalves & Velasco, 2022).

My presentation, however, demonstrates that the nation still matters greatly. It builds on the interdisciplinary project I conducted as Researcher-in-Residence at, and in close collaboration with, the National Library of the Netherlands (KB), drawing from the fields queer internet studies, web archive studies and network analysis. Using historical hyperlink analysis, I analyzed the special LGBT+ web collection of the KB. This collection is unique in size and richness, comprising archived websites of hundreds of LGBT+ organizations and individuals, each of which has been harvested once annually. However, the collection has not yet been researched by others.

The talk focuses on the 200 LGBT+ websites that were harvested in 2020 (for pragmatic reasons: in terms of size and quality of the LGBT+ collection, this is the best year to scrutinize). To identify the (trans)national queer networks they formed that year, I extracted and scrutinized all hyperlinks of these websites. After all, hyperlinks are not merely the constitutive elements of the Web, they are ‘conscious acts of connectivity’ (Milligan, 2022, p. 132) that yield insights into ‘hyperlinked identities’ (Szulc, 2015, p. 121). I specifically concentrate on the hyperlinks that directed to LGBT+ websites – not necessarily the 200 websites, but to any website, Dutch or non-Dutch, that catered to LGBT+ people.

I will detail this bottom-up approach that combines distant and close reading, and will show that there was a distinctly Dutch queer web sphere. For instance, 49 of the 50 websites that were most frequently hyperlinked to (or: targeted) were websites of Dutch organizations, in Dutch. In fact, many were hosted by local or regional groups, which suggests that, as far as geographical focus is concerned, internet historians should perhaps zoom in rather than out. Moreover, most of the target websites had ‘.nl’ as a top-level domain (TLD), whereas ‘.amsterdam’ was also relatively popular. These findings challenge the assumption that queer online cultures are inherently transnational.

This talk connects to the conference regarding both the topic (e.g., ‘underrepresented voices and marginalised communities’) and applied method (‘Derived and statistical data for distant reading’). It is designed to resonate with every conference participant. It goes beyond simply demonstrating—through practical examples—how collaboration between researchers and web archivists can deepen our insights into critical societal and historical issues.

Additionally, it explores the workflows the KB and I created for building and analyzing datasets, which could inspire future research and ultimately encourage greater engagement with web archives. By showcasing how hyperlink analysis can reveal hidden local networks, this talk offers a replicable, data-driven approach for archivists and researchers to assess and enrich collections of underrepresented groups—directly addressing the conference’s call for inclusive and sustainable web archiving practices.

2:39pm - 3:00pm

WARCbench: A swiss army knife for WARC processing

Rebecca Cremona

Harvard Library Innovation Lab, United States of America

WARCbench is an open-source Python library and command-line utility designed for exploring, analyzing, transforming, recombining, and extracting data from WARC files in all their variety. Inspired by the ad hoc snippets of code the team at the Library Innovation Lab repeatedly reaches for while operating Perma.cc, WARCbench is a new addition to our suite of open-source web-archiving tools. It offers a resilient, highly configurable toolkit for experienced technologists, alongside easy-to-use commands for quickly exploring the contents of a WARC without writing any code.

In running a production-scale web archive, we’re always finding new anomalies to investigate, emerging patterns to study, and new use cases to explore. Though a broad array of tools and libraries exists for working with WARC files, most are understandably optimized for the well-known, frequently encountered tasks of web archiving rather than for empowering learning and discovery, supporting ad hoc scripting, and enabling users to quickly and easily explore novel problem spaces. WARCbench was created with these non-standard uses in mind and with an eye toward best practices: clear, thorough documentation; robust error handling; and an architecture that makes custom extension and introspection straightforward.

Our goals for this project were to:

Enable technical and semi-technical users to conveniently inspect and interact with WARC files, even without deep prior knowledge of the format.
Support inspection and repair of malformed WARC files.
Offer extensive configurability, with plenty of options, hooks and custom callbacks.
Provide flexibility to adjust workflows for optimal memory usage, speed, or convenience as needed.
Minimize hidden processing — for example, by delaying byte decoding and header deserialization until the moment they're needed.
Provide extensive documentation, including educational docstrings in the codebase.

Our session aims to spark dialogue about common practices in ad hoc WARC processing and future tooling needs. Attendees will learn practical, repeatable approaches for inspecting and handling even "difficult" WARC files using WARCbench, and we’ll demonstrate both typical and edge-case scenarios ranging from simple inspection to transformation and extraction. Because it’s open source and modular, WARCbench lowers barriers to adoption, invites community iteration, and supports tool longevity — a critical factor for sustainable web archiving.

3:00pm - 3:30pm

BREAK
Location: GALERIE [-2] & PANORAMA FOYER [+6]

☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6) and in Galerie (Floor -2, next to the Auditorium).

3:30pm - 4:00pm

POSTER SLAM
Location: AUDITORIUM [-2]
Session Chair: Olga Holownia, IIPC

A survey on data-access methods for an open web archive

Pedro Ortiz Suarez, Laurie Burchell

Common Crawl Foundation, France

With the ever growing interest in web data and web archives being driven by Large Language Models (LLMs), Artificial Intelligence (AI) and Retrieval-Augmented Generation (RAG), web archivists managing open repositories are faced with an unprecedented volume of download and requests. Given that web archiving infrastructures are sometimes constrained in resources, the increased traffic has made it difficult to serve and fulfill all of these incoming requests properly without saturating the infrastructure. This problem is compounded by users who employ far too aggressive retry policies, often unknowingly, when they try to access open archives.

To deal with these issues in relation to our own archives, we introduced an official, open-source tool over a year ago to facilitate sustainable user access. We developed it to be cross-platform, dependency-free and user-friendly to ensure easy adoption by the community. It implements and supports polite retry-strategies like exponential backoff and jitter, while also allowing for parallelization.

In this talk, we present the results of a comprehensive study into the impact of our new tool over the span of a year on how users have been accessing our archive and how this impacts our infrastructure. We use a defined standard user agent for our official tool to track usage, and investigate how our tool has been adopted throughout time, and if its introduction has simplified access to our open web archives for users. We also compare our official tool to other standard access methods employed by our users and study how the introduction of a polite access tool has impacted the load of our infrastructure. Finally, we propose some strategies that other web archiving institutions can use to simplify access to their archives, providing users with polite tooling inspired by our findings and allowing them to reduce the load in their infrastructures.

Linkra – application for archiving and creating citations of web resources in scientific texts

Václav Dragoun, Marie Haškovcová, Markéta Hrdličková, Luboš Svoboda

National Library of the Czech Republic, Czech Republic

A newly developed archiving and citation service Linkra is designed to store web resources cited in scientific and professional texts. It addresses the problem of link rot – the loss of referenced web content that threatens their credibility. The application allows users to save cited resources to a web archive, obtain archive URLs, and create citation records. In addition to preserving cited resources, it encourages researchers to include archival copies in their academic citations in accordance with the ISO 690 standard.

The application uses a harvesting method based on the open-source Scoop tool, allowing fast access to archived data. Working with the application involves several steps. Users first insert the web sources they want to preserve into the application, which starts the harvesting process. They then receive a unique address through which they can return to their assignment. After the harvesting is complete, they receive shortened URLs that will redirect to archived copies after indexing. Finally, they can use the built-in generator to prepare citations of web sources for publication in professional texts. They can either use pre-prepared templates designed according to common citation standards or create their own, for example according to the specific requirements of a professional journal. Citation records prepared in this way can be exported in bulk.

The Linkra application is being developed as part of institutional research as an open-source tool. It was preceded by research focused on disappearing web content and on the possibilities of citing web resources and their archive versions. The aim of the application is to preserve the sources of scientific works while also expanding the existing acquisition strategies of the web archive of the National Library of the Czech Republic. As part of the poster presentation, we will introduce the goals of our project, describe the technical solution, discuss the challenges encountered during development, and demonstrate how to use the application.

Application of AI at Social Media Archiving in the National Library of China

Shiyan JI, Danyan Zhao, Qian Sun

National Library of China, China, People's Republic of

The evolution of Artificial Intelligence (AI) has offered a new paradigm for web archiving. Based on over two decades of practical experience, our library is actively exploring the innovative application of AI and AI-Agents across all stages of the archiving, preservation, and management processes. Practice has demonstrated that our library has achieved successful outcomes in applying AI technology to social media archiving, and has made breakthrough progress in identifying archiving targets, analyzing archiving content, and cataloging metadata by utilizing the DeepSeek large model.

Our library has expanded the scope of web archiving, focusing on the social media archiving (articles published on WeChat official accounts), the deepseek-r1:14B model is used to assist in determining the archiving targets, filtering reasonable search results according to specified search conditions, and automatically extracting the titles and URL addresses of WeChat articles to be crawled. Combining the learning, understanding, and analysis capabilities of AI, it assists in the full-text analysis of crawled WeChat articles. Based on the cataloging results of historical articles and through multiple rounds of optimization and training of the AI model, the AI has ultimately achieved precise description of key information such as full-text summaries, keywords, and data sources of Wechat articles. The application of AI provides an effective tool for the web archiving.

Archiving and Analyzing YouTube Recommendations during the Paris 2024 Olympic Games

Yvette Assilaméhou-Kunz¹, Michaël Attali², Dorothée Benhamou-Suesser³, Amélia Ferreira^3,6, Erwan Le Merrer⁴, Julien Mésangeau⁵, Gilles Tredan⁶

¹Université Sorbonne-Nouvelle; ²Université Rennes 2; ³Bibliothèque nationale de France, France; ⁴Inria, Rennes; ⁵Université de Lille; ⁶Laboratory for Analysis and Architecture of Systems – CNRS

Though profoundly shaping and personalizing our experiences of the web and our access to information, algorithmic recommendations remain largely absent from institutional web archives, raising critical questions about how to capture and preserve a long-term record of algorithmic activity. This poster presents the preliminary results of a multidisciplinary research project that brings together a national library and experts from computer science, information science, social psychology, and sports history. The project’s goal is to capture and analyze the videos recommended by YouTube’s algorithm to different user profiles during the Paris 2024 Olympic Games, in order to determine whether these algorithmic recommendations reflect different narratives or perspectives on the Olympics, and whether they promote distinct values related to sports and the Olympic spirit. This poster will outline the initial findings of this exploratory approach, including the methodology and the resulting dataset. Using bots with diverse browsing histories, we collected over 21 million video recommendations across 19 user archetype profiles over a 45-day period. We complemented this approach by constituting a corpus of 18k videos related to the Paris Games published during the events and monitored daily from the time of their publication. We refer to this as an "objective corpus", which we used as a reference to analyze the personalized recommendation datasets. We will present preliminary quantitative insights from the data collected, in particular by focusing on recommendations of videos from our "objective corpus". We found considerable variations of the bots exposure to corpus videos depending on their profile; in particular, bots with a media consumption are more exposed than bots with a sport consumption, which might appear surprising given the nature of the event. We will share the first results from a qualitative analysis of the subjective representations associated with the “Paris Olympics” event in the most frequently recommended videos. We analyzed variations in values expressed in these videos to compare different personalization regimes and value systems. Finally, we aim to spark a discussion on several open questions: How can such a large dataset be preserved and made accessible? How to construct a "representative" personalization ? How might algorithmic recommendations be integrated into existing web archiving practices, and how can their capture be developed into a reproducible and sustainable process? Can these recommendations help build an archive that reflects diverse perspectives on the same event?

Detecting and managing challenging web crawls at scale

Chris Doyle

MirrorWeb Limited, United Kingdom

Web archiving at scale presents significant operational challenges, particularly in identifying crawls that deviate from expected behaviour. Whilst standard monitoring systems report binary "running" or "stopped" states, they fail to detect more subtle problems: crawls that exceed their intended scope, enter infinite loops on dynamic content, or silently stall whilst appearing active. By the time such issues are manually identified, substantial computational resources have been consumed, and service level agreements may be compromised.

This poster presents [REDACTED]; a proactive monitoring application developed to address these detection gaps. The system leverages historical crawl data to establish profile-based performance baselines for different crawl configurations. By continuously comparing current crawl duration against expected averages, the application automatically flags potentially problematic crawls for investigation before they escalate into resource-intensive failures.

The application integrates multiple data sources including AWS EC2 instance metadata, MySQL profile databases, Redis queue systems, and Heritrix REST API endpoints. When a crawl exceeds its baseline duration, the system gathers comprehensive diagnostics: status, queue metrics, actively processing URLs, and recent log entries. This diagnostic information is automatically posted to associated ticketing systems with stakeholder notifications, enabling rapid response.

Operational deployment has demonstrated significant benefits including early problem detection (hours rather than days), reduced manual oversight requirements, improved response times through automated stakeholder notification, and enhanced organisational knowledge capture through documented diagnostics. The profile-based approach proves particularly effective for organisations managing diverse crawl types across multiple clients, where manual monitoring becomes impractical.

This work highlights the importance of monitoring strategies that extend beyond simple status checks. As web archiving operations scale, institutions require intelligent detection mechanisms that understand normal crawl behaviour and can identify deviations before they impact service delivery. The poster will demonstrate the system's architecture, detection methodology, and practical implementation considerations for institutions seeking to enhance their crawl monitoring capabilities.

Mapping duplicate images in a web archive using perceptual hashing

Marie Roald

National Library of Norway, Norway

Images have been part of the web since its early beginnings [1] and today most webpages have some form of image content. Since the early 2000s, the National Library of Norway has harvested web data from the Norwegian top-level domain, storing time-stamped records of web content, including text, audio, video and images. A large portion of the stored data is images and finding ways to sort through the images, link together related images and remove duplicates is crucial for researchers to be able to find what they are looking for.

Image files spread quickly online. The same image can be downloaded multiple times and reuploaded to different websites. As a result, duplicates of an image can be hosted at multiple domains and the link between the image instances is not always preserved in the process. Further, as content management services often compress and resize images automatically upon upload, instances of the same image might also exist with different sizes or compression levels which means that they are different at the byte level.

This poster will present our ongoing work and preliminary results from a deduplication study to detect duplicate images in a web archive. By using perceptual hashing algorithms [2,3], we detect and flag perceptual duplicates in a subset of the archived data. Moreover, to estimate the performance of this perceptual hashing algorithm, we evaluate the detection accuracy for several simulated image degradation transforms. Similarly, we use pixel-level comparison on a random subset of the images to probe the hashing algorithm for false positives.

Our initial findings suggest this approach is promising and has two potential benefits:

1) Allowing scholars to track the use and reuse of an image across multiple pages.

2) Reducing unnecessary computation, if two files represent the same image with only minor differences in resolution or compression, there is no need to perform expensive computation twice.

We will present our work so far, what lessons we have learned and how these lessons will inform how the National Library of Norway processes and disseminates web archive image data in the future.

[1]: Tim Berners-Lee and Mark Fischetti. 1999. Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web. HarperCollins Publishers.

[2]: Farid, H. 2021. An Overview of Perceptual Hashing. Journal of Online Trust and Safety. 1, 1 (Oct. 2021). DOI:https://doi.org/10.54501/jots.v1i1.24.

[3]: Meta 2019. The TMK+PDQF Video-Hashing Algorithm and the PDQ Image-Hashing Algorithm. https://github.com/facebook/ThreatExchange/blob/main/hashing/hashing.pdf (Retrieved 2025-10-13)

Migration of Croatia's Web Archive's selective web harvesting system: transitioning to sustainable and interoperable solutions

Anamarija Ljubek, Inge Rudomino

National and University Library in Zagreb, Croatia

Preserving online publications presents growing challenges due to the increasing volume of digital content, rapid technological change, and the need to ensure compatibility with international web archiving initiatives. The current system used for selective web harvesting has reached both infrastructural and functional limitations, which prompted a shift toward a more modern and sustainable solution.

The proposed approach focuses on migrating the existing selective web archiving system to the Web Curator Tool (WCT), an open-source platform designed for managing complex web harvesting and curation workflows. This migration entails comprehensive technical and functional transformations, including the conversion of harvesting parameters, metadata migration, and reconfiguration of harvesting schedules to accommodate new system capabilities.

In preparation for the migration, archived publications are being thoroughly assessed to determine appropriate capture frequencies, verify the quality and integrity of harvested instances, and identify materials unsuitable for migration—such as publications not available in standard HTML format. This careful evaluation ensures that only relevant content is retained in the new web archiving system.

The poster will outline the advantages of adopting standardized and widely supported tool, such as improved scalability, interoperability, and alignment with international web archiving best practices. It will also address potential challenges, including the need for significant resources during the migration process and the potential loss of certain legacy functionalities that cannot be replicated in the new environment.

The overall goal is to establish a sustainable, scalable, and interoperable selective web archiving system that ensures the long-term preservation and accessibility of the nation’s online publishing heritage.

Using data to challenge negativity bias in quality assurance workflows

Abbie Grotke, Amanda Lehman, Carly Boerrigter

Library of Congress, United States of America

This poster will describe how institutional staff are disproving a common expectation of poor results in web archives. Our institution includes over 5 PBs of data for event-based and thematic collections, however a hyper-targeted capture remediation approach by the quality assurance (QA) team leads to perceptions of low success and high failure rates of captured content. The team is motivated to find a sustainable workflow that balances large-scale quality assurance data and individualized attention to specific captures to glean a clearer, more positive image of web archive capture health.

The poster will touch on staff's ongoing developments to make their quality assurance workflow sustainable. It will also briefly discuss how we process the data gathered through this workflow. Data from a standardized qualitative rubric for capture assessment indicates that a majority of captures are successful. This rubric is based on correspondence of the live site and web archive browsing experience, following criteria developed by Dr. Brenda Reyes-Ayala (1). Priority captures for remediation and triage by the quality assurance team are indicated by scoring on the rubric.

Scoring data and major categorical issues from this rubric are then visualized in Tableau and reveal a range of positive and negative capture assessments. This Tableau data is critical as our QA staff time is focused on troubleshooting the negative assessments. New visibility of positive assessments through the Tableau dashboard highlights value within the collections and builds team morale in a challenging QA environment. The positive data allowed the team to update the QA workflow to funnel only the high-priority, actionable assessments through the process.

Through data collection and visualization, we hope we can better understand and manage the myriad collections at our large collecting institution. Using data to communicate a transparent understanding of crawl health could help onboard new staff and support a morale boost for staff performing quality assurance work long-term. This poster shares steps on our journey towards stable and enduring web archiving capture assessment and remediation work.

(1) "Correspondence as the primary measure of quality for web archives: A human-centered grounded theory study" International Journal of Digital Libraries, 2022

Sustainable web archiving: a living and participative poster

Emmanuelle Bermès², Valérie Schafer¹

¹C2DH, University of Luxembourg, Luxembourg; ²Ecole Nationale des Chartes - PSL, France

How can we think about the sustainability of web archiving while respecting its vitality, creativity, and diversity of approaches and uses, and while encouraging co-shaping, interdisciplinarity, and participation? During WAC26, which will address this question and provide many answers, our participatory poster will offer an additional tool: a collective, living and co-constructed poster.

The poster is both an installation and a performance running throughout the conference. Rather than a fixed, finished object, it is a living surface that grows day by day, shaped by the contributions of WAC26 participants. It takes the form of one or two long rolls of recycled kraft paper, several meters in length, fixed to a wall or unrolled across a few tables to invite collaborative contributions. On this surface, participants are invited to draw, annotate, question, collage, and connect, challenge, highlight ideas they developed or found interesting and exciting in sessions, using various materials (markers, old magazines, scraps of paper, threads of yarn to create links, etc.). In this way, the poster becomes both a shared reflection space and a creative archive in itself.

We will prepare the very first layers of the installation (if possible with early scholars during the spring school to be held on April 20, just prior to the launch of WAC26) : this will take the form of a partial canvas including a mind map on sustainable web archiving and a set of open questions handwritten on kraft paper. From there, the surface becomes a collective palimpsest: enriched through participants’ sketches, and reactions to conference sessions. This process turns the poster into a space of dialogue and imagination, where sustainability is explored as a social, creative, playful, material, and collaborative practice.

This living poster is at once a reflective poster, a participatory “artwork”, and a sustainable experiment in reimagining how we present and co-construct knowledge in the field of web archiving. Expected outcomes are a process of documentation (photos and notes throughout the event), presentation through a lightning talk and a final blogpost on netpreserve.org, including images of the evolving poster (and eventually audio comments), to preserve and share this experimental form of knowledge-making.

The technologies of an in-house seed handling tool

Mikko Merioksa

National Library of Finland

This poster is an overview of the technologies used in an in-house developed tool that is used to create and manage collections based on harvested online materials, and to automate some harvesting and preservation related tasks. It has been in development since 2018 and is still being updated based on the users' needs.
With the tool, users can collaboratively create collections (called "thematic harvests"), add URLs as different types of seeds (e.g. webpage or YouTube video), and publish the collections for use at the legal deposit workstations throughout the country. The tool is a PHP software created with the CodeIgniter framework. With the PHP app users can create collections, modify comments relating to them, and pull their metadata from a separate cataloguing system.
For handling the adding, modifying and harvesting of URLs, the tool has a separate Javascript frontend that uses the jQuery library for rendering. Seeds can be added one at a time or in bulk, and they always have certain parameters attached to them such as the type of the seed which determines the harvesting tool used, and the depth parameter used to determine how a given web page should be harvested.
Most of the harvesting is done by hand by our technical specialists based on the attributes set in the tool, but some parts of the harvesting are also automated. Simple harvests (e.g. a single page) are sent to a separate python script that launches different tools based on the type of seed given. The system currently supports harvesting web pages by launching Browsertrix containers and YouTube videos using the yt-dlp software. Currently the tool talks to the harvesting script by creating simple JSON input files for the script that contain info about the seed that is to be harvested, such as the URL, the tool that is going to be used to harvest it, and the collection that the harvested items should be added to. A separate file of the output results is created so the results can be displayed for the user in the tool's UI.
After all the seeds of a collection have been succesfully harvested a process of indexing the harvested materials can be started from the tool's UI. This uses a collection of shell scripts that enter info about the materials into a SOLR and OutbackCDX indexes. After the indexing is done the collection can be marked as ready, after which the materials are publicly available.

Virtual Mucem: from web archives to a museum remediation of an ethnological websites collection

Alexandre Faye, Sara Aubry

Bibliothèque nationale de France, France

The Museum of European and Mediterranean Civilizations (Mucem) is a major French ethnology museum located in Marseille. It opened in June 2013, inheriting the collections of the former National Museum of Popular Arts and Traditions (MNATP). This transfer of a national museum to a regional location was the first of its kind in France. The new museum implemented a multidisciplinary project and expanded its collections to the Mediterranean basin by launching new ethnological surveys.

Between its official creation in 2005 and its opening to the public in June 2013, the museum developed an online strategy and launched eight original thematic websites. These websites were editorial projects in their own right and were used as a key means of promoting ethnological collections, researches and surveys. The websites were hosted on the French Ministry of Culture servers and were taken offline at the end of 2020 due to technical obsolescence and an extensive use of Adobe Flash technology. The disappearance led to an awareness of their importance. Some of them offered scientific descriptions of collections, which were more complete than the museum’s databases. Others reflected the museum's new stance on contemporary issues, such as gender, and preceded important exhibitions.

The aim of the Virtual Mucem project carried out in 2024 was to experiment with a form of remediation by using web archives of a national library. The work was both documentary and technical. On one hand, the project team searched local archives and conducted oral surveys with the producers of the websites. On the other hand, a tool has been developed to enable the project team to extract and package the library web archives in order to produce derivative WARC files as complete as possible for each one of the websites. Following these two tasks, which were carried out simultaneously, the project team set up an editorial interface for remediating the websites and integrating the derivative web archives, which can be consulted within the walls of the Mucem's Conservation and Resource Center with a local installation of SolrWayback.

This remediation project has a collegial and experimental dimension. Over the course of a year, it brought together more than fifteen people, including archivists, documentalists, librarians, IT specialists, and historians, as well as curators, ethnologists, and technical teams involved in the production of some of the sites.

This poster will present the challenges and results of this remediation project. First, it will highlight the collaboration between a museum and a national library that can inspire new projects in the future. It will provide information about the process of creating derivative WARCs. Finally, it will question the remediation itself and some of the main issues: technical but also documentary obsolescence of the content, possible deficiency of the web archives, technical choice and network security, public display.

WebData: Building a Research Infrastructure for the Norwegian Web Archive

Jon Carlstedt Tønnessen

National Library of Norway, Norway

Researchers have addressed the need for dedicated research infrastructures to study web archives. In response, the WebData project is building a research infrastructure for the National Library of Norway's web archive, enabling large-scale access to nearly 25 years of archived material. This poster will present the project's status, lessons learned so far, and findings from a needs assessment conducted with a relatively large number of scholars, mapping their needs.[1]

The project started in 2025, with four key objectives:

Build a research platform for searching, exploring and retrieving data
Automatically classify and pseudonymise texts containing (sensitive) personal data
Annotate data in order to provide analytical services,
Develop the infrastructure in close collaboration with the research community through needs and representation studies. [2]

Further, the poster will present findings from surveying researchers’ needs within four areas: a) access, b) interfaces and functionality, c) data and d) metadata. In addition to sharing scholarly needs, we examine how we plan to address this over the next 4 years. This involves traditional rule-based programming, identifying specific attributes in archived items, as well as machine-learning-based systems to enrich WARC data with additional metadata.

The WebData consortium is led by the National Library of Norway, with the Norwegian Computing Center, University of Oslo and University of Tromsø as partners. Project development runs until 2029, while the infrastructure will operate until at least 2035. The project is funded by the Research Council of Norway.

[1]: Brügger, N. (2021): ‘The Need for Research Infrastructures for the Study of Web Archives’. In The Past Web: Exploring Web Archives, edited by Daniel Gomes, et al. Springer International Publishing. https://doi.org/10.1007/978-3-030-63291-5_17; “About WebData” (2025), WebData.

[2]: https://webdata.nb.no

Doing humanities with web archiving: an oral history of web archiving practices in academia and the making of digital culture

Sophie Gebeil¹, Véronique Ginouvès²

¹Aix Marseille University, TELEMMe Laboratory, France; ²Mediterranean House of Human and Social Sciences, Aix Marseille University, CNRS, France

Doing Humanities with Web Archiving: An Oral History of Web Archiving Practices in Academia and the Making of Digital Culture

This project stems from the observation of a widening gap between, on the one hand, a small community of researchers and teachers who have developed expertise in web archiving, and, on the other, the vast majority of academics who occasionally need to archive the web as part of their work. The latter often rely on improvised, artisanal solutions to preserve or cite born-digital sources. While the experts are engaged with international initiatives and explore innovative methodologies linked to the digital humanities, most researchers remain unaware of this body of work and continue to “make do,” adjusting their practices as they go.
An historian who completed a PhD based on web archives and an archivist who trains doctoral students in managing their research data have both observed that, despite the field’s vitality, web archiving practices have not yet permeated the broader landscape of research and teaching in the humanities and social sciences. One likely reason is the particularly high entry cost for scholars in these fields: web archiving requires not only solid digital literacy but also the integration of a new theoretical and methodological framework that reshapes our relationship to archives and redefines research methodologies.
The proposed oral history project aims to document and analyze web archiving practices—both expert and improvised—among academics. By focusing on lived experience and everyday practice rather than top-down models, it seeks to contribute to an epistemology of the humanities and social sciences through practice. About fifty interviews will be conducted with individuals who are currently building or have previously built web archive corpora. These testimonies will shed light on how such corpora evolve over time, revealing continuities, shifts in focus, and changes in the objects of study. A heuristic map will guide the interview framework and adapt questions to diverse uses of the web.
Archiving is integral to the project’s logic: each interview will be described in a dedicated archival finding aid, with standardized metadata, keywords, and access criteria defined by the interviewee, who will choose the appropriate level of confidentiality or public release. The resulting corpus will serve as a foundation for cumulative future research on web archiving practices.
Through this collection of oral testimonies, we aim to capture a momentary yet representative picture of how researchers and teachers use and archive the web in their professional and pedagogical activities. Approaching web archives through life stories offers an opportunity to rethink our relationship to digital sources and research methodologies. By archiving these oral histories as research sound archives, in line with open science principles, we hope to make them accessible and reusable, within the boundaries of ethical and legal standards.

How not to build a web archive in two weeks

Shannon Willis

Texas State University, United States of America

In 2025, a university library started a web archive. Getting to this point represented two years of education and advocacy to secure the necessary resources to start a program, aligning web archives with the larger mission and scope of the library. Given limited in-house development support, Archive-It was chosen as the university’s first web archiving tool and the university’s web presence as the first collecting area. Delays in contracts and purchasings resulted in little time to capture seeds before the data budget for the year would be reset. Determined to use as much of the data budget as possible before it expired, and after a self-given crash course in Archive-It, the presenter set out to capture as much as possible in as thoughtful a manner as time would allow before the end of the fiscal year. This poster will explore what went well, what went wrong, and lessons learned from this compressed timeline for starting a web archive. It will consider the work of implementing web archiving best practices, how the library is moving forward to grow a more robust and sustainable web archiving program, and the importance of advocacy and community in supporting institutions and sustaining the work of web archiving.

In addition to doing the internal work of establishing repeatable workflows, refining regular crawl schedules, and considering the long-term preservation needs of their WARC files, the presenter is also actively restarting a regional web archiving interest group to build a local support network that can help foster their own and others' web archiving work in the area. As well, understanding that growth of the web archives will require continued support and increased resources from their institution, they are also leveraging the current attention their web archive has amongst leadership to promote the efforts of the library and advocate for the importance to the university of web archiving and preservation work. Through these efforts, the presenter aims to grow what started as a rough-and-ready little web archive into a sustainable web archiving program, expanding both upon its collecting scope and the archiving technology used. The presenter hopes the poster will prompt conversations around good (and not so good) practices in starting in web archives, successful approaches for advocating for web archiving resources, and the importance of web archiving communities in sustaining the work.

Linking the awesome: Building a Community Knowledge Graph for Web Archiving Resources

Natanael Arndt

German National Library, Germany

Web archiving is a highly technical endeavor involving a lot of tools. These tools are developed by a broad community and mostly as open source software. The open source software development allows participants of the community to exchange tools and improve them in a cooperative and collaborative way. The web archiving is technologically and from a community perspective embedded in the World Wide Web, which as well is mostly based on open source software and open protocols and standards. Likewise in web archives open protocols and standards, like WARC and CDX, play a fundamental role and allow the interoperability of components.

The International Internet Preservation Consortium (IIPC) serves as a hub to foster communication among web archiving institutes, to support the standardization processes and the software development. The “Awesome Web Archiving” list follows the idea of awesome lists (https://awesome.re/). Awesome lists are common on GitHub, maintained as a Markdown document, and provide a low profile accessible index of resources that are relevant for a certain community, contributors are able to suggest new entries as pull-requests. Among others, this involves links to software tools and standard documents. Within the “Awesome Web Archiving” list the entries are assigned to categories, while individual entries can fit into more than on category. The referenced projects are sometimes under vivid development, while others get unmaintained over time. To improve the quality of the “Awesome Web Archiving” list and as such their value for the web archiving community the recency and information richness are relevant factors. The entries in the list are often links to git repositories or projects on GitHub. From these project pages, additional information about the current development status and the self-description of the projects can be gathered.

To interlink the information that are gained through the crowdsourced approach of maintaining an awesome list with the information available at the project pages, linked data is a good and web native format to encode information in a structured way. The SPARQL Anything (https://sparql-anything.readthedocs.io/) tool provides access to Markdown documents (https://sparql-anything.readthedocs.io/stable/formats/Markdown/) with the standardized SPARQL 1.1 Query language (https://www.w3.org/TR/sparql11-query/). With these tools it is possible to create a knowledge graph–The Web Archive Awesome Graph (WAAG)–of information resources relevant to the web archiving community (https://github.com/white-gecko/webarchiving-awesome-graph). This graph can serve as integration point for structured or semi structured contributions to the tool collection, for information enrichment, and to model interconnections between listed resources, such as tools and libraries, and libraries and standards. Finally, the graph's information can be rendered browsed in a graph like manner and can be again rendered to an awesome list document.

The tools involved are still under development and the approach requires discussion within the community. The poster should serve as a catalyst for such a discussion.

Revisiting a statistical approach for measuring Solr query performance

Jørgen Johan Antonsen

National Library of Norway, Norway

Popular in the web archiving community, Solr allows for fast free-text search within a web archive. When working with large indexes, one soon faces the limits of one’s own infrastructure, and query response times increase. At that point, there are many measures that can be taken, so it is useful to know the effects of each measure, or which setting that gives the best performance. This is when having tools for evaluating query performance comes in handy. This poster sheds light on a handy method of measuring and visualizing Solr query performance.

Imagine for instance that you want to improve the query response time of your Solr index, and have a theory that it will help to split a large collection into multiple shards. To qualitatively check if this is the case, it is first important to be aware that a query with few hits typically has a shorter response time than a query with very many hits. It therefore gives insight to check the performance across groups of queries, with say, 10-100 hits, 100-1000 hits, 10K-100K hits and so on.

There is also the question of caching. If a specific query has been made before, the response time is shorter and might give a misleading idea of a Solr instance’ performance. Consequently, one needs to do many queries, which results in a set of valuable statistics. If these tests are run before and after the shard split, the results can be compared and the performance gain becomes very visible.

The method has been used many years ago in presentations on previous IIPC conferences, but does not seem to be actively used today. The presenting organization is currently indexing on new infrastructure, and the method has been very useful in making decisions in this process, which is why we would like to highlight it in this poster.

Sustaining web archiving through instruction

Nicole Greenhouse

New York University Libraries, United States of America

According to the National Digital Stewardship Alliance (NDSA) 2022 Web Archiving Survey, "few organizations dedicate more than one, full-time employee to web archiving." American organizations’ staffing in regards to web archiving have stagnated, with the majority of practitioners of web archiving only working on it a quarter of their professional time, in line with the results from the 2017. With very little staff time devoted to web archiving, building and sustaining a program can be difficult and leaves no room for development in practices in the field. Over the last nine years, conversations around quality assurance, ethics, access and description for web archives have also gone by the way side in the United States in favor of similar conversations around event based collecting and technical developments. But once these events are over, web archiving practitioners are still needed to maintain these collections into the long-term. By providing training and instruction that does not just cover the basics of web archiving, but rather workflows and policies we can build up the knowledge that web archiving is not a “set it and forget it” that needs more than just a staff member 25% of the time. This poster will focus on best practices for training students and professionals in web archiving, including quality assurance, how to use the tools, maintenance, preservation, and access so we can move away from web archiving as an extension of someone’s work and part of a sustainable practice in their institution and a community of practice with more people with expertise to do better and innovating work.

The DOWARC notebook: modelling web archiving artefacts as RDF graphs in Jupyter

Tom Storrar¹, Manuela Pallotto Strickland²

¹The National Archives, UK; ²King's College London, UK

This poster presents a local and small-scale implementation of Semantic technologies in web archiving processes and builds on the research collaboration we conducted in 2024, which delivered the draft version of the DOWARC domain ontology presented in a lighting-talk at IIPC WAC 2025.

To effectively manage the capture of the changes that affect live websites and webpages, web archiving practices lead to the creation of datasets composed of snapshots of web resources. Because each snapshot essentially recaptures the entirety of the archived web data object packaged into WARC files, significant issues of duplication inevitably arise over time, rendering versioning difficult to manage. Furthermore, as each snapshot provides an instantaneous representation of the live web resource captured in a specific moment of its existence, issues of context also arise, particularly with regard to the relationship between different versions of the same resource. Such issues have an impact on the long-term sustainability of web archiving practices and can also affect future reuse of web archives, by engendering contextual ambiguities.

Our research explores affordances of Semantic technologies in tackling versioning and context-related issues in web archiving practices. Although Semantic technologies such as RDF and Linked Data are being implemented by web archives to enrich discovery-of/access-to a web archives’ collections, and/or support distant-reading of primary web resources (e.g., mapping and profiling of web communities), currently they are neither being used to support sustainable versioning and address issues of context, nor are considered useful in tackling the preservation challenges specifically presented by web resources. Our implementation aims to fill this gap and demonstrate the potential effectiveness of Semantic technologies and Knowledge Engineering techniques in providing effective means to automate the mitigation of versioning and context-related ambiguities, over large and dynamic web archived datasets.

The implementation we present processes web archive data in a portable Jupyter environment and visualises it as an RDF graph. Using OS standard tools such as WARCIO and FastWARC, we extract data objects from WARC and CDX files, which we index in a database and provide with URIs. The WARC and CDX objects we then annotate and describe using DOWARC are represented as an interactive network graph.

Our notebook is configured as a sandbox environment, to test and assess affordances and bottlenecks of automation when annotating Real World web archiving artefacts using the DOWARC ontology. By presenting our work to the web archiving and digital preservation community, we would like to gather community feedback on our sandbox implementation, on the specific affordances offered by Semantic technologies that we have demonstrated, but also on the limitations we have encountered and successfully/unsuccessfully tackled. We aim to identify interested institutional partners to further explore scalable implementation of Semantic technologies to support sustainable and accessible archiving and preservation of web content.

Where is Hyves? Preparing hyperlinks for distant reading

Iris Geldermans

KB | National Library of the Netherlands, Netherlands, The

Link graphs, word clouds and keyword search are frequently based on derivative data, but more than often it is unclear how this data was prepared. In this presentation I will argue that because a website is such a container source, it is important as a researcher and as an archiving institution to be clear which data is in the index and how it was pre-processed. Through several collaborative research projects I have found that preprocessing the data, in this example hyperlinks, has a lot of consequences for the subsequent analysis. Being transparent about how data is pre-processed for tooling is therefor important for the academic community.

To illustrate this point I will discuss two use cases: Hyves (a former social media platform) and XS4ALL (one of the first public internet service providers in [COUNTRY]) analysed within the SolrWayback and a custom linkanalysis script. Both platforms have a similar sub domain construction causing them to either disappear from link graphs, or get grouped together into one major node causing the individual websites to disappear. It brings the question forward: what should a singular node be in the link graph? I argue that the level of granularity depends on the research question and the importance to explain to researchers that they should take this into consideration when preforming their research.

It is also important to know which hyperlinks are displayed within the graph. Hyperlinks can be found throughout a website: there is embedded content, anchor hyperlinks but also scripts and fonts. Differentiating the different kinds of hyperlinks within a visualisation is as important as knowing how they are cropped. When a tool or analysis does not differentiate this, the bigger platforms will always come out on top, eclipsing smaller but perhaps more important individual websites, because they have a stake in every type of hyperlink. But more importantly when researching website-networks based on the content of websites requires different hyperlinks, than when researching for example techniques used to build a website.

When visualizing link graphs with these thoughts in mind, you can enhance research results. And working this way can also be used for other elements on websites like text or images. Text, for example, should also be differentiated into header-text, footer text, article text, menu items and so forth. This to bring more meaning to analysis tools and visualisations. Moreover, within a website the text is already coded through html, so why not use this? With this archiving institutions can emphasize to researchers that the website is a container of many types of information and that they should be aware of this. Selecting which part of the website they want to use can enhance their research and should be chosen wisely.

4:00pm - 5:20pm

POSTER SESSIONS
Location: GALERIE [-2]

6:30pm - 9:30pm

Dinner at Le Cercle Des Voyageurs

Rue des Grands Carmes, Lievevrouwbroersstraat 18, 1000 Bruxelles

Date: Wednesday, 22/Apr/2026

9:00am - 9:20am

MORNING COFFEE
Location: GALERIE [-2]

☕️🥐 Drinks and snacks will be served in Galerie (Floor -2, next to the Auditorium).

9:20am - 10:45am

SOCIAL MEDIA PANEL
Location: AUDITORIUM [-2]
Session Chair: Katrien Weyns, Vlaams Architectuurinstituut

9:20am - 10:25am

Social media archiving in small institutions: working alone together

Katrien Weyns¹, Sophie Bossaert², Jeroen Fernandez-Alonso³, Nastasia Vanderperren⁴, Julie Birkholz⁵, Eva Van den Hurk-Van 't Klooster⁶, Elise Storme⁷

¹Vlaams Architectuurinstituut, Belgium; ²Archive for national movements, Belgium; ³Amsab-Instituut voor Sociale Geschiedenis, Belgium; ⁴meemoo, Belgium; ⁵KBR | Royal Library of Belgium & Ghent University, Belgium; ⁶Regionaal Historisch Centrum Eindhoven, Netherlands; ⁷UGent, Belgium

This panel takes a critical look at the work of several private cultural archives. In recent years, the findings of a joint research project on social media archiving have been incorporated into the organisations. To make progress with limited resources a specific approach to social media archiving was rolled out. The cultural archives are subsidised at regional level and build collections that complement the national collections of [INSTITUTION] and the [ARCHIVES]. These small private institutions work on a different scale and from a different perspective on social media archiving. During this session, three cultural archives will briefly present concrete steps that have been taken in the areas of selection, knowledge sharing and archiving to further embed the project in regular operations supported by a community of practice. This will be followed by a panel discussion of 40 minutes that will delve deeper into the challenges for each component and evaluate the steps taken. Based on propositions, experts from different backgrounds (research, regional public archive abroad, technical profile, national heritage institution) and institutions will engage in a discussion that will (hopefully) yield new insights. The presenters (different private archives) will moderate the panel. Some example propositions:

- The small steps being taken by cultural archives, alongside those of national heritage institutions, are valuable. Social media must be archived at various levels by heritage institutions (national, regional, local).

(What should be the role of large archives and libraries? Should there be coordination and how?)

- It is more important to secure and preserve the data than to make it available. (Should we be concerned about our ecological footprint?)

- It is not worthwhile to archive comments on posts. They mainly contain nonsense and rarely relevant information.
(How unique is the information on social media: what would we miss if we did not preserve it?)

- Archiving incomplete datasets is not worthwhile and therefore irresponsible. (What minimum criteria should heritage institutions use to determine what is worthwhile?)

- We must ask permission from all parties involved before archiving.

- We must better convince our archive creators to export their data themselves. (What are the arguments for and against? How do we do that? )

Small scale selection of social media (presentation)

When you are a small archive with only half to one digital archivist, you have to be happy with small steps. After all, that archivist is responsible for setting up a digital preservation system, acquiring, preserving and giving access to a multitude of complex digital file formats. Despite the many tasks, it is necessary to start archiving social media before the data becomes inaccessible. A first step is to map and select the social media you want to archive. We recently started drawing up a seed list and establishing selection criteria. We use our own collection plan, websites and the MOSCOW principles to determine priorities. In a short presentation some examples illustrate this approach, the challenges (i.e. deduplication) and the gaps (i.e. randomness and bias) to feed later panel discussions.

The community of social media archiving in practice (presentation)

A community of practice social media archiving developed various initiatives to safeguard her knowledge and experiences. Working groups were set up to share best practices (Twitter/X research) and test results of replay tools (SolrWayback). We organized edit-a-thons to update existing manuals and created new ones for a diverse scala of archiving tools. Developing a sustainable network helps us to ensure our knowledge and expertise is not lost but can be embedded within our small archival private institutions. But how is the balance between between effort and output? What roles do we take as an institution and archivist within that network?

The inherent incompleteness of archived social media data (presentation)

Regardless of the method used to preserve social media content, archived datasets will almost always be incomplete or imperfect. With participatory archiving -where the archival creator uses the platform’s export function to obtain a copy of their own data- significant contextual information is lost. For example, we only receive comments of the archival creator, without the surrounding interactions that give them meaning. The web scraping methods will also lead to imperfect archived datasets. For instance, depending on the tools used, the visual appearance and user experience of the original platform are often not preserved. Certain elements, such as comments or embedded media, are in practice also difficult or impossible to capture in full. These limitations are not solely technical; human factors also contribute. Delays in initiating the archiving process, particularly when event-driven archiving, can result in the loss of valuable content that has already been removed from the web. This raises a difficult question for web archivists: how should we address these imperfect conditions? By examining a series of cases where the archiving process went wrong, we propose a pragmatic approach that demonstrates how even flawed or partial efforts can still yield historically valuable data.

Panel of external voices from different organisations and backgrounds (names were removed on request of the WAC program committee) They are available in the remarks for the program committee and chair:

Panelist-technician
Panelist-national heritage institution
Panelist-public regional public archive abroad
Panelist-research

10:25am - 10:45am

Making 1.2 billion social media posts accessible: a user-centric search interface for large-scale Twitter archives

Mehdi Bourgeois

INA - Institut national de l'audiovisuel, France

Archiving social media platforms represents a major scientific, documentary, and civic challenge. In order to secure our digital heritage, our institution has undertaken the task of collecting and archiving content from Twitter and, more recently, Bluesky. Over the past decade, the chosen strategy has resulted in an archive of 1.2 billion tweets and posts from 16,000 accounts and 3,200 thematic hashtags, accompanied by 25 million archived videos.

While the resulting massive scale of these archives creates a multitude of opportunities, it also comes with new challenges. How do we design access systems that remain sustainable as archives scale from millions to billions of items? How can such a vast archive be made accessible, intelligible, and useful? Researchers require sophisticated filtering capabilities to construct meaningful corpora as simple keyword searches on collections of this magnitude return overwhelming and unusable results. The general public needs intuitive and reliable tools to explore topics of interest, such as media events, cultural trends, and political and societal discussions.

This presentation demonstrates a production-ready consultation interface designed to address these challenges. Built as a JavaScript web application with an Elasticsearch cluster backend, it provides multiple access points tailored to diverse research methodologies:

- Faceted Search Engine: Full-text search combined with progressive filters for media type, language, hashtags, emojis, and engagement metrics (likes, retweets, replies, citations), enabling users to refine queries across multiple dimensions simultaneously
- Data Visualization: Interactive visualizations including word clouds, temporal histograms, distribution charts, and image galleries that provide immediate corpus overviews and facilitate iterative query refinement through visual feedback
- Metadata Transparency: Complete metadata visibility for each archived post, supporting reproducible research practices and proper citation
- Progressive Disclosure: Researchers can begin with broad queries and iteratively narrow their focus using visual feedback from result distributions

The presentation will include a live demonstration highlighting real research use cases that illustrate how preserved archived content enables important scholarly investigations.

Beyond demonstrating the interface, this contribution aims to foster discussion about broader sustainability challenges in social media archiving. Platform migrations — such as the ongoing transition from Twitter to Bluesky — raise other fundamental questions: how can we design interfaces and data models that adapt to evolving platform ecosystems while maintaining data integrity and access? How can we ensure these archives serve as sustainable tools for research communities and the public?

9:20am - 10:45am

COLLECTIONS AS DATA: WORKFLOWS & USE CASES
Location: PANORAMA [+6]
Session Chair: Bjarne Andersen, Royal Danish Library

9:20am - 9:42am

Web archives of tragedy: ethical, sustainable access and research use for 9/11 collections

Ian Milligan

University of Waterloo, Canada

During and after the September 11, 2001 (“9/11”) attacks, web users exchanged tens of thousands of emails, listserv posts, BlackBerry messages, and blog comments. Much of this material was captured in exceptional crawls by the Internet Archive and the Library of Congress, or later collected by the September 11 Digital Archive. Read together, these sources enable a minute-by-minute social history in which unity and care coexisted with fear, backlash, and hate, patterns further shaped by platform affordances and moderation practices. Yet this evidentiary base remains fragmented across crawls, platforms, file types, or how information was arranged and presented.

This talk presents a practical model for sustainable access and research use by constructing a releasable, reusable dataset that harmonizes multiple September 11–related web-archival collections (e.g., Yahoo! Groups and web-hosted listservs), totaling tens of thousands of messages. The workflow covers content-hash deduplication; date-time normalization to Eastern Time (anchored to verifiable real-world events); thread reconstruction when possible; and a common schema that structures headers, body text, and related paratext (e.g., moderation notes) into designated fields. The resulting datasets are packaged as CSV and Parquet for straightforward download and reuse and are currently hosted as private collections on Hugging Face pending release decisions.

Many items have effectively enjoyed privacy-by-obscurity in the Wayback Machine or as archive objects not exposed to search engines. When harmonized and made machine-indexable, they become trivially discoverable, including personally identifiable information. A user who posted under their own name to a public list in 2001 could not reasonably anticipate the 2025 search environment or large-scale text mining. While case-by-case review, which can attend to the context of creation, reasonable expectations of privacy, and the purposes of reuse, can guide my own individual decisions, it does not scale to tens of thousands of messages. At IIPC, therefore, I am hoping to seek community input on what to release, how, and with what documentation, and to share my own best practices. In my own work, I have imposed quoting thresholds on records I do not want to be identified; anonymizing names and email addresses in some files, and documenting provenance and processing choices so downstream users can determine what to use.

9:42am - 10:03am

Creative access - Lessons from the Digital Ghosts exhibition

Andrea Kocsis, Dorsey Kaufmann

Univeristy of Edinburgh, United Kingdom

This paper presents the lessons learned from the Digital Ghosts exhibition, a practice-based research project exploring how artistic and creative methods can enhance public engagement with web archives. Centred on the Scotland on the Internet curated collection, the project investigated how visualisation, data enrichment, and storytelling can improve awareness and usability of archived web content among non-specialist audiences.

The exhibition showcased collaborative works created by an interdisciplinary team of archivists, data scientists, artists, and informatics students. Through data-driven artworks and interactive interfaces, the exhibition translated web archive metadata into tangible and visually engaging forms that encouraged visitors to reflect on digital presence, disappearance, and collective memory. Public engagement activities, including a panel discussion and participatory workshops, further enabled dialogue between archivists, artists, and users on issues of selection, loss, and representation of [redacted] online heritage.

A key component of the project was the preparation and enrichment of a dataset derived from the Scotland on the Internet collection, used both for artistic interpretation and as an educational resource. The process of structuring and visualising this web archive metadata offered an entry point for students and artists to engage with the complexities of humanities data, such as gaps, inconsistencies, and ethical and legal considerations. By integrating web archive material into data science teaching, the project aimed to familiarise future data users with the interpretive and contextual challenges of GLAM datasets, while exploring use cases to encourage the future utilisation of web archive data.

To assess the impact of these creative interventions, the project incorporated user research in the form of visitor surveys and focus groups conducted with exhibition visitors, workshop participants, and student groups. Based on the results of the user research and through documenting this interdisciplinary process, the paper argues that creativity is not merely an outreach tool but a sustainable access strategy that bridges preservation and access, facilitates communication between archivists, outreach specialists, researchers, and users, and supports web archives literacy. Situated within the Access and Research Use track, the paper offers conference attendees a tried and tested framework for integrating data enrichment, as well as creative and participatory methods into web archive engagement.

10:03am - 10:24am

Developing a sustainable workflow for UK Web Archive collections as data

Nicola Bingham, Helena Byrne, Nora Ramsey, Mindaugas Vidmantas

British Library, United Kingdom

The UK Web Archive collects and preserves websites published in the UK, encompassing a broad spectrum of topics. The entire collection amounts to approximately 2 petabytes (PB) of data. The archive includes curated or thematic collections that cover a diverse array of subjects and events, ranging from General Elections, blogs, and the UEFA Women’s Euros, to Live Art, the History of the Book, and the French community. 2026 is a special year for the UK Web Archive, as it is celebrating its 21st year curating web archive collections. In the early years these collections followed a simple structure of a title and a list of related websites, subsections of websites, individual web pages and documents published on the web. The implementation of the curation software, in 2013 enabled the use of hierarchical structures to curate collections. Most of the hierarchical collections have one or two subsections, but other collection have up to four subsections.

The UK Web Archive provides an essential resource for studying the evolution of web publishing formats and for accessing a comprehensive record of content published on the web. Due to limitations of the Legal Deposit Regulations, creating data sets of web archive content poses both technical as well as legal challenges. However, the metadata created by UK Web Archive collaborators is something that sits outside the limitations outlined by the Legal Deposit Regulations and could be repurposed to create data sets for further research.

To date, we have published a number of our curated collections metadata as data through the British Library Research Repository. Metadata was extracted from backups of the curation management tool. The first tranche of collections as data published were extracted from a backup of our curation software in July 2023. At this point there were 173,961 curated records in the collection. The second tranche was extracted from a backup of our curation software from October 2023. This backup had 181,551 curated records in the collection.

This presentation runs through a number of the processes involved and the lessons learnt from developing these new workflows. These include:

Developing a policy for what data will be published.
Choosing a documentation framework to use to describe data sets.
Developing work arounds to manage challenges such as staff changes.
Challenges in exporting hierarchical data in flat formats.
Requirements for making these data sets available in the Research Repository.
What is available and how researchers can access them.

It is hoped that this presentation can enable further discussion on publishing collections as data within the web archive community. These discussions will then help to develop best practice for enabling reuse of web archives within the research community.

10:24am - 10:45am

Bridging the Web Archive and the Library: a Linked‑Data Model for FAIR Web Archive Integration

Natanael Arndt, Tracy Arndt

German National Library, Germany

Our library makes its data available as linked open data. Since 2012, we operated a web archive that is currently being redeveloped in-house and with an open‑source approach to increase capacity. Furthermore, the aim of the web archive is integrated with the overall digital library architecture, which involves ingest in the libraries digital object import pipeline, cataloging of the digital objects into the integrated library system, and storage in a common repository for digital objects. Thus far, the metadata of the web archive is converted into the library's internal data format. However, it has become apparent that current bibliographic standards cannot capture the complexity and characteristics of web resources. Additionally, the web archive should provide sufficient metadata to allow data based research on the digital holdings in a way that is adapted for the web medium. The overall architecture of the web archive involves several components that produce metadata about the digital object that are collected and others that require the data as input or enrich the data. Such components are the seed selection, crawlers, file format checkers, quality assurance, meta-data extractor, subject indexer, CDX indexer, playback system.
With these abstract use cases and requirements in mind we have devised an abstract metadata model with concepts of the live web (web page, URL, and domain), the technical part of the archive surrounding the WARC standard format (crawl, file, record, and cdx entry), and concepts that facilitate the integration with the catalogs data model (collection, website or seed, and snapshot). Within the collection a lot of rich payload data is stored, such as browser and server communication metadata and web resources such as HTML, style-sheets, JavaScript, and images. These resources regularly carry their own metadata, such as Microdata, Microformats, OpenGraph, Schema.org, and general JSON‑LD. The metadata model can serve several purposes: (1) as a common reference, (2) a tool for integration, and (3) as an extension point. (1) A common reference for common processes distributed among various participants. (2) A tool for integration with the libraries catalog which in turn serves as a portal to the web archive collection for replay and research purposes. (3) A extension point to link enclosed metadata to the web archive holdings, as well as derived information as the result of library activities like formal cataloguing, subject cataloguing, and enrichment processes.
The data model is modeled in RDF to describe Linked Data. It supports the FAIR data principles (findability, accessibility, interoperability, and reusability). As part of the design activities we have proposed changes to the DOWARC vocabulary—an initial draft by the National Archives UK. Further the data model reuses and aligns with existing vocabularies, such as the Dublin Core metadata element set (dc), DCMI Metadata Terms (dcterms), FOAF Vocabulary (foaf), The Bibliographic Ontology (bibo), and the lobid vocab (lv).

9:20am - 10:45am

SHORT TALKS
Location: CONCERT [+4]
Session Chair: Katherine Boss, National Library of Norway

9:20am - 9:30am

Environmentally-friendly digital preservation policies and infrastructure at the National Library of Norway

Katherine Boss, Kristin Laberg, Frode Steen

National Library of Norway, Norway

The National Library of Norway has been a certified environmental “lighthouse” organization since 2015, indicating that it complies with a set number of environmentally-friendly criteria. This has required the library to implement and sustain many environmentally-friendly policies, including several related to digital preservation and storage, that may be of interest to the international community.

One core aspect of this work is energy efficiency. The library’s digital collections currently total more than 18 petabytes of data. This data is regularly checked for bit rot and is preserved using the 3-2-1 standard of digital preservation, wherein we preserve 3 copies of each file, on 2 different storage technologies, including 1 file copy stored at a different geographical location. To reduce our energy use in this work, the library uses an energy-efficient technology for our disk systems, called MAID (Massive Array of Idle Disks). This storage technology reduces power consumption by only allowing disks to spin when they are in active use, so that most hard drives are kept inactive and turned off to save energy and extend their lifespan. Although it affects application performance during data access, MAID is effective for storing data that is rarely used, such as archival data that does not change and is rarely accessed. This provides an almost 60% energy savings.

Another aspect of the library’s sustainable data storage practices focuses on data minimization. The library stores material in filetypes that meet international standards and that can also be compressed to reduce the total volume of information we store, such as the JPEG2000 file format.

Our data is also stored in what is often referred to as a “cold climate” data storage facility. The northern location of the National Library is based in Mo i Rana, a city 30 kilometers south of the arctic circle. The storage facilities are built into the side of Mofjellet mountain. For seven months of the year, the monthly average temperature is below 0 degrees Celsius. This stable, even, cold climate requires less energy to keep the storage servers cool.

Finally, the library uses 98% renewable energy sources, including from wind and hydroelectric sources, to maintain this infrastructure.

There are still more measures the library can take to improve sustainability in our operations. For example, soon we plan to further optimize our energy use by recycling the heat from the data center to warm buildings. Another area of improvement is our file degradation systems, which are not as efficient as they could be. We use Checksum technology to check for bit rot. All preserved files are assigned a checksum, or fingerprint. To handle checksums, computing power is needed every time a check is run and to confirm that a file has not changed. We compare the stored checksum against the calculated checksum for a file each time it is retrieved from our digital preservation system, but this is processing that could be avoided if we used technology that more effectively maintained the integrity of a file.

9:30am - 9:38am

Environmental Issues on the Web: Building and Promoting a Thematic Archive

Anaïs Crinière-Boizet

National Library of France, France

In 2020, our institution took part in the Climate Change IIPC Collaborative collection and drew inspiration from this initiative to set up its own collection on environmental issues. We felt it was essential to include these major issues for our contemporary society in our collections. That is why, since 2020, we have been launching an annual collection entitled ‘Environmental Issues’.

The aim of this collection is to highlight expressions, reactions, actions, representations or reflections relating to environmental issues on the internet. It comprises eight themes, in order to cover the multiple aspects of these issues (scientific, economic, artistic, etc.) as well as the different types of website producers. It currently has more than 800 selections made internally by librarians, as well as by partner libraries in the regions.

In this lightning talk, we would like to present this collaborative collection on a national scale, as well as the various initiatives implemented to promote it to the public. We have published in December 2023 a thematic and edited selection of archived pages (also known as “guided tour”) about “The environment on the web”. This tour is divided into 14 themes such as “Issues, Concepts and Theories”, “Biodiversity and Species Extinction”, “Urban Planning and Land Use”, and “Everyday Citizen Action”. As our collections can only be accessed within the research rooms of our library, we have also published on our website the seeds list of this collection as well as a version of the tour with screenshots, for which we asked the websites owners' authorizations. This collection and its promotion are a good example of how we build and develop a thematic collection in our library and how we can help the public to better understand the challenges posed by climate change.

9:38am - 9:46am

Storing URLs, targets, and other time-varying entities in a database as a path to sustainable recordkeeping

Gyula Kalcsó

Hungarian National Museum Public Collection Centre National Széchényi Library, Hungary

A recurring problem with mass web archiving, e.g., at the top-domain level, is how to record the targeted content and the changes in the associated URL(s) over time. This issue is related to seed list maintenance, as in the case of larger harvests, it is necessary to exclude websites that were previously saved but are no longer functional, meaning that there is no longer any content behind a given URL, or it no longer belongs to that website. The lightning talk presents a flexible concept that can be used to manage the relationships between URLs of different structures (with or without http or https protocol, with or without www), their changes over time, and their connection to the website as an entity. The essence of the solution is an entity-based SQL database that is capable of recording all changes over time in a non-redundant manner by ensuring 3rd Normal Form (3NF). The main entities stored in the database, such as target and URL, are linked to each other, to themselves, and to tables containing information about them using junction tables. This solution ensures scalability, e.g., the information stored about each entity can be expanded arbitrarily, and the 'date_from' and 'date_to' fields in the junction tables can be used to record when the given relations were valid. Linking the entity tables to themselves allows us to link alternative URLs to each other in time, for example. The information stored about each entity allows for complex queries. For example, in the case of the target, the type (website, web page, file, etc.), or in the case of URLs, the status code is stored in a separate table. The junction tables also ensure that changes over time are recorded, so that, for example, it is possible to query which URL belonged to a given entity (e.g., a file on a website) during a given period. All this contributes greatly to sustainability, as it provides a much more economical, easier to use, and more flexible query solution than previous data storage methods, such as Google spreadsheets.

9:46am - 9:54am

Web archiving automation at the Mexico Digital Preservation Group: error assessment and quality control

Carolina Silva Bretón¹, Alejandro Juárez Arriaga²

¹National Library of Mexico, Mexico; ²Digital Preservation Group, Mexico

In Mexico, progress continues to be made in web archiving, which has become a fundamental strategy for preserving digital heritage, especially given the volatile and ephemeral nature of online content. In this context, the Digital Preservation Group of Mexico (GPD) has experimented with an automated web archiving system to capture, store, and preserve digital resources relevant to the country's collective memory. This study focuses on detecting errors during the capture processes and in the strategies applied to ensure the quality of the resulting archives.

Using an empirical-applied approach, combining observation and experimentation to address practical problems, the automated tool Browsertrix (from Webrecorder) was used, along with systematic reviews of the files generated in WARC format. Twenty-four websites were captured in 2025, including catalogs, databases, and repositories. The analysis focused on the frequency, type, and cause of detected errors (e.g., broken links, missing sitemaps, uncaptured dynamic content, JavaScript issues, or multimedia format problems) and the effectiveness of the applied quality control mechanisms.

The results reveal that while automation allows for a significant increase in archiving coverage, it also introduces considerable technical challenges, which we will discuss in the lightning talk. Recurring error patterns were identified, linked to highly dynamic sites with complex structures, highlighting the need for specialized configurations and iterative validation processes. The importance of establishing contextualized quality criteria, beyond purely technical parameters, is also discussed, integrating aspects of cultural, institutional, and legal relevance.

The lightning talk concludes with a series of practical recommendations for similar projects in Latin American contexts, emphasizing the importance of a flexible technical infrastructure, automated monitoring capabilities, and a clear policy for collaborative digital preservation. This work contributes to the development of standards and best practices for institutional web archiving in the region, and opens the door to future research on automated curation and preservation of emerging content such as social networks, alternative media and ephemeral resources.

9:54am - 10:02am

Sustainable and systematic: building a search index of research and practice in web archiving and digital preservation

Andrew Jackson¹, Olga Holownia², Sharon Healy³

¹Digital Preservation Coalition, United Kingdon; ²IIPC, United States of America; ³Cartlann Digital Services, Ireland

Over the years, through events such as the IIPC Web Archiving Conference, iPRES - International Conference on Digital Preservation, and various collaborative projects, the digital preservation and web archiving communities have built an extensive repository of knowledge. However, a persistent challenge has been to provide a single, citable point of access to these dispersed resources. Our project introduces the Awesome Indexer¹, which brings together digital preservation and web archiving resources into a single search interface and database. Our key argument is that centralised discovery is crucial for the long-term sustainability of these resources, encouraging reuse and investment in those resources rather than attempting to replace them.

This tool works by accepting a range of standardised bookmark and bibliographic sources, such as Awesome Lists, Zotero², and Zenodo collections. Zotero is a particularly powerful source, as the established tools and workflows around Zotero collection management make it easy to pull in records from a wide range of sources, from traditional publisher websites through to YouTube playlists and content hosted by digital libraries³. The Awesome Indexer combines the data from these sources to generate a dedicated faceted search system, built using off-the-shelf tools and packaged as a simple static website. It also creates SQLite and Apache Parquet versions of the same data, allowing richer exploration and analysis of the sources in the index. The Indexer is an open source tool that can be used by anyone to build their own index.

This “work-in-progress” short talk will briefly trace the development of the Indexer, detailing the steps it required and the challenges posed by its underlying resources. The current version of the Digital Preservation Publications Index (DPPI) will be demonstrated to highlight how the Indexer consolidates decades of content from across multiple platforms into a single, comprehensive entry point. This significantly improves discoverability, facilitates citation, contributes to training, and maximises the impact of our collective knowledge for practitioners and researchers.

References
¹ https://github.com/digipres/awesome-indexer 2 There are multiple Zotero web archiving bibliographies, including the WARCnet Directory 4 Web Archive Research (https://www.zotero.org/groups/4394230/warcnet_directory_4_web_archive_research), created as part of the WARCnet network (2020-2023).

³ This is an example of a web archiving collection hosted by the University of North Texas Digital Library https://digital.library.unt.edu/explore/partners/IIPC/

10:02am - 10:10am

Querying the archived web with an AI assistant

Victor Harbo Johnston¹, Brian Balsun-Stanton², Helle Strandgaard Jensen¹, Christian Kaalund Kjeldsen¹

¹Aarhus University, Denmark; ²Macquarie University, Australia

The archived web is a indescribably rich primary source for contemporary history. However, only a handfull of historians have started including the archived web as part of their source material when investigating phenomenons from the 1990's and 2000's (Mackinnon, 2022; Millward, 2025; Winters, 2017).This lightning talk presents exploratory work on exploring and discovering content from web archives through an *AI Research Assistant* and research questions from the discipline of history.
The proposal takes a postphenomenological approach, as the historian interacting with the archived web is set in a mediated situation, mediated by the software used to explore the archive e.g. SolrWayback or PyWb (Hasse, 2015; Rosenberger & Verbeek, 2017). This approach makes the discovery of material from the archive focus on research questions asked by the researcher rather than content from the archive in itself resulting in a problem driven approach which better resembles how historians traditionally work. This approach puts the research questions to the center rather than the data from the archives as presented in the Archives Unleashed project (Ruest et al., 2022).
The proposal explores how traditional research questions from the humanities can be used to setup a large language model to act as a research assistant. A similar approach has been used to construct a bibliography about the classic Caligula's Madness (Green et al., 2024). This proposal investigates how research questions can be transformed into prompts that are then used by the AI to systematically read all documents in a collection and provide information on which of these documents could be useful to answer the given research questions. The methodology will make it possible to backtrack the reasoning of the AI by making sure that the AI provides a link or an ID of the source it has found fitting for the provided research question.
Webarchives are often massive in size, therefore this proposal investigates how the solution performs on big corpora and how choosing different LLMs as the backbone of the solution can impact performance and quality of output.

10:10am - 10:18am

Online annotation platform for web archives

Pedro Gomes

Arquivo.pt, Portugal

Search engine evaluation relies heavily on high-quality test collections that reflect user information needs and relevance judgments. However, building such collections is resource-intensive, requiring systematic annotation of queries and results. The service is a web-based platform designed to streamline this process by enabling the annotation of search engine results in a user-friendly and collaborative environment. The tool allows assessors to annotate retrieved documents according to predefined relevance criteria, supporting the creation of standardized datasets for training, tuning, and benchmarking retrieval models.

Our web archive is a research infrastructure that provides tools to preserve and exploit data from the web to meet the needs of scientists and ordinary citizens and our mission is to provide digital infrastructures to support the academic and scientific community. However, until now, our web archive has focused on collecting data from websites hosted under the .PT domain, which is not enough to guarantee the preservation of relevant content for the academic and scientific community.

Our web archive provides a “Google-like” service that enables searching pages and images collected from the web since the 1990s. Notice that our web archive search complements live-web search engines because it enables temporal search over information that is no longer available online on its original websites.

Developed within the context of our web archive, the service facilitates the generation of reliable ground truth data, while remaining adaptable to different domains and languages. By lowering the barriers to annotation, this platform contributes to the reproducibility, scalability, and improvement of search technologies. The main objective is to provide in the future a dataset with public access to support researchers.This contributes to comparing users’ search behavior between live-web and web-archive search engines.

10:18am - 10:26am

Organizing the 'Social Mess': a comprehensive Tool for Social Media and Instant Messaging Archiving

Primo Baldini¹, Alessia Del Bianco², Adele Gorini²

¹University of Pavia, Italy; ²University of Bologna, Italy

The exponential growth of digital content through social media and instant messaging platforms presents critical challenges for digital preservation. Born-digital communications—created in fragmented, proprietary environments where personal and public spheres overlap—remain excluded mainly from systematic archival practices despite their historical and cultural significance.

Within the national archival context, there are no comprehensive tools to preserve and manage these materials for individuals, institutions, or public figures whose digital traces hold substantial value for future research. This gap affects personal archives of political and institutional figures and collections of broader cultural relevance.

As part of a collaborative research initiative on preserving contemporary digital archives, we are developing a software tool for individual users and institutional archivists. This collaborative effort, which includes our professional experience, highlights an urgent need to address technical and methodological shortcomings in this field.

Existing tools—typically command-line utilities or platform-specific applications—allow for the separate management of content from social media, messaging services, and email, etc., but do not provide integrated support within a unified solution. Our framework, in contrast, is comprehensive in its capacity to manage the complete spectrum of digital materials: traditional files alongside social media content, instant messages, and emails within a unified environment. This comprehensive approach addresses the complexity of contemporary digital archives.

The software enables users to reorganize their materials systematically, making it valuable for a variety of contexts. Whether it's individuals managing personal digital heritage, prominent figures preparing materials for donation, or institutions controlling and facilitating access to collections.

Our Java-based solution integrates core modules, ensuring usability and data integrity. Operating through manual download and ingest processes—not APIs—it provides user control while supporting standard formats (JSON, CSV) for interoperability. The embedded database and exclusive use of open-source libraries enable platform-independent installation without external dependencies.

Key functionalities include AES-256 encryption, automatic backups, metadata extraction, device synchronization, and granular permissions. Critically, access settings apply at both file and individual message levels—essential for managing diverse privacy requirements and enabling selective disclosure within complex digital collections.

Currently under active development, the project aims to support institutions in visualizing and managing heterogeneous digital materials, enhance accessibility for researchers through reorganization and categorization tools, and foster inter-institutional collaboration.

This session will provide participants—particularly archivists and records managers—with an overview of a collaborative project and its outcomes, highlighting an integrated approach that offers significant advancements for digital preservation practice and academic scholarship.

10:45am - 11:15am

BREAK
Location: GALERIE [-2] & PANORAMA FOYER [+6]

☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6) and in Galerie (Floor -2, next to the Auditorium).
🏛️ GUIDED TOUR KBR: If you signed up for a guided tour of KBR, please be in front of the three main elevators on Floor -2 at 10:45. To know if you signed up for a tour, check your registration details in ConfTool.

11:15am - 12:20pm

QUALITY ASSURANCE & DEDUPLICATION
Location: AUDITORIUM [-2]
Session Chair: Anders Klindt Myrvoll, Royal Danish Library

11:15am - 11:37am

Deduplication in browser-based crawling with Browsertrix

Ilya Kreymer, Tessa Walsh

Webrecorder

This talk discusses new deduplication capabilities recently added to Browsertrix, a widely-used open source browser-based crawler and crawl management platform, in relation to sustainable web archiving. Browsertrix Crawler originally did not include support for deduplication, but we have recently added it as an option at the request of our users. This presentation will discuss why Browsertrix and Browsertrix Crawler did not originally support deduplication, the trade-offs introduced by adding deduplication support, and the unique challenges and opportunities related to deduplication with browser-based crawling. These trade-offs will be discussed in relation to storage efficiency and sustainability in web archiving programs.

The talk will begin with some background on the early principles and capabilities of Browsertrix, and why deduplication support has not previously been added. This will include some discussion of the complexities deduplication introduces in terms of inter-crawl dependencies, and the tension between this complexity and the goal of being able to create portable, self-contained web archives.

Next, the presentation will give a high-level overview of the deduplication capabilities that have been added to Browsertrix and Browsertrix Crawler. This will include our flexible model for how to configure an index for use as a deduplication source of truth using collections of previous crawls, how deduplication has been implemented in crawls, and the consequences this introduces for replay, sharing web archives, and other post-crawl activities. Also discussed will be how browser-based crawling allows for new experimental approaches to deduplication that can potentially result in efficiency gains in crawling time in addition to storage.

The remainder of the presentation will provide thoughts on when deduplication may or may not be appropriate, using use cases to help illustrate how deduplication relates to institutions’ efforts to ensure their web archiving programs are efficient and sustainable, as well as the trade-offs that users will need to consider.

11:37am - 11:59am

Efficient quality assurance of deduplicated web archives with Browsertrix

László Tóth

National Library of Luxembourg, Luxembourg

This presentation focuses on the quality assurance of archived websites using Browsertrix on a national institutional level. In the second half of 2025, our institution completed the migration and expansion of its internal web harvesting infrastructure to use the latest version of Browsertrix. This includes the crawler, the management interface and the quality assurance workflows. We introduce several enhancements to these modules, which we will discuss in this presentation, with a particular emphasis on quality control.

In particular, we propose a system for making the QA process more efficient by limiting the number of pages (or samples) that are analyzed in each batch. This process provides a good indication of the overall quality of a harvest, without needing to check all (often many hundreds or thousands) of pages. Thus, together with our crawler’s cross-crawl deduplication feature, this makes it possible to archive and analyze many terabytes of web content on a regular basis.

We also present our system architecture and design choices that we made during the migration process, in detail. This includes our Kubernetes deployment, hybrid storage solution, custom registry, and multi-node setup. Our workflow is separated into three dedicated nodes, making it possible to harvest, manage and perform QA separately for :

(1) behind-the-paywall news media content,

(2) websites of national importance, and

(3) ad-hoc collections.

Our results show that Browsertrix offers many unique advantages compared to other alternatives that our institution has used previously. Furthermore, our enhanced quality assurance workflow provides an efficient, scalable means to monitor, manage, and maintain regular harvests on a daily, weekly, and monthly basis.

11:59am - 12:20pm

A browser-based approach to measuring completeness in archived websites

Brenda Reyes Ayala

University of Alberta, Canada

The Internet Archive, the world’s most prominent web archiving institution, has created Archive-It (AIT), a popular web archiving subscription service, which is used by hundreds of institutions around the world to preserve their digital cultural heritage. AIT clients can choose to employ an AIT tool

called Wayback QA to perform Quality Assurance (QA) on their archived websites (Archive-It, 2025).

However, for those institutions who do not use AIT or for whom Wayback QA might not scale, the QA process has remained largely manual. To address this issue, we present a browser-based approach to measuring the completeness of a collection of archived websites. First, we establish a definition of completeness, which we define in terms of the network requests that are executed by a browser in order to properly load a website. We assume a live website is the “gold standard” against which the archived website must be measured. Therefore a fully complete archived website executes all of the same network requests that are also executed when loading the original live website. The completeness of an archived website thus becomes the fraction of original network requests that are successfully executed in the archived version. Our approach operates by comparing the network requests of the live website to those of the archived website and generating a measure of similarity.

The approach includes an open-source command-line tool that can be deployed without needing to manually inspect each archived website on a browser. The work presented here is meant to provide a simple way to quickly assess the quality of a web archive collection. It does not preclude the use of other web archiving tools to capture, display, or analyze web archives. The audience for this tool is composed of web archivists looking to carry out QA on their archived websites. Researchers studying web archives could also employ this tool to gauge the quality of an archived web collection at a glance.

The accompanying tool was written in Python, runs from the Linux command line, and is available to download and use on the Github Platform. It was written to be as modular as possible, with each step producing an output that is then used as input for the following step. The approach presented here has the following advantages over previous approaches:

– It does not require web archivists to manually interact with each site they have archived, saving time and resources.

– Additional information such as screenshots, WARC files, or crawler logs is not needed. As input, it only requires the URL of the archived website and its live counterpart.

– It is an open-source tool and not proprietary. As such, it is open to further improvements and contributions from the web archiving community, and an AIT subscription is not necessary to use it.

– Because the approach is browser-based rather than crawler-based, it is more focused on the user experience of archived websites.

References

Archive-It: How to patch crawl with the wayback qa tool (2025),

https://support.archive-it.org/hc/en-us/articles/115004144786-How-to-patch-

crawl-with-the-Wayback-QA-tool

11:15am - 12:20pm

TECHNICAL INNOVATION AND STRATEGIES
Location: PANORAMA [+6]
Session Chair: Lauren Ko, University of North Texas Libraries

11:15am - 11:37am

Archiving websites and social media of national movements: best practices of ADVN | Archives of national movements

Sophie Bossaert

Archive for national movements

In 2018, our archive decided to expand the collection of online publications and started with harvesting websites of our archival creators to preserve their online heritage for future research. The web is constantly changing and content is quickly modified, removed or made inaccessible which make archiving it a necessity. During the coronavirus pandemic we realised the rise of social media could no longer be ignored. It was a start point to capture, record, scrape and download social media archives as well, but we were exposed to many challenges including technical barriers (API limitations, platform restrictions), legal and ethical isuses,… which require continous monitoring and specific strategies for effective preservation. During these years we developed a substainable policy and regulary monitor more than 5000 channels created by our archival community.
In this presentation we would like to share our experiences and approach:
• How can a small institution with limited resources start to build a substainable web and social media archive? What are the main challenges and pitfalls?
• Which collection policy can you develop in archiving websites and social media? How can you monitor a specific working field? Which metadata should be maintained? How can you update the collection?
• What are the trends or tendencies in the use of the platforms? Which platforms are used by which types of archival creators? How did this evolve in time? Which channels are used by radical right and does it differ from other wings or parties? How many sources disappeared from the net and what is anno 2025 the ratio between offline - online channels?
• Which (open source) tools did we use to collect the content and how did this evolve in time? What are the pros and cons when using open source tools? Which are the essential or recommonded elements to capture? How do you cope with the continious changes in development of open source tools and the platforms themselves?
• Which infrastructure will we use to preserve the archives? What are the minimal requirements to preserve the archives?

11:37am - 11:59am

Combining browser-based and browserless crawling for better fidelity vs. efficiency tradeoffs

Jingyuan Zhu¹, Huanchen Sun², Harsha Madhyastha²

¹University of Michigan, United States of America; ²University of Southern California, United States of America

Operators of web archives can crawl pages from the web using either dynamic browser-based crawlers (such as Brozzler and Browsertrix) or static browserless crawlers (such as Heritrix). Static crawlers are more lightweight and, hence, can crawl pages at a faster rate: in our measurements, 16x faster than with a dynamic crawler. However, static crawlers miss page resources which are fetched only when JavaScripts are executed; we repeatedly crawled 10K pages (spread across the top 1 million domains) both statically and dynamically for 16 weeks, and found that only 55% of statically crawled snapshots visually and functionally match the corresponding dynamically crawled snapshots.

In this talk, we will present our study on how to combine dynamic and static crawling so as to serve page snapshots at high fidelity while minimizing the computational needs for supporting high crawling throughput.

First, we quantified the utility of a practice which is common in web archives: reusing crawled resources either across snapshots of multiple pages or across multiple snapshots of the same page. When an archive receives a request for a resource, it serves the copy which it captured closest in time to the page snapshot it is serving. If no resource with the requested URL is found, the archive returns a resource which approximately has the same URL. We estimated the utility of these simple measures if the frequency with which an archive crawls pages matches the availability of page snapshots on the Wayback Machine. We find that, compared to crawling all pages statically, crawling 9% of snapshots with a browser suffices to increase the fraction of statically crawled snapshots which can be served without loss of fidelity from 55% to 96%.

Second, to fix the fidelity issues associated with the remaining static crawls, we studied two methods for augmenting them using other dynamically crawled snapshots.

When replacing the scripts missing in one page snapshot with those from other crawls, we find that there is often the need to replace them collectively. Due to the prevalence on the modern web of bundling all the scripts on a page into a versioned bundle, the scripts work only if they all belong to the same version.
Many pages have resources which are unique to the page and vary over time, e.g., on the page for a specific product, a script may fetch a JSON which lists the product’s current price. We found that the URLs of such resources are often derived by combining strings found in the page’s source, and the recipe for constructing these URLs can be learned from a single dynamically crawled copy of a page. A static crawler can then be augmented to fetch these resources which it would otherwise miss.

Put together, we estimate that these two measures will further increase the fraction of statically crawled page snapshots which can be served without loss of fidelity to 99%.

By communicating our findings to the IIPC audience, we hope that developers of web crawlers will help translate our findings into practice.

11:59am - 12:20pm

The wasteback machine: measuring the environmental impact of the past web

David Mahoney

The University of Edinburgh, United Kingdom

This paper introduces the Wasteback Machine, a JavaScript library that repurposes web archives to analyse historical web page size and composition. It addresses a key limitation in current approaches to web sustainability assessment, which rely on live measurements and therefore obscure the cumulative environmental effects of long-term digital growth. By making web archives amenable to quantitative analysis, the Wasteback Machine enables new forms of historical inquiry into the evolution of page size and composition and their environmental implications. In doing so, it demonstrates how web archives can function as analytical resources rather than merely records of cultural memory. This paper will demonstrate the capabilities of the Wasteback Machine, examine representative analyses of historical web development, and situate its contributions within wider debates in web archiving and sustainability. It will further consider the reuse of “reborn” digital materials for quantitative inquiry, the long-term ecological implications of persistent web expansion, and the challenges and responsibilities facing the future of web archives.

11:15am - 12:20pm

ACCESS & REUSE
Location: CONCERT [+4]
Session Chair: Jon Carlstedt Tønnessen, National Library of Norway

11:15am - 11:37am

Unlocking the web: online access to the National Library of Singapore’s web archives collection

Shereen Tay, Meiyu Lee

National Library Board Singapore, Singapore

In 2019, the National Library of Singapore's (NLS) legislation was updated to empower it to archive websites ending in “.sg” without the need for written permission. This allowed NLS to comprehensively collect and preserve Singapore's Internet landscape by conducting large scale domain crawling of .sg websites.

Since then, about 80,000 websites are archived every year and made available on the WebArchiveSG portal. However, due to copyright laws, these .sg websites could only be view at the NLS on a designated computer terminal. Permission had to be given by the website owners to make them accessible online. This meant that about 88% of the collection had access restrictions which greatly impeded the use and visibility of the collection due to the need for library users to visit the NLS to view them.

After five years of growing the collection and with greater awareness and support for web archiving, NLS wanted to explore how it could make the collection more accessible to users in 2024. A discussion with its legal team was initiated to relook at the copyright laws and study how online access could be applied to the web archives collection. This led to the creation of an online access criteria for websites based on the Fair Use principle that the archived website is not a 100% replica of the live website. A quick takedown policy was also set in place to handle public requests promptly.

With the above new criteria, the bulk of the domain crawl collection could be released for online access. Websites with owners who had previously specified onsite access and undesirable websites (e.g. adult and gambling websites) would remain accessible at the NLS only. NLS implemented online access to its collection in the 4th quarter of 2025.

This presentation will cover NLS' online access criteria for websites, its application to its web archives collection, operational changes made to allow online access via its WebArchiveSG, as well as learning points from this experience.

11:37am - 11:59am

Unlocking the Web Archive: understanding researcher needs

Natasha Kitcher, Jake Bickford, Claire Newing

The National Archives UK, United Kingdom

Our web archive contains more than 8 billion digital objects. It contains the record of over twenty-five years of government information released to the public, yet we face significant challenges encouraging research engagement and use of this resource. Barriers to increased access to the web archive include practical constraints (which limit our ability to release the dataset to potential researchers), and the Takedown Policy (a reclosure policy which allows for the removal of sensitive content at any time). Another challenge is our own incomplete understanding of what researchers need and want from the archive, as well as a lack of understanding by users of the complexities and limitations of the web archiving process.

This presentation will introduce a project conducted at our institution designed to investigate and understand researcher needs. The project was funded by the Archives’ own Strategic Research Fund, an internal funding scheme reserved to make disruptive research possible and promote inclusive practice. In October 2025, workshops were hosted to determine what researchers want from our web archive, and subsequently we are able to share some of our hopes and plans for the future. This work was our first project focused on research users on their own, rather than general web archive users.

We asked potential researchers what they need from the web archive in order to succeed and introduced ethical constraints that we face when sharing our own data. This enabled us to make recommendations for future work with the web archive that takes into account practical and ethical constraints around the release of datasets, as well as increase researcher understanding of what the web archive is, and how they can use it.

The workshops aimed at engaging both web archive users and those curious about the potential of web archives. We invited both groups in the hope of responding to the need for equitable access in public sector web archives (Hartland, 2023)^[1] and a desire to follow the UN principles of good governance, which includes being “participatory … equitable and inclusive,” (Schafer & Winters, 2021)^[2] in web archives more generally.

This presentation will discuss the future access scenarios that were proposed in the workshops, scaled from least to most computational and resource intensive. By examining what researchers both need and want from future digital preservation infrastructures, we will explore where they draw the line on computational intensity. The findings offer insight into how our web archive can evolve to meet the demands of its research community, balancing ambition with sustainability. We hope sharing both our findings and methodological approach can be useful to other web archiving institutions.

[1] Nicole Hartland, ‘Web Archives for All? Towards Equitable Access to UK Public Sector Web Archives,’ iPRES, (Online, 2024).

[2] Valerie Schafer & Jane Winters, ‘The Value of Web Archives,’ International Journal of Digital Humanities, (Springer, 2021).

11:59am - 12:20pm

Text Mining Analysis of the discourse on ‘Archive Silences and Democracy’

Vaia Papanikolaou¹, Afrodite Malliari², Panagiotis Tzionas³

¹International Hellenic University, Greece; ²Department of Library, Archival & Information Studies, International Hellenic University, Greece; ³Department of Production Engineering and Management, International Hellenic University, Greece

Foucauldian discourse analysis examines how language, power, and knowledge intersect to influence what is considered "true" and shape individual and societal identities. In analogy, Deconstruction theory involves identifying binary oppositions (like truth/error), reversing the traditional privilege of one term, and revealing their interdependence in the discourse (known as ‘Violent Hierarchies’). However, privileging certain terms or silencing others is a dangerous concept that may have a direct impact on democratic institutions.

The internet constitutes today’s digital public sphere, and an interdisciplinary range of scientists try to identify and develop best practices for selecting, collecting, preserving and providing access to its content. Archive silences refer to the absent or distorted documentation of certain groups, stories, and perspectives within historical records, leading to gaps in the collective memory and understanding of the past.

In this paper we argue that archive silences in the digital public sphere are either a result of, or they reflect power relations that privilege certain terms, and this has major detrimental impact on democratic institutions. We will try to establish whether and how this relation of archive silences and democracy is manifested.

Towards this end, in this paper a text mining process is employed to analyze results of the query 'Archive Silences and Democracy', within the large volumes of information contained in the 40 most popular pages returned by the google (US) search engine. Artificial Intelligence algorithms are used to examine the correlations between these terms, create clusters of concepts, and determine those terms that may strongly mediate meanings between such groups of concepts. Finally, the results are graphically represented in a network form where influential words are depicted as nodes and the strong interconnections between them are represented as edges using the Infranodus software.

Results show that archive silences are strongly related to state political censoring (even in democracies, e.g. during transitions from dictatorships). Thus, they impose selective perspectives on the construction of social memory. They are also used both in uncovering and silencing history (colonialism, immigration) and they usually are a result of corrupted autocracies. Archive silences exist with respect to human rights violations, freedom of press, they might be gender based, they may hinder the quest for accountability and justice, or they can be related to infrastructure inadequacies in disasters.

As shown on the constructed network, these silences are not accidental but result from factors like biased collection practices, structural inequalities, and the inherent limitations of institutions, which can inadvertently or purposefully exclude certain voices, with obvious negative impact on democracy.

Addressing archival silences involves critically examining the history presented, recognizing the power dynamics involved, and seeking out the marginalized narratives that remain unheard. We believe that the proposed methodology contributes towards all the above, as the text to network transformation and graph metrics avoid subjectivity and distortion of concepts, without imposing external semantic structures. Moreover, they can be especially helpful in bringing potential conceptual gaps that are highlighted in the transformed geometrical space.

12:20pm - 1:25pm

LUNCH
Location: GALERIE [-2] & PANORAMA FOYER [+6]

🍴 Lunch will be served will be served in Panorama Foyer (Floor +6) and in Galerie (Floor -2, next to the Auditorium).
🏛️ GUIDED KBR MUSEUM TOUR: If you signed up for a guided tour, please be by the entrance to the Museum on Floor 0 at 12:20 [1st tour] or 12:50 [2nd tour]. To know if you signed up for a tour, check your registration details in ConfTool.

CLICK TO VIEW: 🗺️ FLOOR PLANS + 🎥 ORIENTATION VIDEO

1:25pm - 2:30pm

ARCHIVING BLOGS AND NEWS
Location: AUDITORIUM [-2]
Session Chair: Padraic Stack, National Library of Ireland

1:25pm - 1:47pm

Blogs to digital heritage: a British Library case study

Nora Ramsey, Helena Byrne, Nicola Bingham

British Library, United Kingdom

In 2025 the British Library undertook a time-sensitive initiative to preserve its institutional blogs hosted on the Typepad platform. The library blogs represent over a decade of research, curatorial insight, and public engagement, making them a crucial component of the institution’s digital heritage. This project aimed to preserve the content while ensuring continuity of user access and long-term discoverability through the UK Web Archive.

The blogs were hosted across two domains with Cloudflare protections active on only one. This configuration presented several challenges for crawling including blocked requests, redirects, and embedded content across multiple subdomains. To address these issues crawler user agents were whitelisted by the domain owners and manual crawls were conducted for content outside Cloudflare.

As a result, the team compiled seed lists for manual crawling using a combination of internal metadata, Screaming Frog exports, and curated inputs. Approximately 160,000 URLs were initially identified which were refined to around 90,000 unique URLs representing individual blog posts and associated media. Browsertrix was used for targeted crawls of these posts and separate crawls captured embedded assets such as images, audio, and documents.

After crawling, further challenges arose regarding consolidating the content captured from two different domains into a single, coherent viewer. Quality assurance was particularly complex as some captures were not traditional failures, but rather pages returning HTTP 503 errors instead of the expected blog content. These recurring 503 captures had to be identified and re-crawled manually to ensure every post and associated media was fully preserved, requiring careful review and iterative verification across both domains.

Throughout this project a strong focus was placed on user access and experience. The current solution includes a bespoke workflow with support from Browsertrix which provides a temporary route for public access until the blogs are fully integrated into the UK Web Archive. Redirects were planned at the top-level domain to route users to archived versions, with documentation including a LibGuide to guide navigation and citation. The team explored how archived content could later integrate into the Web Archive’s discovery systems which ensures sustainable long-term accessibility.

This presentation will discuss the workflows, technical challenges, and collaborative strategies employed to preserve both content and access. Particular attention will be given to overcoming Cloudflare restrictions, managing URL redirects, coordinating cross-departmental teams, and designing user support resources to make the archived blogs usable and discoverable.

The case study demonstrates that under platform constraints, institutions can successfully safeguard digital heritage while prioritising accessibility, discoverability, and usability for researchers and the public.

This presentation will illustrate how careful planning, cross-team collaboration, and targeted technical strategies enabled the preservation of content while prioritizing user access. It will highlight approaches to overcoming platform limitations, ensuring discoverability, and supporting users in navigating archived blogs.

1:47pm - 2:09pm

The taste of blogging : towards sensible and ethical approaches to web archives.

Emmanuelle Bermès¹, Alexandre Faye², Marina Hervieu¹

¹École nationale des chartes, France; ²Bibliothèque nationale de France, France

Archives of the early vernacular web hold a lot of sensitive content: personal photos, texts created by children, viral memes remixing personal and copyrighted content… Blogs and social networks are not only made of text or images: they encompass intimate, individual stories. Within those pages, we come across confidences from marginalized people, mothers grieving for their child, photos of late-night parties, fantasies worded as fanfictions. What can be told about them without betraying the intimacy these authors have placed in their blogs?

Based on the massive collection carried out with the National Library of France (BnF) for 12.6 million blogs, mainly french-speaking and created mostly in the early 2000s, we will discuss how research teams and cultural institutions can implement sensible approaches to this kind of peculiar corpus. Our projects SkyTaste and Skybox build on a platform of tools and data for researchers designed by the BnF in order to promote the visibility of this archive. Our goal is to capture the unique atmosphere of those blogs to design ways to reconvey this heritage to its stakeholder community. Within our project, we define sensible approaches to web archives as epistemological methods designed to interact with sensitive content from the vernacular web in a way that is respectful of ethical principles.

In France, web archives are legal deposit and can only be accessed by researchers on the premises of a few institutions. If we want to use this content for an exhibition or a scientific paper, we have to ask for authorizations from rights holders. However, most of the content on this blog platform was posted under pseudonyms and most of it, especially within fandoms, is composed of reused content, making it difficult to trace. Furthermore, even if we can find some of these authors, they are not keen to provide display authorizations for their intimate content. Finally, there may be cases when the materials are so sensitive that we may feel reluctant to expose them, even if we are allowed to. However, when telling the stories of these blogs, if we only show low-risk content, either authorized or already available, there is a significant risk that we end up representing a biased version of the platform and missing out the purpose of cultural heritage: stirring emotions.

Sensible approaches to web archives include acknowledging intellectual property rights, being mindful of people’s privacy and intimacy, taking into account cultural diversity, protecting stakeholders (including researchers) from potentially harmful information. Such approaches may include navigating between distant and close reading, avoiding blind spots and building research processes along with communities, and mobilizing art-based research as a catalyst of emotions that we experience as web archivists or as researchers in front of the archive. Thanks to the synergy that emerged around the aforementioned projects researchers and students work together with web archivists to build this ethical framework for navigating personal web archives. This is our main goal for two workshops we’re organizing in the fall 2025 : we will synthetise our results for this presentation.

2:09pm - 2:30pm

Capturing the flow of online news: complementary approaches to web archiving and legal deposit in Sweden

Pär Nilsson, Nina Heljeback, David Brodin, Johan Furusvärd

National Library of Sweden, Sweden

The National Library of Sweden has engaged in large-scale web archiving since 1997, when domain-level crawls of the Swedish web were first initiated as part of the national web harvesting program. In 2002–2003, this effort was expanded to include daily crawls of Swedish news media websites, recognizing the need to capture the rapid publication cycles and dynamic content characteristic of online journalism. These crawls have since documented the structure, evolution, and visual presentation of Sweden's digital news ecosystem across both national and regional outlets. The harvested material is available for on-site consultation at the library and forms a cornerstone of the National Library of Sweden's long-term digital preservation holdings.

The introduction of electronic legal deposit legislation in 2012 significantly expanded the National Library of Sweden's collecting mandate, establishing a legal basis for requiring publishers to deliver digital content, including material distributed exclusively online and behind paywalls. Building on this framework, the National Library of Sweden launched in 2015 a new and more granular collection process for news media: a focused harvesting based on RSS feeds supplied by publishers in accordance with technical specifications developed by the library. These feeds expose article-level content and metadata, including updated versions of published articles, thereby enabling the systematic and high-frequency collection of born-digital news items. This targeted, metadata-rich approach complements the broader but less structured coverage achieved through traditional web crawls.

This presentation will examine the operational and curatorial relationship between these two collection streams—comprehensive web harvesting and RSS-based electronic legal deposit. It will discuss differences in scope, temporal resolution, and metadata granularity, as well as efforts to align descriptive and technical metadata across systems to enable cross-collection discovery and analysis. Particular attention is given to challenges in integrating large-scale WARC-based collections with structured, feed-based article data, and to access conditions: while the web-harvested material is available to users on-site, the legal deposit corpus remains restricted due to current legal and technical constraints. The presentation will also try to outline future directions for harmonizing workflows, enhancing metadata interoperability, and leveraging these complementary datasets for large-scale research use in digital news studies and computational journalism.

1:25pm - 2:30pm

RESPONSIBLE STRATEGIES
Location: PANORAMA [+6]
Session Chair: Abbie Grotke, Library of Congress

1:25pm - 1:47pm

End of Term Web Archive: Harmonizing WARC contributions from multiple crawling partners

Mark Phillips¹, Sawood Alam²

¹University of North Texas Libraries, United States of America; ²Internet Archive, United States of America

Every four years, the End of Term (EOT) Web Archive documents the transition in the executive branch of the United States federal web space by harvesting federal .gov and public .mil domains. The most recent transition from the Biden to Trump administration resulted in the largest data collection yet, with over 2.3PB of content crawled by six different crawling partners. From the beginning of the EOT Web Archive project, the diversity of approaches in crawling and curating portions of the overall projects by different crawling partners has been seen as a benefit. This approach allowed different strategies for crawling to be experimented with while allowing partners to focus on the content their organizations were willing and able to collect. In the case of the EOT-2024 process, this diversity in collecting institutions resulted in a wide range of implementations of the WARC format and required the project team to make decisions about how best to harmonize the data and make it available to researchers for computational use.

Examples of the different variations in the WARC format include WARC files created using record-at-a-time gzip compression, WARC files packaged in the Web Archive Collection Zipped or WACZ format, WARC data compressed using the Zstandard data compression algorithm, and finally WARC files packaged in the BagIt format comprising file headers with the payloads stored alongside the WARC files themselves.

To provide a consistent file format and access paradigm to end users who might not be familiar with the range of variations of the WARC format and their nuances, the EOT team made the decision to normalize all streams of WARC data into individual WARC files with record-at-a-time gzip compression. This required the normalization of several of the formats that presented several non-trivial challenges during the process. While data for the public dataset was normalized, the originally contributed formats are archived as they were deposited at the Internet Archive where they are served by the Wayback Machine. The resulting dataset will hopefully provide end users with an easily accessible set of files that can be used for a variety of projects in the future.

This presentation provides a novel focus on normalizing heterogeneous WARC files in order to provide a consistent set of interactions for end users of these files who are not primarily web archivists. The presentation will provide a brief introduction to the EOT collection process but focuses predominantly on the description of the different tools and resulting WARC implementations generated in the most recent round of this effort. An overview of decisions that the EOT team made to normalize these WARC records will be discussed as well as the technical approaches used throughout the dataset creation portion of the project.

1:47pm - 2:09pm

Crawl, cloud, carbon: measuring and reducing emissions for web archivists

Simon Ponsford, Gareth Williams, Jamie Wigley

Tailpipe, United Kingdom

A walkthrough of a novel methodology for precisely estimating the carbon emissions generated by cloud computing, contextualised within a case study whereby the emissions of a major web archiving platform were measured.

The presentation begins with an explanation of the process by which cloud computing generates carbon emissions. This process connects the cloud service user to the datacentre that processes their requests, to the power station that fuels the datacentre, to the energy source that generates the necessary electricity. This process is illustrated by data from the emissions assessment of the aforementioned web archiving platform. The emissions intensity of web archiving is also highlighted, as a compute and storage intensive process that is reliant upon a vast network of cloud storage, which consumes a significant amount of power and thereby generates material quantities of carbon emissions.

Next, the methodology for how cloud computing emissions can be estimated is detailed. A step-by-step explanation begins with an assessment of the power draw of the hardware components that host cloud services. This dataset is combined with measured processor utilisation data to determine the overall power draw of a user or organisation's use of cloud services. The carbon emissions of this power draw are then calculated by drawing on regional carbon intensity grid mix data, as well as accounting for regional power transmission losses. Alongside these ‘operational’ emissions, the methodology is expanded to encompass other elements of the cloud computing infrastructure’s lifecycle including manufacture, shipping and disposal. This methodology is accompanied by examples from the web archiving case study, covering the types of hardware used by web archivists, the types of cloud services utilised to host web archiving, and the carbon intensity of datacentres that most commonly host web archive data. Results from empirical testing will also be shown to demonstrate the precision of the estimated power and emissions calculations. Areas for further improvement will be presented to highlight how additional refinements can be made in the future.

The presentation concludes with recommendations to help web archivists reduce the carbon emissions generated by their processes. These include migrating services to datacentres in low carbon intensity areas, as well as maximising the efficiency of web archiving software hosted on cloud services.

2:09pm - 2:30pm

How the Arquivo.pt Memorial service contributes to reducing the carbon footprint

Ricardo Basílio

Arquivo.pt, Portugal

This presentation provides an overview of seven years of the “M”, a service offered to the community since 2018 that allows organizations to shut down old websites while keeping their content accessible, thereby contributing to reducing their carbon footprint.

Organizations create websites for a wide variety of purposes, sometimes having to maintain dozens of small websites without updating them. For example, universities create websites dedicated to events, conferences, research projects, etc. What to do? Shut down the websites and lose interesting information? This is where the “M” service comes in.

We consider this service from three perspectives: 1) How it works; 2) How it adds value to organizations; 3) Community involvement. We conclude by outlining the next steps for expanding the service.

1) How it works. The “M” basically consists of redirecting a domain to a historical version preserved in the “Web Archive”. The workflow begins with a request from the organization that owns the website. The “Web Archive” makes a high-quality recording of the website. The website owner only has to maintain and redirect the domain. The “Web Archive”, in turn, generates an SSL certificate and provides access to the archived content. A landing page informs users that this is a historical version. This process involves collaboration between the “Web Archive” team and people from the entities that have joined the “M” service.

2) How it adds value to organizations. In communicating the service to the community (external advocacy) we highlight the value of the “M” in terms of energy savings, CO2 reduction, and therefore a smaller carbon footprint. The second value of the service, which is important to IT teams, is that it helps eliminate security flaws. When websites are not updated, they become targets for attacks. Instead of eliminating websites with useful content for the community, IT teams and decision-makers can use the web archive to continue to provide access to this content.

3) Community involvement. In 2025, the “M” service reached approximately 284 websites from 26 institutions. Over the years, 50 websites were removed due to domain maintenance issues or broken collaboration. Processes have been improved and the service is poised for growth. For example, SSL certificate generation has been automated. External advocacy is a priority, as the preservation of websites in web archive format is not widely known.

The next step to expand the “M” service is to use the same workflow and structure to provide a rapid response service in the event of cyber attacks on websites of important organizations, such as universities. The “Web Archive” must be prepared to provide the latest archived version to one of these entities. We believe that redirecting to “Web Archive”, as is the case with “M” service, is an important contribution to disaster recovery processes.

This presentation concludes by referring to the vision of “Web Archive” in creating services for the community. It is essential to offer services: 1) to demonstrate the usefulness of web archives to organizations 2) to point out its contribution to sustainable goals.

1:25pm - 2:30pm

COLLABORATION & OUTREACH
Location: CONCERT [+4]
Session Chair: Vladimir Tybin, Bibliothèque nationale de France

1:25pm - 1:47pm

A web archiving training program for Latin America

Perla Olivia Rodríguez-Reséndiz¹, Lorena Ramirez-Lopez²

¹Universidad Nacional Autónoma de México, Mexico; ²Webrecorder, United States of America

Web preservation is a contemporary practice that began this century. Like many practices, promoting and supporting web archiving has been challenging due to limited time and resources. However, the urgency and ephemeral nature of online content have made the gap between countries that have adopted web archiving initiatives and those still unaware of its importance increasingly clear, highlighting the pressing need for action.

In Latin America, web archiving is an archival medium technique that has been rarely applied. Formal web archiving projects are known to exist in Chile and Mexico, though communities and other organizations have also made significant contributions. Many have attempted to archive the web using the limited support, resources, and documentation available from both within the web archiving community and their own local contexts.

For this reason, a Spanish-language web archiving training program is being developed within the Library and Information Research Institute (IIBI) of the National Autonomous University of Mexico (UNAM), with the goal of preparing new generations of archivists who can identify, preserve, and provide access to web pages of social, archival, and political value. The program is being developed internally within the IIBI department to evaluate workflows, logic, and vocabulary, with the goal of expanding and disseminating these resources as part of the cultural heritage of our countries and communities.

This panel proposes a training and professional development program, designed as a collaborative strategy between the public university of UNAM from Mexico and open-source tools. As the program is being developed, we invite the broader web archiving community to join the conversation and share insights on how they would have liked to begin their own journeys, offering input that can help shape a more accessible and impactful initiative.

1:47pm - 1:55pm

Warc School - fellowship & training program update

Makiba Foster¹, Bergis Jules²

¹College of Wooster Libraries, United States of America; ²Shift Collective

Archiving the Black Web was founded on a commitment to create pathways for underrepresented voices and marginalized communities to access web archiving skills, knowledge, and networks. Our work addresses not only “ensuring equitable access to archived web content,” but also ensuring equitable access to who gets to participate in the practice of web archiving and what gets privileged to be part of a web archive collection. At IIPC WAC 2024, Archiving the Black Web shared details about our project’s efforts to reduce these disparities with the upcoming launch of our fellowship and training program, Warc School. Developed for memory workers dedicated to collecting and preserving Black history and culture online, the fellowship offers web archiving training to enhance their memory work or digital content creation practice. In April 2025, Warc School welcomed 22 fellows representing traditional archives, community-based archives, Historically Black Colleges and Universities, public libraries, and independent scholars and creators to complete our 10-month training program, which includes five courses and a practicum.

In this session, join Archiving the Black Web for a brief update on lessons learned while developing a training program and its curriculum, recruiting fellows and faculty, as well as highlights from student practicum projects. Attendees will also hear about our new initiative to strengthen social sustainability, with details about the launch of our second cohort. This cohort will include fellowship opportunities not only for memory workers but also for journalists at Black newspapers interested in digital preservation through web archiving training. Information integrity and ethical considerations related to artificial intelligence will be incorporated into the 2026 Warc School curriculum.

2:30pm - 3:00pm

BREAK
Location: GALERIE [-2] & PANORAMA FOYER [+6]

☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6) and in Galerie (Floor -2, next to the Auditorium).

3:00pm - 4:30pm

CLOSING KEYNOTE PANEL
Location: AUDITORIUM [-2]
Session Chair: Basile Simon, Starling Lab for Data Integrity

PANELISTS Basile Simon, Starling Lab, Stanford / USC Basile leads applied research in evidentiary and investigative standards and comes from a background in journalism and human rights documentation Emily Tripp, Airwars Emily is the executive director of Airwars, a leading human rights violations monitor and an early architect of remote, web-based documentation. Marvin Milatz, Der Spiegel Marvin works as a researcher at DER SPIEGEL’s fact-checking unit, specialising in “Open Source Intelligence” (OSINT) and digital forensics. Friedhelm Weinberg, Mnemonic Friedhelm oversees the programmatic work of Mnemonic, an organisation dedicated to archiving, preserving, and verifying open-source information as documentation of human rights violations and international crimes.

Archiving for accountability: new frontiers in open source intelligence and digital evidence

Basile Simon¹, Emily Tripp², Marvin Milatz³, Friedhelm Weinberg⁴

¹Starling Lab; Stanford / USC; ²Airwars; ³Der Spiegel; ⁴Mnemonic

This panel convenes leading practitioners at the intersection of web archiving, open-source intelligence (OSINT), and human rights documentation to explore practices and use cases that are often under-represented at the IIPC. While traditional web archiving focuses on preserving cultural heritage, a growing community of investigators, journalists, and legal advocates utilizes web archiving as a critical tool for accountability. This panel will present the methodologies and challenges of this mature field, showcasing how web archives are used to document human rights abuses, investigate war crimes, and combat disinformation.

Our contribution advances the conference theme by broadening the definition of "web heritage" to include the ephemeral, dynamic, and often contentious digital content that serves as evidence of historical events. While discussions at previous conferences have at times centered on the technical and logistical challenges of large-scale crawls and collection development, our panel shifts the focus to the high-stakes application of web archives in legal and journalistic contexts. We will build upon prior work in areas like high-fidelity capture and consider the unique ethical and security challenges that arise when archiving evidence of state-sponsored violence or disinformation campaigns. By bridging the gap between the cultural heritage and accountability-focused communities, we aim to create a more holistic understanding of the web’s role as a primary source for future generations.

The impact of this panel on conference attendees will be threefold. First, archivists and librarians from traditional memory institutions will gain insight into a rapidly evolving use case for their skills and infrastructure, opening potential avenues for collaboration with human rights organizations and newsrooms. Second, tool developers and engineers will hear directly from practitioners in high-risk environments about their specific needs for verifiability, security, and ease of use, informing the next generation of web archiving technology. Finally, researchers and legal experts will be exposed to the practical realities and evidentiary standards required to bring web-archived material into judicial proceedings, fostering a richer dialogue between technologists and the legal community.

We aim to foster a discussion on similarities and differences in cultural heritage archiving and "accountability archiving" practices to share knowledge, identify common challenges, and explore opportunities for collaboration. By highlighting the needs of investigators and OSINT practitioners, we can collectively advance the development of tools and standards that serve a broader range of applications and strengthen the integrity of the web as a historical record.

Participating organisations (names and participants not included) will:

Moderate the panel, and share their experiences in developing and deploying end-to-end systems for authenticated capture, storage, and verification of digital evidence for legal, journalistic, and historical purposes.
Discuss their systematic approach to web preservation in the conflict contexts, and their work in building a searchable index of evidence for accountability.
Share insights on the journalism-specific workflows and idiosyncrasies of using web archives as supporting evidence in public-interest reporting.
Discuss how their suite of widely available open source high-fidelity web archiving tools can support this community of users, and how to balance the challenges and unique needs of users in high risk environments, such as verifiability of web archives, with broader web archiving needs.
Socialise their hands-on, practical research into legal framework and standards with regard to the production of web archiving material in judicial contexts.
Present what is perhaps the most complex (albeit impactful) use of third-party, web sources to derive actionable claims of attribution against military commanders.

4:30pm - 4:45pm

CLOSING REMARKS
Location: AUDITORIUM [-2]

4:45pm - 6:00pm

CLOSING RECEPTION
Location: ROTONDE

Drinks and nibbles will be served in Rotonde. Volunteers will guide you to floor +3 from where the historical part of the Library can be accessed.

Date: Thursday, 23/Apr/2026

9:00am - 9:30am

MORNING COFFEE
Location: PANORAMA: FOYER [+6]

☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6).

9:30am - 9:40am

GENERAL ASSEMBLY: OPENING REMARKS
Location: PANORAMA [+6]

9:40am - 9:50am

CHAIR ADDRESS
Location: PANORAMA [+6]

9:50am - 10:30am

IIPC ACTIVITIES AND FUNDING PROGRAMS
Location: PANORAMA [+6]

10:30am - 11:00am

COFFEE BREAK
Location: PANORAMA: FOYER [+6]

☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6).

11:00am - 12:00pm

TOOLS
Location: PANORAMA [+6]

- WARC extensions: Idea to Stable Specification - Alex Osborne (National Library of Australia)
- Heritrix updates - Alex Osborne (National Library of Australia)
- Enhancing Curatorial Workflows in Heritrix - Sara Aubry (National Library of France)
- SolrWayback & WARC-indexer (updates + to-do’s) - Thomas Egense & Anders Klindt Myrvoll (Royal Danish Library)
- Wombat and wabac and pywb, oh my! How the pieces fit together in replay - Ilya Kremer & Tessa Walsh (Webrecorder)
- Websites blocking crawling efforts - Sebastian Nagel (Common Crawl)

12:00pm - 1:00pm

LUNCH
Location: PANORAMA: FOYER [+6]

🍴 Lunch will be served in Panorama Foyer (Floor +6).

CLICK TO VIEW: 🗺️ FLOOR PLANS + 🎥 ORIENTATION VIDEO

1:00pm - 3:00pm

TOOLS SUSTAINABILITY FRAMEWORK
Location: PANORAMA [+6]

1:00pm - 3:00pm

TRAINING WORKING GROUP
Location: STUDIO [+6]

1:00pm - 3:00pm

RESEARCH WORKING GROUP
Location: ATELIER [+2]

1:00pm - 3:00pm

CONTENT DEVELOPMENT WORKING GROUP
Location: AQUARIUM [+2]

1. Updates on the Working Group
2. Curators Update
- WWII 80th Anniversary Commemoration by Melissa Wertheimer
- 2026 Winter Olympics and National Committees by Helena Byrne
- Street Art Collection by Alex Thurman
- War in Ukraine by Anaïs Crinière-Boizet and Vladimir Tybin
- Southeast Asia Arts Websites by Shereen Tay
3. Open discussion: how to promote the use of collaborative collections

3:00pm - 3:30pm

COFFEE BREAK
Location: PANORAMA: FOYER [+6]

☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6).