Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.
|
Session Overview |
| Date: Monday, 20/Apr/2026 | |
| 9:30am - 10:00am | REGISTRATION: BELGICAWEB & WORKSHOPS AND COFFEE Location: PANORAMA: FOYER [+6] ☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6). |
| 9:30am - 10:00am | EARLY SCHOLARS SPRING SCHOOL ON WEB ARCHIVES [PART 1] Location: KRANTEN / JOURNAUX [0] Registration for this event begins at 9:00, followed by an icebreaker. |
| 10:00am - 11:00am | BELGICAWEB SYMPOSIUM Location: PANORAMA [+6] |
|
|
Digital Heritage for the Future - the BelgicaWeb Story 1KBR, Royal Library of Belgium; 2Ghent University; 3University of Namur This presentation will showcase the concluding results of BelgicaWeb (2024-2026), an innovative research project led by the Royal Library of Belgium (KBR) and funded by BELSPO. The project aimed to preserve and provide access to Belgium’s born-digital heritage through a multilingual, user-friendly platform and an API, ensuring FAIR principles (Findable, Accessible, Interoperable, Reusable). Key achievements include sustainable access strategies, robust data infrastructure, metadata enrichment, and legal framework analysis. The session will highlight how BelgicaWeb promotes Belgium’s digital heritage. The project brought together a consortium of partners with diverse expertise: CRIDS at UNamur for legal issues and IDLab, GhentCDH, and MICT at Ghent University for data enrichment, user engagement, and outreach. KBR coordinates the project, focusing on platform and API development and data enrichment. |
| 10:00am - 11:00am | - Location: KRANTEN / JOURNAUX [0] |
| 11:00am - 11:30am | BREAK Location: PANORAMA: FOYER [+6] ☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6) |
| 11:30am - 12:30pm | BELGICAWEB SYMPOSIUM PANEL: USER NEEDS Location: PANORAMA [+6] |
|
|
Users first: (Re)designing Web Archives around real Needs 1Ghent University; 2Bibliothèque nationale de France; 3National Library of Norway; 4University of Edinburgh; 5UK National Archives As web archives make the transition from niche repositories to essential infrastructure for digital history, one critical disconnection remains: the gap between technical capture and scholarly utility. While archivists battle with dynamic content and platform APIs, too often for the researcher, web archives remain a ‘black box’ that does not meet their methodological requirements. This panel addresses the central research question: “What are user requirements for web and social media archives?”. Based on recent empirical work such as survey data, workshops results, and exploratory user testing, this session discusses the needs, expectations, and practices of diverse user groups such as researchers, heritage professionals, journalists, policy analysts, … Moving beyond a ‘capture-first’ mentality to a ‘use-centric’ approach, this panel will analyse key areas of conflict and friction. This session will be in an interactive format, inviting the panel members and audience to vote on a series of statements. |
| 11:30am - 12:30pm | BELGICAWEB SYMPOSIUM PANEL: PROGRAMS Location: CONCERT [+4] |
|
|
From pilot to program: cultivating institutional web archiving practices for sustainability 1IIPC, United States of America; 2Library of Congress, United States; 3National Library of Australia, Australia; 4British Library, United Kingdom Transitioning an archiving pilot into a resilient, long-term program is a perennial challenge in the web archiving field. Moving beyond initial proof-of-concept and pilot projects requires strategic investment in technical infrastructure, human capital and securing funding. Featuring experts from libraries with 20+ years in running web archiving programs as well as international collaboration, this panel explores the roadmaps for sustainable growth. Discussion will focus on the practical challenges and solutions regarding long-term staffing, infrastructure, and collection management practices that move a program from "temporary project" to "enduring program". |
| 11:30am - 12:30pm | BELGICAWEB SYMPOSIUM PANEL: AI Location: ATELIER [+2] |
|
|
AI in web and social media archives 1KBR - Royal Library of Belgium, Belgium; 2Internet Archive; 3Bibliothèque nationale de France; 4Aarhus University This panel brings together experts to discuss the practical use of AI in web and social media archives from leveraging machine learning for the curation, preservation, and discovery of massive, ephemeral datasets, to GenAI for supporting users to navigating the archives. We make a distinction in this discussion as to not focus solely on efficiency and the tools used but also the archival conversation around stewardship and ethics. As with these great affordances come the challenges of ensuring data privacy, managing biases, and establishing transparency and provenance for AI-generated (meta)data. As AI becomes integral to archival practice, how do we balance innovation with accountability? |
| 11:30am - 12:30pm | BELGICAWEB SYMPOSIUM PANEL: LEGAL Location: AQUARIUM [+2] |
|
|
The legal challenges of web archiving and open collections 1University of Namur; 2Sciences Po Paris Law School; 3Vrije Universiteit Amsterdam; 4Reprobel The panel will provide an opportunity to address topics that are particularly critical in the field of web archiving, such as considerations related to copyright (exceptions, Text and Data Mining practices, extended collective licensing), to open data, and to challenges of data enrichment and AI-driven data enrichment. |
| 11:30am - 12:30pm | SPRING SCHOOL [PART 2] Location: KRANTEN / JOURNAUX [0] |
| 12:30pm - 1:30pm | LUNCH Location: PANORAMA: FOYER [+6] 🍴 Lunch will be served in Panorama Foyer (Floor +6).
🏛️ GUIDED TOUR KBR: If you signed up for a guided tour of KBR, please be in front of the three main elevators on Floor -2 at 12:30 [1st tour] or 13:00 [2nd tour]. To know if you signed up for a tour, check your registration details in ConfTool. |
| 1:30pm - 3:00pm | WORKSHOP: BELGICAWEB [PART 1] Location: PANORAMA [+6] |
|
|
Web Archiving, with a little help from my LLM friends. 1Ghent University, Belgium; 2KBR Royal Library of Belgium; 3Université de Namur, Belgium BelgicaWeb is a two-year BRAIN 2.0 project funded by BELSPO that aims to safeguard and promote Belgium’s born-digital heritage by making it FAIR—Findable, Accessible, Interoperable, and Reusable. Through the development of a user-friendly access platform and API, the project addresses sustainable access, data enrichment using technologies such as Linked Data and NLP, and legal frameworks around data sharing, AI, and privacy. It brings together experts from KBR, Ghent University, and the University of Namur, and actively engages users to shape its design and functionality. In this tutorial, we will demonstrate a complete web archiving pipeline and show how it can be augmented through AI-based methods, mainly large language models (LLMs), to extend existing workflows. Participants will first see how our current BelgicaWeb pipeline automatically creates and replays web archives using SolrWayback. The resulting WARC files will be processed with an LLM-based data cleaning pipeline that turns the raw data into structured Linked Data. The same raw data can also be explored using retrieval-augmented generation (RAG) to make the mapping process more interactive. With this approach, we demonstrate that data exploration can be carried out with multiple complementary approaches (Linked Data, full-text search, and RAG). The session will conclude with a discussion on the legal and ethical dimensions of applying AI in web archiving, including GDPR and compliance with EU AI regulations. Three Hands-on Sessions The tutorial consists of three parts, each addressing a different stage in the BelgicaWeb workflow: (i) data harvesting; (ii) the AI-based cleaning, enrichment, and exploration; (iii) the legal reflection. Participants can follow along in prefilled Colab notebooks or just observe the demonstrations. The participants will be kept in sync by providing (partially) pre-filled notebooks and data snapshots to guarantee the progress of the workshop.
The first session focuses on the practical aspects of harvesting web data. Participants will explore the automated pipeline (using heritrix) that generates WARC files from a defined set of seeds, and replays them in SolrWayback. An explanatory diagram of the harvesting pipeline will be shared with the audience. The session will conclude by demonstrating how structured metadata (which will also be generated in the next session) can be re-integrated into SolrWayback to enhance search and browsing.
The second session introduces the concept of enhancing traditional web archiving workflows with LLMs. They will be introduced at two points in the workflow:
Finally, we will show how AI-based exploration and traditional SPARQL querying can be intertwined and used for complementary insights. Participants will use ready-to-use Jupyter notebooks in Google Colab, which are partially filled in, to lower the barrier of entry for less technical users. Each step will be guided, allowing participants to experiment safely with retrieval, vector databases, chunking, and summarization.
The final session addresses the legal and ethical dimensions of AI-assisted web archiving. Questions that can arise are: Is what we did safe? What safety measures are required to ensure we remain within legal boundaries? This presentation will address more specifically the legal and ethical implementation issues relating to data protection, copyright, FAIR/CARE principles and the responsible use of AI in web archiving. This will take the form of a Q&A based on concrete cases that inspired our reflections. Participants will be invited to participate in the debate. Format The tutorial is designed as a 3-hour technical session with three modular components:
Target Audience This tutorial targets professionals in the field of web archiving, particularly developers, digital experts, and others interested in the technical aspects of web archiving and AI. Familiarity with Python and Linked Data will be very helpful, but as we provide a lot of code samples, less experienced participants will also benefit from this workshop. Participants may follow along actively in Colab or simply observe the demonstrations. We anticipate a group of 25–40 participants, allowing for interaction and guided support during the hands-on session. Expected Learning Outcomes By the end of the tutorial, participants will be able to:
Technical Requirements
Main Topic AI-enabled workflows for web archiving. Keywords Web archiving, SolrWayback, Large Language Models, Data Cleaning, Information Retrieval, Linked Data |
| 1:30pm - 3:00pm | WORKSHOP: SUSTAINABLE HARVESTING Location: STUDIO [+6] |
|
|
Web harvesting in an environmentally sustainable way 1Netherlands Institute for Sound and Vision, Netherlands, The; 2The National Archives (UK), United Kingdom; 3National Archives of the Netherlands, Netherlands; 4University of London, United Kingdom; 5Publications Office of the European Union As web harvesting grows in scale and frequency, so does its environmental impact. Crawlers use bandwidth and computing power, and all that harvested data takes energy to store and maintain. Web archiving plays an essential role in preserving our digital culture and supporting research, but it also leaves a considerable carbon footprint that can not be ignored. As our reliance on digital preservation increases, finding ways to make these processes more efficient and environmentally responsible has become an important collective challenge. This workshop invites IIPC members and the wider community to talk about how we can make web harvesting more environmentally sustainable. From smarter crawling techniques to collaboration that cuts down on duplication, we’ll explore how the web archiving community can align its work with broader sustainability goals without compromising the quality and integrity of our web archives. Sustainability has become a growing priority for libraries, archives, and research institutions. As organizations move toward net-zero targets, web archiving programs should also start to examine their own energy use and storage practices more closely. This workshop responds to a pressing need to explore current sustainability practices and experiments and aims to identify opportunities to reduce energy use during crawling and storage. It offers a space to share what people are already trying, what’s working, and where we see opportunities to reduce our footprint, whether that’s through more efficient crawls, less redundant storage, or greener preservation strategies. The workshop will kick off with the authors revisiting the talks that were given on this topic during last year’s Web Archiving Conference and highlighting the developments that have taken place since then. Coming from different institutional and professional backgrounds, the authors will demonstrate how approaches to green web harvesting vary across contexts while also showing the value of sharing insights and experiences. After this introduction, the participants will form small breakout groups to discuss key aspects of sustainability in web harvesting. Topics include ideas for running crawlers more efficiently - like optimizing scope and timing - and strategies for storing and managing data. We’ll explore ways to collaborate across institutions to reduce overlap, and how to measure and report the environmental cost of our work. Additionally, we’ll consider the ethical and policy questions that come with balancing preservation goals and sustainability. By the end, this session aims to build a shared understanding of what “sustainable web harvesting” can look like in practice. Together, we’ve explored current approaches to making web harvesting more sustainable and discussed best practices. We’ll gather practical ideas and recommendations - technical, organizational, and policy-related - and use the notes and key take-aways from the session as input for a set of community guidelines on sustainable web archiving, to be shared post-event. We hope the discussion will inspire interest in forming a small working group or shared resource on green web archiving, helping the conversation continue beyond the conference. Above all, the session will bring together people who care deeply about both preserving the web and environmental responsibility, fostering new collaborations and long-term awareness of sustainability within the web archiving community. |
| 1:30pm - 3:00pm | WORKSHOP: SOLRWAYBACK [PART 1] Location: ATELIER [+2] |
|
|
Run your own full-stack SolrWayback, collaborate & unlock the potential of archived data 1Royal Danish Library, Denmark; 2The National Library of New Zealand; 3Aarhus University; 4National Library of Norway; 5National Library of Luxembourg An updated version of the '21, ‘23 and '24 IIPC WAC workshops “Run your own full stack SolrWayback” with added user cases showing SolrWayback’s growing resilience, robustness and sustainability, how to contribute and an open discussion to conclude the workshop. Background
As an open source software project SolrWayback is growing. This can be seen in the diversity of the contributors on Github. Since the last IIPC workshop on SolrWayback in 2024, the software has moved in multiple directions. Some contributions which could be mentioned here are the following: The memento protocol has been implemented for better comparability between archives, a huge rework and upgrade of the frontend framework has been done and playback of old ARC files have been improved. The workshop consists of:
When participants have attended the workshop, they will have a working installation of SolrWayback on their local computers and they will have learned to install and interact with the software. During the workshop participants are also introduced to how SolrWayback can support collaborative work with the archived web as a source for children's history as well as a more technical case on tracking pixels. These specific cases act as examples of research use and through them participants of the workshop will gain an understanding of how SolrWayback can be integrated into their research practices and support their exploration of the archived web. Attendees will also know how to contribute ideas and code through Github and the discussion at the end of the workshop will inspire and leave attendees with thoughts and impulses to act, improve and sustain SolrWayback. Prerequisites:
Support During the conference there will be focused support for SolrWayback in a dedicated Slack channel from the facilitators of the workshop. Target audience Web archivists and researchers with medium knowledge of web archiving and tools for exploring web archives. Basic technical knowledge of starting a program from the command line is required as this is currently the only way of starting the program. However, the SolrWayback bundle is designed for easy deployment, so terminal interaction will be at a minimum. Coordinator(s)/ facilitator(s) All of the authors, 5, and other attendees that might chime in. |
| 1:30pm - 3:00pm | WORKSHOP: GLAM LABS & JUPYTER NOTEBOOKS [PART 1] Location: AQUARIUM [+2] |
|
|
Explore the use of the GLAM Labs Checklist, Datasheets, and Jupyter Notebooks for digitized and born-digital collections 1British Library; 2University of Alicante, Spain; 3National Library of Norway; 4International Internet Preservation Consortium; 5National Library of New Zealand There are often significant barriers to accessing and using collections as data in the GLAM sector, often demanding technical expertise and suitable IT infrastructure. Although training in digital research skills is becoming more widespread, GLAM institutions still face the challenge of determining how best to provide access to their digital collections in ways that encourage the use of these skills. Jupyter Notebooks are an increasingly popular form of hybrid tooling that combines data and code to make digital collections more accessible, particularly for less technical users. GLAM institutions have started to employ Jupyter Notebooks as a new approach to demonstrate how users can access and experiment with datasets derived from their collections [1]. Projects like the GLAM Workbench [2] illustrate their utility across various types of collections, including both digitized collections and web archives. They offer interactive and reproducible environments[3] for exploring and analyzing collections of data. This workshop will help participants explore digitised and born-digital collections using reproducible code and Jupyter Notebooks. These collections will be placed in the context of “datasheets for datasets,” which provide structured documentation about how a dataset was created. Notebooks and datasheets are two key steps in the “Checklist to Publish Collections as Data in GLAM Institutions” (glamlabs.io/checklist). Expert facilitators will help users explore the possibilities of Notebooks, focusing on three areas: 1) working on one specific topic using data from digitised and born-digital collections (e.g. news), 2) using and creating reproducible notebooks, and 3) understanding existing infrastructures, cloud services, and workflows for publishing computationally ready datasets. Use cases and discussion will also address preservation challenges and future reuse of notebooks and datasheets. Format The workshop will begin with short presentations on the GLAM Labs Checklist, datasheets for data sets, and the framework for creating a collection of Jupyter Notebooks [3]. These will include examples based on digitized and born-digital collections, and guidance on how to get started using a Jupyter Notebook. The main part of the workshop will involve participants using and exploring the datasets with one or more of the available Jupyter Notebooks. Data research infrastructures and cloud services to run Jupyter Notebooks will be presented. The session will wrap up with a discussion on the preservation challenges of the notebooks and datasheets. Learning Outcomes The workshop aims to provide the following outcomes:
References
Acknowledgments This workshop builds on the work of the GLAM Labs community and the Web Archives as Data workshops delivered at various conferences, most recently at the Digital Humanities in the Nordic and Baltic Countries (DHNB) 2025 Conference in Reykjavík and the Web Archiving Conference (WAC) in Oslo. |
| 1:30pm - 3:00pm | SPRING SCHOOL [PART 3] Location: KRANTEN / JOURNAUX [0] |
| 3:00pm - 3:30pm | BREAK Location: PANORAMA: FOYER [+6] ☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6). |
| 3:30pm - 5:00pm | WORKSHOP: BELGICAWEB [PART 2] Location: PANORAMA [+6] |
| 3:30pm - 5:00pm | WORKSHOP: IA4LAM Location: STUDIO [+6] |
|
|
From problem to practice: a collaborative use case workshop on AI-driven management and reuse of web archive content 1AI4LAM; 2IIPC Worksop is developed to explore the transformative potential of artificial intelligence in managing and reusing internet cultural heritage content preserved in web archives of IIPC institutions. As digital heritage grows exponentially, institutions face mounting challenges in accessing, organizing, and repurposing archived web data. Participants will engage with cutting-edge AI tools - to develop innovative solutions for enhancing discoverability, and enabling creative reuse of archived web content. The event invites developers, researchers, archivists, and digital humanists to collaborate on prototypes that address real-world needs: from semantic enrichment and automated classification to visualization, summarization, and cross-archive interoperability. By bridging technical innovation with cultural preservation, this workshop aims to unlock new pathways for engaging with the web’s historical layers and ensuring their relevance for future generations. It is developed in close cooperation between the IIPC and AI4LAM teams to ensure optimal planning, preparation of web content, and effective outreach. This collaboration will help align technical capabilities with community needs and maximize the impact of the event. Purpose This workshop brings together participants who want to explore, challenge, and strengthen their real‑world use cases while exploring collaborative efforts between IIPC and AI4LAM to better provide access, reuse, and address ethical issues in AI use on harvested content. The focus is on thoughtful discussion, critical debate, and collaborative refinement to surface high‑quality insights that will contribute to strategic pathways after the event. Format Participants are invited to bring their real-life use cases to examine — whether emerging, partially formed, or already in practice. Through guided sessions and structured debate, each use case will serve for mapping of further working areas. The workshop emphasizes clarity, feasibility, impact, and alignment with broader strategic or technological themes of IIPC and AI4LAM. Activities
Max number of participants: Up to 30 persons. Technical requirements: Participants are encouraged to bring their real‑world use cases. Expected outcomes: white paper/recommendations/draft strategy for further work on the subject. |
| 3:30pm - 5:00pm | WORKSHOP: SOLRWAYBACK [PART 2] Location: ATELIER [+2] |
| 3:30pm - 5:00pm | WORKSHOP: GLAM LABS & JUPYTER NOTEBOOKS [PART 2] Location: AQUARIUM [+2] |
| 3:30pm - 5:00pm | SPRING SCHOOL [PART 4] Location: KRANTEN / JOURNAUX [0] |
| 5:00pm - 5:30pm | - |
| 5:00pm - 5:30pm | - |
| 5:00pm - 5:30pm | - |
| 5:00pm - 5:30pm | - |
| 5:00pm - 5:30pm | SPRING SCHOOL [WRAP-UP] Location: KRANTEN / JOURNAUX [0] |
| Date: Tuesday, 21/Apr/2026 | |
| 8:30am - 9:15am | REGISTRATION AND COFFEE Location: AUDITORIUM [-2] ☕️🥐 Drinks and snacks will be served in Galerie (Floor -2, next to Auditorium). |
| 9:15am - 9:30am | OPENING REMARKS Location: AUDITORIUM [-2] |
| 9:30am - 11:00am | OPENING KEYNOTE PANEL Location: AUDITORIUM [-2] |
| 11:00am - 11:30am | BREAK Location: GALERIE [-2] ☕️🥐 Drinks and snacks will be served in Galerie (Floor -2, next to the Auditorium).
🏛️ GUIDED TOUR KBR: If you signed up for a guided tour of KBR, please be in front of the three main elevators on Floor -2 at 11:00. To know if you signed up for a tour, check your registration details in ConfTool. |
| 11:30am - 12:35pm | POSITIVE + NEGATIVE IMPACT OF AI Location: AUDITORIUM [-2] |
|
|
11:30am - 11:52am
Why ask WAAI: A sustainable approach to exploring web archiving artificial intelligence (WAAI) Internet Archive, United States of America Beyond the media hype, financial bubble, and general social freakout over Artificial Intelligence (AI), the emergence of machine learning (ML) and AI technologies merit impartial consideration of the potential for these innovations to benefit many aspects of the overall web archiving endeavour. Much as digitization and the internet itself radically changed how libraries and heritage institutions approached professional practices like acquisition and access, ML/AI may have the potential to address longstanding challenges in web archiving related to harvesting, collection management, and search and discovery. Of course ML/AI tools could also prove too immature, too unreliable, too expensive, or too unwieldy to provide a suitable return on investment for web archive collections that can measure in the hundreds of terabytes, if not petabytes. Thus, ML/AI explorations in web archives need a different methodology of research, testing, and assessment than more traditional, more narrowly focused technologies specific only to certain areas of web archiving practice or infrastructure. This talk will approach the challenge of incorporating ML/AI tools in web archives from a “why ask why” perspective, emphasizing small, low-stakes, and well scoped experimentation across all aspects of the web archiving lifecycle instead of rigorously planned, ambitiously conceived, large scale projects or more formal and ornate methods of research and development. The presentation will thus lay out a general framework for advancing AI-based work in web archiving based on practical examples, use cases, and findings from pursuing such an approach within a large web archiving institution that has been conducting internal AI projects on multiple parts of its web archiving processes. The talk will cover both managerial and practical aspects of exploring ML/AI for web archiving, such as staffing, infrastructure, tools, costs, program/product development, and engineering practices, and will link these with specific completed or in-progress work on leveraging ML/AI tools for various areas of web archiving, such as appraisal, collection, description, quality assurance, and search. By bridging practical details and results with specific areas of professional practice and wrapping both in a framework that emphasizes experimentation and action over procedural, policy, or administrative plodding, the talk hopes to advocate for a “sustainable” approach to exploring ML/AI in web archiving that proves doable, cost-effective, and user-driven. This presentation will propose a method, detail results from implementing that method in a large web archiving organization, and share results and findings intended to help other web archiving institutions pursue ML/AI work that will be sustainable, productive, and successful. 11:52am - 12:14pm
Understanding and mitigating anti-bot technologies' impact on archival web crawling 1MirrorWeb Limited, United Kingdom; 2Library of Congress, United States of America The proliferation of AI bot prevention technologies has created an unprecedented challenge for institutional web archiving programs. Website owners, administrators, and hosting providers—particularly those serving large organisations and government entities—have implemented increasingly aggressive safeguards to protect against AI agents harvesting training data. While well-intentioned, these measures inadvertently block legitimate preservation crawlers, threatening the completeness and quality of web archive collections. This research addresses a critical gap in understanding how anti-bot technologies affect large-scale web archiving operations. Even when securing appropriate crawling permissions per institutional policies, standard preservation tools like Heritrix are increasingly mistaken for malicious bots or AI scrapers, resulting in blocked access to nominated content. While quality assurance teams have documented this issue on individual seeds and domains, no comprehensive analysis of its scale and impact has been conducted. Our investigation analyses data from institutional crawling operations, and aims to enable systematic identification of blocking patterns, affected content types, and the scope of collection gaps caused by anti-bot technologies. This work extends existing guidance (such as robots.txt configuration advice) to address the complex landscape of modern bot prevention technologies. By documenting the real-world impact of these systems on institutional collecting and developing evidence-based mitigation strategies, this presentation is intended to aid web archiving programs maintaining collection quality while minimising resource-intensive manual interventions with individual website owners. The findings will aim to inform both technical approaches to crawling at scale and strategic communication with the broader web archiving community, website creators, and technology providers. Ultimately, this research aims to bridge the gap between legitimate preservation activities and necessary web security measures, ensuring cultural heritage institutions can fulfil their missions in an increasingly bot-hostile web environment. 12:14pm - 12:35pm
AI-powered search to sustain IIPC conference knowledge 1Bibliotheca Alexandrina, Egypt; 2Alamein International University; 3Egypt-Japan University of Science and Technology The IIPC Web Archiving Conference often receives high ratings in surveys from the community for being recognized as a platform for sharing knowledge and experience among web archiving practitioners and researchers. The output from this annual event is kept and made accessible via an online repository, courtesy of the University of North Texas. With today's advancement in Artificial Intelligence (AI) technology, an opportunity presents itself to render the wealth of information stored within the IIPC's repository of conference materials into more accessible knowledge. The IIPC Assistant supports the sustainable preservation and accessibility of the International Internet Preservation Consortium (IIPC) conference materials through an AI-powered search frontend that enables natural-language exploration of conference contributions archived in the online repository. By integrating vector embeddings with generative AI, the system delivers contextually accurate answers grounded in content that has been through a review process and was presented at the conference, contributing to the long-term usability and enhanced accessibility of the material that periodically documents the work done in the area of web archiving. The project began with metadata harvesting via the OAI-PMH API to consolidate creators, titles, subjects, and textual content from IIPC presentations and transcripts into a unified dataset. Because the materials were not designed for interactive querying, a Retrieval-Augmented Generation (RAG) approach was adopted to enable dynamic, source-grounded responses without retraining large models, an approach that promotes computational efficiency and sustainable reuse of existing data. Challenges in data consistency and semantic coherence were addressed by employing generative AI through the Gemini API to restructure fragmented text and enhance contextual quality. The retrieval pipeline was further refined to group and rank documents based on relevance, ensuring balanced coverage and interpretability. Built with a React + TypeScript frontend, Flask backend, and FAISS vector database, the implementation emphasizes scalability and efficiency. By advancing sustainable methods for information retrieval, the IIPC Assistant demonstrates how an AI-powered access interface can broaden the potential of a repository of valuable content accumulated over the history of the organization, thus transforming static collections into an interactive, reusable knowledge resource that supports ongoing research and global collaboration in the domain of web archiving. |
| 11:30am - 12:35pm | SHORT TALKS Location: PANORAMA [+6] |
|
|
11:30am - 11:41am
A Toolbox to foster Web Archives Use and Reuse National Library of France, France Web Archives represent an immense reservoir of data, with diverse and evolving possibilities for use and reuse that will undoubtedly continue to grow in the coming decades. As a national library, we have faced over the past 10 years a wide variety of requests particularly for extracting, recovering, and replaying web-archived materials for research, institutional, and personal use. All these requests have enabled us to develop a range of services and a set of tools. We will focus on three real-life use cases and the technical solutions we have developed to answer the needs of:
Our presentation will cover how we have progressively developed and consolidated tools from specific user needs and questions to a generic and sustainable set of tools that we integrated into a toolbox to extract and transform archived data and websites into various formats such as metadata, HTML, text, images, and various outputs such as file lists, tree structures or derivative WARC files. 11:41am - 11:49am
Constructing and sharing historical web link graphs from web archives Arquivo.pt, Portugal At our organisation, we have been developing a new text search platform based on Apache Solr to replace our legacy system, which depends on outdated and unsupported technologies. As part of this major upgrade, we undertook the task of reindexing all archived collections to align with the new, more flexible indexing schema. This large-scale reindexing effort provided us with a unique opportunity: the chance to extract additional insights from our historical web data. In particular, we focused on capturing link relationships between webpages. From this process, we generated and published a dataset of web link graphs that document the structure of hyperlinks across a significant portion of the web as preserved by our web archive. The published dataset contains information on over 139 million webpage URLs and the collections chosen for this dataset range from 1996 to 2021, allowing researchers to study the evolution of webgraphs over time. This type of data can be particularly valuable for researchers in areas such as web science, digital preservation, search engine technology, and network analysis. Furthermore, the code used to generate this dataset has been made publicly available. This allows others to apply the same approach to their own web archives and produce comparable link graph datasets from their WARC files. We believe this makes our work a reusable and extensible contribution to the web archiving and research communities.
In this lightning talk we aim to provide an overview of how the dataset was created and the structure and format of the data itself. 11:49am - 11:57am
Lossy and porous archives: Sustainability and collaborative models of LAC and the Internet Archive University of Copenhagen, Denmark As of 2024 Library Archives Canada and the Internet Archive have partnered to digitize and scan up to 80,000 out-of-copyright Canadian publications. Six Internet Archive created “Scribes” workstations were installed in LAC’s Gatineau facility, run by LAC staff (Library and Archives Canada, 2025; Internet Archive Canada, 2024). This co created project is a reflection of the porous boundaries of democratic digital knowledge ecosystems.This paper will compare both the LAC and IA’s sustainability models through IA’s digital resources and through interviews with Library Archives Canada. It presents a brief overview of mandates, accountability to publics vs. donors, and compares the overlap and (in)dependence of national and transnational digital archiving. The analysis draws on theories of data loss to engage with the porous and lossy boundaries of digital memory infrastructure. Both IA and LAC have gaps and absences, but their losses result in different absences and silences. Both the IA and LAC are infrastructure within the ecologies of digital archiving but diverge in mandate and logics. LAC is mandated to produce Canadian cultural and governmental memory, and are accountable to Canadian governmental policy, whereas contrastingly, the IA is a transnational nonprofit that controls through providing web infrastructure. They are bound by copyright law but are politically focused on increasing access to data through different, highly visible projects. I will use the construction of 'Scribes' as a focus to present the porous nature of digital memory institutions. This comparative analysis contributes to conversations around the tensions of digital national futures, and how the process of transnational archiving can complicate or support national archival agendas. References Internet Archive Canada. (2024, July 1). Internet Archive Canada launches digitization project with Library and Archives Canada. https://internetarchivecanada.org/2024/07/01/internet-archive-canada-launches-digitization-project-with-library-and-archives-canada/ Library and Archives Canada. (2025, August 1). The plan to scan: digitizing out-of-copyright publications. Government of Canada. https://www.canada.ca/en/library-archives/corporate/updates/2025/the-plan-to-scan-digitizing-out-of-copyright-publications.html Star, S. L., & Ruhleder, K. (1996). Steps toward an ecology of infrastructure: Design and access for large information spaces. Information Systems Research, 7(1), 111–134. https://doi.org/10.1287/isre.7.1.111 11:57am - 12:05pm
Ten years of websites and born-digital archiving in Slovakia University Library in Bratislava, Slovak Republic Electronic documents and websites should be preserved similarly to physical objects of lasting value. In 2015 our institution has been involved in the project regarding digital resources. The goal of the project was to create the technological and organisational infrastructure for systematic and controlled web harvesting and born-digital archiving. We archive national websites and born-digital content (electronic monographs and electronic serials). Nowadays, the project is out of the sustainability phase and all activities are provided by the specialised department. During the pilot phase a complex information system for harvesting, identification, management and long-term preservation of web resources and born-digital documents was established. Our information system consists of the specialised open source software modules (Heritrix, OpenWayback, SOLR etc.). The application is supported by a powerful HW infrastructure. The system management is optimized for parallel web harvesting. This enables to master the full domain harvest with required politeness in an acceptable time. One of the useful system features is the identical parallel testing environment. The web archiving system disposes with 800 TB storage. A substantial part of the system is the catalogue of websites, which is regularly updated during the automated survey of the national domain. Some domains that match our policy criteria are added to the catalogue manually (.org, .net, .com, .eu…). Since 2016 our department has performed seven full-domain harvests - harvesting of the national domain and multiple selective and thematic harvests. Electronic publications with assigned ISSN are archived in cooperation with the National ISSN Centre by upload or by harvest. Access to the archived data is provided in OpenWayback. A limited number of archived websites and electronic publications is available publicly due to the copyright restrictions. All archived resources are available locally in the institution. This contribution focuses on the path of archiving the national websites and born-digital documents in digital resources archive. During ten years, it faced several opportunities and now it is a recognized source, partly supported in national legislation (archiving of news portals). 12:05pm - 12:13pm
Climate change captured: collaborative, complex crawling & collecting - learnings from a cross-institutional pilot on climate change reactions Royal Danish Library, Denmark As part of a national, cross-institutional, pilot initiative documenting public reactions to climate change, a recent thematic web collection focused on online debates and reflections surrounding water levels, flooding, and environmental adaptation. Within this pilot, an almost single curator-led effort resulted in the collection of over 1.6 million unique web pages—more than 5 terabytes of data—including embedded videos, dynamic, rich media and selected social media content. The collection was conducted using Browsertrix, a browser-based crawling technology that proved essential for capturing complex, media-rich web content that traditional crawlers often miss. The setup included both cloud-based and local installations, allowing flexible scaling and testing of workflows. Browsertrix enabled efficient harvesting within a limited timeframe while significantly improving the fidelity of the captures, particularly for sites relying heavily on dynamic or embedded content. This presentation will share key learnings from the pilot, focusing on technical, curatorial, and collaborative dimensions. On the technical side, challenges included resource demands, blocked access to social media “walled gardens,” and maintaining crawl stability across diverse sites. From a curatorial perspective, the project demonstrated the value of close cooperation with domain experts on climate change, whose insights were crucial for identifying emerging debates and relevant sources as well as inspiration from the other institutions participating in the pilot, collecting non-web media or physical objects. The user friendly GUI of Browsertrix, partly developed during the IIPC funded project "Browser based crawling system for all" - https://netpreserve.org/projects/browser-based-crawling, empowered curators to crawl and make informed decisions in a fast, intuitive and user friendly manner including monitoring crawls at run time, helped identifying important sites, that could be crawled in more depth later. However, the experience also revealed the need for broader outreach and participatory workshops in future large-scale efforts, to ensure diverse and inclusive input across sectors. The pilot underscored how browser-based harvesting tools can transform national web archiving by bridging gaps in multimedia and interactive content capture. At the same time, it highlighted the limits of current approaches—particularly the need for dedicated development to handle advanced social media and video platforms. The forthcoming main project, pending accept of fund applications, aims to build on these lessons, exploring how combining existing infrastructures with newer tools like Browsertrix can enhance thematic, rapid-response collections. With modest resources but focused technical and curatorial innovation, it is possible to add substantial cultural and research value to national web archives documenting societal reactions to climate change. 12:13pm - 12:21pm
Bridging local and international communities: Web archiving outreach and collaboration 1Aix Marseille University, France; 2Humathèque, Campus Condorcet, France; 3MMSH, CNRS, Aix Marseille University, France This lightning talk aims to present three community-building and outreach initiatives that brought together long-time web-archiving specialists and newcomers to the field in 2025. The first one is a community-building initiative that resulted in the drafting of a memorandum of understanding between the xxx and xxx. In this declaration, they commit to: creating a shared ecosystem to foster new cooperation projects, conducting collective work on the methodology for stabilizing, and archiving web data corpora, strengthening links between existing institutions with expertise in collecting, analyzing and archiving web data, and reflecting on how to create a reproducible pipeline to collect, curate, consult and conserve web data corpora for SSH research. The second initiative is the co-organization of a monthly research seminar entitled “The Web and Web archives for research in the humanities and social sciences: knowledge, methods, and tools for the collection, analysis, and preservation of online corpora”. The third initiative is an event : a hackathon called “Building a corpus with web data” involving SSH researchers and research library professionals from xxx and xxx, but also other local significant players of web archiving. xxx and xxx are pooling their expertise to transform research practices through knowledge creation, training, awareness-raising, and the sharing of common tools for web archiving. Together, they want to build bridges between the international web-archiving communities (RESAW, IIPC) and local specialists and enthusiasts. 12:21pm - 12:29pm
Best practices for collaboration: Managing themed harvests with external partners National Library of Finland, Finland A substantial part of the web archiving at The National Library of Finland are themed harvests. Beyond just crawling yearly the Finnish domains ending with .fi or .ax country codes, online content is crawled with continuous harvests and themed harvests, that have varied subjects and content types. The most recent collection plan for 2025-2028 requires to have more emphasis on themed harvests that contain collaboration or cooperation with different groups, third-party organisations and other participants that are interested in suggesting content or participating in other ways in web archiving to the Finnish Web Archive. This lightning talk will provide insight into how managing collaborational themed harvests are usually done and how they have developed in recent years. As harvests may cover subjects that the team of the legal deposit services that curates the archived online content does not have itself the required expertise about, the role of external partners is crucial. The presentation will include several themed harvests from recent years that had cooperation or collaboration with external partners. Many of the collaborated themed harvests in recent years have mostly been organized with institutions and organizations specialized in or representing language minorities or underrecognized groups, but the findings presented are useable also with other kinds of external partners. Over the years, we have learned to improve the management of different types of cooperative and collaborational themed harvests. Collecting projects may be sparked by external suggestions or may be based on an already constructed set of online content by a third party. Managing these kinds of projects usually turn out to be fairly different from the projects that require reaching out for expertise beyond The National Library. Organizing themed harvests with especially minorities and underrecognized groups includes a feature in which the collaborating participants are not just providers of suggestions but also have knowledge and say in other aspects of the project (e.g., cataloguing and communicating to peers). Based on our experiences with these kinds of themed harvests, we have produced internal guidelines on how to manage collaborative collecting projects. |
| 12:35pm - 1:35pm | LUNCH Location: GALERIE [-2] & PANORAMA FOYER [+6] 🍴 Lunch will be served in in Panorama Foyer (Floor +6) and in Galerie (Floor -2, next to the Auditorium).
🏛️ GUIDED KBR MUSEUM TOUR (ENGLISH): If you signed up for a guided tour, please be by the Museum entrance on Floor 0 at 12:35 [1st tour] or 13:05 [2nd tour]. To know if you signed up for a tour, check your registration details in ConfTool. |
| 1:35pm - 3:00pm | SOCIAL MEDIA Location: AUDITORIUM [-2] |
|
|
1:35pm - 2:40pm
Digital Democracy: archiving government social media content 1The National Archives of the Netherlands, Netherlands, The; 2The National Archives, UK; 3National Library of Singapore; 4Bodleian Libraries General abstract For government organisations, the use of social media platforms are a great way to get in direct contact with civilians. For example, local organisations can use social media to ask direct input from the people on new initiatives regarding the environment in their municipality, and national ministries can highlight new policies and regulations. However, archiving social media after we have used it, is a totally different story. We understand the need to archive the material, the legal basis national archives have to do so, and the limitations as we do not own the platforms, but how does that work in practice? During this panel national archives and libraries from all over the world will share their experiences with safeguarding this public discourse on social media for the long term. The panel will explain their own situation and touch upon relevant legislation from their country briefly. Furthermore, during an interactive panel discussion, the panelists will touch upon topics such as best and worst practices, user experiences, accessibility, the ongoing debate to include or exclude comments and direct messages, how to handle donated material, long-term preservation, and in which file formats social media is archived in. Questions and use cases from the audience are very much appreciated. Short abstract per institution Organisation 1 At [organisation], we are not responsible for archiving social media on behalf of the entire [country] government. Each government organisation is responsible for managing its own social media archives. However, all currently active government social media accounts have been designated as information that must be permanently preserved. This means their archives will eventually be transferred to [organisation] within the next 10 to 20 years. To support this process, [organisation] has developed guidelines for archiving social media and is actively contributing to the creation of a government-wide policy. We have also defined the essential properties of social media archives that must be safeguarded to ensure their long-term preservation. National Library of SIngapore With the growing significance and usage of social media, the National Library of Singapore (NLS) developed and included this new format as part of our collection policy in 2024. The policy covers both private organisations/individuals, as well as government accounts. Working closely with the National Archives of Singapore, it was made mandatory for government agencies to transfer their social media accounts to the NLS. NLS is currently archiving Singapore's political office holders’ social media accounts, with plans to collect government agencies’ accounts in the near future. Our future plans also include ingesting the social media collection into NLS' digital preservation system and exploring access ideas. Organisation 3 [organisation] is archiving [country] Government social media at scale, using automated harvesting methods. This activity is supported by the Public Records Act 1958 which defines public records broadly as ‘not only written records, but records conveying information by any means whatsoever’ - so including social media. Currently, we are harvesting a limited number of platforms and we are exploring ways to expand our coverage, including direct transfer of accounts. Archived material is publicly accessible via our Social Media Archive. Organisation 4 - TBC [organisation] collects social media in the context of thematic and/or curated research and data collections. [country] legislation recognizes the Internet as published (subject to Legal Deposit) and therefore the material is not archival. This means that [organisation] does not collect the web or social media of federal departments as formal archival records. However, as a published supplement to a formal archival fonds, [organisation] does accept important federal and non-federal social media data exports on a case-by-case basis. 2:40pm - 3:00pm
High-fidelity social media archiving: current state of the art Webrecorder How to archive social media remains one of the most frequently asked questions, and sometimes one of the biggest challenges, in web archiving. Social media platforms are vast and quickly evolving, while web archiving tools are always playing catch up. Can web archiving tools be used to archive social media at high fidelity, i.e. accurately to their users’ experience? What makes archiving social media so difficult, and what are the key aspects of web archiving that apply to social media? This talk will share some of our experience in the field, as well as the latest state of the art (which sometimes changes daily). We’ll cover the major platforms, such as Facebook, Instagram, Twitter/X, TikTok, YouTube, Telegram and LinkedIn, their current state, and how archiving some of these platforms has changed over the years. We’ll discuss browser profiles and paywalls, challenges of session information and rate limiting, custom behaviors, and how all of these factors affect capture and replay. We’ll discuss what has consistently worked and why, and what hasn’t, what requires more work and maintenance, and what trade-offs may be necessary. We’ll also provide a real world use case of social media archiving workflows that others could perhaps use. The presentation will discuss how we've approached social media archiving across key open source tools, including Browsertrix/Browsertrix Crawler, ArchiveWeb.page, and ReplayWeb.page. We hope to end with a discussion on the subject of how to make social media archiving a sustainable practice within the web archiving field, and what can be done collaboratively for the benefit of all. |
| 1:35pm - 3:00pm | WORKFLOWS FOR BUILDING AND ANALYSING DATA Location: PANORAMA [+6] |
|
|
1:35pm - 1:57pm
Digital Diaspora: mapping the Jewish internet The National Library of Israel, Israel Methods are being developed to systematically detect and archive Jewish web content on a large scale, capturing the evolving, multilingual digital expression of diasporic culture. This presentation outlines new procedures for the systematic detection and collection of Jewish web materials. Building on earlier curatorial approaches, this phase of the project focuses on automating the identification of thematically relevant websites through content-based analysis. Drawing on linguistic markers, semantic clustering, and metadata extraction, the process generates an expansive and continuously updated registry of Jewish web domains. To expand the detection of thematically relevant web content, the workflow integrates automated site aggregation with multilingual linguistic modeling. The system applies cross-lingual text analysis, semantic clustering, and metadata extraction according to defined selection criteria, enabling the identification of recurring cultural, historical, and communal markers across diverse digital sources. Detecting websites by thematic relevance rather than by technical metadata or domain structures presents a distinct challenge, as cultural or communal identity is often conveyed implicitly through language, visual and textual cues, and context rather than explicit tags or classifications. However, by embedding these computational processes within curatorial practice, the project broadens how the Jewish digital sphere is identified and delineated, ensuring that content produced in multiple languages and regions is systematically recognized and incorporated into the resulting archive. The presentation will address the conceptual design and technical aspects of this workflow, including criteria for data selection, the balance between automation and curatorial oversight, and methods for verifying the alignment of collected materials with the intended thematic focus. Beyond its technical contribution, the project reflects on the broader questions of how such workflows might inform other initiatives seeking to create expansive, thematically driven web collections, and how these systems can remain adaptable as online content and communities evolve. By presenting this next phase, the project invites further dialogue on how national and thematic archives can responsibly automate the preservation of networked, transnational cultural spheres. 1:57pm - 2:18pm
Improved language identification for web crawl data Common Crawl Foundation, United Kingdom Identifying the languages contained in crawl data is a fundamental step in exploring the multilinguality of web archives. However, this task is far from straightforward: language annotations contained in webpage metadata are often unreliable or missing, and existing language identification systems are limited in their ability to handle large-scale diverse web crawl data well. Specifically, common language identification systems used for web crawls (e.g. CLD2) only cover a small number of languages well and are not reliable for many under-served language varieties. At the same time, more recent high-coverage language identification systems (e.g. GlotLID) are too computationally expensive for large-scale pipelines and often lack robustness when dealing with the heterogeneity inherent in web data. We therefore identify five desiderata for a language identification system suitable for annotating web crawls: it must be fast, computationally lightweight, adapted to the web domain, able to handle multilingual input, and easily extensible to additional language varieties. In this talk, we present a new language identification system designed for web crawl data that meets all these requirements. Our solution is implemented in Rust and so is performant enough to process large amounts of web data in a reasonable time. It is designed from scratch for the web domain, including identifying multilingual web pages. The initial model is able to identify around 200 language varieties, but is easy to extend to additional language varieties given sufficient training data. We benchmark our system’s performance against popular existing language identification models, measuring computational performance and language identification fidelity. We finish with a discussion of the potential impact of our system on downstream language technologies, with a particular focus on under-served languages. Our language identification model is released under a permissive open source license to enable easy adoption and extension by the community. 2:18pm - 2:39pm
Hyperlinked homeland: A historical hyperlink analysis of 200 Dutch LGBT+ websites University of Groningen, Netherlands, The Over the past years, scholars have increasingly emphasized that queer cultures intrinsically transcend national borders (Bayramoğlu et al., 2024). The transnational connections that LGBT+ people establish online, among others through hyperlinks (Kiel & Osterbur, 2017), are often presented as a case in point (e.g., Gonsalves & Velasco, 2022). My presentation, however, demonstrates that the nation still matters greatly. It builds on the interdisciplinary project I conducted as Researcher-in-Residence at, and in close collaboration with, the National Library of the Netherlands (KB), drawing from the fields queer internet studies, web archive studies and network analysis. Using historical hyperlink analysis, I analyzed the special LGBT+ web collection of the KB. This collection is unique in size and richness, comprising archived websites of hundreds of LGBT+ organizations and individuals, each of which has been harvested once annually. However, the collection has not yet been researched by others. The talk focuses on the 200 LGBT+ websites that were harvested in 2020 (for pragmatic reasons: in terms of size and quality of the LGBT+ collection, this is the best year to scrutinize). To identify the (trans)national queer networks they formed that year, I extracted and scrutinized all hyperlinks of these websites. After all, hyperlinks are not merely the constitutive elements of the Web, they are ‘conscious acts of connectivity’ (Milligan, 2022, p. 132) that yield insights into ‘hyperlinked identities’ (Szulc, 2015, p. 121). I specifically concentrate on the hyperlinks that directed to LGBT+ websites – not necessarily the 200 websites, but to any website, Dutch or non-Dutch, that catered to LGBT+ people. I will detail this bottom-up approach that combines distant and close reading, and will show that there was a distinctly Dutch queer web sphere. For instance, 49 of the 50 websites that were most frequently hyperlinked to (or: targeted) were websites of Dutch organizations, in Dutch. In fact, many were hosted by local or regional groups, which suggests that, as far as geographical focus is concerned, internet historians should perhaps zoom in rather than out. Moreover, most of the target websites had ‘.nl’ as a top-level domain (TLD), whereas ‘.amsterdam’ was also relatively popular. These findings challenge the assumption that queer online cultures are inherently transnational. This talk connects to the conference regarding both the topic (e.g., ‘underrepresented voices and marginalised communities’) and applied method (‘Derived and statistical data for distant reading’). It is designed to resonate with every conference participant. It goes beyond simply demonstrating—through practical examples—how collaboration between researchers and web archivists can deepen our insights into critical societal and historical issues. Additionally, it explores the workflows the KB and I created for building and analyzing datasets, which could inspire future research and ultimately encourage greater engagement with web archives. By showcasing how hyperlink analysis can reveal hidden local networks, this talk offers a replicable, data-driven approach for archivists and researchers to assess and enrich collections of underrepresented groups—directly addressing the conference’s call for inclusive and sustainable web archiving practices. 2:39pm - 3:00pm
WARCbench: A swiss army knife for WARC processing Harvard Library Innovation Lab, United States of America WARCbench is an open-source Python library and command-line utility designed for exploring, analyzing, transforming, recombining, and extracting data from WARC files in all their variety. Inspired by the ad hoc snippets of code the team at the Library Innovation Lab repeatedly reaches for while operating Perma.cc, WARCbench is a new addition to our suite of open-source web-archiving tools. It offers a resilient, highly configurable toolkit for experienced technologists, alongside easy-to-use commands for quickly exploring the contents of a WARC without writing any code. In running a production-scale web archive, we’re always finding new anomalies to investigate, emerging patterns to study, and new use cases to explore. Though a broad array of tools and libraries exists for working with WARC files, most are understandably optimized for the well-known, frequently encountered tasks of web archiving rather than for empowering learning and discovery, supporting ad hoc scripting, and enabling users to quickly and easily explore novel problem spaces. WARCbench was created with these non-standard uses in mind and with an eye toward best practices: clear, thorough documentation; robust error handling; and an architecture that makes custom extension and introspection straightforward. Our goals for this project were to:
Our session aims to spark dialogue about common practices in ad hoc WARC processing and future tooling needs. Attendees will learn practical, repeatable approaches for inspecting and handling even "difficult" WARC files using WARCbench, and we’ll demonstrate both typical and edge-case scenarios ranging from simple inspection to transformation and extraction. Because it’s open source and modular, WARCbench lowers barriers to adoption, invites community iteration, and supports tool longevity — a critical factor for sustainable web archiving. |
| 3:00pm - 3:30pm | BREAK Location: GALERIE [-2] & PANORAMA FOYER [+6] ☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6) and in Galerie (Floor -2, next to the Auditorium). |
| 3:30pm - 4:00pm | POSTER SLAM Location: AUDITORIUM [-2] |
|
|
A survey on data-access methods for an open web archive Common Crawl Foundation, France With the ever growing interest in web data and web archives being driven by Large Language Models (LLMs), Artificial Intelligence (AI) and Retrieval-Augmented Generation (RAG), web archivists managing open repositories are faced with an unprecedented volume of download and requests. Given that web archiving infrastructures are sometimes constrained in resources, the increased traffic has made it difficult to serve and fulfill all of these incoming requests properly without saturating the infrastructure. This problem is compounded by users who employ far too aggressive retry policies, often unknowingly, when they try to access open archives. To deal with these issues in relation to our own archives, we introduced an official, open-source tool over a year ago to facilitate sustainable user access. We developed it to be cross-platform, dependency-free and user-friendly to ensure easy adoption by the community. It implements and supports polite retry-strategies like exponential backoff and jitter, while also allowing for parallelization. In this talk, we present the results of a comprehensive study into the impact of our new tool over the span of a year on how users have been accessing our archive and how this impacts our infrastructure. We use a defined standard user agent for our official tool to track usage, and investigate how our tool has been adopted throughout time, and if its introduction has simplified access to our open web archives for users. We also compare our official tool to other standard access methods employed by our users and study how the introduction of a polite access tool has impacted the load of our infrastructure. Finally, we propose some strategies that other web archiving institutions can use to simplify access to their archives, providing users with polite tooling inspired by our findings and allowing them to reduce the load in their infrastructures. Linkra – application for archiving and creating citations of web resources in scientific texts National Library of the Czech Republic, Czech Republic A newly developed archiving and citation service Linkra is designed to store web resources cited in scientific and professional texts. It addresses the problem of link rot – the loss of referenced web content that threatens their credibility. The application allows users to save cited resources to a web archive, obtain archive URLs, and create citation records. In addition to preserving cited resources, it encourages researchers to include archival copies in their academic citations in accordance with the ISO 690 standard. The application uses a harvesting method based on the open-source Scoop tool, allowing fast access to archived data. Working with the application involves several steps. Users first insert the web sources they want to preserve into the application, which starts the harvesting process. They then receive a unique address through which they can return to their assignment. After the harvesting is complete, they receive shortened URLs that will redirect to archived copies after indexing. Finally, they can use the built-in generator to prepare citations of web sources for publication in professional texts. They can either use pre-prepared templates designed according to common citation standards or create their own, for example according to the specific requirements of a professional journal. Citation records prepared in this way can be exported in bulk. The Linkra application is being developed as part of institutional research as an open-source tool. It was preceded by research focused on disappearing web content and on the possibilities of citing web resources and their archive versions. The aim of the application is to preserve the sources of scientific works while also expanding the existing acquisition strategies of the web archive of the National Library of the Czech Republic. As part of the poster presentation, we will introduce the goals of our project, describe the technical solution, discuss the challenges encountered during development, and demonstrate how to use the application. Application of AI at Social Media Archiving in the National Library of China National Library of China, China, People's Republic of The evolution of Artificial Intelligence (AI) has offered a new paradigm for web archiving. Based on over two decades of practical experience, our library is actively exploring the innovative application of AI and AI-Agents across all stages of the archiving, preservation, and management processes. Practice has demonstrated that our library has achieved successful outcomes in applying AI technology to social media archiving, and has made breakthrough progress in identifying archiving targets, analyzing archiving content, and cataloging metadata by utilizing the DeepSeek large model. Our library has expanded the scope of web archiving, focusing on the social media archiving (articles published on WeChat official accounts), the deepseek-r1:14B model is used to assist in determining the archiving targets, filtering reasonable search results according to specified search conditions, and automatically extracting the titles and URL addresses of WeChat articles to be crawled. Combining the learning, understanding, and analysis capabilities of AI, it assists in the full-text analysis of crawled WeChat articles. Based on the cataloging results of historical articles and through multiple rounds of optimization and training of the AI model, the AI has ultimately achieved precise description of key information such as full-text summaries, keywords, and data sources of Wechat articles. The application of AI provides an effective tool for the web archiving. Archiving and Analyzing YouTube Recommendations during the Paris 2024 Olympic Games 1Université Sorbonne-Nouvelle; 2Université Rennes 2; 3National Library of France, France; 4Inria, Rennes; 5Université de Lille; 6LAAS-CNRS Though profoundly shaping and personalizing our experiences of the web and our access to information, algorithmic recommendations remain largely absent from institutional web archives, raising critical questions about how to capture and preserve a long-term record of algorithmic activity. This poster presents the preliminary results of a multidisciplinary research project that brings together a national library and experts from computer science, information science, social psychology, and sports history. The project’s goal is to capture and analyze the videos recommended by YouTube’s algorithm to different user profiles during the Paris 2024 Olympic Games, in order to determine whether these algorithmic recommendations reflect different narratives or perspectives on the Olympics, and whether they promote distinct values related to sports and the Olympic spirit. This poster will outline the initial findings of this exploratory approach, including the methodology and the resulting dataset. Using bots with diverse browsing histories, we collected over 21 million video recommendations across 19 user archetype profiles over a 45-day period. We complemented this approach by constituting a corpus of 18k videos related to the Paris Games published during the events and monitored daily from the time of their publication. We refer to this as an "objective corpus", which we used as a reference to analyze the personalized recommendation datasets. We will present preliminary quantitative insights from the data collected, in particular by focusing on recommendations of videos from our "objective corpus". We found considerable variations of the bots exposure to corpus videos depending on their profile; in particular, bots with a media consumption are more exposed than bots with a sport consumption, which might appear surprising given the nature of the event. We will share the first results from a qualitative analysis of the subjective representations associated with the “Paris Olympics” event in the most frequently recommended videos. We analyzed variations in values expressed in these videos to compare different personalization regimes and value systems. Finally, we aim to spark a discussion on several open questions: How can such a large dataset be preserved and made accessible? How to construct a "representative" personalization ? How might algorithmic recommendations be integrated into existing web archiving practices, and how can their capture be developed into a reproducible and sustainable process? Can these recommendations help build an archive that reflects diverse perspectives on the same event? Detecting and managing challenging web crawls at scale MirrorWeb Limited, United Kingdom Web archiving at scale presents significant operational challenges, particularly in identifying crawls that deviate from expected behaviour. Whilst standard monitoring systems report binary "running" or "stopped" states, they fail to detect more subtle problems: crawls that exceed their intended scope, enter infinite loops on dynamic content, or silently stall whilst appearing active. By the time such issues are manually identified, substantial computational resources have been consumed, and service level agreements may be compromised. This poster presents [REDACTED]; a proactive monitoring application developed to address these detection gaps. The system leverages historical crawl data to establish profile-based performance baselines for different crawl configurations. By continuously comparing current crawl duration against expected averages, the application automatically flags potentially problematic crawls for investigation before they escalate into resource-intensive failures. The application integrates multiple data sources including AWS EC2 instance metadata, MySQL profile databases, Redis queue systems, and Heritrix REST API endpoints. When a crawl exceeds its baseline duration, the system gathers comprehensive diagnostics: status, queue metrics, actively processing URLs, and recent log entries. This diagnostic information is automatically posted to associated ticketing systems with stakeholder notifications, enabling rapid response. Operational deployment has demonstrated significant benefits including early problem detection (hours rather than days), reduced manual oversight requirements, improved response times through automated stakeholder notification, and enhanced organisational knowledge capture through documented diagnostics. The profile-based approach proves particularly effective for organisations managing diverse crawl types across multiple clients, where manual monitoring becomes impractical. This work highlights the importance of monitoring strategies that extend beyond simple status checks. As web archiving operations scale, institutions require intelligent detection mechanisms that understand normal crawl behaviour and can identify deviations before they impact service delivery. The poster will demonstrate the system's architecture, detection methodology, and practical implementation considerations for institutions seeking to enhance their crawl monitoring capabilities. Mapping duplicate images in a web archive using perceptual hashing National Library of Norway, Norway Images have been part of the web since its early beginnings [1] and today most webpages have some form of image content. Since the early 2000s, the National Library of Norway has harvested web data from the Norwegian top-level domain, storing time-stamped records of web content, including text, audio, video and images. A large portion of the stored data is images and finding ways to sort through the images, link together related images and remove duplicates is crucial for researchers to be able to find what they are looking for. Image files spread quickly online. The same image can be downloaded multiple times and reuploaded to different websites. As a result, duplicates of an image can be hosted at multiple domains and the link between the image instances is not always preserved in the process. Further, as content management services often compress and resize images automatically upon upload, instances of the same image might also exist with different sizes or compression levels which means that they are different at the byte level. This poster will present our ongoing work and preliminary results from a deduplication study to detect duplicate images in a web archive. By using perceptual hashing algorithms [2,3], we detect and flag perceptual duplicates in a subset of the archived data. Moreover, to estimate the performance of this perceptual hashing algorithm, we evaluate the detection accuracy for several simulated image degradation transforms. Similarly, we use pixel-level comparison on a random subset of the images to probe the hashing algorithm for false positives. Our initial findings suggest this approach is promising and has two potential benefits: 1) Allowing scholars to track the use and reuse of an image across multiple pages. 2) Reducing unnecessary computation, if two files represent the same image with only minor differences in resolution or compression, there is no need to perform expensive computation twice. We will present our work so far, what lessons we have learned and how these lessons will inform how the National Library of Norway processes and disseminates web archive image data in the future. [1]: Tim Berners-Lee and Mark Fischetti. 1999. Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web. HarperCollins Publishers. [2]: Farid, H. 2021. An Overview of Perceptual Hashing. Journal of Online Trust and Safety. 1, 1 (Oct. 2021). DOI:https://doi.org/10.54501/jots.v1i1.24. [3]: Meta 2019. The TMK+PDQF Video-Hashing Algorithm and the PDQ Image-Hashing Algorithm. https://github.com/facebook/ThreatExchange/blob/main/hashing/hashing.pdf (Retrieved 2025-10-13) Migration of Croatia's Web Archive's selective web harvesting system: transitioning to sustainable and interoperable solutions National and University Library in Zagreb, Croatia Preserving online publications presents growing challenges due to the increasing volume of digital content, rapid technological change, and the need to ensure compatibility with international web archiving initiatives. The current system used for selective web harvesting has reached both infrastructural and functional limitations, which prompted a shift toward a more modern and sustainable solution. The proposed approach focuses on migrating the existing selective web archiving system to the Web Curator Tool (WCT), an open-source platform designed for managing complex web harvesting and curation workflows. This migration entails comprehensive technical and functional transformations, including the conversion of harvesting parameters, metadata migration, and reconfiguration of harvesting schedules to accommodate new system capabilities. In preparation for the migration, archived publications are being thoroughly assessed to determine appropriate capture frequencies, verify the quality and integrity of harvested instances, and identify materials unsuitable for migration—such as publications not available in standard HTML format. This careful evaluation ensures that only relevant content is retained in the new web archiving system. The poster will outline the advantages of adopting standardized and widely supported tool, such as improved scalability, interoperability, and alignment with international web archiving best practices. It will also address potential challenges, including the need for significant resources during the migration process and the potential loss of certain legacy functionalities that cannot be replicated in the new environment. The overall goal is to establish a sustainable, scalable, and interoperable selective web archiving system that ensures the long-term preservation and accessibility of the nation’s online publishing heritage. Using data to challenge negativity bias in quality assurance workflows Library of Congress, United States of America This poster will describe how institutional staff are disproving a common expectation of poor results in web archives. Our institution includes over 5 PBs of data for event-based and thematic collections, however a hyper-targeted capture remediation approach by the quality assurance (QA) team leads to perceptions of low success and high failure rates of captured content. The team is motivated to find a sustainable workflow that balances large-scale quality assurance data and individualized attention to specific captures to glean a clearer, more positive image of web archive capture health. The poster will touch on staff's ongoing developments to make their quality assurance workflow sustainable. It will also briefly discuss how we process the data gathered through this workflow. Data from a standardized qualitative rubric for capture assessment indicates that a majority of captures are successful. This rubric is based on correspondence of the live site and web archive browsing experience, following criteria developed by Dr. Brenda Reyes-Ayala (1). Priority captures for remediation and triage by the quality assurance team are indicated by scoring on the rubric. Scoring data and major categorical issues from this rubric are then visualized in Tableau and reveal a range of positive and negative capture assessments. This Tableau data is critical as our QA staff time is focused on troubleshooting the negative assessments. New visibility of positive assessments through the Tableau dashboard highlights value within the collections and builds team morale in a challenging QA environment. The positive data allowed the team to update the QA workflow to funnel only the high-priority, actionable assessments through the process. Through data collection and visualization, we hope we can better understand and manage the myriad collections at our large collecting institution. Using data to communicate a transparent understanding of crawl health could help onboard new staff and support a morale boost for staff performing quality assurance work long-term. This poster shares steps on our journey towards stable and enduring web archiving capture assessment and remediation work. (1) "Correspondence as the primary measure of quality for web archives: A human-centered grounded theory study" International Journal of Digital Libraries, 2022 Sustainable web archiving: a living and participative poster 1C2DH, University of Luxembourg, Luxembourg; 2Ecole Nationale des Chartes - PSL, France How can we think about the sustainability of web archiving while respecting its vitality, creativity, and diversity of approaches and uses, and while encouraging co-shaping, interdisciplinarity, and participation? During WAC26, which will address this question and provide many answers, our participatory poster will offer an additional tool: a collective, living and co-constructed poster. The poster is both an installation and a performance running throughout the conference. Rather than a fixed, finished object, it is a living surface that grows day by day, shaped by the contributions of WAC26 participants. It takes the form of one or two long rolls of recycled kraft paper, several meters in length, fixed to a wall or unrolled across a few tables to invite collaborative contributions. On this surface, participants are invited to draw, annotate, question, collage, and connect, challenge, highlight ideas they developed or found interesting and exciting in sessions, using various materials (markers, old magazines, scraps of paper, threads of yarn to create links, etc.). In this way, the poster becomes both a shared reflection space and a creative archive in itself. We will prepare the very first layers of the installation (if possible with early scholars during the spring school to be held on April 20, just prior to the launch of WAC26) : this will take the form of a partial canvas including a mind map on sustainable web archiving and a set of open questions handwritten on kraft paper. From there, the surface becomes a collective palimpsest: enriched through participants’ sketches, and reactions to conference sessions. This process turns the poster into a space of dialogue and imagination, where sustainability is explored as a social, creative, playful, material, and collaborative practice. This living poster is at once a reflective poster, a participatory “artwork”, and a sustainable experiment in reimagining how we present and co-construct knowledge in the field of web archiving. Expected outcomes are a process of documentation (photos and notes throughout the event), presentation through a lightning talk and a final blogpost on netpreserve.org, including images of the evolving poster (and eventually audio comments), to preserve and share this experimental form of knowledge-making. The technologies of an in-house seed handling tool National Library of Finland This poster is an overview of the technologies used in an in-house developed tool that is used to create and manage collections based on harvested online materials, and to automate some harvesting and preservation related tasks. It has been in development since 2018 and is still being updated based on the users' needs. Virtual Mucem: from web archives to a museum remediation of an ethnological websites collection National Library of France, France The Museum of European and Mediterranean Civilizations (Mucem) is a major French ethnology museum located in Marseille. It opened in June 2013, inheriting the collections of the former National Museum of Popular Arts and Traditions (MNATP). This transfer of a national museum to a regional location was the first of its kind in France. The new museum implemented a multidisciplinary project and expanded its collections to the Mediterranean basin by launching new ethnological surveys. Between its official creation in 2005 and its opening to the public in June 2013, the museum developed an online strategy and launched eight original thematic websites. These websites were editorial projects in their own right and were used as a key means of promoting ethnological collections, researches and surveys. The websites were hosted on the French Ministry of Culture servers and were taken offline at the end of 2020 due to technical obsolescence and an extensive use of Adobe Flash technology. The disappearance led to an awareness of their importance. Some of them offered scientific descriptions of collections, which were more complete than the museum’s databases. Others reflected the museum's new stance on contemporary issues, such as gender, and preceded important exhibitions. The aim of the Virtual Mucem project carried out in 2024 was to experiment with a form of remediation by using web archives of a national library. The work was both documentary and technical. On one hand, the project team searched local archives and conducted oral surveys with the producers of the websites. On the other hand, a tool has been developed to enable the project team to extract and package the library web archives in order to produce derivative WARC files as complete as possible for each one of the websites. Following these two tasks, which were carried out simultaneously, the project team set up an editorial interface for remediating the websites and integrating the derivative web archives, which can be consulted within the walls of the Mucem's Conservation and Resource Center with a local installation of SolrWayback. This remediation project has a collegial and experimental dimension. Over the course of a year, it brought together more than fifteen people, including archivists, documentalists, librarians, IT specialists, and historians, as well as curators, ethnologists, and technical teams involved in the production of some of the sites. This poster will present the challenges and results of this remediation project. First, it will highlight the collaboration between a museum and a national library that can inspire new projects in the future. It will provide information about the process of creating derivative WARCs. Finally, it will question the remediation itself and some of the main issues: technical but also documentary obsolescence of the content, possible deficiency of the web archives, technical choice and network security, public display. WebData: Building a Research Infrastructure for the Norwegian Web Archive National Library of Norway, Norway Researchers have addressed the need for dedicated research infrastructures to study web archives. In response, the WebData project is building a research infrastructure for the National Library of Norway's web archive, enabling large-scale access to nearly 25 years of archived material. This poster will present the project's status, lessons learned so far, and findings from a needs assessment conducted with a relatively large number of scholars, mapping their needs.[1] The project started in 2025, with four key objectives:
Further, the poster will present findings from surveying researchers’ needs within four areas: a) access, b) interfaces and functionality, c) data and d) metadata. In addition to sharing scholarly needs, we examine how we plan to address this over the next 4 years. This involves traditional rule-based programming, identifying specific attributes in archived items, as well as machine-learning-based systems to enrich WARC data with additional metadata. The WebData consortium is led by the National Library of Norway, with the Norwegian Computing Center, University of Oslo and University of Tromsø as partners. Project development runs until 2029, while the infrastructure will operate until at least 2035. The project is funded by the Research Council of Norway. -- [1]: Brügger, N. (2021): ‘The Need for Research Infrastructures for the Study of Web Archives’. In The Past Web: Exploring Web Archives, edited by Daniel Gomes, et al. Springer International Publishing. https://doi.org/10.1007/978-3-030-63291-5_17; “About WebData” (2025), WebData. [2]: https://webdata.nb.no Kaʻohipōhaku: Community social media archiving in Hawaiʻi University of Hawaiʻi Since 2019, librarians in the University of Hawaiʻi System have been working towards developing an archive of social media content that documents significant historical events in Hawaiʻi, like the Kū Kiaʻi Mauna movement and the wildfires on Maui. After years of trial and error, we received a grant to establish a stronger foundation for this archive by bringing together Kānaka (Native Hawaiians) and web/social media archivists for the first time to exchange ideas, knowledge, and perspectives on what an ethical social media archive could look like. Through an Advisory Board, an online survey geared toward content creators dedicated to uplifting aloha ʻāina (love for the land), and community consultations, Kaʻohipōhaku will explore what an archive, rooted in Kānaka values and ʻāina (land), could look like in hopes of setting an example for the rest of Hawaiʻi and other Indigenous communities. Our poster will share our project goals, activities, and preliminary data from our findings. The project name, Kaʻohipōhaku, means to gather or collect stones, which is the first step in any utilization of pōhaku (stones, rocks). While pōhaku could be seen as immovable or fixed, all pōhaku can move when the time is right. Besides practical use (structures, cooking, tools,etc.), pōhaku are also believed to retain mana (energy, power). Our vision of building this archive is similar to building a hale (structure) in which our histories will be stored and preserved for the future generations. This name is also a play on the mele, Kaulana Nā Pua, in which the composer says they would rather eat rocks than be under the governance of foreigners. Doing humanities with web archiving: an oral history of web archiving practices in academia and the making of digital culture 1Aix Marseille University, TELEMMe Laboratory, France; 2MMSH, Aix Marseille University, CNRS, France Doing Humanities with Web Archiving: An Oral History of Web Archiving Practices in Academia and the Making of Digital Culture This project stems from the observation of a widening gap between, on the one hand, a small community of researchers and teachers who have developed expertise in web archiving, and, on the other, the vast majority of academics who occasionally need to archive the web as part of their work. The latter often rely on improvised, artisanal solutions to preserve or cite born-digital sources. While the experts are engaged with international initiatives and explore innovative methodologies linked to the digital humanities, most researchers remain unaware of this body of work and continue to “make do,” adjusting their practices as they go. How not to build a web archive in two weeks Texas State University, United States of America In 2025, a university library started a web archive. Getting to this point represented two years of education and advocacy to secure the necessary resources to start a program, aligning web archives with the larger mission and scope of the library. Given limited in-house development support, Archive-It was chosen as the university’s first web archiving tool and the university’s web presence as the first collecting area. Delays in contracts and purchasings resulted in little time to capture seeds before the data budget for the year would be reset. Determined to use as much of the data budget as possible before it expired, and after a self-given crash course in Archive-It, the presenter set out to capture as much as possible in as thoughtful a manner as time would allow before the end of the fiscal year. This poster will explore what went well, what went wrong, and lessons learned from this compressed timeline for starting a web archive. It will consider the work of implementing web archiving best practices, how the library is moving forward to grow a more robust and sustainable web archiving program, and the importance of advocacy and community in supporting institutions and sustaining the work of web archiving. In addition to doing the internal work of establishing repeatable workflows, refining regular crawl schedules, and considering the long-term preservation needs of their WARC files, the presenter is also actively restarting a regional web archiving interest group to build a local support network that can help foster their own and others' web archiving work in the area. As well, understanding that growth of the web archives will require continued support and increased resources from their institution, they are also leveraging the current attention their web archive has amongst leadership to promote the efforts of the library and advocate for the importance to the university of web archiving and preservation work. Through these efforts, the presenter aims to grow what started as a rough-and-ready little web archive into a sustainable web archiving program, expanding both upon its collecting scope and the archiving technology used. The presenter hopes the poster will prompt conversations around good (and not so good) practices in starting in web archives, successful approaches for advocating for web archiving resources, and the importance of web archiving communities in sustaining the work. Linking the awesome: Building a Community Knowledge Graph for Web Archiving Resources German National Library, Germany Web archiving is a highly technical endeavor involving a lot of tools. These tools are developed by a broad community and mostly as open source software. The open source software development allows participants of the community to exchange tools and improve them in a cooperative and collaborative way. The web archiving is technologically and from a community perspective embedded in the World Wide Web, which as well is mostly based on open source software and open protocols and standards. Likewise in web archives open protocols and standards, like WARC and CDX, play a fundamental role and allow the interoperability of components. The International Internet Preservation Consortium (IIPC) serves as a hub to foster communication among web archiving institutes, to support the standardization processes and the software development. The “Awesome Web Archiving” list follows the idea of awesome lists (https://awesome.re/). Awesome lists are common on GitHub, maintained as a Markdown document, and provide a low profile accessible index of resources that are relevant for a certain community, contributors are able to suggest new entries as pull-requests. Among others, this involves links to software tools and standard documents. Within the “Awesome Web Archiving” list the entries are assigned to categories, while individual entries can fit into more than on category. The referenced projects are sometimes under vivid development, while others get unmaintained over time. To improve the quality of the “Awesome Web Archiving” list and as such their value for the web archiving community the recency and information richness are relevant factors. The entries in the list are often links to git repositories or projects on GitHub. From these project pages, additional information about the current development status and the self-description of the projects can be gathered. To interlink the information that are gained through the crowdsourced approach of maintaining an awesome list with the information available at the project pages, linked data is a good and web native format to encode information in a structured way. The SPARQL Anything (https://sparql-anything.readthedocs.io/) tool provides access to Markdown documents (https://sparql-anything.readthedocs.io/stable/formats/Markdown/) with the standardized SPARQL 1.1 Query language (https://www.w3.org/TR/sparql11-query/). With these tools it is possible to create a knowledge graph–The Web Archive Awesome Graph (WAAG)–of information resources relevant to the web archiving community (https://github.com/white-gecko/webarchiving-awesome-graph). This graph can serve as integration point for structured or semi structured contributions to the tool collection, for information enrichment, and to model interconnections between listed resources, such as tools and libraries, and libraries and standards. Finally, the graph's information can be rendered browsed in a graph like manner and can be again rendered to an awesome list document. The tools involved are still under development and the approach requires discussion within the community. The poster should serve as a catalyst for such a discussion. Revisiting a statistical approach for measuring Solr query performance National Library of Norway, Norway Popular in the web archiving community, Solr allows for fast free-text search within a web archive. When working with large indexes, one soon faces the limits of one’s own infrastructure, and query response times increase. At that point, there are many measures that can be taken, so it is useful to know the effects of each measure, or which setting that gives the best performance. This is when having tools for evaluating query performance comes in handy. This poster sheds light on a handy method of measuring and visualizing Solr query performance. Imagine for instance that you want to improve the query response time of your Solr index, and have a theory that it will help to split a large collection into multiple shards. To qualitatively check if this is the case, it is first important to be aware that a query with few hits typically has a shorter response time than a query with very many hits. It therefore gives insight to check the performance across groups of queries, with say, 10-100 hits, 100-1000 hits, 10K-100K hits and so on. There is also the question of caching. If a specific query has been made before, the response time is shorter and might give a misleading idea of a Solr instance’ performance. Consequently, one needs to do many queries, which results in a set of valuable statistics. If these tests are run before and after the shard split, the results can be compared and the performance gain becomes very visible. The method has been used many years ago in presentations on previous IIPC conferences, but does not seem to be actively used today. The presenting organization is currently indexing on new infrastructure, and the method has been very useful in making decisions in this process, which is why we would like to highlight it in this poster. Sustaining web archiving through instruction New York University Libraries, United States of America According to the National Digital Stewardship Alliance (NDSA) 2022 Web Archiving Survey, "few organizations dedicate more than one, full-time employee to web archiving." American organizations’ staffing in regards to web archiving have stagnated, with the majority of practitioners of web archiving only working on it a quarter of their professional time, in line with the results from the 2017. With very little staff time devoted to web archiving, building and sustaining a program can be difficult and leaves no room for development in practices in the field. Over the last nine years, conversations around quality assurance, ethics, access and description for web archives have also gone by the way side in the United States in favor of similar conversations around event based collecting and technical developments. But once these events are over, web archiving practitioners are still needed to maintain these collections into the long-term. By providing training and instruction that does not just cover the basics of web archiving, but rather workflows and policies we can build up the knowledge that web archiving is not a “set it and forget it” that needs more than just a staff member 25% of the time. This poster will focus on best practices for training students and professionals in web archiving, including quality assurance, how to use the tools, maintenance, preservation, and access so we can move away from web archiving as an extension of someone’s work and part of a sustainable practice in their institution and a community of practice with more people with expertise to do better and innovating work. The DOWARC notebook: modelling web archiving artefacts as RDF graphs in Jupyter 1The National Archives, UK; 2King's College London, UK This poster presents a local and small-scale implementation of Semantic technologies in web archiving processes and builds on the research collaboration we conducted in 2024, which delivered the draft version of the DOWARC domain ontology presented in a lighting-talk at IIPC WAC 2025. To effectively manage the capture of the changes that affect live websites and webpages, web archiving practices lead to the creation of datasets composed of snapshots of web resources. Because each snapshot essentially recaptures the entirety of the archived web data object packaged into WARC files, significant issues of duplication inevitably arise over time, rendering versioning difficult to manage. Furthermore, as each snapshot provides an instantaneous representation of the live web resource captured in a specific moment of its existence, issues of context also arise, particularly with regard to the relationship between different versions of the same resource. Such issues have an impact on the long-term sustainability of web archiving practices and can also affect future reuse of web archives, by engendering contextual ambiguities. Our research explores affordances of Semantic technologies in tackling versioning and context-related issues in web archiving practices. Although Semantic technologies such as RDF and Linked Data are being implemented by web archives to enrich discovery-of/access-to a web archives’ collections, and/or support distant-reading of primary web resources (e.g., mapping and profiling of web communities), currently they are neither being used to support sustainable versioning and address issues of context, nor are considered useful in tackling the preservation challenges specifically presented by web resources. Our implementation aims to fill this gap and demonstrate the potential effectiveness of Semantic technologies and Knowledge Engineering techniques in providing effective means to automate the mitigation of versioning and context-related ambiguities, over large and dynamic web archived datasets. The implementation we present processes web archive data in a portable Jupyter environment and visualises it as an RDF graph. Using OS standard tools such as WARCIO and FastWARC, we extract data objects from WARC and CDX files, which we index in a database and provide with URIs. The WARC and CDX objects we then annotate and describe using DOWARC are represented as an interactive network graph. Our notebook is configured as a sandbox environment, to test and assess affordances and bottlenecks of automation when annotating Real World web archiving artefacts using the DOWARC ontology. By presenting our work to the web archiving and digital preservation community, we would like to gather community feedback on our sandbox implementation, on the specific affordances offered by Semantic technologies that we have demonstrated, but also on the limitations we have encountered and successfully/unsuccessfully tackled. We aim to identify interested institutional partners to further explore scalable implementation of Semantic technologies to support sustainable and accessible archiving and preservation of web content. Where is Hyves? Preparing hyperlinks for distant reading KB | National Library of the Netherlands, Netherlands, The Link graphs, word clouds and keyword search are frequently based on derivative data, but more than often it is unclear how this data was prepared. In this presentation I will argue that because a website is such a container source, it is important as a researcher and as an archiving institution to be clear which data is in the index and how it was pre-processed. Through several collaborative research projects I have found that preprocessing the data, in this example hyperlinks, has a lot of consequences for the subsequent analysis. Being transparent about how data is pre-processed for tooling is therefor important for the academic community. To illustrate this point I will discuss two use cases: Hyves (a former social media platform) and XS4ALL (one of the first public internet service providers in [COUNTRY]) analysed within the SolrWayback and a custom linkanalysis script. Both platforms have a similar sub domain construction causing them to either disappear from link graphs, or get grouped together into one major node causing the individual websites to disappear. It brings the question forward: what should a singular node be in the link graph? I argue that the level of granularity depends on the research question and the importance to explain to researchers that they should take this into consideration when preforming their research. It is also important to know which hyperlinks are displayed within the graph. Hyperlinks can be found throughout a website: there is embedded content, anchor hyperlinks but also scripts and fonts. Differentiating the different kinds of hyperlinks within a visualisation is as important as knowing how they are cropped. When a tool or analysis does not differentiate this, the bigger platforms will always come out on top, eclipsing smaller but perhaps more important individual websites, because they have a stake in every type of hyperlink. But more importantly when researching website-networks based on the content of websites requires different hyperlinks, than when researching for example techniques used to build a website. When visualizing link graphs with these thoughts in mind, you can enhance research results. And working this way can also be used for other elements on websites like text or images. Text, for example, should also be differentiated into header-text, footer text, article text, menu items and so forth. This to bring more meaning to analysis tools and visualisations. Moreover, within a website the text is already coded through html, so why not use this? With this archiving institutions can emphasize to researchers that the website is a container of many types of information and that they should be aware of this. Selecting which part of the website they want to use can enhance their research and should be chosen wisely. |
| 4:00pm - 5:20pm | POSTER SESSIONS Location: GALERIE [-2] |
| 6:30pm - 9:30pm | Dinner at Le Cercle Des Voyageurs |
| Date: Wednesday, 22/Apr/2026 | |
| 9:00am - 9:20am | MORNING COFFEE Location: GALERIE [-2] ☕️🥐 Drinks and snacks will be served in Galerie (Floor -2, next to the Auditorium). |
| 9:20am - 10:45am | SOCIAL MEDIA Location: AUDITORIUM [-2] |
|
|
9:20am - 10:25am
Social media archiving in small institutions: working alone together 1Vlaams Architectuurinstituut, Belgium; 2ADVN, Belgium; 3Amsab-isg, Belgium; 4meemoo, Belgium; 5KBR-UGent, Belgium; 6Regionaal Historisch Centrum Eindhoven, Netherlands; 7UGent, Belgium This panel takes a critical look at the work of several private cultural archives. In recent years, the findings of a joint research project on social media archiving have been incorporated into the organisations. To make progress with limited resources a specific approach to social media archiving was rolled out. The cultural archives are subsidised at regional level and build collections that complement the national collections of [INSTITUTION] and the [ARCHIVES]. These small private institutions work on a different scale and from a different perspective on social media archiving. During this session, three cultural archives will briefly present concrete steps that have been taken in the areas of selection, knowledge sharing and archiving to further embed the project in regular operations supported by a community of practice. This will be followed by a panel discussion of 40 minutes that will delve deeper into the challenges for each component and evaluate the steps taken. Based on propositions, experts from different backgrounds (research, regional public archive abroad, technical profile, national heritage institution) and institutions will engage in a discussion that will (hopefully) yield new insights. The presenters (different private archives) will moderate the panel. Some example propositions: - The small steps being taken by cultural archives, alongside those of national heritage institutions, are valuable. Social media must be archived at various levels by heritage institutions (national, regional, local). (What should be the role of large archives and libraries? Should there be coordination and how?) - It is more important to secure and preserve the data than to make it available. (Should we be concerned about our ecological footprint?) - It is not worthwhile to archive comments on posts. They mainly contain nonsense and rarely relevant information. - Archiving incomplete datasets is not worthwhile and therefore irresponsible. (What minimum criteria should heritage institutions use to determine what is worthwhile?) - We must ask permission from all parties involved before archiving. - We must better convince our archive creators to export their data themselves. (What are the arguments for and against? How do we do that? ) Small scale selection of social media (presentation) When you are a small archive with only half to one digital archivist, you have to be happy with small steps. After all, that archivist is responsible for setting up a digital preservation system, acquiring, preserving and giving access to a multitude of complex digital file formats. Despite the many tasks, it is necessary to start archiving social media before the data becomes inaccessible. A first step is to map and select the social media you want to archive. We recently started drawing up a seed list and establishing selection criteria. We use our own collection plan, websites and the MOSCOW principles to determine priorities. In a short presentation some examples illustrate this approach, the challenges (i.e. deduplication) and the gaps (i.e. randomness and bias) to feed later panel discussions. The community of social media archiving in practice (presentation) A community of practice social media archiving developed various initiatives to safeguard her knowledge and experiences. Working groups were set up to share best practices (Twitter/X research) and test results of replay tools (SolrWayback). We organized edit-a-thons to update existing manuals and created new ones for a diverse scala of archiving tools. Developing a sustainable network helps us to ensure our knowledge and expertise is not lost but can be embedded within our small archival private institutions. But how is the balance between between effort and output? What roles do we take as an institution and archivist within that network? The inherent incompleteness of archived social media data (presentation) Regardless of the method used to preserve social media content, archived datasets will almost always be incomplete or imperfect. With participatory archiving -where the archival creator uses the platform’s export function to obtain a copy of their own data- significant contextual information is lost. For example, we only receive comments of the archival creator, without the surrounding interactions that give them meaning. The web scraping methods will also lead to imperfect archived datasets. For instance, depending on the tools used, the visual appearance and user experience of the original platform are often not preserved. Certain elements, such as comments or embedded media, are in practice also difficult or impossible to capture in full. These limitations are not solely technical; human factors also contribute. Delays in initiating the archiving process, particularly when event-driven archiving, can result in the loss of valuable content that has already been removed from the web. This raises a difficult question for web archivists: how should we address these imperfect conditions? By examining a series of cases where the archiving process went wrong, we propose a pragmatic approach that demonstrates how even flawed or partial efforts can still yield historically valuable data. Panel of external voices from different organisations and backgrounds (names were removed on request of the WAC program committee) They are available in the remarks for the program committee and chair:
10:25am - 10:45am
Making 1.2 billion social media posts accessible: a user-centric search interface for large-scale Twitter archives INA - Institut national de l'audiovisuel, France Archiving social media platforms represents a major scientific, documentary, and civic challenge. In order to secure our digital heritage, our institution has undertaken the task of collecting and archiving content from Twitter and, more recently, Bluesky. Over the past decade, the chosen strategy has resulted in an archive of 1.2 billion tweets and posts from 16,000 accounts and 3,200 thematic hashtags, accompanied by 25 million archived videos. While the resulting massive scale of these archives creates a multitude of opportunities, it also comes with new challenges. How do we design access systems that remain sustainable as archives scale from millions to billions of items? How can such a vast archive be made accessible, intelligible, and useful? Researchers require sophisticated filtering capabilities to construct meaningful corpora as simple keyword searches on collections of this magnitude return overwhelming and unusable results. The general public needs intuitive and reliable tools to explore topics of interest, such as media events, cultural trends, and political and societal discussions. This presentation demonstrates a production-ready consultation interface designed to address these challenges. Built as a JavaScript web application with an Elasticsearch cluster backend, it provides multiple access points tailored to diverse research methodologies: - Faceted Search Engine: Full-text search combined with progressive filters for media type, language, hashtags, emojis, and engagement metrics (likes, retweets, replies, citations), enabling users to refine queries across multiple dimensions simultaneously The presentation will include a live demonstration highlighting real research use cases that illustrate how preserved archived content enables important scholarly investigations. Beyond demonstrating the interface, this contribution aims to foster discussion about broader sustainability challenges in social media archiving. Platform migrations — such as the ongoing transition from Twitter to Bluesky — raise other fundamental questions: how can we design interfaces and data models that adapt to evolving platform ecosystems while maintaining data integrity and access? How can we ensure these archives serve as sustainable tools for research communities and the public? |
| 9:20am - 10:45am | COLLECTIONS AS DATA: WORKFLOWS & USE CASES Location: PANORAMA [+6] |
|
|
9:20am - 9:42am
Web archives of tragedy: ethical, sustainable access and research use for 9/11 collections University of Waterloo, Canada During and after the September 11, 2001 (“9/11”) attacks, web users exchanged tens of thousands of emails, listserv posts, BlackBerry messages, and blog comments. Much of this material was captured in exceptional crawls by the Internet Archive and the Library of Congress, or later collected by the September 11 Digital Archive. Read together, these sources enable a minute-by-minute social history in which unity and care coexisted with fear, backlash, and hate, patterns further shaped by platform affordances and moderation practices. Yet this evidentiary base remains fragmented across crawls, platforms, file types, or how information was arranged and presented. This talk presents a practical model for sustainable access and research use by constructing a releasable, reusable dataset that harmonizes multiple September 11–related web-archival collections (e.g., Yahoo! Groups and web-hosted listservs), totaling tens of thousands of messages. The workflow covers content-hash deduplication; date-time normalization to Eastern Time (anchored to verifiable real-world events); thread reconstruction when possible; and a common schema that structures headers, body text, and related paratext (e.g., moderation notes) into designated fields. The resulting datasets are packaged as CSV and Parquet for straightforward download and reuse and are currently hosted as private collections on Hugging Face pending release decisions. Many items have effectively enjoyed privacy-by-obscurity in the Wayback Machine or as archive objects not exposed to search engines. When harmonized and made machine-indexable, they become trivially discoverable, including personally identifiable information. A user who posted under their own name to a public list in 2001 could not reasonably anticipate the 2025 search environment or large-scale text mining. While case-by-case review, which can attend to the context of creation, reasonable expectations of privacy, and the purposes of reuse, can guide my own individual decisions, it does not scale to tens of thousands of messages. At IIPC, therefore, I am hoping to seek community input on what to release, how, and with what documentation, and to share my own best practices. In my own work, I have imposed quoting thresholds on records I do not want to be identified; anonymizing names and email addresses in some files, and documenting provenance and processing choices so downstream users can determine what to use. 9:42am - 10:03am
Creative access - Lessons from the Digital Ghosts exhibition Univeristy of Edinburgh, United Kingdom This paper presents the lessons learned from the Digital Ghosts exhibition, a practice-based research project exploring how artistic and creative methods can enhance public engagement with web archives. Centred on the Scotland on the Internet curated collection, the project investigated how visualisation, data enrichment, and storytelling can improve awareness and usability of archived web content among non-specialist audiences. The exhibition showcased collaborative works created by an interdisciplinary team of archivists, data scientists, artists, and informatics students. Through data-driven artworks and interactive interfaces, the exhibition translated web archive metadata into tangible and visually engaging forms that encouraged visitors to reflect on digital presence, disappearance, and collective memory. Public engagement activities, including a panel discussion and participatory workshops, further enabled dialogue between archivists, artists, and users on issues of selection, loss, and representation of [redacted] online heritage. A key component of the project was the preparation and enrichment of a dataset derived from the Scotland on the Internet collection, used both for artistic interpretation and as an educational resource. The process of structuring and visualising this web archive metadata offered an entry point for students and artists to engage with the complexities of humanities data, such as gaps, inconsistencies, and ethical and legal considerations. By integrating web archive material into data science teaching, the project aimed to familiarise future data users with the interpretive and contextual challenges of GLAM datasets, while exploring use cases to encourage the future utilisation of web archive data. To assess the impact of these creative interventions, the project incorporated user research in the form of visitor surveys and focus groups conducted with exhibition visitors, workshop participants, and student groups. Based on the results of the user research and through documenting this interdisciplinary process, the paper argues that creativity is not merely an outreach tool but a sustainable access strategy that bridges preservation and access, facilitates communication between archivists, outreach specialists, researchers, and users, and supports web archives literacy. Situated within the Access and Research Use track, the paper offers conference attendees a tried and tested framework for integrating data enrichment, as well as creative and participatory methods into web archive engagement. 10:03am - 10:24am
Developing a sustainable workflow for UK Web Archive collections as data British Library, United Kingdom The UK Web Archive collects and preserves websites published in the UK, encompassing a broad spectrum of topics. The entire collection amounts to approximately 2 petabytes (PB) of data. The archive includes curated or thematic collections that cover a diverse array of subjects and events, ranging from General Elections, blogs, and the UEFA Women’s Euros, to Live Art, the History of the Book, and the French community. 2026 is a special year for the UK Web Archive, as it is celebrating its 21st year curating web archive collections. In the early years these collections followed a simple structure of a title and a list of related websites, subsections of websites, individual web pages and documents published on the web. The implementation of the curation software, in 2013 enabled the use of hierarchical structures to curate collections. Most of the hierarchical collections have one or two subsections, but other collection have up to four subsections. The UK Web Archive provides an essential resource for studying the evolution of web publishing formats and for accessing a comprehensive record of content published on the web. Due to limitations of the Legal Deposit Regulations, creating data sets of web archive content poses both technical as well as legal challenges. However, the metadata created by UK Web Archive collaborators is something that sits outside the limitations outlined by the Legal Deposit Regulations and could be repurposed to create data sets for further research. To date, we have published a number of our curated collections metadata as data through the British Library Research Repository. Metadata was extracted from backups of the curation management tool. The first tranche of collections as data published were extracted from a backup of our curation software in July 2023. At this point there were 173,961 curated records in the collection. The second tranche was extracted from a backup of our curation software from October 2023. This backup had 181,551 curated records in the collection. This presentation runs through a number of the processes involved and the lessons learnt from developing these new workflows. These include:
It is hoped that this presentation can enable further discussion on publishing collections as data within the web archive community. These discussions will then help to develop best practice for enabling reuse of web archives within the research community. 10:24am - 10:45am
Bridging the Web Archive and the Library: a Linked‑Data Model for FAIR Web Archive Integration German National Library, Germany Our library makes its data available as linked open data. Since 2012, we operated a web archive that is currently being redeveloped in-house and with an open‑source approach to increase capacity. Furthermore, the aim of the web archive is integrated with the overall digital library architecture, which involves ingest in the libraries digital object import pipeline, cataloging of the digital objects into the integrated library system, and storage in a common repository for digital objects. Thus far, the metadata of the web archive is converted into the library's internal data format. However, it has become apparent that current bibliographic standards cannot capture the complexity and characteristics of web resources. Additionally, the web archive should provide sufficient metadata to allow data based research on the digital holdings in a way that is adapted for the web medium. The overall architecture of the web archive involves several components that produce metadata about the digital object that are collected and others that require the data as input or enrich the data. Such components are the seed selection, crawlers, file format checkers, quality assurance, meta-data extractor, subject indexer, CDX indexer, playback system. |
| 9:20am - 10:45am | SHORT TALKS Location: CONCERT [+4] |
|
|
9:20am - 9:30am
Environmentally-friendly digital preservation policies and infrastructure at the National Library of Norway National Library of Norway, Norway The National Library of Norway has been a certified environmental “lighthouse” organization since 2015, indicating that it complies with a set number of environmentally-friendly criteria. This has required the library to implement and sustain many environmentally-friendly policies, including several related to digital preservation and storage, that may be of interest to the international community. One core aspect of this work is energy efficiency. The library’s digital collections currently total more than 18 petabytes of data. This data is regularly checked for bit rot and is preserved using the 3-2-1 standard of digital preservation, wherein we preserve 3 copies of each file, on 2 different storage technologies, including 1 file copy stored at a different geographical location. To reduce our energy use in this work, the library uses an energy-efficient technology for our disk systems, called MAID (Massive Array of Idle Disks). This storage technology reduces power consumption by only allowing disks to spin when they are in active use, so that most hard drives are kept inactive and turned off to save energy and extend their lifespan. Although it affects application performance during data access, MAID is effective for storing data that is rarely used, such as archival data that does not change and is rarely accessed. This provides an almost 60% energy savings. Another aspect of the library’s sustainable data storage practices focuses on data minimization. The library stores material in filetypes that meet international standards and that can also be compressed to reduce the total volume of information we store, such as the JPEG2000 file format. Our data is also stored in what is often referred to as a “cold climate” data storage facility. The northern location of the National Library is based in Mo i Rana, a city 30 kilometers south of the arctic circle. The storage facilities are built into the side of Mofjellet mountain. For seven months of the year, the monthly average temperature is below 0 degrees Celsius. This stable, even, cold climate requires less energy to keep the storage servers cool. Finally, the library uses 98% renewable energy sources, including from wind and hydroelectric sources, to maintain this infrastructure. There are still more measures the library can take to improve sustainability in our operations. For example, soon we plan to further optimize our energy use by recycling the heat from the data center to warm buildings. Another area of improvement is our file degradation systems, which are not as efficient as they could be. We use Checksum technology to check for bit rot. All preserved files are assigned a checksum, or fingerprint. To handle checksums, computing power is needed every time a check is run and to confirm that a file has not changed. We compare the stored checksum against the calculated checksum for a file each time it is retrieved from our digital preservation system, but this is processing that could be avoided if we used technology that more effectively maintained the integrity of a file. 9:30am - 9:38am
Environmental Issues on the Web: Building and Promoting a Thematic Archive National Library of France, France In 2020, our institution took part in the Climate Change IIPC Collaborative collection and drew inspiration from this initiative to set up its own collection on environmental issues. We felt it was essential to include these major issues for our contemporary society in our collections. That is why, since 2020, we have been launching an annual collection entitled ‘Environmental Issues’. The aim of this collection is to highlight expressions, reactions, actions, representations or reflections relating to environmental issues on the internet. It comprises eight themes, in order to cover the multiple aspects of these issues (scientific, economic, artistic, etc.) as well as the different types of website producers. It currently has more than 800 selections made internally by librarians, as well as by partner libraries in the regions. In this lightning talk, we would like to present this collaborative collection on a national scale, as well as the various initiatives implemented to promote it to the public. We have published in December 2023 a thematic and edited selection of archived pages (also known as “guided tour”) about “The environment on the web”. This tour is divided into 14 themes such as “Issues, Concepts and Theories”, “Biodiversity and Species Extinction”, “Urban Planning and Land Use”, and “Everyday Citizen Action”. As our collections can only be accessed within the research rooms of our library, we have also published on our website the seeds list of this collection as well as a version of the tour with screenshots, for which we asked the websites owners' authorizations. This collection and its promotion are a good example of how we build and develop a thematic collection in our library and how we can help the public to better understand the challenges posed by climate change. 9:38am - 9:46am
Storing URLs, targets, and other time-varying entities in a database as a path to sustainable recordkeeping Hungarian National Museum Public Collection Centre National Széchényi Library, Hungary A recurring problem with mass web archiving, e.g., at the top-domain level, is how to record the targeted content and the changes in the associated URL(s) over time. This issue is related to seed list maintenance, as in the case of larger harvests, it is necessary to exclude websites that were previously saved but are no longer functional, meaning that there is no longer any content behind a given URL, or it no longer belongs to that website. The lightning talk presents a flexible concept that can be used to manage the relationships between URLs of different structures (with or without http or https protocol, with or without www), their changes over time, and their connection to the website as an entity. The essence of the solution is an entity-based SQL database that is capable of recording all changes over time in a non-redundant manner by ensuring 3rd Normal Form (3NF). The main entities stored in the database, such as target and URL, are linked to each other, to themselves, and to tables containing information about them using junction tables. This solution ensures scalability, e.g., the information stored about each entity can be expanded arbitrarily, and the 'date_from' and 'date_to' fields in the junction tables can be used to record when the given relations were valid. Linking the entity tables to themselves allows us to link alternative URLs to each other in time, for example. The information stored about each entity allows for complex queries. For example, in the case of the target, the type (website, web page, file, etc.), or in the case of URLs, the status code is stored in a separate table. The junction tables also ensure that changes over time are recorded, so that, for example, it is possible to query which URL belonged to a given entity (e.g., a file on a website) during a given period. All this contributes greatly to sustainability, as it provides a much more economical, easier to use, and more flexible query solution than previous data storage methods, such as Google spreadsheets. 9:46am - 9:54am
Web archiving automation at the Mexico Digital Preservation Group: error assessment and quality control 1National Library of Mexico, Mexico; 2Digital Preservation Group, Mexico In Mexico, progress continues to be made in web archiving, which has become a fundamental strategy for preserving digital heritage, especially given the volatile and ephemeral nature of online content. In this context, the Digital Preservation Group of Mexico (GPD) has experimented with an automated web archiving system to capture, store, and preserve digital resources relevant to the country's collective memory. This study focuses on detecting errors during the capture processes and in the strategies applied to ensure the quality of the resulting archives. Using an empirical-applied approach, combining observation and experimentation to address practical problems, the automated tool Browsertrix (from Webrecorder) was used, along with systematic reviews of the files generated in WARC format. Twenty-four websites were captured in 2025, including catalogs, databases, and repositories. The analysis focused on the frequency, type, and cause of detected errors (e.g., broken links, missing sitemaps, uncaptured dynamic content, JavaScript issues, or multimedia format problems) and the effectiveness of the applied quality control mechanisms. The results reveal that while automation allows for a significant increase in archiving coverage, it also introduces considerable technical challenges, which we will discuss in the lightning talk. Recurring error patterns were identified, linked to highly dynamic sites with complex structures, highlighting the need for specialized configurations and iterative validation processes. The importance of establishing contextualized quality criteria, beyond purely technical parameters, is also discussed, integrating aspects of cultural, institutional, and legal relevance. The lightning talk concludes with a series of practical recommendations for similar projects in Latin American contexts, emphasizing the importance of a flexible technical infrastructure, automated monitoring capabilities, and a clear policy for collaborative digital preservation. This work contributes to the development of standards and best practices for institutional web archiving in the region, and opens the door to future research on automated curation and preservation of emerging content such as social networks, alternative media and ephemeral resources. 9:54am - 10:02am
Sustainable and systematic: building a search index of research and practice in web archiving and digital preservation 1Digital Preservation Coalition, United Kingdon; 2IIPC, United States of America; 3Cartlann Digital Services, Ireland Over the years, through events such as the IIPC Web Archiving Conference, iPRES - International Conference on Digital Preservation, and various collaborative projects, the digital preservation and web archiving communities have built an extensive repository of knowledge. However, a persistent challenge has been to provide a single, citable point of access to these dispersed resources. Our project introduces the Awesome Indexer1, which brings together digital preservation and web archiving resources into a single search interface and database. Our key argument is that centralised discovery is crucial for the long-term sustainability of these resources, encouraging reuse and investment in those resources rather than attempting to replace them. This tool works by accepting a range of standardised bookmark and bibliographic sources, such as Awesome Lists, Zotero2, and Zenodo collections. Zotero is a particularly powerful source, as the established tools and workflows around Zotero collection management make it easy to pull in records from a wide range of sources, from traditional publisher websites through to YouTube playlists and content hosted by digital libraries3. The Awesome Indexer combines the data from these sources to generate a dedicated faceted search system, built using off-the-shelf tools and packaged as a simple static website. It also creates SQLite and Apache Parquet versions of the same data, allowing richer exploration and analysis of the sources in the index. The Indexer is an open source tool that can be used by anyone to build their own index. This “work-in-progress” short talk will briefly trace the development of the Indexer, detailing the steps it required and the challenges posed by its underlying resources. The current version of the Digital Preservation Publications Index (DPPI) will be demonstrated to highlight how the Indexer consolidates decades of content from across multiple platforms into a single, comprehensive entry point. This significantly improves discoverability, facilitates citation, contributes to training, and maximises the impact of our collective knowledge for practitioners and researchers. References 3 This is an example of a web archiving collection hosted by the University of North Texas Digital Library https://digital.library.unt.edu/explore/partners/IIPC/ 10:02am - 10:10am
Querying the archived web with an AI assistant 1Aarhus University, Denmark; 2Macquarie University, Australia The archived web is a indescribably rich primary source for contemporary history. However, only a handfull of historians have started including the archived web as part of their source material when investigating phenomenons from the 1990's and 2000's (Mackinnon, 2022; Millward, 2025; Winters, 2017).This lightning talk presents exploratory work on exploring and discovering content from web archives through an *AI Research Assistant* and research questions from the discipline of history. 10:10am - 10:18am
Online annotation platform for web archives Arquivo.pt, Portugal Search engine evaluation relies heavily on high-quality test collections that reflect user information needs and relevance judgments. However, building such collections is resource-intensive, requiring systematic annotation of queries and results. The service is a web-based platform designed to streamline this process by enabling the annotation of search engine results in a user-friendly and collaborative environment. The tool allows assessors to annotate retrieved documents according to predefined relevance criteria, supporting the creation of standardized datasets for training, tuning, and benchmarking retrieval models. Our web archive is a research infrastructure that provides tools to preserve and exploit data from the web to meet the needs of scientists and ordinary citizens and our mission is to provide digital infrastructures to support the academic and scientific community. However, until now, our web archive has focused on collecting data from websites hosted under the .PT domain, which is not enough to guarantee the preservation of relevant content for the academic and scientific community. Our web archive provides a “Google-like” service that enables searching pages and images collected from the web since the 1990s. Notice that our web archive search complements live-web search engines because it enables temporal search over information that is no longer available online on its original websites. Developed within the context of our web archive, the service facilitates the generation of reliable ground truth data, while remaining adaptable to different domains and languages. By lowering the barriers to annotation, this platform contributes to the reproducibility, scalability, and improvement of search technologies. The main objective is to provide in the future a dataset with public access to support researchers.This contributes to comparing users’ search behavior between live-web and web-archive search engines. 10:18am - 10:26am
Warc School - fellowship & training program update 1College of Wooster Libraries, United States of America; 2Shift Collective Archiving the Black Web was founded on a commitment to create pathways for underrepresented voices and marginalized communities to access web archiving skills, knowledge, and networks. Our work addresses not only “ensuring equitable access to archived web content,” but also ensuring equitable access to who gets to participate in the practice of web archiving and what gets privileged to be part of a web archive collection. At IIPC WAC 2024, Archiving the Black Web shared details about our project’s efforts to reduce these disparities with the upcoming launch of our fellowship and training program, Warc School. Developed for memory workers dedicated to collecting and preserving Black history and culture online, the fellowship offers web archiving training to enhance their memory work or digital content creation practice. In April 2025, Warc School welcomed 22 fellows representing traditional archives, community-based archives, Historically Black Colleges and Universities, public libraries, and independent scholars and creators to complete our 10-month training program, which includes five courses and a practicum. In this session, join Archiving the Black Web for a brief update on lessons learned while developing a training program and its curriculum, recruiting fellows and faculty, as well as highlights from student practicum projects. Attendees will also hear about our new initiative to strengthen social sustainability, with details about the launch of our second cohort. This cohort will include fellowship opportunities not only for memory workers but also for journalists at Black newspapers interested in digital preservation through web archiving training. Information integrity and ethical considerations related to artificial intelligence will be incorporated into the 2026 Warc School curriculum. 10:26am - 10:34am
Organizing the 'Social Mess': a comprehensive Tool for Social Media and Instant Messaging Archiving 1University of Pavia, Italy; 2University of Bologna, Italy The exponential growth of digital content through social media and instant messaging platforms presents critical challenges for digital preservation. Born-digital communications—created in fragmented, proprietary environments where personal and public spheres overlap—remain excluded mainly from systematic archival practices despite their historical and cultural significance. Within the national archival context, there are no comprehensive tools to preserve and manage these materials for individuals, institutions, or public figures whose digital traces hold substantial value for future research. This gap affects personal archives of political and institutional figures and collections of broader cultural relevance. As part of a collaborative research initiative on preserving contemporary digital archives, we are developing a software tool for individual users and institutional archivists. This collaborative effort, which includes our professional experience, highlights an urgent need to address technical and methodological shortcomings in this field. Existing tools—typically command-line utilities or platform-specific applications—allow for the separate management of content from social media, messaging services, and email, etc., but do not provide integrated support within a unified solution. Our framework, in contrast, is comprehensive in its capacity to manage the complete spectrum of digital materials: traditional files alongside social media content, instant messages, and emails within a unified environment. This comprehensive approach addresses the complexity of contemporary digital archives. The software enables users to reorganize their materials systematically, making it valuable for a variety of contexts. Whether it's individuals managing personal digital heritage, prominent figures preparing materials for donation, or institutions controlling and facilitating access to collections. Our Java-based solution integrates core modules, ensuring usability and data integrity. Operating through manual download and ingest processes—not APIs—it provides user control while supporting standard formats (JSON, CSV) for interoperability. The embedded database and exclusive use of open-source libraries enable platform-independent installation without external dependencies. Key functionalities include AES-256 encryption, automatic backups, metadata extraction, device synchronization, and granular permissions. Critically, access settings apply at both file and individual message levels—essential for managing diverse privacy requirements and enabling selective disclosure within complex digital collections. Currently under active development, the project aims to support institutions in visualizing and managing heterogeneous digital materials, enhance accessibility for researchers through reorganization and categorization tools, and foster inter-institutional collaboration. This session will provide participants—particularly archivists and records managers—with an overview of a collaborative project and its outcomes, highlighting an integrated approach that offers significant advancements for digital preservation practice and academic scholarship. 10:34am - 10:42am
Social media archiving, right now Digital Preservation Coalition, United Kingdom As funding cuts bite, some organisations have had to shut down offices and services at very short notice. These closures put history at risk, especially where social media is concerned. These interactions with their patrons and the wider public are a crucial part of the function of any modern organisation, and the content, comments and context are important historical records. These should not be lost simply because the funding has been pulled on short notice. Unfortunately, in situations like this, already cash-strapped archives are rapidly swamped, and are struggling to cope with the deluge of digital records and requests for assistance. The individuals with access to the social media accounts are often not the archivists themselves, nor have the archival or technical skill required to archive things alone. Short of time and resources, what should they do? And with little hope of booming budgets anytime soon, what are the most sustainable approaches for the safe keeping of these complex records? This presentation will present a wide range of lessons learned while attempting to assist organisations as they rush to capture what they can from Facebook, Instagram, LinkedIn, X/Twitter and Flickr. This investigation considered and experimented with a range of strategies, including direct web archiving, API access, third-party archiving services and data exports, combined with tools like Browsertrix, ArchiveBox and wget. The advantages and limitations of these approaches will be explored and compared, highlighting the gaps between what is possible and what is practical, in the context of an urgent shutdown operation. |
| 10:45am - 11:15am | BREAK Location: GALERIE [-2] & PANORAMA FOYER [+6] ☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6) and in Galerie (Floor -2, next to the Auditorium).
🏛️ GUIDED TOUR KBR: If you signed up for a guided tour of KBR, please be in front of the three main elevators on Floor -2 at 10:45. To know if you signed up for a tour, check your registration details in ConfTool. |
| 11:15am - 12:20pm | QUALITY ASSURANCE & DEDUPLICATION Location: AUDITORIUM [-2] |
|
|
11:15am - 11:37am
Deduplication in browser-based crawling with Browsertrix Webrecorder This talk discusses new deduplication capabilities recently added to Browsertrix, a widely-used open source browser-based crawler and crawl management platform, in relation to sustainable web archiving. Browsertrix Crawler originally did not include support for deduplication, but we have recently added it as an option at the request of our users. This presentation will discuss why Browsertrix and Browsertrix Crawler did not originally support deduplication, the trade-offs introduced by adding deduplication support, and the unique challenges and opportunities related to deduplication with browser-based crawling. These trade-offs will be discussed in relation to storage efficiency and sustainability in web archiving programs. The talk will begin with some background on the early principles and capabilities of Browsertrix, and why deduplication support has not previously been added. This will include some discussion of the complexities deduplication introduces in terms of inter-crawl dependencies, and the tension between this complexity and the goal of being able to create portable, self-contained web archives. Next, the presentation will give a high-level overview of the deduplication capabilities that have been added to Browsertrix and Browsertrix Crawler. This will include our flexible model for how to configure an index for use as a deduplication source of truth using collections of previous crawls, how deduplication has been implemented in crawls, and the consequences this introduces for replay, sharing web archives, and other post-crawl activities. Also discussed will be how browser-based crawling allows for new experimental approaches to deduplication that can potentially result in efficiency gains in crawling time in addition to storage. The remainder of the presentation will provide thoughts on when deduplication may or may not be appropriate, using use cases to help illustrate how deduplication relates to institutions’ efforts to ensure their web archiving programs are efficient and sustainable, as well as the trade-offs that users will need to consider. 11:37am - 11:59am
Efficient quality assurance of deduplicated web archives with Browsertrix National Library of Luxembourg, Luxembourg This presentation focuses on the quality assurance of archived websites using Browsertrix on a national institutional level. In the second half of 2025, our institution completed the migration and expansion of its internal web harvesting infrastructure to use the latest version of Browsertrix. This includes the crawler, the management interface and the quality assurance workflows. We introduce several enhancements to these modules, which we will discuss in this presentation, with a particular emphasis on quality control. In particular, we propose a system for making the QA process more efficient by limiting the number of pages (or samples) that are analyzed in each batch. This process provides a good indication of the overall quality of a harvest, without needing to check all (often many hundreds or thousands) of pages. Thus, together with our crawler’s cross-crawl deduplication feature, this makes it possible to archive and analyze many terabytes of web content on a regular basis. We also present our system architecture and design choices that we made during the migration process, in detail. This includes our Kubernetes deployment, hybrid storage solution, custom registry, and multi-node setup. Our workflow is separated into three dedicated nodes, making it possible to harvest, manage and perform QA separately for : (1) behind-the-paywall news media content, (2) websites of national importance, and (3) ad-hoc collections. Our results show that Browsertrix offers many unique advantages compared to other alternatives that our institution has used previously. Furthermore, our enhanced quality assurance workflow provides an efficient, scalable means to monitor, manage, and maintain regular harvests on a daily, weekly, and monthly basis. 11:59am - 12:20pm
A browser-based approach to measuring completeness in archived websites University of Alberta, Canada The Internet Archive, the world’s most prominent web archiving institution, has created Archive-It (AIT), a popular web archiving subscription service, which is used by hundreds of institutions around the world to preserve their digital cultural heritage. AIT clients can choose to employ an AIT tool called Wayback QA to perform Quality Assurance (QA) on their archived websites (Archive-It, 2025). However, for those institutions who do not use AIT or for whom Wayback QA might not scale, the QA process has remained largely manual. To address this issue, we present a browser-based approach to measuring the completeness of a collection of archived websites. First, we establish a definition of completeness, which we define in terms of the network requests that are executed by a browser in order to properly load a website. We assume a live website is the “gold standard” against which the archived website must be measured. Therefore a fully complete archived website executes all of the same network requests that are also executed when loading the original live website. The completeness of an archived website thus becomes the fraction of original network requests that are successfully executed in the archived version. Our approach operates by comparing the network requests of the live website to those of the archived website and generating a measure of similarity. The approach includes an open-source command-line tool that can be deployed without needing to manually inspect each archived website on a browser. The work presented here is meant to provide a simple way to quickly assess the quality of a web archive collection. It does not preclude the use of other web archiving tools to capture, display, or analyze web archives. The audience for this tool is composed of web archivists looking to carry out QA on their archived websites. Researchers studying web archives could also employ this tool to gauge the quality of an archived web collection at a glance. The accompanying tool was written in Python, runs from the Linux command line, and is available to download and use on the Github Platform. It was written to be as modular as possible, with each step producing an output that is then used as input for the following step. The approach presented here has the following advantages over previous approaches: – It does not require web archivists to manually interact with each site they have archived, saving time and resources. – Additional information such as screenshots, WARC files, or crawler logs is not needed. As input, it only requires the URL of the archived website and its live counterpart. – It is an open-source tool and not proprietary. As such, it is open to further improvements and contributions from the web archiving community, and an AIT subscription is not necessary to use it. – Because the approach is browser-based rather than crawler-based, it is more focused on the user experience of archived websites. References Archive-It: How to patch crawl with the wayback qa tool (2025), https://support.archive-it.org/hc/en-us/articles/115004144786-How-to-patch- crawl-with-the-Wayback-QA-tool |
| 11:15am - 12:20pm | TECHNICAL INNOVATION AND STRATEGIES Location: PANORAMA [+6] |
|
|
11:15am - 11:37am
Archiving websites and social media of national movements: best practices of ADVN | Archives of national movements ADVN | archive for national movements, Belgium In 2018, our archive decided to expand the collection of online publications and started with harvesting websites of our archival creators to preserve their online heritage for future research. The web is constantly changing and content is quickly modified, removed or made inaccessible which make archiving it a necessity. During the coronavirus pandemic we realised the rise of social media could no longer be ignored. It was a start point to capture, record, scrape and download social media archives as well, but we were exposed to many challenges including technical barriers (API limitations, platform restrictions), legal and ethical isuses,… which require continous monitoring and specific strategies for effective preservation. During these years we developed a substainable policy and regulary monitor more than 5000 channels created by our archival community. 11:37am - 11:59am
Combining browser-based and browserless crawling for better fidelity vs. efficiency tradeoffs 1University of Michigan, United States of America; 2University of Southern California, United States of America Operators of web archives can crawl pages from the web using either dynamic browser-based crawlers (such as Brozzler and Browsertrix) or static browserless crawlers (such as Heritrix). Static crawlers are more lightweight and, hence, can crawl pages at a faster rate: in our measurements, 16x faster than with a dynamic crawler. However, static crawlers miss page resources which are fetched only when JavaScripts are executed; we repeatedly crawled 10K pages (spread across the top 1 million domains) both statically and dynamically for 16 weeks, and found that only 55% of statically crawled snapshots visually and functionally match the corresponding dynamically crawled snapshots. In this talk, we will present our study on how to combine dynamic and static crawling so as to serve page snapshots at high fidelity while minimizing the computational needs for supporting high crawling throughput. First, we quantified the utility of a practice which is common in web archives: reusing crawled resources either across snapshots of multiple pages or across multiple snapshots of the same page. When an archive receives a request for a resource, it serves the copy which it captured closest in time to the page snapshot it is serving. If no resource with the requested URL is found, the archive returns a resource which approximately has the same URL. We estimated the utility of these simple measures if the frequency with which an archive crawls pages matches the availability of page snapshots on the Wayback Machine. We find that, compared to crawling all pages statically, crawling 9% of snapshots with a browser suffices to increase the fraction of statically crawled snapshots which can be served without loss of fidelity from 55% to 96%. Second, to fix the fidelity issues associated with the remaining static crawls, we studied two methods for augmenting them using other dynamically crawled snapshots.
Put together, we estimate that these two measures will further increase the fraction of statically crawled page snapshots which can be served without loss of fidelity to 99%. By communicating our findings to the IIPC audience, we hope that developers of web crawlers will help translate our findings into practice. 11:59am - 12:20pm
The wasteback machine: measuring the environmental impact of the past web The University of Edinburgh, United Kingdom
This paper introduces the Wasteback Machine, a JavaScript library that repurposes web archives to analyse historical web page size and composition. It addresses a key limitation in current approaches to web sustainability assessment, which rely on live measurements and therefore obscure the cumulative environmental effects of long-term digital growth. By making web archives amenable to quantitative analysis, the Wasteback Machine enables new forms of historical inquiry into the evolution of page size and composition and their environmental implications. In doing so, it demonstrates how web archives can function as analytical resources rather than merely records of cultural memory.
This paper will demonstrate the capabilities of the Wasteback Machine, examine representative analyses of historical web development, and situate its contributions within wider debates in web archiving and sustainability. It will further consider the reuse of “reborn” digital materials for quantitative inquiry, the long-term ecological implications of persistent web expansion, and the challenges and responsibilities facing the future of web archives.
|
| 11:15am - 12:20pm | ACCESS & REUSE Location: CONCERT [+4] |
|
|
11:15am - 11:37am
Unlocking the web: online access to the National Library of Singapore’s web archives collection National Library Board Singapore, Singapore In 2019, the National Library of Singapore's (NLS) legislation was updated to empower it to archive websites ending in “.sg” without the need for written permission. This allowed NLS to comprehensively collect and preserve Singapore's Internet landscape by conducting large scale domain crawling of .sg websites. Since then, about 80,000 websites are archived every year and made available on the WebArchiveSG portal. However, due to copyright laws, these .sg websites could only be view at the NLS on a designated computer terminal. Permission had to be given by the website owners to make them accessible online. This meant that about 88% of the collection had access restrictions which greatly impeded the use and visibility of the collection due to the need for library users to visit the NLS to view them. After five years of growing the collection and with greater awareness and support for web archiving, NLS wanted to explore how it could make the collection more accessible to users in 2024. A discussion with its legal team was initiated to relook at the copyright laws and study how online access could be applied to the web archives collection. This led to the creation of an online access criteria for websites based on the Fair Use principle that the archived website is not a 100% replica of the live website. A quick takedown policy was also set in place to handle public requests promptly. With the above new criteria, the bulk of the domain crawl collection could be released for online access. Websites with owners who had previously specified onsite access and undesirable websites (e.g. adult and gambling websites) would remain accessible at the NLS only. NLS implemented online access to its collection in the 4th quarter of 2025. This presentation will cover NLS' online access criteria for websites, its application to its web archives collection, operational changes made to allow online access via its WebArchiveSG, as well as learning points from this experience. 11:37am - 11:59am
Unlocking the Web Archive: understanding researcher needs The National Archives UK, United Kingdom Our web archive contains more than 8 billion digital objects. It contains the record of over twenty-five years of government information released to the public, yet we face significant challenges encouraging research engagement and use of this resource. Barriers to increased access to the web archive include practical constraints (which limit our ability to release the dataset to potential researchers), and the Takedown Policy (a reclosure policy which allows for the removal of sensitive content at any time). Another challenge is our own incomplete understanding of what researchers need and want from the archive, as well as a lack of understanding by users of the complexities and limitations of the web archiving process. This presentation will introduce a project conducted at our institution designed to investigate and understand researcher needs. The project was funded by the Archives’ own Strategic Research Fund, an internal funding scheme reserved to make disruptive research possible and promote inclusive practice. In October 2025, workshops were hosted to determine what researchers want from our web archive, and subsequently we are able to share some of our hopes and plans for the future. This work was our first project focused on research users on their own, rather than general web archive users. We asked potential researchers what they need from the web archive in order to succeed and introduced ethical constraints that we face when sharing our own data. This enabled us to make recommendations for future work with the web archive that takes into account practical and ethical constraints around the release of datasets, as well as increase researcher understanding of what the web archive is, and how they can use it. The workshops aimed at engaging both web archive users and those curious about the potential of web archives. We invited both groups in the hope of responding to the need for equitable access in public sector web archives (Hartland, 2023)[1] and a desire to follow the UN principles of good governance, which includes being “participatory … equitable and inclusive,” (Schafer & Winters, 2021)[2] in web archives more generally. This presentation will discuss the future access scenarios that were proposed in the workshops, scaled from least to most computational and resource intensive. By examining what researchers both need and want from future digital preservation infrastructures, we will explore where they draw the line on computational intensity. The findings offer insight into how our web archive can evolve to meet the demands of its research community, balancing ambition with sustainability. We hope sharing both our findings and methodological approach can be useful to other web archiving institutions. [1] Nicole Hartland, ‘Web Archives for All? Towards Equitable Access to UK Public Sector Web Archives,’ iPRES, (Online, 2024). [2] Valerie Schafer & Jane Winters, ‘The Value of Web Archives,’ International Journal of Digital Humanities, (Springer, 2021). 11:59am - 12:20pm
Text Mining Analysis of the discourse on ‘Archive Silences and Democracy’ 1International Hellenic University, Greece; 2Department of Library, Archival & Information Studies, International Hellenic University, Greece; 3Department of Production Engineering and Management, International Hellenic University, Greece Foucauldian discourse analysis examines how language, power, and knowledge intersect to influence what is considered "true" and shape individual and societal identities. In analogy, Deconstruction theory involves identifying binary oppositions (like truth/error), reversing the traditional privilege of one term, and revealing their interdependence in the discourse (known as ‘Violent Hierarchies’). However, privileging certain terms or silencing others is a dangerous concept that may have a direct impact on democratic institutions. The internet constitutes today’s digital public sphere, and an interdisciplinary range of scientists try to identify and develop best practices for selecting, collecting, preserving and providing access to its content. Archive silences refer to the absent or distorted documentation of certain groups, stories, and perspectives within historical records, leading to gaps in the collective memory and understanding of the past. In this paper we argue that archive silences in the digital public sphere are either a result of, or they reflect power relations that privilege certain terms, and this has major detrimental impact on democratic institutions. We will try to establish whether and how this relation of archive silences and democracy is manifested. Towards this end, in this paper a text mining process is employed to analyze results of the query 'Archive Silences and Democracy', within the large volumes of information contained in the 40 most popular pages returned by the google (US) search engine. Artificial Intelligence algorithms are used to examine the correlations between these terms, create clusters of concepts, and determine those terms that may strongly mediate meanings between such groups of concepts. Finally, the results are graphically represented in a network form where influential words are depicted as nodes and the strong interconnections between them are represented as edges using the Infranodus software. Results show that archive silences are strongly related to state political censoring (even in democracies, e.g. during transitions from dictatorships). Thus, they impose selective perspectives on the construction of social memory. They are also used both in uncovering and silencing history (colonialism, immigration) and they usually are a result of corrupted autocracies. Archive silences exist with respect to human rights violations, freedom of press, they might be gender based, they may hinder the quest for accountability and justice, or they can be related to infrastructure inadequacies in disasters. As shown on the constructed network, these silences are not accidental but result from factors like biased collection practices, structural inequalities, and the inherent limitations of institutions, which can inadvertently or purposefully exclude certain voices, with obvious negative impact on democracy. Addressing archival silences involves critically examining the history presented, recognizing the power dynamics involved, and seeking out the marginalized narratives that remain unheard. We believe that the proposed methodology contributes towards all the above, as the text to network transformation and graph metrics avoid subjectivity and distortion of concepts, without imposing external semantic structures. Moreover, they can be especially helpful in bringing potential conceptual gaps that are highlighted in the transformed geometrical space. |
| 12:20pm - 1:25pm | LUNCH Location: GALERIE [-2] & PANORAMA FOYER [+6] 🍴 Lunch will be served will be served in Panorama Foyer (Floor +6) and in Galerie (Floor -2, next to the Auditorium).
🏛️ GUIDED KBR MUSEUM TOUR: If you signed up for a guided tour, please be by the entrance to the Museum on Floor 0 at 12:20 [1st tour] or 12:50 [2nd tour]. To know if you signed up for a tour, check your registration details in ConfTool. |
| 1:25pm - 2:30pm | ARCHIVING BLOGS AND NEWS Location: AUDITORIUM [-2] |
|
|
1:25pm - 1:47pm
Blogs to digital heritage: a British Library case study British Library, United Kingdom In 2025 the British Library undertook a time-sensitive initiative to preserve its institutional blogs hosted on the Typepad platform. The library blogs represent over a decade of research, curatorial insight, and public engagement, making them a crucial component of the institution’s digital heritage. This project aimed to preserve the content while ensuring continuity of user access and long-term discoverability through the UK Web Archive. The blogs were hosted across two domains with Cloudflare protections active on only one. This configuration presented several challenges for crawling including blocked requests, redirects, and embedded content across multiple subdomains. To address these issues crawler user agents were whitelisted by the domain owners and manual crawls were conducted for content outside Cloudflare. As a result, the team compiled seed lists for manual crawling using a combination of internal metadata, Screaming Frog exports, and curated inputs. Approximately 160,000 URLs were initially identified which were refined to around 90,000 unique URLs representing individual blog posts and associated media. Browsertrix was used for targeted crawls of these posts and separate crawls captured embedded assets such as images, audio, and documents. After crawling, further challenges arose regarding consolidating the content captured from two different domains into a single, coherent viewer. Quality assurance was particularly complex as some captures were not traditional failures, but rather pages returning HTTP 503 errors instead of the expected blog content. These recurring 503 captures had to be identified and re-crawled manually to ensure every post and associated media was fully preserved, requiring careful review and iterative verification across both domains. Throughout this project a strong focus was placed on user access and experience. The current solution includes a bespoke workflow with support from Browsertrix which provides a temporary route for public access until the blogs are fully integrated into the UK Web Archive. Redirects were planned at the top-level domain to route users to archived versions, with documentation including a LibGuide to guide navigation and citation. The team explored how archived content could later integrate into the Web Archive’s discovery systems which ensures sustainable long-term accessibility. This presentation will discuss the workflows, technical challenges, and collaborative strategies employed to preserve both content and access. Particular attention will be given to overcoming Cloudflare restrictions, managing URL redirects, coordinating cross-departmental teams, and designing user support resources to make the archived blogs usable and discoverable. The case study demonstrates that under platform constraints, institutions can successfully safeguard digital heritage while prioritising accessibility, discoverability, and usability for researchers and the public. This presentation will illustrate how careful planning, cross-team collaboration, and targeted technical strategies enabled the preservation of content while prioritizing user access. It will highlight approaches to overcoming platform limitations, ensuring discoverability, and supporting users in navigating archived blogs. 1:47pm - 2:09pm
The taste of blogging : towards sensible and ethical approaches to web archives. 1École nationale des chartes, France; 2Bibliothèque nationale de France Archives of the early vernacular web hold a lot of sensitive content: personal photos, texts created by children, viral memes remixing personal and copyrighted content… Blogs and social networks are not only made of text or images: they encompass intimate, individual stories. Within those pages, we come across confidences from marginalized people, mothers grieving for their child, photos of late-night parties, fantasies worded as fanfictions. What can be told about them without betraying the intimacy these authors have placed in their blogs? Based on the massive collection carried out with the National Library of France (BnF) for 12.6 million blogs, mainly french-speaking and created mostly in the early 2000s, we will discuss how research teams and cultural institutions can implement sensible approaches to this kind of peculiar corpus. Our projects SkyTaste and Skybox build on a platform of tools and data for researchers designed by the BnF in order to promote the visibility of this archive. Our goal is to capture the unique atmosphere of those blogs to design ways to reconvey this heritage to its stakeholder community. Within our project, we define sensible approaches to web archives as epistemological methods designed to interact with sensitive content from the vernacular web in a way that is respectful of ethical principles. In France, web archives are legal deposit and can only be accessed by researchers on the premises of a few institutions. If we want to use this content for an exhibition or a scientific paper, we have to ask for authorizations from rights holders. However, most of the content on this blog platform was posted under pseudonyms and most of it, especially within fandoms, is composed of reused content, making it difficult to trace. Furthermore, even if we can find some of these authors, they are not keen to provide display authorizations for their intimate content. Finally, there may be cases when the materials are so sensitive that we may feel reluctant to expose them, even if we are allowed to. However, when telling the stories of these blogs, if we only show low-risk content, either authorized or already available, there is a significant risk that we end up representing a biased version of the platform and missing out the purpose of cultural heritage: stirring emotions. Sensible approaches to web archives include acknowledging intellectual property rights, being mindful of people’s privacy and intimacy, taking into account cultural diversity, protecting stakeholders (including researchers) from potentially harmful information. Such approaches may include navigating between distant and close reading, avoiding blind spots and building research processes along with communities, and mobilizing art-based research as a catalyst of emotions that we experience as web archivists or as researchers in front of the archive. Thanks to the synergy that emerged around the aforementioned projects researchers and students work together with web archivists to build this ethical framework for navigating personal web archives. This is our main goal for two workshops we’re organizing in the fall 2025 : we will synthetise our results for this presentation. 2:09pm - 2:30pm
Capturing the flow of online news: complementary approaches to web archiving and legal deposit in Sweden National Library of Sweden, Sweden The National Library of Sweden has engaged in large-scale web archiving since 1997, when domain-level crawls of the Swedish web were first initiated as part of the national web harvesting program. In 2002–2003, this effort was expanded to include daily crawls of Swedish news media websites, recognizing the need to capture the rapid publication cycles and dynamic content characteristic of online journalism. These crawls have since documented the structure, evolution, and visual presentation of Sweden's digital news ecosystem across both national and regional outlets. The harvested material is available for on-site consultation at the library and forms a cornerstone of the National Library of Sweden's long-term digital preservation holdings. The introduction of electronic legal deposit legislation in 2012 significantly expanded the National Library of Sweden's collecting mandate, establishing a legal basis for requiring publishers to deliver digital content, including material distributed exclusively online and behind paywalls. Building on this framework, the National Library of Sweden launched in 2015 a new and more granular collection process for news media: a focused harvesting based on RSS feeds supplied by publishers in accordance with technical specifications developed by the library. These feeds expose article-level content and metadata, including updated versions of published articles, thereby enabling the systematic and high-frequency collection of born-digital news items. This targeted, metadata-rich approach complements the broader but less structured coverage achieved through traditional web crawls. This presentation will examine the operational and curatorial relationship between these two collection streams—comprehensive web harvesting and RSS-based electronic legal deposit. It will discuss differences in scope, temporal resolution, and metadata granularity, as well as efforts to align descriptive and technical metadata across systems to enable cross-collection discovery and analysis. Particular attention is given to challenges in integrating large-scale WARC-based collections with structured, feed-based article data, and to access conditions: while the web-harvested material is available to users on-site, the legal deposit corpus remains restricted due to current legal and technical constraints. The presentation will also try to outline future directions for harmonizing workflows, enhancing metadata interoperability, and leveraging these complementary datasets for large-scale research use in digital news studies and computational journalism. |
| 1:25pm - 2:30pm | RESPONSIBLE STRATEGIES Location: PANORAMA [+6] |
|
|
1:25pm - 1:47pm
End of Term Web Archive: Harmonizing WARC contributions from multiple crawling partners 1University of North Texas Libraries, United States of America; 2Internet Archive, United States of America Every four years, the End of Term (EOT) Web Archive documents the transition in the executive branch of the United States federal web space by harvesting federal .gov and public .mil domains. The most recent transition from the Biden to Trump administration resulted in the largest data collection yet, with over 2.3PB of content crawled by six different crawling partners. From the beginning of the EOT Web Archive project, the diversity of approaches in crawling and curating portions of the overall projects by different crawling partners has been seen as a benefit. This approach allowed different strategies for crawling to be experimented with while allowing partners to focus on the content their organizations were willing and able to collect. In the case of the EOT-2024 process, this diversity in collecting institutions resulted in a wide range of implementations of the WARC format and required the project team to make decisions about how best to harmonize the data and make it available to researchers for computational use. Examples of the different variations in the WARC format include WARC files created using record-at-a-time gzip compression, WARC files packaged in the Web Archive Collection Zipped or WACZ format, WARC data compressed using the Zstandard data compression algorithm, and finally WARC files packaged in the BagIt format comprising file headers with the payloads stored alongside the WARC files themselves. To provide a consistent file format and access paradigm to end users who might not be familiar with the range of variations of the WARC format and their nuances, the EOT team made the decision to normalize all streams of WARC data into individual WARC files with record-at-a-time gzip compression. This required the normalization of several of the formats that presented several non-trivial challenges during the process. While data for the public dataset was normalized, the originally contributed formats are archived as they were deposited at the Internet Archive where they are served by the Wayback Machine. The resulting dataset will hopefully provide end users with an easily accessible set of files that can be used for a variety of projects in the future. This presentation provides a novel focus on normalizing heterogeneous WARC files in order to provide a consistent set of interactions for end users of these files who are not primarily web archivists. The presentation will provide a brief introduction to the EOT collection process but focuses predominantly on the description of the different tools and resulting WARC implementations generated in the most recent round of this effort. An overview of decisions that the EOT team made to normalize these WARC records will be discussed as well as the technical approaches used throughout the dataset creation portion of the project. 1:47pm - 2:09pm
Crawl, cloud, carbon: measuring and reducing emissions for web archivists Tailpipe, United Kingdom A walkthrough of a novel methodology for precisely estimating the carbon emissions generated by cloud computing, contextualised within a case study whereby the emissions of a major web archiving platform were measured. The presentation begins with an explanation of the process by which cloud computing generates carbon emissions. This process connects the cloud service user to the datacentre that processes their requests, to the power station that fuels the datacentre, to the energy source that generates the necessary electricity. This process is illustrated by data from the emissions assessment of the aforementioned web archiving platform. The emissions intensity of web archiving is also highlighted, as a compute and storage intensive process that is reliant upon a vast network of cloud storage, which consumes a significant amount of power and thereby generates material quantities of carbon emissions. Next, the methodology for how cloud computing emissions can be estimated is detailed. A step-by-step explanation begins with an assessment of the power draw of the hardware components that host cloud services. This dataset is combined with measured processor utilisation data to determine the overall power draw of a user or organisation's use of cloud services. The carbon emissions of this power draw are then calculated by drawing on regional carbon intensity grid mix data, as well as accounting for regional power transmission losses. Alongside these ‘operational’ emissions, the methodology is expanded to encompass other elements of the cloud computing infrastructure’s lifecycle including manufacture, shipping and disposal. This methodology is accompanied by examples from the web archiving case study, covering the types of hardware used by web archivists, the types of cloud services utilised to host web archiving, and the carbon intensity of datacentres that most commonly host web archive data. Results from empirical testing will also be shown to demonstrate the precision of the estimated power and emissions calculations. Areas for further improvement will be presented to highlight how additional refinements can be made in the future. The presentation concludes with recommendations to help web archivists reduce the carbon emissions generated by their processes. These include migrating services to datacentres in low carbon intensity areas, as well as maximising the efficiency of web archiving software hosted on cloud services. 2:09pm - 2:30pm
How the “M” service contributes to reducing the carbon footprint Arquivo.pt, Portugal This presentation provides an overview of seven years of the “M”, a service offered to the community since 2018 that allows organizations to shut down old websites while keeping their content accessible, thereby contributing to reducing their carbon footprint. Organizations create websites for a wide variety of purposes, sometimes having to maintain dozens of small websites without updating them. For example, universities create websites dedicated to events, conferences, research projects, etc. What to do? Shut down the websites and lose interesting information? This is where the “M” service comes in. We consider this service from three perspectives: 1) How it works; 2) How it adds value to organizations; 3) Community involvement. We conclude by outlining the next steps for expanding the service. 1) How it works. The “M” basically consists of redirecting a domain to a historical version preserved in the “Web Archive”. The workflow begins with a request from the organization that owns the website. The “Web Archive” makes a high-quality recording of the website. The website owner only has to maintain and redirect the domain. The “Web Archive”, in turn, generates an SSL certificate and provides access to the archived content. A landing page informs users that this is a historical version. This process involves collaboration between the “Web Archive” team and people from the entities that have joined the “M” service. 2) How it adds value to organizations. In communicating the service to the community (external advocacy) we highlight the value of the “M” in terms of energy savings, CO2 reduction, and therefore a smaller carbon footprint. The second value of the service, which is important to IT teams, is that it helps eliminate security flaws. When websites are not updated, they become targets for attacks. Instead of eliminating websites with useful content for the community, IT teams and decision-makers can use the web archive to continue to provide access to this content. 3) Community involvement. In 2025, the “M” service reached approximately 284 websites from 26 institutions. Over the years, 50 websites were removed due to domain maintenance issues or broken collaboration. Processes have been improved and the service is poised for growth. For example, SSL certificate generation has been automated. External advocacy is a priority, as the preservation of websites in web archive format is not widely known. The next step to expand the “M” service is to use the same workflow and structure to provide a rapid response service in the event of cyber attacks on websites of important organizations, such as universities. The “Web Archive” must be prepared to provide the latest archived version to one of these entities. We believe that redirecting to “Web Archive”, as is the case with “M” service, is an important contribution to disaster recovery processes. This presentation concludes by referring to the vision of “Web Archive” in creating services for the community. It is essential to offer services: 1) to demonstrate the usefulness of web archives to organizations 2) to point out its contribution to sustainable goals. |
| 1:25pm - 2:30pm | COLLABORATION & OUTREACH Location: CONCERT [+4] |
|
|
1:25pm - 1:47pm
A web archiving training program for Latin America 1Universidad Nacional Autónoma de México, Mexico; 2Webrecorder, United States of America Web preservation is a contemporary practice that began this century. Like many practices, promoting and supporting web archiving has been challenging due to limited time and resources. However, the urgency and ephemeral nature of online content have made the gap between countries that have adopted web archiving initiatives and those still unaware of its importance increasingly clear, highlighting the pressing need for action. In Latin America, web archiving is an archival medium technique that has been rarely applied. Formal web archiving projects are known to exist in Chile and Mexico, though communities and other organizations have also made significant contributions. Many have attempted to archive the web using the limited support, resources, and documentation available from both within the web archiving community and their own local contexts. For this reason, a Spanish-language web archiving training program is being developed within the Library and Information Research Institute (IIBI) of the National Autonomous University of Mexico (UNAM), with the goal of preparing new generations of archivists who can identify, preserve, and provide access to web pages of social, archival, and political value. The program is being developed internally within the IIBI department to evaluate workflows, logic, and vocabulary, with the goal of expanding and disseminating these resources as part of the cultural heritage of our countries and communities. This panel proposes a training and professional development program, designed as a collaborative strategy between the public university of UNAM from Mexico and open-source tools. As the program is being developed, we invite the broader web archiving community to join the conversation and share insights on how they would have liked to begin their own journeys, offering input that can help shape a more accessible and impactful initiative. 1:47pm - 2:09pm
Modeling CARE: Sustainable web archiving across languages 1Indiana University, United States of America; 2ESRI This presentation will describe a collaborative web archiving project funded by the Mellon Foundation (2020-present) and the National Endowment for the Humanities (2023-2025). It employs the decolonial practice of post-custodial archiving to record stories of mutual aid organizations and individuals responding to disasters that have impacted Puerto Rico in the last five years, including hurricanes, earthquakes, and COVID-19. Over the course of two weeks in September 2017, Puerto Rico was impacted by a category 5 and a category 4 hurricane, Hurricanes Irma and María. The disaster, however, was not simply the hurricanes but also the events that followed. Notably, the disaster-response methods used—prioritization of urban centers, slow distribution of resources, and strains to infrastructure—placed Puerto Rico under duress by leaving most people to fend for themselves. As a result, Puerto Ricans' survival largely depended upon community-based groups and their use of local traditions, oral knowledge, and community organizing. Our team works with these community organizations to preserve and archive their stories. We are committed to decolonial web archiving practices that build reciprocal relationships with and for our communities. Linda Tuhiwai Smith asserts that “the intellectual project of decolonizing has to set out ways to proceed through a colonizing world. It needs a radical compassion that reaches out, that seeks collaboration, and that is open to possibilities that can only be imagined as other things fall into place.” For our team, decolonial praxis is defined as “rejecting extractive forms of knowledge acquisition by relegating authority and control of collection processes, material selection, and dissemination strategies to the participating community organizations.” This approach includes adherence to the CARE principles–collective benefit, authority to control, responsibility, and ethics. One way we live out these values is through participatory, user-centric design of not only the project’s collections but also the platform in which they are housed. In particular, AREPR has developed a multilingual open source Omeka S theme that is freely available for other groups to use. This theme and the corresponding Omeka S modules that we produced simplify the process of developing and sustaining multilingual projects by providing free, easy-to-use tools for displaying archival materials across languages. Using these software extensions, we built a collection of over 800 bilingual disaster response artifacts and oral histories. We also work with our community partners to sustain these tools via training, documentation, and knowledge transfer. This presentation will describe how our team put its values into praxis through participatory design of both the project’s collections and the platform in which they are housed. Offering an exciting case study for how to utilize and sustain these innovative software extensions, it will demonstrate how CARE approaches to archiving and tool development make it possible to engage in collaborative and mutually beneficial knowledge production. While this presentation uses our project as a case study, it also draws attention to how web archiving practices can be reimagined to create new opportunities for community engagement and sustainable praxis. 2:09pm - 2:30pm
Approaches towards archiving digital Islam University of Edinburgh, United Kingdom Approaches towards archiving digital Islam presents the experience of the "Digital Islam Across Europe" project, in which digital archiving constitutes a core methodological component. The presentation explores how data were selected, archived, and visualised by teams of academic specialists. Although the team possessed technical competence and general awareness of computer technologies, none had specific expertise in digital archiving. The presentation will therefore illustrate the team’s experiences including processes of experimentation and trial and error. The project’s focus on archiving prompted a steep and ongoing learning curve aimed at developing sustainable, narrated, and open-source digital archives that capture multiple dimensions of digital Muslim expression. Sustainability and accessibility have been integrated into the project’s design through the use of tools provided by Archive-It and ARCH. In doing so, the team seeks to establish good practices that are transferable to other disciplines and to encourage similar projects based on the methodological frameworks developed through this work. This initiative represents one of the earliest systematic efforts to archive religious expression, identities, and related issues specifically those associated with Muslim communities and Islam through multidisciplinary and interdisciplinary approaches informed by the diverse expertise of the participating teams. "Digital Islam Across Europe: Understanding Muslims’ Participation in Online Islamic Environments" (DigitIslam) examines the social and religious impact of Online Islamic Environments (OIEs) on Europe’s diverse Muslim communities. The project is funded by the Collaboration of Humanities and Social Sciences in Europe (CHANSE) and involves research teams working across five European countries: the United Kingdom, Poland, Sweden, Spain, and Lithuania, with the University of Edinburgh serving as the lead institution. The archives draw on specific contextual Muslim interests reflecting national concerns within the partner countries, while also highlighting transnational networks and shared themes. Each country team contributed subject-specific expertise, particularly in the development of metadata. The content was translated into the respective partner languages, which required refinements to the archiving tools. Although DigitIslam’s archives remain under development, they already constitute a significant research resource at a critical juncture in the study of European Muslim life and digital engagement. Online: https://blogs.ed.ac.uk/digitalislameurope/ X: @digitislam Bluesky: @digitislam.bsky.social Facebook: Chanse DigitIslam |
| 2:30pm - 3:00pm | BREAK Location: GALERIE [-2] & PANORAMA FOYER [+6] ☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6) and in Galerie (Floor -2, next to the Auditorium). |
| 3:00pm - 4:30pm | CLOSING KEYNOTE PANEL Location: AUDITORIUM [-2] |
| 4:30pm - 4:45pm | CLOSING REMARKS Location: AUDITORIUM [-2] |
| 4:45pm - 6:00pm | CLOSING RECEPTION Location: ROTONDE Drinks and nibbles will be served in Rotonde. Volunteers will guide you to floor +3 from where the historical part of the Library can be accessed. |
| Date: Thursday, 23/Apr/2026 | |
| 9:00am - 9:30am | MORNING COFFEE Location: PANORAMA: FOYER [+6] ☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6). |
| 9:30am - 9:40am | GENERAL ASSEMBLY: OPENING REMARKS Location: PANORAMA [+6] |
| 9:40am - 9:50am | CHAIR ADDRESS Location: PANORAMA [+6] |
| 9:50am - 10:30am | TBC Location: PANORAMA [+6] |
| 10:30am - 11:00am | COFFEE BREAK Location: PANORAMA: FOYER [+6] ☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6). |
| 11:00am - 12:00pm | TOOLS Location: PANORAMA [+6] |
| 11:00am - 12:00pm | TBC Location: STUDIO [+6] |
| 11:00am - 12:00pm | CONTENT DEVELOPMENT WORKING GROUP Location: AQUARIUM [+2] |
| 12:00pm - 1:00pm | LUNCH Location: PANORAMA: FOYER [+6] 🍴 Lunch will be served in Panorama Foyer (Floor +6). |
| 1:00pm - 3:00pm | TOOLS Location: PANORAMA [+6] |
| 1:00pm - 3:00pm | TRAINING WORKING GROUP Location: STUDIO [+6] |
| 1:00pm - 3:00pm | RESEARCH WORKING GROUP Location: ATELIER [+2] |
| 1:00pm - 3:00pm | CONTENT DEVELOPMENT WORKING GROUP Location: AQUARIUM [+2] |
| 3:00pm - 3:30pm | COFFEE BREAK Location: PANORAMA: FOYER [+6] ☕️🥐 Drinks and snacks will be served in Panorama Foyer (Floor +6). |
| 3:30pm - 4:00pm | GENERAL ASSEMBLY: CLOSING SESSION Location: PANORAMA [+6] |
