JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at events@netpreserve.org.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

Only Sessions at Date / Time

Session Overview

Date: Tuesday, 08/Apr/2025
9:00am - 9:40am	REGISTRATION: General Assembly (For IIPC members only)
9:40am - 9:50am	Opening Remarks Location: Målstova (upstairs)
9:50am - 10:00am	Chair Address Location: Målstova (upstairs)
10:00am - 10:45am	IIPC Strategic Plan 2026-2030 Location: Målstova (upstairs)
10:45am - 11:15am	BREAK Location: Folkestova (upstairs) If you signed up for a guided exhibition tour, please be in the exhibition room at 10:45. To know if you signed up for a tour, check your registration details in ConfTool.
11:15am - 12:45pm	Framework for Tools Sustainability Location: Målstova (upstairs)
11:15am - 12:45pm	Content Development Working Group Meeting Location: Slottsbiblioteket (ground floor)
11:15am - 12:45pm	TBC Location: VIP - rommet (upstairs)
12:45pm - 2:00pm	LUNCH Location: CREDO Restaurant \| Kantine (downstairs) If you signed up for a guided exhibition tour, please be in the exhibition room at 12:50. To know if you signed up for a tour, check your registration details in ConfTool.
2:00pm - 3:30pm	Research Working Group Meeting Location: Målstova (upstairs)
2:00pm - 3:30pm	Training Working Group Meeting Location: Slottsbiblioteket (ground floor) Actual session length: 60 minutes
2:00pm - 3:30pm	TBC Location: VIP - rommet (upstairs)
3:30pm - 4:00pm	BREAK Location: Folkestova (upstairs) If you signed up for a guided exhibition tour, please be in the exhibition room at 3:30. To know if you signed up for a tour, check your registration details in ConfTool.
4:00pm - 5:30pm	Crawling National Domain: Towards Best Practices Location: Målstova (upstairs)
4:00pm - 5:30pm	TWG WORKSHOP: Case Studies ‘Write-a-thon’ - Documenting Best Practices Location: Slottsbiblioteket (ground floor)
	Case Studies ‘Write-a-thon’ - Documenting Best Practices Claire Newing¹, Lauren Baker², Kody Willis³ ¹The National Archives (UK), United Kingdom; ²Library of Congress, United States of America; ³Internet Archive, United States of America
4:00pm - 5:30pm	TBC Location: VIP - rommet (upstairs)
7:00pm - 9:00pm	WELCOME RECEPTION Location: Folkestova (upstairs) [IIPC Members Only] Includes light refreshments and drinks. Attendees are encouraged to have dinner beforehand.

Date: Wednesday, 09/Apr/2025
9:00am - 9:40am	REGISTRATION: Web Archiving Conference (WAC)
9:40am - 9:50am	Opening Remarks Location: Målstova (upstairs) Streamed to Store Auditorium.
9:50am - 10:45am	Opening Keynote: Libraries, Copyright, and Language Models Location: Målstova (upstairs) Session Chair: Andrew Jackson, Digital Preservation Coalition Streamed to Store Auditorium.
10:45am - 10:55am	SHORT BREAK Streaming video from Målstova to Store Auditorium ends. Lightning Talk Session 2 will begin in the Store Auditorium after the break.
10:55am - 11:00am	LIGHTNING TALK SESSION 1: INTRODUCTION Location: Målstova (upstairs) Session Chair: Ben Els, National Library of Luxembourg
10:55am - 11:00am	LIGHTNING TALK SESSION 2: INTRODUCTION Location: Store Auditorium (ground floor) Session Chair: Sawood Alam, Internet Archive
11:00am - 11:25am	LIGHTNING TALK SESSION 1 Location: Målstova (upstairs) Session Chair: Ben Els, National Library of Luxembourg
	11:00am - 11:05am Strategies and Challenges in the Preservation of Mexico’s Web Heritage: First Steps Carolina Silva Bretón National Library of Mexico, Mexico 11:05am - 11:10am Arquivo.pt Toolkit for Web Archiving Daniel Gomes Arquivo.pt, Portugal 11:10am - 11:15am Tracking the Political Representations of Life: Methodological Challenges of Exploring the BnF Web Archives Guillaume Levrier^1,2, Dorothée Benhamou-Suesser² ¹Centre de recherches politiques de Sciences Po (CEVIPOF, CNRS), France; ²Bibliothèque nationale de France, France 11:15am - 11:20am Collaborative Curatorial Approaches of the Czech Web Archive Using the Example of Thematic Literary Collections Marie Haškovcová National Library of the Czech Republic, Czech Republic
11:00am - 11:25am	LIGHTNING TALK SESSION 2 Location: Store Auditorium (ground floor) Session Chair: Sawood Alam, Internet Archive
	11:00am - 11:05am Modelling Archived Web Objects as Semantic Entities to Manage Contextual and Versioning Issues Tom Storrar¹, Manuela Pallotto Strickland² ¹The National Archives (UK), United Kingdom; ²King's College London, United Kingdom 11:05am - 11:10am Modernizing Web Archives: The Bumpy Road Towards a General ARC2WARC Conversion Tool Pedro Ortiz Suarez, Sebastian Nagel, Thom Vaughan Common Crawl Foundation, United States of America 11:10am - 11:15am Poking Around in Podcast Preservation Jasper Snoeren Netherlands Institute for Sound and Vision, Netherlands 11:15am - 11:20am Automatic Clustering of Domains by Industry for Effective Curation Thomas Smedebøl Royal Danish Library, Denmark 11:20am - 11:25am Best Practice of Preserving Posts from Social Media Feeds Magdalena Sjödahl Arkiwera wcrify AB, Sweden
11:25am - 11:55am	BREAK Location: Folkestova (upstairs) Participants in the 2025 Mentoring Program can meet at the top of the old granite stairs outside of Målstova. Sitting places are available in the cafeteria/bar (upstairs) and library hallways (upstairs and ground floor). If the weather is nice, there are also small parks immediately in front of and behind the National Library building.
11:55am - 1:00pm	PANEL #01: Engaging Audiences Location: Målstova (upstairs) Session Chair: Eveline Vlassenroot, University of Ghent
	Beyond Preservation: Engaging Audiences and Researchers with Web Archives Eveline Vlassenroot¹, Peter Mechant¹, Friedel Geeraert², Christina Vandendyck², Cui Cui^3,4, Beatrice Cannelli⁴, Anders Klindt Myrvoll⁵, Andrea Kocsis⁶ ¹University of Ghent, Belgium; ²KBR - Royal Library of Belgium, Belgium; ³University of Sheffield, United Kingdom; ⁴Bodleian Libraries, United Kingdom; ⁵Royal Danish Library, Denmark; ⁶National Library of Scotland, United Kingdom
11:55am - 1:00pm	SESSION #01: Tools Under Construction: Lessons Learned (National Library Perspective) Location: Store Auditorium (ground floor) Session Chair: Katherine Boss, National Library of Norway
	11:55am - 12:15pm Embedding the Web Archive in an Overall Preservation System Hansueli Locher Swiss National Library, Switzerland The Swiss National Library (SNL) is building a new digital long-term archive that will go live in spring 2025. This system is designed as an overall system that covers all the processes involved in handling the digital objects of all the SNL's collections, including the web archive. This starts with the delivery of the objects by producers or the collection of the objects by the SNL itself, includes the preparation for archiving and cataloguing, administration and preservation, and ends with the provision to users. The first part of the presentation will describe the architecture and functionality of the overall system, which consists of three different areas and uses a mixture of standard components and individual developments. A modular pre-ingest area provides so-called processing channels for different types of collection objects. With the help of said channels the objects and their metadata are prepared in such a way that they can be transferred to the ingest process of the digital archive. The Digital Archive contains the core system for managing and archiving digital collection objects. It also provides risk and preservation management functionality. An access system allows users to access the digital collections. It provides a full-text search, access control and server-based viewers for the most common data formats. In addition, selected parts of the collection can be presented to users in a curated form via so-called showcases. The second part of the presentation will show how the Swiss Web Archive and its specific processes have been integrated into the overall system. Special precautions had to be taken particularly in the Pre-Ingest and Access areas. In Pre-Ingest, a distict processing channel was created for the web archive. This makes it possible to register the websites for collection (and automated periodic snapshots), collect them, check their quality and improve it if necessary, and ensure that they are virus-free. Access makes the web archive accessible via a full-text search, for which special precautions had to be taken when generating the hit lists. Otherwise, the hits from the other collections would be lost among the numerous hits from the web archive. In addition, one of the showcases will provide an unexpected approach to the web archive. The presentation will conclude by addressing some of the specific challenges of integrating the web archive into an overall preservation system and the lessons learnt. 12:15pm - 12:35pm UKWA Rebuild Gil Hoggarth British Library, United Kingdom The British Library suffered a major service outage following a cyber-attack on all technical systems in late October, 2023. What followed was a complete rebuild of all services with security baked in. This short presentation provides an overview of how the UK Web Archive was affected, how the new operational technology landscape of the British Library changed, and describes the work being undertaken to return UKWA as a public service and to begin crawling again from on-premise servers. It will also describe how the internal systems of UKWA are changing to meet the new infrastructure and policies.* The challenges faced should be important to all web archiving institutions. The necessary changes made by the British Library to ensure the new services are secure by design will have a major impact on the UK Web Archive systems, but these could be challenges and changes imposed on any web archive. The size of the UK Web Archive, approaching 2PiB and an estimated 18 billion files, also creates challenges in itself which will be familiar to many web archives - the redesign of UKWA includes distant storage and aims to establish shared functions and resources across the Legal Deposit Libraries in the future. Ways of discovering content within the UK Web Archive have been significantly reduced by the cyber-attack. Previously, a full text search service was available using Apache Solr. However, the return of a 'discovery service' has been delayed by the necessity of rebuilding all systems from scratch. The future planning for a discovery service, and a user service, will also be outlined in the presentation. * As of mid-August 2024, no technology infrastructure or systems have been released for the UKWA rebuild work. Consequently, the content of this presentation may change from this paper submission and the conference date. 12:35pm - 12:55pm Under Construction: Web Archive of the German National Library Natanael Arndt German National Library, Germany Our institution is running a web archive since 2012, in cooperation with an external contractor and on closed-source software. Most recently we have started the shift towards an in-house open source web archiving system that shall be integrated with the overall data management infrastructure of our institution. During a first migration process the whole setup was moved in-house. The migration allowed us to gain some control over the operation, while the development and support is still performed by the contractor. In our experience over the last decade, we have identified a number of limitations with the current web archive setup: The crawling capacity is limited to a maximum of 12,000 snapshots per annum, the non-modular system complicates the implementation of new requirements, and we cannot directly benefit from the progress of the striving open source web archiving community in regard to new features and the implementation of web archiving standards. In parallel to the web archiving activities, our institution has developed an overarching data management infrastructure for the acquisition, digital preservation, and provisioning of electronic resources, such as e-books, e-journals, and most recently audio files. In order to gain an increased maintainability, flexibility and control over the web archiving activity, our aim is to implement a new system in-house, to integrate it with the well-established in-house workflows for electronic resources, and align it with and base it on the current open source state of the art and the standards of the web archiving community. During the presentation we take you on the journey of our institution towards the implementation of an in-house and open source web archive. We try to answer the questions: How do we understand the environment? How do we get together our team? Where do we want to go? How do we decide, which paths we take? Which gear do we need? And finally, what are our lessons learned?
11:55am - 1:00pm	WORKSHOP #01: Exploring Dilemmas in the Archiving of Legacy Webportals: An Exercise in Reflective Questioning Location: Slottsbiblioteket (ground floor) Since 2023 the National Library of the Netherlands (KBNL) is proud to curate a digital collection that has become UNESCO world heritage: the Digital City (De Digitale Stad, henceforth: DDS). Material belonging to this collection consists of an original freeze from 1996, as well as two student projects and miscellaneous material that was contributed by users and founders over the course of multiple events. The two student projects were the first attempt to revive the portal of DDS and store it as a disk image. The two groups of students used two methods for this reviving: one based on emulation, the other based on migration. But what choices were made during restoration and which version is more authentic? Furthermore, KBNL has several websites, scientific articles and newspaper clippings in its collections that might serve as context information. Do we consider this context information crucial for understanding DDS or do we rather leave users to find these resources by themselves if they are interested? Even without considering the plethora of archival material that currently is DDS, the original portal already was a mixed bag of different protocols. Most of them are currently not mainstream anymore like IRC and Usenet newsgroups and were never part of DDS itself but only linked to. The portal also consisted of links to offsite websites not archived, like some of the users homepages or ‘houses’. The original hardware – not part of the collection - was running on proprietary software that is now thoroughly obsolete. There was a multi-user dungeon where users could program their own objects but this depended on real-time user interaction. Some of the functionality depended on live data which isn’t available anymore, like who was logged in. The original software was command-line and based on Freenet-software. Shortly after the initial launch an HTML-interface was introduced. Even then the command-line interface stayed available for less-privileged users. The navigation of the HTML-version relied heavily on image maps that require a binary executable to function correctly. From newspaper evidence we can gather that sometimes functionality wasn’t available or stopped working. There was both a general part of the portal and a personalized part based on login, the latter also containing email. There have also been cases of harmful or polarizing content being published in newsgroups. At the time the norm was self-regulation by the community and laissez-faire but time has moved on and our users may have come to expect a more active approach of regulation, or at least some form of acknowledgement, from us as heritage organizations. As can be seen from this description, there is a lot of complexity when we consider archiving DDS and making it accessible to our users. We can think of a lot of difficult dilemmas when making decisions on what to archive and how to present it. Do we want users to experience how it is to create a homepage in DDS or do we want to present a historically correct picture of the homepages existing at the time? What should be considered part of the object and what part of the context? Is the migrated or the emulated version more authentic? What is more important, the privacy of the original users or providing full access to researchers? What do we consider belonging to DDS and what not? Only the HTML? Or also any news group material that might still be online but isn’t part of the archival material? Do users want a real authentic experience or rather a convenient way of viewing the content? Even though DDS was a Dutch portal, it was based on software of the American Free-nets and inspired other cities in Europe and Asia. Therefore, we think this case might have a lot of recognizable features that also apply to the archiving of other legacy portals. Arguably, there are no right or wrong answers. They are typically dilemmas where multiple options have both benefits and drawbacks. In our workshop we want to present a couple of these real-world dilemmas to participants to stimulate discussion based on principles of reflective questioning and open dialogue. The idea is that we present a few cases related to DDS that participants can discuss in groups. Each group has to choose a preferred solution and present their reasoning to the group. People are encouraged to explore the reasons for choosing one or the other, for instance by reflecting on their own organizational context or personal assumptions regarding digital preservation. We try to stay away from providing clear cut answers or guidance but rather provide participants with the opportunity to explore these questions together. Participants will learn how to ask the right questions to delve deeper into their own reasoning process during decision making, based on our method of reflective questioning. Participants should be able to use this method and the cases presented to benefit their own curatorial decision making process regarding legacy webportals in their own collections. For KBNL, the group discussions may provide important community input and food for thought on some of the decisions we are going to be making regarding DDS in the near future.
	Exploring Dilemmas in the Archiving of Legacy Webportals: An Exercise in Reflective Questioning Daniel Steinmeier, Sophie Ham National Library of the Netherlands, Netherlands Since 2023 the National Library of the Netherlands (KBNL) is proud to curate a digital collection that has become UNESCO world heritage: the Digital City (De Digitale Stad, henceforth: DDS). Material belonging to this collection consists of an original freeze from 1996, as well as two student projects and miscellaneous material that was contributed by users and founders over the course of multiple events. The two student projects were the first attempt to revive the portal of DDS and store it as a disk image. The two groups of students used two methods for this reviving: one based on emulation, the other based on migration. But what choices were made during restoration and which version is more authentic? Furthermore, KBNL has several websites, scientific articles and newspaper clippings in its collections that might serve as context information. Do we consider this context information crucial for understanding DDS or do we rather leave users to find these resources by themselves if they are interested? As can be seen from this description, there is a lot of complexity when we consider archiving DDS and making it accessible to our users. We can think of a lot of difficult dilemmas when making decisions on what to archive and how to present it. Do we want users to experience how it is to create a homepage in DDS or do we want to present a historically correct picture of the homepages existing at the time? What should be considered part of the object and what part of the context? Is the migrated or the emulated version more authentic? What is more important, the privacy of the original users or providing full access to researchers? What do we consider belonging to DDS and what not? Only the HTML? Or also any news group material that might still be online but isn’t part of the archival material? Do users want a real authentic experience or rather a convenient way of viewing the content? Even though DDS was a Dutch portal, it was based on software of the American Free-nets and inspired other cities in Europe and Asia. Therefore, we think this case might have a lot of recognizable features that also apply to the archiving of other legacy portals. Arguably, there are no right or wrong answers. They are typically dilemmas where multiple options have both benefits and drawbacks. In our workshop we want to present a couple of these real-world dilemmas to participants to stimulate discussion based on the idea of opposing values. In webarchiving and webarcheology tough decisions have to be made sometimes. In the above description we can already perceive some opposing options, for instance whether to prioritize interactivity or historical accuracy. Another example would be the opposition between privacy and openness. How do we weigh these options in practice? What values are important to us and how do they interact? Through principles of reflective questioning and open dialogue we will try to create awareness about the idea of value prioritization as part of the decision-making process. The idea is that we present a number of dilemmas, based on our collection material, for participants to discuss in groups. Participants may also choose an example that illustrates the same dilemma from their own collection. Each group has to choose a preferred solution and present their reasoning to the group. People are encouraged to explore the reasons for choosing one or the other, for instance by reflecting on their own organizational context or personal assumptions regarding digital preservation. We try to stay away from providing clear cut answers or guidance but rather provide participants with the opportunity to explore these questions together. Participants will learn how to ask the right questions to delve deeper into their own reasoning process during decision making, based on our method of reflective questioning. Participants should be able to apply this method and the cases presented to benefit their own curatorial decision-making process regarding legacy webportals in their own collections. For KBNL, the group discussions may provide important community input and food for thought on some of the decisions we are going to be making regarding DDS in the near future.
1:00pm - 2:00pm	LUNCH Location: CREDO Restaurant \| Kantine (downstairs) If you signed up for a guided exhibition tour, please be in the exhibition room at 13:05. To know if you signed up for a tour, check your registration details in ConfTool.
2:05pm - 3:40pm	SESSION #02: Crawling Tools Location: Målstova (upstairs) Session Chair: László Tóth, National Library of Luxembourg
	2:05pm - 2:25pm Lessons Learned Building a Crawler From Scratch: The Development and Implementation of Veidemann Marius André Elsfjordstrand Beck National Library of Norway, Norway Over the past two decades, web content has become increasingly dynamic. While a long-standing harvesting technology like Heritrix effectively captures static web content, it has huge limitations capturing and discovering links in dynamic content. As a response to this, the Web Archive at the National Library of Norway in 2015 set out to develop a new browser-based web crawler. This talk will present our experiences and lessons learned from building Veidemann. There are so many factors to consider when building a tool from scratch, and we will try to outline some of the decisions we were faced with during the process, unexpected issues and how we are addressing them. The talk will present: A high-level view of the design of Veidemann and the factors that influenced it How Veidemann compares to similar projects. The pros and cons of using a container-based platform The main issues with the current implementation and possible solutions to them Unexpected results An idea for a different paradigm in the design of such a system The full cost/benefit analysis of taking on a project of this size and scale is, by the nature of the work, not fully knowable at the start. After nearly a decade in the making, the story of Veidemann is one of pride, hope, hardship and lessons learned. While it is still being used in production at our institution and harvesting roughly 1TB per week (of deduplicated content), other similar tools, such as Browsertrix, have distinct advantages in their approach. While the future of Veidemann is uncertain we would love to share what we have learned so far with the broader community. 2:25pm - 2:45pm Experiences of Using in-House Developed Collecting Tool ELK Lauri Ojanen National Library of Finland, Finland ELK (acronym for Elonleikkuukone which means harvester in Finnish) is a tool which was built in the National Library of Finland’s Legal Deposit Services to aid collecting, managing, and harvesting online materials to the Web Archice. Legal Deposit Services started to use ELK in 2018 and since we’ve updated ELK several times to better suit the needs of collectors and harvesters of web materials. Features of ELK include back catalog of former thematic web harvests including web materials also known as seeds, cataloging information and keywords, and tools to manage thematic web harvests that are currently being made. Features have been made in collaboration between the collectors and developers who also work on harvesting the web materials. The aim is to create a tool where the collectors can easily categorize different web materials, give notes on how to harvest different materials and stay on track what has been collected and what has not. Collectors can also harvest single web pages themselves for quality control. This is to make sure that pages with dynamic elements can be viewed as they were meant to in the web archive. ELK is also used as a documenting platform. The easiest way to see curatorial choices, keywords and history of the thematic web harvests is to gather them in one platform. When that platform is used for everything related to the web archiving, we can easily see what themes have been harvested, what sort of materials were collected previously and in best cases see the curatorial decisions that were made in those harvests. Sharing our experiences of an in-house developed tool for collecting web materials we can help other libraries in their efforts. What are the advantages in curating and managing our web collections and what disadvantages there are. Also, where we would like to see our collections go in the future now that we’ve used the tool for a while. 2:45pm - 3:05pm Better Together: Building a Scalable Multi-Crawler Web Harvesting Toolkit Alex Dempsey, Adam Miller, Kyrie Whitsett Internet Archive, United States of America The web is as nearly infinite in its expanse as it is in its diversity. As its volume and complexity continues to grow, high-quality, efficient, and scalable web harvesting methods are more essential than ever. The numerous and varied challenges of web archiving are well known to this community, so it’s not surprising there isn’t one tool that can perfectly harvest it all. But through open source software collaboration we can build a scalable toolkit to meet some of these challenges. In the presentation, we will outline some of the many lessons and best practices our institution has learned from the challenges, requirements, research, and practical experience from collaborating with other memory institutions for over 25 years to meet the harvesting needs of the preservation community. To demonstrate how some of those challenges can be overcome, we will then discuss a fictional large-scale domain harvest use case presenting common issues. With each new challenge encountered we will introduce concepts in web harvesting while demonstrating approaches to solve them. Sometimes the best approach is a configuration option in Heritrix, and sometimes it’s including another open source software to incrementally improve the quality and scale of the campaign. Nothing is perfect, so we’ll also cover some things to consider when deciding to employ an additional tool. Some of the challenges we’ll address are: • Scaling crawls to multiple machines • How to avoid accidental crawler traps • Efficiently layering-in browser assisted web crawling • Handling rich media like video and PDFs • And more Heritrix makes a great base for large-scale web crawling, and many in the IIPC community already use it for their web harvests. The presentation will demonstrate tools that complement Heritrix, and should be easy to try as an add-on to a reliable implementation, but the concepts—and often the tools themselves—are web crawler agnostic. The presentation is geared to a wide range of experience. Anyone who is curious about what it takes to run a large web harvest will leave with a better understanding, and experienced practitioners will acquire insights into some technical improvements and strategies for improving their own harvesting infrastructures. 3:05pm - 3:25pm Lowering Barriers to Use, Crawling, and Curation: Recent Browsertrix Developments Tessa Walsh, Ilya Kreymer Webrecorder, United States of America As the web continues to evolve and web archiving programs develop in their practices and face new challenges, so too must the tools that support web archiving continue to develop alongside them. This talk will provide updates on new features and changes in Browsertrix since last year’s conference that enable web archiving practitioners to capture, curate, and replay important web content better than ever before. One key new feature that will be discussed is crawling through proxies. Browsertrix now supports the ability to crawl through SOCKS5 proxies which can be located anywhere in the world, regardless of where Browsertrix itself is deployed. With this feature, it is possible for users to crawl sites from an IP address located in a particular country or even from an institutional IP range, setting crawl workflows to use different proxies as desired. This feature allows web archiving programs to satisfy geolocation requirements for crawling while still taking advantage of the benefits of using cloud-hosted Browsertrix. Proxies may also have other concrete use cases for web archivists, including avoiding anti-crawling measures and being able to provide a static IP address for crawling to publishers. Similarly, the presentation will discuss changes made that enable users of Browsertrix to configure and use their own S3 buckets for storage. Like proxies, this feature lowers the barriers to using cloud-hosted Browsertrix by enabling institutions to use their own storage infrastructure and meet data jurisdiction requirements without needing to deploy and maintain a self-hosted local instance of Browsertrix. Other developments will also be discussed, such as improvements to collection features in Browsertrix which better enable web archiving practitioners to curate and share their archives with end users, user interface improvements which make it easier for anyone to get started with web archiving, and improvements to Browsertrix Crawler to ensure websites are crawled at their fullest possible fidelity.
2:05pm - 3:40pm	SESSION #03: Advocacy & User Engagement Location: Store Auditorium (ground floor) Session Chair: Mark Phillips, University of North Texas Libraries
	2:05pm - 2:25pm Insufficiency of Human-Centric Ethical Guidelines in the Age of AI: Considering Implications of Making Legacy Web Content Openly Accessible Gaja Zornada, Boštjan Špetič Computer History Museum Slovenia (Računališki muzej), Slovenia While the preservation of web history is crucial for maintaining a cultural and informational record of our age, reconstructing and resurfacing legacy content without appropriate context nowadays presents new ethical concerns. Legacy content may be misleading to users when consumed in isolation, as it often reflects outdated norms, technologies, and information that are no longer relevant. Moreover, individuals featured in such content may be unfairly subjected to scrutiny based on past actions or statements that, in today's context, could harm their personal or professional reputation. The consequences of resurfacing this content without adequate contextualization are amplified when AI technologies are involved. AI’s ability to synthesize and amplify such data across platforms can create a ripple effect, where even content that does not explicitly reveal personal information can still have far-reaching consequences. By connecting disparate data points, AI may draw conclusions or inferences about individuals, influencing public perception and potentially affecting career prospects or even legal outcomes. Unlike the human reader who would be able to contextually infer that a piece of reconstructed online content is part of a legacy web segment intended to be presented as a historical monument to the online world of times past, AI will not be able to distinguish such content from contemporary sources and will misplace the weights system on it’s analysis of such content. The ethical challenge here lies not just in the publication of legacy content and archival access, but in AI’s ability to endlessly circulate and reinterpret it in ways that were never intended by the original authors. This proposal explores the delicate balance between the preservation of historical digital records and respecting individuals' right to be forgotten (RTBF) in the age of AI. It seeks to question how AI-powered tools reshaping the reading and presentation of web archives challenge existing ethical norms. By examining potential frameworks for responsible digital archiving, the proposal aims to identify solutions that mitigate the risks posed by AI-driven resurfacing of legacy content in the public domain. 2:25pm - 2:45pm Web Archives for Music Research Andreas Lenander Ægidius Royal Danish Library, Denmark The Royal Danish Library has set a strategic goal to make more of its cultural heritage materials accessible and engaging for researchers by 2027. In this paper, we present findings from an advocacy initiative targeted at researchers at national universities in music-related fields. The national web archive provides primary sources and contextual information relevant to music researchers as they engage with our music collections. However, there is room for improvement in the connection between these collections and our understanding of user needs. Reports by Healy et al. (2022) and Healy & Byrne (2023) explore the challenges researchers face when using web archives, highlighting the ongoing need to examine the skills, tools, and methods associated with web archiving. Additionally, the sounds of the web—from MIDI to streaming—are an integral part of its history, yet this aspect is often overlooked by tools like the Internet Archive's Wayback Machine (Morris, 2019). Through semi-structured interviews with fellow curators and music researchers at universities, we identify current barriers to access and user requirements for improved utilization of web archival resources. Our advocacy initiative also allows us to summarize current research trends as feedback for web curators. In conclusion, we describe how the web curators processed our findings into suggestions for updates and refinements to web crawling strategies and the built-in tools in the SolrWayBack installation. References Healy, S., & Byrne, H. (2023). Scholarly Use of Web Archives Across Ireland: The Past, Present & Future(s) (WARCnet Special Reports). Aarhus University. https://cc.au.dk/fileadmin/dac/Projekter/WARCnet/Healy_Byrne_Scholarly_Use_01.pdf Healy, S., Byrne, H., Schmid, K., Bingham, N., Holownia, O., Kurzmeier, M., & Jansma, R. (2022). Skills, Tools, and Knowledge Ecologies in Web Archive Research (WARCnet Special Reports). Aarhus University. https://cc.au.dk/fileadmin/dac/Projekter/WARCnet/Healy_et_al_Skills_Tools_and_Knowledge_Ecologies.pdf . Morris, J. W. (2019). Hearing the Past: The Sonic Web from MIDI to Music Streaming. In N. Brügger & I. Milligan (Eds.), The SAGE Handbook of Web History (pp. 491–510). Sage. 2:45pm - 3:05pm IXP History Collection: Recording the Early Development of the Core of the Public Internet Sharon Healy¹, Gerard Best¹, Lara Díaz Martínez² ¹Independent Researcher, Ireland; ²University of Barcelona, Spain The IXP History Collection is an ongoing project which seeks to record and document histories of the Internet exchange points (IXPs) which form the core of the Internet’s topology. An IXP is the point at which Internet Service Providers and Content Delivery Networks connect and exchange data with each other (“peering”). IXPs form the topological core of the Internet backbone, their histories are inextricably linked to the commercialization of the Internet, and their development is a significant milestone in the global history of media and communications. Efforts should therefore be made to ensure that we preserve IXP histories for future generations. The main purpose of the project is to collect and preserve networking and IXP histories due to valid concerns that these histories will be lost from the global record unless attempts are made to start preserving them now. In particular, the project is concerned with the fragility of electronic information and born digital documents, records, and multimedia, otherwise known as born digital heritage. As a starting point, the project utilizes the Internet Exchange Directory which is maintained by Packet Clearing House, an intergovernmental treaty organization responsible for providing operational support and security to critical Internet infrastructure, including Internet exchange points. The PCH IX Directory is one of the earliest organized efforts to develop and maintain a database for recording and tracking the establishment, development and global growth of IXPs. The project then focuses on documenting IXP histories through as many online sources as possible (e.g., websites/pages, reports, journals, magazines/newspaper articles, old emails on public mail lists). The project relies on the use of web archives as a research tool for tracing IXP histories, as well as a preservation tool using the Save Page functions in the Wayback Machine and Arquivo.pt. In this presentation we discuss our approach and methodology for developing the collection and making it available online as a reference resource, and we offer an overview of the importance of using web archives for documenting and preserving Internet and IXP histories. By presenting our approach, we hope to offer a case study that demonstrates how web archive research can be integrated with traditional research methods (Healy et al., 2022), and promote more widespread use of web archives as research tools for historical inquiry, and the long-term preservation of digital research (Byrne et al., 2024). Resources: Arquivo.pt: https://arquivo.pt/ IXP History Collection - Information Directory \| Zotero: https://www.zotero.org/groups/4944209/ixp_history_collection_-_information_directory/library Packet Clearing House, Internet Exchange Directory: https://www.pch.net/ixp/dir Wayback Machine: https://web.archive.org/ References: Healy, S., Byrne, H., Schmid, K., Bingham, N., Holownia, O., Kurzmeier, M. and Jansma, R. (2022). Skills, Tools, and Knowledge Ecologies in Web Archive Research. WARCnet Special Report, Aarhus, Denmark: https://web.archive.org/web/20221003215455/https://cc.au.dk/fileadmin/dac/Projekter/WARCnet/Healy_et_al_Skills_Tools_and_Knowledge_Ecologies.pdf Byrne, H., Boté-Vericad, J-J, and Healy, S. (2024) Exploring Skills and Training Requirements for the Web Archiving Community. In: Aasman, S., Ben-David, A., and Brügger, N., eds. The Routledge Companion to Transnational Web Archive Studies. Routledge. 3:05pm - 3:25pm Lost, but Preserved - A Web Archiving Perspective on the Ephemeral Web Sawood Alam, Mark Graham Internet Archive, United States of America The World Wide Web, our era's most dynamic information ecosystem, is characterized by its transient nature. Recent studies have highlighted the alarming rate at which web content disappears or changes, a phenomenon known as "link-rot". A 2024 Pew Research Center study revealed that 38% of webpages from 2013 were inaccessible a decade later. Even more striking, Ahrefs, an SEO company, reported that at least 66.5% of links to sites created in the last nine years are now dead. These findings echo earlier research by Zittrain et al., which uncovered significant link-rot in journalistic references from New York Times articles. While these statistics paint a grim picture of digital impermanence, they often overlook a crucial factor: the role of web archives. This talk aims to reframe the link-rot discussion by considering the preservation efforts of various web archiving institutions. Our research revisiting the Pew dataset yielded a surprising discovery: only one in nine URLs from the original study were truly missing, the remaining bulk had at least one capture in a web archive. This finding suggests that the digital landscape, when viewed through the lens of web archiving, may be less ephemeral than commonly perceived. Key points we will explore: 1. The state of link-rot: We will review recent studies and their methodologies, discussing the implications of their findings for digital scholarship, journalism, and information access. 2. Web archives as digital preservationists: We will introduce major web archiving initiatives and explain their crucial role in maintaining the continuity of online information. 3. Reassessing link rot with archives in mind: We will present our methodology and findings from reexamining the Pew dataset, demonstrating how web archives mitigate content loss. 4. Challenges and limitations of web archiving: Despite their importance, web archives face significant technical, legal, and resource constraints. We will discuss these challenges and their impact on preservation efforts. 5. The future of web preservation: We will explore emerging technologies and strategies in web archiving, including machine learning approaches to capture dynamic content and efforts to preserve the context of web pages. 6. Call to action: We will emphasize the importance of supporting and expanding web archiving efforts, discussing how researchers, institutions, and individuals can contribute to these initiatives. This talk aims to provide a more nuanced understanding of digital impermanence and preservation. While acknowledging the real challenges posed by link-rot, we will highlight the often-overlooked role of web archives in maintaining our digital heritage. By doing so, we hope to foster greater appreciation for web archiving efforts and encourage increased support for these crucial initiatives. Our goal is to leave the audience with a renewed perspective on the state of the web's preservability and a clear understanding of why supporting web archiving is essential for ensuring the longevity and accessibility of our shared digital knowledge. As we navigate an increasingly digital world, recognizing that much of what seems lost may actually be preserved is vital for researchers, educators, journalists, lawyers, and anyone who values the continuity of online information.
2:05pm - 3:40pm	WORKSHOP #02: Web Archive Collections As Data Location: Slottsbiblioteket (ground floor)
	Web Archive Collections as Data Gustavo Candela¹, Chase Dooley², Abbie Grotke², Olga Holownia³, Jon Carlstedt Tønnessen⁴, Helena Byrne⁵, Emily Maemura⁶ ¹University of Alicante, Spain; ²Library of Congress, United States of America; ³IIPC, United States of America; ⁴National Library of Norway, Norway; ⁵British Library, UK; ⁶University of Illinois Urbana-Champaign, United States of America GLAM (Galleries, Libraries, Archives and Museums) have started to make available their digital collections suitable for computational use following the Collections as Data principles[1]. The International GLAM Labs Community[2] has explored innovative and creative ways to publish and reuse the content provided by cultural heritage institutions. As part of their work, and as a collaborative-led effort, a checklist[3] was defined and focused on the publication of collections as data. The checklist provides a set of steps that can be used for creating and evaluating digital collections suitable for computational use. While web archiving institutions and initiatives have been providing access to their collections - ranging from sharing seedlists to derivatives to “cleaned” WARC files - there is currently no standardised checklist to prepare those collections for researchers. This workshop aims to involve web archiving practitioners and researchers in reevaluating whether the GLAM Labs checklist can be adapted for web archive collections. The first part of the workshop will introduce the GLAM checklist, followed by two use cases that show how the web archiving teams have been working with their institutions’ Labs to prepare large data packages and corpora for researchers. In the second part of the workshop, we want to involve the audience in identifying the main challenges to implementing the GLAM checklist and determining which steps require modifications so that it can be used successfully for web archive collections. First use case The UK Web Archive has recently started to publish the metadata to some of our inactive curated collections as data. This project developed new workflows by using the Datasheets for Datasets framework to provide provenance information on the individual collections that were published as data. In this presentation, we will highlight how participants can: Use Datasheets for Datasets to describe their collections. Potential research uses for the data sets that were published. Gain insights from the lessons learnt phase of the project. Second use case Our library recently launched its first Web News Corpus, making more than 1.5 million texts from 268 news websites available for computational analysis through API. The aim is to facilitate text analysis at scale.[4] This presentation will provide a brief description of “warc2corpus”, our workflow for turning WARCs into text corpora, aiming to satisfy the FAIR principles, while also taking immaterial rights into account.[5] In this presentation, we will showcase how users can: tailor research corpora based on keywords and various metadata, visualise general insights, exercise different types of ‘distant reading’, both with the Library Labs package for Python and with user-friendly web applications.[6] Third use case Our library has been working to refine and improve workflows that enable creation and publishing of web archive data packages for computational research use. With a recently hired Senior Digital Collections Data Librarian, and working with our institution’s Labs, web archiving staff have prepared new data packages for web archive data in response to recent research requests. We will provide some background into this work and developments that led to the creation of the data librarian role, and will share details about how we are creating our data packages and sharing derivative datasets with researchers. Using a recent data package release, we will compare local practices in providing data to researchers with the GLAM checklist and talk through ways in which our institution does or does not comply. REFERENCES: [1] Padilla, T. (2017). “On a Collections as Data Imperative”. UC Santa Barbara. pp. 1–8; [2] https://glamlabs.io/ [3] Candela, G. et al. (2023), "A checklist to publish collections as data in GLAM institutions", Global Knowledge, Memory and Communication. https://doi.org/10.1108/GKMC-06-2023-0195 [4]: Tønnessen, J. (2024). “Web News Corpus”. National Library of Norway. https://www.nb.no/en/collection/web-archive/research/web-news-corpus/ [5]: Tønnessen J., Birkenes M., Bremnes T. (2024). “corpus-build”. GitHub. National Library of Norway. https://github.com/nlnwa/corpus-build; Birkenes M., Johnsen, L., Kåsen, A. (2023). “NB DH-LAB: a corpus infrastructure for social sciences and humanities computing.” CLARIN Annual Conference Proceedings. [6]: “dhlab documentation”. National Library of Norway. https://dhlab.readthedocs.io/en/latest/
3:40pm - 4:10pm	BREAK Location: Folkestova (upstairs) Participants in the 2025 Mentoring Program can meet at the top of the old granite stairs outside of Målstova. Sitting places are available in the cafeteria/bar (upstairs) and library hallways (upstairs and ground floor). If the weather is nice, there are also small parks immediately in front of and behind the National Library building.
4:10pm - 4:20pm	POSTER SLAM INTRO Location: Målstova (upstairs) Session Chair: Olga Holownia, IIPC Streamed to Store Auditorium.
4:20pm - 4:40pm	POSTER SLAM Location: Målstova (upstairs) Session Chair: Olga Holownia, IIPC Streamed to Store Auditorium.
	4:20pm - 4:21pm ‘We Are Now Entering the Pre-election Period’: Experimental Twitter Capture at The National Archives Jake Bickford The National Archives (UK), United Kingdom 4:21pm - 4:22pm The BnF DataLab Services and Tools for Researchers Working on Web Archives Sara Aubry, Dorothée Benhamou-Suesser Bibliothèque nationale de France, France 4:22pm - 4:23pm Designing Art Student Web Archives Katherine Martinez The New School, United States of America 4:23pm - 4:24pm Next Steps Towards A Formal Registry Of Web Archives For Persistent And Sustainable Identification Eld Zierau Royal Danish Library, Denmark 4:24pm - 4:25pm Using Web Archives to Construct the History of an Academic Field Tegan Pyke University of Bergen, Norway 4:25pm - 4:26pm Consortium on Electronic Literature (CELL) Hannah Ackermans University of Bergen, Norway 4:26pm - 4:27pm Arquivo.pt Annual Awards: A Glimpse Daniel Gomes Arquivo.pt, Portugal 4:27pm - 4:28pm Arquivo.pt Api/Bulk Access and Its Usage Vasco Rato, Daniel Gomes Arquivo.pt, Portugal 4:28pm - 4:29pm Failed Capture or Playback Woes? A Case Study in Highly Interactive Web Based Experiences Mari Allison Smithsonian Libraries and Archives United States of America 4:29pm - 4:30pm HAWathon: Participants Experience Ingeborg Rudomino, Anamarija Ljubek National and University Library in Zagreb, Croatia 4:30pm - 4:31pm Supporting Best Practices for Archiving Social Media by Heritage Institutions in Flanders (and Beyond) Ellen Van Keer¹, Katrien Weyns² ¹meemoo, Flemish Institute for Archives, Belgium; ²KADOC at Catholic University of Leuven, Belgium 4:31pm - 4:32pm Planning Web Archiving Within a Four-Year Scope: Making the New Collection Plan for the Years 2025-2028 in the National Library of Finland Sanna Haukkala National Library of Finland, Finland 4:32pm - 4:33pm Redirects Unraveled: From Lost Links to Rickrolls Kritika Garg¹, Sawood Alam², Michele Weigle¹, Michael Nelson¹, Mark Graham², Dietrich Ayala³ ¹Old Dominion University, United States of America; ²Internet Archive, United States of America; ³Filecoin Foundation, Netherlands 4:33pm - 4:34pm Use of Screenshots as a Harvesting Tool for Dynamic Content and Use of AI for Later Data Analysis Gaja Zornada, Boštjan Špetič Computer History Museum Slovenia (Računališki muzej), Slovenia 4:34pm - 4:35pm Asynchronous and Modular Pipelines for Fast WARC Annotation Pedro Ortiz Suarez, Thom Vaughan Common Crawl Foundation, United States of America 4:35pm - 4:36pm Politely Downloading Millions of WARC Files Without Burning the Servers Down Pedro Ortiz Suarez, Thom Vaughan, Greg Lindahl Common Crawl Foundation, United States of America 4:36pm - 4:37pm Robots.txt and Crawler Politeness in the Age of Generative AI Sebastian Nagel, Thom Vaughan Common Crawl Foundation, United States of America 4:37pm - 4:38pm Experiences Switching an Archiving Web Crawler to Support HTTP/2 Sebastian Nagel Common Crawl Foundation, United States of America
4:40pm - 6:00pm	POSTER SESSION Location: Folkestova (upstairs)
7:30pm - 9:30pm	DINNER Location: CREDO Restaurant \| Kantine (downstairs)

Date: Thursday, 10/Apr/2025
9:00am - 9:20am	MORNING COFFEE Location: Folkestova (upstairs)
9:20am - 9:25am	LIGHTNING TALK SESSION 3: INTRODUCTION Location: Målstova (upstairs) Session Chair: Helena Byrne, British Library
9:20am - 9:25am	LIGHTNING TALK SESSION 4: INTRODUCTION Location: Store Auditorium (ground floor) Session Chair: Dorothée Benhamou-Suesser, National Library of France
9:25am - 9:55am	LIGHTNING TALK SESSION 3 Location: Målstova (upstairs) Session Chair: Helena Byrne, British Library
	9:25am - 9:30am The Practice of Web Archiving Statistics and Quality Evaluation Based on the Localization of ISO/TR 14873:2013(E): A Case Study of the NSL-WebArchive Platform Zhenxin Wu¹, Jiali Zhu^2,3, Jiying Hu¹ ¹National Science Library, Chinese Academy of Sciences, China; ²Zhejiang Economic & Information Center, China; ³Zhejiang Economic & Information Development Co., Ltd, China 9:30am - 9:35am Modifying ePADD for Entity Extraction in Non-English Languages Pierre Beauguitte, Tita Enstad National Library of Norway, Norway 9:35am - 9:40am Arquivo.pt Query Logs Pedro Gomes, Daniel Gomes Arquivo.pt, Portugal 9:40am - 9:45am What You See No One Saw Mat Kelly¹, Alex H. Poole¹, Michele Weigle², Michael Nelson², Travis Reid², Christopher B. Rauch¹, Hyung Wook Choi¹ ¹Drexel University, United States of America; ²Old Dominion University, United States of America
9:25am - 9:55am	LIGHTNING TALK SESSION 4 Location: Store Auditorium (ground floor) Session Chair: Dorothée Benhamou-Suesser, National Library of France
	9:25am - 9:30am Collaborative Collections at Arquivo.pt: Four Years of Recordings from the City of Sines (Portugal) Ricardo Basílio Arquivo.pt, Portugal 9:30am - 9:35am Participatory Web Archiving: The Tensions Between the Instrumental Benefits and Democratic Value Cui Cui^1,3, Stephen Pinfield¹, Andrew Cox¹, Frank Hopfgartner² ¹University of Sheffield, United Kingdom; ²Institute for Web Science and Technologies (WeST), Germany; ³Bodleian Libraries, United Kingdom 9:35am - 9:40am A Minimal Computing Approach for Web Archive Research Alan Colin-Arce¹, Rosario Rogel-Salazar² ¹University of Victoria, Canada; ²Universidad Autónoma del Estado de México, Mexico 9:40am - 9:45am Where Fashion Meets Science: Collecting and Curating a Creative Web Archive Elisabeth Thurlow University of the Arts London, United Kingdom
9:55am - 10:05am	SHORT BREAK
10:05am - 11:15am	SESSION #04: Discovery & Access (News/Newspapers) Location: Målstova (upstairs) Session Chair: Tita Enstad, National Library of Norway
	10:05am - 10:25am Unlocking the Archive: Open Access to News Content as Corpora Jon Carlstedt Tønnessen, Magnus Breder Birkenes National Library of Norway, Norway The content of web archives is potentially highly valuable to research and knowledge production. However, most web archives have strict access regimes to their collections, and with good reason: archived content is often subject to copyright restrictions and potentially also data protection laws. When moving towards best practices, a key question is how to improve access, while also maintaining legal and ethical commitments. [1] This presentation will show how the National Library of Norway (NB) has worked to provide open access to a corpus of more than 1.5 million news articles in the web archive. By providing the collection as data - scoping it across the typical crawl job-oriented segmentation - anyone gets access to computational text analysis at scale. By serving metadata and snippets of content through a REST API and keeping the full content in-house, we align with FAIR principles while accounting for immaterial rights and data protection laws. [2] The key steps in building the news corpora will be walked through, such as: a) extracting data from WARC, b) removing boilerplate content for purposes of Natural Language Processing (NLP), c) curating and filtering across crawl-oriented collections, d) tokenising full-text for computational analysis, e) Quality Assessment before publishing Further, we will demonstrate how anyone can tailor corpora for their own use and analyse news text at scale - either with user-friendly apps, or with computational notebooks via API. [3] The demonstration highlights some of the limitations, but also the great possibilities for allowing distant reading of web archives. We will discuss how the approach to collections as data provides broader access and new perspectives for researchers. Open access further allows for utilisation in new contexts, such as higher education, government and commercial business. With easy-to-use web applications on top, the threshold for non-technical users is lowered, potentially increasing the use of web archives vastly. We also reflect on how interdisciplinary cooperation and user-orientation have been vital in designing and building the solution. -- [1]: Caroline Nyvang og Eld Zierau, “Untangling Nordic Web Archives”, in The Nordic Model of Digital Archiving (Routledge, 2023), 191–92; Niels Brügger og Ralph Schroeder, Web as History: Using Web Archives to Understand the Past and the Present (London: UCL Press, 2017), 10. [2]: Magnus Breder Birkenes and Jon Carlstedt Tønnessen. (2024). “corpus-build”. Github. National Library of Norway. https://github.com/nlnwa/corpus-build/; Thomas Padilla. (2017). “On a Collections as Data Imperative”. UC Santa Barbara. pp. 1–8; Sally Chambers. (2021). “Collections as Data : Interdisciplinary Experiments with KBR’s Digitised Historical Newspapers : a Belgian Case Study”. DH Benelux: The Humanities in a Digital World. 1–3; Magnus Breder Birkenes, Lars Johnsen, and Andre Kåsen. (2023). “NB DH-LAB: a corpus infrastructure for social sciences and humanities computing.” CLARIN Annual Conference Proceedings. [3]: Apps and notebooks will be available as open-source code ultimo November 2024. For similar services for digitised content, see “Apper fra DH-LAB». (2024). National Library of Norway. https://www.nb.no/dh-lab/apper/; “Digital tekstanalyse”. (2024). National Library of Norway. https://www.nb.no/dh-lab/digital-tekstanalyse/ 10:25am - 10:45am Recently Orphaned Newspapers: From Archived Webpages to Reusable Datasets and Research Outlooks Tyng-Ruey Chuang¹, Chia-Hsun Wang¹, Hung-Yen Wu^1,2 ¹Academia Sinica, Taiwan; ²National Yang Ming Chiao Tung University, Taiwan We report on our progress in converting the web archives of a recently orphaned newspaper into accessible article collections in IPTC (International Press Telecommunications Council) standard format for news representation. After the conversion, old articles extracted from a defunct news website are now reincarnated as research datasets meeting the FAIR data principles. Specifically, we focus on Taiwan's Apple Daily and work on the WARC files built by the Archive Team in September 2022 at a time when the future of the newspaper seemed dim [0]. We convert these WARC files into de-duplicated collections of pure text in ninjs (News in JSON) format [1]. The Apple Daily in Taiwan had been in publication since 2003 but discontinued its print edition in May 2021. In August 2022, its online edition was no longer being updated, and the entire news website has become inaccessible since March 2023. The fate of Taiwan's Apple Daily followed that of its (elder) sister publication in Hong Kong. The Apple Daily in Hong Kong was forced to cease its entire operation after midnight June 23, 2021 [2]. Its pro-democracy founder, Jimmy Lai (黎智英) [3], was arrested under Hong Kong's security law the year before. Being orphaned and offline, past reports and commentaries from the newspapers on contemporary events (e.g. the Sunflower Movement in Taiwan and the Umbrella Movement in Hong Kong) become unavailable to the general public. Such inaccessibility has impacts on education (e.g. fewer news sources to be edited into Wikipedia), research (e.g. fewer materials to study the early 2000s zeitgeist in Hong Kong and Taiwan), and knowledge production (e.g. fewer traditional Chinese corpora to work with). Our work in transforming the WARC records into ninjs objects produces a collection of unique 953,175 news articles totaling in 4.3 GB. The articles are grouped by the day/month/year they were published hence it is convenient to look into a specific date for the news that were published on that day. Metadata about each article — headline(s), subject(s), original URI, unique ID, among others — are mapped into the corresponding fields in the ninjs object for ready access. (For figures, please access them at this dataset [4].) Figure 1 shows the ninjs object derived from a news article that was published on 2014-03-19, archived on 2021-09-29, and converted by us on 2024-02-17. Figure 2 is a screenshot of the webpage where the news was originally published. Figure 3 displays the text file of the ninjs object in Figure 1. Currently the images and videos accompanying the news article have not been extracted. Another process is in the plan to preserve and link to these media files in the produced ninjs object. In our presentation, we shall elaborate on technical details (such as the accuracy and coverage of the conversion) and exemplary use cases of the collection. We will touch on the roles of public research organizations in preserving and making available materials that are deemed out of commerce and circulation. [0] https://wiki.archiveteam.org/index.php/Apple_Daily#Apple_Daily_Taiwan [1] https://iptc.org/standards/ninjs/ [2] https://web.archive.org/web/20210623212350/https://goodbye.appledaily.com/ [3] https://en.wikipedia.org/wiki/Jimmy_Lai [4] https://pid.depositar.io/ark:37281/k5p3h9k37 10:45am - 11:05am NewsWARC: Analyzing News Over Time in the Web Archive Amr Emara², Khaled Ezz², Shaden Hazem², Youssef Eldakar¹ ¹Bibliotheca Alexandrina, Egypt; ²Alamein International University, Egypt News consumption, as studies generally suggest, is quite common globally. Today, individuals, wherever there is an Internet connection, access news predominantly online. On the web, news websites rank relatively high by number of visits. Considering the history of the web, the news media industry was one domain of society to adopt the web as technology very early on. Being of such significance, news content on the web is one to particularly investigate, using the web archive as data source. We present NewsWARC, a tool, developed as an internship project, for aiding researchers to explore news content in a web archive collection over time. NewsWARC consists of two components: the data analyzer and the viewer. The data analyzer is code that runs on data in the collection and uses machine learning to get information about each news article or post, namely, sentiment, named entities, and category, and store that into a database for access via the second component that serves as the interface for querying and visualizing the pre-analyzed data. We report on our experience processing data from the Common Crawl news collection to use in testing, including comparing performance of the data analyzer running on different hardware configurations. We show examples of queries and trend visualizations that the viewer offers, such as examining how the sentiment of articles in health-related news varies over the course of a pandemic. In developing this initial prototype, while we narrowed our focus with regard to information that the analyzer returns to sentiment, named entities, and category, there exists a wider range of analyses to include in future work, such as topic modeling, keyword and keyphrase extraction, measuring readability and complexity, and fact vs. opinion classification. Also as future work, this overall functionality can be deployed as a service for an alternative interface to supplement researcher access to web archives. 11:05am - 11:10am Zombie E-Journals and the National Library of Spain José Carlos Cerdán Medina Biblioteca Nacional de España, Spain A "zombie e-journal" refers to an electronic journal that has become inaccessible, but for which a web archive has preserved a copy, sometime this one is not perfectly accurate. It is widely recognized that, each year, a significant number of e-journals disappear without existing in print, resulting in the loss of their content on a global scale. This constitutes a substantial loss of economic investment, scholarly knowledge, and cultural heritage. While many universities maintain institutional repositories to safeguard publications, a large number of e-journals lack sustainable preservation methods due to financial constraints. In response to this challenge, the Spanish Web Archive initiated efforts to explore potential solutions. A key question was posed: is it feasible to ensure the long-term preservation of more than 10,000 open-access e-journals in Spain? The National Library of Spain, which serves as the National Centre for ISSN assignment, maintains a catalogue that includes all e-journals registered with an ISSN. The first phase of this initiative started in 2020, when the Spanish Web Archive implemented an annual broad crawl encompassing all URLs associated with electronic journals in Spain. This proactive approach significantly increases the likelihood of locating missing e-journals in the future. Currently, the project has entered its second phase, during which e-journals that became inaccessible between 2009 and 2023 have been identified. To date, over 500 zombie e-journals have been recovered through consultations with the Spanish Web Archive. The full list of these journals is publicly available through the project’s website and integrated into the National Library’s catalogue. In the forthcoming third phase, the identified e-journals will be formally declared out-of-commerce works, according to Directive (EU) 2019/790,thus facilitating open access to their content. This step will allow users to once again access and benefit from these resources. Additionally, a comprehensive system has been developed to detect missing e-journals, conduct quality assurance (Q&A) processes on the captured content, and integrate access to these journals through the library's website and catalogue. The broad crawl has proven effective in identifying missing e-journals, and following quality assurance, the recovered information is systematically incorporated into the catalogue.
10:05am - 11:15am	SESSION #05: Sustainability Location: Store Auditorium (ground floor) Session Chair: Bjarne Andersen, Royal Danish Library
	10:05am - 10:25am 42 Tips to Diminish the CO2 Impact of Websites Tamara van Zwol², Lotte Wijsman¹, Jasper Snoeren³, Tineke van Heijst⁴ ¹National Archives of the Netherlands, Netherlands; ²Dutch Digital Heritage Network, Netherlands; ³Netherlands Institute for Sound and Vision, Netherlands; ⁴Van Heijst Information Consulting, Netherlands The internet has become indispensable to modern life, yet its environmental impact is often overlooked. Despite terms like "virtual" and "cloud" suggesting a minimal footprint, the global internet is a significant energy consumer. In 2020, it accounted for approximately 4% of global energy consumption, and if usage trends persist, this figure could rise to 14% by 2040. Archiving even a small number of websites contributes to the growing carbon footprint of digital archives, which compounds over time. To address this, the Dutch Digital Heritage Network commissioned research to assess the CO2 impact of current websites across various heritage organizations. The study provided practical recommendations to reduce this impact, such as optimizing image sizes, employing green hosting, and streamlining unnecessary code. These strategies not only benefit the public-facing side of websites but also hold potential for the backend, such as in the harvesting process for archiving. In our presentation, we will share these research findings and highlight actionable steps organizations can take to create more energy-efficient digital archives. Additionally, we will explore the question of what should be archived: Is every aspect of a website equally essential for long-term preservation? Lastly, we are investigating incremental archiving as a solution to reduce both storage needs and emissions. This approach, which focuses on capturing specific updates rather than performing full harvests, offers a more sustainable alternative for digital preservation. 10:25am - 10:45am Building Towards Environmentally Sustainable Web Archiving: The UK Government Web Archive and Beyond Jane Winters¹, Eirini Goudarouli², Jake Bickford² ¹University of London, United Kingdom; ²The National Archives (UK), United Kingdom There is an urgent need for the fostering of more environmentally sustainable archival methods and approaches that place sustainability frameworks at the centre of archival practice, aiding archiving institutions in their ambitions to achieve Net Zero. This will involve sector-wide collaboration to develop new ways of working and the rethinking of long-established best practice in order to define and adopt ways of working that are ‘good enough’. The challenge is particularly urgent for born-digital archives, which form an increasingly significant (and rapidly growing) part of the archival record. Pendergrass et al. 2019 have argued for fundamental change in ‘practices for appraisal, permanence, and availability of digital content’ (p. 4), and the Digital Preservation Coalition has similarly called for a re-evaluation of all aspects of digital preservation (Kilbride 2023). This paper will discuss one approach to the development of a framework for more environmentally sustainable web archiving, using the UK Government Web Archive as a case study. First, it will present the findings of a workshop on ‘Archives and the environment’, which was held at The UK National Archives in 2023. One of the main strands of discussion was the environmental cost both of creating and preserving born-digital and digitised archives and of the digital infrastructure, tools and methods used to analyse them. Recommendations arising from the event and subsequent report have informed an action plan for the UK Government Web Archive (UKGWA) as it begins to explore its environmental footprint. The UKGWA action plan involves four main strands of work: establishing, as far as possible, the current environmental impact of the web archive, drawing on a range of metrics; identifying those aspects of the web archiving workflows that may be streamlined or redeveloped in order to reduce that impact; designing and prototyping new and more sustainable processes within the UKGWA; and producing recommendations for good practice that may be adopted and/or adapted by other national and international web archives. The planned research is concerned not just with environmentally sustainable practice within the UKGWA but also with Scope 3 carbon emissions (that is, emissions that are produced not by an organisation itself but by those for whom it is indirectly responsible, in this case users and suppliers). The research is at an early stage, but we hope that the development of an extensible and customisable framework, accompanied by a toolkit that builds on the work of the Digital Humanities Climate Coalition, will provide an opportunity for wider collaboration. The work presented here is grounded in the experience and practice of the UK Government Web Archive, but it will benefit enormously from being placed in dialogue with the work of the IIPC and other national and institutional web archives concerned with the impact of climate change on digital archival practice and of digital archiving and preservation on climate change. K. Pendergrass et al., ‘Toward environmentally sustainable digital preservation’, The American Archivist (2019), 82:1, 165-206 W. Kilbride, ‘The Anthropocene remembered: digital memory after the climate crisis’, Digital Preservation Coalition Blog (2019) 10:45am - 11:05am Preservation of Historical Data: Using Warchaeology to Process 20 Years of Harvesting Andreas Børsheim, Marius André Elsfjordstrand Beck National Library of Norway, Norway The National Library of Norway have been harvesting the internet since the beginning of the millennium, with a primary focus and priority on the collection and storage of data. Over 25 years, web harvesting methods and preservation systems have changed. Consequently, the collection is composed of various file types, including ARC, WARC, and files produced by NEDLIB[1]. In more recent years our focus has shifted towards access and quality assurance, and the need to include the older data has increased. But how do we utilize this data, which by now is poorly structured, has little to no documentation, and is hard to read by modern software? In addition, the National Library of Norway is migrating to a new digital preservation system, so all of our data is expected to be moved, providing us an opportunity to clean, index and organize our collection. To address and resolve these issues and move toward the ultimate goal of making the collections fully discoverable and available, the National Library of Norway developed an open-source tool, Warchaeology[2], capable of converting, validating and deduplicating web archive collections data. This presentation will outline how we have used this tool to process 2PB of data, harvested since 2001. The objective is better management and preservation, including to identify collections and groupings of data, parse and sort metadata, identify formats and how these should be processed or converted, deduplicate files, and gather insight about collections generally. We will talk about the challenges in deduping, converting, and maintaining large web archive collections, including infrastructural issues like securing sufficient storage space to complete the work. This will be a time-intensive process; we estimate several months will be required for shuffling files between storage solutions, converting and deduplicating our data. The goals of this work are a collection of data that is cleaner, smaller, easier to maintain, and, at the end of the day, accessible for our users. [1] https://web.archive.org/web/20040604032621/http://www.kb.nl/coop/nedlib/ [2] https://github.com/nlnwa/warchaeology/ 11:05am - 11:10am Analysing the Publications Office of the European Union Web Archive for the Rationalisation of Digital Content Generation Alexandre Angers Publications Office of the European Union, Luxembourg More and more information from EU institutions, bodies and agencies is only made available on their public websites. However, web content often has a short lifespan, and this information is at risk of getting lost when websites are updated, substantially redesigned or taken offline. As part of its different preservation activities, the Publications Office of the EU crawls, curates and preserves the content and design of these websites, making them available for current and future generations. We also prepare an ingestion of this collection into our digital archive, to ensure its long-term preservation. We have recently performed a full export of the most recent crawls from our web archive collection, spanning from March 2019 to September 2024, as a set of WARC files. We have extracted relevant information regarding all the “response” and “revisit” records in the collection and inserted it into a relational database, allowing efficient custom analyses. In this presentation, we will show various interesting statistics we have generated about the content of our web archive. These include the analysis of large response payloads (more than 100 Mb), as well as the relative footprint of crawled video files. We also investigate the amount of duplication of records - those that were avoided through ‘revisit’ records, as well as duplicate ‘response’ records is still present in the archive. We also explain how we have used this information to refine our crawling strategies in order to rationalise our digital content generation going forward. We also define potential policies to curate the existing archive prior to ingestion in a long-term digital repository, where the impact on the carbon footprint may be even more significant.
10:05am - 11:15am	WORKSHOP #03: Introduction to Web Graphs Location: Slottsbiblioteket (ground floor) The workshop will begin with a brief introduction to the concept of the webgraph or hyperlink graph - a directed graph whose nodes correspond to web pages and whose edges correspond to hyperlinks from one web page to another. We will also look at aggregations of the page-level webgraph at the level of Internet hosts or pay-level domains. The host-level and domain-level graphs are at least an order of magnitude smaller than the original page-level graph, which makes them easier to study. To represent and process webgraphs, we utilize the WebGraph framework, which was developed at the Laboratory of Web Algorithms (LAW) of the University of Milano. As a "framework for graph compression aimed at studying web graphs," it allows very large webgraphs to be stored and accessed efficiently. Even on a laptop computer, it's possible to store and explore a graph with 100 million nodes and more than 1 billion edges. The WebGraph framework is also used to compress other types of graphs, such as social network graphs or software dependency graphs. In addition, the framework and related software projects include tools for the analysis of web graphs and the computation of their statistical and topological properties. The WebGraph framework implements a number of graph algorithms, including PageRank and other centrality measures. It is an open-source Java project, but a re-implementation in the Rust language has recently been released. Over the past two decades, the WebGraph format has been widely used by researchers, for example those at LAW or Web Data Commons, to distribute graph dumps. It has also been used by open data initiatives, including the Common Crawl Foundation and the Software Heritage project. The workshop focuses on interactive exploration of one of the precompiled and publicly available webgraphs. We look at graph properties and metrics, learn how to map node identifiers (just numbers) and node labels (URLs), and compute the shortest path between two nodes. We also show how to detect "cliques", i.e. densely connected subgraphs, or how to run PageRank and related centrality algorithms to rank the nodes of our graph. We share our experiments on how these applications are used for collection curation: how cliques can be used to discover sites with content in a regional language, how link spam is detected or how global domain ranks are used to select a representative sample of websites. Finally, we will build a small webgraph from scratch using crawl data. Participants will learn how to explore webgraphs (even large ones) in an interactive way and learn how graphs can be used to curate collections. Basic programming skills and basic knowledge of the Java programming language are a plus but not required. Since this is an interactive workshop, attendees should bring their own laptops, preferably with the Java 11 (or higher) JDK and Maven installed. Nevertheless, it will be possible to follow the steps and explanations without having to type them into a laptop. We will provide download and installation instructions, as well as all teaching materials, prior to the workshop.
	Introduction to Web Graphs Sebastian Nagel, Pedro Ortiz Suarez, Thom Vaughan, Greg Lindahl Common Crawl Foundation, United States of America The workshop will begin with a brief introduction to the concept of the webgraph or hyperlink graph - a directed graph whose nodes correspond to web pages and whose edges correspond to hyperlinks from one web page to another. We will also look at aggregations of the page-level webgraph at the level of Internet hosts or pay-level domains. The host-level and domain-level graphs are at least an order of magnitude smaller than the original page-level graph, which makes them easier to study. To represent and process webgraphs, we utilize the WebGraph framework, which was developed at the Laboratory of Web Algorithms (LAW) of the University of Milano. As a "framework for graph compression aimed at studying web graphs," it allows very large webgraphs to be stored and accessed efficiently. Even on a laptop computer, it's possible to store and explore a graph with 100 million nodes and more than 1 billion edges. The WebGraph framework is also used to compress other types of graphs, such as social network graphs or software dependency graphs. In addition, the framework and related software projects include tools for the analysis of web graphs and the computation of their statistical and topological properties. The WebGraph framework implements a number of graph algorithms, including PageRank and other centrality measures. It is an open-source Java project, but a re-implementation in the Rust language has recently been released. Over the past two decades, the WebGraph format has been widely used by researchers, for example those at LAW or Web Data Commons, to distribute graph dumps. It has also been used by open data initiatives, including the Common Crawl Foundation and the Software Heritage project. The workshop focuses on interactive exploration of one of the precompiled and publicly available webgraphs. We look at graph properties and metrics, learn how to map node identifiers (just numbers) and node labels (URLs), and compute the shortest path between two nodes. We also show how to detect "cliques", i.e. densely connected subgraphs, or how to run PageRank and related centrality algorithms to rank the nodes of our graph. We share our experiments on how these applications are used for collection curation: how cliques can be used to discover sites with content in a regional language, how link spam is detected or how global domain ranks are used to select a representative sample of websites. Finally, we will build a small webgraph from scratch using crawl data. Participants will learn how to explore webgraphs (even large ones) in an interactive way and learn how graphs can be used to curate collections. Basic programming skills and basic knowledge of the Java programming language are a plus but not required. Since this is an interactive workshop, attendees should bring their own laptops, preferably with the Java 11 (or higher) JDK and Maven installed. Nevertheless, it will be possible to follow the steps and explanations without having to type them into a laptop. We will provide download and installation instructions, as well as all teaching materials, prior to the workshop.
11:15am - 11:45am	BREAK Location: Folkestova (upstairs)
11:45am - 1:15pm	PANEL #02: Cross-Institutional Collaborations Location: Målstova (upstairs) Session Chair: Abbie Grotke, Library of Congress
	Past, Present & Future of Cross-Institutional Collaboration in Web Archiving: Insights from the Norwegian and Danish Web Archive, the NetArchiveSuite Community, & Beyond Anders Klindt Myrvoll¹, Thomas Langvann², Sara Aubry³, José Carlos Cerdán Medina⁴, Niels Ørbæk Chemnitz⁵, Colin Samuel Rosenthal¹, Abbie Grotke⁶ ¹Royal Danish Library, Denmark; ²National Library of Norway, Norway; ³Bibliothèque nationale de France, France; ⁴Biblioteca Nacional de España, Spain; ⁵Analysis & Numbers, Denmark; ⁶Library of Congress, United States of America
11:45am - 1:15pm	SESSION #06: Curating Social Media Location: Store Auditorium (ground floor) Session Chair: Tom Smyth, Library and Archives Canada
	11:45am - 12:05pm Developing Social Media Archiving Guidelines at the National Archives of the Netherlands Lotte Wijsman, Geert Leloup, Susanne Van den Eijkel, Sander Wellens National Archives of the Netherlands, Netherlands At the beginning of 2024, we started a project to develop a nationwide guideline for archiving public social media content. This project aimed to address the increasing use of social media by Dutch governments and the current lack of archiving there is. Our presentation at the Web Archiving Conference 2025 will focus on the process of creating this guideline and presenting the final version. The primary target audience for this guideline are the information professionals, who play a vital role in managing and preserving the archived social media content. However, we also recognise communication professionals as an important target audience, given their role in setting up and using the accounts. The guideline is structured into six modules Definitions and scope This module provides a definition of social media and identifies what constitutes as public information on the various platforms. Legal framework In this module, we examine the Dutch and European legal requirements and constraints related to archiving public social media content. Understanding the legal landscape is essential to ensure compliance and address any legal challenges. Recommendations for communication professionals This module provides practical recommendation on using social media in a way that facilities easier archiving. Aimed at those managing the social media accounts, it includes tips on account settings and content creation. Appraisal and selection This module addresses how content can be appraised and selected. Also to ensure that historically important information will be transferred to the Dutch National Archives at a certain moment. Criteria and techniques In this module we establish quality criteria for archiving social media content and explore various techniques to archive social media. Methods discussed include screen capturing and API usage. This module aims to equip professionals with the knowledge to choose the most effective archiving methods. Case studies The final module presents real-world examples from the Netherlands and abroad. These case studies illustrate diverse methods and results, providing practical insights and lessons learned from other practitioners in the field. The creation of this guideline was a collaborative and intensive year-long process. We systematically engaged with a wide range of stakeholders and incorporated their feedback to ensure the guideline is comprehensive and practical. Our goal is to support government agencies in archiving their social media communications effectively. We are excited to share our journey and the outcomes of this project with our colleagues at the Web Archiving Conference. By presenting our experiences and insights, we hope to contribute to the ongoing discourse on social media archiving and inspire others in the field. 12:05pm - 12:25pm Archiving the Social Media Profiles of Members of Government Ben Els National Library of Luxembourg, Luxembourg As part of the 2023 national elections, the National Library of Luxembourg, in collaboration with the National Archives and the Ministry of State, launched a pilot project to archive the social media profiles of members of the government. The technical obstacles to archiving social platforms are becoming increasingly problematic, resulting in the situation that none of the major platforms can currently be archived effectively by our harvesters and service providers. Since most social media platforms are practically inaccessible by web crawlers and conventional web archiving methods, we decided to try a more direct approach, by asking the members of government directly to download the data from their profiles and hand them over to the National Library and National Archives. With the help of the Ministry of State, we sent out a call for participation, with specific guidelines to exporting datasets from social networks to the archive delegates and communication departments of each ministry, as well as to the ministers themselves. The response to this first call for participation was very positive - despite the pressure of time, between the election and formation of a new government, with a high chance of many ministers leaving their offices. In addition to elaborating the guidelines for downloading datasets from different platforms, we offered direct technical support to the people involved in the ministries, even the members of government themselves and retrieved the data individually on site. We were able to retrieve the majority of profiles of the government, for the time span of the 5 years of their term. This pilot project represents a direct and effective method, to secure the data of profiles of high public interest. The National Library and National Archives of Luxembourg are looking to repeat the same collection process by the end of 2024 and hope to move to a regular operation after that. This presentation will cover the different steps of the collection process, the lessons learned from the pilot project and the second operation end of 2024. We will conclude with an outlook to the changes we hope to implement in the future, a possible extension of the collection scope and our plans in terms of public access to the collections. 12:25pm - 12:45pm From Posts to Archives: The National Library of Singapore’s Journey in Collecting Social Media Shereen Tay, Meiyu Lee National Library Board Singapore, Singapore Social media plays a huge role in our everyday life today. It is used for a myriad of activities such as communication, entertainment, business, and even as personal diaries. In Singapore, about 85% of the population uses social media, the most popular ones being Facebook, Instagram, YouTube, and TikTok. Besides individuals, many organisations have also turned to social media to engage and communicate with their followers. With such prevalence use, social media is becoming an important source of information about the lives and stories of our country and people. Recognising this, the National Library of Singapore (NLS) began looking at collecting social media. Our journey started in 2017, and the initial years focused on research and experiments, such as conducting environmental scan of other heritage institutions’ experiences in collecting social media, proof-of-concept using web archiving and available APIs, and trialling commercial vendors’ solutions. Our experience was similar to many institutions around the world. Collecting social media is complex and poses many technical, legal, and ethical challenges such as limited access to APIs and needing to manage personal data and third-party content. Despite these challenges, we knew that we had to start collecting social media given its increasing significance. This was not only to meet our mandate of collecting and preserving our countries’ digital memories, but to also gain practical experience on how to collect, organise, and manage this format. Putting together what we have learnt, we developed a social media collecting framework in 2023 to provide guidance on how to collect social media amidst these challenges while ensuring that a representative set of social media content can be collected for future generations and research. Our framework covered the selection criteria, the collecting methods, and our collecting approach for key social media platforms that are widely used in Singapore. We piloted our first social media collecting in the same year, under NLS’ new 2-year project to collect contemporary materials on Singapore food and youth. The purpose was to assess individuals and organisations’ receptiveness to contribute their social media accounts to us, which was more forthcoming than we anticipated. In 2024, we made collecting social media as part of our operational work. Our collection strategy was three-prong: 1) outsourcing the archiving of significant persons/organisations’ social media accounts to a commercial vendor; 2) approaching identified organisations based on subjects to contribute their social media accounts; and 3) engaging and promoting social media collecting through advocates and an annual public call to nominate favourite Singapore social media accounts, YouTube and TikTok videos, as well as websites. This presentation will highlight NLS’ journey in collecting social media, our collecting framework and strategy, as well as learning points and future plans. 12:45pm - 1:05pm Innovative Web Archiving Amid Crisis: Leveraging Browsertrix and Hybrid Working Models to Capture the UK General Election 2024 Nicola Bingham, Jennie Grimshaw British Library, United Kingdom The British Library, in collaboration with the National Libraries of Scotland and Wales, the Bodleian Library and Cambridge University Library, has created collections of archived websites for all UK general elections since 2005. This time series shows how internet use in political communication has evolved, and how the fortunes of political parties have changed. The 2024 general election was called unexpectedly on May 22nd, and took place on July 4th, at a time when the UK Web Archive was inaccessible, and our Web Archiving and Curation Tool was unavailable following a devastating ransomware attack on the British Library on October 29th 2023. Working together, we nevertheless created a collection of 2253 archived websites covering candidates' campaign sites, social media feeds of significant politicians and journalists, local and national party sites, comment by think tanks, community engagement, news sources, and manifestos of a plethora of interest groups seeking to influence the new government. To facilitate use by researchers tracking change over time, we have organised the material into these same sub-collections since 2005. We collected campaign websites for a sample of English candidates for the same counties and urban areas as we have covered since 2005, but all Scottish and Welsh candidates’ sites were gathered as numbers are manageable. We also targeted marginal constituencies which had increased in numbers dramatically since 2019. The 2024 general election saw the rise of formerly minor parties such as Reform UK to national prominence, a Liberal Democrat resurgence, growing influence of independent candidates, and the rise of identity politics with groups encouraged to vote as a bloc on issues such as the war in Gaza, and an increasingly sophisticated use of social media. The technical outage caused by the ransomware attack necessitated a unique approach due to the disruption in our usual workflows. Despite the challenges, websites continued to be archived using Heritrix on AWS servers rather than the Library's in-house infrastructure. This shift required a new workflow, involving the use of simple spreadsheets and collaborative efforts to quickly refine metadata definitions and crawl scope, aiming to replicate our existing curatorial software as closely as possible. In addition, the British Library secured a free-trial subscription to Browsertrix, which allowed us to explore and learn this new tool’s capabilities ahead of a more formal subscription. Despite the challenges, we successfully captured 1,600 snapshots of social media content, including posts from X (formerly Twitter), Facebook, and Instagram. This experience introduced library staff to working within data and time constraints, enhancing our understanding of how to effectively scope crawls, monitor them in real-time, and implement new quality assurance practices. The project resulted in a hybrid collecting model, utilising both Heritrix and Browsertrix for the same thematic collection. The presentation will discuss the challenges and opportunities encountered during this project, providing valuable insights for those interested in Browsertrix’s capabilities and in executing web archiving with a mixed-model approach across different institutions with diverse interests and expertise in unusually challenging circumstances within the framework provided by a historic time series.
11:45am - 1:15pm	WORKSHOP #04: How to Develop a New Browsertrix Behavior Location: Slottsbiblioteket (ground floor) Behaviors are a key part of Browsertrix and Browsertrix Crawler, as they make it possible to automatically have the crawler browsers take certain actions on web pages to help capture important content. This tutorial will walk attendees through the process of creating a new behavior and using it with Browsertrix Crawler. Browsertrix Crawler includes a suite of standard behaviors, including auto-scrolling pages, auto-playing videos, and capturing posts and comments on particular social media sites. By default, all of the standard set of behaviors are enabled for each crawl. Users have the ability to instead disable behaviors entirely or select only a subset of the standard set of behaviors to use on a crawl. At times, users may need additional custom behaviors to navigate and interact with a site in specific ways automatically during crawling if they want the resulting web archive and replay to reflect the full experience of the live site. For instance, a new behavior could click on interactive buttons in a particular order, “drive” interactive components on a page, or open up posts sequentially on a new social media site and load comments. This tutorial will walk through the process of creating a new behavior step by step, using the existing written tutorial for creating new behaviors on GitHub as a model. In addition to demonstrating how to write a behavior’s code (using JavaScript), the tutorial will also discuss how to know when a behavior is the appropriate solution for a given crawling problem, how to test behaviors during development, how to use custom behaviors with Browsertrix Crawler running locally in Docker, and finally how to use custom behaviors from the Browsertrix web interface (a feature that is currently planned and will be completed by the conference date). Participants will not be expected to write any code or follow along on their own laptops in real time during the tutorial. The purpose is instead to demonstrate how one would approach developing a new behavior, lower the barrier to entry for developers and practitioners who may be interested in doing so, and to give attendees the opportunity to ask questions of Webrecorder developers in real time. We would additionally love to foster a conversation about how to develop a community library of available behaviors moving forward to make it easier than ever for users to find and use behaviors that meet their needs. The tutorial will be led by Ilya Kreymer and Tessa Walsh, developers at Webrecorder with intimate knowledge of the Browsertrix ecosystem. The target audience is technically-minded web archiving practitioners and developers - in other words, people who could either themselves write new custom behaviors or communicate the salient points to developers at their institutions. Because this is not a hackathon-style workshop, the tutorial could have as many participants as the venue allows. By the conclusion of the tutorial, attendees should understand the concept of how Browsertrix Behaviors work, when developing a new behavior is a good solution to their problems, the steps involved in developing and testing a new behavior, and where to find additional resources to help them along the way. Our hope is to foster a decentralized community of practice around behaviors to the entire IIPC community’s benefit.
	How to Develop a New Browsertrix Behavior Ilya Kreymer, Tessa Walsh Webrecorder, United States of America Behaviors are a key part of Browsertrix and Browsertrix Crawler, as they make it possible to automatically have the crawler browsers take certain actions on web pages to help capture important content. This tutorial will walk attendees through the process of creating a new behavior and using it with Browsertrix Crawler. Browsertrix Crawler includes a suite of standard behaviors, including auto-scrolling pages, auto-playing videos, and capturing posts and comments on particular social media sites. By default, all of the standard set of behaviors are enabled for each crawl. Users have the ability to instead disable behaviors entirely or select only a subset of the standard set of behaviors to use on a crawl. At times, users may need additional custom behaviors to navigate and interact with a site in specific ways automatically during crawling if they want the resulting web archive and replay to reflect the full experience of the live site. For instance, a new behavior could click on interactive buttons in a particular order, “drive” interactive components on a page, or open up posts sequentially on a new social media site and load comments. This tutorial will walk through the process of creating a new behavior step by step, using the existing written tutorial for creating new behaviors on GitHub as a model. In addition to demonstrating how to write a behavior’s code (using JavaScript), the tutorial will also discuss how to know when a behavior is the appropriate solution for a given crawling problem, how to test behaviors during development, how to use custom behaviors with Browsertrix Crawler running locally in Docker, and finally how to use custom behaviors from the Browsertrix web interface (a feature that is currently planned and will be completed by the conference date). Participants will not be expected to write any code or follow along on their own laptops in real time during the tutorial. The purpose is instead to demonstrate how one would approach developing a new behavior, lower the barrier to entry for developers and practitioners who may be interested in doing so, and to give attendees the opportunity to ask questions of Webrecorder developers in real time. We would additionally love to foster a conversation about how to develop a community library of available behaviors moving forward to make it easier than ever for users to find and use behaviors that meet their needs. The tutorial will be led by Ilya Kreymer and Tessa Walsh, developers at Webrecorder with intimate knowledge of the Browsertrix ecosystem. The target audience is technically-minded web archiving practitioners and developers - in other words, people who could either themselves write new custom behaviors or communicate the salient points to developers at their institutions. Because this is not a hackathon-style workshop, the tutorial could have as many participants as the venue allows. By the conclusion of the tutorial, attendees should understand the concept of how Browsertrix Behaviors work, when developing a new behavior is a good solution to their problems, the steps involved in developing and testing a new behavior, and where to find additional resources to help them along the way. Our hope is to foster a decentralized community of practice around behaviors to the entire IIPC community’s benefit.
1:15pm - 2:15pm	LUNCH Location: CREDO Restaurant \| Kantine (downstairs) If you signed up for a guided exhibition tour, please be in the exhibition room at 13:20. To know if you signed up for a tour, check your registration details in ConfTool.
2:15pm - 3:40pm	SESSION #07: Research & Access Location: Målstova (upstairs) Session Chair: Marie Roald, National Library of Norway
	2:15pm - 2:35pm From Pages to People: Tailoring Web Archives for Different Use Cases Andrea Kocsis², Leontien Talboom¹ ¹Cambridge University Libraries, United Kingdom; ²National Library of Scotland, United Kingdom Our paper explores different modes of reaching the three distinct audiences identified in previous work with the National Archives UK : readers, data users, and the digitally curious. Building on the examples of our work conducted at the Cambridge University Libraries and the National Library of Scotland, our paper gives recommendations and demonstrates good practices for designing web archives for different audience needs while assuring wide access. Firstly, to improve the experience of the general readers, we employ exploratory and gamified interfaces and public outreach events, such as exhibitions, to bring the library users' awareness to the available web archive resources. Secondly, to serve the data user community, we put an emphasis on curating metadata datasets and the Datasheets for Data documentation, encouraging the quantitative research of the web archive collections. This work also involves outreach events, such as data visualisation calls, which later can be incorporated into the resources for the general readers. Finally, to overcome the obstacle of the digital skill gap, we tailored in-library workshops for digitally curious - those who recognise the potential of web archives but lack advanced computational skills. We expect that upskilling the digitally curious can open their interest towards exploring and using the web archive collections. To sum up, our paper introduces the work we have been doing to improve the useability of the UK Web Archive within our institutions with the help of developing additional materials (datasets, interfaces) and planning outreach events (exhibitions, calls, workshops) to ensure we meet the expectations of readers, data users, and the digitally curious. 2:35pm - 2:55pm Making Research Data Published to the Web FAIR Bryony Hooper, Ric Campbell University of Sheffield, United Kingdom The University of Sheffield’s vision for research is that our distinctive and innovative research will be world-leading and world-changing. We will produce the highest quality research to drive intellectual advances and address global challenges. https://www.sheffield.ac.uk/openresearch/university-statement-open-research Research data published to the web can offer opportunities for wider discovery and access to your research outputs. However, it also presents risk in terms of assurances that that discovery and access will remain for as long as the need for it remains. Websites are an inherently fragile medium, and present risks in terms of providing assurances that we can evidence our research impact over time. This includes potentially wanting to submit sites as part of a UK’s Research Excellence Framework submission (the next scheduled for 2029). Funding requirements may also stipulate how long they expect the outputs to remain accessible. Years of work, including work undertaken with public funding could disappear if there is no intervention. In addition, publishing research data to the web cannot provide assurances in terms of meeting the University of Sheffield’s commitment to FAIR principles (findable, accessible, interoperable and reusable) and Open Research and Open Data practices. At the University of Sheffield, colleagues in our Research Data Management (RDM) team have also noticed a trend in researchers depositing in the Institutional Repository (ORDA), links to URLs where the data is situated. In some cases, the website is the research output in its entirety, meaning the maintenance falls outside of the RDM team’s remit, meaning we cannot provide the usual assurances in terms of preserving that deposit in these cases. This paper will discuss the work undertaken by the University of Sheffield’s Library to mitigate potential data loss from research published online. It will include a case study of the capturing of a research group’s website to deposit in our institutional data repository, the creation of collaboratively created guidance for researchers and research data managers, and the embedding good practice at the University to enable Open Research and Open Data will remain open and FAIR. 2:55pm - 3:15pm Enhancing Accessibility to Belgian Born-Digital Heritage: The BelgicaWeb Project Christina Vandendyck Royal Library of Belgium (KBR), Belgium The BelgicaWeb project aims to make Belgian digital heritage more (FAIR ( i.e. Findable, Accessible, Interoperable and Reusable) to a wide audience. BelgicaWeb is a BRAIN 2.0 project funded by BELSPO, the Belgian Science Policy Office. It is a collaboration between CRIDS (University of Namur) who provide expertise on the relevant legal issues, IDLab, GhentCDH and MICT (Ghent University) who will work on data enrichment, user engagement and evaluation and outreach to the research community, respectively, and KBR (Royal Library of Belgium) who act as project coordinator and work on the development of the access platform and API and data enrichment. By leveraging web and social media archiving tools, the project focuses on creating comprehensive collections, developing a multilingual access platform, and providing a robust API enabling data-level access. At the heart of the project is a reference group of experts who provide iterative input on the selection, development of the API and access platform, data enrichment and quality control and usability. Therefore, the project contributes to moving towards best practices for search and discovery. The project goes beyond data collection by means of open-source tools by enriching and aggregating (meta)data associated with these collections using innovative technologies such as Linked Data and Natural Language Processing (NLP). This approach enhances search capabilities, yielding more relevant results for both researchers and the general public. In this presentation, we will provide an overview of the BelgicaWeb project’s system architecture, the technical challenges we encountered, and the solutions we implemented. We will demonstrate how the access platform and API offer powerful, relevant, and user-friendly search functionalities, making it a valuable tool for accessing Belgium’s digital heritage. Attendees will gain insights into our development process, the technologies employed, and the benefits of our open-source approach for the web archiving and by extension the digital preservation communities. 3:15pm - 3:35pm Using Generative AI to Interrogate the UK Government Web Archive Chris Royds, Tom Storrar The National Archives (UK), United Kingdom Our project seeks to make the contents of Web Archives more easily discoverable and interrogable, through the use of Generative AI (Gen-AI). It explores the feasibility of setting up a chatbot, and using UK Government Web Archive data to inform its responses. We believe that, if this approach proves successful, it could lead to a step-change in the discoverability and accessibility of Web Archives. Background Gen-AIs like ChatGPT and Copilot have impressive capabilities, but are notoriously prone to “hallucinations”. They can generate confident-sounding, but demonstrably false responses – even to the point of inventing non-existent academic papers, complete with fictitious DOI numbers. Retrieval-Augmented Generation (RAG) seeks to address this. It supplements Gen-AI with an additional database, queried whenever a response is generated. This approach aims to significantly reduce the chance of hallucination, while also enabling chatbots to provide specific references to the original sources. Additionally, any approach used would need to take into account the occasional need to remove individual records (in line with The National Archives’ takedown policy: https://www.nationalarchives.gov.uk/legal/takedown-and-reclosure-policy/). In traditional Neural Networks, “forgetting” data is currently an intractable problem. However, it should be possible to set up RAG databases such that removal of specific documents is straightforward. Approach Our project is focused on two open-source tools, both of which allow for RAG based on Web Archive records. The first is WARC-GPT, a lightweight tool developed by a team at Harvard designed to ingest Web Archive documents, feed them to a RAG database, and provide a chat-bot to interrogate the results. While the tool’s creators have demonstrated its capabilities on a small number of documents, we have attempted to test it at a larger scale, on a corpus of ~22,000 resources. The second, more sophisticated tool is Microsoft’s GraphRAG. GraphRAG identifies the “entities” referenced in documents, and builds a data structure representing the relationships between them. This data structure should allow a chat-bot to carry out more in-depth “reasoning” about the contents of the original documents, and potentially provide better answers about information aggregated across multiple documents. Results Our initial findings suggest that WARC-GPT produces impressive responses when queried about topics covered in a single document. It quickly discovers which one of the documents in its database best answers the prompt. It summarises relevant information from that document, and provides its URL. Additionally, with a few minor tweaks to the underlying source code, it is possible to remove individual documents from its database. However, WARC-GPT’s responses fare poorly when attempting to aggregate information from multiple documents. Our experiments with GraphRAG suggest that it outperforms WARC-GPT in aggregating information. However, while GraphRAG is reasonably quick to generate these responses, it is significantly slower and more expensive to set up than WARC-GPT. Additionally, removing individual records from GraphRAG, while possible, is computationally expensive.
2:15pm - 3:40pm	SESSION #08: Handling What You Captured Location: Store Auditorium (ground floor) Session Chair: Meghan Lyon, Library of Congress
	2:15pm - 2:35pm So You’ve Got a WACZ: How Archives Become Verifiable Evidence Basile Simon, Lindsay Walker Starling Lab for Data Integrity, Stanford-USC, United States of America This talk will present a workflow and toolkit, developed by the Starling Lab for Data Integrity, for collecting and organizing web archives alongside integrity and provenance data. Co-founded by Stanford and USC, Starling supports investigators–be they journalists, lawyers, or human rights defenders–in their collection of information and evidence. In addition to using Browsertrix to crawl (and test) large sets of web archive data, we have built a downstream integration, so data flows into our cryptographically-signed and append-only database called Authenticated Attributes (AA). AA extends Browsertrix’s utility by enabling archivists to securely attach and verify the provenance of claims that include context-critical metadata about the archived content in a secure and decentralized manner. It allows for the addition, preservation, and sharing of provenance data while facilitating efficient organization, searchability, and integration with other tools. Through AA, web archives and metadata become accessible for other applications and verification workflows, e.g. OSINT investigations. In this presentation, we will showcase case studies and projects with our collaborators including the Atlantic Council’s DFRLab and conflict monitors. 2:35pm - 2:55pm Warc-Safe: An Open-Source WARC Virus Checker and NSFW (Not-Safe-For-Work) Content Detection Tool László Tóth National Library of Luxembourg, Luxembourg We present warc-safe, the first open-source WARC virus checker and NSFW (Not-Safe-For-Work) content detection tool. Built with particular emphasis on usability and integration within existing workflows, this application detects harmful material and inappropriate content in WARC records. The tool uses the open-source ClamAV antimalware toolkit for threat detection and a specially trained AI model to analyze WARC image records. Several image formats are supported by the model (JPG, PNG, TIFF, WEBP, …), which produces a score between 0 (completely safe) and 1 (surely unsafe). This approach makes it easy to classify images and determine what to do with those that exceed a certain threshold. The warc-safe tool was developed with ease of use in mind; thus, it can be run in two modes: test mode (scan WARC files on the command line) or server mode (for easy integration with existing workflows). Server mode allows the client to use several features over an API, such as scanning a WARC file for viruses, scanning for NSFW content, or both. This makes it easy to use together with popular web archiving tools. To illustrate this, we present a case study where warc-safe was integrated into SolrWayback and the UK Web Archive’s warc-indexer. This integration made it possible to enrich the metadata indexed from WARC files, by extending the existing Solr schema with several new fields related to virus- and NSFW-test results, allowing for advanced searching and statistical analysis. Finally, we discuss how warc-safe could be used within an institutional framework, for instance by scanning newly harvested WARC files resulting from large-scale harvesting campaigns as well as including it within existing indexing workflows. 2:55pm - 3:15pm Detecting and Diagnosing Errors in Replaying Archived Web Pages Jingyuan Zhu¹, Huanchen Sun², Harsha Madhyastha² ¹University of Michigan, United States of America; ²University of Southern California, United States of America When a user loads an archived page from a web archive, the archive must ensure that the user’s browser fetches all resources on the page from the archive, not from the original website. To achieve this, archives rewrite references to page resources that are embedded within crawled HTMLs, stylesheets, and scripts. Unfortunately, the widespread use of JavaScript on modern web pages has made page rewriting challenging. Beyond rewriting static links, archives now also need to ensure that dynamically generated requests during JavaScript execution are intercepted and rewritten. Given the diversity of scripts on the web, rewriting them often results in fidelity violations, i.e., when a user loads an archived page, even if all resources on the page had been crawled and saved, either some of the content that appeared on the original page is missing or some functionality that ought to work on archived pages (e.g., menus, change page theme) does not. To verify if the replay of an archived page preserves fidelity, archival systems currently compare either screenshots of the page taken during recording and replay or errors encountered in both loads (e.g., https://docs.browsertrix.com/user-guide/review/). These methods have several significant drawbacks. First, modern web pages often include dynamic components, such as animations or carousels; so, screenshots of the same page copy can vary across loads. Second, neither does incorrect replay always result in additional script execution or resource fetch errors, nor does the presence of such errors indicate the existence of user-visible problems. Lastly, even if an archived page does differ from the original page, existing methods cannot pinpoint what inaccuracies in page rewriting led to this problem. In this talk, we will describe our work in developing a new approach for a) more reliably detecting whether the replay of an archived page violates fidelity, and b) pinpointing the cause when this occurs. Fundamental to our approach is that we do not focus on only the externally visible outcomes of page loads (e.g., pixels rendered and runtime/fetch errors). Instead, both during recording and replay, we capture each visible element in the browser DOM tree, including its location on the screen and dimensions, and the JavaScript writes that produce visible effects. Our fine-grained representation of page loads also enables us to precisely identify the rewritten source code that led to fidelity violations. The fix has to be ultimately determined by a human developer. However, we are able to validate the root cause we identify by either inserting only the problematic rewrite into the original page or by selectively rolling back that edit from the rewritten archived page and examining the corresponding effects. In our study across tens of thousands of diverse pages, we have found that pywb (version 2.8.3) fails to accurately replay archived copies of approximately 15–17% of pages. Importantly, compared to relying on screenshots and errors to detect low fidelity replay, our approach reduces false positives by as much as 5x. 3:15pm - 3:35pm Building a Toolchain for Screen Recording-Based Web Archiving of SVOD Platforms Alexis Di Lisi Institut national de l'audiovisuel (INA), France As Subscription Video on Demand (SVOD) platforms expand, preserving DRM-protected content has become a critical challenge for web archivists. Traditional methods often fall short due to Digital Rights Management (DRM) restrictions, necessitating more adaptable solutions. This presentation covers the ongoing development of a generic toolchain based on screen recording designed to effectively address DRM restrictions, capture high-quality content, and scale efficiently. The project is structured into two main phases. Phase One focuses on developing a system that automatically checks the quality of screen recordings. By monitoring key metrics such as frame rate, resolution, and bit rate, the system should ensure that recordings match the original content’s quality as closely as possible. This phase addresses several technical challenges, including video glitches, frame drops, low resolution, and audio syncing issues. These problems arise from varying network conditions, software performance issues, and hardware limitations. To refine and validate the toolchain, over 100 hours of competition footage from the Paris 2024 Olympic Games have been collected and are being used to assess the system’s performance. This dataset is crucial for ensuring that the toolchain can handle high-quality recordings effectively. Phase Two tackles the specific challenges posed by DRM restrictions. Level 1 DRM, which involves a trusted environment and hardware restrictions, uses hardware acceleration that causes black screens when video playback and screen recording are attempted simultaneously. Additionally, many SVOD platforms limit high-resolution playback on Linux systems, complicating the capture of high-quality content. To circumvent these issues, playback should be handled on distant machines running Windows, Mac, or Chrome OS—environments where high-resolution limitations do not apply—while recording is performed on Linux systems. For HD video content, which generally involves Level 3 DRM with only software restrictions, Linux can be used directly for both playback and recording without encountering black screen issues. The toolchain will utilize Docker to scale the recording process by virtualizing hardware components such as display and sound cards. Docker should enable the system to manage multiple recordings concurrently, improving efficiency and reducing the time required for large-scale archiving. FFmpeg will be employed for recording, while Xvfb and ALSA will be used to virtualize the display and sound cards, respectively. By leveraging Docker for virtualization and managing workloads across various instances, the system is expected to scale effectively and accelerate the archiving process. This ongoing work aims to provide a robust and scalable solution for capturing DRM-protected content when direct downloading is not possible. The toolchain should be adaptable to various SVOD platforms and DRM systems, offering a flexible fallback method. The presentation will offer insights into the technical challenges being addressed, the strategies being developed to bypass DRM restrictions, and how the toolchain should evolve to manage large-scale content archiving effectively and attendees will gain an understanding of the methods used to overcome DRM challenges, the role of Docker in scaling, and the practical applications of this toolchain in preserving valuable web content.
2:15pm - 3:40pm	PANEL #03: Cross-Institutional Collaboration: the End of Term Archive Location: Slottsbiblioteket (ground floor) Session Chair: Jeffrey van der Hoeven, National Library of the Netherlands (KB)
	Coordinating, Capturing, and Curating the 2024 United States End of Term Web Archive Mark Phillips¹, Sawood Alam², James Jacobs³, Ilya Kreymer⁴ ¹University of North Texas, United States of America; ²Internet Archive, United States of America; ³Stanford University, United States of America; ⁴Webrecorder, United States of America
3:40pm - 4:10pm	BREAK Location: Folkestova (upstairs)
4:10pm - 5:05pm	Closing Keynote: Quantifying Complexity: Using Web Data to Decode Online Public Debate Location: Målstova (upstairs) Session Chair: Jon Carlstedt Tønnessen, National Library of Norway Streamed to Store Auditorium.
5:05pm - 5:30pm	Closing Remarks: Closing Remarks Location: Målstova (upstairs) Streamed to Store Auditorium.