Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.
|
Session Overview |
Date: Tuesday, 08/Apr/2025 | |
9:00am - 9:40am | REGISTRATION: General Assembly (For IIPC members only) |
9:40am - 9:50am | Opening Remarks Location: Målstova (upstairs) |
9:50am - 10:00am | Chair Address Location: Målstova (upstairs) |
10:00am - 10:45am | IIPC Strategic Plan 2026-2030 Location: Målstova (upstairs) |
10:45am - 11:15am | BREAK Location: Folkestova (upstairs) If you signed up for a guided exhibition tour, please be in the exhibition room at 10:45. To know if you signed up for a tour, check your registration details in ConfTool. |
11:15am - 12:45pm | Framework for Tools Sustainability Location: Målstova (upstairs) |
11:15am - 12:45pm | Content Development Working Group Meeting Location: Slottsbiblioteket (ground floor) |
11:15am - 12:45pm | TBC Location: VIP - rommet (upstairs) |
12:45pm - 2:00pm | LUNCH Location: CREDO Restaurant | Kantine (downstairs) If you signed up for a guided exhibition tour, please be in the exhibition room at 12:50. To know if you signed up for a tour, check your registration details in ConfTool. |
2:00pm - 3:30pm | Research Working Group Meeting Location: Målstova (upstairs) |
2:00pm - 3:30pm | Training Working Group Meeting Location: Slottsbiblioteket (ground floor) Actual session length: 60 minutes |
2:00pm - 3:30pm | TBC Location: VIP - rommet (upstairs) |
3:30pm - 4:00pm | BREAK Location: Folkestova (upstairs) If you signed up for a guided exhibition tour, please be in the exhibition room at 3:30. To know if you signed up for a tour, check your registration details in ConfTool. |
4:00pm - 5:30pm | Crawling National Domain: Towards Best Practices Location: Målstova (upstairs) |
4:00pm - 5:30pm | TWG WORKSHOP: Case Studies ‘Write-a-thon’ - Documenting Best Practices Location: Slottsbiblioteket (ground floor) |
|
Case Studies ‘Write-a-thon’ - Documenting Best Practices 1The National Archives (UK), United Kingdom; 2Library of Congress, United States of America; 3Internet Archive, United States of America |
4:00pm - 5:30pm | TBC Location: VIP - rommet (upstairs) |
7:00pm - 9:00pm | WELCOME RECEPTION Location: Folkestova (upstairs) [IIPC Members Only] Includes light refreshments and drinks. Attendees are encouraged to have dinner beforehand. |
Date: Wednesday, 09/Apr/2025 | |
9:00am - 9:40am | REGISTRATION: Web Archiving Conference (WAC) |
9:40am - 9:50am | Opening Remarks Location: Målstova (upstairs) Streamed to Store Auditorium. |
9:50am - 10:45am | Opening Keynote: Libraries, Copyright, and Language Models Location: Målstova (upstairs) Session Chair: Andrew Jackson, Digital Preservation Coalition Streamed to Store Auditorium. |
10:45am - 10:55am | SHORT BREAK Streaming video from Målstova to Store Auditorium ends. Lightning Talk Session 2 will begin in the Store Auditorium after the break. |
10:55am - 11:00am | LIGHTNING TALK SESSION 1: INTRODUCTION Location: Målstova (upstairs) Session Chair: Ben Els, National Library of Luxembourg |
10:55am - 11:00am | LIGHTNING TALK SESSION 2: INTRODUCTION Location: Store Auditorium (ground floor) Session Chair: Sawood Alam, Internet Archive |
11:00am - 11:25am | LIGHTNING TALK SESSION 1 Location: Målstova (upstairs) Session Chair: Ben Els, National Library of Luxembourg |
|
11:00am - 11:05am
Strategies and Challenges in the Preservation of Mexico’s Web Heritage: First Steps National Library of Mexico, Mexico 11:05am - 11:10am
Arquivo.pt Toolkit for Web Archiving Arquivo.pt, Portugal 11:10am - 11:15am
Tracking the Political Representations of Life: Methodological Challenges of Exploring the BnF Web Archives 1Centre de recherches politiques de Sciences Po (CEVIPOF, CNRS), France; 2Bibliothèque nationale de France, France 11:15am - 11:20am
Collaborative Curatorial Approaches of the Czech Web Archive Using the Example of Thematic Literary Collections National Library of the Czech Republic, Czech Republic |
11:00am - 11:25am | LIGHTNING TALK SESSION 2 Location: Store Auditorium (ground floor) Session Chair: Sawood Alam, Internet Archive |
|
11:00am - 11:05am
Modelling Archived Web Objects as Semantic Entities to Manage Contextual and Versioning Issues 1The National Archives (UK), United Kingdom; 2King's College London, United Kingdom 11:05am - 11:10am
Modernizing Web Archives: The Bumpy Road Towards a General ARC2WARC Conversion Tool Common Crawl Foundation, United States of America 11:10am - 11:15am
Poking Around in Podcast Preservation Netherlands Institute for Sound and Vision, Netherlands 11:15am - 11:20am
Automatic Clustering of Domains by Industry for Effective Curation Royal Danish Library, Denmark 11:20am - 11:25am
Best Practice of Preserving Posts from Social Media Feeds Arkiwera wcrify AB, Sweden |
11:25am - 11:55am | BREAK Location: Folkestova (upstairs) Participants in the 2025 Mentoring Program can meet at the top of the old granite stairs outside of Målstova. Sitting places are available in the cafeteria/bar (upstairs) and library hallways (upstairs and ground floor). If the weather is nice, there are also small parks immediately in front of and behind the National Library building. |
11:55am - 1:00pm | PANEL #01: Engaging Audiences Location: Målstova (upstairs) Session Chair: Eveline Vlassenroot, University of Ghent |
|
Beyond Preservation: Engaging Audiences and Researchers with Web Archives 1University of Ghent, Belgium; 2KBR - Royal Library of Belgium, Belgium; 3University of Sheffield, United Kingdom; 4Bodleian Libraries, United Kingdom; 5Royal Danish Library, Denmark; 6National Library of Scotland, United Kingdom |
11:55am - 1:00pm | SESSION #01: Tools Under Construction: Lessons Learned (National Library Perspective) Location: Store Auditorium (ground floor) Session Chair: Katherine Boss, National Library of Norway |
|
11:55am - 12:15pm
Embedding the Web Archive in an Overall Preservation System Swiss National Library, Switzerland The Swiss National Library (SNL) is building a new digital long-term archive that will go live in spring 2025. This system is designed as an overall system that covers all the processes involved in handling the digital objects of all the SNL's collections, including the web archive. This starts with the delivery of the objects by producers or the collection of the objects by the SNL itself, includes the preparation for archiving and cataloguing, administration and preservation, and ends with the provision to users. The first part of the presentation will describe the architecture and functionality of the overall system, which consists of three different areas and uses a mixture of standard components and individual developments.
The second part of the presentation will show how the Swiss Web Archive and its specific processes have been integrated into the overall system. Special precautions had to be taken particularly in the Pre-Ingest and Access areas. In Pre-Ingest, a distict processing channel was created for the web archive. This makes it possible to register the websites for collection (and automated periodic snapshots), collect them, check their quality and improve it if necessary, and ensure that they are virus-free. Access makes the web archive accessible via a full-text search, for which special precautions had to be taken when generating the hit lists. Otherwise, the hits from the other collections would be lost among the numerous hits from the web archive. In addition, one of the showcases will provide an unexpected approach to the web archive. The presentation will conclude by addressing some of the specific challenges of integrating the web archive into an overall preservation system and the lessons learnt. 12:15pm - 12:35pm
UKWA Rebuild British Library, United Kingdom The British Library suffered a major service outage following a cyber-attack on all technical systems in late October, 2023. What followed was a complete rebuild of all services with security baked in. This short presentation provides an overview of how the UK Web Archive was affected, how the new operational technology landscape of the British Library changed, and describes the work being undertaken to return UKWA as a public service and to begin crawling again from on-premise servers. It will also describe how the internal systems of UKWA are changing to meet the new infrastructure and policies.* The challenges faced should be important to all web archiving institutions. The necessary changes made by the British Library to ensure the new services are secure by design will have a major impact on the UK Web Archive systems, but these could be challenges and changes imposed on any web archive. The size of the UK Web Archive, approaching 2PiB and an estimated 18 billion files, also creates challenges in itself which will be familiar to many web archives - the redesign of UKWA includes distant storage and aims to establish shared functions and resources across the Legal Deposit Libraries in the future. Ways of discovering content within the UK Web Archive have been significantly reduced by the cyber-attack. Previously, a full text search service was available using Apache Solr. However, the return of a 'discovery service' has been delayed by the necessity of rebuilding all systems from scratch. The future planning for a discovery service, and a user service, will also be outlined in the presentation. * As of mid-August 2024, no technology infrastructure or systems have been released for the UKWA rebuild work. Consequently, the content of this presentation may change from this paper submission and the conference date. 12:35pm - 12:55pm
Under Construction: Web Archive of the German National Library German National Library, Germany Our institution is running a web archive since 2012, in cooperation with an external contractor and on closed-source software. Most recently we have started the shift towards an in-house open source web archiving system that shall be integrated with the overall data management infrastructure of our institution. During a first migration process the whole setup was moved in-house. The migration allowed us to gain some control over the operation, while the development and support is still performed by the contractor. In our experience over the last decade, we have identified a number of limitations with the current web archive setup: The crawling capacity is limited to a maximum of 12,000 snapshots per annum, the non-modular system complicates the implementation of new requirements, and we cannot directly benefit from the progress of the striving open source web archiving community in regard to new features and the implementation of web archiving standards. In parallel to the web archiving activities, our institution has developed an overarching data management infrastructure for the acquisition, digital preservation, and provisioning of electronic resources, such as e-books, e-journals, and most recently audio files. In order to gain an increased maintainability, flexibility and control over the web archiving activity, our aim is to implement a new system in-house, to integrate it with the well-established in-house workflows for electronic resources, and align it with and base it on the current open source state of the art and the standards of the web archiving community. During the presentation we take you on the journey of our institution towards the implementation of an in-house and open source web archive. We try to answer the questions: How do we understand the environment? How do we get together our team? Where do we want to go? How do we decide, which paths we take? Which gear do we need? And finally, what are our lessons learned? |
11:55am - 1:00pm | WORKSHOP #01: Exploring Dilemmas in the Archiving of Legacy Webportals: An Exercise in Reflective Questioning Location: Slottsbiblioteket (ground floor) |
|
Exploring Dilemmas in the Archiving of Legacy Webportals: An Exercise in Reflective Questioning National Library of the Netherlands, Netherlands Since 2023 the National Library of the Netherlands (KBNL) is proud to curate a digital collection that has become UNESCO world heritage: the Digital City (De Digitale Stad, henceforth: DDS). Material belonging to this collection consists of an original freeze from 1996, as well as two student projects and miscellaneous material that was contributed by users and founders over the course of multiple events. The two student projects were the first attempt to revive the portal of DDS and store it as a disk image. The two groups of students used two methods for this reviving: one based on emulation, the other based on migration. But what choices were made during restoration and which version is more authentic? Furthermore, KBNL has several websites, scientific articles and newspaper clippings in its collections that might serve as context information. Do we consider this context information crucial for understanding DDS or do we rather leave users to find these resources by themselves if they are interested? As can be seen from this description, there is a lot of complexity when we consider archiving DDS and making it accessible to our users. We can think of a lot of difficult dilemmas when making decisions on what to archive and how to present it. Do we want users to experience how it is to create a homepage in DDS or do we want to present a historically correct picture of the homepages existing at the time? What should be considered part of the object and what part of the context? Is the migrated or the emulated version more authentic? What is more important, the privacy of the original users or providing full access to researchers? What do we consider belonging to DDS and what not? Only the HTML? Or also any news group material that might still be online but isn’t part of the archival material? Do users want a real authentic experience or rather a convenient way of viewing the content? Even though DDS was a Dutch portal, it was based on software of the American Free-nets and inspired other cities in Europe and Asia. Therefore, we think this case might have a lot of recognizable features that also apply to the archiving of other legacy portals. Arguably, there are no right or wrong answers. They are typically dilemmas where multiple options have both benefits and drawbacks. In our workshop we want to present a couple of these real-world dilemmas to participants to stimulate discussion based on the idea of opposing values. In webarchiving and webarcheology tough decisions have to be made sometimes. In the above description we can already perceive some opposing options, for instance whether to prioritize interactivity or historical accuracy. Another example would be the opposition between privacy and openness. How do we weigh these options in practice? What values are important to us and how do they interact? Through principles of reflective questioning and open dialogue we will try to create awareness about the idea of value prioritization as part of the decision-making process. The idea is that we present a number of dilemmas, based on our collection material, for participants to discuss in groups. Participants may also choose an example that illustrates the same dilemma from their own collection. Each group has to choose a preferred solution and present their reasoning to the group. People are encouraged to explore the reasons for choosing one or the other, for instance by reflecting on their own organizational context or personal assumptions regarding digital preservation. We try to stay away from providing clear cut answers or guidance but rather provide participants with the opportunity to explore these questions together. Participants will learn how to ask the right questions to delve deeper into their own reasoning process during decision making, based on our method of reflective questioning. Participants should be able to apply this method and the cases presented to benefit their own curatorial decision-making process regarding legacy webportals in their own collections. For KBNL, the group discussions may provide important community input and food for thought on some of the decisions we are going to be making regarding DDS in the near future. |
1:00pm - 2:00pm | LUNCH Location: CREDO Restaurant | Kantine (downstairs) If you signed up for a guided exhibition tour, please be in the exhibition room at 13:05. To know if you signed up for a tour, check your registration details in ConfTool. |
2:05pm - 3:40pm | SESSION #02: Crawling Tools Location: Målstova (upstairs) Session Chair: László Tóth, National Library of Luxembourg |
|
2:05pm - 2:25pm
Lessons Learned Building a Crawler From Scratch: The Development and Implementation of Veidemann National Library of Norway, Norway Over the past two decades, web content has become increasingly dynamic. While a long-standing harvesting technology like Heritrix effectively captures static web content, it has huge limitations capturing and discovering links in dynamic content. As a response to this, the Web Archive at the National Library of Norway in 2015 set out to develop a new browser-based web crawler. This talk will present our experiences and lessons learned from building Veidemann. There are so many factors to consider when building a tool from scratch, and we will try to outline some of the decisions we were faced with during the process, unexpected issues and how we are addressing them. The talk will present:
The full cost/benefit analysis of taking on a project of this size and scale is, by the nature of the work, not fully knowable at the start. After nearly a decade in the making, the story of Veidemann is one of pride, hope, hardship and lessons learned. While it is still being used in production at our institution and harvesting roughly 1TB per week (of deduplicated content), other similar tools, such as Browsertrix, have distinct advantages in their approach. While the future of Veidemann is uncertain we would love to share what we have learned so far with the broader community. 2:25pm - 2:45pm
Experiences of Using in-House Developed Collecting Tool ELK National Library of Finland, Finland ELK (acronym for Elonleikkuukone which means harvester in Finnish) is a tool which was built in the National Library of Finland’s Legal Deposit Services to aid collecting, managing, and harvesting online materials to the Web Archice. Legal Deposit Services started to use ELK in 2018 and since we’ve updated ELK several times to better suit the needs of collectors and harvesters of web materials. Features of ELK include back catalog of former thematic web harvests including web materials also known as seeds, cataloging information and keywords, and tools to manage thematic web harvests that are currently being made. Features have been made in collaboration between the collectors and developers who also work on harvesting the web materials. The aim is to create a tool where the collectors can easily categorize different web materials, give notes on how to harvest different materials and stay on track what has been collected and what has not. Collectors can also harvest single web pages themselves for quality control. This is to make sure that pages with dynamic elements can be viewed as they were meant to in the web archive. ELK is also used as a documenting platform. The easiest way to see curatorial choices, keywords and history of the thematic web harvests is to gather them in one platform. When that platform is used for everything related to the web archiving, we can easily see what themes have been harvested, what sort of materials were collected previously and in best cases see the curatorial decisions that were made in those harvests. Sharing our experiences of an in-house developed tool for collecting web materials we can help other libraries in their efforts. What are the advantages in curating and managing our web collections and what disadvantages there are. Also, where we would like to see our collections go in the future now that we’ve used the tool for a while. 2:45pm - 3:05pm
Better Together: Building a Scalable Multi-Crawler Web Harvesting Toolkit Internet Archive, United States of America The web is as nearly infinite in its expanse as it is in its diversity. As its volume and complexity continues to grow, high-quality, efficient, and scalable web harvesting methods are more essential than ever. The numerous and varied challenges of web archiving are well known to this community, so it’s not surprising there isn’t one tool that can perfectly harvest it all. But through open source software collaboration we can build a scalable toolkit to meet some of these challenges. In the presentation, we will outline some of the many lessons and best practices our institution has learned from the challenges, requirements, research, and practical experience from collaborating with other memory institutions for over 25 years to meet the harvesting needs of the preservation community. To demonstrate how some of those challenges can be overcome, we will then discuss a fictional large-scale domain harvest use case presenting common issues. With each new challenge encountered we will introduce concepts in web harvesting while demonstrating approaches to solve them. Sometimes the best approach is a configuration option in Heritrix, and sometimes it’s including another open source software to incrementally improve the quality and scale of the campaign. Nothing is perfect, so we’ll also cover some things to consider when deciding to employ an additional tool. Some of the challenges we’ll address are: Heritrix makes a great base for large-scale web crawling, and many in the IIPC community already use it for their web harvests. The presentation will demonstrate tools that complement Heritrix, and should be easy to try as an add-on to a reliable implementation, but the concepts—and often the tools themselves—are web crawler agnostic. The presentation is geared to a wide range of experience. Anyone who is curious about what it takes to run a large web harvest will leave with a better understanding, and experienced practitioners will acquire insights into some technical improvements and strategies for improving their own harvesting infrastructures. 3:05pm - 3:25pm
Lowering Barriers to Use, Crawling, and Curation: Recent Browsertrix Developments Webrecorder, United States of America As the web continues to evolve and web archiving programs develop in their practices and face new challenges, so too must the tools that support web archiving continue to develop alongside them. This talk will provide updates on new features and changes in Browsertrix since last year’s conference that enable web archiving practitioners to capture, curate, and replay important web content better than ever before. One key new feature that will be discussed is crawling through proxies. Browsertrix now supports the ability to crawl through SOCKS5 proxies which can be located anywhere in the world, regardless of where Browsertrix itself is deployed. With this feature, it is possible for users to crawl sites from an IP address located in a particular country or even from an institutional IP range, setting crawl workflows to use different proxies as desired. This feature allows web archiving programs to satisfy geolocation requirements for crawling while still taking advantage of the benefits of using cloud-hosted Browsertrix. Proxies may also have other concrete use cases for web archivists, including avoiding anti-crawling measures and being able to provide a static IP address for crawling to publishers. Similarly, the presentation will discuss changes made that enable users of Browsertrix to configure and use their own S3 buckets for storage. Like proxies, this feature lowers the barriers to using cloud-hosted Browsertrix by enabling institutions to use their own storage infrastructure and meet data jurisdiction requirements without needing to deploy and maintain a self-hosted local instance of Browsertrix. Other developments will also be discussed, such as improvements to collection features in Browsertrix which better enable web archiving practitioners to curate and share their archives with end users, user interface improvements which make it easier for anyone to get started with web archiving, and improvements to Browsertrix Crawler to ensure websites are crawled at their fullest possible fidelity. |
2:05pm - 3:40pm | SESSION #03: Advocacy & User Engagement Location: Store Auditorium (ground floor) Session Chair: Mark Phillips, University of North Texas Libraries |
|
2:05pm - 2:25pm
Insufficiency of Human-Centric Ethical Guidelines in the Age of AI: Considering Implications of Making Legacy Web Content Openly Accessible Computer History Museum Slovenia (Računališki muzej), Slovenia While the preservation of web history is crucial for maintaining a cultural and informational record of our age, reconstructing and resurfacing legacy content without appropriate context nowadays presents new ethical concerns. Legacy content may be misleading to users when consumed in isolation, as it often reflects outdated norms, technologies, and information that are no longer relevant. Moreover, individuals featured in such content may be unfairly subjected to scrutiny based on past actions or statements that, in today's context, could harm their personal or professional reputation. The consequences of resurfacing this content without adequate contextualization are amplified when AI technologies are involved. AI’s ability to synthesize and amplify such data across platforms can create a ripple effect, where even content that does not explicitly reveal personal information can still have far-reaching consequences. By connecting disparate data points, AI may draw conclusions or inferences about individuals, influencing public perception and potentially affecting career prospects or even legal outcomes. Unlike the human reader who would be able to contextually infer that a piece of reconstructed online content is part of a legacy web segment intended to be presented as a historical monument to the online world of times past, AI will not be able to distinguish such content from contemporary sources and will misplace the weights system on it’s analysis of such content. The ethical challenge here lies not just in the publication of legacy content and archival access, but in AI’s ability to endlessly circulate and reinterpret it in ways that were never intended by the original authors. This proposal explores the delicate balance between the preservation of historical digital records and respecting individuals' right to be forgotten (RTBF) in the age of AI. It seeks to question how AI-powered tools reshaping the reading and presentation of web archives challenge existing ethical norms. By examining potential frameworks for responsible digital archiving, the proposal aims to identify solutions that mitigate the risks posed by AI-driven resurfacing of legacy content in the public domain. 2:25pm - 2:45pm
Web Archives for Music Research Royal Danish Library, Denmark The Royal Danish Library has set a strategic goal to make more of its cultural heritage materials accessible and engaging for researchers by 2027. In this paper, we present findings from an advocacy initiative targeted at researchers at national universities in music-related fields. The national web archive provides primary sources and contextual information relevant to music researchers as they engage with our music collections. However, there is room for improvement in the connection between these collections and our understanding of user needs. Reports by Healy et al. (2022) and Healy & Byrne (2023) explore the challenges researchers face when using web archives, highlighting the ongoing need to examine the skills, tools, and methods associated with web archiving. Additionally, the sounds of the web—from MIDI to streaming—are an integral part of its history, yet this aspect is often overlooked by tools like the Internet Archive's Wayback Machine (Morris, 2019). Through semi-structured interviews with fellow curators and music researchers at universities, we identify current barriers to access and user requirements for improved utilization of web archival resources. Our advocacy initiative also allows us to summarize current research trends as feedback for web curators. In conclusion, we describe how the web curators processed our findings into suggestions for updates and refinements to web crawling strategies and the built-in tools in the SolrWayBack installation. References Healy, S., & Byrne, H. (2023). Scholarly Use of Web Archives Across Ireland: The Past, Present & Future(s) (WARCnet Special Reports). Aarhus University. https://cc.au.dk/fileadmin/dac/Projekter/WARCnet/Healy_Byrne_Scholarly_Use_01.pdf Healy, S., Byrne, H., Schmid, K., Bingham, N., Holownia, O., Kurzmeier, M., & Jansma, R. (2022). Skills, Tools, and Knowledge Ecologies in Web Archive Research (WARCnet Special Reports). Aarhus University. https://cc.au.dk/fileadmin/dac/Projekter/WARCnet/Healy_et_al_Skills_Tools_and_Knowledge_Ecologies.pdf . Morris, J. W. (2019). Hearing the Past: The Sonic Web from MIDI to Music Streaming. In N. Brügger & I. Milligan (Eds.), The SAGE Handbook of Web History (pp. 491–510). Sage. 2:45pm - 3:05pm
IXP History Collection: Recording the Early Development of the Core of the Public Internet 1Independent Researcher, Ireland; 2University of Barcelona, Spain The IXP History Collection is an ongoing project which seeks to record and document histories of the Internet exchange points (IXPs) which form the core of the Internet’s topology. An IXP is the point at which Internet Service Providers and Content Delivery Networks connect and exchange data with each other (“peering”). IXPs form the topological core of the Internet backbone, their histories are inextricably linked to the commercialization of the Internet, and their development is a significant milestone in the global history of media and communications. Efforts should therefore be made to ensure that we preserve IXP histories for future generations. The main purpose of the project is to collect and preserve networking and IXP histories due to valid concerns that these histories will be lost from the global record unless attempts are made to start preserving them now. In particular, the project is concerned with the fragility of electronic information and born digital documents, records, and multimedia, otherwise known as born digital heritage. As a starting point, the project utilizes the Internet Exchange Directory which is maintained by Packet Clearing House, an intergovernmental treaty organization responsible for providing operational support and security to critical Internet infrastructure, including Internet exchange points. The PCH IX Directory is one of the earliest organized efforts to develop and maintain a database for recording and tracking the establishment, development and global growth of IXPs. The project then focuses on documenting IXP histories through as many online sources as possible (e.g., websites/pages, reports, journals, magazines/newspaper articles, old emails on public mail lists). The project relies on the use of web archives as a research tool for tracing IXP histories, as well as a preservation tool using the Save Page functions in the Wayback Machine and Arquivo.pt. In this presentation we discuss our approach and methodology for developing the collection and making it available online as a reference resource, and we offer an overview of the importance of using web archives for documenting and preserving Internet and IXP histories. By presenting our approach, we hope to offer a case study that demonstrates how web archive research can be integrated with traditional research methods (Healy et al., 2022), and promote more widespread use of web archives as research tools for historical inquiry, and the long-term preservation of digital research (Byrne et al., 2024). Resources: Arquivo.pt: https://arquivo.pt/ IXP History Collection - Information Directory | Zotero: https://www.zotero.org/groups/4944209/ixp_history_collection_-_information_directory/library Packet Clearing House, Internet Exchange Directory: https://www.pch.net/ixp/dir Wayback Machine: https://web.archive.org/ References: Healy, S., Byrne, H., Schmid, K., Bingham, N., Holownia, O., Kurzmeier, M. and Jansma, R. (2022). Skills, Tools, and Knowledge Ecologies in Web Archive Research. WARCnet Special Report, Aarhus, Denmark: https://web.archive.org/web/20221003215455/https://cc.au.dk/fileadmin/dac/Projekter/WARCnet/Healy_et_al_Skills_Tools_and_Knowledge_Ecologies.pdf Byrne, H., Boté-Vericad, J-J, and Healy, S. (2024) Exploring Skills and Training Requirements for the Web Archiving Community. In: Aasman, S., Ben-David, A., and Brügger, N., eds. The Routledge Companion to Transnational Web Archive Studies. Routledge. 3:05pm - 3:25pm
Lost, but Preserved - A Web Archiving Perspective on the Ephemeral Web Internet Archive, United States of America The World Wide Web, our era's most dynamic information ecosystem, is characterized by its transient nature. Recent studies have highlighted the alarming rate at which web content disappears or changes, a phenomenon known as "link-rot". A 2024 Pew Research Center study revealed that 38% of webpages from 2013 were inaccessible a decade later. Even more striking, Ahrefs, an SEO company, reported that at least 66.5% of links to sites created in the last nine years are now dead. These findings echo earlier research by Zittrain et al., which uncovered significant link-rot in journalistic references from New York Times articles. While these statistics paint a grim picture of digital impermanence, they often overlook a crucial factor: the role of web archives. This talk aims to reframe the link-rot discussion by considering the preservation efforts of various web archiving institutions. Our research revisiting the Pew dataset yielded a surprising discovery: only one in nine URLs from the original study were truly missing, the remaining bulk had at least one capture in a web archive. This finding suggests that the digital landscape, when viewed through the lens of web archiving, may be less ephemeral than commonly perceived. Key points we will explore: 1. The state of link-rot: We will review recent studies and their methodologies, discussing the implications of their findings for digital scholarship, journalism, and information access. 2. Web archives as digital preservationists: We will introduce major web archiving initiatives and explain their crucial role in maintaining the continuity of online information. 3. Reassessing link rot with archives in mind: We will present our methodology and findings from reexamining the Pew dataset, demonstrating how web archives mitigate content loss. 4. Challenges and limitations of web archiving: Despite their importance, web archives face significant technical, legal, and resource constraints. We will discuss these challenges and their impact on preservation efforts. 5. The future of web preservation: We will explore emerging technologies and strategies in web archiving, including machine learning approaches to capture dynamic content and efforts to preserve the context of web pages. 6. Call to action: We will emphasize the importance of supporting and expanding web archiving efforts, discussing how researchers, institutions, and individuals can contribute to these initiatives. This talk aims to provide a more nuanced understanding of digital impermanence and preservation. While acknowledging the real challenges posed by link-rot, we will highlight the often-overlooked role of web archives in maintaining our digital heritage. By doing so, we hope to foster greater appreciation for web archiving efforts and encourage increased support for these crucial initiatives. Our goal is to leave the audience with a renewed perspective on the state of the web's preservability and a clear understanding of why supporting web archiving is essential for ensuring the longevity and accessibility of our shared digital knowledge. As we navigate an increasingly digital world, recognizing that much of what seems lost may actually be preserved is vital for researchers, educators, journalists, lawyers, and anyone who values the continuity of online information. |
2:05pm - 3:40pm | WORKSHOP #02: Web Archive Collections As Data Location: Slottsbiblioteket (ground floor) |
|
Web Archive Collections as Data 1University of Alicante, Spain; 2Library of Congress, United States of America; 3IIPC, United States of America; 4National Library of Norway, Norway; 5British Library, UK; 6University of Illinois Urbana-Champaign, United States of America GLAM (Galleries, Libraries, Archives and Museums) have started to make available their digital collections suitable for computational use following the Collections as Data principles[1]. The International GLAM Labs Community[2] has explored innovative and creative ways to publish and reuse the content provided by cultural heritage institutions. As part of their work, and as a collaborative-led effort, a checklist[3] was defined and focused on the publication of collections as data. The checklist provides a set of steps that can be used for creating and evaluating digital collections suitable for computational use. While web archiving institutions and initiatives have been providing access to their collections - ranging from sharing seedlists to derivatives to “cleaned” WARC files - there is currently no standardised checklist to prepare those collections for researchers. This workshop aims to involve web archiving practitioners and researchers in reevaluating whether the GLAM Labs checklist can be adapted for web archive collections. The first part of the workshop will introduce the GLAM checklist, followed by two use cases that show how the web archiving teams have been working with their institutions’ Labs to prepare large data packages and corpora for researchers. In the second part of the workshop, we want to involve the audience in identifying the main challenges to implementing the GLAM checklist and determining which steps require modifications so that it can be used successfully for web archive collections. First use case The UK Web Archive has recently started to publish the metadata to some of our inactive curated collections as data. This project developed new workflows by using the Datasheets for Datasets framework to provide provenance information on the individual collections that were published as data. In this presentation, we will highlight how participants can:
Second use case Our library recently launched its first Web News Corpus, making more than 1.5 million texts from 268 news websites available for computational analysis through API. The aim is to facilitate text analysis at scale.[4] This presentation will provide a brief description of “warc2corpus”, our workflow for turning WARCs into text corpora, aiming to satisfy the FAIR principles, while also taking immaterial rights into account.[5] In this presentation, we will showcase how users can:
Third use case Our library has been working to refine and improve workflows that enable creation and publishing of web archive data packages for computational research use. With a recently hired Senior Digital Collections Data Librarian, and working with our institution’s Labs, web archiving staff have prepared new data packages for web archive data in response to recent research requests. We will provide some background into this work and developments that led to the creation of the data librarian role, and will share details about how we are creating our data packages and sharing derivative datasets with researchers. Using a recent data package release, we will compare local practices in providing data to researchers with the GLAM checklist and talk through ways in which our institution does or does not comply. REFERENCES: [1] Padilla, T. (2017). “On a Collections as Data Imperative”. UC Santa Barbara. pp. 1–8; [2] https://glamlabs.io/ [3] Candela, G. et al. (2023), "A checklist to publish collections as data in GLAM institutions", Global Knowledge, Memory and Communication. https://doi.org/10.1108/GKMC-06-2023-0195 [4]: Tønnessen, J. (2024). “Web News Corpus”. National Library of Norway. https://www.nb.no/en/collection/web-archive/research/web-news-corpus/ [5]: Tønnessen J., Birkenes M., Bremnes T. (2024). “corpus-build”. GitHub. National Library of Norway. https://github.com/nlnwa/corpus-build; Birkenes M., Johnsen, L., Kåsen, A. (2023). “NB DH-LAB: a corpus infrastructure for social sciences and humanities computing.” CLARIN Annual Conference Proceedings. [6]: “dhlab documentation”. National Library of Norway. https://dhlab.readthedocs.io/en/latest/ |
3:40pm - 4:10pm | BREAK Location: Folkestova (upstairs) Participants in the 2025 Mentoring Program can meet at the top of the old granite stairs outside of Målstova. Sitting places are available in the cafeteria/bar (upstairs) and library hallways (upstairs and ground floor). If the weather is nice, there are also small parks immediately in front of and behind the National Library building. |
4:10pm - 4:20pm | POSTER SLAM INTRO Location: Målstova (upstairs) Session Chair: Olga Holownia, IIPC Streamed to Store Auditorium. |
4:20pm - 4:40pm | POSTER SLAM Location: Målstova (upstairs) Session Chair: Olga Holownia, IIPC Streamed to Store Auditorium. |
|
4:20pm - 4:21pm
‘We Are Now Entering the Pre-election Period’: Experimental Twitter Capture at The National Archives The National Archives (UK), United Kingdom 4:21pm - 4:22pm
The BnF DataLab Services and Tools for Researchers Working on Web Archives Bibliothèque nationale de France, France 4:22pm - 4:23pm
Designing Art Student Web Archives The New School, United States of America 4:23pm - 4:24pm
Next Steps Towards A Formal Registry Of Web Archives For Persistent And Sustainable Identification Royal Danish Library, Denmark 4:24pm - 4:25pm
Using Web Archives to Construct the History of an Academic Field University of Bergen, Norway 4:25pm - 4:26pm
Consortium on Electronic Literature (CELL) University of Bergen, Norway 4:26pm - 4:27pm
Arquivo.pt Annual Awards: A Glimpse Arquivo.pt, Portugal 4:27pm - 4:28pm
Arquivo.pt Api/Bulk Access and Its Usage Arquivo.pt, Portugal 4:28pm - 4:29pm
Failed Capture or Playback Woes? A Case Study in Highly Interactive Web Based Experiences Smithsonian Libraries and Archives United States of America 4:29pm - 4:30pm
HAWathon: Participants Experience National and University Library in Zagreb, Croatia 4:30pm - 4:31pm
Supporting Best Practices for Archiving Social Media by Heritage Institutions in Flanders (and Beyond) 1meemoo, Flemish Institute for Archives, Belgium; 2KADOC at Catholic University of Leuven, Belgium 4:31pm - 4:32pm
Planning Web Archiving Within a Four-Year Scope: Making the New Collection Plan for the Years 2025-2028 in the National Library of Finland National Library of Finland, Finland 4:32pm - 4:33pm
Redirects Unraveled: From Lost Links to Rickrolls 1Old Dominion University, United States of America; 2Internet Archive, United States of America; 3Filecoin Foundation, Netherlands 4:33pm - 4:34pm
Use of Screenshots as a Harvesting Tool for Dynamic Content and Use of AI for Later Data Analysis Computer History Museum Slovenia (Računališki muzej), Slovenia 4:34pm - 4:35pm
Asynchronous and Modular Pipelines for Fast WARC Annotation Common Crawl Foundation, United States of America 4:35pm - 4:36pm
Politely Downloading Millions of WARC Files Without Burning the Servers Down Common Crawl Foundation, United States of America 4:36pm - 4:37pm
Robots.txt and Crawler Politeness in the Age of Generative AI Common Crawl Foundation, United States of America 4:37pm - 4:38pm
Experiences Switching an Archiving Web Crawler to Support HTTP/2 Common Crawl Foundation, United States of America |
4:40pm - 6:00pm | POSTER SESSION Location: Folkestova (upstairs) |
7:30pm - 9:30pm | DINNER Location: CREDO Restaurant | Kantine (downstairs) |
Date: Thursday, 10/Apr/2025 | |
9:00am - 9:20am | MORNING COFFEE Location: Folkestova (upstairs) |
9:20am - 9:25am | LIGHTNING TALK SESSION 3: INTRODUCTION Location: Målstova (upstairs) Session Chair: Helena Byrne, British Library |
9:20am - 9:25am | LIGHTNING TALK SESSION 4: INTRODUCTION Location: Store Auditorium (ground floor) Session Chair: Dorothée Benhamou-Suesser, National Library of France |
9:25am - 9:55am | LIGHTNING TALK SESSION 3 Location: Målstova (upstairs) Session Chair: Helena Byrne, British Library |
|
9:25am - 9:30am
The Practice of Web Archiving Statistics and Quality Evaluation Based on the Localization of ISO/TR 14873:2013(E): A Case Study of the NSL-WebArchive Platform 1National Science Library, Chinese Academy of Sciences, China; 2Zhejiang Economic & Information Center, China; 3Zhejiang Economic & Information Development Co., Ltd, China 9:30am - 9:35am
Modifying ePADD for Entity Extraction in Non-English Languages National Library of Norway, Norway 9:35am - 9:40am
Arquivo.pt Query Logs Arquivo.pt, Portugal 9:40am - 9:45am
What You See No One Saw 1Drexel University, United States of America; 2Old Dominion University, United States of America |
9:25am - 9:55am | LIGHTNING TALK SESSION 4 Location: Store Auditorium (ground floor) Session Chair: Dorothée Benhamou-Suesser, National Library of France |
|
9:25am - 9:30am
Collaborative Collections at Arquivo.pt: Four Years of Recordings from the City of Sines (Portugal) Arquivo.pt, Portugal 9:30am - 9:35am
Participatory Web Archiving: The Tensions Between the Instrumental Benefits and Democratic Value 1University of Sheffield, United Kingdom; 2Institute for Web Science and Technologies (WeST), Germany; 3Bodleian Libraries, United Kingdom 9:35am - 9:40am
A Minimal Computing Approach for Web Archive Research 1University of Victoria, Canada; 2Universidad Autónoma del Estado de México, Mexico 9:40am - 9:45am
Where Fashion Meets Science: Collecting and Curating a Creative Web Archive University of the Arts London, United Kingdom |
9:55am - 10:05am | SHORT BREAK |
10:05am - 11:15am | SESSION #04: Discovery & Access (News/Newspapers) Location: Målstova (upstairs) Session Chair: Tita Enstad, National Library of Norway |
|
10:05am - 10:25am
Unlocking the Archive: Open Access to News Content as Corpora National Library of Norway, Norway The content of web archives is potentially highly valuable to research and knowledge production. However, most web archives have strict access regimes to their collections, and with good reason: archived content is often subject to copyright restrictions and potentially also data protection laws. When moving towards best practices, a key question is how to improve access, while also maintaining legal and ethical commitments. [1] This presentation will show how the National Library of Norway (NB) has worked to provide open access to a corpus of more than 1.5 million news articles in the web archive. By providing the collection as data - scoping it across the typical crawl job-oriented segmentation - anyone gets access to computational text analysis at scale. By serving metadata and snippets of content through a REST API and keeping the full content in-house, we align with FAIR principles while accounting for immaterial rights and data protection laws. [2] The key steps in building the news corpora will be walked through, such as: Further, we will demonstrate how anyone can tailor corpora for their own use and analyse news text at scale - either with user-friendly apps, or with computational notebooks via API. [3] The demonstration highlights some of the limitations, but also the great possibilities for allowing distant reading of web archives. We will discuss how the approach to collections as data provides broader access and new perspectives for researchers. Open access further allows for utilisation in new contexts, such as higher education, government and commercial business. With easy-to-use web applications on top, the threshold for non-technical users is lowered, potentially increasing the use of web archives vastly. We also reflect on how interdisciplinary cooperation and user-orientation have been vital in designing and building the solution. 10:25am - 10:45am
Recently Orphaned Newspapers: From Archived Webpages to Reusable Datasets and Research Outlooks 1Academia Sinica, Taiwan; 2National Yang Ming Chiao Tung University, Taiwan We report on our progress in converting the web archives of a recently orphaned newspaper into accessible article collections in IPTC (International Press Telecommunications Council) standard format for news representation. After the conversion, old articles extracted from a defunct news website are now reincarnated as research datasets meeting the FAIR data principles. Specifically, we focus on Taiwan's Apple Daily and work on the WARC files built by the Archive Team in September 2022 at a time when the future of the newspaper seemed dim [0]. We convert these WARC files into de-duplicated collections of pure text in ninjs (News in JSON) format [1]. The Apple Daily in Taiwan had been in publication since 2003 but discontinued its print edition in May 2021. In August 2022, its online edition was no longer being updated, and the entire news website has become inaccessible since March 2023. The fate of Taiwan's Apple Daily followed that of its (elder) sister publication in Hong Kong. The Apple Daily in Hong Kong was forced to cease its entire operation after midnight June 23, 2021 [2]. Its pro-democracy founder, Jimmy Lai (黎智英) [3], was arrested under Hong Kong's security law the year before. Being orphaned and offline, past reports and commentaries from the newspapers on contemporary events (e.g. the Sunflower Movement in Taiwan and the Umbrella Movement in Hong Kong) become unavailable to the general public. Such inaccessibility has impacts on education (e.g. fewer news sources to be edited into Wikipedia), research (e.g. fewer materials to study the early 2000s zeitgeist in Hong Kong and Taiwan), and knowledge production (e.g. fewer traditional Chinese corpora to work with). Our work in transforming the WARC records into ninjs objects produces a collection of unique 953,175 news articles totaling in 4.3 GB. The articles are grouped by the day/month/year they were published hence it is convenient to look into a specific date for the news that were published on that day. Metadata about each article — headline(s), subject(s), original URI, unique ID, among others — are mapped into the corresponding fields in the ninjs object for ready access. (For figures, please access them at this dataset [4].) Figure 1 shows the ninjs object derived from a news article that was published on 2014-03-19, archived on 2021-09-29, and converted by us on 2024-02-17. Figure 2 is a screenshot of the webpage where the news was originally published. Figure 3 displays the text file of the ninjs object in Figure 1. Currently the images and videos accompanying the news article have not been extracted. Another process is in the plan to preserve and link to these media files in the produced ninjs object. In our presentation, we shall elaborate on technical details (such as the accuracy and coverage of the conversion) and exemplary use cases of the collection. We will touch on the roles of public research organizations in preserving and making available materials that are deemed out of commerce and circulation. [0] https://wiki.archiveteam.org/index.php/Apple_Daily#Apple_Daily_Taiwan [3] https://en.wikipedia.org/wiki/Jimmy_Lai [4] https://pid.depositar.io/ark:37281/k5p3h9k37 10:45am - 11:05am
NewsWARC: Analyzing News Over Time in the Web Archive 1Bibliotheca Alexandrina, Egypt; 2Alamein International University, Egypt News consumption, as studies generally suggest, is quite common globally. Today, individuals, wherever there is an Internet connection, access news predominantly online. On the web, news websites rank relatively high by number of visits. Considering the history of the web, the news media industry was one domain of society to adopt the web as technology very early on. Being of such significance, news content on the web is one to particularly investigate, using the web archive as data source. We present NewsWARC, a tool, developed as an internship project, for aiding researchers to explore news content in a web archive collection over time. NewsWARC consists of two components: the data analyzer and the viewer. The data analyzer is code that runs on data in the collection and uses machine learning to get information about each news article or post, namely, sentiment, named entities, and category, and store that into a database for access via the second component that serves as the interface for querying and visualizing the pre-analyzed data. We report on our experience processing data from the Common Crawl news collection to use in testing, including comparing performance of the data analyzer running on different hardware configurations. We show examples of queries and trend visualizations that the viewer offers, such as examining how the sentiment of articles in health-related news varies over the course of a pandemic. In developing this initial prototype, while we narrowed our focus with regard to information that the analyzer returns to sentiment, named entities, and category, there exists a wider range of analyses to include in future work, such as topic modeling, keyword and keyphrase extraction, measuring readability and complexity, and fact vs. opinion classification. Also as future work, this overall functionality can be deployed as a service for an alternative interface to supplement researcher access to web archives. 11:05am - 11:10am
Zombie E-Journals and the National Library of Spain Biblioteca Nacional de España, Spain A "zombie e-journal" refers to an electronic journal that has become inaccessible, but for which a web archive has preserved a copy, sometime this one is not perfectly accurate. It is widely recognized that, each year, a significant number of e-journals disappear without existing in print, resulting in the loss of their content on a global scale. This constitutes a substantial loss of economic investment, scholarly knowledge, and cultural heritage. While many universities maintain institutional repositories to safeguard publications, a large number of e-journals lack sustainable preservation methods due to financial constraints. In response to this challenge, the Spanish Web Archive initiated efforts to explore potential solutions. A key question was posed: is it feasible to ensure the long-term preservation of more than 10,000 open-access e-journals in Spain? The National Library of Spain, which serves as the National Centre for ISSN assignment, maintains a catalogue that includes all e-journals registered with an ISSN. The first phase of this initiative started in 2020, when the Spanish Web Archive implemented an annual broad crawl encompassing all URLs associated with electronic journals in Spain. This proactive approach significantly increases the likelihood of locating missing e-journals in the future. Currently, the project has entered its second phase, during which e-journals that became inaccessible between 2009 and 2023 have been identified. To date, over 500 zombie e-journals have been recovered through consultations with the Spanish Web Archive. The full list of these journals is publicly available through the project’s website and integrated into the National Library’s catalogue. In the forthcoming third phase, the identified e-journals will be formally declared out-of-commerce works, according to Directive (EU) 2019/790,thus facilitating open access to their content. This step will allow users to once again access and benefit from these resources. Additionally, a comprehensive system has been developed to detect missing e-journals, conduct quality assurance (Q&A) processes on the captured content, and integrate access to these journals through the library's website and catalogue. The broad crawl has proven effective in identifying missing e-journals, and following quality assurance, the recovered information is systematically incorporated into the catalogue. |
10:05am - 11:15am | SESSION #05: Sustainability Location: Store Auditorium (ground floor) Session Chair: Bjarne Andersen, Royal Danish Library |
|
10:05am - 10:25am
42 Tips to Diminish the CO2 Impact of Websites 1National Archives of the Netherlands, Netherlands; 2Dutch Digital Heritage Network, Netherlands; 3Netherlands Institute for Sound and Vision, Netherlands; 4Van Heijst Information Consulting, Netherlands The internet has become indispensable to modern life, yet its environmental impact is often overlooked. Despite terms like "virtual" and "cloud" suggesting a minimal footprint, the global internet is a significant energy consumer. In 2020, it accounted for approximately 4% of global energy consumption, and if usage trends persist, this figure could rise to 14% by 2040. Archiving even a small number of websites contributes to the growing carbon footprint of digital archives, which compounds over time. To address this, the Dutch Digital Heritage Network commissioned research to assess the CO2 impact of current websites across various heritage organizations. The study provided practical recommendations to reduce this impact, such as optimizing image sizes, employing green hosting, and streamlining unnecessary code. These strategies not only benefit the public-facing side of websites but also hold potential for the backend, such as in the harvesting process for archiving. In our presentation, we will share these research findings and highlight actionable steps organizations can take to create more energy-efficient digital archives. Additionally, we will explore the question of what should be archived: Is every aspect of a website equally essential for long-term preservation? Lastly, we are investigating incremental archiving as a solution to reduce both storage needs and emissions. This approach, which focuses on capturing specific updates rather than performing full harvests, offers a more sustainable alternative for digital preservation. 10:25am - 10:45am
Building Towards Environmentally Sustainable Web Archiving: The UK Government Web Archive and Beyond 1University of London, United Kingdom; 2The National Archives (UK), United Kingdom There is an urgent need for the fostering of more environmentally sustainable archival methods and approaches that place sustainability frameworks at the centre of archival practice, aiding archiving institutions in their ambitions to achieve Net Zero. This will involve sector-wide collaboration to develop new ways of working and the rethinking of long-established best practice in order to define and adopt ways of working that are ‘good enough’. The challenge is particularly urgent for born-digital archives, which form an increasingly significant (and rapidly growing) part of the archival record. Pendergrass et al. 2019 have argued for fundamental change in ‘practices for appraisal, permanence, and availability of digital content’ (p. 4), and the Digital Preservation Coalition has similarly called for a re-evaluation of all aspects of digital preservation (Kilbride 2023). This paper will discuss one approach to the development of a framework for more environmentally sustainable web archiving, using the UK Government Web Archive as a case study. First, it will present the findings of a workshop on ‘Archives and the environment’, which was held at The UK National Archives in 2023. One of the main strands of discussion was the environmental cost both of creating and preserving born-digital and digitised archives and of the digital infrastructure, tools and methods used to analyse them. Recommendations arising from the event and subsequent report have informed an action plan for the UK Government Web Archive (UKGWA) as it begins to explore its environmental footprint. The UKGWA action plan involves four main strands of work: establishing, as far as possible, the current environmental impact of the web archive, drawing on a range of metrics; identifying those aspects of the web archiving workflows that may be streamlined or redeveloped in order to reduce that impact; designing and prototyping new and more sustainable processes within the UKGWA; and producing recommendations for good practice that may be adopted and/or adapted by other national and international web archives. The planned research is concerned not just with environmentally sustainable practice within the UKGWA but also with Scope 3 carbon emissions (that is, emissions that are produced not by an organisation itself but by those for whom it is indirectly responsible, in this case users and suppliers). The research is at an early stage, but we hope that the development of an extensible and customisable framework, accompanied by a toolkit that builds on the work of the Digital Humanities Climate Coalition, will provide an opportunity for wider collaboration. The work presented here is grounded in the experience and practice of the UK Government Web Archive, but it will benefit enormously from being placed in dialogue with the work of the IIPC and other national and institutional web archives concerned with the impact of climate change on digital archival practice and of digital archiving and preservation on climate change. K. Pendergrass et al., ‘Toward environmentally sustainable digital preservation’, The American Archivist (2019), 82:1, 165-206 W. Kilbride, ‘The Anthropocene remembered: digital memory after the climate crisis’, Digital Preservation Coalition Blog (2019) 10:45am - 11:05am
Preservation of Historical Data: Using Warchaeology to Process 20 Years of Harvesting National Library of Norway, Norway The National Library of Norway have been harvesting the internet since the beginning of the millennium, with a primary focus and priority on the collection and storage of data. Over 25 years, web harvesting methods and preservation systems have changed. Consequently, the collection is composed of various file types, including ARC, WARC, and files produced by NEDLIB[1]. In more recent years our focus has shifted towards access and quality assurance, and the need to include the older data has increased. But how do we utilize this data, which by now is poorly structured, has little to no documentation, and is hard to read by modern software? To address and resolve these issues and move toward the ultimate goal of making the collections fully discoverable and available, the National Library of Norway developed an open-source tool, Warchaeology[2], capable of converting, validating and deduplicating web archive collections data. This presentation will outline how we have used this tool to process 2PB of data, harvested since 2001. The objective is better management and preservation, including to identify collections and groupings of data, parse and sort metadata, identify formats and how these should be processed or converted, deduplicate files, and gather insight about collections generally.
11:05am - 11:10am
Analysing the Publications Office of the European Union Web Archive for the Rationalisation of Digital Content Generation Publications Office of the European Union, Luxembourg More and more information from EU institutions, bodies and agencies is only made available on their public websites. However, web content often has a short lifespan, and this information is at risk of getting lost when websites are updated, substantially redesigned or taken offline. As part of its different preservation activities, the Publications Office of the EU crawls, curates and preserves the content and design of these websites, making them available for current and future generations. We also prepare an ingestion of this collection into our digital archive, to ensure its long-term preservation. We have recently performed a full export of the most recent crawls from our web archive collection, spanning from March 2019 to September 2024, as a set of WARC files. We have extracted relevant information regarding all the “response” and “revisit” records in the collection and inserted it into a relational database, allowing efficient custom analyses. In this presentation, we will show various interesting statistics we have generated about the content of our web archive. These include the analysis of large response payloads (more than 100 Mb), as well as the relative footprint of crawled video files. We also investigate the amount of duplication of records - those that were avoided through ‘revisit’ records, as well as duplicate ‘response’ records is still present in the archive. We also explain how we have used this information to refine our crawling strategies in order to rationalise our digital content generation going forward. We also define potential policies to curate the existing archive prior to ingestion in a long-term digital repository, where the impact on the carbon footprint may be even more significant. |
10:05am - 11:15am | WORKSHOP #03: Introduction to Web Graphs Location: Slottsbiblioteket (ground floor) |
|
Introduction to Web Graphs Common Crawl Foundation, United States of America The workshop will begin with a brief introduction to the concept of the webgraph or hyperlink graph - a directed graph whose nodes correspond to web pages and whose edges correspond to hyperlinks from one web page to another. We will also look at aggregations of the page-level webgraph at the level of Internet hosts or pay-level domains. The host-level and domain-level graphs are at least an order of magnitude smaller than the original page-level graph, which makes them easier to study. To represent and process webgraphs, we utilize the WebGraph framework, which was developed at the Laboratory of Web Algorithms (LAW) of the University of Milano. As a "framework for graph compression aimed at studying web graphs," it allows very large webgraphs to be stored and accessed efficiently. Even on a laptop computer, it's possible to store and explore a graph with 100 million nodes and more than 1 billion edges. The WebGraph framework is also used to compress other types of graphs, such as social network graphs or software dependency graphs. In addition, the framework and related software projects include tools for the analysis of web graphs and the computation of their statistical and topological properties. The WebGraph framework implements a number of graph algorithms, including PageRank and other centrality measures. It is an open-source Java project, but a re-implementation in the Rust language has recently been released. Over the past two decades, the WebGraph format has been widely used by researchers, for example those at LAW or Web Data Commons, to distribute graph dumps. It has also been used by open data initiatives, including the Common Crawl Foundation and the Software Heritage project. The workshop focuses on interactive exploration of one of the precompiled and publicly available webgraphs. We look at graph properties and metrics, learn how to map node identifiers (just numbers) and node labels (URLs), and compute the shortest path between two nodes. We also show how to detect "cliques", i.e. densely connected subgraphs, or how to run PageRank and related centrality algorithms to rank the nodes of our graph. We share our experiments on how these applications are used for collection curation: how cliques can be used to discover sites with content in a regional language, how link spam is detected or how global domain ranks are used to select a representative sample of websites. Finally, we will build a small webgraph from scratch using crawl data. Participants will learn how to explore webgraphs (even large ones) in an interactive way and learn how graphs can be used to curate collections. Basic programming skills and basic knowledge of the Java programming language are a plus but not required. Since this is an interactive workshop, attendees should bring their own laptops, preferably with the Java 11 (or higher) JDK and Maven installed. Nevertheless, it will be possible to follow the steps and explanations without having to type them into a laptop. We will provide download and installation instructions, as well as all teaching materials, prior to the workshop. |
11:15am - 11:45am | BREAK Location: Folkestova (upstairs) |
11:45am - 1:15pm | PANEL #02: Cross-Institutional Collaborations Location: Målstova (upstairs) Session Chair: Abbie Grotke, Library of Congress |
|
Past, Present & Future of Cross-Institutional Collaboration in Web Archiving: Insights from the Norwegian and Danish Web Archive, the NetArchiveSuite Community, & Beyond 1Royal Danish Library, Denmark; 2National Library of Norway, Norway; 3Bibliothèque nationale de France, France; 4Biblioteca Nacional de España, Spain; 5Analysis & Numbers, Denmark; 6Library of Congress, United States of America |
11:45am - 1:15pm | SESSION #06: Curating Social Media Location: Store Auditorium (ground floor) Session Chair: Tom Smyth, Library and Archives Canada |
|
11:45am - 12:05pm
Developing Social Media Archiving Guidelines at the National Archives of the Netherlands National Archives of the Netherlands, Netherlands At the beginning of 2024, we started a project to develop a nationwide guideline for archiving public social media content. This project aimed to address the increasing use of social media by Dutch governments and the current lack of archiving there is. Our presentation at the Web Archiving Conference 2025 will focus on the process of creating this guideline and presenting the final version. The primary target audience for this guideline are the information professionals, who play a vital role in managing and preserving the archived social media content. However, we also recognise communication professionals as an important target audience, given their role in setting up and using the accounts. The guideline is structured into six modules
This module provides a definition of social media and identifies what constitutes as public information on the various platforms.
In this module, we examine the Dutch and European legal requirements and constraints related to archiving public social media content. Understanding the legal landscape is essential to ensure compliance and address any legal challenges.
This module provides practical recommendation on using social media in a way that facilities easier archiving. Aimed at those managing the social media accounts, it includes tips on account settings and content creation.
This module addresses how content can be appraised and selected. Also to ensure that historically important information will be transferred to the Dutch National Archives at a certain moment.
In this module we establish quality criteria for archiving social media content and explore various techniques to archive social media. Methods discussed include screen capturing and API usage. This module aims to equip professionals with the knowledge to choose the most effective archiving methods.
The final module presents real-world examples from the Netherlands and abroad. These case studies illustrate diverse methods and results, providing practical insights and lessons learned from other practitioners in the field. The creation of this guideline was a collaborative and intensive year-long process. We systematically engaged with a wide range of stakeholders and incorporated their feedback to ensure the guideline is comprehensive and practical. Our goal is to support government agencies in archiving their social media communications effectively. We are excited to share our journey and the outcomes of this project with our colleagues at the Web Archiving Conference. By presenting our experiences and insights, we hope to contribute to the ongoing discourse on social media archiving and inspire others in the field. 12:05pm - 12:25pm
Archiving the Social Media Profiles of Members of Government National Library of Luxembourg, Luxembourg As part of the 2023 national elections, the National Library of Luxembourg, in collaboration with the National Archives and the Ministry of State, launched a pilot project to archive the social media profiles of members of the government. The technical obstacles to archiving social platforms are becoming increasingly problematic, resulting in the situation that none of the major platforms can currently be archived effectively by our harvesters and service providers. Since most social media platforms are practically inaccessible by web crawlers and conventional web archiving methods, we decided to try a more direct approach, by asking the members of government directly to download the data from their profiles and hand them over to the National Library and National Archives. With the help of the Ministry of State, we sent out a call for participation, with specific guidelines to exporting datasets from social networks to the archive delegates and communication departments of each ministry, as well as to the ministers themselves. The response to this first call for participation was very positive - despite the pressure of time, between the election and formation of a new government, with a high chance of many ministers leaving their offices. In addition to elaborating the guidelines for downloading datasets from different platforms, we offered direct technical support to the people involved in the ministries, even the members of government themselves and retrieved the data individually on site. We were able to retrieve the majority of profiles of the government, for the time span of the 5 years of their term. This pilot project represents a direct and effective method, to secure the data of profiles of high public interest. The National Library and National Archives of Luxembourg are looking to repeat the same collection process by the end of 2024 and hope to move to a regular operation after that. This presentation will cover the different steps of the collection process, the lessons learned from the pilot project and the second operation end of 2024. We will conclude with an outlook to the changes we hope to implement in the future, a possible extension of the collection scope and our plans in terms of public access to the collections. 12:25pm - 12:45pm
From Posts to Archives: The National Library of Singapore’s Journey in Collecting Social Media National Library Board Singapore, Singapore Social media plays a huge role in our everyday life today. It is used for a myriad of activities such as communication, entertainment, business, and even as personal diaries. In Singapore, about 85% of the population uses social media, the most popular ones being Facebook, Instagram, YouTube, and TikTok. Besides individuals, many organisations have also turned to social media to engage and communicate with their followers. With such prevalence use, social media is becoming an important source of information about the lives and stories of our country and people. Recognising this, the National Library of Singapore (NLS) began looking at collecting social media. Our journey started in 2017, and the initial years focused on research and experiments, such as conducting environmental scan of other heritage institutions’ experiences in collecting social media, proof-of-concept using web archiving and available APIs, and trialling commercial vendors’ solutions. Our experience was similar to many institutions around the world. Collecting social media is complex and poses many technical, legal, and ethical challenges such as limited access to APIs and needing to manage personal data and third-party content. Despite these challenges, we knew that we had to start collecting social media given its increasing significance. This was not only to meet our mandate of collecting and preserving our countries’ digital memories, but to also gain practical experience on how to collect, organise, and manage this format. Putting together what we have learnt, we developed a social media collecting framework in 2023 to provide guidance on how to collect social media amidst these challenges while ensuring that a representative set of social media content can be collected for future generations and research. Our framework covered the selection criteria, the collecting methods, and our collecting approach for key social media platforms that are widely used in Singapore. We piloted our first social media collecting in the same year, under NLS’ new 2-year project to collect contemporary materials on Singapore food and youth. The purpose was to assess individuals and organisations’ receptiveness to contribute their social media accounts to us, which was more forthcoming than we anticipated. In 2024, we made collecting social media as part of our operational work. Our collection strategy was three-prong: 1) outsourcing the archiving of significant persons/organisations’ social media accounts to a commercial vendor; 2) approaching identified organisations based on subjects to contribute their social media accounts; and 3) engaging and promoting social media collecting through advocates and an annual public call to nominate favourite Singapore social media accounts, YouTube and TikTok videos, as well as websites. This presentation will highlight NLS’ journey in collecting social media, our collecting framework and strategy, as well as learning points and future plans. 12:45pm - 1:05pm
Innovative Web Archiving Amid Crisis: Leveraging Browsertrix and Hybrid Working Models to Capture the UK General Election 2024 British Library, United Kingdom The British Library, in collaboration with the National Libraries of Scotland and Wales, the Bodleian Library and Cambridge University Library, has created collections of archived websites for all UK general elections since 2005. This time series shows how internet use in political communication has evolved, and how the fortunes of political parties have changed. The 2024 general election was called unexpectedly on May 22nd, and took place on July 4th, at a time when the UK Web Archive was inaccessible, and our Web Archiving and Curation Tool was unavailable following a devastating ransomware attack on the British Library on October 29th 2023. Working together, we nevertheless created a collection of 2253 archived websites covering candidates' campaign sites, social media feeds of significant politicians and journalists, local and national party sites, comment by think tanks, community engagement, news sources, and manifestos of a plethora of interest groups seeking to influence the new government. To facilitate use by researchers tracking change over time, we have organised the material into these same sub-collections since 2005. We collected campaign websites for a sample of English candidates for the same counties and urban areas as we have covered since 2005, but all Scottish and Welsh candidates’ sites were gathered as numbers are manageable. We also targeted marginal constituencies which had increased in numbers dramatically since 2019. The 2024 general election saw the rise of formerly minor parties such as Reform UK to national prominence, a Liberal Democrat resurgence, growing influence of independent candidates, and the rise of identity politics with groups encouraged to vote as a bloc on issues such as the war in Gaza, and an increasingly sophisticated use of social media. The technical outage caused by the ransomware attack necessitated a unique approach due to the disruption in our usual workflows. Despite the challenges, websites continued to be archived using Heritrix on AWS servers rather than the Library's in-house infrastructure. This shift required a new workflow, involving the use of simple spreadsheets and collaborative efforts to quickly refine metadata definitions and crawl scope, aiming to replicate our existing curatorial software as closely as possible. This experience introduced library staff to working within data and time constraints, enhancing our understanding of how to effectively scope crawls, monitor them in real-time, and implement new quality assurance practices. The project resulted in a hybrid collecting model, utilising both Heritrix and Browsertrix for the same thematic collection. The presentation will discuss the challenges and opportunities encountered during this project, providing valuable insights for those interested in Browsertrix’s capabilities and in executing web archiving with a mixed-model approach across different institutions with diverse interests and expertise in unusually challenging circumstances within the framework provided by a historic time series. |
11:45am - 1:15pm | WORKSHOP #04: How to Develop a New Browsertrix Behavior Location: Slottsbiblioteket (ground floor) |
|
How to Develop a New Browsertrix Behavior Webrecorder, United States of America Behaviors are a key part of Browsertrix and Browsertrix Crawler, as they make it possible to automatically have the crawler browsers take certain actions on web pages to help capture important content. This tutorial will walk attendees through the process of creating a new behavior and using it with Browsertrix Crawler. Browsertrix Crawler includes a suite of standard behaviors, including auto-scrolling pages, auto-playing videos, and capturing posts and comments on particular social media sites. By default, all of the standard set of behaviors are enabled for each crawl. Users have the ability to instead disable behaviors entirely or select only a subset of the standard set of behaviors to use on a crawl. At times, users may need additional custom behaviors to navigate and interact with a site in specific ways automatically during crawling if they want the resulting web archive and replay to reflect the full experience of the live site. For instance, a new behavior could click on interactive buttons in a particular order, “drive” interactive components on a page, or open up posts sequentially on a new social media site and load comments. This tutorial will walk through the process of creating a new behavior step by step, using the existing written tutorial for creating new behaviors on GitHub as a model. In addition to demonstrating how to write a behavior’s code (using JavaScript), the tutorial will also discuss how to know when a behavior is the appropriate solution for a given crawling problem, how to test behaviors during development, how to use custom behaviors with Browsertrix Crawler running locally in Docker, and finally how to use custom behaviors from the Browsertrix web interface (a feature that is currently planned and will be completed by the conference date). Participants will not be expected to write any code or follow along on their own laptops in real time during the tutorial. The purpose is instead to demonstrate how one would approach developing a new behavior, lower the barrier to entry for developers and practitioners who may be interested in doing so, and to give attendees the opportunity to ask questions of Webrecorder developers in real time. We would additionally love to foster a conversation about how to develop a community library of available behaviors moving forward to make it easier than ever for users to find and use behaviors that meet their needs. The tutorial will be led by Ilya Kreymer and Tessa Walsh, developers at Webrecorder with intimate knowledge of the Browsertrix ecosystem. The target audience is technically-minded web archiving practitioners and developers - in other words, people who could either themselves write new custom behaviors or communicate the salient points to developers at their institutions. Because this is not a hackathon-style workshop, the tutorial could have as many participants as the venue allows. By the conclusion of the tutorial, attendees should understand the concept of how Browsertrix Behaviors work, when developing a new behavior is a good solution to their problems, the steps involved in developing and testing a new behavior, and where to find additional resources to help them along the way. Our hope is to foster a decentralized community of practice around behaviors to the entire IIPC community’s benefit. |
1:15pm - 2:15pm | LUNCH Location: CREDO Restaurant | Kantine (downstairs) If you signed up for a guided exhibition tour, please be in the exhibition room at 13:20. To know if you signed up for a tour, check your registration details in ConfTool. |
2:15pm - 3:40pm | SESSION #07: Research & Access Location: Målstova (upstairs) Session Chair: Marie Roald, National Library of Norway |
|
2:15pm - 2:35pm
From Pages to People: Tailoring Web Archives for Different Use Cases 1Cambridge University Libraries, United Kingdom; 2National Library of Scotland, United Kingdom Our paper explores different modes of reaching the three distinct audiences identified in previous work with the National Archives UK : readers, data users, and the digitally curious. Building on the examples of our work conducted at the Cambridge University Libraries and the National Library of Scotland, our paper gives recommendations and demonstrates good practices for designing web archives for different audience needs while assuring wide access. Firstly, to improve the experience of the general readers, we employ exploratory and gamified interfaces and public outreach events, such as exhibitions, to bring the library users' awareness to the available web archive resources. Secondly, to serve the data user community, we put an emphasis on curating metadata datasets and the Datasheets for Data documentation, encouraging the quantitative research of the web archive collections. This work also involves outreach events, such as data visualisation calls, which later can be incorporated into the resources for the general readers. Finally, to overcome the obstacle of the digital skill gap, we tailored in-library workshops for digitally curious - those who recognise the potential of web archives but lack advanced computational skills. We expect that upskilling the digitally curious can open their interest towards exploring and using the web archive collections. To sum up, our paper introduces the work we have been doing to improve the useability of the UK Web Archive within our institutions with the help of developing additional materials (datasets, interfaces) and planning outreach events (exhibitions, calls, workshops) to ensure we meet the expectations of readers, data users, and the digitally curious. 2:35pm - 2:55pm
Making Research Data Published to the Web FAIR University of Sheffield, United Kingdom The University of Sheffield’s vision for research is that our distinctive and innovative research will be world-leading and world-changing. We will produce the highest quality research to drive intellectual advances and address global challenges. https://www.sheffield.ac.uk/openresearch/university-statement-open-research Research data published to the web can offer opportunities for wider discovery and access to your research outputs. However, it also presents risk in terms of assurances that that discovery and access will remain for as long as the need for it remains. Websites are an inherently fragile medium, and present risks in terms of providing assurances that we can evidence our research impact over time. This includes potentially wanting to submit sites as part of a UK’s Research Excellence Framework submission (the next scheduled for 2029). Funding requirements may also stipulate how long they expect the outputs to remain accessible. Years of work, including work undertaken with public funding could disappear if there is no intervention. In addition, publishing research data to the web cannot provide assurances in terms of meeting the University of Sheffield’s commitment to FAIR principles (findable, accessible, interoperable and reusable) and Open Research and Open Data practices. At the University of Sheffield, colleagues in our Research Data Management (RDM) team have also noticed a trend in researchers depositing in the Institutional Repository (ORDA), links to URLs where the data is situated. In some cases, the website is the research output in its entirety, meaning the maintenance falls outside of the RDM team’s remit, meaning we cannot provide the usual assurances in terms of preserving that deposit in these cases. This paper will discuss the work undertaken by the University of Sheffield’s Library to mitigate potential data loss from research published online. It will include a case study of the capturing of a research group’s website to deposit in our institutional data repository, the creation of collaboratively created guidance for researchers and research data managers, and the embedding good practice at the University to enable Open Research and Open Data will remain open and FAIR. 2:55pm - 3:15pm
Enhancing Accessibility to Belgian Born-Digital Heritage: The BelgicaWeb Project Royal Library of Belgium (KBR), Belgium The BelgicaWeb project aims to make Belgian digital heritage more (FAIR ( i.e. Findable, Accessible, Interoperable and Reusable) to a wide audience. BelgicaWeb is a BRAIN 2.0 project funded by BELSPO, the Belgian Science Policy Office. It is a collaboration between CRIDS (University of Namur) who provide expertise on the relevant legal issues, IDLab, GhentCDH and MICT (Ghent University) who will work on data enrichment, user engagement and evaluation and outreach to the research community, respectively, and KBR (Royal Library of Belgium) who act as project coordinator and work on the development of the access platform and API and data enrichment. By leveraging web and social media archiving tools, the project focuses on creating comprehensive collections, developing a multilingual access platform, and providing a robust API enabling data-level access. At the heart of the project is a reference group of experts who provide iterative input on the selection, development of the API and access platform, data enrichment and quality control and usability. Therefore, the project contributes to moving towards best practices for search and discovery. The project goes beyond data collection by means of open-source tools by enriching and aggregating (meta)data associated with these collections using innovative technologies such as Linked Data and Natural Language Processing (NLP). This approach enhances search capabilities, yielding more relevant results for both researchers and the general public. In this presentation, we will provide an overview of the BelgicaWeb project’s system architecture, the technical challenges we encountered, and the solutions we implemented. We will demonstrate how the access platform and API offer powerful, relevant, and user-friendly search functionalities, making it a valuable tool for accessing Belgium’s digital heritage. Attendees will gain insights into our development process, the technologies employed, and the benefits of our open-source approach for the web archiving and by extension the digital preservation communities. 3:15pm - 3:35pm
Using Generative AI to Interrogate the UK Government Web Archive The National Archives (UK), United Kingdom Our project seeks to make the contents of Web Archives more easily discoverable and interrogable, through the use of Generative AI (Gen-AI). It explores the feasibility of setting up a chatbot, and using UK Government Web Archive data to inform its responses. We believe that, if this approach proves successful, it could lead to a step-change in the discoverability and accessibility of Web Archives. Background Gen-AIs like ChatGPT and Copilot have impressive capabilities, but are notoriously prone to “hallucinations”. They can generate confident-sounding, but demonstrably false responses – even to the point of inventing non-existent academic papers, complete with fictitious DOI numbers. Retrieval-Augmented Generation (RAG) seeks to address this. It supplements Gen-AI with an additional database, queried whenever a response is generated. This approach aims to significantly reduce the chance of hallucination, while also enabling chatbots to provide specific references to the original sources. Additionally, any approach used would need to take into account the occasional need to remove individual records (in line with The National Archives’ takedown policy: https://www.nationalarchives.gov.uk/legal/takedown-and-reclosure-policy/). In traditional Neural Networks, “forgetting” data is currently an intractable problem. However, it should be possible to set up RAG databases such that removal of specific documents is straightforward. Approach Our project is focused on two open-source tools, both of which allow for RAG based on Web Archive records. The first is WARC-GPT, a lightweight tool developed by a team at Harvard designed to ingest Web Archive documents, feed them to a RAG database, and provide a chat-bot to interrogate the results. While the tool’s creators have demonstrated its capabilities on a small number of documents, we have attempted to test it at a larger scale, on a corpus of ~22,000 resources. The second, more sophisticated tool is Microsoft’s GraphRAG. GraphRAG identifies the “entities” referenced in documents, and builds a data structure representing the relationships between them. This data structure should allow a chat-bot to carry out more in-depth “reasoning” about the contents of the original documents, and potentially provide better answers about information aggregated across multiple documents. Results Our initial findings suggest that WARC-GPT produces impressive responses when queried about topics covered in a single document. It quickly discovers which one of the documents in its database best answers the prompt. It summarises relevant information from that document, and provides its URL. Additionally, with a few minor tweaks to the underlying source code, it is possible to remove individual documents from its database. However, WARC-GPT’s responses fare poorly when attempting to aggregate information from multiple documents. Our experiments with GraphRAG suggest that it outperforms WARC-GPT in aggregating information. However, while GraphRAG is reasonably quick to generate these responses, it is significantly slower and more expensive to set up than WARC-GPT. Additionally, removing individual records from GraphRAG, while possible, is computationally expensive. |
2:15pm - 3:40pm | SESSION #08: Handling What You Captured Location: Store Auditorium (ground floor) Session Chair: Meghan Lyon, Library of Congress |
|
2:15pm - 2:35pm
So You’ve Got a WACZ: How Archives Become Verifiable Evidence Starling Lab for Data Integrity, Stanford-USC, United States of America This talk will present a workflow and toolkit, developed by the Starling Lab for Data Integrity, for collecting and organizing web archives alongside integrity and provenance data. Co-founded by Stanford and USC, Starling supports investigators–be they journalists, lawyers, or human rights defenders–in their collection of information and evidence. In addition to using Browsertrix to crawl (and test) large sets of web archive data, we have built a downstream integration, so data flows into our cryptographically-signed and append-only database called Authenticated Attributes (AA). AA extends Browsertrix’s utility by enabling archivists to securely attach and verify the provenance of claims that include context-critical metadata about the archived content in a secure and decentralized manner. It allows for the addition, preservation, and sharing of provenance data while facilitating efficient organization, searchability, and integration with other tools. Through AA, web archives and metadata become accessible for other applications and verification workflows, e.g. OSINT investigations. In this presentation, we will showcase case studies and projects with our collaborators including the Atlantic Council’s DFRLab and conflict monitors. 2:35pm - 2:55pm
Warc-Safe: An Open-Source WARC Virus Checker and NSFW (Not-Safe-For-Work) Content Detection Tool National Library of Luxembourg, Luxembourg We present warc-safe, the first open-source WARC virus checker and NSFW (Not-Safe-For-Work) content detection tool. Built with particular emphasis on usability and integration within existing workflows, this application detects harmful material and inappropriate content in WARC records. The tool uses the open-source ClamAV antimalware toolkit for threat detection and a specially trained AI model to analyze WARC image records. Several image formats are supported by the model (JPG, PNG, TIFF, WEBP, …), which produces a score between 0 (completely safe) and 1 (surely unsafe). This approach makes it easy to classify images and determine what to do with those that exceed a certain threshold. The warc-safe tool was developed with ease of use in mind; thus, it can be run in two modes: test mode (scan WARC files on the command line) or server mode (for easy integration with existing workflows). Server mode allows the client to use several features over an API, such as scanning a WARC file for viruses, scanning for NSFW content, or both. This makes it easy to use together with popular web archiving tools. To illustrate this, we present a case study where warc-safe was integrated into SolrWayback and the UK Web Archive’s warc-indexer. This integration made it possible to enrich the metadata indexed from WARC files, by extending the existing Solr schema with several new fields related to virus- and NSFW-test results, allowing for advanced searching and statistical analysis. Finally, we discuss how warc-safe could be used within an institutional framework, for instance by scanning newly harvested WARC files resulting from large-scale harvesting campaigns as well as including it within existing indexing workflows. 2:55pm - 3:15pm
Detecting and Diagnosing Errors in Replaying Archived Web Pages 1University of Michigan, United States of America; 2University of Southern California, United States of America When a user loads an archived page from a web archive, the archive must ensure that the user’s browser fetches all resources on the page from the archive, not from the original website. To achieve this, archives rewrite references to page resources that are embedded within crawled HTMLs, stylesheets, and scripts. Unfortunately, the widespread use of JavaScript on modern web pages has made page rewriting challenging. Beyond rewriting static links, archives now also need to ensure that dynamically generated requests during JavaScript execution are intercepted and rewritten. Given the diversity of scripts on the web, rewriting them often results in fidelity violations, i.e., when a user loads an archived page, even if all resources on the page had been crawled and saved, either some of the content that appeared on the original page is missing or some functionality that ought to work on archived pages (e.g., menus, change page theme) does not. To verify if the replay of an archived page preserves fidelity, archival systems currently compare either screenshots of the page taken during recording and replay or errors encountered in both loads (e.g., https://docs.browsertrix.com/user-guide/review/). These methods have several significant drawbacks. First, modern web pages often include dynamic components, such as animations or carousels; so, screenshots of the same page copy can vary across loads. Second, neither does incorrect replay always result in additional script execution or resource fetch errors, nor does the presence of such errors indicate the existence of user-visible problems. Lastly, even if an archived page does differ from the original page, existing methods cannot pinpoint what inaccuracies in page rewriting led to this problem. In this talk, we will describe our work in developing a new approach for a) more reliably detecting whether the replay of an archived page violates fidelity, and b) pinpointing the cause when this occurs. Fundamental to our approach is that we do not focus on only the externally visible outcomes of page loads (e.g., pixels rendered and runtime/fetch errors). Instead, both during recording and replay, we capture each visible element in the browser DOM tree, including its location on the screen and dimensions, and the JavaScript writes that produce visible effects. Our fine-grained representation of page loads also enables us to precisely identify the rewritten source code that led to fidelity violations. The fix has to be ultimately determined by a human developer. However, we are able to validate the root cause we identify by either inserting only the problematic rewrite into the original page or by selectively rolling back that edit from the rewritten archived page and examining the corresponding effects. In our study across tens of thousands of diverse pages, we have found that pywb (version 2.8.3) fails to accurately replay archived copies of approximately 15–17% of pages. Importantly, compared to relying on screenshots and errors to detect low fidelity replay, our approach reduces false positives by as much as 5x. 3:15pm - 3:35pm
Building a Toolchain for Screen Recording-Based Web Archiving of SVOD Platforms Institut national de l'audiovisuel (INA), France As Subscription Video on Demand (SVOD) platforms expand, preserving DRM-protected content has become a critical challenge for web archivists. Traditional methods often fall short due to Digital Rights Management (DRM) restrictions, necessitating more adaptable solutions. This presentation covers the ongoing development of a generic toolchain based on screen recording designed to effectively address DRM restrictions, capture high-quality content, and scale efficiently. The project is structured into two main phases. Phase One focuses on developing a system that automatically checks the quality of screen recordings. By monitoring key metrics such as frame rate, resolution, and bit rate, the system should ensure that recordings match the original content’s quality as closely as possible. This phase addresses several technical challenges, including video glitches, frame drops, low resolution, and audio syncing issues. These problems arise from varying network conditions, software performance issues, and hardware limitations. To refine and validate the toolchain, over 100 hours of competition footage from the Paris 2024 Olympic Games have been collected and are being used to assess the system’s performance. This dataset is crucial for ensuring that the toolchain can handle high-quality recordings effectively. Phase Two tackles the specific challenges posed by DRM restrictions. Level 1 DRM, which involves a trusted environment and hardware restrictions, uses hardware acceleration that causes black screens when video playback and screen recording are attempted simultaneously. Additionally, many SVOD platforms limit high-resolution playback on Linux systems, complicating the capture of high-quality content. To circumvent these issues, playback should be handled on distant machines running Windows, Mac, or Chrome OS—environments where high-resolution limitations do not apply—while recording is performed on Linux systems. For HD video content, which generally involves Level 3 DRM with only software restrictions, Linux can be used directly for both playback and recording without encountering black screen issues. The toolchain will utilize Docker to scale the recording process by virtualizing hardware components such as display and sound cards. Docker should enable the system to manage multiple recordings concurrently, improving efficiency and reducing the time required for large-scale archiving. FFmpeg will be employed for recording, while Xvfb and ALSA will be used to virtualize the display and sound cards, respectively. By leveraging Docker for virtualization and managing workloads across various instances, the system is expected to scale effectively and accelerate the archiving process. |
2:15pm - 3:40pm | PANEL #03: Cross-Institutional Collaboration: the End of Term Archive Location: Slottsbiblioteket (ground floor) Session Chair: Jeffrey van der Hoeven, National Library of the Netherlands (KB) |
|
Coordinating, Capturing, and Curating the 2024 United States End of Term Web Archive 1University of North Texas, United States of America; 2Internet Archive, United States of America; 3Stanford University, United States of America; 4Webrecorder, United States of America |
3:40pm - 4:10pm | BREAK Location: Folkestova (upstairs) |
4:10pm - 5:05pm | Closing Keynote: Quantifying Complexity: Using Web Data to Decode Online Public Debate Location: Målstova (upstairs) Session Chair: Jon Carlstedt Tønnessen, National Library of Norway Streamed to Store Auditorium. |
5:05pm - 5:30pm | Closing Remarks: Closing Remarks Location: Målstova (upstairs) Streamed to Store Auditorium. |
Contact and Legal Notice · Contact Address: Privacy Statement · Conference: IIPC WAC 2025 |
Conference Software: ConfTool Pro 2.6.153 © 2001–2025 by Dr. H. Weinreich, Hamburg, Germany |