JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at events@netpreserve.org.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

Daily Overview

Session

POSTER SLAM

Time:

Tuesday, 21/Apr/2026:

3:30pm - 4:00pm

Session Chair: Olga Holownia, IIPC

Location: AUDITORIUM [-2]

Floor: -2 [Ground Floor | Main entrance]

Presentations

A survey on data-access methods for an open web archive

Pedro Ortiz Suarez, Laurie Burchell

Common Crawl Foundation, France

With the ever growing interest in web data and web archives being driven by Large Language Models (LLMs), Artificial Intelligence (AI) and Retrieval-Augmented Generation (RAG), web archivists managing open repositories are faced with an unprecedented volume of download and requests. Given that web archiving infrastructures are sometimes constrained in resources, the increased traffic has made it difficult to serve and fulfill all of these incoming requests properly without saturating the infrastructure. This problem is compounded by users who employ far too aggressive retry policies, often unknowingly, when they try to access open archives.

To deal with these issues in relation to our own archives, we introduced an official, open-source tool over a year ago to facilitate sustainable user access. We developed it to be cross-platform, dependency-free and user-friendly to ensure easy adoption by the community. It implements and supports polite retry-strategies like exponential backoff and jitter, while also allowing for parallelization.

In this talk, we present the results of a comprehensive study into the impact of our new tool over the span of a year on how users have been accessing our archive and how this impacts our infrastructure. We use a defined standard user agent for our official tool to track usage, and investigate how our tool has been adopted throughout time, and if its introduction has simplified access to our open web archives for users. We also compare our official tool to other standard access methods employed by our users and study how the introduction of a polite access tool has impacted the load of our infrastructure. Finally, we propose some strategies that other web archiving institutions can use to simplify access to their archives, providing users with polite tooling inspired by our findings and allowing them to reduce the load in their infrastructures.

Linkra – application for archiving and creating citations of web resources in scientific texts

Václav Dragoun, Marie Haškovcová, Markéta Hrdličková, Luboš Svoboda

National Library of the Czech Republic, Czech Republic

A newly developed archiving and citation service Linkra is designed to store web resources cited in scientific and professional texts. It addresses the problem of link rot – the loss of referenced web content that threatens their credibility. The application allows users to save cited resources to a web archive, obtain archive URLs, and create citation records. In addition to preserving cited resources, it encourages researchers to include archival copies in their academic citations in accordance with the ISO 690 standard.

The application uses a harvesting method based on the open-source Scoop tool, allowing fast access to archived data. Working with the application involves several steps. Users first insert the web sources they want to preserve into the application, which starts the harvesting process. They then receive a unique address through which they can return to their assignment. After the harvesting is complete, they receive shortened URLs that will redirect to archived copies after indexing. Finally, they can use the built-in generator to prepare citations of web sources for publication in professional texts. They can either use pre-prepared templates designed according to common citation standards or create their own, for example according to the specific requirements of a professional journal. Citation records prepared in this way can be exported in bulk.

The Linkra application is being developed as part of institutional research as an open-source tool. It was preceded by research focused on disappearing web content and on the possibilities of citing web resources and their archive versions. The aim of the application is to preserve the sources of scientific works while also expanding the existing acquisition strategies of the web archive of the National Library of the Czech Republic. As part of the poster presentation, we will introduce the goals of our project, describe the technical solution, discuss the challenges encountered during development, and demonstrate how to use the application.

Application of AI at Social Media Archiving in the National Library of China

Shiyan JI, Danyan Zhao, Qian Sun

National Library of China, China, People's Republic of

The evolution of Artificial Intelligence (AI) has offered a new paradigm for web archiving. Based on over two decades of practical experience, our library is actively exploring the innovative application of AI and AI-Agents across all stages of the archiving, preservation, and management processes. Practice has demonstrated that our library has achieved successful outcomes in applying AI technology to social media archiving, and has made breakthrough progress in identifying archiving targets, analyzing archiving content, and cataloging metadata by utilizing the DeepSeek large model.

Our library has expanded the scope of web archiving, focusing on the social media archiving (articles published on WeChat official accounts), the deepseek-r1:14B model is used to assist in determining the archiving targets, filtering reasonable search results according to specified search conditions, and automatically extracting the titles and URL addresses of WeChat articles to be crawled. Combining the learning, understanding, and analysis capabilities of AI, it assists in the full-text analysis of crawled WeChat articles. Based on the cataloging results of historical articles and through multiple rounds of optimization and training of the AI model, the AI has ultimately achieved precise description of key information such as full-text summaries, keywords, and data sources of Wechat articles. The application of AI provides an effective tool for the web archiving.

Archiving and Analyzing YouTube Recommendations during the Paris 2024 Olympic Games

Yvette Assilaméhou-Kunz¹, Michaël Attali², Dorothée Benhamou-Suesser³, Amélia Ferreira^3,6, Erwan Le Merrer⁴, Julien Mésangeau⁵, Gilles Tredan⁶

¹Université Sorbonne-Nouvelle; ²Université Rennes 2; ³Bibliothèque nationale de France, France; ⁴Inria, Rennes; ⁵Université de Lille; ⁶Laboratory for Analysis and Architecture of Systems – CNRS

Though profoundly shaping and personalizing our experiences of the web and our access to information, algorithmic recommendations remain largely absent from institutional web archives, raising critical questions about how to capture and preserve a long-term record of algorithmic activity. This poster presents the preliminary results of a multidisciplinary research project that brings together a national library and experts from computer science, information science, social psychology, and sports history. The project’s goal is to capture and analyze the videos recommended by YouTube’s algorithm to different user profiles during the Paris 2024 Olympic Games, in order to determine whether these algorithmic recommendations reflect different narratives or perspectives on the Olympics, and whether they promote distinct values related to sports and the Olympic spirit. This poster will outline the initial findings of this exploratory approach, including the methodology and the resulting dataset. Using bots with diverse browsing histories, we collected over 21 million video recommendations across 19 user archetype profiles over a 45-day period. We complemented this approach by constituting a corpus of 18k videos related to the Paris Games published during the events and monitored daily from the time of their publication. We refer to this as an "objective corpus", which we used as a reference to analyze the personalized recommendation datasets. We will present preliminary quantitative insights from the data collected, in particular by focusing on recommendations of videos from our "objective corpus". We found considerable variations of the bots exposure to corpus videos depending on their profile; in particular, bots with a media consumption are more exposed than bots with a sport consumption, which might appear surprising given the nature of the event. We will share the first results from a qualitative analysis of the subjective representations associated with the “Paris Olympics” event in the most frequently recommended videos. We analyzed variations in values expressed in these videos to compare different personalization regimes and value systems. Finally, we aim to spark a discussion on several open questions: How can such a large dataset be preserved and made accessible? How to construct a "representative" personalization ? How might algorithmic recommendations be integrated into existing web archiving practices, and how can their capture be developed into a reproducible and sustainable process? Can these recommendations help build an archive that reflects diverse perspectives on the same event?

Detecting and managing challenging web crawls at scale

Chris Doyle

MirrorWeb Limited, United Kingdom

Web archiving at scale presents significant operational challenges, particularly in identifying crawls that deviate from expected behaviour. Whilst standard monitoring systems report binary "running" or "stopped" states, they fail to detect more subtle problems: crawls that exceed their intended scope, enter infinite loops on dynamic content, or silently stall whilst appearing active. By the time such issues are manually identified, substantial computational resources have been consumed, and service level agreements may be compromised.

This poster presents [REDACTED]; a proactive monitoring application developed to address these detection gaps. The system leverages historical crawl data to establish profile-based performance baselines for different crawl configurations. By continuously comparing current crawl duration against expected averages, the application automatically flags potentially problematic crawls for investigation before they escalate into resource-intensive failures.

The application integrates multiple data sources including AWS EC2 instance metadata, MySQL profile databases, Redis queue systems, and Heritrix REST API endpoints. When a crawl exceeds its baseline duration, the system gathers comprehensive diagnostics: status, queue metrics, actively processing URLs, and recent log entries. This diagnostic information is automatically posted to associated ticketing systems with stakeholder notifications, enabling rapid response.

Operational deployment has demonstrated significant benefits including early problem detection (hours rather than days), reduced manual oversight requirements, improved response times through automated stakeholder notification, and enhanced organisational knowledge capture through documented diagnostics. The profile-based approach proves particularly effective for organisations managing diverse crawl types across multiple clients, where manual monitoring becomes impractical.

This work highlights the importance of monitoring strategies that extend beyond simple status checks. As web archiving operations scale, institutions require intelligent detection mechanisms that understand normal crawl behaviour and can identify deviations before they impact service delivery. The poster will demonstrate the system's architecture, detection methodology, and practical implementation considerations for institutions seeking to enhance their crawl monitoring capabilities.

Mapping duplicate images in a web archive using perceptual hashing

Marie Roald

National Library of Norway, Norway

Images have been part of the web since its early beginnings [1] and today most webpages have some form of image content. Since the early 2000s, the National Library of Norway has harvested web data from the Norwegian top-level domain, storing time-stamped records of web content, including text, audio, video and images. A large portion of the stored data is images and finding ways to sort through the images, link together related images and remove duplicates is crucial for researchers to be able to find what they are looking for.

Image files spread quickly online. The same image can be downloaded multiple times and reuploaded to different websites. As a result, duplicates of an image can be hosted at multiple domains and the link between the image instances is not always preserved in the process. Further, as content management services often compress and resize images automatically upon upload, instances of the same image might also exist with different sizes or compression levels which means that they are different at the byte level.

This poster will present our ongoing work and preliminary results from a deduplication study to detect duplicate images in a web archive. By using perceptual hashing algorithms [2,3], we detect and flag perceptual duplicates in a subset of the archived data. Moreover, to estimate the performance of this perceptual hashing algorithm, we evaluate the detection accuracy for several simulated image degradation transforms. Similarly, we use pixel-level comparison on a random subset of the images to probe the hashing algorithm for false positives.

Our initial findings suggest this approach is promising and has two potential benefits:

1) Allowing scholars to track the use and reuse of an image across multiple pages.

2) Reducing unnecessary computation, if two files represent the same image with only minor differences in resolution or compression, there is no need to perform expensive computation twice.

We will present our work so far, what lessons we have learned and how these lessons will inform how the National Library of Norway processes and disseminates web archive image data in the future.

[1]: Tim Berners-Lee and Mark Fischetti. 1999. Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web. HarperCollins Publishers.

[2]: Farid, H. 2021. An Overview of Perceptual Hashing. Journal of Online Trust and Safety. 1, 1 (Oct. 2021). DOI:https://doi.org/10.54501/jots.v1i1.24.

[3]: Meta 2019. The TMK+PDQF Video-Hashing Algorithm and the PDQ Image-Hashing Algorithm. https://github.com/facebook/ThreatExchange/blob/main/hashing/hashing.pdf (Retrieved 2025-10-13)

Migration of Croatia's Web Archive's selective web harvesting system: transitioning to sustainable and interoperable solutions

Anamarija Ljubek, Inge Rudomino

National and University Library in Zagreb, Croatia

Preserving online publications presents growing challenges due to the increasing volume of digital content, rapid technological change, and the need to ensure compatibility with international web archiving initiatives. The current system used for selective web harvesting has reached both infrastructural and functional limitations, which prompted a shift toward a more modern and sustainable solution.

The proposed approach focuses on migrating the existing selective web archiving system to the Web Curator Tool (WCT), an open-source platform designed for managing complex web harvesting and curation workflows. This migration entails comprehensive technical and functional transformations, including the conversion of harvesting parameters, metadata migration, and reconfiguration of harvesting schedules to accommodate new system capabilities.

In preparation for the migration, archived publications are being thoroughly assessed to determine appropriate capture frequencies, verify the quality and integrity of harvested instances, and identify materials unsuitable for migration—such as publications not available in standard HTML format. This careful evaluation ensures that only relevant content is retained in the new web archiving system.

The poster will outline the advantages of adopting standardized and widely supported tool, such as improved scalability, interoperability, and alignment with international web archiving best practices. It will also address potential challenges, including the need for significant resources during the migration process and the potential loss of certain legacy functionalities that cannot be replicated in the new environment.

The overall goal is to establish a sustainable, scalable, and interoperable selective web archiving system that ensures the long-term preservation and accessibility of the nation’s online publishing heritage.

Using data to challenge negativity bias in quality assurance workflows

Abbie Grotke, Amanda Lehman, Carly Boerrigter

Library of Congress, United States of America

This poster will describe how institutional staff are disproving a common expectation of poor results in web archives. Our institution includes over 5 PBs of data for event-based and thematic collections, however a hyper-targeted capture remediation approach by the quality assurance (QA) team leads to perceptions of low success and high failure rates of captured content. The team is motivated to find a sustainable workflow that balances large-scale quality assurance data and individualized attention to specific captures to glean a clearer, more positive image of web archive capture health.

The poster will touch on staff's ongoing developments to make their quality assurance workflow sustainable. It will also briefly discuss how we process the data gathered through this workflow. Data from a standardized qualitative rubric for capture assessment indicates that a majority of captures are successful. This rubric is based on correspondence of the live site and web archive browsing experience, following criteria developed by Dr. Brenda Reyes-Ayala (1). Priority captures for remediation and triage by the quality assurance team are indicated by scoring on the rubric.

Scoring data and major categorical issues from this rubric are then visualized in Tableau and reveal a range of positive and negative capture assessments. This Tableau data is critical as our QA staff time is focused on troubleshooting the negative assessments. New visibility of positive assessments through the Tableau dashboard highlights value within the collections and builds team morale in a challenging QA environment. The positive data allowed the team to update the QA workflow to funnel only the high-priority, actionable assessments through the process.

Through data collection and visualization, we hope we can better understand and manage the myriad collections at our large collecting institution. Using data to communicate a transparent understanding of crawl health could help onboard new staff and support a morale boost for staff performing quality assurance work long-term. This poster shares steps on our journey towards stable and enduring web archiving capture assessment and remediation work.

(1) "Correspondence as the primary measure of quality for web archives: A human-centered grounded theory study" International Journal of Digital Libraries, 2022

Sustainable web archiving: a living and participative poster

Emmanuelle Bermès², Valérie Schafer¹

¹C2DH, University of Luxembourg, Luxembourg; ²Ecole Nationale des Chartes - PSL, France

How can we think about the sustainability of web archiving while respecting its vitality, creativity, and diversity of approaches and uses, and while encouraging co-shaping, interdisciplinarity, and participation? During WAC26, which will address this question and provide many answers, our participatory poster will offer an additional tool: a collective, living and co-constructed poster.

The poster is both an installation and a performance running throughout the conference. Rather than a fixed, finished object, it is a living surface that grows day by day, shaped by the contributions of WAC26 participants. It takes the form of one or two long rolls of recycled kraft paper, several meters in length, fixed to a wall or unrolled across a few tables to invite collaborative contributions. On this surface, participants are invited to draw, annotate, question, collage, and connect, challenge, highlight ideas they developed or found interesting and exciting in sessions, using various materials (markers, old magazines, scraps of paper, threads of yarn to create links, etc.). In this way, the poster becomes both a shared reflection space and a creative archive in itself.

We will prepare the very first layers of the installation (if possible with early scholars during the spring school to be held on April 20, just prior to the launch of WAC26) : this will take the form of a partial canvas including a mind map on sustainable web archiving and a set of open questions handwritten on kraft paper. From there, the surface becomes a collective palimpsest: enriched through participants’ sketches, and reactions to conference sessions. This process turns the poster into a space of dialogue and imagination, where sustainability is explored as a social, creative, playful, material, and collaborative practice.

This living poster is at once a reflective poster, a participatory “artwork”, and a sustainable experiment in reimagining how we present and co-construct knowledge in the field of web archiving. Expected outcomes are a process of documentation (photos and notes throughout the event), presentation through a lightning talk and a final blogpost on netpreserve.org, including images of the evolving poster (and eventually audio comments), to preserve and share this experimental form of knowledge-making.

The technologies of an in-house seed handling tool

Mikko Merioksa

National Library of Finland

This poster is an overview of the technologies used in an in-house developed tool that is used to create and manage collections based on harvested online materials, and to automate some harvesting and preservation related tasks. It has been in development since 2018 and is still being updated based on the users' needs.
With the tool, users can collaboratively create collections (called "thematic harvests"), add URLs as different types of seeds (e.g. webpage or YouTube video), and publish the collections for use at the legal deposit workstations throughout the country. The tool is a PHP software created with the CodeIgniter framework. With the PHP app users can create collections, modify comments relating to them, and pull their metadata from a separate cataloguing system.
For handling the adding, modifying and harvesting of URLs, the tool has a separate Javascript frontend that uses the jQuery library for rendering. Seeds can be added one at a time or in bulk, and they always have certain parameters attached to them such as the type of the seed which determines the harvesting tool used, and the depth parameter used to determine how a given web page should be harvested.
Most of the harvesting is done by hand by our technical specialists based on the attributes set in the tool, but some parts of the harvesting are also automated. Simple harvests (e.g. a single page) are sent to a separate python script that launches different tools based on the type of seed given. The system currently supports harvesting web pages by launching Browsertrix containers and YouTube videos using the yt-dlp software. Currently the tool talks to the harvesting script by creating simple JSON input files for the script that contain info about the seed that is to be harvested, such as the URL, the tool that is going to be used to harvest it, and the collection that the harvested items should be added to. A separate file of the output results is created so the results can be displayed for the user in the tool's UI.
After all the seeds of a collection have been succesfully harvested a process of indexing the harvested materials can be started from the tool's UI. This uses a collection of shell scripts that enter info about the materials into a SOLR and OutbackCDX indexes. After the indexing is done the collection can be marked as ready, after which the materials are publicly available.

Virtual Mucem: from web archives to a museum remediation of an ethnological websites collection

Alexandre Faye, Sara Aubry

Bibliothèque nationale de France, France

The Museum of European and Mediterranean Civilizations (Mucem) is a major French ethnology museum located in Marseille. It opened in June 2013, inheriting the collections of the former National Museum of Popular Arts and Traditions (MNATP). This transfer of a national museum to a regional location was the first of its kind in France. The new museum implemented a multidisciplinary project and expanded its collections to the Mediterranean basin by launching new ethnological surveys.

Between its official creation in 2005 and its opening to the public in June 2013, the museum developed an online strategy and launched eight original thematic websites. These websites were editorial projects in their own right and were used as a key means of promoting ethnological collections, researches and surveys. The websites were hosted on the French Ministry of Culture servers and were taken offline at the end of 2020 due to technical obsolescence and an extensive use of Adobe Flash technology. The disappearance led to an awareness of their importance. Some of them offered scientific descriptions of collections, which were more complete than the museum’s databases. Others reflected the museum's new stance on contemporary issues, such as gender, and preceded important exhibitions.

The aim of the Virtual Mucem project carried out in 2024 was to experiment with a form of remediation by using web archives of a national library. The work was both documentary and technical. On one hand, the project team searched local archives and conducted oral surveys with the producers of the websites. On the other hand, a tool has been developed to enable the project team to extract and package the library web archives in order to produce derivative WARC files as complete as possible for each one of the websites. Following these two tasks, which were carried out simultaneously, the project team set up an editorial interface for remediating the websites and integrating the derivative web archives, which can be consulted within the walls of the Mucem's Conservation and Resource Center with a local installation of SolrWayback.

This remediation project has a collegial and experimental dimension. Over the course of a year, it brought together more than fifteen people, including archivists, documentalists, librarians, IT specialists, and historians, as well as curators, ethnologists, and technical teams involved in the production of some of the sites.

This poster will present the challenges and results of this remediation project. First, it will highlight the collaboration between a museum and a national library that can inspire new projects in the future. It will provide information about the process of creating derivative WARCs. Finally, it will question the remediation itself and some of the main issues: technical but also documentary obsolescence of the content, possible deficiency of the web archives, technical choice and network security, public display.

WebData: Building a Research Infrastructure for the Norwegian Web Archive

Jon Carlstedt Tønnessen

National Library of Norway, Norway

Researchers have addressed the need for dedicated research infrastructures to study web archives. In response, the WebData project is building a research infrastructure for the National Library of Norway's web archive, enabling large-scale access to nearly 25 years of archived material. This poster will present the project's status, lessons learned so far, and findings from a needs assessment conducted with a relatively large number of scholars, mapping their needs.[1]

The project started in 2025, with four key objectives:

Build a research platform for searching, exploring and retrieving data
Automatically classify and pseudonymise texts containing (sensitive) personal data
Annotate data in order to provide analytical services,
Develop the infrastructure in close collaboration with the research community through needs and representation studies. [2]

Further, the poster will present findings from surveying researchers’ needs within four areas: a) access, b) interfaces and functionality, c) data and d) metadata. In addition to sharing scholarly needs, we examine how we plan to address this over the next 4 years. This involves traditional rule-based programming, identifying specific attributes in archived items, as well as machine-learning-based systems to enrich WARC data with additional metadata.

The WebData consortium is led by the National Library of Norway, with the Norwegian Computing Center, University of Oslo and University of Tromsø as partners. Project development runs until 2029, while the infrastructure will operate until at least 2035. The project is funded by the Research Council of Norway.

[1]: Brügger, N. (2021): ‘The Need for Research Infrastructures for the Study of Web Archives’. In The Past Web: Exploring Web Archives, edited by Daniel Gomes, et al. Springer International Publishing. https://doi.org/10.1007/978-3-030-63291-5_17; “About WebData” (2025), WebData.

[2]: https://webdata.nb.no

Doing humanities with web archiving: an oral history of web archiving practices in academia and the making of digital culture

Sophie Gebeil¹, Véronique Ginouvès²

¹Aix Marseille University, TELEMMe Laboratory, France; ²Mediterranean House of Human and Social Sciences, Aix Marseille University, CNRS, France

Doing Humanities with Web Archiving: An Oral History of Web Archiving Practices in Academia and the Making of Digital Culture

This project stems from the observation of a widening gap between, on the one hand, a small community of researchers and teachers who have developed expertise in web archiving, and, on the other, the vast majority of academics who occasionally need to archive the web as part of their work. The latter often rely on improvised, artisanal solutions to preserve or cite born-digital sources. While the experts are engaged with international initiatives and explore innovative methodologies linked to the digital humanities, most researchers remain unaware of this body of work and continue to “make do,” adjusting their practices as they go.
An historian who completed a PhD based on web archives and an archivist who trains doctoral students in managing their research data have both observed that, despite the field’s vitality, web archiving practices have not yet permeated the broader landscape of research and teaching in the humanities and social sciences. One likely reason is the particularly high entry cost for scholars in these fields: web archiving requires not only solid digital literacy but also the integration of a new theoretical and methodological framework that reshapes our relationship to archives and redefines research methodologies.
The proposed oral history project aims to document and analyze web archiving practices—both expert and improvised—among academics. By focusing on lived experience and everyday practice rather than top-down models, it seeks to contribute to an epistemology of the humanities and social sciences through practice. About fifty interviews will be conducted with individuals who are currently building or have previously built web archive corpora. These testimonies will shed light on how such corpora evolve over time, revealing continuities, shifts in focus, and changes in the objects of study. A heuristic map will guide the interview framework and adapt questions to diverse uses of the web.
Archiving is integral to the project’s logic: each interview will be described in a dedicated archival finding aid, with standardized metadata, keywords, and access criteria defined by the interviewee, who will choose the appropriate level of confidentiality or public release. The resulting corpus will serve as a foundation for cumulative future research on web archiving practices.
Through this collection of oral testimonies, we aim to capture a momentary yet representative picture of how researchers and teachers use and archive the web in their professional and pedagogical activities. Approaching web archives through life stories offers an opportunity to rethink our relationship to digital sources and research methodologies. By archiving these oral histories as research sound archives, in line with open science principles, we hope to make them accessible and reusable, within the boundaries of ethical and legal standards.

How not to build a web archive in two weeks

Shannon Willis

Texas State University, United States of America

In 2025, a university library started a web archive. Getting to this point represented two years of education and advocacy to secure the necessary resources to start a program, aligning web archives with the larger mission and scope of the library. Given limited in-house development support, Archive-It was chosen as the university’s first web archiving tool and the university’s web presence as the first collecting area. Delays in contracts and purchasings resulted in little time to capture seeds before the data budget for the year would be reset. Determined to use as much of the data budget as possible before it expired, and after a self-given crash course in Archive-It, the presenter set out to capture as much as possible in as thoughtful a manner as time would allow before the end of the fiscal year. This poster will explore what went well, what went wrong, and lessons learned from this compressed timeline for starting a web archive. It will consider the work of implementing web archiving best practices, how the library is moving forward to grow a more robust and sustainable web archiving program, and the importance of advocacy and community in supporting institutions and sustaining the work of web archiving.

In addition to doing the internal work of establishing repeatable workflows, refining regular crawl schedules, and considering the long-term preservation needs of their WARC files, the presenter is also actively restarting a regional web archiving interest group to build a local support network that can help foster their own and others' web archiving work in the area. As well, understanding that growth of the web archives will require continued support and increased resources from their institution, they are also leveraging the current attention their web archive has amongst leadership to promote the efforts of the library and advocate for the importance to the university of web archiving and preservation work. Through these efforts, the presenter aims to grow what started as a rough-and-ready little web archive into a sustainable web archiving program, expanding both upon its collecting scope and the archiving technology used. The presenter hopes the poster will prompt conversations around good (and not so good) practices in starting in web archives, successful approaches for advocating for web archiving resources, and the importance of web archiving communities in sustaining the work.

Linking the awesome: Building a Community Knowledge Graph for Web Archiving Resources

Natanael Arndt

German National Library, Germany

Web archiving is a highly technical endeavor involving a lot of tools. These tools are developed by a broad community and mostly as open source software. The open source software development allows participants of the community to exchange tools and improve them in a cooperative and collaborative way. The web archiving is technologically and from a community perspective embedded in the World Wide Web, which as well is mostly based on open source software and open protocols and standards. Likewise in web archives open protocols and standards, like WARC and CDX, play a fundamental role and allow the interoperability of components.

The International Internet Preservation Consortium (IIPC) serves as a hub to foster communication among web archiving institutes, to support the standardization processes and the software development. The “Awesome Web Archiving” list follows the idea of awesome lists (https://awesome.re/). Awesome lists are common on GitHub, maintained as a Markdown document, and provide a low profile accessible index of resources that are relevant for a certain community, contributors are able to suggest new entries as pull-requests. Among others, this involves links to software tools and standard documents. Within the “Awesome Web Archiving” list the entries are assigned to categories, while individual entries can fit into more than on category. The referenced projects are sometimes under vivid development, while others get unmaintained over time. To improve the quality of the “Awesome Web Archiving” list and as such their value for the web archiving community the recency and information richness are relevant factors. The entries in the list are often links to git repositories or projects on GitHub. From these project pages, additional information about the current development status and the self-description of the projects can be gathered.

To interlink the information that are gained through the crowdsourced approach of maintaining an awesome list with the information available at the project pages, linked data is a good and web native format to encode information in a structured way. The SPARQL Anything (https://sparql-anything.readthedocs.io/) tool provides access to Markdown documents (https://sparql-anything.readthedocs.io/stable/formats/Markdown/) with the standardized SPARQL 1.1 Query language (https://www.w3.org/TR/sparql11-query/). With these tools it is possible to create a knowledge graph–The Web Archive Awesome Graph (WAAG)–of information resources relevant to the web archiving community (https://github.com/white-gecko/webarchiving-awesome-graph). This graph can serve as integration point for structured or semi structured contributions to the tool collection, for information enrichment, and to model interconnections between listed resources, such as tools and libraries, and libraries and standards. Finally, the graph's information can be rendered browsed in a graph like manner and can be again rendered to an awesome list document.

The tools involved are still under development and the approach requires discussion within the community. The poster should serve as a catalyst for such a discussion.

Revisiting a statistical approach for measuring Solr query performance

Jørgen Johan Antonsen

National Library of Norway, Norway

Popular in the web archiving community, Solr allows for fast free-text search within a web archive. When working with large indexes, one soon faces the limits of one’s own infrastructure, and query response times increase. At that point, there are many measures that can be taken, so it is useful to know the effects of each measure, or which setting that gives the best performance. This is when having tools for evaluating query performance comes in handy. This poster sheds light on a handy method of measuring and visualizing Solr query performance.

Imagine for instance that you want to improve the query response time of your Solr index, and have a theory that it will help to split a large collection into multiple shards. To qualitatively check if this is the case, it is first important to be aware that a query with few hits typically has a shorter response time than a query with very many hits. It therefore gives insight to check the performance across groups of queries, with say, 10-100 hits, 100-1000 hits, 10K-100K hits and so on.

There is also the question of caching. If a specific query has been made before, the response time is shorter and might give a misleading idea of a Solr instance’ performance. Consequently, one needs to do many queries, which results in a set of valuable statistics. If these tests are run before and after the shard split, the results can be compared and the performance gain becomes very visible.

The method has been used many years ago in presentations on previous IIPC conferences, but does not seem to be actively used today. The presenting organization is currently indexing on new infrastructure, and the method has been very useful in making decisions in this process, which is why we would like to highlight it in this poster.

Sustaining web archiving through instruction

Nicole Greenhouse

New York University Libraries, United States of America

According to the National Digital Stewardship Alliance (NDSA) 2022 Web Archiving Survey, "few organizations dedicate more than one, full-time employee to web archiving." American organizations’ staffing in regards to web archiving have stagnated, with the majority of practitioners of web archiving only working on it a quarter of their professional time, in line with the results from the 2017. With very little staff time devoted to web archiving, building and sustaining a program can be difficult and leaves no room for development in practices in the field. Over the last nine years, conversations around quality assurance, ethics, access and description for web archives have also gone by the way side in the United States in favor of similar conversations around event based collecting and technical developments. But once these events are over, web archiving practitioners are still needed to maintain these collections into the long-term. By providing training and instruction that does not just cover the basics of web archiving, but rather workflows and policies we can build up the knowledge that web archiving is not a “set it and forget it” that needs more than just a staff member 25% of the time. This poster will focus on best practices for training students and professionals in web archiving, including quality assurance, how to use the tools, maintenance, preservation, and access so we can move away from web archiving as an extension of someone’s work and part of a sustainable practice in their institution and a community of practice with more people with expertise to do better and innovating work.

The DOWARC notebook: modelling web archiving artefacts as RDF graphs in Jupyter

Tom Storrar¹, Manuela Pallotto Strickland²

¹The National Archives, UK; ²King's College London, UK

This poster presents a local and small-scale implementation of Semantic technologies in web archiving processes and builds on the research collaboration we conducted in 2024, which delivered the draft version of the DOWARC domain ontology presented in a lighting-talk at IIPC WAC 2025.

To effectively manage the capture of the changes that affect live websites and webpages, web archiving practices lead to the creation of datasets composed of snapshots of web resources. Because each snapshot essentially recaptures the entirety of the archived web data object packaged into WARC files, significant issues of duplication inevitably arise over time, rendering versioning difficult to manage. Furthermore, as each snapshot provides an instantaneous representation of the live web resource captured in a specific moment of its existence, issues of context also arise, particularly with regard to the relationship between different versions of the same resource. Such issues have an impact on the long-term sustainability of web archiving practices and can also affect future reuse of web archives, by engendering contextual ambiguities.

Our research explores affordances of Semantic technologies in tackling versioning and context-related issues in web archiving practices. Although Semantic technologies such as RDF and Linked Data are being implemented by web archives to enrich discovery-of/access-to a web archives’ collections, and/or support distant-reading of primary web resources (e.g., mapping and profiling of web communities), currently they are neither being used to support sustainable versioning and address issues of context, nor are considered useful in tackling the preservation challenges specifically presented by web resources. Our implementation aims to fill this gap and demonstrate the potential effectiveness of Semantic technologies and Knowledge Engineering techniques in providing effective means to automate the mitigation of versioning and context-related ambiguities, over large and dynamic web archived datasets.

The implementation we present processes web archive data in a portable Jupyter environment and visualises it as an RDF graph. Using OS standard tools such as WARCIO and FastWARC, we extract data objects from WARC and CDX files, which we index in a database and provide with URIs. The WARC and CDX objects we then annotate and describe using DOWARC are represented as an interactive network graph.

Our notebook is configured as a sandbox environment, to test and assess affordances and bottlenecks of automation when annotating Real World web archiving artefacts using the DOWARC ontology. By presenting our work to the web archiving and digital preservation community, we would like to gather community feedback on our sandbox implementation, on the specific affordances offered by Semantic technologies that we have demonstrated, but also on the limitations we have encountered and successfully/unsuccessfully tackled. We aim to identify interested institutional partners to further explore scalable implementation of Semantic technologies to support sustainable and accessible archiving and preservation of web content.

Where is Hyves? Preparing hyperlinks for distant reading

Iris Geldermans

KB | National Library of the Netherlands, Netherlands, The

Link graphs, word clouds and keyword search are frequently based on derivative data, but more than often it is unclear how this data was prepared. In this presentation I will argue that because a website is such a container source, it is important as a researcher and as an archiving institution to be clear which data is in the index and how it was pre-processed. Through several collaborative research projects I have found that preprocessing the data, in this example hyperlinks, has a lot of consequences for the subsequent analysis. Being transparent about how data is pre-processed for tooling is therefor important for the academic community.

To illustrate this point I will discuss two use cases: Hyves (a former social media platform) and XS4ALL (one of the first public internet service providers in [COUNTRY]) analysed within the SolrWayback and a custom linkanalysis script. Both platforms have a similar sub domain construction causing them to either disappear from link graphs, or get grouped together into one major node causing the individual websites to disappear. It brings the question forward: what should a singular node be in the link graph? I argue that the level of granularity depends on the research question and the importance to explain to researchers that they should take this into consideration when preforming their research.

It is also important to know which hyperlinks are displayed within the graph. Hyperlinks can be found throughout a website: there is embedded content, anchor hyperlinks but also scripts and fonts. Differentiating the different kinds of hyperlinks within a visualisation is as important as knowing how they are cropped. When a tool or analysis does not differentiate this, the bigger platforms will always come out on top, eclipsing smaller but perhaps more important individual websites, because they have a stake in every type of hyperlink. But more importantly when researching website-networks based on the content of websites requires different hyperlinks, than when researching for example techniques used to build a website.

When visualizing link graphs with these thoughts in mind, you can enhance research results. And working this way can also be used for other elements on websites like text or images. Text, for example, should also be differentiated into header-text, footer text, article text, menu items and so forth. This to bring more meaning to analysis tools and visualisations. Moreover, within a website the text is already coded through html, so why not use this? With this archiving institutions can emphasize to researchers that the website is a container of many types of information and that they should be aware of this. Selecting which part of the website they want to use can enhance their research and should be chosen wisely.