JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at events@netpreserve.org.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

Agenda Overview

Session

SHORT TALKS

Time:

Wednesday, 22/Apr/2026:

9:20am - 10:45am

Session Chair: Katherine Boss, National Library of Norway

Location: CONCERT [+4]

Presentations

9:20am - 9:30am

Environmentally-friendly digital preservation policies and infrastructure at the National Library of Norway

Katherine Boss, Kristin Laberg, Frode Steen

National Library of Norway, Norway

The National Library of Norway has been a certified environmental “lighthouse” organization since 2015, indicating that it complies with a set number of environmentally-friendly criteria. This has required the library to implement and sustain many environmentally-friendly policies, including several related to digital preservation and storage, that may be of interest to the international community.

One core aspect of this work is energy efficiency. The library’s digital collections currently total more than 18 petabytes of data. This data is regularly checked for bit rot and is preserved using the 3-2-1 standard of digital preservation, wherein we preserve 3 copies of each file, on 2 different storage technologies, including 1 file copy stored at a different geographical location. To reduce our energy use in this work, the library uses an energy-efficient technology for our disk systems, called MAID (Massive Array of Idle Disks). This storage technology reduces power consumption by only allowing disks to spin when they are in active use, so that most hard drives are kept inactive and turned off to save energy and extend their lifespan. Although it affects application performance during data access, MAID is effective for storing data that is rarely used, such as archival data that does not change and is rarely accessed. This provides an almost 60% energy savings.

Another aspect of the library’s sustainable data storage practices focuses on data minimization. The library stores material in filetypes that meet international standards and that can also be compressed to reduce the total volume of information we store, such as the JPEG2000 file format.

Our data is also stored in what is often referred to as a “cold climate” data storage facility. The northern location of the National Library is based in Mo i Rana, a city 30 kilometers south of the arctic circle. The storage facilities are built into the side of Mofjellet mountain. For seven months of the year, the monthly average temperature is below 0 degrees Celsius. This stable, even, cold climate requires less energy to keep the storage servers cool.

Finally, the library uses 98% renewable energy sources, including from wind and hydroelectric sources, to maintain this infrastructure.

There are still more measures the library can take to improve sustainability in our operations. For example, soon we plan to further optimize our energy use by recycling the heat from the data center to warm buildings. Another area of improvement is our file degradation systems, which are not as efficient as they could be. We use Checksum technology to check for bit rot. All preserved files are assigned a checksum, or fingerprint. To handle checksums, computing power is needed every time a check is run and to confirm that a file has not changed. We compare the stored checksum against the calculated checksum for a file each time it is retrieved from our digital preservation system, but this is processing that could be avoided if we used technology that more effectively maintained the integrity of a file.

9:30am - 9:38am

Environmental Issues on the Web: Building and Promoting a Thematic Archive

Anaïs Crinière-Boizet

National Library of France, France

In 2020, our institution took part in the Climate Change IIPC Collaborative collection and drew inspiration from this initiative to set up its own collection on environmental issues. We felt it was essential to include these major issues for our contemporary society in our collections. That is why, since 2020, we have been launching an annual collection entitled ‘Environmental Issues’.

The aim of this collection is to highlight expressions, reactions, actions, representations or reflections relating to environmental issues on the internet. It comprises eight themes, in order to cover the multiple aspects of these issues (scientific, economic, artistic, etc.) as well as the different types of website producers. It currently has more than 800 selections made internally by librarians, as well as by partner libraries in the regions.

In this lightning talk, we would like to present this collaborative collection on a national scale, as well as the various initiatives implemented to promote it to the public. We have published in December 2023 a thematic and edited selection of archived pages (also known as “guided tour”) about “The environment on the web”. This tour is divided into 14 themes such as “Issues, Concepts and Theories”, “Biodiversity and Species Extinction”, “Urban Planning and Land Use”, and “Everyday Citizen Action”. As our collections can only be accessed within the research rooms of our library, we have also published on our website the seeds list of this collection as well as a version of the tour with screenshots, for which we asked the websites owners' authorizations. This collection and its promotion are a good example of how we build and develop a thematic collection in our library and how we can help the public to better understand the challenges posed by climate change.

9:38am - 9:46am

Storing URLs, targets, and other time-varying entities in a database as a path to sustainable recordkeeping

Gyula Kalcsó

Hungarian National Museum Public Collection Centre National Széchényi Library, Hungary

A recurring problem with mass web archiving, e.g., at the top-domain level, is how to record the targeted content and the changes in the associated URL(s) over time. This issue is related to seed list maintenance, as in the case of larger harvests, it is necessary to exclude websites that were previously saved but are no longer functional, meaning that there is no longer any content behind a given URL, or it no longer belongs to that website. The lightning talk presents a flexible concept that can be used to manage the relationships between URLs of different structures (with or without http or https protocol, with or without www), their changes over time, and their connection to the website as an entity. The essence of the solution is an entity-based SQL database that is capable of recording all changes over time in a non-redundant manner by ensuring 3rd Normal Form (3NF). The main entities stored in the database, such as target and URL, are linked to each other, to themselves, and to tables containing information about them using junction tables. This solution ensures scalability, e.g., the information stored about each entity can be expanded arbitrarily, and the 'date_from' and 'date_to' fields in the junction tables can be used to record when the given relations were valid. Linking the entity tables to themselves allows us to link alternative URLs to each other in time, for example. The information stored about each entity allows for complex queries. For example, in the case of the target, the type (website, web page, file, etc.), or in the case of URLs, the status code is stored in a separate table. The junction tables also ensure that changes over time are recorded, so that, for example, it is possible to query which URL belonged to a given entity (e.g., a file on a website) during a given period. All this contributes greatly to sustainability, as it provides a much more economical, easier to use, and more flexible query solution than previous data storage methods, such as Google spreadsheets.

9:46am - 9:54am

Web archiving automation at the Mexico Digital Preservation Group: error assessment and quality control

Carolina Silva Bretón¹, Alejandro Juárez Arriaga²

¹National Library of Mexico, Mexico; ²Digital Preservation Group, Mexico

In Mexico, progress continues to be made in web archiving, which has become a fundamental strategy for preserving digital heritage, especially given the volatile and ephemeral nature of online content. In this context, the Digital Preservation Group of Mexico (GPD) has experimented with an automated web archiving system to capture, store, and preserve digital resources relevant to the country's collective memory. This study focuses on detecting errors during the capture processes and in the strategies applied to ensure the quality of the resulting archives.

Using an empirical-applied approach, combining observation and experimentation to address practical problems, the automated tool Browsertrix (from Webrecorder) was used, along with systematic reviews of the files generated in WARC format. Twenty-four websites were captured in 2025, including catalogs, databases, and repositories. The analysis focused on the frequency, type, and cause of detected errors (e.g., broken links, missing sitemaps, uncaptured dynamic content, JavaScript issues, or multimedia format problems) and the effectiveness of the applied quality control mechanisms.

The results reveal that while automation allows for a significant increase in archiving coverage, it also introduces considerable technical challenges, which we will discuss in the lightning talk. Recurring error patterns were identified, linked to highly dynamic sites with complex structures, highlighting the need for specialized configurations and iterative validation processes. The importance of establishing contextualized quality criteria, beyond purely technical parameters, is also discussed, integrating aspects of cultural, institutional, and legal relevance.

The lightning talk concludes with a series of practical recommendations for similar projects in Latin American contexts, emphasizing the importance of a flexible technical infrastructure, automated monitoring capabilities, and a clear policy for collaborative digital preservation. This work contributes to the development of standards and best practices for institutional web archiving in the region, and opens the door to future research on automated curation and preservation of emerging content such as social networks, alternative media and ephemeral resources.

9:54am - 10:02am

Sustainable and systematic: building a search index of research and practice in web archiving and digital preservation

Andrew Jackson¹, Olga Holownia², Sharon Healy³

¹Digital Preservation Coalition, United Kingdon; ²IIPC, United States of America; ³Cartlann Digital Services, Ireland

Over the years, through events such as the IIPC Web Archiving Conference, iPRES - International Conference on Digital Preservation, and various collaborative projects, the digital preservation and web archiving communities have built an extensive repository of knowledge. However, a persistent challenge has been to provide a single, citable point of access to these dispersed resources. Our project introduces the Awesome Indexer¹, which brings together digital preservation and web archiving resources into a single search interface and database. Our key argument is that centralised discovery is crucial for the long-term sustainability of these resources, encouraging reuse and investment in those resources rather than attempting to replace them.

This tool works by accepting a range of standardised bookmark and bibliographic sources, such as Awesome Lists, Zotero², and Zenodo collections. Zotero is a particularly powerful source, as the established tools and workflows around Zotero collection management make it easy to pull in records from a wide range of sources, from traditional publisher websites through to YouTube playlists and content hosted by digital libraries³. The Awesome Indexer combines the data from these sources to generate a dedicated faceted search system, built using off-the-shelf tools and packaged as a simple static website. It also creates SQLite and Apache Parquet versions of the same data, allowing richer exploration and analysis of the sources in the index. The Indexer is an open source tool that can be used by anyone to build their own index.

This “work-in-progress” short talk will briefly trace the development of the Indexer, detailing the steps it required and the challenges posed by its underlying resources. The current version of the Digital Preservation Publications Index (DPPI) will be demonstrated to highlight how the Indexer consolidates decades of content from across multiple platforms into a single, comprehensive entry point. This significantly improves discoverability, facilitates citation, contributes to training, and maximises the impact of our collective knowledge for practitioners and researchers.

References
¹ https://github.com/digipres/awesome-indexer 2 There are multiple Zotero web archiving bibliographies, including the WARCnet Directory 4 Web Archive Research (https://www.zotero.org/groups/4394230/warcnet_directory_4_web_archive_research), created as part of the WARCnet network (2020-2023).

³ This is an example of a web archiving collection hosted by the University of North Texas Digital Library https://digital.library.unt.edu/explore/partners/IIPC/

10:02am - 10:10am

Querying the archived web with an AI assistant

Victor Harbo Johnston¹, Brian Balsun-Stanton², Helle Strandgaard Jensen¹, Christian Kaalund Kjeldsen¹

¹Aarhus University, Denmark; ²Macquarie University, Australia

The archived web is a indescribably rich primary source for contemporary history. However, only a handfull of historians have started including the archived web as part of their source material when investigating phenomenons from the 1990's and 2000's (Mackinnon, 2022; Millward, 2025; Winters, 2017).This lightning talk presents exploratory work on exploring and discovering content from web archives through an *AI Research Assistant* and research questions from the discipline of history.
The proposal takes a postphenomenological approach, as the historian interacting with the archived web is set in a mediated situation, mediated by the software used to explore the archive e.g. SolrWayback or PyWb (Hasse, 2015; Rosenberger & Verbeek, 2017). This approach makes the discovery of material from the archive focus on research questions asked by the researcher rather than content from the archive in itself resulting in a problem driven approach which better resembles how historians traditionally work. This approach puts the research questions to the center rather than the data from the archives as presented in the Archives Unleashed project (Ruest et al., 2022).
The proposal explores how traditional research questions from the humanities can be used to setup a large language model to act as a research assistant. A similar approach has been used to construct a bibliography about the classic Caligula's Madness (Green et al., 2024). This proposal investigates how research questions can be transformed into prompts that are then used by the AI to systematically read all documents in a collection and provide information on which of these documents could be useful to answer the given research questions. The methodology will make it possible to backtrack the reasoning of the AI by making sure that the AI provides a link or an ID of the source it has found fitting for the provided research question.
Webarchives are often massive in size, therefore this proposal investigates how the solution performs on big corpora and how choosing different LLMs as the backbone of the solution can impact performance and quality of output.

10:10am - 10:18am

Online annotation platform for web archives

Pedro Gomes

Arquivo.pt, Portugal

Search engine evaluation relies heavily on high-quality test collections that reflect user information needs and relevance judgments. However, building such collections is resource-intensive, requiring systematic annotation of queries and results. The service is a web-based platform designed to streamline this process by enabling the annotation of search engine results in a user-friendly and collaborative environment. The tool allows assessors to annotate retrieved documents according to predefined relevance criteria, supporting the creation of standardized datasets for training, tuning, and benchmarking retrieval models.

Our web archive is a research infrastructure that provides tools to preserve and exploit data from the web to meet the needs of scientists and ordinary citizens and our mission is to provide digital infrastructures to support the academic and scientific community. However, until now, our web archive has focused on collecting data from websites hosted under the .PT domain, which is not enough to guarantee the preservation of relevant content for the academic and scientific community.

Our web archive provides a “Google-like” service that enables searching pages and images collected from the web since the 1990s. Notice that our web archive search complements live-web search engines because it enables temporal search over information that is no longer available online on its original websites.

Developed within the context of our web archive, the service facilitates the generation of reliable ground truth data, while remaining adaptable to different domains and languages. By lowering the barriers to annotation, this platform contributes to the reproducibility, scalability, and improvement of search technologies. The main objective is to provide in the future a dataset with public access to support researchers.This contributes to comparing users’ search behavior between live-web and web-archive search engines.

10:18am - 10:26am

Organizing the 'Social Mess': a comprehensive Tool for Social Media and Instant Messaging Archiving

Primo Baldini¹, Alessia Del Bianco², Adele Gorini²

¹University of Pavia, Italy; ²University of Bologna, Italy

The exponential growth of digital content through social media and instant messaging platforms presents critical challenges for digital preservation. Born-digital communications—created in fragmented, proprietary environments where personal and public spheres overlap—remain excluded mainly from systematic archival practices despite their historical and cultural significance.

Within the national archival context, there are no comprehensive tools to preserve and manage these materials for individuals, institutions, or public figures whose digital traces hold substantial value for future research. This gap affects personal archives of political and institutional figures and collections of broader cultural relevance.

As part of a collaborative research initiative on preserving contemporary digital archives, we are developing a software tool for individual users and institutional archivists. This collaborative effort, which includes our professional experience, highlights an urgent need to address technical and methodological shortcomings in this field.

Existing tools—typically command-line utilities or platform-specific applications—allow for the separate management of content from social media, messaging services, and email, etc., but do not provide integrated support within a unified solution. Our framework, in contrast, is comprehensive in its capacity to manage the complete spectrum of digital materials: traditional files alongside social media content, instant messages, and emails within a unified environment. This comprehensive approach addresses the complexity of contemporary digital archives.

The software enables users to reorganize their materials systematically, making it valuable for a variety of contexts. Whether it's individuals managing personal digital heritage, prominent figures preparing materials for donation, or institutions controlling and facilitating access to collections.

Our Java-based solution integrates core modules, ensuring usability and data integrity. Operating through manual download and ingest processes—not APIs—it provides user control while supporting standard formats (JSON, CSV) for interoperability. The embedded database and exclusive use of open-source libraries enable platform-independent installation without external dependencies.

Key functionalities include AES-256 encryption, automatic backups, metadata extraction, device synchronization, and granular permissions. Critically, access settings apply at both file and individual message levels—essential for managing diverse privacy requirements and enabling selective disclosure within complex digital collections.

Currently under active development, the project aims to support institutions in visualizing and managing heterogeneous digital materials, enhance accessibility for researchers through reorganization and categorization tools, and foster inter-institutional collaboration.

This session will provide participants—particularly archivists and records managers—with an overview of a collaborative project and its outcomes, highlighting an integrated approach that offers significant advancements for digital preservation practice and academic scholarship.