Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.
|
Session Overview |
| Session | ||
SHORT TALKS
| ||
| Presentations | ||
11:30am - 11:41am
A Toolbox to foster Web Archives Use and Reuse National Library of France, France Web Archives represent an immense reservoir of data, with diverse and evolving possibilities for use and reuse that will undoubtedly continue to grow in the coming decades. As a national library, we have faced over the past 10 years a wide variety of requests particularly for extracting, recovering, and replaying web-archived materials for research, institutional, and personal use. All these requests have enabled us to develop a range of services and a set of tools. We will focus on three real-life use cases and the technical solutions we have developed to answer the needs of:
Our presentation will cover how we have progressively developed and consolidated tools from specific user needs and questions to a generic and sustainable set of tools that we integrated into a toolbox to extract and transform archived data and websites into various formats such as metadata, HTML, text, images, and various outputs such as file lists, tree structures or derivative WARC files. 11:41am - 11:49am
Constructing and sharing historical web link graphs from web archives Arquivo.pt, Portugal At our organisation, we have been developing a new text search platform based on Apache Solr to replace our legacy system, which depends on outdated and unsupported technologies. As part of this major upgrade, we undertook the task of reindexing all archived collections to align with the new, more flexible indexing schema. This large-scale reindexing effort provided us with a unique opportunity: the chance to extract additional insights from our historical web data. In particular, we focused on capturing link relationships between webpages. From this process, we generated and published a dataset of web link graphs that document the structure of hyperlinks across a significant portion of the web as preserved by our web archive. The published dataset contains information on over 139 million webpage URLs and the collections chosen for this dataset range from 1996 to 2021, allowing researchers to study the evolution of webgraphs over time. This type of data can be particularly valuable for researchers in areas such as web science, digital preservation, search engine technology, and network analysis. Furthermore, the code used to generate this dataset has been made publicly available. This allows others to apply the same approach to their own web archives and produce comparable link graph datasets from their WARC files. We believe this makes our work a reusable and extensible contribution to the web archiving and research communities.
In this lightning talk we aim to provide an overview of how the dataset was created and the structure and format of the data itself. 11:49am - 11:57am
Lossy and porous archives: Sustainability and collaborative models of LAC and the Internet Archive University of Copenhagen, Denmark As of 2024 Library Archives Canada and the Internet Archive have partnered to digitize and scan up to 80,000 out-of-copyright Canadian publications. Six Internet Archive created “Scribes” workstations were installed in LAC’s Gatineau facility, run by LAC staff (Library and Archives Canada, 2025; Internet Archive Canada, 2024). This co created project is a reflection of the porous boundaries of democratic digital knowledge ecosystems.This paper will compare both the LAC and IA’s sustainability models through IA’s digital resources and through interviews with Library Archives Canada. It presents a brief overview of mandates, accountability to publics vs. donors, and compares the overlap and (in)dependence of national and transnational digital archiving. The analysis draws on theories of data loss to engage with the porous and lossy boundaries of digital memory infrastructure. Both IA and LAC have gaps and absences, but their losses result in different absences and silences. Both the IA and LAC are infrastructure within the ecologies of digital archiving but diverge in mandate and logics. LAC is mandated to produce Canadian cultural and governmental memory, and are accountable to Canadian governmental policy, whereas contrastingly, the IA is a transnational nonprofit that controls through providing web infrastructure. They are bound by copyright law but are politically focused on increasing access to data through different, highly visible projects. I will use the construction of 'Scribes' as a focus to present the porous nature of digital memory institutions. This comparative analysis contributes to conversations around the tensions of digital national futures, and how the process of transnational archiving can complicate or support national archival agendas. References Internet Archive Canada. (2024, July 1). Internet Archive Canada launches digitization project with Library and Archives Canada. https://internetarchivecanada.org/2024/07/01/internet-archive-canada-launches-digitization-project-with-library-and-archives-canada/ Library and Archives Canada. (2025, August 1). The plan to scan: digitizing out-of-copyright publications. Government of Canada. https://www.canada.ca/en/library-archives/corporate/updates/2025/the-plan-to-scan-digitizing-out-of-copyright-publications.html Star, S. L., & Ruhleder, K. (1996). Steps toward an ecology of infrastructure: Design and access for large information spaces. Information Systems Research, 7(1), 111–134. https://doi.org/10.1287/isre.7.1.111 11:57am - 12:05pm
Ten years of websites and born-digital archiving in Slovakia University Library in Bratislava, Slovak Republic Electronic documents and websites should be preserved similarly to physical objects of lasting value. In 2015 our institution has been involved in the project regarding digital resources. The goal of the project was to create the technological and organisational infrastructure for systematic and controlled web harvesting and born-digital archiving. We archive national websites and born-digital content (electronic monographs and electronic serials). Nowadays, the project is out of the sustainability phase and all activities are provided by the specialised department. During the pilot phase a complex information system for harvesting, identification, management and long-term preservation of web resources and born-digital documents was established. Our information system consists of the specialised open source software modules (Heritrix, OpenWayback, SOLR etc.). The application is supported by a powerful HW infrastructure. The system management is optimized for parallel web harvesting. This enables to master the full domain harvest with required politeness in an acceptable time. One of the useful system features is the identical parallel testing environment. The web archiving system disposes with 800 TB storage. A substantial part of the system is the catalogue of websites, which is regularly updated during the automated survey of the national domain. Some domains that match our policy criteria are added to the catalogue manually (.org, .net, .com, .eu…). Since 2016 our department has performed seven full-domain harvests - harvesting of the national domain and multiple selective and thematic harvests. Electronic publications with assigned ISSN are archived in cooperation with the National ISSN Centre by upload or by harvest. Access to the archived data is provided in OpenWayback. A limited number of archived websites and electronic publications is available publicly due to the copyright restrictions. All archived resources are available locally in the institution. This contribution focuses on the path of archiving the national websites and born-digital documents in digital resources archive. During ten years, it faced several opportunities and now it is a recognized source, partly supported in national legislation (archiving of news portals). 12:05pm - 12:13pm
Climate change captured: collaborative, complex crawling & collecting - learnings from a cross-institutional pilot on climate change reactions Royal Danish Library, Denmark As part of a national, cross-institutional, pilot initiative documenting public reactions to climate change, a recent thematic web collection focused on online debates and reflections surrounding water levels, flooding, and environmental adaptation. Within this pilot, an almost single curator-led effort resulted in the collection of over 1.6 million unique web pages—more than 5 terabytes of data—including embedded videos, dynamic, rich media and selected social media content. The collection was conducted using Browsertrix, a browser-based crawling technology that proved essential for capturing complex, media-rich web content that traditional crawlers often miss. The setup included both cloud-based and local installations, allowing flexible scaling and testing of workflows. Browsertrix enabled efficient harvesting within a limited timeframe while significantly improving the fidelity of the captures, particularly for sites relying heavily on dynamic or embedded content. This presentation will share key learnings from the pilot, focusing on technical, curatorial, and collaborative dimensions. On the technical side, challenges included resource demands, blocked access to social media “walled gardens,” and maintaining crawl stability across diverse sites. From a curatorial perspective, the project demonstrated the value of close cooperation with domain experts on climate change, whose insights were crucial for identifying emerging debates and relevant sources as well as inspiration from the other institutions participating in the pilot, collecting non-web media or physical objects. The user friendly GUI of Browsertrix, partly developed during the IIPC funded project "Browser based crawling system for all" - https://netpreserve.org/projects/browser-based-crawling, empowered curators to crawl and make informed decisions in a fast, intuitive and user friendly manner including monitoring crawls at run time, helped identifying important sites, that could be crawled in more depth later. However, the experience also revealed the need for broader outreach and participatory workshops in future large-scale efforts, to ensure diverse and inclusive input across sectors. The pilot underscored how browser-based harvesting tools can transform national web archiving by bridging gaps in multimedia and interactive content capture. At the same time, it highlighted the limits of current approaches—particularly the need for dedicated development to handle advanced social media and video platforms. The forthcoming main project, pending accept of fund applications, aims to build on these lessons, exploring how combining existing infrastructures with newer tools like Browsertrix can enhance thematic, rapid-response collections. With modest resources but focused technical and curatorial innovation, it is possible to add substantial cultural and research value to national web archives documenting societal reactions to climate change. 12:13pm - 12:21pm
Bridging local and international communities: Web archiving outreach and collaboration 1Aix Marseille University, France; 2Humathèque, Campus Condorcet, France; 3MMSH, CNRS, Aix Marseille University, France This lightning talk aims to present three community-building and outreach initiatives that brought together long-time web-archiving specialists and newcomers to the field in 2025. The first one is a community-building initiative that resulted in the drafting of a memorandum of understanding between the xxx and xxx. In this declaration, they commit to: creating a shared ecosystem to foster new cooperation projects, conducting collective work on the methodology for stabilizing, and archiving web data corpora, strengthening links between existing institutions with expertise in collecting, analyzing and archiving web data, and reflecting on how to create a reproducible pipeline to collect, curate, consult and conserve web data corpora for SSH research. The second initiative is the co-organization of a monthly research seminar entitled “The Web and Web archives for research in the humanities and social sciences: knowledge, methods, and tools for the collection, analysis, and preservation of online corpora”. The third initiative is an event : a hackathon called “Building a corpus with web data” involving SSH researchers and research library professionals from xxx and xxx, but also other local significant players of web archiving. xxx and xxx are pooling their expertise to transform research practices through knowledge creation, training, awareness-raising, and the sharing of common tools for web archiving. Together, they want to build bridges between the international web-archiving communities (RESAW, IIPC) and local specialists and enthusiasts. 12:21pm - 12:29pm
Best practices for collaboration: Managing themed harvests with external partners National Library of Finland, Finland A substantial part of the web archiving at The National Library of Finland are themed harvests. Beyond just crawling yearly the Finnish domains ending with .fi or .ax country codes, online content is crawled with continuous harvests and themed harvests, that have varied subjects and content types. The most recent collection plan for 2025-2028 requires to have more emphasis on themed harvests that contain collaboration or cooperation with different groups, third-party organisations and other participants that are interested in suggesting content or participating in other ways in web archiving to the Finnish Web Archive. This lightning talk will provide insight into how managing collaborational themed harvests are usually done and how they have developed in recent years. As harvests may cover subjects that the team of the legal deposit services that curates the archived online content does not have itself the required expertise about, the role of external partners is crucial. The presentation will include several themed harvests from recent years that had cooperation or collaboration with external partners. Many of the collaborated themed harvests in recent years have mostly been organized with institutions and organizations specialized in or representing language minorities or underrecognized groups, but the findings presented are useable also with other kinds of external partners. Over the years, we have learned to improve the management of different types of cooperative and collaborational themed harvests. Collecting projects may be sparked by external suggestions or may be based on an already constructed set of online content by a third party. Managing these kinds of projects usually turn out to be fairly different from the projects that require reaching out for expertise beyond The National Library. Organizing themed harvests with especially minorities and underrecognized groups includes a feature in which the collaborating participants are not just providers of suggestions but also have knowledge and say in other aspects of the project (e.g., cataloguing and communicating to peers). Based on our experiences with these kinds of themed harvests, we have produced internal guidelines on how to manage collaborative collecting projects. | ||