Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

 
 
Session Overview
Session
SESSION #01: Tools Under Construction: Lessons Learned (National Library Perspective)
Time:
Wednesday, 09/Apr/2025:
11:55am - 1:00pm

Session Chair: Katherine Boss, National Library of Norway
Location: Store Auditorium (ground floor)

main entrance at street level

Show help for 'Increase or decrease the abstract text size'
Presentations
11:55am - 12:15pm

Embedding the Web Archive in an Overall Preservation System

Hansueli Locher

Swiss National Library, Switzerland

The Swiss National Library (SNL) is building a new digital long-term archive that will go live in spring 2025. This system is designed as an overall system that covers all the processes involved in handling the digital objects of all the SNL's collections, including the web archive. This starts with the delivery of the objects by producers or the collection of the objects by the SNL itself, includes the preparation for archiving and cataloguing, administration and preservation, and ends with the provision to users.

The first part of the presentation will describe the architecture and functionality of the overall system, which consists of three different areas and uses a mixture of standard components and individual developments.

  • A modular pre-ingest area provides so-called processing channels for different types of collection objects. With the help of said channels the objects and their metadata are prepared in such a way that they can be transferred to the ingest process of the digital archive.
  • The Digital Archive contains the core system for managing and archiving digital collection objects. It also provides risk and preservation management functionality.
  • An access system allows users to access the digital collections. It provides a full-text search, access control and server-based viewers for the most common data formats. In addition, selected parts of the collection can be presented to users in a curated form via so-called showcases.

The second part of the presentation will show how the Swiss Web Archive and its specific processes have been integrated into the overall system. Special precautions had to be taken particularly in the Pre-Ingest and Access areas.

In Pre-Ingest, a distict processing channel was created for the web archive. This makes it possible to register the websites for collection (and automated periodic snapshots), collect them, check their quality and improve it if necessary, and ensure that they are virus-free.

Access makes the web archive accessible via a full-text search, for which special precautions had to be taken when generating the hit lists. Otherwise, the hits from the other collections would be lost among the numerous hits from the web archive. In addition, one of the showcases will provide an unexpected approach to the web archive.

The presentation will conclude by addressing some of the specific challenges of integrating the web archive into an overall preservation system and the lessons learnt.



12:15pm - 12:35pm

UKWA Rebuild

Gil Hoggarth

British Library, United Kingdom

The British Library suffered a major service outage following a cyber-attack on all technical systems in late October, 2023. What followed was a complete rebuild of all services with security baked in. This short presentation provides an overview of how the UK Web Archive was affected, how the new operational technology landscape of the British Library changed, and describes the work being undertaken to return UKWA as a public service and to begin crawling again from on-premise servers. It will also describe how the internal systems of UKWA are changing to meet the new infrastructure and policies.*

The challenges faced should be important to all web archiving institutions. The necessary changes made by the British Library to ensure the new services are secure by design will have a major impact on the UK Web Archive systems, but these could be challenges and changes imposed on any web archive. The size of the UK Web Archive, approaching 2PiB and an estimated 18 billion files, also creates challenges in itself which will be familiar to many web archives - the redesign of UKWA includes distant storage and aims to establish shared functions and resources across the Legal Deposit Libraries in the future.

Ways of discovering content within the UK Web Archive have been significantly reduced by the cyber-attack. Previously, a full text search service was available using Apache Solr. However, the return of a 'discovery service' has been delayed by the necessity of rebuilding all systems from scratch. The future planning for a discovery service, and a user service, will also be outlined in the presentation.

* As of mid-August 2024, no technology infrastructure or systems have been released for the UKWA rebuild work. Consequently, the content of this presentation may change from this paper submission and the conference date.



12:35pm - 12:55pm

Under Construction: Web Archive of the German National Library

Natanael Arndt

German National Library, Germany

Our institution is running a web archive since 2012, in cooperation with an external contractor and on closed-source software. Most recently we have started the shift towards an in-house open source web archiving system that shall be integrated with the overall data management infrastructure of our institution. During a first migration process the whole setup was moved in-house. The migration allowed us to gain some control over the operation, while the development and support is still performed by the contractor. In our experience over the last decade, we have identified a number of limitations with the current web archive setup: The crawling capacity is limited to a maximum of 12,000 snapshots per annum, the non-modular system complicates the implementation of new requirements, and we cannot directly benefit from the progress of the striving open source web archiving community in regard to new features and the implementation of web archiving standards. In parallel to the web archiving activities, our institution has developed an overarching data management infrastructure for the acquisition, digital preservation, and provisioning of electronic resources, such as e-books, e-journals, and most recently audio files. In order to gain an increased maintainability, flexibility and control over the web archiving activity, our aim is to implement a new system in-house, to integrate it with the well-established in-house workflows for electronic resources, and align it with and base it on the current open source state of the art and the standards of the web archiving community.

During the presentation we take you on the journey of our institution towards the implementation of an in-house and open source web archive. We try to answer the questions: How do we understand the environment? How do we get together our team? Where do we want to go? How do we decide, which paths we take? Which gear do we need? And finally, what are our lessons learned?