Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

 
 
Session Overview
Session
SESSION #07: Research & Access
Time:
Thursday, 10/Apr/2025:
2:15pm - 3:40pm

Session Chair: Marie Roald, National Library of Norway
Location: Målstova (upstairs)

1 level up from ground floor

Show help for 'Increase or decrease the abstract text size'
Presentations
2:15pm - 2:35pm

From Pages to People: Tailoring Web Archives for Different Use Cases

Andrea Kocsis2, Leontien Talboom1

1Cambridge University Libraries, United Kingdom; 2National Library of Scotland, United Kingdom

Our paper explores different modes of reaching the three distinct audiences identified in previous work with the National Archives UK : readers, data users, and the digitally curious. Building on the examples of our work conducted at the Cambridge University Libraries and the National Library of Scotland, our paper gives recommendations and demonstrates good practices for designing web archives for different audience needs while assuring wide access.

Firstly, to improve the experience of the general readers, we employ exploratory and gamified interfaces and public outreach events, such as exhibitions, to bring the library users' awareness to the available web archive resources. Secondly, to serve the data user community, we put an emphasis on curating metadata datasets and the Datasheets for Data documentation, encouraging the quantitative research of the web archive collections. This work also involves outreach events, such as data visualisation calls, which later can be incorporated into the resources for the general readers. Finally, to overcome the obstacle of the digital skill gap, we tailored in-library workshops for digitally curious - those who recognise the potential of web archives but lack advanced computational skills. We expect that upskilling the digitally curious can open their interest towards exploring and using the web archive collections.

To sum up, our paper introduces the work we have been doing to improve the useability of the UK Web Archive within our institutions with the help of developing additional materials (datasets, interfaces) and planning outreach events (exhibitions, calls, workshops) to ensure we meet the expectations of readers, data users, and the digitally curious.



2:35pm - 2:55pm

Making Research Data Published to the Web FAIR

Bryony Hooper, Ric Campbell

University of Sheffield, United Kingdom

The University of Sheffield’s vision for research is that our distinctive and innovative research will be world-leading and world-changing. We will produce the highest quality research to drive intellectual advances and address global challenges. https://www.sheffield.ac.uk/openresearch/university-statement-open-research

Research data published to the web can offer opportunities for wider discovery and access to your research outputs. However, it also presents risk in terms of assurances that that discovery and access will remain for as long as the need for it remains. Websites are an inherently fragile medium, and present risks in terms of providing assurances that we can evidence our research impact over time. This includes potentially wanting to submit sites as part of a UK’s Research Excellence Framework submission (the next scheduled for 2029).

Funding requirements may also stipulate how long they expect the outputs to remain accessible. Years of work, including work undertaken with public funding could disappear if there is no intervention. In addition, publishing research data to the web cannot provide assurances in terms of meeting the University of Sheffield’s commitment to FAIR principles (findable, accessible, interoperable and reusable) and Open Research and Open Data practices.

At the University of Sheffield, colleagues in our Research Data Management (RDM) team have also noticed a trend in researchers depositing in the Institutional Repository (ORDA), links to URLs where the data is situated. In some cases, the website is the research output in its entirety, meaning the maintenance falls outside of the RDM team’s remit, meaning we cannot provide the usual assurances in terms of preserving that deposit in these cases.

This paper will discuss the work undertaken by the University of Sheffield’s Library to mitigate potential data loss from research published online. It will include a case study of the capturing of a research group’s website to deposit in our institutional data repository, the creation of collaboratively created guidance for researchers and research data managers, and the embedding good practice at the University to enable Open Research and Open Data will remain open and FAIR.



2:55pm - 3:15pm

Enhancing Accessibility to Belgian Born-Digital Heritage: The BelgicaWeb Project

Christina Vandendyck

Royal Library of Belgium (KBR), Belgium

The BelgicaWeb project aims to make Belgian digital heritage more (FAIR ( i.e. Findable, Accessible, Interoperable and Reusable) to a wide audience. BelgicaWeb is a BRAIN 2.0 project funded by BELSPO, the Belgian Science Policy Office. It is a collaboration between CRIDS (University of Namur) who provide expertise on the relevant legal issues, IDLab, GhentCDH and MICT (Ghent University) who will work on data enrichment, user engagement and evaluation and outreach to the research community, respectively, and KBR (Royal Library of Belgium) who act as project coordinator and work on the development of the access platform and API and data enrichment.

By leveraging web and social media archiving tools, the project focuses on creating comprehensive collections, developing a multilingual access platform, and providing a robust API enabling data-level access. At the heart of the project is a reference group of experts who provide iterative input on the selection, development of the API and access platform, data enrichment and quality control and usability. Therefore, the project contributes to moving towards best practices for search and discovery.

The project goes beyond data collection by means of open-source tools by enriching and aggregating (meta)data associated with these collections using innovative technologies such as Linked Data and Natural Language Processing (NLP). This approach enhances search capabilities, yielding more relevant results for both researchers and the general public.

In this presentation, we will provide an overview of the BelgicaWeb project’s system architecture, the technical challenges we encountered, and the solutions we implemented. We will demonstrate how the access platform and API offer powerful, relevant, and user-friendly search functionalities, making it a valuable tool for accessing Belgium’s digital heritage. Attendees will gain insights into our development process, the technologies employed, and the benefits of our open-source approach for the web archiving and by extension the digital preservation communities.



3:15pm - 3:35pm

Using Generative AI to Interrogate the UK Government Web Archive

Chris Royds, Tom Storrar

The National Archives (UK), United Kingdom

Our project seeks to make the contents of Web Archives more easily discoverable and interrogable, through the use of Generative AI (Gen-AI). It explores the feasibility of setting up a chatbot, and using UK Government Web Archive data to inform its responses. We believe that, if this approach proves successful, it could lead to a step-change in the discoverability and accessibility of Web Archives.

Background

Gen-AIs like ChatGPT and Copilot have impressive capabilities, but are notoriously prone to “hallucinations”. They can generate confident-sounding, but demonstrably false responses – even to the point of inventing non-existent academic papers, complete with fictitious DOI numbers.

Retrieval-Augmented Generation (RAG) seeks to address this. It supplements Gen-AI with an additional database, queried whenever a response is generated. This approach aims to significantly reduce the chance of hallucination, while also enabling chatbots to provide specific references to the original sources.

Additionally, any approach used would need to take into account the occasional need to remove individual records (in line with The National Archives’ takedown policy: https://www.nationalarchives.gov.uk/legal/takedown-and-reclosure-policy/). In traditional Neural Networks, “forgetting” data is currently an intractable problem. However, it should be possible to set up RAG databases such that removal of specific documents is straightforward.

Approach

Our project is focused on two open-source tools, both of which allow for RAG based on Web Archive records.

The first is WARC-GPT, a lightweight tool developed by a team at Harvard designed to ingest Web Archive documents, feed them to a RAG database, and provide a chat-bot to interrogate the results. While the tool’s creators have demonstrated its capabilities on a small number of documents, we have attempted to test it at a larger scale, on a corpus of ~22,000 resources.

The second, more sophisticated tool is Microsoft’s GraphRAG. GraphRAG identifies the “entities” referenced in documents, and builds a data structure representing the relationships between them. This data structure should allow a chat-bot to carry out more in-depth “reasoning” about the contents of the original documents, and potentially provide better answers about information aggregated across multiple documents.

Results

Our initial findings suggest that WARC-GPT produces impressive responses when queried about topics covered in a single document. It quickly discovers which one of the documents in its database best answers the prompt. It summarises relevant information from that document, and provides its URL. Additionally, with a few minor tweaks to the underlying source code, it is possible to remove individual documents from its database. However, WARC-GPT’s responses fare poorly when attempting to aggregate information from multiple documents.

Our experiments with GraphRAG suggest that it outperforms WARC-GPT in aggregating information. However, while GraphRAG is reasonably quick to generate these responses, it is significantly slower and more expensive to set up than WARC-GPT. Additionally, removing individual records from GraphRAG, while possible, is computationally expensive.



 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2025
Conference Software: ConfTool Pro 2.6.153
© 2001–2025 by Dr. H. Weinreich, Hamburg, Germany