2:15pm - 2:35pmFrom Pages to People: Tailoring Web Archives for Different Use Cases
Andrea Kocsis2, Leontien Talboom1
1Cambridge University Libraries, United Kingdom; 2National Library of Scotland, United Kingdom
Our paper explores different modes of reaching the three distinct audiences identified in previous work with the National Archives UK : readers, data users, and the digitally curious. Building on the examples of our work conducted at the Cambridge University Libraries and the National Library of Scotland, our paper gives recommendations and demonstrates good practices for designing web archives for different audience needs while assuring wide access.
Firstly, to improve the experience of the general readers, we employ exploratory and gamified interfaces and public outreach events, such as exhibitions, to bring the library users' awareness to the available web archive resources. Secondly, to serve the data user community, we put an emphasis on curating metadata datasets and the Datasheets for Data documentation, encouraging the quantitative research of the web archive collections. This work also involves outreach events, such as data visualisation calls, which later can be incorporated into the resources for the general readers. Finally, to overcome the obstacle of the digital skill gap, we tailored in-library workshops for digitally curious - those who recognise the potential of web archives but lack advanced computational skills. We expect that upskilling the digitally curious can open their interest towards exploring and using the web archive collections.
To sum up, our paper introduces the work we have been doing to improve the useability of the UK Web Archive within our institutions with the help of developing additional materials (datasets, interfaces) and planning outreach events (exhibitions, calls, workshops) to ensure we meet the expectations of readers, data users, and the digitally curious.
2:35pm - 2:55pmMaking Research Data Published to the Web FAIR
Bryony Hooper, Ric Campbell
University of Sheffield, United Kingdom
The University of Sheffield’s vision for research is that our distinctive and innovative research will be world-leading and world-changing. We will produce the highest quality research to drive intellectual advances and address global challenges. https://www.sheffield.ac.uk/openresearch/university-statement-open-research
Research data published to the web can offer opportunities for wider discovery and access to your research outputs. However, it also presents risk in terms of assurances that that discovery and access will remain for as long as the need for it remains. Websites are an inherently fragile medium, and present risks in terms of providing assurances that we can evidence our research impact over time. This includes potentially wanting to submit sites as part of a UK’s Research Excellence Framework submission (the next scheduled for 2029).
Funding requirements may also stipulate how long they expect the outputs to remain accessible. Years of work, including work undertaken with public funding could disappear if there is no intervention. In addition, publishing research data to the web cannot provide assurances in terms of meeting the University of Sheffield’s commitment to FAIR principles (findable, accessible, interoperable and reusable) and Open Research and Open Data practices.
At the University of Sheffield, colleagues in our Research Data Management (RDM) team have also noticed a trend in researchers depositing in the Institutional Repository (ORDA), links to URLs where the data is situated. In some cases, the website is the research output in its entirety, meaning the maintenance falls outside of the RDM team’s remit, meaning we cannot provide the usual assurances in terms of preserving that deposit in these cases.
This paper will discuss the work undertaken by the University of Sheffield’s Library to mitigate potential data loss from research published online. It will include a case study of the capturing of a research group’s website to deposit in our institutional data repository, the creation of collaboratively created guidance for researchers and research data managers, and the embedding good practice at the University to enable Open Research and Open Data will remain open and FAIR.
2:55pm - 3:15pmEnhancing Accessibility to Belgian Born-Digital Heritage: The BelgicaWeb Project
Christina Vandendyck
Royal Library of Belgium (KBR), Belgium
The BelgicaWeb project aims to make Belgian digital heritage more (FAIR ( i.e. Findable, Accessible, Interoperable and Reusable) to a wide audience. BelgicaWeb is a BRAIN 2.0 project funded by BELSPO, the Belgian Science Policy Office. It is a collaboration between CRIDS (University of Namur) who provide expertise on the relevant legal issues, IDLab, GhentCDH and MICT (Ghent University) who will work on data enrichment, user engagement and evaluation and outreach to the research community, respectively, and KBR (Royal Library of Belgium) who act as project coordinator and work on the development of the access platform and API and data enrichment.
By leveraging web and social media archiving tools, the project focuses on creating comprehensive collections, developing a multilingual access platform, and providing a robust API enabling data-level access. At the heart of the project is a reference group of experts who provide iterative input on the selection, development of the API and access platform, data enrichment and quality control and usability. Therefore, the project contributes to moving towards best practices for search and discovery.
The project goes beyond data collection by means of open-source tools by enriching and aggregating (meta)data associated with these collections using innovative technologies such as Linked Data and Natural Language Processing (NLP). This approach enhances search capabilities, yielding more relevant results for both researchers and the general public.
In this presentation, we will provide an overview of the BelgicaWeb project’s system architecture, the technical challenges we encountered, and the solutions we implemented. We will demonstrate how the access platform and API offer powerful, relevant, and user-friendly search functionalities, making it a valuable tool for accessing Belgium’s digital heritage. Attendees will gain insights into our development process, the technologies employed, and the benefits of our open-source approach for the web archiving and by extension the digital preservation communities.
3:15pm - 3:35pmUsing Generative AI to Interrogate the UK Government Web Archive
Chris Royds, Tom Storrar
The National Archives (UK), United Kingdom
Our project seeks to make the contents of Web Archives more easily discoverable and interrogable, through the use of Generative AI (Gen-AI). It explores the feasibility of setting up a chatbot, and using UK Government Web Archive data to inform its responses. We believe that, if this approach proves successful, it could lead to a step-change in the discoverability and accessibility of Web Archives.
Background
Gen-AIs like ChatGPT and Copilot have impressive capabilities, but are notoriously prone to “hallucinations”. They can generate confident-sounding, but demonstrably false responses – even to the point of inventing non-existent academic papers, complete with fictitious DOI numbers.
Retrieval-Augmented Generation (RAG) seeks to address this. It supplements Gen-AI with an additional database, queried whenever a response is generated. This approach aims to significantly reduce the chance of hallucination, while also enabling chatbots to provide specific references to the original sources.
Additionally, any approach used would need to take into account the occasional need to remove individual records (in line with The National Archives’ takedown policy: https://www.nationalarchives.gov.uk/legal/takedown-and-reclosure-policy/). In traditional Neural Networks, “forgetting” data is currently an intractable problem. However, it should be possible to set up RAG databases such that removal of specific documents is straightforward.
Approach
Our project is focused on two open-source tools, both of which allow for RAG based on Web Archive records.
The first is WARC-GPT, a lightweight tool developed by a team at Harvard designed to ingest Web Archive documents, feed them to a RAG database, and provide a chat-bot to interrogate the results. While the tool’s creators have demonstrated its capabilities on a small number of documents, we have attempted to test it at a larger scale, on a corpus of ~22,000 resources.
The second, more sophisticated tool is Microsoft’s GraphRAG. GraphRAG identifies the “entities” referenced in documents, and builds a data structure representing the relationships between them. This data structure should allow a chat-bot to carry out more in-depth “reasoning” about the contents of the original documents, and potentially provide better answers about information aggregated across multiple documents.
Results
Our initial findings suggest that WARC-GPT produces impressive responses when queried about topics covered in a single document. It quickly discovers which one of the documents in its database best answers the prompt. It summarises relevant information from that document, and provides its URL. Additionally, with a few minor tweaks to the underlying source code, it is possible to remove individual documents from its database. However, WARC-GPT’s responses fare poorly when attempting to aggregate information from multiple documents.
Our experiments with GraphRAG suggest that it outperforms WARC-GPT in aggregating information. However, while GraphRAG is reasonably quick to generate these responses, it is significantly slower and more expensive to set up than WARC-GPT. Additionally, removing individual records from GraphRAG, while possible, is computationally expensive.
|