Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.
|
Session Overview |
| Session | ||
WORKSHOP: BELGICAWEB [PART 1]
| ||
| Presentations | ||
Web Archiving, with a little help from my LLM friends. 1Ghent University, Belgium; 2KBR Royal Library of Belgium; 3Université de Namur, Belgium BelgicaWeb is a two-year BRAIN 2.0 project funded by BELSPO that aims to safeguard and promote Belgium’s born-digital heritage by making it FAIR—Findable, Accessible, Interoperable, and Reusable. Through the development of a user-friendly access platform and API, the project addresses sustainable access, data enrichment using technologies such as Linked Data and NLP, and legal frameworks around data sharing, AI, and privacy. It brings together experts from KBR, Ghent University, and the University of Namur, and actively engages users to shape its design and functionality. In this tutorial, we will demonstrate a complete web archiving pipeline and show how it can be augmented through AI-based methods, mainly large language models (LLMs), to extend existing workflows. Participants will first see how our current BelgicaWeb pipeline automatically creates and replays web archives using SolrWayback. The resulting WARC files will be processed with an LLM-based data cleaning pipeline that turns the raw data into structured Linked Data. The same raw data can also be explored using retrieval-augmented generation (RAG) to make the mapping process more interactive. With this approach, we demonstrate that data exploration can be carried out with multiple complementary approaches (Linked Data, full-text search, and RAG). The session will conclude with a discussion on the legal and ethical dimensions of applying AI in web archiving, including GDPR and compliance with EU AI regulations. Three Hands-on Sessions The tutorial consists of three parts, each addressing a different stage in the BelgicaWeb workflow: (i) data harvesting; (ii) the AI-based cleaning, enrichment, and exploration; (iii) the legal reflection. Participants can follow along in prefilled Colab notebooks or just observe the demonstrations. The participants will be kept in sync by providing (partially) pre-filled notebooks and data snapshots to guarantee the progress of the workshop.
The first session focuses on the practical aspects of harvesting web data. Participants will explore the automated pipeline (using heritrix) that generates WARC files from a defined set of seeds, and replays them in SolrWayback. An explanatory diagram of the harvesting pipeline will be shared with the audience. The session will conclude by demonstrating how structured metadata (which will also be generated in the next session) can be re-integrated into SolrWayback to enhance search and browsing.
The second session introduces the concept of enhancing traditional web archiving workflows with LLMs. They will be introduced at two points in the workflow:
Finally, we will show how AI-based exploration and traditional SPARQL querying can be intertwined and used for complementary insights. Participants will use ready-to-use Jupyter notebooks in Google Colab, which are partially filled in, to lower the barrier of entry for less technical users. Each step will be guided, allowing participants to experiment safely with retrieval, vector databases, chunking, and summarization.
The final session addresses the legal and ethical dimensions of AI-assisted web archiving. Questions that can arise are: Is what we did safe? What safety measures are required to ensure we remain within legal boundaries? This presentation will address more specifically the legal and ethical implementation issues relating to data protection, copyright, FAIR/CARE principles and the responsible use of AI in web archiving. This will take the form of a Q&A based on concrete cases that inspired our reflections. Participants will be invited to participate in the debate. Format The tutorial is designed as a 3-hour technical session with three modular components:
Target Audience This tutorial targets professionals in the field of web archiving, particularly developers, digital experts, and others interested in the technical aspects of web archiving and AI. Familiarity with Python and Linked Data will be very helpful, but as we provide a lot of code samples, less experienced participants will also benefit from this workshop. Participants may follow along actively in Colab or simply observe the demonstrations. We anticipate a group of 25–40 participants, allowing for interaction and guided support during the hands-on session. Expected Learning Outcomes By the end of the tutorial, participants will be able to:
Technical Requirements
Main Topic AI-enabled workflows for web archiving. Keywords Web archiving, SolrWayback, Large Language Models, Data Cleaning, Information Retrieval, Linked Data | ||
