JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at events@netpreserve.org.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

Only Sessions at Date / Time

Session Overview

Session

WORKSHOP: BELGICAWEB [PART 1]

Time:

Monday, 20/Apr/2026:

1:30pm - 3:00pm

Location: PANORAMA [+6]

Floor: +6

Presentations

Web Archiving, with a little help from my LLM friends.

George Barbosa¹, Christina Vandendyck², Friedel Geeraert², Elodie Lecroart³, Dieter De Witte¹, Kenzo Milleville¹

¹Ghent University, Belgium; ²KBR Royal Library of Belgium; ³Université de Namur, Belgium

BelgicaWeb is a two-year BRAIN 2.0 project funded by BELSPO that aims to safeguard and promote Belgium’s born-digital heritage by making it FAIR—Findable, Accessible, Interoperable, and Reusable. Through the development of a user-friendly access platform and API, the project addresses sustainable access, data enrichment using technologies such as Linked Data and NLP, and legal frameworks around data sharing, AI, and privacy. It brings together experts from KBR, Ghent University, and the University of Namur, and actively engages users to shape its design and functionality.

In this tutorial, we will demonstrate a complete web archiving pipeline and show how it can be augmented through AI-based methods, mainly large language models (LLMs), to extend existing workflows. Participants will first see how our current BelgicaWeb pipeline automatically creates and replays web archives using SolrWayback. The resulting WARC files will be processed with an LLM-based data cleaning pipeline that turns the raw data into structured Linked Data. The same raw data can also be explored using retrieval-augmented generation (RAG) to make the mapping process more interactive. With this approach, we demonstrate that data exploration can be carried out with multiple complementary approaches (Linked Data, full-text search, and RAG).

The session will conclude with a discussion on the legal and ethical dimensions of applying AI in web archiving, including GDPR and compliance with EU AI regulations.

Three Hands-on Sessions

The tutorial consists of three parts, each addressing a different stage in the BelgicaWeb workflow: (i) data harvesting; (ii) the AI-based cleaning, enrichment, and exploration; (iii) the legal reflection. Participants can follow along in prefilled Colab notebooks or just observe the demonstrations.

The participants will be kept in sync by providing (partially) pre-filled notebooks and data snapshots to guarantee the progress of the workshop.

Harvesting and replay with SolrWayback

The first session focuses on the practical aspects of harvesting web data. Participants will explore the automated pipeline (using heritrix) that generates WARC files from a defined set of seeds, and replays them in SolrWayback. An explanatory diagram of the harvesting pipeline will be shared with the audience.
This process will be performed live with the creation of a new collection, integrated into SolrWayback. Participants will then be able to explore the results of this archive creation session. The challenges and concerns encountered during harvesting will be discussed to share lessons learned.

The session will conclude by demonstrating how structured metadata (which will also be generated in the next session) can be re-integrated into SolrWayback to enhance search and browsing.

LLMs for Data Cleaning and Data Exploration

The second session introduces the concept of enhancing traditional web archiving workflows with LLMs. They will be introduced at two points in the workflow:

LLMs will be used for transforming WARC files into structured Linked Data. This involves data cleaning, normalization, and mapping.
LLMs can also be used for data exploration, where participants interactively query both the raw and the processed data through RAG.

Finally, we will show how AI-based exploration and traditional SPARQL querying can be intertwined and used for complementary insights.

Participants will use ready-to-use Jupyter notebooks in Google Colab, which are partially filled in, to lower the barrier of entry for less technical users. Each step will be guided, allowing participants to experiment safely with retrieval, vector databases, chunking, and summarization.

Legal and Ethical Reflection

The final session addresses the legal and ethical dimensions of AI-assisted web archiving. Questions that can arise are: Is what we did safe? What safety measures are required to ensure we remain within legal boundaries?

This presentation will address more specifically the legal and ethical implementation issues relating to data protection, copyright, FAIR/CARE principles and the responsible use of AI in web archiving. This will take the form of a Q&A based on concrete cases that inspired our reflections. Participants will be invited to participate in the debate.

Format

The tutorial is designed as a 3-hour technical session with three modular components:

Demonstration (1 x 35 min): Automatic WARC generation and replay with SolrWayback.
Two hands-on exercises (2 x 35 min): Running AI-enabled queries on WARC-derived datasets with Colab notebooks.
Discussion and reflection (1 x 35 min): Interactive Q&A on the legal and ethical aspects of AI-assisted web archiving.

Target Audience

This tutorial targets professionals in the field of web archiving, particularly developers, digital experts, and others interested in the technical aspects of web archiving and AI.

Familiarity with Python and Linked Data will be very helpful, but as we provide a lot of code samples, less experienced participants will also benefit from this workshop. Participants may follow along actively in Colab or simply observe the demonstrations.

We anticipate a group of 25–40 participants, allowing for interaction and guided support during the hands-on session.

Expected Learning Outcomes

By the end of the tutorial, participants will be able to:

Understand how to create a web archive (WARC) from a list of seeds.
Use Solrwayback in their daily practice and understand how it can integrate with AI components.
Experiment with AI techniques to query and enrich WARC-derived data.
Reflect on the overall process, from a technical angle (when to use AI?) and a legal angle (when not to use it, or how to use it responsibly?)

Technical Requirements

Laptop with internet connection and browser.
Google account (for Colab access).
All datasets and notebooks will be provided in advance. No local installation required.

Main Topic

AI-enabled workflows for web archiving.

Keywords

Web archiving, SolrWayback, Large Language Models, Data Cleaning, Information Retrieval, Linked Data