JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at events@netpreserve.org.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

Only Sessions at Date / Time

Session Overview

Session

SESSION #02: Crawling Tools

Time:

Wednesday, 09/Apr/2025:

2:05pm - 3:40pm

Session Chair: László Tóth, National Library of Luxembourg

Location: Målstova (upstairs)

1 level up from ground floor

Presentations

2:05pm - 2:25pm

Lessons Learned Building a Crawler From Scratch: The Development and Implementation of Veidemann

Marius André Elsfjordstrand Beck

National Library of Norway, Norway

Over the past two decades, web content has become increasingly dynamic. While a long-standing harvesting technology like Heritrix effectively captures static web content, it has huge limitations capturing and discovering links in dynamic content. As a response to this, the Web Archive at the National Library of Norway in 2015 set out to develop a new browser-based web crawler.

This talk will present our experiences and lessons learned from building Veidemann. There are so many factors to consider when building a tool from scratch, and we will try to outline some of the decisions we were faced with during the process, unexpected issues and how we are addressing them.

The talk will present:

A high-level view of the design of Veidemann and the factors that influenced it

How Veidemann compares to similar projects.

The pros and cons of using a container-based platform

The main issues with the current implementation and possible solutions to them

Unexpected results

An idea for a different paradigm in the design of such a system

The full cost/benefit analysis of taking on a project of this size and scale is, by the nature of the work, not fully knowable at the start. After nearly a decade in the making, the story of Veidemann is one of pride, hope, hardship and lessons learned. While it is still being used in production at our institution and harvesting roughly 1TB per week (of deduplicated content), other similar tools, such as Browsertrix, have distinct advantages in their approach. While the future of Veidemann is uncertain we would love to share what we have learned so far with the broader community.

2:25pm - 2:45pm

Experiences of Using in-House Developed Collecting Tool ELK

Lauri Ojanen

National Library of Finland, Finland

ELK (acronym for Elonleikkuukone which means harvester in Finnish) is a tool which was built in the National Library of Finland’s Legal Deposit Services to aid collecting, managing, and harvesting online materials to the Web Archice. Legal Deposit Services started to use ELK in 2018 and since we’ve updated ELK several times to better suit the needs of collectors and harvesters of web materials.

Features of ELK include back catalog of former thematic web harvests including web materials also known as seeds, cataloging information and keywords, and tools to manage thematic web harvests that are currently being made. Features have been made in collaboration between the collectors and developers who also work on harvesting the web materials. The aim is to create a tool where the collectors can easily categorize different web materials, give notes on how to harvest different materials and stay on track what has been collected and what has not. Collectors can also harvest single web pages themselves for quality control. This is to make sure that pages with dynamic elements can be viewed as they were meant to in the web archive.

ELK is also used as a documenting platform. The easiest way to see curatorial choices, keywords and history of the thematic web harvests is to gather them in one platform. When that platform is used for everything related to the web archiving, we can easily see what themes have been harvested, what sort of materials were collected previously and in best cases see the curatorial decisions that were made in those harvests.

Sharing our experiences of an in-house developed tool for collecting web materials we can help other libraries in their efforts. What are the advantages in curating and managing our web collections and what disadvantages there are. Also, where we would like to see our collections go in the future now that we’ve used the tool for a while.

2:45pm - 3:05pm

Better Together: Building a Scalable Multi-Crawler Web Harvesting Toolkit

Alex Dempsey, Adam Miller, Kyrie Whitsett

Internet Archive, United States of America

The web is as nearly infinite in its expanse as it is in its diversity. As its volume and complexity continues to grow, high-quality, efficient, and scalable web harvesting methods are more essential than ever. The numerous and varied challenges of web archiving are well known to this community, so it’s not surprising there isn’t one tool that can perfectly harvest it all. But through open source software collaboration we can build a scalable toolkit to meet some of these challenges.

In the presentation, we will outline some of the many lessons and best practices our institution has learned from the challenges, requirements, research, and practical experience from collaborating with other memory institutions for over 25 years to meet the harvesting needs of the preservation community.

To demonstrate how some of those challenges can be overcome, we will then discuss a fictional large-scale domain harvest use case presenting common issues. With each new challenge encountered we will introduce concepts in web harvesting while demonstrating approaches to solve them. Sometimes the best approach is a configuration option in Heritrix, and sometimes it’s including another open source software to incrementally improve the quality and scale of the campaign. Nothing is perfect, so we’ll also cover some things to consider when deciding to employ an additional tool.

Some of the challenges we’ll address are:
• Scaling crawls to multiple machines
• How to avoid accidental crawler traps
• Efficiently layering-in browser assisted web crawling
• Handling rich media like video and PDFs
• And more

Heritrix makes a great base for large-scale web crawling, and many in the IIPC community already use it for their web harvests. The presentation will demonstrate tools that complement Heritrix, and should be easy to try as an add-on to a reliable implementation, but the concepts—and often the tools themselves—are web crawler agnostic.

The presentation is geared to a wide range of experience. Anyone who is curious about what it takes to run a large web harvest will leave with a better understanding, and experienced practitioners will acquire insights into some technical improvements and strategies for improving their own harvesting infrastructures.

3:05pm - 3:25pm

Lowering Barriers to Use, Crawling, and Curation: Recent Browsertrix Developments

Tessa Walsh, Ilya Kreymer

Webrecorder, United States of America

As the web continues to evolve and web archiving programs develop in their practices and face new challenges, so too must the tools that support web archiving continue to develop alongside them. This talk will provide updates on new features and changes in Browsertrix since last year’s conference that enable web archiving practitioners to capture, curate, and replay important web content better than ever before.

One key new feature that will be discussed is crawling through proxies. Browsertrix now supports the ability to crawl through SOCKS5 proxies which can be located anywhere in the world, regardless of where Browsertrix itself is deployed. With this feature, it is possible for users to crawl sites from an IP address located in a particular country or even from an institutional IP range, setting crawl workflows to use different proxies as desired. This feature allows web archiving programs to satisfy geolocation requirements for crawling while still taking advantage of the benefits of using cloud-hosted Browsertrix. Proxies may also have other concrete use cases for web archivists, including avoiding anti-crawling measures and being able to provide a static IP address for crawling to publishers.

Similarly, the presentation will discuss changes made that enable users of Browsertrix to configure and use their own S3 buckets for storage. Like proxies, this feature lowers the barriers to using cloud-hosted Browsertrix by enabling institutions to use their own storage infrastructure and meet data jurisdiction requirements without needing to deploy and maintain a self-hosted local instance of Browsertrix.

Other developments will also be discussed, such as improvements to collection features in Browsertrix which better enable web archiving practitioners to curate and share their archives with end users, user interface improvements which make it easier for anyone to get started with web archiving, and improvements to Browsertrix Crawler to ensure websites are crawled at their fullest possible fidelity.