JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at events@netpreserve.org.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available). To only see the sessions for 3 May's Online Day, select "Online" for location.

Please note that all times are shown in the time zone of the conference. The current conference time is: 27th Apr 2024, 10:46:16pm CEST

Session Overview

Session

OL-SES-08: PANEL: BROWSER-BASED CRAWLING FOR ALL: THE STORY SO FAR

Time:

Wednesday, 03/May/2023:

8:00pm - 9:00pm

Session Chair: Meghan Lyon, Library of Congress

Virtual location: Online

Presentations

Browser-Based Crawling For All: The Story So Far

Anders Klindt Myrvoll¹, Andrew Jackson², Ben O'Brien³, Sholto Duncan³, Ilya Kreymer⁴, Lauren Ko⁵, Jasmine Mulliken⁶, Antares Reich⁷, Andreas Predikaka⁷

¹Royal Danish Library; ²The British Library, United Kingdom; ³National Library of New Zealand | Te Puna Mātauranga o Aotearoa; ⁴Webrecorder; ⁵UNT; ⁶Stanford; ⁷Austrian National Library

Through the IIPC-funded “Browser-based crawling system for all” project (https://netpreserve.org/projects/browser-based-crawling/), members have been working with Webrecorder to support the development of their new crawling tools: Browsertrix Crawler (a crawl system that automates web browsers to crawl the web) and Browsertrix Cloud (a user interface to control multiple Browsetrix Crawler crawls). The IIPC funding is particularly focussed on making sure IIPC members can use these tools.

This online panel will provide an update on the project, emphasizing the experiences of IIPC members who have been experimenting with the tools. Three IIPC members who have been exploring Browsertrix Cloud in detail will present their experiences so far. What works well, what works less well, how the development process has been, and what the longer-term issues might be. The Q&A session will be used to explore the issues raised and encourage wider engagement and feedback from IIPC members.

Project Update: Anders Klindt Myrvoll & Ilya Kreymer

Anders will present an update from the project leads on what has been achieved since we started the project and what the next steps are. We will look at the broad picture as well as the the goals, outcomes and deliverables as described in the IIPC project description: https://netpreserve.org/projects/browser-based-crawling/

On behalf of Webrecorder, Ilya will outline the wider context and updating on the status of the project and including any immediate feedback from the Workshop session

User experience 1 (NZ) Sholto Duncan

Testing Browsertrix Cloud at NLNZ

In recent years the selective web harvesting programme at the National Library of New Zealand has broadened its crawling tools of choice in order to use the best one for the job. From primarily using Heritrix, through WCT, to now also regularly crawling with Webrecorder and Archive-IT. This allowed us to get the best capture possible. But unfortunately still falls short in harvesting some of those more rich, dynamic, modern websites that are becoming more commonplace.

Other areas within the Library that often use web archiving processes for capturing web content have seen this same need for improved crawling tools. This has provided a range of users and diverse use cases for our Browsertrix Cloud testing. During this presentation we will cover our user experience during this testing.

User experience 2 (UNT) Lauren Ko

Improving the Web Archive Experience

With a focus on collecting the expiring websites of defunct federal government commissions, carrying out biannual crawls of its own subdomains, and participating in event-based crawling projects, since 2005 UNT Libraries has mostly carried out harvesting with Heritrix. However, in recent years, attempts to better archive increasingly challenging websites and social media have led to supplementing this crawling with a more manual approach using pywb's record mode. Now hosting an instance of Browsertrix Cloud, UNT Libraries hopes to reduce the time spent on archiving such content that requires browser-based crawling. Additionally, the libraries expect the friendlier user interface Browsertrix Cloud provides to facilitate its use by more staff in the library, as a teaching tool in a web archiving course in the College of Information, and in a project collaborating with external contributors.

User experience 3 (Stanford) Jasmine Mulliken

Crawling the Complex

Web-based digital scholarship, like the kind produced under Stanford University Press’s Mellon-funded digital publishing initiative (http://supdigital.org), is especially resistant to standard web archiving. Scholars choosing to publish outside the bounds of the print book are finding it challenging to defend their innovatively formatted scholarly research outputs to tenure committees, for example, because of the perceived ephemerality of web-based content. SUP is supporting such scholars by providing a pathway to publication that also ensures the longevity of their work in the scholarly record. This is in part achieved by SUP’s partnership with Webrecorder (https://blog.supdigital.org/sup-webrecorder-partnership/), which has now, using Browsertrix Cloud, produced web-archived versions of all eleven of SUP’s complex, interactive, monograph-length scholarly projects (https://archive.supdigital.org/). These archived publications represent an important use case for Browsertrix Cloud that speaks to the needs of creators of web content who rely on web archiving tools as an added measure of value for the work they are contributing to the evolving innovative shape of the scholarly record.

User experience 4 (Austrian National Library) Andreas Predikaka & Antares Reich

Integrating Browsertrix Since the beginning of the web archiving project in 2008, Austrian National Library has been using the crawler Heritrix integrated in Netarchivesuite. For many websites in daily crawls, the use of Heritrix is no longer sufficient and it is necessary to improve the quality of our crawls. Tests showed very quickly, that Browsertrix is doing a very good job to fulfil this requirement. But for us it is also important that the results of Browsertrix crawls are integrated into our overall working process. By using the API of Browsertrix, it was possible to create a proof of concept of necessary steps for this use case.

Mobile View Print View

Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2023