JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at events@netpreserve.org.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available). To only see the sessions for 3 May's Online Day, select "Online" for location.

Please note that all times are shown in the time zone of the conference. The current conference time is: 28th Apr 2024, 03:48:59pm CEST

Session Overview

Session

WKSHP-04: BROWSER-BASED CRAWLING FOR ALL: GETTING STARTED WITH BROWSERTRIX CLOUD

Time:

Friday, 12/May/2023:

8:30am - 10:00am

Location: Theatre 2

Pre-registration required for this event.

Presentations

Browser-Based Crawling For All: Getting Started with Browsertrix Cloud

Andrew N. Jackson¹, Anders Klindt Myrvoll², Ilya Kreymer³

¹The British Library, United Kingdom; ²Royal Danish Library; ³Webrecorder

Through the IIPC-funded “Browser-based crawling system for all” project (https://netpreserve.org/projects/browser-based-crawling/), members have been working with Webrecorder to support the development of their new crawling tools: Browsertrix Crawler (a crawl system that automates web browsers to crawl the web) and Browsertrix Cloud (a user interface to control multiple Browsertrix Crawler crawls). The IIPC funding is particularly focussed on making sure IIPC members can make use of these tools.

This workshop will be led by Webrecorder, who will invite you to try out an instance of Browsertrix Cloud and explore how it might work for you. IIPC members that have been part of the project will be on hand to discuss institutional issues like deployment and integration.

The workshop will start with a brief introduction to high-fidelity web archiving with Browsertrix Cloud. Participants will be given accounts and be able to create their own high-fidelity web archives using the Browsertrix Cloud crawling system during the workshop. Users will be presented with an interactive user interface to create crawls, allowing them to configure crawling options. We will discuss the crawling configuration options and how these affect the result of the crawl. Users will have the opportunity to run a crawl of their own for at least 30 minutes and to see the results. We will then discuss and reflect on the results.

After a quick break, we will discuss how the web archives can be accessed and shared with others, using the ReplayWeb.page viewer. Participants will be able to download the contents of their crawls (as WACZ files) and load them on their own machines. We will also present options for sharing the outputs with others directly, by uploading to an easy-to-use hosting option such as Glitch or our custom WACZ Uploader. Either method will produce a URL which participants can then share with others, in and outside the workshop, to show the results of their crawl. We will discuss how, once complete, the resulting archive is no longer dependent on the crawler infrastructure, but can be treated like any other static file, and, as such, can be added to existing digital preservation repositories.

In the final wrap-up of the workshop, we will discuss challenges, what worked and what didn’t, what still needs improvement, etc.. We will also discuss how participants can add the web archives they created into existing web archives that they may already have, and how Browsertrix Cloud can fit into and augment existing web archiving workflows at participants' institutions. IIPC member institutions that have been testing Browsertrix Cloud thus far will share their experiences in this area.

The format of the workshop will be as follows:

Introduction to Browsertrix Cloud - 10 min
Use Cases and Examples by IIPC project partners - 10 min
Break - 5
Hands-On: Setup and Crawling with Browsertrix Cloud (Including Q&A / help while crawls are running) - 30 min
Break - 5 min
Hands-On: Replaying and Sharing Web Archives - 10 min
Wrap-Up: Final Q&A / Discuss Integration of Browsertrix Cloud Into Existing Web Archiving Workflows with IIPC project partners - 20 min

Webrecorder will lead the workshop and the hands-on portions of the workshop. The IIPC project partners will present and answer questions about use-cases at the beginning and around integration into their workflows at the end.

Participants should come prepared with a laptop and a couple of sites that they would like to crawl, especially those that are generally difficult to crawl by other means and require a ‘high fidelity approach’. (Examples include social media sites, sites that are behind a paywall, etc..) Ideally, the sites can be crawled during the course of 30 mins (though crawls can be interrupted if they run for too long)

This workshop is intended for curators and anyone wishing to create and use web archives and are ready to try Browsertrix Cloud on their own with guidance from the Webrecorder team. The workshop does not require any technical expertise besides basic familiarity with web archiving. The participants’ experiences and feedback will help shape not only the remainder of the IIPC project, but the long-term future of this new crawling toolset.

The workshop should be able to accommodate up to 50 participants.

Mobile View Print View

Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2023