Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

Please note that all times are shown in the time zone of the conference. The current conference time is: 20th May 2024, 06:50:08am CEST

 
 
Session Overview
Session
WORKSHOP #06: Browser-Based Crawling For All: Introduction to Quality Assurance with Browsertrix Cloud
Time:
Thursday, 25/Apr/2024:
1:40pm - 3:00pm

Location: Salle 70 [François-Mitterrand site]


Show help for 'Increase or decrease the abstract text size'
Presentations

Browser-Based Crawling For All: Introduction to Quality Assurance with Browsertrix Cloud

Andrew Jackson1, Anders Klindt Myrvoll2, Ilya Kreymer3

1Digital Preservation Coalition, United Kingdom; 2Royal Danish Library; 3Webrecorder, United States of America

Through the IIPC-funded “Browser-based crawling system for all” project (https://netpreserve.org/projects/browser-based-crawling/), members have been working with Webrecorder to support the development of their new crawling tools: Browsertrix Crawler (a crawl system that automates web browsers to crawl the web) and Browsertrix Cloud (a user interface to control multiple Browsertrix Crawler crawls). The IIPC funding is particularly focussed on making sure IIPC members can make use of these tools.

This workshop will be led by Webrecorder, who will invite you to try out an instance of Browsertrix Cloud and explore how it might work for you, and how the latest QA features might help. IIPC members that have been part of the project will be on hand to discuss institutional issues like deployment and integration.

The workshop will start with a brief introduction to high-fidelity web archiving with Browsertrix Cloud. Participants will be given accounts and be able to create their own high-fidelity web archives using the Browsertrix Cloud crawling system during the workshop. Users will be presented with an interactive user interface to create crawls, allowing them to configure crawling options. We will discuss the crawling configuration options and how these affect the result of the crawl. Users will have the opportunity to run a crawl of their own for at least 30 minutes and to see the results.

After a quick break, we will then explore the latest Quality Assurance features of Browsertrix Cloud. This includes ‘patch crawling’ by using the ArchiveWeb.Page browser extension to archive difficult pages, and then integrating those results into a Browsertrix Cloud collection..

In the final wrap-up of the workshop, we will discuss challenges, what worked and what didn’t, what still needs improvement, etc.. We will also outline how participants can provide access to the web archives they created, either using standalone tools or by integrating them into their existing web archive collections. IIPC member institutions that have been testing Browsertrix Cloud thus far will share their experiences in this area.

The format of the workshop will be as follows:

  • Introduction to Browsertrix Cloud - 10 min

  • Use Cases and Examples by IIPC project partners - 10 min

  • Break - 5

  • Hands-On: Setup and Crawling with Browsertrix Cloud (Including Q&A / help while crawls are running) - 20 min

  • Break - 5 min

  • Hands-On: Quality Assurance with Browsertrix Cloud - 10 min

  • Wrap-Up: Final Q&A / Discuss Access 7 Integration of Browsertrix Cloud Into Existing Web Archiving Workflows with IIPC project partners - 20 min

Webrecorder will lead the workshop and the hands-on portions of the workshop. The IIPC project partners will present and answer questions about use-cases at the beginning and around integration into their workflows at the end.

Participants should come prepared with a laptop and a couple of sites that they would like to crawl, especially those that are generally difficult to crawl by other means and require a ‘high fidelity approach’. (Examples include social media sites, sites that are behind a paywall, etc..) Ideally, the sites can be crawled during the course of 30 mins (though crawls can be interrupted if they run for too long)

This workshop is intended for curators and anyone wishing to create and use web archives and are ready to try Browsertrix Cloud on their own with guidance from the Webrecorder team. The workshop does not require any technical expertise besides basic familiarity with web archiving. The participants’ experiences and feedback will help shape not only the remainder of the IIPC project, but the long-term future of this new crawling toolset.

The workshop should be able to accommodate up to 70 participants.



 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2024
Conference Software: ConfTool Pro 2.6.149
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany