JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at events@netpreserve.org.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available). To only see the sessions for 3 May's Online Day, select "Online" for location.

Please note that all times are shown in the time zone of the conference. The current conference time is: 13th May 2024, 07:39:25am CEST

Session Overview

Session

SES-08: QUALITY ASSURANCE

Time:

Thursday, 11/May/2023:

4:20pm - 5:30pm

Session Chair: Arnoud Goos, Netherlands Institute for Sound & Vision

Location: Theatre 2

These presentations will be followed by a 10 min Q&A.

Presentations

4:20pm - 4:40pm

The Auto QA process at UK Government Web Archive

Kourosh Feissali, Jake Bickford

The National Archives, United Kingdom

The UK Government Web Archive’s (UKGWA) Auto QA process allows us to carry out enhanced data-driven QA almost completely automatically. This is particularly useful for websites that are high-profile or sites that are about to close. Our Auto QA has several advantages over solely visual QA. The advantages enable us to:

1) Identify problems that are not obvious at the visual QA stage.

2) Identify Heritrix errors during the crawl. These include -2 and -6 errors. Once identified, we re-run Heritrix on the affected URIs.

3) Identify and patch URIs that Heritrix could not discover.

4) Identify, test, and patch Hyperlinks insides PDFs. Many PDFs contain hyperlinks to a page on the parent website or to other websites. And sometimes the only way to access those pages is through a link in a PDF which most crawlers can't normally access.

Auto QA consists of three separate processes:

1) ‘Crawl Log Analysis’ that runs on every crawl automatically. CLA examines Heritrix crawl logs and looks for errors. It then tests those errors against the live web.

2) ‘Diffex’ that compares what Heritrix discovered with the output of another crawler such as Screaming Frog. This will identify what Heritrix did not discover. Diffex then tests those URIs against the live web and if they are valid, they are added to a patchlist.

3) ‘PDFflash’ extracts PDF URI’s from Heritrix crawl logs. It then parses them and looks for hyperlinks within PDFs; tests those hyperlinks against the live web, our web archives, and against our in-scope domains. If a hyperlink’s target serves 404 it will be added to our patchlist provided it meets certain conditions such as scoping criteria.

UKGWA’s Auto QA is a highly efficient and scalable system that compliments visual QA; and we are in the process of making it open source.

4:40pm - 5:00pm

The Human in the Machine: Sustaining a Quality Assurance Lifecycle at the Library of Congress

Grace Bicho, Meghan Lyon, Amanda Lehman

Library of Congress, United States of America

This talk will build upon information shared during the IIPC WAC 2022 session Building a Sustainable Quality Assurance Lifecycle at the Library of Congress (Thomas and Lyon).

The work to develop a sustainable and effective quality assurance (QA) ecosystem is ongoing and the Library of Congress Web Archiving Team (WAT) is constantly working to improve and streamline workflows. The Library’s web archiving QA goals are structured around Dr. Reyes Ayala’s framework for quality measurements of web archives based in Grounded Theory (Reyes Ayala). During last year’s session, we described how the WAT satisfies the two dimensions of Relevance and Archivability, with some automated processes built in to help the team do its work. We also introduced our idea for Capture Assessment to satisfy the Correspondence dimension of Dr. Reyes Ayala’s framework.

In July 2022, the WAT launched the Capture Assessment workflow internally and invited curators of web archives content at the Library to review captures of their selected content. To best communicate issues of Correspondence quality between the curatorial librarians and the WAT, we instituted a rubric where curatorial librarians can ascribe a numeric value to convey quality information from various angles about a particular web capture, alongside a checklist of common issues to easily note.

The WAT held an optional training alongside the launch, and since then, there have been over 90 responses from a handful of curatorial librarians, including one power user. The WAT has found responses to be mostly actionable for correction in future crawls. We’ve also seen that Capture Assessments are performed on captures that wouldn’t necessarily be flagged via other QA workflows, which gives us confidence that a wider swath of the archive is being reviewed for quality.

The session will share more details about the Capture Assessment workflow and, in time for the 2023 WAC session, we intend to complete a small, early analysis of the Capture Assessment responses to share with the wider web archiving community.

Reyes Ayala, B. Correspondence as the primary measure of information quality for web archives: a human-centered grounded theory study. Int J Digit Libr 23, 19–31 (2022). https://doi.org/10.1007/s00799-021-00314-x