The UK Government Web Archive’s (UKGWA) Auto QA process allows us to carry out enhanced data-driven QA almost completely automatically. This is particularly useful for websites that are high-profile or sites that are about to close. Our Auto QA has several advantages over solely visual QA. The advantages enable us to:
1) Identify problems that are not obvious at the visual QA stage.
2) Identify Heritrix errors during the crawl. These include -2 and -6 errors. Once identified, we re-run Heritrix on the affected URIs.
3) Identify and patch URIs that Heritrix could not discover.
4) Identify, test, and patch Hyperlinks insides PDFs. Many PDFs contain hyperlinks to a page on the parent website or to other websites. And sometimes the only way to access those pages is through a link in a PDF which most crawlers can't normally access.
Auto QA consists of three separate processes:
1) ‘Crawl Log Analysis’ that runs on every crawl automatically. CLA examines Heritrix crawl logs and looks for errors. It then tests those errors against the live web.
2) ‘Diffex’ that compares what Heritrix discovered with the output of another crawler such as Screaming Frog. This will identify what Heritrix did not discover. Diffex then tests those URIs against the live web and if they are valid, they are added to a patchlist.
3) ‘PDFflash’ extracts PDF URI’s from Heritrix crawl logs. It then parses them and looks for hyperlinks within PDFs; tests those hyperlinks against the live web, our web archives, and against our in-scope domains. If a hyperlink’s target serves 404 it will be added to our patchlist provided it meets certain conditions such as scoping criteria.
UKGWA’s Auto QA is a highly efficient and scalable system that compliments visual QA; and we are in the process of making it open source.