This lightning talk would like to tempt and to challenge the participants of the IIPC Web Archiving Conference 2023 to engage in an exchange of ideas, assumptions and knowledge about the subject of validating WARC-files and the use of WARC validation tools.
In 2021 we’ve written an information sheet about WARC validation. During our (desk)research it became clear that most (inter)national colleagues who archive websites more often than not don’t use WARC validation tools. Why not?
Most heritage institutions, national libraries and archives focus on safeguarding as much online content as possible before it disappears, based on an organizational selection policy. And the other goal is to give access to the captured information as complete and quickly as possible, both to the general users and researchers. Both goals are at the core of webarchiving initiatives of course!
It seems as though little attention is given to an aspect of quality control such as the checking of the technical validity of WARC-files. Or are there other reasons not to pay much attention to this aspect?
We like to share some of our findings after deploying several tools for processing WARC-files: JHOVE, JWAT, Warcat and Warcio. More tools are available, but in our opinion these four tools are the most commonly used, mature and actively maintained tools that can check of validate WARC files.
In our research into WARC validation, we noticed that some tools are validation tools that check conformance to WARC standard ISO 28500 and others ‘only’ check block and/or payload digests. Most tools support version 1.0 of the WARC standard (of 2009). Few support version 1.1 (of 2017).
Another conclusion is that there is no one WARC validation tool ‘to rule them all’, so using a combination of tools will probably be the best strategy for now.