11:50am - 12:10pmThe Potentials and Challenges for Researchers and Web Archives Using the Persistent Web IDentifier (PWID)
Caroline Nyvang, Eld Zierau
Royal Danish Library
In order for researchers to live up to good research practice, we need to be able to make persistent references to contents in web archives. In some cases, different types of Persistent Identifiers can be used. However, for web archive pages or element references, which needs to be resolvable for more than 50 years, the Persistent Web IDentifier (PWID) is often the best choice. Many referencing guidelines or standards recommend that references to web archives should be made via an archived URL. This is a challenge not only for closed web archives, but also for web archives that change addresses for their web archive data. For instance, this happened when the Irish web archive migrated their holdings from an Internet Memory Foundation (IMF) platform (http://collection.europarchive.org/nli/) to an Archive-IT web archive service (https://archive-it.org/home/nli). It will also be the case for web archives changing archive URLs due to changes related to the Wayback machine.
The PWID resolves many of the known issues with common identifiers as it is based on basic web archive metadata; web archive, archival time of web element, archived URL of web element and precision or inherited interpretation of the PWID, like page or part/file. Thus, once the web archive is identified, the archival time and archived URL can be used to find the resource since these metadata are present in WARC. Finally, the information about interpretation/precision of the resource can be used as a means to choose manifestation of the page and access to the resource. This means that resolving of a PWID does not rely on a separate registry of the contents of a web archive (which can be huge), since the WARC metadata can be indexed (e.g. in CDX or SOLR) and this index will be able to support the resolving. Furthermore, the design of the PWID has been based on bridge building between digital humanity researchers, web archivist, persistent identifier experts, Internet experts etc. in order to meet requirements of being human readable, persistent, technology agnostic, global, algorithmically resolvable and accepted as an URN.
Using the PWID, researchers will gain a way to persistently address web elements in a sustainable way. The web archives can benefit from the PWID, too, both in regards to the implementation of support for researchers, and in creation of the web archive when there are several manifestations of a web page. For example, the British Library web archive uses the PWID when archiving snapshots of web pages. Furthermore, since a PWID URN is a URI, it can be used as URI identifier as is e.g. required for WARC identifiers. The PWID can become even more useful for researchers when is incorporated in reference tools like Zotero etc.
The presenters will discuss their different perspectives as researcher within the humanities and as computer scientist and web archivist. The presentation will cover challenges and experiences from each perspective as well as future potentials in support and through expansion of the PWID URN definition.
12:10pm - 12:30pmArquivo.pt CitationSaver: Preserving Citations for Online Documents
Pedro Gomes, Daniel Gomes
Arquivo.pt, Portugal
Scientific documents, whether books or articles, reference web addresses (URLs) to cite documents published online. In the case of scientific articles, the importance of these citations is even greater in order to maintain the integrity of an investigation because they often reference fundamental information to allow the reproducibility of an experiment or analysis. For example, links in a scientific article can cite datasets, software or web news that supported the research and which are not included in the text of the scientific article.
However, documents published online disappear very quickly. This means that the citations contained in a scientific document, which are fundamental to guaranteeing its scientific validity, become invalid.
Arquivo.pt is a research infrastructure that provides tools to preserve and exploit data from the web to meet the needs of scientists and ordinary citizens and our mission is to provide digital infrastructures to support the academic and scientific community. However, until now, Arquivo.pt has focused on collecting data from websites hosted under the .PT domain, which is not enough to guarantee the preservation of relevant content for the academic and scientific community.
In response to the need to preserve the integrity of scientific documents and other documents that cite documents published online, Arquivo.pt has created a new project called CitationSaver. CitationSaver, available at arquivo.pt/citationsaver, automatically extracts the links cited in a document and preserves their content (e.g. web pages cited in a book) so that they can be retrieved later from Arquivo.pt.
This presentation will detail the context that led to the need to create the CitationSaver service and how it works. The CitationSaver service allows users to help select and immediately preserve relevant information published online before it is altered.
In addition, we will demonstrate how, using the APIs provided by the Open Science ecosystem, we can automatically identify scientific documents and data published online to be preserved. For instance, we used the API from RCAAP (Open Access Scientific Repositories in Portugal) to get all scientific publications and used our system to extract more than 10 million URLs.
12:30pm - 12:50pmIntegration of Bit Preservation for Web Archives, Using the Open Source BitRepository.org Framework
Rasmus Kristensen, Mathias Jensen, Colin Rosenthal, Eld Zierau
Royal Danish Library
The Royal Danish Library has used the open source BitRepository.org framework as basis for bit preservation of Danish cultural heritage for more than ten years. Until 2022 this was the case for all digital materials except the web archive materials. Until then the bit preservation of web archive materials relied on the NetarchiveSuite archival module. However, this module had several disadvantages, since it only supported bit preservation with two online copies and one checksum copy. A third copy therefore had to be a backup copy which was detached from the active bit preservation, where copies could be regularly checked and compared (via checksums). In the late 2010’s, the library wanted to modernize the bit preservation platform for the Danish web archive, which resulted in an integration the NetarchiveSuite and BitRepository.org Framework. This also enabled the possibility only to have one online copy of the web archive, to have three copies all included in the active bit preservation, to have numerous checksum copies and to enable better independence between the copies, and thus have reduced risk of incidents destroying all copies.
This presentation will present the capabilities of the BitRepository.org framework concerning how it can support advanced active bit preservation for web archives in general. The main theme in the presentation will be about the bit preservation, and how bitrepository.org enables use of storage of copies on all types of current and future media, and how it is technology agnostic in the sense that software and media technologies can be change rather easily over time. It will also be presented how Bitrepository.org framework supports daily bit preservation operations, and how it enables setup with high access possibilities as well as providing a basis for high operation security at all levels. Furthermore, it the experiences of integrating using the flexibility of setup will be presented, as well as the experiences with integration with NetarchiveSuite, and how it now supports S3, and thus can be integrated with many other web collection solutions.
|