GLAM (Galleries, Libraries, Archives and Museums) have started to make available their digital collections suitable for computational use following the Collections as Data principles[1]. The International GLAM Labs Community[2] has explored innovative and creative ways to publish and reuse the content provided by cultural heritage institutions. As part of their work, and as a collaborative-led effort, a checklist[3] was defined and focused on the publication of collections as data. The checklist provides a set of steps that can be used for creating and evaluating digital collections suitable for computational use. While web archiving institutions and initiatives have been providing access to their collections - ranging from sharing seedlists to derivatives to “cleaned” WARC files - there is currently no standardised checklist to prepare those collections for researchers.
This workshop aims to involve web archiving practitioners and researchers in reevaluating whether the GLAM Labs checklist can be adapted for web archive collections. The first part of the workshop will introduce the GLAM checklist, followed by two use cases that show how the web archiving teams have been working with their institutions’ Labs to prepare large data packages and corpora for researchers. In the second part of the workshop, we want to involve the audience in identifying the main challenges to implementing the GLAM checklist and determining which steps require modifications so that it can be used successfully for web archive collections.
First use case
The UK Web Archive has recently started to publish the metadata to some of our inactive curated collections as data. This project developed new workflows by using the Datasheets for Datasets framework to provide provenance information on the individual collections that were published as data.
In this presentation, we will highlight how participants can:
-
Use Datasheets for Datasets to describe their collections.
-
Potential research uses for the data sets that were published.
-
Gain insights from the lessons learnt phase of the project.
Second use case
Our library recently launched its first Web News Corpus, making more than 1.5 million texts from 268 news websites available for computational analysis through API. The aim is to facilitate text analysis at scale.[4] This presentation will provide a brief description of “warc2corpus”, our workflow for turning WARCs into text corpora, aiming to satisfy the FAIR principles, while also taking immaterial rights into account.[5]
In this presentation, we will showcase how users can:
-
tailor research corpora based on keywords and various metadata,
-
visualise general insights,
-
exercise different types of ‘distant reading’, both with the Library Labs package for Python and with user-friendly web applications.[6]
Third use case
Our library has been working to refine and improve workflows that enable creation and publishing of web archive data packages for computational research use. With a recently hired Senior Digital Collections Data Librarian, and working with our institution’s Labs, web archiving staff have prepared new data packages for web archive data in response to recent research requests. We will provide some background into this work and developments that led to the creation of the data librarian role, and will share details about how we are creating our data packages and sharing derivative datasets with researchers. Using a recent data package release, we will compare local practices in providing data to researchers with the GLAM checklist and talk through ways in which our institution does or does not comply.
REFERENCES:
[1] Padilla, T. (2017). “On a Collections as Data Imperative”. UC Santa Barbara. pp. 1–8;
[2] https://glamlabs.io/
[3] Candela, G. et al. (2023), "A checklist to publish collections as data in GLAM institutions", Global Knowledge, Memory and Communication. https://doi.org/10.1108/GKMC-06-2023-0195
[4]: Tønnessen, J. (2024). “Web News Corpus”. National Library of Norway. https://www.nb.no/en/collection/web-archive/research/web-news-corpus/
[5]: Tønnessen J., Birkenes M., Bremnes T. (2024). “corpus-build”. GitHub. National Library of Norway. https://github.com/nlnwa/corpus-build; Birkenes M., Johnsen, L., Kåsen, A. (2023). “NB DH-LAB: a corpus infrastructure for social sciences and humanities computing.” CLARIN Annual Conference Proceedings.
[6]: “dhlab documentation”. National Library of Norway. https://dhlab.readthedocs.io/en/latest/