IIPC General Assembly and Web Archiving Conference 2023

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available). To only see the sessions for 3 May's Online Day, select "Online" for location.

Session

SES-05: COVID-19 COLLECTIONS

Time:

Thursday, 11/May/2023:

2:40pm - 3:50pm

Session Chair: Kees Teszelszky, KB, National Library of the Netherlands

Location: Theatre 1

These presentations will be followed by a 10 min Q&A.

Presentations

2:40pm - 3:00pm

The UK Government Web Archive (UKGWA): Measuring the impact of our response to the COVID-19 pandemic

Tom Storrar

The National Archives, United Kingdom

The COVID-19 pandemic, the first pandemic of the digital age, has presented an enormous challenge to our web archiving practice. As the official archive of the UK government, we were tasked with building a comprehensive archive of the UK government's online response to the emergency. To meet this challenge we have devised new archiving strategies ranging from supplementary broad, keyword-driven crawling to focus, data-driven, daily captures of the UK’s official “Coronavirus (COVID-19) in the UK” data dashboard. We have also massively increased our rates of capture. The challenge has demanded creativity, adaptation and a great deal of effort.

All of this work prompted us to think of a number of questions that we’d like to answer: How complete is the record we captured in our web archive and how much is this a result of the extra effort we made? How could we perform meaningful analysis on the enormous numbers of HTML and non-HTML resources? What contributions have these innovations made to this outcome and how can these inform our practice going forward?

To tackle these questions we needed to analyse millions of captured resources in our web archive. It soon became clear that we’d only be able to achieve the level of insight needed by developing an entire end-to-end analysis system. The resulting pipeline we designed and built uses a combination of familiar and novel concepts and approaches; we used the WARC file content, along with CDX APIs, but we also developed a set of heuristics, and custom algorithms, all ultimately populating a database that allowed us to run queries to give us the answers we sought. Running an entirely cloud-based system enabled this work as we were at that time unable to reliably access our office.

This presentation will provide an overview of the approaches used, the results we found and the areas for further development. We believe that these tools can be applied to our overall web archive collections and hope that other institutions will find our experience useful when thinking about analysing their own collection and quantifying the impact of their efforts.

3:00pm - 3:20pm

Women and COVID through Web Archives. How to explore the pandemic through a collaborative, interdisciplinary research approach

Susan Aasman¹, Karin de Wild², Joshgun Sirajzade³, Fréderic Clavert³, Valerie Schafer³, Sophie Gebeil⁴, Niels Brügger⁵

¹University of Groningen, Netherlands, The; ²Leiden University, The Netherlands; ³University of Luxembourg, Luxembourg; ⁴Aix-Marseille University, France; ⁵Aarhus University, Denmark

The COVID crisis has been a shared worldwide and collective experience from March 2020 and lot of voices have echoed each other, may it be related to grief, lockdown, masks and vaccines, homeschooling, etc. However, this unprecedented crisis has also deepened asymmetries and failures within societies, in terms of occupational fields, economic inequalities, health and sanitary access, and we could extend the inventory of these hidden and more visible gaps that were reinforced during the crisis. Women and gender were also at stake when it came to this sanitary crisis, may it be to discuss the better management of the crisis by female politicians, domestic violence during the lockdown, decreasing production of papers by female research scientists, homeschooling and mental load of women, etc.

As a cohort team within the Archives Unleashed Team (AUT) program, the European research AWAC2 team benefited from a privileged access to this collection, thanks to Archive-It and through ARCH, and from regular mentorship by the AUT team. It allowed us to investigate and analyse this huge collection of 5.3 TB, 161 757 lines for the CSV on domain frequency CSV, 8,738,751 lines for the CSV related to plain text of web pages. In December 2021, our AWAC2 team submitted several topics to the IIPC (International Internet Preservation Consortium) community and invited the international organization to select one of them that the team would investigate in depth, based on the unique IIPC COVID collection of web archives. Women, gender, and COVID was the winning topic.

Accepting the challenge, the AWAC2 team organized a datathon in March 2022 in Luxembourg to investigate and retrieve the many traces of women, gender and COVID in web archives, while mixing close and distant reading. Since then, the team has been working on the dataset to further explore the opportunities for computational methods for reading at scale. In this presentation, we will reflect on technical, epistemological, and methodological challenges and present some results as well.

3:20pm - 3:40pm

Surveying the landscape of COVID-19 web collections in European GLAM institutions

Nicola Bingham¹, Friedel Geeraert², Caroline Nyvang³, Karin de Wild⁴

¹British Library, United Kingdom; ²KBR (Royal Library of Belgium); ³Royal Danish Library; ⁴Leiden University

The aim of the WARCnet network [https://cc.au.dk/en/warcnet/about] is to promote high-quality national and transnational research that will help us to understand the history of (trans)national web domains and of transnational events on the web, drawing on the increasingly important digital cultural heritage held in national web archives. Within the context of this network, a survey was conducted to see how cultural heritage institutions are capturing the COVID-19 crisis for future generations. The aim of the survey was to map the scope and collection strategies of COVID-19 Web collections with a main focus on Europe. The survey was managed by the British Library and was conducted by means of the Snap survey platform. It circulated between June and September 2022 among mainly European GLAM institutions and 61 responses were obtained.

The purpose of this presentation is to provide an overview of the different collection development practices when curating COVID-19 collections. On the one hand, the results may support GLAM institutions to gain further insights in how to curate COVID-19 Web collections or identify potential partners. On the other hand, revealing the scope of these Web collections may also encourage humanists and data scientists to unlock the potential of these archived Web sources to further understand international developments on the Web during the COVID-19 pandemic

More concretely, the presentation will provide further insight into the local, regional, national or global scopes of the different COVID-19 collections, the type of content that is included in the collections, the available metadata, the selection criteria that were used when curating the collections and the efforts that were made to create inclusive collections. The temporality of the collections will also be discussed by highlighting the start, and, if applicable, end dates of the collections and the capture frequency. Quality control and long-term preservation are two further elements that will be discussed during the presentation.

Conference Agenda