Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available). To only see the sessions for 3 May's Online Day, select "Online" for location.

Only Sessions at Location/Venue 
Session Overview
Thursday, 11/May/2023:
1:30pm - 3:30pm

Location: Labs Room 1 (workshops)

Pre-registration required for this event.

Show help for 'Increase or decrease the abstract text size'

Describing Collections with Datasheets for Datasets

Emily Maemura1, Helena Byrne2

1University of Illinois; 2British Library, United Kingdom

Significant work in web archives scholarship has focused on addressing the description and provenance of collections and their data. For example, Dooley et al. (2018) propose recommendations for descriptive metadata, and Maemura et al. (2018) develop a framework for documenting elements of a collection’s provenance. Additionally, documentation of the data processing and curation steps towards generating a corpus for computational analysis are described extensively in Brügger (2021), Brügger, Laursen & Nielsen (2019) and Brügger, N., Nielsen, J., & Laursen, D. (2020). However, looking beyond libraries, archives, or cultural heritage settings provides alternative forms for the description of data. One approach to the challenge of describing large datasets comes from the field of machine learning where Gebru et al. (2018, 2021) propose developing “Datasheets for Datasets,” a form of short document answering a standard set of questions arranged by stages of the data lifecycle.

This workshop explores how web archives collections can be described using the framework provided by Datasheets for Datasets. Specifically, this work builds on the template for datasheets developed by Gebru et al. that is arranged into seven sections: Motivation; Composition; Collection Process; Preprocessing/Cleaning/Labeling; Use; Distribution; and, Maintenance. The workflow they present includes a total of 57 questions to answer about a dataset, focusing on the specific needs of machine learning researchers. We consider how these questions can be adopted for the purposes of describing web archives datasets. Participants will consider and assess how each question might be adapted and applied to describe datasets from the UK Web Archive curated collections. After a brief description of the Datasheets for Datasets framework, we will break into small groups to perform a card-sorting exercise. Each group will evaluate a set of questions from the Datasheets framework and assess them using the MoSCoW technique, sorting questions into categories of Must, Should, Can’t, and Won’t have. Groups will then describe their findings from the card-sorting exercise in order to generate a broader discussion of priorities and resources available for generating descriptive metadata and documentation for public web archives datasets.

Format:120 minute workshop where participants will do a card sorting activity in small groups to review the practicalities of the Datasheets for Datasets Framework when applied to web archives. Ideally participants can prepare by reading through questions prior to the workshop.

We anticipate the following schedule:

  • 5 min: Introduction

  • 15 min: Overview of Datasheets for Datasets

  • 5 min: Overview of UKWA Datasets

  • 60 min: Card-sorting Exercise in small groups

  • 5 min: Comfort Break

  • 20 min: Discussion of small group findings

  • 5 min: Conclusion and Wrap-up

Target Audience: Web Archivists, Researchers

Anticipated number of participants: 12-16

Technical requirements: overhead projector with computer and large tables for a big card sorting activity.

Learning outcomes:

  • Raise awareness of the Datasheets for Datasets Framework in the web archiving community.

  • Understand what type of descriptive metadata web archive experts think should accompany web archive collections published as data.

  • Generate discussion and promote communication between web archivists and research users on priorities for documentation.

Coordinators: Emily Maemura (University of Illinois), Helena Byrne (British Library)

Emily Maemura is an Assistant Professor in the School of Information Sciences at the University of Illinois Urbana-Champaign. Her research focuses on data practices and the activities of curation, description, characterization, and re-use of archived web data. She completed her PhD at the University of Toronto's Faculty of Information, with a dissertation exploring the practices of collecting and curating web pages and websites for future use by researchers in the social sciences and humanities.

Helena Byrne is the Curator of Web Archives at the British Library. She was the Lead Curator on the IIPC Content Development Group 2022, 2018 and 2016 Olympic and Paralympic collections. Helena completed a Master’s in Library and Information Studies at University College Dublin, Ireland in 2015. Previously she worked as an English language teacher in Turkey, South Korea, and Ireland. Helena is also an independent researcher that focuses on the history of women's football in Ireland. Her previous publications cover both web archives and sports history.


Brügger, N. (2021). Digital humanities and web archives: Possible new paths for combining datasets. International Journal of Digital Humanities.

Brügger, N., Laursen, D., & Nielsen, J. (2019). Establishing a corpus of the archived web: The case of the Danish web from 2005 to 2015. In N. Brügger & D. Laursen (Eds.), The historical web and digital humanities: The case of national web domains (pp. 124–142). Routledge/Taylor & Francis Group.

Brügger, N., Nielsen, J., & Laursen, D. (2020). Big data experiments with the archived Web: Methodological reflections on studying the development of a nation’s Web. First Monday.

Dooley, J., & Bowers, K. (2018). Descriptive Metadata for Web Archiving: Recommendations of the OCLC Research Library Partnership Web Archiving Metadata Working Group (p. ). OCLC Research.

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2018). Datasheets for Datasets. ArXiv:1803.09010 [Cs].

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., III, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92.

Maemura, E., Worby, N., Milligan, I., & Becker, C. (2018). If These Crawls Could Talk: Studying and Documenting Web Archives Provenance. Journal of the Association for Information Science and Technology, 69(10), 1223–1233.