Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available). To only see the sessions for 3 May's Online Day, select "Online" for location.

Please note that all times are shown in the time zone of the conference. The current conference time is: 11th May 2024, 11:12:21pm CEST

 
Only Sessions at Location/Venue 
 
 
Session Overview
Location: Labs Room 1 (workshops)
Date: Thursday, 11/May/2023
1:30pm - 3:30pmWKSHP-01: DESCRIBING COLLECTIONS WITH DATASHEETS FOR DATASETS
Location: Labs Room 1 (workshops)
Pre-registration required for this event.
 

Describing Collections with Datasheets for Datasets

Emily Maemura1, Helena Byrne2

1University of Illinois; 2British Library, United Kingdom

Significant work in web archives scholarship has focused on addressing the description and provenance of collections and their data. For example, Dooley et al. (2018) propose recommendations for descriptive metadata, and Maemura et al. (2018) develop a framework for documenting elements of a collection’s provenance. Additionally, documentation of the data processing and curation steps towards generating a corpus for computational analysis are described extensively in Brügger (2021), Brügger, Laursen & Nielsen (2019) and Brügger, N., Nielsen, J., & Laursen, D. (2020). However, looking beyond libraries, archives, or cultural heritage settings provides alternative forms for the description of data. One approach to the challenge of describing large datasets comes from the field of machine learning where Gebru et al. (2018, 2021) propose developing “Datasheets for Datasets,” a form of short document answering a standard set of questions arranged by stages of the data lifecycle.

This workshop explores how web archives collections can be described using the framework provided by Datasheets for Datasets. Specifically, this work builds on the template for datasheets developed by Gebru et al. that is arranged into seven sections: Motivation; Composition; Collection Process; Preprocessing/Cleaning/Labeling; Use; Distribution; and, Maintenance. The workflow they present includes a total of 57 questions to answer about a dataset, focusing on the specific needs of machine learning researchers. We consider how these questions can be adopted for the purposes of describing web archives datasets. Participants will consider and assess how each question might be adapted and applied to describe datasets from the UK Web Archive curated collections. After a brief description of the Datasheets for Datasets framework, we will break into small groups to perform a card-sorting exercise. Each group will evaluate a set of questions from the Datasheets framework and assess them using the MoSCoW technique, sorting questions into categories of Must, Should, Can’t, and Won’t have. Groups will then describe their findings from the card-sorting exercise in order to generate a broader discussion of priorities and resources available for generating descriptive metadata and documentation for public web archives datasets.

Format:120 minute workshop where participants will do a card sorting activity in small groups to review the practicalities of the Datasheets for Datasets Framework when applied to web archives. Ideally participants can prepare by reading through questions prior to the workshop.

We anticipate the following schedule:

  • 5 min: Introduction

  • 15 min: Overview of Datasheets for Datasets

  • 5 min: Overview of UKWA Datasets

  • 60 min: Card-sorting Exercise in small groups

  • 5 min: Comfort Break

  • 20 min: Discussion of small group findings

  • 5 min: Conclusion and Wrap-up

Target Audience: Web Archivists, Researchers

Anticipated number of participants: 12-16

Technical requirements: overhead projector with computer and large tables for a big card sorting activity.

Learning outcomes:

  • Raise awareness of the Datasheets for Datasets Framework in the web archiving community.

  • Understand what type of descriptive metadata web archive experts think should accompany web archive collections published as data.

  • Generate discussion and promote communication between web archivists and research users on priorities for documentation.

Coordinators: Emily Maemura (University of Illinois), Helena Byrne (British Library)

Emily Maemura is an Assistant Professor in the School of Information Sciences at the University of Illinois Urbana-Champaign. Her research focuses on data practices and the activities of curation, description, characterization, and re-use of archived web data. She completed her PhD at the University of Toronto's Faculty of Information, with a dissertation exploring the practices of collecting and curating web pages and websites for future use by researchers in the social sciences and humanities.

Helena Byrne is the Curator of Web Archives at the British Library. She was the Lead Curator on the IIPC Content Development Group 2022, 2018 and 2016 Olympic and Paralympic collections. Helena completed a Master’s in Library and Information Studies at University College Dublin, Ireland in 2015. Previously she worked as an English language teacher in Turkey, South Korea, and Ireland. Helena is also an independent researcher that focuses on the history of women's football in Ireland. Her previous publications cover both web archives and sports history.

References

Brügger, N. (2021). Digital humanities and web archives: Possible new paths for combining datasets. International Journal of Digital Humanities. https://doi.org/10.1007/s42803-021-00038-z

Brügger, N., Laursen, D., & Nielsen, J. (2019). Establishing a corpus of the archived web: The case of the Danish web from 2005 to 2015. In N. Brügger & D. Laursen (Eds.), The historical web and digital humanities: The case of national web domains (pp. 124–142). Routledge/Taylor & Francis Group.

Brügger, N., Nielsen, J., & Laursen, D. (2020). Big data experiments with the archived Web: Methodological reflections on studying the development of a nation’s Web. First Monday. https://doi.org/10.5210/fm.v25i3.10384

Dooley, J., & Bowers, K. (2018). Descriptive Metadata for Web Archiving: Recommendations of the OCLC Research Library Partnership Web Archiving Metadata Working Group (p. ). OCLC Research. https://doi.org/10.25333/C3005C

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2018). Datasheets for Datasets. ArXiv:1803.09010 [Cs]. http://arxiv.org/abs/1803.09010

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., III, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. https://doi.org/10.1145/3458723

Maemura, E., Worby, N., Milligan, I., & Becker, C. (2018). If These Crawls Could Talk: Studying and Documenting Web Archives Provenance. Journal of the Association for Information Science and Technology, 69(10), 1223–1233. https://doi.org/10.1002/asi.24048

 
4:20pm - 5:30pmWKSHP-02: A PROPOSED FRAMEWORK FOR USING AI WITH WEB ARCHIVES IN LAMS
Location: Labs Room 1 (workshops)
Pre-registration required for this event.
 

A proposed framework for using AI with web archives in LAMs

Abigail Potter

Library of Congress, United States of America

There is tremendous promise in using artificial intellegence, and specifically machine learning techniques to help curators, collections managers and users to understand, use, steward and preserve web archives. Libraries, archives, museums and other public cultural heritage organizations who manage web archives have shared challenges in operationalizing AI technologies and unique requirements for managing digital heritage collections at a very large scale. Through research, experimentation and collaboration the LC Labs team has developed a set of tools to document, analyze, prioritize and assess AI technologies in a LAM context. This framework is in draft form and in need of additional use cases and perspectives, especially web archives use cases. The facilitators will introduce the framework and ask participants to use the proposed framework to evaluate their own proposed or in process ML or AI use case that increases understanding of and access to web archivies.

Sharing the framework elements, gathering feedback, and documenting web archives use cases are the goals of the workshop.
Sample Elements and Prompts from the framework:
- Organizational Profile: How will or does your organization want to use AI or Machine learning?

- Define the Problem you are trying to solve.

- Write a user story about the AI/ML task or system your are planning/doing

- Risks and Benefits: What are the benefits and risks to users, staff and the organization when an AI/ML technology is/will be used?

- What systems or policies will/do the AI/ML task or system impact or touch?

- What are the limitations of future use of any training, target, validation or derived data?
- Data Processing Plan: What documentation are/will you require when using AI or ML technologies - What existing open source or commercial platforms offer
pathways into use of AI?

- What are the success metrics and measures for the AI/ML task?

- What are the quality benchmarks for the AI/ML output?

- What could come next?

 
Date: Friday, 12/May/2023
8:30am - 10:00amWKSHP-03: FAKE IT TILL YOU MAKE IT: SOCIAL MEDIA ARCHIVING AT DIFFERENT ORGANIZATIONS FOR DIFFERENT PURPOSES
Location: Labs Room 1 (workshops)
Pre-registration required for this event.
 

Fake it Till You Make it: Social Media Archiving at Different Organizations for Different Purposes

Susanne van den Eijkel1, Zefi Kavvadia2, Lotte Wijsman3

1KB, National Library of the Netherlands; 2International Institute for Social History; 3National Archives of the Netherlands

Abstract

Different organizations, different business rules, different choices. That seems obvious. However, different perspectives can alter the choices that you make and therefore the results you get when you’re archiving Social Media. In this tutorial, we would like to zoom in on the different perspectives an organization can have. A perspective can be formed over a mandate or type of organization, the designated community of an institution, or a specific tool that you use. Therefore, we would like to highlight these influences and how they can affect the results that you get.

When you start with Social Media archiving, you won’t get the best results right away. It is really a process of trial and error, where you aim for good practice and not necessarily best practice (and is there such a thing as best practice?). With a practical assignment we want to showcase the importance of collaboration between different organizations. What are the worst practices that we have seen so far? What’s best to avoid, and why? What could be a solution? And why is it a good idea to involve other institutions at an early stage?

This tutorial relates to the conference topics of community, research and tools. It builds on previous work from the Dutch Digital Heritage Network and the BeSocial project from the National Library of Belgium. Furthermore, different tools will be highlighted and it will me made clear why different tooling can result in different results.

Format

In-person tutorial, 90 minutes.

  • Introduction: who are the speakers, where do they work, introduction on practices related to different organizations.

  • Assignment: participants will do a practical assignment related to social media archiving. They’ll receive persona’s for different institutions (library, government, archive) and ask themselves the question: how does your own organization's perspective influence the choices you make? We will gather the results on post-its and end with a discussion.

  • Wrap-up: conclusions of discussion.

Target audience

This tutorial is aimed at those who want to learn more about doing social media archiving at their organizations. It is mainly meant for starters in social media archiving, but not necessarily complete beginners (even though they are definitely welcome too!). Potential participants could be archivists, librarians, repository managers, curators, metadata specialists, (research) data specialists, and generally anyone who is or could be involved in the collection and preservation of social media content for their organization.

Expected number of participants: 20-25.

Expected learning outcome(s)

Participants will understand:

  1. Why Social Media archiving is different than Web Archiving;
  2. Why different perspectives lead to different choices and results;
  3. How tools can affect the potential perspectives you can work with.

In addition, participants will get insight into:

  1. The different perspectives from which you can do social media archiving from;
  2. How different organizations (could) work on social media archiving.

Coordinators

Susanne van den Eijkel is a metadata specialist for digital preservation at the National Library of the Netherlands. She is responsible for all the preservation metadata, writing policies and implementing them. Her main focus are born-digital collections, especially the web archives. She focuses on web material after it has been harvested, and not so much on selection and tools and is therefore more involved with which metadata and context information is available and relevant for preservation. In addition, she works on the communication strategy of her department; is actively involved in the Dutch Digital Heritage Network and provides guest lectures on digital preservation and web archiving.

Zefi Kavvadia is a digital archivist at the International Institute of Social History in Amsterdam, the Netherlands. She is part of the institute’s Collections Department, where she is responsible for processing of digital archival collections. She is also actively contributing to research, planning, and improving of the IISH digital collections workflows. While her work covers potentially any type of digital material, she is especially interested in the preservation of born-digital content and is currently the person responsible for web archiving at IISH. Her research interests range from digital preservation and archives, to web and social media archiving, and research data management, with a special focus on how these different but overlapping domains can learn and work together. She is active in the web archiving expert group of the Dutch Digital Heritage Network and the digital preservation interest group of the International Association of Labour History Institutions.

Lotte Wijsman is the Preservation Researcher at the National Archives in The Hague. In her role she researches how we can further develop preservation at the National Archives of the Netherlands and how we can innovate the archival field in general. This includes considering our current practices and evaluating how we can improve these with e.g. new practices and tools. Currently, Lotte is active in research projects concerning subjects as social media archiving, AI, a supra-organizational Preservation Watch function, and environmentally sustainable digital preservation. Furthermore, she is a guest teacher at the Archiefschool and Reinwardt Academy (Amsterdam University of the Arts).

 
10:30am - 12:00pmWKSHP-05: SUPPORTING COMPUTATIONAL RESEARCH ON WEB ARCHIVES WITH THE ARCHIVE RESEARCH COMPUTE HUB (ARCH)
Location: Labs Room 1 (workshops)
Pre-registration required for this event.
 

Supporting Computational Research on Web Archives with the Archive Research Compute Hub (ARCH)

Jefferson Bailey, Kody Willis, Helge Holzmann, Alex Dempsey

Internet Archive, United States of America

Coordinators:

  • Jefferson Bailey, Director of Archiving & Data Services, Internet Archive

  • Kody Willis, Product Operations Manager, Archiving & Data Services, Internet Archive

  • Helge Holzmann, Senior Data Engineer, Archiving & Data Services, Internet Archive

  • An Archives Unleashed member may also coordinate/participate

Format: 90 or 120-minute workshop and tutorial

Target Audience: The target audience is professionals working in digital library services that are collecting, managing, or providing access to web archives, scholars using web archives and other digital collections in their work, library professionals working to support computational access to digital collections, and digital library technical staff.

Anticipated Number of Participants: 25
Technical Requirements: A meeting room with wireless internet access and a projector or video display. Participants must bring laptop computers and there should be power outlets. The coordinators will handle preliminary activities over email and provide some technical support beforehand as far as building or accessing web archives for use in the workshop.

Abstract: Every year more and more scholars are conducting research on terabytes and even petabytes of digital library and archive collections using computational methods such as data mining, natural language processing, and machine learning. Web archives are a significant collection of interest for these researchers, especially due to their contemporaneity, size, multi-format nature, and how they can represent different thematic, demographic, disciplinary, and other characteristics. Web archives also have longitudinal complexity, with frequent changes in content (and often state of existence) even at the same URL, gobs of metadata both content-based and transactional, and many characteristics that make them highly suitable for data mining and computational analysis. Supporting computational use of web archives, however, poses many technical, operational, and procedural challenges for libraries. Similarly, while platforms exist for supporting computational scholarship on homogenous collections (such as digitized texts, images, or structured data), none exist that handle the vagaries of web archive collections while also providing a high level of automation, seamless user experience, and support for both technical and non-technical users.

In 2020, Internet Archive Research Services and the Archives Unleashed received funding for joint technology development and community building to combine their respective tools that enable computational analysis of web and digital archives in order to build an end-to-end platform supporting data mining of web archives. The program also simultaneously is building out a community of computational researchers doing scholarly projects via a program supporting cohort teams of scholars that receive direct technical support for their projects. The beta platform, Archives Research Compute Hub (ARCH), is currently being used by dozens of researchers in the digital humanities, social and computer science researchers, and by dozens of libraries and archives that are interested in supporting local researchers and sharing datasets derived from their web collection in support of large-scale digital research methods.

ARCH lowers the barriers for conducting research of web archives, using data processing operations to generate 16 different derivatives from WARC files. Derivatives range in use from graph analysis, text mining, and file format extraction, and ARCH makes it possible to visualize, download, and integrate these datasets into third-party tools for more advanced study. ARCH enables analysis of the more than 20,000 web archive collections - over 3 PB of data - collected by over 1,000 institutions using Archive-It that cover a broad range of subjects and events and ARCH also includes various portions of the overall Wayback Machine global web archive totalling 50+ PB and going back to 1996.

This workshop will be a hands-on training covering the full lifecycle of supporting computational research on web archives. The agenda will include an overview of the conceptual challenges researchers face when working with web archives, the procedural challenges that librarians face in making web archives available for computational use, and most importantly, will provide an in-depth tutorial on using the ARCH platform and its suite of data analysis, dataset generation, data visualization, and data publishing tools, both from the perspective of a collection manager, a research services librarian, and a computational scholar. Workshop attendees will be able to build small web archive collections beforehand or will be granted access to existing web archive collections to use during the workshop. All participants will also have access to any datasets and data visualizations created as part of the workshop.

Anticipated Learning Outcomes:

Given the conference, we expect the attendees primarily to be web archivists, collection managers, digital librarians, and other library and archives staff. After the workshop, attendees will:

  • Understand the full lifecycle of making web and digital archives available for computational use by researchers, scholars, and others. This includes gaining knowledge of outreach and promotion strategies to engage research communities, how to handle computational research requests, how to work with researchers to scope and refine their requests, how to make collections available as data, how to work with internal technical teams facilitating requests, dataset formats and delivery methods, and how to support researchers in ongoing data analysis and publishing.

  • Gain knowledge of the specific types of data analysis and datasets that are possible with web archive collections, including data formats, digital methods, tools, infrastructure requirements, and the related methodological affordances and limitations for scholarship related to working with web archives as data.

  • Receive hands-on training on using the ARCH platform to explore and analyze web archive collections, from both the perspective of a collection manager and that of a researcher.

  • Be able to use the ARCH platform to generate derivative datasets, create corresponding data visualizations, publish these datasets to open-access repositories, and conduct further analysis with additional data mining tools.

  • Have tangible experience with datasets and related technologies in order to perform specific analytic tasks on web archives such as exploring graph networks of domains and hyperlinks, extract and visualize images and other specific formats, and perform textual analysis and other interpretive functions.

  • Have insights into digital methods through their exposure to a variety of different active, real-life use cases from scholars and research teams currently using the ARCH platform for digital humanities and similar work.

 
1:00pm - 3:00pmWKSHP-06: RUN YOUR OWN FULL STACK SOLRWAYBACK
Location: Labs Room 1 (workshops)
Pre-registration required for this event.
 

Run your own full stack SolrWayback

Thomas Egense, Toke Eskildsen, Jørn Thøgersen, Anders Klindt Myrvoll

Royal Danish Library, Denmark

An in-person, updated, version of the ‘21 WAC workshop Run your own full stack SolrWayback:
https://netpreserve.org/event/wac2021-solrwayback-1/

This workshop will

  1. Explain the ecosystem for SolrWayback 4 (https://github.com/netarchivesuite/solrwayback)

  2. Perform a walkthrough of installing and running the SolrWayback bundle. Participants are expected to mirror the process on their own computer and there will be time for solving installation problems

  3. Leave participants with a fully working stack for index, discovery and playback of WARC files

  4. End with open discussion of SolrWayback configuration and features.

Prerequisites:

  • Participants should have a Linux, Mac or Windows computer with Java 8 or Java 11 installed. To see java is installed type this in a terminal: java -version

  • Downloading the latest release of SolrWayback Bundle from:https://github.com/netarchivesuite/solrwayback/releases beforehand is recommended.

  • Having institutional WARC files available is a plus, but sample files can be downloaded from https://archive.org/download/testWARCfiles

  • A mix of WARC-files from different harvests/years will showcase SolrWaybacks capabilities the best way possible.

Target audience:

Web archivists and researchers with medium knowledge of web archiving and tools for exploring web archives. Basic technical knowledge of starting a program from the command line is required; the SolrWayback bundle is designed for easy deployment.
Maximum number of participants
30


Background

SolrWayback 4 (https://github.com/netarchivesuite/solrwayback) is a major rewrite with a strong focus on improving usability. It provides real time full text search, discovery, statistics extraction & visualisation, data export and playback of webarchive material. SolrWayback uses Solr (https://solr.apache.org/) as the underlying search engine. The index is populated using Web Archive Discovery (https://github.com/ukwa/webarchive-discovery). The full stack is open source and freely available. A live demo is available at https://webadmin.oszk.hu/solrwayback/

During the conference there will be focused support for SolrWayback in a dedicated Slack channel by Thomas Egense and Toke Eskildsen.

 

 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2023
Conference Software: ConfTool Pro 2.6.149
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany