Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available). To only see the sessions for 3 May's Online Day, select "Online" for location.

Please note that all times are shown in the time zone of the conference. The current conference time is: 9th May 2024, 10:57:57pm CEST

 
Only Sessions at Location/Venue 
Only Sessions at Date / Time 
 
 
Session Overview
Date: Thursday, 11/May/2023
8:30am - 9:30amREGISTRATION/COFFEE
9:30am - 9:45amOPENING REMARKS: Eppo van Nispen, Sound & Vision
Location: Theatre 1
9:45am - 10:45amKEYNOTE: Eliot Higgins, Bellingcat. Introduced and chaired by Johan Oomen, Sound & Vision
Location: Theatre 1
10:45am - 11:00amBREAK
11:00am - 12:30pmSES-01: RESEARCH & ACCESS
Location: Theatre 1
Session Chair: Ditte Laursen, Royal Danish Library
These presentations will be followed by a 10 min Q&A.
 
11:00am - 11:20am

Through the ARCHway: Opportunities to Support Access, Exploration, and Engagement with Web Archives

Samantha Fritz

Archives Unleashed Project, University of Waterloo, Canada

For nearly three decades, memory institutions have consciously archived the web to preserve born-digital heritage. Now, web archive collections range into the petabytes, significantly expanding the scope and scale of data for scholars. Yet there are many acute challenges research communities face, from the availability of analytical tools, community infrastructure, and inaccessible research interfaces. The core objective of the Archives Unleashed Project is to lower these barriers and burdens for conducting scalable research with web archives.

Following a successful series of datathon events (2017-2020), Archives Unleashed launched the cohort program (2021-2023) to facilitate opportunities to improve access, exploration and research engagement with web archives.

Borrowing from the hacking genre of events often found within the tech industry, Archives Unleashed datathons were designed to provide an immersive and uninterrupted period of time for participants to work collaboratively on projects and gain hands-on experience working with web archive data. The datathon series cultivated community formation and empowered scholars to build confidence and the skills needed to work with web archives. However, the short-term nature of datathons ultimately saw focused energy and time to research projects diminish once meetings concluded.

Launched in 2021, the Archives Unleashed cohort program was developed as a matured evolution of the datathon model to support research projects. The program ran two iterative cycles and hosted 46 international researchers from 21 unique institutions. Programmatically, researchers engaged in a year-long collaboration project, with web archives featured as a primary data source. The mentorship model has been a defining feature, including direct one-on-one consultation from Archives Unleashed, connections to field experts, and opportunities for peer-to-peer support.

This presentation will reflect on the experiences of engaging with scholars to build scalable analytical tools and deliver a mentorship program to facilitate research with web archives. The cohort program asked researchers to step into an unfamiliar environment with complex data, and they did so with curiosity while embracing opportunities to access, explore, and engage with web archive collections. While the program highlights a broad range of use cases, we seek to inspire the adoption of web archives for scholarly inquiry more commonly across disciplines.



11:20am - 11:40am

‘Research-ready’ collections: challenges and opportunities in making web archive material accessible

Leontien Talboom1, Mark Simon Haydn2

1Cambridge University Libraries, United Kingdom; 2National Library of Scotland, United Kingdom

The Archive of Tomorrow is a collaborative, multi-institutional project led by the National Library of Scotland and funded by the Wellcome Trust collecting information and misinformation around health in the online public space. One of the aims of this project is to create a ‘research-ready’ collection which would make it possible for researchers to access and reuse the themed collections of materials for further research. However, there are many challenges around making this a reality, especially around the legislative framework governing collection of and access to web archives in the UK, and technical difficulties stemming from the emerging platforms and schemas used to catalogue websites.

This talk would primarily address IIPC 2023's Access and Research themes, while also touching on the Collections and Operations strands in its discussion of a short-term project promising to deliver technical improvements and expanded access to web archives collections by 2023. The presentation would like to challenge and explore the difficulties the project encountered by offering different ways into the material, including exposing insights that can be generated from working with metadata exports outside of collecting platforms; detailing the project’s work in surfacing web archives in traditional library discovery settings through metadata crosswalks; and exploring further possibilities around the use of Jupyter Notebooks for data exploration and the documentation and dissemination of datasets.

The intended deliverables of this session are to present the tools developed within the project to make web archive material suitable and useful for research; to share frameworks used by the project’s web archivists when navigating the challenges of archiving personal and political health information online; and to discuss the barriers to access around collecting web archive and social media material in a UK context.



11:40am - 12:00pm

Developing new academic uses of web archives collections: challenges and lessons learned from the experimental service deployed at the University of Lille during the ResPaDon Project

Jennifer Morival1, Sara Aubry2, Dorothée Benhamou-Suesser2

1Université de Lille, France; 2Bibliothèque nationale de France, France

2022 marks the second year of the ResPaDon project, undertaken by the BnF (National Library of France) and the University of Lille, in partnership with Sciences Po and Campus Condorcet. The project brings together researchers and librarians to promote and facilitate a broader academic use of web archives by demonstrating the value of web archives and by reducing the technical and methodological barriers researchers may encounter when discovering this source for the first time or when working with such complex materials.

One of the ways to meet the challenges and address new ways of doing research is the implementation of an experimental remote access point to the web archives at the University of Lille. The project team has renewed the offer of tools and conducted outreach to new groups of potential web archive users.

The remote access point to web archives has been deployed in two university libraries in Lille: this service allows for both consultation of the web archives in their entirety (44 billion documents, 1.7 PB of data) and for exploring a collection, "The 2002 presidential and local elections", which was the the first collection constituted in-house by the BnF 20 years ago. This collection is now accessible , through various tools for data mining, analysis, and data visualization. And the use of those tools is accompanied by guides, reports, examples, use cases - multiple types of supporting documentation that will also be evaluated on their usefulness as part of the experimentation.

The presentation will focus on the implementation of this access point from both technical and practical aspects. It will address the training of the team of 6 mediators responsible for accompanying the researchers in Lille, as well as the collaboration between the teams in Lille and at the BnF. It will also tackle the challenges of outreach and the path we have taken to communicate within the academic community to find researcher-testers.

We will share the results and lessons learned from this experimentation: the first tests conducted with the researchers have allowed us to obtain feedback on the tools deployed and the improvements to be made to this experimental service.

 
11:00am - 12:30pmSES-02: FINDING MEANING IN WEB ARCHIVES
Location: Theatre 2
Session Chair: Vladimir Tybin, Bibliothèque nationale de France
These presentations will be followed by a 10 min Q&A.
 
11:00am - 11:20am

Leveraging Existing Bibliographic Metadata to Improve Automatic Document Identification in Web Archives.

Mark Phillips1, Cornelia Caragea2, Praneeth Rikka1

1University of North Texas, United States of America; 2University of Illinois Chicago, United States of America

The University of North Texas Libraries, partnering with the University of Illinois Chicago (UIC) Computer Science Department, has been awarded a research and development grant (LG-252349-OLS-22) from the Institute of Museum and Library Services in the United States to continue work from previously awarded projects (LG-71-17-0202-17) related to identification and extraction of high-value publications from large web archives. This work will investigate the potential of using existing bibliographic metadata from library catalogs and digital library collection to better train machine learning models that can assist librarians and information professionals in identifying and classifying high-value publications from large web archives. The project will focus on extracting publications related to state government document collections from the states of Texas and Michigan with the hopes that this approach will enable other institutions interested in leveraging their existing web archives to assist in building traditional digital collections with these publications. This presentation will present an overview of the project with a description of the approaches the research team is exploring to leverage existing bibliographic metadata to assist in building machine models for publication identification from web archives. Early findings from the first year of research as well as next steps and how this research can be used by institutions apply to their own web archives.



11:20am - 11:40am

Conceptual Modeling of the Web Archiving Domain

Illyria Brejchová

Masaryk University, Czech Republic

Web archives collect and preserve complex digital objects. This complexity, along with the large scope of archived websites and the dynamic nature of web content, makes sustainable and detailed metadata description challenging. Different institutions have taken various approaches to metadata description within the web archiving community, yet this diversity complicates interoperability. The OCLC Research Library Partnership Web Archiving Metadata Working Group took a significant step forward in publishing user-centered descriptive metadata recommendations applicable across common metadata formats. However, there is no shared conceptual model for understanding web archive collections. In my research, I examine three conceptual models from within the GLAM domain, IFLA-LRM created by the library community, CIDOC-CRM originating from the museum community, and RiC-CM stemming from the archive community. I will discuss what insight they bring to understanding the content within web archives and their potential for supporting metadata practices that are flexible, scalable, meet the requirements of the end users, and are interoperable between web archives as well as the broader cultural heritage domain.

This approach sheds light on common problems encountered in metadata description practice in a bibliographic context by modeling archived web resources according to IFLA-LRM and showing how constraints within RDA introduce complexity without providing tools for feasibly representing this complexity in MARC 21. On the other hand, object-oriented models, such as CIDOC-CRM, can represent at least the same complexity of concepts as IFLA-LRM but without many of the aforementioned limitations. By mapping our current descriptive metadata and automatically generated administrative metadata to a single comprehensive model and publishing it as open linked data, we can not only more easily exchange metadata but also provide a powerful tool for researchers to make inferences about the past live web by reconstructing the web harvesting process using log files and available metadata.

While the work presented is theoretical, it provides a clearer understanding of the web archiving domain. It can be used to develop even better tools for managing and exploring web archive collections.



11:40am - 12:00pm

Web Archives & Machine Learning: Practices, Procedures, Ethics

Jefferson Bailey

Internet Archive, United States of America

Given their size, complexity, and heterogeneity, web archives are uniquely suited to leverage and enable machine learning techniques for a variety of purposes. On the one hand, web collections increasingly represent a larger portion of the recent historical record and are characterized by longitudinality, format diversity, and large data volumes; this makes them highly valuable in computational research by scholars, scientists, and industry professionals using machine learning for scholarship, analysis, and tool development. Few institutions, however, are yet facilitating this type of access or pursuing these types of partnerships and projects given the specialized practices, skills, and resources required. At the same time, machine learning tools also have the potential to improve internal procedures and workflows related to web collections management by custodial institutions, from description to discovery to quality assurance. Projects applying machine learning to web archive workflows, however, also remains a nascent, if promising, area of work for libraries. There is also a “virtuous loop” possible between these two functional areas of access support and collections management, wherein researchers utilizing machine learning tools on web archive collections can create technologies that then have internal benefits to the custodial institutions that granted access to their collections. Finally, spanning both external researcher uses and internal workflow applications are an intricate set of ethical questions posed by machine learning techniques. Internet Archive has been partnering with both academic and industry research projects to support the use of web archives in machine learning projects by these communities. Simultaneous, IA has also explored prototype work applying machine learning to internal workflows for improving the curation and stewardship of web archives. This presentation will cover the role of machine learning in supporting data-driven research, the successes and failures of applying these tools to various internal processes, and the ethical dimensions of deploying this emerging technology in digital library and archival services.



12:00pm - 12:20pm

From Small to Scale: Lessons Learned on the Requirements of Coordinated Selective Web Archiving and Its Applications

Balázs Indig1,2, Zsófia Sárközi-Lindner1,2, Mihály Nagy1,2

1Eötvös Loránd University, Department of Digital Humanities, Budapest, Hungary; 2National laboratory for Digital Humanities, Budapest, Hungary

Today, web archiving is measured on an increasingly large scale, pressurizing newcomers and independent researchers to keep up with the pace of development and maintain an expensive ecosystem of expertise and machinery. These dynamics involve a fast and broad collection phase, resulting in a large pool of data, followed by a slower enrichment phase consisting of cleaning, deduplication and annotation.

Our streamlined methodology for specific web archiving use cases combines mainstream practices with new open-source tools. Our custom crawler conducts selective web archiving for portals (e.g. blogs, forums, currently applied to Hungarian news providers), using the taxonomy of the given portal to systematically extract all articles exclusively into portal-specific WARC files. As articles have uniform portal-dependent structure, they can be transformed into a portal-independent TEI XML format individually. This methodology enables assets (e.g. video) to be archived separately on demand.

We focus on textual content, which in case of using traditional web archives would require using resource intensive filtering. Alternatives like trafilatura are limited to automatic content extraction often yielding invalid TEI or incomplete metadata unlike our semi-automatic method. Resulting data are deposited by grouping portals under specific DOIs, enabling fine-grained access and version control.

With almost 3 million articles from more than 20 portals we developed a library for executing common tasks on these files, including NLP and format conversion to overcome the difficulties of interacting with the TEI standard. To provide access to our archive and gain insights through faceted search, we created a light-weight trend viewer application to visualize text and descriptive metadata.

Our collaborations with researchers have shown that our approach makes it easy to merge coordinated separate crawls promoting small archives created by different researchers, who may have lower technical skills, into a comprehensive collection that can in some respects serve as an alternative to mainstream archives.

Balázs Indig, Zsófia Sárközi-Lindner, and Mihály Nagy. 2022. Use the Metadata, Luke! – An Experimental Joint Metadata Search and N-gram Trend Viewer for Personal Web Archives. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, pages 47–52, Taipei, Taiwan. Association for Computational Linguistics.

 
12:30pm - 1:30pmLUNCH
1:30pm - 2:30pmSES-03 (PANEL): INSTITUTIONAL WEB ARCHIVING INITIATIVES TO SUPPORT DIGITAL SCHOLARSHIP
Location: Theatre 1
Session Chair: Martin Klein, Los Alamos National Laboratory
 

Institutional Web Archiving Initiatives to Support Digital Scholarship

Martin Klein1, Emily Escamilla2, Sarah Potvin3, Vicky Rampin4, Talya Cooper4

1Los Alamos National Laboratory, United States of America; 2Old Dominion University, United States of America; 3Texas A&M University, United States of America; 4New York University, United States of America

Panel description:
Scholarship happens on the web but unlike more traditional output such as scientific papers in PDF format, we are still lacking comprehensive institutional web archiving approaches to capture increasingly prominent scholarly artifacts such as source code, datasets, workflows, and protocols. This panel will feature scholars from three different institutions - Old Dominion University, Texas A&M University, and New York University - that will provide an overview of their explorations in investigating the use of scholarly artifacts and their (in-)accessibility on the live web. The panelists will further outline how these findings inform institutional collection policies regarding such artifacts, web archiving efforts aligned with institutional infrastructure, and outreach and education opportunities for students and faculty. The panel will conclude with an interactive discussion while welcoming input and feedback from the WAC audience.

Individual:

Emily:

Title: Source Code Archiving for Scholarly Publications

Abstract:

Git Hosting Platforms (GHPs) are commonly used by software developers and scholars to host source code and data to make them available for collaboration and reuse. However, GHPs and their content are not permanent. Gitorious and Google Code are examples of GHPs that are no longer available even though users deposited their code expecting an element of permanence. Scholarly publications are well-preserved due to current archiving efforts by organizations like LOCKSS, CLOCKSS, and Portico; however, no analogous effort has yet emerged to preserve the data and code referenced in publications, particularly the scholarly code hosted online in GHPs. The Software Heritage Foundation is working to archive public source code, but issue threads, pull requests, wikis, and other features that add context to the source code are not currently preserved. Institutional repositories seek to preserve all research outputs which include data, source code, and ephemera; however, current publicly available implementations do not preserve source code and its associated ephemera, which presents a problem for scholarly projects where reproducibility matters. To discuss the importance of institutions archiving scholarly content like source code, we first need to understand the prevalence of source code within scholarly publications and electronic theses and dissertations (ETDs). We analyzed over 2.6 million publications across three categories of sources: preprints, peer-reviewed journals, and ETDs. We found that authors are increasingly referencing the Web in their scholarly publications with an average of five URIs per publication in 2021, and one in five arXiv articles included at least one link to a GHP. In this panel, we will discuss some of the questions that result from these findings such as: Are these GHP URIs still available on the live Web? Are they available in Software Heritage? Are they available in web archives and if so, how often and how well are they archived?

Sarah:

Title: Designing a Sociotechnical Intervention for Reference Rot in Electronic Theses

Abstract:

Intertwined publication and preservation practices have become widespread in the establishment of institutional digital repositories and libraries’ stewardship of institutional research output, including open educational resources and electronic theses and dissertations. Most digital preservation work seeks to preserve a whole text, like a dissertation, in a digital form. This presentation reports on an ongoing research effort - a collaboration with Klein, Potvin, Katherine Anders, and Tina Budzise-Weaver - intended to prevent potential information loss within the thesis, through interventions that can be integrated into trainings and thesis management tools. This approach draws on research into graduate training and citation practices, web archiving, open source software development, and digital collection stewardship with a goal of recommending systematized sociotechnical interventions to prevent reference rot in institutionally-hosted graduate theses. Findings from qualitative surveys and interviews conducted at Texas A&M University on graduate student perceptions of reference rot will be detailed.

Vicky/Talya

Title: Collaborating on Software Archiving for Institutions

Abstract:

Inarguably, software and code are part of our scholarly record. Software preservation is a necessary prerequisite for long-term access and reuse of computational research, across many fields of study. Open research software is shared on the Web most commonly via Git hosting platforms (GHPs), which are excellent for fostering open source communities, transparency of research, and add useful features on top such as wikis, continuous integration, and merge requests and issue threads. However, the source code and the useful scholarly ephemera (e.g. wikis) are archived separately, often by “breadth over depth” approaches. I’ll discuss the Collaborative Software Archiving for Institutions (CoSAI) project from NYU, LANL, ODU, and OCCAM, which is addressing this pressing need to provide machine-repeatable, human-understandable workflows for preserving web-based scholarship, scholarly code in particular, alongside the components that make it most useful. I’ll present the results of ongoing efforts in the three main streams of work: 1) technical development on open source, community-led tools for collecting, curating, and preserving open scholarship with a focus on research software, 2) community building around open scholarship, software collection and curation, and archiving of open scholarship, and 3) optimizing workflows for archiving open scholarship with ephemera, via machine-actionable and manual workflows.

 
1:30pm - 2:30pmSES-04 (PANEL): SOLRWAYBACK: BEST PRACTICE, COMMUNITY USAGE & ENGAGEMENT
Location: Theatre 2
Session Chair: Thomas Langvann, National Library of Norway
 

SolrWayback: Best practice, community usage and engagement

Thomas Egense1, László Tóth2, Youssef Eldakar3, Sara Aubry4, Anders Klindt Myrvoll1

1Royal Danish Library (KB); 2National Library of Luxembourg (BnL); 3Bibliotheca Alexandrina (BA); 4National Library of France (BnF)

Panel description

This panel will focus on the status quo of SolrWayback, implementations of SolrWayback and where it's heading in the future, including the growing open source community adapting SolrWayback and contributing to developing the tool, making it more resilient.

Thomas Egense will give an update on the current development and the flourishing user community and some thoughts on making SolrWayback even more resilient in the future.

László Tóth will talk about the National Library of Luxembourg (BnL) development of a fully automated archiving workflow comprised of the capture, indexing and playback of Luxembourgish news websites. The solution combines the powerful features of SolrWayback such as full-text search, wildcard search, category search and mre, with the high playback quality of PyWb.

Youssef Eldakar will present the way Solwayback have enhanced the way researchers can search for content and view the 18 IIPC special collections and also bring up some considerations about scaling the system.

Sara Aubry will present how the National Library of France (BnF) has been using SolrWayback to give researcher teams the possibility to explore, analyze and visualize specific collections. She will also share how BnF contributed to the application development, including the extension of datavisualisation features.

Thomas Egense: Increasing community interactions and the near future of SolrWayback

During the last year, the number of community interactions such as direct email questions, bugs/ feature requests posted on github jira, has increased every week. It is indeed good news that so many Libraries/Institutions or researchers already have embraced SolrWayback, but to keep up this momentum more community engagement will be welcomed for this open source project.

By submitting a feature request or bug report on GitHub you will help prioritize which will benefit the most, do not hold back. More programmers for backend(Java) or frontend (GUI) would speed up the development of SolrWayback.

Recently BnF helped improve some of the visualization tools by allowing shorter time intervals instead of years. For newly established collections this is a much more useful visualization. Is it a good example of the different need for new collections just 1 year old compared to collections with 25 years of web harvests. So it was not in our focus though it was a very useful improvement.

In the very near future I expect that more time will be used on supporting new users attempting to implement SolrWayback. Also the hybrid SolrWayback combined with PyWb for playback seems to be the direction many choose to go. And finally large collections will run into a Solr scaling problem that can be solved by switching to SolrCloud. There is a need for better documentation and workflow support in the SolrWayback bundle for this scaling issue.

László Tóth: A Hybrid SolrWayback-PyWb playback system with parallel indexing using the Camunda Workflow Engine

Within the framework of its web archiving programme, the National Library of Luxembourg (BnL) develops a fully automated archiving workflow comprised of the capture, indexing and playback of Luxembourgish news websites.

Our workflow design takes into account several key features such as the efficiency of crawls (both in time and space) and of the indexing processes, all while providing high quality end user experience. In particular, we have chosen a hybrid approach for the playback of our archived content, making use of several well-known technologies in the field.


Our solution combines the powerful features of SolrWayback such as full-text search, wildcard search, category search and so forth, with the high playback quality of PyWb (for instance its ability to handle complex websites, in particular with respect to POST requests). Thus, once a website is harvested, the corresponding WARC files are indexed in both systems. Users are then able to perform fine-tuned searches using SolrWayback and view the chosen pages using PyWb. This also means that we need to store our indexes in two different places: the first is within an OutbackCDX indexing server connected to our PyWb instance, the second is a larger Solr ecosystem put in place specifically for SolrWayback. This parallel indexing process, together with the handling of the entire workflow from start to finish, is handled by the Camunda Workflow Engine, which we have configured in a highly flexible manner.


This way, we can quickly respond to new requirements, or even to small adjustments such as new site-specific behaviors. All of our updates, including new productive tasks or workflows, can be deployed on-the-fly without needing any downtime. This combination of technologies allows us to provide a seamless and automated workflow together with an enjoyable user experience. We will present the integrated workflow with Camunda and how users interact with the whole system.


Youssef EldakarWhere We Are a Year Later with the IIPC Collections and Researcher Access through SolrWayback ”

One year ago, we presented a joint effort, spanning the IIPC Research Working Group, the IIPC Content Development Working Group, and Bibliotheca Alexandrina, to republish the IIPC collections for researcher access through alternative interfaces, namely, LinkGate and SolrWayback.


This effort aims to re-host the IIPC collections, originally harvested on Archive-It, at Bibliotheca Alexandrina with the purpose of offering researchers the added value of being able to explore a web archive collection as a temporal graph with the data indexed in LinkGate, as well as search the full text of a web archive collection and run other types of analyses with the data indexed in SolrWayback.


At the time of last year's presentation, the indexing of 18 collections and a total compressed size of approximately 30 TB for publishing through both LinkGate and SolrWayback was at its early stage. As part of this panel on SolrWayback, one year later, we present an update of what is now available to researchers after the progress made on indexing and tuning of the deployment, focusing on showcasing access to the data through the different tools found in the SolrWayback user interface.


We also present a brief technical overview of how the underlying deployment has changed to meet the demands of scaling up to the growing volume of data. We finally share thoughts on future next steps. See the republished collections at https://iipc-collections.bibalex.org/ and the presentation from 2022.

Sara Aubry: SolrWayback at the National Library of France (BnF) : an exploration tool for researchers and the web archiving team engagement to contribute to its evolution

With the opening of its DataLab in October 2021 and the Respadon project (which will also be presented during the WAC), BnF web archiving team is currently concentrating on the development of services, tools, methods and documentation to ease the understanding and appropriation of web archives for research. The underlying objective is to provide the research community, along with information professionals, with a diversity of tools dedicated to the building, exploring and analysis of web corpora. Among all tools we have tested with researchers, SolrWayback has a particular place because of its simplicity to handle and its rich functionalities. Beyond a first contact with the web archives, it allows researchers to question and analyze the focused collections to which it gives access. This presentation will focus on researcher feedback using SolrWayback, how the application promotes the development of skills on web archives, and how we accompany researchers in the use of this application. We will also present how research use and feedback has led us to contribute to the development of this open source tool.


 
1:30pm - 3:30pmWKSHP-01: DESCRIBING COLLECTIONS WITH DATASHEETS FOR DATASETS
Location: Labs Room 1 (workshops)
Pre-registration required for this event.
 

Describing Collections with Datasheets for Datasets

Emily Maemura1, Helena Byrne2

1University of Illinois; 2British Library, United Kingdom

Significant work in web archives scholarship has focused on addressing the description and provenance of collections and their data. For example, Dooley et al. (2018) propose recommendations for descriptive metadata, and Maemura et al. (2018) develop a framework for documenting elements of a collection’s provenance. Additionally, documentation of the data processing and curation steps towards generating a corpus for computational analysis are described extensively in Brügger (2021), Brügger, Laursen & Nielsen (2019) and Brügger, N., Nielsen, J., & Laursen, D. (2020). However, looking beyond libraries, archives, or cultural heritage settings provides alternative forms for the description of data. One approach to the challenge of describing large datasets comes from the field of machine learning where Gebru et al. (2018, 2021) propose developing “Datasheets for Datasets,” a form of short document answering a standard set of questions arranged by stages of the data lifecycle.

This workshop explores how web archives collections can be described using the framework provided by Datasheets for Datasets. Specifically, this work builds on the template for datasheets developed by Gebru et al. that is arranged into seven sections: Motivation; Composition; Collection Process; Preprocessing/Cleaning/Labeling; Use; Distribution; and, Maintenance. The workflow they present includes a total of 57 questions to answer about a dataset, focusing on the specific needs of machine learning researchers. We consider how these questions can be adopted for the purposes of describing web archives datasets. Participants will consider and assess how each question might be adapted and applied to describe datasets from the UK Web Archive curated collections. After a brief description of the Datasheets for Datasets framework, we will break into small groups to perform a card-sorting exercise. Each group will evaluate a set of questions from the Datasheets framework and assess them using the MoSCoW technique, sorting questions into categories of Must, Should, Can’t, and Won’t have. Groups will then describe their findings from the card-sorting exercise in order to generate a broader discussion of priorities and resources available for generating descriptive metadata and documentation for public web archives datasets.

Format:120 minute workshop where participants will do a card sorting activity in small groups to review the practicalities of the Datasheets for Datasets Framework when applied to web archives. Ideally participants can prepare by reading through questions prior to the workshop.

We anticipate the following schedule:

  • 5 min: Introduction

  • 15 min: Overview of Datasheets for Datasets

  • 5 min: Overview of UKWA Datasets

  • 60 min: Card-sorting Exercise in small groups

  • 5 min: Comfort Break

  • 20 min: Discussion of small group findings

  • 5 min: Conclusion and Wrap-up

Target Audience: Web Archivists, Researchers

Anticipated number of participants: 12-16

Technical requirements: overhead projector with computer and large tables for a big card sorting activity.

Learning outcomes:

  • Raise awareness of the Datasheets for Datasets Framework in the web archiving community.

  • Understand what type of descriptive metadata web archive experts think should accompany web archive collections published as data.

  • Generate discussion and promote communication between web archivists and research users on priorities for documentation.

Coordinators: Emily Maemura (University of Illinois), Helena Byrne (British Library)

Emily Maemura is an Assistant Professor in the School of Information Sciences at the University of Illinois Urbana-Champaign. Her research focuses on data practices and the activities of curation, description, characterization, and re-use of archived web data. She completed her PhD at the University of Toronto's Faculty of Information, with a dissertation exploring the practices of collecting and curating web pages and websites for future use by researchers in the social sciences and humanities.

Helena Byrne is the Curator of Web Archives at the British Library. She was the Lead Curator on the IIPC Content Development Group 2022, 2018 and 2016 Olympic and Paralympic collections. Helena completed a Master’s in Library and Information Studies at University College Dublin, Ireland in 2015. Previously she worked as an English language teacher in Turkey, South Korea, and Ireland. Helena is also an independent researcher that focuses on the history of women's football in Ireland. Her previous publications cover both web archives and sports history.

References

Brügger, N. (2021). Digital humanities and web archives: Possible new paths for combining datasets. International Journal of Digital Humanities. https://doi.org/10.1007/s42803-021-00038-z

Brügger, N., Laursen, D., & Nielsen, J. (2019). Establishing a corpus of the archived web: The case of the Danish web from 2005 to 2015. In N. Brügger & D. Laursen (Eds.), The historical web and digital humanities: The case of national web domains (pp. 124–142). Routledge/Taylor & Francis Group.

Brügger, N., Nielsen, J., & Laursen, D. (2020). Big data experiments with the archived Web: Methodological reflections on studying the development of a nation’s Web. First Monday. https://doi.org/10.5210/fm.v25i3.10384

Dooley, J., & Bowers, K. (2018). Descriptive Metadata for Web Archiving: Recommendations of the OCLC Research Library Partnership Web Archiving Metadata Working Group (p. ). OCLC Research. https://doi.org/10.25333/C3005C

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2018). Datasheets for Datasets. ArXiv:1803.09010 [Cs]. http://arxiv.org/abs/1803.09010

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., III, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. https://doi.org/10.1145/3458723

Maemura, E., Worby, N., Milligan, I., & Becker, C. (2018). If These Crawls Could Talk: Studying and Documenting Web Archives Provenance. Journal of the Association for Information Science and Technology, 69(10), 1223–1233. https://doi.org/10.1002/asi.24048

 
2:30pm - 2:40pmBREAK
2:40pm - 3:50pmSES-05: COVID-19 COLLECTIONS
Location: Theatre 1
Session Chair: Kees Teszelszky, KB, National Library of the Netherlands
These presentations will be followed by a 10 min Q&A.
 
2:40pm - 3:00pm

The UK Government Web Archive (UKGWA): Measuring the impact of our response to the COVID-19 pandemic

Tom Storrar

The National Archives, United Kingdom

The COVID-19 pandemic, the first pandemic of the digital age, has presented an enormous challenge to our web archiving practice. As the official archive of the UK government, we were tasked with building a comprehensive archive of the UK government's online response to the emergency. To meet this challenge we have devised new archiving strategies ranging from supplementary broad, keyword-driven crawling to focus, data-driven, daily captures of the UK’s official “Coronavirus (COVID-19) in the UK” data dashboard. We have also massively increased our rates of capture. The challenge has demanded creativity, adaptation and a great deal of effort.

All of this work prompted us to think of a number of questions that we’d like to answer: How complete is the record we captured in our web archive and how much is this a result of the extra effort we made? How could we perform meaningful analysis on the enormous numbers of HTML and non-HTML resources? What contributions have these innovations made to this outcome and how can these inform our practice going forward?

To tackle these questions we needed to analyse millions of captured resources in our web archive. It soon became clear that we’d only be able to achieve the level of insight needed by developing an entire end-to-end analysis system. The resulting pipeline we designed and built uses a combination of familiar and novel concepts and approaches; we used the WARC file content, along with CDX APIs, but we also developed a set of heuristics, and custom algorithms, all ultimately populating a database that allowed us to run queries to give us the answers we sought. Running an entirely cloud-based system enabled this work as we were at that time unable to reliably access our office.

This presentation will provide an overview of the approaches used, the results we found and the areas for further development. We believe that these tools can be applied to our overall web archive collections and hope that other institutions will find our experience useful when thinking about analysing their own collection and quantifying the impact of their efforts.



3:00pm - 3:20pm

Women and COVID through Web Archives. How to explore the pandemic through a collaborative, interdisciplinary research approach

Susan Aasman1, Karin de Wild2, Joshgun Sirajzade3, Fréderic Clavert3, Valerie Schafer3, Sophie Gebeil4, Niels Brügger5

1University of Groningen, Netherlands, The; 2Leiden University, The Netherlands; 3University of Luxembourg, Luxembourg; 4Aix-Marseille University, France; 5Aarhus University, Denmark

The COVID crisis has been a shared worldwide and collective experience from March 2020 and lot of voices have echoed each other, may it be related to grief, lockdown, masks and vaccines, homeschooling, etc. However, this unprecedented crisis has also deepened asymmetries and failures within societies, in terms of occupational fields, economic inequalities, health and sanitary access, and we could extend the inventory of these hidden and more visible gaps that were reinforced during the crisis. Women and gender were also at stake when it came to this sanitary crisis, may it be to discuss the better management of the crisis by female politicians, domestic violence during the lockdown, decreasing production of papers by female research scientists, homeschooling and mental load of women, etc.

As a cohort team within the Archives Unleashed Team (AUT) program, the European research AWAC2 team benefited from a privileged access to this collection, thanks to Archive-It and through ARCH, and from regular mentorship by the AUT team. It allowed us to investigate and analyse this huge collection of 5.3 TB, 161 757 lines for the CSV on domain frequency CSV, 8,738,751 lines for the CSV related to plain text of web pages. In December 2021, our AWAC2 team submitted several topics to the IIPC (International Internet Preservation Consortium) community and invited the international organization to select one of them that the team would investigate in depth, based on the unique IIPC COVID collection of web archives. Women, gender, and COVID was the winning topic.

Accepting the challenge, the AWAC2 team organized a datathon in March 2022 in Luxembourg to investigate and retrieve the many traces of women, gender and COVID in web archives, while mixing close and distant reading. Since then, the team has been working on the dataset to further explore the opportunities for computational methods for reading at scale. In this presentation, we will reflect on technical, epistemological, and methodological challenges and present some results as well.



3:20pm - 3:40pm

Surveying the landscape of COVID-19 web collections in European GLAM institutions

Nicola Bingham1, Friedel Geeraert2, Caroline Nyvang3, Karin de Wild4

1British Library, United Kingdom; 2KBR (Royal Library of Belgium); 3Royal Danish Library; 4Leiden University

The aim of the WARCnet network [https://cc.au.dk/en/warcnet/about] is to promote high-quality national and transnational research that will help us to understand the history of (trans)national web domains and of transnational events on the web, drawing on the increasingly important digital cultural heritage held in national web archives. Within the context of this network, a survey was conducted to see how cultural heritage institutions are capturing the COVID-19 crisis for future generations. The aim of the survey was to map the scope and collection strategies of COVID-19 Web collections with a main focus on Europe. The survey was managed by the British Library and was conducted by means of the Snap survey platform. It circulated between June and September 2022 among mainly European GLAM institutions and 61 responses were obtained.

The purpose of this presentation is to provide an overview of the different collection development practices when curating COVID-19 collections. On the one hand, the results may support GLAM institutions to gain further insights in how to curate COVID-19 Web collections or identify potential partners. On the other hand, revealing the scope of these Web collections may also encourage humanists and data scientists to unlock the potential of these archived Web sources to further understand international developments on the Web during the COVID-19 pandemic

More concretely, the presentation will provide further insight into the local, regional, national or global scopes of the different COVID-19 collections, the type of content that is included in the collections, the available metadata, the selection criteria that were used when curating the collections and the efforts that were made to create inclusive collections. The temporality of the collections will also be discussed by highlighting the start, and, if applicable, end dates of the collections and the capture frequency. Quality control and long-term preservation are two further elements that will be discussed during the presentation.

 
2:40pm - 3:50pmSES-06: SOCIAL MEDIA & PLAYBACK: COLLABORATIVE APPROACHES
Location: Theatre 2
Session Chair: Susanne van den Eijkel, KB, National Library of the Netherlands
These presentations will be followed by a 10 min Q&A.
 
2:40pm - 3:00pm

Archiving social media in Flemish cultural or private archives, (how) is it possible

Katrien Weyns1, Ellen Van Keer2

1KADOC-KU Leuven, Belgium; 2meemoo, Belgium

Social media are increasingly replacing other forms of communication. In doing so, they are also becoming an important source to archive in order to preserve the diverse voices in society for the long term. However, few Flemish archival institutions currently archive this type of content. To remedy this situation, a number of private archival institutions in Flanders started research on sustainable approaches and methods to capture and preserve social media archives. Confronted with the complex reality of this new landscape however, this turned out to be a rather challenging undertaking.

Through the lens of our project 'Best practices for social media archiving in Flanders and Brussels', we’ll look at the lessons learned and the central challenges that remain for social media archiving in private archival institutions in Flanders. Many of these lessons and challenges transcend this project and concern the broader web archiving community and cultural heritage sector.

Unsurprisingly, to a lot of (often smaller) private archival institutions in Belgium archiving social media remains a major challenge either because of a lack of (new) digital archiving competencies or the availability of (often expensive and quickly outdated) technical solutions in heritage institutions. On top of that, there are major legal challenges. For one, these archives cannot fall back on archival law or legal deposit law as a legal basis. In addition, the quickly evolving European and national privacy and copyright regulations form a maze of rules and exceptions they have to find their way in and keep up with.

One last stumbling block is proving particularly hard to overcome. It concerns the legal and technical restrictions the social media platforms themselves impose on users. These make it practically impossible for heritage institutions to capture and preserve the integrity of social media content in a sustainable way. We believe this problem is best to be addressed by the international web archiving, research and heritage community as a whole.

This is only one of the recommendations we’re proposing to improve the situation as part of the set of ‘best practices’ we developed and which we would like to present here in more detail.



3:00pm - 3:20pm

Searching for a Little Help From My Friends: Reporting on the Efforts to Create an (Inter)national Distributed Collaborative Social Media Archiving Structure

Zefi Kavvadia1, Katrien Weyns2, Mirjam Schaap3, Sophie Ham4

1International Institute of Social History; 2KADOC Documentation and Research Centre on Religion, Culture, and Society; 3Amsterdam City Archives; 4KB, National Library of the Netherlands

Social media archiving in cultural heritage and government is still at an experimental stage with regard to organizational readiness for and sustainability of initiatives. The many different tools, the variety of platforms, and the intricate legal and ethical issues surrounding social media do not readily allow for immediate progress and uptake by organizations interested or mandated to preserve social media content for the long term.

In Belgium and the Netherlands, the last three years have seen a series of promising projects on building social media archiving capacity, mostly focusing on heritage and research. One of their most important findings is that the multiple needs and requirements of successful social media archiving are difficult for any one organization to tackle; efforts to propose good practices or establish guidelines often run onto the reality of the many and sometimes clashing priorities of different domains e.g. archives, libraries, local and national government, research. Faced with little time and increasing costs, managers and funders are generally reluctant to support social media archiving as an integral part of collecting activity, as it is seen as a nice-to-have but not crucial part of their already demanding core business.

Against this background, we set out to bring together representatives of different organizations from different sectors in Belgium and the Netherlands to research the possibilities for what a distributed collaborative approach to social media archiving could look like, including requirements for sharing knowledge and experiences systematically and efficiently, sharing infrastructure and human and technical resources, prioritization, and future-proofing the initiative. In order to do this, we look into:

  • Wishes, demands, and obstacles to doing social media archiving at different types of organizations in Belgium and the Netherlands?

  • Aligning the heritage, research, and governmental perspectives

  • Learning from existing collective organizational structures

  • First steps for the allocation of roles and responsibilities

Through interviews with staff and managers of interested organizations, we want to find out if there is potential in thinking about social media archiving as a truly collaborative venture. We would like to discuss the progress of this research and the ideas and challenges we have come up against.



3:20pm - 3:40pm

Collaborating On The Cutting Edge: Client Side Playback

Clare Stanton, Matteo Cargnelutti

Library Innovation Lab, United States of America

Perma.cc is a project of the Library Innovation Lab, which is based within the Harvard Law School Library and exists as a unit of a large academic institution. Our work has been focused in the past mainly on the application of web archiving technology as it relates to citation in legal and scholarly writing. However, we also have spent time exploring expansive topics in the web archiving world - oftentimes via close collaboration with the Webrecorder project - and most recently have built tools leveraging new client-side playback technology made available by replayweb.page.

warc-embed is LIL's experimental client-side playback integration boilerplate, which can be used to test out and explore this new technology, along with its potential new applications. It consists of: a simple web server configuration that provides web archive playback; a preconfigured “embed” page that can be easily implemented to interact with replayweb.page; and a two-way communication layer that allows the replay to reliably and safely communicate with the archive. These features are replicable for a relatively non-technical audience and thus we sought to explore small scale applications of it outside of our group.

This is one of two proposals relating to this work. We believe there is an IIPC audience who could attend either or both sessions based on their interests. They explore separate topics relating to the core technology. This session will look into user applications of the tool and institutional user feedback from the Harvard Library community.

Our colleagues at Harvard use the Internet Archive’s Archive-It across the board for the majority of their web archiving collections and access. As an experiment, we have worked with some of them to host and serve their .warcs via warc-embed. We scoped work based on their needs and made adjustments based on their ability to apply the technology. One example of this is a refresh of the software to be able to mesh with WordPress, which was more easily managed directly by the team. This session will explore a breakdown of roadblocks, design strategies, and wins from this collaboration. It will focus on the end-user results and applications of the technology.

 
3:50pm - 4:20pmBREAK
4:20pm - 5:30pmSES-07: COLLABORATIONS & OUTREACH
Location: Theatre 1
Session Chair: Ben Els, National Library of Luxembourg
These presentations will be followed by a 10 min Q&A.
 
4:20pm - 4:40pm

Linking web archiving with arts and humanities: the collaboration between ROSSIO and Arquivo.pt

Ricardo Basílio

Arquivo.pt - Fundação para a Ciência e Tecnologia, I.P., Portugal

ROSSIO and Arquivo.pt developed collaborative activities with the goal of connecting web archiving, arts and digital humanities, between 2018 and 2022. How to make Web archives useful and accessible to digital humanities researchers, and by extension to citizens? This challenge was answered in three ways: training, dissemination, and collaborative curation of websites. This presentation aims to describe those collaborative activities and share what we’ve learned from them.

ROSSIO is a Portuguese infrastructure for the Social Sciences, Arts and Humanities (https://rossio.fcsh.unl.pt/). Its mission is to aggregate, contextualize, enrich and disseminate digital content. It is based at the Faculty of Social and Human Sciences of the NOVA University of Lisbon (FCSH-NOVA) and involves several institutions that provide content. Arquivo.pt's mission (https://arquivo.pt) is to preserve the Portuguese Web and make available contents from the Web since 1996 to everyone, from simple citizens to researchers.

ROSSIO contributed human resources, namely, a web curator, a community manager, a web developer, and researchers who used Arquivo.pt in their work. Arquivo.pt in turn contributed its know-how, created new services (e.g., the SavePageNow) and made available open data sets.

Therefore, we describe the activities carried out in collaboration and their results.

First, regarding training, we refer to face-to-face and online sessions held with ROSSIO partners and their communities. We highlight the initiative "Café with Arquivo.pt" (https://arquivo.pt/cafe) and the webinars held during the pandemic, because they strengthened the connection between Arquivo.pt and distant communities (e.g., in 2021 they had 538 participants and 84% of satisfaction).

Second, the continuous dissemination in the social networks and groups of the ROSSIO partners which helped to make Arquivo.pt better known (e.g., 7.300 new users accessed the service between 2018 and 2021).

Third, researchers from the ROSSIO collaborated in curating websites, which resulted in documentation for studies and online exhibitions (e.g. “Times of illness, times of healing” at the FCSH NOVA; and "art festivals memory" at the Gulbenkian Art Library).

We concluded this presentation by sharing what we learned from participating in ROSSIO, and the challenges that lie ahead for creating a community of practice among art and humanities researchers.



4:40pm - 5:00pm

Building collaborative collections : experience of the Croatian Web Archive

Inge Rudomino, Dolores Mumelaš

National and University Library in Zagreb, Croatia

In Croatia, the only institution that archives the web is the National and University Library in Zagreb. The library established the Croatian Web Archive (HAW) and began archiving Croatian web sources in 2004. From then until today, we have developed several approaches to web archiving: selective, .hr crawls, thematic crawls, building local history collections and social media archiving. In order to broaden our collections and raise public awareness as much as possible the Croatian Web Archive is opening up to collaboration with other libraries, as well as all interested citizens.

One of the examples is the Building Local History Web project from 2020. That year, the Croatian Web Archive began collaboration with public libraries for the purpose of archiving web resources related to a specific area or homeland. The contents are related to a specific locality with the aim of presenting and ensuring long-term access to local materials that are available only on the web and complement and popularize the local history collection of the public library.

In addition to collaboration with public libraries, the Croatian Web Archive has connected with the User Service Department of the National and University Library in Zagreb, in order to involve citizens in the creation of thematic collections through citizen science. In that way the thematic collection “Bees, life, people” was created, using the crowdsourcing method, in collaboration with the public library, citizens (high school students) and other library departments.

This presentation will discuss developing a collection policy, collaboration and working process in building local history and citizen science collections.

The lessons learned throughout collaboration with citizens and public libraries are great encouragement to expand the existing scope of archiving as well as involvement of other libraries and citizens in raising awareness of information literacy and the importance of archiving web content.



5:00pm - 5:20pm

Your Software Development Internship in Web Archiving

Youssef Eldakar

Bibliotheca Alexandrina, Egypt

A summer internship project is an opportunity for the intern to practice in the real world as well as for the host institution to make extra progress on program objectives, while also engaging with the community. Since 2019, Bibliotheca Alexandrina's IT team has been running a summer internship series for undergraduate students of computing, with several of the internship projects having a connection to web archiving.

Throughout this experience, our mentors have been finding the young interns much intrigued by the technology involved in archiving the web. From a computing perspective, aside from serving to preserve a quite significant information medium, web archiving is an activity where a number of sub-domains of computing come together. A software project in web archiving will involve, for instance, management of big data to keep pace with how the web and consequently an archive thereof continues to expand in volume, parallel computing to achieve the capacity for both data harvesting and processing at that level of scale, machine learning to find answers to questions about the datasets that can be extracted from a web archive, or network theory and graph analytics to come to more understandable representations of the heavily interlinked data.

In this presentation, we invite you to join us on a virtual visit to the home of the IT team at Bibliotheca Alexandrina for a look into our archive of past internship projects in web archiving. These projects include the investigation of alternative graph analytics backends for the implementation of new features in web archive graph visualization, repurposing of the WARC format for use in the library's digital book portal, and crawling the web for text for language model training. For each project, we will review the specific objective, how the problem was addressed, and the outcome. Finaly, to reflect on the overall experience, we will share lessons learned as well as discuss how the interaction with the community through internships is additionally an opportunity to raise awareness about web archiving, the technology involved, and the work of the International Internet Preservation Consortium (IIPC).

 
4:20pm - 5:30pmSES-08: QUALITY ASSURANCE
Location: Theatre 2
Session Chair: Arnoud Goos, Netherlands Institute for Sound & Vision
These presentations will be followed by a 10 min Q&A.
 
4:20pm - 4:40pm

The Auto QA process at UK Government Web Archive

Kourosh Feissali, Jake Bickford

The National Archives, United Kingdom

The UK Government Web Archive’s (UKGWA) Auto QA process allows us to carry out enhanced data-driven QA almost completely automatically. This is particularly useful for websites that are high-profile or sites that are about to close. Our Auto QA has several advantages over solely visual QA. The advantages enable us to:

1) Identify problems that are not obvious at the visual QA stage.

2) Identify Heritrix errors during the crawl. These include -2 and -6 errors. Once identified, we re-run Heritrix on the affected URIs.

3) Identify and patch URIs that Heritrix could not discover.

4) Identify, test, and patch Hyperlinks insides PDFs. Many PDFs contain hyperlinks to a page on the parent website or to other websites. And sometimes the only way to access those pages is through a link in a PDF which most crawlers can't normally access.

Auto QA consists of three separate processes:

1) ‘Crawl Log Analysis’ that runs on every crawl automatically. CLA examines Heritrix crawl logs and looks for errors. It then tests those errors against the live web.

2) ‘Diffex’ that compares what Heritrix discovered with the output of another crawler such as Screaming Frog. This will identify what Heritrix did not discover. Diffex then tests those URIs against the live web and if they are valid, they are added to a patchlist.

3) ‘PDFflash’ extracts PDF URI’s from Heritrix crawl logs. It then parses them and looks for hyperlinks within PDFs; tests those hyperlinks against the live web, our web archives, and against our in-scope domains. If a hyperlink’s target serves 404 it will be added to our patchlist provided it meets certain conditions such as scoping criteria.

UKGWA’s Auto QA is a highly efficient and scalable system that compliments visual QA; and we are in the process of making it open source.



4:40pm - 5:00pm

The Human in the Machine: Sustaining a Quality Assurance Lifecycle at the Library of Congress

Grace Bicho, Meghan Lyon, Amanda Lehman

Library of Congress, United States of America

This talk will build upon information shared during the IIPC WAC 2022 session Building a Sustainable Quality Assurance Lifecycle at the Library of Congress (Thomas and Lyon).

The work to develop a sustainable and effective quality assurance (QA) ecosystem is ongoing and the Library of Congress Web Archiving Team (WAT) is constantly working to improve and streamline workflows. The Library’s web archiving QA goals are structured around Dr. Reyes Ayala’s framework for quality measurements of web archives based in Grounded Theory (Reyes Ayala). During last year’s session, we described how the WAT satisfies the two dimensions of Relevance and Archivability, with some automated processes built in to help the team do its work. We also introduced our idea for Capture Assessment to satisfy the Correspondence dimension of Dr. Reyes Ayala’s framework.

In July 2022, the WAT launched the Capture Assessment workflow internally and invited curators of web archives content at the Library to review captures of their selected content. To best communicate issues of Correspondence quality between the curatorial librarians and the WAT, we instituted a rubric where curatorial librarians can ascribe a numeric value to convey quality information from various angles about a particular web capture, alongside a checklist of common issues to easily note.

The WAT held an optional training alongside the launch, and since then, there have been over 90 responses from a handful of curatorial librarians, including one power user. The WAT has found responses to be mostly actionable for correction in future crawls. We’ve also seen that Capture Assessments are performed on captures that wouldn’t necessarily be flagged via other QA workflows, which gives us confidence that a wider swath of the archive is being reviewed for quality.

The session will share more details about the Capture Assessment workflow and, in time for the 2023 WAC session, we intend to complete a small, early analysis of the Capture Assessment responses to share with the wider web archiving community.

Reyes Ayala, B. Correspondence as the primary measure of information quality for web archives: a human-centered grounded theory study. Int J Digit Libr 23, 19–31 (2022). https://doi.org/10.1007/s00799-021-00314-x

 
4:20pm - 5:30pmWKSHP-02: A PROPOSED FRAMEWORK FOR USING AI WITH WEB ARCHIVES IN LAMS
Location: Labs Room 1 (workshops)
Pre-registration required for this event.
 

A proposed framework for using AI with web archives in LAMs

Abigail Potter

Library of Congress, United States of America

There is tremendous promise in using artificial intellegence, and specifically machine learning techniques to help curators, collections managers and users to understand, use, steward and preserve web archives. Libraries, archives, museums and other public cultural heritage organizations who manage web archives have shared challenges in operationalizing AI technologies and unique requirements for managing digital heritage collections at a very large scale. Through research, experimentation and collaboration the LC Labs team has developed a set of tools to document, analyze, prioritize and assess AI technologies in a LAM context. This framework is in draft form and in need of additional use cases and perspectives, especially web archives use cases. The facilitators will introduce the framework and ask participants to use the proposed framework to evaluate their own proposed or in process ML or AI use case that increases understanding of and access to web archivies.

Sharing the framework elements, gathering feedback, and documenting web archives use cases are the goals of the workshop.
Sample Elements and Prompts from the framework:
- Organizational Profile: How will or does your organization want to use AI or Machine learning?

- Define the Problem you are trying to solve.

- Write a user story about the AI/ML task or system your are planning/doing

- Risks and Benefits: What are the benefits and risks to users, staff and the organization when an AI/ML technology is/will be used?

- What systems or policies will/do the AI/ML task or system impact or touch?

- What are the limitations of future use of any training, target, validation or derived data?
- Data Processing Plan: What documentation are/will you require when using AI or ML technologies - What existing open source or commercial platforms offer
pathways into use of AI?

- What are the success metrics and measures for the AI/ML task?

- What are the quality benchmarks for the AI/ML output?

- What could come next?

 
5:30pm - 6:10pmPOS-1: LIGHTNING & DROP-IN TALKS
Location: Theatre 1
Session Chair: Abbie Grotke, Library of Congress
1 minute drop-in talks will immediately follow lightning talks. After the session ends, lightning talk presenters will be available for questions in the atrium, where their posters will be on display.

Drop-in talk schedule:

Quick Overview of Perma Tools List​
Clare Stanton​, Perma.cc

Engineering Updates from Internet Archive ​
Alex Dempsey​, Internet Archive

Mapping News in the Norwegian Web Archive​
Jon Carlstedt Tønnessen, National Library of Norway
 

Memory in Uncertainty – The Implications of Gathering, Storing, Sharing and Navigating Browser-based Archives

Cade Diehm, Benjamin Royer

New Design Congress, Germany

How do we save the past in a violent present for an uncertain future? As societal digitisation accelerates, so too has the belligerence of state and corporate power, the democratisation of targeted harassment, and the collapse of consent by communities plagued by ongoing (and often unwanted) datafication. Drawing from political forecasts and participatory consultation with practicioners and communities, this research examines the physical safety of data centres, the socio-technical issues of the diverse practice of web-based archiving, and the physical and mental health of archive practitioners and communities subjected to archiving. This research identifies and documents issues of ethics, consent, digital security, colonialism, resilience, custodianship and tool complexity. Despite the systemic challenges identified in the research, and the broad lag in response from tool makers and other actors within the web archiving discipline, there exist compelling reasons to remain optimistic. Emergent technologies, stronger socio-technical literacy amongst archivists, and critical interventions in the colonial structures of digital systems offer immediate points of intervention. By acknowledging the shortcomings of cybernetics, resisting the desire to apply software solutionism at scale, and developing a nuanced and informed understanding of the realities of archiving in digitised societies, a broad surface of opportunities can emerge to develop resilient, considered, safe and context-sensitive archival technologies and practice for our uncertain world.



To preserve this memory, click here. Real-time public engagement with personal digital archives

Marije Miedema, Susan Aasman, Sabrina Sauer

University of Groningen, Centre for Media and Journalism Studies

Digital collections aim to reflect our personal and collective histories, which are shaped by and concurrently shape our memories. While advancements are made to develop web archival practices in the public domain, personal digital material is mostly preserved with commercially driven technologies. This is worrying, for although it may seem that these privately-owned cloud services are spaces where our precious pictures will exist forever, we know that long-term sustainable archiving practices are not these service providers’ primary concern. This demo is part of the first stages in the fieldwork of a PhD project that explores alternative approaches to sustainable everyday archival data management. Through participatory research methods, such as co-designing prototypes, we aim to establish a public-private-civic collaboration to rethink our relationship with the personal digital archive. Moving towards the question of what digital material do we throw away, discard, or forget about, we want to contribute to existing knowledge on how to manage the growing amount of digital stuff.

Translating this question into an interactive installation, the demo combines human and technological performativity employing participatory, playful methods to let conference participants materialize their reflections on their engagement with their digital archives, from their professional and personal perspective. This demo invites conference participants to actively engage with the question of responsibility regarding the future of our personal digital past; is there a role to play for public institutions next to the commitment of individuals to commercially driven storage technologies? The researchers will consider the privacy of the participants throughout the duration of the demo. Through this demo, the community of (web) archivists are involved in the early stages of the project’s co-creative research practices and aims to build lasting connections with these important stakeholders.



Participatory Web Archiving: A Roadmap for Knowledge Sharing

Cui Cui1,2

1Bodleian Libraries University of Oxford, United Kingdom; 2Information School University of Sheffield

In recent years, community participation seems to have become a desirable step in developing web archives. Participatory practices in the cultural heritage sector are not new (Benoit & Eveleigh, 2019). The practice of working in collaboration with different community partners to build archives is underway in conventional archives (Cook, 2013). Indeed, it has now become one of the main themes of web archival development on both theoretical and practical levels.

Although involving wider communities is often regarded as an approach to democratise practices, it has been debated if community participation can lead to improved representation. At the same time, the significant impact that participatory practices have on creating and sharing knowledge should not be underestimated. My current PhD research is to understand how participatory practices have been deployed in web archiving, their mechanisms and impacts.

Since April 2022, I have worked as a web archivist for the Archive of Tomorrow project, developing various sub-collections on the topics relating to cancer, Covid-19, food, diet, nutrition, and wellbeing. The project, funded by the Wellcome Trust, is to explore and preserve online 2 information and misinformation about health and the Covid-19 pandemic. Started in February 2022, the project runs for 14 months and will form a 'Talking about Health' collection within the UK Web Archive, giving researchers and members of the public access to a wide representation of diverse online sources.

For this project, I have attempted to link theories with practices and applied various participatory methods in developing the collection, such as engaging with subject librarians, delivering a workshop co-curating a sub-collection, consulting academics to identify archiving priorities, cocurating a sub-collection with students from an internship scheme, and collaborating with a local patient support group. This poster is to reflect how different approaches have been deployed and lessons learned. It will highlight the transformative impact of participatory practices on sharing, creating and reconstructing knowledge.

References

Benoit, E., & Eveleigh, A. (2019). Defining and framing participatory archives in archival science. In E. Benoit & A. Eveleigh (Eds.), Participatory archives: theory and practice (pp. 1–12). London.

Cook, T. (2013). Evidence, memory, identity, and community: Four shifting archival paradigms. Archival Science, 13(2–3), 95–120. https://doi.org/10.1007/s10502- 012-9180-7

 
5:30pm - 6:10pmPOS-2: LIGHTNING & DROP-IN TALKS
Location: Theatre 2
Session Chair: Martin Klein, Los Alamos National Laboratory
1 minute drop-in talks will immediately follow lightning talks. After the session ends, lightning talk presenters will be available for questions in the atrium, where their posters will be on display.

Drop-in talk schedule:

Persistent Web IDentifier (PWID) also as URN​
Eld Zierau, Royal Danish Library

Crowdsourcing German Twitter ​
Britta Woldering, German National Library

At the end of the rainbow. Examining the Dutch LGBT+ web archive using NER and hyperlink analyses
Jesper Verhoef, Erasmus University Rotterdam
 

Sunsetting a digital institution: Web archiving and the International Museum of Women

Marie Chant

The Feminist Institute, United States of America

The Feminist Institute’s (TFI) partnership program helps feminist organizations sunset mission-aligned digital projects utilizing web archiving technology and ethnographic preservation to contextualize and honor the labor contributed to ephemeral digital initiatives. In 2021, The Feminist Institute partnered with Global Fund for Women to preserve the International Museum of Women (I.M.O.W). This digital, social change museum built award-winning digital exhibitions that explored women’s contributions to society. I.M.O.W. initially aimed to build a physical space but shifted to a digital-only presence in 2005, opting to democratize access to the museum’s work. I.M.O.W’s first exhibition, Imagining Ourselves: A Global Generation of Women, engaged and connected more than a million participants worldwide. After launching several successful digital collections, I.M.O.W. merged with Global Fund for Women in 2014. The organization did not have the means to continually migrate and maintain the websites as technology depreciated, leaving gaps in functionality and access. Working directly with stakeholders from Global Fund for Women and the International Museum of Women, TFI developed a multi-pronged preservation plan that included capturing I.M.O.W’s digital exhibitions using Webrecorder’s Browsertrix Crawler, harvesting and converting Adobe Flash assets, conducting oral histories with I.M.O.W. staff and external developers, and providing access through the TFI Digital Archive.



Visualizing web harvests with the WAVA tool

Ben O'Brien1, Frank Lee1, Hanna Koppelaar2, Sophie Ham2

1National Library of New Zealand, New Zealand; 2National Library of the Netherlands, Netherlands

Between 2020-2021, the National Library of New Zealand (NLNZ) and the National Library of the Netherlands (KB-NL) developed a new harvest visualization feature within the Web Curator Tool (WCT). This feature was demonstrated during a presentation at the 2021 IIPC WAC titled Improving the quality of web harvests using Web Curator Tool. During development it was recognised that the visualization tool could be beneficial to the web archiving community beyond WCT. This was also reflected in feedback received after the 2021 IIPC WAC.

The feature has now been ported to an accompanying stand-alone application called the WAVA tool (Web Archive Visualization and Analysis). This is a stripped down version, that contains the web harvest analysis and visualization without the WCT dependent functionality, such as patching.

The WCT harvest visualization has been designed primarily for performing quality assurance on web archives. To avoid the traditional mess of links and nodes when visualizing URLs, the tool abstracts the data to a domain level. Aggregating URLs into groups of domains gives a higher overview of a crawl and allows for quicker analysis of the relationships between content in a harvest. The visualization consists of an interactive network graph of links and nodes that can be inspected, allowing a user to drill down to the URL level for deeper analysis.

NLNZ and KB-NL believe the WAVA tool can have many uses to the web archiving community. It lowers the barrier to investigating and understanding the relationships and structure of the web content that we crawl. What can we discover in our crawls that might improve the quality of future web harvests? The WAVA tool also removes technical steps that have been a barrier in the past to researchers visualizing web archive data. How many future research questions can be aided by its use?



WARC validation, why not?

Antal Posthumus, Jacob Takema

Nationaal Archief, The Netherlands

This lightning talk would like to tempt and to challenge the participants of the IIPC Web Archiving Conference 2023 to engage in an exchange of ideas, assumptions and knowledge about the subject of validating WARC-files and the use of WARC validation tools.

In 2021 we’ve written an information sheet about WARC validation. During our (desk)research it became clear that most (inter)national colleagues who archive websites more often than not don’t use WARC validation tools. Why not?

Most heritage institutions, national libraries and archives focus on safeguarding as much online content as possible before it disappears, based on an organizational selection policy. And the other goal is to give access to the captured information as complete and quickly as possible, both to the general users and researchers. Both goals are at the core of webarchiving initiatives of course!

It seems as though little attention is given to an aspect of quality control such as the checking of the technical validity of WARC-files. Or are there other reasons not to pay much attention to this aspect?

We like to share some of our findings after deploying several tools for processing WARC-files: JHOVE, JWAT, Warcat and Warcio. More tools are available, but in our opinion these four tools are the most commonly used, mature and actively maintained tools that can check of validate WARC files.

In our research into WARC validation, we noticed that some tools are validation tools that check conformance to WARC standard ISO 28500 and others ‘only’ check block and/or payload digests. Most tools support version 1.0 of the WARC standard (of 2009). Few support version 1.1 (of 2017).

Another conclusion is that there is no one WARC validation tool ‘to rule them all’, so using a combination of tools will probably be the best strategy for now.

 
7:00pm - 9:00pmDINNER
Pre-registration required for this event.

 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2023
Conference Software: ConfTool Pro 2.6.149
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany