Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available). To only see the sessions for 3 May's Online Day, select "Online" for location.

Please note that all times are shown in the time zone of the conference. The current conference time is: 10th May 2024, 11:56:43pm CEST

 
Only Sessions at Location/Venue 
Only Sessions at Date / Time 
 
 
Session Overview
Date: Friday, 12/May/2023
8:00am - 8:30amARRIVAL/COFFEE
8:30am - 10:00amSES-11: COLLECTION BUILDING
Location: Theatre 1
Session Chair: Lauren Baker, Library of Congress
These presentations will be followed by a 10 min Q&A.
 
8:30am - 8:50am

20 years of archiving the French electoral web

Dorothée Benhamou-Suesser, Anaïs Crinière-Boizet

Bibliothèque nationale de France, France

In 2022, BnF is celebrating the 20th anniversary of its electoral crawls. On this occasion, we would like to trace the history of 20 years of electoral crawls, which cover 20 elections of all types (presidential, parliamentary, local, departmental, European), and represent more than 30 Tio of data. The 2002 presidential election crawl was the first in-house crawl conducted by the BnF, a founding moment for experimenting a legal, technical and library policy framework. We, as an heritage institution, are accountable for the first electoral collections, which are emblematic and representative of our workflows on several aspects: harvest, selection, and outreach.

First, on the technical point of view, electoral crawls were an opportunity to set up crawling tools and to develop adaptative techniques to face the evolution of Web and meet the challenge to archive it. We have experimented and made improvements in our archiving processes for each new election and a specific look into the communication means (eg. forums, Twitter accounts, Youtube channels and more recently Instagram accounts, TikTok contents).

Secondly, electoral crawls have led the BnF to set up and organise a network of contributors and the means of selection. In 2002, contributions were from BnF librarians. In 2004, partners libraries in different regions and overseas territories contributed to select content for the regional elections. In 2012, we initiated the development of a collaborative curation tool. Throughout the years, we have also built a document typology that has remained stable to guarantee the coherence of the collections.

Thirdly, electoral crawls led us to set up ways to promote web archives to the public and the research community. To promote the use of a collection with such historical consistency, of high interest for the study of political life, we designed guided tours (thematic and edited selections of archived pages made by librarians). The BnF also engaged in organizing scientific events, and in several collaborative outreach initiatives.



8:50am - 9:10am

Archiving the Web for FIFA World Cup Qatar 2022™

Arif Shaon, Carol Ann Daul Elhindi, Marcin Werla

Qatar National Library, Qatar

The core mission of Qatar National Library is to “spread knowledge, nurture imagination, cultivate creativity, and preserve the nation’s heritage for the future.” To fulfil this mission, the Library commits to collecting, preserving and providing access to both local and global knowledge, including heritage-related content relevant to Qatar and the region. Web resources of cultural importance could assist future generations in the interpretation of events that may not be extant anywhere else. Archiving such websites is an important initiative within the wider mission of the Library to support Qatar on its journey towards a knowledge-based economy.

The 2022 FIFA World Cup will be the first World Cup ever to be held in the Arab world, and hence is considered a landmark event in Qatar’s history. Qatar’s journey towards hosting the 2022 World Cup has been covered by all types of local and international websites and news portals, and the coverage is expected to increase significantly in the weeks leading to, during and post-World Cup. The information published by these websites will truly reflect the journey towards, and experience of, the event from a variety of perspectives, including the fans, the organizers, the players, and members of the public. Capturing and preserving such information for the long-term enables future generations to also share the experience and appreciate the astounding effort required to host a massive, culturally important global event in Qatar.

In this talk, we describe the Library’s approach to capturing and preserving websites related to the World Cup 2022, to guarantee access to the content for the future generations. We also highlight the challenges associated with developing archived websites as collections for researchers in the context of the Qatari copyright law.



9:10am - 9:30am

Museums on the Web: Exploring the past for the future

Karin de Wild

Leiden University, Netherlands, The

This presentation will celebrate the launch of the special collection ‘Museums on the Web’ at the KB, National Library of the Netherlands. This evolving collection unlocks an essential and the largest sub-collection within the KB Web archive. It contains more than 800 museum websites and offers the potential to research histories of museums on the Web within the Netherlands.

It requires special tools to access Web archives and therefore this presentation will demonstrate a variety of entry points. It features a selection of curated archived websites that can be viewed page-by-page. It will also be the first KB special collection that is accessible through a SOLR Wayback search engine, which enables the request of derived datasets and explore the collection through a series of dashboards. This offers the opportunity to study histories of museums on the Web in The Netherlands, combining methods from history and data science and drawing on a computational analysis of Web archive data.

The presentation will conclude with highlighting some significant case studies to showcase the diversity of museum websites and the research potential to uncover a Dutch history of museums on the Web. The advent of online technologies has changed the way museums manage collections and access them, shape exhibitions, and build communities. By engaging with the past, we can enhance our understanding of how museums are functioning today and offer new perspectives for future developments.

This paper coincides with the release of a Double Special Issue “Museums on the Web: Exploring the past for the future” in the journal Internet Histories: Digital Technology, Culture and Society (Routledge/Taylor & Francis).



9:30am - 9:50am

Unsustainability and Retrenchment in American University Web Archives Programs

Gregory Wiedeman1, Amanda Greenwood2

1University at Albany, United States of America; 2Union College

This presentation will overview the expansion and later retrenchment of UAlbany’s web archives program due to a lack of permanently funded staff. UAlbany began its web archives program in 2013 in response to state records laws requiring it to preserve university records on the web. The department that housed the program had strong existing collecting programs in New York State politics and capital punishment. Since much of current politics and activism now happens online, it was natural and necessary to expand the web archives program to ensure we were effectively documenting these important spaces for the long-term future. However, we will show how the increasing complexity of the web and collecting techniques means that the scoping needs for ongoing collecting seem to require significantly more testing and labor over time. Thus, despite the need to expand the web archives program to meet our department’s mission, we will describe the painful process of reducing our web archives collecting scope. With the NDSA Web Archiving in the United States surveys reporting 71-83% of respondents devoting 0.5 or less FTE to web archiving, maintenance inflation like this is catastrophic to many web archives programs. Most alarmingly, we will overview how the web archives labor situation at American universities is likely to get worse. The UAlbany Libraries, which houses the web archives program, has permanently lost over 30% of FTE since 2020 and almost 50% of FTE since 2000. Peer assessment studies, ARL staffing surveys, and the University of California, Berkley’s recent announcement of library closures shows that UAlbany’s example is more typical than exceptional. Finally, we will show how these cuts are not the result of a misunderstanding or a lack of value for web archives or libraries by university administrators, but because our web archives program conflicts with UAlbany’s overall organizational mission and the business model of American higher education.

 
8:30am - 10:00amWKSHP-04: BROWSER-BASED CRAWLING FOR ALL: GETTING STARTED WITH BROWSERTRIX CLOUD
Location: Theatre 2
Pre-registration required for this event.
 

Browser-Based Crawling For All: Getting Started with Browsertrix Cloud

Andrew N. Jackson1, Anders Klindt Myrvoll2, Ilya Kreymer3

1The British Library, United Kingdom; 2Royal Danish Library; 3Webrecorder

Through the IIPC-funded “Browser-based crawling system for all” project (https://netpreserve.org/projects/browser-based-crawling/), members have been working with Webrecorder to support the development of their new crawling tools: Browsertrix Crawler (a crawl system that automates web browsers to crawl the web) and Browsertrix Cloud (a user interface to control multiple Browsertrix Crawler crawls). The IIPC funding is particularly focussed on making sure IIPC members can make use of these tools.

This workshop will be led by Webrecorder, who will invite you to try out an instance of Browsertrix Cloud and explore how it might work for you. IIPC members that have been part of the project will be on hand to discuss institutional issues like deployment and integration.

The workshop will start with a brief introduction to high-fidelity web archiving with Browsertrix Cloud. Participants will be given accounts and be able to create their own high-fidelity web archives using the Browsertrix Cloud crawling system during the workshop. Users will be presented with an interactive user interface to create crawls, allowing them to configure crawling options. We will discuss the crawling configuration options and how these affect the result of the crawl. Users will have the opportunity to run a crawl of their own for at least 30 minutes and to see the results. We will then discuss and reflect on the results.

After a quick break, we will discuss how the web archives can be accessed and shared with others, using the ReplayWeb.page viewer. Participants will be able to download the contents of their crawls (as WACZ files) and load them on their own machines. We will also present options for sharing the outputs with others directly, by uploading to an easy-to-use hosting option such as Glitch or our custom WACZ Uploader. Either method will produce a URL which participants can then share with others, in and outside the workshop, to show the results of their crawl. We will discuss how, once complete, the resulting archive is no longer dependent on the crawler infrastructure, but can be treated like any other static file, and, as such, can be added to existing digital preservation repositories.

In the final wrap-up of the workshop, we will discuss challenges, what worked and what didn’t, what still needs improvement, etc.. We will also discuss how participants can add the web archives they created into existing web archives that they may already have, and how Browsertrix Cloud can fit into and augment existing web archiving workflows at participants' institutions. IIPC member institutions that have been testing Browsertrix Cloud thus far will share their experiences in this area.

The format of the workshop will be as follows:

  • Introduction to Browsertrix Cloud - 10 min

  • Use Cases and Examples by IIPC project partners - 10 min

  • Break - 5

  • Hands-On: Setup and Crawling with Browsertrix Cloud (Including Q&A / help while crawls are running) - 30 min

  • Break - 5 min

  • Hands-On: Replaying and Sharing Web Archives - 10 min

  • Wrap-Up: Final Q&A / Discuss Integration of Browsertrix Cloud Into Existing Web Archiving Workflows with IIPC project partners - 20 min

Webrecorder will lead the workshop and the hands-on portions of the workshop. The IIPC project partners will present and answer questions about use-cases at the beginning and around integration into their workflows at the end.

Participants should come prepared with a laptop and a couple of sites that they would like to crawl, especially those that are generally difficult to crawl by other means and require a ‘high fidelity approach’. (Examples include social media sites, sites that are behind a paywall, etc..) Ideally, the sites can be crawled during the course of 30 mins (though crawls can be interrupted if they run for too long)

This workshop is intended for curators and anyone wishing to create and use web archives and are ready to try Browsertrix Cloud on their own with guidance from the Webrecorder team. The workshop does not require any technical expertise besides basic familiarity with web archiving. The participants’ experiences and feedback will help shape not only the remainder of the IIPC project, but the long-term future of this new crawling toolset.

The workshop should be able to accommodate up to 50 participants.

 
8:30am - 10:00amWKSHP-03: FAKE IT TILL YOU MAKE IT: SOCIAL MEDIA ARCHIVING AT DIFFERENT ORGANIZATIONS FOR DIFFERENT PURPOSES
Location: Labs Room 1 (workshops)
Pre-registration required for this event.
 

Fake it Till You Make it: Social Media Archiving at Different Organizations for Different Purposes

Susanne van den Eijkel1, Zefi Kavvadia2, Lotte Wijsman3

1KB, National Library of the Netherlands; 2International Institute for Social History; 3National Archives of the Netherlands

Abstract

Different organizations, different business rules, different choices. That seems obvious. However, different perspectives can alter the choices that you make and therefore the results you get when you’re archiving Social Media. In this tutorial, we would like to zoom in on the different perspectives an organization can have. A perspective can be formed over a mandate or type of organization, the designated community of an institution, or a specific tool that you use. Therefore, we would like to highlight these influences and how they can affect the results that you get.

When you start with Social Media archiving, you won’t get the best results right away. It is really a process of trial and error, where you aim for good practice and not necessarily best practice (and is there such a thing as best practice?). With a practical assignment we want to showcase the importance of collaboration between different organizations. What are the worst practices that we have seen so far? What’s best to avoid, and why? What could be a solution? And why is it a good idea to involve other institutions at an early stage?

This tutorial relates to the conference topics of community, research and tools. It builds on previous work from the Dutch Digital Heritage Network and the BeSocial project from the National Library of Belgium. Furthermore, different tools will be highlighted and it will me made clear why different tooling can result in different results.

Format

In-person tutorial, 90 minutes.

  • Introduction: who are the speakers, where do they work, introduction on practices related to different organizations.

  • Assignment: participants will do a practical assignment related to social media archiving. They’ll receive persona’s for different institutions (library, government, archive) and ask themselves the question: how does your own organization's perspective influence the choices you make? We will gather the results on post-its and end with a discussion.

  • Wrap-up: conclusions of discussion.

Target audience

This tutorial is aimed at those who want to learn more about doing social media archiving at their organizations. It is mainly meant for starters in social media archiving, but not necessarily complete beginners (even though they are definitely welcome too!). Potential participants could be archivists, librarians, repository managers, curators, metadata specialists, (research) data specialists, and generally anyone who is or could be involved in the collection and preservation of social media content for their organization.

Expected number of participants: 20-25.

Expected learning outcome(s)

Participants will understand:

  1. Why Social Media archiving is different than Web Archiving;
  2. Why different perspectives lead to different choices and results;
  3. How tools can affect the potential perspectives you can work with.

In addition, participants will get insight into:

  1. The different perspectives from which you can do social media archiving from;
  2. How different organizations (could) work on social media archiving.

Coordinators

Susanne van den Eijkel is a metadata specialist for digital preservation at the National Library of the Netherlands. She is responsible for all the preservation metadata, writing policies and implementing them. Her main focus are born-digital collections, especially the web archives. She focuses on web material after it has been harvested, and not so much on selection and tools and is therefore more involved with which metadata and context information is available and relevant for preservation. In addition, she works on the communication strategy of her department; is actively involved in the Dutch Digital Heritage Network and provides guest lectures on digital preservation and web archiving.

Zefi Kavvadia is a digital archivist at the International Institute of Social History in Amsterdam, the Netherlands. She is part of the institute’s Collections Department, where she is responsible for processing of digital archival collections. She is also actively contributing to research, planning, and improving of the IISH digital collections workflows. While her work covers potentially any type of digital material, she is especially interested in the preservation of born-digital content and is currently the person responsible for web archiving at IISH. Her research interests range from digital preservation and archives, to web and social media archiving, and research data management, with a special focus on how these different but overlapping domains can learn and work together. She is active in the web archiving expert group of the Dutch Digital Heritage Network and the digital preservation interest group of the International Association of Labour History Institutions.

Lotte Wijsman is the Preservation Researcher at the National Archives in The Hague. In her role she researches how we can further develop preservation at the National Archives of the Netherlands and how we can innovate the archival field in general. This includes considering our current practices and evaluating how we can improve these with e.g. new practices and tools. Currently, Lotte is active in research projects concerning subjects as social media archiving, AI, a supra-organizational Preservation Watch function, and environmentally sustainable digital preservation. Furthermore, she is a guest teacher at the Archiefschool and Reinwardt Academy (Amsterdam University of the Arts).

 
10:00am - 10:30amBREAK
10:30am - 12:00pmSES-12: DOMAIN CRAWLS
Location: Theatre 1
Session Chair: Grace Bicho, Library of Congress
These presentations will be followed by a 10 min Q&A.
 
10:30am - 10:50am

Discovering and Archiving the Frisian Web. Preparing for a National Domain Crawl.

Susanne van den Eijkel, Iris Geldermans

KB, National Library of the Netherlands

In the past years KB, National Library of the Netherlands (KBNL), conducted a pilot for a national domain crawl. KBNL has been harvesting websites with the Web Curator Tool (a web interface with Heritrix crawler) since 2007, on a selective basis that are focused on Dutch history, culture and language. Information on the web can be brief in existence but can have a vital importance for researchers now and in the future. Furthermore, KBNL outlined in their content strategy that it is the ambition of the library to collect everything that was published in and about the Netherlands, websites included. As more libraries around the world were collecting a national domain, KBNL also expressed the wish to execute a national domain crawl. Before we were able to do that, we had to form a multidisciplinary web archiving team, decide on a new tool for domain harvests and start an intensive testing phase. For this pilot a regional domain, the Frisian, was selected. Since we were new to a domain harvest, we used a selective approach. Curators of digital collections from KBNL were in close contact with Frisian researchers, to help define which websites needed to be included in the regional domain. During the pilot we also gathered more knowledge about Heritrix as we were using NetarchiveSuite (also a web interface with Heritrix crawler) for crawls.

Now that the results are in, we can share our lessons learned, like challenges on technical and legal aspects and related policies that are needed for web collections. Also, we will go into detail about the crawler software settings that were tested and how we can use such information as context information.

This presentation is related to the conference topics collections, community and program operations, as we want to share the best practices for executing a (regional) domain crawl and lessons learned in preparation for a national domain crawl. Furthermore, we will focus on the next steps after completion of the pilot. Other institutions that are harvesting websites can learn from it and those that want to start with web archiving can be more prepared.



10:50am - 11:10am

Back to Class: Capturing the University of Cambridge Domain

Caylin Smith, Leontien Talboom

Cambridge University Libraries, United Kingdom

The University Archives of Cambridge University, based at the University Library (UL), is responsible for the selection, transfer, and preservation of the internal administrative records of the University, dating from 1266 to the present. These records are increasingly created in digital formats, including common ‘office’ formats (Word, Excel, PDF) as well as increasingly for the web.

The question “How do you preserve an entire online ecosystem in which scholars collaborate, discover and share new knowledge?” about the digital scholarly record posed by Cramer et al. (2022) equally applies to online learning and teaching materials as well as the day-to-day business records of a university.

Capturing this online ecosystem as comprehensively, rather than selectively, as possible is an undertaking that involves many stakeholders and moving parts.

As a UK Legal Deposit Library, the UL is a partner in the UK Web Archive and Cambridge University websites are captured annually; however, some online content needs to be captured more frequently, does not have an identifiable UK address, or is behind a log-in screen.

To improve this capturing, the UL is working on the following:

  • Engaging with content creators and/or University Information Services, which supports the University’s Drupal platform.
  • Working directly with the University Archivist as well as creating a web archiving working group with additional Library staff to identify what University websites need to be captured manually or were captured only in an annual domain crawl but need to be captured more frequently.
  • Becoming a stakeholder in web transformation initiatives to communicate requirements for creating preservable websites and quality checking new web templates from an archival perspective.
  • Identifying potential tools for capturing online content behind login screens. So far WebRecorder.io has been a successful tool to capture this material; however, this is a time-consuming and manual process that would be improved if automated. The automation of this process is currently being explored.

Our presentation will walk WAC2023 attendees through our current workflow as well as highlight ongoing challenges we are working to resolve so that attendees based at universities can take these into account for archiving content on their university’s domains.



11:10am - 11:30am

Laboratory not Found? Analyzing LANL’s Web Domain Crawl

Martin Klein, Lyudmila Balakireva

Los Alamos National Laboratory, United States of America

Institutions, regardless of whether they identify as for-profit, nonprofit, academic, or government, are invested in maintaining and curating their representation on the web. The organizational website is often the top-ranked on search engine result pages and commonly used as a platform to communicate organizational news, highlights, and policy changes. Individual web pages from this site are often distributed via organization-wide email channels, included in new articles, and shared via social media. Institutions are therefore motivated to ensure the long-term accessibility of their content. However, resources on the web frequently disappear, leading to the known detriment of link rot. Beyond the inconvenience of the encounter with a “404 - Page not Found” error, there may be legal implications when published government resources are missing, trust issues when academic institutions fail to provide content, and even national security concerns when taxpayer-funded federal research organizations such as Los Alamos National Laboratory show deficient stewardship of their digital content.

We therefore conducted a web crawl of the lanl.gov domain with the motivation to investigate the scale of missing resources within the canonical website representing the institution. We found a noticeable number of broken links, including a significant number of special cases of link rot commonly known as “soft404s” as well as potential transient errors. We further evaluated the recovery rate of missing resources from more than twenty public web archives via the Memento TimeTravel federated search service. Somewhat surprisingly, our results show little success in recovering missing web pages.

These observations lead us to argue that, as an institution, we could be a better steward of our web content and establishing an institutional web archive would be a significant step towards this goal. We therefore implemented a pilot LANL web archive in support of highlighting the availability and authenticity of web resources.

In this presentation, I will motivate the project, outline our workflow, highlight our findings, and demonstrate the implemented pilot LANL web archive. The goal is to showcase an example of an institutional web crawl that, in conjunction with the evaluation, can serve as a blueprint for other interested parties



11:30am - 11:50am

Public policies for governmental web archiving in Brazil

Jonas Ferrigolo Melo1, Moisés Rockembach2

1University of Porto, Portugal; 2Federal University of Rio Grande do Sul, Brazil

Scientific, cultural, and intellectual relevance of web archiving has been widely recognized since the 1990s. The preservation of the web has been appreciated in several studies ranging from its specific theories and practices, such as its methodological approaches, specific ethical aspects of preserving web pages, to subjects that permeate the Digital Humanities and their uses as a primary source.

This study aims to identify the documents and actions that are related to the development of the web archive policy in Brazil. The methodology used was bibliographic and documental research, using literature on government web archiving, and legislation regarding public policies.

Brazil has a variety of technical resources and legislation that addresses the need to preserve government documents, however, the websites have not yet been included in the records management practices of Brazilian institutions. Until the recent past, the country did not have a website preservation policy. However, there are currently two government actions under development.

A Bill that has been under consideration in the National Congress since July 2015, provides on the institutional digital public heritage in the www. This project is currently in the Constitution and Justice and Citizenship Commission (CCJC) of the Brazilian National Congress, since December 2022.

Another action comes from the National Council of Archives – Brazil (CONARQ), which established a technical chamber to define guidelines for the elaboration of studies, proposals, and solutions for the preservation of websites and social media. Based on its general goals, the technical chamber has produced two documents: (i) the Website and Social Media Preservation Policy; and, (ii) the recommendation of basic elements for websites and social media’s digital preservation. The documents were approved in December 2022 and will be published as a federal resolution.

The actions raised show that efforts for the state to take a proactive role in promoting and leadership of this technological innovation are in course in Brazil. The definition of a web archiving policy, as well as the requirements for the selection of preservation and archiving methods, technologies, and contents that will be archived, can already be considered a reality in Brazil.

 
10:30am - 12:00pmSES-13: CRAWLING, PLAYBACK, SUSTAINABILITY
Location: Theatre 2
Session Chair: Laura Wrubel, Stanford University
These presentations will be followed by a 10 min Q&A.
 
10:30am - 10:50am

Developer Update for Browsertrix Crawler and Browsertrix Cloud

Ilya Kreymer, Tessa Walsh

Webrecorder, United States of America

This presentation will provide a technical and feature update on the latest features implemented in Browsertrix Cloud and Browsertrix Crawler, Webrecorder's open source automated web archiving tools. The presentation will provide a brief intro to Browsertrix Cloud and the ongoing collaboration between Webrecorder and IIPC partners testing the tool.

We will present an outline for the next phase of development of these tools and discuss current / ongoing challenges in high fidelity web archiving, and how we may mitigate them in the future. We will also cover any lessons learned thus far.

We will end with a brief Q&A to answer any questions about the Browsertrix Crawler and Cloud systems, including how others may contribute to testing and development of these open source tools.



10:50am - 11:10am

Opportunities and Challenges of Client-Side Playback

Clare Stanton, Matteo Cargnelutti

Library Innovation Lab, United States of America

The team working on Perma.cc at the Library Innovation Lab has been using the open-source technologies developed by Webrecorder in production for many years, and has subsequently built custom software around those core services. Recently, in exploring applications for client-side playback of web archives via replayweb.page, we have learned lessons about the security, performance and reliability profile of this technology. This has deepened our understanding of the opportunities it presents and challenges it poses. Subsequently, we have developed an experimental boilerplate for testing out variations of this technology and have sought partners within the Harvard Library community to iterate with, test our learnings, and explore some of the interactive experiences that client-side playback makes possible.

warc-embed is LIL's experimental client-side playback integration boilerplate, which can be used to test out and explore this new technology. It consists of: a cookie-cutter web server configuration for storing, proxying, caching and serving web archive files; a pre-configured "embed" page, serving an instance of replayweb.page aimed at a given archive file; as well as a two-way communication layer allowing the embedding website to safely communicate with the embedded archive. These unique features allow for a thorough exploration of this new technology from a technical and security standpoint.

This is one of two proposals relating to this work. We believe there is an IIPC audience who could attend either or both sessions based on their interests. This session will dive into the technical research conducted at the lab and present those findings.

Combined with the emergence of the WACZ packaging format, client-side playback is a radically different and novel take on web archive playback which allows for the implementation of previously unachievable embedding scenarios. This session will explore the technical opportunities and challenges client-side playback presents from a performance, security, ease-of-access and programmability perspective by going over concrete implementation examples of this technology on Perma.cc and warc-embed.



11:10am - 11:30am

Sustaining pywb through community engagement and renewal: recent roadmapping and development as a case study in open source web archiving tool sustainability

Tessa Walsh, Ilya Kreymer

Webrecorder

IIPC’s adoption of pywb as the “go to” open source web archive replay system for its members, along with Webrecorder’s support for transitioning to pywb from other “wayback machine” replay systems, brings a large new user base to pywb. In the interests of ensuring pywb continues to sustainably meet the needs of IIPC members and the greater web archiving community, Webrecorder has been investing in maintenance and new releases for the current 2.x release series of pywb as well as engaging in the early stages of a significant 3.0 rewrite of pywb. These changes are being driven by a community roadmapping exercise with members of the IIPC oh-sos (Online Hours: Supporting Open Source) group and other pywb community stakeholders.

This talk will outline some of the recent feature and maintenance work done in pywb 2.7, including a new interactive timeline banner which aims to promote easier navigation and discovery within web archive collections. It will go on to discuss the community roadmapping process for pywb 3.0 and an overview of the proposed new architecture, perhaps even showing an early demo if development is in a state by May 2023 to support doing so.

The talk will aim to not only share specific information about pywb and the efforts being put into its sustainability and maintenance by both Webrecorder and the IIPC community, but also to use pywb as a case study to discuss the resilience, sustainability, and renewal of open source software tools that enable web archiving for all. pywb as a codebase is after all nearly a decade old itself and has gone through several rounds of significant rewrites as well as eight years of regular maintenance by Webrecorder staff and open source contributors to get to its current state, making it a prime example of how ongoing effort and community involvement make all the difference in building sustainable open source web archiving tools.



11:30am - 11:50am

Addressing the Adverse Impacts of JavaScript on Web Archives

Ayush Goel1, Jingyuan Zhu1, Ravi Netravali2, Harsha V. Madhyastha1

1University of Michigan, United States of America; 2Princeton University, United States of America

Over the last decade, the presence of JavaScript code on web pages has dramatically increased. While JavaScript enables websites to offer a more dynamic user experience, its increasing use adversely impacts the fidelity of archived web pages. For example, when we load snapshots of JavaScript-heavy pages from the Internet Archive, we find that many are missing important images and JavaScript execution errors are common.

In this talk, we will describe the takeaways from our research on how to archive and serve pages that are heavily reliant on JavaScript. Via fine-grained analysis of JavaScript execution on 3000 pages spread across 300 sites, we find that the root cause for the poor fidelity of archived page copies is because the execution of JavaScript code that appears on the web is often dependent on the characteristics of the client device on which it is executed. For example, JavaScript on a page can execute differently based on whether the page is loaded on a smartphone or on a laptop, or whether the browser used is Chrome or Safari; even subtle differences like whether the user's network connection is over 3G or WiFi can affect JavaScript execution. As a result, when a user loads an archived copy of a page in their browser, JavaScript on the page might attempt to fetch a different set of embedded resources (i.e., images, stylesheets, etc.) as compared to those fetched when this copy was crawled. Since a web archive is unable to serve resources that it did not crawl, the user sees an improperly rendered page both because of missing content and JavaScript runtime errors.

To account for the sources of non-deterministic JavaScript execution, a web archive cannot crawl every page in all possible execution environments (client devices, browsers, etc), as doing so would significantly inflate the cost of archiving. Instead, if we augment archived JavaScript such that the code on any archived page will always execute exactly how it did when the page was crawled, we are able to ensure that all archived pages match their original versions on the web, both visually and functionally.

 
10:30am - 12:00pmWKSHP-05: SUPPORTING COMPUTATIONAL RESEARCH ON WEB ARCHIVES WITH THE ARCHIVE RESEARCH COMPUTE HUB (ARCH)
Location: Labs Room 1 (workshops)
Pre-registration required for this event.
 

Supporting Computational Research on Web Archives with the Archive Research Compute Hub (ARCH)

Jefferson Bailey, Kody Willis, Helge Holzmann, Alex Dempsey

Internet Archive, United States of America

Coordinators:

  • Jefferson Bailey, Director of Archiving & Data Services, Internet Archive

  • Kody Willis, Product Operations Manager, Archiving & Data Services, Internet Archive

  • Helge Holzmann, Senior Data Engineer, Archiving & Data Services, Internet Archive

  • An Archives Unleashed member may also coordinate/participate

Format: 90 or 120-minute workshop and tutorial

Target Audience: The target audience is professionals working in digital library services that are collecting, managing, or providing access to web archives, scholars using web archives and other digital collections in their work, library professionals working to support computational access to digital collections, and digital library technical staff.

Anticipated Number of Participants: 25
Technical Requirements: A meeting room with wireless internet access and a projector or video display. Participants must bring laptop computers and there should be power outlets. The coordinators will handle preliminary activities over email and provide some technical support beforehand as far as building or accessing web archives for use in the workshop.

Abstract: Every year more and more scholars are conducting research on terabytes and even petabytes of digital library and archive collections using computational methods such as data mining, natural language processing, and machine learning. Web archives are a significant collection of interest for these researchers, especially due to their contemporaneity, size, multi-format nature, and how they can represent different thematic, demographic, disciplinary, and other characteristics. Web archives also have longitudinal complexity, with frequent changes in content (and often state of existence) even at the same URL, gobs of metadata both content-based and transactional, and many characteristics that make them highly suitable for data mining and computational analysis. Supporting computational use of web archives, however, poses many technical, operational, and procedural challenges for libraries. Similarly, while platforms exist for supporting computational scholarship on homogenous collections (such as digitized texts, images, or structured data), none exist that handle the vagaries of web archive collections while also providing a high level of automation, seamless user experience, and support for both technical and non-technical users.

In 2020, Internet Archive Research Services and the Archives Unleashed received funding for joint technology development and community building to combine their respective tools that enable computational analysis of web and digital archives in order to build an end-to-end platform supporting data mining of web archives. The program also simultaneously is building out a community of computational researchers doing scholarly projects via a program supporting cohort teams of scholars that receive direct technical support for their projects. The beta platform, Archives Research Compute Hub (ARCH), is currently being used by dozens of researchers in the digital humanities, social and computer science researchers, and by dozens of libraries and archives that are interested in supporting local researchers and sharing datasets derived from their web collection in support of large-scale digital research methods.

ARCH lowers the barriers for conducting research of web archives, using data processing operations to generate 16 different derivatives from WARC files. Derivatives range in use from graph analysis, text mining, and file format extraction, and ARCH makes it possible to visualize, download, and integrate these datasets into third-party tools for more advanced study. ARCH enables analysis of the more than 20,000 web archive collections - over 3 PB of data - collected by over 1,000 institutions using Archive-It that cover a broad range of subjects and events and ARCH also includes various portions of the overall Wayback Machine global web archive totalling 50+ PB and going back to 1996.

This workshop will be a hands-on training covering the full lifecycle of supporting computational research on web archives. The agenda will include an overview of the conceptual challenges researchers face when working with web archives, the procedural challenges that librarians face in making web archives available for computational use, and most importantly, will provide an in-depth tutorial on using the ARCH platform and its suite of data analysis, dataset generation, data visualization, and data publishing tools, both from the perspective of a collection manager, a research services librarian, and a computational scholar. Workshop attendees will be able to build small web archive collections beforehand or will be granted access to existing web archive collections to use during the workshop. All participants will also have access to any datasets and data visualizations created as part of the workshop.

Anticipated Learning Outcomes:

Given the conference, we expect the attendees primarily to be web archivists, collection managers, digital librarians, and other library and archives staff. After the workshop, attendees will:

  • Understand the full lifecycle of making web and digital archives available for computational use by researchers, scholars, and others. This includes gaining knowledge of outreach and promotion strategies to engage research communities, how to handle computational research requests, how to work with researchers to scope and refine their requests, how to make collections available as data, how to work with internal technical teams facilitating requests, dataset formats and delivery methods, and how to support researchers in ongoing data analysis and publishing.

  • Gain knowledge of the specific types of data analysis and datasets that are possible with web archive collections, including data formats, digital methods, tools, infrastructure requirements, and the related methodological affordances and limitations for scholarship related to working with web archives as data.

  • Receive hands-on training on using the ARCH platform to explore and analyze web archive collections, from both the perspective of a collection manager and that of a researcher.

  • Be able to use the ARCH platform to generate derivative datasets, create corresponding data visualizations, publish these datasets to open-access repositories, and conduct further analysis with additional data mining tools.

  • Have tangible experience with datasets and related technologies in order to perform specific analytic tasks on web archives such as exploring graph networks of domains and hyperlinks, extract and visualize images and other specific formats, and perform textual analysis and other interpretive functions.

  • Have insights into digital methods through their exposure to a variety of different active, real-life use cases from scholars and research teams currently using the ARCH platform for digital humanities and similar work.

 
12:00pm - 1:00pmLUNCH
1:00pm - 2:00pmSES-14 (PANEL): INCLUSIVE REPRESENTATION AND PRACTICES IN WEB ARCHIVING
Location: Theatre 1
Session Chair: Daniel Steinmeier, KB National Library of the Netherlands
 

Renewal in Web Archiving: Towards More Inclusive Representation and Practices

Makiba Foster1, Bergis Jules2, Zakiya Collier3

1The College of Wooster; 2Archiving The Black Web; 3Shift Collective

“The future is already here, it's just not very equally distributed, yet” - William Gibson
In this session you will learn about a growing community of practice of independent yet interconnected projects whose work converges as an intervention to critically engage the practice of web archiving to be more inclusive in terms of what gets web archived and who gets to build web archives. These projects reimagine a future for web archiving that distributes the practice and diversifies the collections.

Presentation 1- Archiving The Black Web

Author/Presenter: Makiba Foster, The College of Wooster and Bergis Jules, Archiving the Black Web

Abstract: Unactualized web archiving opportunities for Black knowledge collecting institutions interested in documenting web-based Black history and culture has reached critical levels due to the expansive growth of content produced about the Black experience by Black digital creators. Archiving The Black Web (ATBW), works to establish more equitable, accessible, and inclusive web archiving practices to diversify not only collection practices but also its practitioners. Founded in 2019, ATBW's creators will discuss the collaborative catalyst for the creation and launch of this important DEI initiative within web archiving. In this panel session, attendees will learn more about ATBW’s mission to address web archiving disparities. ATBW envisions a future that includes cultivating a community of practice for Black collecting institutions, developing training opportunities to diversify the practice of web archiving, and expanding the scope of web archives to include culturally relevant web content.

Presentation 2 - Schomburg Syllabus

Author/Presenter: Zakiya Collier, Shift Collective

Abstract: From 2017-2019 the Schomburg Center for Research in Black Culture participated in the Internet Archive’s Community Webs program, becoming the first Black collecting institution to create a web archiving program centering web-based Black history and culture. Recognizing that content in crowdsourced hashtag syllabi could be lost to the ephemerality of the Web, the #HashtagSyllabusMovement collection was created to archive online educational material related to publicly produced, crowdsourced content highlighting race, police violence, and other social justice issues within the Black community. Both the first of its kind in focus and within The New York Public Library system, the Schomburg Center’s web archiving program faced challenges including but not limited to identifying ways to introduce the concept of web archiving to Schomburg Center researchers and community members, demonstrating the necessity of a web supported web archiving program to Library administration, and expressing the urgency needed in centering Black content on the web that may be especially ephemeral like those associated with struggles for social justice. It was necessary for the Schomburg Center to not only continue their web archiving efforts with the #Syllabus and other web archive collections, but also develop strategies to invoke the same sense of urgency and value for Black web archive collections that we now see demonstrated in the collection of analog records documenting Black history, culture and activism— especially as social justice organizing efforts increasingly have online components.

As a result, the #SchomburgSyllabus project was developed to merge web-archives and analog resources from the Schomburg Center in celebration of Black people's longstanding self-organized educational efforts. #SchomburgSyllabus uniquely organizes primary and secondary sources into a 27-themed web-based resource guide that can be used for classroom curriculum, collective study, self-directed education, and social media and internet research. Tethering web-archived resources to the Schomburg Center’s world-renowned physical collections Black diasporic history has proven key in garnering support for the Schomburg’s web archiving program and enthusiasm for the preservation of the Black web as demonstrated by the #SchomburgSyllabus’ use in classrooms, inclusion in journal articles, and features in cultural/educational TV programs.

 
1:00pm - 2:10pmSES-15: DATA CONSIDERATIONS
Location: Theatre 2
Session Chair: Sophie Ham, Koninklijke Bibliotheek
These presentations will be followed by a 10 min Q&A.
 
1:00pm - 1:20pm

What if GitHub disappeared tomorrow?

Emily Escamilla, Michele Weigle, Michael Nelson

Old Dominion University, United States of America

Research is reproducible when the methodology and data originally presented by the researchers can be used to reproduce the results found. Reproducibility is critical for verifying and building on results; both of which benefit the scientific community. The correct implementation of the original methodology and access to the original data are the lynchpin of reproducibility. Researchers are putting the exact implementation of their methodology in online repositories like GitHub. In our previous work, we analyzed arXiv and PubMed Central (PMC) corpora and found 219,961 URIs to GitHub in scholarly publications. Additionally, in 2021, one in five arXiv publications contained at least one link to GitHub. These findings indicate the increasing reliance of researchers on the holdings of GitHub to support their research. So, what if GitHub disappeared tomorrow? Where could we find archived versions of the source code referenced in scholarly publications? Internet Archive, Zenodo, and Software Heritage are three different digital libraries that may contain archived versions of a given repository. However, they are not guaranteed to contain a given repository and the method for accessing the code from the repository will vary across the three digital libraries. Additionally, Internet Archive, Zenodo, and Software Heritage all approach archiving from different perspectives and different use cases that may impact reproducibility. Internet Archive is a Web archive; therefore, the crawler archives the GitHub repository as a Web page and not specifically as a code repository. Zenodo allows researchers to publish source code and data and to share them with a DOI. Software Heritage allows researchers to preserve source code and issues permalinks for individual files and even lines of code. In this presentation, we will answer the questions: What if GitHub disappeared tomorrow? What percentage of scholarly repositories are in Internet Archive, Zenodo, and Software Heritage? What percentage of scholarly repositories would be lost? Do the archived copies available in these three digital libraries facilitate reproducibility? How can other researchers access source code in these digital libraries?



1:20pm - 1:40pm

Web archives and FAIR data: exploring the challenges for Research Data Management (RDM)

Sharon Healy1, Ulrich Karstoft Have2, Sally Chambers3, Ditte Laursen4, Eld Zierau4, Susan Aasman5, Olga Holownia6, Beatrice Cannelli7

1Maynooth University; 2NetLab; 3KBR & Ghent Centre for Digital Humanities; 4Royal Danish Library; 5University of Groningen; 6IIPC; 7School of Advanced Study, University of London

The FAIR principles imply “that all research objects should be Findable, Accessible, Interoperable and Reusable (FAIR) both for machines and for people” (Wilkinson et al., 2016). These principles present varying degrees of technical, legal, and ethical challenges in different countries when it comes to access and the reusability of research data. This equally applies to data in web archives (Boté & Térmens, 2019; Truter, 2021). In this presentation we examine the challenges for the use and reuse of data from web archives from both the perspectives of web archive curators and users, and we assess how these challenges influence the application of FAIR principles to such data.

Researchers' use of web archives has increased steadily in recent years, across a multitude of disciplines, using multiple methods (Maemura, 2022; Gomes et al., 2021; Brügger & Milligan, 2019). This development would imply that there are a diversity of requirements regarding the RDM lifecycle for the use and reuse of web archive data. Nonetheless there has been very little research conducted which examines the challenges for researchers in the application of FAIR principles to the data they use from web archives.

To better understand current research practices and RDM challenges for this type of data, a series of semi-structured interviews were undertaken with both researchers who use web or social media archives for their research and cultural heritage institutions interested in improving the access of their born-digital archives for research.

Through an analysis of the interviews we offer an overview of several aspects which present challenges for the application of FAIR principles to web archive data. We assess how current RDM practices transfer to such data from both a researcher and archival perspective, including an examination of how FAIR web archives are (Chambers, 2020). We also look at the legal and ethical challenges experienced by creators and users of web archives, and how they impact on the application of FAIR principles and cross-border data sharing. Finally, we explore some of the technical challenges, and discuss methods for the extraction of datasets from web archives using reproducible workflows (Have, 2020).



1:40pm - 2:00pm

Lessons Learned in Hosting the End of Term Web Archive in the Cloud

Mark Phillips1, Sawood Alam2

1University of North Texas, United States of America; 2Internet Archive, United States of America

The End of Term (EOT) Web Archive which is composed of member institutions across the United States who have come together every four years since 2008 to complete a large-scale crawl of the .gov and .mil domains in the United States to document the transition in the Executive Branch of the Federal Government in the United States. In years when a presidential transition did not occur, these crawls served as a systematic crawl of the .gov domain in what has become a longitudinal dataset of crawls. In 2022 the EOT team from the UNT Libraries and the Internet Archive moved nearly 700TB of primary WARC content and derivative formats into the cloud. The goal of this work was to provide easier computational access to the web archive by hosting a copy of the WARC files and derivative WAT, WET, and CDXJ files in the Amazon S3 Storage Service as part of Amazon’s Open Data Sponsorship Program. In addition to these common formats in the web archive community, the EOT team modeled our work on the structure and layout of the Common Crawl datasets including their use of the columnar storage format Parquet to represent CDX data in a way that enables access with query languages like SQL. This presentation will discuss the lessons learned in staging and moving these web archives into AWS, the layout used to organize the crawl data into 2008, 2012, 2016, and 2020 datasets and further into different groups based on the original crawling institution. Examples of how content staged in this manner can be used by researchers both inside and outside of a collecting institution to answer questions that had previously been challenging to answer about these web archives. The EOT team will discuss the documentation and training efforts underway to help researchers incorporate these datasets into their work.

 
1:00pm - 3:00pmWKSHP-06: RUN YOUR OWN FULL STACK SOLRWAYBACK
Location: Labs Room 1 (workshops)
Pre-registration required for this event.
 

Run your own full stack SolrWayback

Thomas Egense, Toke Eskildsen, Jørn Thøgersen, Anders Klindt Myrvoll

Royal Danish Library, Denmark

An in-person, updated, version of the ‘21 WAC workshop Run your own full stack SolrWayback:
https://netpreserve.org/event/wac2021-solrwayback-1/

This workshop will

  1. Explain the ecosystem for SolrWayback 4 (https://github.com/netarchivesuite/solrwayback)

  2. Perform a walkthrough of installing and running the SolrWayback bundle. Participants are expected to mirror the process on their own computer and there will be time for solving installation problems

  3. Leave participants with a fully working stack for index, discovery and playback of WARC files

  4. End with open discussion of SolrWayback configuration and features.

Prerequisites:

  • Participants should have a Linux, Mac or Windows computer with Java 8 or Java 11 installed. To see java is installed type this in a terminal: java -version

  • Downloading the latest release of SolrWayback Bundle from:https://github.com/netarchivesuite/solrwayback/releases beforehand is recommended.

  • Having institutional WARC files available is a plus, but sample files can be downloaded from https://archive.org/download/testWARCfiles

  • A mix of WARC-files from different harvests/years will showcase SolrWaybacks capabilities the best way possible.

Target audience:

Web archivists and researchers with medium knowledge of web archiving and tools for exploring web archives. Basic technical knowledge of starting a program from the command line is required; the SolrWayback bundle is designed for easy deployment.
Maximum number of participants
30


Background

SolrWayback 4 (https://github.com/netarchivesuite/solrwayback) is a major rewrite with a strong focus on improving usability. It provides real time full text search, discovery, statistics extraction & visualisation, data export and playback of webarchive material. SolrWayback uses Solr (https://solr.apache.org/) as the underlying search engine. The index is populated using Web Archive Discovery (https://github.com/ukwa/webarchive-discovery). The full stack is open source and freely available. A live demo is available at https://webadmin.oszk.hu/solrwayback/

During the conference there will be focused support for SolrWayback in a dedicated Slack channel by Thomas Egense and Toke Eskildsen.

 
2:00pm - 2:20pmBREAK
2:20pm - 3:50pmSES-16: PRESERVATION & COMPLEX DIGITAL PUBLICATIONS
Location: Theatre 1
Session Chair: Kiki Lennaerts, Sound & Vision
These presentations will be followed by a 10 min Q&A.
 
2:20pm - 2:40pm

Preservability and Preservation of Digital Scholarly Editions

Michael Kurzmeier1, James O'Sullivan1, Mike Pidd2, Orla Murphy1, Bridgette Wessels3

1University College Cork, Ireland; 2University of Sheffield; 3University of Glasgow

Digital Scholarly Editions (DSE) are web resources, thus subject to data loss. While DSEs are usually the result of funded research, their longevity and preservation is uncertain. DSEs might be partially or completely captured during web archiving crawls, in some cases making web archives the only remaining publicly available source of information about a DSE. Patrick Sahle’s Catalogue of DSEs lists ~800 URLs referring to DSEs, of which 46 refer to the Internet Archive. (2020) This shows the overlap between DSEs and web archives and highlights the need for a closer look at the longevity and archiving of these important resources. This presentation will introduce a recent study on the availability and longevity of DSEs and introduce different preservation models and examples specific to DSEs. Examples of lost and partially preserved editions will be used to illustrate the problem of preservation and preservability of DSEs. This presentation will also outline the specific challenges of archiving DSEs.

The C21 Editions project is a three-year international research collaboration researching the state of the art and the future of DSEs. As part of the project output, this presentation will introduce the main data sources on DSEs and demonstrate the workflow to assess DSE availability over time. It will illustrate the role web archives play in the preservation of DSEs as well as highlight specific challenges DSEs present to web archiving. As DSEs are complex projects, featuring multiple layers of data, transcription and annotation, their full preservation usually includes ongoing maintenance of the often custom-build backend system. Once project funding ends, these structures are very prone to deterioration and loss. Besides ongoing maintenance, other preservation models exist, generally reducing the archiving scope in order to reduce the ongoing work required (Dillen 2019; Pierazzo 2019; Sahle and Kronenwett 2016). Such editions using compatible rather than bespoke solutions are more likely to be fully preserved. Other approaches include a “preservability by design” approach through minimal computing (Elwert n.d.) or standardization through existing services such as DARIAH or GitHub. The presentation will outline these models using examples of successful preservation as well as lost editions.

This presentation is part of the larger C21 Editions project, a three-year international collaboration jointly funded by the Arts & Humanities Research Council (AH/W001489/1) and Irish Research Council (IRC/W001489/1).



2:40pm - 3:00pm

Collecting and presenting complex digital publications

Ian Cooke, Giulia Carla Rossi

The British Library, United Kingdom

'Emerging Formats' is a term that is used by UK legal deposit libraries to describe experimental and innovative digital publications, for which there are no collection management solutions that can operate at scale. They are important to the libraries, and their users, as they document a period of creativity and rapid change, and often include authors and experiences that are less well represented in more mainstream publications, and are at high risk of loss. For six years, the UK legal deposit libraries have been working collaboratively and experimentally to both survey the types of publications, and to test approaches to collection that will support preservation, discovery and access. An important concept in this work has been 'contextual collecting', that seeks to preserve the best possible archival instance of a work, alongside information that documents how a work was created, and how it was experienced by users.

Web archiving has formed an important part of this work, both in providing practical tools to support collection management, including access, and also in supporting the collection of contextual information. An example of this can be seen in the New Media Writing Prize thematic collection https://www.webarchive.org.uk/en/ukwa/collection/2912

In this presentation, we will step back from specific examples, and talk about what we have learned so far from our work as a whole. We will outline how this work, including user research and engagement, has shaped policy at the British Library, through the creation of our 'Content Development Plan' for Emerging Formats, and the role of web archiving within that plan.

This presentation contributes to the Collections themes of 'blurring the boundaries between web archives and other born digital collections' and 'reuse of web archived materials for other born digital collections'. It builds on previous presentations to Web Archive Conference, which have focused on specific challenges related to collecting complex digital publications, to demonstrate how this research has informed the policy direction at the British Library and how web archiving infrastructure will be built in to efforts to collect, assess and make accessible new publications.



3:00pm - 3:20pm

What can web archiving history tell us about preservation risks?

Susanne van den Eijkel, Daniel Steinmeier

KB, National Library of the Netherlands

When people talk about the necessity of preservation, the first thing that comes to mind is the supposed risk of file format obsolescence. Within the preservation community there have been voices raising the concern that this might not be the most pressing risk. If we are actually solving the wrong problem, this means we neglect the real problem. Therefore, it is important to know that the solutions we create are solving demonstrably real problems. Web archiving could be a great source of information for researching the most urgent risks, because developments and standards on the web are very fluid. There are examples of file formats on the web, such as Flash, that are not supported anymore by modern browsers. However, these formats can still be rendered using widely available software. We have also seen that website owners migrated their content from Flash to HTML5. So, can we really say that obsolescence has resulted in loss of data? How can we find out more about this? And more importantly, can we find out which risks are actually more relevant?

At the National Library of the Netherlands, we have been working on building a web collection since 2007. By looking at a few historical webpages we will illustrate where to look for answers and how to formulate better preservation risks using source data and context information. At iPres2022 we have presented a short paper on the importance of context information for web collections. This information helps us in understanding the scope and the creation process of the archived website. In this presentation, we will demonstrate how we use this context information to search out sustainability risks for web collections. This will also give us insight into sustainability risks in general so we can create better informed preservation strategies.



3:20pm - 3:40pm

Towards an effective long-term preservation of the web. The case of the Publications Office of the EU

Corinne Frappart

Publications Office of the European Union, Luxembourg

Much is being written about web archiving in general where new, improving methods to capture the World Wide Web and to facilitate access to the resulting archives are constantly being described and shared. But when it comes to the long-term preservation of web sites, i.e. safeguarding the ARC/WARC files with a proper planning of preservation actions beyond simply bit preservation, literature is much less abundant.

The Publications Office of the EU is responsible for the preservation of the websites authored by the EU institutions. In addition to our activities in harvesting and making accessible the content through our public web archive (https://op.europa.eu/en/web/euwebarchive), we started to delve more deeply into the management of content preserved for the long-term.

Our reflection focused on long-term risks such as obsolescence or loss of file useability, and on the availability of a disaster recovery mechanism for the platform providing access to the web archive. Ingesting web archive files into a long-term preservation system raises many questions:

  • Should we expect different difficulties with ARC and WARC files? Is it worth migrating the ARC files to WARC files, and having a consistent collection on which the same tools can be applied?
  • Does ARC/WARC file compression impact the storage, the processing time, the preservation actions?
  • What is the best granularity for the preservation of web archive?
  • Should the characterization of the numerous files embedded in ARC/WARC files occur during or after ingestion? With which impact on the preservation actions?
  • How can descriptive, technical and provenance metadata be enriched, possibly automatically, and where can they be stored?
  • What kind of information about the context of the crawls, the format description and the data structure should be also preserved to help future users to understand the content of the ARC/WARC files?

To get some advice about all these questions and others, the Publications Office commissioned a study looking at published and grey literature, and supplemented by a series of interviews conducted with leading institutions in field of web archiving. This paper presents the findings and offers recommendations on how to answer the questions above.

 
2:20pm - 3:50pmSES-17: PROGRAM INFRASTRUCTURE
Location: Theatre 2
Session Chair: René Voorburg, KB, National Library of the Netherlands
These presentations will be followed by a 10 min Q&A.
 
2:20pm - 2:40pm

Maintenance Practices for Web Archives

Ed Summers, Laura Wrubel

Stanford University, United States of America

What makes a web archive an archive? Why don’t we call them web collections instead, since they are resources that have been collected from the web and made available again on the web? Perhaps one reason that the term archive has stuck is that it entails a commitment to preserving the collected web resources over time, and making continued access to them available. Just like the brick and mortar buildings that must be maintained to house traditional archives, web archives are supported by software and hardware infrastructure that must be cared for in order to ensure that the web archives remain accessible. In this talk we will present some examples of what this maintenance work looks like in practice drawing from experiences at Stanford University Libraries (SUL).

While many organizations actively use third party services like Archive-It, PageFreezer, and ArchiveSocial to create web archives, it is less common for them to retrieve the collected data and make it available outside that service platform. Starting in 2012 SUL has been engaged in building web archive collections as part of its general digital collections using tools such as httrack, CDL’s Web Archiving Service, Archive-It and more recently Webrecorder. These collections have been made available using the OpenWayback software, but in 2022 SUL switched to using the PyWB application.

We will discuss some of the reasons why Stanford initially found it important to host its own web archiving replay service and what factors led to switching to PyWB. Work such as reindexing and quality assurance testing were integral to moving to PyWB, which in turn generated new knowledge about the web archives records, as well as new practices for transitioning them into the new software environment. The acquisition, preservation of and access to web archives has been incorporated into the microservice architecture of the Stanford Digital Repository. One key benefit to this mainstreaming is shared terminology, infrastructure and maintenance practices for web archives, which is essential for sustaining the service. We will conclude with some consideration of what these local findings suggest about successfully maintaining open source web archiving software as a community.



2:40pm - 3:00pm

Radical incrementalism and the resilience and renewal of the National Library of Australia's web archiving infrastructure

Alex Osborne1, Paul Koerbin2

1National Library of Australia, Australia; 2National Library of Australia, Australia

The National Library of Australia’s web archiving program is one of the world’s earliest established and longest continually sustained operations. From its inception it was focused on establishing and delivering a functional operation as soon as feasible. This work historically included the development of policy, procedures and guidelines; together with much effort working through the changing legal landscape, from a permissions-based operation to one based on legal deposit warrant.

Changes to the Copyright Act (1968) in 2016, that extended legal deposit to online materials, gave impetus to the NLA’s strategic priorities to increase comprehensive collecting objectives and to expand open access to its entire web archive corpus. This also had significant implications for the NLA’s online collecting infrastructure. In part this involved confronting and dealing with a large legacy of web content collected by various tools and structured in disparate forms; and in part it involved a rebuild of the collecting workflow infrastructure while sustaining and redeveloping existing collaborative collecting processes.

After establishing this historic context, this presentation will focus attention on the NLA’s approach to the development of its web archiving infrastructure – an approach described as radical incrementalism: taking small, pragmatic steps that lead over time to achieving major objectives. While effective in providing the way to achieve strategic objectives, this approach can also build a legacy of infrastructural dead-weight that needs to be dealt with in order to continue to sustain and renew the dynamic and challenging task of web archiving. With a radical team restructure and an agile and iterative approach to development, the NLA has made significant progress in recent times in moving from a legacy infrastructure to one of renewed sustainability and flexibility in application.

This presentation will highlight some of the recent developments in the NLA’s web archiving infrastructure, including the web archive collection management system (including ‘Bamboo’ and ‘OutbackCDX’) and the web archive workflow management tool, ‘PANDAS’.



3:00pm - 3:20pm

Arquivo.pt behind the curtains

Daniel Gomes

FCT: Arquivo.pt, Portugal

Arquivo.pt is a governmental service that enables search and access to historical information preserved from the Web since the 1990s. The search services provided by Arquivo.pt include full-text search, image search, version history listing, advanced search and application programming interfaces (API). Arquivo.pt has been running as an official public service since 2013 but in the same year its system totally collapsed due to a severe hardware failure and over-optimistic architectural design. Since then, Arquivo.pt was completely renewed to improve its resilience. At the same time, Arquivo.pt has been widening the scope of its activities by improving the quality of the acquired web data and deploying online services of general interest to public administration institutions, such as the Memorial that preserves the information of historical websites or Arquivo404 that fixes broken links in live websites. These innovative offers require the delivery of resilient services constantly available.

The Arquivo.pt hardware infrastructure is hosted at its own data centre and it is managed by full-time dedicated staff. The preservation workflow is performed through a large-scale information system distributed over about 100 servers. This presentation will describe the software and hardware architectures adopted to maintain the quality and resilience of Arquivo.pt. These architectures were “designed-to-fail” following a “share-nothing” paradigm. Continuous integration tools and processes are essential to assure the resilience of the service. The Arquivo.pt online services are supported by 14 micro-services that must be kept permanently available. The Arquivo.pt software architecture is composed of 8 systems that host 35 components and the hardware architecture is composed of 9 server profiles. The average availability of the online services provided by Arquivo.pt in 2021 was 99,998%. Web archives must urgently assume their rule in digital societies as memory keepers of the XXI century. The objective of this presentation is to share our lessons learned at a technical level so that other initiatives may be developed at a faster pace using the most adequate technologies and architectures.



3:20pm - 3:40pm

Implementing access to and management of archived websites at the National Archives of the Netherlands

Antal Posthumus

Nationaal Archief, The Netherlands

The National Archives of the Netherlands, as a permanent government agency and official archive for the Central Government, has the legal duty, laid down in the Archiefwet, to secure the future of the government record. In the case of this proposal the focus is on how we worked on the infrastructure and processes of our trusted digital repository (TDR in short) relating to the ingestion, storage, management and preservation of and providing access to archived public websites of the Dutch Central Government.

In 2018 we’ve issued a very well received guideline on archiving websites (2018), We tried to involve our producers in the drafting process of the guidelines in their development. Part of which was to organize a public review. We received no less than 600 comments from 30 different organizations, which enabled us to improve the guidelines and immediately bring them to the attention of potential future users.

These guidelines were also used as part of the requirements of a public European tender (2021). The objective of the tender: realizing a central harvesting platform (hosted by. https://www.archiefweb.eu/openbare-webarchieven-rijksoverheid/) to structurally harvest circa 1500 public websites of the Central Government. This enabled us as an archival institution to influence the desired outcome of the harvesting process for these 1500 websites owned by at least all Ministries and most of their agencies.

A main challenge was that our off the shelf version of the Open Wayback-viewer wasn’t a complete version of the software and therefore isn’t able to render increments, or provide a calendar function, one of the key elements of the minimum viable product we aimed at.
We’ve opted for pywb based on what we learned through the IIPC-community about the transition from Open Wayback to Pywb.
Installation of Pywb was experienced by our technical team as very simple. An issue we did encounter was that the TDR-software doesn’t support a linkage with this (or any) external viewer which forces us to copy all WARC-files from our TDR into the viewer. This means a deviation from our current workflow; it also means we need twice as much disk space, so to speak.

 
3:50pm - 4:00pmSHORT BREAK
4:00pm - 5:00pmKEYNOTE: Marleen Stikker. Introduced and chaired by Martijn Kleppe, KB
Location: Theatre 1
Session Chair: Martijn Kleppe, KB, national library of the Netherlands
5:00pm - 5:15pmCLOSING REMARKS: Jeffrey van der Hoeven, KB, National Library of the Netherlands
Location: Theatre 1
Session Chair: Jeffrey van der Hoeven, KB, National Library of the Netherlands

 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2023
Conference Software: ConfTool Pro 2.6.149
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany