Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available). To only see the sessions for 3 May's Online Day, select "Online" for location.

Please note that all times are shown in the time zone of the conference. The current conference time is: 14th May 2024, 09:20:17pm CEST

 
Only Sessions at Location/Venue 
 
 
Session Overview
Location: Theatre 2
Date: Thursday, 11/May/2023
11:00am - 12:30pmSES-02: FINDING MEANING IN WEB ARCHIVES
Location: Theatre 2
Session Chair: Vladimir Tybin, Bibliothèque nationale de France
These presentations will be followed by a 10 min Q&A.
 
11:00am - 11:20am

Leveraging Existing Bibliographic Metadata to Improve Automatic Document Identification in Web Archives.

Mark Phillips1, Cornelia Caragea2, Praneeth Rikka1

1University of North Texas, United States of America; 2University of Illinois Chicago, United States of America

The University of North Texas Libraries, partnering with the University of Illinois Chicago (UIC) Computer Science Department, has been awarded a research and development grant (LG-252349-OLS-22) from the Institute of Museum and Library Services in the United States to continue work from previously awarded projects (LG-71-17-0202-17) related to identification and extraction of high-value publications from large web archives. This work will investigate the potential of using existing bibliographic metadata from library catalogs and digital library collection to better train machine learning models that can assist librarians and information professionals in identifying and classifying high-value publications from large web archives. The project will focus on extracting publications related to state government document collections from the states of Texas and Michigan with the hopes that this approach will enable other institutions interested in leveraging their existing web archives to assist in building traditional digital collections with these publications. This presentation will present an overview of the project with a description of the approaches the research team is exploring to leverage existing bibliographic metadata to assist in building machine models for publication identification from web archives. Early findings from the first year of research as well as next steps and how this research can be used by institutions apply to their own web archives.



11:20am - 11:40am

Conceptual Modeling of the Web Archiving Domain

Illyria Brejchová

Masaryk University, Czech Republic

Web archives collect and preserve complex digital objects. This complexity, along with the large scope of archived websites and the dynamic nature of web content, makes sustainable and detailed metadata description challenging. Different institutions have taken various approaches to metadata description within the web archiving community, yet this diversity complicates interoperability. The OCLC Research Library Partnership Web Archiving Metadata Working Group took a significant step forward in publishing user-centered descriptive metadata recommendations applicable across common metadata formats. However, there is no shared conceptual model for understanding web archive collections. In my research, I examine three conceptual models from within the GLAM domain, IFLA-LRM created by the library community, CIDOC-CRM originating from the museum community, and RiC-CM stemming from the archive community. I will discuss what insight they bring to understanding the content within web archives and their potential for supporting metadata practices that are flexible, scalable, meet the requirements of the end users, and are interoperable between web archives as well as the broader cultural heritage domain.

This approach sheds light on common problems encountered in metadata description practice in a bibliographic context by modeling archived web resources according to IFLA-LRM and showing how constraints within RDA introduce complexity without providing tools for feasibly representing this complexity in MARC 21. On the other hand, object-oriented models, such as CIDOC-CRM, can represent at least the same complexity of concepts as IFLA-LRM but without many of the aforementioned limitations. By mapping our current descriptive metadata and automatically generated administrative metadata to a single comprehensive model and publishing it as open linked data, we can not only more easily exchange metadata but also provide a powerful tool for researchers to make inferences about the past live web by reconstructing the web harvesting process using log files and available metadata.

While the work presented is theoretical, it provides a clearer understanding of the web archiving domain. It can be used to develop even better tools for managing and exploring web archive collections.



11:40am - 12:00pm

Web Archives & Machine Learning: Practices, Procedures, Ethics

Jefferson Bailey

Internet Archive, United States of America

Given their size, complexity, and heterogeneity, web archives are uniquely suited to leverage and enable machine learning techniques for a variety of purposes. On the one hand, web collections increasingly represent a larger portion of the recent historical record and are characterized by longitudinality, format diversity, and large data volumes; this makes them highly valuable in computational research by scholars, scientists, and industry professionals using machine learning for scholarship, analysis, and tool development. Few institutions, however, are yet facilitating this type of access or pursuing these types of partnerships and projects given the specialized practices, skills, and resources required. At the same time, machine learning tools also have the potential to improve internal procedures and workflows related to web collections management by custodial institutions, from description to discovery to quality assurance. Projects applying machine learning to web archive workflows, however, also remains a nascent, if promising, area of work for libraries. There is also a “virtuous loop” possible between these two functional areas of access support and collections management, wherein researchers utilizing machine learning tools on web archive collections can create technologies that then have internal benefits to the custodial institutions that granted access to their collections. Finally, spanning both external researcher uses and internal workflow applications are an intricate set of ethical questions posed by machine learning techniques. Internet Archive has been partnering with both academic and industry research projects to support the use of web archives in machine learning projects by these communities. Simultaneous, IA has also explored prototype work applying machine learning to internal workflows for improving the curation and stewardship of web archives. This presentation will cover the role of machine learning in supporting data-driven research, the successes and failures of applying these tools to various internal processes, and the ethical dimensions of deploying this emerging technology in digital library and archival services.



12:00pm - 12:20pm

From Small to Scale: Lessons Learned on the Requirements of Coordinated Selective Web Archiving and Its Applications

Balázs Indig1,2, Zsófia Sárközi-Lindner1,2, Mihály Nagy1,2

1Eötvös Loránd University, Department of Digital Humanities, Budapest, Hungary; 2National laboratory for Digital Humanities, Budapest, Hungary

Today, web archiving is measured on an increasingly large scale, pressurizing newcomers and independent researchers to keep up with the pace of development and maintain an expensive ecosystem of expertise and machinery. These dynamics involve a fast and broad collection phase, resulting in a large pool of data, followed by a slower enrichment phase consisting of cleaning, deduplication and annotation.

Our streamlined methodology for specific web archiving use cases combines mainstream practices with new open-source tools. Our custom crawler conducts selective web archiving for portals (e.g. blogs, forums, currently applied to Hungarian news providers), using the taxonomy of the given portal to systematically extract all articles exclusively into portal-specific WARC files. As articles have uniform portal-dependent structure, they can be transformed into a portal-independent TEI XML format individually. This methodology enables assets (e.g. video) to be archived separately on demand.

We focus on textual content, which in case of using traditional web archives would require using resource intensive filtering. Alternatives like trafilatura are limited to automatic content extraction often yielding invalid TEI or incomplete metadata unlike our semi-automatic method. Resulting data are deposited by grouping portals under specific DOIs, enabling fine-grained access and version control.

With almost 3 million articles from more than 20 portals we developed a library for executing common tasks on these files, including NLP and format conversion to overcome the difficulties of interacting with the TEI standard. To provide access to our archive and gain insights through faceted search, we created a light-weight trend viewer application to visualize text and descriptive metadata.

Our collaborations with researchers have shown that our approach makes it easy to merge coordinated separate crawls promoting small archives created by different researchers, who may have lower technical skills, into a comprehensive collection that can in some respects serve as an alternative to mainstream archives.

Balázs Indig, Zsófia Sárközi-Lindner, and Mihály Nagy. 2022. Use the Metadata, Luke! – An Experimental Joint Metadata Search and N-gram Trend Viewer for Personal Web Archives. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, pages 47–52, Taipei, Taiwan. Association for Computational Linguistics.

 
1:30pm - 2:30pmSES-04 (PANEL): SOLRWAYBACK: BEST PRACTICE, COMMUNITY USAGE & ENGAGEMENT
Location: Theatre 2
Session Chair: Thomas Langvann, National Library of Norway
 

SolrWayback: Best practice, community usage and engagement

Thomas Egense1, László Tóth2, Youssef Eldakar3, Sara Aubry4, Anders Klindt Myrvoll1

1Royal Danish Library (KB); 2National Library of Luxembourg (BnL); 3Bibliotheca Alexandrina (BA); 4National Library of France (BnF)

Panel description

This panel will focus on the status quo of SolrWayback, implementations of SolrWayback and where it's heading in the future, including the growing open source community adapting SolrWayback and contributing to developing the tool, making it more resilient.

Thomas Egense will give an update on the current development and the flourishing user community and some thoughts on making SolrWayback even more resilient in the future.

László Tóth will talk about the National Library of Luxembourg (BnL) development of a fully automated archiving workflow comprised of the capture, indexing and playback of Luxembourgish news websites. The solution combines the powerful features of SolrWayback such as full-text search, wildcard search, category search and mre, with the high playback quality of PyWb.

Youssef Eldakar will present the way Solwayback have enhanced the way researchers can search for content and view the 18 IIPC special collections and also bring up some considerations about scaling the system.

Sara Aubry will present how the National Library of France (BnF) has been using SolrWayback to give researcher teams the possibility to explore, analyze and visualize specific collections. She will also share how BnF contributed to the application development, including the extension of datavisualisation features.

Thomas Egense: Increasing community interactions and the near future of SolrWayback

During the last year, the number of community interactions such as direct email questions, bugs/ feature requests posted on github jira, has increased every week. It is indeed good news that so many Libraries/Institutions or researchers already have embraced SolrWayback, but to keep up this momentum more community engagement will be welcomed for this open source project.

By submitting a feature request or bug report on GitHub you will help prioritize which will benefit the most, do not hold back. More programmers for backend(Java) or frontend (GUI) would speed up the development of SolrWayback.

Recently BnF helped improve some of the visualization tools by allowing shorter time intervals instead of years. For newly established collections this is a much more useful visualization. Is it a good example of the different need for new collections just 1 year old compared to collections with 25 years of web harvests. So it was not in our focus though it was a very useful improvement.

In the very near future I expect that more time will be used on supporting new users attempting to implement SolrWayback. Also the hybrid SolrWayback combined with PyWb for playback seems to be the direction many choose to go. And finally large collections will run into a Solr scaling problem that can be solved by switching to SolrCloud. There is a need for better documentation and workflow support in the SolrWayback bundle for this scaling issue.

László Tóth: A Hybrid SolrWayback-PyWb playback system with parallel indexing using the Camunda Workflow Engine

Within the framework of its web archiving programme, the National Library of Luxembourg (BnL) develops a fully automated archiving workflow comprised of the capture, indexing and playback of Luxembourgish news websites.

Our workflow design takes into account several key features such as the efficiency of crawls (both in time and space) and of the indexing processes, all while providing high quality end user experience. In particular, we have chosen a hybrid approach for the playback of our archived content, making use of several well-known technologies in the field.


Our solution combines the powerful features of SolrWayback such as full-text search, wildcard search, category search and so forth, with the high playback quality of PyWb (for instance its ability to handle complex websites, in particular with respect to POST requests). Thus, once a website is harvested, the corresponding WARC files are indexed in both systems. Users are then able to perform fine-tuned searches using SolrWayback and view the chosen pages using PyWb. This also means that we need to store our indexes in two different places: the first is within an OutbackCDX indexing server connected to our PyWb instance, the second is a larger Solr ecosystem put in place specifically for SolrWayback. This parallel indexing process, together with the handling of the entire workflow from start to finish, is handled by the Camunda Workflow Engine, which we have configured in a highly flexible manner.


This way, we can quickly respond to new requirements, or even to small adjustments such as new site-specific behaviors. All of our updates, including new productive tasks or workflows, can be deployed on-the-fly without needing any downtime. This combination of technologies allows us to provide a seamless and automated workflow together with an enjoyable user experience. We will present the integrated workflow with Camunda and how users interact with the whole system.


Youssef EldakarWhere We Are a Year Later with the IIPC Collections and Researcher Access through SolrWayback ”

One year ago, we presented a joint effort, spanning the IIPC Research Working Group, the IIPC Content Development Working Group, and Bibliotheca Alexandrina, to republish the IIPC collections for researcher access through alternative interfaces, namely, LinkGate and SolrWayback.


This effort aims to re-host the IIPC collections, originally harvested on Archive-It, at Bibliotheca Alexandrina with the purpose of offering researchers the added value of being able to explore a web archive collection as a temporal graph with the data indexed in LinkGate, as well as search the full text of a web archive collection and run other types of analyses with the data indexed in SolrWayback.


At the time of last year's presentation, the indexing of 18 collections and a total compressed size of approximately 30 TB for publishing through both LinkGate and SolrWayback was at its early stage. As part of this panel on SolrWayback, one year later, we present an update of what is now available to researchers after the progress made on indexing and tuning of the deployment, focusing on showcasing access to the data through the different tools found in the SolrWayback user interface.


We also present a brief technical overview of how the underlying deployment has changed to meet the demands of scaling up to the growing volume of data. We finally share thoughts on future next steps. See the republished collections at https://iipc-collections.bibalex.org/ and the presentation from 2022.

Sara Aubry: SolrWayback at the National Library of France (BnF) : an exploration tool for researchers and the web archiving team engagement to contribute to its evolution

With the opening of its DataLab in October 2021 and the Respadon project (which will also be presented during the WAC), BnF web archiving team is currently concentrating on the development of services, tools, methods and documentation to ease the understanding and appropriation of web archives for research. The underlying objective is to provide the research community, along with information professionals, with a diversity of tools dedicated to the building, exploring and analysis of web corpora. Among all tools we have tested with researchers, SolrWayback has a particular place because of its simplicity to handle and its rich functionalities. Beyond a first contact with the web archives, it allows researchers to question and analyze the focused collections to which it gives access. This presentation will focus on researcher feedback using SolrWayback, how the application promotes the development of skills on web archives, and how we accompany researchers in the use of this application. We will also present how research use and feedback has led us to contribute to the development of this open source tool.


 
2:40pm - 3:50pmSES-06: SOCIAL MEDIA & PLAYBACK: COLLABORATIVE APPROACHES
Location: Theatre 2
Session Chair: Susanne van den Eijkel, KB, National Library of the Netherlands
These presentations will be followed by a 10 min Q&A.
 
2:40pm - 3:00pm

Archiving social media in Flemish cultural or private archives, (how) is it possible

Katrien Weyns1, Ellen Van Keer2

1KADOC-KU Leuven, Belgium; 2meemoo, Belgium

Social media are increasingly replacing other forms of communication. In doing so, they are also becoming an important source to archive in order to preserve the diverse voices in society for the long term. However, few Flemish archival institutions currently archive this type of content. To remedy this situation, a number of private archival institutions in Flanders started research on sustainable approaches and methods to capture and preserve social media archives. Confronted with the complex reality of this new landscape however, this turned out to be a rather challenging undertaking.

Through the lens of our project 'Best practices for social media archiving in Flanders and Brussels', we’ll look at the lessons learned and the central challenges that remain for social media archiving in private archival institutions in Flanders. Many of these lessons and challenges transcend this project and concern the broader web archiving community and cultural heritage sector.

Unsurprisingly, to a lot of (often smaller) private archival institutions in Belgium archiving social media remains a major challenge either because of a lack of (new) digital archiving competencies or the availability of (often expensive and quickly outdated) technical solutions in heritage institutions. On top of that, there are major legal challenges. For one, these archives cannot fall back on archival law or legal deposit law as a legal basis. In addition, the quickly evolving European and national privacy and copyright regulations form a maze of rules and exceptions they have to find their way in and keep up with.

One last stumbling block is proving particularly hard to overcome. It concerns the legal and technical restrictions the social media platforms themselves impose on users. These make it practically impossible for heritage institutions to capture and preserve the integrity of social media content in a sustainable way. We believe this problem is best to be addressed by the international web archiving, research and heritage community as a whole.

This is only one of the recommendations we’re proposing to improve the situation as part of the set of ‘best practices’ we developed and which we would like to present here in more detail.



3:00pm - 3:20pm

Searching for a Little Help From My Friends: Reporting on the Efforts to Create an (Inter)national Distributed Collaborative Social Media Archiving Structure

Zefi Kavvadia1, Katrien Weyns2, Mirjam Schaap3, Sophie Ham4

1International Institute of Social History; 2KADOC Documentation and Research Centre on Religion, Culture, and Society; 3Amsterdam City Archives; 4KB, National Library of the Netherlands

Social media archiving in cultural heritage and government is still at an experimental stage with regard to organizational readiness for and sustainability of initiatives. The many different tools, the variety of platforms, and the intricate legal and ethical issues surrounding social media do not readily allow for immediate progress and uptake by organizations interested or mandated to preserve social media content for the long term.

In Belgium and the Netherlands, the last three years have seen a series of promising projects on building social media archiving capacity, mostly focusing on heritage and research. One of their most important findings is that the multiple needs and requirements of successful social media archiving are difficult for any one organization to tackle; efforts to propose good practices or establish guidelines often run onto the reality of the many and sometimes clashing priorities of different domains e.g. archives, libraries, local and national government, research. Faced with little time and increasing costs, managers and funders are generally reluctant to support social media archiving as an integral part of collecting activity, as it is seen as a nice-to-have but not crucial part of their already demanding core business.

Against this background, we set out to bring together representatives of different organizations from different sectors in Belgium and the Netherlands to research the possibilities for what a distributed collaborative approach to social media archiving could look like, including requirements for sharing knowledge and experiences systematically and efficiently, sharing infrastructure and human and technical resources, prioritization, and future-proofing the initiative. In order to do this, we look into:

  • Wishes, demands, and obstacles to doing social media archiving at different types of organizations in Belgium and the Netherlands?

  • Aligning the heritage, research, and governmental perspectives

  • Learning from existing collective organizational structures

  • First steps for the allocation of roles and responsibilities

Through interviews with staff and managers of interested organizations, we want to find out if there is potential in thinking about social media archiving as a truly collaborative venture. We would like to discuss the progress of this research and the ideas and challenges we have come up against.



3:20pm - 3:40pm

Collaborating On The Cutting Edge: Client Side Playback

Clare Stanton, Matteo Cargnelutti

Library Innovation Lab, United States of America

Perma.cc is a project of the Library Innovation Lab, which is based within the Harvard Law School Library and exists as a unit of a large academic institution. Our work has been focused in the past mainly on the application of web archiving technology as it relates to citation in legal and scholarly writing. However, we also have spent time exploring expansive topics in the web archiving world - oftentimes via close collaboration with the Webrecorder project - and most recently have built tools leveraging new client-side playback technology made available by replayweb.page.

warc-embed is LIL's experimental client-side playback integration boilerplate, which can be used to test out and explore this new technology, along with its potential new applications. It consists of: a simple web server configuration that provides web archive playback; a preconfigured “embed” page that can be easily implemented to interact with replayweb.page; and a two-way communication layer that allows the replay to reliably and safely communicate with the archive. These features are replicable for a relatively non-technical audience and thus we sought to explore small scale applications of it outside of our group.

This is one of two proposals relating to this work. We believe there is an IIPC audience who could attend either or both sessions based on their interests. They explore separate topics relating to the core technology. This session will look into user applications of the tool and institutional user feedback from the Harvard Library community.

Our colleagues at Harvard use the Internet Archive’s Archive-It across the board for the majority of their web archiving collections and access. As an experiment, we have worked with some of them to host and serve their .warcs via warc-embed. We scoped work based on their needs and made adjustments based on their ability to apply the technology. One example of this is a refresh of the software to be able to mesh with WordPress, which was more easily managed directly by the team. This session will explore a breakdown of roadblocks, design strategies, and wins from this collaboration. It will focus on the end-user results and applications of the technology.

 
4:20pm - 5:30pmSES-08: QUALITY ASSURANCE
Location: Theatre 2
Session Chair: Arnoud Goos, Netherlands Institute for Sound & Vision
These presentations will be followed by a 10 min Q&A.
 
4:20pm - 4:40pm

The Auto QA process at UK Government Web Archive

Kourosh Feissali, Jake Bickford

The National Archives, United Kingdom

The UK Government Web Archive’s (UKGWA) Auto QA process allows us to carry out enhanced data-driven QA almost completely automatically. This is particularly useful for websites that are high-profile or sites that are about to close. Our Auto QA has several advantages over solely visual QA. The advantages enable us to:

1) Identify problems that are not obvious at the visual QA stage.

2) Identify Heritrix errors during the crawl. These include -2 and -6 errors. Once identified, we re-run Heritrix on the affected URIs.

3) Identify and patch URIs that Heritrix could not discover.

4) Identify, test, and patch Hyperlinks insides PDFs. Many PDFs contain hyperlinks to a page on the parent website or to other websites. And sometimes the only way to access those pages is through a link in a PDF which most crawlers can't normally access.

Auto QA consists of three separate processes:

1) ‘Crawl Log Analysis’ that runs on every crawl automatically. CLA examines Heritrix crawl logs and looks for errors. It then tests those errors against the live web.

2) ‘Diffex’ that compares what Heritrix discovered with the output of another crawler such as Screaming Frog. This will identify what Heritrix did not discover. Diffex then tests those URIs against the live web and if they are valid, they are added to a patchlist.

3) ‘PDFflash’ extracts PDF URI’s from Heritrix crawl logs. It then parses them and looks for hyperlinks within PDFs; tests those hyperlinks against the live web, our web archives, and against our in-scope domains. If a hyperlink’s target serves 404 it will be added to our patchlist provided it meets certain conditions such as scoping criteria.

UKGWA’s Auto QA is a highly efficient and scalable system that compliments visual QA; and we are in the process of making it open source.



4:40pm - 5:00pm

The Human in the Machine: Sustaining a Quality Assurance Lifecycle at the Library of Congress

Grace Bicho, Meghan Lyon, Amanda Lehman

Library of Congress, United States of America

This talk will build upon information shared during the IIPC WAC 2022 session Building a Sustainable Quality Assurance Lifecycle at the Library of Congress (Thomas and Lyon).

The work to develop a sustainable and effective quality assurance (QA) ecosystem is ongoing and the Library of Congress Web Archiving Team (WAT) is constantly working to improve and streamline workflows. The Library’s web archiving QA goals are structured around Dr. Reyes Ayala’s framework for quality measurements of web archives based in Grounded Theory (Reyes Ayala). During last year’s session, we described how the WAT satisfies the two dimensions of Relevance and Archivability, with some automated processes built in to help the team do its work. We also introduced our idea for Capture Assessment to satisfy the Correspondence dimension of Dr. Reyes Ayala’s framework.

In July 2022, the WAT launched the Capture Assessment workflow internally and invited curators of web archives content at the Library to review captures of their selected content. To best communicate issues of Correspondence quality between the curatorial librarians and the WAT, we instituted a rubric where curatorial librarians can ascribe a numeric value to convey quality information from various angles about a particular web capture, alongside a checklist of common issues to easily note.

The WAT held an optional training alongside the launch, and since then, there have been over 90 responses from a handful of curatorial librarians, including one power user. The WAT has found responses to be mostly actionable for correction in future crawls. We’ve also seen that Capture Assessments are performed on captures that wouldn’t necessarily be flagged via other QA workflows, which gives us confidence that a wider swath of the archive is being reviewed for quality.

The session will share more details about the Capture Assessment workflow and, in time for the 2023 WAC session, we intend to complete a small, early analysis of the Capture Assessment responses to share with the wider web archiving community.

Reyes Ayala, B. Correspondence as the primary measure of information quality for web archives: a human-centered grounded theory study. Int J Digit Libr 23, 19–31 (2022). https://doi.org/10.1007/s00799-021-00314-x

 
5:30pm - 6:10pmPOS-2: LIGHTNING & DROP-IN TALKS
Location: Theatre 2
Session Chair: Martin Klein, Los Alamos National Laboratory
1 minute drop-in talks will immediately follow lightning talks. After the session ends, lightning talk presenters will be available for questions in the atrium, where their posters will be on display.

Drop-in talk schedule:

Persistent Web IDentifier (PWID) also as URN​
Eld Zierau, Royal Danish Library

Crowdsourcing German Twitter ​
Britta Woldering, German National Library

At the end of the rainbow. Examining the Dutch LGBT+ web archive using NER and hyperlink analyses
Jesper Verhoef, Erasmus University Rotterdam
 

Sunsetting a digital institution: Web archiving and the International Museum of Women

Marie Chant

The Feminist Institute, United States of America

The Feminist Institute’s (TFI) partnership program helps feminist organizations sunset mission-aligned digital projects utilizing web archiving technology and ethnographic preservation to contextualize and honor the labor contributed to ephemeral digital initiatives. In 2021, The Feminist Institute partnered with Global Fund for Women to preserve the International Museum of Women (I.M.O.W). This digital, social change museum built award-winning digital exhibitions that explored women’s contributions to society. I.M.O.W. initially aimed to build a physical space but shifted to a digital-only presence in 2005, opting to democratize access to the museum’s work. I.M.O.W’s first exhibition, Imagining Ourselves: A Global Generation of Women, engaged and connected more than a million participants worldwide. After launching several successful digital collections, I.M.O.W. merged with Global Fund for Women in 2014. The organization did not have the means to continually migrate and maintain the websites as technology depreciated, leaving gaps in functionality and access. Working directly with stakeholders from Global Fund for Women and the International Museum of Women, TFI developed a multi-pronged preservation plan that included capturing I.M.O.W’s digital exhibitions using Webrecorder’s Browsertrix Crawler, harvesting and converting Adobe Flash assets, conducting oral histories with I.M.O.W. staff and external developers, and providing access through the TFI Digital Archive.



Visualizing web harvests with the WAVA tool

Ben O'Brien1, Frank Lee1, Hanna Koppelaar2, Sophie Ham2

1National Library of New Zealand, New Zealand; 2National Library of the Netherlands, Netherlands

Between 2020-2021, the National Library of New Zealand (NLNZ) and the National Library of the Netherlands (KB-NL) developed a new harvest visualization feature within the Web Curator Tool (WCT). This feature was demonstrated during a presentation at the 2021 IIPC WAC titled Improving the quality of web harvests using Web Curator Tool. During development it was recognised that the visualization tool could be beneficial to the web archiving community beyond WCT. This was also reflected in feedback received after the 2021 IIPC WAC.

The feature has now been ported to an accompanying stand-alone application called the WAVA tool (Web Archive Visualization and Analysis). This is a stripped down version, that contains the web harvest analysis and visualization without the WCT dependent functionality, such as patching.

The WCT harvest visualization has been designed primarily for performing quality assurance on web archives. To avoid the traditional mess of links and nodes when visualizing URLs, the tool abstracts the data to a domain level. Aggregating URLs into groups of domains gives a higher overview of a crawl and allows for quicker analysis of the relationships between content in a harvest. The visualization consists of an interactive network graph of links and nodes that can be inspected, allowing a user to drill down to the URL level for deeper analysis.

NLNZ and KB-NL believe the WAVA tool can have many uses to the web archiving community. It lowers the barrier to investigating and understanding the relationships and structure of the web content that we crawl. What can we discover in our crawls that might improve the quality of future web harvests? The WAVA tool also removes technical steps that have been a barrier in the past to researchers visualizing web archive data. How many future research questions can be aided by its use?



WARC validation, why not?

Antal Posthumus, Jacob Takema

Nationaal Archief, The Netherlands

This lightning talk would like to tempt and to challenge the participants of the IIPC Web Archiving Conference 2023 to engage in an exchange of ideas, assumptions and knowledge about the subject of validating WARC-files and the use of WARC validation tools.

In 2021 we’ve written an information sheet about WARC validation. During our (desk)research it became clear that most (inter)national colleagues who archive websites more often than not don’t use WARC validation tools. Why not?

Most heritage institutions, national libraries and archives focus on safeguarding as much online content as possible before it disappears, based on an organizational selection policy. And the other goal is to give access to the captured information as complete and quickly as possible, both to the general users and researchers. Both goals are at the core of webarchiving initiatives of course!

It seems as though little attention is given to an aspect of quality control such as the checking of the technical validity of WARC-files. Or are there other reasons not to pay much attention to this aspect?

We like to share some of our findings after deploying several tools for processing WARC-files: JHOVE, JWAT, Warcat and Warcio. More tools are available, but in our opinion these four tools are the most commonly used, mature and actively maintained tools that can check of validate WARC files.

In our research into WARC validation, we noticed that some tools are validation tools that check conformance to WARC standard ISO 28500 and others ‘only’ check block and/or payload digests. Most tools support version 1.0 of the WARC standard (of 2009). Few support version 1.1 (of 2017).

Another conclusion is that there is no one WARC validation tool ‘to rule them all’, so using a combination of tools will probably be the best strategy for now.

 
Date: Friday, 12/May/2023
8:30am - 10:00amWKSHP-04: BROWSER-BASED CRAWLING FOR ALL: GETTING STARTED WITH BROWSERTRIX CLOUD
Location: Theatre 2
Pre-registration required for this event.
 

Browser-Based Crawling For All: Getting Started with Browsertrix Cloud

Andrew N. Jackson1, Anders Klindt Myrvoll2, Ilya Kreymer3

1The British Library, United Kingdom; 2Royal Danish Library; 3Webrecorder

Through the IIPC-funded “Browser-based crawling system for all” project (https://netpreserve.org/projects/browser-based-crawling/), members have been working with Webrecorder to support the development of their new crawling tools: Browsertrix Crawler (a crawl system that automates web browsers to crawl the web) and Browsertrix Cloud (a user interface to control multiple Browsertrix Crawler crawls). The IIPC funding is particularly focussed on making sure IIPC members can make use of these tools.

This workshop will be led by Webrecorder, who will invite you to try out an instance of Browsertrix Cloud and explore how it might work for you. IIPC members that have been part of the project will be on hand to discuss institutional issues like deployment and integration.

The workshop will start with a brief introduction to high-fidelity web archiving with Browsertrix Cloud. Participants will be given accounts and be able to create their own high-fidelity web archives using the Browsertrix Cloud crawling system during the workshop. Users will be presented with an interactive user interface to create crawls, allowing them to configure crawling options. We will discuss the crawling configuration options and how these affect the result of the crawl. Users will have the opportunity to run a crawl of their own for at least 30 minutes and to see the results. We will then discuss and reflect on the results.

After a quick break, we will discuss how the web archives can be accessed and shared with others, using the ReplayWeb.page viewer. Participants will be able to download the contents of their crawls (as WACZ files) and load them on their own machines. We will also present options for sharing the outputs with others directly, by uploading to an easy-to-use hosting option such as Glitch or our custom WACZ Uploader. Either method will produce a URL which participants can then share with others, in and outside the workshop, to show the results of their crawl. We will discuss how, once complete, the resulting archive is no longer dependent on the crawler infrastructure, but can be treated like any other static file, and, as such, can be added to existing digital preservation repositories.

In the final wrap-up of the workshop, we will discuss challenges, what worked and what didn’t, what still needs improvement, etc.. We will also discuss how participants can add the web archives they created into existing web archives that they may already have, and how Browsertrix Cloud can fit into and augment existing web archiving workflows at participants' institutions. IIPC member institutions that have been testing Browsertrix Cloud thus far will share their experiences in this area.

The format of the workshop will be as follows:

  • Introduction to Browsertrix Cloud - 10 min

  • Use Cases and Examples by IIPC project partners - 10 min

  • Break - 5

  • Hands-On: Setup and Crawling with Browsertrix Cloud (Including Q&A / help while crawls are running) - 30 min

  • Break - 5 min

  • Hands-On: Replaying and Sharing Web Archives - 10 min

  • Wrap-Up: Final Q&A / Discuss Integration of Browsertrix Cloud Into Existing Web Archiving Workflows with IIPC project partners - 20 min

Webrecorder will lead the workshop and the hands-on portions of the workshop. The IIPC project partners will present and answer questions about use-cases at the beginning and around integration into their workflows at the end.

Participants should come prepared with a laptop and a couple of sites that they would like to crawl, especially those that are generally difficult to crawl by other means and require a ‘high fidelity approach’. (Examples include social media sites, sites that are behind a paywall, etc..) Ideally, the sites can be crawled during the course of 30 mins (though crawls can be interrupted if they run for too long)

This workshop is intended for curators and anyone wishing to create and use web archives and are ready to try Browsertrix Cloud on their own with guidance from the Webrecorder team. The workshop does not require any technical expertise besides basic familiarity with web archiving. The participants’ experiences and feedback will help shape not only the remainder of the IIPC project, but the long-term future of this new crawling toolset.

The workshop should be able to accommodate up to 50 participants.

 
10:30am - 12:00pmSES-13: CRAWLING, PLAYBACK, SUSTAINABILITY
Location: Theatre 2
Session Chair: Laura Wrubel, Stanford University
These presentations will be followed by a 10 min Q&A.
 
10:30am - 10:50am

Developer Update for Browsertrix Crawler and Browsertrix Cloud

Ilya Kreymer, Tessa Walsh

Webrecorder, United States of America

This presentation will provide a technical and feature update on the latest features implemented in Browsertrix Cloud and Browsertrix Crawler, Webrecorder's open source automated web archiving tools. The presentation will provide a brief intro to Browsertrix Cloud and the ongoing collaboration between Webrecorder and IIPC partners testing the tool.

We will present an outline for the next phase of development of these tools and discuss current / ongoing challenges in high fidelity web archiving, and how we may mitigate them in the future. We will also cover any lessons learned thus far.

We will end with a brief Q&A to answer any questions about the Browsertrix Crawler and Cloud systems, including how others may contribute to testing and development of these open source tools.



10:50am - 11:10am

Opportunities and Challenges of Client-Side Playback

Clare Stanton, Matteo Cargnelutti

Library Innovation Lab, United States of America

The team working on Perma.cc at the Library Innovation Lab has been using the open-source technologies developed by Webrecorder in production for many years, and has subsequently built custom software around those core services. Recently, in exploring applications for client-side playback of web archives via replayweb.page, we have learned lessons about the security, performance and reliability profile of this technology. This has deepened our understanding of the opportunities it presents and challenges it poses. Subsequently, we have developed an experimental boilerplate for testing out variations of this technology and have sought partners within the Harvard Library community to iterate with, test our learnings, and explore some of the interactive experiences that client-side playback makes possible.

warc-embed is LIL's experimental client-side playback integration boilerplate, which can be used to test out and explore this new technology. It consists of: a cookie-cutter web server configuration for storing, proxying, caching and serving web archive files; a pre-configured "embed" page, serving an instance of replayweb.page aimed at a given archive file; as well as a two-way communication layer allowing the embedding website to safely communicate with the embedded archive. These unique features allow for a thorough exploration of this new technology from a technical and security standpoint.

This is one of two proposals relating to this work. We believe there is an IIPC audience who could attend either or both sessions based on their interests. This session will dive into the technical research conducted at the lab and present those findings.

Combined with the emergence of the WACZ packaging format, client-side playback is a radically different and novel take on web archive playback which allows for the implementation of previously unachievable embedding scenarios. This session will explore the technical opportunities and challenges client-side playback presents from a performance, security, ease-of-access and programmability perspective by going over concrete implementation examples of this technology on Perma.cc and warc-embed.



11:10am - 11:30am

Sustaining pywb through community engagement and renewal: recent roadmapping and development as a case study in open source web archiving tool sustainability

Tessa Walsh, Ilya Kreymer

Webrecorder

IIPC’s adoption of pywb as the “go to” open source web archive replay system for its members, along with Webrecorder’s support for transitioning to pywb from other “wayback machine” replay systems, brings a large new user base to pywb. In the interests of ensuring pywb continues to sustainably meet the needs of IIPC members and the greater web archiving community, Webrecorder has been investing in maintenance and new releases for the current 2.x release series of pywb as well as engaging in the early stages of a significant 3.0 rewrite of pywb. These changes are being driven by a community roadmapping exercise with members of the IIPC oh-sos (Online Hours: Supporting Open Source) group and other pywb community stakeholders.

This talk will outline some of the recent feature and maintenance work done in pywb 2.7, including a new interactive timeline banner which aims to promote easier navigation and discovery within web archive collections. It will go on to discuss the community roadmapping process for pywb 3.0 and an overview of the proposed new architecture, perhaps even showing an early demo if development is in a state by May 2023 to support doing so.

The talk will aim to not only share specific information about pywb and the efforts being put into its sustainability and maintenance by both Webrecorder and the IIPC community, but also to use pywb as a case study to discuss the resilience, sustainability, and renewal of open source software tools that enable web archiving for all. pywb as a codebase is after all nearly a decade old itself and has gone through several rounds of significant rewrites as well as eight years of regular maintenance by Webrecorder staff and open source contributors to get to its current state, making it a prime example of how ongoing effort and community involvement make all the difference in building sustainable open source web archiving tools.



11:30am - 11:50am

Addressing the Adverse Impacts of JavaScript on Web Archives

Ayush Goel1, Jingyuan Zhu1, Ravi Netravali2, Harsha V. Madhyastha1

1University of Michigan, United States of America; 2Princeton University, United States of America

Over the last decade, the presence of JavaScript code on web pages has dramatically increased. While JavaScript enables websites to offer a more dynamic user experience, its increasing use adversely impacts the fidelity of archived web pages. For example, when we load snapshots of JavaScript-heavy pages from the Internet Archive, we find that many are missing important images and JavaScript execution errors are common.

In this talk, we will describe the takeaways from our research on how to archive and serve pages that are heavily reliant on JavaScript. Via fine-grained analysis of JavaScript execution on 3000 pages spread across 300 sites, we find that the root cause for the poor fidelity of archived page copies is because the execution of JavaScript code that appears on the web is often dependent on the characteristics of the client device on which it is executed. For example, JavaScript on a page can execute differently based on whether the page is loaded on a smartphone or on a laptop, or whether the browser used is Chrome or Safari; even subtle differences like whether the user's network connection is over 3G or WiFi can affect JavaScript execution. As a result, when a user loads an archived copy of a page in their browser, JavaScript on the page might attempt to fetch a different set of embedded resources (i.e., images, stylesheets, etc.) as compared to those fetched when this copy was crawled. Since a web archive is unable to serve resources that it did not crawl, the user sees an improperly rendered page both because of missing content and JavaScript runtime errors.

To account for the sources of non-deterministic JavaScript execution, a web archive cannot crawl every page in all possible execution environments (client devices, browsers, etc), as doing so would significantly inflate the cost of archiving. Instead, if we augment archived JavaScript such that the code on any archived page will always execute exactly how it did when the page was crawled, we are able to ensure that all archived pages match their original versions on the web, both visually and functionally.

 
1:00pm - 2:10pmSES-15: DATA CONSIDERATIONS
Location: Theatre 2
Session Chair: Sophie Ham, Koninklijke Bibliotheek
These presentations will be followed by a 10 min Q&A.
 
1:00pm - 1:20pm

What if GitHub disappeared tomorrow?

Emily Escamilla, Michele Weigle, Michael Nelson

Old Dominion University, United States of America

Research is reproducible when the methodology and data originally presented by the researchers can be used to reproduce the results found. Reproducibility is critical for verifying and building on results; both of which benefit the scientific community. The correct implementation of the original methodology and access to the original data are the lynchpin of reproducibility. Researchers are putting the exact implementation of their methodology in online repositories like GitHub. In our previous work, we analyzed arXiv and PubMed Central (PMC) corpora and found 219,961 URIs to GitHub in scholarly publications. Additionally, in 2021, one in five arXiv publications contained at least one link to GitHub. These findings indicate the increasing reliance of researchers on the holdings of GitHub to support their research. So, what if GitHub disappeared tomorrow? Where could we find archived versions of the source code referenced in scholarly publications? Internet Archive, Zenodo, and Software Heritage are three different digital libraries that may contain archived versions of a given repository. However, they are not guaranteed to contain a given repository and the method for accessing the code from the repository will vary across the three digital libraries. Additionally, Internet Archive, Zenodo, and Software Heritage all approach archiving from different perspectives and different use cases that may impact reproducibility. Internet Archive is a Web archive; therefore, the crawler archives the GitHub repository as a Web page and not specifically as a code repository. Zenodo allows researchers to publish source code and data and to share them with a DOI. Software Heritage allows researchers to preserve source code and issues permalinks for individual files and even lines of code. In this presentation, we will answer the questions: What if GitHub disappeared tomorrow? What percentage of scholarly repositories are in Internet Archive, Zenodo, and Software Heritage? What percentage of scholarly repositories would be lost? Do the archived copies available in these three digital libraries facilitate reproducibility? How can other researchers access source code in these digital libraries?



1:20pm - 1:40pm

Web archives and FAIR data: exploring the challenges for Research Data Management (RDM)

Sharon Healy1, Ulrich Karstoft Have2, Sally Chambers3, Ditte Laursen4, Eld Zierau4, Susan Aasman5, Olga Holownia6, Beatrice Cannelli7

1Maynooth University; 2NetLab; 3KBR & Ghent Centre for Digital Humanities; 4Royal Danish Library; 5University of Groningen; 6IIPC; 7School of Advanced Study, University of London

The FAIR principles imply “that all research objects should be Findable, Accessible, Interoperable and Reusable (FAIR) both for machines and for people” (Wilkinson et al., 2016). These principles present varying degrees of technical, legal, and ethical challenges in different countries when it comes to access and the reusability of research data. This equally applies to data in web archives (Boté & Térmens, 2019; Truter, 2021). In this presentation we examine the challenges for the use and reuse of data from web archives from both the perspectives of web archive curators and users, and we assess how these challenges influence the application of FAIR principles to such data.

Researchers' use of web archives has increased steadily in recent years, across a multitude of disciplines, using multiple methods (Maemura, 2022; Gomes et al., 2021; Brügger & Milligan, 2019). This development would imply that there are a diversity of requirements regarding the RDM lifecycle for the use and reuse of web archive data. Nonetheless there has been very little research conducted which examines the challenges for researchers in the application of FAIR principles to the data they use from web archives.

To better understand current research practices and RDM challenges for this type of data, a series of semi-structured interviews were undertaken with both researchers who use web or social media archives for their research and cultural heritage institutions interested in improving the access of their born-digital archives for research.

Through an analysis of the interviews we offer an overview of several aspects which present challenges for the application of FAIR principles to web archive data. We assess how current RDM practices transfer to such data from both a researcher and archival perspective, including an examination of how FAIR web archives are (Chambers, 2020). We also look at the legal and ethical challenges experienced by creators and users of web archives, and how they impact on the application of FAIR principles and cross-border data sharing. Finally, we explore some of the technical challenges, and discuss methods for the extraction of datasets from web archives using reproducible workflows (Have, 2020).



1:40pm - 2:00pm

Lessons Learned in Hosting the End of Term Web Archive in the Cloud

Mark Phillips1, Sawood Alam2

1University of North Texas, United States of America; 2Internet Archive, United States of America

The End of Term (EOT) Web Archive which is composed of member institutions across the United States who have come together every four years since 2008 to complete a large-scale crawl of the .gov and .mil domains in the United States to document the transition in the Executive Branch of the Federal Government in the United States. In years when a presidential transition did not occur, these crawls served as a systematic crawl of the .gov domain in what has become a longitudinal dataset of crawls. In 2022 the EOT team from the UNT Libraries and the Internet Archive moved nearly 700TB of primary WARC content and derivative formats into the cloud. The goal of this work was to provide easier computational access to the web archive by hosting a copy of the WARC files and derivative WAT, WET, and CDXJ files in the Amazon S3 Storage Service as part of Amazon’s Open Data Sponsorship Program. In addition to these common formats in the web archive community, the EOT team modeled our work on the structure and layout of the Common Crawl datasets including their use of the columnar storage format Parquet to represent CDX data in a way that enables access with query languages like SQL. This presentation will discuss the lessons learned in staging and moving these web archives into AWS, the layout used to organize the crawl data into 2008, 2012, 2016, and 2020 datasets and further into different groups based on the original crawling institution. Examples of how content staged in this manner can be used by researchers both inside and outside of a collecting institution to answer questions that had previously been challenging to answer about these web archives. The EOT team will discuss the documentation and training efforts underway to help researchers incorporate these datasets into their work.

 
2:20pm - 3:50pmSES-17: PROGRAM INFRASTRUCTURE
Location: Theatre 2
Session Chair: René Voorburg, KB, National Library of the Netherlands
These presentations will be followed by a 10 min Q&A.
 
2:20pm - 2:40pm

Maintenance Practices for Web Archives

Ed Summers, Laura Wrubel

Stanford University, United States of America

What makes a web archive an archive? Why don’t we call them web collections instead, since they are resources that have been collected from the web and made available again on the web? Perhaps one reason that the term archive has stuck is that it entails a commitment to preserving the collected web resources over time, and making continued access to them available. Just like the brick and mortar buildings that must be maintained to house traditional archives, web archives are supported by software and hardware infrastructure that must be cared for in order to ensure that the web archives remain accessible. In this talk we will present some examples of what this maintenance work looks like in practice drawing from experiences at Stanford University Libraries (SUL).

While many organizations actively use third party services like Archive-It, PageFreezer, and ArchiveSocial to create web archives, it is less common for them to retrieve the collected data and make it available outside that service platform. Starting in 2012 SUL has been engaged in building web archive collections as part of its general digital collections using tools such as httrack, CDL’s Web Archiving Service, Archive-It and more recently Webrecorder. These collections have been made available using the OpenWayback software, but in 2022 SUL switched to using the PyWB application.

We will discuss some of the reasons why Stanford initially found it important to host its own web archiving replay service and what factors led to switching to PyWB. Work such as reindexing and quality assurance testing were integral to moving to PyWB, which in turn generated new knowledge about the web archives records, as well as new practices for transitioning them into the new software environment. The acquisition, preservation of and access to web archives has been incorporated into the microservice architecture of the Stanford Digital Repository. One key benefit to this mainstreaming is shared terminology, infrastructure and maintenance practices for web archives, which is essential for sustaining the service. We will conclude with some consideration of what these local findings suggest about successfully maintaining open source web archiving software as a community.



2:40pm - 3:00pm

Radical incrementalism and the resilience and renewal of the National Library of Australia's web archiving infrastructure

Alex Osborne1, Paul Koerbin2

1National Library of Australia, Australia; 2National Library of Australia, Australia

The National Library of Australia’s web archiving program is one of the world’s earliest established and longest continually sustained operations. From its inception it was focused on establishing and delivering a functional operation as soon as feasible. This work historically included the development of policy, procedures and guidelines; together with much effort working through the changing legal landscape, from a permissions-based operation to one based on legal deposit warrant.

Changes to the Copyright Act (1968) in 2016, that extended legal deposit to online materials, gave impetus to the NLA’s strategic priorities to increase comprehensive collecting objectives and to expand open access to its entire web archive corpus. This also had significant implications for the NLA’s online collecting infrastructure. In part this involved confronting and dealing with a large legacy of web content collected by various tools and structured in disparate forms; and in part it involved a rebuild of the collecting workflow infrastructure while sustaining and redeveloping existing collaborative collecting processes.

After establishing this historic context, this presentation will focus attention on the NLA’s approach to the development of its web archiving infrastructure – an approach described as radical incrementalism: taking small, pragmatic steps that lead over time to achieving major objectives. While effective in providing the way to achieve strategic objectives, this approach can also build a legacy of infrastructural dead-weight that needs to be dealt with in order to continue to sustain and renew the dynamic and challenging task of web archiving. With a radical team restructure and an agile and iterative approach to development, the NLA has made significant progress in recent times in moving from a legacy infrastructure to one of renewed sustainability and flexibility in application.

This presentation will highlight some of the recent developments in the NLA’s web archiving infrastructure, including the web archive collection management system (including ‘Bamboo’ and ‘OutbackCDX’) and the web archive workflow management tool, ‘PANDAS’.



3:00pm - 3:20pm

Arquivo.pt behind the curtains

Daniel Gomes

FCT: Arquivo.pt, Portugal

Arquivo.pt is a governmental service that enables search and access to historical information preserved from the Web since the 1990s. The search services provided by Arquivo.pt include full-text search, image search, version history listing, advanced search and application programming interfaces (API). Arquivo.pt has been running as an official public service since 2013 but in the same year its system totally collapsed due to a severe hardware failure and over-optimistic architectural design. Since then, Arquivo.pt was completely renewed to improve its resilience. At the same time, Arquivo.pt has been widening the scope of its activities by improving the quality of the acquired web data and deploying online services of general interest to public administration institutions, such as the Memorial that preserves the information of historical websites or Arquivo404 that fixes broken links in live websites. These innovative offers require the delivery of resilient services constantly available.

The Arquivo.pt hardware infrastructure is hosted at its own data centre and it is managed by full-time dedicated staff. The preservation workflow is performed through a large-scale information system distributed over about 100 servers. This presentation will describe the software and hardware architectures adopted to maintain the quality and resilience of Arquivo.pt. These architectures were “designed-to-fail” following a “share-nothing” paradigm. Continuous integration tools and processes are essential to assure the resilience of the service. The Arquivo.pt online services are supported by 14 micro-services that must be kept permanently available. The Arquivo.pt software architecture is composed of 8 systems that host 35 components and the hardware architecture is composed of 9 server profiles. The average availability of the online services provided by Arquivo.pt in 2021 was 99,998%. Web archives must urgently assume their rule in digital societies as memory keepers of the XXI century. The objective of this presentation is to share our lessons learned at a technical level so that other initiatives may be developed at a faster pace using the most adequate technologies and architectures.



3:20pm - 3:40pm

Implementing access to and management of archived websites at the National Archives of the Netherlands

Antal Posthumus

Nationaal Archief, The Netherlands

The National Archives of the Netherlands, as a permanent government agency and official archive for the Central Government, has the legal duty, laid down in the Archiefwet, to secure the future of the government record. In the case of this proposal the focus is on how we worked on the infrastructure and processes of our trusted digital repository (TDR in short) relating to the ingestion, storage, management and preservation of and providing access to archived public websites of the Dutch Central Government.

In 2018 we’ve issued a very well received guideline on archiving websites (2018), We tried to involve our producers in the drafting process of the guidelines in their development. Part of which was to organize a public review. We received no less than 600 comments from 30 different organizations, which enabled us to improve the guidelines and immediately bring them to the attention of potential future users.

These guidelines were also used as part of the requirements of a public European tender (2021). The objective of the tender: realizing a central harvesting platform (hosted by. https://www.archiefweb.eu/openbare-webarchieven-rijksoverheid/) to structurally harvest circa 1500 public websites of the Central Government. This enabled us as an archival institution to influence the desired outcome of the harvesting process for these 1500 websites owned by at least all Ministries and most of their agencies.

A main challenge was that our off the shelf version of the Open Wayback-viewer wasn’t a complete version of the software and therefore isn’t able to render increments, or provide a calendar function, one of the key elements of the minimum viable product we aimed at.
We’ve opted for pywb based on what we learned through the IIPC-community about the transition from Open Wayback to Pywb.
Installation of Pywb was experienced by our technical team as very simple. An issue we did encounter was that the TDR-software doesn’t support a linkage with this (or any) external viewer which forces us to copy all WARC-files from our TDR into the viewer. This means a deviation from our current workflow; it also means we need twice as much disk space, so to speak.

 

 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2023
Conference Software: ConfTool Pro 2.6.149
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany