Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available). To only see the sessions for 3 May's Online Day, select "Online" for location.

Please note that all times are shown in the time zone of the conference. The current conference time is: 12th May 2024, 06:17:53pm CEST

 
Only Sessions at Location/Venue 
 
 
Session Overview
Location: Theatre 1
Date: Wednesday, 10/May/2023
4:00pm - 5:30pmPUBLIC EVENT: BUILDING DIGITAL HERITAGE TOGETHER: DUTCH AND TRANSNATIONAL PERSPECTIVES
Location: Theatre 1
This public event, hosted by the Netherlands Institute for Sound and Vision (NISV) and co-organised by KB – National Library of the Netherlands and IIPC, will feature presentations on the Netherlands UNESCO projects as well as an introduction to collaborative, transnational web archiving. Presentations will be followed by a panel discussion moderated by Tamara van Zwol, Dutch Digital Heritage Network.
Pre-registration is required for this event.
Date: Thursday, 11/May/2023
9:30am - 9:45amOPENING REMARKS: Eppo van Nispen, Sound & Vision
Location: Theatre 1
9:45am - 10:45amKEYNOTE: Eliot Higgins, Bellingcat. Introduced and chaired by Johan Oomen, Sound & Vision
Location: Theatre 1
11:00am - 12:30pmSES-01: RESEARCH & ACCESS
Location: Theatre 1
Session Chair: Ditte Laursen, Royal Danish Library
These presentations will be followed by a 10 min Q&A.
 
11:00am - 11:20am

Through the ARCHway: Opportunities to Support Access, Exploration, and Engagement with Web Archives

Samantha Fritz

Archives Unleashed Project, University of Waterloo, Canada

For nearly three decades, memory institutions have consciously archived the web to preserve born-digital heritage. Now, web archive collections range into the petabytes, significantly expanding the scope and scale of data for scholars. Yet there are many acute challenges research communities face, from the availability of analytical tools, community infrastructure, and inaccessible research interfaces. The core objective of the Archives Unleashed Project is to lower these barriers and burdens for conducting scalable research with web archives.

Following a successful series of datathon events (2017-2020), Archives Unleashed launched the cohort program (2021-2023) to facilitate opportunities to improve access, exploration and research engagement with web archives.

Borrowing from the hacking genre of events often found within the tech industry, Archives Unleashed datathons were designed to provide an immersive and uninterrupted period of time for participants to work collaboratively on projects and gain hands-on experience working with web archive data. The datathon series cultivated community formation and empowered scholars to build confidence and the skills needed to work with web archives. However, the short-term nature of datathons ultimately saw focused energy and time to research projects diminish once meetings concluded.

Launched in 2021, the Archives Unleashed cohort program was developed as a matured evolution of the datathon model to support research projects. The program ran two iterative cycles and hosted 46 international researchers from 21 unique institutions. Programmatically, researchers engaged in a year-long collaboration project, with web archives featured as a primary data source. The mentorship model has been a defining feature, including direct one-on-one consultation from Archives Unleashed, connections to field experts, and opportunities for peer-to-peer support.

This presentation will reflect on the experiences of engaging with scholars to build scalable analytical tools and deliver a mentorship program to facilitate research with web archives. The cohort program asked researchers to step into an unfamiliar environment with complex data, and they did so with curiosity while embracing opportunities to access, explore, and engage with web archive collections. While the program highlights a broad range of use cases, we seek to inspire the adoption of web archives for scholarly inquiry more commonly across disciplines.



11:20am - 11:40am

‘Research-ready’ collections: challenges and opportunities in making web archive material accessible

Leontien Talboom1, Mark Simon Haydn2

1Cambridge University Libraries, United Kingdom; 2National Library of Scotland, United Kingdom

The Archive of Tomorrow is a collaborative, multi-institutional project led by the National Library of Scotland and funded by the Wellcome Trust collecting information and misinformation around health in the online public space. One of the aims of this project is to create a ‘research-ready’ collection which would make it possible for researchers to access and reuse the themed collections of materials for further research. However, there are many challenges around making this a reality, especially around the legislative framework governing collection of and access to web archives in the UK, and technical difficulties stemming from the emerging platforms and schemas used to catalogue websites.

This talk would primarily address IIPC 2023's Access and Research themes, while also touching on the Collections and Operations strands in its discussion of a short-term project promising to deliver technical improvements and expanded access to web archives collections by 2023. The presentation would like to challenge and explore the difficulties the project encountered by offering different ways into the material, including exposing insights that can be generated from working with metadata exports outside of collecting platforms; detailing the project’s work in surfacing web archives in traditional library discovery settings through metadata crosswalks; and exploring further possibilities around the use of Jupyter Notebooks for data exploration and the documentation and dissemination of datasets.

The intended deliverables of this session are to present the tools developed within the project to make web archive material suitable and useful for research; to share frameworks used by the project’s web archivists when navigating the challenges of archiving personal and political health information online; and to discuss the barriers to access around collecting web archive and social media material in a UK context.



11:40am - 12:00pm

Developing new academic uses of web archives collections: challenges and lessons learned from the experimental service deployed at the University of Lille during the ResPaDon Project

Jennifer Morival1, Sara Aubry2, Dorothée Benhamou-Suesser2

1Université de Lille, France; 2Bibliothèque nationale de France, France

2022 marks the second year of the ResPaDon project, undertaken by the BnF (National Library of France) and the University of Lille, in partnership with Sciences Po and Campus Condorcet. The project brings together researchers and librarians to promote and facilitate a broader academic use of web archives by demonstrating the value of web archives and by reducing the technical and methodological barriers researchers may encounter when discovering this source for the first time or when working with such complex materials.

One of the ways to meet the challenges and address new ways of doing research is the implementation of an experimental remote access point to the web archives at the University of Lille. The project team has renewed the offer of tools and conducted outreach to new groups of potential web archive users.

The remote access point to web archives has been deployed in two university libraries in Lille: this service allows for both consultation of the web archives in their entirety (44 billion documents, 1.7 PB of data) and for exploring a collection, "The 2002 presidential and local elections", which was the the first collection constituted in-house by the BnF 20 years ago. This collection is now accessible , through various tools for data mining, analysis, and data visualization. And the use of those tools is accompanied by guides, reports, examples, use cases - multiple types of supporting documentation that will also be evaluated on their usefulness as part of the experimentation.

The presentation will focus on the implementation of this access point from both technical and practical aspects. It will address the training of the team of 6 mediators responsible for accompanying the researchers in Lille, as well as the collaboration between the teams in Lille and at the BnF. It will also tackle the challenges of outreach and the path we have taken to communicate within the academic community to find researcher-testers.

We will share the results and lessons learned from this experimentation: the first tests conducted with the researchers have allowed us to obtain feedback on the tools deployed and the improvements to be made to this experimental service.

 
1:30pm - 2:30pmSES-03 (PANEL): INSTITUTIONAL WEB ARCHIVING INITIATIVES TO SUPPORT DIGITAL SCHOLARSHIP
Location: Theatre 1
Session Chair: Martin Klein, Los Alamos National Laboratory
 

Institutional Web Archiving Initiatives to Support Digital Scholarship

Martin Klein1, Emily Escamilla2, Sarah Potvin3, Vicky Rampin4, Talya Cooper4

1Los Alamos National Laboratory, United States of America; 2Old Dominion University, United States of America; 3Texas A&M University, United States of America; 4New York University, United States of America

Panel description:
Scholarship happens on the web but unlike more traditional output such as scientific papers in PDF format, we are still lacking comprehensive institutional web archiving approaches to capture increasingly prominent scholarly artifacts such as source code, datasets, workflows, and protocols. This panel will feature scholars from three different institutions - Old Dominion University, Texas A&M University, and New York University - that will provide an overview of their explorations in investigating the use of scholarly artifacts and their (in-)accessibility on the live web. The panelists will further outline how these findings inform institutional collection policies regarding such artifacts, web archiving efforts aligned with institutional infrastructure, and outreach and education opportunities for students and faculty. The panel will conclude with an interactive discussion while welcoming input and feedback from the WAC audience.

Individual:

Emily:

Title: Source Code Archiving for Scholarly Publications

Abstract:

Git Hosting Platforms (GHPs) are commonly used by software developers and scholars to host source code and data to make them available for collaboration and reuse. However, GHPs and their content are not permanent. Gitorious and Google Code are examples of GHPs that are no longer available even though users deposited their code expecting an element of permanence. Scholarly publications are well-preserved due to current archiving efforts by organizations like LOCKSS, CLOCKSS, and Portico; however, no analogous effort has yet emerged to preserve the data and code referenced in publications, particularly the scholarly code hosted online in GHPs. The Software Heritage Foundation is working to archive public source code, but issue threads, pull requests, wikis, and other features that add context to the source code are not currently preserved. Institutional repositories seek to preserve all research outputs which include data, source code, and ephemera; however, current publicly available implementations do not preserve source code and its associated ephemera, which presents a problem for scholarly projects where reproducibility matters. To discuss the importance of institutions archiving scholarly content like source code, we first need to understand the prevalence of source code within scholarly publications and electronic theses and dissertations (ETDs). We analyzed over 2.6 million publications across three categories of sources: preprints, peer-reviewed journals, and ETDs. We found that authors are increasingly referencing the Web in their scholarly publications with an average of five URIs per publication in 2021, and one in five arXiv articles included at least one link to a GHP. In this panel, we will discuss some of the questions that result from these findings such as: Are these GHP URIs still available on the live Web? Are they available in Software Heritage? Are they available in web archives and if so, how often and how well are they archived?

Sarah:

Title: Designing a Sociotechnical Intervention for Reference Rot in Electronic Theses

Abstract:

Intertwined publication and preservation practices have become widespread in the establishment of institutional digital repositories and libraries’ stewardship of institutional research output, including open educational resources and electronic theses and dissertations. Most digital preservation work seeks to preserve a whole text, like a dissertation, in a digital form. This presentation reports on an ongoing research effort - a collaboration with Klein, Potvin, Katherine Anders, and Tina Budzise-Weaver - intended to prevent potential information loss within the thesis, through interventions that can be integrated into trainings and thesis management tools. This approach draws on research into graduate training and citation practices, web archiving, open source software development, and digital collection stewardship with a goal of recommending systematized sociotechnical interventions to prevent reference rot in institutionally-hosted graduate theses. Findings from qualitative surveys and interviews conducted at Texas A&M University on graduate student perceptions of reference rot will be detailed.

Vicky/Talya

Title: Collaborating on Software Archiving for Institutions

Abstract:

Inarguably, software and code are part of our scholarly record. Software preservation is a necessary prerequisite for long-term access and reuse of computational research, across many fields of study. Open research software is shared on the Web most commonly via Git hosting platforms (GHPs), which are excellent for fostering open source communities, transparency of research, and add useful features on top such as wikis, continuous integration, and merge requests and issue threads. However, the source code and the useful scholarly ephemera (e.g. wikis) are archived separately, often by “breadth over depth” approaches. I’ll discuss the Collaborative Software Archiving for Institutions (CoSAI) project from NYU, LANL, ODU, and OCCAM, which is addressing this pressing need to provide machine-repeatable, human-understandable workflows for preserving web-based scholarship, scholarly code in particular, alongside the components that make it most useful. I’ll present the results of ongoing efforts in the three main streams of work: 1) technical development on open source, community-led tools for collecting, curating, and preserving open scholarship with a focus on research software, 2) community building around open scholarship, software collection and curation, and archiving of open scholarship, and 3) optimizing workflows for archiving open scholarship with ephemera, via machine-actionable and manual workflows.

 
2:40pm - 3:50pmSES-05: COVID-19 COLLECTIONS
Location: Theatre 1
Session Chair: Kees Teszelszky, KB, National Library of the Netherlands
These presentations will be followed by a 10 min Q&A.
 
2:40pm - 3:00pm

The UK Government Web Archive (UKGWA): Measuring the impact of our response to the COVID-19 pandemic

Tom Storrar

The National Archives, United Kingdom

The COVID-19 pandemic, the first pandemic of the digital age, has presented an enormous challenge to our web archiving practice. As the official archive of the UK government, we were tasked with building a comprehensive archive of the UK government's online response to the emergency. To meet this challenge we have devised new archiving strategies ranging from supplementary broad, keyword-driven crawling to focus, data-driven, daily captures of the UK’s official “Coronavirus (COVID-19) in the UK” data dashboard. We have also massively increased our rates of capture. The challenge has demanded creativity, adaptation and a great deal of effort.

All of this work prompted us to think of a number of questions that we’d like to answer: How complete is the record we captured in our web archive and how much is this a result of the extra effort we made? How could we perform meaningful analysis on the enormous numbers of HTML and non-HTML resources? What contributions have these innovations made to this outcome and how can these inform our practice going forward?

To tackle these questions we needed to analyse millions of captured resources in our web archive. It soon became clear that we’d only be able to achieve the level of insight needed by developing an entire end-to-end analysis system. The resulting pipeline we designed and built uses a combination of familiar and novel concepts and approaches; we used the WARC file content, along with CDX APIs, but we also developed a set of heuristics, and custom algorithms, all ultimately populating a database that allowed us to run queries to give us the answers we sought. Running an entirely cloud-based system enabled this work as we were at that time unable to reliably access our office.

This presentation will provide an overview of the approaches used, the results we found and the areas for further development. We believe that these tools can be applied to our overall web archive collections and hope that other institutions will find our experience useful when thinking about analysing their own collection and quantifying the impact of their efforts.



3:00pm - 3:20pm

Women and COVID through Web Archives. How to explore the pandemic through a collaborative, interdisciplinary research approach

Susan Aasman1, Karin de Wild2, Joshgun Sirajzade3, Fréderic Clavert3, Valerie Schafer3, Sophie Gebeil4, Niels Brügger5

1University of Groningen, Netherlands, The; 2Leiden University, The Netherlands; 3University of Luxembourg, Luxembourg; 4Aix-Marseille University, France; 5Aarhus University, Denmark

The COVID crisis has been a shared worldwide and collective experience from March 2020 and lot of voices have echoed each other, may it be related to grief, lockdown, masks and vaccines, homeschooling, etc. However, this unprecedented crisis has also deepened asymmetries and failures within societies, in terms of occupational fields, economic inequalities, health and sanitary access, and we could extend the inventory of these hidden and more visible gaps that were reinforced during the crisis. Women and gender were also at stake when it came to this sanitary crisis, may it be to discuss the better management of the crisis by female politicians, domestic violence during the lockdown, decreasing production of papers by female research scientists, homeschooling and mental load of women, etc.

As a cohort team within the Archives Unleashed Team (AUT) program, the European research AWAC2 team benefited from a privileged access to this collection, thanks to Archive-It and through ARCH, and from regular mentorship by the AUT team. It allowed us to investigate and analyse this huge collection of 5.3 TB, 161 757 lines for the CSV on domain frequency CSV, 8,738,751 lines for the CSV related to plain text of web pages. In December 2021, our AWAC2 team submitted several topics to the IIPC (International Internet Preservation Consortium) community and invited the international organization to select one of them that the team would investigate in depth, based on the unique IIPC COVID collection of web archives. Women, gender, and COVID was the winning topic.

Accepting the challenge, the AWAC2 team organized a datathon in March 2022 in Luxembourg to investigate and retrieve the many traces of women, gender and COVID in web archives, while mixing close and distant reading. Since then, the team has been working on the dataset to further explore the opportunities for computational methods for reading at scale. In this presentation, we will reflect on technical, epistemological, and methodological challenges and present some results as well.



3:20pm - 3:40pm

Surveying the landscape of COVID-19 web collections in European GLAM institutions

Nicola Bingham1, Friedel Geeraert2, Caroline Nyvang3, Karin de Wild4

1British Library, United Kingdom; 2KBR (Royal Library of Belgium); 3Royal Danish Library; 4Leiden University

The aim of the WARCnet network [https://cc.au.dk/en/warcnet/about] is to promote high-quality national and transnational research that will help us to understand the history of (trans)national web domains and of transnational events on the web, drawing on the increasingly important digital cultural heritage held in national web archives. Within the context of this network, a survey was conducted to see how cultural heritage institutions are capturing the COVID-19 crisis for future generations. The aim of the survey was to map the scope and collection strategies of COVID-19 Web collections with a main focus on Europe. The survey was managed by the British Library and was conducted by means of the Snap survey platform. It circulated between June and September 2022 among mainly European GLAM institutions and 61 responses were obtained.

The purpose of this presentation is to provide an overview of the different collection development practices when curating COVID-19 collections. On the one hand, the results may support GLAM institutions to gain further insights in how to curate COVID-19 Web collections or identify potential partners. On the other hand, revealing the scope of these Web collections may also encourage humanists and data scientists to unlock the potential of these archived Web sources to further understand international developments on the Web during the COVID-19 pandemic

More concretely, the presentation will provide further insight into the local, regional, national or global scopes of the different COVID-19 collections, the type of content that is included in the collections, the available metadata, the selection criteria that were used when curating the collections and the efforts that were made to create inclusive collections. The temporality of the collections will also be discussed by highlighting the start, and, if applicable, end dates of the collections and the capture frequency. Quality control and long-term preservation are two further elements that will be discussed during the presentation.

 
4:20pm - 5:30pmSES-07: COLLABORATIONS & OUTREACH
Location: Theatre 1
Session Chair: Ben Els, National Library of Luxembourg
These presentations will be followed by a 10 min Q&A.
 
4:20pm - 4:40pm

Linking web archiving with arts and humanities: the collaboration between ROSSIO and Arquivo.pt

Ricardo Basílio

Arquivo.pt - Fundação para a Ciência e Tecnologia, I.P., Portugal

ROSSIO and Arquivo.pt developed collaborative activities with the goal of connecting web archiving, arts and digital humanities, between 2018 and 2022. How to make Web archives useful and accessible to digital humanities researchers, and by extension to citizens? This challenge was answered in three ways: training, dissemination, and collaborative curation of websites. This presentation aims to describe those collaborative activities and share what we’ve learned from them.

ROSSIO is a Portuguese infrastructure for the Social Sciences, Arts and Humanities (https://rossio.fcsh.unl.pt/). Its mission is to aggregate, contextualize, enrich and disseminate digital content. It is based at the Faculty of Social and Human Sciences of the NOVA University of Lisbon (FCSH-NOVA) and involves several institutions that provide content. Arquivo.pt's mission (https://arquivo.pt) is to preserve the Portuguese Web and make available contents from the Web since 1996 to everyone, from simple citizens to researchers.

ROSSIO contributed human resources, namely, a web curator, a community manager, a web developer, and researchers who used Arquivo.pt in their work. Arquivo.pt in turn contributed its know-how, created new services (e.g., the SavePageNow) and made available open data sets.

Therefore, we describe the activities carried out in collaboration and their results.

First, regarding training, we refer to face-to-face and online sessions held with ROSSIO partners and their communities. We highlight the initiative "Café with Arquivo.pt" (https://arquivo.pt/cafe) and the webinars held during the pandemic, because they strengthened the connection between Arquivo.pt and distant communities (e.g., in 2021 they had 538 participants and 84% of satisfaction).

Second, the continuous dissemination in the social networks and groups of the ROSSIO partners which helped to make Arquivo.pt better known (e.g., 7.300 new users accessed the service between 2018 and 2021).

Third, researchers from the ROSSIO collaborated in curating websites, which resulted in documentation for studies and online exhibitions (e.g. “Times of illness, times of healing” at the FCSH NOVA; and "art festivals memory" at the Gulbenkian Art Library).

We concluded this presentation by sharing what we learned from participating in ROSSIO, and the challenges that lie ahead for creating a community of practice among art and humanities researchers.



4:40pm - 5:00pm

Building collaborative collections : experience of the Croatian Web Archive

Inge Rudomino, Dolores Mumelaš

National and University Library in Zagreb, Croatia

In Croatia, the only institution that archives the web is the National and University Library in Zagreb. The library established the Croatian Web Archive (HAW) and began archiving Croatian web sources in 2004. From then until today, we have developed several approaches to web archiving: selective, .hr crawls, thematic crawls, building local history collections and social media archiving. In order to broaden our collections and raise public awareness as much as possible the Croatian Web Archive is opening up to collaboration with other libraries, as well as all interested citizens.

One of the examples is the Building Local History Web project from 2020. That year, the Croatian Web Archive began collaboration with public libraries for the purpose of archiving web resources related to a specific area or homeland. The contents are related to a specific locality with the aim of presenting and ensuring long-term access to local materials that are available only on the web and complement and popularize the local history collection of the public library.

In addition to collaboration with public libraries, the Croatian Web Archive has connected with the User Service Department of the National and University Library in Zagreb, in order to involve citizens in the creation of thematic collections through citizen science. In that way the thematic collection “Bees, life, people” was created, using the crowdsourcing method, in collaboration with the public library, citizens (high school students) and other library departments.

This presentation will discuss developing a collection policy, collaboration and working process in building local history and citizen science collections.

The lessons learned throughout collaboration with citizens and public libraries are great encouragement to expand the existing scope of archiving as well as involvement of other libraries and citizens in raising awareness of information literacy and the importance of archiving web content.



5:00pm - 5:20pm

Your Software Development Internship in Web Archiving

Youssef Eldakar

Bibliotheca Alexandrina, Egypt

A summer internship project is an opportunity for the intern to practice in the real world as well as for the host institution to make extra progress on program objectives, while also engaging with the community. Since 2019, Bibliotheca Alexandrina's IT team has been running a summer internship series for undergraduate students of computing, with several of the internship projects having a connection to web archiving.

Throughout this experience, our mentors have been finding the young interns much intrigued by the technology involved in archiving the web. From a computing perspective, aside from serving to preserve a quite significant information medium, web archiving is an activity where a number of sub-domains of computing come together. A software project in web archiving will involve, for instance, management of big data to keep pace with how the web and consequently an archive thereof continues to expand in volume, parallel computing to achieve the capacity for both data harvesting and processing at that level of scale, machine learning to find answers to questions about the datasets that can be extracted from a web archive, or network theory and graph analytics to come to more understandable representations of the heavily interlinked data.

In this presentation, we invite you to join us on a virtual visit to the home of the IT team at Bibliotheca Alexandrina for a look into our archive of past internship projects in web archiving. These projects include the investigation of alternative graph analytics backends for the implementation of new features in web archive graph visualization, repurposing of the WARC format for use in the library's digital book portal, and crawling the web for text for language model training. For each project, we will review the specific objective, how the problem was addressed, and the outcome. Finaly, to reflect on the overall experience, we will share lessons learned as well as discuss how the interaction with the community through internships is additionally an opportunity to raise awareness about web archiving, the technology involved, and the work of the International Internet Preservation Consortium (IIPC).

 
5:30pm - 6:10pmPOS-1: LIGHTNING & DROP-IN TALKS
Location: Theatre 1
Session Chair: Abbie Grotke, Library of Congress
1 minute drop-in talks will immediately follow lightning talks. After the session ends, lightning talk presenters will be available for questions in the atrium, where their posters will be on display.

Drop-in talk schedule:

Quick Overview of Perma Tools List​
Clare Stanton​, Perma.cc

Engineering Updates from Internet Archive ​
Alex Dempsey​, Internet Archive

Mapping News in the Norwegian Web Archive​
Jon Carlstedt Tønnessen, National Library of Norway
 

Memory in Uncertainty – The Implications of Gathering, Storing, Sharing and Navigating Browser-based Archives

Cade Diehm, Benjamin Royer

New Design Congress, Germany

How do we save the past in a violent present for an uncertain future? As societal digitisation accelerates, so too has the belligerence of state and corporate power, the democratisation of targeted harassment, and the collapse of consent by communities plagued by ongoing (and often unwanted) datafication. Drawing from political forecasts and participatory consultation with practicioners and communities, this research examines the physical safety of data centres, the socio-technical issues of the diverse practice of web-based archiving, and the physical and mental health of archive practitioners and communities subjected to archiving. This research identifies and documents issues of ethics, consent, digital security, colonialism, resilience, custodianship and tool complexity. Despite the systemic challenges identified in the research, and the broad lag in response from tool makers and other actors within the web archiving discipline, there exist compelling reasons to remain optimistic. Emergent technologies, stronger socio-technical literacy amongst archivists, and critical interventions in the colonial structures of digital systems offer immediate points of intervention. By acknowledging the shortcomings of cybernetics, resisting the desire to apply software solutionism at scale, and developing a nuanced and informed understanding of the realities of archiving in digitised societies, a broad surface of opportunities can emerge to develop resilient, considered, safe and context-sensitive archival technologies and practice for our uncertain world.



To preserve this memory, click here. Real-time public engagement with personal digital archives

Marije Miedema, Susan Aasman, Sabrina Sauer

University of Groningen, Centre for Media and Journalism Studies

Digital collections aim to reflect our personal and collective histories, which are shaped by and concurrently shape our memories. While advancements are made to develop web archival practices in the public domain, personal digital material is mostly preserved with commercially driven technologies. This is worrying, for although it may seem that these privately-owned cloud services are spaces where our precious pictures will exist forever, we know that long-term sustainable archiving practices are not these service providers’ primary concern. This demo is part of the first stages in the fieldwork of a PhD project that explores alternative approaches to sustainable everyday archival data management. Through participatory research methods, such as co-designing prototypes, we aim to establish a public-private-civic collaboration to rethink our relationship with the personal digital archive. Moving towards the question of what digital material do we throw away, discard, or forget about, we want to contribute to existing knowledge on how to manage the growing amount of digital stuff.

Translating this question into an interactive installation, the demo combines human and technological performativity employing participatory, playful methods to let conference participants materialize their reflections on their engagement with their digital archives, from their professional and personal perspective. This demo invites conference participants to actively engage with the question of responsibility regarding the future of our personal digital past; is there a role to play for public institutions next to the commitment of individuals to commercially driven storage technologies? The researchers will consider the privacy of the participants throughout the duration of the demo. Through this demo, the community of (web) archivists are involved in the early stages of the project’s co-creative research practices and aims to build lasting connections with these important stakeholders.



Participatory Web Archiving: A Roadmap for Knowledge Sharing

Cui Cui1,2

1Bodleian Libraries University of Oxford, United Kingdom; 2Information School University of Sheffield

In recent years, community participation seems to have become a desirable step in developing web archives. Participatory practices in the cultural heritage sector are not new (Benoit & Eveleigh, 2019). The practice of working in collaboration with different community partners to build archives is underway in conventional archives (Cook, 2013). Indeed, it has now become one of the main themes of web archival development on both theoretical and practical levels.

Although involving wider communities is often regarded as an approach to democratise practices, it has been debated if community participation can lead to improved representation. At the same time, the significant impact that participatory practices have on creating and sharing knowledge should not be underestimated. My current PhD research is to understand how participatory practices have been deployed in web archiving, their mechanisms and impacts.

Since April 2022, I have worked as a web archivist for the Archive of Tomorrow project, developing various sub-collections on the topics relating to cancer, Covid-19, food, diet, nutrition, and wellbeing. The project, funded by the Wellcome Trust, is to explore and preserve online 2 information and misinformation about health and the Covid-19 pandemic. Started in February 2022, the project runs for 14 months and will form a 'Talking about Health' collection within the UK Web Archive, giving researchers and members of the public access to a wide representation of diverse online sources.

For this project, I have attempted to link theories with practices and applied various participatory methods in developing the collection, such as engaging with subject librarians, delivering a workshop co-curating a sub-collection, consulting academics to identify archiving priorities, cocurating a sub-collection with students from an internship scheme, and collaborating with a local patient support group. This poster is to reflect how different approaches have been deployed and lessons learned. It will highlight the transformative impact of participatory practices on sharing, creating and reconstructing knowledge.

References

Benoit, E., & Eveleigh, A. (2019). Defining and framing participatory archives in archival science. In E. Benoit & A. Eveleigh (Eds.), Participatory archives: theory and practice (pp. 1–12). London.

Cook, T. (2013). Evidence, memory, identity, and community: Four shifting archival paradigms. Archival Science, 13(2–3), 95–120. https://doi.org/10.1007/s10502- 012-9180-7

 
Date: Friday, 12/May/2023
8:30am - 10:00amSES-11: COLLECTION BUILDING
Location: Theatre 1
Session Chair: Lauren Baker, Library of Congress
These presentations will be followed by a 10 min Q&A.
 
8:30am - 8:50am

20 years of archiving the French electoral web

Dorothée Benhamou-Suesser, Anaïs Crinière-Boizet

Bibliothèque nationale de France, France

In 2022, BnF is celebrating the 20th anniversary of its electoral crawls. On this occasion, we would like to trace the history of 20 years of electoral crawls, which cover 20 elections of all types (presidential, parliamentary, local, departmental, European), and represent more than 30 Tio of data. The 2002 presidential election crawl was the first in-house crawl conducted by the BnF, a founding moment for experimenting a legal, technical and library policy framework. We, as an heritage institution, are accountable for the first electoral collections, which are emblematic and representative of our workflows on several aspects: harvest, selection, and outreach.

First, on the technical point of view, electoral crawls were an opportunity to set up crawling tools and to develop adaptative techniques to face the evolution of Web and meet the challenge to archive it. We have experimented and made improvements in our archiving processes for each new election and a specific look into the communication means (eg. forums, Twitter accounts, Youtube channels and more recently Instagram accounts, TikTok contents).

Secondly, electoral crawls have led the BnF to set up and organise a network of contributors and the means of selection. In 2002, contributions were from BnF librarians. In 2004, partners libraries in different regions and overseas territories contributed to select content for the regional elections. In 2012, we initiated the development of a collaborative curation tool. Throughout the years, we have also built a document typology that has remained stable to guarantee the coherence of the collections.

Thirdly, electoral crawls led us to set up ways to promote web archives to the public and the research community. To promote the use of a collection with such historical consistency, of high interest for the study of political life, we designed guided tours (thematic and edited selections of archived pages made by librarians). The BnF also engaged in organizing scientific events, and in several collaborative outreach initiatives.



8:50am - 9:10am

Archiving the Web for FIFA World Cup Qatar 2022™

Arif Shaon, Carol Ann Daul Elhindi, Marcin Werla

Qatar National Library, Qatar

The core mission of Qatar National Library is to “spread knowledge, nurture imagination, cultivate creativity, and preserve the nation’s heritage for the future.” To fulfil this mission, the Library commits to collecting, preserving and providing access to both local and global knowledge, including heritage-related content relevant to Qatar and the region. Web resources of cultural importance could assist future generations in the interpretation of events that may not be extant anywhere else. Archiving such websites is an important initiative within the wider mission of the Library to support Qatar on its journey towards a knowledge-based economy.

The 2022 FIFA World Cup will be the first World Cup ever to be held in the Arab world, and hence is considered a landmark event in Qatar’s history. Qatar’s journey towards hosting the 2022 World Cup has been covered by all types of local and international websites and news portals, and the coverage is expected to increase significantly in the weeks leading to, during and post-World Cup. The information published by these websites will truly reflect the journey towards, and experience of, the event from a variety of perspectives, including the fans, the organizers, the players, and members of the public. Capturing and preserving such information for the long-term enables future generations to also share the experience and appreciate the astounding effort required to host a massive, culturally important global event in Qatar.

In this talk, we describe the Library’s approach to capturing and preserving websites related to the World Cup 2022, to guarantee access to the content for the future generations. We also highlight the challenges associated with developing archived websites as collections for researchers in the context of the Qatari copyright law.



9:10am - 9:30am

Museums on the Web: Exploring the past for the future

Karin de Wild

Leiden University, Netherlands, The

This presentation will celebrate the launch of the special collection ‘Museums on the Web’ at the KB, National Library of the Netherlands. This evolving collection unlocks an essential and the largest sub-collection within the KB Web archive. It contains more than 800 museum websites and offers the potential to research histories of museums on the Web within the Netherlands.

It requires special tools to access Web archives and therefore this presentation will demonstrate a variety of entry points. It features a selection of curated archived websites that can be viewed page-by-page. It will also be the first KB special collection that is accessible through a SOLR Wayback search engine, which enables the request of derived datasets and explore the collection through a series of dashboards. This offers the opportunity to study histories of museums on the Web in The Netherlands, combining methods from history and data science and drawing on a computational analysis of Web archive data.

The presentation will conclude with highlighting some significant case studies to showcase the diversity of museum websites and the research potential to uncover a Dutch history of museums on the Web. The advent of online technologies has changed the way museums manage collections and access them, shape exhibitions, and build communities. By engaging with the past, we can enhance our understanding of how museums are functioning today and offer new perspectives for future developments.

This paper coincides with the release of a Double Special Issue “Museums on the Web: Exploring the past for the future” in the journal Internet Histories: Digital Technology, Culture and Society (Routledge/Taylor & Francis).



9:30am - 9:50am

Unsustainability and Retrenchment in American University Web Archives Programs

Gregory Wiedeman1, Amanda Greenwood2

1University at Albany, United States of America; 2Union College

This presentation will overview the expansion and later retrenchment of UAlbany’s web archives program due to a lack of permanently funded staff. UAlbany began its web archives program in 2013 in response to state records laws requiring it to preserve university records on the web. The department that housed the program had strong existing collecting programs in New York State politics and capital punishment. Since much of current politics and activism now happens online, it was natural and necessary to expand the web archives program to ensure we were effectively documenting these important spaces for the long-term future. However, we will show how the increasing complexity of the web and collecting techniques means that the scoping needs for ongoing collecting seem to require significantly more testing and labor over time. Thus, despite the need to expand the web archives program to meet our department’s mission, we will describe the painful process of reducing our web archives collecting scope. With the NDSA Web Archiving in the United States surveys reporting 71-83% of respondents devoting 0.5 or less FTE to web archiving, maintenance inflation like this is catastrophic to many web archives programs. Most alarmingly, we will overview how the web archives labor situation at American universities is likely to get worse. The UAlbany Libraries, which houses the web archives program, has permanently lost over 30% of FTE since 2020 and almost 50% of FTE since 2000. Peer assessment studies, ARL staffing surveys, and the University of California, Berkley’s recent announcement of library closures shows that UAlbany’s example is more typical than exceptional. Finally, we will show how these cuts are not the result of a misunderstanding or a lack of value for web archives or libraries by university administrators, but because our web archives program conflicts with UAlbany’s overall organizational mission and the business model of American higher education.

 
10:30am - 12:00pmSES-12: DOMAIN CRAWLS
Location: Theatre 1
Session Chair: Grace Bicho, Library of Congress
These presentations will be followed by a 10 min Q&A.
 
10:30am - 10:50am

Discovering and Archiving the Frisian Web. Preparing for a National Domain Crawl.

Susanne van den Eijkel, Iris Geldermans

KB, National Library of the Netherlands

In the past years KB, National Library of the Netherlands (KBNL), conducted a pilot for a national domain crawl. KBNL has been harvesting websites with the Web Curator Tool (a web interface with Heritrix crawler) since 2007, on a selective basis that are focused on Dutch history, culture and language. Information on the web can be brief in existence but can have a vital importance for researchers now and in the future. Furthermore, KBNL outlined in their content strategy that it is the ambition of the library to collect everything that was published in and about the Netherlands, websites included. As more libraries around the world were collecting a national domain, KBNL also expressed the wish to execute a national domain crawl. Before we were able to do that, we had to form a multidisciplinary web archiving team, decide on a new tool for domain harvests and start an intensive testing phase. For this pilot a regional domain, the Frisian, was selected. Since we were new to a domain harvest, we used a selective approach. Curators of digital collections from KBNL were in close contact with Frisian researchers, to help define which websites needed to be included in the regional domain. During the pilot we also gathered more knowledge about Heritrix as we were using NetarchiveSuite (also a web interface with Heritrix crawler) for crawls.

Now that the results are in, we can share our lessons learned, like challenges on technical and legal aspects and related policies that are needed for web collections. Also, we will go into detail about the crawler software settings that were tested and how we can use such information as context information.

This presentation is related to the conference topics collections, community and program operations, as we want to share the best practices for executing a (regional) domain crawl and lessons learned in preparation for a national domain crawl. Furthermore, we will focus on the next steps after completion of the pilot. Other institutions that are harvesting websites can learn from it and those that want to start with web archiving can be more prepared.



10:50am - 11:10am

Back to Class: Capturing the University of Cambridge Domain

Caylin Smith, Leontien Talboom

Cambridge University Libraries, United Kingdom

The University Archives of Cambridge University, based at the University Library (UL), is responsible for the selection, transfer, and preservation of the internal administrative records of the University, dating from 1266 to the present. These records are increasingly created in digital formats, including common ‘office’ formats (Word, Excel, PDF) as well as increasingly for the web.

The question “How do you preserve an entire online ecosystem in which scholars collaborate, discover and share new knowledge?” about the digital scholarly record posed by Cramer et al. (2022) equally applies to online learning and teaching materials as well as the day-to-day business records of a university.

Capturing this online ecosystem as comprehensively, rather than selectively, as possible is an undertaking that involves many stakeholders and moving parts.

As a UK Legal Deposit Library, the UL is a partner in the UK Web Archive and Cambridge University websites are captured annually; however, some online content needs to be captured more frequently, does not have an identifiable UK address, or is behind a log-in screen.

To improve this capturing, the UL is working on the following:

  • Engaging with content creators and/or University Information Services, which supports the University’s Drupal platform.
  • Working directly with the University Archivist as well as creating a web archiving working group with additional Library staff to identify what University websites need to be captured manually or were captured only in an annual domain crawl but need to be captured more frequently.
  • Becoming a stakeholder in web transformation initiatives to communicate requirements for creating preservable websites and quality checking new web templates from an archival perspective.
  • Identifying potential tools for capturing online content behind login screens. So far WebRecorder.io has been a successful tool to capture this material; however, this is a time-consuming and manual process that would be improved if automated. The automation of this process is currently being explored.

Our presentation will walk WAC2023 attendees through our current workflow as well as highlight ongoing challenges we are working to resolve so that attendees based at universities can take these into account for archiving content on their university’s domains.



11:10am - 11:30am

Laboratory not Found? Analyzing LANL’s Web Domain Crawl

Martin Klein, Lyudmila Balakireva

Los Alamos National Laboratory, United States of America

Institutions, regardless of whether they identify as for-profit, nonprofit, academic, or government, are invested in maintaining and curating their representation on the web. The organizational website is often the top-ranked on search engine result pages and commonly used as a platform to communicate organizational news, highlights, and policy changes. Individual web pages from this site are often distributed via organization-wide email channels, included in new articles, and shared via social media. Institutions are therefore motivated to ensure the long-term accessibility of their content. However, resources on the web frequently disappear, leading to the known detriment of link rot. Beyond the inconvenience of the encounter with a “404 - Page not Found” error, there may be legal implications when published government resources are missing, trust issues when academic institutions fail to provide content, and even national security concerns when taxpayer-funded federal research organizations such as Los Alamos National Laboratory show deficient stewardship of their digital content.

We therefore conducted a web crawl of the lanl.gov domain with the motivation to investigate the scale of missing resources within the canonical website representing the institution. We found a noticeable number of broken links, including a significant number of special cases of link rot commonly known as “soft404s” as well as potential transient errors. We further evaluated the recovery rate of missing resources from more than twenty public web archives via the Memento TimeTravel federated search service. Somewhat surprisingly, our results show little success in recovering missing web pages.

These observations lead us to argue that, as an institution, we could be a better steward of our web content and establishing an institutional web archive would be a significant step towards this goal. We therefore implemented a pilot LANL web archive in support of highlighting the availability and authenticity of web resources.

In this presentation, I will motivate the project, outline our workflow, highlight our findings, and demonstrate the implemented pilot LANL web archive. The goal is to showcase an example of an institutional web crawl that, in conjunction with the evaluation, can serve as a blueprint for other interested parties



11:30am - 11:50am

Public policies for governmental web archiving in Brazil

Jonas Ferrigolo Melo1, Moisés Rockembach2

1University of Porto, Portugal; 2Federal University of Rio Grande do Sul, Brazil

Scientific, cultural, and intellectual relevance of web archiving has been widely recognized since the 1990s. The preservation of the web has been appreciated in several studies ranging from its specific theories and practices, such as its methodological approaches, specific ethical aspects of preserving web pages, to subjects that permeate the Digital Humanities and their uses as a primary source.

This study aims to identify the documents and actions that are related to the development of the web archive policy in Brazil. The methodology used was bibliographic and documental research, using literature on government web archiving, and legislation regarding public policies.

Brazil has a variety of technical resources and legislation that addresses the need to preserve government documents, however, the websites have not yet been included in the records management practices of Brazilian institutions. Until the recent past, the country did not have a website preservation policy. However, there are currently two government actions under development.

A Bill that has been under consideration in the National Congress since July 2015, provides on the institutional digital public heritage in the www. This project is currently in the Constitution and Justice and Citizenship Commission (CCJC) of the Brazilian National Congress, since December 2022.

Another action comes from the National Council of Archives – Brazil (CONARQ), which established a technical chamber to define guidelines for the elaboration of studies, proposals, and solutions for the preservation of websites and social media. Based on its general goals, the technical chamber has produced two documents: (i) the Website and Social Media Preservation Policy; and, (ii) the recommendation of basic elements for websites and social media’s digital preservation. The documents were approved in December 2022 and will be published as a federal resolution.

The actions raised show that efforts for the state to take a proactive role in promoting and leadership of this technological innovation are in course in Brazil. The definition of a web archiving policy, as well as the requirements for the selection of preservation and archiving methods, technologies, and contents that will be archived, can already be considered a reality in Brazil.

 
1:00pm - 2:00pmSES-14 (PANEL): INCLUSIVE REPRESENTATION AND PRACTICES IN WEB ARCHIVING
Location: Theatre 1
Session Chair: Daniel Steinmeier, KB National Library of the Netherlands
 

Renewal in Web Archiving: Towards More Inclusive Representation and Practices

Makiba Foster1, Bergis Jules2, Zakiya Collier3

1The College of Wooster; 2Archiving The Black Web; 3Shift Collective

“The future is already here, it's just not very equally distributed, yet” - William Gibson
In this session you will learn about a growing community of practice of independent yet interconnected projects whose work converges as an intervention to critically engage the practice of web archiving to be more inclusive in terms of what gets web archived and who gets to build web archives. These projects reimagine a future for web archiving that distributes the practice and diversifies the collections.

Presentation 1- Archiving The Black Web

Author/Presenter: Makiba Foster, The College of Wooster and Bergis Jules, Archiving the Black Web

Abstract: Unactualized web archiving opportunities for Black knowledge collecting institutions interested in documenting web-based Black history and culture has reached critical levels due to the expansive growth of content produced about the Black experience by Black digital creators. Archiving The Black Web (ATBW), works to establish more equitable, accessible, and inclusive web archiving practices to diversify not only collection practices but also its practitioners. Founded in 2019, ATBW's creators will discuss the collaborative catalyst for the creation and launch of this important DEI initiative within web archiving. In this panel session, attendees will learn more about ATBW’s mission to address web archiving disparities. ATBW envisions a future that includes cultivating a community of practice for Black collecting institutions, developing training opportunities to diversify the practice of web archiving, and expanding the scope of web archives to include culturally relevant web content.

Presentation 2 - Schomburg Syllabus

Author/Presenter: Zakiya Collier, Shift Collective

Abstract: From 2017-2019 the Schomburg Center for Research in Black Culture participated in the Internet Archive’s Community Webs program, becoming the first Black collecting institution to create a web archiving program centering web-based Black history and culture. Recognizing that content in crowdsourced hashtag syllabi could be lost to the ephemerality of the Web, the #HashtagSyllabusMovement collection was created to archive online educational material related to publicly produced, crowdsourced content highlighting race, police violence, and other social justice issues within the Black community. Both the first of its kind in focus and within The New York Public Library system, the Schomburg Center’s web archiving program faced challenges including but not limited to identifying ways to introduce the concept of web archiving to Schomburg Center researchers and community members, demonstrating the necessity of a web supported web archiving program to Library administration, and expressing the urgency needed in centering Black content on the web that may be especially ephemeral like those associated with struggles for social justice. It was necessary for the Schomburg Center to not only continue their web archiving efforts with the #Syllabus and other web archive collections, but also develop strategies to invoke the same sense of urgency and value for Black web archive collections that we now see demonstrated in the collection of analog records documenting Black history, culture and activism— especially as social justice organizing efforts increasingly have online components.

As a result, the #SchomburgSyllabus project was developed to merge web-archives and analog resources from the Schomburg Center in celebration of Black people's longstanding self-organized educational efforts. #SchomburgSyllabus uniquely organizes primary and secondary sources into a 27-themed web-based resource guide that can be used for classroom curriculum, collective study, self-directed education, and social media and internet research. Tethering web-archived resources to the Schomburg Center’s world-renowned physical collections Black diasporic history has proven key in garnering support for the Schomburg’s web archiving program and enthusiasm for the preservation of the Black web as demonstrated by the #SchomburgSyllabus’ use in classrooms, inclusion in journal articles, and features in cultural/educational TV programs.

 
2:20pm - 3:50pmSES-16: PRESERVATION & COMPLEX DIGITAL PUBLICATIONS
Location: Theatre 1
Session Chair: Kiki Lennaerts, Sound & Vision
These presentations will be followed by a 10 min Q&A.
 
2:20pm - 2:40pm

Preservability and Preservation of Digital Scholarly Editions

Michael Kurzmeier1, James O'Sullivan1, Mike Pidd2, Orla Murphy1, Bridgette Wessels3

1University College Cork, Ireland; 2University of Sheffield; 3University of Glasgow

Digital Scholarly Editions (DSE) are web resources, thus subject to data loss. While DSEs are usually the result of funded research, their longevity and preservation is uncertain. DSEs might be partially or completely captured during web archiving crawls, in some cases making web archives the only remaining publicly available source of information about a DSE. Patrick Sahle’s Catalogue of DSEs lists ~800 URLs referring to DSEs, of which 46 refer to the Internet Archive. (2020) This shows the overlap between DSEs and web archives and highlights the need for a closer look at the longevity and archiving of these important resources. This presentation will introduce a recent study on the availability and longevity of DSEs and introduce different preservation models and examples specific to DSEs. Examples of lost and partially preserved editions will be used to illustrate the problem of preservation and preservability of DSEs. This presentation will also outline the specific challenges of archiving DSEs.

The C21 Editions project is a three-year international research collaboration researching the state of the art and the future of DSEs. As part of the project output, this presentation will introduce the main data sources on DSEs and demonstrate the workflow to assess DSE availability over time. It will illustrate the role web archives play in the preservation of DSEs as well as highlight specific challenges DSEs present to web archiving. As DSEs are complex projects, featuring multiple layers of data, transcription and annotation, their full preservation usually includes ongoing maintenance of the often custom-build backend system. Once project funding ends, these structures are very prone to deterioration and loss. Besides ongoing maintenance, other preservation models exist, generally reducing the archiving scope in order to reduce the ongoing work required (Dillen 2019; Pierazzo 2019; Sahle and Kronenwett 2016). Such editions using compatible rather than bespoke solutions are more likely to be fully preserved. Other approaches include a “preservability by design” approach through minimal computing (Elwert n.d.) or standardization through existing services such as DARIAH or GitHub. The presentation will outline these models using examples of successful preservation as well as lost editions.

This presentation is part of the larger C21 Editions project, a three-year international collaboration jointly funded by the Arts & Humanities Research Council (AH/W001489/1) and Irish Research Council (IRC/W001489/1).



2:40pm - 3:00pm

Collecting and presenting complex digital publications

Ian Cooke, Giulia Carla Rossi

The British Library, United Kingdom

'Emerging Formats' is a term that is used by UK legal deposit libraries to describe experimental and innovative digital publications, for which there are no collection management solutions that can operate at scale. They are important to the libraries, and their users, as they document a period of creativity and rapid change, and often include authors and experiences that are less well represented in more mainstream publications, and are at high risk of loss. For six years, the UK legal deposit libraries have been working collaboratively and experimentally to both survey the types of publications, and to test approaches to collection that will support preservation, discovery and access. An important concept in this work has been 'contextual collecting', that seeks to preserve the best possible archival instance of a work, alongside information that documents how a work was created, and how it was experienced by users.

Web archiving has formed an important part of this work, both in providing practical tools to support collection management, including access, and also in supporting the collection of contextual information. An example of this can be seen in the New Media Writing Prize thematic collection https://www.webarchive.org.uk/en/ukwa/collection/2912

In this presentation, we will step back from specific examples, and talk about what we have learned so far from our work as a whole. We will outline how this work, including user research and engagement, has shaped policy at the British Library, through the creation of our 'Content Development Plan' for Emerging Formats, and the role of web archiving within that plan.

This presentation contributes to the Collections themes of 'blurring the boundaries between web archives and other born digital collections' and 'reuse of web archived materials for other born digital collections'. It builds on previous presentations to Web Archive Conference, which have focused on specific challenges related to collecting complex digital publications, to demonstrate how this research has informed the policy direction at the British Library and how web archiving infrastructure will be built in to efforts to collect, assess and make accessible new publications.



3:00pm - 3:20pm

What can web archiving history tell us about preservation risks?

Susanne van den Eijkel, Daniel Steinmeier

KB, National Library of the Netherlands

When people talk about the necessity of preservation, the first thing that comes to mind is the supposed risk of file format obsolescence. Within the preservation community there have been voices raising the concern that this might not be the most pressing risk. If we are actually solving the wrong problem, this means we neglect the real problem. Therefore, it is important to know that the solutions we create are solving demonstrably real problems. Web archiving could be a great source of information for researching the most urgent risks, because developments and standards on the web are very fluid. There are examples of file formats on the web, such as Flash, that are not supported anymore by modern browsers. However, these formats can still be rendered using widely available software. We have also seen that website owners migrated their content from Flash to HTML5. So, can we really say that obsolescence has resulted in loss of data? How can we find out more about this? And more importantly, can we find out which risks are actually more relevant?

At the National Library of the Netherlands, we have been working on building a web collection since 2007. By looking at a few historical webpages we will illustrate where to look for answers and how to formulate better preservation risks using source data and context information. At iPres2022 we have presented a short paper on the importance of context information for web collections. This information helps us in understanding the scope and the creation process of the archived website. In this presentation, we will demonstrate how we use this context information to search out sustainability risks for web collections. This will also give us insight into sustainability risks in general so we can create better informed preservation strategies.



3:20pm - 3:40pm

Towards an effective long-term preservation of the web. The case of the Publications Office of the EU

Corinne Frappart

Publications Office of the European Union, Luxembourg

Much is being written about web archiving in general where new, improving methods to capture the World Wide Web and to facilitate access to the resulting archives are constantly being described and shared. But when it comes to the long-term preservation of web sites, i.e. safeguarding the ARC/WARC files with a proper planning of preservation actions beyond simply bit preservation, literature is much less abundant.

The Publications Office of the EU is responsible for the preservation of the websites authored by the EU institutions. In addition to our activities in harvesting and making accessible the content through our public web archive (https://op.europa.eu/en/web/euwebarchive), we started to delve more deeply into the management of content preserved for the long-term.

Our reflection focused on long-term risks such as obsolescence or loss of file useability, and on the availability of a disaster recovery mechanism for the platform providing access to the web archive. Ingesting web archive files into a long-term preservation system raises many questions:

  • Should we expect different difficulties with ARC and WARC files? Is it worth migrating the ARC files to WARC files, and having a consistent collection on which the same tools can be applied?
  • Does ARC/WARC file compression impact the storage, the processing time, the preservation actions?
  • What is the best granularity for the preservation of web archive?
  • Should the characterization of the numerous files embedded in ARC/WARC files occur during or after ingestion? With which impact on the preservation actions?
  • How can descriptive, technical and provenance metadata be enriched, possibly automatically, and where can they be stored?
  • What kind of information about the context of the crawls, the format description and the data structure should be also preserved to help future users to understand the content of the ARC/WARC files?

To get some advice about all these questions and others, the Publications Office commissioned a study looking at published and grey literature, and supplemented by a series of interviews conducted with leading institutions in field of web archiving. This paper presents the findings and offers recommendations on how to answer the questions above.

 
4:00pm - 5:00pmKEYNOTE: Marleen Stikker. Introduced and chaired by Martijn Kleppe, KB
Location: Theatre 1
Session Chair: Martijn Kleppe, KB, national library of the Netherlands
5:00pm - 5:15pmCLOSING REMARKS: Jeffrey van der Hoeven, KB, National Library of the Netherlands
Location: Theatre 1
Session Chair: Jeffrey van der Hoeven, KB, National Library of the Netherlands

 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2023
Conference Software: ConfTool Pro 2.6.149
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany