Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available). To only see the sessions for 3 May's Online Day, select "Online" for location.

Please note that all times are shown in the time zone of the conference. The current conference time is: 9th May 2024, 07:35:51pm CEST

 
Only Sessions at Location/Venue 
Only Sessions at Date / Time 
 
 
Session Overview
Date: Wednesday, 03/May/2023
1:00pm - 1:15pmWELCOME & HOUSEKEEPING
Virtual location: Online
1:15pm - 2:00pmOL-SES-01: Q&A: RECONSTRUCTIONS IN NATIONAL DOMAINS: HISTORY, COLLECTIONS & CORPORA
Virtual location: Online
Session Chair: Susanne van den Eijkel, KB, National Library of the Netherlands
Session Chair: Sophie Ham, Koninklijke Bibliotheek
 

Uncovering the (paper) traces of the early Belgian web

Bas Vercruysse1, Julie Birkholz1,2, Friedel Geeraert2

1Ghent University, Belgium; 2KBR (Royal Library of Belgium)

The Belgian web began in June 1988 when EARN and Eunet introduced the .be domain. In December 1993 the first .be domain names were registered and in 1994 there were a total of 129 registered .be names.[1] Documentation about the early Belgian web is scarce and the topic has not yet been researched in depth. In other European countries such as France and The Netherlands, specific projects have been set up to document the early national web. [2] This study of the early Belgian web therefore helps to complete the history of the early web in Europe and understand the specific dynamics that led to the emergence of the web in Belgium.

Records of the early Web Belgium include: published lists of domain names of interest to Belgians held in the collections of KBR (e.g. publications such as the Belgian Web Directory published from 1997 to 1998 and the Web Directory published from 1998 to 2000), archived early Belgian websites preserved in the Wayback Machine since 1996, archives of organisations such as DNS Belgium (the registry for the .be, .brussels and .vlaanderen domains) etc.

This archival information provides a slice of the information needed to understand the emergence of the early web in Belgium, yet it is clear that social actors who played key roles in developing the Belgian web, are not always recorded in the few archival records that remain. By combining these “paper traces” of the early Belgian web with semi-structured interviews with key actors in Belgium (e.g. long-time employees of DNS Belgium, instigators of the .be domain name, first users of and researchers on the web) we are able to reconstruct the history of the start of the web in Belgium.

In this presentation, we will report on this research that stitches the first traces of the early Belgian web.

[1] DNS Belgium. (2019). De historiek van DNS Belgium. Available online at: https://www.dnsbelgium.be/nl/over-dns-belgium/de-historiek-van-dns-belgium.

[2] De Bode, P., Teszelszky, K. (2018). Web collection internet archaeology Euronet-Internet (1994-2017). Available online at: https://lab.kb.nl/dataset/web-collection-internet-archaeology-euronet-internet-1994-2017; Bibliothèque nationale de France. (2018). Web90 - Patrimoine, Mémoires et Histoire du Web dans les années 1990. Available online at: https://web90.hypotheses.org/tag/bnf.



The Lifranum research project : building a collection on French speaking literature

Christian Cote2, Alexandre Faye1, Christine Genin1, Kevin Locoh-Donou1

1French national library, France; 2University of Lyon 3 (Jean Moulin), France

Many amateurs and professionals writers have taken to the web since its very beginning, to share their writings and personal diaries, engaging themselves in the first forums. These practices increased with the rise of blogging platforms in the 2000s. Authors have used hypertext link possibilities to develop a new digital sociability and a common transnational creative network.

The Lifranum research project brings together researchers from several disciplines. Its objective is to provide an original platform within a thematic web archive as corpora and to develop enhanced search features. The indexing scheme takes into account advances in automatic style analysis. In this context, researchers and librarians have defined complementary needs considering the web archive collection to be built and have tested new methods to design the corpora and carry out the crawls.

During this presentation, we will share the challenges we encountered and the experiences we developed during the building of this large thematic corpora, from the selection phase to the crawl processes. The following aspects will be discussed:

- text indexing and text analyzing issues;

- large thematic corpora building methods using Hyphe, a tool developed by SciencesPo for exploring the web, build corpora, analyze links between websites and adding annotations;

- managing quantity and quality on blogging platforms;

- documenting choices to proceed data.

The presentation will also compare web archive logics to scientific approaches focused on a specific type of data (text, image, video) that are exposed using APIs and easier to analyze. We will question the contributions and limits of this type of collection launched in partnership within the framework of a research project, which enriches the archives due to more methodical explorations of the web, anticipated qualitative controls and production of reusable documentation.

The video will be available in French with English subtitles.



Developing a Reborn Digital Archival Edition as an Approach for the Collection, Organisation, and Analysis of Web Archive Sources

Sharon Healy1, Juan-José Boté-Vericad2, Helena Byrne3

1Maynooth University; 2Universitat de Barcelona; 3British Library

In this presentation, we explore the development of a reborn digital archival edition (RDAE) as a hybrid approach for the collection, organisation, and analysis of reborn digital materials (Brügger, 2018; 2016) that are accessible through public web archives. Brügger (2016) describes reborn digital media, as media that has been collected and preserved and has undergone a change due to this process such as emulations of computer games or materials in a web archive. Further to this, we explore the potential of an RDAE as a method to enable the sharing and reuse of such data. As part of this we use a case study of the press/media statements of Irish politician, poet, and sociologist, Michael D. Higgins from 2002-2011. For the most part, these press statements were once available on the website of Michael D. Higgins, who is the current serving Irish President since 2011. Higgin’s website disappeared from the live web, sometime after the 2011 Presidential Election took place (27 October 2011), and sometime before Higgins was inaugurated (11 November 2011). Using the NLI Web Archive (National Library of Ireland) and the Wayback Machine (Internet Archive), this project sought to find and collect traces of these press statements and bring them together as an RDAE. In doing so, we use Zotero open-source citation management software, for collecting, organising, and analysing the data (archived web pages). We extract the text, and use screenshot software to capture an image of the archived web page. Thereafter, we utilise Omeka open-source software as a platform for presenting the data (screenshot/metadata/transcription) as a curated thematic collection of reborn digital materials, offering search and discovery functions through free text search, metadata fields and subject headings. To end, we use DROID open-source software for organising the data for long-term preservation, and Open Science Framework as a platform for sharing derivative materials and datasets.

References:

Brügger, N. (2016). Digital Humanities in the 21st Century: Digital Material as a Driving Force. Digital Humanities Quarterly, 10(2). Retrieved from http://www.digitalhumanities.org/dhq/vol/10/3/000256/000256.html

Brügger, N. (2018). The Archived Web: Doing History in the Digital Age. The MIT Press.

 
2:00pm - 2:15pmOL-SES-02: Q&A: BARRIERS TO WEB ARCHIVING IN LATIN AMERICA
Virtual location: Online
Session Chair: Eilidh MacGlone, National Library of Scotland
 

Web Archiving en español: Barriers to Accessing and Using Web Archives in Latin America

Alan Colin-Arce1, Sylvia Fernández-Quintanilla2, Rosario Rogel-Salazar1, Verónica Benítez-Pérez1, Abraham García-Monroy1

1Universidad Autónoma del Estado de México, Mexico; 2University of Texas at San Antonio

Web archives have been growing in popularity in Global North countries as a way of preserving a part of their political, cultural, and social life carried out online. However, its spread to other regions has been slower because of several technical, economic, and social barriers. In this presentation, we will discuss the main limitations in the uptake of web archiving in Spanish-speaking Latin American countries and its implications for the access and use of web archives.

The first barrier to web archiving in these countries is the lack of awareness of web archives among librarians and archivists. According to Scopus data from 2022, out of 909 documents with the words “web archiv*” on the title, abstract, or keywords, 10 papers are from Spanish-Speaking Latin American countries. In worldwide web archiving surveys, there are no initiatives from Latin America yet (D. Gomes et al., 2011; P. Gomes, 2020), and we could only identify 5 Latin American institutions on Archive-It, none of which had active public collections in 2022.

Another barrier is the cost of web archiving services like Archive-It, which can be unaffordable for many institutions in the region. Even if they can afford these services or use free tools, most web archiving software is created only in English and it does not smoothly support multilingual collections or collections in languages other than English (for example, the default metadata fields for collections and seeds are in English and adding them in other languages is not straightforward).

This unequal access to web archives between the Global North and South comes with the risk that Global North websites get preserved, organized, accessed, and used, while Latin America and other regions continue depending solely on third parties like the Internet Archive to preserve their websites.

A possible solution for raising awareness of web archives is developing workshops as well as mentorship programs for Latin American librarians and digital humanists looking to start with web archiving. For the linguistic barrier, translating the documentation of web archiving tools to other languages can be a first step to encourage their use in Latin America.

References

Gomes, D., Miranda, J., & Costa, M. (2011). A Survey on Web Archiving Initiatives. In S. Gradmann, F. Borri, C. Meghini, & H. Schuldt (Eds.), Research and Advanced Technology for Digital Libraries (pp. 408–420). Springer. https://doi.org/10.1007/978-3-642-24469-8_41

Gomes, P. (2020). Map of Web archiving initiatives. Own work. https://commons.wikimedia.org/wiki/File:Map_of_Web_archiving_initiatives_58632FR T.jpg

 
2:15pm - 2:30pmOL BREAK
Virtual location: Online
2:30pm - 3:00pmOL-SES-03: Q&A: RESEARCHING WEB ARCHIVES
Virtual location: Online
Session Chair: Ben Els, National Library of Luxembourg
 

All Our Yesterdays: A toolkit to explore web archives in Colab

Tim Ribaric, Sam Langdon

Brock University, Canada

The rise of Jupyter notebooks and particularly Google Colab has created an easy to use and accessible platform for those interested in exploring computational methods. This is especially the case with performing research using web archives. However the question remains, how to start? In particular, for those without an extensive background in programming this might be an insurmountable challenge. Enter the All Our Yesterdays Took Kit. (AOY-TK) This suite of notebooks and associated code provides a scaffolded introduction to opening, analyzing, and generating insights with web archives. With tight integration to Google Drive and text analysis tools, it provides a comprehensive solution to that very question of how to start. Development of AOY-TK is made possible by grant funding and this session will discuss progress to date and provide some brief case study examples of the types of analysis possible using the toolkit.



Using Web Archives to Model Academic Migration and Identify Brain Drain

Mat Kelly, Deanna Zarrillo, Erjia Yan

Drexel University, United States of America

Academic faculty members may change their institution affiliation over the course of their career. In the case of Historically Black Colleges and Universities (HBCUs) in the United States, which make substantial contributions to the preparation of Black professionals, retaining the most talented Black students and faculty from moving to non-HBCUs (thus preventing “brain drain”) is often a losing battle. This project seeks to investigate the effects of academic mobility at the institutional and individual level, measuring the potential brain drain from HBCUs. To accomplish this, we consult web archives to identify captures of academic institutions and their departments in the past to extract faculty names, title, and affiliation at various points in time. By analyzing the HBCUs’ list of faculty over time, we will be able to model academic migration and quantify the degree of brain drain.

This NSF-sponsored project is in the early stages of execution and is a collaboration between Drexel University, Howard University, University of Tennessee - Knoxville, and University of Wisconsin - Madison. We are currently in the data collection stage, which entails us leveraging an open source Memento aggregator to consult international sources of web archives to potentially improve the quality and quantity of captures of past versions of HBCU sites. In this initial stage, we have encountered caveats of the process of efficient extraction, established a systematic methodology of utilizing this approach beyond our initial use cases, and identified potentially ethical dilemmas of individuals’ information on the past being uncovered and highlighted without their explicit consent. During the first year of the project, we have refined our approach to facilitate better data quality for subsequent steps in the process and to emphasize recall. This presentation will both describe some of these nuances of our collaborative project as well as highlight the next steps for identifying brain drain from HBCUs by utilizing web archives.

 
3:00pm - 3:05pmOL BREAK
Virtual location: Online
3:05pm - 3:35pmOL-SES-04: Q&A: SAMPLING THE HISTORICAL WEB & TEMPORAL RESILIENCE OF WEB PAGES
Virtual location: Online
Session Chair: Laura Wrubel, Stanford University
 

Lessons Learned From the Longitudinal Sampling of a Large Web Archive

Kritika Garg1, Sawood Alam2, Michele Weigle1, Michael Nelson1, Corentin Barreau2, Mark Graham2, Dietrich Ayala3

1Old Dominion University, Norfolk, Virginia - USA; 2Internet Archive, San Francisco, California - USA; 3Protocol Labs, San Francisco, California - USA

We document the strategies and lessons learned from sampling the web by collecting 27.3 million URLs with 3.8 billion archived pages spanning 26 years of the Internet Archive's holdings (1996–2021). Our overall project goal is to revisit fundamental questions regarding the size, nature, and prevalence of the publicly archivable web, and in particular, to reconsider the question, "how long does a web page last?" Addressing this question requires obtaining a "representative sample of the web." We proposed several orthogonal dimensions to sample URLs using the archived web: time of the first archive, HTML vs. MIME types, URL depth (top-level pages vs. deep links), and TLD. We sampled 285 million URLs from IA's ZipNum index file that contains every 6000th line of the CDX index. These include URLs of embedded resources, such as images, CSS, and JavaScript. To limit our samples to web pages, we filtered the URLs for likely HTML pages (based on filename extensions). We determined the time of the first archive and MIME type using IA's CDX API. We grouped the 92 million URLs with "text/html" MIME types based on the year of the first archive. Archiving speed and capacity have significantly increased, so we found fewer URLs archived in the early years than in later years. Hence, we adjusted our goal of 1 million URLs per year and clustered the early years (1996-2000) to reach that size (1.2 million URLs). We noticed an increase in deep links archived over the years. We extracted the top-level URLs from the deep links to upsample the earlier years. We found that popular domains like Yahoo and Twitter were over-represented in the IA. We performed logarithmic-scale downsampling based on the number of URLs sharing a domain. Given the collection size, we employed various sampling strategies to ensure fairness in the domain and temporal representations. Our final dataset contains TimeMaps of 27.3 million URLs comprising 3.8 billion archived pages from 1996 to 2021. We convey the lessons learned from sampling the archived web, which could inform other studies that sample from web archives.



TrendMachine: Temporal Resilience of Web Pages

Sawood Alam1, Mark Graham1, Kritika Garg2, Michele Weigle2, Michael Nelson2, Dietrich Ayala3

1Internet Archive, San Francisco, California - USA; 2Old Dominion University, Norfolk, Virginia - USA; 3Protocol Labs, San Francisco, California - USA

"How long does a web page last?" is commonly answered with "40 to 100 days", with sources dating back to the late 1990s. The web has since evolved from mostly static pages to dynamically-generated pages that heavily rely on client-side scripts and user contributed content. Before we revisit this question, there are additional questions to explore. For example, is it fair to call a page dead that returns a 404 vs. one whose domain name no longer resolves? Is a web page alive if it returns content, but has drifted away from its original topic? How to assess the lifespan of pages from the perspective of fixity with the spectrum of content-addressable pages to tweets to home pages of news websites to weather report pages to push notifications to streaming media? To quantify the resilience of a page, we developed a mathematical model that calculates a normalized score as time-series data based on the archived versions of the page. It uses Sigmoid functions to increase or decrease the score slowly on the first few observations of the same class. The score changes significantly if the observations remain consistent over time, and there are tunable parameters for each class of observation (e.g., HTTP status codes, no archival activities, and content fixity). Our model has many potential applications, such as identifying points of interest in the TimeMap of densely archived web resources, identifying dead links (in wiki pages or any other website) that can be replaced with archived copies, and aggregated analysis of sections of large websites. We implemented an open-source interactive tool [1] powered by this model to analyze URIs against any CDX data source. Our tool gave interesting insights on various sites, such as, the day when "cs.odu.edu" was configured to redirect to "odu.edu/compsci", the two and a half years of duration when "example.com" was being redirected to "iana.org", the time when ODU’s website had downtime due to a cyber attack, or the year when Hampton Public Library’s domain name was drop-catched to host a fake NSFW store.

[1] https://github.com/internetarchive/webpage_resilience

 
3:35pm - 6:00pmOL BREAK
Virtual location: Online
6:00pm - 6:05pmWELCOME & HOUSEKEEPING
Virtual location: Online
6:05pm - 6:35pmOL-SES-05: Q&A: PRESERVING SOCIAL MEDIA & VIDEO GAMES
Virtual location: Online
Session Chair: Sawood Alam, Internet Archive
 

A Gift to Another Age: Evaluating Virtual Machines for the Preservation of Video Games at MoMA

Kirk Mudle

New York University and the Musuem of Modern Art in New York, United States of America

This preservation project investigates the use of virtual machines for the preservation of video games. From MoMA’s collection, Rand and Robyn Miller’s classic adventure game Myst (1993) is used as a sample record to evaluate the performance of three different virtualization options for the Mac OS 9 operating system—SheepShaver, Qemu, and Yale’s Emulation-as-a-Service-Infrastructure (EaaSI). Serving as the control for the experiment, Myst is first documented running natively on an original PowerMac G4 at MoMA. The native performance is then compared with each virtualization software. Finally, a fully configured virtual machine is packaged as a single file and tested in different contemporary computing environments. More generally, this project clarifies the risks and challenges that arise when using virtual machines for the long-term preservation of computer and software-based art.


Experiences from archiving information from social media

Magdalena Sjödahl, Stefan Jacobson

Arkiwera, Sweden

Social media has enabled an expanded dialogue in the public space. News spread faster than ever, politicians and leaders can communicate directly with a great number of people, and everyone can become an influencer or create a public movement. For the archival institutions and governmental organisations active on different social medias, this creates new challenges and questions. As a consultancy firm working with digital preservation, we couldn’t just ask these questions without also finding some answers. This led to the beginning of developing the system that we today call Arkiwera.

With more than 10 years’ experience of web archiving, we started to look at solutions to preserve posts, including comments and reactions, from different social media platforms about 4-5 years ago. Since we couldn’t find any “out of the box”-solutions that we could just apply and refer our customer to we started to develop our own solution. This has been a long and interesting journey of many lessons learned that we would love to share on the conference connecting to several of the themes presented, e.g. Research, Tools and Access.

Our lecture introduces you to the circumstances offered within the Swedish archival context and the choices we have made from an archival, regulative, and ethical aspect when developing the archival platform Arkiwera – today used by a large number of organisations.

 
6:35pm - 6:40pmOL BREAK
Virtual location: Online
6:40pm - 7:10pmOL-SES-06: Q&A: COLLABORATIVE WEB ARCHIVING
Virtual location: Online
Session Chair: Lauren Ko, University of North Texas
 

Empowering Bibliographers to Build Collections: The Browsertrix Cloud Pilot at Stanford Libraries

Quinn Dombrowski, Ed Summers, Laura Wrubel, Peter Chan

Stanford University, United States of America

The purview of subject-area librarians has expanded in the 21st century from primarily focusing on books and print subscriptions to a much larger set of materials, including digital subscription packages and data sets (distributed using a variety of media, for purchase or lease). Through this process, subject-area librarians are increasingly exposed to complex issues around copyright, license terms, and privacy/ethical concerns, where both norms and laws can vary significantly among different countries and communities. While it is nearly impossible for subject-area librarians in any field to treat “data” as outside the scope of their collecting efforts in 2022, the same does not hold true for web archives. Many libraries have at least some access to web archiving tools, although this access may primarily be in the hands of a limited number of users, sometimes associated with library technical services or special collections / university archives (e.g. for institutions whose focus of web archiving is primarily their own digital resources).

In late 2022, the web archiving task force at Stanford Libraries – a cross-functional team that brought together the web archivist, technical staff, and embedded digital humanities staff – set out to shift this dynamic by empowering disciplinary librarians to add web archiving to their toolkit for building the university’s collections. By partnering with Webrecorder, Stanford Libraries set up an instance of Browsertrix Cloud, and provided access to a pilot group of bibliographers and other subject-matter experts as part of a short-term pilot. The goals of this pilot were to see how, and how much, bibliographers would engage with web archiving for collection-building if given unfettered access to easy-to-use tools. What materials would they prioritize? What challenges would they encounter? What technical (e.g. storage) and support (e.g. training, debugging, community engagement) resources would be necessary for them to be successful? This pilot was also intended to inform the strategic direction for web archiving at Stanford moving forward.

In this talk, we will briefly present how we designed the pilot, will hear perspectives from bibliographers who participated, and we will share the pilot outcomes and future directions.



What next? An update from SUCHO

Quinn Dombrowski1, Anna Kijas2, Sebastian Majstorovic3, Ed Summers1, Andreas Segerberg4

1Stanford University, United States of America; 2Tufts University, United States of America; 3Austrian Center for Digital Humanities and Cultural Heritage, Austria; 4University of Gothenburg, Sweden

Saving Ukrainian Cultural Heritage Online (SUCHO) made headlines as an international, volunteer-run initiative archiving Ukrainian cultural heritage websites in the wake of Russia’s invasion in February 2022. Through SUCHO, over 1,500 volunteers around the world – from technologists and librarians, to retirees and children – were involved in a large-scale, rapid-response web archiving effort that developed a collection of over 5,000 websites and 50 TB of data. As a non-institutional project with the primary goal of digital repatriation, creating this collection and ensuring its security through a network of mirrors was not enough. The motivation for SUCHO was not to create a permanent archive of Ukraine that could be used as research data for scholars as the country was destroyed; instead, the hope was to hold onto the data only until the cultural heritage sector in Ukraine was ready to rebuild.

The initial web archiving phase of SUCHO’s work happened between March and August 2022. The archives came from a variety of sources: created on volunteers laptops using the command-line Browsertrix software, using Browsertrix Cloud, or even uploads of individual, highly interactive page archives using the Browsertrix Chrome plugin. In addition, while the project mostly worked from a single list of sites, the work was done in haste, and status metadata (e.g. “in progress”, “done”, “problem”) was not always accurately documented. Furthermore, while the project had full DNS records for these sites, that metadata was stored separately from the spreadsheet – as was information about site uptime and downtime over the course of the project. Creating the web archives was challenging, but it quickly became apparent that the bigger challenge would be curation.

This talk will follow up on our 2022 IIPC presentation on SUCHO, confronting the question of “What next?” for SUCHO. It will bring together a number of volunteers to discuss different facets of this curation process, including reuniting archives with different kinds of metadata, our efforts in extracting data from the archives that could be used as the foundation for rebuilding websites, and other work to curate and present what our volunteer community accomplished.

 
7:10pm - 7:25pmOL BREAK
Virtual location: Online
7:25pm - 7:55pmOL-SES-07: Q&A: LEGAL & ETHICAL CONSIDERATIONS
Virtual location: Online
Session Chair: Tom Smyth, Libraries and Archives Canada
 

Querying Queer Web Archives

Di Yoong1, Filipa Calado1, Corey Clawson2

1The Graduate Center, CUNY, USA; 2Rutgers University, USA

Our paper explores the intersections of querying and queerness as it interacts with and is informed by web spaces and their development across time. Working with hundreds of gigabites of web archival records on queer and queer-ish online spaces, we are developing new methodsfor search and discovery. as well as for the ethical access and use of web archives. This paper reflects on our process pursuing methodologies that accommodate diverse perspectives for querying web-based datasets and embrace the qualities of play and pliancy to respond to a host of research questions and investments.

For example, one central concern explores ethical methods for cleaning web archival data to maintain privacy and anonymity. While queer spaces have historically existed in the margins, confidential information is easily shared and retained in the process of collecting data. Given that we are looking into queer spaces across 30 or so years, we are also mindful of the ethical consideration for privacy and anonymity in twofolds: first, in the sense of anonymity that has shifted since early internet days; and second, on the uses of collected sites in repositories. For example, in 1995 only 0.4% of world population had access to the internet (Mendel, 2012), compared to 60% in 2020 (The World Bank, n.d.). The sense of anonymity and smaller internet community means that users were likely to share more private information than they might share today. Our research therefore has to consider how to remove private information in large amounts using tools such as bulk_extractor (Garfinkel, 2013) and bulk_reviewer (Walsh & Baggett, 2019). In addition, we also work with repositories of archived websites whose original collection was obtained through informed consent. This means that while we may have the ability to access the collection, ethical secondary use requires additional consideration. Given the small size of the collection, we have been able to reach out to the original creators, but this approach will need to be reconsidered for larger collections.



Beyond the Affidavit: Towards Better Standards for Web Archive Evidence

Nicholas Taylor

nullhandle.org

The Internet Archive (IA) standard legal affidavit is used in litigation both frequently and reliably for the authentication and admission of evidence from the Wayback Machine (WM). While the affidavit has enabled the regular and relatively confident application of IAWM evidence by the legal community, their understanding of the contingencies of web archives - including qualifications to which the affidavit itself calls attention - is limited.

The tendency to conflate IA's attestation as to the authenticity of IAWM /records/ with the authenticity of /historical webpages/ will eventually have material consequences in litigation, which we may reasonably suppose will undermine confidence in the trustworthiness of web archives generally and to a greater extent than likely merited. The ever-increasing complexity of the web and the unfortunately growing investment in disinformation only increase the probability that this will happen sooner as versus later.

In response to the looming (or present, but as yet undiscovered) threat to the current IA affidavit-favored regime for authentication of IAWM evidence, the web archiving community would do well to champion better, more institutionally-agnostic standards for evaluating and affirming the authenticity of archived web content. Some modest efforts have been made on this front, and there are a few places we can consult for tacitly indicated frameworks. Collectively, these include judicial precedents, e-discovery community guidance, and the marketing of services by commercial archiving companies. I would argue that these do not get us far enough, though.

To that end, I would like to elaborate a more expansive set of criteria that could serve as a basis for the authenticity of web archives for evidentiary purposes. Some of these traits are foundational to web archiving in the main, and help to distinguish web archives from other forms of web content capture. Some reflect the affordances of our standards and tools that we as a community already have in place. Some reflect under-addressed technical challenges, for which continued investment in mitigation will be necessary to maintain the trustworthiness of our archives for legal use. Together, they may better provide for the sustained and trustworthy use of web archives for evidentiary purposes.

 
7:55pm - 8:00pmOL BREAK
Virtual location: Online
8:00pm - 9:00pmOL-SES-08: PANEL: BROWSER-BASED CRAWLING FOR ALL: THE STORY SO FAR
Virtual location: Online
Session Chair: Meghan Lyon, Library of Congress
 

Browser-Based Crawling For All: The Story So Far

Anders Klindt Myrvoll1, Andrew Jackson2, Ben O'Brien3, Sholto Duncan3, Ilya Kreymer4, Lauren Ko5, Jasmine Mulliken6, Antares Reich7, Andreas Predikaka7

1Royal Danish Library; 2The British Library, United Kingdom; 3National Library of New Zealand | Te Puna Mātauranga o Aotearoa; 4Webrecorder; 5UNT; 6Stanford; 7Austrian National Library

Through the IIPC-funded “Browser-based crawling system for all” project (https://netpreserve.org/projects/browser-based-crawling/), members have been working with Webrecorder to support the development of their new crawling tools: Browsertrix Crawler (a crawl system that automates web browsers to crawl the web) and Browsertrix Cloud (a user interface to control multiple Browsetrix Crawler crawls). The IIPC funding is particularly focussed on making sure IIPC members can use these tools.

This online panel will provide an update on the project, emphasizing the experiences of IIPC members who have been experimenting with the tools. Three IIPC members who have been exploring Browsertrix Cloud in detail will present their experiences so far. What works well, what works less well, how the development process has been, and what the longer-term issues might be. The Q&A session will be used to explore the issues raised and encourage wider engagement and feedback from IIPC members.

Project Update: Anders Klindt Myrvoll & Ilya Kreymer

Anders will present an update from the project leads on what has been achieved since we started the project and what the next steps are. We will look at the broad picture as well as the the goals, outcomes and deliverables as described in the IIPC project description: https://netpreserve.org/projects/browser-based-crawling/

On behalf of Webrecorder, Ilya will outline the wider context and updating on the status of the project and including any immediate feedback from the Workshop session

User experience 1 (NZ) Sholto Duncan

Testing Browsertrix Cloud at NLNZ

In recent years the selective web harvesting programme at the National Library of New Zealand has broadened its crawling tools of choice in order to use the best one for the job. From primarily using Heritrix, through WCT, to now also regularly crawling with Webrecorder and Archive-IT. This allowed us to get the best capture possible. But unfortunately still falls short in harvesting some of those more rich, dynamic, modern websites that are becoming more commonplace.

Other areas within the Library that often use web archiving processes for capturing web content have seen this same need for improved crawling tools. This has provided a range of users and diverse use cases for our Browsertrix Cloud testing. During this presentation we will cover our user experience during this testing.

User experience 2 (UNT) Lauren Ko

Improving the Web Archive Experience

With a focus on collecting the expiring websites of defunct federal government commissions, carrying out biannual crawls of its own subdomains, and participating in event-based crawling projects, since 2005 UNT Libraries has mostly carried out harvesting with Heritrix. However, in recent years, attempts to better archive increasingly challenging websites and social media have led to supplementing this crawling with a more manual approach using pywb's record mode. Now hosting an instance of Browsertrix Cloud, UNT Libraries hopes to reduce the time spent on archiving such content that requires browser-based crawling. Additionally, the libraries expect the friendlier user interface Browsertrix Cloud provides to facilitate its use by more staff in the library, as a teaching tool in a web archiving course in the College of Information, and in a project collaborating with external contributors.

User experience 3 (Stanford) Jasmine Mulliken

Crawling the Complex

Web-based digital scholarship, like the kind produced under Stanford University Press’s Mellon-funded digital publishing initiative (http://supdigital.org), is especially resistant to standard web archiving. Scholars choosing to publish outside the bounds of the print book are finding it challenging to defend their innovatively formatted scholarly research outputs to tenure committees, for example, because of the perceived ephemerality of web-based content. SUP is supporting such scholars by providing a pathway to publication that also ensures the longevity of their work in the scholarly record. This is in part achieved by SUP’s partnership with Webrecorder (https://blog.supdigital.org/sup-webrecorder-partnership/), which has now, using Browsertrix Cloud, produced web-archived versions of all eleven of SUP’s complex, interactive, monograph-length scholarly projects (https://archive.supdigital.org/). These archived publications represent an important use case for Browsertrix Cloud that speaks to the needs of creators of web content who rely on web archiving tools as an added measure of value for the work they are contributing to the evolving innovative shape of the scholarly record.

User experience 4 (Austrian National Library) Andreas Predikaka & Antares Reich

Integrating Browsertrix Since the beginning of the web archiving project in 2008, Austrian National Library has been using the crawler Heritrix integrated in Netarchivesuite. For many websites in daily crawls, the use of Heritrix is no longer sufficient and it is necessary to improve the quality of our crawls. Tests showed very quickly, that Browsertrix is doing a very good job to fulfil this requirement. But for us it is also important that the results of Browsertrix crawls are integrated into our overall working process. By using the API of Browsertrix, it was possible to create a proof of concept of necessary steps for this use case.
 

 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2023
Conference Software: ConfTool Pro 2.6.149
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany