Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available). To only see the sessions for 3 May's Online Day, select "Online" for location.
Please note that all times are shown in the time zone of the conference. The current conference time is: 28th Apr 2024, 08:38:09pm CEST
|
Session Overview |
Date: Wednesday, 03/May/2023 | |
1:00pm - 1:15pm | WELCOME & HOUSEKEEPING Virtual location: Online |
1:15pm - 2:00pm | OL-SES-01: Q&A: RECONSTRUCTIONS IN NATIONAL DOMAINS: HISTORY, COLLECTIONS & CORPORA Virtual location: Online Session Chair: Susanne van den Eijkel, KB, National Library of the Netherlands Session Chair: Sophie Ham, Koninklijke Bibliotheek |
|
Uncovering the (paper) traces of the early Belgian web 1Ghent University, Belgium; 2KBR (Royal Library of Belgium) The Belgian web began in June 1988 when EARN and Eunet introduced the .be domain. In December 1993 the first .be domain names were registered and in 1994 there were a total of 129 registered .be names.[1] Documentation about the early Belgian web is scarce and the topic has not yet been researched in depth. In other European countries such as France and The Netherlands, specific projects have been set up to document the early national web. [2] This study of the early Belgian web therefore helps to complete the history of the early web in Europe and understand the specific dynamics that led to the emergence of the web in Belgium. Records of the early Web Belgium include: published lists of domain names of interest to Belgians held in the collections of KBR (e.g. publications such as the Belgian Web Directory published from 1997 to 1998 and the Web Directory published from 1998 to 2000), archived early Belgian websites preserved in the Wayback Machine since 1996, archives of organisations such as DNS Belgium (the registry for the .be, .brussels and .vlaanderen domains) etc. This archival information provides a slice of the information needed to understand the emergence of the early web in Belgium, yet it is clear that social actors who played key roles in developing the Belgian web, are not always recorded in the few archival records that remain. By combining these “paper traces” of the early Belgian web with semi-structured interviews with key actors in Belgium (e.g. long-time employees of DNS Belgium, instigators of the .be domain name, first users of and researchers on the web) we are able to reconstruct the history of the start of the web in Belgium. In this presentation, we will report on this research that stitches the first traces of the early Belgian web. [1] DNS Belgium. (2019). De historiek van DNS Belgium. Available online at: https://www.dnsbelgium.be/nl/over-dns-belgium/de-historiek-van-dns-belgium. [2] De Bode, P., Teszelszky, K. (2018). Web collection internet archaeology Euronet-Internet (1994-2017). Available online at: https://lab.kb.nl/dataset/web-collection-internet-archaeology-euronet-internet-1994-2017; Bibliothèque nationale de France. (2018). Web90 - Patrimoine, Mémoires et Histoire du Web dans les années 1990. Available online at: https://web90.hypotheses.org/tag/bnf. The Lifranum research project : building a collection on French speaking literature 1French national library, France; 2University of Lyon 3 (Jean Moulin), France Many amateurs and professionals writers have taken to the web since its very beginning, to share their writings and personal diaries, engaging themselves in the first forums. These practices increased with the rise of blogging platforms in the 2000s. Authors have used hypertext link possibilities to develop a new digital sociability and a common transnational creative network. The Lifranum research project brings together researchers from several disciplines. Its objective is to provide an original platform within a thematic web archive as corpora and to develop enhanced search features. The indexing scheme takes into account advances in automatic style analysis. In this context, researchers and librarians have defined complementary needs considering the web archive collection to be built and have tested new methods to design the corpora and carry out the crawls. During this presentation, we will share the challenges we encountered and the experiences we developed during the building of this large thematic corpora, from the selection phase to the crawl processes. The following aspects will be discussed: - text indexing and text analyzing issues; - large thematic corpora building methods using Hyphe, a tool developed by SciencesPo for exploring the web, build corpora, analyze links between websites and adding annotations; - managing quantity and quality on blogging platforms; - documenting choices to proceed data. The presentation will also compare web archive logics to scientific approaches focused on a specific type of data (text, image, video) that are exposed using APIs and easier to analyze. We will question the contributions and limits of this type of collection launched in partnership within the framework of a research project, which enriches the archives due to more methodical explorations of the web, anticipated qualitative controls and production of reusable documentation. The video will be available in French with English subtitles. Developing a Reborn Digital Archival Edition as an Approach for the Collection, Organisation, and Analysis of Web Archive Sources 1Maynooth University; 2Universitat de Barcelona; 3British Library In this presentation, we explore the development of a reborn digital archival edition (RDAE) as a hybrid approach for the collection, organisation, and analysis of reborn digital materials (Brügger, 2018; 2016) that are accessible through public web archives. Brügger (2016) describes reborn digital media, as media that has been collected and preserved and has undergone a change due to this process such as emulations of computer games or materials in a web archive. Further to this, we explore the potential of an RDAE as a method to enable the sharing and reuse of such data. As part of this we use a case study of the press/media statements of Irish politician, poet, and sociologist, Michael D. Higgins from 2002-2011. For the most part, these press statements were once available on the website of Michael D. Higgins, who is the current serving Irish President since 2011. Higgin’s website disappeared from the live web, sometime after the 2011 Presidential Election took place (27 October 2011), and sometime before Higgins was inaugurated (11 November 2011). Using the NLI Web Archive (National Library of Ireland) and the Wayback Machine (Internet Archive), this project sought to find and collect traces of these press statements and bring them together as an RDAE. In doing so, we use Zotero open-source citation management software, for collecting, organising, and analysing the data (archived web pages). We extract the text, and use screenshot software to capture an image of the archived web page. Thereafter, we utilise Omeka open-source software as a platform for presenting the data (screenshot/metadata/transcription) as a curated thematic collection of reborn digital materials, offering search and discovery functions through free text search, metadata fields and subject headings. To end, we use DROID open-source software for organising the data for long-term preservation, and Open Science Framework as a platform for sharing derivative materials and datasets. References: Brügger, N. (2016). Digital Humanities in the 21st Century: Digital Material as a Driving Force. Digital Humanities Quarterly, 10(2). Retrieved from http://www.digitalhumanities.org/dhq/vol/10/3/000256/000256.html Brügger, N. (2018). The Archived Web: Doing History in the Digital Age. The MIT Press. |
2:00pm - 2:15pm | OL-SES-02: Q&A: BARRIERS TO WEB ARCHIVING IN LATIN AMERICA Virtual location: Online Session Chair: Eilidh MacGlone, National Library of Scotland |
|
Web Archiving en español: Barriers to Accessing and Using Web Archives in Latin America 1Universidad Autónoma del Estado de México, Mexico; 2University of Texas at San Antonio Web archives have been growing in popularity in Global North countries as a way of preserving a part of their political, cultural, and social life carried out online. However, its spread to other regions has been slower because of several technical, economic, and social barriers. In this presentation, we will discuss the main limitations in the uptake of web archiving in Spanish-speaking Latin American countries and its implications for the access and use of web archives. The first barrier to web archiving in these countries is the lack of awareness of web archives among librarians and archivists. According to Scopus data from 2022, out of 909 documents with the words “web archiv*” on the title, abstract, or keywords, 10 papers are from Spanish-Speaking Latin American countries. In worldwide web archiving surveys, there are no initiatives from Latin America yet (D. Gomes et al., 2011; P. Gomes, 2020), and we could only identify 5 Latin American institutions on Archive-It, none of which had active public collections in 2022. Another barrier is the cost of web archiving services like Archive-It, which can be unaffordable for many institutions in the region. Even if they can afford these services or use free tools, most web archiving software is created only in English and it does not smoothly support multilingual collections or collections in languages other than English (for example, the default metadata fields for collections and seeds are in English and adding them in other languages is not straightforward). This unequal access to web archives between the Global North and South comes with the risk that Global North websites get preserved, organized, accessed, and used, while Latin America and other regions continue depending solely on third parties like the Internet Archive to preserve their websites. A possible solution for raising awareness of web archives is developing workshops as well as mentorship programs for Latin American librarians and digital humanists looking to start with web archiving. For the linguistic barrier, translating the documentation of web archiving tools to other languages can be a first step to encourage their use in Latin America. References Gomes, D., Miranda, J., & Costa, M. (2011). A Survey on Web Archiving Initiatives. In S. Gradmann, F. Borri, C. Meghini, & H. Schuldt (Eds.), Research and Advanced Technology for Digital Libraries (pp. 408–420). Springer. https://doi.org/10.1007/978-3-642-24469-8_41 Gomes, P. (2020). Map of Web archiving initiatives. Own work. https://commons.wikimedia.org/wiki/File:Map_of_Web_archiving_initiatives_58632FR T.jpg |
2:15pm - 2:30pm | OL BREAK Virtual location: Online |
2:30pm - 3:00pm | OL-SES-03: Q&A: RESEARCHING WEB ARCHIVES Virtual location: Online Session Chair: Ben Els, National Library of Luxembourg |
|
All Our Yesterdays: A toolkit to explore web archives in Colab Brock University, Canada The rise of Jupyter notebooks and particularly Google Colab has created an easy to use and accessible platform for those interested in exploring computational methods. This is especially the case with performing research using web archives. However the question remains, how to start? In particular, for those without an extensive background in programming this might be an insurmountable challenge. Enter the All Our Yesterdays Took Kit. (AOY-TK) This suite of notebooks and associated code provides a scaffolded introduction to opening, analyzing, and generating insights with web archives. With tight integration to Google Drive and text analysis tools, it provides a comprehensive solution to that very question of how to start. Development of AOY-TK is made possible by grant funding and this session will discuss progress to date and provide some brief case study examples of the types of analysis possible using the toolkit. Using Web Archives to Model Academic Migration and Identify Brain Drain Drexel University, United States of America Academic faculty members may change their institution affiliation over the course of their career. In the case of Historically Black Colleges and Universities (HBCUs) in the United States, which make substantial contributions to the preparation of Black professionals, retaining the most talented Black students and faculty from moving to non-HBCUs (thus preventing “brain drain”) is often a losing battle. This project seeks to investigate the effects of academic mobility at the institutional and individual level, measuring the potential brain drain from HBCUs. To accomplish this, we consult web archives to identify captures of academic institutions and their departments in the past to extract faculty names, title, and affiliation at various points in time. By analyzing the HBCUs’ list of faculty over time, we will be able to model academic migration and quantify the degree of brain drain. This NSF-sponsored project is in the early stages of execution and is a collaboration between Drexel University, Howard University, University of Tennessee - Knoxville, and University of Wisconsin - Madison. We are currently in the data collection stage, which entails us leveraging an open source Memento aggregator to consult international sources of web archives to potentially improve the quality and quantity of captures of past versions of HBCU sites. In this initial stage, we have encountered caveats of the process of efficient extraction, established a systematic methodology of utilizing this approach beyond our initial use cases, and identified potentially ethical dilemmas of individuals’ information on the past being uncovered and highlighted without their explicit consent. During the first year of the project, we have refined our approach to facilitate better data quality for subsequent steps in the process and to emphasize recall. This presentation will both describe some of these nuances of our collaborative project as well as highlight the next steps for identifying brain drain from HBCUs by utilizing web archives. |
3:00pm - 3:05pm | OL BREAK Virtual location: Online |
3:05pm - 3:35pm | OL-SES-04: Q&A: SAMPLING THE HISTORICAL WEB & TEMPORAL RESILIENCE OF WEB PAGES Virtual location: Online Session Chair: Laura Wrubel, Stanford University |
|
Lessons Learned From the Longitudinal Sampling of a Large Web Archive 1Old Dominion University, Norfolk, Virginia - USA; 2Internet Archive, San Francisco, California - USA; 3Protocol Labs, San Francisco, California - USA We document the strategies and lessons learned from sampling the web by collecting 27.3 million URLs with 3.8 billion archived pages spanning 26 years of the Internet Archive's holdings (1996–2021). Our overall project goal is to revisit fundamental questions regarding the size, nature, and prevalence of the publicly archivable web, and in particular, to reconsider the question, "how long does a web page last?" Addressing this question requires obtaining a "representative sample of the web." We proposed several orthogonal dimensions to sample URLs using the archived web: time of the first archive, HTML vs. MIME types, URL depth (top-level pages vs. deep links), and TLD. We sampled 285 million URLs from IA's ZipNum index file that contains every 6000th line of the CDX index. These include URLs of embedded resources, such as images, CSS, and JavaScript. To limit our samples to web pages, we filtered the URLs for likely HTML pages (based on filename extensions). We determined the time of the first archive and MIME type using IA's CDX API. We grouped the 92 million URLs with "text/html" MIME types based on the year of the first archive. Archiving speed and capacity have significantly increased, so we found fewer URLs archived in the early years than in later years. Hence, we adjusted our goal of 1 million URLs per year and clustered the early years (1996-2000) to reach that size (1.2 million URLs). We noticed an increase in deep links archived over the years. We extracted the top-level URLs from the deep links to upsample the earlier years. We found that popular domains like Yahoo and Twitter were over-represented in the IA. We performed logarithmic-scale downsampling based on the number of URLs sharing a domain. Given the collection size, we employed various sampling strategies to ensure fairness in the domain and temporal representations. Our final dataset contains TimeMaps of 27.3 million URLs comprising 3.8 billion archived pages from 1996 to 2021. We convey the lessons learned from sampling the archived web, which could inform other studies that sample from web archives. TrendMachine: Temporal Resilience of Web Pages 1Internet Archive, San Francisco, California - USA; 2Old Dominion University, Norfolk, Virginia - USA; 3Protocol Labs, San Francisco, California - USA "How long does a web page last?" is commonly answered with "40 to 100 days", with sources dating back to the late 1990s. The web has since evolved from mostly static pages to dynamically-generated pages that heavily rely on client-side scripts and user contributed content. Before we revisit this question, there are additional questions to explore. For example, is it fair to call a page dead that returns a 404 vs. one whose domain name no longer resolves? Is a web page alive if it returns content, but has drifted away from its original topic? How to assess the lifespan of pages from the perspective of fixity with the spectrum of content-addressable pages to tweets to home pages of news websites to weather report pages to push notifications to streaming media? To quantify the resilience of a page, we developed a mathematical model that calculates a normalized score as time-series data based on the archived versions of the page. It uses Sigmoid functions to increase or decrease the score slowly on the first few observations of the same class. The score changes significantly if the observations remain consistent over time, and there are tunable parameters for each class of observation (e.g., HTTP status codes, no archival activities, and content fixity). Our model has many potential applications, such as identifying points of interest in the TimeMap of densely archived web resources, identifying dead links (in wiki pages or any other website) that can be replaced with archived copies, and aggregated analysis of sections of large websites. We implemented an open-source interactive tool [1] powered by this model to analyze URIs against any CDX data source. Our tool gave interesting insights on various sites, such as, the day when "cs.odu.edu" was configured to redirect to "odu.edu/compsci", the two and a half years of duration when "example.com" was being redirected to "iana.org", the time when ODU’s website had downtime due to a cyber attack, or the year when Hampton Public Library’s domain name was drop-catched to host a fake NSFW store. [1] https://github.com/internetarchive/webpage_resilience |
3:35pm - 6:00pm | OL BREAK Virtual location: Online |
6:00pm - 6:05pm | WELCOME & HOUSEKEEPING Virtual location: Online |
6:05pm - 6:35pm | OL-SES-05: Q&A: PRESERVING SOCIAL MEDIA & VIDEO GAMES Virtual location: Online Session Chair: Sawood Alam, Internet Archive |
|
A Gift to Another Age: Evaluating Virtual Machines for the Preservation of Video Games at MoMA New York University and the Musuem of Modern Art in New York, United States of America
This preservation project investigates the use of virtual machines for the preservation of video games. From MoMA’s collection, Rand and Robyn Miller’s classic adventure game Myst (1993) is used as a sample record to evaluate the performance of three different virtualization options for the Mac OS 9 operating system—SheepShaver, Qemu, and Yale’s Emulation-as-a-Service-Infrastructure (EaaSI). Serving as the control for the experiment, Myst is first documented running natively on an original PowerMac G4 at MoMA. The native performance is then compared with each virtualization software. Finally, a fully configured virtual machine is packaged as a single file and tested in different contemporary computing environments. More generally, this project clarifies the risks and challenges that arise when using virtual machines for the long-term preservation of computer and software-based art.
Experiences from archiving information from social media Arkiwera, Sweden Social media has enabled an expanded dialogue in the public space. News spread faster than ever, politicians and leaders can communicate directly with a great number of people, and everyone can become an influencer or create a public movement. For the archival institutions and governmental organisations active on different social medias, this creates new challenges and questions. As a consultancy firm working with digital preservation, we couldn’t just ask these questions without also finding some answers. This led to the beginning of developing the system that we today call Arkiwera. With more than 10 years’ experience of web archiving, we started to look at solutions to preserve posts, including comments and reactions, from different social media platforms about 4-5 years ago. Since we couldn’t find any “out of the box”-solutions that we could just apply and refer our customer to we started to develop our own solution. This has been a long and interesting journey of many lessons learned that we would love to share on the conference connecting to several of the themes presented, e.g. Research, Tools and Access. Our lecture introduces you to the circumstances offered within the Swedish archival context and the choices we have made from an archival, regulative, and ethical aspect when developing the archival platform Arkiwera – today used by a large number of organisations. |
6:35pm - 6:40pm | OL BREAK Virtual location: Online |
6:40pm - 7:10pm | OL-SES-06: Q&A: COLLABORATIVE WEB ARCHIVING Virtual location: Online Session Chair: Lauren Ko, University of North Texas |
|
Empowering Bibliographers to Build Collections: The Browsertrix Cloud Pilot at Stanford Libraries Stanford University, United States of America The purview of subject-area librarians has expanded in the 21st century from primarily focusing on books and print subscriptions to a much larger set of materials, including digital subscription packages and data sets (distributed using a variety of media, for purchase or lease). Through this process, subject-area librarians are increasingly exposed to complex issues around copyright, license terms, and privacy/ethical concerns, where both norms and laws can vary significantly among different countries and communities. While it is nearly impossible for subject-area librarians in any field to treat “data” as outside the scope of their collecting efforts in 2022, the same does not hold true for web archives. Many libraries have at least some access to web archiving tools, although this access may primarily be in the hands of a limited number of users, sometimes associated with library technical services or special collections / university archives (e.g. for institutions whose focus of web archiving is primarily their own digital resources). In late 2022, the web archiving task force at Stanford Libraries – a cross-functional team that brought together the web archivist, technical staff, and embedded digital humanities staff – set out to shift this dynamic by empowering disciplinary librarians to add web archiving to their toolkit for building the university’s collections. By partnering with Webrecorder, Stanford Libraries set up an instance of Browsertrix Cloud, and provided access to a pilot group of bibliographers and other subject-matter experts as part of a short-term pilot. The goals of this pilot were to see how, and how much, bibliographers would engage with web archiving for collection-building if given unfettered access to easy-to-use tools. What materials would they prioritize? What challenges would they encounter? What technical (e.g. storage) and support (e.g. training, debugging, community engagement) resources would be necessary for them to be successful? This pilot was also intended to inform the strategic direction for web archiving at Stanford moving forward. In this talk, we will briefly present how we designed the pilot, will hear perspectives from bibliographers who participated, and we will share the pilot outcomes and future directions. What next? An update from SUCHO 1Stanford University, United States of America; 2Tufts University, United States of America; 3Austrian Center for Digital Humanities and Cultural Heritage, Austria; 4University of Gothenburg, Sweden Saving Ukrainian Cultural Heritage Online (SUCHO) made headlines as an international, volunteer-run initiative archiving Ukrainian cultural heritage websites in the wake of Russia’s invasion in February 2022. Through SUCHO, over 1,500 volunteers around the world – from technologists and librarians, to retirees and children – were involved in a large-scale, rapid-response web archiving effort that developed a collection of over 5,000 websites and 50 TB of data. As a non-institutional project with the primary goal of digital repatriation, creating this collection and ensuring its security through a network of mirrors was not enough. The motivation for SUCHO was not to create a permanent archive of Ukraine that could be used as research data for scholars as the country was destroyed; instead, the hope was to hold onto the data only until the cultural heritage sector in Ukraine was ready to rebuild. The initial web archiving phase of SUCHO’s work happened between March and August 2022. The archives came from a variety of sources: created on volunteers laptops using the command-line Browsertrix software, using Browsertrix Cloud, or even uploads of individual, highly interactive page archives using the Browsertrix Chrome plugin. In addition, while the project mostly worked from a single list of sites, the work was done in haste, and status metadata (e.g. “in progress”, “done”, “problem”) was not always accurately documented. Furthermore, while the project had full DNS records for these sites, that metadata was stored separately from the spreadsheet – as was information about site uptime and downtime over the course of the project. Creating the web archives was challenging, but it quickly became apparent that the bigger challenge would be curation. This talk will follow up on our 2022 IIPC presentation on SUCHO, confronting the question of “What next?” for SUCHO. It will bring together a number of volunteers to discuss different facets of this curation process, including reuniting archives with different kinds of metadata, our efforts in extracting data from the archives that could be used as the foundation for rebuilding websites, and other work to curate and present what our volunteer community accomplished. |
7:10pm - 7:25pm | OL BREAK Virtual location: Online |
7:25pm - 7:55pm | OL-SES-07: Q&A: LEGAL & ETHICAL CONSIDERATIONS Virtual location: Online Session Chair: Tom Smyth, Libraries and Archives Canada |
|
Querying Queer Web Archives 1The Graduate Center, CUNY, USA; 2Rutgers University, USA Our paper explores the intersections of querying and queerness as it interacts with and is informed by web spaces and their development across time. Working with hundreds of gigabites of web archival records on queer and queer-ish online spaces, we are developing new methodsfor search and discovery. as well as for the ethical access and use of web archives. This paper reflects on our process pursuing methodologies that accommodate diverse perspectives for querying web-based datasets and embrace the qualities of play and pliancy to respond to a host of research questions and investments. For example, one central concern explores ethical methods for cleaning web archival data to maintain privacy and anonymity. While queer spaces have historically existed in the margins, confidential information is easily shared and retained in the process of collecting data. Given that we are looking into queer spaces across 30 or so years, we are also mindful of the ethical consideration for privacy and anonymity in twofolds: first, in the sense of anonymity that has shifted since early internet days; and second, on the uses of collected sites in repositories. For example, in 1995 only 0.4% of world population had access to the internet (Mendel, 2012), compared to 60% in 2020 (The World Bank, n.d.). The sense of anonymity and smaller internet community means that users were likely to share more private information than they might share today. Our research therefore has to consider how to remove private information in large amounts using tools such as bulk_extractor (Garfinkel, 2013) and bulk_reviewer (Walsh & Baggett, 2019). In addition, we also work with repositories of archived websites whose original collection was obtained through informed consent. This means that while we may have the ability to access the collection, ethical secondary use requires additional consideration. Given the small size of the collection, we have been able to reach out to the original creators, but this approach will need to be reconsidered for larger collections. Beyond the Affidavit: Towards Better Standards for Web Archive Evidence nullhandle.org The Internet Archive (IA) standard legal affidavit is used in litigation both frequently and reliably for the authentication and admission of evidence from the Wayback Machine (WM). While the affidavit has enabled the regular and relatively confident application of IAWM evidence by the legal community, their understanding of the contingencies of web archives - including qualifications to which the affidavit itself calls attention - is limited. The tendency to conflate IA's attestation as to the authenticity of IAWM /records/ with the authenticity of /historical webpages/ will eventually have material consequences in litigation, which we may reasonably suppose will undermine confidence in the trustworthiness of web archives generally and to a greater extent than likely merited. The ever-increasing complexity of the web and the unfortunately growing investment in disinformation only increase the probability that this will happen sooner as versus later. In response to the looming (or present, but as yet undiscovered) threat to the current IA affidavit-favored regime for authentication of IAWM evidence, the web archiving community would do well to champion better, more institutionally-agnostic standards for evaluating and affirming the authenticity of archived web content. Some modest efforts have been made on this front, and there are a few places we can consult for tacitly indicated frameworks. Collectively, these include judicial precedents, e-discovery community guidance, and the marketing of services by commercial archiving companies. I would argue that these do not get us far enough, though. To that end, I would like to elaborate a more expansive set of criteria that could serve as a basis for the authenticity of web archives for evidentiary purposes. Some of these traits are foundational to web archiving in the main, and help to distinguish web archives from other forms of web content capture. Some reflect the affordances of our standards and tools that we as a community already have in place. Some reflect under-addressed technical challenges, for which continued investment in mitigation will be necessary to maintain the trustworthiness of our archives for legal use. Together, they may better provide for the sustained and trustworthy use of web archives for evidentiary purposes. |
7:55pm - 8:00pm | OL BREAK Virtual location: Online |
8:00pm - 9:00pm | OL-SES-08: PANEL: BROWSER-BASED CRAWLING FOR ALL: THE STORY SO FAR Virtual location: Online Session Chair: Meghan Lyon, Library of Congress |
|
Browser-Based Crawling For All: The Story So Far 1Royal Danish Library; 2The British Library, United Kingdom; 3National Library of New Zealand | Te Puna Mātauranga o Aotearoa; 4Webrecorder; 5UNT; 6Stanford; 7Austrian National Library Through the IIPC-funded “Browser-based crawling system for all” project (https://netpreserve.org/projects/browser-based-crawling/), members have been working with Webrecorder to support the development of their new crawling tools: Browsertrix Crawler (a crawl system that automates web browsers to crawl the web) and Browsertrix Cloud (a user interface to control multiple Browsetrix Crawler crawls). The IIPC funding is particularly focussed on making sure IIPC members can use these tools. This online panel will provide an update on the project, emphasizing the experiences of IIPC members who have been experimenting with the tools. Three IIPC members who have been exploring Browsertrix Cloud in detail will present their experiences so far. What works well, what works less well, how the development process has been, and what the longer-term issues might be. The Q&A session will be used to explore the issues raised and encourage wider engagement and feedback from IIPC members. Project Update: Anders Klindt Myrvoll & Ilya Kreymer Anders will present an update from the project leads on what has been achieved since we started the project and what the next steps are. We will look at the broad picture as well as the the goals, outcomes and deliverables as described in the IIPC project description: https://netpreserve.org/projects/browser-based-crawling/ On behalf of Webrecorder, Ilya will outline the wider context and updating on the status of the project and including any immediate feedback from the Workshop session User experience 1 (NZ) Sholto Duncan Testing Browsertrix Cloud at NLNZ In recent years the selective web harvesting programme at the National Library of New Zealand has broadened its crawling tools of choice in order to use the best one for the job. From primarily using Heritrix, through WCT, to now also regularly crawling with Webrecorder and Archive-IT. This allowed us to get the best capture possible. But unfortunately still falls short in harvesting some of those more rich, dynamic, modern websites that are becoming more commonplace. Other areas within the Library that often use web archiving processes for capturing web content have seen this same need for improved crawling tools. This has provided a range of users and diverse use cases for our Browsertrix Cloud testing. During this presentation we will cover our user experience during this testing. User experience 2 (UNT) Lauren Ko Improving the Web Archive Experience With a focus on collecting the expiring websites of defunct federal government commissions, carrying out biannual crawls of its own subdomains, and participating in event-based crawling projects, since 2005 UNT Libraries has mostly carried out harvesting with Heritrix. However, in recent years, attempts to better archive increasingly challenging websites and social media have led to supplementing this crawling with a more manual approach using pywb's record mode. Now hosting an instance of Browsertrix Cloud, UNT Libraries hopes to reduce the time spent on archiving such content that requires browser-based crawling. Additionally, the libraries expect the friendlier user interface Browsertrix Cloud provides to facilitate its use by more staff in the library, as a teaching tool in a web archiving course in the College of Information, and in a project collaborating with external contributors. User experience 3 (Stanford) Jasmine Mulliken Crawling the Complex Web-based digital scholarship, like the kind produced under Stanford University Press’s Mellon-funded digital publishing initiative (http://supdigital.org), is especially resistant to standard web archiving. Scholars choosing to publish outside the bounds of the print book are finding it challenging to defend their innovatively formatted scholarly research outputs to tenure committees, for example, because of the perceived ephemerality of web-based content. SUP is supporting such scholars by providing a pathway to publication that also ensures the longevity of their work in the scholarly record. This is in part achieved by SUP’s partnership with Webrecorder (https://blog.supdigital.org/sup-webrecorder-partnership/), which has now, using Browsertrix Cloud, produced web-archived versions of all eleven of SUP’s complex, interactive, monograph-length scholarly projects (https://archive.supdigital.org/). These archived publications represent an important use case for Browsertrix Cloud that speaks to the needs of creators of web content who rely on web archiving tools as an added measure of value for the work they are contributing to the evolving innovative shape of the scholarly record. User experience 4 (Austrian National Library) Andreas Predikaka & Antares Reich Integrating Browsertrix Since the beginning of the web archiving project in 2008, Austrian National Library has been using the crawler Heritrix integrated in Netarchivesuite. For many websites in daily crawls, the use of Heritrix is no longer sufficient and it is necessary to improve the quality of our crawls. Tests showed very quickly, that Browsertrix is doing a very good job to fulfil this requirement. But for us it is also important that the results of Browsertrix crawls are integrated into our overall working process. By using the API of Browsertrix, it was possible to create a proof of concept of necessary steps for this use case. |
Date: Wednesday, 10/May/2023 | |
4:00pm - 5:30pm | PUBLIC EVENT: BUILDING DIGITAL HERITAGE TOGETHER: DUTCH AND TRANSNATIONAL PERSPECTIVES Location: Theatre 1 This public event, hosted by the Netherlands Institute for Sound and Vision (NISV) and co-organised by KB – National Library of the Netherlands and IIPC, will feature presentations on the Netherlands UNESCO projects as well as an introduction to collaborative, transnational web archiving. Presentations will be followed by a panel discussion moderated by Tamara van Zwol, Dutch Digital Heritage Network.
Pre-registration is required for this event. |
5:30pm - 7:00pm | WELCOME RECEPTION & NETWORKING EVENT We look forward to welcoming conference delegates and public event attendees to a welcome reception and networking event in Sound & Vision’s atrium following the public event Building Digital Heritage Together: Dutch And Transnational Perspectives.
This event will feature drinks, small bites, and a chance to network with other attendees and partner organizations. Partner Organizations: Council on Library and Information Resources (CLIR) Dutch Digital Heritage Network (DDHN) Open Preservation Foundation (OPF) Pre-registration is required for this event. |
Date: Thursday, 11/May/2023 | |
8:30am - 9:30am | REGISTRATION/COFFEE |
9:30am - 9:45am | OPENING REMARKS: Eppo van Nispen, Sound & Vision Location: Theatre 1 |
9:45am - 10:45am | KEYNOTE: Eliot Higgins, Bellingcat. Introduced and chaired by Johan Oomen, Sound & Vision Location: Theatre 1 |
10:45am - 11:00am | BREAK |
11:00am - 12:30pm | SES-01: RESEARCH & ACCESS Location: Theatre 1 Session Chair: Ditte Laursen, Royal Danish Library These presentations will be followed by a 10 min Q&A. |
|
11:00am - 11:20am
Through the ARCHway: Opportunities to Support Access, Exploration, and Engagement with Web Archives Archives Unleashed Project, University of Waterloo, Canada For nearly three decades, memory institutions have consciously archived the web to preserve born-digital heritage. Now, web archive collections range into the petabytes, significantly expanding the scope and scale of data for scholars. Yet there are many acute challenges research communities face, from the availability of analytical tools, community infrastructure, and inaccessible research interfaces. The core objective of the Archives Unleashed Project is to lower these barriers and burdens for conducting scalable research with web archives. Following a successful series of datathon events (2017-2020), Archives Unleashed launched the cohort program (2021-2023) to facilitate opportunities to improve access, exploration and research engagement with web archives. Borrowing from the hacking genre of events often found within the tech industry, Archives Unleashed datathons were designed to provide an immersive and uninterrupted period of time for participants to work collaboratively on projects and gain hands-on experience working with web archive data. The datathon series cultivated community formation and empowered scholars to build confidence and the skills needed to work with web archives. However, the short-term nature of datathons ultimately saw focused energy and time to research projects diminish once meetings concluded. Launched in 2021, the Archives Unleashed cohort program was developed as a matured evolution of the datathon model to support research projects. The program ran two iterative cycles and hosted 46 international researchers from 21 unique institutions. Programmatically, researchers engaged in a year-long collaboration project, with web archives featured as a primary data source. The mentorship model has been a defining feature, including direct one-on-one consultation from Archives Unleashed, connections to field experts, and opportunities for peer-to-peer support. This presentation will reflect on the experiences of engaging with scholars to build scalable analytical tools and deliver a mentorship program to facilitate research with web archives. The cohort program asked researchers to step into an unfamiliar environment with complex data, and they did so with curiosity while embracing opportunities to access, explore, and engage with web archive collections. While the program highlights a broad range of use cases, we seek to inspire the adoption of web archives for scholarly inquiry more commonly across disciplines. 11:20am - 11:40am
‘Research-ready’ collections: challenges and opportunities in making web archive material accessible 1Cambridge University Libraries, United Kingdom; 2National Library of Scotland, United Kingdom The Archive of Tomorrow is a collaborative, multi-institutional project led by the National Library of Scotland and funded by the Wellcome Trust collecting information and misinformation around health in the online public space. One of the aims of this project is to create a ‘research-ready’ collection which would make it possible for researchers to access and reuse the themed collections of materials for further research. However, there are many challenges around making this a reality, especially around the legislative framework governing collection of and access to web archives in the UK, and technical difficulties stemming from the emerging platforms and schemas used to catalogue websites. This talk would primarily address IIPC 2023's Access and Research themes, while also touching on the Collections and Operations strands in its discussion of a short-term project promising to deliver technical improvements and expanded access to web archives collections by 2023. The presentation would like to challenge and explore the difficulties the project encountered by offering different ways into the material, including exposing insights that can be generated from working with metadata exports outside of collecting platforms; detailing the project’s work in surfacing web archives in traditional library discovery settings through metadata crosswalks; and exploring further possibilities around the use of Jupyter Notebooks for data exploration and the documentation and dissemination of datasets. The intended deliverables of this session are to present the tools developed within the project to make web archive material suitable and useful for research; to share frameworks used by the project’s web archivists when navigating the challenges of archiving personal and political health information online; and to discuss the barriers to access around collecting web archive and social media material in a UK context. 11:40am - 12:00pm
Developing new academic uses of web archives collections: challenges and lessons learned from the experimental service deployed at the University of Lille during the ResPaDon Project 1Université de Lille, France; 2Bibliothèque nationale de France, France 2022 marks the second year of the ResPaDon project, undertaken by the BnF (National Library of France) and the University of Lille, in partnership with Sciences Po and Campus Condorcet. The project brings together researchers and librarians to promote and facilitate a broader academic use of web archives by demonstrating the value of web archives and by reducing the technical and methodological barriers researchers may encounter when discovering this source for the first time or when working with such complex materials. One of the ways to meet the challenges and address new ways of doing research is the implementation of an experimental remote access point to the web archives at the University of Lille. The project team has renewed the offer of tools and conducted outreach to new groups of potential web archive users. The remote access point to web archives has been deployed in two university libraries in Lille: this service allows for both consultation of the web archives in their entirety (44 billion documents, 1.7 PB of data) and for exploring a collection, "The 2002 presidential and local elections", which was the the first collection constituted in-house by the BnF 20 years ago. This collection is now accessible , through various tools for data mining, analysis, and data visualization. And the use of those tools is accompanied by guides, reports, examples, use cases - multiple types of supporting documentation that will also be evaluated on their usefulness as part of the experimentation. The presentation will focus on the implementation of this access point from both technical and practical aspects. It will address the training of the team of 6 mediators responsible for accompanying the researchers in Lille, as well as the collaboration between the teams in Lille and at the BnF. It will also tackle the challenges of outreach and the path we have taken to communicate within the academic community to find researcher-testers. We will share the results and lessons learned from this experimentation: the first tests conducted with the researchers have allowed us to obtain feedback on the tools deployed and the improvements to be made to this experimental service. |
11:00am - 12:30pm | SES-02: FINDING MEANING IN WEB ARCHIVES Location: Theatre 2 Session Chair: Vladimir Tybin, Bibliothèque nationale de France These presentations will be followed by a 10 min Q&A. |
|
11:00am - 11:20am
Leveraging Existing Bibliographic Metadata to Improve Automatic Document Identification in Web Archives. 1University of North Texas, United States of America; 2University of Illinois Chicago, United States of America The University of North Texas Libraries, partnering with the University of Illinois Chicago (UIC) Computer Science Department, has been awarded a research and development grant (LG-252349-OLS-22) from the Institute of Museum and Library Services in the United States to continue work from previously awarded projects (LG-71-17-0202-17) related to identification and extraction of high-value publications from large web archives. This work will investigate the potential of using existing bibliographic metadata from library catalogs and digital library collection to better train machine learning models that can assist librarians and information professionals in identifying and classifying high-value publications from large web archives. The project will focus on extracting publications related to state government document collections from the states of Texas and Michigan with the hopes that this approach will enable other institutions interested in leveraging their existing web archives to assist in building traditional digital collections with these publications. This presentation will present an overview of the project with a description of the approaches the research team is exploring to leverage existing bibliographic metadata to assist in building machine models for publication identification from web archives. Early findings from the first year of research as well as next steps and how this research can be used by institutions apply to their own web archives. 11:20am - 11:40am
Conceptual Modeling of the Web Archiving Domain Masaryk University, Czech Republic Web archives collect and preserve complex digital objects. This complexity, along with the large scope of archived websites and the dynamic nature of web content, makes sustainable and detailed metadata description challenging. Different institutions have taken various approaches to metadata description within the web archiving community, yet this diversity complicates interoperability. The OCLC Research Library Partnership Web Archiving Metadata Working Group took a significant step forward in publishing user-centered descriptive metadata recommendations applicable across common metadata formats. However, there is no shared conceptual model for understanding web archive collections. In my research, I examine three conceptual models from within the GLAM domain, IFLA-LRM created by the library community, CIDOC-CRM originating from the museum community, and RiC-CM stemming from the archive community. I will discuss what insight they bring to understanding the content within web archives and their potential for supporting metadata practices that are flexible, scalable, meet the requirements of the end users, and are interoperable between web archives as well as the broader cultural heritage domain. This approach sheds light on common problems encountered in metadata description practice in a bibliographic context by modeling archived web resources according to IFLA-LRM and showing how constraints within RDA introduce complexity without providing tools for feasibly representing this complexity in MARC 21. On the other hand, object-oriented models, such as CIDOC-CRM, can represent at least the same complexity of concepts as IFLA-LRM but without many of the aforementioned limitations. By mapping our current descriptive metadata and automatically generated administrative metadata to a single comprehensive model and publishing it as open linked data, we can not only more easily exchange metadata but also provide a powerful tool for researchers to make inferences about the past live web by reconstructing the web harvesting process using log files and available metadata. While the work presented is theoretical, it provides a clearer understanding of the web archiving domain. It can be used to develop even better tools for managing and exploring web archive collections. 11:40am - 12:00pm
Web Archives & Machine Learning: Practices, Procedures, Ethics Internet Archive, United States of America Given their size, complexity, and heterogeneity, web archives are uniquely suited to leverage and enable machine learning techniques for a variety of purposes. On the one hand, web collections increasingly represent a larger portion of the recent historical record and are characterized by longitudinality, format diversity, and large data volumes; this makes them highly valuable in computational research by scholars, scientists, and industry professionals using machine learning for scholarship, analysis, and tool development. Few institutions, however, are yet facilitating this type of access or pursuing these types of partnerships and projects given the specialized practices, skills, and resources required. At the same time, machine learning tools also have the potential to improve internal procedures and workflows related to web collections management by custodial institutions, from description to discovery to quality assurance. Projects applying machine learning to web archive workflows, however, also remains a nascent, if promising, area of work for libraries. There is also a “virtuous loop” possible between these two functional areas of access support and collections management, wherein researchers utilizing machine learning tools on web archive collections can create technologies that then have internal benefits to the custodial institutions that granted access to their collections. Finally, spanning both external researcher uses and internal workflow applications are an intricate set of ethical questions posed by machine learning techniques. Internet Archive has been partnering with both academic and industry research projects to support the use of web archives in machine learning projects by these communities. Simultaneous, IA has also explored prototype work applying machine learning to internal workflows for improving the curation and stewardship of web archives. This presentation will cover the role of machine learning in supporting data-driven research, the successes and failures of applying these tools to various internal processes, and the ethical dimensions of deploying this emerging technology in digital library and archival services. 12:00pm - 12:20pm
From Small to Scale: Lessons Learned on the Requirements of Coordinated Selective Web Archiving and Its Applications 1Eötvös Loránd University, Department of Digital Humanities, Budapest, Hungary; 2National laboratory for Digital Humanities, Budapest, Hungary Today, web archiving is measured on an increasingly large scale, pressurizing newcomers and independent researchers to keep up with the pace of development and maintain an expensive ecosystem of expertise and machinery. These dynamics involve a fast and broad collection phase, resulting in a large pool of data, followed by a slower enrichment phase consisting of cleaning, deduplication and annotation. Our streamlined methodology for specific web archiving use cases combines mainstream practices with new open-source tools. Our custom crawler conducts selective web archiving for portals (e.g. blogs, forums, currently applied to Hungarian news providers), using the taxonomy of the given portal to systematically extract all articles exclusively into portal-specific WARC files. As articles have uniform portal-dependent structure, they can be transformed into a portal-independent TEI XML format individually. This methodology enables assets (e.g. video) to be archived separately on demand. We focus on textual content, which in case of using traditional web archives would require using resource intensive filtering. Alternatives like trafilatura are limited to automatic content extraction often yielding invalid TEI or incomplete metadata unlike our semi-automatic method. Resulting data are deposited by grouping portals under specific DOIs, enabling fine-grained access and version control. With almost 3 million articles from more than 20 portals we developed a library for executing common tasks on these files, including NLP and format conversion to overcome the difficulties of interacting with the TEI standard. To provide access to our archive and gain insights through faceted search, we created a light-weight trend viewer application to visualize text and descriptive metadata. Our collaborations with researchers have shown that our approach makes it easy to merge coordinated separate crawls promoting small archives created by different researchers, who may have lower technical skills, into a comprehensive collection that can in some respects serve as an alternative to mainstream archives. Balázs Indig, Zsófia Sárközi-Lindner, and Mihály Nagy. 2022. Use the Metadata, Luke! – An Experimental Joint Metadata Search and N-gram Trend Viewer for Personal Web Archives. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, pages 47–52, Taipei, Taiwan. Association for Computational Linguistics. |
12:30pm - 1:30pm | LUNCH |
1:30pm - 2:30pm | SES-03 (PANEL): INSTITUTIONAL WEB ARCHIVING INITIATIVES TO SUPPORT DIGITAL SCHOLARSHIP Location: Theatre 1 Session Chair: Martin Klein, Los Alamos National Laboratory |
|
Institutional Web Archiving Initiatives to Support Digital Scholarship 1Los Alamos National Laboratory, United States of America; 2Old Dominion University, United States of America; 3Texas A&M University, United States of America; 4New York University, United States of America Panel description: Individual: Emily: Title: Source Code Archiving for Scholarly Publications Abstract: Git Hosting Platforms (GHPs) are commonly used by software developers and scholars to host source code and data to make them available for collaboration and reuse. However, GHPs and their content are not permanent. Gitorious and Google Code are examples of GHPs that are no longer available even though users deposited their code expecting an element of permanence. Scholarly publications are well-preserved due to current archiving efforts by organizations like LOCKSS, CLOCKSS, and Portico; however, no analogous effort has yet emerged to preserve the data and code referenced in publications, particularly the scholarly code hosted online in GHPs. The Software Heritage Foundation is working to archive public source code, but issue threads, pull requests, wikis, and other features that add context to the source code are not currently preserved. Institutional repositories seek to preserve all research outputs which include data, source code, and ephemera; however, current publicly available implementations do not preserve source code and its associated ephemera, which presents a problem for scholarly projects where reproducibility matters. To discuss the importance of institutions archiving scholarly content like source code, we first need to understand the prevalence of source code within scholarly publications and electronic theses and dissertations (ETDs). We analyzed over 2.6 million publications across three categories of sources: preprints, peer-reviewed journals, and ETDs. We found that authors are increasingly referencing the Web in their scholarly publications with an average of five URIs per publication in 2021, and one in five arXiv articles included at least one link to a GHP. In this panel, we will discuss some of the questions that result from these findings such as: Are these GHP URIs still available on the live Web? Are they available in Software Heritage? Are they available in web archives and if so, how often and how well are they archived? Sarah: Title: Designing a Sociotechnical Intervention for Reference Rot in Electronic Theses Abstract: Intertwined publication and preservation practices have become widespread in the establishment of institutional digital repositories and libraries’ stewardship of institutional research output, including open educational resources and electronic theses and dissertations. Most digital preservation work seeks to preserve a whole text, like a dissertation, in a digital form. This presentation reports on an ongoing research effort - a collaboration with Klein, Potvin, Katherine Anders, and Tina Budzise-Weaver - intended to prevent potential information loss within the thesis, through interventions that can be integrated into trainings and thesis management tools. This approach draws on research into graduate training and citation practices, web archiving, open source software development, and digital collection stewardship with a goal of recommending systematized sociotechnical interventions to prevent reference rot in institutionally-hosted graduate theses. Findings from qualitative surveys and interviews conducted at Texas A&M University on graduate student perceptions of reference rot will be detailed. Vicky/Talya Title: Collaborating on Software Archiving for Institutions Abstract: Inarguably, software and code are part of our scholarly record. Software preservation is a necessary prerequisite for long-term access and reuse of computational research, across many fields of study. Open research software is shared on the Web most commonly via Git hosting platforms (GHPs), which are excellent for fostering open source communities, transparency of research, and add useful features on top such as wikis, continuous integration, and merge requests and issue threads. However, the source code and the useful scholarly ephemera (e.g. wikis) are archived separately, often by “breadth over depth” approaches. I’ll discuss the Collaborative Software Archiving for Institutions (CoSAI) project from NYU, LANL, ODU, and OCCAM, which is addressing this pressing need to provide machine-repeatable, human-understandable workflows for preserving web-based scholarship, scholarly code in particular, alongside the components that make it most useful. I’ll present the results of ongoing efforts in the three main streams of work: 1) technical development on open source, community-led tools for collecting, curating, and preserving open scholarship with a focus on research software, 2) community building around open scholarship, software collection and curation, and archiving of open scholarship, and 3) optimizing workflows for archiving open scholarship with ephemera, via machine-actionable and manual workflows. |
1:30pm - 2:30pm | SES-04 (PANEL): SOLRWAYBACK: BEST PRACTICE, COMMUNITY USAGE & ENGAGEMENT Location: Theatre 2 Session Chair: Thomas Langvann, National Library of Norway |
|
SolrWayback: Best practice, community usage and engagement 1Royal Danish Library (KB); 2National Library of Luxembourg (BnL); 3Bibliotheca Alexandrina (BA); 4National Library of France (BnF) Panel description This panel will focus on the status quo of SolrWayback, implementations of SolrWayback and where it's heading in the future, including the growing open source community adapting SolrWayback and contributing to developing the tool, making it more resilient. Thomas Egense will give an update on the current development and the flourishing user community and some thoughts on making SolrWayback even more resilient in the future. László Tóth will talk about the National Library of Luxembourg (BnL) development of a fully automated archiving workflow comprised of the capture, indexing and playback of Luxembourgish news websites. The solution combines the powerful features of SolrWayback such as full-text search, wildcard search, category search and mre, with the high playback quality of PyWb. Youssef Eldakar will present the way Solwayback have enhanced the way researchers can search for content and view the 18 IIPC special collections and also bring up some considerations about scaling the system. Sara Aubry will present how the National Library of France (BnF) has been using SolrWayback to give researcher teams the possibility to explore, analyze and visualize specific collections. She will also share how BnF contributed to the application development, including the extension of datavisualisation features. Thomas Egense: Increasing community interactions and the near future of SolrWayback During the last year, the number of community interactions such as direct email questions, bugs/ feature requests posted on github jira, has increased every week. It is indeed good news that so many Libraries/Institutions or researchers already have embraced SolrWayback, but to keep up this momentum more community engagement will be welcomed for this open source project. By submitting a feature request or bug report on GitHub you will help prioritize which will benefit the most, do not hold back. More programmers for backend(Java) or frontend (GUI) would speed up the development of SolrWayback. Recently BnF helped improve some of the visualization tools by allowing shorter time intervals instead of years. For newly established collections this is a much more useful visualization. Is it a good example of the different need for new collections just 1 year old compared to collections with 25 years of web harvests. So it was not in our focus though it was a very useful improvement. In the very near future I expect that more time will be used on supporting new users attempting to implement SolrWayback. Also the hybrid SolrWayback combined with PyWb for playback seems to be the direction many choose to go. And finally large collections will run into a Solr scaling problem that can be solved by switching to SolrCloud. There is a need for better documentation and workflow support in the SolrWayback bundle for this scaling issue. László Tóth: A Hybrid SolrWayback-PyWb playback system with parallel indexing using the Camunda Workflow Engine Within the framework of its web archiving programme, the National Library of Luxembourg (BnL) develops a fully automated archiving workflow comprised of the capture, indexing and playback of Luxembourgish news websites. Our workflow design takes into account several key features such as the efficiency of crawls (both in time and space) and of the indexing processes, all while providing high quality end user experience. In particular, we have chosen a hybrid approach for the playback of our archived content, making use of several well-known technologies in the field.
One year ago, we presented a joint effort, spanning the IIPC Research Working Group, the IIPC Content Development Working Group, and Bibliotheca Alexandrina, to republish the IIPC collections for researcher access through alternative interfaces, namely, LinkGate and SolrWayback.
Sara Aubry: SolrWayback at the National Library of France (BnF) : an exploration tool for researchers and the web archiving team engagement to contribute to its evolution With the opening of its DataLab in October 2021 and the Respadon project (which will also be presented during the WAC), BnF web archiving team is currently concentrating on the development of services, tools, methods and documentation to ease the understanding and appropriation of web archives for research. The underlying objective is to provide the research community, along with information professionals, with a diversity of tools dedicated to the building, exploring and analysis of web corpora. Among all tools we have tested with researchers, SolrWayback has a particular place because of its simplicity to handle and its rich functionalities. Beyond a first contact with the web archives, it allows researchers to question and analyze the focused collections to which it gives access. This presentation will focus on researcher feedback using SolrWayback, how the application promotes the development of skills on web archives, and how we accompany researchers in the use of this application. We will also present how research use and feedback has led us to contribute to the development of this open source tool. |
1:30pm - 3:30pm | WKSHP-01: DESCRIBING COLLECTIONS WITH DATASHEETS FOR DATASETS Location: Labs Room 1 (workshops) Pre-registration required for this event. |
|
Describing Collections with Datasheets for Datasets 1University of Illinois; 2British Library, United Kingdom Significant work in web archives scholarship has focused on addressing the description and provenance of collections and their data. For example, Dooley et al. (2018) propose recommendations for descriptive metadata, and Maemura et al. (2018) develop a framework for documenting elements of a collection’s provenance. Additionally, documentation of the data processing and curation steps towards generating a corpus for computational analysis are described extensively in Brügger (2021), Brügger, Laursen & Nielsen (2019) and Brügger, N., Nielsen, J., & Laursen, D. (2020). However, looking beyond libraries, archives, or cultural heritage settings provides alternative forms for the description of data. One approach to the challenge of describing large datasets comes from the field of machine learning where Gebru et al. (2018, 2021) propose developing “Datasheets for Datasets,” a form of short document answering a standard set of questions arranged by stages of the data lifecycle. This workshop explores how web archives collections can be described using the framework provided by Datasheets for Datasets. Specifically, this work builds on the template for datasheets developed by Gebru et al. that is arranged into seven sections: Motivation; Composition; Collection Process; Preprocessing/Cleaning/Labeling; Use; Distribution; and, Maintenance. The workflow they present includes a total of 57 questions to answer about a dataset, focusing on the specific needs of machine learning researchers. We consider how these questions can be adopted for the purposes of describing web archives datasets. Participants will consider and assess how each question might be adapted and applied to describe datasets from the UK Web Archive curated collections. After a brief description of the Datasheets for Datasets framework, we will break into small groups to perform a card-sorting exercise. Each group will evaluate a set of questions from the Datasheets framework and assess them using the MoSCoW technique, sorting questions into categories of Must, Should, Can’t, and Won’t have. Groups will then describe their findings from the card-sorting exercise in order to generate a broader discussion of priorities and resources available for generating descriptive metadata and documentation for public web archives datasets. Format:120 minute workshop where participants will do a card sorting activity in small groups to review the practicalities of the Datasheets for Datasets Framework when applied to web archives. Ideally participants can prepare by reading through questions prior to the workshop. We anticipate the following schedule:
Target Audience: Web Archivists, Researchers Anticipated number of participants: 12-16 Technical requirements: overhead projector with computer and large tables for a big card sorting activity. Learning outcomes:
Coordinators: Emily Maemura (University of Illinois), Helena Byrne (British Library) Emily Maemura is an Assistant Professor in the School of Information Sciences at the University of Illinois Urbana-Champaign. Her research focuses on data practices and the activities of curation, description, characterization, and re-use of archived web data. She completed her PhD at the University of Toronto's Faculty of Information, with a dissertation exploring the practices of collecting and curating web pages and websites for future use by researchers in the social sciences and humanities. Helena Byrne is the Curator of Web Archives at the British Library. She was the Lead Curator on the IIPC Content Development Group 2022, 2018 and 2016 Olympic and Paralympic collections. Helena completed a Master’s in Library and Information Studies at University College Dublin, Ireland in 2015. Previously she worked as an English language teacher in Turkey, South Korea, and Ireland. Helena is also an independent researcher that focuses on the history of women's football in Ireland. Her previous publications cover both web archives and sports history. References Brügger, N. (2021). Digital humanities and web archives: Possible new paths for combining datasets. International Journal of Digital Humanities. https://doi.org/10.1007/s42803-021-00038-z Brügger, N., Laursen, D., & Nielsen, J. (2019). Establishing a corpus of the archived web: The case of the Danish web from 2005 to 2015. In N. Brügger & D. Laursen (Eds.), The historical web and digital humanities: The case of national web domains (pp. 124–142). Routledge/Taylor & Francis Group. Brügger, N., Nielsen, J., & Laursen, D. (2020). Big data experiments with the archived Web: Methodological reflections on studying the development of a nation’s Web. First Monday. https://doi.org/10.5210/fm.v25i3.10384 Dooley, J., & Bowers, K. (2018). Descriptive Metadata for Web Archiving: Recommendations of the OCLC Research Library Partnership Web Archiving Metadata Working Group (p. ). OCLC Research. https://doi.org/10.25333/C3005C Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2018). Datasheets for Datasets. ArXiv:1803.09010 [Cs]. http://arxiv.org/abs/1803.09010 Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., III, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. https://doi.org/10.1145/3458723 Maemura, E., Worby, N., Milligan, I., & Becker, C. (2018). If These Crawls Could Talk: Studying and Documenting Web Archives Provenance. Journal of the Association for Information Science and Technology, 69(10), 1223–1233. https://doi.org/10.1002/asi.24048 |
2:30pm - 2:40pm | BREAK |
2:40pm - 3:50pm | SES-05: COVID-19 COLLECTIONS Location: Theatre 1 Session Chair: Kees Teszelszky, KB, National Library of the Netherlands These presentations will be followed by a 10 min Q&A. |
|
2:40pm - 3:00pm
The UK Government Web Archive (UKGWA): Measuring the impact of our response to the COVID-19 pandemic The National Archives, United Kingdom The COVID-19 pandemic, the first pandemic of the digital age, has presented an enormous challenge to our web archiving practice. As the official archive of the UK government, we were tasked with building a comprehensive archive of the UK government's online response to the emergency. To meet this challenge we have devised new archiving strategies ranging from supplementary broad, keyword-driven crawling to focus, data-driven, daily captures of the UK’s official “Coronavirus (COVID-19) in the UK” data dashboard. We have also massively increased our rates of capture. The challenge has demanded creativity, adaptation and a great deal of effort. All of this work prompted us to think of a number of questions that we’d like to answer: How complete is the record we captured in our web archive and how much is this a result of the extra effort we made? How could we perform meaningful analysis on the enormous numbers of HTML and non-HTML resources? What contributions have these innovations made to this outcome and how can these inform our practice going forward? To tackle these questions we needed to analyse millions of captured resources in our web archive. It soon became clear that we’d only be able to achieve the level of insight needed by developing an entire end-to-end analysis system. The resulting pipeline we designed and built uses a combination of familiar and novel concepts and approaches; we used the WARC file content, along with CDX APIs, but we also developed a set of heuristics, and custom algorithms, all ultimately populating a database that allowed us to run queries to give us the answers we sought. Running an entirely cloud-based system enabled this work as we were at that time unable to reliably access our office. This presentation will provide an overview of the approaches used, the results we found and the areas for further development. We believe that these tools can be applied to our overall web archive collections and hope that other institutions will find our experience useful when thinking about analysing their own collection and quantifying the impact of their efforts. 3:00pm - 3:20pm
Women and COVID through Web Archives. How to explore the pandemic through a collaborative, interdisciplinary research approach 1University of Groningen, Netherlands, The; 2Leiden University, The Netherlands; 3University of Luxembourg, Luxembourg; 4Aix-Marseille University, France; 5Aarhus University, Denmark The COVID crisis has been a shared worldwide and collective experience from March 2020 and lot of voices have echoed each other, may it be related to grief, lockdown, masks and vaccines, homeschooling, etc. However, this unprecedented crisis has also deepened asymmetries and failures within societies, in terms of occupational fields, economic inequalities, health and sanitary access, and we could extend the inventory of these hidden and more visible gaps that were reinforced during the crisis. Women and gender were also at stake when it came to this sanitary crisis, may it be to discuss the better management of the crisis by female politicians, domestic violence during the lockdown, decreasing production of papers by female research scientists, homeschooling and mental load of women, etc. As a cohort team within the Archives Unleashed Team (AUT) program, the European research AWAC2 team benefited from a privileged access to this collection, thanks to Archive-It and through ARCH, and from regular mentorship by the AUT team. It allowed us to investigate and analyse this huge collection of 5.3 TB, 161 757 lines for the CSV on domain frequency CSV, 8,738,751 lines for the CSV related to plain text of web pages. In December 2021, our AWAC2 team submitted several topics to the IIPC (International Internet Preservation Consortium) community and invited the international organization to select one of them that the team would investigate in depth, based on the unique IIPC COVID collection of web archives. Women, gender, and COVID was the winning topic. Accepting the challenge, the AWAC2 team organized a datathon in March 2022 in Luxembourg to investigate and retrieve the many traces of women, gender and COVID in web archives, while mixing close and distant reading. Since then, the team has been working on the dataset to further explore the opportunities for computational methods for reading at scale. In this presentation, we will reflect on technical, epistemological, and methodological challenges and present some results as well. 3:20pm - 3:40pm
Surveying the landscape of COVID-19 web collections in European GLAM institutions 1British Library, United Kingdom; 2KBR (Royal Library of Belgium); 3Royal Danish Library; 4Leiden University The aim of the WARCnet network [https://cc.au.dk/en/warcnet/about] is to promote high-quality national and transnational research that will help us to understand the history of (trans)national web domains and of transnational events on the web, drawing on the increasingly important digital cultural heritage held in national web archives. Within the context of this network, a survey was conducted to see how cultural heritage institutions are capturing the COVID-19 crisis for future generations. The aim of the survey was to map the scope and collection strategies of COVID-19 Web collections with a main focus on Europe. The survey was managed by the British Library and was conducted by means of the Snap survey platform. It circulated between June and September 2022 among mainly European GLAM institutions and 61 responses were obtained. The purpose of this presentation is to provide an overview of the different collection development practices when curating COVID-19 collections. On the one hand, the results may support GLAM institutions to gain further insights in how to curate COVID-19 Web collections or identify potential partners. On the other hand, revealing the scope of these Web collections may also encourage humanists and data scientists to unlock the potential of these archived Web sources to further understand international developments on the Web during the COVID-19 pandemic More concretely, the presentation will provide further insight into the local, regional, national or global scopes of the different COVID-19 collections, the type of content that is included in the collections, the available metadata, the selection criteria that were used when curating the collections and the efforts that were made to create inclusive collections. The temporality of the collections will also be discussed by highlighting the start, and, if applicable, end dates of the collections and the capture frequency. Quality control and long-term preservation are two further elements that will be discussed during the presentation. |
2:40pm - 3:50pm | SES-06: SOCIAL MEDIA & PLAYBACK: COLLABORATIVE APPROACHES Location: Theatre 2 Session Chair: Susanne van den Eijkel, KB, National Library of the Netherlands These presentations will be followed by a 10 min Q&A. |
|
2:40pm - 3:00pm
Archiving social media in Flemish cultural or private archives, (how) is it possible 1KADOC-KU Leuven, Belgium; 2meemoo, Belgium Social media are increasingly replacing other forms of communication. In doing so, they are also becoming an important source to archive in order to preserve the diverse voices in society for the long term. However, few Flemish archival institutions currently archive this type of content. To remedy this situation, a number of private archival institutions in Flanders started research on sustainable approaches and methods to capture and preserve social media archives. Confronted with the complex reality of this new landscape however, this turned out to be a rather challenging undertaking. Through the lens of our project 'Best practices for social media archiving in Flanders and Brussels', we’ll look at the lessons learned and the central challenges that remain for social media archiving in private archival institutions in Flanders. Many of these lessons and challenges transcend this project and concern the broader web archiving community and cultural heritage sector. Unsurprisingly, to a lot of (often smaller) private archival institutions in Belgium archiving social media remains a major challenge either because of a lack of (new) digital archiving competencies or the availability of (often expensive and quickly outdated) technical solutions in heritage institutions. On top of that, there are major legal challenges. For one, these archives cannot fall back on archival law or legal deposit law as a legal basis. In addition, the quickly evolving European and national privacy and copyright regulations form a maze of rules and exceptions they have to find their way in and keep up with. One last stumbling block is proving particularly hard to overcome. It concerns the legal and technical restrictions the social media platforms themselves impose on users. These make it practically impossible for heritage institutions to capture and preserve the integrity of social media content in a sustainable way. We believe this problem is best to be addressed by the international web archiving, research and heritage community as a whole. This is only one of the recommendations we’re proposing to improve the situation as part of the set of ‘best practices’ we developed and which we would like to present here in more detail. 3:00pm - 3:20pm
Searching for a Little Help From My Friends: Reporting on the Efforts to Create an (Inter)national Distributed Collaborative Social Media Archiving Structure 1International Institute of Social History; 2KADOC Documentation and Research Centre on Religion, Culture, and Society; 3Amsterdam City Archives; 4KB, National Library of the Netherlands Social media archiving in cultural heritage and government is still at an experimental stage with regard to organizational readiness for and sustainability of initiatives. The many different tools, the variety of platforms, and the intricate legal and ethical issues surrounding social media do not readily allow for immediate progress and uptake by organizations interested or mandated to preserve social media content for the long term. In Belgium and the Netherlands, the last three years have seen a series of promising projects on building social media archiving capacity, mostly focusing on heritage and research. One of their most important findings is that the multiple needs and requirements of successful social media archiving are difficult for any one organization to tackle; efforts to propose good practices or establish guidelines often run onto the reality of the many and sometimes clashing priorities of different domains e.g. archives, libraries, local and national government, research. Faced with little time and increasing costs, managers and funders are generally reluctant to support social media archiving as an integral part of collecting activity, as it is seen as a nice-to-have but not crucial part of their already demanding core business. Against this background, we set out to bring together representatives of different organizations from different sectors in Belgium and the Netherlands to research the possibilities for what a distributed collaborative approach to social media archiving could look like, including requirements for sharing knowledge and experiences systematically and efficiently, sharing infrastructure and human and technical resources, prioritization, and future-proofing the initiative. In order to do this, we look into:
Through interviews with staff and managers of interested organizations, we want to find out if there is potential in thinking about social media archiving as a truly collaborative venture. We would like to discuss the progress of this research and the ideas and challenges we have come up against. 3:20pm - 3:40pm
Collaborating On The Cutting Edge: Client Side Playback Library Innovation Lab, United States of America Perma.cc is a project of the Library Innovation Lab, which is based within the Harvard Law School Library and exists as a unit of a large academic institution. Our work has been focused in the past mainly on the application of web archiving technology as it relates to citation in legal and scholarly writing. However, we also have spent time exploring expansive topics in the web archiving world - oftentimes via close collaboration with the Webrecorder project - and most recently have built tools leveraging new client-side playback technology made available by replayweb.page. warc-embed is LIL's experimental client-side playback integration boilerplate, which can be used to test out and explore this new technology, along with its potential new applications. It consists of: a simple web server configuration that provides web archive playback; a preconfigured “embed” page that can be easily implemented to interact with replayweb.page; and a two-way communication layer that allows the replay to reliably and safely communicate with the archive. These features are replicable for a relatively non-technical audience and thus we sought to explore small scale applications of it outside of our group. This is one of two proposals relating to this work. We believe there is an IIPC audience who could attend either or both sessions based on their interests. They explore separate topics relating to the core technology. This session will look into user applications of the tool and institutional user feedback from the Harvard Library community. Our colleagues at Harvard use the Internet Archive’s Archive-It across the board for the majority of their web archiving collections and access. As an experiment, we have worked with some of them to host and serve their .warcs via warc-embed. We scoped work based on their needs and made adjustments based on their ability to apply the technology. One example of this is a refresh of the software to be able to mesh with WordPress, which was more easily managed directly by the team. This session will explore a breakdown of roadblocks, design strategies, and wins from this collaboration. It will focus on the end-user results and applications of the technology. |
3:50pm - 4:20pm | BREAK |
4:20pm - 5:30pm | SES-07: COLLABORATIONS & OUTREACH Location: Theatre 1 Session Chair: Ben Els, National Library of Luxembourg These presentations will be followed by a 10 min Q&A. |
|
4:20pm - 4:40pm
Linking web archiving with arts and humanities: the collaboration between ROSSIO and Arquivo.pt Arquivo.pt - Fundação para a Ciência e Tecnologia, I.P., Portugal ROSSIO and Arquivo.pt developed collaborative activities with the goal of connecting web archiving, arts and digital humanities, between 2018 and 2022. How to make Web archives useful and accessible to digital humanities researchers, and by extension to citizens? This challenge was answered in three ways: training, dissemination, and collaborative curation of websites. This presentation aims to describe those collaborative activities and share what we’ve learned from them. ROSSIO is a Portuguese infrastructure for the Social Sciences, Arts and Humanities (https://rossio.fcsh.unl.pt/). Its mission is to aggregate, contextualize, enrich and disseminate digital content. It is based at the Faculty of Social and Human Sciences of the NOVA University of Lisbon (FCSH-NOVA) and involves several institutions that provide content. Arquivo.pt's mission (https://arquivo.pt) is to preserve the Portuguese Web and make available contents from the Web since 1996 to everyone, from simple citizens to researchers. ROSSIO contributed human resources, namely, a web curator, a community manager, a web developer, and researchers who used Arquivo.pt in their work. Arquivo.pt in turn contributed its know-how, created new services (e.g., the SavePageNow) and made available open data sets. Therefore, we describe the activities carried out in collaboration and their results. First, regarding training, we refer to face-to-face and online sessions held with ROSSIO partners and their communities. We highlight the initiative "Café with Arquivo.pt" (https://arquivo.pt/cafe) and the webinars held during the pandemic, because they strengthened the connection between Arquivo.pt and distant communities (e.g., in 2021 they had 538 participants and 84% of satisfaction). Second, the continuous dissemination in the social networks and groups of the ROSSIO partners which helped to make Arquivo.pt better known (e.g., 7.300 new users accessed the service between 2018 and 2021). Third, researchers from the ROSSIO collaborated in curating websites, which resulted in documentation for studies and online exhibitions (e.g. “Times of illness, times of healing” at the FCSH NOVA; and "art festivals memory" at the Gulbenkian Art Library). We concluded this presentation by sharing what we learned from participating in ROSSIO, and the challenges that lie ahead for creating a community of practice among art and humanities researchers. 4:40pm - 5:00pm
Building collaborative collections : experience of the Croatian Web Archive National and University Library in Zagreb, Croatia In Croatia, the only institution that archives the web is the National and University Library in Zagreb. The library established the Croatian Web Archive (HAW) and began archiving Croatian web sources in 2004. From then until today, we have developed several approaches to web archiving: selective, .hr crawls, thematic crawls, building local history collections and social media archiving. In order to broaden our collections and raise public awareness as much as possible the Croatian Web Archive is opening up to collaboration with other libraries, as well as all interested citizens. One of the examples is the Building Local History Web project from 2020. That year, the Croatian Web Archive began collaboration with public libraries for the purpose of archiving web resources related to a specific area or homeland. The contents are related to a specific locality with the aim of presenting and ensuring long-term access to local materials that are available only on the web and complement and popularize the local history collection of the public library. In addition to collaboration with public libraries, the Croatian Web Archive has connected with the User Service Department of the National and University Library in Zagreb, in order to involve citizens in the creation of thematic collections through citizen science. In that way the thematic collection “Bees, life, people” was created, using the crowdsourcing method, in collaboration with the public library, citizens (high school students) and other library departments. This presentation will discuss developing a collection policy, collaboration and working process in building local history and citizen science collections. The lessons learned throughout collaboration with citizens and public libraries are great encouragement to expand the existing scope of archiving as well as involvement of other libraries and citizens in raising awareness of information literacy and the importance of archiving web content. 5:00pm - 5:20pm
Your Software Development Internship in Web Archiving Bibliotheca Alexandrina, Egypt A summer internship project is an opportunity for the intern to practice in the real world as well as for the host institution to make extra progress on program objectives, while also engaging with the community. Since 2019, Bibliotheca Alexandrina's IT team has been running a summer internship series for undergraduate students of computing, with several of the internship projects having a connection to web archiving. Throughout this experience, our mentors have been finding the young interns much intrigued by the technology involved in archiving the web. From a computing perspective, aside from serving to preserve a quite significant information medium, web archiving is an activity where a number of sub-domains of computing come together. A software project in web archiving will involve, for instance, management of big data to keep pace with how the web and consequently an archive thereof continues to expand in volume, parallel computing to achieve the capacity for both data harvesting and processing at that level of scale, machine learning to find answers to questions about the datasets that can be extracted from a web archive, or network theory and graph analytics to come to more understandable representations of the heavily interlinked data. In this presentation, we invite you to join us on a virtual visit to the home of the IT team at Bibliotheca Alexandrina for a look into our archive of past internship projects in web archiving. These projects include the investigation of alternative graph analytics backends for the implementation of new features in web archive graph visualization, repurposing of the WARC format for use in the library's digital book portal, and crawling the web for text for language model training. For each project, we will review the specific objective, how the problem was addressed, and the outcome. Finaly, to reflect on the overall experience, we will share lessons learned as well as discuss how the interaction with the community through internships is additionally an opportunity to raise awareness about web archiving, the technology involved, and the work of the International Internet Preservation Consortium (IIPC). |
4:20pm - 5:30pm | SES-08: QUALITY ASSURANCE Location: Theatre 2 Session Chair: Arnoud Goos, Netherlands Institute for Sound & Vision These presentations will be followed by a 10 min Q&A. |
|
4:20pm - 4:40pm
The Auto QA process at UK Government Web Archive The National Archives, United Kingdom The UK Government Web Archive’s (UKGWA) Auto QA process allows us to carry out enhanced data-driven QA almost completely automatically. This is particularly useful for websites that are high-profile or sites that are about to close. Our Auto QA has several advantages over solely visual QA. The advantages enable us to: 1) Identify problems that are not obvious at the visual QA stage. 2) Identify Heritrix errors during the crawl. These include -2 and -6 errors. Once identified, we re-run Heritrix on the affected URIs. 3) Identify and patch URIs that Heritrix could not discover. 4) Identify, test, and patch Hyperlinks insides PDFs. Many PDFs contain hyperlinks to a page on the parent website or to other websites. And sometimes the only way to access those pages is through a link in a PDF which most crawlers can't normally access. Auto QA consists of three separate processes: 1) ‘Crawl Log Analysis’ that runs on every crawl automatically. CLA examines Heritrix crawl logs and looks for errors. It then tests those errors against the live web. 2) ‘Diffex’ that compares what Heritrix discovered with the output of another crawler such as Screaming Frog. This will identify what Heritrix did not discover. Diffex then tests those URIs against the live web and if they are valid, they are added to a patchlist. 3) ‘PDFflash’ extracts PDF URI’s from Heritrix crawl logs. It then parses them and looks for hyperlinks within PDFs; tests those hyperlinks against the live web, our web archives, and against our in-scope domains. If a hyperlink’s target serves 404 it will be added to our patchlist provided it meets certain conditions such as scoping criteria. UKGWA’s Auto QA is a highly efficient and scalable system that compliments visual QA; and we are in the process of making it open source. 4:40pm - 5:00pm
The Human in the Machine: Sustaining a Quality Assurance Lifecycle at the Library of Congress Library of Congress, United States of America This talk will build upon information shared during the IIPC WAC 2022 session Building a Sustainable Quality Assurance Lifecycle at the Library of Congress (Thomas and Lyon). The work to develop a sustainable and effective quality assurance (QA) ecosystem is ongoing and the Library of Congress Web Archiving Team (WAT) is constantly working to improve and streamline workflows. The Library’s web archiving QA goals are structured around Dr. Reyes Ayala’s framework for quality measurements of web archives based in Grounded Theory (Reyes Ayala). During last year’s session, we described how the WAT satisfies the two dimensions of Relevance and Archivability, with some automated processes built in to help the team do its work. We also introduced our idea for Capture Assessment to satisfy the Correspondence dimension of Dr. Reyes Ayala’s framework. In July 2022, the WAT launched the Capture Assessment workflow internally and invited curators of web archives content at the Library to review captures of their selected content. To best communicate issues of Correspondence quality between the curatorial librarians and the WAT, we instituted a rubric where curatorial librarians can ascribe a numeric value to convey quality information from various angles about a particular web capture, alongside a checklist of common issues to easily note. The WAT held an optional training alongside the launch, and since then, there have been over 90 responses from a handful of curatorial librarians, including one power user. The WAT has found responses to be mostly actionable for correction in future crawls. We’ve also seen that Capture Assessments are performed on captures that wouldn’t necessarily be flagged via other QA workflows, which gives us confidence that a wider swath of the archive is being reviewed for quality. The session will share more details about the Capture Assessment workflow and, in time for the 2023 WAC session, we intend to complete a small, early analysis of the Capture Assessment responses to share with the wider web archiving community. Reyes Ayala, B. Correspondence as the primary measure of information quality for web archives: a human-centered grounded theory study. Int J Digit Libr 23, 19–31 (2022). https://doi.org/10.1007/s00799-021-00314-x |
4:20pm - 5:30pm | WKSHP-02: A PROPOSED FRAMEWORK FOR USING AI WITH WEB ARCHIVES IN LAMS Location: Labs Room 1 (workshops) Pre-registration required for this event. |
|
A proposed framework for using AI with web archives in LAMs Library of Congress, United States of America There is tremendous promise in using artificial intellegence, and specifically machine learning techniques to help curators, collections managers and users to understand, use, steward and preserve web archives. Libraries, archives, museums and other public cultural heritage organizations who manage web archives have shared challenges in operationalizing AI technologies and unique requirements for managing digital heritage collections at a very large scale. Through research, experimentation and collaboration the LC Labs team has developed a set of tools to document, analyze, prioritize and assess AI technologies in a LAM context. This framework is in draft form and in need of additional use cases and perspectives, especially web archives use cases. The facilitators will introduce the framework and ask participants to use the proposed framework to evaluate their own proposed or in process ML or AI use case that increases understanding of and access to web archivies. Sharing the framework elements, gathering feedback, and documenting web archives use cases are the goals of the workshop. - Define the Problem you are trying to solve. - Write a user story about the AI/ML task or system your are planning/doing - Risks and Benefits: What are the benefits and risks to users, staff and the organization when an AI/ML technology is/will be used? - What systems or policies will/do the AI/ML task or system impact or touch? - What are the limitations of future use of any training, target, validation or derived data? - What are the success metrics and measures for the AI/ML task? - What are the quality benchmarks for the AI/ML output? - What could come next? |
5:30pm - 6:10pm | POS-1: LIGHTNING & DROP-IN TALKS Location: Theatre 1 Session Chair: Abbie Grotke, Library of Congress 1 minute drop-in talks will immediately follow lightning talks. After the session ends, lightning talk presenters will be available for questions in the atrium, where their posters will be on display. Drop-in talk schedule: Quick Overview of Perma Tools List Clare Stanton, Perma.cc Engineering Updates from Internet Archive Alex Dempsey, Internet Archive Mapping News in the Norwegian Web Archive Jon Carlstedt Tønnessen, National Library of Norway |
|
Memory in Uncertainty – The Implications of Gathering, Storing, Sharing and Navigating Browser-based Archives New Design Congress, Germany How do we save the past in a violent present for an uncertain future? As societal digitisation accelerates, so too has the belligerence of state and corporate power, the democratisation of targeted harassment, and the collapse of consent by communities plagued by ongoing (and often unwanted) datafication. Drawing from political forecasts and participatory consultation with practicioners and communities, this research examines the physical safety of data centres, the socio-technical issues of the diverse practice of web-based archiving, and the physical and mental health of archive practitioners and communities subjected to archiving. This research identifies and documents issues of ethics, consent, digital security, colonialism, resilience, custodianship and tool complexity. Despite the systemic challenges identified in the research, and the broad lag in response from tool makers and other actors within the web archiving discipline, there exist compelling reasons to remain optimistic. Emergent technologies, stronger socio-technical literacy amongst archivists, and critical interventions in the colonial structures of digital systems offer immediate points of intervention. By acknowledging the shortcomings of cybernetics, resisting the desire to apply software solutionism at scale, and developing a nuanced and informed understanding of the realities of archiving in digitised societies, a broad surface of opportunities can emerge to develop resilient, considered, safe and context-sensitive archival technologies and practice for our uncertain world. To preserve this memory, click here. Real-time public engagement with personal digital archives University of Groningen, Centre for Media and Journalism Studies Digital collections aim to reflect our personal and collective histories, which are shaped by and concurrently shape our memories. While advancements are made to develop web archival practices in the public domain, personal digital material is mostly preserved with commercially driven technologies. This is worrying, for although it may seem that these privately-owned cloud services are spaces where our precious pictures will exist forever, we know that long-term sustainable archiving practices are not these service providers’ primary concern. This demo is part of the first stages in the fieldwork of a PhD project that explores alternative approaches to sustainable everyday archival data management. Through participatory research methods, such as co-designing prototypes, we aim to establish a public-private-civic collaboration to rethink our relationship with the personal digital archive. Moving towards the question of what digital material do we throw away, discard, or forget about, we want to contribute to existing knowledge on how to manage the growing amount of digital stuff. Translating this question into an interactive installation, the demo combines human and technological performativity employing participatory, playful methods to let conference participants materialize their reflections on their engagement with their digital archives, from their professional and personal perspective. This demo invites conference participants to actively engage with the question of responsibility regarding the future of our personal digital past; is there a role to play for public institutions next to the commitment of individuals to commercially driven storage technologies? The researchers will consider the privacy of the participants throughout the duration of the demo. Through this demo, the community of (web) archivists are involved in the early stages of the project’s co-creative research practices and aims to build lasting connections with these important stakeholders. Participatory Web Archiving: A Roadmap for Knowledge Sharing 1Bodleian Libraries University of Oxford, United Kingdom; 2Information School University of Sheffield In recent years, community participation seems to have become a desirable step in developing web archives. Participatory practices in the cultural heritage sector are not new (Benoit & Eveleigh, 2019). The practice of working in collaboration with different community partners to build archives is underway in conventional archives (Cook, 2013). Indeed, it has now become one of the main themes of web archival development on both theoretical and practical levels. Although involving wider communities is often regarded as an approach to democratise practices, it has been debated if community participation can lead to improved representation. At the same time, the significant impact that participatory practices have on creating and sharing knowledge should not be underestimated. My current PhD research is to understand how participatory practices have been deployed in web archiving, their mechanisms and impacts. Since April 2022, I have worked as a web archivist for the Archive of Tomorrow project, developing various sub-collections on the topics relating to cancer, Covid-19, food, diet, nutrition, and wellbeing. The project, funded by the Wellcome Trust, is to explore and preserve online 2 information and misinformation about health and the Covid-19 pandemic. Started in February 2022, the project runs for 14 months and will form a 'Talking about Health' collection within the UK Web Archive, giving researchers and members of the public access to a wide representation of diverse online sources. For this project, I have attempted to link theories with practices and applied various participatory methods in developing the collection, such as engaging with subject librarians, delivering a workshop co-curating a sub-collection, consulting academics to identify archiving priorities, cocurating a sub-collection with students from an internship scheme, and collaborating with a local patient support group. This poster is to reflect how different approaches have been deployed and lessons learned. It will highlight the transformative impact of participatory practices on sharing, creating and reconstructing knowledge. References Benoit, E., & Eveleigh, A. (2019). Defining and framing participatory archives in archival science. In E. Benoit & A. Eveleigh (Eds.), Participatory archives: theory and practice (pp. 1–12). London. Cook, T. (2013). Evidence, memory, identity, and community: Four shifting archival paradigms. Archival Science, 13(2–3), 95–120. https://doi.org/10.1007/s10502- 012-9180-7 |
5:30pm - 6:10pm | POS-2: LIGHTNING & DROP-IN TALKS Location: Theatre 2 Session Chair: Martin Klein, Los Alamos National Laboratory 1 minute drop-in talks will immediately follow lightning talks. After the session ends, lightning talk presenters will be available for questions in the atrium, where their posters will be on display. Drop-in talk schedule: Persistent Web IDentifier (PWID) also as URN Eld Zierau, Royal Danish Library Crowdsourcing German Twitter Britta Woldering, German National Library At the end of the rainbow. Examining the Dutch LGBT+ web archive using NER and hyperlink analyses Jesper Verhoef, Erasmus University Rotterdam |
|
Sunsetting a digital institution: Web archiving and the International Museum of Women The Feminist Institute, United States of America The Feminist Institute’s (TFI) partnership program helps feminist organizations sunset mission-aligned digital projects utilizing web archiving technology and ethnographic preservation to contextualize and honor the labor contributed to ephemeral digital initiatives. In 2021, The Feminist Institute partnered with Global Fund for Women to preserve the International Museum of Women (I.M.O.W). This digital, social change museum built award-winning digital exhibitions that explored women’s contributions to society. I.M.O.W. initially aimed to build a physical space but shifted to a digital-only presence in 2005, opting to democratize access to the museum’s work. I.M.O.W’s first exhibition, Imagining Ourselves: A Global Generation of Women, engaged and connected more than a million participants worldwide. After launching several successful digital collections, I.M.O.W. merged with Global Fund for Women in 2014. The organization did not have the means to continually migrate and maintain the websites as technology depreciated, leaving gaps in functionality and access. Working directly with stakeholders from Global Fund for Women and the International Museum of Women, TFI developed a multi-pronged preservation plan that included capturing I.M.O.W’s digital exhibitions using Webrecorder’s Browsertrix Crawler, harvesting and converting Adobe Flash assets, conducting oral histories with I.M.O.W. staff and external developers, and providing access through the TFI Digital Archive. Visualizing web harvests with the WAVA tool 1National Library of New Zealand, New Zealand; 2National Library of the Netherlands, Netherlands Between 2020-2021, the National Library of New Zealand (NLNZ) and the National Library of the Netherlands (KB-NL) developed a new harvest visualization feature within the Web Curator Tool (WCT). This feature was demonstrated during a presentation at the 2021 IIPC WAC titled Improving the quality of web harvests using Web Curator Tool. During development it was recognised that the visualization tool could be beneficial to the web archiving community beyond WCT. This was also reflected in feedback received after the 2021 IIPC WAC. The feature has now been ported to an accompanying stand-alone application called the WAVA tool (Web Archive Visualization and Analysis). This is a stripped down version, that contains the web harvest analysis and visualization without the WCT dependent functionality, such as patching. The WCT harvest visualization has been designed primarily for performing quality assurance on web archives. To avoid the traditional mess of links and nodes when visualizing URLs, the tool abstracts the data to a domain level. Aggregating URLs into groups of domains gives a higher overview of a crawl and allows for quicker analysis of the relationships between content in a harvest. The visualization consists of an interactive network graph of links and nodes that can be inspected, allowing a user to drill down to the URL level for deeper analysis. NLNZ and KB-NL believe the WAVA tool can have many uses to the web archiving community. It lowers the barrier to investigating and understanding the relationships and structure of the web content that we crawl. What can we discover in our crawls that might improve the quality of future web harvests? The WAVA tool also removes technical steps that have been a barrier in the past to researchers visualizing web archive data. How many future research questions can be aided by its use? WARC validation, why not? Nationaal Archief, The Netherlands This lightning talk would like to tempt and to challenge the participants of the IIPC Web Archiving Conference 2023 to engage in an exchange of ideas, assumptions and knowledge about the subject of validating WARC-files and the use of WARC validation tools. In 2021 we’ve written an information sheet about WARC validation. During our (desk)research it became clear that most (inter)national colleagues who archive websites more often than not don’t use WARC validation tools. Why not? Most heritage institutions, national libraries and archives focus on safeguarding as much online content as possible before it disappears, based on an organizational selection policy. And the other goal is to give access to the captured information as complete and quickly as possible, both to the general users and researchers. Both goals are at the core of webarchiving initiatives of course! It seems as though little attention is given to an aspect of quality control such as the checking of the technical validity of WARC-files. Or are there other reasons not to pay much attention to this aspect? We like to share some of our findings after deploying several tools for processing WARC-files: JHOVE, JWAT, Warcat and Warcio. More tools are available, but in our opinion these four tools are the most commonly used, mature and actively maintained tools that can check of validate WARC files. In our research into WARC validation, we noticed that some tools are validation tools that check conformance to WARC standard ISO 28500 and others ‘only’ check block and/or payload digests. Most tools support version 1.0 of the WARC standard (of 2009). Few support version 1.1 (of 2017). Another conclusion is that there is no one WARC validation tool ‘to rule them all’, so using a combination of tools will probably be the best strategy for now. |
7:00pm - 9:00pm | DINNER Pre-registration required for this event. |
Date: Friday, 12/May/2023 | |
8:00am - 8:30am | ARRIVAL/COFFEE |
8:30am - 10:00am | SES-11: COLLECTION BUILDING Location: Theatre 1 Session Chair: Lauren Baker, Library of Congress These presentations will be followed by a 10 min Q&A. |
|
8:30am - 8:50am
20 years of archiving the French electoral web Bibliothèque nationale de France, France In 2022, BnF is celebrating the 20th anniversary of its electoral crawls. On this occasion, we would like to trace the history of 20 years of electoral crawls, which cover 20 elections of all types (presidential, parliamentary, local, departmental, European), and represent more than 30 Tio of data. The 2002 presidential election crawl was the first in-house crawl conducted by the BnF, a founding moment for experimenting a legal, technical and library policy framework. We, as an heritage institution, are accountable for the first electoral collections, which are emblematic and representative of our workflows on several aspects: harvest, selection, and outreach. First, on the technical point of view, electoral crawls were an opportunity to set up crawling tools and to develop adaptative techniques to face the evolution of Web and meet the challenge to archive it. We have experimented and made improvements in our archiving processes for each new election and a specific look into the communication means (eg. forums, Twitter accounts, Youtube channels and more recently Instagram accounts, TikTok contents). Secondly, electoral crawls have led the BnF to set up and organise a network of contributors and the means of selection. In 2002, contributions were from BnF librarians. In 2004, partners libraries in different regions and overseas territories contributed to select content for the regional elections. In 2012, we initiated the development of a collaborative curation tool. Throughout the years, we have also built a document typology that has remained stable to guarantee the coherence of the collections. Thirdly, electoral crawls led us to set up ways to promote web archives to the public and the research community. To promote the use of a collection with such historical consistency, of high interest for the study of political life, we designed guided tours (thematic and edited selections of archived pages made by librarians). The BnF also engaged in organizing scientific events, and in several collaborative outreach initiatives. 8:50am - 9:10am
Archiving the Web for FIFA World Cup Qatar 2022™ Qatar National Library, Qatar The core mission of Qatar National Library is to “spread knowledge, nurture imagination, cultivate creativity, and preserve the nation’s heritage for the future.” To fulfil this mission, the Library commits to collecting, preserving and providing access to both local and global knowledge, including heritage-related content relevant to Qatar and the region. Web resources of cultural importance could assist future generations in the interpretation of events that may not be extant anywhere else. Archiving such websites is an important initiative within the wider mission of the Library to support Qatar on its journey towards a knowledge-based economy. The 2022 FIFA World Cup will be the first World Cup ever to be held in the Arab world, and hence is considered a landmark event in Qatar’s history. Qatar’s journey towards hosting the 2022 World Cup has been covered by all types of local and international websites and news portals, and the coverage is expected to increase significantly in the weeks leading to, during and post-World Cup. The information published by these websites will truly reflect the journey towards, and experience of, the event from a variety of perspectives, including the fans, the organizers, the players, and members of the public. Capturing and preserving such information for the long-term enables future generations to also share the experience and appreciate the astounding effort required to host a massive, culturally important global event in Qatar. In this talk, we describe the Library’s approach to capturing and preserving websites related to the World Cup 2022, to guarantee access to the content for the future generations. We also highlight the challenges associated with developing archived websites as collections for researchers in the context of the Qatari copyright law. 9:10am - 9:30am
Museums on the Web: Exploring the past for the future Leiden University, Netherlands, The This presentation will celebrate the launch of the special collection ‘Museums on the Web’ at the KB, National Library of the Netherlands. This evolving collection unlocks an essential and the largest sub-collection within the KB Web archive. It contains more than 800 museum websites and offers the potential to research histories of museums on the Web within the Netherlands. It requires special tools to access Web archives and therefore this presentation will demonstrate a variety of entry points. It features a selection of curated archived websites that can be viewed page-by-page. It will also be the first KB special collection that is accessible through a SOLR Wayback search engine, which enables the request of derived datasets and explore the collection through a series of dashboards. This offers the opportunity to study histories of museums on the Web in The Netherlands, combining methods from history and data science and drawing on a computational analysis of Web archive data. The presentation will conclude with highlighting some significant case studies to showcase the diversity of museum websites and the research potential to uncover a Dutch history of museums on the Web. The advent of online technologies has changed the way museums manage collections and access them, shape exhibitions, and build communities. By engaging with the past, we can enhance our understanding of how museums are functioning today and offer new perspectives for future developments. This paper coincides with the release of a Double Special Issue “Museums on the Web: Exploring the past for the future” in the journal Internet Histories: Digital Technology, Culture and Society (Routledge/Taylor & Francis). 9:30am - 9:50am
Unsustainability and Retrenchment in American University Web Archives Programs 1University at Albany, United States of America; 2Union College This presentation will overview the expansion and later retrenchment of UAlbany’s web archives program due to a lack of permanently funded staff. UAlbany began its web archives program in 2013 in response to state records laws requiring it to preserve university records on the web. The department that housed the program had strong existing collecting programs in New York State politics and capital punishment. Since much of current politics and activism now happens online, it was natural and necessary to expand the web archives program to ensure we were effectively documenting these important spaces for the long-term future. However, we will show how the increasing complexity of the web and collecting techniques means that the scoping needs for ongoing collecting seem to require significantly more testing and labor over time. Thus, despite the need to expand the web archives program to meet our department’s mission, we will describe the painful process of reducing our web archives collecting scope. With the NDSA Web Archiving in the United States surveys reporting 71-83% of respondents devoting 0.5 or less FTE to web archiving, maintenance inflation like this is catastrophic to many web archives programs. Most alarmingly, we will overview how the web archives labor situation at American universities is likely to get worse. The UAlbany Libraries, which houses the web archives program, has permanently lost over 30% of FTE since 2020 and almost 50% of FTE since 2000. Peer assessment studies, ARL staffing surveys, and the University of California, Berkley’s recent announcement of library closures shows that UAlbany’s example is more typical than exceptional. Finally, we will show how these cuts are not the result of a misunderstanding or a lack of value for web archives or libraries by university administrators, but because our web archives program conflicts with UAlbany’s overall organizational mission and the business model of American higher education. |
8:30am - 10:00am | WKSHP-04: BROWSER-BASED CRAWLING FOR ALL: GETTING STARTED WITH BROWSERTRIX CLOUD Location: Theatre 2 Pre-registration required for this event. |
|
Browser-Based Crawling For All: Getting Started with Browsertrix Cloud 1The British Library, United Kingdom; 2Royal Danish Library; 3Webrecorder Through the IIPC-funded “Browser-based crawling system for all” project (https://netpreserve.org/projects/browser-based-crawling/), members have been working with Webrecorder to support the development of their new crawling tools: Browsertrix Crawler (a crawl system that automates web browsers to crawl the web) and Browsertrix Cloud (a user interface to control multiple Browsertrix Crawler crawls). The IIPC funding is particularly focussed on making sure IIPC members can make use of these tools. This workshop will be led by Webrecorder, who will invite you to try out an instance of Browsertrix Cloud and explore how it might work for you. IIPC members that have been part of the project will be on hand to discuss institutional issues like deployment and integration. The workshop will start with a brief introduction to high-fidelity web archiving with Browsertrix Cloud. Participants will be given accounts and be able to create their own high-fidelity web archives using the Browsertrix Cloud crawling system during the workshop. Users will be presented with an interactive user interface to create crawls, allowing them to configure crawling options. We will discuss the crawling configuration options and how these affect the result of the crawl. Users will have the opportunity to run a crawl of their own for at least 30 minutes and to see the results. We will then discuss and reflect on the results. After a quick break, we will discuss how the web archives can be accessed and shared with others, using the ReplayWeb.page viewer. Participants will be able to download the contents of their crawls (as WACZ files) and load them on their own machines. We will also present options for sharing the outputs with others directly, by uploading to an easy-to-use hosting option such as Glitch or our custom WACZ Uploader. Either method will produce a URL which participants can then share with others, in and outside the workshop, to show the results of their crawl. We will discuss how, once complete, the resulting archive is no longer dependent on the crawler infrastructure, but can be treated like any other static file, and, as such, can be added to existing digital preservation repositories. In the final wrap-up of the workshop, we will discuss challenges, what worked and what didn’t, what still needs improvement, etc.. We will also discuss how participants can add the web archives they created into existing web archives that they may already have, and how Browsertrix Cloud can fit into and augment existing web archiving workflows at participants' institutions. IIPC member institutions that have been testing Browsertrix Cloud thus far will share their experiences in this area. The format of the workshop will be as follows:
Webrecorder will lead the workshop and the hands-on portions of the workshop. The IIPC project partners will present and answer questions about use-cases at the beginning and around integration into their workflows at the end. Participants should come prepared with a laptop and a couple of sites that they would like to crawl, especially those that are generally difficult to crawl by other means and require a ‘high fidelity approach’. (Examples include social media sites, sites that are behind a paywall, etc..) Ideally, the sites can be crawled during the course of 30 mins (though crawls can be interrupted if they run for too long) This workshop is intended for curators and anyone wishing to create and use web archives and are ready to try Browsertrix Cloud on their own with guidance from the Webrecorder team. The workshop does not require any technical expertise besides basic familiarity with web archiving. The participants’ experiences and feedback will help shape not only the remainder of the IIPC project, but the long-term future of this new crawling toolset. The workshop should be able to accommodate up to 50 participants. |
8:30am - 10:00am | WKSHP-03: FAKE IT TILL YOU MAKE IT: SOCIAL MEDIA ARCHIVING AT DIFFERENT ORGANIZATIONS FOR DIFFERENT PURPOSES Location: Labs Room 1 (workshops) Pre-registration required for this event. |
|
Fake it Till You Make it: Social Media Archiving at Different Organizations for Different Purposes 1KB, National Library of the Netherlands; 2International Institute for Social History; 3National Archives of the Netherlands Abstract Different organizations, different business rules, different choices. That seems obvious. However, different perspectives can alter the choices that you make and therefore the results you get when you’re archiving Social Media. In this tutorial, we would like to zoom in on the different perspectives an organization can have. A perspective can be formed over a mandate or type of organization, the designated community of an institution, or a specific tool that you use. Therefore, we would like to highlight these influences and how they can affect the results that you get. When you start with Social Media archiving, you won’t get the best results right away. It is really a process of trial and error, where you aim for good practice and not necessarily best practice (and is there such a thing as best practice?). With a practical assignment we want to showcase the importance of collaboration between different organizations. What are the worst practices that we have seen so far? What’s best to avoid, and why? What could be a solution? And why is it a good idea to involve other institutions at an early stage? This tutorial relates to the conference topics of community, research and tools. It builds on previous work from the Dutch Digital Heritage Network and the BeSocial project from the National Library of Belgium. Furthermore, different tools will be highlighted and it will me made clear why different tooling can result in different results. Format In-person tutorial, 90 minutes.
Target audience This tutorial is aimed at those who want to learn more about doing social media archiving at their organizations. It is mainly meant for starters in social media archiving, but not necessarily complete beginners (even though they are definitely welcome too!). Potential participants could be archivists, librarians, repository managers, curators, metadata specialists, (research) data specialists, and generally anyone who is or could be involved in the collection and preservation of social media content for their organization. Expected number of participants: 20-25. Expected learning outcome(s) Participants will understand:
In addition, participants will get insight into:
Coordinators Susanne van den Eijkel is a metadata specialist for digital preservation at the National Library of the Netherlands. She is responsible for all the preservation metadata, writing policies and implementing them. Her main focus are born-digital collections, especially the web archives. She focuses on web material after it has been harvested, and not so much on selection and tools and is therefore more involved with which metadata and context information is available and relevant for preservation. In addition, she works on the communication strategy of her department; is actively involved in the Dutch Digital Heritage Network and provides guest lectures on digital preservation and web archiving. Zefi Kavvadia is a digital archivist at the International Institute of Social History in Amsterdam, the Netherlands. She is part of the institute’s Collections Department, where she is responsible for processing of digital archival collections. She is also actively contributing to research, planning, and improving of the IISH digital collections workflows. While her work covers potentially any type of digital material, she is especially interested in the preservation of born-digital content and is currently the person responsible for web archiving at IISH. Her research interests range from digital preservation and archives, to web and social media archiving, and research data management, with a special focus on how these different but overlapping domains can learn and work together. She is active in the web archiving expert group of the Dutch Digital Heritage Network and the digital preservation interest group of the International Association of Labour History Institutions. Lotte Wijsman is the Preservation Researcher at the National Archives in The Hague. In her role she researches how we can further develop preservation at the National Archives of the Netherlands and how we can innovate the archival field in general. This includes considering our current practices and evaluating how we can improve these with e.g. new practices and tools. Currently, Lotte is active in research projects concerning subjects as social media archiving, AI, a supra-organizational Preservation Watch function, and environmentally sustainable digital preservation. Furthermore, she is a guest teacher at the Archiefschool and Reinwardt Academy (Amsterdam University of the Arts). |
10:00am - 10:30am | BREAK |
10:30am - 12:00pm | SES-12: DOMAIN CRAWLS Location: Theatre 1 Session Chair: Grace Bicho, Library of Congress These presentations will be followed by a 10 min Q&A. |
|
10:30am - 10:50am
Discovering and Archiving the Frisian Web. Preparing for a National Domain Crawl. KB, National Library of the Netherlands In the past years KB, National Library of the Netherlands (KBNL), conducted a pilot for a national domain crawl. KBNL has been harvesting websites with the Web Curator Tool (a web interface with Heritrix crawler) since 2007, on a selective basis that are focused on Dutch history, culture and language. Information on the web can be brief in existence but can have a vital importance for researchers now and in the future. Furthermore, KBNL outlined in their content strategy that it is the ambition of the library to collect everything that was published in and about the Netherlands, websites included. As more libraries around the world were collecting a national domain, KBNL also expressed the wish to execute a national domain crawl. Before we were able to do that, we had to form a multidisciplinary web archiving team, decide on a new tool for domain harvests and start an intensive testing phase. For this pilot a regional domain, the Frisian, was selected. Since we were new to a domain harvest, we used a selective approach. Curators of digital collections from KBNL were in close contact with Frisian researchers, to help define which websites needed to be included in the regional domain. During the pilot we also gathered more knowledge about Heritrix as we were using NetarchiveSuite (also a web interface with Heritrix crawler) for crawls. Now that the results are in, we can share our lessons learned, like challenges on technical and legal aspects and related policies that are needed for web collections. Also, we will go into detail about the crawler software settings that were tested and how we can use such information as context information. This presentation is related to the conference topics collections, community and program operations, as we want to share the best practices for executing a (regional) domain crawl and lessons learned in preparation for a national domain crawl. Furthermore, we will focus on the next steps after completion of the pilot. Other institutions that are harvesting websites can learn from it and those that want to start with web archiving can be more prepared. 10:50am - 11:10am
Back to Class: Capturing the University of Cambridge Domain Cambridge University Libraries, United Kingdom The University Archives of Cambridge University, based at the University Library (UL), is responsible for the selection, transfer, and preservation of the internal administrative records of the University, dating from 1266 to the present. These records are increasingly created in digital formats, including common ‘office’ formats (Word, Excel, PDF) as well as increasingly for the web. The question “How do you preserve an entire online ecosystem in which scholars collaborate, discover and share new knowledge?” about the digital scholarly record posed by Cramer et al. (2022) equally applies to online learning and teaching materials as well as the day-to-day business records of a university. Capturing this online ecosystem as comprehensively, rather than selectively, as possible is an undertaking that involves many stakeholders and moving parts. As a UK Legal Deposit Library, the UL is a partner in the UK Web Archive and Cambridge University websites are captured annually; however, some online content needs to be captured more frequently, does not have an identifiable UK address, or is behind a log-in screen. To improve this capturing, the UL is working on the following:
Our presentation will walk WAC2023 attendees through our current workflow as well as highlight ongoing challenges we are working to resolve so that attendees based at universities can take these into account for archiving content on their university’s domains. 11:10am - 11:30am
Laboratory not Found? Analyzing LANL’s Web Domain Crawl Los Alamos National Laboratory, United States of America Institutions, regardless of whether they identify as for-profit, nonprofit, academic, or government, are invested in maintaining and curating their representation on the web. The organizational website is often the top-ranked on search engine result pages and commonly used as a platform to communicate organizational news, highlights, and policy changes. Individual web pages from this site are often distributed via organization-wide email channels, included in new articles, and shared via social media. Institutions are therefore motivated to ensure the long-term accessibility of their content. However, resources on the web frequently disappear, leading to the known detriment of link rot. Beyond the inconvenience of the encounter with a “404 - Page not Found” error, there may be legal implications when published government resources are missing, trust issues when academic institutions fail to provide content, and even national security concerns when taxpayer-funded federal research organizations such as Los Alamos National Laboratory show deficient stewardship of their digital content. We therefore conducted a web crawl of the lanl.gov domain with the motivation to investigate the scale of missing resources within the canonical website representing the institution. We found a noticeable number of broken links, including a significant number of special cases of link rot commonly known as “soft404s” as well as potential transient errors. We further evaluated the recovery rate of missing resources from more than twenty public web archives via the Memento TimeTravel federated search service. Somewhat surprisingly, our results show little success in recovering missing web pages. These observations lead us to argue that, as an institution, we could be a better steward of our web content and establishing an institutional web archive would be a significant step towards this goal. We therefore implemented a pilot LANL web archive in support of highlighting the availability and authenticity of web resources. In this presentation, I will motivate the project, outline our workflow, highlight our findings, and demonstrate the implemented pilot LANL web archive. The goal is to showcase an example of an institutional web crawl that, in conjunction with the evaluation, can serve as a blueprint for other interested parties 11:30am - 11:50am
Public policies for governmental web archiving in Brazil 1University of Porto, Portugal; 2Federal University of Rio Grande do Sul, Brazil Scientific, cultural, and intellectual relevance of web archiving has been widely recognized since the 1990s. The preservation of the web has been appreciated in several studies ranging from its specific theories and practices, such as its methodological approaches, specific ethical aspects of preserving web pages, to subjects that permeate the Digital Humanities and their uses as a primary source. This study aims to identify the documents and actions that are related to the development of the web archive policy in Brazil. The methodology used was bibliographic and documental research, using literature on government web archiving, and legislation regarding public policies. Brazil has a variety of technical resources and legislation that addresses the need to preserve government documents, however, the websites have not yet been included in the records management practices of Brazilian institutions. Until the recent past, the country did not have a website preservation policy. However, there are currently two government actions under development. A Bill that has been under consideration in the National Congress since July 2015, provides on the institutional digital public heritage in the www. This project is currently in the Constitution and Justice and Citizenship Commission (CCJC) of the Brazilian National Congress, since December 2022. Another action comes from the National Council of Archives – Brazil (CONARQ), which established a technical chamber to define guidelines for the elaboration of studies, proposals, and solutions for the preservation of websites and social media. Based on its general goals, the technical chamber has produced two documents: (i) the Website and Social Media Preservation Policy; and, (ii) the recommendation of basic elements for websites and social media’s digital preservation. The documents were approved in December 2022 and will be published as a federal resolution. The actions raised show that efforts for the state to take a proactive role in promoting and leadership of this technological innovation are in course in Brazil. The definition of a web archiving policy, as well as the requirements for the selection of preservation and archiving methods, technologies, and contents that will be archived, can already be considered a reality in Brazil. |
10:30am - 12:00pm | SES-13: CRAWLING, PLAYBACK, SUSTAINABILITY Location: Theatre 2 Session Chair: Laura Wrubel, Stanford University These presentations will be followed by a 10 min Q&A. |
|
10:30am - 10:50am
Developer Update for Browsertrix Crawler and Browsertrix Cloud Webrecorder, United States of America This presentation will provide a technical and feature update on the latest features implemented in Browsertrix Cloud and Browsertrix Crawler, Webrecorder's open source automated web archiving tools. The presentation will provide a brief intro to Browsertrix Cloud and the ongoing collaboration between Webrecorder and IIPC partners testing the tool. We will present an outline for the next phase of development of these tools and discuss current / ongoing challenges in high fidelity web archiving, and how we may mitigate them in the future. We will also cover any lessons learned thus far. We will end with a brief Q&A to answer any questions about the Browsertrix Crawler and Cloud systems, including how others may contribute to testing and development of these open source tools. 10:50am - 11:10am
Opportunities and Challenges of Client-Side Playback Library Innovation Lab, United States of America The team working on Perma.cc at the Library Innovation Lab has been using the open-source technologies developed by Webrecorder in production for many years, and has subsequently built custom software around those core services. Recently, in exploring applications for client-side playback of web archives via replayweb.page, we have learned lessons about the security, performance and reliability profile of this technology. This has deepened our understanding of the opportunities it presents and challenges it poses. Subsequently, we have developed an experimental boilerplate for testing out variations of this technology and have sought partners within the Harvard Library community to iterate with, test our learnings, and explore some of the interactive experiences that client-side playback makes possible. warc-embed is LIL's experimental client-side playback integration boilerplate, which can be used to test out and explore this new technology. It consists of: a cookie-cutter web server configuration for storing, proxying, caching and serving web archive files; a pre-configured "embed" page, serving an instance of replayweb.page aimed at a given archive file; as well as a two-way communication layer allowing the embedding website to safely communicate with the embedded archive. These unique features allow for a thorough exploration of this new technology from a technical and security standpoint. This is one of two proposals relating to this work. We believe there is an IIPC audience who could attend either or both sessions based on their interests. This session will dive into the technical research conducted at the lab and present those findings. Combined with the emergence of the WACZ packaging format, client-side playback is a radically different and novel take on web archive playback which allows for the implementation of previously unachievable embedding scenarios. This session will explore the technical opportunities and challenges client-side playback presents from a performance, security, ease-of-access and programmability perspective by going over concrete implementation examples of this technology on Perma.cc and warc-embed. 11:10am - 11:30am
Sustaining pywb through community engagement and renewal: recent roadmapping and development as a case study in open source web archiving tool sustainability Webrecorder IIPC’s adoption of pywb as the “go to” open source web archive replay system for its members, along with Webrecorder’s support for transitioning to pywb from other “wayback machine” replay systems, brings a large new user base to pywb. In the interests of ensuring pywb continues to sustainably meet the needs of IIPC members and the greater web archiving community, Webrecorder has been investing in maintenance and new releases for the current 2.x release series of pywb as well as engaging in the early stages of a significant 3.0 rewrite of pywb. These changes are being driven by a community roadmapping exercise with members of the IIPC oh-sos (Online Hours: Supporting Open Source) group and other pywb community stakeholders. This talk will outline some of the recent feature and maintenance work done in pywb 2.7, including a new interactive timeline banner which aims to promote easier navigation and discovery within web archive collections. It will go on to discuss the community roadmapping process for pywb 3.0 and an overview of the proposed new architecture, perhaps even showing an early demo if development is in a state by May 2023 to support doing so. The talk will aim to not only share specific information about pywb and the efforts being put into its sustainability and maintenance by both Webrecorder and the IIPC community, but also to use pywb as a case study to discuss the resilience, sustainability, and renewal of open source software tools that enable web archiving for all. pywb as a codebase is after all nearly a decade old itself and has gone through several rounds of significant rewrites as well as eight years of regular maintenance by Webrecorder staff and open source contributors to get to its current state, making it a prime example of how ongoing effort and community involvement make all the difference in building sustainable open source web archiving tools. 11:30am - 11:50am
Addressing the Adverse Impacts of JavaScript on Web Archives 1University of Michigan, United States of America; 2Princeton University, United States of America Over the last decade, the presence of JavaScript code on web pages has dramatically increased. While JavaScript enables websites to offer a more dynamic user experience, its increasing use adversely impacts the fidelity of archived web pages. For example, when we load snapshots of JavaScript-heavy pages from the Internet Archive, we find that many are missing important images and JavaScript execution errors are common. In this talk, we will describe the takeaways from our research on how to archive and serve pages that are heavily reliant on JavaScript. Via fine-grained analysis of JavaScript execution on 3000 pages spread across 300 sites, we find that the root cause for the poor fidelity of archived page copies is because the execution of JavaScript code that appears on the web is often dependent on the characteristics of the client device on which it is executed. For example, JavaScript on a page can execute differently based on whether the page is loaded on a smartphone or on a laptop, or whether the browser used is Chrome or Safari; even subtle differences like whether the user's network connection is over 3G or WiFi can affect JavaScript execution. As a result, when a user loads an archived copy of a page in their browser, JavaScript on the page might attempt to fetch a different set of embedded resources (i.e., images, stylesheets, etc.) as compared to those fetched when this copy was crawled. Since a web archive is unable to serve resources that it did not crawl, the user sees an improperly rendered page both because of missing content and JavaScript runtime errors. To account for the sources of non-deterministic JavaScript execution, a web archive cannot crawl every page in all possible execution environments (client devices, browsers, etc), as doing so would significantly inflate the cost of archiving. Instead, if we augment archived JavaScript such that the code on any archived page will always execute exactly how it did when the page was crawled, we are able to ensure that all archived pages match their original versions on the web, both visually and functionally. |
10:30am - 12:00pm | WKSHP-05: SUPPORTING COMPUTATIONAL RESEARCH ON WEB ARCHIVES WITH THE ARCHIVE RESEARCH COMPUTE HUB (ARCH) Location: Labs Room 1 (workshops) Pre-registration required for this event. |
|
Supporting Computational Research on Web Archives with the Archive Research Compute Hub (ARCH) Internet Archive, United States of America Coordinators:
Format: 90 or 120-minute workshop and tutorial Target Audience: The target audience is professionals working in digital library services that are collecting, managing, or providing access to web archives, scholars using web archives and other digital collections in their work, library professionals working to support computational access to digital collections, and digital library technical staff. Anticipated Number of Participants: 25 Abstract: Every year more and more scholars are conducting research on terabytes and even petabytes of digital library and archive collections using computational methods such as data mining, natural language processing, and machine learning. Web archives are a significant collection of interest for these researchers, especially due to their contemporaneity, size, multi-format nature, and how they can represent different thematic, demographic, disciplinary, and other characteristics. Web archives also have longitudinal complexity, with frequent changes in content (and often state of existence) even at the same URL, gobs of metadata both content-based and transactional, and many characteristics that make them highly suitable for data mining and computational analysis. Supporting computational use of web archives, however, poses many technical, operational, and procedural challenges for libraries. Similarly, while platforms exist for supporting computational scholarship on homogenous collections (such as digitized texts, images, or structured data), none exist that handle the vagaries of web archive collections while also providing a high level of automation, seamless user experience, and support for both technical and non-technical users. In 2020, Internet Archive Research Services and the Archives Unleashed received funding for joint technology development and community building to combine their respective tools that enable computational analysis of web and digital archives in order to build an end-to-end platform supporting data mining of web archives. The program also simultaneously is building out a community of computational researchers doing scholarly projects via a program supporting cohort teams of scholars that receive direct technical support for their projects. The beta platform, Archives Research Compute Hub (ARCH), is currently being used by dozens of researchers in the digital humanities, social and computer science researchers, and by dozens of libraries and archives that are interested in supporting local researchers and sharing datasets derived from their web collection in support of large-scale digital research methods. ARCH lowers the barriers for conducting research of web archives, using data processing operations to generate 16 different derivatives from WARC files. Derivatives range in use from graph analysis, text mining, and file format extraction, and ARCH makes it possible to visualize, download, and integrate these datasets into third-party tools for more advanced study. ARCH enables analysis of the more than 20,000 web archive collections - over 3 PB of data - collected by over 1,000 institutions using Archive-It that cover a broad range of subjects and events and ARCH also includes various portions of the overall Wayback Machine global web archive totalling 50+ PB and going back to 1996. This workshop will be a hands-on training covering the full lifecycle of supporting computational research on web archives. The agenda will include an overview of the conceptual challenges researchers face when working with web archives, the procedural challenges that librarians face in making web archives available for computational use, and most importantly, will provide an in-depth tutorial on using the ARCH platform and its suite of data analysis, dataset generation, data visualization, and data publishing tools, both from the perspective of a collection manager, a research services librarian, and a computational scholar. Workshop attendees will be able to build small web archive collections beforehand or will be granted access to existing web archive collections to use during the workshop. All participants will also have access to any datasets and data visualizations created as part of the workshop. Anticipated Learning Outcomes: Given the conference, we expect the attendees primarily to be web archivists, collection managers, digital librarians, and other library and archives staff. After the workshop, attendees will:
|
12:00pm - 1:00pm | LUNCH |
1:00pm - 2:00pm | SES-14 (PANEL): INCLUSIVE REPRESENTATION AND PRACTICES IN WEB ARCHIVING Location: Theatre 1 Session Chair: Daniel Steinmeier, KB National Library of the Netherlands |
|
Renewal in Web Archiving: Towards More Inclusive Representation and Practices 1The College of Wooster; 2Archiving The Black Web; 3Shift Collective “The future is already here, it's just not very equally distributed, yet” - William Gibson Presentation 1- Archiving The Black Web Author/Presenter: Makiba Foster, The College of Wooster and Bergis Jules, Archiving the Black Web Abstract: Unactualized web archiving opportunities for Black knowledge collecting institutions interested in documenting web-based Black history and culture has reached critical levels due to the expansive growth of content produced about the Black experience by Black digital creators. Archiving The Black Web (ATBW), works to establish more equitable, accessible, and inclusive web archiving practices to diversify not only collection practices but also its practitioners. Founded in 2019, ATBW's creators will discuss the collaborative catalyst for the creation and launch of this important DEI initiative within web archiving. In this panel session, attendees will learn more about ATBW’s mission to address web archiving disparities. ATBW envisions a future that includes cultivating a community of practice for Black collecting institutions, developing training opportunities to diversify the practice of web archiving, and expanding the scope of web archives to include culturally relevant web content. Presentation 2 - Schomburg Syllabus Author/Presenter: Zakiya Collier, Shift Collective Abstract: From 2017-2019 the Schomburg Center for Research in Black Culture participated in the Internet Archive’s Community Webs program, becoming the first Black collecting institution to create a web archiving program centering web-based Black history and culture. Recognizing that content in crowdsourced hashtag syllabi could be lost to the ephemerality of the Web, the #HashtagSyllabusMovement collection was created to archive online educational material related to publicly produced, crowdsourced content highlighting race, police violence, and other social justice issues within the Black community. Both the first of its kind in focus and within The New York Public Library system, the Schomburg Center’s web archiving program faced challenges including but not limited to identifying ways to introduce the concept of web archiving to Schomburg Center researchers and community members, demonstrating the necessity of a web supported web archiving program to Library administration, and expressing the urgency needed in centering Black content on the web that may be especially ephemeral like those associated with struggles for social justice. It was necessary for the Schomburg Center to not only continue their web archiving efforts with the #Syllabus and other web archive collections, but also develop strategies to invoke the same sense of urgency and value for Black web archive collections that we now see demonstrated in the collection of analog records documenting Black history, culture and activism— especially as social justice organizing efforts increasingly have online components. As a result, the #SchomburgSyllabus project was developed to merge web-archives and analog resources from the Schomburg Center in celebration of Black people's longstanding self-organized educational efforts. #SchomburgSyllabus uniquely organizes primary and secondary sources into a 27-themed web-based resource guide that can be used for classroom curriculum, collective study, self-directed education, and social media and internet research. Tethering web-archived resources to the Schomburg Center’s world-renowned physical collections Black diasporic history has proven key in garnering support for the Schomburg’s web archiving program and enthusiasm for the preservation of the Black web as demonstrated by the #SchomburgSyllabus’ use in classrooms, inclusion in journal articles, and features in cultural/educational TV programs. |
1:00pm - 2:10pm | SES-15: DATA CONSIDERATIONS Location: Theatre 2 Session Chair: Sophie Ham, Koninklijke Bibliotheek These presentations will be followed by a 10 min Q&A. |
|
1:00pm - 1:20pm
What if GitHub disappeared tomorrow? Old Dominion University, United States of America Research is reproducible when the methodology and data originally presented by the researchers can be used to reproduce the results found. Reproducibility is critical for verifying and building on results; both of which benefit the scientific community. The correct implementation of the original methodology and access to the original data are the lynchpin of reproducibility. Researchers are putting the exact implementation of their methodology in online repositories like GitHub. In our previous work, we analyzed arXiv and PubMed Central (PMC) corpora and found 219,961 URIs to GitHub in scholarly publications. Additionally, in 2021, one in five arXiv publications contained at least one link to GitHub. These findings indicate the increasing reliance of researchers on the holdings of GitHub to support their research. So, what if GitHub disappeared tomorrow? Where could we find archived versions of the source code referenced in scholarly publications? Internet Archive, Zenodo, and Software Heritage are three different digital libraries that may contain archived versions of a given repository. However, they are not guaranteed to contain a given repository and the method for accessing the code from the repository will vary across the three digital libraries. Additionally, Internet Archive, Zenodo, and Software Heritage all approach archiving from different perspectives and different use cases that may impact reproducibility. Internet Archive is a Web archive; therefore, the crawler archives the GitHub repository as a Web page and not specifically as a code repository. Zenodo allows researchers to publish source code and data and to share them with a DOI. Software Heritage allows researchers to preserve source code and issues permalinks for individual files and even lines of code. In this presentation, we will answer the questions: What if GitHub disappeared tomorrow? What percentage of scholarly repositories are in Internet Archive, Zenodo, and Software Heritage? What percentage of scholarly repositories would be lost? Do the archived copies available in these three digital libraries facilitate reproducibility? How can other researchers access source code in these digital libraries? 1:20pm - 1:40pm
Web archives and FAIR data: exploring the challenges for Research Data Management (RDM) 1Maynooth University; 2NetLab; 3KBR & Ghent Centre for Digital Humanities; 4Royal Danish Library; 5University of Groningen; 6IIPC; 7School of Advanced Study, University of London The FAIR principles imply “that all research objects should be Findable, Accessible, Interoperable and Reusable (FAIR) both for machines and for people” (Wilkinson et al., 2016). These principles present varying degrees of technical, legal, and ethical challenges in different countries when it comes to access and the reusability of research data. This equally applies to data in web archives (Boté & Térmens, 2019; Truter, 2021). In this presentation we examine the challenges for the use and reuse of data from web archives from both the perspectives of web archive curators and users, and we assess how these challenges influence the application of FAIR principles to such data. Researchers' use of web archives has increased steadily in recent years, across a multitude of disciplines, using multiple methods (Maemura, 2022; Gomes et al., 2021; Brügger & Milligan, 2019). This development would imply that there are a diversity of requirements regarding the RDM lifecycle for the use and reuse of web archive data. Nonetheless there has been very little research conducted which examines the challenges for researchers in the application of FAIR principles to the data they use from web archives. To better understand current research practices and RDM challenges for this type of data, a series of semi-structured interviews were undertaken with both researchers who use web or social media archives for their research and cultural heritage institutions interested in improving the access of their born-digital archives for research. Through an analysis of the interviews we offer an overview of several aspects which present challenges for the application of FAIR principles to web archive data. We assess how current RDM practices transfer to such data from both a researcher and archival perspective, including an examination of how FAIR web archives are (Chambers, 2020). We also look at the legal and ethical challenges experienced by creators and users of web archives, and how they impact on the application of FAIR principles and cross-border data sharing. Finally, we explore some of the technical challenges, and discuss methods for the extraction of datasets from web archives using reproducible workflows (Have, 2020). 1:40pm - 2:00pm
Lessons Learned in Hosting the End of Term Web Archive in the Cloud 1University of North Texas, United States of America; 2Internet Archive, United States of America The End of Term (EOT) Web Archive which is composed of member institutions across the United States who have come together every four years since 2008 to complete a large-scale crawl of the .gov and .mil domains in the United States to document the transition in the Executive Branch of the Federal Government in the United States. In years when a presidential transition did not occur, these crawls served as a systematic crawl of the .gov domain in what has become a longitudinal dataset of crawls. In 2022 the EOT team from the UNT Libraries and the Internet Archive moved nearly 700TB of primary WARC content and derivative formats into the cloud. The goal of this work was to provide easier computational access to the web archive by hosting a copy of the WARC files and derivative WAT, WET, and CDXJ files in the Amazon S3 Storage Service as part of Amazon’s Open Data Sponsorship Program. In addition to these common formats in the web archive community, the EOT team modeled our work on the structure and layout of the Common Crawl datasets including their use of the columnar storage format Parquet to represent CDX data in a way that enables access with query languages like SQL. This presentation will discuss the lessons learned in staging and moving these web archives into AWS, the layout used to organize the crawl data into 2008, 2012, 2016, and 2020 datasets and further into different groups based on the original crawling institution. Examples of how content staged in this manner can be used by researchers both inside and outside of a collecting institution to answer questions that had previously been challenging to answer about these web archives. The EOT team will discuss the documentation and training efforts underway to help researchers incorporate these datasets into their work. |
1:00pm - 3:00pm | WKSHP-06: RUN YOUR OWN FULL STACK SOLRWAYBACK Location: Labs Room 1 (workshops) Pre-registration required for this event. |
|
Run your own full stack SolrWayback Royal Danish Library, Denmark An in-person, updated, version of the ‘21 WAC workshop Run your own full stack SolrWayback: This workshop will
Prerequisites:
Target audience: Web archivists and researchers with medium knowledge of web archiving and tools for exploring web archives. Basic technical knowledge of starting a program from the command line is required; the SolrWayback bundle is designed for easy deployment.
SolrWayback 4 (https://github.com/netarchivesuite/solrwayback) is a major rewrite with a strong focus on improving usability. It provides real time full text search, discovery, statistics extraction & visualisation, data export and playback of webarchive material. SolrWayback uses Solr (https://solr.apache.org/) as the underlying search engine. The index is populated using Web Archive Discovery (https://github.com/ukwa/webarchive-discovery). The full stack is open source and freely available. A live demo is available at https://webadmin.oszk.hu/solrwayback/ During the conference there will be focused support for SolrWayback in a dedicated Slack channel by Thomas Egense and Toke Eskildsen. |
2:00pm - 2:20pm | BREAK |
2:20pm - 3:50pm | SES-16: PRESERVATION & COMPLEX DIGITAL PUBLICATIONS Location: Theatre 1 Session Chair: Kiki Lennaerts, Sound & Vision These presentations will be followed by a 10 min Q&A. |
|
2:20pm - 2:40pm
Preservability and Preservation of Digital Scholarly Editions 1University College Cork, Ireland; 2University of Sheffield; 3University of Glasgow Digital Scholarly Editions (DSE) are web resources, thus subject to data loss. While DSEs are usually the result of funded research, their longevity and preservation is uncertain. DSEs might be partially or completely captured during web archiving crawls, in some cases making web archives the only remaining publicly available source of information about a DSE. Patrick Sahle’s Catalogue of DSEs lists ~800 URLs referring to DSEs, of which 46 refer to the Internet Archive. (2020) This shows the overlap between DSEs and web archives and highlights the need for a closer look at the longevity and archiving of these important resources. This presentation will introduce a recent study on the availability and longevity of DSEs and introduce different preservation models and examples specific to DSEs. Examples of lost and partially preserved editions will be used to illustrate the problem of preservation and preservability of DSEs. This presentation will also outline the specific challenges of archiving DSEs. The C21 Editions project is a three-year international research collaboration researching the state of the art and the future of DSEs. As part of the project output, this presentation will introduce the main data sources on DSEs and demonstrate the workflow to assess DSE availability over time. It will illustrate the role web archives play in the preservation of DSEs as well as highlight specific challenges DSEs present to web archiving. As DSEs are complex projects, featuring multiple layers of data, transcription and annotation, their full preservation usually includes ongoing maintenance of the often custom-build backend system. Once project funding ends, these structures are very prone to deterioration and loss. Besides ongoing maintenance, other preservation models exist, generally reducing the archiving scope in order to reduce the ongoing work required (Dillen 2019; Pierazzo 2019; Sahle and Kronenwett 2016). Such editions using compatible rather than bespoke solutions are more likely to be fully preserved. Other approaches include a “preservability by design” approach through minimal computing (Elwert n.d.) or standardization through existing services such as DARIAH or GitHub. The presentation will outline these models using examples of successful preservation as well as lost editions. This presentation is part of the larger C21 Editions project, a three-year international collaboration jointly funded by the Arts & Humanities Research Council (AH/W001489/1) and Irish Research Council (IRC/W001489/1). 2:40pm - 3:00pm
Collecting and presenting complex digital publications The British Library, United Kingdom 'Emerging Formats' is a term that is used by UK legal deposit libraries to describe experimental and innovative digital publications, for which there are no collection management solutions that can operate at scale. They are important to the libraries, and their users, as they document a period of creativity and rapid change, and often include authors and experiences that are less well represented in more mainstream publications, and are at high risk of loss. For six years, the UK legal deposit libraries have been working collaboratively and experimentally to both survey the types of publications, and to test approaches to collection that will support preservation, discovery and access. An important concept in this work has been 'contextual collecting', that seeks to preserve the best possible archival instance of a work, alongside information that documents how a work was created, and how it was experienced by users. Web archiving has formed an important part of this work, both in providing practical tools to support collection management, including access, and also in supporting the collection of contextual information. An example of this can be seen in the New Media Writing Prize thematic collection https://www.webarchive.org.uk/en/ukwa/collection/2912 In this presentation, we will step back from specific examples, and talk about what we have learned so far from our work as a whole. We will outline how this work, including user research and engagement, has shaped policy at the British Library, through the creation of our 'Content Development Plan' for Emerging Formats, and the role of web archiving within that plan. This presentation contributes to the Collections themes of 'blurring the boundaries between web archives and other born digital collections' and 'reuse of web archived materials for other born digital collections'. It builds on previous presentations to Web Archive Conference, which have focused on specific challenges related to collecting complex digital publications, to demonstrate how this research has informed the policy direction at the British Library and how web archiving infrastructure will be built in to efforts to collect, assess and make accessible new publications. 3:00pm - 3:20pm
What can web archiving history tell us about preservation risks? KB, National Library of the Netherlands When people talk about the necessity of preservation, the first thing that comes to mind is the supposed risk of file format obsolescence. Within the preservation community there have been voices raising the concern that this might not be the most pressing risk. If we are actually solving the wrong problem, this means we neglect the real problem. Therefore, it is important to know that the solutions we create are solving demonstrably real problems. Web archiving could be a great source of information for researching the most urgent risks, because developments and standards on the web are very fluid. There are examples of file formats on the web, such as Flash, that are not supported anymore by modern browsers. However, these formats can still be rendered using widely available software. We have also seen that website owners migrated their content from Flash to HTML5. So, can we really say that obsolescence has resulted in loss of data? How can we find out more about this? And more importantly, can we find out which risks are actually more relevant? At the National Library of the Netherlands, we have been working on building a web collection since 2007. By looking at a few historical webpages we will illustrate where to look for answers and how to formulate better preservation risks using source data and context information. At iPres2022 we have presented a short paper on the importance of context information for web collections. This information helps us in understanding the scope and the creation process of the archived website. In this presentation, we will demonstrate how we use this context information to search out sustainability risks for web collections. This will also give us insight into sustainability risks in general so we can create better informed preservation strategies. 3:20pm - 3:40pm
Towards an effective long-term preservation of the web. The case of the Publications Office of the EU Publications Office of the European Union, Luxembourg Much is being written about web archiving in general where new, improving methods to capture the World Wide Web and to facilitate access to the resulting archives are constantly being described and shared. But when it comes to the long-term preservation of web sites, i.e. safeguarding the ARC/WARC files with a proper planning of preservation actions beyond simply bit preservation, literature is much less abundant. The Publications Office of the EU is responsible for the preservation of the websites authored by the EU institutions. In addition to our activities in harvesting and making accessible the content through our public web archive (https://op.europa.eu/en/web/euwebarchive), we started to delve more deeply into the management of content preserved for the long-term. Our reflection focused on long-term risks such as obsolescence or loss of file useability, and on the availability of a disaster recovery mechanism for the platform providing access to the web archive. Ingesting web archive files into a long-term preservation system raises many questions:
To get some advice about all these questions and others, the Publications Office commissioned a study looking at published and grey literature, and supplemented by a series of interviews conducted with leading institutions in field of web archiving. This paper presents the findings and offers recommendations on how to answer the questions above. |
2:20pm - 3:50pm | SES-17: PROGRAM INFRASTRUCTURE Location: Theatre 2 Session Chair: René Voorburg, KB, National Library of the Netherlands These presentations will be followed by a 10 min Q&A. |
|
2:20pm - 2:40pm
Maintenance Practices for Web Archives Stanford University, United States of America What makes a web archive an archive? Why don’t we call them web collections instead, since they are resources that have been collected from the web and made available again on the web? Perhaps one reason that the term archive has stuck is that it entails a commitment to preserving the collected web resources over time, and making continued access to them available. Just like the brick and mortar buildings that must be maintained to house traditional archives, web archives are supported by software and hardware infrastructure that must be cared for in order to ensure that the web archives remain accessible. In this talk we will present some examples of what this maintenance work looks like in practice drawing from experiences at Stanford University Libraries (SUL). While many organizations actively use third party services like Archive-It, PageFreezer, and ArchiveSocial to create web archives, it is less common for them to retrieve the collected data and make it available outside that service platform. Starting in 2012 SUL has been engaged in building web archive collections as part of its general digital collections using tools such as httrack, CDL’s Web Archiving Service, Archive-It and more recently Webrecorder. These collections have been made available using the OpenWayback software, but in 2022 SUL switched to using the PyWB application. We will discuss some of the reasons why Stanford initially found it important to host its own web archiving replay service and what factors led to switching to PyWB. Work such as reindexing and quality assurance testing were integral to moving to PyWB, which in turn generated new knowledge about the web archives records, as well as new practices for transitioning them into the new software environment. The acquisition, preservation of and access to web archives has been incorporated into the microservice architecture of the Stanford Digital Repository. One key benefit to this mainstreaming is shared terminology, infrastructure and maintenance practices for web archives, which is essential for sustaining the service. We will conclude with some consideration of what these local findings suggest about successfully maintaining open source web archiving software as a community. 2:40pm - 3:00pm
Radical incrementalism and the resilience and renewal of the National Library of Australia's web archiving infrastructure 1National Library of Australia, Australia; 2National Library of Australia, Australia The National Library of Australia’s web archiving program is one of the world’s earliest established and longest continually sustained operations. From its inception it was focused on establishing and delivering a functional operation as soon as feasible. This work historically included the development of policy, procedures and guidelines; together with much effort working through the changing legal landscape, from a permissions-based operation to one based on legal deposit warrant. Changes to the Copyright Act (1968) in 2016, that extended legal deposit to online materials, gave impetus to the NLA’s strategic priorities to increase comprehensive collecting objectives and to expand open access to its entire web archive corpus. This also had significant implications for the NLA’s online collecting infrastructure. In part this involved confronting and dealing with a large legacy of web content collected by various tools and structured in disparate forms; and in part it involved a rebuild of the collecting workflow infrastructure while sustaining and redeveloping existing collaborative collecting processes. After establishing this historic context, this presentation will focus attention on the NLA’s approach to the development of its web archiving infrastructure – an approach described as radical incrementalism: taking small, pragmatic steps that lead over time to achieving major objectives. While effective in providing the way to achieve strategic objectives, this approach can also build a legacy of infrastructural dead-weight that needs to be dealt with in order to continue to sustain and renew the dynamic and challenging task of web archiving. With a radical team restructure and an agile and iterative approach to development, the NLA has made significant progress in recent times in moving from a legacy infrastructure to one of renewed sustainability and flexibility in application. This presentation will highlight some of the recent developments in the NLA’s web archiving infrastructure, including the web archive collection management system (including ‘Bamboo’ and ‘OutbackCDX’) and the web archive workflow management tool, ‘PANDAS’. 3:00pm - 3:20pm
Arquivo.pt behind the curtains FCT: Arquivo.pt, Portugal Arquivo.pt is a governmental service that enables search and access to historical information preserved from the Web since the 1990s. The search services provided by Arquivo.pt include full-text search, image search, version history listing, advanced search and application programming interfaces (API). Arquivo.pt has been running as an official public service since 2013 but in the same year its system totally collapsed due to a severe hardware failure and over-optimistic architectural design. Since then, Arquivo.pt was completely renewed to improve its resilience. At the same time, Arquivo.pt has been widening the scope of its activities by improving the quality of the acquired web data and deploying online services of general interest to public administration institutions, such as the Memorial that preserves the information of historical websites or Arquivo404 that fixes broken links in live websites. These innovative offers require the delivery of resilient services constantly available. The Arquivo.pt hardware infrastructure is hosted at its own data centre and it is managed by full-time dedicated staff. The preservation workflow is performed through a large-scale information system distributed over about 100 servers. This presentation will describe the software and hardware architectures adopted to maintain the quality and resilience of Arquivo.pt. These architectures were “designed-to-fail” following a “share-nothing” paradigm. Continuous integration tools and processes are essential to assure the resilience of the service. The Arquivo.pt online services are supported by 14 micro-services that must be kept permanently available. The Arquivo.pt software architecture is composed of 8 systems that host 35 components and the hardware architecture is composed of 9 server profiles. The average availability of the online services provided by Arquivo.pt in 2021 was 99,998%. Web archives must urgently assume their rule in digital societies as memory keepers of the XXI century. The objective of this presentation is to share our lessons learned at a technical level so that other initiatives may be developed at a faster pace using the most adequate technologies and architectures. 3:20pm - 3:40pm
Implementing access to and management of archived websites at the National Archives of the Netherlands Nationaal Archief, The Netherlands The National Archives of the Netherlands, as a permanent government agency and official archive for the Central Government, has the legal duty, laid down in the Archiefwet, to secure the future of the government record. In the case of this proposal the focus is on how we worked on the infrastructure and processes of our trusted digital repository (TDR in short) relating to the ingestion, storage, management and preservation of and providing access to archived public websites of the Dutch Central Government. In 2018 we’ve issued a very well received guideline on archiving websites (2018), We tried to involve our producers in the drafting process of the guidelines in their development. Part of which was to organize a public review. We received no less than 600 comments from 30 different organizations, which enabled us to improve the guidelines and immediately bring them to the attention of potential future users. These guidelines were also used as part of the requirements of a public European tender (2021). The objective of the tender: realizing a central harvesting platform (hosted by. https://www.archiefweb.eu/openbare-webarchieven-rijksoverheid/) to structurally harvest circa 1500 public websites of the Central Government. This enabled us as an archival institution to influence the desired outcome of the harvesting process for these 1500 websites owned by at least all Ministries and most of their agencies. A main challenge was that our off the shelf version of the Open Wayback-viewer wasn’t a complete version of the software and therefore isn’t able to render increments, or provide a calendar function, one of the key elements of the minimum viable product we aimed at. |
3:50pm - 4:00pm | SHORT BREAK |
4:00pm - 5:00pm | KEYNOTE: Marleen Stikker. Introduced and chaired by Martijn Kleppe, KB Location: Theatre 1 Session Chair: Martijn Kleppe, KB, national library of the Netherlands |
5:00pm - 5:15pm | CLOSING REMARKS: Jeffrey van der Hoeven, KB, National Library of the Netherlands Location: Theatre 1 Session Chair: Jeffrey van der Hoeven, KB, National Library of the Netherlands |
Contact and Legal Notice · Contact Address: Privacy Statement · Conference: IIPC WAC 2023 |
Conference Software: ConfTool Pro 2.6.149 © 2001–2024 by Dr. H. Weinreich, Hamburg, Germany |