JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at events@netpreserve.org.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available). To only see the sessions for 3 May's Online Day, select "Online" for location.

Please note that all times are shown in the time zone of the conference. The current conference time is: 28th Apr 2024, 02:04:39pm CEST

Session Overview

Session

SES-12: DOMAIN CRAWLS

Time:

Friday, 12/May/2023:

10:30am - 12:00pm

Session Chair: Grace Bicho, Library of Congress

Location: Theatre 1

These presentations will be followed by a 10 min Q&A.

Presentations

10:30am - 10:50am

Discovering and Archiving the Frisian Web. Preparing for a National Domain Crawl.

Susanne van den Eijkel, Iris Geldermans

KB, National Library of the Netherlands

In the past years KB, National Library of the Netherlands (KBNL), conducted a pilot for a national domain crawl. KBNL has been harvesting websites with the Web Curator Tool (a web interface with Heritrix crawler) since 2007, on a selective basis that are focused on Dutch history, culture and language. Information on the web can be brief in existence but can have a vital importance for researchers now and in the future. Furthermore, KBNL outlined in their content strategy that it is the ambition of the library to collect everything that was published in and about the Netherlands, websites included. As more libraries around the world were collecting a national domain, KBNL also expressed the wish to execute a national domain crawl. Before we were able to do that, we had to form a multidisciplinary web archiving team, decide on a new tool for domain harvests and start an intensive testing phase. For this pilot a regional domain, the Frisian, was selected. Since we were new to a domain harvest, we used a selective approach. Curators of digital collections from KBNL were in close contact with Frisian researchers, to help define which websites needed to be included in the regional domain. During the pilot we also gathered more knowledge about Heritrix as we were using NetarchiveSuite (also a web interface with Heritrix crawler) for crawls.

Now that the results are in, we can share our lessons learned, like challenges on technical and legal aspects and related policies that are needed for web collections. Also, we will go into detail about the crawler software settings that were tested and how we can use such information as context information.

This presentation is related to the conference topics collections, community and program operations, as we want to share the best practices for executing a (regional) domain crawl and lessons learned in preparation for a national domain crawl. Furthermore, we will focus on the next steps after completion of the pilot. Other institutions that are harvesting websites can learn from it and those that want to start with web archiving can be more prepared.

10:50am - 11:10am

Back to Class: Capturing the University of Cambridge Domain

Caylin Smith, Leontien Talboom

Cambridge University Libraries, United Kingdom

The University Archives of Cambridge University, based at the University Library (UL), is responsible for the selection, transfer, and preservation of the internal administrative records of the University, dating from 1266 to the present. These records are increasingly created in digital formats, including common ‘office’ formats (Word, Excel, PDF) as well as increasingly for the web.

The question “How do you preserve an entire online ecosystem in which scholars collaborate, discover and share new knowledge?” about the digital scholarly record posed by Cramer et al. (2022) equally applies to online learning and teaching materials as well as the day-to-day business records of a university.

Capturing this online ecosystem as comprehensively, rather than selectively, as possible is an undertaking that involves many stakeholders and moving parts.

As a UK Legal Deposit Library, the UL is a partner in the UK Web Archive and Cambridge University websites are captured annually; however, some online content needs to be captured more frequently, does not have an identifiable UK address, or is behind a log-in screen.

To improve this capturing, the UL is working on the following:

Engaging with content creators and/or University Information Services, which supports the University’s Drupal platform.
Working directly with the University Archivist as well as creating a web archiving working group with additional Library staff to identify what University websites need to be captured manually or were captured only in an annual domain crawl but need to be captured more frequently.
Becoming a stakeholder in web transformation initiatives to communicate requirements for creating preservable websites and quality checking new web templates from an archival perspective.
Identifying potential tools for capturing online content behind login screens. So far WebRecorder.io has been a successful tool to capture this material; however, this is a time-consuming and manual process that would be improved if automated. The automation of this process is currently being explored.

Our presentation will walk WAC2023 attendees through our current workflow as well as highlight ongoing challenges we are working to resolve so that attendees based at universities can take these into account for archiving content on their university’s domains.

11:10am - 11:30am

Laboratory not Found? Analyzing LANL’s Web Domain Crawl

Martin Klein, Lyudmila Balakireva

Los Alamos National Laboratory, United States of America

Institutions, regardless of whether they identify as for-profit, nonprofit, academic, or government, are invested in maintaining and curating their representation on the web. The organizational website is often the top-ranked on search engine result pages and commonly used as a platform to communicate organizational news, highlights, and policy changes. Individual web pages from this site are often distributed via organization-wide email channels, included in new articles, and shared via social media. Institutions are therefore motivated to ensure the long-term accessibility of their content. However, resources on the web frequently disappear, leading to the known detriment of link rot. Beyond the inconvenience of the encounter with a “404 - Page not Found” error, there may be legal implications when published government resources are missing, trust issues when academic institutions fail to provide content, and even national security concerns when taxpayer-funded federal research organizations such as Los Alamos National Laboratory show deficient stewardship of their digital content.

We therefore conducted a web crawl of the lanl.gov domain with the motivation to investigate the scale of missing resources within the canonical website representing the institution. We found a noticeable number of broken links, including a significant number of special cases of link rot commonly known as “soft404s” as well as potential transient errors. We further evaluated the recovery rate of missing resources from more than twenty public web archives via the Memento TimeTravel federated search service. Somewhat surprisingly, our results show little success in recovering missing web pages.

These observations lead us to argue that, as an institution, we could be a better steward of our web content and establishing an institutional web archive would be a significant step towards this goal. We therefore implemented a pilot LANL web archive in support of highlighting the availability and authenticity of web resources.

In this presentation, I will motivate the project, outline our workflow, highlight our findings, and demonstrate the implemented pilot LANL web archive. The goal is to showcase an example of an institutional web crawl that, in conjunction with the evaluation, can serve as a blueprint for other interested parties

11:30am - 11:50am

Public policies for governmental web archiving in Brazil

Jonas Ferrigolo Melo¹, Moisés Rockembach²

¹University of Porto, Portugal; ²Federal University of Rio Grande do Sul, Brazil

Scientific, cultural, and intellectual relevance of web archiving has been widely recognized since the 1990s. The preservation of the web has been appreciated in several studies ranging from its specific theories and practices, such as its methodological approaches, specific ethical aspects of preserving web pages, to subjects that permeate the Digital Humanities and their uses as a primary source.

This study aims to identify the documents and actions that are related to the development of the web archive policy in Brazil. The methodology used was bibliographic and documental research, using literature on government web archiving, and legislation regarding public policies.

Brazil has a variety of technical resources and legislation that addresses the need to preserve government documents, however, the websites have not yet been included in the records management practices of Brazilian institutions. Until the recent past, the country did not have a website preservation policy. However, there are currently two government actions under development.

A Bill that has been under consideration in the National Congress since July 2015, provides on the institutional digital public heritage in the www. This project is currently in the Constitution and Justice and Citizenship Commission (CCJC) of the Brazilian National Congress, since December 2022.

Another action comes from the National Council of Archives – Brazil (CONARQ), which established a technical chamber to define guidelines for the elaboration of studies, proposals, and solutions for the preservation of websites and social media. Based on its general goals, the technical chamber has produced two documents: (i) the Website and Social Media Preservation Policy; and, (ii) the recommendation of basic elements for websites and social media’s digital preservation. The documents were approved in December 2022 and will be published as a federal resolution.

The actions raised show that efforts for the state to take a proactive role in promoting and leadership of this technological innovation are in course in Brazil. The definition of a web archiving policy, as well as the requirements for the selection of preservation and archiving methods, technologies, and contents that will be archived, can already be considered a reality in Brazil.

Mobile View Print View

Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2023