Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.
|
Session Overview |
| Session | ||
RESPONSIBLE STRATEGIES
| ||
| Presentations | ||
1:25pm - 1:47pm
End of Term Web Archive: Harmonizing WARC contributions from multiple crawling partners 1University of North Texas Libraries, United States of America; 2Internet Archive, United States of America Every four years, the End of Term (EOT) Web Archive documents the transition in the executive branch of the United States federal web space by harvesting federal .gov and public .mil domains. The most recent transition from the Biden to Trump administration resulted in the largest data collection yet, with over 2.3PB of content crawled by six different crawling partners. From the beginning of the EOT Web Archive project, the diversity of approaches in crawling and curating portions of the overall projects by different crawling partners has been seen as a benefit. This approach allowed different strategies for crawling to be experimented with while allowing partners to focus on the content their organizations were willing and able to collect. In the case of the EOT-2024 process, this diversity in collecting institutions resulted in a wide range of implementations of the WARC format and required the project team to make decisions about how best to harmonize the data and make it available to researchers for computational use. Examples of the different variations in the WARC format include WARC files created using record-at-a-time gzip compression, WARC files packaged in the Web Archive Collection Zipped or WACZ format, WARC data compressed using the Zstandard data compression algorithm, and finally WARC files packaged in the BagIt format comprising file headers with the payloads stored alongside the WARC files themselves. To provide a consistent file format and access paradigm to end users who might not be familiar with the range of variations of the WARC format and their nuances, the EOT team made the decision to normalize all streams of WARC data into individual WARC files with record-at-a-time gzip compression. This required the normalization of several of the formats that presented several non-trivial challenges during the process. While data for the public dataset was normalized, the originally contributed formats are archived as they were deposited at the Internet Archive where they are served by the Wayback Machine. The resulting dataset will hopefully provide end users with an easily accessible set of files that can be used for a variety of projects in the future. This presentation provides a novel focus on normalizing heterogeneous WARC files in order to provide a consistent set of interactions for end users of these files who are not primarily web archivists. The presentation will provide a brief introduction to the EOT collection process but focuses predominantly on the description of the different tools and resulting WARC implementations generated in the most recent round of this effort. An overview of decisions that the EOT team made to normalize these WARC records will be discussed as well as the technical approaches used throughout the dataset creation portion of the project. 1:47pm - 2:09pm
Crawl, cloud, carbon: measuring and reducing emissions for web archivists Tailpipe, United Kingdom A walkthrough of a novel methodology for precisely estimating the carbon emissions generated by cloud computing, contextualised within a case study whereby the emissions of a major web archiving platform were measured. The presentation begins with an explanation of the process by which cloud computing generates carbon emissions. This process connects the cloud service user to the datacentre that processes their requests, to the power station that fuels the datacentre, to the energy source that generates the necessary electricity. This process is illustrated by data from the emissions assessment of the aforementioned web archiving platform. The emissions intensity of web archiving is also highlighted, as a compute and storage intensive process that is reliant upon a vast network of cloud storage, which consumes a significant amount of power and thereby generates material quantities of carbon emissions. Next, the methodology for how cloud computing emissions can be estimated is detailed. A step-by-step explanation begins with an assessment of the power draw of the hardware components that host cloud services. This dataset is combined with measured processor utilisation data to determine the overall power draw of a user or organisation's use of cloud services. The carbon emissions of this power draw are then calculated by drawing on regional carbon intensity grid mix data, as well as accounting for regional power transmission losses. Alongside these ‘operational’ emissions, the methodology is expanded to encompass other elements of the cloud computing infrastructure’s lifecycle including manufacture, shipping and disposal. This methodology is accompanied by examples from the web archiving case study, covering the types of hardware used by web archivists, the types of cloud services utilised to host web archiving, and the carbon intensity of datacentres that most commonly host web archive data. Results from empirical testing will also be shown to demonstrate the precision of the estimated power and emissions calculations. Areas for further improvement will be presented to highlight how additional refinements can be made in the future. The presentation concludes with recommendations to help web archivists reduce the carbon emissions generated by their processes. These include migrating services to datacentres in low carbon intensity areas, as well as maximising the efficiency of web archiving software hosted on cloud services. 2:09pm - 2:30pm
How the “M” service contributes to reducing the carbon footprint Arquivo.pt, Portugal This presentation provides an overview of seven years of the “M”, a service offered to the community since 2018 that allows organizations to shut down old websites while keeping their content accessible, thereby contributing to reducing their carbon footprint. Organizations create websites for a wide variety of purposes, sometimes having to maintain dozens of small websites without updating them. For example, universities create websites dedicated to events, conferences, research projects, etc. What to do? Shut down the websites and lose interesting information? This is where the “M” service comes in. We consider this service from three perspectives: 1) How it works; 2) How it adds value to organizations; 3) Community involvement. We conclude by outlining the next steps for expanding the service. 1) How it works. The “M” basically consists of redirecting a domain to a historical version preserved in the “Web Archive”. The workflow begins with a request from the organization that owns the website. The “Web Archive” makes a high-quality recording of the website. The website owner only has to maintain and redirect the domain. The “Web Archive”, in turn, generates an SSL certificate and provides access to the archived content. A landing page informs users that this is a historical version. This process involves collaboration between the “Web Archive” team and people from the entities that have joined the “M” service. 2) How it adds value to organizations. In communicating the service to the community (external advocacy) we highlight the value of the “M” in terms of energy savings, CO2 reduction, and therefore a smaller carbon footprint. The second value of the service, which is important to IT teams, is that it helps eliminate security flaws. When websites are not updated, they become targets for attacks. Instead of eliminating websites with useful content for the community, IT teams and decision-makers can use the web archive to continue to provide access to this content. 3) Community involvement. In 2025, the “M” service reached approximately 284 websites from 26 institutions. Over the years, 50 websites were removed due to domain maintenance issues or broken collaboration. Processes have been improved and the service is poised for growth. For example, SSL certificate generation has been automated. External advocacy is a priority, as the preservation of websites in web archive format is not widely known. The next step to expand the “M” service is to use the same workflow and structure to provide a rapid response service in the event of cyber attacks on websites of important organizations, such as universities. The “Web Archive” must be prepared to provide the latest archived version to one of these entities. We believe that redirecting to “Web Archive”, as is the case with “M” service, is an important contribution to disaster recovery processes. This presentation concludes by referring to the vision of “Web Archive” in creating services for the community. It is essential to offer services: 1) to demonstrate the usefulness of web archives to organizations 2) to point out its contribution to sustainable goals. | ||
