Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available). To only see the sessions for 3 May's Online Day, select "Online" for location.

Please note that all times are shown in the time zone of the conference. The current conference time is: 28th Apr 2024, 04:12:37pm CEST

 
Only Sessions at Location/Venue 
 
 
Session Overview
Session
OL-SES-04: Q&A: SAMPLING THE HISTORICAL WEB & TEMPORAL RESILIENCE OF WEB PAGES
Time:
Wednesday, 03/May/2023:
3:05pm - 3:35pm

Session Chair: Laura Wrubel, Stanford University
Virtual location: Online


Show help for 'Increase or decrease the abstract text size'
Presentations

Lessons Learned From the Longitudinal Sampling of a Large Web Archive

Kritika Garg1, Sawood Alam2, Michele Weigle1, Michael Nelson1, Corentin Barreau2, Mark Graham2, Dietrich Ayala3

1Old Dominion University, Norfolk, Virginia - USA; 2Internet Archive, San Francisco, California - USA; 3Protocol Labs, San Francisco, California - USA

We document the strategies and lessons learned from sampling the web by collecting 27.3 million URLs with 3.8 billion archived pages spanning 26 years of the Internet Archive's holdings (1996–2021). Our overall project goal is to revisit fundamental questions regarding the size, nature, and prevalence of the publicly archivable web, and in particular, to reconsider the question, "how long does a web page last?" Addressing this question requires obtaining a "representative sample of the web." We proposed several orthogonal dimensions to sample URLs using the archived web: time of the first archive, HTML vs. MIME types, URL depth (top-level pages vs. deep links), and TLD. We sampled 285 million URLs from IA's ZipNum index file that contains every 6000th line of the CDX index. These include URLs of embedded resources, such as images, CSS, and JavaScript. To limit our samples to web pages, we filtered the URLs for likely HTML pages (based on filename extensions). We determined the time of the first archive and MIME type using IA's CDX API. We grouped the 92 million URLs with "text/html" MIME types based on the year of the first archive. Archiving speed and capacity have significantly increased, so we found fewer URLs archived in the early years than in later years. Hence, we adjusted our goal of 1 million URLs per year and clustered the early years (1996-2000) to reach that size (1.2 million URLs). We noticed an increase in deep links archived over the years. We extracted the top-level URLs from the deep links to upsample the earlier years. We found that popular domains like Yahoo and Twitter were over-represented in the IA. We performed logarithmic-scale downsampling based on the number of URLs sharing a domain. Given the collection size, we employed various sampling strategies to ensure fairness in the domain and temporal representations. Our final dataset contains TimeMaps of 27.3 million URLs comprising 3.8 billion archived pages from 1996 to 2021. We convey the lessons learned from sampling the archived web, which could inform other studies that sample from web archives.



TrendMachine: Temporal Resilience of Web Pages

Sawood Alam1, Mark Graham1, Kritika Garg2, Michele Weigle2, Michael Nelson2, Dietrich Ayala3

1Internet Archive, San Francisco, California - USA; 2Old Dominion University, Norfolk, Virginia - USA; 3Protocol Labs, San Francisco, California - USA

"How long does a web page last?" is commonly answered with "40 to 100 days", with sources dating back to the late 1990s. The web has since evolved from mostly static pages to dynamically-generated pages that heavily rely on client-side scripts and user contributed content. Before we revisit this question, there are additional questions to explore. For example, is it fair to call a page dead that returns a 404 vs. one whose domain name no longer resolves? Is a web page alive if it returns content, but has drifted away from its original topic? How to assess the lifespan of pages from the perspective of fixity with the spectrum of content-addressable pages to tweets to home pages of news websites to weather report pages to push notifications to streaming media? To quantify the resilience of a page, we developed a mathematical model that calculates a normalized score as time-series data based on the archived versions of the page. It uses Sigmoid functions to increase or decrease the score slowly on the first few observations of the same class. The score changes significantly if the observations remain consistent over time, and there are tunable parameters for each class of observation (e.g., HTTP status codes, no archival activities, and content fixity). Our model has many potential applications, such as identifying points of interest in the TimeMap of densely archived web resources, identifying dead links (in wiki pages or any other website) that can be replaced with archived copies, and aggregated analysis of sections of large websites. We implemented an open-source interactive tool [1] powered by this model to analyze URIs against any CDX data source. Our tool gave interesting insights on various sites, such as, the day when "cs.odu.edu" was configured to redirect to "odu.edu/compsci", the two and a half years of duration when "example.com" was being redirected to "iana.org", the time when ODU’s website had downtime due to a cyber attack, or the year when Hampton Public Library’s domain name was drop-catched to host a fake NSFW store.

[1] https://github.com/internetarchive/webpage_resilience



 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2023
Conference Software: ConfTool Pro 2.6.149
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany