Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.
|
Session Overview |
| Session | ||
WORKFLOWS FOR BUILDING AND ANALYSING DATA
| ||
| Presentations | ||
1:35pm - 1:57pm
Digital Diaspora: mapping the Jewish internet The National Library of Israel, Israel Methods are being developed to systematically detect and archive Jewish web content on a large scale, capturing the evolving, multilingual digital expression of diasporic culture. This presentation outlines new procedures for the systematic detection and collection of Jewish web materials. Building on earlier curatorial approaches, this phase of the project focuses on automating the identification of thematically relevant websites through content-based analysis. Drawing on linguistic markers, semantic clustering, and metadata extraction, the process generates an expansive and continuously updated registry of Jewish web domains. To expand the detection of thematically relevant web content, the workflow integrates automated site aggregation with multilingual linguistic modeling. The system applies cross-lingual text analysis, semantic clustering, and metadata extraction according to defined selection criteria, enabling the identification of recurring cultural, historical, and communal markers across diverse digital sources. Detecting websites by thematic relevance rather than by technical metadata or domain structures presents a distinct challenge, as cultural or communal identity is often conveyed implicitly through language, visual and textual cues, and context rather than explicit tags or classifications. However, by embedding these computational processes within curatorial practice, the project broadens how the Jewish digital sphere is identified and delineated, ensuring that content produced in multiple languages and regions is systematically recognized and incorporated into the resulting archive. The presentation will address the conceptual design and technical aspects of this workflow, including criteria for data selection, the balance between automation and curatorial oversight, and methods for verifying the alignment of collected materials with the intended thematic focus. Beyond its technical contribution, the project reflects on the broader questions of how such workflows might inform other initiatives seeking to create expansive, thematically driven web collections, and how these systems can remain adaptable as online content and communities evolve. By presenting this next phase, the project invites further dialogue on how national and thematic archives can responsibly automate the preservation of networked, transnational cultural spheres. 1:57pm - 2:18pm
Improved language identification for web crawl data Common Crawl Foundation, United Kingdom Identifying the languages contained in crawl data is a fundamental step in exploring the multilinguality of web archives. However, this task is far from straightforward: language annotations contained in webpage metadata are often unreliable or missing, and existing language identification systems are limited in their ability to handle large-scale diverse web crawl data well. Specifically, common language identification systems used for web crawls (e.g. CLD2) only cover a small number of languages well and are not reliable for many under-served language varieties. At the same time, more recent high-coverage language identification systems (e.g. GlotLID) are too computationally expensive for large-scale pipelines and often lack robustness when dealing with the heterogeneity inherent in web data. We therefore identify five desiderata for a language identification system suitable for annotating web crawls: it must be fast, computationally lightweight, adapted to the web domain, able to handle multilingual input, and easily extensible to additional language varieties. In this talk, we present a new language identification system designed for web crawl data that meets all these requirements. Our solution is implemented in Rust and so is performant enough to process large amounts of web data in a reasonable time. It is designed from scratch for the web domain, including identifying multilingual web pages. The initial model is able to identify around 200 language varieties, but is easy to extend to additional language varieties given sufficient training data. We benchmark our system’s performance against popular existing language identification models, measuring computational performance and language identification fidelity. We finish with a discussion of the potential impact of our system on downstream language technologies, with a particular focus on under-served languages. Our language identification model is released under a permissive open source license to enable easy adoption and extension by the community. 2:18pm - 2:39pm
Hyperlinked homeland: A historical hyperlink analysis of 200 Dutch LGBT+ websites University of Groningen, Netherlands, The Over the past years, scholars have increasingly emphasized that queer cultures intrinsically transcend national borders (Bayramoğlu et al., 2024). The transnational connections that LGBT+ people establish online, among others through hyperlinks (Kiel & Osterbur, 2017), are often presented as a case in point (e.g., Gonsalves & Velasco, 2022). My presentation, however, demonstrates that the nation still matters greatly. It builds on the interdisciplinary project I conducted as Researcher-in-Residence at, and in close collaboration with, the National Library of the Netherlands (KB), drawing from the fields queer internet studies, web archive studies and network analysis. Using historical hyperlink analysis, I analyzed the special LGBT+ web collection of the KB. This collection is unique in size and richness, comprising archived websites of hundreds of LGBT+ organizations and individuals, each of which has been harvested once annually. However, the collection has not yet been researched by others. The talk focuses on the 200 LGBT+ websites that were harvested in 2020 (for pragmatic reasons: in terms of size and quality of the LGBT+ collection, this is the best year to scrutinize). To identify the (trans)national queer networks they formed that year, I extracted and scrutinized all hyperlinks of these websites. After all, hyperlinks are not merely the constitutive elements of the Web, they are ‘conscious acts of connectivity’ (Milligan, 2022, p. 132) that yield insights into ‘hyperlinked identities’ (Szulc, 2015, p. 121). I specifically concentrate on the hyperlinks that directed to LGBT+ websites – not necessarily the 200 websites, but to any website, Dutch or non-Dutch, that catered to LGBT+ people. I will detail this bottom-up approach that combines distant and close reading, and will show that there was a distinctly Dutch queer web sphere. For instance, 49 of the 50 websites that were most frequently hyperlinked to (or: targeted) were websites of Dutch organizations, in Dutch. In fact, many were hosted by local or regional groups, which suggests that, as far as geographical focus is concerned, internet historians should perhaps zoom in rather than out. Moreover, most of the target websites had ‘.nl’ as a top-level domain (TLD), whereas ‘.amsterdam’ was also relatively popular. These findings challenge the assumption that queer online cultures are inherently transnational. This talk connects to the conference regarding both the topic (e.g., ‘underrepresented voices and marginalised communities’) and applied method (‘Derived and statistical data for distant reading’). It is designed to resonate with every conference participant. It goes beyond simply demonstrating—through practical examples—how collaboration between researchers and web archivists can deepen our insights into critical societal and historical issues. Additionally, it explores the workflows the KB and I created for building and analyzing datasets, which could inspire future research and ultimately encourage greater engagement with web archives. By showcasing how hyperlink analysis can reveal hidden local networks, this talk offers a replicable, data-driven approach for archivists and researchers to assess and enrich collections of underrepresented groups—directly addressing the conference’s call for inclusive and sustainable web archiving practices. 2:39pm - 3:00pm
WARCbench: A swiss army knife for WARC processing Harvard Library Innovation Lab, United States of America WARCbench is an open-source Python library and command-line utility designed for exploring, analyzing, transforming, recombining, and extracting data from WARC files in all their variety. Inspired by the ad hoc snippets of code the team at the Library Innovation Lab repeatedly reaches for while operating Perma.cc, WARCbench is a new addition to our suite of open-source web-archiving tools. It offers a resilient, highly configurable toolkit for experienced technologists, alongside easy-to-use commands for quickly exploring the contents of a WARC without writing any code. In running a production-scale web archive, we’re always finding new anomalies to investigate, emerging patterns to study, and new use cases to explore. Though a broad array of tools and libraries exists for working with WARC files, most are understandably optimized for the well-known, frequently encountered tasks of web archiving rather than for empowering learning and discovery, supporting ad hoc scripting, and enabling users to quickly and easily explore novel problem spaces. WARCbench was created with these non-standard uses in mind and with an eye toward best practices: clear, thorough documentation; robust error handling; and an architecture that makes custom extension and introspection straightforward. Our goals for this project were to:
Our session aims to spark dialogue about common practices in ad hoc WARC processing and future tooling needs. Attendees will learn practical, repeatable approaches for inspecting and handling even "difficult" WARC files using WARCbench, and we’ll demonstrate both typical and edge-case scenarios ranging from simple inspection to transformation and extraction. Because it’s open source and modular, WARCbench lowers barriers to adoption, invites community iteration, and supports tool longevity — a critical factor for sustainable web archiving. | ||
