Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.
|
Session Overview |
| Session | ||
ARCHIVING BLOGS AND NEWS
| ||
| Presentations | ||
1:25pm - 1:47pm
Blogs to digital heritage: a British Library case study British Library, United Kingdom In 2025 the British Library undertook a time-sensitive initiative to preserve its institutional blogs hosted on the Typepad platform. The library blogs represent over a decade of research, curatorial insight, and public engagement, making them a crucial component of the institution’s digital heritage. This project aimed to preserve the content while ensuring continuity of user access and long-term discoverability through the UK Web Archive. The blogs were hosted across two domains with Cloudflare protections active on only one. This configuration presented several challenges for crawling including blocked requests, redirects, and embedded content across multiple subdomains. To address these issues crawler user agents were whitelisted by the domain owners and manual crawls were conducted for content outside Cloudflare. As a result, the team compiled seed lists for manual crawling using a combination of internal metadata, Screaming Frog exports, and curated inputs. Approximately 160,000 URLs were initially identified which were refined to around 90,000 unique URLs representing individual blog posts and associated media. Browsertrix was used for targeted crawls of these posts and separate crawls captured embedded assets such as images, audio, and documents. After crawling, further challenges arose regarding consolidating the content captured from two different domains into a single, coherent viewer. Quality assurance was particularly complex as some captures were not traditional failures, but rather pages returning HTTP 503 errors instead of the expected blog content. These recurring 503 captures had to be identified and re-crawled manually to ensure every post and associated media was fully preserved, requiring careful review and iterative verification across both domains. Throughout this project a strong focus was placed on user access and experience. The current solution includes a bespoke workflow with support from Browsertrix which provides a temporary route for public access until the blogs are fully integrated into the UK Web Archive. Redirects were planned at the top-level domain to route users to archived versions, with documentation including a LibGuide to guide navigation and citation. The team explored how archived content could later integrate into the Web Archive’s discovery systems which ensures sustainable long-term accessibility. This presentation will discuss the workflows, technical challenges, and collaborative strategies employed to preserve both content and access. Particular attention will be given to overcoming Cloudflare restrictions, managing URL redirects, coordinating cross-departmental teams, and designing user support resources to make the archived blogs usable and discoverable. The case study demonstrates that under platform constraints, institutions can successfully safeguard digital heritage while prioritising accessibility, discoverability, and usability for researchers and the public. This presentation will illustrate how careful planning, cross-team collaboration, and targeted technical strategies enabled the preservation of content while prioritizing user access. It will highlight approaches to overcoming platform limitations, ensuring discoverability, and supporting users in navigating archived blogs. 1:47pm - 2:09pm
The taste of blogging : towards sensible and ethical approaches to web archives. 1École nationale des chartes, France; 2Bibliothèque nationale de France Archives of the early vernacular web hold a lot of sensitive content: personal photos, texts created by children, viral memes remixing personal and copyrighted content… Blogs and social networks are not only made of text or images: they encompass intimate, individual stories. Within those pages, we come across confidences from marginalized people, mothers grieving for their child, photos of late-night parties, fantasies worded as fanfictions. What can be told about them without betraying the intimacy these authors have placed in their blogs? Based on the massive collection carried out with the National Library of France (BnF) for 12.6 million blogs, mainly french-speaking and created mostly in the early 2000s, we will discuss how research teams and cultural institutions can implement sensible approaches to this kind of peculiar corpus. Our projects SkyTaste and Skybox build on a platform of tools and data for researchers designed by the BnF in order to promote the visibility of this archive. Our goal is to capture the unique atmosphere of those blogs to design ways to reconvey this heritage to its stakeholder community. Within our project, we define sensible approaches to web archives as epistemological methods designed to interact with sensitive content from the vernacular web in a way that is respectful of ethical principles. In France, web archives are legal deposit and can only be accessed by researchers on the premises of a few institutions. If we want to use this content for an exhibition or a scientific paper, we have to ask for authorizations from rights holders. However, most of the content on this blog platform was posted under pseudonyms and most of it, especially within fandoms, is composed of reused content, making it difficult to trace. Furthermore, even if we can find some of these authors, they are not keen to provide display authorizations for their intimate content. Finally, there may be cases when the materials are so sensitive that we may feel reluctant to expose them, even if we are allowed to. However, when telling the stories of these blogs, if we only show low-risk content, either authorized or already available, there is a significant risk that we end up representing a biased version of the platform and missing out the purpose of cultural heritage: stirring emotions. Sensible approaches to web archives include acknowledging intellectual property rights, being mindful of people’s privacy and intimacy, taking into account cultural diversity, protecting stakeholders (including researchers) from potentially harmful information. Such approaches may include navigating between distant and close reading, avoiding blind spots and building research processes along with communities, and mobilizing art-based research as a catalyst of emotions that we experience as web archivists or as researchers in front of the archive. Thanks to the synergy that emerged around the aforementioned projects researchers and students work together with web archivists to build this ethical framework for navigating personal web archives. This is our main goal for two workshops we’re organizing in the fall 2025 : we will synthetise our results for this presentation. 2:09pm - 2:30pm
Capturing the flow of online news: complementary approaches to web archiving and legal deposit in Sweden National Library of Sweden, Sweden The National Library of Sweden has engaged in large-scale web archiving since 1997, when domain-level crawls of the Swedish web were first initiated as part of the national web harvesting program. In 2002–2003, this effort was expanded to include daily crawls of Swedish news media websites, recognizing the need to capture the rapid publication cycles and dynamic content characteristic of online journalism. These crawls have since documented the structure, evolution, and visual presentation of Sweden's digital news ecosystem across both national and regional outlets. The harvested material is available for on-site consultation at the library and forms a cornerstone of the National Library of Sweden's long-term digital preservation holdings. The introduction of electronic legal deposit legislation in 2012 significantly expanded the National Library of Sweden's collecting mandate, establishing a legal basis for requiring publishers to deliver digital content, including material distributed exclusively online and behind paywalls. Building on this framework, the National Library of Sweden launched in 2015 a new and more granular collection process for news media: a focused harvesting based on RSS feeds supplied by publishers in accordance with technical specifications developed by the library. These feeds expose article-level content and metadata, including updated versions of published articles, thereby enabling the systematic and high-frequency collection of born-digital news items. This targeted, metadata-rich approach complements the broader but less structured coverage achieved through traditional web crawls. This presentation will examine the operational and curatorial relationship between these two collection streams—comprehensive web harvesting and RSS-based electronic legal deposit. It will discuss differences in scope, temporal resolution, and metadata granularity, as well as efforts to align descriptive and technical metadata across systems to enable cross-collection discovery and analysis. Particular attention is given to challenges in integrating large-scale WARC-based collections with structured, feed-based article data, and to access conditions: while the web-harvested material is available to users on-site, the legal deposit corpus remains restricted due to current legal and technical constraints. The presentation will also try to outline future directions for harmonizing workflows, enhancing metadata interoperability, and leveraging these complementary datasets for large-scale research use in digital news studies and computational journalism. | ||
