10:05am - 10:25amUnlocking the Archive: Open Access to News Content as Corpora
Jon Carlstedt Tønnessen, Magnus Breder Birkenes
National Library of Norway, Norway
The content of web archives is potentially highly valuable to research and knowledge production. However, most web archives have strict access regimes to their collections, and with good reason: archived content is often subject to copyright restrictions and potentially also data protection laws. When moving towards best practices, a key question is how to improve access, while also maintaining legal and ethical commitments. [1]
This presentation will show how the National Library of Norway (NB) has worked to provide open access to a corpus of more than 1.5 million news articles in the web archive. By providing the collection as data - scoping it across the typical crawl job-oriented segmentation - anyone gets access to computational text analysis at scale. By serving metadata and snippets of content through a REST API and keeping the full content in-house, we align with FAIR principles while accounting for immaterial rights and data protection laws. [2]
The key steps in building the news corpora will be walked through, such as: a) extracting data from WARC, b) removing boilerplate content for purposes of Natural Language Processing (NLP), c) curating and filtering across crawl-oriented collections, d) tokenising full-text for computational analysis, e) Quality Assessment before publishing
Further, we will demonstrate how anyone can tailor corpora for their own use and analyse news text at scale - either with user-friendly apps, or with computational notebooks via API. [3]
The demonstration highlights some of the limitations, but also the great possibilities for allowing distant reading of web archives. We will discuss how the approach to collections as data provides broader access and new perspectives for researchers. Open access further allows for utilisation in new contexts, such as higher education, government and commercial business. With easy-to-use web applications on top, the threshold for non-technical users is lowered, potentially increasing the use of web archives vastly. We also reflect on how interdisciplinary cooperation and user-orientation have been vital in designing and building the solution. -- [1]: Caroline Nyvang og Eld Zierau, “Untangling Nordic Web Archives”, in The Nordic Model of Digital Archiving (Routledge, 2023), 191–92; Niels Brügger og Ralph Schroeder, Web as History: Using Web Archives to Understand the Past and the Present (London: UCL Press, 2017), 10. [2]: Magnus Breder Birkenes and Jon Carlstedt Tønnessen. (2024). “corpus-build”. Github. National Library of Norway. https://github.com/nlnwa/corpus-build/; Thomas Padilla. (2017). “On a Collections as Data Imperative”. UC Santa Barbara. pp. 1–8; Sally Chambers. (2021). “Collections as Data : Interdisciplinary Experiments with KBR’s Digitised Historical Newspapers : a Belgian Case Study”. DH Benelux: The Humanities in a Digital World. 1–3; Magnus Breder Birkenes, Lars Johnsen, and Andre Kåsen. (2023). “NB DH-LAB: a corpus infrastructure for social sciences and humanities computing.” CLARIN Annual Conference Proceedings. [3]: Apps and notebooks will be available as open-source code ultimo November 2024. For similar services for digitised content, see “Apper fra DH-LAB». (2024). National Library of Norway. https://www.nb.no/dh-lab/apper/; “Digital tekstanalyse”. (2024). National Library of Norway. https://www.nb.no/dh-lab/digital-tekstanalyse/
10:25am - 10:45amRecently Orphaned Newspapers: From Archived Webpages to Reusable Datasets and Research Outlooks
Tyng-Ruey Chuang1, Chia-Hsun Wang1, Hung-Yen Wu1,2
1Academia Sinica, Taiwan; 2National Yang Ming Chiao Tung University, Taiwan
We report on our progress in converting the web archives of a recently orphaned newspaper into accessible article collections in IPTC (International Press Telecommunications Council) standard format for news representation. After the conversion, old articles extracted from a defunct news website are now reincarnated as research datasets meeting the FAIR data principles. Specifically, we focus on Taiwan's Apple Daily and work on the WARC files built by the Archive Team in September 2022 at a time when the future of the newspaper seemed dim [0]. We convert these WARC files into de-duplicated collections of pure text in ninjs (News in JSON) format [1].
The Apple Daily in Taiwan had been in publication since 2003 but discontinued its print edition in May 2021. In August 2022, its online edition was no longer being updated, and the entire news website has become inaccessible since March 2023. The fate of Taiwan's Apple Daily followed that of its (elder) sister publication in Hong Kong. The Apple Daily in Hong Kong was forced to cease its entire operation after midnight June 23, 2021 [2]. Its pro-democracy founder, Jimmy Lai (黎智英) [3], was arrested under Hong Kong's security law the year before.
Being orphaned and offline, past reports and commentaries from the newspapers on contemporary events (e.g. the Sunflower Movement in Taiwan and the Umbrella Movement in Hong Kong) become unavailable to the general public. Such inaccessibility has impacts on education (e.g. fewer news sources to be edited into Wikipedia), research (e.g. fewer materials to study the early 2000s zeitgeist in Hong Kong and Taiwan), and knowledge production (e.g. fewer traditional Chinese corpora to work with).
Our work in transforming the WARC records into ninjs objects produces a collection of unique 953,175 news articles totaling in 4.3 GB. The articles are grouped by the day/month/year they were published hence it is convenient to look into a specific date for the news that were published on that day. Metadata about each article — headline(s), subject(s), original URI, unique ID, among others — are mapped into the corresponding fields in the ninjs object for ready access.
(For figures, please access them at this dataset [4].) Figure 1 shows the ninjs object derived from a news article that was published on 2014-03-19, archived on 2021-09-29, and converted by us on 2024-02-17. Figure 2 is a screenshot of the webpage where the news was originally published. Figure 3 displays the text file of the ninjs object in Figure 1. Currently the images and videos accompanying the news article have not been extracted. Another process is in the plan to preserve and link to these media files in the produced ninjs object.
In our presentation, we shall elaborate on technical details (such as the accuracy and coverage of the conversion) and exemplary use cases of the collection. We will touch on the roles of public research organizations in preserving and making available materials that are deemed out of commerce and circulation.
[0] https://wiki.archiveteam.org/index.php/Apple_Daily#Apple_Daily_Taiwan [1] https://iptc.org/standards/ninjs/ [2] https://web.archive.org/web/20210623212350/https://goodbye.appledaily.com/
[3] https://en.wikipedia.org/wiki/Jimmy_Lai
[4] https://pid.depositar.io/ark:37281/k5p3h9k37
10:45am - 11:05amNewsWARC: Analyzing News Over Time in the Web Archive
Amr Emara2, Khaled Ezz2, Shaden Hazem2, Youssef Eldakar1
1Bibliotheca Alexandrina, Egypt; 2Alamein International University, Egypt
News consumption, as studies generally suggest, is quite common globally. Today, individuals, wherever there is an Internet connection, access news predominantly online. On the web, news websites rank relatively high by number of visits. Considering the history of the web, the news media industry was one domain of society to adopt the web as technology very early on. Being of such significance, news content on the web is one to particularly investigate, using the web archive as data source.
We present NewsWARC, a tool, developed as an internship project, for aiding researchers to explore news content in a web archive collection over time. NewsWARC consists of two components: the data analyzer and the viewer. The data analyzer is code that runs on data in the collection and uses machine learning to get information about each news article or post, namely, sentiment, named entities, and category, and store that into a database for access via the second component that serves as the interface for querying and visualizing the pre-analyzed data. We report on our experience processing data from the Common Crawl news collection to use in testing, including comparing performance of the data analyzer running on different hardware configurations. We show examples of queries and trend visualizations that the viewer offers, such as examining how the sentiment of articles in health-related news varies over the course of a pandemic.
In developing this initial prototype, while we narrowed our focus with regard to information that the analyzer returns to sentiment, named entities, and category, there exists a wider range of analyses to include in future work, such as topic modeling, keyword and keyphrase extraction, measuring readability and complexity, and fact vs. opinion classification. Also as future work, this overall functionality can be deployed as a service for an alternative interface to supplement researcher access to web archives.
11:05am - 11:10amZombie E-Journals and the National Library of Spain
José Carlos Cerdán Medina
Biblioteca Nacional de España, Spain
A "zombie e-journal" refers to an electronic journal that has become inaccessible, but for which a web archive has preserved a copy, sometime this one is not perfectly accurate. It is widely recognized that, each year, a significant number of e-journals disappear without existing in print, resulting in the loss of their content on a global scale. This constitutes a substantial loss of economic investment, scholarly knowledge, and cultural heritage. While many universities maintain institutional repositories to safeguard publications, a large number of e-journals lack sustainable preservation methods due to financial constraints.
In response to this challenge, the Spanish Web Archive initiated efforts to explore potential solutions. A key question was posed: is it feasible to ensure the long-term preservation of more than 10,000 open-access e-journals in Spain? The National Library of Spain, which serves as the National Centre for ISSN assignment, maintains a catalogue that includes all e-journals registered with an ISSN.
The first phase of this initiative started in 2020, when the Spanish Web Archive implemented an annual broad crawl encompassing all URLs associated with electronic journals in Spain. This proactive approach significantly increases the likelihood of locating missing e-journals in the future.
Currently, the project has entered its second phase, during which e-journals that became inaccessible between 2009 and 2023 have been identified. To date, over 500 zombie e-journals have been recovered through consultations with the Spanish Web Archive. The full list of these journals is publicly available through the project’s website and integrated into the National Library’s catalogue.
In the forthcoming third phase, the identified e-journals will be formally declared out-of-commerce works, according to Directive (EU) 2019/790,thus facilitating open access to their content. This step will allow users to once again access and benefit from these resources.
Additionally, a comprehensive system has been developed to detect missing e-journals, conduct quality assurance (Q&A) processes on the captured content, and integrate access to these journals through the library's website and catalogue. The broad crawl has proven effective in identifying missing e-journals, and following quality assurance, the recovered information is systematically incorporated into the catalogue.
|