Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available). To only see the sessions for 3 May's Online Day, select "Online" for location.

 
Only Sessions at Location/Venue 
 
 
Session Overview
Session
SES-02: FINDING MEANING IN WEB ARCHIVES
Time:
Thursday, 11/May/2023:
11:00am - 12:30pm

Session Chair: Vladimir Tybin, Bibliothèque nationale de France
Location: Theatre 2


These presentations will be followed by a 10 min Q&A.

Show help for 'Increase or decrease the abstract text size'
Presentations
11:00am - 11:20am

Leveraging Existing Bibliographic Metadata to Improve Automatic Document Identification in Web Archives.

Mark Phillips1, Cornelia Caragea2, Praneeth Rikka1

1University of North Texas, United States of America; 2University of Illinois Chicago, United States of America

The University of North Texas Libraries, partnering with the University of Illinois Chicago (UIC) Computer Science Department, has been awarded a research and development grant (LG-252349-OLS-22) from the Institute of Museum and Library Services in the United States to continue work from previously awarded projects (LG-71-17-0202-17) related to identification and extraction of high-value publications from large web archives. This work will investigate the potential of using existing bibliographic metadata from library catalogs and digital library collection to better train machine learning models that can assist librarians and information professionals in identifying and classifying high-value publications from large web archives. The project will focus on extracting publications related to state government document collections from the states of Texas and Michigan with the hopes that this approach will enable other institutions interested in leveraging their existing web archives to assist in building traditional digital collections with these publications. This presentation will present an overview of the project with a description of the approaches the research team is exploring to leverage existing bibliographic metadata to assist in building machine models for publication identification from web archives. Early findings from the first year of research as well as next steps and how this research can be used by institutions apply to their own web archives.



11:20am - 11:40am

Conceptual Modeling of the Web Archiving Domain

Illyria Brejchová

Masaryk University, Czech Republic

Web archives collect and preserve complex digital objects. This complexity, along with the large scope of archived websites and the dynamic nature of web content, makes sustainable and detailed metadata description challenging. Different institutions have taken various approaches to metadata description within the web archiving community, yet this diversity complicates interoperability. The OCLC Research Library Partnership Web Archiving Metadata Working Group took a significant step forward in publishing user-centered descriptive metadata recommendations applicable across common metadata formats. However, there is no shared conceptual model for understanding web archive collections. In my research, I examine three conceptual models from within the GLAM domain, IFLA-LRM created by the library community, CIDOC-CRM originating from the museum community, and RiC-CM stemming from the archive community. I will discuss what insight they bring to understanding the content within web archives and their potential for supporting metadata practices that are flexible, scalable, meet the requirements of the end users, and are interoperable between web archives as well as the broader cultural heritage domain.

This approach sheds light on common problems encountered in metadata description practice in a bibliographic context by modeling archived web resources according to IFLA-LRM and showing how constraints within RDA introduce complexity without providing tools for feasibly representing this complexity in MARC 21. On the other hand, object-oriented models, such as CIDOC-CRM, can represent at least the same complexity of concepts as IFLA-LRM but without many of the aforementioned limitations. By mapping our current descriptive metadata and automatically generated administrative metadata to a single comprehensive model and publishing it as open linked data, we can not only more easily exchange metadata but also provide a powerful tool for researchers to make inferences about the past live web by reconstructing the web harvesting process using log files and available metadata.

While the work presented is theoretical, it provides a clearer understanding of the web archiving domain. It can be used to develop even better tools for managing and exploring web archive collections.



11:40am - 12:00pm

Web Archives & Machine Learning: Practices, Procedures, Ethics

Jefferson Bailey

Internet Archive, United States of America

Given their size, complexity, and heterogeneity, web archives are uniquely suited to leverage and enable machine learning techniques for a variety of purposes. On the one hand, web collections increasingly represent a larger portion of the recent historical record and are characterized by longitudinality, format diversity, and large data volumes; this makes them highly valuable in computational research by scholars, scientists, and industry professionals using machine learning for scholarship, analysis, and tool development. Few institutions, however, are yet facilitating this type of access or pursuing these types of partnerships and projects given the specialized practices, skills, and resources required. At the same time, machine learning tools also have the potential to improve internal procedures and workflows related to web collections management by custodial institutions, from description to discovery to quality assurance. Projects applying machine learning to web archive workflows, however, also remains a nascent, if promising, area of work for libraries. There is also a “virtuous loop” possible between these two functional areas of access support and collections management, wherein researchers utilizing machine learning tools on web archive collections can create technologies that then have internal benefits to the custodial institutions that granted access to their collections. Finally, spanning both external researcher uses and internal workflow applications are an intricate set of ethical questions posed by machine learning techniques. Internet Archive has been partnering with both academic and industry research projects to support the use of web archives in machine learning projects by these communities. Simultaneous, IA has also explored prototype work applying machine learning to internal workflows for improving the curation and stewardship of web archives. This presentation will cover the role of machine learning in supporting data-driven research, the successes and failures of applying these tools to various internal processes, and the ethical dimensions of deploying this emerging technology in digital library and archival services.



12:00pm - 12:20pm

From Small to Scale: Lessons Learned on the Requirements of Coordinated Selective Web Archiving and Its Applications

Balázs Indig1,2, Zsófia Sárközi-Lindner1,2, Mihály Nagy1,2

1Eötvös Loránd University, Department of Digital Humanities, Budapest, Hungary; 2National laboratory for Digital Humanities, Budapest, Hungary

Today, web archiving is measured on an increasingly large scale, pressurizing newcomers and independent researchers to keep up with the pace of development and maintain an expensive ecosystem of expertise and machinery. These dynamics involve a fast and broad collection phase, resulting in a large pool of data, followed by a slower enrichment phase consisting of cleaning, deduplication and annotation.

Our streamlined methodology for specific web archiving use cases combines mainstream practices with new open-source tools. Our custom crawler conducts selective web archiving for portals (e.g. blogs, forums, currently applied to Hungarian news providers), using the taxonomy of the given portal to systematically extract all articles exclusively into portal-specific WARC files. As articles have uniform portal-dependent structure, they can be transformed into a portal-independent TEI XML format individually. This methodology enables assets (e.g. video) to be archived separately on demand.

We focus on textual content, which in case of using traditional web archives would require using resource intensive filtering. Alternatives like trafilatura are limited to automatic content extraction often yielding invalid TEI or incomplete metadata unlike our semi-automatic method. Resulting data are deposited by grouping portals under specific DOIs, enabling fine-grained access and version control.

With almost 3 million articles from more than 20 portals we developed a library for executing common tasks on these files, including NLP and format conversion to overcome the difficulties of interacting with the TEI standard. To provide access to our archive and gain insights through faceted search, we created a light-weight trend viewer application to visualize text and descriptive metadata.

Our collaborations with researchers have shown that our approach makes it easy to merge coordinated separate crawls promoting small archives created by different researchers, who may have lower technical skills, into a comprehensive collection that can in some respects serve as an alternative to mainstream archives.

Balázs Indig, Zsófia Sárközi-Lindner, and Mihály Nagy. 2022. Use the Metadata, Luke! – An Experimental Joint Metadata Search and N-gram Trend Viewer for Personal Web Archives. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, pages 47–52, Taipei, Taiwan. Association for Computational Linguistics.