Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available). To only see the sessions for 3 May's Online Day, select "Online" for location.

Please note that all times are shown in the time zone of the conference. The current conference time is: 27th Apr 2024, 06:36:30pm CEST

 
Only Sessions at Location/Venue 
 
 
Session Overview
Session
SES-15: DATA CONSIDERATIONS
Time:
Friday, 12/May/2023:
1:00pm - 2:10pm

Session Chair: Sophie Ham, Koninklijke Bibliotheek
Location: Theatre 2


These presentations will be followed by a 10 min Q&A.

Show help for 'Increase or decrease the abstract text size'
Presentations
1:00pm - 1:20pm

What if GitHub disappeared tomorrow?

Emily Escamilla, Michele Weigle, Michael Nelson

Old Dominion University, United States of America

Research is reproducible when the methodology and data originally presented by the researchers can be used to reproduce the results found. Reproducibility is critical for verifying and building on results; both of which benefit the scientific community. The correct implementation of the original methodology and access to the original data are the lynchpin of reproducibility. Researchers are putting the exact implementation of their methodology in online repositories like GitHub. In our previous work, we analyzed arXiv and PubMed Central (PMC) corpora and found 219,961 URIs to GitHub in scholarly publications. Additionally, in 2021, one in five arXiv publications contained at least one link to GitHub. These findings indicate the increasing reliance of researchers on the holdings of GitHub to support their research. So, what if GitHub disappeared tomorrow? Where could we find archived versions of the source code referenced in scholarly publications? Internet Archive, Zenodo, and Software Heritage are three different digital libraries that may contain archived versions of a given repository. However, they are not guaranteed to contain a given repository and the method for accessing the code from the repository will vary across the three digital libraries. Additionally, Internet Archive, Zenodo, and Software Heritage all approach archiving from different perspectives and different use cases that may impact reproducibility. Internet Archive is a Web archive; therefore, the crawler archives the GitHub repository as a Web page and not specifically as a code repository. Zenodo allows researchers to publish source code and data and to share them with a DOI. Software Heritage allows researchers to preserve source code and issues permalinks for individual files and even lines of code. In this presentation, we will answer the questions: What if GitHub disappeared tomorrow? What percentage of scholarly repositories are in Internet Archive, Zenodo, and Software Heritage? What percentage of scholarly repositories would be lost? Do the archived copies available in these three digital libraries facilitate reproducibility? How can other researchers access source code in these digital libraries?



1:20pm - 1:40pm

Web archives and FAIR data: exploring the challenges for Research Data Management (RDM)

Sharon Healy1, Ulrich Karstoft Have2, Sally Chambers3, Ditte Laursen4, Eld Zierau4, Susan Aasman5, Olga Holownia6, Beatrice Cannelli7

1Maynooth University; 2NetLab; 3KBR & Ghent Centre for Digital Humanities; 4Royal Danish Library; 5University of Groningen; 6IIPC; 7School of Advanced Study, University of London

The FAIR principles imply “that all research objects should be Findable, Accessible, Interoperable and Reusable (FAIR) both for machines and for people” (Wilkinson et al., 2016). These principles present varying degrees of technical, legal, and ethical challenges in different countries when it comes to access and the reusability of research data. This equally applies to data in web archives (Boté & Térmens, 2019; Truter, 2021). In this presentation we examine the challenges for the use and reuse of data from web archives from both the perspectives of web archive curators and users, and we assess how these challenges influence the application of FAIR principles to such data.

Researchers' use of web archives has increased steadily in recent years, across a multitude of disciplines, using multiple methods (Maemura, 2022; Gomes et al., 2021; Brügger & Milligan, 2019). This development would imply that there are a diversity of requirements regarding the RDM lifecycle for the use and reuse of web archive data. Nonetheless there has been very little research conducted which examines the challenges for researchers in the application of FAIR principles to the data they use from web archives.

To better understand current research practices and RDM challenges for this type of data, a series of semi-structured interviews were undertaken with both researchers who use web or social media archives for their research and cultural heritage institutions interested in improving the access of their born-digital archives for research.

Through an analysis of the interviews we offer an overview of several aspects which present challenges for the application of FAIR principles to web archive data. We assess how current RDM practices transfer to such data from both a researcher and archival perspective, including an examination of how FAIR web archives are (Chambers, 2020). We also look at the legal and ethical challenges experienced by creators and users of web archives, and how they impact on the application of FAIR principles and cross-border data sharing. Finally, we explore some of the technical challenges, and discuss methods for the extraction of datasets from web archives using reproducible workflows (Have, 2020).



1:40pm - 2:00pm

Lessons Learned in Hosting the End of Term Web Archive in the Cloud

Mark Phillips1, Sawood Alam2

1University of North Texas, United States of America; 2Internet Archive, United States of America

The End of Term (EOT) Web Archive which is composed of member institutions across the United States who have come together every four years since 2008 to complete a large-scale crawl of the .gov and .mil domains in the United States to document the transition in the Executive Branch of the Federal Government in the United States. In years when a presidential transition did not occur, these crawls served as a systematic crawl of the .gov domain in what has become a longitudinal dataset of crawls. In 2022 the EOT team from the UNT Libraries and the Internet Archive moved nearly 700TB of primary WARC content and derivative formats into the cloud. The goal of this work was to provide easier computational access to the web archive by hosting a copy of the WARC files and derivative WAT, WET, and CDXJ files in the Amazon S3 Storage Service as part of Amazon’s Open Data Sponsorship Program. In addition to these common formats in the web archive community, the EOT team modeled our work on the structure and layout of the Common Crawl datasets including their use of the columnar storage format Parquet to represent CDX data in a way that enables access with query languages like SQL. This presentation will discuss the lessons learned in staging and moving these web archives into AWS, the layout used to organize the crawl data into 2008, 2012, 2016, and 2020 datasets and further into different groups based on the original crawling institution. Examples of how content staged in this manner can be used by researchers both inside and outside of a collecting institution to answer questions that had previously been challenging to answer about these web archives. The EOT team will discuss the documentation and training efforts underway to help researchers incorporate these datasets into their work.



 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: IIPC WAC 2023
Conference Software: ConfTool Pro 2.6.149
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany