ID: 174
/ LP 7: 1
Long Paper
Keywords: digital philology, semantic web technology, digital scholarly edition
Integrating TEI XML with Existing Semantic Web Practices for Enhanced Accessibility and Interoperability in Scholarly Editions
Z. Fellegi
HUN-REN Research Centre for the Humanities, Hungary
In recent years, the integration of the semantic web and linked open data has emerged as a pivotal area in digital philology. This discussion has observed numerous applications of semantic web technology and graph data models in modeling data for scholarly editions. Notably, there has been a shift away from the traditional TEI XML format, initially tailored for digital scholarly editing needs.
My presentation will propose a refined architecture that aims to overcome the limitations observed in recent digital philological experiments. The prevalent approach—developing bespoke ontologies, data structures, and corresponding software—tends to isolate digital philology from broader scholarly engagement, inadvertently perpetuating the exclusivity of each edition.
Our framework, hosted at DigiPhil (digiphil.hu), leverages TEI XML to ensure the accessibility of texts to a diverse academic audience, including historians and linguists. This accessibility is not limited to traditional close reading but extends to computer-assisted distant reading. We are actively developing tools that facilitate the conversion of XML into various data formats, such as plain text, CSV, and LaTeX.
While TEI XML inherently possesses semantic properties, it does not alone bridge the gap between a document network and a data network as envisaged by Berners-Lee. Previous efforts to integrate TEI XML with the semantic web have fallen short, failing to elevate the use of philological data beyond its original academic confines to a broader cultural and scientific arena.
Our proposed architecture emphasizes the integration of services and software such as WikiData, GitHub, Zenodo.org, Wikibase, and Invenio RDM—cornerstones of the open data philosophy. However, linking these platforms is not straightforward. This presentation will outline the entire workflow, from the initial editing of scholarly texts to the publishing of semantically rich linked data, describing the metadata relational network of larger text units and the practices of semantic annotation and linking of smaller text segments.
Ultimately, I will showcase the DigiPhil infrastructure and the comprehensive workflow we employ, from inception to both print and digital publication, through a case study of a multilingual (Latin and English) digital scholarly edition.
References
• Rees, Thorsten; Palkó, Gábor: Born-digital archives. In International Journal of Digital Humanities 2019/1, p. 1–11. DOI: https://doi.org/10.1007/s42803-019-00011-x
• Palkó, Gábor: The Phenomenon of “Linked Data” from a Media Archaeological Perspective. In The (Web)sites of Memory: Cultural Heritage in the Digital Age, ed. Morse E., Donald; O. Réti, Zsófia; Takács, Miklós; 2018, p. 23–31. Handle: http://hdl.handle.net/2437/280285
• Fellegi, Zsófia: Digital Philology on the Semantic Web: Publishing Hungarian Avant-garde Magazines. In The (Web)sites of Memory: Cultural Heritage in the Digital Age, ed. Morse E., Donald; O. Réti, Zsófia; Takács, Miklós; 2018, p. 105–116. Handle: http://hdl.handle.net/2437/280285
• Graph Data-Models and Semantic Web Technologies in Scholarly Digital Editing. Ed. Spadini, Elena; Tomasi, Francesca; Vogeler, Georg, 2021. URL: https://kups.ub.uni-koeln.de/54580/1/SpadiniTomasi.pdf
ID: 143
/ LP 7: 2
Long Paper
Keywords: cmc, TEI guidelines, post, archive, correspondence
Can we apply the new CMC chapter to the TEI Listserv Archives? An experiment with TEI for Correspondence and Computer-Mediated Communication
E. Beshero-Bondar1, S. Bauman2
1Penn State Erie, United States of America; 2Northeastern University, United States of America
The TEI Technical Council is working with the Computer-Mediated Communication (CMC) special interest group (SIG) on introducing a new chapter on CMC for the TEI Guidelines; they expect the new chapter to be released either a few months before or a month after the conference. We propose to test the applicability of the new module to e-mail by encoding a subset of the TEI Listserv archive, and reporting on our successes, failures, and problems.
The authors have been involved in a project to transfer those TEI mailing lists currently on the Brown University Listserv server, many of them dating from the 1990s, to a Listserv server at Penn State University. Simultaneously we have been reviewing and working on the introduction of the draft CMC chapter in its late stages of development in 2024. It is not clear to us how well the TEI with the new CMC Guidelines would apply to the archiving of e-mail in general, and in particular to e-mail from a mailing list. At the time of this writing, the draft CMC chapter primarily addresses the kinds of dialogic, conversational messages we encounter in chat forums, social media platforms, and discussion boards as amenable to TEI encoding. These are media that the CMC draft authors Michael Beißwenger and Harald Lüngen described as requiring packaging into “products,” or "posts" prior to transmission over a network.
While the draft chapter does not specifically address e-mail, we want to determine how much of the encoding proposed for CMC could apply to e-mail messages posted in the conversational space of an email listserv. We expect that e-mail messages can be addressed in the encoding provided for correspondence introduced to the Guidelines, but we also wonder to what extent the CMC encoding can be blended with TEI encoding for correspondence in encoding in representing metadata about the transmission and distribution of messages. We expect that the CMC encoding is better suited to documenting the distribution of a Listserv post, and that Listserv messages shared to a community of recipients are better served by CMC encoding, while an e-mail archive representing an individual's personal communications over time may be better represented by the TEI correspondence encoding.
The authors propose to explore the new encoding of CMC introduced to the TEI Guidelines by applying them at scale to a subset of TEI Listserv archives. We will work with the Listserv archive format currently supplied by Brown and Penn State Universities (versions 16.5 and 17.0 respectively). We will document our steps in transforming the Listerv archive format to TEI using text manipulation tools like Perl, Python, and XSLT. The authors expect that they will each have distinct ideas about how to apply the CMC and correspondence encoding to the data we are working with. We will document and present where our perspectives diverge, and we will seek the thoughts of TEI conference attendees on elaborating best practices for the encoding of e-mail list archives.
ID: 114
/ LP 7: 3
Long Paper
Keywords: large language model, text encoding, generative artificial intelligence, drama
Towards a LLM-powered encoding workflow for plays / Hacia un flujo de trabajo de codificación para obras de teatro impulsado por LLM
L. Giovannini1,2, D. Skorinkin1
1University of Potsdam; 2University of Padua
Encoding new texts in TEI-XML format plays a central role in established research projects such as DraCor (Fischer et al. 2019), a major computational infrastructure hosting ‘programmable corpora’ comprising thousands of dramatic texts. As outlined by Giovannini et al. 2023, current DraCor production workflows change according to the initial markup of the texts to be onboarded: while computational transformations (mostly Python or XLST scripts) are applied to sources with basic (HTML) or advanced (XML) markup, texts with no markup are usually encoded through the application of the lightweight markdown language easydrama (Skorinkin 2024) and a successive scripted conversion.
The rise of large language models (LLMs), however, promises to further automate such encoding tasks, and scholars have been already exploring the potential of generative AI by developing advanced prompt-engineering techniques to enhance outputs (Czmiel et al. 2024, Pollin 2023). Most efforts, however, seem to have been devoted to shorter textual forms, like letters (e.g. Pollin, Steiner, and Zach 2023), which present comparatively fewer challenges in encoding than longer texts like plays (i.a. due to the tendency of many models to shorten the long output).
In this contribution, we present a proof-of-concept demonstrating how LLMs can be efficiently integrated into the corpus building pipeline of a standard DraCor corpus. To this aim, we conduct a series of experiments with several state-of-art models, including both commercial (OpenAI’s GPT, Google’s Gemini) and open-source (Meta’s Llama, Mistral AI’s Mixtral) products, to assess to which extent an largely automated TEI encoding of plays is possible. We therefore discuss strengths and limitations of this approach and eventually propose a set of tailored LLMs prompts which can be used to generate partially ‘DraCor-ready’ files from raw text input. In future developments, we also envision the possibility of fine-tuning existing LLMs with specific DraCor data to improve its performance.
|