Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
 
Session Overview
Session
Long Papers 8
Time:
Thursday, 10/Oct/2024:
3:00pm - 4:30pm

Session Chair: Gustavo Fernández Riva, University of Heidelberg
Location: Aula 3 - Primer piso

Rectorado

Encoding and analysis 3

Show help for 'Increase or decrease the abstract text size'
Presentations
ID: 130 / LP 8: 1
Long Paper
Keywords: retrocomputing, born digital heritage, TEI encoding

Editing Early-Born Digital Text in TEI

T. Roeder

University of Würzburg, Germany

With the increasing prevalence of digital inheritances and digital literary publication forms, digital cultural heritage is gradually coming into the purview of editorial philology. Due to electronic storage, "born digital" resources do not possess a fixed material form but only temporary visualizations. Insofar, their materiality is transmitted to devices and storage media, to whose zeitgeist-specific technological design the storage is bound. A closer examination of digital remnants from the 1980s makes clear the fundamental technological change that has taken place in just a few decades: Neither software nor files are accessible using today's common systems, and handling original storage media and devices requires care and expertise. Emulation is only possible if corresponding digital images of the original storage media are available. For the scholarly editing of early digital material, it follows that digital is not equal to digital, but that digitality arises in various manifestations depending on the respective digital environment. On a ZX Spectrum, this was entirely different from on an Apple II or a Commodore Amiga 4000. This, in turn, creates different conditions for textual criticism, depending on the specific historic or system-specific concept of digitality.

To approach these conditions, it seems reasonable to examine the electronic representation of text more closely. In the simplest case, there is a direct, symbolic connection between the binary code on a storage medium and the display on a display device, which is defined by a standard such as ASCII (later included in Unicode). However, depending on the historical environment, a multi-tiered chain can be expected: Companies like Commodore maintained proprietary standards in which the outward appearance of the character inventory could also be individually customized, while standard formats for character sets or their documentation did not exist. For a correct mapping of the byte sequence to the semantics of individual characters, sufficient knowledge of the adaptations is therefore necessary, because otherwise neither can the intended character be inferred from the encoding nor can the encoding be inferred from the outward form. However, this also means that the transmission of text cannot occur without the transmission of the system environment. The same applies to historical compression algorithms, which were often applied due to limited storage space but now need to be determined due to the lack of specifications when one wants to access the digital original text.

But if the relationship between the displayed character and the underlying encoding only works best in the context of the original environment: What then is the text to be edited in the digital realm? And how can this relationship be meaningfully incorporated into textual criticism? Do emulations possibly serve a facsimile function? And finally, the crucial question: What does the TEI need to capture firstly the neccessary information in the metadata and secondly to encode the different digital text layers? The contribution discusses these questions theoretically and with reference to several examples from a current editing project.



ID: 173 / LP 8: 2
Long Paper
Keywords: handwritten text recognition, deep learning, large language models, XIXth centuries documents in Hungarian

Integrating Deep Learning and Philology: Challenges and Opportunities in the Digital Processing of János Arany’s Legacy

G. Palkó

HUN-REN Research Centre for the Humanities, Hungary

In the first decades of the 21st century, we can witness two parallel and closely related trends in the fields of culture and science. On one hand, Artificial Intelligence is transforming and replacing various established cultural practices at an unforeseeable pace. On the other hand, partly due to the digitalization of cultural heritage and partly due to the huge volume of 'born digital' materials being produced, we are literally seeing the creation of data sets and networks of unimaginable scale.

However, within the discourse of digital heritage, alongside easily processable and publishable printed or 'born digital' materials, the "real" – that is, handwritten – manuscripts tend to be overshadowed, as they cannot be made searchable with general models that do not consider the specific characteristics of the document group in question. A particular problem is that AI tools work better in major world languages spoken by populations of hundreds of millions, thus for instance, Hungarian handwritten documents are exceptionally underrepresented in the entirety of the digital cultural heritage.

Addressing this issue is one of the main goals of the National Laboratory for Digital Heritage project. Led and professionally guided by digital humanities experts from ELTE Faculty of Humanities, this project, which is a collaboration of public collections and research institutions, considers its primary task to be the application of AI tools optimized for the Hungarian language in public collections. One of the most significant achievements of this work is the development of a handwriting recognition model that has made it possible to make János Arany's official documents searchable, thereby making an extremely valuable document corpus accessible to researchers and the general public.

The present lecture, in addition to briefly outlining the HTR processing of a significant portion—approximately 30,000 pages—of János Arany’s legacy, fundamentally focuses on two issues. The first issue examines the potential and risks of various deep-learning technologies such as synthetic handwriting generation and large language models (LLMs) in “improving” the text quality of a corpus that is too large to be checked by human involvement, and in making it researchable. The second issue, which I discuss in detail in my presentation, addresses how a corpus converted into text by artificial intelligence can be integrated within the framework of critical text editions created by philologists. Specifically, it queries whether the TEI XML markup language is optimal for the publication of uncorrected HTR-ed documents and how texts of high philological demand in digital scholarly editions can coexist on a single platform with documents "read" by machines. I will present these issues in the context of specific IT developments.

References:

- Palkó, Gábor–Szekrényes, István – Bobák, Barbara 2023. A Digitális   Örökség Nemzeti Laboratórium webszolgáltatásai automatikus   kézírás-felismertetéshez. [Online Services of the National   Laboratory for Digital Heritage for Automatic Handwritten Text   Recognition] In: Tick, József – Kokas, Károly – Holl, András   (eds.) Új technológiákkal, új tartalmakkal a jövő digitális   transzformációja felé, Budapest, Magyarország: Hungarnet. 207 p.   pp. 164–169., 6 p. https://doi.org/10.31915/NWS.2023.24

- Li et al 2022.   TrOCR: Transformer-based Optical Character Recognition with   Pre-trained Models https://doi.org/10.48550/arXiv.2109.10282



ID: 107 / LP 8: 3
Long Paper
Keywords: Novella, Umberto Eco, Literary Forgery, Annotation, Analysis

Chasing ‘Carmen Nova’: Encoding and Analysis of a TEI Version of the Crime Novella Allegedly Written by Umberto Eco

F. Fischer1, D. C. Çakir1, V. J. Illmer1, N. Penke2, M. Schwindt1, L. Welz1

1Freie Universität Berlin, Germany; 2University of Siegen, Germany

On the first pages of Umberto Eco’s world-famous 1980 novel Il nome della rosa, we read about a book that the narrator finds in an antiquarian bookshop on Avenida Corrientes in Buenos Aires, which sets the plot in motion. With this contribution to the TEI 2024 Conference, we want to bring Umberto Eco back to Buenos Aires under circumstances that are just as intricate as in Eco’s novel. Using a TEI document, our project aims to thoroughly examine an unresolved case concerning the authorship of a crime novella:

At the end of 2022, literary scholar Niels Penke discovered a book titled ‘Carmen Nova’ on eBay, which had Umberto Eco as the author. The afterword was allegedly written by Roland Barthes. Several copies of the book, which was published by a fictitious Swiss publishing house, are now known to exist. But as it turned out, the 64-page volume is a literary forgery, as nothing of such a work by Umberto Eco is known. The novella disguises itself as a German translation of the alleged Italian original. The plot revolves around the detective search for a certain Carmen, who, however, does not appear to be a concrete person, but rather a world literary concept that has its origins in Mérimée’s novella Carmen, published in 1845, which in turn was the basis for Bizet’s opera of the same name, which premiered in 1875. The Carmen Nova of the novella is one of these ‘fluctuating individuals’ (Eco 2009, pp. 86–89).

Since Niels Penke’s discovery, a community of scholars, interested readers and journalists has formed to find out more about the author (or authors) of this literary forgery, which must have been written in the early 1980s. It is still not known who wrote, printed and circulated this book.

The State and University Library Bremen (SuUB) has made a digital scan of its copy in 2023. In order to be able to carry out digital analyses with the full text of the work, we used OCR to convert the PDF to plain text and, after a round of corrections, encoded it in TEI (Araneda Lavín et al. 2023). We published the encoded file by help of the JavaScript library CETEIcean (Cayless & Viglianti 2018). Among other things, we have annotated mentioned persons and obvious spelling mistakes, clues to help uncovering the nature of the text. We will present our annotation strategy and present initial analysis results. Using a Jupyter notebook that utilizes the LXML library to extract information from the TEI-encoded version, we generated quantitative results that provide new insights into the making of this mysterious text.

We have released both the TEI and Python code as open source (see bibliography) and hope that the presentation will arouse lively interest among the TEI community in this mystery, which has so far been confined to the German-speaking community. Last not least, it is a nice touch that, through the TEI 2024 Conference, a pseudo-version of Umberto Eco finds his way to Buenos Aires via winding paths.



 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: TEI 2024
Conference Software: ConfTool Pro 2.6.153
© 2001–2025 by Dr. H. Weinreich, Hamburg, Germany