Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
 
Session Overview
Session
Short Papers 1
Time:
Wednesday, 09/Oct/2024:
11:30am - 1:00pm

Session Chair: Hugh Cayless, Duke University
Location: Aula 2 - Primer piso

Rectorado

TEI workflows

Show help for 'Increase or decrease the abstract text size'
Presentations
ID: 109 / SP1: 1
Short Paper
Keywords: Python library, format conversion, NLP, genres

Lessons learned from developing a customisable tool for TEI processing and handling of various TEI schemas

B. Indig1,2, M. Nagy2,3, L. Horváth4

1Eötvös Loránd University, Department of Digital Humanities; 2National Laboratory for Digital Humanities; 3Eötvös Loránd University, Atelier Department of Interdisciplinary History; 4Eötvös Loránd University, Doctoral School of Informatics

Throughout maintenance and systematic extension of five (currently medium-sized) corpora from different genres encoded in TEI schemas we have observed limitations regarding handling and enrichment faced by non-technical researchers. This bottleneck in further document processing steps has hindered our efforts to attract a larger userbase among students and researchers.

While pursuing our goal to standardise the common processing steps that connect to the already standardised data storage format (i.e. TEI schema) we developed a lightweight Python library intended for solving conversion, linguistic annotation, and metadata extraction tasks in a unified manner. Intended for users with minimal technical knowledge our tool provides a high-level API for a range of TEI-XML-related tasks including validation, format conversion/text and metadata extraction for downstream tasks, and TEI-compatible linguistic annotation.

We distinguish TEI schemas (e.g. for poems, dramas, novels, folk song, news articles) as genres, where each genre represents a unique (valid) TEI document structure. Our library (teiutils) consists of an API skeleton that provides handling of built-in genres and allows the easy development of custom bundles to be attached as Python modules without further restrictions. This approach creates a standardised framework extendible to numerous genres.

Our library allows using multiple NLP pipelines to accommodate different languages, while supporting conversion to common output formats (JSONL, customisable HTML, sentence per line, vertical XML format for Sketch Engine corpus query framework) for using our corpora outside of TEI. We have also defined different TEI schema levels to fit NLP and genre-specific annotations, while adhering to the original text. This enables users to generate different annotation levels from raw TEI documents then convert them into another format in batch with only a few API calls programmatically.

Furthermore, we present our observations and experiences and developed best practices regarding compatibility between annotations, TEI structures, and fidelity to original source texts.



ID: 133 / SP1: 2
Short Paper
Keywords: text encoding, markdown, encoding tools, drama

EasyDrama: a lightweight solution for encoding plays in TEI/XML

D. Skorinkin

Digital Humanities Potsdam

Although in many cases TEI/XML markup can be automated, a lot of TEI/XML documents are still encoded manually due to limitations of technology, complexity of the annotated phenomena, or the desire of the researcher(s) to stay close to the material and be in control. In such cases, the entry threshold for manual TEI encoding becomes a challenge. To turn raw text into TEI, one has to familiarise oneself with XML and learn heavy-weight annotation tools like Oxygen or CATMA. When it comes to markup workshops with non-digital scholars, one must spend considerable time getting the participants familiar with the tools and the format.

In the DraCor (dracor.org) project, it is important to enable people without technical background to encode drama in TEI/XML. Therefore, we are working on lowering the encoding threshold. One approach is EasyDrama (github.com/dracor-org/ezdrama) — a markdown-like language to encode the main structural elements of drama. In EasyDrama, speeches (TEI element <sp>), speakers (<speaker>), stage directions (<stage>), as well as acts and scenes (nested <div>-s) are encoded with just a handful of metasymbols (#@$%\n). This encoding is automatically translated to TEI/XML following a deterministic procedure.

EasyDrama became popular within the DraCor community. It is even sometimes preferred by people with technical skills and knowledge of XML. The balance between the simplicity of the markup and its unambiguous translation to TEI/XML seems to appeal to encoders. Depending on the uniformity of the source, markup can be accelerated with simple search-replaces, regexes, or LLMs. It is easy to few-shot-learn LLMs to output EasyDrama, and still have more control than in end-to-end generation.

While EasyDrama is a niche solution and does not replace other tools in the TEI/XML ecosystem, it can serve as a primer for interface simplification that increases the speed of drama encoding and lowers the threshold for encoders.



ID: 118 / SP1: 3
Short Paper
Keywords: Calderón de la Barca, Character Annotation, DraCor, Natural Language Processing, Theater

From Annotations in TEI to Natural Language Processing: A Computational Analysis of Characters in Calderón Drama Corpus

H. Ehrlicher1, A. Rojas Castro1, S. Padó2, K. Jung2, A. Keith2

1Eberhard Karls Universität Tübingen, Germany; 2Universität Stuttgart, Germany

The TEI-encoded Calderón Drama Corpus (https://dracor.org/cal) represents an important milestone by enabling the use of the digital methods for investigating Calderón’s work, such as the extent to which rules or genre conventions are followed (Ehrlicher et al. 2020; Lehmann & Padó 2022). An aspect of this corpus that is highly promising for future research concerns the treatment of characters and character types such as the ‘gracioso’. However, character-level information such as gender, social role, honorifics, or character types, is scarce: some of it can be recovered from set lists, some from secondary literature, but independently of the source, normalization and representation remain a challenge. On this poster, we report on our studies in which we enhance the TEI encoding of the Calderón Drama Corpus with character information, where available, and outline how this information can be used to recognize other characters within the vast corpus that fit into these archetypes, based on their speech and social relations, with minimal manual intervention and employing large language models and automatic classification.

Bibliographic references

Lehmann, Jörg & Sebastian Padó. «Clasificación de tragedias y comedias en las comedias nuevas de Calderón de la Barca». Revista de Humanidades Digitales 7 (27 November 2022): 80-103. https://doi.org/10.5944/rhd.vol.7.2022.34588.



ID: 179 / SP1: 4
Short Paper
Keywords: TEI-C, github, organization, metrics, community

C is for Co(nsortium|uncil|llaboration|ntributors|mmunity), or What Can GitHub Issues Tell Us About the TEI?

J. Takeda

Simon Fraser University, Canada

The TEI’s GitHub organization is the central home for development of the TEI Guidelines, Stylesheets, and many other associated tools, projects, and working groups. Among other things, each repository contains every version of every source file, a full log of every change committed, a list of all releases and their source files, and a list of all completed and outstanding issues (or “tickets”). These issues are key to the distributed, asynchronous, and transparent work of TEI Technical Council and, much like the TEI listServ, the repositories provide an important channel for the TEI community to propose and suggest changes, raise issues, and ask questions.

But they also serve as an incredibly useful record of the TEI’s development work over the last decade or so (since the migration of the TEI’s codebase from SourceForge in 2015). Every bug or feature request is (theoretically) logged in the repository, which also (theoretically) chronicles a history of that particular issue: who raised it, who resolved it, who responded, when was it closed, and under what circumstances.

Drawing on recent research and initiatives into evaluating and measuring the “health” of open source code and communities, this paper investigates what an analysis of the TEI’s GitHub data might yield for understanding the relationship between the various groups that make up the TEI-C: council, consortium, contributors, collaborators, and community. Using metrics defined by the Linux Foundation’s “Community Health Analytics in Open Source Software” (CHAOSS) project (e.g. time to first response, time to close, and “bus factor”), this paper will present a critical analysis of the issues raised on the TEI’s two primary GitHub repositories (TEIC/TEI and TEI/Stylesheets) and a discussion of what, if anything, these metrics can tell us about the past, present, and future of the TEI.



ID: 175 / SP1: 5
Short Paper
Keywords: digital edition, commentary, latin-polish translation, metadata

Neolatina Sarmatica - from Web 1.0 to Web 3.0

I. Grabska-Gradzińska, G. Urban-Godziek

Jagiellonian University, Poland

The original resource and reason for the creation of the Neolatina Sarmatica project is Cochanovius Latinus, a collection of completed Latin works by Jan Kochanowski, published

electronically (2006-2011) with the first contemporary translation and commentary. This attempt to present Kochanowski's Latin in digital form was based on static HTML page structures and the commentary reproduced the approach of the print editions with its structure and visualisation.

As new tools were developed and became more widespread, a new edition based on the TEI-publisher engine was proposed in 2024, with the aim of enriching it with metadata and consequently increasing the reader's ability to search and interact with the main text and supplementary texts (critical apparatus, commentaries), supporting texts, comparisons with other texts, and comparisons of different translations. This approach makes the texts more accessible to readers with different philological competences: scholars who are well versed in the original language and the historical and cultural context of the works, as well as less qualified readers.

The new opening of the texts to different types of readers has made it possible to update the commentaries and adapt them to the requirements of the modern reader, as well as to use the possibilities of varying both the scope and the way in which the texts are visualised.

Work is currently in progress to utilise the metadata collected for editing in an ontology being developed, enabling a semantic approach to the text as data.



 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: TEI 2024
Conference Software: ConfTool Pro 2.6.153
© 2001–2025 by Dr. H. Weinreich, Hamburg, Germany