ID: 159
/ LP 2: 1
Long Paper
Keywords: large language models, generative ai, letter edition
Empowering Text Encoding with Large Language Models: Benefits and Challenges
M. Scholger, C. Pollin, S. Strutz
University of Graz, Austria
This contribution will discuss how Large Language Models (LLMs) can be used to support and enhance text encoding with the standard of the Text Encoding Initiative, demonstrating an exemplary workflow – from model creation to data extraction to data analysis to presentation – in the context of a letter edition.
ID: 163
/ LP 2: 2
Long Paper
Keywords: Knowledge acquisition, Technical documentation, RAG, LLM, Information retrieval
Enhancing Technical Knowledge Acquisition with RAG Systems: the TEI use case
M. Khemakhem1, H. E. Rekik2, O. Bouaziz1
1MandaNetwork, France; 2ENSTA Paris, France
In an era dominated by an explosion of technical documentation across diverse domains, the need for effective knowledge acquisition mechanisms has become paramount. The assimilation of the “Text Encoding Initiative” (TEI) guidelines, for instance, presents challenges for organizations and individuals seeking to effectively adopt its encoding principles. Retrieval-Augmented Generation (RAG) systems emerge as a promising paradigm to address this challenge, seamlessly integrating information retrieval with natural language generation to facilitate the acquisition of technical knowledge from large documentation material. In this publication, we explore how RAG systems can mitigate these challenges while maximizing the benefits of TEI adoption, particularly in the context of learning and implementation.
Challenges in learning and adopting TEI guidelines revolve around the complexity of the markup language and the diverse skill levels of users. Mastering TEI requires familiarity with its intricate syntax, encoding conventions, and domain-specific applications, posing a steep learning curve for novices. Furthermore, the extensive volume of its published guidelines poses another challenge, even for experienced users, in efficiently retrieving relevant information.
RAG systems provide a novel approach to technical knowledge acquisition by seamlessly integrating the power of Large Language Models (LLMs) and specialized knowledge. The logic behind a RAG system lies in its ability to leverage pre-trained LLMs to generate informative and contextually relevant responses based on retrieved information. This approach has emerged to tackle the hallucination issue observed in generative models. It achieves this by enriching the context necessary for these models with knowledge sourced directly from relevant documents.
Consequently, RAG systems offer a promising solution to TEI adoption and learning challenges by providing a more adaptive and interactive content. Through advanced natural language generation capabilities, RAG systems can generate tailored explanations, examples, and walkthroughs of TEI encoding practices, catering to the specific needs and skill levels of users. By leveraging retrieval mechanisms, RAG systems can retrieve relevant TEI guidelines and examples from its extended documentation, facilitating self-paced learning and knowledge acquisition. Such interactive systems can empower users by providing assistance in the creation of TEI-compliant markup, streamlining the encoding process, and thereby reducing errors and inconsistencies. Furthermore, RAG-generated summaries and explanations elucidate the rationale behind TEI encoding decisions, enhancing transparency and reproducibility in digital humanities research.
Upon experimentation with state of the art models, we observed the persistence of some technical challenges for this ultimate goal. First, a pre-processing of the documentation is necessary to overcome issues related to the tokenization process. Moreover, a chunking strategy for such a rich documentation has to be carefully defined to enable more precise information retrieval and complete response from the AI assistant. In addition, choosing the right prompt remains crucial to frame the context and the expected outcome in order to generate accurate responses.
In conclusion, RAG systems offer a transformative approach to learning and adopting TEI guidelines, mitigating challenges and maximizing benefits for knowledge acquisition. By leveraging RAG systems to facilitate TEI learning, organizations can empower users to unlock the full potential of TEI for data interoperability, scholarly communication and digital humanities research.
ID: 137
/ LP 2: 3
Long Paper
Keywords: Digital edition, TEI-XML markup, Named Entity Recognition (NER), standardized metadata
From Catullus to Wikidata: Language Models, Metadata Schemes, and Ontologies in a Digital Edition in XML TEI
C. Nusch1,2,3, G. Calarco2, G. del Rio Riande2,3, L. C. Cagnina2,4, M. L. Errecalde2,4, L. Antonelli1
1UNLP; 2CONICET; 3AAHD; 4UNSL
El presente trabajo detalla diferentes tareas de marcado y procesamiento del lenguaje natural realizadas en el proyecto Aetatis Amoris, dedicado a la creación de un sitio web que explora la poesía amorosa a través de la historia literaria. El proyecto se centró en la edición digital enriquecida de obras de poetas latinos clásicos como Cayo Valerio Catulo, Albio Tibulo y Sexto Propercio utilizando el estándar XML-TEI para el estructurado y marcado de los textos. Las tareas iniciales incluyeron la generación automática de los elementos principales del documento, tales como header y body, además del conteo y etiquetado automático de versos y estrofas.
Posteriormente, utilizando el modelo avanzado de spaCy para latín, LatinCy, se extrajeron y marcaron automáticamente nombres de personas y lugares con sus etiquetas correspondientes. Además, se llevaron a cabo búsquedas de metadatos normalizados utilizando API externas, consultando bases de datos como Virtual International Authority File (VIAF), el proyecto Pleiades y Wikidata. Esto permitió recuperar identificadores normalizados, información rica y curada, así como imágenes de lugares y personajes históricos. Mientras que las herramientas automáticas facilitaron significativamente el proceso de edición digital, la vasta cantidad de información recuperada también plantea desafíos significativos en la curación de datos y la evaluación de calidad, redefiniendo el papel del editor en el entorno digital.
This paper details various tasks of markup and natural language processing conducted in Aetatis Amoris, a project dedicated to exploring love poetry throughout literary history. The project focuses on the enriched digital edition of works by classical Latin poets such as Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, using the TEI-XML standard for encoding the texts. Initial tasks included the automatic generation of the main document elements, such as header and body, and the counting and automatic tagging of verses and stanzas.
Subsequently, using the advanced spaCy model for Latin, LatinCy, names of people and places were automatically extracted and tagged with the corresponding labels. In addition, searches for standardized metadata were carried out using external APIs, and consulting databases such as Virtual International Authority File (VIAF), Pleiades project, and Wikidata. This allowed for the retrieval of standardized identifiers, rich and curated information, and images of historical places and characters. As a conclusion, we can state that while the automatic tools significantly facilitated the digital editing process, the vast amount of information recovered also posed significant challenges in data curation and quality assessment, redefining the digital scholarly editor role in the process.
|