http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/issue/feed Procesamiento del Lenguaje Natural 2024-09-26T16:17:54+02:00 Eugenio Martínez Cámara emcamara@ujaen.es Open Journal Systems <div class="homeText"><p>El objetivo principal de la revista es el de ofrecer a los investigadores en Procesamiento del Lenguaje Natural (PLN) una oportunidad para presentar nuevos trabajos, comunicar resultados, discutir problemas y obstáculos encontrados durante su trayectoria investigadora.</p><p>Por otro lado, permitir intercambiar opiniones sobre directrices futuras de investigación básica y aplicación prevista por los expertos y contrastarlas con las necesidades reales del mercado. Reflexionar y debatir en profundidad sobre temas concretos de máxima actualidad tales como la extracción de información, la recuperación de información o la evaluación de sistemas de procesamiento del lenguaje natural.</p><p>La Revista tiene una periodicidad semestral, publicándose dos números al año (marzo y septiembre) que recogen los últimos avances en PLN.</p><p>La Revista cuenta con el sello de calidad de la Fundación Española para Ciencia y Tecnología (FECyT), el cual la certifica como revista de excelencia, y por lo tanto, incluida en el Repositorio de Revistas Científicas españolas (RECyT, <span>Repositorio Español de Ciencia y Tecnología) <a href="http://recyt.fecyt.es/index.php/PLN">http://recyt.fecyt.es/index.php/PLN</a></span></p><p><span>La Revista de Procesamiento de Lenguaje Natural también ha recibido e<span>l sello de calidad (ISO9001) que la acredita como excelente durante un periodo de tres años (14 de marzo de 2012 al 14 de marzo de 2015).</span></span></p><p>Procesamiento del Lenguaje Natural (edición impresa). ISSN: 1135-5948.</p><p>Procesamiento del Lenguaje Natural (edición electrónica). ISSN: 1989-7553.</p></div> http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6596 Número 73 2024-09-26T16:17:54+02:00 2024-09-21T00:00:00+02:00 Copyright (c) 2024 Procesamiento del Lenguaje Natural http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6597 Are rule-based approaches a thing of the past? The case of anaphora resolution 2024-09-16T17:00:36+02:00 Ruslan Mitkov r.mitkov@lancaster.ac.uk Le An Ha In this paper, we evaluate and compare new variants of a popular rule-based anaphora resolution algorithm with the original version. We seek to establish whether configurations that benefit from Deep Learning, LLMs and eye-tracking data (always) outperform the original rule-based algorithm. The results of this study suggest that while algorithms based in Deep Learning and LLMs usually perform better than rule-based ones, this is not always the case, and we argue that rule-based approaches still have a place in today’s research. 2024-09-21T00:00:00+02:00 Copyright (c) 2024 Procesamiento del Lenguaje Natural http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6598 Multi-label Discourse Function Classification of Lexical Bundles in Basque and Spanish via transformer-based models 2024-09-16T17:00:36+02:00 Josu Goikoetxea josu.goikoetxea@ehu.eus Markel Etxabe Marcos García Eleonora Guzzi Margarita Alonso This paper explores the effectiveness of transformer-based models in the discourse function multi-label classification of lexical bundles task in two languages, Basque and Spanish. The study has a dual focus: firstly, to evaluate the impact of manually and automatically annotated datasets in the fine-tuning for this task; secondly, to demonstrate the efficiency of multilingual language models in a cross-lingual transfer learning context for this task. First and foremost, our findings reveal their ability to generalize discourse function classification of lexical bundles beyond specific sequence of words forms in the mentioned task in both monolingual and cross-lingual transfer learning contexts. In the former setting, this research highlights the superiority of manually annotated datasets over the automatically annotated ones as long as dataset size is sufficiently large. In the latter case, despite the transfer learning occurring between two typologically different languages, results also suggest the superiority of manually annotated datasets along with the capability to surpass the monolingual results when ratios of target and source language training and fine-tuning corpora are balanced. 2024-09-21T00:00:00+02:00 Copyright (c) 2024 Procesamiento del Lenguaje Natural http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6599 Toxicity in Spanish News Comments and its Relationship with Constructiveness 2024-09-16T17:00:36+02:00 Pilar López-Úbeda p.lopez@htmedica.com Flor Miriam Plaza-del-Arco Manuel-Carlos Díaz-Galiano M. Teresa Martín-Valdivia Online news comments are a critical source of information and opinion, but they often become a breeding ground for toxic discourse and incivility. Detecting toxicity in these comments is essential to understand and mitigate this problem. This paper presents a corpus of Spanish news comments labeled with toxicity (NECOS-TOX) and conducts a series of experiments using several machine learning algorithms, including different language models based on transformers. Our findings show that Spanish language models, such as BETO, are capable of detecting toxicity in Spanish news comments. Additionally, we investigated the relationship between toxicity and constructiveness in these comments and found that there is no clear correlation between the two factors. These results provide insights into the complexities of online discourse and highlight the need for further research to better understand the relationship between toxicity and constructiveness in Spanish news comments. 2024-09-21T00:00:00+02:00 Copyright (c) 2024 Procesamiento del Lenguaje Natural http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6600 Semi-Supervised Learning in the Field of Conversational Agents and Motivational Interviewing 2024-09-16T17:00:36+02:00 Gergana Rosenova gergana.rosenova@rai.usc.es Marcos Fernández-Pichel Selina Meyer David E. Losada The exploitation of Motivational Interviewing concepts for text analysis contributes to gaining valuable insights into individuals’ perspectives and attitudes towards behaviour change. The scarcity of labelled user data poses a persistent challenge and impedes technical advances in research under non-English language scenarios. To address the limitations of manual data labelling, we propose a semi-supervised learning method as a means to augment an existing training corpus. Our approach leverages machine-translated user-generated data sourced from social media communities and employs self-training techniques for annotation. To that end, we consider various source contexts and conduct an evaluation of multiple classifiers trained on various augmented datasets. The results indicate that this weak labelling approach does not yield improvements in the overall classification capabilities of the models. However, notable enhancements were observed for the minority classes. We conclude that several factors, including the quality of machine translation, can potentially bias the pseudo-labelling models and that the imbalanced nature of the data and the impact of a strict pre-filtering threshold need to be taken into account as inhibiting factors. 2024-09-21T00:00:00+02:00 Copyright (c) 2024 Procesamiento del Lenguaje Natural http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6601 Comparison of Clustering Algorithms for Knowledge Discovery in Social Media Publications: A Case Study of Mental Health Analysis 2024-09-16T17:00:36+02:00 Manuel Couto manuel.couto.pintos@usc.es Javier Parapar David E. Losada In the age of social media, user-generated content is critical for detecting early signs of mental disorders. In this study, we use thematic clustering to analyze the content of the social media platform Reddit. Our primary goal is to use clustering techniques for comprehensive topic discovery, with a focus on identifying common themes among user groups suffering from mental illnesses such as depression, anorexia, gambling addiction, and self-harm. Our findings show that certain clusters are more cohesive, e.g., with a higher proportion of texts indicating depression. Furthermore, we discovered subreddits that are strongly linked to texts from the depressed user group. These findings shed light on how online interactions and subreddit themes may impact users’ mental health, paving the way for future research and more targeted interventions in the field of online mental health. 2024-09-21T00:00:00+02:00 Copyright (c) 2024 Procesamiento del Lenguaje Natural http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6602 Inductive Graph Neural Network for Job-Skill Framework Analysis 2024-09-16T17:00:36+02:00 Hermenegildo Fabregat hermenegildo.fabregat@avature.net Rus Poves Lucas Alvarez Lacasa Federico Retyk Laura García-Sardiña Rabih Zbib The analysis of skills and their relationship to different occupations is an area of special attention within human capital management processes. Nowadays, job specialization has made this increasingly important. In this paper, we address two main tasks: the retrieval of similar jobs and the retrieval of skills related to a given job. We develop a system that combines the encoding of textual information with a graph neural network, thus mitigating the limitations of a system that relies on either of these separately. We present experiments that show that the proposed system is able to take advantage of both the encoded textual information, and the structured relationships between job titles and skills represented by the graph. We also show the robustness of the proposed model in modeling unseen entities by evaluating the model’s performance in simulated cold-recommendation scenarios where a percentage of the skills under study are eliminated during training. 2024-09-21T00:00:00+02:00 Copyright (c) 2024 Procesamiento del Lenguaje Natural http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6603 Open Conversational LLMs do not know most Spanish words 2024-09-16T17:00:36+02:00 Javier Conde javier.conde.diaz@upm.es Miguel González Nina Melero Raquel Ferrando Gonzalo Martínez Elena Merino-Gómez José Alberto Hernández Pedro Reviriego The growing interest in Large Language Models (LLMs) and in particular in conversational models with which users can interact has led to the development of a large number of open chat LLMs. These models are evaluated on a wide range of benchmarks to assess their capabilities in answering questions or solving problems on almost any possible topic or to test their ability to reason or interpret texts. Instead, the evaluation of the knowledge that these models have of the languages has received much less attention. For example, the words that they can recognize and use in different languages. In this paper, we evaluate the knowledge that open chat LLMs have of Spanish words by testing a sample of words in a reference dictionary. The results show that open chat LLMs produce incorrect meanings for an important fraction of the words and are not able to use most of the words correctly to write sentences with context. These results show how Spanish is left behind in the LLM race and highlight the need to push for linguistic fairness in conversational LLMs ensuring that they provide similar performance across languages. 2024-09-21T00:00:00+02:00 Copyright (c) 2024 Procesamiento del Lenguaje Natural http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6604 Intent Classification Methods for Human Resources Chatbots 2024-09-16T17:00:36+02:00 Lucas Alvarez Lacasa lucas.lacasa@avature.net Martín Dévora-Pajares Rabih Zbib Hermenegildo Fabregat With the rapid development of artificial intelligence, conversational agents have become prevalent in most mainstream platforms. To accomplish the user’s goal, the system needs to determine their intention, detect emotions and extract key entities from the conversational utterances. This paper presents a comprehensive analysis of intent classification techniques applied to Human Resources chatbots. First, unsupervised learning methods are explored, using pre-trained encoders and Zero-Shot Classification pipelines. Then, we investigate supervised approaches and Large Language Models using Retrieval Augmented Generation. Finally, we propose a two-stage retrieval pipeline that trains a coarse-grained classifier and uses its prediction to retrieve the fine-grained intent with Approximate Nearest Neighbor search. Results reveal that while fully supervised methods yield superior models, unsupervised approaches demonstrate competitive performance, but have the advantage of allowing new intents to be added without requiring model retraining. Our proposed two-stage approach combines the benefits of better performance with the added flexibility. 2024-09-21T00:00:00+02:00 Copyright (c) 2024 Procesamiento del Lenguaje Natural http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6605 EuSQuAD: Automatically Translated and Aligned SQuAD2.0 for Basque 2024-09-16T17:00:36+02:00 Aitor García-Pablos agarciap@vicomtech.org Naiara Perez Montse Cuadros Jaione Bengoetxea The widespread availability of Question Answering (QA) datasets in English has greatly facilitated the advancement of the Natural Language Processing (NLP) field. However, the scarcity of such resources for minority languages, such as Basque, poses a substantial challenge for these communities. In this context, the translation and alignment of existing QA datasets plays a crucial role in narrowing this technological gap. This work presents EuSQuAD, the first initiative dedicated to automatically translating and aligning SQuAD2.0 into Basque, resulting in more than 142k QA examples. We demonstrate EuSQuAD’s value through extensive qualitative analysis and QA experiments supported with EuSQuAD as training data. These experiments are evaluated with a new human-annotated dataset. 2024-09-21T00:00:00+02:00 Copyright (c) 2024 Procesamiento del Lenguaje Natural