A Dimensionality Reduction Approach for Text Vectorization in Detecting Human and Machine-Generated Texts
Abstract
Distinguishing between human and machine-generated texts has been a task of recent interest in Natural Language Processing (NLP), especially in the face of the malicious use of Large-Language Models (LLMs). As a result of this, several state-of-the-art methods and approaches have been proposed, providing promising results. However, some of them are unreliable in explaining how features influence the detection of human and machine-generated texts. In this sense, previous studies have explored the effectiveness of traditional machine learning algorithms using lexical features based on ASCII code characters. Nevertheless, not all these features are used, which may difficult this task. Therefore, in this paper, we propose a dimensionality reduction of these features to improve the performance of this text vectorization using traditional machine learning algorithms. The proposed dimensionality reduction has been tested in the AuTexTification task in English and Spanish documents. According to the results, the dimensionality reduction of features improves the performance of machine-learning algorithms, serving this vectorization as inputs to more advanced machine-learning algorithms.