Linguistic features integration for text classification tasks in Spanish

  1. García Díaz, José Antonio
Supervised by:
  1. Rafael Valencia García Director

Defence university: Universidad de Murcia

Fecha de defensa: 05 July 2022

Committee:
  1. María del Pilar Salas-Zárate Chair
  2. Miguel Ángel Rodríguez García Secretary
  3. Salud M. Jiménez Zafra Committee member

Type: Thesis

Abstract

Objectives: We define the following research hypotheses concerning the inclusion of linguistic features in automatic classification systems: (RH1) The inclusion of linguistic features improves the performance of automatic text classification systems in Spanish, and (RH2) The inclusion of linguistic features can provide interpretability to the models. To accomplish the research hypotheses, we define the following objectives: • OB1. Obtaining a taxonomy of the different linguistic features of Spanish. • OB2. The development of the UMUTextStats tool and the related lexicons for each feature within the taxonomy. • OB3. The development of the UMUCorpusClassifier tool for the compilation and annotation of Spanish corpora. • OB4. Validation of the UMUTextStats tool in different scenarios. • OB5. Compilation and annotation of linguistic corpora in Spanish to conduct automatic document classification in different domains. Methodology: The methodology followed is described below. First, a study of tools like those that were intended to be built was developed. Specifically, LIWC is the de facto tool for feature extraction in Spanish. In this tool, a series of shortcomings were identified, such as the fact that certain characteristics of Spanish were not contemplated. Second, a taxonomy for classifying linguistic features in different categories was proposed: phonetics, morphosyntax, correction and style, semantics, pragmatics, stylometry, lexis, and social media jargon. Third, the UMUTextStats tool was developed, the dictionaries for each dimension were compiled, and software classes were developed for each type of linguistic feature. Fourth, the UMUCorpusClassifier tool was built, which is used to compile and label linguistic corpora automatically or semi-automatically. Finally, the features obtained were used to build automatic classification systems for sentiment analysis, emotion analysis, author profiling, and satire detection tasks, among other tasks. Results: Meeting the objectives set in this doctoral thesis has allowed us to publish our methods and results in high-impact scientific journals, as well as being able to participate in international congresses and conferences. The main results obtained are presented in this doctoral thesis as a compendium. • García-Díaz, J. A., Cánovas-García, M., & Valencia-García, R. (2020). Ontology-driven aspect-based sentiment analysis classification: An infodemiological case study regarding infectious diseases in Latin America. Future Generation Computer Systems, 112, 641-657. • García-Díaz, J. A., Cánovas-García, M., Colomo-Palacios, R., & Valencia-García, R. (2021). Detecting misogyny in Spanish tweets. An approach based on linguistics features and word embeddings. Future Generation Computer Systems, 114, 506-518. • García-Díaz, J. A., Colomo-Palacios, R., & Valencia-García, R. (2022). Psychographic traits identification based on political ideology: An author analysis study on Spanish politicians’ tweets posted in 2020. Future Generation Computer Systems, 130, 59-74. • García-Díaz, J. A., & Valencia-García, R. (2022). Compilation and evaluation of the Spanish SatiCorpus 2021 for satire identification using linguistic features and transformers. Complex & Intelligent Systems, 1-14. This apart, we describe the participation in different international workshops, such as IberLEF, SemEval, or FIRE, in which we evaluate the linguistic features separately and combined with Transformers and traditional machine-learning methods. In these shared tasks, we have achieved competitive results in almost all of them. These tasks involve hate-speech detection, emotion analysis, humour detection or source-code profiling, among others. Conclusions: In this doctoral thesis, we have shown the development and evaluation of a set of linguistic features for Spanish that have proven to be effective in automatic classification tasks. These features are extracted with UMUTextStats, a tool that has been developed during this doctoral thesis and that is available for the research community. Specifically, two research hypotheses were raised during this thesis. First, if the inclusion of the linguistic features improves the performance of automatic text classification systems in Spanish, and second, if the inclusion of linguistic features can provide interpretability to the models. For the first research hypothesis we have shown that the linguistic features can be combined easily with state-of-the-art Transformers or traditional machine-learning models, outperforming the results achieved separately. It is worth mentioning that the performance of the linguistic features depends considerably on the task and the domain applied. For the second research hypothesis, we have obtained the correlation with the Mutual Information measure of the linguistic features with the target class in several domains, including infodemiology, hate-speech and misogyny detection, or emotion analysis, to name but a few. For instance, we found a strong correlation between lexical and morphosyntactic features in author profiling, whereas these kinds of features were less important for conducting authorship attribution. However, stylometric features are more relevant for this task. We will continue with the development and validation of UMUTextStats for different languages and domains. We are currently adapting the taxonomy for English and other languages. We expect that the release of the tool to the scientific community make it easier to validate and extend this tool. Besides, we are planning to facilitate the integration of this tool with other NLP tools apart from Stanza. We expect to make it easier to combine and use other NER and PoS models that extend the number of available labels.