Machine Learning approaches for Topic and Sentiment Analysis in multilingual opinions and low-resource languages: From English to Guarani

  1. Agüero Torales, Marvin Matías
Supervised by:
  1. Antonio Grabriel López Herrera Director

Defence university: Universidad de Granada

Fecha de defensa: 04 February 2022

Committee:
  1. Enrique Herrera Viedma Chair
  2. Carlos Gustavo Porcel Gallego Secretary
  3. Jesús Serrano Guerrero Committee member
  4. María José del Jesús Díaz Committee member
  5. Salud M. Jiménez Zafra Committee member

Type: Thesis

Abstract

This dissertation has focused on the study of machine learning techniques for sentiment analysis and topic modeling in texts from social media. It puts a special emphasis on approaches and methods for handling low-resource languages, i.e., languages lacking large monolingual or parallel corpora and/or manually elaborated linguistic resources sufficient for building Natural Language Processing (NLP) applications; and the implementation of these approaches and methods to multilingual scenarios, such code-switching (i.e., alternating between two or more languages or varieties of language in a phrase or word). First, we presented a data science workflow to perform machine learning models for social media texts written in low-resource languages, even if these suffer code-switching. The workflow proposed is able to handle different difficulties for the purpose at hand (such as, for example, web text collection, dealing with unbalanced classes, or implementing crosslingual models). In the following, we described how to build machine learning models to perform topic modeling with large data coming from social media with short texts written in Spanish, as well as a number of sentiment analysis related tasks for Guarani (a South American indigenous language) and Jopara (i.e., Guarani-Spanish mixture), namely polarity classification, emotion recognition, humor detection, and offensive and toxic language identification. Emphasis was also placed on noisy and short texts coming from social media. Experiments with the corpora created and the evaluation of the machine learning models built, show the robustness of the approaches and methods proposed in this dissertation, in monolingual, multilingual, and code-switching settings. The contributions presented in this dissertation may be useful both for the Spanishspeaking community and the Guarani-speaking community. There are many use cases in different areas and disciplines that can benefit from the insights created by the approaches we presented in this thesis. Therefore, there are a number of possible applications for the democratization of low-resource languages, such as the ability to perform less biased monitoring of social networks in multilingual environments or the capacity to extract automatically the knowledge available in non-dominant languages.