Toxicity in Spanish News Comments and its Relationship with Constructiveness

  1. López-Úbeda, Pilar
  2. Plaza-del-Arco, Flor Miriam
  3. Díaz-Galiano, Manuel-Carlos
  4. Martín-Valdivia, M. Teresa
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Año de publicación: 2024

Número: 73

Páginas: 43-53

Tipo: Artículo

Otras publicaciones en: Procesamiento del lenguaje natural

Resumen

Los comentarios en plataformas de noticias digitales constituyen una fuente esencial de información y opinión. Sin embargo, frecuentemente se transforman en focos de discurso toxico e incivilidad. La detección de la toxicidad en dichos comentarios es fundamental para comprender y atenuar este problema. Este artículo introduce un corpus de comentarios de noticias en español, etiquetados por su toxicidad (NECOS-TOX), y realiza una serie de experimentos empleando diversos algoritmos de aprendizaje automático, incluyendo modelos de lenguaje basados en la arquitectura de transformers. Los resultados obtenidos demuestran que los modelos de lenguaje específicos para el español, como BETO, poseen la capacidad de identificar la toxicidad en los comentarios de noticias en español. Adicionalmente, se exploró la relación existente entre la toxicidad y la constructividad en estos comentarios, concluyendo que no se aprecia una correlación evidente entre ambos factores. Estos hallazgos aportan luz sobre las complejidades inherentes al discurso en línea y subrayan la necesidad imperante de realizar investigaciones adicionales para comprender de manera más profunda la relación entre la toxicidad y la constructividad en los comentarios de noticias en español.

Referencias bibliográficas

  • Bose, R., I. Perera, and B. Dorr. 2023. Detoxifying online discourse: A guided response generation approach for reducing toxicity in user-generated text. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023), pages 9–14, Toronto, Canada, July. Association for Computational Linguistics.
  • Burger, C. 1998. A tutorial on support vector machines for pattern recognition, data mining and knowledge discovery. WORKSHOP ON DATA MINING AND KNOWLEDGE DISCOVERY.
  • Cañete, J., G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Pérez. 2020. Spanish pre-trained bert model and evaluation data. Pml4dc at iclr, 2020(2020):1–10.
  • Chvasta, A., A. Lees, J. Sorensen, L. Vasserman, and N. Goyal. 2022. Lost in distillation: A case study in toxicity modeling. In Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH), pages 92–101, Seattle, Washington (Hybrid), July. Association for Computational Linguistics.
  • Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
  • Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2019. Unsupervised crosslingual representation learning at scale. CoRR, abs/1911.02116.
  • Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding.
  • Dinkov, Y., I. Koychev, and P. Nakov. 2019. Detecting toxicity in news articles: Application to bulgarian. arXiv preprint arXiv:1908.09785.
  • Garlapati, A., N. Malisetty, and G. Narayanan. 2022. Classification of toxicity in comments using nlp and lstm. In 2022 8th International Conference on Advanced Computing and Communication Systems (ICACCS), volume 1, pages 16–21. IEEE.
  • Kogilavani, S., S. Malliga, K. Jaiabinaya, M. Malini, and M. M. Kokila. 2021. Characterization and mechanical properties of offensive language taxonomy and detection techniques. Materials Today: Proceedings.
  • Kolhatkar, V. and M. Taboada. 2017. Constructive language in news comments. In Proceedings of the first workshop on abusive language online, pages 11–17.
  • Kolhatkar, V., N. Thain, J. Sorensen, L. Dixon, and M. Taboada. 2020a. Classifying constructive comments. arXiv preprint arXiv:2004.05476.
  • Kolhatkar, V., H. Wu, L. Cavasso, E. Francis, K. Shukla, and M. Taboada. 2020b. The sfu opinion and comments corpus: A corpus for the analysis of online news comments. Corpus Pragmatics, 4:155–190.
  • la Rosa, J. D., E. G. Ponferrada, M. Romero, P. Villegas, P. G. de Prado Salas, and M. Grandury. 2022. Bertin: Efficient pretraining of a spanish language model using perplexity sampling. Procesamiento del Lenguaje Natural, 68(0):13–23.
  • Lample, G. and A. Conneau. 2019. Crosslingual language model pretraining.
  • Landis, J. R. and G. G. Koch. 1977. The measurement of observer agreement for categorical data. biometrics, pages 159–174.
  • Li, H., W. Mao, and H. Liu. 2019. Toxic comment detection and classification. In CS299 Machine Learning. Standford University.
  • López-Úbeda, P., F. M. Plaza-del Arco, M. C. Díaz-Galiano, and M. T. Martín-Valdivia. 2021. Necos: An annotated corpus to identify constructive news comments in spanish. Procesamiento del Lenguaje Natural, 66:41–51.
  • Narang, K., A. M. Davani, L. Mathias, B. Vidgen, and Z. Talat. 2022. Proceedings of the sixth workshop on online abuse and harms (woah). In Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH).
  • Nguyen, L. T., K. Van Nguyen, and N. L.-T. Nguyen. 2021. Constructive and toxic speech detection for open-domain social media comments in vietnamese. In Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices: 34th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2021, Kuala Lumpur, Malaysia, July 26–29, 2021, Proceedings, Part I 34, pages 572–583. Springer.
  • Nobata, C., J. Tetreault, A. Thomas, Y. Mehdad, and Y. Chang. 2016. Abusive language detection in online user content. In Proceedings of the 25th international conference on world wide web, pages 145–153.
  • Peters, M. E., M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. Deep contextualized word representations.
  • Pires, T., E. Schlinger, and D. Garrette. 2019. How multilingual is multilingual bert?
  • Plaza-del-Arco, F. M., M. D. Molina-González, L. A. U. López, and M. T. M. Valdivia. 2021. SINAI at iberlef-2021 DETOXIS task: Exploring features as tasks in a multi-task learning approach to detecting toxic comments. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2021), XXXVII International Conference of the Spanish Society for Natural Language Processing., M´alaga, Spain, September, 2021, volume 2943 of CEUR Workshop Proceedings, pages 580–590. CEUR-WS.org.
  • Plaza-del Arco, F. M., M. D. Molina-González, L. A. Ureña-López, and M.-T. Martín-Valdivia. 2022. Integrating implicit and explicit linguistic phenomena via multi-task learning for offensive language detection. Knowledge-Based Systems, 258:109965.
  • Risch, J. and R. Krestel. 2020. Toxic comment detection in online discussions. Deep learning-based approaches for sentiment analysis, pages 85–109.
  • Salton, G. and C. Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523.
  • Subies, G. G. 2021. Guillemgsubies at iberlef-2021 DETOXIS task: Detecting toxicity with spanish BERT. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2021), XXXVII International Conference of the Spanish Society for Natural Language Processing., Málaga, Spain, September, 2021, volume 2943 of CEUR Workshop Proceedings, pages 591–598. CEUR-WS.org.
  • Taulé, M., A. Ariza, M. Nofre, E. Amigó, and P. Rosso. 2021. Overview of detoxis at iberlef 2021: detection of toxicity in comments in spanish. Procesamiento del lenguaje natural, 67:209–221.
  • Taulé, M., M. Nofre, V. Bargiela, and X. Bonet. 2024. Newscom-tox: a corpus of comments on news articles annotated for toxicity in spanish. Language Resources and Evaluation, pages 1–41.
  • Xenos, A., J. Pavlopoulos, and I. Androutsopoulos. 2021. Context sensitivity estimation in toxicity detection. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 140–145, Online, August. Association for Computational Linguistics.
  • Zaheri, S., J. Leath, and D. Stroud. 2020. Toxic comment classification. SMU Data Science Review, 3(1):13.
  • Zhao, Z., Z. Zhang, and F. Hopfgartner. 2021. A comparative study of using pretrained language models for toxic comment classification. In Companion Proceedings of the Web Conference 2021, pages 500–507.