Un enfoque del filtrado de léxico para perfiles de autor

Ortiz Zambrano, Jenny Alexandra; Montejo Ráez, Arturo; Espin Riofrio, César

Un enfoque del filtrado de léxico para perfiles de autor

Ortiz Zambrano, Jenny Alexandra
Montejo Ráez, Arturo
Espin Riofrio, César

Revista:

Procesamiento del lenguaje natural

ISSN: 1135-5948

Año de publicación: 2023

Número: 71

Páginas: 75-86

Tipo: Artículo

DIALNET GOOGLE SCHOLAR RUA editor

Otras publicaciones en: Procesamiento del lenguaje natural

Resumen

This paper studies the influence of a general Spanish lexicon and a domain-specific lexicon on a text classification problem. Specifically, we address the impact of the choice of lexicons for user modelling. To do so, we identify gender and profession as demographic traits, and political ideology as a psychographic trait from a set of tweets. We experimented with machine learning and supervised learning methods to create a prediction model with which we evaluated our specific lexicon. Our results show that the choice and/or construction of lexicons to support the resolution of this task can follow a given strategy, characterised by the domain of the lexicon and the type of words it contains.

Referencias bibliográficas

Campillos-Llanos, L., A. Valverde-Mateos, A. Capllonch-Carrión, and A. Moreno- Sandoval. 2021. A clinical trials corpus annotated with umls entities to enhance the access to evidence-based medicine. BMC medical informatics and decision making, 21(1):1–19.
Canete, J., G. Chaperon, R. Fuentes, et al. 2020. Spanish pre-trained bert model and evaluation data. Pml4dc at iclr, 2020:1– 10.
Carrasco, S. S. and R. C. Rosillo. 2022. Loscalis at politices 2022: Political author profiling using beto and maria. In Proceedings of the Iberian Languages Eval uation Forum (IberLEF 2022). CEUR Workshop Proceedings, CEUR-WS, A Coruna, Spain.
Clark, E. V. 1995. The lexicon in acquisition. Number 65. Cambridge University Press. Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2019. Unsupervised crosslingual representation learning at scale. arXiv preprint arXiv:1911.02116.
Davidson, T., D. Warmsley, M. Macy, and I. Weber. 2017. Automated hate speech detection and the problem of offensive language. In Proceedings of the international AAAI conference on web and social media, volume 11, pages 512–515.
Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Eder, M., J. Rybicki, and M. Kestemont. 2016. Stylometry with r: a package for computational text analysis. The R Journal, 8(1).
Erikson, R. S. and K. L. Tedin. 2015. American public opinion: Its origins, content and impact. Routledge.
Espin-Riofrio, C., J. Ortiz-Zambrano, and A. Montejo-Ráez. 2022. Sinai at politices 2022: Exploring relative frequency of words in stylometrics for profile discovery. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2022). CEUR Workshop Proceedings, CEUR-WS, A Coruna, Spain.
García-Díaz, J. A., á. Almela, G. Alcaraz- Mármol, and R. Valencia-García. 2020. Umucorpusclassifier: Compilation and evaluation of linguistic corpus for natural language processing tasks. Procesamiento del Lenguaje Natural, 65:139–142.
García-Díaz, J. A., R. Colomo-Palacios, and R. Valencia-García. 2022. Psychographic traits identification based on political ideology: An author analysis study on spanish politicians’ tweets posted in 2020. Future Generation Computer Systems, 130:59–74.
García-Díaz, J. A., S. M. Jiménez-Zafra, M.-T. M. Valdivia, F. García-Sánchez, L. A. Ureria-López, and R. Valencia- García. 2022. Overview of politices 2022: Spanish author profiling for political ideology. Procesamiento del Lenguaje Natural, 69.
Gutiérrez Fandiño, A., J. Armengol Estapé, et al. 2022. Maria: Spanish language models. Procesamiento del Lenguaje Natural, 68.
Holmes, D. I. 1994. Authorship attribution. Computers and the Humanities, 28(2):87– 106.
Hoover, D. L. 2007. Corpus stylistics, stylometry, and the styles of henry james. Style, 41(2):174–203.
Hu, M. and B. Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 168–177.
Ikae, C., S. Nath, and J. Savoy. 2019. Unine at pan-clef 2019: Bots and gender task. In CLEF (Working Notes).
Juola, P. et al. 2008. Authorship attribution. Foundations and Trends® in Information Retrieval, 1(3):233–334.
Lanza, C., A. Folino, E. Pasceri, and A. Perri. 2021. Lexicon of pandemics: A semantic analysis of the spanish flu and the covid-19 timeframe terminology. Journal of Documentation. Laufer, B. and T. Cobb. 2020. How much knowledge of derived words is needed for reading? Applied Linguistics, 41(6):971– 998.
McClelland, J. L. and D. E. Rumelhart. 1981. An interactive activation model of context effects in letter perception: I. an account of basic findings. Psychological review, 88(5):375.
Mohammad, S. M. 2017. Word affect intensities. arXiv preprint arXiv:1704.08798.
Molina-González, M. D., E. Martínez- Cámara, M. T. Martín-Valdivia, and L. A. Urena-López. 2014. Cross-domain sentiment analysis using spanish opinionated words. In Natural Language Processing and Information Systems: 19th International Conference on Applications of Natural Language to Information Systems, NLDB 2014, Montpellier, France, June 18-20, 2014. Proceedings 19, pages 214– 219. Springer.
Molina-González, M. D., E. Martínez- Cámara, M.-T. Martín-Valdivia, and J. M. Perea-Ortega. 2013. Semantic orientation for polarity classification in spanish reviews. Expert Systems with Applications, 40(18):7250–7257.
Moreira, G. L. 2021. El léxico del turismo en los diccionarios de español (the tourism lexicon in spanish dictionaries). Termin`alia, pages 27–38.
Moreno-Ortiz, A. and C. P. Hernández. 2013. Lexicon-based sentiment analysis of twitter messages in spanish. Procesamiento del lenguaje natural, 50:93–100.
Mosquera, A. 2022. Alejandro mosquera at politices 2022: Towards robust spanish author profiling and lessons learned from adversarial attacks. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2022). CEUR Workshop Proceedings, CEUR-WS, A Coruna, Spain. D. Moctezuma, and V. Muniz-Sánchez.
Neal, T., K. Sundararajan, A. Fatima, Y. Yan, Y. Xiang, and D. Woodard. 2017. Surveying stylometry techniques and applications. ACM Computing Surveys (CSuR), 50(6):1–36.
Ortiz-Zambrano, J., C. Espin-Riofrio, and A. Montejo-Ráez. 2022. Transformers for lexical complexity prediction in spanish language. Procesamiento del Lenguaje Natural, 69:177–188.
Plaza-del Arco, F. M., M. D. Molina- González, S. M. Jiménez-Zafra, and M. T. Martín-Valdivia. 2018. Lexicon adaptation for spanish emotion mining. Procesamiento del Lenguaje Natural, 61:117– 124.
Plaza-del Arco, F. M., A. Montejo-Ráez, L. A. Urena-López, and M. Martín- Valdivia. 2021. Offendes: A new corpus in spanish for offensive language research. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 1096–1108.
Roy, G. and S. Sharma. 2021. Analyzing one-day tour trends during covid-19 disruption–applying push and pull theory and text mining approach. Tourism Recreation Research, 46(2):288–303.
Sánchez-Junquera, J., S. P. Ponzetto, and P. Rosso. 2020. A twitter political corpus of the 2019 10n spanish election. In International Conference on Text, Speech, and Dialogue, pages 41–49. Springer.
Sandoval, L. G. M., A. P. Quimbaya, C. E. C. Gutiérrez, J. F. G. Pachón, and D. F. V. Ramírez. 2022. Comparación de métodos de análisis de sentimientos en comunidades de habla hispana. Encuentro Internacional de Educación en Ingeniería.
Savoy, J. 2020. Machine learning methods for stylometry. Springer.
Sidorov, G., S. Miranda-Jiménez, F. Viveros- Jiménez, A. Gelbukh, N. Castro-Sánchez, F. Velásquez, I. Díaz-Rangel, S. Suárez- Guerra, A. Trevino, and J. Gordon. 2013. Empirical study of machine learning based approach for opinion mining in tweets. In Advances in Artificial Intelligence: 11th Mexican International Conference on Artificial Intelligence, MICAI 2012, San Luis Potosí, Mexico, October 27–November 4, 2012. Revised Selected Papers, Part I 11, pages 1–14. Springer.
Taboada, M. 2017. SFU Review Corpus — Maite Taboada.
Taboada, M., J. Brooke, M. Tofiloski, K. Voll, and M. Stede. 2011. Lexiconbased methods for sentiment analysis. Computational linguistics, 37(2):267–307.
Tweedie, F. J., S. Singh, and D. I. Holmes. 1996. Neural network applications in stylometry: The federalist papers. Computers and the Humanities, 30(1):1–10.
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
Villa-Cueva, E., I. González-Franco, F. Sanchez-Vega, and A. P. López- Monroy. 2022. Nlp-cimat at politices 2022: Politibeto, a domain-adapted transformer for multi-class political author profiling. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2022). CEUR Workshop Proceedings, CEUR-WS, A Coruna, Spain.
Vossen, P. 1998. A multilingual database with lexical semantic networks. Dordrecht: Kluwer Academic Publishers. doi, 10:978–94.
Webb, S. 2021. The lemma dilemma: How should words be operationalized in research and pedagogy? Studies in Second Language Acquisition, 43(5):941–949.
Zhang, M., Y. Liu, H. Luan, and M. Sun. 2017. Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1959– 1970.

Fuente de los datos: Dialnet