Transformers for Lexical Complexity Prediction in Spanish Language

  1. Ortiz Zambrano, Jenny Alexandra
  2. Espin-Riofrio, César
  3. Montejo Ráez, Arturo
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Año de publicación: 2022

Número: 69

Páginas: 177-188

Tipo: Artículo

Otras publicaciones en: Procesamiento del lenguaje natural

Resumen

En este artículo hemos presentado una contribución a la predicción de la complejidad de palabras simples en lengua española cuyo fundamento se basa en la combinación de un gran número de características de distinta naturaleza. Obtuvimos los resultados después de ejecutar los modelos afinados basados en Transformers y ejecutados sobre los modelos pre-entrenados BERT, XLM-RoBERTa y RoBERTa-large-BNE en los diferentes conjuntos de datos en español y corridos con varios algoritmos de regresión. La evaluación de los resultados determinó que se logró un buen desempeño con un Error Absoluto Medio (MAE) = 0.1598 y Pearson = 0.9883 logrado con el entrenamiento y evaluación del algoritmo Random Forest Regressor para el modelo BERT afinado. Como posible propuesta alternativa para lograr una mejor predicción de la complejidad léxica, estamos muy interesados en seguir realizando experimentaciones con conjuntos de datos para español probando modelos de Transformer de última generación

Referencias bibliográficas

  • Bender, E. M., T. Gebru, A. McMillan- Major, and S. Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? . In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA. Association for Computing Machinery.
  • Bottou, L. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’ 2010. Springer, pages 177–186.
  • Breiman, L. 2001. Random forests. Machine learning, 45(1):5–32.
  • Canete, J., G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Perez. 2020. Spanish pre-trained bert model and evaluation data. Pml4dc at iclr, 2020:2020.
  • Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzman, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2019. Unsupervised crosslingual representation learning at scale. arXiv preprint arXiv:1911.02116.
  • Crammer, K., O. Dekel, J. Keshet, S. Shalev- Shwartz, and Y. Singer. 2006. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551–585.
  • Dale, E. and J. S. Chall. 1948. A formula for predicting readability: Instructions. Educational research bulletin, pages 37–54.
  • Davidson, S., A. Yamada, P. F. Mira, A. Carando, C. H. S. Gutierrez, and K. Sagae. 2020. Developing nlp tools with a new corpus of learner spanish. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 7238– 7243.
  • Desai, A., K. North, M. Zampieri, and C. M. Homan. 2021. Lcp-rit at semeval- 2021 task 1: Exploring linguistic features for lexical complexity prediction. arXiv preprint arXiv:2105.08780.
  • Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.
  • Gooding, S. and E. Kochmar. 2018. Camb at cwi shared task 2018: Complex word identification with ensemble-based voting. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 184–194.
  • Gutierrez-Fandiño, A., J. Armengol- Estape, M. P`amies, J. Llop-Palao, J. Silveira-Ocampo, C. P. Carrino, A. Gonzalez-Agirre, C. Armentano-Oller, C. Rodriguez-Penagos, and M. Villegas. 2021. Spanish language models. arXiv preprint arXiv:2107.07253.
  • Liebeskind, C., O. Elkayam, and S. Liebeskind. 2021. Jct at semeval-2021 task 1: Context-aware representation for lexical complexity prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 138–143.
  • Liu, X., P. He, W. Chen, and J. Gao. 2019. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482.
  • Mc Laughlin, G. H. 1969. Smog grading-a new readability formula. Journal of reading, 12(8):639–646.
  • Nandy, A., S. Adak, T. Halder, and S. M. Pokala. 2021. cs60075 team2 at semeval- 2021 task 1: Lexical complexity prediction using transformer-based language models pre-trained on various text corpora. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 678–682.
  • Ortiz-Zambrano, J. A. and A. Montejo-Raez. 2021. Complex words identification using word-level features for semeval-2020 task 1. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 126–129.
  • Ortiz-Zambranoa, J. A. and A. Montejo- Raezb. 2020. Overview of alexs 2020: First workshop on lexical analysis at sepln. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020).
  • Paetzold, G. 2021. Utfpr at semeval- 2021 task 1: Complexity prediction by combining bert vectors and classic features. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 617–622.
  • Paetzold, G. and L. Specia. 2016. Sv000gg at semeval-2016 task 11: Heavy gauge complex word identification with system voting. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 969–974.
  • Rayner, K. and S. A. Duffy. 1986. Lexical complexity and fixation times in reading: Effects of word frequency, verb complexity, and lexical ambiguity. Memory & cognition, 14(3):191–201.
  • Rico-Sulayes, A. 2020. General lexiconbased complex word identification extended with stem n-grams and morphological engines. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR-WS, Malaga, Spain.
  • Rojas, K. R. and F. Alva-Manchego. 2021. Iapucp at semeval-2021 task 1: Stacking fine-tuned transformers is almost all you need for lexical complexity prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 144–149.
  • Ronzano, F., L. E. Anke, H. Saggion, et al. 2016. Taln at semeval-2016 task 11: Modelling complex words by contextual, lexical and semantic features. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 1011–1016.
  • Saggion, H., S. Stajner, S. Bott, S. Mille, L. Rello, and B. Drndarevic. 2015. Making it simplext: Implementation and evaluation of a text simplification system for spanish. ACM Transactions on Accessible Computing (TACCESS), 6(4):1–36.
  • Shardlow, M. 2013. A comparison of techniques to automatically identify complex words. In 51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop, pages 103–109.
  • Shardlow, M., M. Cooper, and M. Zampieri. 2020. Complex: A new corpus for lexical complexity prediction from likert scale data. arXiv preprint arXiv:2003.07008. Shardlow, M., R. Evans, G. H. Paetzold, and M. Zampieri. 2021. Semeval-2021 task 1: Lexical complexity prediction. arXiv preprint arXiv:2106.00473.
  • Shardlow, M., R. Evans, and M. Zampieri. 2021. Predicting lexical complexity in english texts. arXiv preprint arXiv:2102.08773.
  • Singh, S. and A. Mahmood. 2021. The nlp cookbook: Modern recipes for transformer based deep learning architectures. IEEE Access, 9:68675–68702.
  • Uluslu, A. Y. 2022. Automatic lexical simplification for turkish. arXiv preprint arXiv:2201.05878.
  • Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  • Vettigli, G. and A. Sorgente. 2021. Compna at semeval-2021 task 1: Prediction of lexical complexity analyzing heterogeneous features. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 560–564.
  • Wolf, T., L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
  • Yaseen, T. B., Q. Ismail, S. Al-Omari, E. Al- Sobh, and M. Abdullah. 2021. Just-blue at semeval-2021 task 1: Predicting lexical complexity using bert and roberta pretrained language models. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 661–666.
  • Zaharia, G.-E., D.-C. Cercel, and M. Dascalu. 2021. Upb at semeval-2021 task 1: Combining deep learning and handcrafted features for lexical complexity prediction. arXiv preprint arXiv:2104.06983.
  • Zambrano, J. A. O. and A. Montejo-Raez. 2021. Clexis2: A new corpus for complex word identification research in computing studies. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 1075–1083.