LegalEc: Un nuevo corpus para la investigación de la identificación de palabras complejas en los estudios de Derecho en español ecuatoriano

Espin-Riofrio, César; Montejo Ráez, Arturo; Ortiz Zambrano, Jenny Alexandra

LegalEcUn nuevo corpus para la investigación de la identificación de palabras complejas en los estudios de Derecho en español ecuatoriano

Espin-Riofrio, César
Montejo Ráez, Arturo
Ortiz Zambrano, Jenny Alexandra

Revista:

Procesamiento del lenguaje natural

ISSN: 1135-5948

Año de publicación: 2023

Número: 71

Páginas: 247-259

Tipo: Artículo

DIALNET GOOGLE SCHOLAR RUA editor

Otras publicaciones en: Procesamiento del lenguaje natural

Resumen

En este trabajo, presentamos a LegalEc, un nuevo corpus etiquetado con léxico complejo construido con textos de contenido legal en español ecuatoriano. Detallamos el proceso de compilación y anotación del mismo. Para proporcionar casos base a la comunidad científica, se han realizado varios experimentos de predicción de palabras complejas sobre este corpus. Extrajimos 23 características lingüísticas que combinamos con las codificaciones generadas por modelos como XLM-RoBERTa y RoBERTa-BNE (del proyecto MarIA). La evaluación muestra que la combinación de estas características mejora notablemente la predicción de la complejidad léxica.

Referencias bibliográficas

Alarcon, R., L. Moreno, and P. Martınez. 2020. Hulat-alexs cwi task-cwi for language and learning disabilities applied to university educational texts. In IberLEF@ SEPLN, pages 24–30.
Anula, A. 2008. Lecturas adaptadas a la enseñanza del español como l2: variables ling¨uısticas para la determinacion del nivel de legibilidad. La evaluacion en el aprendizaje y la enseñanza del español como LE L, 2:162–170.
Cabrera-Melendez, J. L., D. Iparraguirre- Leon, M. Way, F. Valenzuela-Ore, and D. B. Montesinos-Tubee. 2022. The applicability of similarity indices in an ethnobotanical study of medicinal plants from three localities of the yunga district, moquegua region, peru. Ethnobotany Research and Applications, 24(16).
Camposa, R. A., P. Estrella, J. A. Castillo, and W. A. Grijalba. 2020. Estudio de la complejidad del español para la simplificaci on textual. Revista Tecnologıa en Marcha, pages ag–45.
Crossley, S. A., T. Salsbury, and D. S. Mc- Namara. 2012. Predicting the proficiency level of language learners using lexical indices. Language Testing, 29(2):243–263.
Davidson, S., A. Yamada, P. F. Mira, A. Carando, C. H. S. Gutierrez, and K. Sagae. 2020. Developing nlp tools with a new corpus of learner spanish. In Proceedings of the 12th language resources and evaluation conference, pages 7238– 7243.
Desai, A. T., K. North, M. Zampieri, and C. Homan. 2021. LCP-RIT at SemEval2021 task 1: Exploring linguistic features for lexical complexity prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval- 2021), pages 548–553, Online, August. Association for Computational Linguistics.
Doring, M. 2021. How-to bureaucracy: A concept of citizens’ administrative literacy. Administration & Society, 53(8):1155–1177.
Garcıa-Dıaz, J. A., A. Almela, G. Alcaraz- Marmol, and R. Valencia-Garcıa. 2020. Umucorpusclassifier: Compilation and evaluation of linguistic corpus for natural language processing tasks. Procesamiento del Lenguaje Natural, 65:139–142.
Mosquera, A. 2021. Alejandro mosquera at semeval-2021 task 1: Exploring sentence and word features for lexical complexity prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 554–559.
North, K., M. Zampieri, and M. Shardlow. 2022. An evaluation of binary comparative lexical complexity models. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022), pages 197–203, Seattle, Washington, July. Association for Computational Linguistics.
North, K., M. Zampieri, and M. Shardlow. 2023. Lexical complexity prediction: An overview. ACM Computing Surveys, 55(9):1–42.
Ortiz-Zambrano, J. and A. Montejo-Raez. 2017. Vytedu: Un corpus de vıdeos y sus transcripciones para investigacion en el ambito educativo.
Ortiz-Zambrano, J. and A. Montejo-Raez. 2020. Overview of alexs 2020: First workshop on lexical analysis at sepln. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), volume 2664, pages 1–6.
Ortiz-Zambrano, J. and A. Montejo-Raez. 2021. Clexis2: A new corpus for complex word identification research in computing studies. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 1075–1083. Ortiz-Zambrano, J., A. MontejoRaez, K. Lino Castillo, O. Gonzalez Mendoza, and B. Cañizales Perdomo. 2019. Vytedu-cw: Difficult words as a barrier in the reading comprehension of university students. In Advances in Emerging Trends and Technologies: Volume 1. Springer, pages 167–176.
Ortiz-Zambrano, J. and E. Varela Tapia. 2019. Reading comprehension in university texts: the metrics of lexical complexity in corpus analysis in spanish. In Computer and Communication Engineering: First International Conference, ICCCE 2018, Guayaquil, Ecuador, October 25–27, 2018, Proceedings 1, pages 111– 123. Springer.
Paetzold, G. 2021. Utfpr at semeval- 2021 task 1: Complexity prediction by combining bert vectors and classic features. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 617–622.
Paetzold, G. and L. Specia. 2016a. Semeval 2016 task 11: Complex word identification. pages 560–569, 01.
Paetzold, G. and L. Specia. 2016b. Sv000gg at semeval-2016 task 11: Heavy gauge complex word identification with system voting. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 969–974.
Pitkowski, E. F. and J. V. Gamarra. 2009. El uso de los corpus ling¨uısticos como herramienta pedagogica para la enseñanza y aprendizaje de ele. Tinkuy: boletın de investigaci on y debate, (11):31–51.
Quevedo-Marcos, B. 2020. Analisis de las herrramientas de procesamiento de lenguaje natural para estructurar textos medicos.
Rico-Sulayes, A. 2020. General lexiconbased complex word identification extended with stem n-grams and morphological engines. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR-WS, Malaga, Spain, volume 23.
Ronzano, F., L. E. Anke, H. Saggion, et al. 2016. Taln at semeval-2016 task 11: Modelling complex words by contextual, lexical and semantic features. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 1011–1016.
Saggion, H., S. Stajner, S. Bott, S. Mille, L. Rello, and B. Drndarevic. 2015. Making it simplext: Implementation and evaluation of a text simplification system for spanish. ACM Transactions on Accessible Computing (TACCESS), 6(4):1–36.
Saggion, H., S. Stajner, D. Ferres, K. C. Sheang, M. Shardlow, K. North, and M. Zampieri. 2022. Findings of the tsar-2022 shared task on multilingual lexical simplification. In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 271–283, Abu Dhabi, United Arab Emirates (Virtual), December. Association for Computational Linguistics.
Saggion, H., S. Stajner, D. Ferres, K. C. Sheang, M. Shardlow, K. North, and M. Zampieri. 2023. Findings of the tsar-2022 shared task on multilingual lexical simplification. arXiv preprint arXiv:2302.02888.
Segura-Bedmar, I. and P. Martınez. 2017. Simplifying drug package leaflets written in spanish by using word embedding. Journal of biomedical semantics, 8(1):1–9.
Shardlow, M. 2013. A comparison of techniques to automatically identify complex words. In 51st annual meeting of the association for computational linguistics proceedings of the student research workshop, pages 103–109.
Shardlow, M., M. Cooper, and M. Zampieri. 2020. CompLex — a new corpus for lexical complexity prediction from Likert Scale data. In Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI), pages 57–62, Marseille, France, May. European Language Resources Association.
Shardlow, M., R. Evans, G. H. Paetzold, and M. Zampieri. 2021. SemEval-2021 task 1: Lexical complexity prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 1–16, Online, August. Association for Computational Linguistics.
Shiroyama, T. 2022. Comparing lexical complexity using two different ve modes: a pilot study. Intelligent CALL, granular systems and learner data: short papers from EUROCALL 2022, page 358.
Spaulding, S. 1956. A spanish readability formula. The Modern Language Journal, 40(8):433–441.
Taya, Y., L. Kanashiro Pereira, F. Cheng, and I. Kobayashi. 2021. OCHADAIKYOTO at SemEval-2021 task 1: Enhancing model generalization and ro bustness for lexical complexity prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 17–23, Online, August. Association for Computational Linguistics.
Yimam, S. M., C. Biemann, S. Malmasi, G. Paetzold, L. Specia, S. Stajner, A. Tack, and M. Zampieri. 2018. A report on the complex word identification shared task 2018. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 66–78, New Orleans, Louisiana, June. Association for Computational Linguistics.
Zotova, E., M. Cuadros, N. Perez, and A. G. Pablos. 2020. Vicomtech at alexs 2020: Unsupervised complex word identification based on domain frequency. In IberLEF@ SEPLN, pages 7–14.

Fuente de los datos: Dialnet