Towards Quality Benchmarking in Question Answering over Tabular Data in Spanish

Osés Grijalba, Jorge; Ureña López, Luis Alfonso; Camacho-Collados, Jose; Martínez Cámara, Eugenio

Towards Quality Benchmarking in Question Answering over Tabular Data in Spanish

Osés Grijalba, Jorge
Ureña López, Luis Alfonso
Camacho-Collados, Jose
Martínez Cámara, Eugenio

Revista:

Procesamiento del lenguaje natural

ISSN: 1135-5948

Ano de publicación: 2024

Número: 73

Páxinas: 283-296

Tipo: Artigo

DIALNET GOOGLE SCHOLAR Acceso aberto editor

Outras publicacións en: Procesamiento del lenguaje natural

Resumo

The rapid and incessant progress of language understanding and language generation capacity of large language models (LLMs) is followed by the discovery of new capabilities. The research community has to provide evaluation benchmarks to asses these emerging capabilities by studying, analysing and comparing different LLMs under fair and realistic settings. Question answering on tabular data is an important task to assess that lacks reliable evaluation benchmarks to assess LLMs in distinct scenarios, particularly for Spanish. Hence, in this paper we present Spa-DataBench, an evaluation benchmark composed of ten datasets about different topics of the Spanish society. Likewise, each dataset is linked to a set of questions written in Spanish and their corresponding answers. These questions are used to assess LLMs and analyse their capacity for answering questions that involve one single or multiple columns of different data types, and for generating source code to resolve the questions. We evaluate six LLMs on Spa-DataBench, and we compare their performance using both Spanish and English prompts. The results on Spa-DataBench show that LLMs are able to reason on tabular data, but their performance in Spanish is worse, which means that there is still room for improvement of LLMs in the Spanish language.

Referencias bibliográficas

40dB, E. P. 2022. Percepción del amor. https://elpais.com/sociedad/2022-06-05/consulte-todos-los-datos-internosde-la-encuesta-de-el-pais-sobre-la-percepcion-del-amor-cuestionarios-yrespuestas-individuales.html.
40dB, E. P. 2024a. Encuesta de igualdad marzo 2024. https://elpais.com/espana/2024-03-11/consulte-todos-los-datos-internosde-la-encuesta-de-el-pais-de-marzocuestionarios-cruces-y-respuestas.html.
40dB, E. P. 2024b. Encuesta sobre el sueño. https://elpais.com/ciencia/2024-02-25/consulte-todos-los-datos-internosdel-barometro-de-el-pais-cuestionarioscruces-y-respuestas-individuales.html.
Brown, T. B., B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. Mc-Candlish, A. Radford, I. Sutskever, and D. Amodei. 2020. Language models are few-shot learners.
Buhrmann, T. 2023. Lector, dec.
CEA. 2023. Barómetro andaluz septiembre 2023. https://www.centrodeestudiosandaluces.es/barometro/barometro-andaluzde-septiembre-2023.
Chen, W. 2023. Large language models are few(1)-shot table reasoners. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1120–1130, Dubrovnik, Croatia, May. Association for Computational Linguistics.
CIS. 2021a. Salud mental durante la pandemia. https://www.cis.es/es/detalleficha-estudio?idEstudio=14676.
CIS. 2021b. Salud mental durante la pandemia. https://datos.gob.es/es/catalogo/ea0022266-2193comportamiento-de-los-espanolesante-las-vacaciones-iii.
CIS. 2023a. Cis – relaciones afectivas pospandemia iii. https://www.cis.es/detalle-fichaestudio?origen=estudio&idEstudio=14702.
CIS. 2023b. Fusión barómetros enero-marzo 2023. https://www.cis.es/es/detalleficha-estudio?idEstudio=14707.
CIS. 2023c. Opinión pública y política fiscal julio 2023. https://www.cis.es/detalle-fichaestudio?origen=estudio&idEstudio=14741.
CRS. 2023. Barómetro juventud, salud y bienestar 2023. https://www.centroreinasofia.org/publicacion/barometro-salud-2023/.
Deng, X., V. Bashlovkina, F. Han, S. Baumgartner, and M. Bendersky. 2023. Llms to the moon? reddit market sentiment analysis with large language models. In Companion Proceedings of the ACM Web Conference 2023, WWW ’23 Companion, page 1014–1019, New York, NY, USA. Association for Computing Machinery.
Du, X., J. Shao, and C. Cardie. 2017. Learning to ask: Neural question generation for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1342–1352, Vancouver, Canada, July. Association for Computational Linguistics.
Duan, N., D. Tang, P. Chen, and M. Zhou. 2017. Question generation for question answering. In M. Palmer, R. Hwa, and S. Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 866–874, Copenhagen, Denmark, September. Association for Computational Linguistics.
Guo, D., Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Li, F. Luo, Y. Xiong, and W. Liang. 2024. Deepseek-coder: When the large language model meets programming – the rise of code intelligence.
Gururangan, S., S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith. 2018. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana, June. Association for Computational Linguistics.
Heilman, M. and N. A. Smith. 2010. Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 609–617, Los Angeles, California, June. Association for Computational Linguistics.
Hendrycks, D., C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. 2021. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
Jiang, A. Q., A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. 2023. Mistral 7b.
Jin, N., J. Siebert, D. Li, and Q. Chen. 2022. A survey on table question answering: Recent advances. In M. Sun, G. Qi, K. Liu, J. Ren, B. Xu, Y. Feng, Y. Liu, and Y. Chen, editors, Knowledge Graph and Semantic Computing: Knowledge Graph Empowers the Digital Economy, pages 174–186, Singapore. Springer Nature Singapore.
Joshi, M., E. Choi, D. S. Weld, and L. Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611.
Kocisky, T., J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. 2018. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
Kweon, S., Y. Kwon, S. Cho, Y. Jo, and E. Choi. 2023. Open-WikiTable : Dataset for open domain question answering with complex reasoning over table. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 8285–8297, Toronto, Canada, July. Association for Computational Linguistics.
Labutov, I., S. Basu, and L. Vanderwende. 2015. Deep questions without deep understanding. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 889–898, Beijing, China, July. Association for Computational Linguistics.
Lindberg, D., F. Popowich, J. Nesbit, and P. Winne. 2013. Generating natural language questions to support learning online. In Proceedings of the 14th European Workshop on Natural Language Generation, pages 105–114, Sofia, Bulgaria, August. Association for Computational Linguistics.
Ling, Y., Y. An, and S. Hasan. 2017. Improving clinical diagnosis inference through integration of structured and unstructured knowledge. In Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 31–36, Valencia, Spain, April. Association for Computational Linguistics.
Nan, L., C. Hsieh, Z. Mao, X. V. Lin, N. Verma, R. Zhang, W. Kryscinski, H. Schoelkopf, R. Kong, X. Tang, M. Mutuma, B. Rosand, I. Trindade, R. Bandaru, J. Cunningham, C. Xiong, D. Radev, and D. Radev. 2022. Fe-TaQA: Free-form table question answering. Transactions of the Association for Computational Linguistics, 10:35–49.
Osés-Grijalba, J., L. A. Ureña-López, E. M. Cámara, and J. Camacho-Collados. 2024. Question answering over tabular data with databench: A large-scale empirical evaluation of llms. In Proceedings of LRECCOLING 2024, Turin, Italy.
Pasupat, P. and P. Liang. 2015a. Compositional semantic parsing on semistructured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1470–1480, Beijing, China, July. Association for Computational Linguistics.
Pasupat, P. and P. Liang. 2015b. Compositional semantic parsing on semistructured tables.
Radford, A., J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. 2019. Language models are unsupervised multitask learners. Technical report, OpenAI.
Rajpurkar, P., J. Zhang, K. Lopyrev, and P. Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, November. Association for Computational Linguistics.
Rozière, B., J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. D´efossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve. 2023. Code llama: Open foundation models for code.
Srivastava, A., A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
Tunstall, L., E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y. Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib, N. Sarrazin, O. Sanseviero, A. M. Rush, and T. Wolf. 2023. Zephyr: Direct distillation of lm alignment.
Ushio, A., F. Alva-Manchego, and J. Camacho-Collados. 2022. Generative language models for paragraph-level question generation. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 670–688, Abu Dhabi, United Arab Emirates, December. Association for Computational Linguistics.
Voorhees, E. M. 2001. The trec question answering track. Natural Language Engineering, 7(4):361–378.
Wang, A., Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. 2019. Superglue: a stickier benchmark for general-purpose language understanding systems. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 3266–3280.
Wang, A., A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, November. Association for Computational Linguistics.
Wang, G., S. Cheng, Q. Yu, and C. Liu. 2023. OpenLLMs: Less is More for Open-source Models, 7.
Wei, J., Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus. 2022. Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification.
Yang, J., H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, B. Yin, and X. Hu. 2023. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712.
Yang, Z., Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
Zhang, Q., S. Chen, D. Xu, Q. Cao, X. Chen, T. Cohn, and M. Fang. 2023a. A survey for efficient open domain question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14447–14465, Toronto, Canada, July. Association for Computational Linguistics.
Zhang, T., F. Ladhak, E. Durmus, P. Liang, K. McKeown, and T. B. Hashimoto. 2023b. Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848.
Zhang, W., Y. Deng, B. Liu, S. Jialin Pan, and L. Bing. 2023c. Sentiment analysis in the era of large language models: A reality check. arXiv e-prints, pages arXiv–2305.
Zhong, V., C. Xiong, and R. Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.

Fonte de datos: Dialnet