Multi-label Discourse Function Classification of Lexical Bundles in Basque and Spanish via transformer-based models

Goikoetxea, Josu; Etxabe, Markel; García, Marcos; Guzzi, Eleonora; Alonso, Margarita

Multi-label Discourse Function Classification of Lexical Bundles in Basque and Spanish via transformer-based models

Goikoetxea, Josu
Etxabe, Markel
García, Marcos
Guzzi, Eleonora
Alonso, Margarita

Revista:

Procesamiento del lenguaje natural

ISSN: 1135-5948

Any de publicació: 2024

Número: 73

Pàgines: 29-41

Tipus: Article

DIALNET GOOGLE SCHOLAR Accés obert editor

Altres publicacions en: Procesamiento del lenguaje natural

Resum

Este artículo explora la efectividad de los modelos basados en transformers en la clasificación multietiqueta de la función discursiva de tareas de conjuntos léxicos en dos idiomas, euskera y español. El estudio tiene un doble enfoque: en primer lugar, evaluar el impacto de los conjuntos de datos anotados manual y automáticamente en el fine-tuning para esta tarea; en segundo lugar, demostrar la eficiencia de los modelos de lenguaje multilingües en un contexto de aprendizaje de transferencia entre idiomas para esta tarea. En primer lugar, nuestros resultados revelan la capacidad de los transformers de generalizar la clasificación de funciones discursivas de conjuntos léxicos más allá de las formas específicas de secuencia de palabras, en contextos tanto de aprendizaje monolingüe como de transferencia de aprendizaje entre idiomas. En el primer contexto, esta investigación destaca la superioridad de los conjuntos de datos anotados manualmente sobre los anotados automáticamente, siempre que el tamaño del conjunto de datos sea lo suficientemente grande. En el último, a pesar de que el aprendizaje de transferencia ocurre entre dos idiomas tipológicamente diferentes, los resultados también sugieren la superioridad de los conjuntos de datos anotados manualmente, así como la capacidad de superar los resultados monolingües cuando se equilibran las proporciones de los corpus de entrenamiento y ajuste fino en el idioma objetivo y de origen.

Referències bibliogràfiques

Agerri, R., I. S. Vicente, J. A. Campos, A. Barrena, X. Saralegi, A. Soroa, y E. Agirre. 2020. Give your text representation models some love: the case for basque. En Proceedings of the 12th International Conference on Language Resources and Evaluation.
Agerri Gascon, R. y E. Agirre Bengoa. 2023. Lessons learned from the evaluation of spanish language models.
Alonso-Ramos, M. y I. Zabala. 2022. Hartaes-vas: Combinaciones léxicas para una herramienta de ayuda a la redacción de textos académicos en español y en vasco. En Pre-conference Proceedings of the Annual Conference of the Spanish Association for Natural Language Processing 2022: Projects and Demonstrations (SEPLN-PD 2022). Co-located with the Conference of the Spanish Society for Natural Language Processings, páginas 25–28.
Aranzabe, M. J., A. Gurrutxaga, y I. Zabala. 2022. Compilación del corpus académico de noveles en euskera hartaeus y su explotación para el estudio de la fraseología académica. Procesamiento del Lenguaje Natural, 69:95–103.
Artetxe, M., G. Labaka, y E. Agirre. 2018. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. En Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), páginas 789–798, Melbourne, Australia, Julio. Association for Computational Linguistics.
Artetxe, M., Aldabe, I., R. A. O. P.-d.- V. A. S. 2022. Does corpus quality really matter for low-resource languages?
Bayoudhi, A., H. Ghorbel, y L. H. Belguith. 2015. Sentiment classification of Arabic documents: Experiments with multitype features and ensemble algorithms. En Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, páginas 196–205.
Biber, D., S. Conrad, y V. Cortes. 2004. If you look at...: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3):371–405.
Biber, D., S. Johansson, G. Leech, S. Conrad, y E. Finegan. 1999. The Longman Grammar of Spoken and Written English. Longman.
Braud, C. y P. Denis. 2016. Learning connective-based word representations for implicit discourse relation identification. En Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, páginas 203–213, Austin, Texas, Noviembre. Association for Computational Linguistics.
Callaway, C. B. 2003. Integrating discourse markers into a pipelined natural language generation architecture. En Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, páginas 264–271, Sapporo, Japan, Julio. Association for Computational Linguistics.
Cañete, J., G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, y J. Pérez. 2020. Spanish pre-trained bert model and evaluation data. En PML4DC at ICLR 2020.
Chernodub, A., O. Oliynyk, P. Heidenreich, A. Bondarenko, M. Hagen, C. Biemann, y A. Panchenko. 2019. Targer: Neural argument mining at your fingertips. En Proceedings of the 57th Annual Meeting of the Association of Computational Linguistics (ACL’2019), Florence, Italy.
Chiarcos, C. 2022. Inducing discourse marker inventories from lexical knowledge graphs. En Proceedings of the Thirteenth Language Resources and Evaluation Conference, páginas 2401–2412, Marseille, France, Junio. European Language Resources Association.
Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, É. Grave, M. Ott, L. Zettlemoyer, y V. Stoyanov. 2020. Unsupervised crosslingual representation learning at scale. En Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, páginas 8440–8451.
da Cunha, I., J.-M. Torres-Moreno, y G. Sierra. 2011. On the development of the RST Spanish treebank. En Proceedings of the 5th Linguistic Annotation Workshop, páginas 1–10, Portland, Oregon, USA, Junio. Association for Computational Linguistics.
Devlin, J., M.-W. Chang, K. Lee, y K. Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Fandiño, A. G., J. A. Estapé, M. Pàmies, J. L. Palao, J. S. Ocampo, C. P. Carrino, C. A. Oller, C. R. Penagos, A. G. Agirre, y M. Villegas. 2022. Maria: Spanish language models. Procesamiento del Lenguaje Natural, 68.
Fergadis, A., D. Pappas, A. Karamolegkou, y H. Papageorgiou. 2021. Argumentation mining in scientific literature for sustainable development. En Proceedings of the 8th Workshop on Argument Mining, páginas 100–111.
Granger, S. y M. Paquot. 2015. Electronic lexicography goes local: Design and structures of a needs-driven online academic writing aid. Lexicographica, 31(1):118–141.
Grave, E., P. Bojanowski, P. Gupta, A. Joulin, y T. Mikolov. 2018. Learning word vectors for 157 languages. En Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, Mayo. European Language Resources Association (ELRA).
Guzzi, E., M. Alonso-Ramos, M. Garcia, y M. García Salido. 2023. Annotation of lexical bundles with discourse functions in a Spanish academic corpus. En Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023), páginas 99–105, Dubrovnik, Croatia, Mayo. Association for Computational Linguistics.
Hardmeier, C. 2014. Discourse in statistical machine translation. Ph.D. tesis, Acta Universitatis Upsaliensis.
He, P., X. Liu, J. Gao, y W. Chen. 2021. Deberta: Decoding-enhanced bert with disentangled attention. En International Conference on Learning Representations.
Hou, Y. 2020. Fine-grained information status classification using discourse context-aware bert. En Proceedings of the 28th International Conference on Computational Linguistics, páginas 6101–6112.
Huber, P. y G. Carenini. 2022. Towards understanding large-scale discourse structures in pre-trained and fine-tuned language models. En Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, páginas 2376–2394.
Hyland, K. 2008a. As can be seen: lexical bundles and disciplinary variation. English for Specific Purposes, 27(1):4–21.
Hyland, K. 2008b. Genre and academic writing in the disciplines. Language Teaching, 41(4):543–562.
Iruskieta, M., M. J. Aranzabe, A. D. de Ilarraza, I. Gonzalez, M. Lersundi, y O. L. de Lacalle. 2013. The rst basque treebank: an online search interface to check rhetorical relations. En 4th workshop RST and discourse studies, páginas 40–49.
Kishimoto, Y., Y. Murawaki, y S. Kurohashi. 2020. Adapting bert to implicit discourse relation classification with a focus on discourse connectives. En Proceedings of the Twelfth Language Resources and Evaluation Conference, páginas 1152–1158.
Koto, F., J. H. Lau, y T. Baldwin. 2021. Top-down discourse parsing via sequence labelling. arXiv preprint arXiv:2102.02080.
Kurfali, M. y R. Östling. 2021. Probing multilingual language models for discourse. En The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Bangkok, Thailand, August 1-6, 2021.
Kwon, J., N. Kobayashi, H. Kamigaito, y M. Okumura. 2021. Considering nested tree structure in sentence extractive summarization with pre-trained transformer. En Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, páginas 4039–4044.
Lauscher, A., V. Ravishankar, I. Vulic, y G. Glavas. 2020. From zero to hero: On the limitations of zero-shot cross-lingual transfer with multilingual transformers. arXiv preprint arXiv:2005.00633.
Leopold, H., J. Mendling, y A. Polyvyanyy. 2014. Supporting process model validation through natural language generation. IEEE Transactions on Software Engineering, 40(8):818–840.
Liu, J., S. B. Cohen, y M. Lapata. 2019. Discourse representation structure parsing with recurrent neural networks and the transformer model. En Proceedings of the IWCS shared task on semantic parsing.
Liu, Z., K. Shi, y N. Chen. 2020. Multilingual neural rst discourse parsing. En Proceedings of the 28th International Conference on Computational Linguistics, páginas 6730–6738.
Ma, X. y E. Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. En Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), páginas 1064–1074, Berlin, Germany, Agosto. Association for Computational Linguistics.
Meyer, T. y B. Webber. 2013. Implicitation of discourse connectives in (machine) translation. En Proceedings of the Workshop on Discourse in Machine Translation, páginas 19–26.
Mukherjee, S. y P. Bhattacharyya. 2012. Sentiment analysis in twitter with lightweight discourse analysis. En Proceedings of COLING 2012, páginas 1847–1864.
Nazar, R. 2021. Automatic induction of a multilingual taxonomy of discourse markers. Electronic lexicography in the 21st century: postediting lexicography. Brno, páginas 440–454.
Nie, A., E. Bennett, y N. Goodman. 2019. Dissent: Learning sentence representations from explicit discourse relations. En Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, páginas 4497–4510.
Otegi, A., A. Agirre, J. A. Campos, A. Soroa, y E. Agirre. 2020. Conversational question answering in low resource scenarios: A dataset and case study for basque. En Proceedings of The 12th Language Resources and Evaluation Conference, páginas 436–442.
Pan, B., Y. Yang, Z. Zhao, Y. Zhuang, D. Cai, y X. He. 2018. Discourse marker augmented network with reinforcement learning for natural language inference. En Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), páginas 989–999, Melbourne, Australia, Julio. Association for Computational Linguistics.
Pandia, L., Y. Cong, y A. Ettinger. 2021. Pragmatic competence of pre-trained language models through the lens of discourse connectives. arXiv preprint arXiv:2109.12951.
Ru, D., L. Qiu, X. Qiu, Y. Zhang, y Z. Zhang. 2023. Distributed marker representation for ambiguous discourse markers and entangled relations. En Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p´aginas 5334–5351, Toronto, Canada, Julio. Association for Computational Linguistics.
Salido, M. G., M. Garcia, M. Villayandre-Llamazares, y M. A. Ramos. 2018. A lexical tool for academic writing in Spanish based on expert and novice corpora. En Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
Sileo, D., T. Van-De-Cruys, C. Pradel, y P. Muller. 2019. Mining discourse markers for unsupervised sentence representation learning. arXiv preprint arXiv:1903.11850.
Simpson-Vlach, R. y N. Ellis. 2010. An academic formulas list: New methods in phraseology research. Applied Linguistics, 31(4):487–512.
Toro, A. G., J. P. Zamorano, y A. Moreno-Sandoval. 2022. A discourse marker tagger for spanish using transformers. Procesamiento del Lenguaje Natural, 68:123–132.
Villayandre, M. y others. 2018. “harta” de noveles: un corpus de español académico. CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Lingüísticos, 5(1):131–140.
Xiao, W., P. Huber, y G. Carenini. 2021. Predicting discourse trees from transformer-based neural summarizers. arXiv preprint arXiv:2104.07058.
Zhou, Z.-M., Y. Xu, Z.-Y. Niu, M. Lan, J. Su, y C. L. Tan. 2010. Predicting discourse connectives for implicit discourse relation recognition. En Coling 2010: Posters, páginas 1507–1514, Beijing, China, Agosto. Coling 2010 Organizing Committee.

Fuente de los datos: Dialnet