Natural Language Processing and Language Technologies for the Basque Language

  1. Gonzalez-Dios, Itziar
  2. Altuna, Begoña 1
  1. 1 University of the Basque Country, Spain
Revista:
Cuadernos europeos de Deusto

ISSN: 1130-8354

Ano de publicación: 2022

Título do exemplar: SPECIAL ISSUE. Linguas minoritarias e futuro de Europa. Minority Languages and the Future of Europe

Número: 4

Páxinas: 203-230

Tipo: Artigo

DOI: 10.18543/CED.2477 DIALNET GOOGLE SCHOLAR lock_openAcceso aberto editor

Outras publicacións en: Cuadernos europeos de Deusto

Resumo

The presence of a language in the digital domain is crucial for its survival, as online communication and digital language resources have become the standard in the last decades and will gain more importance in the coming years. In order to develop advanced systems that are considered the basics for an efficient digital communication (e.g. machine translation systems, text-to-speech and speech-to-text converters and digital assistants), it is necessary to digitalise linguistic resources and create tools. In the case of Basque, scholars have studied the creation of digital linguistic resources and the tools that allow the development of those systems for the last forty years. In this paper, we present an overview of the natural language processing and language technology resources developed for Basque, their impact in the process of making Basque a “digital language” and the applications and challenges in multilingual communication. More precisely, we present the well-known products for Basque, the basic tools and the resources that are behind the products we use every day. Likewise, we would like that this survey serves as a guide for other minority languages that are making their way to digitalisation. Recibido: 05 abril 2022Aceptado: 20 mayo 2022

Referencias bibliográficas

  • Aitor Gonzalez-Agirre, Egoitz Laparra, and German Rigau. “Multilingual Central Re-pository version 3.0.”, in Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), (European Language Resources Association, 2012)
  • Alan Akbik, Duncan Blythe and Rolang Vollgraf, “Contextual String Embeddings for Sequence Labeling”, in Proceedings of the 27th international conference on computational linguistics, (Association for Computational Linguistics, 2018), 1638-1649,
  • Ander Soraluze et al., “EUSKOR: End-to-end Coreference Resolution System for Basque”, Plos one, 14 (2019): e0221801, https://doi.org/10.1371/journal.pone.0221801
  • Arantza Otegi et al., “Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque”, in Proceedings of The 12th Language Resources and Evaluation Conference (European Language Resources Association, 2020), 436-442
  • Arantza Otegi et al., “A Modular Chain of NLP Tools for Basque”, in International Conference on Text, Speech, and Dialogue, (Springer, 2016), 93-100
  • Christoph Pan and Beate Sibylle Pfeil, Minderheitenrechte in Europa: Handbuch der europäischen Volksgruppen, Band 2 (Wien: Braumüller, 2002) Claudia Soria, “Decolonizing Minority Language Technology”, State of the Internet’s lan-guages report, (2022),
  • Damián Blasi, Antonios Anastasopoulos and Graham Neubig, “Systematic Inequalities in Language Technology Performance across the World’s Languages”, arXiv preprint arXiv:2110.06733, (2021)
  • Eli Pociello, Eneko Agirre, and Izaskun Aldezabal, “Methodology and Construction of the Basque WordNet”, Language resources and evaluation, 45, 2 (2011): 121-142.
  • Emily M. Bender, et al., “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”, in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, (Association for Computing Machinery, 2021) 610-623
  • Emma Strubell, Ananya Ganesh, and Andrew McCallum, A. (2019). “Energy and Pol-icy Considerations for Deep Learning in NLP”, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, (Association for Computational Linguistics, 2019), 3645-3650
  • Eneko Agirre and Aitor Soroa, “Personalizing PageRank for Word Sense Disambiguation”, in Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), (Association for Computational Linguistics, 2009), 33-41
  • Eneko Agirre and Aitor Soroa, “Personalizing PageRank for Word Sense Disambigua-tion”, in Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), (Association for Computational Linguistics, 2009), 33-41
  • Eneko Agirre et al., “XUXEN: A Spelling Checker/Corrector for Basque based on Two-Level Morphology”, in proceedings of the third conference on applied natural language processing (1992), 119-125
  • Eneko Agirre et al., “XUXEN: A Spelling Checker/Corrector for Basque based on Two-Level Morphology”, in proceedings of the third conference on applied natural language processing (1992), 119-125
  • European Language Resource Coordination, ELRC WHITE PAPER. Sustainable Language Data Sharing to Support Language Equality in Multilingual Europe. Why Language Data Matters (ELRC Consortium, 2019), ISBN: 978-3-943853-05-6
  • Eusko Jaurlaritzaren Argitalpen Zerbitzu Nagusia, Euskal Herriko parte-hartze kulturalari buruzko inkesta 2019. Emaitzen txostena. (Eusko Jaurlaritzaren Argitalpen Zerbi-tzu Nagusia, 2019)
  • Georg Rehm et al., “The European Language Technology Landscape in 2020: Lan-guage-Centric and Human-Centric AI for CrossCultural Communication in Multilingual Europe”, in Proceedings of the 12th Language Resources and Evaluation Conference (Eu-ropean Language Resources Association, 2020), 3322-3332
  • Gurrutxaga and Ceberio, “Basque-a Digital Language?”
  • Haritz Salaberri, Olatz Arregi, and Beñat Zapirain, “bRol: The Parser of Syntactic and Semantic Dependencies for Basque”, in Proceedings of the International Conference Recent Advances in Natural Language Processing (INCOMA Ltd. Shoumen, 2015), 555-56
  • Iakes Goenaga, “ASKHi: Analisi sintaktiko konputazional hibridoa paradigma esberdinen konbinazioan oinarrituta”, (Doctoral dissertation, University of the Basque Country (UPV/EHU), 2017)
  • Igone Zabala et al., “GARATERM: euskararen erregistro akademikoen garapenaren ikerketarako laningurunea”, in Ugarteburu terminologia jardunaldiak (V). Terminologia naturala eta terminologia planifikatua euskararen normalizazioari begira (Bilbao: Publishing Service of the UPV/EHU, 2013), 98-114
  • Igone Zabala, “The Elaboration of Basque in Academic and Professional Domain” Linguistic Minorities in Europe Online, eds. Miren Lourdes Oñederra and Iván Igartua, (De Gruyter, 2019)
  • Imna Hernaez et al., “Description of the AHOTTS System for the Basque Language.”, in 4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis, (2001) Imna Hernáez et al., Euskara Aro Digitalean — The Basque Language in the Digital Age (Springer, 2012)
  • Imna Hernáez et al., Euskara Aro Digitalean-The Basque Language in the Digital Age(Springer, 2012), https://link.springer.com/book/10.1007/978-3-642-30796-6
  • Iñaki Alegria et al., “Design and Development of a Named Entity Recognizer for an Agglutinative Language”, in First International Joint Conference on NLP (IJCNLP-04). Workshop on Named Entity Recognition, (Berlin, Heidelberg: Springer, 2004)
  • Iñaki Alegria et al., “Robustness and Customisation in an Analyser/lemmatiser for Basque”, in LREC-2002Customizing knowledge in NLP applications workshop (European Language Resources Association, 2002), 1-6
  • Iñaki San Vicente, Xabier Saralegi, and Rodrigo Agerri, “Real Time Monitoring of Social Media and Digital Press”, arXiv e-prints, arXiv-1810 (2018)
  • Itziar Aduriz el al., “Finite State Applications for Basque”, in EACL’2003 Workshop on Finite-State Methods in Natural Language Processing (Association for Computational Linguis-tics, 2003), 3-11
  • Itziar Aduriz et al., “A Cascaded Syntactic Analyser for Basque”, in International Conference on Intelligent Text Processing and Computational Linguistics, Berlin, Heidel-berg: Springer, 2004), 124-134
  • Itziar Gonzalez-Dios et al., “Detecting Apposition for Text Simplification in Basque”, in International Conference on Intelligent Text Processing and Computational Linguistics, (Berlin, Heidelberg: Springer, 2013), 513-524
  • Izaskun Aldezabal et al., “EDBL: A General Lexical Basis for the Automatic Processing of Basque”, in proceedings of the IRCS Workshop on linguistic databases, (IRCS Workshop on linguistic databases, 2001).
  • Izaskun Aldezabal et al., “Basque e-lexicographic Resources: Linguistic Basis, De-velopment, and Future Perspectives”, in Workshop on eLexicography: Between Digital Hu-manities and Artificial Intelligence, (2018)
  • Jacob Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), (Association for Computational Linguistics, 2018), 4171- 4186
  • Jose Mari Arriola et al., “Reusing the CG-2 Grammar for Processing Basque Complex Postpositions” in Actas del XXIX Congreso de la Sociedad Española del Procesamiento del Lenguaje Natural (SEPLN 2013), (2013), 20-27
  • Josu Goikoetxea, Aitor Soroa, and Agirre, “Bilingual Embeddings with Random Walks over Multilingual Wordnets”, Knowledge-Based Systems, 150, (2018): 218-230, https://doi.org/10.1016/j.knosys.2018.03.017 87
  • Justyna Olko and Julia Sallabank, eds,, Revitalizing Endangered Languages: A practical guide (Cambridge: Cambridge University Press, 2021), https://doi.org/10.1017/9781108641142
  • Kepa Sarasola et al., D1.4 Report on the Basque Language (ELE Consortium, 2022) Kepa Bengoetxea and Itziar Gonzalez-Dios, “MultiAzterTest: a Multilingual Analyzer on Multiple Levels of Language for Readability Assessment”, arXiv preprint arXiv:2109.04870, (2021)
  • Kepa Bengoetxea, “Estaldura zabaleko euskararako analizatzaile sintaktiko estatis-tikoa”, (Doctoral dissertation, University of the Basque Country (UPV/EHU), 2014)
  • María Jesús Aranzabe, “Dependentzia-ereduan oinarritutako baliabide sintaktikoak: zuhaitz-bankua eta gramatika konputazionala” (Doctoral dissertation, University of the Basque Country (UPV/EHU), 2008). https://dialnet.unirioja.es/servlet/tesis?codigo=177705
  • María Jesús Aranzabe, Arantza Díaz de Ilarraza, and Itziar Gonzalez-Dios, I. (2013). “Transforming Complex Sentences using Dependency trees for Automatic Text Simplification in Basque”, Procesamiento del lenguaje natural, 50 (2013): 61-68.
  • Mikel Artetxe et al., “Does Corpus Quality Really Matter for Low-Resource Languages?2” arXiv preprint arXiv:2203.08111, (2022)
  • Mikel Artetxe et al., “Does Corpus Quality Really Matter for Low-Resource Lan-guages?2” arXiv preprint arXiv:2203.08111, (2022)
  • Miller, George A. VWordNet: A Lexical Database for English», Communications of the ACM, 38, No. 11 (1995): 39-41
  • Natural Language Processing and Language Technologies for the Basque Language Olatz Ansa et al., “Ihardetsi: a Basque Question Answering System at QA@ CLEF 2008”, in Workshop of the Cross-Language Evaluation Forum for European Languages,(Berlin, Heidelberg: Springer, 2008), 369-376
  • Piotr Bojanowski et al., “Enriching Word Vectors with Subword Information”, Trans-actions of the Association for Computational Linguistics 5 (2017): 135-146
  • Plan de Impulso de las Tecnologías del Lenguaje, Ministerio de Turismo, Energia y Agenda Digital, 2015
  • Rodrigo Agerri et al., “Give your Text Representation Models some Love: the Case for Basque”, in Proceedings of the 12th Language Resources and Evaluation Conference, (European Language Resources Association, 2020), 4781-4788.
  • Rodrigo Agerri, Josu, Bermudez, and German Rigau, G. (2014). “IXA pipeline: Ef-ficient and Ready to Use Multilingual NLP tools”, in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), (European Language Re-sources Association , 2014), 3823-3828
  • Vera Ferreira and Peter Bouda, Language Documentation and Conservation in Europe (whole volume) (Honolulu: University of Hawai‘i Press, 2016)
  • Xabier Gomez Guinovart et al., “Multilingual Central Repository: a Cross-lingual Framework for Developing Wordnets”, arXiv preprint arXiv:2107.00333, (2021)
  • Zuhaitz Beloki et al., “Grammatical Error Correction for Basque through a Seq2seq Neural Architecture and Synthetic Examples”, Procesamiento del Lenguaje Natural, 65 (2020): 13-20
  • Christiane Fellbaum, ed. WordNet: An Electronic Lexical Database (Cambridge, MA: MIT Press, 1998).
  • Xabier Arregi et al., “TZOS: An On-line System for Terminology Service”, in Actualizaciones en Comunicación Social. (Centro de Lingüística Aplicada, 2013), 400-404