Natural Language Processing and Language Technologies for the Basque Language

  1. Gonzalez-Dios, Itziar
  2. Altuna, Begoña 1
  1. 1 University of the Basque Country, Spain
Revista:
Cuadernos europeos de Deusto

ISSN: 1130-8354

Año de publicación: 2022

Título del ejemplar: SPECIAL ISSUE. Linguas minoritarias e futuro de Europa. Minority Languages and the Future of Europe

Número: 4

Páginas: 203-230

Tipo: Artículo

DOI: 10.18543/CED.2477 DIALNET GOOGLE SCHOLAR lock_openAcceso abierto editor

Otras publicaciones en: Cuadernos europeos de Deusto

Resumen

Que una lengua tenga presencia en el ámbito digital es hoy en día crucial para su supervivencia, ya que en las últimas décadas la comunicación en línea y los recursos lingüísticos digitales se han convertido en parte de nuestra vida cotidiana y se utilizarán más en los próximos años. Para desarrollar sistemas que se consideran necesarios para una comunicación digital eficiente (por ejemplo, sistemas de traducción automática, conversores de texto a voz y de voz a texto o asistentes digitales) es necesario digitalizar recursos lingüísticos y crear herramientas adecuadas. En el caso del euskera, desarrollar estos recursos y sistemas ha tenido un interés primordial entre los académicos durante los últimos cuarenta años. En este artículo, presentamos una visión general de los recursos de procesamiento del lenguaje natural y de tecnología lingüística que se han desarrollado para el euskera, su impacto en el proceso de hacer del euskera una «lengua digital» y las aplicaciones y retos en los escenarios de comunicación multilingüe. En concreto, presentamos los productos más conocidos y las herramientas y recursos básicos que los soportan. Asimismo, queremos que este estudio sirva de guía a otras lenguas minoritarias que están realizando su camino a la digitalización. Received: 05 April 2022Accepted: 20 May 2022

Referencias bibliográficas

  • Aitor Gonzalez-Agirre, Egoitz Laparra, and German Rigau. “Multilingual Central Re-pository version 3.0.”, in Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), (European Language Resources Association, 2012)
  • Alan Akbik, Duncan Blythe and Rolang Vollgraf, “Contextual String Embeddings for Sequence Labeling”, in Proceedings of the 27th international conference on computational linguistics, (Association for Computational Linguistics, 2018), 1638-1649,
  • Ander Soraluze et al., “EUSKOR: End-to-end Coreference Resolution System for Basque”, Plos one, 14 (2019): e0221801, https://doi.org/10.1371/journal.pone.0221801
  • Arantza Otegi et al., “Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque”, in Proceedings of The 12th Language Resources and Evaluation Conference (European Language Resources Association, 2020), 436-442
  • Arantza Otegi et al., “A Modular Chain of NLP Tools for Basque”, in International Conference on Text, Speech, and Dialogue, (Springer, 2016), 93-100
  • Christoph Pan and Beate Sibylle Pfeil, Minderheitenrechte in Europa: Handbuch der europäischen Volksgruppen, Band 2 (Wien: Braumüller, 2002) Claudia Soria, “Decolonizing Minority Language Technology”, State of the Internet’s lan-guages report, (2022),
  • Damián Blasi, Antonios Anastasopoulos and Graham Neubig, “Systematic Inequalities in Language Technology Performance across the World’s Languages”, arXiv preprint arXiv:2110.06733, (2021)
  • Eli Pociello, Eneko Agirre, and Izaskun Aldezabal, “Methodology and Construction of the Basque WordNet”, Language resources and evaluation, 45, 2 (2011): 121-142.
  • Emily M. Bender, et al., “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”, in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, (Association for Computing Machinery, 2021) 610-623
  • Emma Strubell, Ananya Ganesh, and Andrew McCallum, A. (2019). “Energy and Pol-icy Considerations for Deep Learning in NLP”, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, (Association for Computational Linguistics, 2019), 3645-3650
  • Eneko Agirre and Aitor Soroa, “Personalizing PageRank for Word Sense Disambiguation”, in Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), (Association for Computational Linguistics, 2009), 33-41
  • Eneko Agirre and Aitor Soroa, “Personalizing PageRank for Word Sense Disambigua-tion”, in Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), (Association for Computational Linguistics, 2009), 33-41
  • Eneko Agirre et al., “XUXEN: A Spelling Checker/Corrector for Basque based on Two-Level Morphology”, in proceedings of the third conference on applied natural language processing (1992), 119-125
  • Eneko Agirre et al., “XUXEN: A Spelling Checker/Corrector for Basque based on Two-Level Morphology”, in proceedings of the third conference on applied natural language processing (1992), 119-125
  • European Language Resource Coordination, ELRC WHITE PAPER. Sustainable Language Data Sharing to Support Language Equality in Multilingual Europe. Why Language Data Matters (ELRC Consortium, 2019), ISBN: 978-3-943853-05-6
  • Eusko Jaurlaritzaren Argitalpen Zerbitzu Nagusia, Euskal Herriko parte-hartze kulturalari buruzko inkesta 2019. Emaitzen txostena. (Eusko Jaurlaritzaren Argitalpen Zerbi-tzu Nagusia, 2019)
  • Georg Rehm et al., “The European Language Technology Landscape in 2020: Lan-guage-Centric and Human-Centric AI for CrossCultural Communication in Multilingual Europe”, in Proceedings of the 12th Language Resources and Evaluation Conference (Eu-ropean Language Resources Association, 2020), 3322-3332
  • Gurrutxaga and Ceberio, “Basque-a Digital Language?”
  • Haritz Salaberri, Olatz Arregi, and Beñat Zapirain, “bRol: The Parser of Syntactic and Semantic Dependencies for Basque”, in Proceedings of the International Conference Recent Advances in Natural Language Processing (INCOMA Ltd. Shoumen, 2015), 555-56
  • Iakes Goenaga, “ASKHi: Analisi sintaktiko konputazional hibridoa paradigma esberdinen konbinazioan oinarrituta”, (Doctoral dissertation, University of the Basque Country (UPV/EHU), 2017)
  • Igone Zabala et al., “GARATERM: euskararen erregistro akademikoen garapenaren ikerketarako laningurunea”, in Ugarteburu terminologia jardunaldiak (V). Terminologia naturala eta terminologia planifikatua euskararen normalizazioari begira (Bilbao: Publishing Service of the UPV/EHU, 2013), 98-114
  • Igone Zabala, “The Elaboration of Basque in Academic and Professional Domain” Linguistic Minorities in Europe Online, eds. Miren Lourdes Oñederra and Iván Igartua, (De Gruyter, 2019)
  • Imna Hernaez et al., “Description of the AHOTTS System for the Basque Language.”, in 4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis, (2001) Imna Hernáez et al., Euskara Aro Digitalean — The Basque Language in the Digital Age (Springer, 2012)
  • Imna Hernáez et al., Euskara Aro Digitalean-The Basque Language in the Digital Age(Springer, 2012), https://link.springer.com/book/10.1007/978-3-642-30796-6
  • Iñaki Alegria et al., “Design and Development of a Named Entity Recognizer for an Agglutinative Language”, in First International Joint Conference on NLP (IJCNLP-04). Workshop on Named Entity Recognition, (Berlin, Heidelberg: Springer, 2004)
  • Iñaki Alegria et al., “Robustness and Customisation in an Analyser/lemmatiser for Basque”, in LREC-2002Customizing knowledge in NLP applications workshop (European Language Resources Association, 2002), 1-6
  • Iñaki San Vicente, Xabier Saralegi, and Rodrigo Agerri, “Real Time Monitoring of Social Media and Digital Press”, arXiv e-prints, arXiv-1810 (2018)
  • Itziar Aduriz el al., “Finite State Applications for Basque”, in EACL’2003 Workshop on Finite-State Methods in Natural Language Processing (Association for Computational Linguis-tics, 2003), 3-11
  • Itziar Aduriz et al., “A Cascaded Syntactic Analyser for Basque”, in International Conference on Intelligent Text Processing and Computational Linguistics, Berlin, Heidel-berg: Springer, 2004), 124-134
  • Itziar Gonzalez-Dios et al., “Detecting Apposition for Text Simplification in Basque”, in International Conference on Intelligent Text Processing and Computational Linguistics, (Berlin, Heidelberg: Springer, 2013), 513-524
  • Izaskun Aldezabal et al., “EDBL: A General Lexical Basis for the Automatic Processing of Basque”, in proceedings of the IRCS Workshop on linguistic databases, (IRCS Workshop on linguistic databases, 2001).
  • Izaskun Aldezabal et al., “Basque e-lexicographic Resources: Linguistic Basis, De-velopment, and Future Perspectives”, in Workshop on eLexicography: Between Digital Hu-manities and Artificial Intelligence, (2018)
  • Jacob Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), (Association for Computational Linguistics, 2018), 4171- 4186
  • Jose Mari Arriola et al., “Reusing the CG-2 Grammar for Processing Basque Complex Postpositions” in Actas del XXIX Congreso de la Sociedad Española del Procesamiento del Lenguaje Natural (SEPLN 2013), (2013), 20-27
  • Josu Goikoetxea, Aitor Soroa, and Agirre, “Bilingual Embeddings with Random Walks over Multilingual Wordnets”, Knowledge-Based Systems, 150, (2018): 218-230, https://doi.org/10.1016/j.knosys.2018.03.017 87
  • Justyna Olko and Julia Sallabank, eds,, Revitalizing Endangered Languages: A practical guide (Cambridge: Cambridge University Press, 2021), https://doi.org/10.1017/9781108641142
  • Kepa Sarasola et al., D1.4 Report on the Basque Language (ELE Consortium, 2022) Kepa Bengoetxea and Itziar Gonzalez-Dios, “MultiAzterTest: a Multilingual Analyzer on Multiple Levels of Language for Readability Assessment”, arXiv preprint arXiv:2109.04870, (2021)
  • Kepa Bengoetxea, “Estaldura zabaleko euskararako analizatzaile sintaktiko estatis-tikoa”, (Doctoral dissertation, University of the Basque Country (UPV/EHU), 2014)
  • María Jesús Aranzabe, “Dependentzia-ereduan oinarritutako baliabide sintaktikoak: zuhaitz-bankua eta gramatika konputazionala” (Doctoral dissertation, University of the Basque Country (UPV/EHU), 2008). https://dialnet.unirioja.es/servlet/tesis?codigo=177705
  • María Jesús Aranzabe, Arantza Díaz de Ilarraza, and Itziar Gonzalez-Dios, I. (2013). “Transforming Complex Sentences using Dependency trees for Automatic Text Simplification in Basque”, Procesamiento del lenguaje natural, 50 (2013): 61-68.
  • Mikel Artetxe et al., “Does Corpus Quality Really Matter for Low-Resource Languages?2” arXiv preprint arXiv:2203.08111, (2022)
  • Mikel Artetxe et al., “Does Corpus Quality Really Matter for Low-Resource Lan-guages?2” arXiv preprint arXiv:2203.08111, (2022)
  • Miller, George A. VWordNet: A Lexical Database for English», Communications of the ACM, 38, No. 11 (1995): 39-41
  • Natural Language Processing and Language Technologies for the Basque Language Olatz Ansa et al., “Ihardetsi: a Basque Question Answering System at QA@ CLEF 2008”, in Workshop of the Cross-Language Evaluation Forum for European Languages,(Berlin, Heidelberg: Springer, 2008), 369-376
  • Piotr Bojanowski et al., “Enriching Word Vectors with Subword Information”, Trans-actions of the Association for Computational Linguistics 5 (2017): 135-146
  • Plan de Impulso de las Tecnologías del Lenguaje, Ministerio de Turismo, Energia y Agenda Digital, 2015
  • Rodrigo Agerri et al., “Give your Text Representation Models some Love: the Case for Basque”, in Proceedings of the 12th Language Resources and Evaluation Conference, (European Language Resources Association, 2020), 4781-4788.
  • Rodrigo Agerri, Josu, Bermudez, and German Rigau, G. (2014). “IXA pipeline: Ef-ficient and Ready to Use Multilingual NLP tools”, in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), (European Language Re-sources Association , 2014), 3823-3828
  • Vera Ferreira and Peter Bouda, Language Documentation and Conservation in Europe (whole volume) (Honolulu: University of Hawai‘i Press, 2016)
  • Xabier Gomez Guinovart et al., “Multilingual Central Repository: a Cross-lingual Framework for Developing Wordnets”, arXiv preprint arXiv:2107.00333, (2021)
  • Zuhaitz Beloki et al., “Grammatical Error Correction for Basque through a Seq2seq Neural Architecture and Synthetic Examples”, Procesamiento del Lenguaje Natural, 65 (2020): 13-20
  • Christiane Fellbaum, ed. WordNet: An Electronic Lexical Database (Cambridge, MA: MIT Press, 1998).
  • Xabier Arregi et al., “TZOS: An On-line System for Terminology Service”, in Actualizaciones en Comunicación Social. (Centro de Lingüística Aplicada, 2013), 400-404