Cálculo de distancia lingüística para textos históricos en euskera

  1. Padilla, Manuel
  2. Soraluze, Ander
  3. Estarrona Ibarloza, Ainara
  4. Etxeberria Uztarroz, Izaskun
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2023

Issue: 70

Pages: 53-61

Type: Article

More publications in: Procesamiento del lenguaje natural


Measuring distance between languages, dialects and language varieties, both synchronically and diachronically, is a topic of growing interest in NLP. Based on our Syntactically Annotated Historical COrpus in BAsque (SAHCOBA) and previous work in perplexity-based language distance proposed by Gamallo, Pichel and Alegria (2017, 2020), we have compared historical corpora with current texts in the standard variety and calculated the language distances between them. As the standard Basque is based on the central dialects, the starting hypothesis is that the oldest texts and the dialects on the extremes will be the most distant. The results obtained have largely confirmed the thesis of traditional dialectology: peripheral dialects show a strong idiosyncrasy and are more distant from the rest.

Bibliographic References

  • Aduriz, I., M. J. Aranzabe, J. M. Arriola, A. Atutxa, A. D. de Ilarraza, N. Ezeiza, K. Gojenola, M. Oronoz, A. Soroa, and R. Urizar. 2006. Methodology and steps towards the construction of EPEC, a corpus of written Basque tagged at morphological and syntactic levels for automatic processing. In Corpus linguistics around the world. Brill, pages 1–15
  • Asgari, E. and M. R. K. Mofrad. 2016. Comparing fifty natural languages and twelve genetic languages using word embedding language divergence (WELD) as a quantitative measure of language distance. In Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP, pages 65–74, San Diego, California
  • Aurrekoetxea, G. 1992. Nafarroako euskara: azterketa dialektometrikoa. Uztaro, 5:59– 109
  • Aurrekoetxea, G., I. Gaminde, J. L. Ormaetxea, and C. Videgain. 2019. Euskalkien sailkapen berria. UPV/EHU, Bilbao
  • Aurrekoetxea, G. and C. Videgain 2009. Le projet Bourciez: traitement geolinguistique d’un corpus dialectal de 1895. Dialectologia, 2:81–111
  • Barrault, L., O. Bojar, M. R. Costa-Jussa, C. Federmann, M. Fishel, and Y. Graham 2019. Findings of the 2019 conference on machine translation (WMT19). Association for Computational Linguistics (ACL)
  • Camino, I. 2008. Dialektologiaren alderdi kronologikoaz. Fontes Linguae Vasconum (FLV), 108:209–247
  • Chavula, C. and H. Suleman. 2020. Intercomprehension in retrieval: User perspectives on six related scarce resource languages In Proceedings of the 2020 conference on human information interaction and retrieval, pages 263–272
  • Claridge, C. 2009. Historical corpora In Corpus linguistics. An International Handbook, pages 242–259, Berlin, Germany
  • de Rijk, R. 1969. Is Basque an SOV language? Fontes Linguae Vasconum (FLV), 1:319–351
  • Degaetano-Ortlieb, S. and E. Teich. 2018 Using relative entropy for detection and analysis of periods of diachronic linguistic change. In Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 22–33
  • Estarrona, A., I. Etxeberria, R. Etxepare, M. Padilla-Moyano, and A. Soraluze 2021. The first annotated corpus of historical basque. Digital Scholarship in the Humanities, 37(2):391–404
  • Gamallo, P., I. Alegria, J. R. Pichel, and M. Agirrezabal. 2016. Comparing two basic methods for discriminating between similar languages and varieties. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 170–177
  • Gamallo, P., J. R. Pichel, and I. Alegria 2017a. From language identification to language distance. Physica A: Statistical Mechanics and its Applications, 484:152– 162
  • Gamallo, P., J. R. Pichel, and I. Alegria 2017b. A perplexity-based method for similar languages discrimination. VarDial 2017, page 109
  • Gamallo, P., J. R. Pichel, and I. Alegria 2020. Measuring language distance of isolated European Languages. Information, 11(4):181
  • Gao, Y., W. Liang, Y. Shi, and Q. Huang 2014. Comparison of directed and weighted co-occurrence networks of six languages. Physica A: Statistical Mechanics and its Applications, 393(C):579–589
  • Gorrochategui, J., I. Igartua, and J. A Lakarra. 2018. Historia de la lengua vasca
  • Kondrak, G. 2005. N-gram similarity and distance. In International symposium on string processing and information retrieval, pages 115–126. Springer
  • Lacombe, G. 1924. La langue basque. In Les langues du monde, pages 255–270, Paris
  • Laka, I. 1996. A brief grammar of Euskara, the Basque language. UPV/EHU, Bilbao
  • Lakarra, J. A. 1997. Euskararen historia eta filologia: arazo zahar, bide berri. ASJU, 31(2):447–535
  • Liu, H. and J. Cong. 2013. Language clustering with word co-occurrence networks based on parallel texts. Chinese Science Bulletin, 58(10):1139–1144
  • Michelena, L. 1964. Textos Arcaicos Vascos
  • Mitxelena, K. 1981. Lengua común y dialectos vascos. International Journal of Basque Linguistics and Philology, 15:289– 313
  • Pagola, R. M. 2006. Lazarragaren eskuizkribua: grafiak, hotsak eta hitzak. In Ling¨uıstica Vasco-Romanica. I Jornadas = Euskal-Erromantze Linguistika. I. Jardunaldiak, pages 539–561, Donostia
  • Pichel, J. R., P. Gamallo, and I. Alegria 2018. Measuring language distance among historical varieties using perplexity. Application to European Portuguese. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pages 145–155
  • Pichel, J. R., P. Gamallo, and I. Alegria 2020. Measuring diachronic language distance using perplexity: Application to English, Portuguese, and Spanish. Natural Language Engineering, 26(4):433–454
  • Sant’Anna, A. A. and L. Weller. 2020. The threat of communism during the cold war: A constraint to income inequality? Comparative Politics, 52(3):359–393
  • Sarasola, I. 1983. Contribución al estudio y edición de textos antiguos vascos. ASJU, pages 69–212
  • Satrustegi, J. M. 1987. Euskal Testu Zaharrak
  • Scherrer, Y., T. Samardzic, and E. Glaser 2019. Digitising Swiss German: how to process and study a polycentric spoken language. Language Resources and Evaluation, 53(4):735–769
  • Singh, A. K. and H. Surana. 2007. Can corpus based measures be used for comparative study of languages? In Proceedings of ninth meeting of the ACL special interest group in computational morphology and phonology, pages 40–47. Association for Computational Linguistics
  • Seguy, J. 1973. La dialectometrie dans l’Atlas linguistique de la Gascogne. Revue de Linguistique Romane (RLiR), 37:1–24
  • Ulibarri, K. 2013. Testuak kokatuz dialektologia historikoan: egiteetatik metodologiara In Koldo Mitxelena Katedraren III Biltzarra / III Congreso de la Catedra Luis Michelena / 3rd Conference of the Luis Michelena Chair, pages 511–532, Vitoria-Gasteiz
  • Zuazo, K. 2014. Euskalkiak. Elkar Zugarini, A., M. Tiezzi, and M. Maggini 2020. Vulgaris: Analysis of a corpus for middle-age varieties of italian language arXiv preprint arXiv:2010.05993