Cálculo de distancia lingüística para textos históricos en euskera

Padilla, Manuel; Soraluze, Ander; Estarrona Ibarloza, Ainara; Etxeberria Uztarroz, Izaskun

Cálculo de distancia lingüística para textos históricos en euskera

Journal:

Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2023

Issue: 70

Pages: 53-61

Type: Article

DIALNET GOOGLE SCHOLAR RUA editor

More publications in: Procesamiento del lenguaje natural

Abstract

Measuring distance between languages, dialects and language varieties, both synchronically and diachronically, is a topic of growing interest in NLP. Based on our Syntactically Annotated Historical COrpus in BAsque (SAHCOBA) and previous work in perplexity-based language distance proposed by Gamallo, Pichel and Alegria (2017, 2020), we have compared historical corpora with current texts in the standard variety and calculated the language distances between them. As the standard Basque is based on the central dialects, the starting hypothesis is that the oldest texts and the dialects on the extremes will be the most distant. The results obtained have largely confirmed the thesis of traditional dialectology: peripheral dialects show a strong idiosyncrasy and are more distant from the rest.

Bibliographic References

Aduriz, I., M. J. Aranzabe, J. M. Arriola, A. Atutxa, A. D. de Ilarraza, N. Ezeiza, K. Gojenola, M. Oronoz, A. Soroa, and R. Urizar. 2006. Methodology and steps towards the construction of EPEC, a corpus of written Basque tagged at morphological and syntactic levels for automatic processing. In Corpus linguistics around the world. Brill, pages 1–15
Asgari, E. and M. R. K. Mofrad. 2016. Comparing fifty natural languages and twelve genetic languages using word embedding language divergence (WELD) as a quantitative measure of language distance. In Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP, pages 65–74, San Diego, California
Aurrekoetxea, G. 1992. Nafarroako euskara: azterketa dialektometrikoa. Uztaro, 5:59– 109
Aurrekoetxea, G., I. Gaminde, J. L. Ormaetxea, and C. Videgain. 2019. Euskalkien sailkapen berria. UPV/EHU, Bilbao
Aurrekoetxea, G. and C. Videgain 2009. Le projet Bourciez: traitement geolinguistique d’un corpus dialectal de 1895. Dialectologia, 2:81–111
Barrault, L., O. Bojar, M. R. Costa-Jussa, C. Federmann, M. Fishel, and Y. Graham 2019. Findings of the 2019 conference on machine translation (WMT19). Association for Computational Linguistics (ACL)
Camino, I. 2008. Dialektologiaren alderdi kronologikoaz. Fontes Linguae Vasconum (FLV), 108:209–247
Chavula, C. and H. Suleman. 2020. Intercomprehension in retrieval: User perspectives on six related scarce resource languages In Proceedings of the 2020 conference on human information interaction and retrieval, pages 263–272
Claridge, C. 2009. Historical corpora In Corpus linguistics. An International Handbook, pages 242–259, Berlin, Germany
de Rijk, R. 1969. Is Basque an SOV language? Fontes Linguae Vasconum (FLV), 1:319–351
Degaetano-Ortlieb, S. and E. Teich. 2018 Using relative entropy for detection and analysis of periods of diachronic linguistic change. In Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 22–33
Estarrona, A., I. Etxeberria, R. Etxepare, M. Padilla-Moyano, and A. Soraluze 2021. The first annotated corpus of historical basque. Digital Scholarship in the Humanities, 37(2):391–404
Gamallo, P., I. Alegria, J. R. Pichel, and M. Agirrezabal. 2016. Comparing two basic methods for discriminating between similar languages and varieties. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 170–177
Gamallo, P., J. R. Pichel, and I. Alegria 2017a. From language identification to language distance. Physica A: Statistical Mechanics and its Applications, 484:152– 162
Gamallo, P., J. R. Pichel, and I. Alegria 2017b. A perplexity-based method for similar languages discrimination. VarDial 2017, page 109
Gamallo, P., J. R. Pichel, and I. Alegria 2020. Measuring language distance of isolated European Languages. Information, 11(4):181
Gao, Y., W. Liang, Y. Shi, and Q. Huang 2014. Comparison of directed and weighted co-occurrence networks of six languages. Physica A: Statistical Mechanics and its Applications, 393(C):579–589
Gorrochategui, J., I. Igartua, and J. A Lakarra. 2018. Historia de la lengua vasca
Kondrak, G. 2005. N-gram similarity and distance. In International symposium on string processing and information retrieval, pages 115–126. Springer
Lacombe, G. 1924. La langue basque. In Les langues du monde, pages 255–270, Paris
Laka, I. 1996. A brief grammar of Euskara, the Basque language. UPV/EHU, Bilbao
Lakarra, J. A. 1997. Euskararen historia eta filologia: arazo zahar, bide berri. ASJU, 31(2):447–535
Liu, H. and J. Cong. 2013. Language clustering with word co-occurrence networks based on parallel texts. Chinese Science Bulletin, 58(10):1139–1144
Michelena, L. 1964. Textos Arcaicos Vascos
Mitxelena, K. 1981. Lengua común y dialectos vascos. International Journal of Basque Linguistics and Philology, 15:289– 313
Pagola, R. M. 2006. Lazarragaren eskuizkribua: grafiak, hotsak eta hitzak. In Ling¨uıstica Vasco-Romanica. I Jornadas = Euskal-Erromantze Linguistika. I. Jardunaldiak, pages 539–561, Donostia
Pichel, J. R., P. Gamallo, and I. Alegria 2018. Measuring language distance among historical varieties using perplexity. Application to European Portuguese. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pages 145–155
Pichel, J. R., P. Gamallo, and I. Alegria 2020. Measuring diachronic language distance using perplexity: Application to English, Portuguese, and Spanish. Natural Language Engineering, 26(4):433–454
Sant’Anna, A. A. and L. Weller. 2020. The threat of communism during the cold war: A constraint to income inequality? Comparative Politics, 52(3):359–393
Sarasola, I. 1983. Contribución al estudio y edición de textos antiguos vascos. ASJU, pages 69–212
Satrustegi, J. M. 1987. Euskal Testu Zaharrak
Scherrer, Y., T. Samardzic, and E. Glaser 2019. Digitising Swiss German: how to process and study a polycentric spoken language. Language Resources and Evaluation, 53(4):735–769
Singh, A. K. and H. Surana. 2007. Can corpus based measures be used for comparative study of languages? In Proceedings of ninth meeting of the ACL special interest group in computational morphology and phonology, pages 40–47. Association for Computational Linguistics
Seguy, J. 1973. La dialectometrie dans l’Atlas linguistique de la Gascogne. Revue de Linguistique Romane (RLiR), 37:1–24
Ulibarri, K. 2013. Testuak kokatuz dialektologia historikoan: egiteetatik metodologiara In Koldo Mitxelena Katedraren III Biltzarra / III Congreso de la Catedra Luis Michelena / 3rd Conference of the Luis Michelena Chair, pages 511–532, Vitoria-Gasteiz
Zuazo, K. 2014. Euskalkiak. Elkar Zugarini, A., M. Tiezzi, and M. Maggini 2020. Vulgaris: Analysis of a corpus for middle-age varieties of italian language arXiv preprint arXiv:2010.05993

Data source: Dialnet