Berri Corpus Manager: A Corpus Analysis Tool Using MongoDB Technology

Hugo Sanjurjo-González

Berri Corpus ManagerA Corpus Analysis Tool Using MongoDB Technology

Hugo Sanjurjo-González ¹

1 Department of Information Technology, Electronics & Communications, University of Deusto, Spain

Libro:

Human Language Technologies – The Baltic Perspective: Proceedings of the Ninth International Conference Baltic HLT 2020

Andrius Utka (coord.)
Jurgita Vaičenonienė (coord.)
Jolanta Kovalevskaitė (coord.)
Danguolė Kalinauskaitė (coord.)

Editorial: IOS Press

ISBN: 978-1-64368-116-0, 978-1-64368-117-7

Año de publicación: 2024

Páginas: 166-173

Tipo: Capítulo de Libro

DIALNET GOOGLE SCHOLAR Acceso abierto editor

Resumen

Nowadays, there are many options for corpus linguistic analysis that make use of different approaches for corpus storage. There are tools based on SQL databases, dedicated implementations such as CQP/CWB and others that employ plain-text corpora. NoSQL databases have been widely used for big data, data mining and even sentiment analysis. However, as far as we can see, there is a lack of a widespread concordancer or consolidated framework that makes use of MongoDB architecture for the purposes of corpus linguistics. This paper aims to describe the architecture of a software that allows users to analyse monolingual and bilingual parallel corpora with grammatical annotation using MongoDB technology. Our premises are that MongoDB is ideal for non-structured data and provides high flexibility and scalability, so it may be also useful for corpus linguistic research. We analyse functionalities of MongoDB such as text search indexes and query format in order to examine its suitability.

Fuente de los datos: Dialnet