Un detector de la unidad central de un texto basado en técnicas de aprendizaje automático en textos científicos para el euskera

Atutxa Salazar, Aitziber; Iruskieta Quintian, Mikel; Bengoetxea Kortazar, Kepa

Un detector de la unidad central de un texto basado en técnicas de aprendizaje automático en textos científicos para el euskera

Journal:

Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2017

Issue: 58

Pages: 37-44

Type: Article

DIALNET GOOGLE SCHOLAR RUA editor

More publications in: Procesamiento del lenguaje natural

Abstract

This paper presents an automatic detector of the discourse central unit (CU) in scientific abstracts based on machine learning techniques. After segmenting a text in its elementary discourse units, the detection of the central unit is a crucial step on the way to robustly build discourse trees under the Rhetorical Structure Theory (RST). Besides, CU detection may also be useful in automatic summarization, question answering and sentiment analysis tasks. Results show that the CU detection using machine learning techniques for Basque scientific abstracts outperform rule based techniques, even on a small size corpus on different domains. This leads us to think that there is still room for improvement.

Bibliographic References

Aduriz, I. 2000. EUSMG: morfologiatik sintaxira murriztapen gramatika erabiliz. Ph.D. tesis, Euskal Herriko Unibertsitatea, UPV/EHU, Donostia.
Aldabe, I., I. Gonzalez-Dios, I. Lopez-Gazpio, I. Madrazo, y M. Maritxalar. 2013. Two approaches to generate questions in basque. Procesamiento del Lenguaje Natural, (51):101–108.
Alkorta, J., K. Gojenola, M. Iruskieta, y A. Perez. 2015. Using relational discourse structure information in Basque sentiment analysis. En 5th Workshop RST and Discourse Studies”, in Actas del XXXI Congreso de la Sociedad Espa˜nola del Procesamiento del Lenguaje Natural (SEPLN 2015), Alicante.
Burstein, J., D. Marcu, S. Andreyev, y M. Chodorow. 2001. Towards automatic classification of discourse elements in essays. En Proceedings of the 39th annual Meeting on Association for Computational Linguistics, páginas 98–105. Association for Computational Linguistics.
Carlson, L., D. Marcu, y M. Okurowski. 2001. Building a discourse-tagged corpus in the framework of rhetorical structure theory. En 2nd SIGDIAL Workshop on Discourse and Dialogue, Eurospeech 2001, página 10, Aalborg, Denmark, 1-2 September. Association for Computational Linguistics.
Carlson, L., M. E. Okurowski, y D. Marcu. 2002. RST discourse treebank. Linguistic Data Consortium, University of Pennsylvania.
da Cunha, I., J.-M. Torres-Moreno, G. Sierra, L.-A. Cabrera-Diego, y B.-G. CastroRol´on. 2011. The RST Spanish Treebank On-line Interface. En International Conference Recent Advances in NLP, Bulgaria, 12-14 September.
Ezeiza, N., I. Alegria, J.-M. Arriola, R. Urizar, y I. Aduriz. 1998. Combining stochastic and rule-based methods for disambiguation in agglutinative languages. Proceedings and 17th International Conference on Computational Lingustics, 1:380– 384.
Iruskieta, M., J. Antonio, y G. Labaka. 2016. Detecting the central units in two different genres and languages: a preliminary study of brazilian portuguese and basque texts. Procesamiento de Lenguaje Natural, (56):65–72.
Iruskieta, M., M. Aranzabe, A. Diaz de Ilarraza, I. Gonzalez, M. Lersundi, y O. L. de la Calle. 2013. The RST Basque TreeBank: an online search interface to check rhetorical relations. En 4th Workshop ”RST and Discourse Studies”, Brasil, October 21-23.
Iruskieta, M., A. Diaz de Ilarraza, G. Labaka, y M. Lersundi. 2015. The Detection of Central Units in Basque scientific abstracts. En 5th Workshop RST and Discourse Studies¨ın Actas del XXXI Congreso de la Sociedad Espa˜nola del Procesamiento del Lenguaje Natural (SEPLN), Alicante.
Iruskieta, M., A. Diaz de Ilarraza, y M. Lersundi. 2014. The annotation of the central unit in rhetorical structure trees: A key step in annotating rhetorical relations. En COLING, páginas 466–475, Dublin. Dublin City University and ACL.
Iruskieta, M. y B. Zapirain. 2015. Euseduseg: a dependency-based edu segmentation for basque. Procesamiento del Lenguaje Natural, (55):41–48.
Joty, S., G. Carenini, y R. T. Ng. 2015. Codra: A novel discriminative framework for rhetorical analysis. Computational Linguistics, 41(3):385–435.
Krippendorff, K. 2004. Content analysis: An introduction to its methodology. Sage.
Luhn, H. P. 1958. The automatic creation of literature abstracts. IBM Journal of research and development, 2(2):159–165.
Mann, W. C. y S. A. Thompson. 1988. Rhetorical structure theory: Toward a functional theory of text organization. TextInterdisciplinary Journal for the Study of Discourse, 8(3):243–281.
Marcu, D. 2000. The rhetorical parsing of unrestricted texts: A surfacebased approach. Computational Linguistics, 26(3):395–448.
McCallum, A. y K. Nigam. 1998. A comparison of event models for naive bayes text classification. En AAAI-98 workshop on learning for text categorization, volumen 752, páginas 41–48.
Neto, J. L., A. D. Santos, C. A. Kaestner, y A. A. Freitas. 2000. Generating text summaries through the relative importance of topics. Advances in Artificial Intelligence, páginas 300–309.
Paice, C. D. 1980. The automatic generation of literature abstracts: an approach based on the identification of self-indicating phrases. En Proceedings of the 3rd annual ACM conference on Research and development in information retrieval, páginas 172–191. Butterworth & Co.
Pardo, T., L. Rino, y M. Nunes. 2003. GistSumm: A summarization tool based on a new extractive method. Computational Processing of the Portuguese Language, páginas 196–196.
Siegel, S. y N. Castellan. 1988. The Friedman two-way analysis of variance by ranks. Nonparametric statistics for the behavioral sciences, páginas 174–184.

Data source: Dialnet