Multilingual acquisition of large scale knowledge resources

Cuadros Oller, Montserrat

Multilingual acquisition of large scale knowledge resources

Cuadros Oller, Montserrat

unter der Leitung von:

Lluís Padró Cirera Doktorvater/Doktormutter
Germán Rigau Claramunt Doktorvater

Universität der Verteidigung: Universitat Politècnica de Catalunya (UPC)

Fecha de defensa: 22 von November von 2011

Gericht:

Horacio Rodríguez Hontoria Präsident/in
Irene Castellón Masalles Sekretär/in
Arantza Díaz de Ilarraza Sánchez Vocal
Roberto Navigli Vocal
Piek Vossen Vocal

Art: Dissertation

Teseo: 113184 DIALNET

Zusammenfassung

Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) that attempts to automatically process human language. Nowadays, NLP systems seem to have reached an upper-bound using existing resources and techniques. There is a broad consensus in the research community that systems need to integrate larger amounts of semantic and world knowledge in order to improve the quality of the current results. Nevertheless, building adequate semantic resources is a very difficult and an open research problem. Many efforts have been devoted to build knowledge repositories in the past decades, producing a wide range of knowledge bases, which offer different levels of granularity or approach different aspects of knowledge representation. Among them, Princeton WordNet[Fellbaum98] (WN) is by far the most widely-used semantic resource in the NLP area. The main goal of the research presented in this thesis is to devise new methods and tools to automatically create new semantic relations between WordNet senses. That is, to accurately increase by automatic means the knowledge represented in WordNet. The proposed process uses the current content of WordNet as the {\it minimal} knowledge base required to start a cycling acquisition approach. First, the process acquires from corpora relevant terms associated to each WordNet sense. Second, the identification stage uses the knowledge present in WordNet to establish the appropriate sense of each of these terms, obtaining as a result large amounts of new semantic relations among WordNet. In particular, our research focuses on devising new methods and tools for: * Acquiring relevant words from general or domain corpora for an specific WordNet word-sense. * Identifying the {\it implicit} word-senses of the acquired relevant words with respect to an {\it existing} knowledge base (in particular, WordNet). * Empirically evaluating the quality of the resulting {\it new} semantic relations in a controlled multilingual evaluation framework. Thus, our research goals cover the automatic acquisition, identification, integration, and evaluation of large amounts of semantic relations among WordNet senses captured from general or domain-specific corpora. In this way, the resulting knowledge net or KnowNet (KN), should be an extensible, large, accurate and useful knowledge base, derived automatically from text collections. Furthermore, being represented at a semantic level, we also expect that the new semantic knowledge acquired from text in one language can be of utility in other languages.