Resolving named entity problemsfrom recognition and discrimination to semantic class learning

  1. Petrova Kozareva, Zornitsa
Supervised by:
  1. Manuel Palomar Sanz Director
  2. Andrés Montoyo Guijarro Director

Defence university: Universitat d'Alacant / Universidad de Alicante

Fecha de defensa: 18 February 2009

Committee:
  1. María Felisa Verdejo Maíllo Chair
  2. Rafael Muñoz Guillena Secretary
  3. Eduard Hovy Committee member
  4. Bernardo Magnini Committee member
  5. Germán Rigau Claramunt Committee member

Type: Thesis

Teseo: 192561 DIALNET

Abstract

Contributions 1, Named Entity Recognition Among the first NER are the rule-based systems which use a set of hand-crafted rules and grammars to identify and to classify the NEs in the text. These systems have robust performance, but they are tedious to create and maintain. Their performance is dependent on the knowledge of their human creator and many times the utilized set of rules does not capture all possible cases and representations of the NEs in the texts, hence many times they miss firing. To surmount these problems, researches focused towards the development of data-driven NER systems. Most such approaches study the type of machine-learning classifier suitable for the task or the set of features that can be encoded. Over the years data driven systems also showed that they can reach high and robust performance as the rule-based system. However, the biggest bottleneck for such systems is the availability of labeled training data from which the systems can learn when ported to new domain or language. The creation of labeled data requires experts who can perform the annotation, and the labeling processes become time consuming. In this thesis, we propose a data-driven NER system which can function with labeled training data, but at the same time is capable to start the learning process from few annotated seeds which during the learning process convert large quantities of unlabeled data into labeled. For the purpose we use semi-supervised machine learning techniques such as self-training and co-training. Meanwhile most NER system use manually created gazetteer lists2, we have proposed an automatic pattern validation and graph exploration algorithm which harvests unstructured and unlabeled texts to generate person and location gazetteer entries. Our final contribution in NER is the development of a feature set which is easy to generate and adapt to a new language. The feature set does not depend on any language specific resources and uses morphological, contextual, orthographic, gazetteer and trigger word information. The main objectives and contributions are supported with comparative study and evaluation with Spanish and Italian NE data sets. 2. Named Entity Discrimination Named entity discrimination system aims at finding the number of underlying entities given a name. Current systems group the snippets of ambiguous names or map the content of the documents to the disambiguated pages in Wikipedia. Our contribution in this NE subtask is the development of a system which groups snippets into clusters and also assigns to each cluster category labels. The category labels represent the sense or the topic of the cluster. For instance, "Jerry Hobbs" the series killer is mapped to the category label "CRIME", while "Jerry Hobbs" the professor is mapped to "SCIENCE". The system also generates descriptive and discriminative labels. The descriptive labels are words which are typical for given sense, but they can be also shared by other clusters. For instance, the two senses of "Jerry Hobbs" the taekwondo teacher and the computational linguistics professor share words like lecture, class, pupils, students, homework and exercise. While the discriminative labels are words which are specific to given sense and are not shared by other individuals. For instance, though the two "Jerry Hobbs" are related to science, they live on specific address, they are married to different wives and have different children. These characteristics provide more thorough representation and explanation of the formed clusters. Our approach is evaluated according to different criterion like the number of conflated names, the size of the examples that have to be disambiguated, disparity factor which estimates the difficulty level for the disambiguation of names which share similar properties among other criterion. Since our approach is based on the similarity of context, the approach was adapted to different languages. We have conducted experiments in Spanish, English, Romanian and Bulgarian languages. We have evaluated the approach with different named entity categories such as rivers, organizations, capitals, racers among others. We have carried out comparative study with other disambiguation approaches and the obtained results show that the proposed disambiguation approach achieves high results. 3. Semantic Class Learning According to Artiles et al. (2005) 30% of the web queries are related to named entities. To improve the performance of search engines fine-grained NER is necessary. The creation of such systems is impeded by number of factors like the creation of annotated data, the type of categories that have to be learned, handling dynamic information as NEs change categories over time. Rather than learning such classifiers researchers focused towards the development of automatic, web-based knowledge extractors and relation harvesters. For instance, Pasca (2004) and Etzioni et al. (2005) proposed instance-based harvesting algorithms, later on Pasca (2007b), Pasca & Durme (2007) continued the work towards attribute generation for each NE class, while Etzioni et al. continued the work towards web based machine reading Etzioni (2008) and open information extraction Banko et al. (2007). Our contribution in this NE subtask is the development of a doubly-anchored pattern (DAP) of the type "superordinate such as subordinate1 and subordinate2" which when instantiated with two of the three instances is able to learn new instances, categories and the relations among them. To steer the learning process, the algorithm is incorporated in a bootstrapping fashion which learns on alternating cycles these three kinds of information. To guide bootstrapping, we have proposed the usage of hyponym pattern linkage graphs. They use different graph algorithms to rank the extracted information and separate relevant from irrelevant concepts. We have shown that the DAP pattern can be used not only as knowledge harvester but also as a validation mechanism which positions a category as super- or subordinate concept given another category. This concept positioning test selects the category which should be incorporated in the bootstrapping process and thus expands the search space. To validate the correctness of the harvested categories, we introduced a criterion according to which people separate and organize the concepts in the domain people. This is also our first approximation at taxonomizing the acquired information. The proposed approach is evaluated on open and closed semantic classes. We have presented comparative study with other knowledge miners such as Etzioni et al. (2005) and Pasca (2007a).