Talukdar 2006 a context pattern induction method for named entity extraction

From Cohen Courses
Jump to navigationJump to search

Citation

A context pattern induction method for named entity extraction, by P. P Talukdar, T. Brants, M. L.F Pereira. In Tenth Conference on Computational Natural Language Learning, 2006.

Online Version

Here is the online version of the paper.

Summary

This paper presents a novel context pattern induction method for Named Entity Extraction using which the authors extended several classes of seed entity lists into much larger high-precision list. The authors explored utility of partial entity lists and massive amount of unlabeled data for Named Entity Extraction. Three hypothesis are tested in this paper-

  • Starting with a few seed entities, it is possible to induce high-precision context patterns by exploiting entity context redundancy.
  • New entity instances of the same category can be extracted from unlabeled data with the induced patterns to create high-precision extensions of the seed lists.
  • Features derived from token membership in the extended lists improve the accuracy of learned named-entity taggers.

The main advance in the present method is the combination of grammatical induction and statistical techniques to create high-precision patterns.

Brief description of the method

The overall method for inducing entity context patterns and extending entity lists is as follows:

  1. Let = seed set, = text corpus.
  2. Find the contexts of entities in in the corpus
  3. Select trigger words from
  4. For each trigger word, induce a pattern automaton
  5. Use induced patterns P to extract more entities
  6. Rank and
  7. If needed, add high scoring entities in to and return to step 2. Otherwise, terminate with patterns and extended entity list as results.

The bold text in the above method are sub-methods which are explained below.

Extracting Context

First the occurrences of seed entities are found in the unlabeled data. For each such occurrence, a fixed number of tokens immediately preceding and immediately following the matched entity was extracted and all entity tokens are replaced by the single token -ENT-. This token now represents a slot in which an entity can occur. The set of extracted contexts is denoted by .

Trigger Word Selection

Some tokens are more specific to particular entity classes than others. Whenever one comes across such a token in text, the probability of finding an entity (of the corresponding entity class) in its vicinity is high. Such starting tokens are called trigger words. The authors used IDF (Inverse Document Frequency) as a term-weighting method to rank candidate trigger words from entity context. For each context segment a dominating word is given by

where is the IDF weight for the word . There is exactly one dominating word for each . All dominating words for contexts in form multiset . Let be the multiplicity of the dominating word in . is sorted by decreasing and the top tokens from this list are selected as potential trigger words. Selection criteria based on dominating word frequency work better than criteria based on simple term weight because high term weight words may be rare in the extracted contexts

Automata Induction

For each trigger word, the contexts starting with the word are listed. The predictive context can lie to the left or right of the slot -ENT- and a single token is retained on the left or right to mark the slot's left or right boundary, respectively. Similar contexts are prepared for each trigger word. The context set for each trigger word is then summarized by a pattern automaton with transitions that match the trigger word and also the wildcard -ENT-.

Context segments are short and typically do not involve recursive structures. Hence, a 1-reversible automaton was chosen to represent sets of contexts. Each transition in a 1-reversible automaton corresponds to a bigram in the contexts used to create . Each transition is assigned the following probability

where is the number of occurrences of the bigram in contexts for .

The initially induced automata need to be pruned to remove transitions with weak evidence so as to increase match precision. Only the transitions that are used in relatively many probable paths through the automaton are kept. The probabilty of path id . Then the posterior probability of of edge is

which can be computed by forward-backward algorithm. Now those transitions can be removed leaving state whose posterior probability is lower than , where controls the degree of pruning, with higher forcing more pruning. All induced and pruned automata are trimmed to remove unreachable states.

Automata as Extractor

Each automaton represents high-precision patterns that start with a given trigger word. Text segments are extracted by scanning the unlabeled data using these patterns. Each token in the extracted text segment is labeled either keep (K) or droppable (D). By default, a token is labeled K. A token is labeled D if it satisfies one of the droppable criteria which are- whether the token is present in a stopword list, whether it is non-capitalized, or whether it is a number. The longest token sequence corresponding to the regular expression K[D K]*K is retained and is considered a final extraction.

Ranking Patterns and Entities

Seed instances of one class are considered as negative instances for the other classes. A pattern is penalized if it extracts entities which belong to the seed lists of the other classes. Let and be respectively the number of distinct positive and negative seeds extracted by pattern . All patterns with positive value, as well as patterns whose total positive seed (distinct) extraction count is less than certain threshold are discarded. The reason for such conservative scoring is that the authors are more interested in precision.

Let be the set of patterns which are retained by the filtering scheme described above. Also, let be an indicator function which takes value 1 when entity is extracted by pattern and 0 otherwise. The score of e, S(e), is given by

This whole process can be iterated by including extracted entities whose score is greater than or equal to a certain threshold to the seed list.

Experimental Result

Authors used 18 billion tokens (31 million documents) of news data as the source of unlabeled data. They experimented with 500 and 1000 trigger words. The results presented were obtained after a single iteration of the Context Pattern Induction algorithm. The subsets of the entity lists provided with CoNLL-2003 shared task data were used as seed sets. Only multi-token entries were included in the seed lists of respective categories (location (LOC), person (PER) & organization (ORG) in this case). Seed list sizes and experimental results are shown in Table 1. The precision numbers shown in Table 1 were obtained by manually evaluating 100 randomly selected instances from each of the extended lists.

Talukdar table1.JPG
Table 1

In the next experiment the authors used automatically generated entity lists as additional features in a supervised tagger. They started with a Conditional Random Fields tagger with a competitive baseline. The baseline tagger was trained on the full CoNLL-2003 shared task data. The authors experimented with the LOC, ORG and PER lists that were automatically generated in last experiment. Table 2 ahows the accuracy of the tagger for the entity types for which they had induced lists.

Talukdar table2.JPG
Table 2

Table 3 shows the accuracy on the full CoNLL task (four entity types) without lists, with seed list only, and with the three induced lists.

Talukdar table3.JPG
Table 3

Incorporation of token membership in the extended lists as additional membership features led to improvements across categories and at all sizes of training data. This also shows that the extended lists are of good quality, since the tagger is able to extract useful evidence from them. Relatively small sizes of training data pose interesting learning situation and is the case with practical applications. The list features lead to significant improvements in such cases. Also, as can be seen from Table 2 & 3, these lists are effective even with mature taggers trained on large amounts of labeled data.

Related papers

Methods provided in this paper are somewhat similar to some previous work on context pattern induction - Riloff and Jones, 1999; Agichtein and Gravano, 2000; Etzioni et al., 2005. Agichtein and Gravano, 2000 focus on relation extraction. The pattern learning methods of Riloff and Jones, 1999 and the generic extraction patterns of Etzioni et al., 2005 use language-specific information.