Difference between revisions of "Talukdar 2006 a context pattern induction method for named entity extraction"

From Cohen Courses
Jump to navigationJump to search
Line 55: Line 55:
  
 
<center><math>P(v,w) = \frac{\sum_{(v,w) \in p} P(p)}{\sum_{p} P(p)},</math></center>
 
<center><math>P(v,w) = \frac{\sum_{(v,w) \in p} P(p)}{\sum_{p} P(p)},</math></center>
 +
 +
which can be computed by forward-backward algorithm. Now those transitions can be removed leaving state <math>v</math> whose posterior probability is lower than <math>p_v = k(\max_w P(v,w))</math>, where <math>\mathrm{0} < k \leq \mathrm{1}</math> controls the degree of pruning, with higher <math>k</math> forcing more pruning. All induced and pruned automata are trimmed to remove unreachable states.
 +
 +
=== Automata as Extractor ===
 +
 +
Each automaton represents high-precision patterns that start with a given trigger word. Text segments are extracted by scanning the unlabeled data using these patterns. Each token in the extracted text segment is labeled either ''keep'' (K) or ''droppable'' (D). By default, a token is labeled K. A token is labeled D if it satisfies one of the droppable criteria which are- whether the token is present in a stopword list, whether it is non-capitalized, or whether it is a number. The longest token sequence corresponding to the regular expression K[D K]*K is retained and is considered a final extraction.
 +
 +
=== Ranking Patterns and Entities ===
 +
  
  

Revision as of 16:30, 30 November 2011

Citation

A context pattern induction method for named entity extraction, by P. P Talukdar, T. Brants, M. L.F Pereira. In Tenth Conference on Computational Natural Language Learning, 2006.

Online Version

Here is the online version of the paper.

Summary

This paper presents a novel context pattern induction method for Named Entity Extraction using which the authors extended several classes of seed entity lists into much larger high-precision list. The authors explored utility of partial entity lists and massive amount of unlabeled data for Named Entity Extraction. Three hypothesis are tested in this paper-

  • Starting with a few seed entities, it is possible to induce high-precision context patterns by exploiting entity context redundancy.
  • New entity instances of the same category can be extracted from unlabeled data with the induced patterns to create high-precision extensions of the seed lists.
  • Features derived from token membership in the extended lists improve the accuracy of learned named-entity taggers.

The main advance in the present method is the combination of grammatical induction and statistical techniques to create high-precision patterns.

Brief description of the method

The overall method for inducing entity context patterns and extending entity lists is as follows:

  1. Let = seed set, = text corpus.
  2. Find the contexts of entities in in the corpus
  3. Select trigger words from
  4. For each trigger word, induce a pattern automaton
  5. Use induced patterns P to extract more entities
  6. Rank and
  7. If needed, add high scoring entities in to and return to step 2. Otherwise, terminate with patterns and extended entity list as results.

The bold text in the above method are sub-methods which are explained below.

Extracting Context

First the occurrences of seed entities are found in the unlabeled data. For each such occurrence, a fixed number of tokens immediately preceding and immediately following the matched entity was extracted and all entity tokens are replaced by the single token -ENT-. This token now represents a slot in which an entity can occur. The set of extracted contexts is denoted by .

Trigger Word Selection

Some tokens are more specific to particular entity classes than others. Whenever one comes across such a token in text, the probability of finding an entity (of the corresponding entity class) in its vicinity is high. Such starting tokens are called trigger words. The authors used IDF (Inverse Document Frequency) as a term-weighting method to rank candidate trigger words from entity context. For each context segment a dominating word is given by

where is the IDF weight for the word . There is exactly one dominating word for each . All dominating words for contexts in form multiset . Let be the multiplicity of the dominating word in . is sorted by decreasing and the top tokens from this list are selected as potential trigger words. Selection criteria based on dominating word frequency work better than criteria based on simple term weight because high term weight words may be rare in the extracted contexts

Automata Induction

For each trigger word, the contexts starting with the word are listed. The predictive context can lie to the left or right of the slot -ENT- and a single token is retained on the left or right to mark the slot's left or right boundary, respectively. Similar contexts are prepared for each trigger word. The context set for each trigger word is then summarized by a pattern automaton with transitions that match the trigger word and also the wildcard -ENT-.

Context segments are short and typically do not involve recursive structures. Hence, a 1-reversible automaton was chosen to represent sets of contexts. Each transition in a 1-reversible automaton corresponds to a bigram in the contexts used to create . Each transition is assigned the following probability

where is the number of occurrences of the bigram in contexts for .

The initially induced automata need to be pruned to remove transitions with weak evidence so as to increase match precision. Only the transitions that are used in relatively many probable paths through the automaton are kept. The probabilty of path id . Then the posterior probability of of edge is

which can be computed by forward-backward algorithm. Now those transitions can be removed leaving state whose posterior probability is lower than , where controls the degree of pruning, with higher forcing more pruning. All induced and pruned automata are trimmed to remove unreachable states.

Automata as Extractor

Each automaton represents high-precision patterns that start with a given trigger word. Text segments are extracted by scanning the unlabeled data using these patterns. Each token in the extracted text segment is labeled either keep (K) or droppable (D). By default, a token is labeled K. A token is labeled D if it satisfies one of the droppable criteria which are- whether the token is present in a stopword list, whether it is non-capitalized, or whether it is a number. The longest token sequence corresponding to the regular expression K[D K]*K is retained and is considered a final extraction.

Ranking Patterns and Entities

Experimental Result

Related papers