Difference between revisions of "Talukdar 2006 a context pattern induction method for named entity extraction"

Revision as of 16:30, 30 November 2011

Citation

A context pattern induction method for named entity extraction, by P. P Talukdar, T. Brants, M. L.F Pereira. In Tenth Conference on Computational Natural Language Learning, 2006.

Online Version

Here is the online version of the paper.

Summary

This paper presents a novel context pattern induction method for Named Entity Extraction using which the authors extended several classes of seed entity lists into much larger high-precision list. The authors explored utility of partial entity lists and massive amount of unlabeled data for Named Entity Extraction. Three hypothesis are tested in this paper-

Starting with a few seed entities, it is possible to induce high-precision context patterns by exploiting entity context redundancy.
New entity instances of the same category can be extracted from unlabeled data with the induced patterns to create high-precision extensions of the seed lists.
Features derived from token membership in the extended lists improve the accuracy of learned named-entity taggers.

The main advance in the present method is the combination of grammatical induction and statistical techniques to create high-precision patterns.

Brief description of the method

The overall method for inducing entity context patterns and extending entity lists is as follows:

Let $E$ = seed set, $T$ = text corpus.
Find the contexts $C$ of entities in $E$ in the corpus $T$
Select trigger words from $C$
For each trigger word, induce a pattern automaton
Use induced patterns P to extract more entities $E'$
Rank $P$ and $E'$
If needed, add high scoring entities in $E'$ to $E$ and return to step 2. Otherwise, terminate with patterns $P$ and extended entity list $E\cup E'$ as results.

The bold text in the above method are sub-methods which are explained below.

Extracting Context

First the occurrences of seed entities are found in the unlabeled data. For each such occurrence, a fixed number $W$ of tokens immediately preceding and immediately following the matched entity was extracted and all entity tokens are replaced by the single token -ENT-. This token now represents a slot in which an entity can occur. The set of extracted contexts is denoted by $C$ .

Trigger Word Selection

Some tokens are more speciﬁc to particular entity classes than others. Whenever one comes across such a token in text, the probability of ﬁnding an entity (of the corresponding entity class) in its vicinity is high. Such starting tokens are called trigger words. The authors used IDF (Inverse Document Frequency) as a term-weighting method to rank candidate trigger words from entity context. For each context segment $c\in C$ a dominating word $d_{c}$ is given by

d_{c}=\arg \max _{w\in c}f_{w}

where $f_{w}$ is the IDF weight for the word $w$ . There is exactly one dominating word for each $c\in C$ . All dominating words for contexts in $C$ form multiset $M$ . Let $m_{w}$ be the multiplicity of the dominating word $w$ in $M$ . $M$ is sorted by decreasing $m_{w}$ and the top $n$ tokens from this list are selected as potential trigger words. Selection criteria based on dominating word frequency work better than criteria based on simple term weight because high term weight words may be rare in the extracted contexts

Automata Induction

For each trigger word, the contexts starting with the word are listed. The predictive context can lie to the left or right of the slot -ENT- and a single token is retained on the left or right to mark the slot's left or right boundary, respectively. Similar contexts are prepared for each trigger word. The context set for each trigger word is then summarized by a pattern automaton with transitions that match the trigger word and also the wildcard -ENT-.

Context segments are short and typically do not involve recursive structures. Hence, a 1-reversible automaton was chosen to represent sets of contexts. Each transition $e=(v,w)$ in a 1-reversible automaton $A$ corresponds to a bigram $vw$ in the contexts used to create $A$ . Each transition is assigned the following probability

P(w|v)={\frac {C(v,w)}{\sum _{w'}C(v,w')}}

where $C(v,w)$ is the number of occurrences of the bigram $vw$ in contexts for $W$ .

The initially induced automata need to be pruned to remove transitions with weak evidence so as to increase match precision. Only the transitions that are used in relatively many probable paths through the automaton are kept. The probabilty of path $p$ id $P(p)=\prod _{(v,w)\in p}P(w|v)$ . Then the posterior probability of of edge $(v,w)$ is

P(v,w)={\frac {\sum _{(v,w)\in p}P(p)}{\sum _{p}P(p)}},

which can be computed by forward-backward algorithm. Now those transitions can be removed leaving state $v$ whose posterior probability is lower than $p_{v}=k(\max _{w}P(v,w))$ , where $\mathrm {0} <k\leq \mathrm {1}$ controls the degree of pruning, with higher $k$ forcing more pruning. All induced and pruned automata are trimmed to remove unreachable states.

Automata as Extractor

Each automaton represents high-precision patterns that start with a given trigger word. Text segments are extracted by scanning the unlabeled data using these patterns. Each token in the extracted text segment is labeled either keep (K) or droppable (D). By default, a token is labeled K. A token is labeled D if it satisﬁes one of the droppable criteria which are- whether the token is present in a stopword list, whether it is non-capitalized, or whether it is a number. The longest token sequence corresponding to the regular expression K[D K]*K is retained and is considered a ﬁnal extraction.

@@ Line 55: / Line 55: @@
 <center><math>P(v,w) = \frac{\sum_{(v,w) \in p} P(p)}{\sum_{p} P(p)},</math></center>
+which can be computed by forward-backward algorithm. Now those transitions can be removed leaving state <math>v</math> whose posterior probability is lower than <math>p_v = k(\max_w P(v,w))</math>, where <math>\mathrm{0} < k \leq \mathrm{1}</math> controls the degree of pruning, with higher <math>k</math> forcing more pruning. All induced and pruned automata are trimmed to remove unreachable states.
+=== Automata as Extractor ===
+Each automaton represents high-precision patterns that start with a given trigger word. Text segments are extracted by scanning the unlabeled data using these patterns. Each token in the extracted text segment is labeled either ''keep'' (K) or ''droppable'' (D). By default, a token is labeled K. A token is labeled D if it satisﬁes one of the droppable criteria which are- whether the token is present in a stopword list, whether it is non-capitalized, or whether it is a number. The longest token sequence corresponding to the regular expression K[D K]*K is retained and is considered a ﬁnal extraction.
+=== Ranking Patterns and Entities ===

Difference between revisions of "Talukdar 2006 a context pattern induction method for named entity extraction"

Revision as of 16:30, 30 November 2011

Contents

Citation

Online Version

Summary

Brief description of the method

Extracting Context

Trigger Word Selection

Automata Induction

Automata as Extractor

Ranking Patterns and Entities

Experimental Result

Related papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools