Talukdar 2006 a context pattern induction method for named entity extraction

Citation

A context pattern induction method for named entity extraction, by P. P Talukdar, T. Brants, M. L.F Pereira. In Tenth Conference on Computational Natural Language Learning, 2006.

Online Version

Here is the online version of the paper.

Summary

This paper presents a novel context pattern induction method for Named Entity Extraction using which the authors extended several classes of seed entity lists into much larger high-precision list. The authors explored utility of partial entity lists and massive amount of unlabeled data for Named Entity Extraction. Three hypothesis are tested in this paper-

Starting with a few seed entities, it is possible to induce high-precision context patterns by exploiting entity context redundancy.
New entity instances of the same category can be extracted from unlabeled data with the induced patterns to create high-precision extensions of the seed lists.
Features derived from token membership in the extended lists improve the accuracy of learned named-entity taggers.

The main advance in the present method is the combination of grammatical induction and statistical techniques to create high-precision patterns.

Brief description of the method

The overall method for inducing entity context patterns and extending entity lists is as follows:

Let $E$ = seed set, $T$ = text corpus.
Find the contexts $C$ of entities in $E$ in the corpus $T$
Select trigger words from $C$
For each trigger word, induce a pattern automaton
Use induced patterns P to extract more entities $E'$
Rank $P$ and $E'$
If needed, add high scoring entities in $E'$ to $E$ and return to step 2. Otherwise, terminate with patterns $P$ and extended entity list $E\cup E'$ as results.

The bold text in the above method are sub-methods which are explained below.

Extracting Context

First the occurrences of seed entities are found in the unlabeled data. For each such occurrence, a fixed number $W$ of tokens immediately preceding and immediately following the matched entity was extracted and all entity tokens are replaced by the single token -ENT-. This token now represents a slot in which an entity can occur. The set of extracted contexts is denoted by $C$ .

Trigger Word Selection

Some tokens are more speciﬁc to particular entity classes than others. Whenever one comes across such a token in text, the probability of ﬁnding an entity (of the corresponding entity class) in its vicinity is high. Such starting tokens are called trigger words. The authors used IDF (Inverse Document Frequency) as a term-weighting method to rank candidate trigger words from entity context. For each context segment $c\in C$ a dominating word $d_{c}$ is given by

d_{c}=\arg \max _{w\in c}f_{w}

where $f_{w}$ is the IDF weight for the word $w$ . There is exactly one dominating word for each $c\in C$ . All dominating words for contexts in $C$ form multiset $M$ . Let $m_{w}$ be the multiplicity of the dominating word $w$ in $M$ . $M$ is sorted by decreasing $m_{w}$ and the top $n$ tokens from this list are selected as potential trigger words. Selection criteria based on dominating word frequency work better than criteria based on simple term weight because high term weight words may be rare in the extracted contexts

Automata Induction

For each trigger word, the contexts starting with the word are listed. The predictive context can lie to the left or right of the slot -ENT- and a single token is retained on the left or right to mark the slot's left or right boundary, respectively. Similar contexts are prepared for each trigger word. The context set for each trigger word is then summarized by a pattern automaton with transitions that match the trigger word and also the wildcard -ENT-.

Context segments are short and typically do not involve recursive structures. Hence, a 1-reversible automaton was chosen to represent sets of contexts. Each transition $e=(v,w)$ in a 1-reversible automaton $A$ corresponds to a bigram $vw$ in the contexts used to create $A$ . Each transition is assigned the following probability

P(w|v)={\frac {C(v,w)}{\sum _{w'}C(v,w')}}

where $C(v,w)$ is the number of occurrences of the bigram $vw$ in contexts for $W$ .

The initially induced automata need to be pruned to remove transitions with weak evidence so as to increase match precision. Only the transitions that are used in relatively many probable paths through the automaton are kept. The probabilty of path $p$ id $P(p)=\prod _{(v,w)\in p}P(w|v)$ . Then the posterior probability of of edge $(v,w)$ is

P(v,w)={\frac {\sum _{(v,w)\in p}P(p)}{\sum _{p}P(p)}},

which can be computed by forward-backward algorithm. Now those transitions can be removed leaving state $v$ whose posterior probability is lower than $p_{v}=k(\max _{w}P(v,w))$ , where $\mathrm {0} <k\leq \mathrm {1}$ controls the degree of pruning, with higher $k$ forcing more pruning. All induced and pruned automata are trimmed to remove unreachable states.

Automata as Extractor

Each automaton represents high-precision patterns that start with a given trigger word. Text segments are extracted by scanning the unlabeled data using these patterns. Each token in the extracted text segment is labeled either keep (K) or droppable (D). By default, a token is labeled K. A token is labeled D if it satisﬁes one of the droppable criteria which are- whether the token is present in a stopword list, whether it is non-capitalized, or whether it is a number. The longest token sequence corresponding to the regular expression K[D K]*K is retained and is considered a ﬁnal extraction.

Ranking Patterns and Entities

Seed instances of one class are considered as negative instances for the other classes. A pattern is penalized if it extracts entities which belong to the seed lists of the other classes. Let $\mathrm {pos} (p)$ and $\mathrm {neg} (p)$ be respectively the number of distinct positive and negative seeds extracted by pattern $p$ . All patterns $p$ with positive $\mathrm {neg} (p)$ value, as well as patterns whose total positive seed (distinct) extraction count is less than certain threshold $\eta _{\mathsf {pattern}}$ are discarded. The reason for such conservative scoring is that the authors are more interested in precision.

Let $G$ be the set of patterns which are retained by the ﬁltering scheme described above. Also, let $I(e,p)$ be an indicator function which takes value 1 when entity $e$ is extracted by pattern $p$ and 0 otherwise. The score of e, S(e), is given by

S(e)=\sum _{p\in G}I(e,p)

This whole process can be iterated by including extracted entities whose score is greater than or equal to a certain threshold $\eta _{\mathsf {entity}}$ to the seed list.

Experimental Result

Authors used 18 billion tokens (31 million documents) of news data as the source of unlabeled data. They experimented with 500 and 1000 trigger words. The results presented were obtained after a single iteration of the Context Pattern Induction algorithm. The subsets of the entity lists provided with CoNLL-2003 shared task data were used as seed sets. Only multi-token entries were included in the seed lists of respective categories (location (LOC), person (PER) & organization (ORG) in this case). Seed list sizes and experimental results are shown in Table 1. The precision numbers shown in Table 1 were obtained by manually evaluating 100 randomly selected instances from each of the extended lists.

Table 1

In the next experiment the authors used automatically generated entity lists as additional features in a supervised tagger. They started with a Conditional Random Fields tagger with a competitive baseline. The baseline tagger was trained on the full CoNLL-2003 shared task data. The authors experimented with the LOC, ORG and PER lists that were automatically generated in last experiment. Table 2 ahows the accuracy of the tagger for the entity types for which they had induced lists.

Table 2

Table 3 shows the accuracy on the full CoNLL task (four entity types) without lists, with seed list only, and with the three induced lists.

Table 3

Incorporation of token membership in the extended lists as additional membership features led to improvements across categories and at all sizes of training data. This also shows that the extended lists are of good quality, since the tagger is able to extract useful evidence from them. Relatively small sizes of training data pose interesting learning situation and is the case with practical applications. The list features lead to signiﬁcant improvements in such cases. Also, as can be seen from Table 2 & 3, these lists are effective even with mature taggers trained on large amounts of labeled data.

Related papers

Methods provided in this paper are somewhat similar to some previous work on context pattern induction - Riloff and Jones, 1999; Agichtein and Gravano, 2000; Etzioni et al., 2005. Agichtein and Gravano, 2000 focus on relation extraction. The pattern learning methods of Riloff and Jones, 1999 and the generic extraction patterns of Etzioni et al., 2005 use language-speciﬁc information.

Talukdar 2006 a context pattern induction method for named entity extraction

Contents

Citation

Online Version

Summary

Brief description of the method

Extracting Context

Trigger Word Selection

Automata Induction

Automata as Extractor

Ranking Patterns and Entities

Experimental Result

Related papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools