Talukdar et al CoNLL 2006

From Cohen Courses
Revision as of 08:29, 1 December 2010 by PastStudents (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Citation

Talukdar, T., Brants, T., Liberman, M. and Pereira, F. "A Context Pattern Induction Method for Named Entity Extraction." Computational Natural Language Learning (CoNLL-X), 2006.

Online Version

[1]

Summary

This paper extends previous methods for pattern induction and uses the patterns to find new instances of interest, which then assist in named entity recognition. This is a form of semi-supervised learning, using unlabeled data to derive new features. The method is language independent, focusing on word and transition frequencies rather than chunking or parsing information.

The method starts with seed instances, using them to find contexts frequently associated with the seeds. Rather than use the contexts directly, it then finds trigger words in the contexts that are rare in the corpus yet frequently found in the contexts by using IDF. These dominating words are used to define patterns later. Simply using IDF without accounting for the frequency of the word in --relevant-- contexts would lead to lower precision.

The dominating words denote the start of phrases surrounding the entity of interest. These phrases are used to induce finite state automata in an effort to generalize from the phrases. The FSMs are pruned to remove transitions which have few paths using them (as opposed to which have a low weight locally on the transition).

The resulting patterns from the FSMs are used to find new instances of entities to populate lists. During this process, the patterns are further filtered to encourage higher precision at the cost of recall. High quality entities from high quality patterns are added to the seed lists and the procedure then starts over.

The induced lists were used as features to improve the performance of CRF based entity taggers. The authors showed that inducing lists from extra unlabeled data improved generalization performance of the taggers. When lists were taken only from training data, there was a strong tendency to overfit.

Related Papers

Riloff and Jones, NCAI 1999 and Etzioni, AIJ 2005 use pattern induction with noun phrases, which are more language dependent than this method.

Agichtein and Gravano, ICDL 2000 induce patterns but apply this to tasks of relation extraction.

Wang and Cohen, ICDM 2007 introduce a method for set-expansion which is also language independent, relying on lists in the pages it is extracting from.