Difference between revisions of "Riloff and Jones 1999"

From Cohen Courses
Jump to navigationJump to search
(Created page with '== Citation == Riloff, E. and Jones., R. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. Proceedings of the Sixteenth National Conference on Arti…')
 
Line 9: Line 9:
 
== Summary ==  
 
== Summary ==  
  
This [[Category::paper]] extends standard CRFs to allow each state to correspond to more than one word or token, similar to the way semi-HMMs extend HMMs. This allows for a richer feature set to be modeled, as features can now correspond to multiple words rather than just one word. These features are quite beneficial in a range of applications where the entities tend to be longer than just one word, including NP-chunking and NER.
+
This [[Category::paper]] was one of the earliest uses of [[UsesMethod::bootstrapping]] to expand entity lists given only a few seed instances. It leverages unlabeled data to iteratively find patterns around the seeds, use those patterns to find new entities, and repeat to find more patterns and entities. This exploits redundancy in the unlabeled data, using the repeated presence of patterns to infer entities and vice versa.
  
Similar to CRFs, a [[UsesMethod::semi-CRF]] applies one exponential model over the whole sequence. However, instead of modeling a sequence of words, we model a sequence of segments, which each are multiple words belonging to the same state. This expands the space to be explored, so that when performing inference, the Viterbi-like recursion algorithm must also maximize over the segment boundaries. The consequence of this is relatively minor, with inference still taking polynomial time. This cost is less than higher order CRFs, which consider all combinations of the L previous states, whereas semi-CRFs only consider where the L previous states are the same. Training the model is not much harder either. The likelihood is still convex and a recursion step will yield the normalizer.
+
Prior to this, most of the work in entity extraction required extensive training data to learn reliable patterns. Lexicons were hand-constructed. This work constructs both patterns and lexicons, using very limited training data, by a mutual bootstrapping procedure over unlabeled data. Candidate patterns generated by a program called AutoSlog are assessed by how many of the seed instances they extract. Top patterns are used to extract new entities, which lead to other patterns becoming highly ranked, etc.
  
The method was then tested on various datasets for NER tasks and compared to standard CRFs. The key ingredient was the choice of richer features in the semi-CRF models. These segment-level features included the number of capital letters in a segment, the segment lengths, and dictionaries that allowed for non-exact matchings. Segment lengths, particularly, can be modeled as any distribution (such as Guassian or exponential) depending upon how this feature is defined, which is a commonly touted benefit of semi-HMMs over regular HMMs. The results indicate that the semi-CRFs outperformed the regular CRFs in almost all cases, sometimes by quite large margins.  
+
When bootstrapping data like this, there is a risk of the original concept of the list becoming distorted as the membership drifts in the wrong direction. To maintain quality of the expanding lists, two stages of filtering are used. On each iteration of the inner procedure, only the highest scoring pattern is used to infer new entities. However, all entities it is associated with will be added to the list. The new list is used to find more patterns again and again.  
 +
 
 +
To further improve the quality of the sets, there is an outer, meta-bootstrapping procedure. The expanded lists from the inner bootstrap are filtered to only keep the five best instances, as measured by the number of different patterns that extract those instances. These five are added to the seed set, and the entire process starts anew.
 +
 
 +
The lists created were found to vary significantly depending on the domain on which they were trained. A list for vehicles, when trained on terrorism news articles, expanded to include weapons, as vehicles are often used as weapons in this area.  
  
 
== Related Papers ==
 
== Related Papers ==
  
[[RelatedPaper::Skounakis, IJCAI 2003]] applies hierarchical HMMs to IE, which model segments like semi-CRFs, but where the segments are themselves Markov processes.
+
[[RelatedPaper::Hearst, COLING 1992]] similarly use patterns between entities of interest to extract facts.  
 
 
[[RelatedPaper::Okanoharu, ACL 2006]] improve the speed of semi-CRFs when the entities are very long by using a filtering process and a feature forest model.
 
  
[[RelatedPaper::Andrew, ENMLP 2006]] combine semi-CRFs with traditional CRFs in order to use segment and word level features. Some word level features are not well represented in the semi-CRF model. He demonstrates improved performance on the task of Chinese word segmentation.
+
These iterative self-training methods show up repeatedly, such as with [[Collins and Singer, EMNLP 1999]], who use it with co-training, and [[Brin, 1998]], who uses it with relation extraction from the web.

Revision as of 10:18, 1 December 2010

Citation

Riloff, E. and Jones., R. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99). 1999.

Online Version

[1]

Summary

This paper was one of the earliest uses of bootstrapping to expand entity lists given only a few seed instances. It leverages unlabeled data to iteratively find patterns around the seeds, use those patterns to find new entities, and repeat to find more patterns and entities. This exploits redundancy in the unlabeled data, using the repeated presence of patterns to infer entities and vice versa.

Prior to this, most of the work in entity extraction required extensive training data to learn reliable patterns. Lexicons were hand-constructed. This work constructs both patterns and lexicons, using very limited training data, by a mutual bootstrapping procedure over unlabeled data. Candidate patterns generated by a program called AutoSlog are assessed by how many of the seed instances they extract. Top patterns are used to extract new entities, which lead to other patterns becoming highly ranked, etc.

When bootstrapping data like this, there is a risk of the original concept of the list becoming distorted as the membership drifts in the wrong direction. To maintain quality of the expanding lists, two stages of filtering are used. On each iteration of the inner procedure, only the highest scoring pattern is used to infer new entities. However, all entities it is associated with will be added to the list. The new list is used to find more patterns again and again.

To further improve the quality of the sets, there is an outer, meta-bootstrapping procedure. The expanded lists from the inner bootstrap are filtered to only keep the five best instances, as measured by the number of different patterns that extract those instances. These five are added to the seed set, and the entire process starts anew.

The lists created were found to vary significantly depending on the domain on which they were trained. A list for vehicles, when trained on terrorism news articles, expanded to include weapons, as vehicles are often used as weapons in this area.

Related Papers

Hearst, COLING 1992 similarly use patterns between entities of interest to extract facts.

These iterative self-training methods show up repeatedly, such as with Collins and Singer, EMNLP 1999, who use it with co-training, and Brin, 1998, who uses it with relation extraction from the web.