Riloff and Jones 1999
Citation
Riloff, E. and Jones., R. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99). 1999.
Online Version
Summary
This paper was one of the earliest uses of bootstrapping to expand entity lists given only a few seed instances. It leverages unlabeled data to iteratively find patterns around the seeds, use those patterns to find new entities, and repeat to find more patterns and entities. This exploits redundancy in the unlabeled data, using the repeated presence of patterns to infer entities and vice versa.
Prior to this, most of the work in entity extraction required extensive training data to learn reliable patterns. Lexicons were hand-constructed. This work constructs both patterns and lexicons, using very limited training data, by a mutual bootstrapping procedure over unlabeled data. Candidate patterns generated by a program called AutoSlog are assessed by how many of the seed instances they extract. Top patterns are used to extract new entities, which lead to other patterns becoming highly ranked, etc.
When bootstrapping data like this, there is a risk of the original concept of the list becoming distorted as the membership drifts in the wrong direction. To maintain quality of the expanding lists, two stages of filtering are used. On each iteration of the inner procedure, only the highest scoring pattern is used to infer new entities. However, all entities it is associated with will be added to the list. The new list is used to find more patterns again and again.
To further improve the quality of the sets, there is an outer, meta-bootstrapping procedure. The expanded lists from the inner bootstrap are filtered to only keep the five best instances, as measured by the number of different patterns that extract those instances. These five are added to the seed set, and the entire process starts anew.
The lists created were found to vary significantly depending on the domain on which they were trained. A list for vehicles, when trained on terrorism news articles, expanded to include weapons, as vehicles are often used as weapons in this area.
Related Papers
Hearst, COLING 1992 similarly use patterns between entities of interest to extract facts.
These iterative self-training methods show up repeatedly, such as with Collins and Singer, EMNLP 1999, who use it with co-training, and Brin, WebDb 1998, who uses it with relation extraction from the web.