Rbosaghz writeup of Etzioni 2004
This is a review of etzioni_2004_methods_for_domain_independent_information_extraction_from_the_web_an_experimental_comparison by user:Rbosaghz.
This is the first Bootstrapping system we're reading about. Similar to Tom Mitchell's Never-Ending Language Learner, Etzioni's KnowItAll system will take a few seed instances, say, a few city names, and go out on the web to find similar instances (in the case that the seed instances were cities, the system would find more cities).
This is done is by looking for patterns that are reliable indicators of what the seed instances represent. For example, after starting with the few seed instances being city names, the system would find the pattern: "Cities such as ____", then look for other nouns that fill the extracted pattern in web data, and reiterate this process to find more city names.
To decide which patterns are best the authors use pointwise mutual information, which is a measure of how well a pattern and instance support one another.
I like this paper because the idea of growing lists of seed instances into much larger lists is exciting, however, the problem of semantic drift is rather disappointing.