Selen writesup Etzioni 2004
This is a review of etzioni_2004_methods_for_domain_independent_information_extraction_from_the_web_an_experimental_comparison by user:Selen.
This paper is about KnowItAll which is a bootstrapping system that extracts information from web automatically and in an domain independent manner (only domain specific input is the classes and relations that makes up its focus). Their first release extracted 50.000 thousand facts but in this work their goal is to increase the recall without sacrificing the precision.
In order to do that they add three methods:
- Rule Learning
- Subclass Extraction
- List Extraction
In its basic form KNOWItAll has two components: Extractor, Search Engine and Assessor. Extractor instantiates rules, and for a given class name search engine generates queries using the rules and assessor validates its correctness using PMI score (pointwise mutual information).
Among the additional methods, rule learning tailors the rules by adding some domain specific rules and increases the recall by extacting those additional facts. As an example they give <film> starring headquarters in <city> examples.
Subclass extraction is developed based on the observation that not every class instances is referred by its superclass but rather by a smaller subclass, for instance, not every "scientist" John Doe is called <Scientist John Doe> but rather biologist <John Doe>. This module extracts subclasses for a given class and feeds it back to KnowItAll. In here they also perform assesment and morphology checking to capture keywords such as <microbiologist>
Finally in web there are a lot of formatted lists which is a good source of information, and list extraction is developed to find good lists using the high probable instances that are found by the baseline.
They compare their system and each additional method to the baseline. Although recall rates are significantly increased, the change in precision do not vary recall as much as it should be. And the other thing bothered me is that the extraction rate (number of unique instances per query) is not very impressive.
I like this paper, but it seems to me that they could have come up with a more unique way of bootstrapping instead of tuning the method here and there by adding some modules. They also do not talk about how they picked the "best" parameters. They critisize DIPRE but it came out way before this paper. Finally, using Google API slows down their system by putting constraints on retrieval and this is the biggest issue with KnowItAll.