KeisukeKamataki writeup of Etzioni 2004
This is a review of Etzioni_2004_methods_for_domain_independent_information_extraction_from_the_web_an_experimental_comparison by user:KeisukeKamataki.
Summary: They tried to improve recall and extraction rate of their previous unsupervised IE system called KNOWITALL, which is based on point-wise mutual information stats of web corpus stats. Specifically, 3 methods Rule learning(RL), Subclass Extraction(SE), and List Extraction(LE) were tested respectively.
Assessor is an important component of KNOWITALL to measure the likelihood of correctness of the extraction. It uses PMI between each extracted instance I (such as "Tokyo") and automatically generated discriminator phrase with class D (such as "city of"). The score of PMI(I,D) would be |Hits(D+I)|/|His(I)|.
RL focuses on adding domain-specific rules which include frequent pattern of the target noun occurrence. It works like Brin's DIPRE, but the seed set is automatically generated in KNOWITALL's bootstrapping phase. It uses two heuristics (prefer the substrings which appear in multiple context strings and has high estimated precision) to achieve good performance.
SE tries to identify sub class like physicists and chemists from scientists and expands extraction pattern. The subclasses is extracted using common noun (not capitalized) and test it with WordNet and morphological analysis. It is also evaluated with Assessor for probabilistic analysis.
LE tries to detect web based regularly-formatted list from HTML parse trees of the HTMLs returned from search engine and make use of it. After extracting the instances, it assesses the probability of the likelihood of correctness with the frequency of occurrence. Original Assessor was also used to evaluate the rank of probability as "LE+A" model. LE itself is kind of related to Google Sets and they argue LE's recall is much higher than it.
Although it seems all the additional methods help improve performance, it is hard to say which method is particularly useful because of the high variance of improvement rate for different domains (LE+A worked well for City and Film, SE worked well for Scientists).
I didn't understand: Although they argue their method is domain-independent, the performance still depends on the combination of method/domain. It would be great if the experiment is applied for more domains and explore the effectiveness of each(or combined) method in detail. I also wanted them to discuss about the problem of their approach that requires a lot of query and web document download. It might be better if they have performance figure according to the number of queries and/or downloaded documents.