Nlao writeup of Etzioni 2004
This is a review of Etzioni_2004_methods_for_domain_independent_information_extraction_from_the_web_an_experimental_comparison by user:Nlao.
This paper presents a very important problem (OIE) with several reasonable approaches: domain independent pattern, domain specific rule, webpage tables, and subclass patterns. These approaches work together in a bootstrapping manner. However, the implementation of these approaches can be improved.
One major problem is the use of existing web Search Engines (SE), which only support keyword search and has bandwidth limitation. Texts need to be processed (NLP parsing, table extraction, pattern matching) after retrieval. This might not be a bad choice if only a small fraction of pages contains interesting information. However, if it is not the case, it is more preferable to first process all the documents, index them, then start the bootstrapping, because richer retrieval language can be used. For example, instead of relying on keyword search, we can directly retrieve text that matches a regular expression which contains POS information.
For the RL method, the choice of context is way too arbitrary: k words before and after the seed word. This might be the main reason why most of the context patterns are very weak. It makes more sense to me to use subtrees of dependency parse that contains the seed word as patterns.
For the LE method, it is inefficient to generate and execute 5k~10k queries from the seed set. Again, had the tables been preprocessed and indexed, we should only need to send one query, and get all the tables ordered by their similarity to the seed set.
In conclusion, promising direction, but toy-like implementation.
[minor points]