Yandongl writeup of Yates 2009
This is a review of Etzioni_2004_methods_for_domain_independent_information_extraction_from_the_web_an_experimental_comparison by user:Yandongl.
This paper introduces KnowItAll facts extracting system. KnowItAll which doesn't require a set of maually tagged training example, is able to extract more than 50,000 facts in the first major run with high precision,but recall is probably low. Then authors introduced three ways to boosting recall:
- Rule Learning: KnowItAll extracts patterns with generic rules. Leveraging domain-specific rules could greatly improve the recall of the system. Rule Learning starts with a set of seed instances, then issue queries to search engine. For each result page retrieved, record the best string that is able to extract new class instances with high precision.
- Subclass Extraction: By identifying and incorporating subclasses of a known class in KnowItAll system, the number of instances the system recognizes is greatly improved (more than ten times). However, the sub-classes found need to be verified to maintain high quality. There are various ways of doing this such as WordNet, string prefix, etc.
- List Extraction: in order to extract information from data that not in natural language form (numbers, times, etc), three stages are proposed: 1) finding lists 2) extracting list elements 3) assessing the accuracy. LE downloads the HTML parse and converts it to a HTML parse tree, so it only deals with well formed HTML documents. It learns a wrapper, but immediately abandons it and only returns the list of instances.There are several possibilities for assessing such as Naive Bayes classifier, EM, or simply sorting by #lists one instance appears in, which is actually used in the system.
Experiments show that all of three new methods can significantly increase the number of new instances extracted while retaining a high precision (0.9).
All the techniques mentioned are heuristics based, but they seem to work well. One of my concerns is that, all the example categories (City, Scientist, and Film) are relative easy to extract. Performance on less well defined classes (e.g. nice food?) is unknown. Also, they rely on search engine heavily and this can be a bottleneck.