Sgardine writesup Etzioni 2004
This is a review of etzioni_2004_methods_for_domain_independent_information_extraction_from_the_web_an_experimental_comparison by user:Sgardine.
Summary
The goal is to improve a previously presented system -- KnowItAll -- with regard to recall while maintaining high precision. The system generally is presented, and then the improvements.
KnowItAll
KnowItAll begins with a set of domain-independent extraction patterns and a set of classes, and bootstraps a set of domain-specific extraction patterns. The Extractor learns patterns from webpages, using the Brill tagger. The patterns are sent as queries to search engines; for speed (with a figleaf of politeness) the authors plan to integrate their own search engine into the system. The Assessor use PMI to decide whether extracted rules are correct.
Recall Enhancements
First, in order to seek better domain-specific rules, the system mines contexts of the search results for seed instances. Pruning rules that match only a single seed get rid of most candidates; further, the precision is estimated by counting the rules as negative examples of unmatched tuples (and smoothing with a constant). Rules with high estimated precision are retained. Second, the subclass extractor tries to induce a subclass hierarchy in order to apply patterns for subclasses to the root class. Candidate subclass relationship are evaluated using WordNet, morphology, and (as in the system writ large) PMI. Finally, the authors attempt to leverage large lists of entities on the web wherein existing tuples are found specified in parallel, i.e. with structural not contextual patterns.
The improvements presented indeed improved the system, with most improvement coming from mining lists.
Commentary
They seem to be saying that an advantage of their system is that you supply a class, rather than a list of instances as the seed. But then in order to know which class names might do well, you'd have to exploratorily search for them anyways -- which usually would produce a partial list of usable seed instances. It might be nice to be able to use both -- I'd be disinclined to give the system a classname, wait 36 hours and then verify that my classname could have been improved by using a synonym.