Wka writeup of Etzioni 2004

From Cohen Courses
Jump to navigationJump to search

This is a review of etzioni_2004_methods_for_domain_independent_information_extraction_from_the_web_an_experimental_comparison by user:wka.

The authors present 3 additions to their baseline Knowitall system that enhance its recall while maintaining its high precision. Knowitall uses the Hearst domain-independent extraction patterns to generate candidate facts and classes, and then uses PMI statistics leveraging search engines to associate probability with these facts; this bootstrapping, along with the scale and redundancy of the web, allows Knowitall to dispense with hand-labeled seeds to startup. They redefine recall as the number of unique facts extracted, since their dataset is the whole web, and their purpose is to achieve higher recall on "large" classes, like City and Film, while keeping high precision.

The system consists of 3 modules: the Extractor, the Search Engine Interface, and the Assessor. They heavily emphasize the PMI basis of their Assessor, and they use it at various stages in the process; to assess whether a candidate is true or not, they use the PMI score as an input feature to a Naive Bayes classifier.

The 3 new methods they add to their system are:

  • (Domain-specific) Rule Learning:
    • context string with 4-word prefix and suffix.
    • rules are favored in precision; 2 heuristics are used for selection
      • prefer multiple substring appearances in contexts across different seeds; ban all substrings that match only 1 seed
      • consider instances of other classes as negative examples for this class, and use them to penalize false positives
  • Subclass Extraction:
    • Proper noun is instance candidate, common noun is subclass candidate
    • Uses Wordnet to check if subclass candidate is hyponym
    • If not in Wordnet, uses morphology of word (especially successful with class Scientist)
  • List Extraction:
    • Select k random seeds, generate a query; repeat 5000 to 10000 times; k = 4 to increase recall
    • Create a few wrappers per page, use them to extract more instances, and then discard them
    • Assessment is based on ranking number of occurrences of instance in all lists
    • Main bottleneck is number of queries to be issued to Google

Comments:

  • Impressive results of List Extraction with PMI Assessment; 2 orders of magnitude faster, and more instances extracted (except vs Subclass Extraction for Scientist)
    • Little advantage of combining all 3 methods vs using just LE+A, except for the generality of the solution; "All" naturally performs well best for all problems, including Scientist.
  • It's not clear how they specify/differentiate between classes in Knowitall's "focus" (section 2)
  • In Rule Learning, if classes are discovered in a domain-independent fashion, then there is a good chance that 2 classes are not mutually exclusive. Hence, the use of the instances of one class as negative examples of another is not always warranted. They probably avoid discovered subclasses in this comparison (but they don't state this)
  • In Subclass Extraction, they say "if either test holds, then the Assessor assigns the subclass a probability that is close to one." How is that probability calculated? Expecially that it is thresholded at 0.85 later. Too many knobs to turn!
  • They mention that they have (naturally) optimized the parameters to get these results; could have presented tests of their ideas in different settings. Future research by others showed success in List Extraction