Philgoo Han writeup of Etzioni, Cafarella, Downey, Popescu, Shaked, Soderland, Weld and Yates

From Cohen Courses
Jump to navigationJump to search

This is a review of Etzioni_2004_methods_for_domain_independent_information_extraction_from_the_web_an_experimental_comparison by user:Ironfoot.

  • KnowItAll and RuleLearning, Subclass Extraction, List Extraction for improvement
  • KnowItAll
    • Bootstrap phase: generic extraction pattern with 'focus' keywords
      • Can generic patterns be precise? With many generics would it be scalable?
    • Search instantiated patterns with generic search engines
    • Asses with PMI
  • Rule Learning
    • Finding domain specific pattern
    • From text extracted with generic pattern find new pattern applying a window.
    • H1: only select multiple context patterns(precision > recall tradeoff)
    • H2: select rules with higher 'only if' precision
  • Subclass Extraction
    • WordNet
    • prefix, suffix
    • Other rules (A, B and C)
  • List Extraction
    • Search random subset of high probable instance
    • Find html lists in result
    • Naive Bayesian Classifyer to classify high probable lists
  • Result
    • Great improvement in recall (with fixed precision)
    • Very high extraction rate with LE
  • This method uses great amount of web query.
  • Model goes for a whole run on each new focus, can this be used in real time query.