Sgopal1 writeup Domain independent extraction from the web

From Cohen Courses
Jump to navigationJump to search

This is a review of Etzioni_2004_methods_for_domain_independent_information_extraction_from_the_web_an_experimental_comparison by user:sgopal1.

This paper proposes a method ( KnowItAll ) to extract facts in a domain independent manner from the web. They identify three ways to improve the performance - Rule Learning , Subclass extraction , List Extraction. The method assumes that they a search engine to run queries and parse the results from.

  • Rule learning : During this phase, KnowItAll tries to extract patterns from the web in a domain independent way. They issue search engine queries and learn the rules from the retrieved pages. They propose two heuristics for effective retrieval - ban all substring that occur exactly once and choose the high precision rules.
  • Subclass extraction : They propose ways to extract subclasses from generic superclasses ( eg microbiologist is a type of biologist ). The same method as above is used, and the final extracted elements are subjected to a little more vigorous test such as checking whether it is a common noun or not and morphological variations.
  • List extraction : The goal of this is to extract lists of entities, such as researchers with home pages etc. A search engine is used to retrieve a fixed number of pages. A random subset is analyzed and the HTML structure is used to identify the potential entities. They simply sort the instances according to the number of lists they appear in and use a cut-off.

The results show a significant increase in the number of instances extracted.

  • It is stated that "we added another extraction-verification step" to improve the recall during subclass extraction. It is not clear what exactly they did.
  • Is it generally considered ok to use an underlying search engine ?
  • It might be possible to pipeline different several IE systems to achieve a better performance, has any work been done along these lines ?