Liuliu writeup of Etzioni 2004

From Cohen Courses
Revision as of 10:42, 3 September 2010 by WikiAdmin (talk | contribs) (1 revision)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This is a review of Etzioni_2004_methods_for_domain_independent_information_extraction_from_the_web_an_experimental_comparison by user:Liuliu.

This paper proposed three methods to increase the recall and extractiong rate of KnowItALL system: Rule learning, Subclass Extraction and List Extraction.

KnowItAll is an information extraction system that extracts facts from the web. The extraction system follows a generate-and-test mechanism and consists of extractor and accessor two parts. The extract is to generate a set of candidate facts while the accessor will rank them by using PMI. It's a very domain independent system with generic extraction patterns and is not a supervised system. KnowItAll has very high precision but a low recall. Hence, the topic of this paper is how to increase recall while keeping the high precision.

  • Rule learning - How to add more domain specific rules to the system
  - use context string to extract context patterns
  - find the best pattern based on the two heuristics
  • Subclass Extraction
  - as shown in the results, the subclass extraction method only works well for decomposable classes where text usually refers to their named subclasses.
  • List Extraction
  - This is the best method in increasing recall among the three methods. 
  - They extract regularly -formatted lists that consists of class members. 
  - They evaluate extracted elements based on the correlation accurate lists and accurate elements.


The experiments showed the increase in recall that the three methods brought into the system. From the result, we could find that the effect of these methods also depends on some domain properties, such as whether the elements of classes are frequently listed, whether a class is a decomposable class.