Mnduong writeup of Etzioni et al. 2004

From Cohen Courses
Revision as of 10:42, 3 September 2010 by WikiAdmin (talk | contribs) (1 revision)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This is a review of Etzioni_2004_methods_for_domain_independent_information_extraction_from_the_web_an_experimental_comparison by user:mnduong.

  • This paper introduces three methods to improve the recall of KnowItAll, a system that extracts facts from the Web.
  • The existing KnowItAll system uses a set of domain independent extraction patterns (e.g. "<class name> such as ...") and instantiates them with the target class names to get a set of seed instances. It then uses Pointwise Mutual Information to rank these instances.
  • This paper introduces Rule Learning, Subclass Extraction and List Extraction - 3 methods to improve the recall of the system without hurting precision.
  • Rule Learning is used to learn domain specific rules starting with domain independent ones. It extracts substrings of the context surrounding the seed instances and the class names. To ensure high precision, it uses an estimated precision measure, which heuristically classify an instance as a negative example of a class if it's a positive example for any other class. It only takes only substrings that appear in multiple contexts.
  • Subclass Extraction aims at extracting subclasses of known classes, e.g. physicist and chemist are subclasses of scientist. It uses two heuristics: the first looks for hyponym of the known class, the second uses the same extraction patterns as in the bootstrapping phase, but instead of extracting proper nouns, it extracts common nouns.
  • Finally, List Extraction targets formatted list in HTML documents, by using a random subset of seed instances as keywords in search queries. The intuition behind finding good lists and instances is that the more lists an instance appears in, the better the instance is, and the more good instances a list contains, the better the list is.
  • All methods were shown to outperform the existing KnowItAll system.

Questions/Comments:

  • Defining the positive examples of one class to be negative examples of other classes seems to not work for certain pairs of classes, such as cities-sports team, countries-organization...
  • In the Subclass Extraction system, "If either test holds, then the Assessor assigns the subclass a probability that is close to one." I'm not clear how the exact probability is assigned.
  • Overall, the paper gives a very detailed and easy to understand description of their system, together with well-motivated methods. It also provides extensive discussion of related work for every aspect of their methods.