Apappu writeup on Etzioni '04

From Cohen Courses
Revision as of 11:38, 2 November 2009 by Apappu (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This is a review of Etzioni_2004_methods_for_domain_independent_information_extraction_from_the_web_an_experimental_comparison by user:Apappu.

  • This paper discusses three techniques for fact mining from multi-domain unstructured data.
  • Domain dependent rules are learned after the initial bootstrapping phase where generic domain independent (Hearst patterns like) were used to induce domain dependent seeds.
  • At the time of this work very few works could actually get away without having manually seeded domain dependent data. Of course, their heuristic 2 (to control false-positives)

seems dubious because it could miss finer categories of an instance ? (per say: Chicago the county, Chicago the city)

  • As a next step to rule learning, discovery of subclasses has been discussed in this paper. Discovery of subclasses has been addressed by Marius Pasca and Razvan Bunescu 2004, which follows snippet-based extraction techniques to discover instances and sub-classes.
  • There is a lot of work done on this line in the coming years, (Turning Web Text and Search Queries into Factual Knowledge: Hierarchical Class Attribute Extraction, Pasca '08) and then (Outclassing Wikipedia in Open-Domain Information Extraction: Weakly-Supervised Acquisition of Attributes over Conceptual Hierarchies, Pasca '09).
  • Then the next task is extracting enumeration of instances. List extraction is a Google Sets kind of problem, where regularly/semi-regularly formatted text contains a list of instances that belongs to a class. Certain checkpoints has been laid down by the authors to verify the membership of an instance in a class.
  • Comments & Questions:
 * How do they manage to adjust instance categorization temporally or topically? Although they claim to work at domain level 
   which is still a coarse level discrimination.
 * They didn't talk about multi-word entities and group them into equivalence classes! like Pittsburgh, PA and Pittsburgh, fall under a bin.
 * I couldn't understand why do they have to consider WordNet to assess subclasses (for candidature as a hyponym) when they could use 
   Wikipedia which provides little-noisy-but-decent subclass candidature information.