Difference between revisions of "Apappu writeup on Etzioni '04"

Latest revision as of 11:42, 3 September 2010

This is a review of Etzioni_2004_methods_for_domain_independent_information_extraction_from_the_web_an_experimental_comparison by user:Apappu.

This paper discusses three techniques for fact mining from multi-domain unstructured data.

Domain dependent rules are learned after the initial bootstrapping phase where generic domain independent (Hearst patterns like) were used to induce domain dependent seeds.

At the time of this work very few works could actually get away without having manually seeded domain dependent data. Of course, their heuristic 2 (to control false-positives)

seems dubious because it could miss finer categories of an instance ? (per say: Chicago the county, Chicago the city)

As a next step to rule learning, discovery of subclasses has been discussed in this paper. Discovery of subclasses has been addressed by Marius Pasca and Razvan Bunescu 2004, which follows snippet-based extraction techniques to discover instances and sub-classes.

There is a lot of work done on this line in the coming years, (Turning Web Text and Search Queries into Factual Knowledge: Hierarchical Class Attribute Extraction, Pasca '08) and then (Outclassing Wikipedia in Open-Domain Information Extraction: Weakly-Supervised Acquisition of Attributes over Conceptual Hierarchies, Pasca '09).

Then the next task is extracting enumeration of instances. List extraction is a Google Sets kind of problem, where regularly/semi-regularly formatted text contains a list of instances that belongs to a class. Certain checkpoints has been laid down by the authors to verify the membership of an instance in a class.

Comments & Questions:

 * How do they manage to adjust instance categorization temporally or topically? Although they claim to work at domain level 
   which is still a coarse level discrimination.
 * They didn't talk about multi-word entities and group them into equivalence classes! like Pittsburgh, PA and Pittsburgh, fall under a bin.
 * I couldn't understand why do they have to consider WordNet to assess subclasses (for candidature as a hyponym) when they could use 
   Wikipedia which provides little-noisy-but-decent subclass candidature information.

Difference between revisions of "Apappu writeup on Etzioni '04"

Latest revision as of 11:42, 3 September 2010

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Revision as of 12:38, 2 November 2009 (view source) Apappu (talk \| contribs)	Latest revision as of 11:42, 3 September 2010 (view source) WikiAdmin (talk \| contribs) m (1 revision)
(No difference)