Apappu writeup on Etzioni '04
From Cohen Courses
Jump to navigationJump to searchThis is a review of Etzioni_2004_methods_for_domain_independent_information_extraction_from_the_web_an_experimental_comparison by user:Apappu.
- This paper discusses three techniques for fact mining from multi-domain unstructured data.
- Domain dependent rules are learned after the initial bootstrapping phase where generic domain independent (Hearst patterns like) were used to induce domain dependent seeds.
- At the time of this work very few works could actually get away without having manually seeded domain dependent data. Of course, their heuristic 2 (to control false-positives)
seems dubious because it could miss finer categories of an instance ? (per say: Chicago the county, Chicago the city)
- As a next step to rule learning, discovery of subclasses has been discussed in this paper. Discovery of subclasses has been addressed by Marius Pasca and Razvan Bunescu 2004, which follows snippet-based extraction techniques to discover instances and sub-classes.
- There is a lot of work done on this line in the coming years, (Turning Web Text and Search Queries into Factual Knowledge: Hierarchical Class Attribute Extraction, Pasca '08) and then (Outclassing Wikipedia in Open-Domain Information Extraction: Weakly-Supervised Acquisition of Attributes over Conceptual Hierarchies, Pasca '09).
- Then the next task is extracting enumeration of instances. List extraction is a Google Sets kind of problem, where regularly/semi-regularly formatted text contains a list of instances that belongs to a class. Certain checkpoints has been laid down by the authors to verify the membership of an instance in a class.
- Comments & Questions:
* How do they manage to adjust instance categorization temporally or topically? Although they claim to work at domain level which is still a coarse level discrimination. * They didn't talk about multi-word entities and group them into equivalence classes! like Pittsburgh, PA and Pittsburgh, fall under a bin. * I couldn't understand why do they have to consider WordNet to assess subclasses (for candidature as a hyponym) when they could use Wikipedia which provides little-noisy-but-decent subclass candidature information.