Mnduong writeup of Wang & Cohen 2009

From Cohen Courses
Revision as of 10:42, 3 September 2010 by WikiAdmin (talk | contribs) (1 revision)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This is a review of Wang_2009_automatic_set_instance_extraction_using_the_web by user:mnduong.

  • This paper introduces AISE, which extends SEAL to make it extract a set of instances knowing only the name of those instances' semantic class, instead of having a few instance seeds. Starting with the class name, it uses hyponym patterns, which are language dependent (the only part of the system that depends on the language). For English, it uses the hyponym phrases introduced by Hearst '92. The system then sends queries to a search engine, then extracts noisy candidate seeds from the snippets returned by the engine. The candidates are ranked based on their snippet frequency, excerpt frequency (each snippet contains multiple excerpts), and their distance from the hyponym phrase. If no results are found, the backoff strategy is to use the class name as the hyponym phrase.
  • The system then uses a noise-resistant version of SEAL to extract other instances, using Random Walk with restart as the ranking method for instances. It differs from the original SEAL in a few points: the Fetcher uses every pair of seeds to fetch relevant documents, instead of all seeds at the same time, so as to minimize the effect of noisy seeds; it also uses a hint word (instantiated as class name) in addition to the seeds in querying; the Extractor requires the contexts to bracket at least two seeds, instead of one; lastly a larger set of seeds is used in extracting wrappers than in fetching documents.
  • Finally, the system uses Bootstrapping to iterate the process, each time adding the highest scoring extracted candidate to the set of seeds, and capping at 4 seeds.
  • The system was shown to outperform a method by Kozareva et al. in one task and achieve comparable results in two others, while using less information (only class name, as opposed to the other method using class name and one seed), and taking much faster (3 minutes vs. overnight). It also outperformed a system by Pasca which uses country names as seeds.

Questions/Comments:

  • For other languages, if we don't know any hyponym patterns, would it be fair to assume a dictionary, then use the hyponym patterns in English to find some candidate seeds, then use their translation as seeds? The idea is that seeds are assumed to be noisy anyway, so we don't necessary have to have the perfect translations.