Nschneid writeup of Wang ACL 2009
From Cohen Courses
Jump to navigationJump to searchThis is Nschneid's review of Wang_2009_automatic_set_instance_extraction_using_the_web
This paper describes the ASIA system for automatic set extraction given a category name in any language. In essence, it works as follows:
- Inputs:
-
- Hearst-style patterns, manually listed for the language of interest. These patterns should generalize to most categories; e.g. the pattern "C such as I", where C is the category name and I is the instance.
- A category name, e.g. cars.
- Procedure:
-
- Use the Hearst-style patterns to find a seed set of instances of the category.
- Use SEAL to find more instances based on this seed set (iteratively look for contextual patterns for these instances in the HTML sources of Web pages).
- Output:
- Instances of the category
- One of the nice features of ASIA is that it does not rely on language-independent tools, or even many seed examples.
- Evaluations: ASIA compares favorably to previous approaches
- We evaluated our approach using the evaluation set presented in (Wang and Cohen, 2007), which contains 36 manually constructed lists across three different languages: English, Chinese, and Japanese (12 lists per language). ...
- We also compare ASIA ... to the extended Wordnet 2.1 produced by Snow et al (Snow et al., 2006), and show that for these twelve sets, ASIA produces more than five times as many set instances with much higher precision (98% versus 70%).
- An interesting observation: about the specificity of category names
- for the three classes: movie, person, and video game, ASIA did not initially converge to the correct in- stance list given the most natural concept name. Given “movies”, ASIA returns as instances strings like “comedy”, “action”, “drama”, and other kinds of movies. Given “video games”, it returns “PSP”, “Xbox”, “Wii”, etc. Given “people”, it returns “musicians”, “artists”, “politicians”, etc. We addressed this problem by simply re-running ASIA with a more specific class name (i.e., the first one returned); however, the result suggests that future work is needed to support automatic construction of hypernym hierarchy using semi-structured web documents.
- I wonder if a variant of this technique could be used to find the hypernym hierarchy: e.g., for each instance returned for a category, see if that in turn seems to be a category for many other instances.
- Another idea: use named entity information (if available) to distinguish categories from entities