Sgardine writesup Wang Cohen 2007

From Cohen Courses
Jump to navigationJump to search

This is a review of Wang_2007_language_independent_set_expansion_of_named_entities_using_the_web by user:Sgardine.

Summary

SEAL is a system for expanding a set of seed instances into a larger set of instances from some desired class containing all seed instances. The Fetcher retrieves documents containing all seeds (via Google). The Extractor learns one or more wrappers for each page and executes them to produce additional mentions of potential instances. The Ranker builds a graph of nodes (seeds ∪ docs ∪ wrappers ∪ mentions) with edges defined by the relations used to discover the nodes (and their inverses), then simulates a large number of random walks over that graph. Mentions are ranked by their weights after the random walks. The system is language-independent.

The system is evaluated on English, Japanese and Chinese against the baseline of Google Sets (English only), as well as various presumably weaker configurations of the system: one with ranks determined by extracted frequency rather than using the graph, and one further handicapped with a simpler wrapper builder. All systems outperformed Google Sets handily, quite aside from handling multiple languages. Most gain over Google Sets from adopting the explicit ranking structure, but a fair gain still comes from using better wrappers and from the graph. Minor improvements are gained by giving more query results, though these diminish quickly (greater improvement by increasing from 100 to 200, but less from 200 to 300).

Commetary

Minor point: what about the system E1+GW? Its performance relative to E2+EF would tell us how much work the graph walk stuff is doing relative to the better wrappers.

Something about the whole problem seems imprecise. Given a set of seeds S and a target set T* we have that S ⊆ T* but we also have that S ⊆ T for every superset T ⊃ T*, and S ⊆ T' for many subsets T' ⊂ T*. It seems like the ability to be extracted by a SEAL-like system is itself an interesting characteristic of the set T*: some alternate supersets of S are obviously contrived (U.S cities not starting with the letter Q) but some are equally plausible (a list of children's movies is a list of movies) and some are more natural (a list of U.S. state capitals is quite likely to be (mis-)interpreted as a list of U.S. cities)

The flipside of this is that constructing a set of seeds becomes non-trivial since you have to make sure to (try to) include representatives of most "reasonable" partitions of your target concept (e.g. if your target is a list of cities in the world then making sure you represent a variety of countries, continents, hemispheres, etc)

I guess a way to describe the sets would be something along the lines of: choosing the complete set T*, sampling a set of seeds from it, and then running the algorithm on the seeds. This should track fairly closely with SEAL's performance on the different sets but will depend less on the initial choice of the seeds.