Sgardine writesup Wang Cohen 2008

From Cohen Courses
Jump to navigationJump to search

This is a review of Wang_2008_iterative_set_expansion_of_named_entities_using_the_web by user:Sgardine

Summary

The SEAL system, presented previously, exhibits degraded performance when given more seeds, since it relies on finding pages containing all seeds. Iterative SEAL (iSEAL) addresses this by calling SEAL as a subroutine. In the iterative supervised expansion variant, random subsets of the seeds are used in each iteration; two subvariants involve the seedset size being fixed (at 2) or boundedly increasing (with only one new random instance from the supervised seeds being introduced in each iteration). Bootstrapping involves using the top two newly discovered instances in each iteration. Several rankers are evaluated: the Graph-based random-walk of SEAL, PageRank on the undirected graph, Bayesian Sets, and a simple ad-hoc ranking scheme which considers longer extracted wrappers to be better.

The system is evaluated on the same sets as previously, again by MAP. In the supervised condition, FSS improves more quickly than ISS, presumably since it introduces more seeds sooner. In the bootstrapping condition, FSS fails to improve quickly, presumably because it boldly introduces learned instances possibly including noise. ISS in contrast chooses only one newly learned instance per iteration, thus introducing noise with lower probability. The simple, light-weight Wrapper-Length scheme did fairly well as well.

Commentary

The approach assumes that the Fetcher is immutable; if we were to, say, retrieve documents containing any seed and then rank them by how many seeds they contain, we could more directly approach the motivating problem.

The number of instances typically listed on a single page was experimentally determined to have an average of about 4; this average however is global across all target sets. As an alternative to ISS and FSS, we might consider one which grows to a maximum determined by the particular target: e.g. stop growing the seedset when our query results start getting starved (falling below some threshold -- which could be determined by the same experiments that yielded the magic number of 4)