Mnduong writeup of Wang & Cohen 2008

From Cohen Courses
Jump to navigationJump to search

This is a review of Wang_2008_iterative_set_expansion_of_named_entities_using_the_web by user:mnduong.

  • This paper introduces solutions to a problem in SEAL, where the performance declines as the number of given seeds increases past 4. The problem is caused by the fact that it's hard to find documents containing all the provided seeds. The two proposed solutions both circumvent this problem by iteratively finding candidates using different sets of seeds, all having at most 4 seeds.
  • The two approaches are supervised expansion and bootstrapping. Supervised expansion always gets its seeds from a set of supervised seeds, provided by human in the beginning. Bootstrapping uses the highest scoring seed(s) the previous iteration.
  • For each method, there are two ways to change the seed set after each iteration, Fixed Seed Size and Increasing Seed Size. FSS always uses 2 seeds, and ISS increases the seed set size by 1 until a maximum number is reached (chosen empirically to be 4), after which point the size remains unchanged.
  • The paper also compares different ranking algorithms of the candidate sets. Random walk with restart is similar to the lazy random walk method used in the original SEAL, but instead of using a probability of staying at each node, it uses a small probability of teleporting to the original seed. Walks are taken until convergence. PageRank is similar to the algorithm introduced by Page & Brin '98, on the undirected version of the same graph used in RW. Bayesian Set is an approach that formulates the set expansion problem as a Bayesian inference problem. The paper also experimented with two fast methods, using the Wrapper Length (longer is better) and Wrapper Frequency to rank.
  • The paper shows that for Supervised Expansion, FSS works best, and both Bayesian Set and Random Walk give good results. For Bootstrapping, ISS works better because it is more conservative, using only the highest scoring candidate from the previous iteration, as opposed to the top 2. For bootstrapping, Random Walk was shown to perform the best, and to be more robust to noise, being the only one to improve (even though only slightly) after 10 iterations.

Comments/Questions:

  • How does the baseline Wrapper Frequency deal with ties? It's obviously more likely to encounter ties if the only information used for ranking is the number of wrappers.
  • I'm not sure if in PageRank, the walker is allowed to teleport anywhere, as in the original algorithm, or only to the beginning, similar to Random Walk with Restart.
  • In random graph walk methods in general, and in this version of Random Walk with Restart, have people experimented with any different ways to assign probabilities to edges, other than assuming uniform distributions at every step.