Liuy writesup Wang Cohen 2008

From Cohen Courses
Revision as of 10:42, 3 September 2010 by WikiAdmin (talk | contribs) (1 revision)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This is a review of Wang_2008_iterative_set_expansion_of_named_entities_using_the_web by user:Liuy

Summary

SEAL expands a partial set of seed entities into a complete one, using the web. The number o entities can affect the performance. It is much more challenging to handle more seeds. Up to ten seeds, iSEAL is proposed in this paper, to make set expansion still possible. It calls SEAL multiple times, each one of which only handle a small number of seeds. A bootstrapping scheme allows mixing seeds that are provided by the users and that are generated by the system; and thus can do better than SEAL, with less user-provided seeds. Also because of this bootstrapping scheme, different ranking methods has a significant impact on the performance of iSEAL. The conclusive empirical observation is that with a mixture of seeds that are provided by the users and that are generated by the system, through bootstrapping, Random walk with Restart does the best. Even with purely user-provided seeds, it is not much worse than Bayesian Sets.



Commentary

Several ranking methods are proposed : Wrapper frequency, Bayasian setd, random walk with restart, Page rank. It is empirically shown on several datasets that, Random walk with the bootstrap scheme is better than Bayesian Sets, mainly because of the robustness the boostrape introduced in the presence of noise. It gathers alternative versions of the same statistic (essentially to get the distribution of it), from alternatives sample from only a small part of the whole population. However, there is no theoretical guarantees that the method will achieve certain level of precision with finite sample, although, it is asymptotically consistent. And it usually assumes the independence of samples, which may not be proper for the task of set expansion. Also I am interested in knowing whether it is verifiable that at the point where the algorithm halts, it gives the generalized error rate at certain level; and also the number of iteration is typically needed for the task involving more than 10 seeds by iSEAL.