KeisukeKamataki writeup of Wang 2009
This is a review of Wang_2009_automatic_set_instance_extraction_using_the_web by user:KeisukeKamataki.
Summary: This paper tackles the problem of set instance extraction extending SEAL. The system is called asie and its main components are Noisy Instance Generator, Reranker and Bootstrapper.
Noisy Instance Generator and SEAL: Creates rough candidates of the set utilizing hyponym patterns (like, "A such as B, A i.e. B, and so on") also gives rough rank candidates (any ranking method would be fine since the candidates will be re-ranked later). SEAL expands the seeds of the set. Random walk with re-stared was used for the ranking of the expanded instances.
Reranker: This component performs ranking modifying original SEAL as the non-resistance SEAL. The main idea here is that since the original seed from the generator is very noisy, it tries to tune SEAL to alleviate the noise. It modifies Fetcher, Extractor and also makes use of hint words for query generation to the search engine. It also increased the size of wrapper seeds to be 15.
Bootstrapper: This works similarly with ISS of the paper in 2008. The difference is that it keeps the highly-ranked instances of previous iteration instead of choosing the new seed randomly.
As for the result, while Noisy Instance generator performs extremely poorly, Reranker and Bootstrapper greatly increases the performance in terms of MAP. For the comparison with other method, it achieved comparable result with previous relevant systems which are domain specific systems and works only for English.
Unclear/I want to know: We may also want to know how the heuristic tunings to handle noisy data (like the modification of SEAL) were effective in terms of performance improvement. Also, it might be even more interesting if there is a discussion about the error analysis for the back-off strategy data (why NFL-team set extraction worked much better than MLB and NBA in Japanese or vice-varsa in Chinese).