Yandongl writeup of Wang 2007

From Cohen Courses
Jump to navigationJump to search

This is a review of Wang_2007_language_independent_set_expansion_of_named_entities_using_the_web by user:Yandongl.

This paper introduces a character-based set expansion algorithm which leverages the documents on the Web. The core idea is to learn a set of wrappers from HTML documents by targeting the seed instances, and then extract context strings around the seed instances. The context strings used in this paper are the longest prefix/suffix around the seed instances which contain at least one seed.

The whole architecture is as follows: starting from a bunch of user-provided seeds, Fetcher downlaods the top URL results returned by Google. Extractor, which uses the learning algorithm mentioned above, forms the wrappers for the seed instances (is number of wrappers mentioned in the paper?). Next this learned Extractor starts extracting more instances from the downloaded Web pages. Finally Ranker sort the extracted instances with many different approaches such as by frequency, or by Random Walk on graph.

Baseline system is Google Sets, and an alternative wrapper-learning method, which uses the common suffix/prefix instead of longest ones, is also induced. Counter-intuitive to me, this alternative approach didn't work as well, at least for the overall accuracy. Not surprisingly, graph-based ranking works nicely, which beats other naive approaches.

Overall, this is a very clear paper. Approaches suggested are simple yet very efficient and effective. Language-independent extraction is a very adorable feature. Experiments show that it works well in practice.