Siddharth writeup of Wang & Cohen ICDM 07
This is a review of Wang_2007_language_independent_set_expansion_of_named_entities_using_the_web by user:sgopal1.
This paper proposes a general purpose language independent way to extract sets from the web given a few seed elements of the set. The proposed methods consists of three stages - Fetcher , Extractor and Ranker. The fetcher fetches the top n pages from google given a list of seed entities. It fetches those pages that contain all the seed entities. The extractor extracts page dependent left context and right context for identifying possible elements of the set in the given page. The extracted elements are passed on to the ranker. The ranker creates a multitype graph, where seeds , pages, patterns and mentions act as nodes. The edges are appropriately defined between them. They then use a page-rank like ( although not the same ) to identify the importance of the node in the graph structure. Evaluation shows a improvement over several datasets.
I think that one more aspect of this solution is that it can be easily parallelized across different machines. Both the ranking computation as well as the extractor can be done using Map Reduce.
It is funny that it is mentioned "unfortunately, however, Google Sets is a proprietary method that may be changed at any time, so research results based on Google Sets cannot be reliably replicated" but the paper uses search results from Google anyway.