Selen writeup of Wang Cohen 2007

From Cohen Courses
Jump to navigationJump to search

This is a review of Wang_2007_language_independent_set_expansion_of_named_entities_using_the_web by user:Selen.

In this paper they are introducing a set expansion system using a few seeds. Their system works on semi-structured documents and it is human and markup language independent. Their approach is to find semi-structured pages that has some list of items and then by combining such lists returning a more relevant subset. Lists in the webpages are found by constructing wrappers for each page that contains the seed. By wrapper they specify left and right context that is necessary to extract the seed from the page. The context is the longest prefix and suffix that encapsulates the seed, although I am not sure how they decide the length, what is the limit?

To purify the result set, they build a graph that contains all the seeds wrappers and candidate entities. and similarity between entities are ranked by a measure aggregated over 10000 random walk over the graph constructed.

The SEAL system has three components: Fetcher, Extractor and Ranker. As its name reveals, given a set of seed fetcher retrieves 100 web pages that contain all of the seeds (which is a downside of the algorithm I think). Extractor module extracts wrappers for each page that is found by fetcher and finally ranker constructs a graph and applies random walk to rank the results.

Unlike other IR set expansion systems such as KnowItAll SEAL considers small sets, and I wonder what would happen if the user query was like "politicians". Would their algorithm be fast and precise? Also what would happen if the user provides seemingly unrelated seeds, then the fetcher module might not retrieve a lot of webpages containing all the seeds.

The downside of their system is they only consider seeds within quotes, in other words they consider exact matches. They are iterating SEAL over the results that they found by feeding the seeds 5 times, they could have used a better bootstrapping approach. Other than that their results seem like a significant improvement over Google sets.