Liuliu writeup of Wang 2007
This is a review of Wang_2007_language_independent_set_expansion_of_named_entities_using_the_web by user:Liuliu.
This paper gives a very detailed introduction of the set-expansion system SEAL. Some highlights include that it's language independent and domain independent, and it doesn't require annotated data but learns the wrappter on the fly.
SEAL contains three components:
- Fetcher: It downloads that web pages which contain seeds
- Extractor: It learns the wrapper on the fly. The wrapper is page-dependent and works on charatctor level. The basic idea is that a wrapper is the maixmally long contextual strings that embeds at least one instance of every seed.
- Ranker: To filter out noisy entities, ranker will ranks entities based on their similarities to seeds. Two steps here: (1) build a graph which contains four kinds of elements in the system: seeds, documents, wrapper, and mentions and connect the graph based on the relationship between these four kinds of elements. (2) take random walk on the graph and calcualte the probabilies from seeds to entites.
They compare the system with both Google Set and alternative methods with different extractor and ranker in terms of mean average precision. The SEAL system works much better than the other methods.
What I like As the authors mentioned in the paper "the SEAL method is simple enough to be easily described and replicated, and is independent of the human language from which the seeds are taken".