Bbd writeup of Language Independent Set Expansion

From Cohen Courses
Jump to navigationJump to search

This is a review of Wang_2007_language_independent_set_expansion_of_named_entities_using_the_web by user:Bbd.

This paper describes a technique to extract a set of entities belonging to a class given small number of seeds of that class. Since this extraction is done using wrappers which are extracted from structure of the text like html/xml tags, ths technique is indepndent of the language in which document is written. The system consists of 3 main components :

 - Fetcher : It queries standard search engines like Google with seeds as query words and extract the pages in which all seeds are present
 - Extractor : This step learns one or more wrappers per page that is retrived, and extract more entities using these wrappers
 - Ranker : This step builds a graph containingseeds, extracted pages, wrappers and etracted entities and ranks the entitis using random walk on graph approach.

They compare their approach with Google sets and Bayesian sets and show improvement over them.

Selecting right seed set plays an important role since it does extract those pages which contain all the seeds. Thsi automatically limits the number of seeds we can use, since more the number of seeds less are the pages extracted which contain all seeds.