Liuy writesup Wang Cohen 2007

From Cohen Courses
Revision as of 10:42, 3 September 2010 by WikiAdmin (talk | contribs) (1 revision)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This is a review of Wang_2007_language_independent_set_expansion_of_named_entities_using_the_web by user:Liuy.

Summary This paper uses set expansion, a technique to expand a given partial set of objects, in particular, to expand set of named entities. It is widely applicable in the sense that, for any semi-structured document, written in any language, this approach can be applied. SEAL displays empirical advantage compared to anther set expansion system using the web, called Google Sets. The datasets involved in the empirical studies, covers three languages. Possible performance improvement can be achieved by doing several rounds of set expansions and then run a bootstrapping of named entities, on top of the current procedure, reranking of web documents from the extracted set of mentions from the previous rounds, and taking into account the the hierarchy or graph structure on the sets of expansion.

Commentary

1. 9 classes that are constructed in all the three languages, that is not culture specific. and 9 classes that are constructed in one language only, that is culture specific. However, the Chinese dataset and the Japanese dataset might share many characters in the same shape and the similar meaning.

2. the alternative method that is used for comparison has several simplifications. First, it searches for common suffixes of the left context L and prefixes of the right context R that are in all seed instances, instead of searching for those embed at least one instance of every seed. Second, ranking based on graph walk is compared with the simplified ranking approach : ranking entity mentions by their frequency counts of being extracted. Although these simplification is sensible intuitively, I think it would be better to discuss how much this simplification will affects performance, and why it is tolerable. Will this simplifications dramatically reduce time complexity of the algorithm?