Wu et al KDD 2008

From Cohen Courses
Revision as of 20:23, 29 September 2011 by Aanavas (talk | contribs) (→‎Summary)
Jump to navigationJump to search

Citation

Wu, F., Hoffmann, R. and Weld, D. 2008. Information Extraction from Wikipedia: Moving Down the Long Tail. In Proceedings of the 14th International Conference on Knowledge Discovery and Data Mining, pp. 731–739, ACM, New York.

Online version

University of Washington

Summary

Most articles in Wikipedia come from the "long tail of sparse classes", article types with a small number of instances (82% of the classes have fewer than 100 articles). This paper introduces three techniques to improve the recall of structured information extraction from those classes: shrinkage over a refined ontology, retraining using open information extractors and supplementing results by extracting from the general Web.

A general statistical technique called shrinkage is applied when training the extractor of a sparse class, by aggregating data from its parent and children classes. The subsumption herarchy needed for this task comes from a previously developed system called KOG (Kylin Ontology Generator). For a given class , the shrinkage procedure collects the related class set:

where is the threshold for path length. Then, the set of attributes from this related class set is used to aggregate training data, and the CRF extractors for are trained on the augmented dataset.

Experimental results

...

Related papers

This paper improves the performance of Kylin, a self-supervised information extractor first described in Wu and Weld CIKM 2007. The shrinkage technique uses a cleanly-structured ontology, the output of KOG, a system presented in Wu and Weld WWW 2008. The retraining technique uses TextRunner which is described in Banko et al IJCAI 2007, but it could use any open information extractor like the ones described in Wu and Weld ACL 2010 and Fader et al EMNLP 2011.