Wu et al KDD 2008

From Cohen Courses
Revision as of 00:19, 28 September 2011 by Aanavas (talk | contribs) (→‎Summary)
Jump to navigationJump to search

Citation

Wu, F., Hoffmann, R. and Weld, D. 2008. Information Extraction from Wikipedia: Moving Down the Long Tail. In Proceedings of the 14th International Conference on Knowledge Discovery and Data Mining, pp. 731–739, ACM, New York.

Online version

University of Washington

Summary

Most articles in Wikipedia come from the "long tail of sparse classes", article types with a small number of instances (82% of the classes have fewer than 100 articles). This paper introduces three techniques for increasing recall of information extraction from those classes: shrinkage over a refined ontology, retraining using open information extractors and supplementing results by extracting from the general Web. These techniques are used to improve the performance of a previously developed information extractor called Kylin.

When training the extractor of a sparse class, the first technique shrinkage works by aggregating data from its parent and children classes. The subsumption herarchy needed for this task comes from a previously developed system called KOG (Kylin Ontology Generator). The shrinkage procedure searches upwards and downwards through the KOG ontology:

- Given a class C, query KOG to collect the related class set: SC = {Ci|path(C, Ci) ≤ l}, where l is the preset threshold for path length. Currently Kylin only searches strict parent/chidren paths without considering siblings. Take the “Performer” class as an example: its parent “Person” and children “Actor” and “Comedian” could be included in SC. - For each attribute C.a (e.g., Performer.loc) of C: -- Query KOG for the mapped attribute Ci.aj (e.g., Person.birth_plc) for each Ci. -- Assign weight wij to the training examples from Ci.aj and add them to the training dataset for C.a. Note that wij may be a function both of the target attribute C.a, the related class Ci, and Ci’s mapped attribute Ci.aj . - Train the CRF extractors for C on the new training set.

Experimental results

...

Related papers

This paper improves Kylin, a self-supervised information extractor first described in Wu and Weld CIKM 2007. The shrinkage technique uses a cleanly-structured ontology, the output of KOG, a system presented in Wu and Weld WWW 2008. The retraining technique uses TextRunner, an open information extractor described in Banko et al IJCAI 2007.