Weld et al SIGMOD 2009

From Cohen Courses
Jump to navigationJump to search

Citation

Weld, D. S., Hoffmann, R., and Wu, F. 2009. Using Wikipedia to bootstrap open information extraction. SIGMOD Rec. 37, 4 (Mar. 2009), 62-68.

Online version

ACM Digital Library

Summary

This is a recent paper paper that addressed the Open Information Extraction problem. Authors used a self supervised learning prototype, KYLIN (Wu and Weld, CIKM 2007), trained using Wikipedia. There are three components in the proposed solution:

  1. Self Learning
    • The infobox of Wikipedia pages are used to determine the class of the page and attributes of the class.
    • Training data for the extraction were constructed from these Wiki pages using heuristics. First a heuristic document classifier will classify documents into classes, then sentence classifier (MaxEnt with bagging bagging) determine if a sentence contains the relations. After that a CRF model will extract the values (second entities) of relations.
    • Shrinkage was used to improve the recall with a automatic ontology generator which combine the infobox classes with WordNet. This ontology gives a hierarchy of classes and facilitate the training of a subclass with the data of super class.
  2. Bootstrapping
    • More training data were harvest from Web using TEXTRUNNER (Banko et al, IJCAI 2007).
    • Web pages were weighted using the estimate of their relevance to the relation.
  3. Correction
    • An interface to encourage community to make correction, so more training data will be collected.

Related papers

More details of KYLIN can be found in Wu and Weld, CIKM 2007 in the task of refining Wikipedia. A follow up paper (Wu and Weld, ACL 2010) refine the solution by adding dependency parsing features to train model.