Difference between revisions of "Weld et al SIGMOD 2009"
From Cohen Courses
Jump to navigationJump to searchPastStudents (talk | contribs) (Created page with '== Citation == Weld, D. S., Hoffmann, R., and Wu, F. 2009. Using Wikipedia to bootstrap open information extraction. SIGMOD Rec. 37, 4 (Mar. 2009), 62-68. == Online version == …') |
PastStudents (talk | contribs) m (→Related papers) |
||
Line 21: | Line 21: | ||
#* An interface to encourage community to make correction, so more training data will be collected. | #* An interface to encourage community to make correction, so more training data will be collected. | ||
== Related papers == | == Related papers == | ||
− | More details of KYLIN can be found in [http://malt.ml.cmu.edu/mw/index.php/%22Wu_and_Weld_CIKM_2007%22 Wu and Weld, CIKM 2007] in the task of refining Wikipedia. A follow up paper (Wu and Weld, ACL 2010) refine the solution | + | More details of KYLIN can be found in [http://malt.ml.cmu.edu/mw/index.php/%22Wu_and_Weld_CIKM_2007%22 Wu and Weld, CIKM 2007] in the task of refining Wikipedia. A follow up paper ([http://malt.ml.cmu.edu/mw/index.php/Wu_and_Weld_ACL_2010 Wu and Weld, ACL 2010]) refine the solution by adding dependency parsing features to train model. |
Revision as of 13:24, 26 September 2010
Citation
Weld, D. S., Hoffmann, R., and Wu, F. 2009. Using Wikipedia to bootstrap open information extraction. SIGMOD Rec. 37, 4 (Mar. 2009), 62-68.
Online version
Summary
This is a recent paper paper that addressed the Open Information Extraction problem. Authors used a self supervised learning prototype, KYLIN (Wu and Weld, CIKM 2007), trained using Wikipedia. There are three components in the proposed solution:
- Self Learning
- The infobox of Wikipedia pages are used to determine the class of the page and attributes of the class.
- Training data for the extraction were constructed from these Wiki pages using heuristics. First a heuristic document classifier will classify documents into classes, then sentence classifier (MaxEnt with bagging bagging) determine if a sentence contains the relations. After that a CRF model will extract the values (second entities) of relations.
- Shrinkage was used to improve the recall with a automatic ontology generator which combine the infobox classes with WordNet. This ontology gives a hierarchy of classes and facilitate the training of a subclass with the data of super class.
- Bootstrapping
- More training data were harvest from Web using TEXTRUNNER (Banko et al, IJCAI 2007).
- Web pages were weighted using the estimate of their relevance to the relation.
- Correction
- An interface to encourage community to make correction, so more training data will be collected.
Related papers
More details of KYLIN can be found in Wu and Weld, CIKM 2007 in the task of refining Wikipedia. A follow up paper (Wu and Weld, ACL 2010) refine the solution by adding dependency parsing features to train model.