Difference between revisions of "Weld et al SIGMOD 2009"

From Cohen Courses
Jump to navigationJump to search
 
(6 intermediate revisions by 2 users not shown)
Line 9: Line 9:
 
== Summary ==
 
== Summary ==
  
This is a recent paper [[Category::paper]] that addressed the [[AddressesProblem::Open Information Extraction]] problem. Authors used a self supervised learning prototype, KYLIN ([http://malt.ml.cmu.edu/mw/index.php/%22Wu_and_Weld_CIKM_2007%22 Wu and Weld, CIKM 2007]), trained using [[UsesDataset::Wikipedia|Wikipedia]]. There are three components in the proposed solution:
+
This is a recent paper [[Category::paper]] that addressed the [[AddressesProblem::Open Information Extraction]] problem. Authors used a self supervised learning prototype, KYLIN ([[Wu_and_Weld_CIKM_2007]]), trained using [[UsesDataset::Wikipedia|Wikipedia]]. There are three components in the proposed solution:
  
 
# Self Learning
 
# Self Learning
 
#* The infobox of Wikipedia pages are used to determine the class of the page and attributes of the class.
 
#* The infobox of Wikipedia pages are used to determine the class of the page and attributes of the class.
#* Training data for the extraction were constructed from these Wiki pages using heuristics. First a heuristic document classifier will classify documents into classes, then sentence classifier ([[UsesMethod::Maximum Entropy model|MaxEnt]] with bagging [[UsesMethod::bagging|bagging]]) determine if a sentence contains the relations. After that a [[UsesMethod::Conditional Random fields|CRF]] model will extract the values (second entities) of relations.
+
#* Training data for the extraction were constructed from these Wiki pages using heuristics. First a heuristic document classifier will classify documents into classes, then sentence classifier ([[UsesMethod::Maximum Entropy model|MaxEnt]] with [[UsesMethod::bagging|bagging]]) determines if a sentence contains the relations. After that a [[UsesMethod::Conditional Random fields|CRF]] model will extract the values (second entities) of relations.
#* Shrinkage was used to improve the recall with a automatic ontology generator which combine the infobox classes with WordNet. This ontology gives a hierarchy of classes and facilitate the training of a subclass with the data of super class.  
+
#* Shrinkage was used to improve the recall with an [[RelatedPaper::Wu and Weld WWW 2008|automatic ontology generator]] which combines the infobox classes with WordNet. This ontology gives a hierarchy of classes and facilitates the training of a subclass with the data of super class.  
 
# Bootstrapping
 
# Bootstrapping
#* More training data were harvest from Web using TEXTRUNNER ([http://malt.ml.cmu.edu/mw/index.php/Banko_et_al_IJCAI_2007 Banko et al, IJCAI 2007]).  
+
#* More training data were harvest from Web using TEXTRUNNER ([[Banko_et_al_IJCAI_2007]]).  
 
#* Web pages were weighted using the estimate of their relevance to the relation.
 
#* Web pages were weighted using the estimate of their relevance to the relation.
 
# Correction
 
# Correction
#* An interface to encourage community to make correction, so more training data will be collected.
+
#* An interface to encourage community to make corrections, so more training data will be collected.
 +
 
 
== Related papers ==
 
== Related papers ==
More details of KYLIN can be found in [http://malt.ml.cmu.edu/mw/index.php/%22Wu_and_Weld_CIKM_2007%22 Wu and Weld, CIKM 2007] in the task of refining Wikipedia. A follow up paper ([http://malt.ml.cmu.edu/mw/index.php/Wu_and_Weld_ACL_2010 Wu and Weld, ACL 2010]) refine the solution by adding dependency parsing features to train model.
+
More details of KYLIN can be found in [[RelatedPaper::Wu_and_Weld_CIKM_2007]] in the task of completing infoboxs in Wikipedia pages. A follow up paper ([[RelatedPaper::Wu_and_Weld_ACL_2010]]) refines the solution by adding dependency parsing features to train the model.

Latest revision as of 11:45, 29 September 2011

Citation

Weld, D. S., Hoffmann, R., and Wu, F. 2009. Using Wikipedia to bootstrap open information extraction. SIGMOD Rec. 37, 4 (Mar. 2009), 62-68.

Online version

ACM Digital Library

Summary

This is a recent paper paper that addressed the Open Information Extraction problem. Authors used a self supervised learning prototype, KYLIN (Wu_and_Weld_CIKM_2007), trained using Wikipedia. There are three components in the proposed solution:

  1. Self Learning
    • The infobox of Wikipedia pages are used to determine the class of the page and attributes of the class.
    • Training data for the extraction were constructed from these Wiki pages using heuristics. First a heuristic document classifier will classify documents into classes, then sentence classifier (MaxEnt with bagging) determines if a sentence contains the relations. After that a CRF model will extract the values (second entities) of relations.
    • Shrinkage was used to improve the recall with an automatic ontology generator which combines the infobox classes with WordNet. This ontology gives a hierarchy of classes and facilitates the training of a subclass with the data of super class.
  2. Bootstrapping
    • More training data were harvest from Web using TEXTRUNNER (Banko_et_al_IJCAI_2007).
    • Web pages were weighted using the estimate of their relevance to the relation.
  3. Correction
    • An interface to encourage community to make corrections, so more training data will be collected.

Related papers

More details of KYLIN can be found in Wu_and_Weld_CIKM_2007 in the task of completing infoboxs in Wikipedia pages. A follow up paper (Wu_and_Weld_ACL_2010) refines the solution by adding dependency parsing features to train the model.