Difference between revisions of "Weld et al SIGMOD 2009"
From Cohen Courses
Jump to navigationJump to searchPastStudents (talk | contribs) m (→Related papers) |
|||
(4 intermediate revisions by 2 users not shown) | |||
Line 13: | Line 13: | ||
# Self Learning | # Self Learning | ||
#* The infobox of Wikipedia pages are used to determine the class of the page and attributes of the class. | #* The infobox of Wikipedia pages are used to determine the class of the page and attributes of the class. | ||
− | #* Training data for the extraction were constructed from these Wiki pages using heuristics. First a heuristic document classifier will classify documents into classes, then sentence classifier ([[UsesMethod::Maximum Entropy model|MaxEnt]] with | + | #* Training data for the extraction were constructed from these Wiki pages using heuristics. First a heuristic document classifier will classify documents into classes, then sentence classifier ([[UsesMethod::Maximum Entropy model|MaxEnt]] with [[UsesMethod::bagging|bagging]]) determines if a sentence contains the relations. After that a [[UsesMethod::Conditional Random fields|CRF]] model will extract the values (second entities) of relations. |
− | #* Shrinkage was used to improve the recall with | + | #* Shrinkage was used to improve the recall with an [[RelatedPaper::Wu and Weld WWW 2008|automatic ontology generator]] which combines the infobox classes with WordNet. This ontology gives a hierarchy of classes and facilitates the training of a subclass with the data of super class. |
# Bootstrapping | # Bootstrapping | ||
#* More training data were harvest from Web using TEXTRUNNER ([[Banko_et_al_IJCAI_2007]]). | #* More training data were harvest from Web using TEXTRUNNER ([[Banko_et_al_IJCAI_2007]]). | ||
#* Web pages were weighted using the estimate of their relevance to the relation. | #* Web pages were weighted using the estimate of their relevance to the relation. | ||
# Correction | # Correction | ||
− | #* An interface to encourage community to make | + | #* An interface to encourage community to make corrections, so more training data will be collected. |
+ | |||
== Related papers == | == Related papers == | ||
− | More details of KYLIN can be found in [[Wu_and_Weld_CIKM_2007]] in the task of completing | + | More details of KYLIN can be found in [[RelatedPaper::Wu_and_Weld_CIKM_2007]] in the task of completing infoboxs in Wikipedia pages. A follow up paper ([[RelatedPaper::Wu_and_Weld_ACL_2010]]) refines the solution by adding dependency parsing features to train the model. |
Latest revision as of 11:45, 29 September 2011
Citation
Weld, D. S., Hoffmann, R., and Wu, F. 2009. Using Wikipedia to bootstrap open information extraction. SIGMOD Rec. 37, 4 (Mar. 2009), 62-68.
Online version
Summary
This is a recent paper paper that addressed the Open Information Extraction problem. Authors used a self supervised learning prototype, KYLIN (Wu_and_Weld_CIKM_2007), trained using Wikipedia. There are three components in the proposed solution:
- Self Learning
- The infobox of Wikipedia pages are used to determine the class of the page and attributes of the class.
- Training data for the extraction were constructed from these Wiki pages using heuristics. First a heuristic document classifier will classify documents into classes, then sentence classifier (MaxEnt with bagging) determines if a sentence contains the relations. After that a CRF model will extract the values (second entities) of relations.
- Shrinkage was used to improve the recall with an automatic ontology generator which combines the infobox classes with WordNet. This ontology gives a hierarchy of classes and facilitates the training of a subclass with the data of super class.
- Bootstrapping
- More training data were harvest from Web using TEXTRUNNER (Banko_et_al_IJCAI_2007).
- Web pages were weighted using the estimate of their relevance to the relation.
- Correction
- An interface to encourage community to make corrections, so more training data will be collected.
Related papers
More details of KYLIN can be found in Wu_and_Weld_CIKM_2007 in the task of completing infoboxs in Wikipedia pages. A follow up paper (Wu_and_Weld_ACL_2010) refines the solution by adding dependency parsing features to train the model.