Difference between revisions of "Wu et al KDD 2008"
(14 intermediate revisions by the same user not shown) | |||
Line 10: | Line 10: | ||
== Summary == | == Summary == | ||
− | Most articles | + | Most Wikipedia articles come from the "long tail of sparse classes", article types with a small number of instances (82% of the classes have fewer than 100 articles). This [[Category::paper]] introduces three techniques to improve the recall of [[AddressesProblem::Infobox completion | structured information extraction]] from those classes: '''shrinkage''' over a refined ontology, '''retraining''' using open information extractors and '''supplementing results''' by extracting from the general Web. |
− | + | A general statistical technique called shrinkage is applied when training the extractor of a sparse class, by aggregating data from its parent and children classes. The subsumption herarchy needed for this task comes from a previously developed system called [[RelatedPaper::Wu and Weld WWW 2008|KOG]] (Kylin Ontology Generator). For a given class <math>C</math>, the shrinkage procedure collects the related class set <math>S_{C} = \{C_{i} \mid \text{path}(C, C_{i}) \le l\}</math>, where <math>l</math> is the threshold for path length. Then, the set of attributes from this related class set is used to aggregate training data, and the [[UsesMethod::Conditional Random Fields | CRF]] extractors for <math>C</math> are trained on the augmented dataset. | |
− | <math> | + | The main idea of the second technique is to use additional training data from the output of an open information extraction system. Open IE systems extract a set of relations <math>\{r \mid r = \langle obj_{1}, predicate, obj_{2} \rangle \}</math> and Wikipedia infoboxes implicitly define triples of the type <math>\{t \mid t = \langle subject, attribute, value \rangle \}</math> where the subject is the title of the article. The retrainer iterates through each attribute <math>C.a</math> of an infobox class <math>C</math> and gets the related set of triples <math>T = \{t \mid t.attribute = C.a \}</math>. Then, it iterates through T to get a set of potential matches from the Open IE system <math>R(C.a) = \{r \mid \exists t \in T : r.obj_{1} = t.subject, r.obj_{2} = t.value \}</math> to augment and clean the training data for C's extractors. |
− | + | Finally, the general web extraction technique used to supplement additional information is modeled as an information retrieval problem. This is solved by a module which generates a set of queries for a general Web search engine, downloads the top-k pages and splits their text into sentences. The queries are generated using the article title and the same predicates learned from the retraining task (e.g. <tt>"andrew murray" was born in</tt> for the attribute <tt>birthdate</tt>). Then, each extracted sentence goes through the same extraction process as the Wikipedia sentences but they are weighted using two features: the distance to the closest sentence containing the title of the article and the rank of the page according to the search engine. | |
== Experimental results == | == Experimental results == | ||
− | ... | + | The 07/16/2007 snapshot of [[UsesDataset::Wikipedia]] was used as source dataset and four classes were tested: “Irish newspaper” (with 20 infobox-contained instance articles), “Performer” (44), “Baseball stadium” (163), and “Writer” (2213), representing different degrees of “sparsity”. The following table shows the area under the precision and recall (AUC) accumulated improvements for the three techniques. |
+ | |||
+ | [[File:results.png]] | ||
+ | |||
+ | The experiments showed that each of these methods is effective individually; however, shrinkage addresses more the long-tailed challenge of sparse classes while the other two address more the challenge of short articles. | ||
== Related papers == | == Related papers == | ||
− | This paper improves Kylin, a self-supervised information extractor first described in [[RelatedPaper::Wu and Weld CIKM 2007]]. The shrinkage technique uses a cleanly-structured ontology, the output of KOG, a system presented in [[RelatedPaper::Wu and Weld WWW 2008]]. The retraining technique uses TextRunner which is described in [[RelatedPaper::Banko et al IJCAI 2007]], but it could use any open information extractor like the ones described in [[RelatedPaper::Wu and Weld ACL 2010]] and [[RelatedPaper::Fader et al EMNLP 2011]]. | + | This paper improves the performance of Kylin, a self-supervised information extractor first described in [[RelatedPaper::Wu and Weld CIKM 2007]]. The shrinkage technique uses a cleanly-structured ontology, the output of KOG, a system presented in [[RelatedPaper::Wu and Weld WWW 2008]]. The retraining technique uses TextRunner which is described in [[RelatedPaper::Banko et al IJCAI 2007]], but it could use any open information extractor like the ones described in [[RelatedPaper::Wu and Weld ACL 2010]] and [[RelatedPaper::Fader et al EMNLP 2011]]. |
Latest revision as of 23:34, 29 September 2011
Citation
Wu, F., Hoffmann, R. and Weld, D. 2008. Information Extraction from Wikipedia: Moving Down the Long Tail. In Proceedings of the 14th International Conference on Knowledge Discovery and Data Mining, pp. 731–739, ACM, New York.
Online version
Summary
Most Wikipedia articles come from the "long tail of sparse classes", article types with a small number of instances (82% of the classes have fewer than 100 articles). This paper introduces three techniques to improve the recall of structured information extraction from those classes: shrinkage over a refined ontology, retraining using open information extractors and supplementing results by extracting from the general Web.
A general statistical technique called shrinkage is applied when training the extractor of a sparse class, by aggregating data from its parent and children classes. The subsumption herarchy needed for this task comes from a previously developed system called KOG (Kylin Ontology Generator). For a given class , the shrinkage procedure collects the related class set , where is the threshold for path length. Then, the set of attributes from this related class set is used to aggregate training data, and the CRF extractors for are trained on the augmented dataset.
The main idea of the second technique is to use additional training data from the output of an open information extraction system. Open IE systems extract a set of relations and Wikipedia infoboxes implicitly define triples of the type where the subject is the title of the article. The retrainer iterates through each attribute of an infobox class and gets the related set of triples . Then, it iterates through T to get a set of potential matches from the Open IE system to augment and clean the training data for C's extractors.
Finally, the general web extraction technique used to supplement additional information is modeled as an information retrieval problem. This is solved by a module which generates a set of queries for a general Web search engine, downloads the top-k pages and splits their text into sentences. The queries are generated using the article title and the same predicates learned from the retraining task (e.g. "andrew murray" was born in for the attribute birthdate). Then, each extracted sentence goes through the same extraction process as the Wikipedia sentences but they are weighted using two features: the distance to the closest sentence containing the title of the article and the rank of the page according to the search engine.
Experimental results
The 07/16/2007 snapshot of Wikipedia was used as source dataset and four classes were tested: “Irish newspaper” (with 20 infobox-contained instance articles), “Performer” (44), “Baseball stadium” (163), and “Writer” (2213), representing different degrees of “sparsity”. The following table shows the area under the precision and recall (AUC) accumulated improvements for the three techniques.
The experiments showed that each of these methods is effective individually; however, shrinkage addresses more the long-tailed challenge of sparse classes while the other two address more the challenge of short articles.
Related papers
This paper improves the performance of Kylin, a self-supervised information extractor first described in Wu and Weld CIKM 2007. The shrinkage technique uses a cleanly-structured ontology, the output of KOG, a system presented in Wu and Weld WWW 2008. The retraining technique uses TextRunner which is described in Banko et al IJCAI 2007, but it could use any open information extractor like the ones described in Wu and Weld ACL 2010 and Fader et al EMNLP 2011.