Difference between revisions of "Wu and Weld WWW 2008"

Revision as of 23:30, 25 September 2011

Citation

Wu, F. and Weld, D. 2008. Automatically Refining the Wikipedia Infobox Ontology. In Proceedings of the 17th Conference of the World Wide Web, pp. 635-644, ACM, New York.

Online version

University of Washington

Summary

This is a paper that introduces an autonomous system for refining Wikipedia’s infobox information schema to create a cleanly-structured ontology. Advanced query capability, improved information extractors and semiautomatic generation of new infobox templates are shown as advantages of a refined ontology. The ontology refinement problem is solved using both Support Vector Machines and a more powerful joint-inference approach expressed in Markov Logic Networks.

The autonomous system, presented as Kylin Ontology Generator (KOG), is comprised of three modules:

a schema cleaner, which merges duplicate classes and attributes and prunes rarely-used ones;
a subsumption detector, which identifies is-a relations between infobox classes (e.g. "volleyball player" is-a "athlete");
and a schema mapper, which builds attribute mappings between related infobox classes.

The subsumption detection task is modeled as a binary classification problem and several intuitive indicators are used as features to train the classifiers:

Similarity measure: the similarity between two infobox classes, measured using the TF/IDF scores between bags of words taken from their attribute set, the first sentence of each of their instances (articles) and their category tags.
Class-name string inclusion: whether the name of a class is a substring of another one (e.g. "English public school" is-a "public school").
Category tags: whether the name of a class is found in the infobox template category tag.
Edit history: the edit pattern of an instance, because a Wikipedia author tends to specialize rather than generalize when changing the type of an article.
Hearst patterns: the number of Google hits for match phrases of the form "Class1, like Class2" or "Class1 such as Class2" (e.g. "...scientists such as chemists, phsycists...").
Wordnet mapping: a bunch of heuristics is used to compute a mapping between a WordNet node and an infobox class and whether a corresponding node of another class is also used as a feature for classification.

Experimental result

...

Related papers

This paper is based on Wu and Weld CIKM 2007.

@@ Line 14: / Line 14: @@
 The autonomous system, presented as Kylin Ontology Generator (KOG), is comprised of three modules:
 * a schema cleaner, which merges duplicate classes and attributes and prunes rarely-used ones;
-* a subsumption detector, which identifies '''[http://en.wikipedia.org/wiki/is-a is-a]''' relations between infobox classes (e.g. "volleyball player" is-a "athlete");
+* a subsumption detector, which identifies '''[http://en.wikipedia.org/wiki/is-a is-a]''' relations between infobox classes (e.g. "volleyball player" '''is-a''' "athlete");
 * and a schema mapper, which builds attribute mappings between related infobox classes.
 The subsumption detection task is modeled as a binary classification problem and several intuitive indicators are used as features to train the classifiers:
-* similarity measure: the similarity between two infobox classes, measured using the TF/IDF scores between bags of words taken from their attribute set, the first sentence of each of their instances and their category tags.
+* Similarity measure: the similarity between two infobox classes, measured using the TF/IDF scores between bags of words taken from their attribute set, the first sentence of each of their instances (articles) and their category tags.
+* Class-name string inclusion: whether the name of a class is a substring of another one (e.g. "English public school" '''is-a''' "public school").
+* Category tags: whether the name of a class is found in the infobox template category tag.
+* Edit history: the edit pattern of an instance, because a Wikipedia author tends to specialize rather than generalize when changing the type of an article.
+* Hearst patterns: the number of Google hits for match phrases of the form "Class1, like Class2" or "Class1 such as Class2" (e.g. "...''scientists'' such as ''chemists'', ''phsycists''...").
+* Wordnet mapping: a bunch of heuristics is used to compute a mapping between a WordNet node and an infobox class and whether a corresponding node of another class is also used as a feature for classification.
 == Experimental result ==

Difference between revisions of "Wu and Weld WWW 2008"

Revision as of 23:30, 25 September 2011

Contents

Citation

Online version

Summary

Experimental result

Related papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools