Wu and Weld WWW 2008

From Cohen Courses
Revision as of 00:57, 26 September 2011 by Aanavas (talk | contribs) (→‎Summary)
Jump to navigationJump to search

Citation

Wu, F. and Weld, D. 2008. Automatically Refining the Wikipedia Infobox Ontology. In Proceedings of the 17th Conference of the World Wide Web, pp. 635-644, ACM, New York.

Online version

University of Washington

Summary

This is a paper that introduces an autonomous system for refining Wikipedia’s infobox information schema to create a cleanly-structured ontology. Advanced query capability, improved information extractors and semiautomatic generation of new infobox templates are shown as advantages of a refined ontology. The ontology refinement problem is solved using both Support Vector Machines (SVM) and a more powerful joint-inference approach expressed in Markov Logic Networks (MLN).

The autonomous system, presented as Kylin Ontology Generator (KOG), is comprised of three modules:

  • a schema cleaner, which merges duplicate classes and attributes and prunes rarely-used ones;
  • a subsumption detector, which identifies is-a relations between infobox classes (e.g. "volleyball player" is-a "athlete");
  • and a schema mapper, which builds attribute mappings between related infobox classes.

The subsumption detection task is modeled as a binary classification problem and several intuitive indicators are used as features to train the classifiers:

  • Similarity measure: the similarity between two infobox classes, measured using the TF/IDF scores between bags of words taken from their attribute set, the first sentence of each of their instances (articles) and their category tags.
  • Class-name string inclusion: whether the name of a class is a substring of another one (e.g. "English public school" is-a "public school").
  • Category tags: whether the name of a class is found in the infobox template category tag.
  • Edit history: the edit pattern of an instance, because a Wikipedia author tends to specialize rather than generalize when changing the type of an article.
  • Hearst patterns: the number of Google hits for match phrases of the form "Class1, like Class2" or "Class1 such as Class2" (e.g. "...scientists such as chemists...").
  • WordNet mapping: a bunch of heuristics is used to compute a mapping between a WordNet node and an infobox class and whether a corresponding node of another class is also used as a feature for classification.

Both the SVM classifier and the MLN model are trained using the features above, but the MLN classifier exploits additional important information. First, if "Class1 is-a Class2" and "Class2 is-a Class3", then it is likely that "Class1 is-a Class3". Also, the WordNet mapping and the is-a binary classification are treated as separate problems when actually the evidence from either one can help to reduce the uncertainty of the other. This additional knowledge is represented as additional logical formulas in the MLN classifier:

(the intuition that is-a is transitive)

(which means if c1 and c2 have correct WordNet mappings and the mapped nodes are is-a according to WordNet, then c1 is-a c2)

Experimental result

...

Related papers

This paper is based on Wu and Weld CIKM 2007.