Wu and Weld CIKM 2007

From Cohen Courses
Revision as of 12:08, 30 September 2010 by PastStudents (talk | contribs) (Created page with '== Citation == Wu, F. and Weld, D. S. 2007. Autonomously semantifying wikipedia. In Proceedings of the Sixteenth ACM Conference on Conference on information and Knowledge Manage…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Citation

Wu, F. and Weld, D. S. 2007. Autonomously semantifying wikipedia. In Proceedings of the Sixteenth ACM Conference on Conference on information and Knowledge Management (Lisbon, Portugal, November 06 - 10, 2007). CIKM '07. ACM, New York, NY, 41-50.

Online version

ACM Digital Library

Summary

This is the paper paper that describes a the Information Extraction prototype, KYLIN, in the task of Wikipedia Refinement. Wikipedia is a accurate source of data but there are still problems of incompleteness, duplicates, and ambiguities. Two problems are addressed in the paper, 1) infobox completion, and 2) link generation. The authors used a three step procedure to tackle the infobox completion task:

  • Preprocessing
    • Schema of infobox for a class was defined by first grouping articles with the same infobox template names and then selecting the most common attributes (used in >15% articles) from them.
    • Training data was generated by selecting (using heuristics) a unique sentence in the documents that contain attributes as the positive sample. The rest of the sentences in the documents are used as negative samples.
  • Document & Sentence Classification
    • A candidate document is identified using a heuristic approach: 1) to find list pages that match infobox class keywords, 2) and then classify the articles from the list pages based on their category tags.
    • A candidate sentence is identified using a classifier MaxEnt with bagging bagging with features: words and their POS tags.
  • Attribute Extraction
    • Negative training examples are ignored if sentences were classified as an candidate sentence in the previous step.
    • Attribute values are identified using CRF, one classifier for each attribute.

Link Generation was done also rather heuristically. The evaluation was done on Wikipedia 2007.02.06 data.

Related papers

This prototype was later used in a more general task of open domain information extraction task in Wu_and_Weld_ACL_2010.