"Wu and Weld CIKM 2007"
Citation
Wu, F. and Weld, D. S. 2007. Autonomously semantifying wikipedia. In Proceedings of the Sixteenth ACM Conference on Conference on information and Knowledge Management (Lisbon, Portugal, November 06 - 10, 2007). CIKM '07. ACM, New York, NY, 41-50.
Online version
Summary
This is the paper paper that describes a the Information Extraction prototype, KYLIN, in the task of Wikipedia Refinement. Wikipedia is a accurate source of data but there are still problems of incompleteness, duplicates, and ambiguities. Two problems are addressed in the paper, 1) infobox completion, and 2) link generation. The authors used a three step procedure to tackle the infobox completion task:
- Preprocessing
- Schema of infobox for a class was defined by first grouping articles with the same infobox template names and then selecting the most common attributes (used in >15% articles) from them.
- Training data was generated by selecting (using heuristics) a unique sentence in the documents that contain attributes as the positive sample. The rest of the sentences in the documents are used as negative samples.
- Document & Sentence Classification
- A candidate document is identified using a heuristic approach: 1) to find list pages that match infobox class keywords, 2) and then classify the articles from the list pages based on their category tags.
- A candidate sentence is identified using a classifier MaxEnt with bagging bagging with features: words and their POS tags.
- Attribute Extraction
- Negative training examples are ignored if sentences were classified as an candidate sentence in the previous step.
- Attribute values are identified using CRF, one classifier for each attribute.
Link Generation was done also rather heuristically. The evaluation was done on Wikipedia 2007.02.06 data.
Related papers
This prototype was later used in a more general task of open domain information extraction task in Wu and Weld, ACL 2010.