D. Lange et al., CIKM 2010

From Cohen Courses
Revision as of 14:25, 30 September 2011 by Wpang (talk | contribs) (→‎Summary)
Jump to navigationJump to search

Citation

Dustin Lange, Christoph Böhm, Felix Naumann. 2010. Extracting structured information from Wikipedia articles to populate infoboxes. In CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management.

Online version

ACM Digital Library

Summary

Ipopulatorexample.png

This is a paper introducing iPopulator system, which automatically populates infoboxes of Wikipedia articles by extracting attribute values from the article's text. (Also known as Infobox completion problem.

System Architecture

iPopulator's extraction workflow contains four steps:

IPopulatorExtractionProcess.png

  • Structure Analysis: For each attribute of the infobox template, we analyze its values given in the training articles' infoboxes to determine a structure that represents the attribute's syntactical characteristics.
  • Training Data Creation: For this step, we use articles that specify a value for an attribute as training data. Occurrences of attribute values within the training article texts are labeled.
  • Value Extractor Creation: The labeled training data are used to generate extractors for as many attributes as possible. We employ Conditional Random Fields (CRFs) to generate attribute value extractors. These extractors are automatically evaluated, so that ineffective extractors can be discarded.
  • Attribute Value Extraction: The extractors can then be applied to all articles to fnd missing attribute values for existing infoboxes.

The process of the steps mentioned above can be demostracted in the following figure:

Brief description of the method

Training Data Creation

Article Paragraph Filtering: Many Wikipedia articles are rather long and contain much information that is irrelevant for the Infobox Population Problem. We conducted an occurrence analysis and concluded that the first paragraphs of an article are sufficient for extraction of many infobox attributes. To restrict the corpus size on the one hand, but also examine only useful text passages, we choose only the first few paragraphs.

Fuzzy Matching: We apply a simple heuristic to discover fuzzy matches of attribute values. We distinguish between numbers and all other strings.

Labeling Value Parts: All attribute values are divided into several parts according to the corresponding attribute value structure. Each part of the value structure is labeled separately. To retain the identity of the value part that is being labeled, a number is assigned to each structure part and used as the actual label.

Experimental Result

Overall Performance

Evaluating on 1,727 distinct infobox template attributes, the authors claims they achieve Precision for 1,521 attributes and for 1,127 attributes. The overall average measures is given precision with 0.91, Recall 0.66, giving a .

Comparison with related work

By comparing with related work Kylin and successor K2, the iPopulator system out perform Kylin's and K2's results, result is summarized by the table below:

Ipopulatorexperimentresult.png

Related papers

The experiment result is compared with Kylin IE system.

Similar problem of infobox population is also discussed in Wu and Weld ACL 2010's Open Information Extraction system and Hoffmann et al., ACL 2010's LUCHS system.

Result generated by iPopulator is also used the knowledge base of DBpedia.