Difference between revisions of "D. Lange et al., CIKM 2010"

From Cohen Courses
Jump to navigationJump to search
Line 17: Line 17:
 
* '''Training Data Creation:''' For this step, we use articles that specify a value for an attribute as training data. Occurrences of attribute values within the training article texts are labeled.
 
* '''Training Data Creation:''' For this step, we use articles that specify a value for an attribute as training data. Occurrences of attribute values within the training article texts are labeled.
  
* '''Value Extractor Creation:''' The labeled training data are used to generate extractors for as many attributes as possible. We employ Conditional Random Fields (CRFs) to generate attribute value extractors. These extractors are automatically evaluated, so that ineffective extractors can be discarded.
+
* '''Value Extractor Creation:''' The labeled training data are used to generate extractors for as many attributes as possible. We employ [[Conditional_Random_Fields|Conditional Random Fields (CRFs)]] to generate attribute value extractors. These extractors are automatically evaluated, so that ineffective extractors can be discarded.
  
 
* '''Attribute Value Extraction:''' The extractors can then be applied to all articles to fnd missing attribute values for existing infoboxes.
 
* '''Attribute Value Extraction:''' The extractors can then be applied to all articles to fnd missing attribute values for existing infoboxes.

Revision as of 22:23, 29 September 2011

Citation

Dustin Lange, Christoph Böhm, Felix Naumann. 2010. Extracting structured information from Wikipedia articles to populate infoboxes. In CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management.

Online version

ACM Digital Library

Summary

This is a paper introducing iPopulator system, which automatically populates infoboxes of Wikipedia articles by extracting attribute values from the article's text. (Also known as Infobox completion problem.

iPopulator's extraction workflow contains four steps:

  • Structure Analysis: For each attribute of the infobox template, we analyze its values given in the training articles' infoboxes to determine a structure that represents the attribute's syntactical characteristics.
  • Training Data Creation: For this step, we use articles that specify a value for an attribute as training data. Occurrences of attribute values within the training article texts are labeled.
  • Value Extractor Creation: The labeled training data are used to generate extractors for as many attributes as possible. We employ Conditional Random Fields (CRFs) to generate attribute value extractors. These extractors are automatically evaluated, so that ineffective extractors can be discarded.
  • Attribute Value Extraction: The extractors can then be applied to all articles to fnd missing attribute values for existing infoboxes.

The process of the steps mentioned above can be demostracted in the following figure:

IPopulatorExtractionProcess.png

Brief description of the method

Experimental Result

Related papers