Difference between revisions of "D. Lange et al., CIKM 2010"

From Cohen Courses
Jump to navigationJump to search
Line 12: Line 12:
 
[[File:Ipopulatorexample.png|480px]]
 
[[File:Ipopulatorexample.png|480px]]
  
== System Architecture ==
+
== System Architecture and Applied Methods ==
 
iPopulator's extraction workflow contains four steps:
 
iPopulator's extraction workflow contains four steps:
 +
=== Structure Analysis ===
 +
For each attribute of the infobox template, the system analyzes its values given in the training articles' infoboxes to determine a structure that represents the attribute's syntactical characteristics. For example, the '''infobox_company''' attribute '''number_of_employees''' might be ''12,500 (2003)'', which means that 12,500 people were employed in the year 2003. The authors employed a ''heuristic method'' to find out a regular expression to get values from infobox values.
  
[[File:IPopulatorExtractionProcess.png]]
+
=== Training Data Creation ===
 +
This step is similar to the ''Matcher''' in the [[Hoffmann_et_al.,_ACL_2010|Hoffmann et al., ACL 2010]] paper, which matches the infobox attributes to correspond Wikipedia article. Because the attribute values usually do not occur verbatim in texts. The paper proposes some ideas to address the issue.
 +
* '''Article Paragraph Filtering:''' Only the first few paragraphs will be chosen, ignore others;
 +
* '''Fuzzy Matching:''' Some simple heuristic is applied to discover fuzzy matches of attribute values, and numbers are processed separately from all other strings;
 +
* '''Labeling Value Parts:''' All attribute values are divided into several parts according to the corresponding attribute value structure. Each part of the value structure is labeled separately. To retain the identity of the value part that is being labeled, a number is assigned to each structure part and used as the actual label.
  
* '''Structure Analysis:''' For each attribute of the infobox template, we analyze its values given in the training articles' infoboxes to determine a structure that represents the attribute's syntactical characteristics.
+
=== Value Extractor Creation ===
 +
The labeled training data are used to generate extractors for as many attributes as possible. We employ [[Conditional_Random_Fields|Conditional Random Fields (CRFs)]] to generate attribute value extractors. These extractors are automatically evaluated, so that ineffective extractors can be discarded.
  
* '''Training Data Creation:''' For this step, we use articles that specify a value for an attribute as training data. Occurrences of attribute values within the training article texts are labeled.
+
=== Attribute Value Extraction===
 +
The extractors can then be applied to all articles to fnd missing attribute values for existing infoboxes.
  
* '''Value Extractor Creation:''' The labeled training data are used to generate extractors for as many attributes as possible. We employ [[Conditional_Random_Fields|Conditional Random Fields (CRFs)]] to generate attribute value extractors. These extractors are automatically evaluated, so that ineffective extractors can be discarded.
+
The process of the steps mentioned above can be demostracted in the following figure:
  
* '''Attribute Value Extraction:''' The extractors can then be applied to all articles to fnd missing attribute values for existing infoboxes.
+
[[File:IPopulatorExtractionProcess.png]]
 
 
The process of the steps mentioned above can be demostracted in the following figure:
 
  
 
== Brief description of the method ==
 
== Brief description of the method ==

Revision as of 17:19, 30 September 2011

Citation

Dustin Lange, Christoph Böhm, Felix Naumann. 2010. Extracting structured information from Wikipedia articles to populate infoboxes. In CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management.

Online version

ACM Digital Library

Summary

This is a paper introducing iPopulator system, which automatically populates infoboxes of Wikipedia articles by extracting attribute values from the article's text. (Also known as Infobox completion problem.

Ipopulatorexample.png

System Architecture and Applied Methods

iPopulator's extraction workflow contains four steps:

Structure Analysis

For each attribute of the infobox template, the system analyzes its values given in the training articles' infoboxes to determine a structure that represents the attribute's syntactical characteristics. For example, the infobox_company attribute number_of_employees might be 12,500 (2003), which means that 12,500 people were employed in the year 2003. The authors employed a heuristic method to find out a regular expression to get values from infobox values.

Training Data Creation

This step is similar to the Matcher' in the Hoffmann et al., ACL 2010 paper, which matches the infobox attributes to correspond Wikipedia article. Because the attribute values usually do not occur verbatim in texts. The paper proposes some ideas to address the issue.

  • Article Paragraph Filtering: Only the first few paragraphs will be chosen, ignore others;
  • Fuzzy Matching: Some simple heuristic is applied to discover fuzzy matches of attribute values, and numbers are processed separately from all other strings;
  • Labeling Value Parts: All attribute values are divided into several parts according to the corresponding attribute value structure. Each part of the value structure is labeled separately. To retain the identity of the value part that is being labeled, a number is assigned to each structure part and used as the actual label.

Value Extractor Creation

The labeled training data are used to generate extractors for as many attributes as possible. We employ Conditional Random Fields (CRFs) to generate attribute value extractors. These extractors are automatically evaluated, so that ineffective extractors can be discarded.

Attribute Value Extraction

The extractors can then be applied to all articles to fnd missing attribute values for existing infoboxes.

The process of the steps mentioned above can be demostracted in the following figure:

IPopulatorExtractionProcess.png

Brief description of the method

Training Data Creation

Article Paragraph Filtering: Many Wikipedia articles are rather long and contain much information that is irrelevant for the Infobox Population Problem. We conducted an occurrence analysis and concluded that the first paragraphs of an article are sufficient for extraction of many infobox attributes. To restrict the corpus size on the one hand, but also examine only useful text passages, we choose only the first few paragraphs.

Fuzzy Matching: We apply a simple heuristic to discover fuzzy matches of attribute values. We distinguish between numbers and all other strings.

Labeling Value Parts: All attribute values are divided into several parts according to the corresponding attribute value structure. Each part of the value structure is labeled separately. To retain the identity of the value part that is being labeled, a number is assigned to each structure part and used as the actual label.

Experimental Result

Overall Performance

Evaluating on 1,727 distinct infobox template attributes, the authors claims they achieve Precision for 1,521 attributes and for 1,127 attributes. The overall average measures is given precision with 0.91, Recall 0.66, giving a .

Comparison with related work

By comparing with related work Kylin and successor K2, the iPopulator system out perform Kylin's and K2's results, result is summarized by the table below:

Ipopulatorexperimentresult.png

Related papers

The experiment result is compared with Kylin Information Extraction system.

Similar problem of infobox population is also discussed in Wu and Weld ACL 2010's Open Information Extraction system and Hoffmann et al., ACL 2010's LUCHS system.

Result generated by iPopulator is also used the knowledge base of DBpedia.