D. Lange et al., CIKM 2010
Contents
Citation
Dustin Lange, Christoph Böhm, Felix Naumann. 2010. Extracting structured information from Wikipedia articles to populate infoboxes. In CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management.
Online version
Summary
This is a paper introducing iPopulator system, which automatically populates infoboxes of Wikipedia articles by extracting attribute values from the article's text. (Also known as Infobox completion problem.
System Architecture and Applied Methods
iPopulator's extraction workflow contains four steps:
Structure Analysis
For each attribute of the infobox template, the system analyzes its values given in the training articles' infoboxes to determine a structure that represents the attribute's syntactical characteristics. For example, the infobox_company attribute number_of_employees might be 12,500 (2003), which means that 12,500 people were employed in the year 2003. The authors employed a heuristic method to find out a regular expression to get values from infobox values.
Training Data Creation
This step is similar to the Matcher' in the Hoffmann et al., ACL 2010 paper, which matches the infobox attributes to correspond Wikipedia article. Because the attribute values usually do not occur verbatim in texts. The paper proposes some ideas to address the issue.
- Article Paragraph Filtering: Only the first few paragraphs will be chosen, ignore others;
- Fuzzy Matching: Some simple heuristic is applied to discover fuzzy matches of attribute values, and numbers are processed separately from all other strings;
- Labeling Value Parts: All attribute values are divided into several parts according to the corresponding attribute value structure. Each part of the value structure is labeled separately. To retain the identity of the value part that is being labeled, a number is assigned to each structure part and used as the actual label.
Value Extractor Creation
The labeled training data are used to generate extractors for as many attributes as possible. We employ Conditional Random Fields (CRFs) to generate attribute value extractors. These extractors are automatically evaluated, so that ineffective extractors can be discarded.
Attribute Value Extraction
The extractors can then be applied to all articles to fnd missing attribute values for existing infoboxes.
The process of the steps mentioned above can be demostracted in the following figure:
Brief description of the method
Training Data Creation
Article Paragraph Filtering: Many Wikipedia articles are rather long and contain much information that is irrelevant for the Infobox Population Problem. We conducted an occurrence analysis and concluded that the first paragraphs of an article are sufficient for extraction of many infobox attributes. To restrict the corpus size on the one hand, but also examine only useful text passages, we choose only the first few paragraphs.
Fuzzy Matching: We apply a simple heuristic to discover fuzzy matches of attribute values. We distinguish between numbers and all other strings.
Labeling Value Parts: All attribute values are divided into several parts according to the corresponding attribute value structure. Each part of the value structure is labeled separately. To retain the identity of the value part that is being labeled, a number is assigned to each structure part and used as the actual label.
Experimental Result
Overall Performance
Evaluating on 1,727 distinct infobox template attributes, the authors claims they achieve Precision for 1,521 attributes and for 1,127 attributes. The overall average measures is given precision with 0.91, Recall 0.66, giving a .
By comparing with related work Kylin and successor K2, the iPopulator system out perform Kylin's and K2's results, result is summarized by the table below:
Related papers
The experiment result is compared with Kylin Information Extraction system.
Similar problem of infobox population is also discussed in Wu and Weld ACL 2010's Open Information Extraction system and Hoffmann et al., ACL 2010's LUCHS system.
Result generated by iPopulator is also used the knowledge base of DBpedia.