Adar, E. et al, WSDM 2009
Contents
Citation
Eytan Adar, Michael Skinner, Daniel S. Weld, Information arbitrage across multi-lingual Wikipedia, Proceedings of the Second ACM International Conference on Web Search and Data Mining, February 09-12, 2009, Barcelona, Spain.
Online version
Summary
This paper presents an automated system that aligns Wikipedia infoboxes across four different language domains (English, Spanish, French and German). The system creates infoboxes if necessary, deal with discrepancies across the parallel pages written in other languages, and fills in missing information. This way of extracting new information is particularly useful since the globalization of Wikipedia creates a rapidly growing parallel corpus over many languages and they are updated very fast, which means not all the parallel documents are likely to be well updated. The system uses additive Logistic Regression classifier to match attribute pairs in a self-supervised setting.
Brief description of the method
Page Alignment
First of all, they align pages that are in other languages. These pages are clustered and each cluster is assigned to a unique concept ID.
Infobox Alignment
They have a classifier that classifies whether a given pair of attributes in infoboxes is a match or not. The following features are used by the additive logistic regression classifier:
(1) Equality features: Exact matches of attribute names, infobox classes, and infobox values are strong indications of a match. Infobox values are matched in different normalized forms (lowercasing, removing everythign but numbers, removing everything but alphabetical characters)
(2) Word features: Infobox values matching partially are captured by the Dice coefficient (=2*|X intersect Y|/(|X|+|Y|)) and raw number of overlapping terms
(3) n-gram Features: Some languages have similar roots and thus look similar. This is captured by comparing character n-grams. 3 character n-grams are generated and they are compared using the Dice coefficient and the number of overlapping n-grams.
(4) Cluster ID Features: From the previous phase, we cluster phases written in different languages but describes the same article. This information is used to see whether values listed in infoboxes in different language in fact indicate the same 'concept'.
(5) Language Feature: An indicator variable indicating which pair of languages (ex. German/English) is tested.
(6) Correlation Features: This is to compare numerical values, where n-grams and matches don't help much due to many reasons (measured at different time, unit conversion, etc.) Peasron product-moment correlation is used here.
(7) Translation Features: Language resources can be used to find any sign of a match when there is no textual similarity. The authors use translations of the infobox class, attribute name, and attribute values and see the number and the ratio of the matched translations.
Completing infoboxes
Given a set of parallel, multilingual documents and a document to be modified, a set of potential infobox classes is guessed by weighted count of infobox co-occurrences. Then for the best guess is used and the system attempts to fill in the values of the attributes that the class can have. The pairwise similarity score computed in the previous step are used to find a maximum matching, assuming 1-1 mappings of infobox attributes.
Data
Due to difficulty parsing Wikipedia infoboxes, they use DBpedia dump instead. The DBpedia dump has all the parsed information of infoboxes on Wikipedia
Generating a labeled training/test set
All hyperlinked values are replaced by their concept ID in the page alignment stage. Thus it is easy to positive examples, since we just need to match the concept IDs and their original values are already in different languages, different format, etc. By counting how many values matched per each attribute pair, the authors use the top 4000 high score pairs and generate 20K examples from there. Producing negative examples is trivial -- you just modify one element of the positive pair to something else.
Experimental Result
Their classifier accuracy was 90.7%, using a 10-fold cross validation. Due to the similarity of the languages used for experiment (English, Spanish, French and German) the translation feature didn't seem to help much. It seemed that the system could generate infoboxes for articles well, which is shown as a higher gain in the later part of the plot below where the growth of the existing entries gets slow. Also, there was a notable difference of gain in different languages (English vs. French in the plot), which shows that the system could leverage infoboxes in other languages that had more information.
The plot is generated by calculating the average number of entries for each infobox class. Classes are sorted in decreasing order of average number of entries, and then plot was drawn in a cumulative fashion.