Adar, E. et al, WSDM 2009

From Cohen Courses
Revision as of 12:22, 30 September 2011 by Daegunw (talk | contribs)
Jump to navigationJump to search

Citation

Eytan Adar, Michael Skinner, Daniel S. Weld, Information arbitrage across multi-lingual Wikipedia, Proceedings of the Second ACM International Conference on Web Search and Data Mining, February 09-12, 2009, Barcelona, Spain.

Online version

link to PDF

Summary

This paper presents an automated system that aligns Wikipedia infoboxes across four different language domains (English, Spanish, French and German). The system creates infoboxes if necessary, deal with discrepancies across the parallel pages written in other languages, and fills in missing information. This way of extracting new information is particularly useful since the globalization of Wikipedia creates a rapidly growing parallel corpus over many languages.

Brief description of the method

Page Alignment

First of all, they align pages

Infobox Alignment

They have a classifier that classifies whether a given pair of attributes in infoboxes is a match or not. The following features are used by the additive logistic regression classifier:

(1) Equality features: Exact matches of attribute names, infobox classes, and infobox values are strong indications of a match. Infobox values are matched in different normalized forms (lowercasing, removing everythign but numbers, removing everything but alphabetical characters)

(2) Word features: Infobox values matching partially are captured by the Dice coefficient (=2*|X intersect Y|/(|X|+|Y|)) and raw number of overlapping terms

(3) n-gram Features: Some languages have similar roots and thus look similar. This is captured by comparing character n-grams. 3 character n-grams are generated and they are compared using the Dice coefficient and the number of overlapping n-grams.

(4) Cluster ID Features: From the previous phase, we cluster phases written in different languages but describes the same article. This information is used to see whether values listed in infoboxes in different language in fact indicate the same 'concept'.

(5) Language Feature: An indicator variable indicating which pair of languages (ex. German/English) is tested.

(6) Correlation Features: This is to compare numerical values, where n-grams and matches don't help much due to many reasons (measured at different time, unit conversion, etc.) Peasron product-moment correlation is used here.

(7) Translation Features: Language resources can be used to find any sign of a match when there is no textual similarity. The authors use translations of the infobox class, attribute name, and attribute values and see the number and the ratio of the matched translations.

Data

Due to difficulty parsing Wikipedia infoboxes, they use DBpedia dump instead. The Dbpedia dump has all the parsed information of infoboxes on Wikipedia

Generating a labeled training/test set

All hyperlinked values are replaced by their concept ID in the page alignment stage. Thus it is easy to positive examples, since we just need to match the concept IDs and their original values are already in different languages, different format, etc. By counting how many values matched per each attribute pair, the authors use the top 4000 high score pairs and generate 20K examples from there. Producing negative examples is trivial -- you just modify one element of the positive pair to something else.

Experimental Result

Wikipedia.


Related papers

[1] Bunescu and Mooney, ACL 2004

[2] Finkel et al, ACL 2005