Difference between revisions of "Adar, E. et al, WSDM 2009"

Latest revision as of 13:14, 30 September 2011

Citation

Eytan Adar, Michael Skinner, Daniel S. Weld, Information arbitrage across multi-lingual Wikipedia, Proceedings of the Second ACM International Conference on Web Search and Data Mining, February 09-12, 2009, Barcelona, Spain.

Online version

link to PDF

Summary

This paper presents an automated system that aligns Wikipedia infoboxes across four different language domains (English, Spanish, French and German). The system creates infoboxes if necessary, deal with discrepancies across the parallel pages written in other languages, and fills in missing information. This way of extracting new information is particularly useful since the globalization of Wikipedia creates a rapidly growing parallel corpus over many languages and they are updated very fast, which means not all the parallel documents are likely to be well updated. The system uses additive Logistic Regression classifier to match attribute pairs in a self-supervised setting.

Brief description of the method

Page Alignment

First of all, they align pages that are in other languages. These pages are clustered and each cluster is assigned to a unique concept ID.

Infobox Alignment

They have a classifier that classifies whether a given pair of attributes in infoboxes is a match or not. The following features are used by the additive logistic regression classifier:

Equality features: Exact matches of attribute names, infobox classes, and infobox values are strong indications of a match. Infobox values are matched in different normalized forms (lowercasing, removing everythign but numbers, removing everything but alphabetical characters)

Word features: Infobox values matching partially are captured by the Dice coefficient (=2*|X intersect Y|/(|X|+|Y|)) and raw number of overlapping terms

n-gram Features: Some languages have similar roots and thus look similar. This is captured by comparing character n-grams. 3 character n-grams are generated and they are compared using the Dice coefficient and the number of overlapping n-grams.

Cluster ID Features: From the previous phase, we cluster phases written in different languages but describes the same article. This information is used to see whether values listed in infoboxes in different language in fact indicate the same 'concept'.

Language Feature: An indicator variable indicating which pair of languages (ex. German/English) is tested.

Correlation Features: This is to compare numerical values, where n-grams and matches don't help much due to many reasons (measured at different time, unit conversion, etc.) Peasron product-moment correlation is used here.

Translation Features: Language resources can be used to find any sign of a match when there is no textual similarity. The authors use translations of the infobox class, attribute name, and attribute values and see the number and the ratio of the matched translations.

Completing infoboxes

Given a set of parallel, multilingual documents and a document to be modified, a set of potential infobox classes is guessed by weighted count of infobox co-occurrences. Then for the best guess is used and the system attempts to fill in the values of the attributes that the class can have. The pairwise similarity score computed in the previous step are used to find a maximum matching, assuming 1-1 mappings of infobox attributes.

Data

Due to difficulty parsing Wikipedia infoboxes, they use DBpedia dump instead. The DBpedia dump has all the parsed information of infoboxes on Wikipedia

Generating a labeled training/test set

All hyperlinked values are replaced by their concept ID in the page alignment stage. Thus it is easy to positive examples, since we just need to match the concept IDs and their original values are already in different languages, different format, etc. By counting how many values matched per each attribute pair, the authors use the top 4000 high score pairs and generate 20K examples from there. Producing negative examples is trivial -- you just modify one element of the positive pair to something else. 40k negative examples are used to train.

Experimental Result

Their classifier accuracy was 90.7%, using a 10-fold cross validation. Due to the similarity of the languages used for experiment (English, Spanish, French and German) the translation feature didn't seem to help much. It seemed that the system could generate infoboxes for articles well, which is shown as a higher gain in the later part of the plot below where the growth of the existing entries gets slow. Also, there was a notable difference of gain in different languages (English vs. French in the plot), which shows that the system could leverage infoboxes in other languages that had more information.

The plot is generated by calculating the average number of entries for each infobox class. Classes are sorted in decreasing order of average number of entries, and then plot was drawn in a cumulative fashion.

@@ Line 9: / Line 9: @@
 == Summary ==
-This [[Category::paper]] presents a simple and efficient two-stage [[UsesMethod::Stacked Sequential Learning]] approach that captures non-local dependencies in [[AddressesProblem::Named Entity Recognition]] (NER). The non-local dependencies that the authors try to handle here are that similar or same tokens(or token sequences) are more likely to have the same label. Directly modeling/capturing such non-local dependencies are difficult because assuming such dependencies make the inference harder. The proposed method does not try to capture non-local dependencies directly but in a two-stage way. In the first stage, a conventional sequence CRF model is used to approximate aggregate statistics of labels. Then another CRF model is used, using a function of those approximated aggregate statistics of labels as its features. For a given token/entity(labeled by the first CRF), it tries to encourage the majority label assigned to (1) the same token, (2) the same entity, and (3) entities whose token sequence includes the current token sequence either (a) in the same document or (b) in the corpus. This method, when tested against previous models that tried to capture non-local dependencies directly, achieved a higher relative error reduction.
+This [[Category::paper]] presents an automated system that aligns Wikipedia infoboxes across four different language domains (English, Spanish, French and German). The system creates infoboxes if necessary, deal with discrepancies across the parallel pages written in other languages, and fills in missing information. This way of extracting new information is particularly useful since the globalization of [[UsesDataset::Wikipedia]] creates a rapidly growing parallel corpus over many languages and they are updated very fast, which means not all the parallel documents are likely to be well updated. The system uses additive [[UsesMethod::Logistic Regression]] classifier to match attribute pairs in a self-supervised setting.
 == Brief description of the method ==
-There are two [[UsesMethod::Conditional Random Fields]] in this method. The first is a conventional linear CRF. The local dependencies used here are tokens before and after. The authors use features known to be effective in NER, such as the current, previous, and next words, character n-grams of the current word, etc. For the full list of features, check the appendix.
+=== Page Alignment ===
-The second CRF uses features that is a function of an aggregate statistics of labels obtained from the first CRF. Specifically, they are the following 3 types:
+First of all, they align pages that are in other languages. These pages are clustered and each cluster is assigned to a unique concept ID.
-. Token-majority features: these features refer to the majority label assigned to the particular token in the document/corpus. These capture dependencies between similar token sequences, especially token sequences that have common words between them.
+=== Infobox Alignment ===
-. Entity-majority features: these features refer to the majority label assigned to the particular entity (labeled by the first CRF) in the document/corpus. If the token was labeled not as a named entity in the first CRF, the feature returns the majority label assigned to a single-token named entity with the current token. These capture dependencies between the same token sequences.
+They have a classifier that classifies whether a given pair of attributes in infoboxes is a match or not. The following features are used by the additive logistic regression classifier:
-. Superentity-majority features: these features refer to the majority label assigned to the supersequences of the particular entity in the document/corpus. If the token was labeled not as a named entity in the first CRF, the feature returns the majority label assigned to all entities containing the token. These capture dependencies between between superentity and subentity. It makes sense to use only superentities as features because longer token sequence gives you more contextual cue.
+* Equality features: Exact matches of attribute names, infobox classes, and infobox values are strong indications of a match. Infobox values are matched in different normalized forms (lowercasing, removing everythign but numbers, removing everything but alphabetical characters)
-One thing to note is that this model also tries to capture non-local dependencies at the corpus level. This could not be done in the previous methods because they try to capture non-local dependencies directly, which makes the inference intractable when corpus level dependencies are added. Also, while the method proposed in [[RelatedPaper::Finkel et al, ACL 2005]] increased the running time by a factor of 30 over the sequential CRF, this method only takes time for two sequential CRFs.
+* Word features: Infobox values matching partially are captured by the Dice coefficient (=2*|X intersect Y|/(|X|+|Y|)) and raw number of overlapping terms
-When training this model we need the predictions of the first CRF on the training data. To prevent overfit, these predictions are made by doing a 10-fold cross validation. However all the training data is used to train the CRF when testing.
+* n-gram Features: Some languages have similar roots and thus look similar. This is captured by comparing character n-grams. 3 character n-grams are generated and they are compared using the Dice coefficient and the number of overlapping n-grams.
-== Experimental Result ==
+* Cluster ID Features: From the previous phase, we cluster phases written in different languages but describes the same article. This information is used to see whether values listed in infoboxes in different language in fact indicate the same 'concept'.
+* Language Feature: An indicator variable indicating which pair of languages (ex. German/English) is tested.
+* Correlation Features: This is to compare numerical values, where n-grams and matches don't help much due to many reasons (measured at different time, unit conversion, etc.) Peasron product-moment correlation is used here.
+* Translation Features: Language resources can be used to find any sign of a match when there is no textual similarity. The authors use translations of the infobox class, attribute name, and attribute values and see the number and the ratio of the matched translations.
+=== Completing infoboxes ===
+Given a set of parallel, multilingual documents and a document to be modified, a set of potential infobox classes is guessed by weighted count of infobox co-occurrences. Then for the best guess is used and the system attempts to fill in the values of the attributes that the class can have. The pairwise similarity score computed in the previous step are used to find a maximum matching, assuming 1-1 mappings of infobox attributes.
+== Data ==
+Due to difficulty parsing [[UsesDataset::Wikipedia]] infoboxes, they use [[UsesDataset::DBpedia]] dump instead. The [[UsesDataset::DBpedia]] dump has all the parsed information of infoboxes on [[UsesDataset::Wikipedia]]
-The authors tested this method on [[UsesDataset::CoNLL'03]] English named entity recognition dataset. Their baseline [[UsesMethod::Conditional Random Fields]] achieved an already competitive result of F-measure of 85.29. Adding document level non-local dependencies achieved 12.6% relative error reduction over the baseline. Incorporating non-local dependencies across documents (at corpus level) as well achieved 13.3% relative error reduction. Also, despite the high baseline performance compared to other methods from [[RelatedPaper::Bunescu and Mooney, ACL 2004]] and [[RelatedPaper::Finkel et al, ACL 2005]], the proposed method managed to get higher relative error reduction. For more detailed result, check the table below.
+=== Generating a labeled training/test set ===
-[[File:Krishnan_et_al_ACL_2006.png]]
+All hyperlinked values are replaced by their concept ID in the page alignment stage. Thus it is easy to positive examples, since we just need to match the concept IDs and their original values are already in different languages, different format, etc. By counting how many values matched per each attribute pair, the authors use the top 4000 high score pairs and generate 20K examples from there. Producing negative examples is trivial -- you just modify one element of the positive pair to something else. 40k negative examples are used to train.
-== Related papers ==
+== Experimental Result ==
+Their classifier accuracy was 90.7%, using a 10-fold cross validation. Due to the similarity of the languages used for experiment (English, Spanish, French and German) the translation feature didn't seem to help much. It seemed that the system could generate infoboxes for articles well, which is shown as a higher gain in the later part of the plot below where the growth of the existing entries gets slow. Also, there was a notable difference of gain in different languages (English vs. French in the plot), which shows that the system could leverage infoboxes in other languages that had more information.
-[1] [[RelatedPaper::Bunescu and Mooney, ACL 2004]]
+[[File:Adar_et_al_Pic.png]]
-[2] [[RelatedPaper::Finkel et al, ACL 2005]]
+The plot is generated by calculating the average number of entries for each infobox class. Classes are sorted in decreasing order of average number of entries, and then plot was drawn in a cumulative fashion.
-== Appendix ==
+== Related papers ==
-Full list of features for the baseline CRF: the current, previous and next words, character n-grams of the current word, Part of Speech tag of the current word and surrounding words, the shallow parse chunk of the current word, shape of the current word, the surrounding word shape sequence, the presence of a word in a left window of size 5 around the current word and the presence of a word in a right window of size 5 around the current word.
+* [[ RelatedPaper::Wu and Weld WWW 2008 ]]
+* [[ RelatedPaper::Wu et al KDD 2008 ]]

Difference between revisions of "Adar, E. et al, WSDM 2009"

Latest revision as of 13:14, 30 September 2011

Contents

Citation

Online version

Summary

Brief description of the method

Page Alignment

Infobox Alignment

Completing infoboxes

Data

Generating a labeled training/test set

Experimental Result

Related papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools