Difference between revisions of "Adar, E. et al, WSDM 2009"

From Cohen Courses
Jump to navigationJump to search
Line 9: Line 9:
 
== Summary ==
 
== Summary ==
  
This [[Category::paper]] presents a simple and efficient two-stage [[UsesMethod::Stacked Sequential Learning]] approach that captures non-local dependencies in [[AddressesProblem::Named Entity Recognition]] (NER). The non-local dependencies that the authors try to handle here are that similar or same tokens(or token sequences) are more likely to have the same label. Directly modeling/capturing such non-local dependencies are difficult because assuming such dependencies make the inference harder. The proposed method does not try to capture non-local dependencies directly but in a two-stage way. In the first stage, a conventional sequence CRF model is used to approximate aggregate statistics of labels. Then another CRF model is used, using a function of those approximated aggregate statistics of labels as its features. For a given token/entity(labeled by the first CRF), it tries to encourage the majority label assigned to (1) the same token, (2) the same entity, and (3) entities whose token sequence includes the current token sequence either (a) in the same document or (b) in the corpus. This method, when tested against previous models that tried to capture non-local dependencies directly, achieved a higher relative error reduction.
+
This [[Category::paper]] presents an automated system that aligns Wikipedia infoboxes across four different language domains (English, Spanish, French and German). The system creates infoboxes if necessary, deal with discrepancies across the parallel pages written in other languages, and fills in missing information. This way of extracting new information is particularly useful since the globalization of [[UsesDataset::Wikipedia]] creates a rapidly growing parallel corpus over many languages.  
  
 
== Brief description of the method ==
 
== Brief description of the method ==
  
There are two [[UsesMethod::Conditional Random Fields]] in this method. The first is a conventional linear CRF. The local dependencies used here are tokens before and after. The authors use features known to be effective in NER, such as the current, previous, and next words, character n-grams of the current word, etc. For the full list of features, check the appendix.
+
=== Page Alignment ===
 +
First of all, they align pages
  
The second CRF uses features that is a function of an aggregate statistics of labels obtained from the first CRF. Specifically, they are the following 3 types:
+
=== Infobox Alignment ===
  
1. Token-majority features: these features refer to the majority label assigned to the particular token in the document/corpus. These capture dependencies between similar token sequences, especially token sequences that have common words between them.
+
They have a classifier that classifies whether a given pair of attributes in infoboxes is a match or not. The following features are used by the additive logistic regression classifier:
  
2. Entity-majority features: these features refer to the majority label assigned to the particular entity (labeled by the first CRF) in the document/corpus. If the token was labeled not as a named entity in the first CRF, the feature returns the majority label assigned to a single-token named entity with the current token. These capture dependencies between the same token sequences.
+
(1) Equality features: Exact matches of attribute names, infobox classes, and infobox values are strong indications of a match. Infobox values are matched in different normalized forms (lowercasing, removing everythign but numbers, removing everything but alphabetical characters)
  
3. Superentity-majority features: these features refer to the majority label assigned to the supersequences of the particular entity in the document/corpus. If the token was labeled not as a named entity in the first CRF, the feature returns the majority label assigned to all entities containing the token. These capture dependencies between between superentity and subentity. It makes sense to use only superentities as features because longer token sequence gives you more contextual cue.
+
(2) Word features: Infobox values matching partially are captured by the Dice coefficient (=2*|X intersect Y|/(|X|+|Y|)) and raw number of overlapping terms
  
One thing to note is that this model also tries to capture non-local dependencies at the corpus level. This could not be done in the previous methods because they try to capture non-local dependencies directly, which makes the inference intractable when corpus level dependencies are added. Also, while the method proposed in [[RelatedPaper::Finkel et al, ACL 2005]] increased the running time by a factor of 30 over the sequential CRF, this method only takes time for two sequential CRFs.
+
(3) n-gram Features: Some languages have similar roots and thus look similar. This is captured by comparing character n-grams. 3 character n-grams are generated and they are compared using the Dice coefficient and the number of overlapping n-grams.
  
When training this model we need the predictions of the first CRF on the training data. To prevent overfit, these predictions are made by doing a 10-fold cross validation. However all the training data is used to train the CRF when testing.
+
(4) Cluster ID Features: From the previous phase, we cluster phases written in different languages but describes the same article. This information is used to see whether values listed in infoboxes in different language in fact indicate the same 'concept'.
 +
 
 +
(5) Language Feature: An indicator variable indicating which pair of languages (ex. German/English) is tested.
 +
 
 +
(6) Correlation Features: This is to compare numerical values, where n-grams and matches don't help much due to many reasons (measured at different time, unit conversion, etc.) Peasron product-moment correlation is used here.
 +
 
 +
(7) Translation Features: Language resources can be used to find any sign of a match when there is no textual similarity. The authors use translations of the infobox class, attribute name, and attribute values and see the number and the ratio of the matched translations.
 +
 
 +
== Data ==
 +
 
 +
Due to difficulty parsing [[UsesDataset::Wikipedia]] infoboxes, they use [[UsesDataset::DBpedia]] dump instead. The [[UsesDataset::Dbpedia]] dump has all the parsed information of infoboxes on [[UsesDataset::Wikipedia]]
 +
 
 +
=== Generating a labeled training/test set ===
 +
 
 +
All hyperlinked values are replaced by their concept ID in the page alignment stage. Thus it is easy to positive examples, since we just need to match the concept IDs and their original values are already in different languages, different format, etc. By counting how many values matched per each attribute pair, the authors use the top 4000 high score pairs and generate 20K examples from there. Producing negative examples is trivial -- you just modify one element of the positive pair to something else.
  
 
== Experimental Result ==
 
== Experimental Result ==
  
The authors tested this method on [[UsesDataset::CoNLL'03]] English named entity recognition dataset. Their baseline [[UsesMethod::Conditional Random Fields]] achieved an already competitive result of F-measure of 85.29. Adding document level non-local dependencies achieved 12.6% relative error reduction over the baseline. Incorporating non-local dependencies across documents (at corpus level) as well achieved 13.3% relative error reduction. Also, despite the high baseline performance compared to other methods from [[RelatedPaper::Bunescu and Mooney, ACL 2004]] and [[RelatedPaper::Finkel et al, ACL 2005]], the proposed method managed to get higher relative error reduction. For more detailed result, check the table below.
+
[[UsesDataset::Wikipedia]].  
 +
 
  
[[File:Krishnan_et_al_ACL_2006.png]]
 
  
 
== Related papers ==
 
== Related papers ==
Line 38: Line 53:
  
 
[2] [[RelatedPaper::Finkel et al, ACL 2005]]
 
[2] [[RelatedPaper::Finkel et al, ACL 2005]]
 
== Appendix ==
 
 
Full list of features for the baseline CRF: the current, previous and next words, character n-grams of the current word, Part of Speech tag of the current word and surrounding words, the shallow parse chunk of the current word, shape of the current word, the surrounding word shape sequence, the presence of a word in a left window of size 5 around the current word and the presence of a word in a right window of size 5 around the current word.
 

Revision as of 12:22, 30 September 2011

Citation

Eytan Adar, Michael Skinner, Daniel S. Weld, Information arbitrage across multi-lingual Wikipedia, Proceedings of the Second ACM International Conference on Web Search and Data Mining, February 09-12, 2009, Barcelona, Spain.

Online version

link to PDF

Summary

This paper presents an automated system that aligns Wikipedia infoboxes across four different language domains (English, Spanish, French and German). The system creates infoboxes if necessary, deal with discrepancies across the parallel pages written in other languages, and fills in missing information. This way of extracting new information is particularly useful since the globalization of Wikipedia creates a rapidly growing parallel corpus over many languages.

Brief description of the method

Page Alignment

First of all, they align pages

Infobox Alignment

They have a classifier that classifies whether a given pair of attributes in infoboxes is a match or not. The following features are used by the additive logistic regression classifier:

(1) Equality features: Exact matches of attribute names, infobox classes, and infobox values are strong indications of a match. Infobox values are matched in different normalized forms (lowercasing, removing everythign but numbers, removing everything but alphabetical characters)

(2) Word features: Infobox values matching partially are captured by the Dice coefficient (=2*|X intersect Y|/(|X|+|Y|)) and raw number of overlapping terms

(3) n-gram Features: Some languages have similar roots and thus look similar. This is captured by comparing character n-grams. 3 character n-grams are generated and they are compared using the Dice coefficient and the number of overlapping n-grams.

(4) Cluster ID Features: From the previous phase, we cluster phases written in different languages but describes the same article. This information is used to see whether values listed in infoboxes in different language in fact indicate the same 'concept'.

(5) Language Feature: An indicator variable indicating which pair of languages (ex. German/English) is tested.

(6) Correlation Features: This is to compare numerical values, where n-grams and matches don't help much due to many reasons (measured at different time, unit conversion, etc.) Peasron product-moment correlation is used here.

(7) Translation Features: Language resources can be used to find any sign of a match when there is no textual similarity. The authors use translations of the infobox class, attribute name, and attribute values and see the number and the ratio of the matched translations.

Data

Due to difficulty parsing Wikipedia infoboxes, they use DBpedia dump instead. The Dbpedia dump has all the parsed information of infoboxes on Wikipedia

Generating a labeled training/test set

All hyperlinked values are replaced by their concept ID in the page alignment stage. Thus it is easy to positive examples, since we just need to match the concept IDs and their original values are already in different languages, different format, etc. By counting how many values matched per each attribute pair, the authors use the top 4000 high score pairs and generate 20K examples from there. Producing negative examples is trivial -- you just modify one element of the positive pair to something else.

Experimental Result

Wikipedia.


Related papers

[1] Bunescu and Mooney, ACL 2004

[2] Finkel et al, ACL 2005