Difference between revisions of "DeNero et al, EMNLP 2008"

From Cohen Courses
Jump to navigationJump to search
 
(40 intermediate revisions by 2 users not shown)
Line 5: Line 5:
 
== Online version ==
 
== Online version ==
  
[http://www.cs.berkeley.edu/~bouchard/pub/mtSubmission.pdf ACM]
+
[http://www.cs.berkeley.edu/~bouchard/pub/mtSubmission.pdf pdf]
  
 
== Summary ==
 
== Summary ==
Line 11: Line 11:
 
Unlike word-to-phrase alignments, computing the alignment and count expectations for phrase-to-phrase alignments is generally intractable due to the exponential growth of possible combination of phrases and alignments. Thus, previous attempts for building a fully joint phrase alignment model have been unsuccessful.
 
Unlike word-to-phrase alignments, computing the alignment and count expectations for phrase-to-phrase alignments is generally intractable due to the exponential growth of possible combination of phrases and alignments. Thus, previous attempts for building a fully joint phrase alignment model have been unsuccessful.
  
This paper describes the first tractable phrase-to-phrase Alignment Model, which relies on Gibbs Sampling to tackle the intractability issues.
+
This [[Category::paper]] describes the first tractable phrase-to-phrase Alignment Model, which relies on [[UsesMethod::Gibbs sampling]] to address the intractability issues.
  
Tests show translation improvements over Machine Translation Systems build using conventional methods.
+
Tests show translation improvements over [[AddressesProblem::Machine Translation]] Systems build using traditional methods ([[Word Alignments]] + [[Phrase Extraction]]).
  
== Phrase Alignment Model ==
+
== Description of the Method ==
The phrase-to-phrase alignment model used in this work is presented in [[Marcus and Wong, EMNLP 2002]]. An extension is made to allow null phrases, where a phrase does not have an equivalent in the other language. This is done by introducing another multinomial distribution <math>\theta_N</math>, from where unaligned phrase pairs are drawn. A free parameter <math>P_{N}</math> is set, defining the probability of a null phrase. This parameter is set to <math>10^{-10}</math> to discourage unaligned sentences.
+
The phrase-to-phrase alignment model used in this work is presented in [[Marcus and Wong, EMNLP 2002]]. An extension is made to allow null phrases, where a phrase does not have an equivalent in the other language. This is done by introducing another multinomial distribution <math>\theta_N</math>, from where unaligned phrase pairs are drawn.
  
This model involves the observed sentence pairs <math>x</math>, the latent phrase segmentations and alignments <math>z</math> and the parameters <math>\theta</math>(<math>\theta_N</math> and <math>\theta_J</math>), from where phrases are drawn. This model can be used to compute the expected counts for phrase pairs for a fixed <math>\theta</math> under <math>P(z|x,\theta)</math>, which could, in turn, be used to calculate translation features in [[Machine Translation]] Models. However, computing the expected counts this way would involve computing the phrase pair counts under all possible segmentations and alignments <math>z</math>, which is too large to be computed one by one. This was the main drawback of the work in [[Marcus and Wong, EMNLP 2002]].  
+
This model involves the observed sentence pairs <math>x</math>, the latent phrase segmentations and alignments <math>z</math> and the parameters <math>\theta</math>(<math>\theta_N</math> and <math>\theta_J</math>), from where phrase pairs are drawn. As with other alignment models, the [[UsesMethod::Expectation Maximization]] Algorithm is used, where during the E-step the expected counts for phrase pairs are calculated under <math>P(z|x,\theta)</math> by fixing <math>\theta</math> and during the M-step a new <math>\theta</math> is calculated based on these counts. However, computing the expected counts this way would involve computing the expected phrase pair counts under all possible segmentations and alignments <math>z</math>, which is too large to be computed one by one. This was the main drawback of the work in [[Marcus and Wong, EMNLP 2002]].  
  
This work tackles this problem using [[Gibbs sampling]]. Rather iterating over all possible <math>z</math>, this work generates a sequence of samples that approximate the conditional distribution <math>P(z|x,\theta)</math>. This means that the average counts for phrase pairs in the samples converges to the same counts under the model.  
+
This work addresses this problem using [[UsesMethod::Gibbs sampling]]. Rather than iterating over all possible <math>z</math>, this work generates a sequence of samples that approximate the conditional distribution <math>P(z|x,\theta)</math>. Then, by averaging counts for phrase pairs using those samples, an approximation of the expected counts under <math>P(z|x,\theta)</math> can be computed. In this work, these counts are directly used to calculate the Translation features, rather than considering the Word Alignments as a separate step.
  
== Gibbs Sampler ==
+
The Gibbs Sampler presented in this work starts with the initial state <math>z_0</math>, which sets a initial configuration for the phrase segmentation and alignment. Then, by applying a set of local changes starting from <math>z_0</math>, a sequence of samples <math>z_i,z_{i+1},\dots</math> are produced. These samples are guaranteed to approximate the conditional distribution <math>P(z|x,\theta)</math>.  
The Gibbs Sampler presented in this work starts with the initial state <math>z_0</math>, which sets a initial configuration for the phrase segmentation and alignment. Then, by applying a set of local changes starting from <math>z_0</math>, a sequence of samples <math>z_i,z_{i+1},\dots</math> are produced. These samples are guaranteed to approximate the conditional distribution <math>P(z|x,\theta)</math>. This means that the average counts for phrase pairs in the samples converge to the same counts under the model. Thus, these can be used to produce estimates for translation features such as the translation probability <math>\phi(t|s)</math>.
 
  
This work defines a set of operators that can be applied to a state <math>z_i</math> to generate a new state <math>z_{i+1}</math>. Each operate freezes the state <math>z_i</math>, except for a local region, and chooses a reconfiguration, between all possible reconfigurations, of that region according to the conditional probability each reconfiguration, given the frozen region of the state. This frozen region of the state is called Markov blanket.
+
This work defines a set of operators that can be applied to a state <math>z_i</math> to generate a new state <math>z_{i+1}</math>. Each operator freezes the state <math>z_i</math>, except for a local region, and chooses a reconfiguration, between all possible reconfigurations of that region, according to the conditional probability of each reconfiguration. This is done so that the sequence of generated samples approximate <math>P(z|x,\theta)</math>.
  
The SWAP operator freezes all phrases, and then chooses two target phrases <math>t_1</math> and <math>t_2</math>. All alignments are also frozen, except for the alignments <math>[t_1,s_1]</math> and <math>[t_2,s_2]</math>. The operator must choose to keep these alignments or switch to the alignments <math>[t_2,s_1]</math> and <math>[t_1,s_2]</math>
+
Finally, every phrase-to-phrase alignment model has to deal with degeneracy issues, since the joint model penalizes segmentations that use many small phrases, because additional phrases incur additional generation and distortion penalties. This means that the maximum likelihood estimation of the model puts weight on segmentations with only one phrase pair that translates the whole source sentence to the target sentence, which perfectly translates the sentences in the training corpus but are unlikely to be useful for decoding sentences that are not in the training corpus. This work addresses this problem by defining a Dirichlet process prior over phrase pairs, and selecting a base measure that prefers shorter phrase pairs and only uses longer phrase pairs when there is clear evidence for them.
 +
 
 +
== Experimental Results ==
 +
 
 +
This work was tested using the setup from the [http://www.statmt.org/wmt07/shared-task.html Statistical Machine Translation Workshop shared task], where the model is trained using the [[UsesDataset::EUROPARL]] dataset for Spanish-English. Using the expected counts to calculate the translation features, considerable improvements observed in terms of translation quality, measured with BLEU and METEOR, over the baseline trained using [http://www.statmt.org/moses/ Moses], where features are calculated using the [[Word Alignments]].
 +
 
 +
== Related Work ==
 +
 
 +
The work in [[Marcus and Wong, EMNLP 2002]], describes a joint probability distribution, which is used and extended in this work.
 +
 
 +
Similar approaches to address the problem of the intractability of all possible phrase segmentations and alignments have been tested using algorithms such as [[Hill Climbing]] in
 +
[[Birch et al, StatMT 2006]].
 +
 
 +
 
 +
== Comment ==
 +
 
 +
Hey, if you want to see more work in word alignment, this recent paper is interesting.  It may be an interesting contrast because it not use Gibbs sampling for inference.  C. Dyer, J. H. Clark, A. Lavie, and N. A. Smith. [http://www.aclweb.org/anthology/P/P11/P11-1042.pdf Unsupervised Word Alignment with Arbitrary Features.]  --[[User:Brendan|Brendan]] 18:36, 13 October 2011 (UTC)

Latest revision as of 13:37, 13 October 2011

Citation

Denero, J., Bouchard-ct, R., & Klein, D.(2008). Sampling alignment structure under a Bayesian translation model. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, Stroudsburg, PA, USA, 314-323.

Online version

pdf

Summary

Unlike word-to-phrase alignments, computing the alignment and count expectations for phrase-to-phrase alignments is generally intractable due to the exponential growth of possible combination of phrases and alignments. Thus, previous attempts for building a fully joint phrase alignment model have been unsuccessful.

This paper describes the first tractable phrase-to-phrase Alignment Model, which relies on Gibbs sampling to address the intractability issues.

Tests show translation improvements over Machine Translation Systems build using traditional methods (Word Alignments + Phrase Extraction).

Description of the Method

The phrase-to-phrase alignment model used in this work is presented in Marcus and Wong, EMNLP 2002. An extension is made to allow null phrases, where a phrase does not have an equivalent in the other language. This is done by introducing another multinomial distribution , from where unaligned phrase pairs are drawn.

This model involves the observed sentence pairs , the latent phrase segmentations and alignments and the parameters ( and ), from where phrase pairs are drawn. As with other alignment models, the Expectation Maximization Algorithm is used, where during the E-step the expected counts for phrase pairs are calculated under by fixing and during the M-step a new is calculated based on these counts. However, computing the expected counts this way would involve computing the expected phrase pair counts under all possible segmentations and alignments , which is too large to be computed one by one. This was the main drawback of the work in Marcus and Wong, EMNLP 2002.

This work addresses this problem using Gibbs sampling. Rather than iterating over all possible , this work generates a sequence of samples that approximate the conditional distribution . Then, by averaging counts for phrase pairs using those samples, an approximation of the expected counts under can be computed. In this work, these counts are directly used to calculate the Translation features, rather than considering the Word Alignments as a separate step.

The Gibbs Sampler presented in this work starts with the initial state , which sets a initial configuration for the phrase segmentation and alignment. Then, by applying a set of local changes starting from , a sequence of samples are produced. These samples are guaranteed to approximate the conditional distribution .

This work defines a set of operators that can be applied to a state to generate a new state . Each operator freezes the state , except for a local region, and chooses a reconfiguration, between all possible reconfigurations of that region, according to the conditional probability of each reconfiguration. This is done so that the sequence of generated samples approximate .

Finally, every phrase-to-phrase alignment model has to deal with degeneracy issues, since the joint model penalizes segmentations that use many small phrases, because additional phrases incur additional generation and distortion penalties. This means that the maximum likelihood estimation of the model puts weight on segmentations with only one phrase pair that translates the whole source sentence to the target sentence, which perfectly translates the sentences in the training corpus but are unlikely to be useful for decoding sentences that are not in the training corpus. This work addresses this problem by defining a Dirichlet process prior over phrase pairs, and selecting a base measure that prefers shorter phrase pairs and only uses longer phrase pairs when there is clear evidence for them.

Experimental Results

This work was tested using the setup from the Statistical Machine Translation Workshop shared task, where the model is trained using the EUROPARL dataset for Spanish-English. Using the expected counts to calculate the translation features, considerable improvements observed in terms of translation quality, measured with BLEU and METEOR, over the baseline trained using Moses, where features are calculated using the Word Alignments.

Related Work

The work in Marcus and Wong, EMNLP 2002, describes a joint probability distribution, which is used and extended in this work.

Similar approaches to address the problem of the intractability of all possible phrase segmentations and alignments have been tested using algorithms such as Hill Climbing in Birch et al, StatMT 2006.


Comment

Hey, if you want to see more work in word alignment, this recent paper is interesting. It may be an interesting contrast because it not use Gibbs sampling for inference. C. Dyer, J. H. Clark, A. Lavie, and N. A. Smith. Unsupervised Word Alignment with Arbitrary Features. --Brendan 18:36, 13 October 2011 (UTC)