Difference between revisions of "Vogal et al, COLING 1996"

From Cohen Courses
Jump to navigationJump to search
 
(25 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
== Online version ==
 
== Online version ==
  
[http://dl.acm.org/ft_gateway.cfm?id=993313&type=pdf&CFID=44190696&CFTOKEN=85108373 ACM]
+
[http://dl.acm.org/ft_gateway.cfm?id=993313&type=pdf&CFID=44190696&CFTOKEN=85108373 pdf]
  
 
== Summary ==
 
== Summary ==
  
[[Word Alignments]] map the word correspondence between two parallel sentences in different languages.
+
This is a highly influential [[Category::paper]] on [[AddressesProblem::Word Alignments]]. This work extends [[IBM Model 1]] and [[IBM Model 2]], which models lexical translation probabilities and absolute distortion probabilities, by also modeling relative distortion.  
  
This is a highly influential work on [[Word Alignments]]. This work IBM models 1 and 2, which models lexical translation probabilities and absolute distortion probabilities, by also modeling relative distortion.  
+
The relative distortion is modeled by applying a first-order [[UsesMethod::Hidden Markov Model]], where each alignment probability is dependent on the distortion of the previous alignment.
  
The relative distortion is modeled by applying a first-order HMM, where each alignment probabilities are dependent on the distortion of the previous alignment.
+
Results indicate that Modeling the relative distortion can improve the overall quality of the Word Alignments.
 
 
Results seem to indicate that Modeling the relative distortion can improve the overall quality of the Word Alignments.
 
 
 
== Previous work ==
 
IBM Model 1 defines the probability of a sentence <math>s_1^J</math>, with length <math>J</math>, being translated to a sentence <math>t_1^I</math>, with length <math>I</math>, with the alignment <math>a_1^J</math> as:
 
 
 
<math>
 
Pr(t,a|s) = \frac{\epsilon}{(J+1)^{I}}\prod_{j=1}^{J}{tr(t_j|s_{a(j)})}
 
</math>
 
 
 
Where the alignment <math>a_1^J</math> is a function that maps each word <math>t_j</math> to a word <math>s_i</math>, by their indexes. These alignments can be viewed as an object for indicating the corresponding words in a parallel text. We can see that the sentence translation probability <math>Pr(t,s)</math>, is decomposed into the product of the lexical translation probabilities <math>tr(t_j|s_{a(j)})</math> of each word in the target <math>t_1,...,t_{J}</math> with the word that it is aligned to in the source <math>s_{a(j)}</math>. Additionally, target words that are not aligned with any source word are aligned with the null token, with the a lexical translation probability given by <math>tr(t_j|null)</math>. These are referred as null insertions. The normalizing factor <math>\frac{\epsilon}{(I+1)^{J}}</math> ensures that <math>Pr(t,a|s)</math> is a probability and is normalized over all possible alignments <math>a</math> and all possible translations <math>t</math>.
 
 
 
One of the problems of the IBM Model 1 is that it is very weak to reordering, since <math>p(f,a|s)</math> is calculated using only the lexical translation probabilities <math>tr(t|s)</math>. Because of this, if the model is presented with 2 translations candidates <math>t_1</math> and <math>t_2</math> with the same lexical translations, but with different reordering of the translated words, the model scores both translations with the same score.
 
 
 
Mixture-based Alignment models~(IBM Model 2) addresses this problem by modeling the absolute distortion in the word positioning between the 2 languages, introducing an alignment probability distribution <math>Pr_a(i|j,J,I)</math>, where <math>i</math> and <math>j</math> are the word positions in the source and target sentences. Thus the equation for <math>Pr(t,a|s)</math> becomes:
 
 
 
<math>
 
Pr(t,a|s) = \frac{\epsilon}{(J+1)^{I}}\prod_{j=1}^{J}{tr(t_j|s_{a(j)}) Pr_a(a(j)|j,J,I)}
 
</math>
 
 
 
Where the alignment probability distribution <math>Pr_a(a(j)|j,J,I)</math> models the probability of a word in the position <math>i</math> in the source sentence of being reordered into the position <math>j</math> in the target sentence.
 
  
 
== Model ==
 
== Model ==
 +
[[IBM Model 2]] attempts to model the absolute distortion of words in sentence pairs. However, alignments have a strong tendency to maintain the local neighborhood after translation.
  
While IBM Model 2 attempts to model the absolute distortion of words in sentence pairs <math>Pr_a(i|j,J,I)</math>, alignments have a strong tendency to maintain the local neighborhood after translation.
+
This model uses a first order [[Hidden Markov Model]] to restructure the alignment model <math>Pr(t,a|s)</math> used in [[IBM Model 2]] to include first order alignment dependencies. It defines the probability of a sentence <math>s_1^J</math>, with length <math>J</math>, being translated to a sentence <math>t_1^I</math>, with length <math>I</math>, with the alignment <math>a_1^J</math> as:
 
 
This uses a first order Hidden Markov Model to restructure the alignment model <math>Pr(t,a|s)</math> to include first order alignment dependencies. Thus:
 
  
 
<math>
 
<math>
Line 62: Line 40:
 
</math>
 
</math>
  
The Viterbi alignment is the alignment with the highest <math>Pr(a|t,s)</math>. While in previous alignment models, the Viterbi alignment could be determined in polynomial time, by maximizing the alignment probability for each target word, due to the independence assumptions that are made, finding the optimum alignment for the HMM-based model is more complex, due to the first order dependencies between alignments. This can still be calculated in polynomial time, with complexity <math>O(I^2J)</math>, using the dynamic programming algorithm, similar to [[Viterbi Decoding]] proposed this work. This algorithm defines the partial alignment probability <math>Q(i,j)</math>, which is defined as
+
The Viterbi alignment is the alignment with the highest <math>Pr(a|t,s)</math> and is widely used in statistical [[AddressesProblem::Machine Translation]] systems. While in previous alignment models, the Viterbi alignment could be determined in polynomial time, by maximizing the alignment probability for each target word, due to the independence assumptions that are made, finding the optimum alignment for the HMM-based model is more complex, due to the first order dependencies between alignments. This can still be calculated in polynomial time, with complexity <math>O(I^2J)</math>, using the dynamic programming algorithm, similar to [[Viterbi Decoding]] proposed this work. This algorithm defines the partial alignment probability <math>Q(i,j)</math>, which is defined as
  
 
<math>
 
<math>
Line 78: Line 56:
 
! Words
 
! Words
 
! Vocabulary
 
! Vocabulary
 +
! Description
 
|-
 
|-
| [[Avalanche Bulletins]]
+
| [[UsesDataset::Avalanche Bulletins]]
 
| French-German
 
| French-German
 
| French:62849 German:44805
 
| French:62849 German:44805
 
| French:1993 German:2265
 
| French:1993 German:2265
 +
| Avalanche Bulletins published by the Swiss Federal Institute for Snow and Avalanche Research
 
|-
 
|-
| [[Verbmobil Corpus]]
+
| [[UsesDataset::Verbmobil]]
 
| Spanish-English
 
| Spanish-English
 
| Spanish:13768 English:15888
 
| Spanish:13768 English:15888
 
| Spanish2008, English:1830
 
| Spanish2008, English:1830
 +
| Spontaneous spoken dialog in the domain of appointment scheduling
 
|-
 
|-
| [[EuTrans Corpus]]
+
| [[UsesDataset::EuTrans]]
 
| German-English
 
| German-English
 
| German:150279, English:154727
 
| German:150279, English:154727
 
| German:4017, English:2443
 
| German:4017, English:2443
 +
| Typical phrases in the tourism and travel domain
 
|}
 
|}
  
Line 99: Line 81:
 
This work compares the HMM-based alignment model with IBM model 2. The training setup for both models start with 10 EM iterations using IBM model 1, to obtain the initial distribution for the lexical translation probabilities <math>tr(t_j|s_{a(j)})</math>. This was used to initialize both the IBM model 2 and the HMM-based model. Next, 5 EM iterations were run for the IBM Model 2 and the HMM-based Model.
 
This work compares the HMM-based alignment model with IBM model 2. The training setup for both models start with 10 EM iterations using IBM model 1, to obtain the initial distribution for the lexical translation probabilities <math>tr(t_j|s_{a(j)})</math>. This was used to initialize both the IBM model 2 and the HMM-based model. Next, 5 EM iterations were run for the IBM Model 2 and the HMM-based Model.
  
== Results ==
+
== Experimental Results ==
 
The quality of the alignments produced by each model is measured in terms of the translation, alignment and total perplexity:
 
The quality of the alignments produced by each model is measured in terms of the translation, alignment and total perplexity:
 
{| class="wikitable" border="1"
 
{| class="wikitable" border="1"
 
|-
 
|-
! [[Avalanche Bulletins]]
+
! Avalanche Bulletins
 
! Translation
 
! Translation
 
! Alignment
 
! Alignment
Line 121: Line 103:
 
{| class="wikitable" border="1"
 
{| class="wikitable" border="1"
 
|-
 
|-
! [[EuTrans Corpus]]
+
! EuTrans Corpus
 
! Translation
 
! Translation
 
! Alignment
 
! Alignment
Line 139: Line 121:
 
{| class="wikitable" border="1"
 
{| class="wikitable" border="1"
 
|-
 
|-
! [[Verbmobil Corpus]]
+
! Verbmobil Corpus
 
! Translation
 
! Translation
 
! Alignment
 
! Alignment
Line 155: Line 137:
 
|}
 
|}
  
From these results, it is concluded that IBM Model 2 gives slightly better results for the perplexity of translation probabilities, while the HMM-based Model gives better perplexity values for alignment probabilities. This is explained by the fact that in some cases the relative distortion used in HMM-based model gives more accurate results than the absolute distortion used in IBM Model 2 and viceversa.
+
From these results, it is concluded that IBM Model 2 gives slightly better results for the perplexity of translation probabilities, while the HMM-based Model gives better perplexity values for alignment probabilities. This is explained by the fact that in some cases the relative distortion used in HMM-based model gives more accurate results than the absolute distortion used in IBM Model 2 and vice-versa.
 +
 
 +
== Related Papers ==
 +
This work defines a relative reordering model for Word Alignments. A similar work is done in [[IBM Model 4]], which has a more robust model. However, the HMM-based model has the advantage that the inference needed to obtain the Viterbi Alignment can be performed in polynomial time, while [[IBM Model 4]] can only sample a subspace of all possible alignments, using [[Hill Climbing]].

Latest revision as of 09:39, 30 September 2011

Citation

Vogel, S., Ney, H., & Tillmann, C. (1996). Hmm-based word alignment in statistical translation. In Proceedings of the 16th conference on Computational linguistics - Volume 2, COLING ’96, pp. 836–841, Stroudsburg, PA, USA. Association for Computational Linguistics.

Online version

pdf

Summary

This is a highly influential paper on Word Alignments. This work extends IBM Model 1 and IBM Model 2, which models lexical translation probabilities and absolute distortion probabilities, by also modeling relative distortion.

The relative distortion is modeled by applying a first-order Hidden Markov Model, where each alignment probability is dependent on the distortion of the previous alignment.

Results indicate that Modeling the relative distortion can improve the overall quality of the Word Alignments.

Model

IBM Model 2 attempts to model the absolute distortion of words in sentence pairs. However, alignments have a strong tendency to maintain the local neighborhood after translation.

This model uses a first order Hidden Markov Model to restructure the alignment model used in IBM Model 2 to include first order alignment dependencies. It defines the probability of a sentence , with length , being translated to a sentence , with length , with the alignment as:

Where the alignment probability is calculated as:

In this formulation, the distortion probability does not depend on the word positions but in the jump width (i-i').

Viterbi Alignment

The alignment probability for a given sentence pair is given by:

The Viterbi alignment is the alignment with the highest and is widely used in statistical Machine Translation systems. While in previous alignment models, the Viterbi alignment could be determined in polynomial time, by maximizing the alignment probability for each target word, due to the independence assumptions that are made, finding the optimum alignment for the HMM-based model is more complex, due to the first order dependencies between alignments. This can still be calculated in polynomial time, with complexity , using the dynamic programming algorithm, similar to Viterbi Decoding proposed this work. This algorithm defines the partial alignment probability , which is defined as

can be seen as the Viterbi alignment from the partial target sentence from to , that contains the word alignment from to . This can be done because each word alignment is only dependent on the previous alignment.

Corpora

Tests were performed using the following corpora:

Corpora Language Pair Words Vocabulary Description
Avalanche Bulletins French-German French:62849 German:44805 French:1993 German:2265 Avalanche Bulletins published by the Swiss Federal Institute for Snow and Avalanche Research
Verbmobil Spanish-English Spanish:13768 English:15888 Spanish2008, English:1830 Spontaneous spoken dialog in the domain of appointment scheduling
EuTrans German-English German:150279, English:154727 German:4017, English:2443 Typical phrases in the tourism and travel domain

Training

This work compares the HMM-based alignment model with IBM model 2. The training setup for both models start with 10 EM iterations using IBM model 1, to obtain the initial distribution for the lexical translation probabilities . This was used to initialize both the IBM model 2 and the HMM-based model. Next, 5 EM iterations were run for the IBM Model 2 and the HMM-based Model.

Experimental Results

The quality of the alignments produced by each model is measured in terms of the translation, alignment and total perplexity:

Avalanche Bulletins Translation Alignment Total
IBM Model 2 3.18 10.05 32.00
HMM Model 3.45 5.84 20.18
EuTrans Corpus Translation Alignment Total
IBM Model 2 2.44 4.00 9.78
HMM Model 2.46 3.93 9.69
Verbmobil Corpus Translation Alignment Total
IBM Model 2 4.70 6.54 30.71
HMM Model 4.86 5.42 26.50

From these results, it is concluded that IBM Model 2 gives slightly better results for the perplexity of translation probabilities, while the HMM-based Model gives better perplexity values for alignment probabilities. This is explained by the fact that in some cases the relative distortion used in HMM-based model gives more accurate results than the absolute distortion used in IBM Model 2 and vice-versa.

Related Papers

This work defines a relative reordering model for Word Alignments. A similar work is done in IBM Model 4, which has a more robust model. However, the HMM-based model has the advantage that the inference needed to obtain the Viterbi Alignment can be performed in polynomial time, while IBM Model 4 can only sample a subspace of all possible alignments, using Hill Climbing.