Difference between revisions of "Bansal et al, ACL 2011"

From Cohen Courses
Jump to navigationJump to search
 
(35 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Note ==
 
still incomplete...
 
 
 
== Citation ==
 
== Citation ==
  
Line 12: Line 9:
 
== Summary ==
 
== Summary ==
  
This [[Category::paper | work]] defines a phrase-to-phrase alignment model for Statistical [[Machine Translation]]. A model based on [[usesMethod::Hidden Markov Model | HMMs]] is defined based on the work presented in [[Vogal et al, COLING 1996]], and extending it to allow continuous and discontinuous phrases (gappy phrases).
+
This [[Category::paper | work]] defines a phrase-to-phrase alignment model for Statistical [[AddressesProblem::Machine Translation]]. A model based on [[usesMethod::Hidden Markov Model | HMMs]] is defined based on the work presented in [[Vogal et al, COLING 1996]], and extending it to allow continuous and discontinuous phrases (gappy phrases).
  
The quality of the alignments is further improved by employing alignment agreement described in [[http://dl.acm.org/ft_gateway.cfm?id=1220849&type=pdf&CFID=49698289&CFTOKEN=66367019 Liang and al, 2006]], where bidirectional alignments are trained with a joint objective function, rather than using Symmetrization.
+
The quality of the alignments is further improved by employing alignment agreement described in [[http://dl.acm.org/ft_gateway.cfm?id=1220849&type=pdf&CFID=49698289&CFTOKEN=66367019 Liang and al, 2006]], where bidirectional alignments are trained with a joint objective function, rather than using Symmetrization.  
  
Experimental results show improvements in terms of AER (Alignment Error Rate) over the work in [[http://dl.acm.org/ft_gateway.cfm?id=1220849&type=pdf&CFID=49698289&CFTOKEN=66367019 Liang and al, 2006]], where discontinuous phrases are not allowed. As for translation quality, it was evaluated using BLEU and showed improvements over the same baseline.
+
Experimental results show improvements in terms of AER (Alignment Error Rate) over the work in [[http://dl.acm.org/ft_gateway.cfm?id=1220849&type=pdf&CFID=49698289&CFTOKEN=66367019 Liang and al, 2006]]. As for translation quality, it was evaluated using BLEU and showed improvements over the same baseline.
  
 
== Description of the Method ==
 
== Description of the Method ==
  
The word alignment model described in this work is an extension to the work presented in [[Vogal et al, COLING 1996]].
+
An extension of the work in [[Vogal et al, COLING 1996]] is described, where a word to phrase alignment model was presented. Two extensions to this model are proposed.
 +
 
 +
The first extension is to allow phrasal alignments, where multiple source words can be aligned with multiple target words. This makes the model Semi-Markov, since each state (alignment between phrases) can emit more than one observation (target word) at each timestamp, as opposed to the previous work using regular HMM, where each target word can be aligned with at most one source word.
 +
 
 +
The second extension allows alignments using phrases with gaps to be modeled, where a phrase with a gap is the sequence <math>w_s * w_f</math>, where <math>w_s</math> is the starting word and <math>w_f</math> is the final word and "*" can be any number of words. Furthermore, the alignment agreement word presented in [[http://dl.acm.org/ft_gateway.cfm?id=1220849&type=pdf&CFID=49698289&CFTOKEN=66367019 Liang and al, 2006]] was employed and extended to the new space of alignments (alignments including gappy phrases), to substantially reduce overfitting.
 +
 
 +
Thus, the generative model takes the following form:
 +
 
 +
<math>
 +
p(A,L,O|S) = p_l(J|I)p_f(K|J)\prod_{k=1}^K p_j(a_k|a_{k-1}) p_t(l_k,o_{l_{k-1}+1}^{l_k}|S[a_k],l_{k-1})
 +
</math>
 +
 
 +
Where <math>p_l(J|I)</math> is a uniform distribution modeling the length of the observation sequence <math>J</math> based on the number of words in the state-side (source words).
 +
 
 +
<math>p_f(K|J)</math> is a distribution to model the number of states given the number of observation words (in another words, how target words are grouped into phrases). This distribution is modeled by <math>\eta^{(J-K)}</math>, where a penalty parametrized by <math>\eta</math> is given to shorter state sequences with long phrases, since the number of phrases <math>K</math> is much smaller than the number of target words <math>J</math>.
 +
 
 +
<math>p_j(a_k|a_{k-1})</math> is a probability distribution for state transitions with a first-order Markov assumption.
 +
 
 +
Finally, <math>p_t(l_k,o_{l_{k-1}+1}^{l_k}|S[a_k],l_{k-1}</math> is the translation probability of the target phrase <math>o_{l_{k-1}+1}</math> starting in position <math>l_{k-1}+1</math> and ending in position <math>l_k</math>, given the previous phrase ending in position <math>l_{k-1}</math>, the aligned source phrase <math>S[a_k]</math>. The alignment variable <math>a</math> is defined as <math>(i,j,g)</math>, where i and j are the starting and ending positions of the words the target phrase is aligned to, and g defines whether the source phrase from i to j is a continuous or gappy phrase. For instance, the phrase <math>S[2,4,CONTIG]</math> can represent the phrase "ne peux pas", while the phrase <math>S[2,4,GAP]</math> represents "ne * pas", where "*" is a gap.
  
 
== Experimental Results ==
 
== Experimental Results ==
 +
Tests were conducted by evaluating the quality of the produced alignments using AER (Alignment Error Rate) and on the translation quality using BLEU.
 +
 +
2 datasets were used. For the English-French pair, the [[UsesDataset::Hansards]] dataset was used, which contains around 1.1 million training sentence pairs and the system was tested using the NAACL 2003 shared-task dataset. The [[UsesDataset::EUROPARL]] German-English data was also used, which contains around 1.6 millions training sentences, and the translation quality was evaluated using the WMT2010 translation task data.
 +
 +
The baseline used for this work is the system described in [[http://dl.acm.org/ft_gateway.cfm?id=1220849&type=pdf&CFID=49698289&CFTOKEN=66367019 Liang and al, 2006]].
  
 +
In terms of AER, the inclusion of contiguous segments showed consistent improvements, and some additional gains are observed by including gappy phrases. This is observed using both Posterior and Viterbi decoding to perform inference over expectations.
  
 +
{| class="wikitable" border="1"
 +
|-
 +
! Data
 +
! Decoding
 +
! Word-to-word
 +
! +Contig phrases
 +
! +Gappy phrases
 +
|-
 +
| [[Hansards]]
 +
| Viterbi
 +
| 94.1
 +
| 94.3
 +
| 94.3
 +
|-
 +
| [[Hansards]]
 +
| Posterior (threshold = 0.1)
 +
| 94.2
 +
| 94.4
 +
| 94.5
 +
|-
 +
| [[EUROPARL]]
 +
| Viterbi
 +
| 83.0
 +
| 85.2
 +
| 85.6
 +
|-
 +
| [[EUROPARL]]
 +
| Posterior (threshold = 0.1)
 +
| 83.7
 +
| 85.3
 +
| 85.7
 +
|}
 +
 +
In terms of BLEU, consistent improvements can also be observed using the alignments with gappy phrases.
 +
 +
{| class="wikitable" border="1"
 +
|-
 +
! Data
 +
! Word-to-word
 +
! +Gappy phrases
 +
|-
 +
| [[Hansards]]
 +
| 34.0
 +
| 34.5
 +
|-
 +
| [[EUROPARL]]
 +
| 19.3
 +
| 19.8
 +
|}
 
== Related Work ==  
 
== Related Work ==  
  
The work in [[Marcus and Wong, EMNLP 2002]], describes a joint probability distribution, which is used and extended in this work.
+
The work in [[Marcus and Wong, EMNLP 2002]], describes a joint probability distribution for phrasal alignments.
 +
 
 +
In [[Vogal et al, COLING 1996]], a HMM-based alignment model is described, where first order Markov dependencies are modeled.
 +
 
 +
This work uses the work in [[http://dl.acm.org/ft_gateway.cfm?id=1220849&type=pdf&CFID=49698289&CFTOKEN=66367019 Liang and al, 2006]] as a baseline, and extends the alignment agreement algorithm to the alignment space with gappy phrases.

Latest revision as of 21:00, 30 October 2011

Citation

M. Bansal, C. Quirk, and R. Moore. 2011. Gappy phrasal alignment by agreement. In Proceedings of ACL.

Online version

pdf

Summary

This work defines a phrase-to-phrase alignment model for Statistical Machine Translation. A model based on HMMs is defined based on the work presented in Vogal et al, COLING 1996, and extending it to allow continuous and discontinuous phrases (gappy phrases).

The quality of the alignments is further improved by employing alignment agreement described in [Liang and al, 2006], where bidirectional alignments are trained with a joint objective function, rather than using Symmetrization.

Experimental results show improvements in terms of AER (Alignment Error Rate) over the work in [Liang and al, 2006]. As for translation quality, it was evaluated using BLEU and showed improvements over the same baseline.

Description of the Method

An extension of the work in Vogal et al, COLING 1996 is described, where a word to phrase alignment model was presented. Two extensions to this model are proposed.

The first extension is to allow phrasal alignments, where multiple source words can be aligned with multiple target words. This makes the model Semi-Markov, since each state (alignment between phrases) can emit more than one observation (target word) at each timestamp, as opposed to the previous work using regular HMM, where each target word can be aligned with at most one source word.

The second extension allows alignments using phrases with gaps to be modeled, where a phrase with a gap is the sequence , where is the starting word and is the final word and "*" can be any number of words. Furthermore, the alignment agreement word presented in [Liang and al, 2006] was employed and extended to the new space of alignments (alignments including gappy phrases), to substantially reduce overfitting.

Thus, the generative model takes the following form:

Where is a uniform distribution modeling the length of the observation sequence based on the number of words in the state-side (source words).

is a distribution to model the number of states given the number of observation words (in another words, how target words are grouped into phrases). This distribution is modeled by , where a penalty parametrized by is given to shorter state sequences with long phrases, since the number of phrases is much smaller than the number of target words .

is a probability distribution for state transitions with a first-order Markov assumption.

Finally, is the translation probability of the target phrase starting in position and ending in position , given the previous phrase ending in position , the aligned source phrase . The alignment variable is defined as , where i and j are the starting and ending positions of the words the target phrase is aligned to, and g defines whether the source phrase from i to j is a continuous or gappy phrase. For instance, the phrase can represent the phrase "ne peux pas", while the phrase represents "ne * pas", where "*" is a gap.

Experimental Results

Tests were conducted by evaluating the quality of the produced alignments using AER (Alignment Error Rate) and on the translation quality using BLEU.

2 datasets were used. For the English-French pair, the Hansards dataset was used, which contains around 1.1 million training sentence pairs and the system was tested using the NAACL 2003 shared-task dataset. The EUROPARL German-English data was also used, which contains around 1.6 millions training sentences, and the translation quality was evaluated using the WMT2010 translation task data.

The baseline used for this work is the system described in [Liang and al, 2006].

In terms of AER, the inclusion of contiguous segments showed consistent improvements, and some additional gains are observed by including gappy phrases. This is observed using both Posterior and Viterbi decoding to perform inference over expectations.

Data Decoding Word-to-word +Contig phrases +Gappy phrases
Hansards Viterbi 94.1 94.3 94.3
Hansards Posterior (threshold = 0.1) 94.2 94.4 94.5
EUROPARL Viterbi 83.0 85.2 85.6
EUROPARL Posterior (threshold = 0.1) 83.7 85.3 85.7

In terms of BLEU, consistent improvements can also be observed using the alignments with gappy phrases.

Data Word-to-word +Gappy phrases
Hansards 34.0 34.5
EUROPARL 19.3 19.8

Related Work

The work in Marcus and Wong, EMNLP 2002, describes a joint probability distribution for phrasal alignments.

In Vogal et al, COLING 1996, a HMM-based alignment model is described, where first order Markov dependencies are modeled.

This work uses the work in [Liang and al, 2006] as a baseline, and extends the alignment agreement algorithm to the alignment space with gappy phrases.