Difference between revisions of "Bansal et al, ACL 2011"
(35 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | |||
− | |||
− | |||
== Citation == | == Citation == | ||
Line 12: | Line 9: | ||
== Summary == | == Summary == | ||
− | This [[Category::paper | work]] defines a phrase-to-phrase alignment model for Statistical [[Machine Translation]]. A model based on [[usesMethod::Hidden Markov Model | HMMs]] is defined based on the work presented in [[Vogal et al, COLING 1996]], and extending it to allow continuous and discontinuous phrases (gappy phrases). | + | This [[Category::paper | work]] defines a phrase-to-phrase alignment model for Statistical [[AddressesProblem::Machine Translation]]. A model based on [[usesMethod::Hidden Markov Model | HMMs]] is defined based on the work presented in [[Vogal et al, COLING 1996]], and extending it to allow continuous and discontinuous phrases (gappy phrases). |
− | The quality of the alignments is further improved by employing alignment agreement described in [[http://dl.acm.org/ft_gateway.cfm?id=1220849&type=pdf&CFID=49698289&CFTOKEN=66367019 Liang and al, 2006]], where bidirectional alignments are trained with a joint objective function, rather than using Symmetrization. | + | The quality of the alignments is further improved by employing alignment agreement described in [[http://dl.acm.org/ft_gateway.cfm?id=1220849&type=pdf&CFID=49698289&CFTOKEN=66367019 Liang and al, 2006]], where bidirectional alignments are trained with a joint objective function, rather than using Symmetrization. |
− | Experimental results show improvements in terms of AER (Alignment Error Rate) over the work in [[http://dl.acm.org/ft_gateway.cfm?id=1220849&type=pdf&CFID=49698289&CFTOKEN=66367019 Liang and al, 2006]] | + | Experimental results show improvements in terms of AER (Alignment Error Rate) over the work in [[http://dl.acm.org/ft_gateway.cfm?id=1220849&type=pdf&CFID=49698289&CFTOKEN=66367019 Liang and al, 2006]]. As for translation quality, it was evaluated using BLEU and showed improvements over the same baseline. |
== Description of the Method == | == Description of the Method == | ||
− | + | An extension of the work in [[Vogal et al, COLING 1996]] is described, where a word to phrase alignment model was presented. Two extensions to this model are proposed. | |
+ | |||
+ | The first extension is to allow phrasal alignments, where multiple source words can be aligned with multiple target words. This makes the model Semi-Markov, since each state (alignment between phrases) can emit more than one observation (target word) at each timestamp, as opposed to the previous work using regular HMM, where each target word can be aligned with at most one source word. | ||
+ | |||
+ | The second extension allows alignments using phrases with gaps to be modeled, where a phrase with a gap is the sequence <math>w_s * w_f</math>, where <math>w_s</math> is the starting word and <math>w_f</math> is the final word and "*" can be any number of words. Furthermore, the alignment agreement word presented in [[http://dl.acm.org/ft_gateway.cfm?id=1220849&type=pdf&CFID=49698289&CFTOKEN=66367019 Liang and al, 2006]] was employed and extended to the new space of alignments (alignments including gappy phrases), to substantially reduce overfitting. | ||
+ | |||
+ | Thus, the generative model takes the following form: | ||
+ | |||
+ | <math> | ||
+ | p(A,L,O|S) = p_l(J|I)p_f(K|J)\prod_{k=1}^K p_j(a_k|a_{k-1}) p_t(l_k,o_{l_{k-1}+1}^{l_k}|S[a_k],l_{k-1}) | ||
+ | </math> | ||
+ | |||
+ | Where <math>p_l(J|I)</math> is a uniform distribution modeling the length of the observation sequence <math>J</math> based on the number of words in the state-side (source words). | ||
+ | |||
+ | <math>p_f(K|J)</math> is a distribution to model the number of states given the number of observation words (in another words, how target words are grouped into phrases). This distribution is modeled by <math>\eta^{(J-K)}</math>, where a penalty parametrized by <math>\eta</math> is given to shorter state sequences with long phrases, since the number of phrases <math>K</math> is much smaller than the number of target words <math>J</math>. | ||
+ | |||
+ | <math>p_j(a_k|a_{k-1})</math> is a probability distribution for state transitions with a first-order Markov assumption. | ||
+ | |||
+ | Finally, <math>p_t(l_k,o_{l_{k-1}+1}^{l_k}|S[a_k],l_{k-1}</math> is the translation probability of the target phrase <math>o_{l_{k-1}+1}</math> starting in position <math>l_{k-1}+1</math> and ending in position <math>l_k</math>, given the previous phrase ending in position <math>l_{k-1}</math>, the aligned source phrase <math>S[a_k]</math>. The alignment variable <math>a</math> is defined as <math>(i,j,g)</math>, where i and j are the starting and ending positions of the words the target phrase is aligned to, and g defines whether the source phrase from i to j is a continuous or gappy phrase. For instance, the phrase <math>S[2,4,CONTIG]</math> can represent the phrase "ne peux pas", while the phrase <math>S[2,4,GAP]</math> represents "ne * pas", where "*" is a gap. | ||
== Experimental Results == | == Experimental Results == | ||
+ | Tests were conducted by evaluating the quality of the produced alignments using AER (Alignment Error Rate) and on the translation quality using BLEU. | ||
+ | |||
+ | 2 datasets were used. For the English-French pair, the [[UsesDataset::Hansards]] dataset was used, which contains around 1.1 million training sentence pairs and the system was tested using the NAACL 2003 shared-task dataset. The [[UsesDataset::EUROPARL]] German-English data was also used, which contains around 1.6 millions training sentences, and the translation quality was evaluated using the WMT2010 translation task data. | ||
+ | |||
+ | The baseline used for this work is the system described in [[http://dl.acm.org/ft_gateway.cfm?id=1220849&type=pdf&CFID=49698289&CFTOKEN=66367019 Liang and al, 2006]]. | ||
+ | In terms of AER, the inclusion of contiguous segments showed consistent improvements, and some additional gains are observed by including gappy phrases. This is observed using both Posterior and Viterbi decoding to perform inference over expectations. | ||
+ | {| class="wikitable" border="1" | ||
+ | |- | ||
+ | ! Data | ||
+ | ! Decoding | ||
+ | ! Word-to-word | ||
+ | ! +Contig phrases | ||
+ | ! +Gappy phrases | ||
+ | |- | ||
+ | | [[Hansards]] | ||
+ | | Viterbi | ||
+ | | 94.1 | ||
+ | | 94.3 | ||
+ | | 94.3 | ||
+ | |- | ||
+ | | [[Hansards]] | ||
+ | | Posterior (threshold = 0.1) | ||
+ | | 94.2 | ||
+ | | 94.4 | ||
+ | | 94.5 | ||
+ | |- | ||
+ | | [[EUROPARL]] | ||
+ | | Viterbi | ||
+ | | 83.0 | ||
+ | | 85.2 | ||
+ | | 85.6 | ||
+ | |- | ||
+ | | [[EUROPARL]] | ||
+ | | Posterior (threshold = 0.1) | ||
+ | | 83.7 | ||
+ | | 85.3 | ||
+ | | 85.7 | ||
+ | |} | ||
+ | |||
+ | In terms of BLEU, consistent improvements can also be observed using the alignments with gappy phrases. | ||
+ | |||
+ | {| class="wikitable" border="1" | ||
+ | |- | ||
+ | ! Data | ||
+ | ! Word-to-word | ||
+ | ! +Gappy phrases | ||
+ | |- | ||
+ | | [[Hansards]] | ||
+ | | 34.0 | ||
+ | | 34.5 | ||
+ | |- | ||
+ | | [[EUROPARL]] | ||
+ | | 19.3 | ||
+ | | 19.8 | ||
+ | |} | ||
== Related Work == | == Related Work == | ||
− | The work in [[Marcus and Wong, EMNLP 2002]], describes a joint probability distribution, | + | The work in [[Marcus and Wong, EMNLP 2002]], describes a joint probability distribution for phrasal alignments. |
+ | |||
+ | In [[Vogal et al, COLING 1996]], a HMM-based alignment model is described, where first order Markov dependencies are modeled. | ||
+ | |||
+ | This work uses the work in [[http://dl.acm.org/ft_gateway.cfm?id=1220849&type=pdf&CFID=49698289&CFTOKEN=66367019 Liang and al, 2006]] as a baseline, and extends the alignment agreement algorithm to the alignment space with gappy phrases. |
Latest revision as of 21:00, 30 October 2011
Contents
Citation
M. Bansal, C. Quirk, and R. Moore. 2011. Gappy phrasal alignment by agreement. In Proceedings of ACL.
Online version
Summary
This work defines a phrase-to-phrase alignment model for Statistical Machine Translation. A model based on HMMs is defined based on the work presented in Vogal et al, COLING 1996, and extending it to allow continuous and discontinuous phrases (gappy phrases).
The quality of the alignments is further improved by employing alignment agreement described in [Liang and al, 2006], where bidirectional alignments are trained with a joint objective function, rather than using Symmetrization.
Experimental results show improvements in terms of AER (Alignment Error Rate) over the work in [Liang and al, 2006]. As for translation quality, it was evaluated using BLEU and showed improvements over the same baseline.
Description of the Method
An extension of the work in Vogal et al, COLING 1996 is described, where a word to phrase alignment model was presented. Two extensions to this model are proposed.
The first extension is to allow phrasal alignments, where multiple source words can be aligned with multiple target words. This makes the model Semi-Markov, since each state (alignment between phrases) can emit more than one observation (target word) at each timestamp, as opposed to the previous work using regular HMM, where each target word can be aligned with at most one source word.
The second extension allows alignments using phrases with gaps to be modeled, where a phrase with a gap is the sequence , where is the starting word and is the final word and "*" can be any number of words. Furthermore, the alignment agreement word presented in [Liang and al, 2006] was employed and extended to the new space of alignments (alignments including gappy phrases), to substantially reduce overfitting.
Thus, the generative model takes the following form:
Where is a uniform distribution modeling the length of the observation sequence based on the number of words in the state-side (source words).
is a distribution to model the number of states given the number of observation words (in another words, how target words are grouped into phrases). This distribution is modeled by , where a penalty parametrized by is given to shorter state sequences with long phrases, since the number of phrases is much smaller than the number of target words .
is a probability distribution for state transitions with a first-order Markov assumption.
Finally, is the translation probability of the target phrase starting in position and ending in position , given the previous phrase ending in position , the aligned source phrase . The alignment variable is defined as , where i and j are the starting and ending positions of the words the target phrase is aligned to, and g defines whether the source phrase from i to j is a continuous or gappy phrase. For instance, the phrase can represent the phrase "ne peux pas", while the phrase represents "ne * pas", where "*" is a gap.
Experimental Results
Tests were conducted by evaluating the quality of the produced alignments using AER (Alignment Error Rate) and on the translation quality using BLEU.
2 datasets were used. For the English-French pair, the Hansards dataset was used, which contains around 1.1 million training sentence pairs and the system was tested using the NAACL 2003 shared-task dataset. The EUROPARL German-English data was also used, which contains around 1.6 millions training sentences, and the translation quality was evaluated using the WMT2010 translation task data.
The baseline used for this work is the system described in [Liang and al, 2006].
In terms of AER, the inclusion of contiguous segments showed consistent improvements, and some additional gains are observed by including gappy phrases. This is observed using both Posterior and Viterbi decoding to perform inference over expectations.
Data | Decoding | Word-to-word | +Contig phrases | +Gappy phrases |
---|---|---|---|---|
Hansards | Viterbi | 94.1 | 94.3 | 94.3 |
Hansards | Posterior (threshold = 0.1) | 94.2 | 94.4 | 94.5 |
EUROPARL | Viterbi | 83.0 | 85.2 | 85.6 |
EUROPARL | Posterior (threshold = 0.1) | 83.7 | 85.3 | 85.7 |
In terms of BLEU, consistent improvements can also be observed using the alignments with gappy phrases.
Data | Word-to-word | +Gappy phrases |
---|---|---|
Hansards | 34.0 | 34.5 |
EUROPARL | 19.3 | 19.8 |
Related Work
The work in Marcus and Wong, EMNLP 2002, describes a joint probability distribution for phrasal alignments.
In Vogal et al, COLING 1996, a HMM-based alignment model is described, where first order Markov dependencies are modeled.
This work uses the work in [Liang and al, 2006] as a baseline, and extends the alignment agreement algorithm to the alignment space with gappy phrases.