Bansal et al, ACL 2011

Citation

M. Bansal, C. Quirk, and R. Moore. 2011. Gappy phrasal alignment by agreement. In Proceedings of ACL.

Online version

Summary

This work defines a phrase-to-phrase alignment model for Statistical Machine Translation. A model based on HMMs is defined based on the work presented in Vogal et al, COLING 1996, and extending it to allow continuous and discontinuous phrases (gappy phrases).

The quality of the alignments is further improved by employing alignment agreement described in [Liang and al, 2006], where bidirectional alignments are trained with a joint objective function, rather than using Symmetrization.

Experimental results show improvements in terms of AER (Alignment Error Rate) over the work in [Liang and al, 2006]. As for translation quality, it was evaluated using BLEU and showed improvements over the same baseline.

Description of the Method

An extension of the work in Vogal et al, COLING 1996 is described, where a word to phrase alignment model was presented. Two extensions to this model are proposed.

The first extension is to allow phrasal alignments, where multiple source words can be aligned with multiple target words. This makes the model Semi-Markov, since each state (alignment between phrases) can emit more than one observation (target word) at each timestamp, as opposed to the previous work using regular HMM, where each target word can be aligned with at most one source word.

The second extension allows alignments using phrases with gaps to be modeled, where a phrase with a gap is the sequence $w_{s}*w_{f}$ , where $w_{s}$ is the starting word and $w_{f}$ is the final word and "*" can be any number of words. Furthermore, the alignment agreement word presented in [Liang and al, 2006] was employed and extended to the new space of alignments (alignments including gappy phrases), to substantially reduce overfitting.

Thus, the generative model takes the following form:

$p(A,L,O|S)=p_{l}(J|I)p_{f}(K|J)\prod _{k=1}^{K}p_{j}(a_{k}|a_{k-1})p_{t}(l_{k},o_{l_{k-1}+1}^{l_{k}}|S[a_{k}],l_{k-1})$

Where $p_{l}(J|I)$ is a uniform distribution modeling the length of the observation sequence $J$ based on the number of words in the state-side (source words).

$p_{f}(K|J)$ is a distribution to model the number of states given the number of observation words (in another words, how target words are grouped into phrases). This distribution is modeled by $\eta ^{(J-K)}$ , where a penalty parametrized by $\eta$ is given to shorter state sequences with long phrases, since the number of phrases $K$ is much smaller than the number of target words $J$ .

$p_{j}(a_{k}|a_{k-1})$ is a probability distribution for state transitions with a first-order Markov assumption.

Finally, $p_{t}(l_{k},o_{l_{k-1}+1}^{l_{k}}|S[a_{k}],l_{k-1}$ is the translation probability of the target phrase $o_{l_{k-1}+1}$ starting in position $l_{k-1}+1$ and ending in position $l_{k}$ , given the previous phrase ending in position $l_{k-1}$ , the aligned source phrase $S[a_{k}]$ . The alignment variable $a$ is defined as $(i,j,g)$ , where i and j are the starting and ending positions of the words the target phrase is aligned to, and g defines whether the source phrase from i to j is a continuous or gappy phrase. For instance, the phrase $S[2,4,CONTIG]$ can represent the phrase "ne peux pas", while the phrase $S[2,4,GAP]$ represents "ne * pas", where "*" is a gap.

Experimental Results

Tests were conducted by evaluating the quality of the produced alignments using AER (Alignment Error Rate) and on the translation quality using BLEU.

2 datasets were used. For the English-French pair, the Hansards dataset was used, which contains around 1.1 million training sentence pairs and the system was tested using the NAACL 2003 shared-task dataset. The EUROPARL German-English data was also used, which contains around 1.6 millions training sentences, and the translation quality was evaluated using the WMT2010 translation task data.

The baseline used for this work is the system described in [Liang and al, 2006].

In terms of AER, the inclusion of contiguous segments showed consistent improvements, and some additional gains are observed by including gappy phrases. This is observed using both Posterior and Viterbi decoding to perform inference over expectations.

Data	Decoding	Word-to-word	+Contig phrases	+Gappy phrases
Hansards	Viterbi	94.1	94.3	94.3
Hansards	Posterior (threshold = 0.1)	94.2	94.4	94.5
EUROPARL	Viterbi	83.0	85.2	85.6
EUROPARL	Posterior (threshold = 0.1)	83.7	85.3	85.7

In terms of BLEU, consistent improvements can also be observed using the alignments with gappy phrases.

Data	Word-to-word	+Gappy phrases
Hansards	34.0	34.5
EUROPARL	19.3	19.8

Related Work

The work in Marcus and Wong, EMNLP 2002, describes a joint probability distribution for phrasal alignments.

In Vogal et al, COLING 1996, a HMM-based alignment model is described, where first order Markov dependencies are modeled.

This work uses the work in [Liang and al, 2006] as a baseline, and extends the alignment agreement algorithm to the alignment space with gappy phrases.

Bansal et al, ACL 2011

Contents

Citation

Online version

Summary

Description of the Method

Experimental Results

Related Work

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools