Bansal et al, ACL 2011

Note

still incomplete...

Citation

M. Bansal, C. Quirk, and R. Moore. 2011. Gappy phrasal alignment by agreement. In Proceedings of ACL.

Online version

Summary

This work defines a phrase-to-phrase alignment model for Statistical Machine Translation. A model based on HMMs is defined based on the work presented in Vogal et al, COLING 1996, and extending it to allow continuous and discontinuous phrases (gappy phrases).

The quality of the alignments is further improved by employing alignment agreement described in [Liang and al, 2006], where bidirectional alignments are trained with a joint objective function, rather than using Symmetrization.

Experimental results show improvements in terms of AER (Alignment Error Rate) over the work in [Liang and al, 2006]. As for translation quality, it was evaluated using BLEU and showed improvements over the same baseline.

Description of the Method

An extension of the work in Vogal et al, COLING 1996 is described, where a word to phrase alignment model was presented. Two extensions to this model are proposed.

The first extension is to allow phrasal alignments, where multiple source words can be aligned with multiple target words. This makes the model Semi-Markov, since each state (alignment between phrases) can emit more than one observation (target word) at each timestamp, as opposed to the previous work using regular HMM, where each target word can be aligned with at most one source word.

The second extension allows alignments using phrases with gaps to be modeled, where a phrase with a gap is the sequence $w_{s}*w_{f}$ , where $w_{s}$ is the starting word and $w_{f}$ is the final word and "*" can be any number of words. Furthermore, the alignment agreement word presented in [Liang and al, 2006] was employed and extended to the new space of alignments (alignments including gappy phrases), to substantially reduce overfitting.

Thus, the generative model takes the following form:

$p(A,L,O|S)=p_{l}(J|I)p_{f}(K|J)\prod _{k=1}^{K}p_{j}(a_{k}|a_{k-1})p_{t}(l_{k},o_{l_{k-1}+1}^{l_{k}}|S[a_{k}],l_{k-1})$

Where $p_{l}(J|I)$ is a uniform distribution modeling the length of the observation sequence $J$ based on the number of words in the state-side (source words). $p_{f}(K|J)$ is a distribution to model the number of states given the number of observation words (in another words, how target words are grouped into phrases). This distribution is modeled by $\eta ^{(J-K)}$ , to discourage shorter state sequences with long phrases.

Experimental Results

Related Work

The work in Marcus and Wong, EMNLP 2002, describes a joint probability distribution, which is used and extended in this work.

Bansal et al, ACL 2011

Contents

Note

Citation

Online version

Summary

Description of the Method

Experimental Results

Related Work

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools