DeNero et al, EMNLP 2008

Citation

Denero, J., Bouchard-ct, R., & Klein, D.(2008). Sampling alignment structure under a Bayesian translation model. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, Stroudsburg, PA, USA, 314-323.

Online version

ACM

Summary

Unlike word-to-phrase alignments, computing the alignment expectations for phrase-to-phrase alignments is generally intractable due to the exponential growth of possible combination of phrases and alignments. Because of this, previous attempts for building a joint phrase alignment model have been unsuccessful.

This paper describes the first tractable phrase-to-phrase Alignment Model, which relies on Gibbs Sampling to tackle the intractability problem.

Tests show translation improvements over Machine Translation Systems build using conventional methods.

Previous Work

Most alignment models can not model many-to-many alignments, since they restrict each word in the target sentence to be aligned with at most one word in the source language. Thus, these models can only model one-to-many alignments, where each source word can be aligned to multiple target words but not the opposite. An efficient way to generate many-to-many alignments is to combine one-to-many alignments with many-to-one alignments, which is called Symmetrization. In this case, we build one-to-many alignments from the source sentences to target sentences and many-to-one alignments in the inverse direction, and combine these together to obtain many-to-many alignments. An easy combination method, but not very used presently, is to perform an union of the alignments, where two words are aligned if those are aligned in either one of the bidirectional alignments.

These alignments are used in the Phrase Extraction Algorithm, where phrase pairs are extracted based on heuristics, such as the alignment template defined by (Och et al, 1999). The phrase translation features are then calculated for each unique phrase pair, based on the available data such as phrase pair counts and alignments. An example of a feature is the translation probability $\phi (t|s)$ , which is calculated as the phrase pair count ratio:

$\phi (t|s)={\frac {c(s,t)}{\sum _{t'}c(s,t')}}$

where c(s,t) is the number of occurrences of the phrase pair with source s and target t (or [t,s]).

The main difference between this model and the previous work is that, while the previous work uses fixed word alignments and extracts phrase pairs using heuristics, this work estimates the phrase translation features using an inference procedure that is not restricted by fixed alignments or heuristics.

Previous Phrase Alignment Model

The phrase-to-phrase alignment model presented in this work is built upon the work in (Marcus and Wong, 2002). In this work, words are clustered into phrases by a generative process, which constructs an ordered set of phrases $t_{1:m}$ in the target language, an ordered set of phrases $s_{1:n}$ in the source language and the alignments between phrases $a=\{(j,k)\}$ , which indicates that the phrase pair with the target $t_{j}$ and $s_{k}$ . The process is composed by 2 steps:

First, the number of components $l$ is chosen and each of $l$ phrase pairs are generated independently.
Then, a ordering for the phrases in the source phrases is chosen, and all the source and target phrases are aligned one to one.

The choice of $l$ is parametrized using a geometric distribution $P_{G}$ , with the stop parameter $p_{\$}$ :

$P(l)=P_{G}(l;p_{\$})=p_{\$}\times (1-p_{\$})^{l-1}$

Phrase pairs are drawn from an unknown multinomial distribution $\theta _{J}$ .

A simple position based distortion model is used, where:

$P(a|[t,s])\propto \prod _{a_{i}\in a}\delta (a_{i})$

$P(a_{i}=(j,k))=b^{|pos(t_{j})-pos(s_{k})\times s|}$

Finally, the joint probability model for aligning sentences consisting of $l$ phrase pairs is given by:

$P([t,s],a)=P_{G}(l;p_{\$})P(a|[t,s])\prod _{[t,s]}\theta _{J}([t,s])$

In the experiments paramters $p_{\$}$ and $b$ were set to 0.1 and 0.85, respectively.

Proposed Phrase Alignment Model

In the previous model, each source phrase had to be aligned with a target phrase in both directions. However, sentence pairs do not always contain the same information in both sides, so it should be expected that some phrase pairs are not aligned. This work extends the previous alignment model by modeling unaligned phrases. This is done by introducing another multinomial distribution $\theta _{N}$ , from where unaligned phrase pairs are drawn. A free parameter $P_{N}$ is set, defining the probability of a null phrase. This parameter is set to $10^{-10}$ to discourage unaligned sentences.

Thus, the new model is defined as: $P([t,s],a)=P_{G}(l;p_{\$})P(a|[t,s])\prod _{[t,s]}P_{M}([t,s])$

where $P_{M}([t,s])$ is defined as:

$P_{M}([t,s])=p_{N}\theta _{N}([t,s])+(1-p_{N})\theta _{J}([t,s])$

This model involves the observed sentence pairs $x$ , the latent phrase segmentations and alignments $z$ and the parameters $\theta$ ( $\theta _{N}$ and $\theta _{J}$ ). This model can be used to compute the expected counts for phrase pairs for a fixed $\theta$ under $P(z|x,\theta )$ , which could, in turn, be used in the E-Step of EM, similarly to Word Alignments. However, in the case of Phrase Alignments the space of possible $z$ is too large to be computed using the same methods used in Word Alignments. Thus, this work tackles this problem by defining a Gibbs Sampler.

Gibbs Sampler

The Gibbs Sampler presented in this work starts with the initial state $z_{0}$ , which sets a initial configuration for the phrase segmentation and alignment. Then, by applying a set of local changes starting from $z_{0}$ , a sequence of samples $z_{i},z_{i+1},\dots$ are produced. These samples are guaranteed to approximate the conditional distribution $P(z|x,\theta )$ . This means that the average counts for phrase pairs in the samples converge to the same counts under the model. Thus, these can be used to produce estimates for translation features such as the translation probability $\phi (t|s)$ .

This work defines a set of operators that can be applied to a state $z_{i}$ to generate a new state $z_{i+1}$ . Each operate freezes the state $z_{i}$ , except for a local region, and chooses a reconfiguration, between all possible reconfigurations, of that region according to the conditional probability each reconfiguration, given the frozen region of the state. This frozen region of the state is called Markov Blanket.

DeNero et al, EMNLP 2008

Contents

Citation

Online version

Summary

Previous Work

Previous Phrase Alignment Model

Proposed Phrase Alignment Model

Gibbs Sampler

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools