# A Discriminative Latent Variable Model for SMT

## Citation

A Discriminative Latent Variable Model for Statistical Machine Translation, by P. Blunsom, T.Cohn, M.Osborne. In Proceedings of ACL-08:HLT, 2008.

This Paper is available online [1].

### Background

Current state-of-the-art approaches in statistical machine translation are often phrase-based (.e.g. the moses decoder) and/or syntactically motivated (e.g., hierarchical MT with the joshua decoder). While these models have achieved a lot, they are still limited in a number of ways. In particular, the notion of "features" is very limited in existing MT work. Och, 2003 presented MERT, which is a discriminative technique to tune the weights of the language, phrase, reordering and word penalty models with respect to each other, but this technique only works well on a limited number (10 or less) of non-overlapping features. In addition, the features in this model are still built in a generative manner, using frequency counts and maximizing likelihood. Approaches with an arbitrary number of overlapping features have been limited in MT (Liang et al, ACL 2006 is one prior work that does do this).

Discriminative models in machine translation are especially difficult because of the "multiple derivation" aspect to the problem, namely there are many ways to go from a source sentence to a target sentence (the terminology of "derivation" is because in an SCFG, to produce an output sentence we go through a sequence of SCFG rule applications). Since we do not have any "reference derivations" to update against, one would ideally like to marginalize out all derivations when coming up with the best translation, but doing so exactly is NP-complete. Previous approaches side-step this problem by choosing a simple model with simple features, or just treat the best derivation as the best translation (i.e. do not marginalize over all possible derivations).

### Summary

In this work, the authors manage to incorporate a large number of overlapping features in a hierarchical machine translation system. What that means is they featurize each rule in a synchronous context free grammar (SCFG), a type of synchronous PCFG, something that the authors call "Discriminative Synchronous Transduction". The authors also model the derivation as a latent/hidden variable which they manage to marginalize out in training and decoding.

### Main Approach

The authors focus on the translation model. They come up with a log-linear translation model, which defines the conditional probability distribution over target translations of a given source sentence. Derivations are modeled as a latent variable. In particular, we can express the conditional probability of a derivation as:

${\displaystyle p_{\Lambda }({\textbf {d,e}}|{\textbf {f}})={\frac {\exp \sum _{k}\lambda _{k}H_{k}({\textbf {d,e,f}})}{Z_{\Lambda }({\textbf {f}})}}}$ where d is the derivation, e is the target sentence, and f is the source sentence. ${\displaystyle k}$ is indexed over the model's features, and ${\displaystyle H_{k}}$ are the feature functions defined over rules ${\displaystyle r:H_{k}({\textbf {d,e,f}})=\sum _{r\in {\textbf {d}}}h_{k}({\textbf {f}},r)}$. Also, ${\displaystyle Z_{\Lambda }({\textbf {f}})}$ is the partition function that globally normalizes the conditional probabilities. Then, the conditional probability of a target sentence given a source is when we marginalize over derivations:

${\displaystyle p_{\Lambda }({\textbf {e}}|{\textbf {f}})=\sum _{{\textbf {d}}\in \Delta ({\textbf {e,f}})}p_{\Lambda }({\textbf {d,e}}|{\textbf {f}})}$ where ${\displaystyle \Delta ({\textbf {e,f}})}$ is the set of all derivations from source to target.

To train, the authors rely on MAP estimation. MAP uses a prior, which can be thought of as a regularization term. In this paper, they use a zero-mean Gaussian prior: ${\displaystyle p_{0}(\lambda _{k})\propto \arg \max _{\Lambda }p_{\Lambda }({\mathcal {D}})p(\Lambda )}$, where ${\displaystyle {\mathcal {D}}}$ is the training data (parallel sentence corpus). We maximize the log likelihood of the posterior using L-BFGS. L-BFGS requires efficient computation of the objective value and the gradient, which is achieved by inside-outside inference over the SCFG parse chart of the input sentence f. The full derivation chart is produced using CYK Parsing. Their formulation is similar to previous work on parsing with a log-linear model and latent variables. Note that in training, if the reference translation for a training instance is not contained in the model's hypothesis space, we discard the unreachable portion of that training sample.

For decoding, the authors use beam search to approximate the sum over all derivations, in order to handle the exponential number of derivations for a given source-target pair. This is similar to previous work on decoding with an SCFG intersected with an n-gram language model.

### Baseline & Results

The authors evaluated their model on 4 fronts: 1) maximizing translations (marginalizing derivations) vs. maximizing derivations in training and decoding 2) regularization vs. maximum likelihood unregularized model 3) comparison with frequency count-based systems and 4) performance of translation as we scale up the number of training examples. All experiments were done on Europarl V2 (French-English), and the training corpus consisted of170k sentence pairs. Tuning set had 315 sentence pairs, and the test set 1164.

First off, we see that there is a huge amount of derivational ambiguity in the data - the number of derivations is exponential in the source sentence length (y-axis is a log-scale):

Next, the following table shows that training with all derivations over choosing the 1-best derivation ("All Derivations" vs. "Single Derivation"), and decoding to optimize translations over derivations ("translation" vs. "derivation") gives best results. The authors also give results on how beam width affects translation quality, and show that even with a relatively tight beam width we get decent results. The table below also shows the effects of regularization (or rather lack thereof, which the last line indicates). Unregularized model performance lags well behind regularized model performance.

The authors also test on their held-out test set. In the table below, the first and the third approaches are what the authors proposed in this work. The second is a full Hiero system but without reverse translation probabilities and reverse lexical probabilities. This is a more fair comparison to the method they have proposed in this work, as these two systems have the same parameter space, differing only in the matter of estimation. The additional Hiero scores are achieved with MERT training on the full set of Hiero features, (the last line is with a language model, the second last is without). The authors point out that one cannot really compare their work with these methods, since they would need to incorporate the reverse features and a language model in their approach.

Lastly, the authors show scalability of their model, in terms of accuracy (i.e. the learning curve):

### Related Work

• Percy Liang's work end-to-end approach to discriminative MT was one of the first works that used a lot of overlapping features in MT. They used a phrase-based system (Pharaoh, the predecessor to Moses).
• The most popular way to tune feature weights (non-overlapping) is Minimum Error Rate Training MERT paper
• The Hierarchical MT model was first proposed by David Chiang Hierarchical MT. This paper won the ACL best paper award in 2005.