A Discriminative Latent Variable Model for SMT
Contents
Citation
A Discriminative Latent Variable Model for Statistical Machine Translation, by P. Blunsom, T.Cohn, M.Osborne. In Proceedings of ACL-08:HLT, 2008.
This Paper is available online [1].
Background
Current state-of-the-art approaches in statistical machine translation are often phrase-based (.e.g. the moses) and/or syntactically motivated (e.g., hierarchical MT with the joshua decoder). While these models have achieved a lot, they are still limited in a number of ways. In particular, the notion of "features" is very limited in existing MT work. Och, 2003 presented MERT, which is a discriminative technique to tune the weights of the language, phrase, reordering and word penalty models with respect to each other, but this technique only works well on a limited number (10 or less) of non-overlapping features. In addition, the features in this model are still built in a generative model, using frequency counts and maximizing likelihood. Approaches with an arbitrary number of overlapping features have been limited in MT (Liang et al, ACL 2006 is one prior work that does do this).
Discriminative models in machine translation are especially difficult because of the "multiple derivation" aspect to the problem, namely there are many ways to go from a source sentence to a target sentence (the terminology of "derivation" is because in an SCFG, to produce an output sentence we go through a sequence of SCFG rule applications). Since we do not have any "reference derivations" to update against, one would ideally like to marginalize out all derivations when coming up with the best translation, but doing so exactly is NP-complete. Previous approaches side-step this problem by choosing a simple model with simple features, or just treat the best derivation as the best translation (i.e. do not marginalize over all possible derivations).
Summary
In this work, the authors manage to incorporate a large number of non-overlapping features in a hierarchical machine translation system. What that means is they featurize each rule in a synchronous context free grammar (SCFG), something that the authors call "Discriminative Synchronous Transduction". The authors also model the derivation as a latent/hidden variable which they manage to marginalize out in training and decoding.
Main Approach
The authors focus on the translation model. They come up with a log-linear translation model, which defines the conditional probability distribution over target translations of a given source sentence. Derivations are modeled as a latent variable. In particular, we can express the conditional probability of a derivation as:
where d is the derivation, e is the target sentence, and f is the source sentence. is indexed over the model's features, and are the feature functions defined over rules . Also, is the partition function that globally normalizes the conditional probabilities. Then, the conditional probability of a target sentence given a source is when we marginalize over derivations:
where is the set of all derivations from source to target.
To train, the authors rely on MAP estimation. MAP uses a prior, which can be thought of as a regularization term. In this paper, they use a zero-mean Gaussian prior: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p_0 (\lambda_k) \propto \arg \max_{\Lambda} p_{\Lambda} (\mathcal{D})p(\Lamba)} , where is the training data (parallel sentence corpus). We maximize the log likelihood of the posterior using L-BFGS. L-BFGS requires efficient computation of the objective value and the gradient, which is achieved by inside-outside inference over the SCFG parse chart of the input sentence f. Their formulation is similar to previous work on parsing with a log-linear model and latent variables. Note that in training, if the reference translation for a training instance is not contained in the model's hypothesis space, we discard the unreachable portion of that training sample.
For decoding, the authors use beam search to approximate the sum over all derivations, in order to handle the exponential number of derivations for a given source-target pair. This is similar to previous work on decoding with an SCFG intersected with an n-gram language model.
Baseline & Results
The authors evaluated their model on 4 fronts: 1) maximizing translations (marginalizing derivations) vs. maximizing derivations in training and decoding 2) regularization vs. maximum likelihood unregularized model 3) comparison with frequency count-based systems and 4) performance of translation as we scale up the number of training examples. All experiments were done on Europarl V2 (French-English), and the training corpus consisted of170k sentence pairs. Tuning set had 315 sentence pairs, and the test set 1164.