A Discriminative Latent Variable Model for SMT
Contents
Citation
A Discriminative Latent Variable Model for Statistical Machine Translation, by P. Blunsom, T.Cohn, M.Osborne. In Proceedings of ACL-08:HLT, 2008.
This Paper is available online [1].
Background
Current state-of-the-art approaches in statistical machine translation are often phrase-based (.e.g. the moses) and/or syntactically motivated (e.g., hierarchical MT with the joshua decoder). While these models have achieved a lot, they are still limited in a number of ways. In particular, the notion of "features" is very limited in existing MT work. Och, 2003 presented MERT, which is a discriminative technique to tune the weights of the language, phrase, reordering and word penalty models with respect to each other, but this technique only works well on a limited number (10 or less) of non-overlapping features. In addition, the features in this model are still built in a generative model, using frequency counts and maximizing likelihood. Approaches with an arbitrary number of overlapping features have been limited in MT (Liang et al, ACL 2006 is one prior work that does do this).
Discriminative models in machine translation are especially difficult because of the "multiple derivation" aspect to the problem, namely there are many ways to go from a source sentence to a target sentence (the terminology of "derivation" is because in an SCFG, to produce an output sentence we go through a sequence of SCFG rule applications). Since we do not have any "reference derivations" to update against, one would ideally like to marginalize out all derivations when coming up with the best translation, but doing so exactly is NP-complete. Previous approaches side-step this problem by choosing a simple model with simple features, or just treat the best derivation as the best translation (i.e. do not marginalize over all possible derivations).
Summary
In this work, the authors manage to incorporate a large number of non-overlapping features in a hierarchical machine translation system. What that means is they featurize each rule in a synchronous context free grammar (SCFG), something that the authors call "Discriminative Synchronous Transduction". The authors also model the derivation as a latent/hidden variable which they manage to marginalize out in training and decoding.
Main Approach
The authors focus on the translation model. They come up with a log-linear translation model, which defines the conditional probability distribution over target translations of a given source sentence. Derivations are modeled as a latent variable. In particular, we can express the conditional probability of a derivation as:
where d is the derivation, e is the target sentence, and f is the source sentence. is indexed over the model's features, and are the feature functions defined over rules