Watanabe et al., EMNLP 2007. Online Large-Margin Training for Statistical Machine Translation
Contents
Citation
Taro Watanabe, Jun Suzuki, Hajime Tsukada, Hideki Isozaki. 2007. Online large-margin training for statistical machine translation. In Proceedings of EMNLP-CoNLL. pp 764–773
Online Version
Online large-margin training for statistical machine translation
Summary
This paper basically introduces an online discriminative large-margin training approach to statistical machine translation. The authors achieved the then state of the art performance on an Arabic-English translation task by tuning a combination of millions of features in an MT system. By following this approach the authors also addressed the problem of scaling machine translation systems with a large number of features of the order of millions.
Method
The paper presents a method to estimate a large number of parameters, of the order of millions, using an online training algorithm for machine translation. The algorithm used in this work is the [Online : Margin Infused Relaxed Algorithm] (MIRA) which has been successfully employed for many structured natural language processing tasks such as, dependency parsing, joint-labeling/chunking. This method is applied to an enhanced hierarchical phrase-based machine translation model.
Hierarchical Phrase-based SMT
Chiang (2005) introduced the hierarchical phrase-based translation approach, in which non-terminals are embedded in each phrase. A translation is generated by hierarchically combining phrases using the non-terminals. Such a quasi-syntactic structure can naturally capture the reordering of phrases that is not directly modeled by a conventional phrase-based approach.
Each production rule in the hierarchical phrase-based translation model is given by:
where X is a non-terminal, is a source side string of arbitrary terminals and/or non-terminals. is a corresponding target side where is a string of terminals, or a phrase, and is a (possibly empty) string of non-terminals. defines one-to-one mapping between non-terminals in and .
Features
The authors build an enhanced translation model on top of the baseline hierarchical phrase-based model. They introduction a very large number of binary features based on word alignments, dependency structures and context.
Baseline
Please refer Chiang (2005) for baseline features.
Sparse Features
Sparse features are of the form:
These features are categorized as:
- Word pair features using word alignments within a standard phrase pair
- Insertion features to take care of spurious words on the target side.
- Target bigram features of words
- Hierarchical features to capture dependencies between parent and child words on source and target sides.
The authors also perform various kinds of normalization to make the feature set more generalized.