Watanabe et al., EMNLP 2007. Online Large-Margin Training for Statistical Machine Translation
Contents
Citation
Taro Watanabe, Jun Suzuki, Hajime Tsukada, Hideki Isozaki. 2007. Online large-margin training for statistical machine translation. In Proceedings of EMNLP-CoNLL. pp 764–773
Online Version
Online large-margin training for statistical machine translation
Summary
This paper basically introduces a discriminative online large-margin training approach to statistical machine translation. The authors achieved the then state of the art performance on an Arabic-English translation task by tuning a combination of millions of features in an MT system. By following this approach the authors also addressed the problem of scaling machine translation systems with a large number of features of the order of millions.
Method
The paper presents a method to estimate a large number of parameters, of the order of millions, using an online training algorithm for machine translation. The algorithm used in this work is the Margin Infused Relaxed Algorithm (MIRA) (Crammer et al., 2006) which has been successfully employed for many structured natural language processing tasks such as, dependency parsing, joint-labeling/chunking. This method is applied to an enhanced hierarchical phrase-based machine translation model.
Hierarchical Phrase-based SMT
Chiang 2005 introduced the hierarchical phrase-based translation approach, in which non-terminals are embedded in each phrase. A translation is generated by hierarchically combining phrases using the non-terminals. Such a quasi-syntactic structure can naturally capture the reordering of phrases that is not directly modeled by a conventional phrase-based approach.
Each production rule in the hierarchical phrase-based translation model is given by:
where X is a non-terminal, is a source side string of arbitrary terminals and/or non-terminals. is a corresponding target side where is a string of terminals, or a phrase, and is a (possibly empty) string of non-terminals. defines one-to-one mapping between non-terminals in and .
Features
The authors build an enhanced translation model on top of the baseline hierarchical phrase-based model. They introduced a very large number of binary features based on word alignments, dependency structures and context.
Baseline
Please refer Chiang (2005) for baseline features.
Sparse Features
Sparse features are of the form:
These features are categorized as:
- Word pair features using word alignments within a standard phrase pair
- Insertion features to take care of spurious words on the target side.
- Target bigram features of words
- Hierarchical features to capture dependencies between parent and child words on source and target sides.
The authors also perform various kinds of normalization to make the feature set more generalized.
Experiments and Results
Datasets
The online large-margin training procedure was applied for an Arabic-to-English translation task. The training data was extracted from the Arabic -English news/UN bilingual corpora supplied by LDC. The data amount to nearly 3.8M sentences. Parameter tuning was carried out on MT03 NIST eval set containing 663 sentences and final evaluation was done on news domain NIST test sets like MT04 and MT05 consisting of 707 and 1056 sentences, respectively.
Evaluation Metric
Results were evaluated comparing against the baseline using the well-known BLEU and NIST metrics for MT evaluation.
Results
Table 1 shows results for incrementally adding structural features as discussed above. Target bigram features account for the fluency of the target side without considering the source/target correspondence. The inclusion of target bigram features overfits the development data. This problem is addressed by adding insertion features which can take into account an agreement with the source side that is not directly captured by word pair features. Hierarchical features also somewhat help in boosting MT05 BLEU scores by considering the dependency structure of the source side.
- Table 1: Experimental results obtained by incrementally adding structural features.
Table 2 shows summarizes results obtained by varying normalized tokens used with surface form.
- Table 2: Experimental results obtained by varying normalized tokens used with surface form.