Watanabe et al., EMNLP 2007. Online Large-Margin Training for Statistical Machine Translation

Citation

Taro Watanabe, Jun Suzuki, Hajime Tsukada, Hideki Isozaki. 2007. Online large-margin training for statistical machine translation. In Proceedings of EMNLP-CoNLL. pp 764–773

Online Version

Online large-margin training for statistical machine translation

Summary

This paper basically introduces an online discriminative large-margin training approach to statistical machine translation. The authors achieved the then state of the art performance on an Arabic-English translation task by tuning a combination of millions of features in an MT system. By following this approach the authors also addressed the problem of scaling machine translation systems with a large number of features of the order of millions.

Method

The paper presents a method to estimate a large number of parameters, of the order of millions, using an online training algorithm for machine translation. The algorithm used in this work is the [Online : Margin Infused Relaxed Algorithm] (MIRA) which has been successfully employed for many structured natural language processing tasks such as, dependency parsing, joint-labeling/chunking. This method is applied to an enhanced hierarchical phrase-based machine translation model.

Hierarchical Phrase-based SMT

Chiang (2005) introduced the hierarchical phrase-based translation approach, in which non-terminals are embedded in each phrase. A translation is generated by hierarchically combining phrases using the non-terminals. Such a quasi-syntactic structure can naturally capture the reordering of phrases that is not directly modeled by a conventional phrase-based approach.

Each production rule in the hierarchical phrase-based translation model is given by: $X\rightarrow \langle \gamma ,{\bar {b}}\beta ,\sim \rangle$

where X is a non-terminal, $\gamma$ is a source side string of arbitrary terminals and/or non-terminals. ${\bar {b}}\beta$ is a corresponding target side where ${\bar {b}}$ is a string of terminals, or a phrase, and $\beta$ is a (possibly empty) string of non-terminals. $\sim$ defines one-to-one mapping between non-terminals in $\gamma$ and $\beta$ .

Features

The authors build an enhanced translation model on top of the baseline hierarchical phrase-based model. They introduction a very large number of binary features based on word alignments, dependency structures and context.

Baseline

Please refer Chiang (2005) for baseline features.

Sparse Features

Sparse features are of the form:

$h(f,e)={\begin{cases}1&{\text{English word violate and Arabic word tnthk appeared in a phrase pair}}\;e\;{\text{and}}\;f\\0&{\text{otherwise}}\end{cases}}$

These features are categorized as:

Word pair features using word alignments within a standard phrase pair
Insertion features to take care of spurious words on the target side.
Target bigram features of words
Hierarchical features to capture dependencies between parent and child words on source and target sides.

The authors also perform various kinds of normalization to make the feature set more generalized.

Watanabe et al., EMNLP 2007. Online Large-Margin Training for Statistical Machine Translation

Contents

Citation

Online Version

Summary

Method

Hierarchical Phrase-based SMT

Features

Baseline

Sparse Features

Experiments and Results

Related Papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools