Marcus and Wong, EMNLP 2002

From Cohen Courses
Revision as of 15:12, 26 November 2011 by Lingwang (talk | contribs)
Jump to navigationJump to search

Citation

Marcu, D., & Wong, W. (2002). A phrase-based, joint probability model for statistical machine translation. In In Proceedings of EMNLP, pp. 133–139.

Online version

pdf

Summary

This work presents a phrase-to-phrase alignment model for Statistical Machine Translation. Alignment models are generally word-to-phrase, where each target word could only be aligned with at most one source word. This work removes this restriction and n-to-m alignments between words in the source and in the target. The main contribution of this work is showing that their model outperforms the IBM model 4, in terms of translation quality in machine translation systems. The main drawback is the high cost of the training procedure that they apply.

Model

In this work, words are clustered into phrases by a generative process, which constructs an ordered set of phrases in the target language, an ordered set of phrases in the source language and the alignments between phrases , which indicates that the phrase pair with the target and . The process is composed by 2 steps:

  • First, the number of components is chosen and each of phrase pairs are generated independently.
  • Then, a ordering for the phrases in the source phrases is chosen, and all the source and target phrases are aligned one to one.

The choice of is parametrized using a geometric distribution , with the stop parameter :

Phrase pairs are drawn from an unknown multinomial distribution .

A simple position based distortion model is used, where:

Finally, the joint probability model for aligning sentences consisting of phrase pairs is given by:

In the experiments paramters and were set to 0.1 and 0.85, respectively.

Experiments

Tests were conducted by testing the translation quality of phrase based machine translation systems using BLEU as the evaluation score.

As for the dataset, the Hansards dataset was used, which contains around 1.1 million training sentence pairs and 500 unseen test sentences were used to test the system. A limit of 20 characters was imposed to the lengths of the sentences in the training corpora.

The model that is described in this paper is compared to the IBM model 4.

Model IBM Model 4 Phrase-to-phrase
Hansards 34.0 34.5
EUROPARL 19.3 19.8

We can see that this model outperforms the IBM model 4 in the experiment that was performed.

Related Work