Difference between revisions of "Koehn et al, ACL 2003"

Latest revision as of 09:19, 29 November 2011

Citation

Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of HLT-NAACL 2003, pages 127–133. [1]

Summary

In this paper the authors propose a new framework that aims at explaining and understanding why phrase-based models in Machine Translation outperform word-based models.

Within this framework (phrase-based translation model and decoding algorithm) the authors carry experiments that explore three different methods for learning phrase translation (based on word alignments, on syntactic information and "pure" phrase alignments). Additionally the authors also explore phrase length, lexical weighting, and the impact of different language pairs in the overall BLEU score.

The results confirm the already proved hypotheses that phrase translation achieve better performance than word-based methods, adding that three-word phrase are sufficient to outperform the traditional methods. Moreover, the authors conclude that lexical weighting of phrase translation boost results, and that syntactic considerations, on the other hand, hinder the results.

Evaluation Framework

The phrase translation model used in the proposed framework is based on the noisy channel model. The best English output sentence $e_{best}$ given a foreign input sentence $f$ is given by:

$e_{best}=\arg \max _{e}p(e|f)=\arg \max _{e}p(f|e)p_{LM}(e)\omega ^{length(e)}$

where:

$p(f|e)$ is the translation model (see below);
$p_{LM}(e)$ is a trigram language model;
and, $\omega$ is a factor that calibrates the output length (\omega > 1, biasing longer output).

The translation model $p(f|e)$ can be decomposed into:

$p({\bar {f}}_{1}^{I}|{\bar {e}}_{1}^{I})=\prod _{i=1}^{I}\phi ({\bar {f}}_{i}|{\bar {e}}_{i})d(a_{i}-b_{i-1})$

where:

${\bar {f}}_{1}^{I}$ is a sequence of $I$ segmented from the input sentence $f$ ;
$\phi ({\bar {f}}_{i}|{\bar {e}}_{i})$ is a probability distribution that models the phrase translation;
and, $d$ is a relative distortion probability distribution between the start position of the foreign phrase that was translated into the $i$ th English phrase ( $a_{i}$ ) and the end position of the foreign phrase translated into the $(i-1)$ th English phrase ( $b-1$ ).

The decoder that was adopted in the framework employs a Beam Search algorithm.

Methods for Learning Phrase Translation

In this work the authors compare three methods to build phrase translation probability tables. The first one builds the phrase alignments using word alignment information, i.e., all the phrase pairs that are considered must be consistent with the word alignments.

The second method explored act as a filter to the previous set of alignments, restricting possible phrases to syntactically correct ones.

Finally, the last method takes the Marcus and Wong, EMNLP 2002 approach, learning phrase-level alignments directly from the parallell corpora.

Experimental Results

The authors used the EUROPARL for the pair German-English.

The first result reported compares the three methods described in the previous section. The next figure plots the BLEU scores against the size of the corpus size for each of the three approaches: based on word alignments (AP), syntactic restrictions (Syn) and "pure" phrase alignments (Joint). The results obtained from the IBM Model 4 are also plotted.

The second result concerns the limit of sentence length that should be considered when learning them. The next figure shows the results from comparing several lengths, showing that length 3 is enough, achieving similar BLEU scores than higher values.

The last result is presented in the table below. In the first place, the authors prove that lexical weighting always improves the results, i.e., taking in consideration how well, in a phrase translation pair, its words translate to each other. Lastly, the authors showed that their approach achieve better BLEU scores for several language pairs, when compared with the IBM Model 4.

@@ Line 1: / Line 1: @@
-Being edited by Rui Correia
 == Citation ==
 Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of HLT-NAACL 2003, pages 127–133. [http://www.aclweb.org/anthology-new/N/N03/N03-1017.pdf]
@@ Line 37: / Line 33: @@
 * and, <math>d</math> is a relative distortion probability distribution between the start position of the foreign phrase that was translated into the <math>i</math>th English phrase (<math>a_i</math>) and the end position of the foreign phrase translated into the <math>(i-1)</math>th English phrase (<math>b - 1</math>).
-The decoder that was adopted in the framework employs a [[UsesMethod::beam search]] algorithm.
+The decoder that was adopted in the framework employs a [[UsesMethod::Beam Search]] algorithm.
 == Methods for Learning Phrase Translation ==
-[[RelatedPaper::Vogal et al, COLING 1996]]
-<math>
+In this work the authors compare three methods to build phrase translation probability tables. The first one builds the phrase alignments using word alignment information, i.e., all the phrase pairs that are considered must be consistent with the word alignments.
-\max_{0 \le z \le 1} \sum_{jk \in \varepsilon}{s_{jk} z_{jk}} + \sum_{jklm \in Q}{s_{jklm}z_{jklm}}
-</math>
+The second method explored act as a filter to the previous set of alignments, restricting possible phrases to syntactically correct ones.
+Finally, the last method takes the [[RelatedPaper::Marcus and Wong, EMNLP 2002]] approach, learning phrase-level alignments directly from the parallell corpora.
+== Experimental Results ==
+The authors used the [[UsesDataset::EUROPARL]] for the pair German-English.
+The first result reported compares the three methods described in the previous section. The next figure plots the BLEU scores against the size of the corpus size for each of the three approaches: based on word alignments (AP), syntactic restrictions (Syn) and "pure" phrase alignments (Joint). The results obtained from the IBM Model 4 are also plotted.
+[[File:Koehncoremethods.png|200px]]
-       <math>
+The second result concerns the limit of sentence length that should be considered when learning them. The next figure shows the results from comparing several lengths, showing that length 3 is enough, achieving similar BLEU scores than higher values.
-         s.t. \sum _{j \in V^s} z_{jk} \le 1, \forall k \in V^t;
-</math>
-           <math>
+[[File:Koehnphraselen.png|200px]]
-              \sum _{k \in V^t} z_{jk} \le 1, \forall j \in V^s;
-</math>
-           <math>
-              z_{jklm} \le z_{jk}, z_{jklm} \le z_{lm}, \forall jklm \in Q,
-</math>
+The last result is presented in the table below. In the first place, the authors prove that lexical weighting always improves the results, i.e., taking in consideration how well, in a phrase translation pair, its words translate to each other. Lastly, the authors showed that their approach achieve better BLEU scores for several language pairs, when compared with the IBM Model 4.
-== Experimental Results ==
+[[File:Koehnlangpairs.png|300px]]
-The authors used the [[UsesDataset::Hansards]]
-[[File:Lacostejulienresults.png|300px]]

Difference between revisions of "Koehn et al, ACL 2003"

Latest revision as of 09:19, 29 November 2011

Contents

Citation

Summary

Evaluation Framework

Methods for Learning Phrase Translation

Experimental Results

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools