Difference between revisions of "Ravi and Knight, ACL 2011"

From Cohen Courses
Jump to navigationJump to search
 
(38 intermediate revisions by 2 users not shown)
Line 9: Line 9:
 
== Summary ==
 
== Summary ==
  
This work addresses the [[Machine Translation]] problem without resorting to parallel training data.
+
This [[Category::paper | work]]  addresses the [[AddressesProblem::Machine Translation]] problem without resorting to parallel training data.
  
 
This is done by looking at the Machine Translation task from the decipherment perspective, where a sentence in the source language is viewed as the sentence target, but encoded in some unknown symbols.
 
This is done by looking at the Machine Translation task from the decipherment perspective, where a sentence in the source language is viewed as the sentence target, but encoded in some unknown symbols.
Line 17: Line 17:
 
== Description of the Method ==
 
== Description of the Method ==
  
[[Word Alignment]] models using parallel corpora is viewed as a maximization problem with latent word alignments <math>a</math> for a set of sentence pairs <math>(s,t)</math>, given by:
+
[[Word Alignments]] using parallel corpora is viewed as a maximization problem with latent word alignments <math>a</math> for a set of sentence pairs <math>(s,t)</math>, given by:
  
 
<math>
 
<math>
Line 23: Line 23:
 
</math>
 
</math>
  
where <math>\theta</math> are the parameters of the model.
+
where <math>\theta</math> are the translation parameters of the model.
  
 +
When only monolingual corpora is used, for each source sentence <math>s</math>, there isn't an exact target sentence that is aligned to the source sentence. In this case, we view the source sentence as a ciphered target sentence and decipher it, and call this process decipherment. Like the word alignments, this work views the hidden target sentence as an additional latent variable. Hence, the previous equation can be rewritten as:
 +
 +
<math>
 +
argmax_\theta \prod_{s} \sum_t P(t) \sum_a P_\theta (s,a|t)
 +
</math>
 +
 +
where P(t) is the probability of a target sentence <math>t</math>, modeled by the a language model. The large number of possible latent variable that is generated from this model is tacked using [[usesMethod::Gibbs sampling]].
 +
 +
As for the translation model <math>P_\theta (s,a|t)</math>, two models are presented.
 +
 +
The first model is a simple model that accounts for word substitutions, insertions, deletions and relative distortion, but does not incorporate word fertility and absolute distortion as in IBM Model 3. The decision to leave out the word fertility and absolute distortion is due to the fact that EM training would be intractable due to the large sizes for the fertility and distortion parameter tables and the resulting derivation lattices.
 +
 +
The second model accounts for the same issues addressed in IBM Model 3 and a Bayesian method is used instead of EM (which is intractable). In this method, model parameters are learned using a [[Chinese Restaurant Process]], rather than using expected counts. For instance, the translation parameters <math>t_\theta(s_i|t_i)</math>, which is generally calculated as the ratio between the number of expected observations of <math>s_i</math> and <math>t_i</math>, <math>C_e(t_i,s_i)</math>, and the number of observations of <math>t_i</math>, <math>C_e(t_i)</math>, is now formulated as:
 +
 +
<math>
 +
t_\theta(s_i|t_i) = \frac{\alpha P_0(s_i|t_i) + C_{history}(t_i,s_i)}{\alpha + C_{history}(t_i)}
 +
</math>
 +
 +
where <math>P_0(s_i|t_i)</math> is a base distribution (which is set to uniform) and <math>C_{history}</math> represents the count events occurring in the history. Finally, a Dirichlet prior is applied, which is parametrized by <math>\alpha</math>.
 
== Experimental Results ==
 
== Experimental Results ==
 +
 +
Tests were conducted on a self-built Time corpus, built by mining the newswire text on the Web. 295k temporal expressions were collected (Ex: Last Year, The Fourth Quarter, In Jan 1968) and translated to Spanish. Another set of tests were conducted on the [[UsesDataset::OPUS]] movie subtitle corpus, where 19770 training sentences and 13181 test sentences were extracted.
 +
 +
Results were compared [http://www.statmt.org/moses/ Moses] and system using IBM Model 3 without distortion, which use Parallel corpora. The quality of the translation is calculated using BLEU (Higher is better) and Normalized edit distance (Lower is better).
 +
{| class="wikitable" border="1"
 +
|-
 +
! Method
 +
! Parallel training (Moses)
 +
! Parallel training (IBM 3 without distortion)
 +
! Decipherment (EM)
 +
! Decipherment (Bayesian IBM 3)
 +
|-
 +
| Time corpus (BLEU - Edit distance)
 +
| 85.6 - 5.6
 +
| 78.9 - 10.1
 +
| 44.6 - 37.6
 +
| 34.0 - 34.0
 +
|-
 +
| OPUS subtitles (BLEU - Edit distance)
 +
| 63.6 - 26.8
 +
| 59.6 - 29.9
 +
| 15.3 - 67.2
 +
| 15.1 - 66.6
 +
|}
 +
While results using parallel corpora are better than the decipherment models, it is shown that it is possible to build MT systems using only monolingual corpora with the quality that is comparable to those using parallel corpora. This is encouraging due to the large amount of monolingual corpora available for different languages.
 +
 +
A study is also performed to measure the value of parallel data versus non-parallel data for the MT task. It is shown that for the Time corpus, a system using 10000 monolingual sentences achieves the same translation quality than a system using 200-500 parallel sentences.
  
 
== Related Work ==
 
== Related Work ==
 +
The work in [http://www.aclweb.org/anthology/P/P08/P08-1088.pdf Haghighi et al, 2008] extracts translation lexicons from non-parallel corpora.
 +
 +
This work contrasts with the work in [http://www.cog.brown.edu/~mj/papers/naacl06-self-train.pdf McClosky et al, 2006], in which a parallel seed is required.
 +
 +
== Comment ==
 +
 +
Another interesting related work: [http://www.cs.berkeley.edu/~tberg/papers/emnlp2011.pdf Berg-Kirkpatrick and Klein EMNLP-2011] --[[User:Brendan|Brendan]] 17:58, 29 November 2011 (UTC)

Latest revision as of 12:58, 29 November 2011

Citation

S. Ravi and K. Knight. 2011. Deciphering Foreign Language. In Proceedings of ACL.

Online version

pdf

Summary

This work addresses the Machine Translation problem without resorting to parallel training data.

This is done by looking at the Machine Translation task from the decipherment perspective, where a sentence in the source language is viewed as the sentence target, but encoded in some unknown symbols.

Experimental showed that, while the results using monolingual data were considerably lower than those using bilingual data if the same amount of data is used, large amounts of monolingual data can be used to create models that perform similarly to systems that use smaller amounts of bilingual data. This is encouraging, since bilingual data is a scarce resource for most language pairs and domains, while monolingual data is much more abundant.

Description of the Method

Word Alignments using parallel corpora is viewed as a maximization problem with latent word alignments for a set of sentence pairs , given by:

where are the translation parameters of the model.

When only monolingual corpora is used, for each source sentence , there isn't an exact target sentence that is aligned to the source sentence. In this case, we view the source sentence as a ciphered target sentence and decipher it, and call this process decipherment. Like the word alignments, this work views the hidden target sentence as an additional latent variable. Hence, the previous equation can be rewritten as:

where P(t) is the probability of a target sentence , modeled by the a language model. The large number of possible latent variable that is generated from this model is tacked using Gibbs sampling.

As for the translation model , two models are presented.

The first model is a simple model that accounts for word substitutions, insertions, deletions and relative distortion, but does not incorporate word fertility and absolute distortion as in IBM Model 3. The decision to leave out the word fertility and absolute distortion is due to the fact that EM training would be intractable due to the large sizes for the fertility and distortion parameter tables and the resulting derivation lattices.

The second model accounts for the same issues addressed in IBM Model 3 and a Bayesian method is used instead of EM (which is intractable). In this method, model parameters are learned using a Chinese Restaurant Process, rather than using expected counts. For instance, the translation parameters , which is generally calculated as the ratio between the number of expected observations of and , , and the number of observations of , , is now formulated as:

where is a base distribution (which is set to uniform) and represents the count events occurring in the history. Finally, a Dirichlet prior is applied, which is parametrized by .

Experimental Results

Tests were conducted on a self-built Time corpus, built by mining the newswire text on the Web. 295k temporal expressions were collected (Ex: Last Year, The Fourth Quarter, In Jan 1968) and translated to Spanish. Another set of tests were conducted on the OPUS movie subtitle corpus, where 19770 training sentences and 13181 test sentences were extracted.

Results were compared Moses and system using IBM Model 3 without distortion, which use Parallel corpora. The quality of the translation is calculated using BLEU (Higher is better) and Normalized edit distance (Lower is better).

Method Parallel training (Moses) Parallel training (IBM 3 without distortion) Decipherment (EM) Decipherment (Bayesian IBM 3)
Time corpus (BLEU - Edit distance) 85.6 - 5.6 78.9 - 10.1 44.6 - 37.6 34.0 - 34.0
OPUS subtitles (BLEU - Edit distance) 63.6 - 26.8 59.6 - 29.9 15.3 - 67.2 15.1 - 66.6

While results using parallel corpora are better than the decipherment models, it is shown that it is possible to build MT systems using only monolingual corpora with the quality that is comparable to those using parallel corpora. This is encouraging due to the large amount of monolingual corpora available for different languages.

A study is also performed to measure the value of parallel data versus non-parallel data for the MT task. It is shown that for the Time corpus, a system using 10000 monolingual sentences achieves the same translation quality than a system using 200-500 parallel sentences.

Related Work

The work in Haghighi et al, 2008 extracts translation lexicons from non-parallel corpora.

This work contrasts with the work in McClosky et al, 2006, in which a parallel seed is required.

Comment

Another interesting related work: Berg-Kirkpatrick and Klein EMNLP-2011 --Brendan 17:58, 29 November 2011 (UTC)