Difference between revisions of "Training SMT Systems with Latent SVMs"

From Cohen Courses
Jump to navigationJump to search
(Created page with '(was '''Improving SMT word alignment with binary feedback''') == Team Member(s) == * Avneesh Saluja * Jeff Flanigan')
 
Line 4: Line 4:
 
* [[User:Asaluja|Avneesh Saluja]]
 
* [[User:Asaluja|Avneesh Saluja]]
 
* [[User:Jmflanig| Jeff Flanigan]]
 
* [[User:Jmflanig| Jeff Flanigan]]
 +
 +
== Proposal ==
 +
Large-scale discriminative training of MT systems has been a long standing goal in statistical machine translation.  One of the first attempts (Liang et al 2006) used the structured perceptron to train weights for each phrase in a phrase-based system as well as features shared between phrases.  The approach can be viewed as an instance of the Latent Variable SVM (ICML 2009) but with no regularizer and no cost function.  Regularization is shown to be important in large-scale discriminative training of SMT systems ( 2008).  We propose to generalize the perceptron training of SMT systems to Latent Variable SVMs to allow for a regularizer and cost function, and to apply the method to systactic SMT systems as well as a phrase-based system.
 +
 +
== Dataset(s) ==
 +
We will primarily use one dataset for the purposes of this project, which is the [http://www.mt-archive.info/IWSLT-2009-Paul.pdf IWSLT 2009 Chinese-English btec task parallel corpus]. 
 +
* The training set is > 500,000 parallel sentences. 
 +
* There are 9 development (tuning) sets, each with ~500 sentences (total of 4,250 sentences)
 +
* The test set consists of 200 aligned sentences
 +
Of course, we can always decide to use one of the tuning sets as a test set and vice versa.
 +
 +
== Baseline System ==
 +
The baseline systems will be a phrase-based system and a Hiero system, optimized using MERT with gammut of usual features.
 +
 +
== Related Work ==
 +
[http://cs.stanford.edu/~pliang/papers/discriminative-mt-acl2006.pdf Laing et al 2006]
 +
 +
[http://www.cs.cornell.edu/~cnyu/papers/icml09_latentssvm.pdf

Revision as of 23:22, 18 October 2011

(was Improving SMT word alignment with binary feedback)

Team Member(s)

Proposal

Large-scale discriminative training of MT systems has been a long standing goal in statistical machine translation. One of the first attempts (Liang et al 2006) used the structured perceptron to train weights for each phrase in a phrase-based system as well as features shared between phrases. The approach can be viewed as an instance of the Latent Variable SVM (ICML 2009) but with no regularizer and no cost function. Regularization is shown to be important in large-scale discriminative training of SMT systems ( 2008). We propose to generalize the perceptron training of SMT systems to Latent Variable SVMs to allow for a regularizer and cost function, and to apply the method to systactic SMT systems as well as a phrase-based system.

Dataset(s)

We will primarily use one dataset for the purposes of this project, which is the IWSLT 2009 Chinese-English btec task parallel corpus.

  • The training set is > 500,000 parallel sentences.
  • There are 9 development (tuning) sets, each with ~500 sentences (total of 4,250 sentences)
  • The test set consists of 200 aligned sentences

Of course, we can always decide to use one of the tuning sets as a test set and vice versa.

Baseline System

The baseline systems will be a phrase-based system and a Hiero system, optimized using MERT with gammut of usual features.

Related Work

Laing et al 2006

[http://www.cs.cornell.edu/~cnyu/papers/icml09_latentssvm.pdf