Difference between revisions of "Improving SMT word alignment with binary feedback"

From Cohen Courses
Jump to navigationJump to search
Line 8: Line 8:
 
The idea behind this project is to improve SMT performance (as evaluated by BLEU, METEOR, or another end-to-end MT metric) through binary feedback given by a user.  In this case, the MT system produces a hypothesis which the user then judges as either a "good translation" or a "bad translation".  The challenge is to incorporate this coarse form of feedback into the various models that constitute an MT system.  Given our hypothesis above, it makes sense to attempt to correct these errors through adjusting the alignment model.   
 
The idea behind this project is to improve SMT performance (as evaluated by BLEU, METEOR, or another end-to-end MT metric) through binary feedback given by a user.  In this case, the MT system produces a hypothesis which the user then judges as either a "good translation" or a "bad translation".  The challenge is to incorporate this coarse form of feedback into the various models that constitute an MT system.  Given our hypothesis above, it makes sense to attempt to correct these errors through adjusting the alignment model.   
  
An initial approach can be based on J-LIS (Joint Learning with Indirect Supervision: "Structured Output Learning with Indirect Supervision", M. Chang et al, ICML 2010).  While the particular problem instance in this case is word alignment, a principled approach can be generalized to tackle the broader problem of incorporating binary labeling, online, in structured output predictors.
+
An initial approach can be based on J-LIS (Chang et al, ICML 2010, see related work).  While the particular problem instance in this case is word alignment, a principled approach can be generalized to tackle the broader problem of incorporating binary labeling, online, in structured output predictors, but that will most likely be future work.
 +
 
 +
== Evaluation ==
 +
Evaluation for this project will be based on two metrics:
 +
* BLEU: a commonly used metric to evaluate machine translation quality. 
 +
* Alignment Error Rate (AER): an alignment-specific metric. This metric can only be used if we have annotated data on parallel corpora, specifically which particular words within a particular sentence correspond to a given target word in the target sentence.  It depends on if we have hand-annotated alignments for our tuning and testing data. 
  
 
== Dataset(s) ==  
 
== Dataset(s) ==  
 +
We will primarily use one dataset for the purposes of this project, which is the [http://www.mt-archive.info/IWSLT-2009-Paul.pdf IWSLT 2009 Chinese-English btec task parallel corpus]. 
 +
* The training set is > 500,000 parallel sentences. 
 +
* There are 9 development (tuning) sets, each with ~500 sentences (total of 4,250 sentences)
 +
* The test set consists of 200 aligned sentences
 +
Of course, we can always decide to use one of the tuning sets as a test set and vice versa.
 +
 +
If we want to also evaluate for AER, we can use the [http://www.isi.edu/natural-language/download/hansard/ Hansards dataset], which provides validation (39 sentence pairs) and test (447 sentence pairs) sets with the required labeled data to evaluate AER.  It should be noted that the link between AER and MT quality metrics like BLEU is not definite, and so these experiments will only be undertaken for comparison purposes to previous work on word alignment. 
  
 
== Baseline System ==  
 
== Baseline System ==  
 +
The baseline system will be trained on 75% of the parallel corpus data.  We will use the remaining 25% of the training data to "simulate" negative and positive feedback, by training a baseline MT model and generating output from this model.  Since we have reference translations for the 25% of the held-out training data, we can check to see if the output is "good" or "bad" by using an F-1 scoring criterion between reference and hypothesis and selecting sentences below a threshold as negative sentences and above a threshold as positive sentences. 
  
== Evaluation ==
+
We will then retrain the alignment model with the positive and negative data and see what impact the binary feedback has on results.  Retraining will be done through the J-LIS framework, which provides us a method to incorporate binary feedback data. 
  
Evaluation for this project will be based on two metrics:
+
== Related Work ==
* Alignment Error Rate (AER): an alignment-specific metric.  This metric can only be used if we have annotated data on parallel corpora, specifically which particular words within a particular sentence correspond to a given target word in the target sentence.  This is available with Hansards data, but not with IWSLT data. 
 
* BLEU: a commonly used metric to evaluate machine translation quality
 
  
== Related Work ==
+
* Structured Output Learning with Indirect Supervision, [http://www.icml2010.org/papers/522.pdf M. Chang et al, ICML 2010].  In this work, positive and negative feedback on structured outputs is incorporated in the training process to produce better POS taggers and named entity transliteration models.  The approach has not been applied to MT or word alignment.

Revision as of 20:20, 18 September 2011

Team Member(s)

  • Avneesh Saluja
  • I am more than happy to partner with 1 or 2 other people on this project. Please contact me if you're interested!

Proposal

Word alignment is an important sub-problem within machine translation. It addresses the issue of aligning word or phrase pairs between different languages, which varies from a relatively simple task for languages with similar structure (e.g., English and Spanish) to a fairly difficult problem for other languages, like English-Chinese or English-Japanese. Alignment models are used in the training of SMT systems when extracting phrase pairs from a parallel corpus, as well as in the decoding stage. Hence, it is reasonable to assume that errors in the hypotheses produced by an MT system can often be attributed to errors in the alignment model.

The idea behind this project is to improve SMT performance (as evaluated by BLEU, METEOR, or another end-to-end MT metric) through binary feedback given by a user. In this case, the MT system produces a hypothesis which the user then judges as either a "good translation" or a "bad translation". The challenge is to incorporate this coarse form of feedback into the various models that constitute an MT system. Given our hypothesis above, it makes sense to attempt to correct these errors through adjusting the alignment model.

An initial approach can be based on J-LIS (Chang et al, ICML 2010, see related work). While the particular problem instance in this case is word alignment, a principled approach can be generalized to tackle the broader problem of incorporating binary labeling, online, in structured output predictors, but that will most likely be future work.

Evaluation

Evaluation for this project will be based on two metrics:

  • BLEU: a commonly used metric to evaluate machine translation quality.
  • Alignment Error Rate (AER): an alignment-specific metric. This metric can only be used if we have annotated data on parallel corpora, specifically which particular words within a particular sentence correspond to a given target word in the target sentence. It depends on if we have hand-annotated alignments for our tuning and testing data.

Dataset(s)

We will primarily use one dataset for the purposes of this project, which is the IWSLT 2009 Chinese-English btec task parallel corpus.

  • The training set is > 500,000 parallel sentences.
  • There are 9 development (tuning) sets, each with ~500 sentences (total of 4,250 sentences)
  • The test set consists of 200 aligned sentences

Of course, we can always decide to use one of the tuning sets as a test set and vice versa.

If we want to also evaluate for AER, we can use the Hansards dataset, which provides validation (39 sentence pairs) and test (447 sentence pairs) sets with the required labeled data to evaluate AER. It should be noted that the link between AER and MT quality metrics like BLEU is not definite, and so these experiments will only be undertaken for comparison purposes to previous work on word alignment.

Baseline System

The baseline system will be trained on 75% of the parallel corpus data. We will use the remaining 25% of the training data to "simulate" negative and positive feedback, by training a baseline MT model and generating output from this model. Since we have reference translations for the 25% of the held-out training data, we can check to see if the output is "good" or "bad" by using an F-1 scoring criterion between reference and hypothesis and selecting sentences below a threshold as negative sentences and above a threshold as positive sentences.

We will then retrain the alignment model with the positive and negative data and see what impact the binary feedback has on results. Retraining will be done through the J-LIS framework, which provides us a method to incorporate binary feedback data.

Related Work

  • Structured Output Learning with Indirect Supervision, M. Chang et al, ICML 2010. In this work, positive and negative feedback on structured outputs is incorporated in the training process to produce better POS taggers and named entity transliteration models. The approach has not been applied to MT or word alignment.