Improving SMT word alignment with binary feedback
Word alignment is an important sub-problem within machine translation. It addresses the issue of aligning word or phrase pairs between different languages, which varies from a relatively simple task for languages with similar structure (e.g., English and Spanish) to a fairly difficult problem for other languages, like English-Chinese or English-Japanese. Alignment models are used in the training of SMT systems when extracting phrase pairs from a parallel corpus, as well as in the decoding stage. Hence, it is reasonable to assume that errors in the hypotheses produced by an MT system can often be attributed to errors in the alignment model.
The idea behind this project is to improve SMT performance (as evaluated by BLEU, METEOR, or another end-to-end MT metric) through binary feedback given by a user. In this case, the MT system produces a hypothesis which the user then judges as either a "good translation" or a "bad translation". The challenge is to incorporate this coarse form of feedback into the various models that constitute an MT system. Given our hypothesis above, it makes sense to attempt to correct these errors through adjusting the alignment model.
An initial approach can be based on J-LIS (Chang et al, ICML 2010, see related work). The basic idea is to train a structural SVM but in addition to the supervised training instances, add an additional term to the learning algorithm that incorporates a loss for the binary prediction:
where is the loss function of the structure prediction (e.g. hinge loss for structured SVM), is the fully labeled dataset, is the loss for binary prediction, and is the binary labels dataset.
In terms of how to optimize this, we will implement a cutting planes algorithm to add constraints, and a dual coordinate descent algorithm to optimize with those constraints. More details are given in Chang et al, ICML 2010.
While the particular problem instance in this case is word alignment, a principled approach can be generalized to tackle the broader problem of incorporating binary labeling, online, in structured output predictors, but that will most likely be future work.
Comments: I'm a little unclear how you're going to construct a hinge loss for word alignment. That needs more thought. Also, given the way you've defined your framework, doesn't an online learning algorithm fit better than a batch/cutting planes-style approach? It will also be easier to implement, and you can incorporate the tradeoff between improving translation and word alignment by randomly choosing between the two when you consider each example in turn. I'm also worried that there are many steps between word alignment and creating your decoder, and they are expensive -- so having a word alignment model trained and then having to decode will involve a lot of intermediate steps to incorporate your updated word alignment model's beliefs into your decoder. Finally, I'd advise you take a look at work on active learning for MT -- I think there's work by Vamshi Ambati at CMU as well as people elsewhere. --Nasmith 21:20, 9 October 2011 (UTC)
Evaluation for this project will be based on two metrics:
- BLEU: a commonly used metric to evaluate machine translation quality.
- Alignment Error Rate (AER): an alignment-specific metric. This metric can only be used if we have annotated data on parallel corpora, specifically which particular words within a particular sentence correspond to a given target word in the target sentence. It depends on if we have hand-annotated alignments for our tuning and testing data.
We will primarily use one dataset for the purposes of this project, which is the IWSLT 2009 Chinese-English btec task parallel corpus.
- The training set is > 500,000 parallel sentences.
- There are 9 development (tuning) sets, each with ~500 sentences (total of 4,250 sentences)
- The test set consists of 200 aligned sentences
Of course, we can always decide to use one of the tuning sets as a test set and vice versa.
If we want to also evaluate for AER, we can use the Hansards dataset, which provides validation (39 sentence pairs) and test (447 sentence pairs) sets with the required labeled data to evaluate AER. It should be noted that the link between AER and MT quality metrics like BLEU is not definite, and so these experiments will only be undertaken for comparison purposes to previous work on word alignment.
The baseline system will be trained on 75% of the parallel corpus data. We will use the remaining 25% of the training data to "simulate" negative and positive feedback, by training a baseline MT model and generating output from this model. Since we have reference translations for the 25% of the held-out training data, we can check to see if the output is "good" or "bad" by using an F-1 scoring criterion between reference and hypothesis and selecting sentences below a threshold as negative sentences and above a threshold as positive sentences.
We will then retrain the alignment model with the positive and negative data and see what impact the binary feedback has on results. Retraining will be done through the J-LIS framework, which provides us a method to incorporate binary feedback data.
We will compare our results against the IBM alignment models, HMM-based word alignment (Vogel et al, 1996) and perhaps against discriminative word alignment models if the code for such models is readily available.
As far as tools are concerned, we aim to use GIZA++, and also the Geppetto phrase extraction toolkit (thanks to classmate Wang Ling, who has helped in developing the code). For decoding, we can use the moses decoder.
- Structured Output Learning with Indirect Supervision, M. Chang et al, ICML 2010. In this work, positive and negative feedback on structured outputs is incorporated in the training process to produce better POS taggers and named entity transliteration models. The approach has not been applied to MT or word alignment.