Improving SMT word alignment with binary feedback
Team Member(s)
Proposal
Word alignment is an important sub-problem within machine translation. It addresses the issue of aligning word or phrase pairs between different languages, which varies from a relatively simple task for languages with similar structure (e.g., English and Spanish) to a fairly difficult problem for other languages, like English-Chinese or English-Japanese. Alignment models are used in the training of SMT systems when extracting phrase pairs from a parallel corpus, as well as in the decoding stage. Hence, it is reasonable to assume that errors in the hypotheses produced by an MT system can often be attributed to errors in the alignment model.
The idea behind this project is to improve SMT performance (as evaluated by BLEU, METEOR, or another end-to-end MT metric) through binary feedback given by a user. In this case, the MT system produces a hypothesis which the user then judges as either a "good translation" or a "bad translation". The challenge is to incorporate this coarse form of feedback into the various models that constitute an MT system. Given our hypothesis above, it makes sense to attempt to correct these errors through adjusting the alignment model.
An initial approach can be based on J-LIS (Chang et al, ICML 2010, see related work). While the particular problem instance in this case is word alignment, a principled approach can be generalized to tackle the broader problem of incorporating binary labeling, online, in structured output predictors, but that will most likely be future work.
Evaluation
Evaluation for this project will be based on two metrics:
- BLEU: a commonly used metric to evaluate machine translation quality.
- Alignment Error Rate (AER): an alignment-specific metric. This metric can only be used if we have annotated data on parallel corpora, specifically which particular words within a particular sentence correspond to a given target word in the target sentence. It depends on if we have hand-annotated alignments for our tuning and testing data.
Dataset(s)
We will primarily use one dataset for the purposes of this project, which is the IWSLT 2009 Chinese-English btec task parallel corpus.
- The training set is > 500,000 parallel sentences.
- There are 9 development (tuning) sets, each with ~500 sentences (total of 4,250 sentences)
- The test set consists of 200 aligned sentences
Of course, we can always decide to use one of the tuning sets as a test set and vice versa.
If we want to also evaluate for AER, we can use the Hansards dataset, which provides validation (39 sentence pairs) and test (447 sentence pairs) sets with the required labeled data to evaluate AER. It should be noted that the link between AER and MT quality metrics like BLEU is not definite, and so these experiments will only be undertaken for comparison purposes to previous work on word alignment.
Baseline System
The baseline system will be trained on 75% of the parallel corpus data. We will use the remaining 25% of the training data to "simulate" negative and positive feedback, by training a baseline MT model and generating output from this model. Since we have reference translations for the 25% of the held-out training data, we can check to see if the output is "good" or "bad" by using an F-1 scoring criterion between reference and hypothesis and selecting sentences below a threshold as negative sentences and above a threshold as positive sentences.
We will then retrain the alignment model with the positive and negative data and see what impact the binary feedback has on results. Retraining will be done through the J-LIS framework, which provides us a method to incorporate binary feedback data.
We will compare our results against the IBM alignment models, HMM-based word alignment (Vogel et al, 1996) and perhaps against discriminative word alignment models if the code for such models is readily available.
As far as tools are concerned, we aim to use GIZA++, and also the Geppetto phrase extraction toolkit (thanks to classmate Wang Ling, who has helped in developing the code). For decoding, we can use the moses decoder.
Related Work
- Structured Output Learning with Indirect Supervision, M. Chang et al, ICML 2010. In this work, positive and negative feedback on structured outputs is incorporated in the training process to produce better POS taggers and named entity transliteration models. The approach has not been applied to MT or word alignment.