Word Alignments using an HMM-based model

From Cohen Courses
Revision as of 20:13, 5 October 2011 by Lingwang (talk | contribs)
Jump to navigationJump to search

Summary

Word alignments are an important notion introduced in Word-based Machine Translation, and are commonly employed in Phrase-based machine translation. In parallel corpora, sentences in different languages are not aligned word by word but sentence by sentence. Thus, it is not trivial to fragment the sentence pair into smaller translation units. Word alignments map each word in the source sentence to a equivalent word in the target sentence.

The goal of this project is to implement a Word Alignment Model where the relative word distortion is modeled using a Hidden Markov Model. This task is similar to [1]. This model will be used as the baseline.

We will extend the HMM-based word-to-phrase alignment model for phrase-to-phrase alignments in a way similar to Bansal_et_al,_ACL_2011, but we will use the Posterior Regularization framework deal with the tractability issues of phrase-to-phrase alignments, due to the large size of latent variables during EM.

The quality of the alignments can be tested using a gold standard corpora, where the Word Alignments are produced by human linguists. One example of such a corpora is the Hansards corpora [2]. Another evaluation method is to use the produced alignments in a Machine Translation system and assess that an improvement is achieved in terms of translation quality when using the improved alignments.

Proposed by: Wang Ling