Difference between revisions of "Word Alignments using an HMM-based model"

From Cohen Courses
Jump to navigationJump to search
Line 6: Line 6:
 
This task is similar to [http://dl.acm.org/citation.cfm?id=993313]. This model will be used as the baseline.  
 
This task is similar to [http://dl.acm.org/citation.cfm?id=993313]. This model will be used as the baseline.  
  
We will extend the HMM-based word-to-phrase alignment model to a phrase-to-phrase alignments in a way similar to the model in [[Bansal_et_al,_ACL_2011]]. One problem with phrase-to-phrase alignment models is their intractability due to the large size of latent variables that must be measured during the E-step in [[Expectation Maximization | EM]]. Another problem is the model degeneration, since the model will be biased towards longer phrases, rather than combining shorter phrases to form longer phrases, since using longer phrases incur less distortion and generation penalties. Previously, the first problem has been addressed before by using [[Gibbs Sampling]], and the second problem has been dealt with by defining a [[Dirichlet distribution]] over the phrase pair distribution.
+
We will extend the HMM-based word-to-phrase alignment model to a phrase-to-phrase alignments in a way similar to the model in [[Bansal_et_al,_ACL_2011]]. One problem with phrase-to-phrase alignment models is their intractability due to the large size of latent variables that must be measured during the E-step in [[Expectation Maximization | EM]]. Another problem is the model degeneration, since the model will be biased towards longer phrases, rather than combining shorter phrases to form longer phrases, since using longer phrases incur less distortion and generation penalties. Previously, the first problem has been addressed before by using [[Gibbs sampling | Gibbs Sampling]], and the second problem has been dealt with by defining a [[Dirichlet distribution]] over the phrase pair distribution.
  
 
We will attempt to use [[Posterior Regularization]] to address these two problems. First, we will define constraints to limit the search space of possible latent variables during the E-step by excluding unlikely alignments and segmentations. Then, we will also try to avoid the degenerative behavior of phrase-to-phrase models by defining constraints so that longer phrases are only selected when their expectations are high enough.
 
We will attempt to use [[Posterior Regularization]] to address these two problems. First, we will define constraints to limit the search space of possible latent variables during the E-step by excluding unlikely alignments and segmentations. Then, we will also try to avoid the degenerative behavior of phrase-to-phrase models by defining constraints so that longer phrases are only selected when their expectations are high enough.

Revision as of 21:35, 5 October 2011

Summary

Word alignments are an important notion introduced in Word-based Machine Translation, and are commonly employed in Phrase-based machine translation. In parallel corpora, sentences in different languages are not aligned word by word but sentence by sentence. Thus, it is not trivial to fragment the sentence pair into smaller translation units. Word alignments map each word in the source sentence to a equivalent word in the target sentence.

The goal of this project is to implement a Word Alignment Model where the relative word distortion is modeled using a Hidden Markov Model. This task is similar to [1]. This model will be used as the baseline.

We will extend the HMM-based word-to-phrase alignment model to a phrase-to-phrase alignments in a way similar to the model in Bansal_et_al,_ACL_2011. One problem with phrase-to-phrase alignment models is their intractability due to the large size of latent variables that must be measured during the E-step in EM. Another problem is the model degeneration, since the model will be biased towards longer phrases, rather than combining shorter phrases to form longer phrases, since using longer phrases incur less distortion and generation penalties. Previously, the first problem has been addressed before by using Gibbs Sampling, and the second problem has been dealt with by defining a Dirichlet distribution over the phrase pair distribution.

We will attempt to use Posterior Regularization to address these two problems. First, we will define constraints to limit the search space of possible latent variables during the E-step by excluding unlikely alignments and segmentations. Then, we will also try to avoid the degenerative behavior of phrase-to-phrase models by defining constraints so that longer phrases are only selected when their expectations are high enough.

The quality of the alignments can be tested using a gold standard corpora, where the Word Alignments are produced by human linguists. One example of such a corpora is the Hansards corpora [2]. Another evaluation method is to use the produced alignments in a Machine Translation system and assess that an improvement is achieved in terms of translation quality when using the improved alignments.

Baseline

We will use a traditional pipeline for phrase based machine translation. We will build the Word Alignments and the Translation Models using the Geppetto toolkit, then we will tune the parameters our model using MERT (minimum error rate training) and decode using Moses.

Proposed by: Wang Ling