Difference between revisions of "Unsupervised Word Alignment with Arbitrary Features"

From Cohen Courses
Jump to navigationJump to search
 
Line 17: Line 17:
 
Inference is done by modeling the translation search space as a WFSA.  The [[Forward-Backward|forward backward]] algorithm is used to compute the expectation terms in the gradient.  However, the WFSA is very large, especially considering one of the expectations is taken when the translation output varies.  Thus, the number of edges in the WFSA needs to be reduced, and the authors try 4 different ways of doing this, ranging from restricting the set of all possible translation outputs to be only the set of target words co-occurring in sentence pairs containing the source sentence, to using the empirical Bayes method and variational inference to prune.   
 
Inference is done by modeling the translation search space as a WFSA.  The [[Forward-Backward|forward backward]] algorithm is used to compute the expectation terms in the gradient.  However, the WFSA is very large, especially considering one of the expectations is taken when the translation output varies.  Thus, the number of edges in the WFSA needs to be reduced, and the authors try 4 different ways of doing this, ranging from restricting the set of all possible translation outputs to be only the set of target words co-occurring in sentence pairs containing the source sentence, to using the empirical Bayes method and variational inference to prune.   
  
A significant component of this paper is the type of features the authors included to help elicit good word alignments.  On top of using word association features (indicator functions for word pairs), they also use orthographic features (which is obviously limited in the case of Chinese-English, one of the language pairs that they used).  Positional features (i.e., looking at how far words are to the alignment matrix diagonal), source features (to capture words in the source language that are purely functional and have no lexical component), source path features (features that assess the goodness of the alignment path through the source sentence, i.e., measuring jumps, etc. in the sequential alignment of the source sentence), and target string features (e.g., if a word translates to itself).  Lastly, the authors also include high-level, coarse features for example the model 1 probabilities and the Dice coefficient.
+
A significant component of this paper is the type of features the authors included to help elicit good word alignments.  On top of using word association features (indicator functions for word pairs), they also use orthographic features (which is obviously limited in the case of Chinese-English, one of the language pairs that they used), positional features (i.e., looking at how far words are to the alignment matrix diagonal), source features (to capture words in the source language that are purely functional and have no lexical component), source path features (features that assess the goodness of the alignment path through the source sentence, i.e., measuring jumps, etc. in the sequential alignment of the source sentence), and target string features (e.g., if a word translates to itself).  Lastly, the authors also include high-level, coarse features for example the model 1 probabilities and the Dice coefficient.
  
 
=== Baseline & Results ===
 
=== Baseline & Results ===

Latest revision as of 16:13, 25 November 2011

Citation

Unsupervised Word Alignment with Arbitrary Features, by C. Dyer, J. Clark, A. Lavie, N.A. Smith. In Proceedings of ACL, 2011.

This Paper is available online [1].

Summary

The authors present a discriminatively-trained, globally normalized log-linear model to model lexical translation in a statistical machine translation system (i.e., word alignment). Unlike previous discriminative approaches, the authors do not need supervised or gold standard word alignments in their approach, just supervision in the form of bilingual parallel sentence corpora. And unlike generative approaches that use EM for unsupervised word alignment, the proposed method can incorporate arbitrary, overlapping features that can help improve word alignment. The authors compare their model to IBM Model 4: they propose two intrinsic metrics, and also evaluate the end-to-end performance of their system with BLEU, METEOR and TER and find improvements.

Approach

The proposed model aims to maximize the conditional likelihood , i.e, the probability of a target sentence given a source sentence. The authors include a random variable for the length of the target sentence, and thus write , i.e., decomposing the likelihood into two models, a translation model and a length model. We introduce a latent variable for the alignment in the translation model, i.e., . Unlike Brown et al's (1993) version, where is further broken down, the authors use a log-linear model to model this directly: , where is a feature function vector dependent on the alignment, source and target sentences, as well as length of target sentence, and is the partition function, which under reasonable assumptions is finite. The length model is irrelevant here, since we are using this model for alignment, where both source and target lengths are observed and completely determined. To make things tractable, the feature function vector is assumed to decompose linearly over all the cliques, if one views the random variables in the form of a graph, similar to the original CRF formulation ( wiki writeup link)

To learn parameters, we add an regularization term and maximize the conditional log likelihood of the training set data. The gradient boils down to the expected difference between feature function values, where in one instance we let the alignment random variable vector vary over all possible alignments, and in the other case we let the translation output as well as vary over all possible values. . Training is an expensive process though, and involves discrimination against the discriminative neighborhood, which although can be pruned (see next paragraph), other techniques can be used to keep the discriminative neighborhood manageable, e.g., constrastive estimation (wiki link here).

Inference is done by modeling the translation search space as a WFSA. The forward backward algorithm is used to compute the expectation terms in the gradient. However, the WFSA is very large, especially considering one of the expectations is taken when the translation output varies. Thus, the number of edges in the WFSA needs to be reduced, and the authors try 4 different ways of doing this, ranging from restricting the set of all possible translation outputs to be only the set of target words co-occurring in sentence pairs containing the source sentence, to using the empirical Bayes method and variational inference to prune.

A significant component of this paper is the type of features the authors included to help elicit good word alignments. On top of using word association features (indicator functions for word pairs), they also use orthographic features (which is obviously limited in the case of Chinese-English, one of the language pairs that they used), positional features (i.e., looking at how far words are to the alignment matrix diagonal), source features (to capture words in the source language that are purely functional and have no lexical component), source path features (features that assess the goodness of the alignment path through the source sentence, i.e., measuring jumps, etc. in the sequential alignment of the source sentence), and target string features (e.g., if a word translates to itself). Lastly, the authors also include high-level, coarse features for example the model 1 probabilities and the Dice coefficient.

Baseline & Results

The authors tested out their model on three language pairs: Chinese-English (travel/tourism domain), Czech-English (news commentary), and Urdu-English (NIST 2009 OpenMT evaluation). The three language pairs each offer up distinct issues for SMT. English was treated as both source and target in their experiments. For the baseline, Giza++ was used to learn model 4, and alignments are symmetrized using grow-diag-final-and heuristic. Only in the CZ-EN case did we have gold standard word alignments. The authors also proposed two additional intrinsic measures: average alignment fertility of source words that occur only once in the training data (as there is a tendency in model 4 for these words to have many target words aligned to them), and the number of rule types learned in grammar induction matching the translation test sets, which suggests better coverage.

Czech-English results:

Czen.png

Chinese-English results:

Zhen.png

Urdu-English results:

Uren.png

Overall one sees that across the board, the proposed model improves upon model 4 in terms of AER (where available), average fertility of singleton source words, and the number of rule types learned that match the test sets. The proposed model also improves upon model 4 when it comes to extrinsic measures, but the most noticeable results are when model 4 and the proposed model are combined for translation purposes. An analysis is also done in terms of the highest feature weights for the language pairs, and these weights definitely reflect the characteristics of the individual language pairs.

Related Work

Generative model-based approaches to word alignment primarily rely on the IBM Models of Brown et al, CL 1993. These models make a host of independence assumptions, which limits the way we can incorporate features and additional information in these models. Discriminative model-based word alignment approaches, e.g., Taskar et al, HLT/EMNLP 2005 need gold-standard word alignments for training, something which is very difficult to obtain in practice.