Difference between revisions of "Unsupervised Word Alignment with Arbitrary Features"
Line 10: | Line 10: | ||
=== Approach === | === Approach === | ||
− | The proposed model aims to maximize the conditional likelihood <math>p(t | s)</math>, i.e, the probability of a target sentence given a source sentence. The authors include a random variable for the length of the target sentence, and thus write <math>p(t|s) = p(t, n | s) = p(t | s, n) \times p (n|s)</math>, i.e., decomposing the likelihood into two models, a translation model and a length model. We introduce a latent variable for the alignment in the translation model, i.e., <math>p(t | s, n) = \sum_a p(t, a | s, n)</math>. Unlike Brown et al's (1993) version, where <math>p(t, a | s, n)</math> is further | + | The proposed model aims to maximize the conditional likelihood <math>p(t | s)</math>, i.e, the probability of a target sentence given a source sentence. The authors include a random variable for the length of the target sentence, and thus write <math>p(t|s) = p(t, n | s) = p(t | s, n) \times p (n|s)</math>, i.e., decomposing the likelihood into two models, a translation model and a length model. We introduce a latent variable for the alignment in the translation model, i.e., <math>p(t | s, n) = \sum_a p(t, a | s, n)</math>. Unlike Brown et al's (1993) version, where <math>p(t, a | s, n)</math> is further broken down, the authors use a log-linear model to model this directly: <math>p_{\theta} (t, a | s, n) = \frac{\exp(\theta^T H(t, a, s, n))}{Z_\theta (s,n)}</math>, where <math>H</math> is a feature function vector dependent on the alignment, source and target sentences, as well as length of target sentence, and <math>Z_{\theta} (s,n)</math> is the partition function, which under reasonable assumptions is finite. The length model is irrelevant here, since we are using this model for alignment, where both source and target lengths are observed and completely determined. To make things tractable, the feature function vector is assumed to decompose linearly over all the cliques, if one views the random variables in the form of a graph (similar to the original CRF formulation, link). |
− | To learn parameters, we add an <math>\ | + | To learn parameters, we add an <math>\ell_1</math> regularization term and maximize the conditional log likelihood of the training set data. The gradient, like in CRFs, boils down to the expected difference between feature function values, where in one instance we let the alignment random variable vector <math>a</math> vary over all possible alignments, and in the other case we let the translation output <math>t</math> as well as <math>a</math> vary over all possible values. |
− | <math>\frac{\partial \mathcal{L}}{\partial \theta} = \sum_{\langle s, t \rangle \in \mathcal{T}} \mathbb{E} | + | <math>\frac{\partial \mathcal{L}}{\partial \theta} = \sum_{\langle s, t \rangle \in \mathcal{T}} \mathbb{E}_{p_{\theta}} (a | s, t, n)} [H(\cdot)] - \mathbb{E}_{p_{\theta}} (t, a | s, n)} [H(\cdot)]</math> |
Inference is done by modeling the translation search space as a WFSA. The forward-backward algorithm is used to compute the expectation terms in the gradient. However, the WFSA is very large, especially considering one of the expectations is taken when the translation output varies. Thus, the number of edges in the WFSA needs to be reduced, and the authors try 4 different ways of doing this, ranging from restricting the set of all possible translation outputs to be only the set of target words co-occurring in sentence pairs containing the source sentence, to using the empirical Bayes method and variational inference to prune. | Inference is done by modeling the translation search space as a WFSA. The forward-backward algorithm is used to compute the expectation terms in the gradient. However, the WFSA is very large, especially considering one of the expectations is taken when the translation output varies. Thus, the number of edges in the WFSA needs to be reduced, and the authors try 4 different ways of doing this, ranging from restricting the set of all possible translation outputs to be only the set of target words co-occurring in sentence pairs containing the source sentence, to using the empirical Bayes method and variational inference to prune. |
Revision as of 14:50, 25 November 2011
Citation
Unsupervised Word Alignment with Arbitrary Features, by C. Dyer, J. Clark, A. Lavie, N.A. Smith. In Proceedings of ACL, 2011.
This Paper is available online [1].
Summary
The authors present a discriminatively-trained, globally normalized log-linear model to model lexical translation in a statistical machine translation system (i.e., word alignment). Unlike previous discriminative approaches, the authors do not need supervised or gold standard word alignments in their approach, just supervision in the form of bilingual parallel sentence corpora. And unlike generative approaches that use EM for unsupervised word alignment, the proposed method can incorporate arbitrary, overlapping features that can help improve word alignment. The authors compare their model to IBM Model 4: they propose two intrinsic metrics, and also evaluate the end-to-end performance of their system with BLEU, METEOR and TER and find improvements.
Approach
The proposed model aims to maximize the conditional likelihood , i.e, the probability of a target sentence given a source sentence. The authors include a random variable for the length of the target sentence, and thus write , i.e., decomposing the likelihood into two models, a translation model and a length model. We introduce a latent variable for the alignment in the translation model, i.e., . Unlike Brown et al's (1993) version, where is further broken down, the authors use a log-linear model to model this directly: , where is a feature function vector dependent on the alignment, source and target sentences, as well as length of target sentence, and is the partition function, which under reasonable assumptions is finite. The length model is irrelevant here, since we are using this model for alignment, where both source and target lengths are observed and completely determined. To make things tractable, the feature function vector is assumed to decompose linearly over all the cliques, if one views the random variables in the form of a graph (similar to the original CRF formulation, link).
To learn parameters, we add an regularization term and maximize the conditional log likelihood of the training set data. The gradient, like in CRFs, boils down to the expected difference between feature function values, where in one instance we let the alignment random variable vector vary over all possible alignments, and in the other case we let the translation output as well as vary over all possible values. Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \frac{\partial \mathcal{L}}{\partial \theta} = \sum_{\langle s, t \rangle \in \mathcal{T}} \mathbb{E}_{p_{\theta}} (a | s, t, n)} [H(\cdot)] - \mathbb{E}_{p_{\theta}} (t, a | s, n)} [H(\cdot)]}
Inference is done by modeling the translation search space as a WFSA. The forward-backward algorithm is used to compute the expectation terms in the gradient. However, the WFSA is very large, especially considering one of the expectations is taken when the translation output varies. Thus, the number of edges in the WFSA needs to be reduced, and the authors try 4 different ways of doing this, ranging from restricting the set of all possible translation outputs to be only the set of target words co-occurring in sentence pairs containing the source sentence, to using the empirical Bayes method and variational inference to prune.
A significant component of this paper is the type of features the authors included to help elicit good word alignments. On top of using word association features (indicator functions for word pairs), they also use orthographic features (which is obviously limited in the case of Chinese-English, one of the language pairs that they used). Positional features (i.e., looking at how far words are to the alignment matrix diagonal), source features (to capture words in the source language that are purely functional and have no lexical component), source path features (features that assess the goodness of the alignment path through the source sentence, i.e., measuring jumps, etc. in the sequential alignment of the source sentence), and target string features (e.g., if a word translates to itself).
Baseline & Results
The authors tested out their model on three language pairs: Chinese-English (travel/tourism domain), Czech-English (news commentary), and Urdu-English (NIST 2009 OpenMT evaluation). The three language pairs each offer up distinct issues for SMT. English was treated as both source and target in their experiments. For the baseline, Giza++ was used to learn model 4, and alignments are symmetrized using grow-diag-final-and heuristic. Only in the CZ-EN case did we have gold standard word alignments. The authors also proposed two additional intrinsic measures: average alignment fertility of source words that occur only once in the training data (as there is a tendency in model 4 for these words to have many target words aligned to them), and the number of rule types learned in grammar induction matching the translation test sets, which suggests better coverage.
Czech-English results:
Chinese-English results:
Urdu-English results:
Related Work
Generative model-based approaches to word alignment primarily rely on the IBM Models of Brown et al, CL 1993. These models make a host of independence assumptions, which limits the way we can incorporate features and additional information in these models. Discriminative model-based word alignment approaches, e.g., Taskar et al, HLT/EMNLP 2005 need gold-standard word alignments for training, something which is very difficult to obtain in practice.