Dyer et al, ACL 2011

Being edited by Rui Correia

Citation

C. Dyer, J. Clark, A. Lavie, and N. A. Smith. 2011. Unsupervised Word Alignment with Arbitrary Features. In Proceedings of HLT-ACL 2011, Volume 1, pp 409–419.

Summary

In this paper the authors address the Word Alignments problem in an unsupervised fashion, filling the gap of having to manually develop a gold standard that is difficult and expensive to create and dependent on the task at hand, specially in languages with resource scarcity problems. The model that is introduced is discriminatively trained, globally normalized, being a variant of the IBM Model 1 that allows the incorporation of non-independent features.

The main focus of the paper goes to the new model propose and to the features that were considered to generate the word alignments. The authors show results for several language pairs, comparing their approach with the IBM Model 4 with respect to BLEU, METEOR and TER scores. Additionally, the authors look at how the different language pairs use the features that were designed in different ways, analyzing how these preferences are representative of each language.

Model

The conditional model proposed assigns probabilities to a target sentence $t$ with length $n$ , given a source language sentence $s$ , with length $m$ . Using the chain rule, the authors factor $p(t|s)$ in a translation model $p(t|s,n)$ and a length model $p(n|s)$ , i.e.,

$p(t|s)=p(t,n|s)=p(t|s,n)\times p(n|s)$

Regarding the translation model, the authors make the assumption that each word of the target language sentence is the translation of a single word in the source language or a special null token, introducing a latent variable $a=\langle a_{1},a_{2},...,a_{n}\rangle \in [0,m]^{n}$ , i.e.,

$p(t|s,n)=\sum _{a}p(t,a|s,n)$

It is at this point that the model diverges from the Brown et al. approach. Instead of using the chain rule, the author propose the application of a log-linear model with parameters $\theta \in \mathbb {R} ^{k}$ and feature vector function $H$ that maps each tuple $\langle a,s,t,n\rangle$ into $\mathbb {R} ^{k}$ to model $p(t,a|s,n)$ directly:

$p_{\theta }(t,a|s,n)={\frac {\exp \theta ^{\top }H(t,a,s,n)}{Z_{\theta }(s,n}}$

where, $Z_{\theta }$ is the partition function given by

$Z_{\theta }(s,n)=\sum _{t'\in \Omega ^{n}}\sum _{a'}\exp \theta ^{\top }H(t',a',s,n)$

Features

The authors group the result of the feature engineering task in 5 feature categories.

The first group described is comprised of Word Association Features, which are common to all lexical translation models, containing features such as fine-grained boolean indicators, orthographic similarities or class pair indicator features.

The second group, Positional Features, is responsible to represent closeness to the alignment matrix diagonal, and when conjoined with the class indicators from the previous group, are able to represent the typical location of certain classes on the sentences.

Source Features represent the need to translate certain words more than others (eg. functional elements of the source language).

The fourth set of features is composed by Source Path Features that are able to represent typical sentence structure and ordering, which when combined with word classes, for instance, can represent the common position of an adjective with respect to a noun in a given language.

The last group, Target String Features, is responsible to capture the phenomenon of multiple values in the predicted target string, and is fired when, for example, a word translates as itself in a given position, but then is translated again as something else in the previous or next positions.

Experimental Results

To assess their model the authors used corpora collected from different sources. For the Chinese-English language pair, they used a tourism domain corpus from [1]. The results in terms of BLEU, METEOR and TER are shown in the table below, where the new model is compared with the IBM Model 4, and with a merging between these two strategies.

For the Czech-English and Urdu-English language pairs, the authors used the NIST MT dataset. The results for these two language pairs can be found in the following tables:

Dyer et al, ACL 2011

Contents

Citation

Summary

Model

Features

Experimental Results

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools