Dyer et al, ACL 2011

Being edited by Rui Correia

Citation

C. Dyer, J. Clark, A. Lavie, and N. A. Smith. 2011. Unsupervised Word Alignment with Arbitrary Features. In Proceedings of HLT-ACL 2011, Volume 1, pp 409–419.

Summary

In this paper the authors address the Word Alignments problem in an unsupervised fashion, filling the gap of having to manually develop a gold standard that is difficult and expensive to create and dependent on the task at hand, specially in languages with resource scarcity problems. The model that is introduced is discriminatively trained, globally normalized, being a variant of the IBM Model 1 that allows the incorporation of non-independent features.

The main focus of the paper goes to the new model propose and to the features that were considered to generate the word alignments. The authors show results for several language pairs, comparing their approach with the IBM Model 4 with respect to BLEU, METEOR and TER scores. Additionally, the authors look at how the different language pairs use the features that were designed in different ways, analyzing how these preferences are representative of each language.

Model

The conditional model proposed assigns probabilities to a target sentence $t$ with length $n$ , given a source language sentence $s$ , with length $m$ . Using the chain rule, the authors factor $p(t|s)$ in a translation model $p(t|s,n)$ and a length model $p(n|s)$ , i.e.,

$p(t|s)=p(t,n|s)=p(t|s,n)\times p(n|s)$

Regarding the translation model, the authors make the assumption that each word of the target language sentence is the translation of a single word in the source language or a special null token, introducing a latent variable $a=\langle a_{1},a_{2},...,a_{n}\rangle \in [0,m]^{n}$ , i.e.,

$p(t|s,n)=\sum _{a}p(t,a|s,n)$

It is at this point that the model diverges from the Brown et al. approach. Instead of using the chain rule, the author propose the application of a log-linear model with parameters $\theta \in \mathbb {R} ^{k}$ and feature vector function $H$ that maps each tuple $\langle a,s,t,n\rangle$ into $\mathbb {R} ^{k}$ to model $p(t,a|s,n)$ directly: