Lacoste-Julien et al, NAACL 2006

Being edited by Rui Correia

Citation

Simon Lacoste-Julien, Ben Taskar, Dan Klein, and Michael I. Jordan. Word alignment via quadratic assignment. In Human Language Technology–North American Association for Computational Linguistics, New York, NY, 2006. On-line

Summary

In this paper the authors address two major limitations of Taskar et al. (2005) paper on discriminative word alignment methods: fertility and first order interactions.

The authors show how to extend a discriminative approach to word alignment to allow fertility modeling and to capture first-order interactions between alignments of consecutive words, enhancing the expressive power of the discriminative approach. Allowing to capture phenomena of monotonicity, local inversion and contiguos fertility trends - phenomena that are highly informative for alignment. They do so while remaining computationally efficient.

Their best models achieves a relative AER reduction of 25 % over the basic matching formulation, beating intersected IBM Model 4 without the use of any compute-intensive features. The authors claim their contribution to be particularly promising for phrase based systems since they perform better with higher recall on the task.

HMM

Baseline Method

The authors base their work in a previous discriminative approach to word alignments by Taskar et al. (2005). In this model, nodes $V^{s}$ and $V^{t}$ correspond to words in the source and target language, respectively, and edges $\varepsilon =\{jk:j\in V^{s},k\in V^{t}\}$ correspond to alignments between words. The edge wights $s_{jk}$ represent the degree to which word $j$ in one sentence cans be translated using word $k$ in the other sentence. The predicted alignments are then chosen by maximizing the sum of the edge scores, which can be formulated in linear programming as:

$\max _{0\leq z\leq 1}\sum _{jk\in \varepsilon }s_{jk}z_{jk}$

       $s.t.\sum _{j\in V^{s}}z_{jk}\leq 1,\forall k\in V^{t};$

           $\sum _{k\in V^{t}}z_{jk}\leq 1,\forall j\in V^{s},$

where $z_{jk}$ are relaxations of the binary variables that indicate if $j$ is assigned to $k$ .

Fertility

In the previously mentioned baseline model, a word can align to at most one word in the translation, which is not the proper solution in cases such as backbone and épine dorsal (in French). The first approach that came to ones mind would be to increase the right hand side for the constraints in the baseline model from $1$ to $D$ , where $D$ is the maximum allowed fertility. However, this would cause that maximum weight solutions would have either all words with fertility $0$ or $D$ .

The authors defined instead a means of encouraging the common case of low fertility, allowing only higher fertilities when it is licensed, introducing penalties in the model for higher fertilities. That penalties is modeled introducing variables $z_{dj\bullet }$ and $z_{d\bullet k}$ (meaning that node $j$ (or $k$ ) has fertility of at least $d$ ). The model formulation in linear programming comes as:

$\max _{0\leq z\leq 1}\sum _{jk\in \varepsilon }{s_{jk}z_{jk}}-\sum _{j\in V^{s},2\leq d\leq D}{s_{dj\bullet }z_{dj\bullet }}-\sum _{k\in V^{t},2\leq d\leq D}{s_{d\bullet k}z_{d\bullet k}}$

       $s.t.\sum _{j\in V^{s}}z_{jk}\leq 1+\sum _{2\leq d\leq D}{z_{d\bullet k}},\forall k\in V^{t};$

           $\sum _{k\in V^{t}}z_{jk}\leq 1+\sum _{2\leq d\leq D}{z_{dj\bullet }},\forall j\in V^{s},$

where \sum_{2 \le d \le D} {s_{dj \bullet} z_{dj \bullet}} represents the penalty for fertility of node $j$ , where each $s_{dj\bullet }\geq 0$ is the penalty increment from increasing the fertility from $d-1$ to $d$ .

First-Order Interaction

Experimental Results

The model was tested with MUC-6 dataset, a collection of 30 Wall Street Journal documents. The authors compared the performance of their model in comparison with the best NE-system so far, a rule-based system. They also tested their solution across different types of input material (mixed case, upper case and speech form) and with a different language (Spanish). The results are shown in the table below:

In the Mixed Case setting the best rules system performed better than the proposed solution in the present paper, although the different is not statistically significant, and does not compensate the effort of having experts maintaining the set of rules. The authors justify the low score for Spanish with the low quantity and quality (inconsistencies) in the training data of the Spanish model.

Another result that came out of this work was the fact that 100k words of training seems to suffice to obtain state-of-the-art results for the NE task.

Lacoste-Julien et al, NAACL 2006

Contents

Citation

Summary

Baseline Method

Fertility

First-Order Interaction

Experimental Results

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools