Hopkins and May, EMNLP 2011. Tuning as Ranking

From Cohen Courses
Jump to navigationJump to search

Citation

Mark Hopkins and Jonathan May. 2011. Tuning as Ranking. In Proceedings of EMNLP-2011.

Online Version

Tuning as Ranking

Summary

This paper presents a simple and scalable method for statistical machine translation parameter tuning based on the pairwise approach to ranking. This pairwise ranking optimization (PRO) method has advantages over MERT Och, 2003 as it is not limited to a handful of parameters and can easily handle systems with thousands of features. In addition, unlike recent approaches built upon the MIRA algorithm of Crammer et al., 2006 (Watanabe et al., 2007), PRO is easy to implement.

Method

Although, MERT is well-understood, easy to implement, and runs quickly, it can behave erratically and would not scale beyond a handful of features. This is a major bottleneck towards working with richer feature representations and structure.

Hence, the authors propose a simpler approach than MIRA, to tuning that similarly scales to high-dimensional feature spaces. Tuning is treated as a ranking problem (Chen et al., 2009), where the explicit goal is to learn to correctly rank candidate translations. The authors describe a pairwise approach to ranking, in which the ranking problem is reduced to the binary classification task of deciding between candidate translation pairs.

Approach

The goal of tuning is to learn a weight vector such that Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle H_{w}(p) } assigns a high score to good translations and a low score to bad translations. Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle H_{w}(p) } is given by:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle H_{w}(p) = \sum_{i \in I} h_{w}(i, p(i)) }

where, the scores for candidate translations Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p(i) } are represented in the following form, Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle h_{w}(i, p(i)) = \textbf{w} \cdot \textbf {x}(i,j) }


Optimization via Pairwise Ranking

MIRA scales well to high-dimensionality candidate spaces. However, its architecture is complex and different to that of MERT. This method assumes a gold scoring function which can be decomposed in the following way:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle G(p) = \sum_{i \in I} g(i, p(i)) } where, Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle g(i,j) } is a local scoring function that scores each candidate translation Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle e(i,j) } .

In a pairwise ranking approach, the learning task is framed as the classification of candidate pairs into two categories: correctly ordered and incorrectly ordered. For each pair of translation candidates Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle e(i,j) } and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle e(i,j^{\prime}) } , an inequality relation is established:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle g(i,j) > g(i,j^{\prime}) \Leftrightarrow h_{w}(i,j) > h_{w}(i,j^{\prime}) }

On simplifying the above, we get

Ranking1.jpg

Hence, the optimization problem reduces to that of binary classification. The training data generated is fed directly to any off-the-shelf classification tool that returns a linear classifier, in order to obtain a weight vector Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \textbf {w} } that optimizes the above condition. The exact loss function Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle l_{s^{\prime}}(H_{w},G) } optimized depends on the choice of classifier.

Scalability

Standard approaches to pairwise ranking enumerate all difference vectors as training data. For tuning, this typically means vectors, where Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle J_{max} } is the cardinality of the largest Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle J(i) } . This could lead to an exponentially large feature vector space due to high-dimensionality of the data. Hence, a sampler template is used to sample from the space of different vectors in order to make it tractable.

For each source sentence Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle i } , the sampler generates Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Gamma } candidate translation pairs Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \langle j, j^{\prime} \rangle } , and accepts each pair with probability Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha_{i} (|g(i,j) - g(i,j^{\prime})|) } . Among the accepted pairs, it keeps the Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Gamma } with greatest Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle g } differential, and adds their difference vectors to the training data. The sampling algorithm is outlined in the figure below:

Tuning sampler.jpg

Features

For each of our systems we identify two feature sets: baseline, which correspond to the typical small feature set reported in current MT literature, and extended, a superset of baseline, which adds hundreds or thousands of features. They use 15 baseline features for phrase-based MT, similar to the baseline features described by Watanabe et al., 2007. They also use 19 baseline features for syntax-based MT, similar to the baseline features described by Chiang et al., 2008.

They used the following feature classes for both syntax-based and phrase-based MT extended scenarios:

  • Discount features for rule frequency bins.
  • Target word insertion features


Syntax-based MT specific features:

  • Rule overlap features
  • Node count features


Phrase-based MT specific features:

  • Unigram word pair features for the 80 most frequent words in both languages plus tokens for unaligned and all other words.
  • Source, target, and joint phrase length features from 1 to 7, e.g. "tgt=4", "src=2", and "src/tgt=2,4".

Experiments and Results

The authors conducted experiments for Urdu-English, Arabic-English and Chinese-English translation tasks and compared the performance of MERT, MIRA and PRO learning algorithms. They evaluated performance on phrase-based and syntax-based MT systems.

Dataset

Tuning dataset.png

The table above shows the size of the data for experiments and results reported in the paper.

Results

The table below shows machine translation performance for the experiments listed in this paper. Scores are case-sensitive IBM BLEU. For every choice of system, language pair, and feature set, PRO performs comparably with the other methods.

Tuning results.jpg

Related Papers

[1] Franz Josef Och, Minimum error rate training in statistical machine translation. 2003. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, p.160-167, July 07-12, 2003, Sapporo, Japan

[2] Taro Watanabe and Jun Suzuki, Hajime Tsukada and Hideki Isozaki. 2007. Online Large-Margin Training for Statistical Machine Translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

[3] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms. Journal of Machine Learning Research, 7(Mar):551--585, 2006.

[4] Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. 2009. Ranking measures and loss functions in learning to rank. In Advances in Neural Information Processing Systems 22, pages 315–323

[5] David Chiang, Yuval Marton, and Philip Resnik. 2008. Online large-margin training of syntactic and structural translation features. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 224–233, Honolulu, HI, October