Difference between revisions of "Minimum error rate training"

Revision as of 13:18, 13 November 2011

Minimum error rate training (or MERT) is a method. This is a work in progress by Francis Keith

Citation

MERT was originally proposed in the paper “Minimum Error Rate Training in Statistical Machine Translation”, Franz Josef Och, ACL, 2003, pp. 160-167. (found here [1])

Background

When training a model, often times it is beneficial to take into account the actual evaluation method for that model. In many cases, training methods do not. MERT attempts to train models for statistical machine translation. It attempts to optimize the parameters of the model while considering a more complex evaluation method than simply counting incorrect translations. It essentially attempts to train the model based on the method that will be used to evaluate the model.

Optimization Problem

The goal of MERT, as the name would suggest, is to find a minimum error rate count, given:

$f_{1}^{s}$ , the representative corpus
${\hat {e}}_{1}^{s}$ , the reference translations
$K$ $K$ , a set of candidate translations
- $C_{s}=\{e_{s,1},...,e_{s,K}\}$ for each $f_{s}$
$M$ feature functions $h_{m}(e,f)$
$M$ model parameters $\lambda _{m}$

We then attempt to optimize:

${\hat {e}}(f_{s};\lambda _{1}^{M})={\underset {e\in C}{\operatorname {argmax} }}\{\sum _{m=1}^{M}\lambda _{m}h_{m}(e|f_{s})\}$

The error count is provided by:

${\hat {\lambda }}_{1}^{M}={\underset {\lambda _{1}^{M}}{\operatorname {argmin} }}\{\sum _{s=1}^{S}E(r_{s},{\hat {e}}(f_{s};\lambda _{1}^{m}))\}$ $={\underset {\lambda _{1}^{M}}{\operatorname {argmin} }}\{\sum _{s=1}^{S}\sum _{k=1}^{K}E(r_{s},e_{s,k})\delta ({\hat {e}}(f_{s};\lambda _{1}^{M}),e_{s,k})\}$

Drawbacks

MERT, while very powerful (and the current popular approach to training MT models), has some drawbacks

Tends to overfit
Doesn't work well with large feature sets
High variance across runs due to many local optima

@@ Line 9: / Line 9: @@
 When training a model, often times it is beneficial to take into account the actual evaluation method for that model. In many cases, training methods do not. MERT attempts to train models for [[AddressesProblem::Machine Translation|statistical machine translation]]. It attempts to optimize the parameters of the model while considering a more complex evaluation method than simply counting incorrect translations. It essentially attempts to train the model based on the method that will be used to evaluate the model.
-== Criteria ==
+== Optimization Problem ==
 The goal of MERT, as the name would suggest, is to find a minimum error rate count, given:
@@ Line 22: / Line 22: @@
 <math>\hat{e}(f_s;\lambda_1^M) = \underset{e \in C}{\operatorname{argmax}}\{\sum_{m=1}^{M} \lambda_{m}h_m(e|f_s)\}</math>
+The error count is provided by:
+<math>\hat{\lambda}_1^M = \underset{\lambda_1^M}{\operatorname{argmin}}\{\sum_{s=1}^S E(r_s,\hat{e}(f_s;\lambda_1^m))\}</math>
+<math>= \underset{\lambda_1^M}{\operatorname{argmin}}\{\sum_{s=1}^S \sum_{k=1}^K E(r_s,e_{s,k})\delta(\hat{e}(f_s;\lambda_1^M),e_{s,k})\}</math>
+== Drawbacks ==
+MERT, while very powerful (and the current popular approach to training MT models), has some drawbacks
+* Tends to overfit
+* Doesn't work well with large feature sets
+* High variance across runs due to many local optima

Difference between revisions of "Minimum error rate training"

Revision as of 13:18, 13 November 2011

Contents

Citation

Background

Optimization Problem

Drawbacks

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools