Difference between revisions of "Minimum error rate training"

From Cohen Courses
Jump to navigationJump to search
(Created page with 'Minimum error rate training (or MERT) is a [[Category::method]]. This is a work in progress by [[User::Fkeith|Francis Keith]] == Citation == MERT was originally proposed in the…')
 
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
Minimum error rate training (or MERT) is a [[Category::method]]. This is a work in progress by [[User::Fkeith|Francis Keith]]
+
== Citation ==
 +
 
 +
MERT is a [[Category::method]] that was originally proposed in the paper [[RelatedPaper::Och ACL 2003|“Minimum Error Rate Training in Statistical Machine Translation”, Franz Josef Och, ACL, 2003, pp. 160-167.]] (found here [http://acl.ldc.upenn.edu/acl2003/main/pdfs/Och.pdf])
 +
 
 +
== Background ==
 +
 
 +
When training a model, often times it is beneficial to take into account the actual evaluation method for that model. In many cases, training methods do not. Minimum error rate training (herein referred to as MERT) attempts to [[AddressesProblem::Model Training|train models]] for [[AddressesProblem::Machine Translation|statistical machine translation]]. It attempts to optimize the parameters of the model while considering a more complex evaluation method than simply counting incorrect translations. It essentially attempts to train the model based on the method that will be used to evaluate the model.
 +
 
 +
== Optimization Problem ==
 +
 
 +
The goal of MERT, as the name would suggest, is to find a minimum error rate count, given:
 +
* <math>f_{1}^{s}</math>, the representative corpus
 +
* <math>\hat{e}_{1}^{s}</math>, the reference translations
 +
* <math>K</math>, a set of candidate translations
 +
** <math>C_{s} = \{e_{s,1},...,e_{s,K}\}</math> for each <math>f_{s}</math>
 +
* <math>M</math> feature functions <math>h_m(e,f)</math>
 +
* <math>M</math> model parameters <math>\lambda_m</math>
 +
 
 +
We then attempt to optimize:
 +
 
 +
<math>\hat{e}(f_s;\lambda_1^M) = \underset{e \in C}{\operatorname{argmax}}\{\sum_{m=1}^{M} \lambda_{m}h_m(e|f_s)\}</math>
 +
 
 +
The error count is provided by:
 +
 
 +
<math>\hat{\lambda}_1^M = \underset{\lambda_1^M}{\operatorname{argmin}}\{\sum_{s=1}^S E(r_s,\hat{e}(f_s;\lambda_1^m))\}</math>
 +
<math>= \underset{\lambda_1^M}{\operatorname{argmin}}\{\sum_{s=1}^S \sum_{k=1}^K E(r_s,e_{s,k})\delta(\hat{e}(f_s;\lambda_1^M),e_{s,k})\}</math>
 +
 
 +
== Training Algorithm ==
 +
 
 +
The key to the optimization is doing an optimization along the line:
 +
 
 +
<math>\lambda_1^M + \gamma * d_1^M</math>
 +
 
 +
<math>\gamma</math> comes from:
 +
 
 +
<math>\hat{e}(\hat{f};\gamma) = \underset{e \in C}{\operatorname{argmin}} \{ t(e,f) + \gamma*m(e,f)\}</math>
 +
 
 +
As <math>t</math> and <math>m</math> are functions of <math>\gamma</math>, we can define a piecewise linear function <math>f'</math> such that:
 +
 
 +
<math>f'(\gamma;f) = \underset{e \in C}{\operatorname{min}} \{t(e,f) + \gamma * m(e,f)\}</math>
 +
 
 +
The algorithm is as follows:
 +
 
 +
* Compute <math>f'(\gamma;f)</math> for every sentence <math>f</math> with the incremental change in error count
 +
** This yields a sequence <math>\gamma_1^f < \gamma_2^f < ... < \gamma_{N_f}^f</math>
 +
** Each of these shows a change in error count <math>\Delta E_n^f</math>
 +
* Compute the optimal <math>\gamma</math> by traversing these intervals while updating the error count
 +
== Benefits ==
 +
 
 +
MERT has numerous benefits, most of which are specific to MT problems
 +
* Finds global optima
 +
** This is important because the error count is not smoothed, which means there could be significantly more local optima
 +
* Guarantees convergence
 +
** This is also important because the unsmoothed error count is not a convex function
 +
 
 +
== Drawbacks ==
 +
 
 +
MERT, while very powerful (and the current popular approach to training MT models), has some drawbacks
 +
* Tends to overfit
 +
* Doesn't work well with large feature sets
 +
* High variance across runs due to many local optima
  
== Citation ==
+
== Related Work ==
 +
 
 +
MERT is more or less the standard for training MT models currently. As a result, it is used in many MT papers
  
MERT was originally proposed in the paper “Minimum Error Rate Training in Statistical Machine Translation”, Franz Josef Och, ACL, 2003, pp. 160-167. (found here [http://acl.ldc.upenn.edu/acl2003/main/pdfs/Och.pdf)
+
* [[RelatedPaper::An End-to-End Discriminative Approach to Machine Translation Liang et al 2006|An End-to-End Discriminative Approach to Machine Translation, Liang et al., 2006.]] - Provides an improvement to allow for more features
 +
* [[RelatedPaper::Hierarchical Phrase-Based Translation Chiang 2007|Hierarchical Phrase-Based Translation, Chiang, 2007.]] - Uses MERT for phrase-based translation

Latest revision as of 10:08, 14 November 2011

Citation

MERT is a method that was originally proposed in the paper “Minimum Error Rate Training in Statistical Machine Translation”, Franz Josef Och, ACL, 2003, pp. 160-167. (found here [1])

Background

When training a model, often times it is beneficial to take into account the actual evaluation method for that model. In many cases, training methods do not. Minimum error rate training (herein referred to as MERT) attempts to train models for statistical machine translation. It attempts to optimize the parameters of the model while considering a more complex evaluation method than simply counting incorrect translations. It essentially attempts to train the model based on the method that will be used to evaluate the model.

Optimization Problem

The goal of MERT, as the name would suggest, is to find a minimum error rate count, given:

  • , the representative corpus
  • , the reference translations
  • , a set of candidate translations
    • for each
  • feature functions
  • model parameters

We then attempt to optimize:

The error count is provided by:

Training Algorithm

The key to the optimization is doing an optimization along the line:

comes from:

As and are functions of , we can define a piecewise linear function such that:

The algorithm is as follows:

  • Compute for every sentence with the incremental change in error count
    • This yields a sequence
    • Each of these shows a change in error count
  • Compute the optimal by traversing these intervals while updating the error count

Benefits

MERT has numerous benefits, most of which are specific to MT problems

  • Finds global optima
    • This is important because the error count is not smoothed, which means there could be significantly more local optima
  • Guarantees convergence
    • This is also important because the unsmoothed error count is not a convex function

Drawbacks

MERT, while very powerful (and the current popular approach to training MT models), has some drawbacks

  • Tends to overfit
  • Doesn't work well with large feature sets
  • High variance across runs due to many local optima

Related Work

MERT is more or less the standard for training MT models currently. As a result, it is used in many MT papers