Difference between revisions of "Empirical Risk Minimization"
Line 1: | Line 1: | ||
This is a [[Category::method]] proposed by [[RelatedPaper::Bahl et al. 1988 A new algorithm for the estimation of hidden Markov model parameters]]. | This is a [[Category::method]] proposed by [[RelatedPaper::Bahl et al. 1988 A new algorithm for the estimation of hidden Markov model parameters]]. | ||
− | In graphical models, true distribution | + | In graphical models, the true distribution is always unknown. Instead of maximizing the likelihood on training data when estimating the model parameter <math>\theta</math>, we can alternatively minimize the Empirical Risk Minimization (ERM) by averaging the loss <math>l</math>. ERM was widely used in Speech Recognition (Bahl et al., 1988) and Machine Translation (Och, 2003). The ERM estimation method has the following advantages: |
* Maximum likelihood does not guarantee better accuracy, but might overfit to the training distribution. ERM can prevent overfitting the training data. | * Maximum likelihood does not guarantee better accuracy, but might overfit to the training distribution. ERM can prevent overfitting the training data. | ||
Line 7: | Line 7: | ||
== Motivation == | == Motivation == | ||
+ | A standard training method for probablistic graphical models often involves | ||
+ | using Expectation Maximization (EM) for Maximum a Posteriori (MAP) training, | ||
+ | approaximate inference and approximate decoding. | ||
+ | |||
+ | However, when using the approximate inference with the same equations as in the exact case, | ||
+ | it might lead to the divergence of the learner (Kulesza and Pereira, 2008). | ||
+ | Secondly, the structure of the model itself might be too simple, and cannot | ||
+ | characterize by a model parameter <math>\theta</math>. | ||
+ | Moreover, even if the model structure is correct, MAP training using the training data | ||
+ | might not give us the correct <math>\theta</math>. | ||
+ | |||
+ | ERM argues that minimizing the risk is the most proper way of training, | ||
+ | since the ultimate goal of the task is to directly optimize the performance | ||
+ | on true evaluation. In addition, studies (Smith and Eisner, 2006) have shown that | ||
+ | maximizing log likelihood using EM does not guarantee consistently high accuracy | ||
+ | for evaluations in NLP tasks. As a result, minimizing local empirical risks | ||
+ | (the observed errors on the training data) might be an alternative method for | ||
+ | training graphical models. | ||
== Problem Formulation == | == Problem Formulation == |
Revision as of 22:38, 1 November 2011
This is a method proposed by Bahl et al. 1988 A new algorithm for the estimation of hidden Markov model parameters.
In graphical models, the true distribution is always unknown. Instead of maximizing the likelihood on training data when estimating the model parameter , we can alternatively minimize the Empirical Risk Minimization (ERM) by averaging the loss . ERM was widely used in Speech Recognition (Bahl et al., 1988) and Machine Translation (Och, 2003). The ERM estimation method has the following advantages:
- Maximum likelihood does not guarantee better accuracy, but might overfit to the training distribution. ERM can prevent overfitting the training data.
- Summing up and averaging the local conditional likelihood might be more resilient to errors than calculating the product of conditional likelihoods.
Contents
Motivation
A standard training method for probablistic graphical models often involves using Expectation Maximization (EM) for Maximum a Posteriori (MAP) training, approaximate inference and approximate decoding.
However, when using the approximate inference with the same equations as in the exact case, it might lead to the divergence of the learner (Kulesza and Pereira, 2008). Secondly, the structure of the model itself might be too simple, and cannot characterize by a model parameter . Moreover, even if the model structure is correct, MAP training using the training data might not give us the correct .
ERM argues that minimizing the risk is the most proper way of training, since the ultimate goal of the task is to directly optimize the performance on true evaluation. In addition, studies (Smith and Eisner, 2006) have shown that maximizing log likelihood using EM does not guarantee consistently high accuracy for evaluations in NLP tasks. As a result, minimizing local empirical risks (the observed errors on the training data) might be an alternative method for training graphical models.
Problem Formulation
Empirical Risk Minimization
Some Reflections
Related Papers
- ()
- ()
- ()
- ()