Difference between revisions of "Empirical Risk Minimization"

From Cohen Courses
Jump to navigationJump to search
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
This is a [[Category::method]] proposed by [[RelatedPaper::Bahl et al. 1988 A new algorithm for the estimation of hidden Markov model parameters]].
 
This is a [[Category::method]] proposed by [[RelatedPaper::Bahl et al. 1988 A new algorithm for the estimation of hidden Markov model parameters]].
  
In graphical models, the true distribution is always unknown. Instead of maximizing the likelihood on training data when estimating the model parameter <math>\theta</math>, we can alternatively minimize the Empirical Risk Minimization (ERM) by averaging the loss <math>l</math>. ERM was widely used in Speech Recognition (Bahl et al., 1988) and Machine Translation (Och, 2003). The ERM estimation method has the following advantages:  
+
In graphical models, the true distribution is always unknown. Instead of maximizing the likelihood on training data when estimating the model parameter <math>\theta</math>, we can alternatively minimize the Empirical Risk Minimization (ERM) by averaging the loss <math>l(x)</math>. ERM was widely used in Speech Recognition (Bahl et al., 1988) and Machine Translation (Och, 2003). The ERM estimation method has the following advantages:  
  
 
* Maximum likelihood might overfit to the training distribution. ERM can prevent overfitting the training data.
 
* Maximum likelihood might overfit to the training distribution. ERM can prevent overfitting the training data.
* Log likelihood does not equal to the accuracy on the test set, but ERM directly optimizes on the test performance.
+
* Log likelihood does not equal to the accuracy on the test set, but ERM directly optimizes on the test performance (the loss function <math>l(x)</math> can be L1 loss, mean squared error, f-measure, conditional log-likelihood or other things).
 
* Summing up and averaging the local conditional likelihood might be more resilient to errors than calculating the product of conditional likelihoods.
 
* Summing up and averaging the local conditional likelihood might be more resilient to errors than calculating the product of conditional likelihoods.
  
Line 39: Line 39:
 
: <math>\! ER(\theta) = \frac{1}{n} \sum_{i=1}^n L(f_{\theta}(x_i), y_i).</math>
 
: <math>\! ER(\theta) = \frac{1}{n} \sum_{i=1}^n L(f_{\theta}(x_i), y_i).</math>
  
The idea of ERM for learning is to choose a hypothesis <math>\hat{\theta}</math> that minimizes the empirical risk:
+
The idea of ERM for learning is to choose a hypothesis <math>\theta^{*}</math> that minimizes the empirical risk:
: <math>\hat{\theta} = \arg \min_{h \in \mathcal{H}} R_{\mbox{emp}}(h).</math>
+
: <math>\theta^{*} = \arg \min_{\theta \in \mathcal{\Theta}} ER(\theta).</math>
Thus the learning algorithm defined by ERM principle consists in solving the above [[Mathematical optimization|optimization]] problem.
+
In order to calculate the <math>\theta^{*}</math>, the problem then turns to be an optimization problem of the above formula.
 +
The function <math>ER(\theta)</math> is often differentiable and we can use optimization methods such as gradient descent
 +
to find the parameter <math>\theta</math>. Note that sometime the loss function might be non-convex, and then we need to take
 +
other methods during optimization.
  
 
== Some Reflections ==
 
== Some Reflections ==
 +
[[File:Em.jpg]]
 +
 +
(1) The above figure was taken from Smith (2006). The x-axis is the log-likelihood while the y-axis is the accuracy on test set.
 +
It shows that EM algorithm with MAP estimation might suffer from the local optimums, and there is no guarantee that maximum likelihood estimation will lead to high accuracy. Charniak(1993) shows the similar results in his paper. Comparing to MLE, the similar problem that ERM might have is also the non-convexity issue. For some of the useful loss functions, for example, L1 loss, might be prone to getting stuck in local optimums.
 +
 +
(2) People use different terminologies when describing risks. For example, in a classic Klein and Manning (2002) paper [[RelatedPaper::Klein 2002 conditional structure versus conditional estimation in nlp models]], they use the term "sum of conditional log-likelihoods" (SCL) to describe risks. They showed that SCL could obtain good results in their task. It is clear that ERM becomes an important alternative training method when <math>\theta</math> of MLE might suffer from overfitting the training set.
 +
 +
(3) It is known that ERM with 0-1 loss function is an NP-hard problem (even though people still use various convex optimization methods for the ERM estimation).
  
 
== Related Papers ==  
 
== Related Papers ==  
Line 50: Line 61:
 
* [[RelatedPaper::Bahl et al. 1988 A new algorithm for the estimation of hidden Markov model parameters]] (The original ERM method was proposed to improve Automatic Speech Recognition tasks.)
 
* [[RelatedPaper::Bahl et al. 1988 A new algorithm for the estimation of hidden Markov model parameters]] (The original ERM method was proposed to improve Automatic Speech Recognition tasks.)
  
* [[RelatedPaper::]] ()
+
* [[RelatedPaper::Klein 2002 conditional structure versus conditional estimation in nlp models]] (Klein and Manning discussed the notion of "sum of conditional likelihoods (SCL)" in their 2002 ACL paper. The idea of SCL is very similar to ERM.)
  
* [[RelatedPaper::]] ()
+
* [[RelatedPaper::Stoyanov et al. 2011: Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure]] (This paper is a comprehensive overview of ERM estimation techniques for probabilistic graphical models.)

Latest revision as of 22:05, 2 November 2011

This is a method proposed by Bahl et al. 1988 A new algorithm for the estimation of hidden Markov model parameters.

In graphical models, the true distribution is always unknown. Instead of maximizing the likelihood on training data when estimating the model parameter , we can alternatively minimize the Empirical Risk Minimization (ERM) by averaging the loss . ERM was widely used in Speech Recognition (Bahl et al., 1988) and Machine Translation (Och, 2003). The ERM estimation method has the following advantages:

  • Maximum likelihood might overfit to the training distribution. ERM can prevent overfitting the training data.
  • Log likelihood does not equal to the accuracy on the test set, but ERM directly optimizes on the test performance (the loss function can be L1 loss, mean squared error, f-measure, conditional log-likelihood or other things).
  • Summing up and averaging the local conditional likelihood might be more resilient to errors than calculating the product of conditional likelihoods.


Motivation

A standard training method for probablistic graphical models often involves using Expectation Maximization (EM) for Maximum a Posteriori (MAP) training, approaximate inference and approximate decoding. However, when using the approximate inference with the same equations as in the exact case, it might lead to the divergence of the learner (Kulesza and Pereira, 2008). Secondly, the structure of the model itself might be too simple, and cannot characterize by a model parameter . Moreover, even if the model structure is correct, MAP training using the training data might not give us the correct .

ERM argues that minimizing the risk is the most proper way of training, since the ultimate goal of the task is to directly optimize the performance on true evaluation. In addition, studies (Smith and Eisner, 2006) have shown that maximizing log likelihood using EM does not guarantee consistently high accuracy for evaluations in NLP tasks. As a result, minimizing local empirical risks (the observed errors on the training data) might be an alternative method for training graphical models.

The Standard MLE Learning Method

Assume we use to represent the model parameter. The task of training is to set the most appropriate that represents the true distribution of the data. For graphical models, given the training data pairs, the standard method is to maximize the following log likelihood

where always represent the conditional log-likelihood of .

Empirical Risk Minimization

As we mentioned earlier, the risk is unknown because the true distribution is unknown. As an alternative method to maximum likelihood, we can calculate an Empirical Risk function by averaging the loss on the training set:

The idea of ERM for learning is to choose a hypothesis that minimizes the empirical risk:

In order to calculate the , the problem then turns to be an optimization problem of the above formula. The function is often differentiable and we can use optimization methods such as gradient descent to find the parameter . Note that sometime the loss function might be non-convex, and then we need to take other methods during optimization.

Some Reflections

Em.jpg

(1) The above figure was taken from Smith (2006). The x-axis is the log-likelihood while the y-axis is the accuracy on test set. It shows that EM algorithm with MAP estimation might suffer from the local optimums, and there is no guarantee that maximum likelihood estimation will lead to high accuracy. Charniak(1993) shows the similar results in his paper. Comparing to MLE, the similar problem that ERM might have is also the non-convexity issue. For some of the useful loss functions, for example, L1 loss, might be prone to getting stuck in local optimums.

(2) People use different terminologies when describing risks. For example, in a classic Klein and Manning (2002) paper Klein 2002 conditional structure versus conditional estimation in nlp models, they use the term "sum of conditional log-likelihoods" (SCL) to describe risks. They showed that SCL could obtain good results in their task. It is clear that ERM becomes an important alternative training method when of MLE might suffer from overfitting the training set.

(3) It is known that ERM with 0-1 loss function is an NP-hard problem (even though people still use various convex optimization methods for the ERM estimation).

Related Papers