Difference between revisions of "Stochastic Gradient Descent"

From Cohen Courses
Jump to navigationJump to search
Line 1: Line 1:
 
== Summary ==
 
== Summary ==
  
This is an optimization [[Category::method]], used in many algorithms such as [[AddressesProblem::Conditional Random Field]] to efficiently optimize the objective function, especially in the online setting. The benefit of stochastic gradient descent comes from the stochastic approximation of the objective f
+
This is an optimization [[Category::method]], used in many algorithms such as [[AddressesProblem::Conditional Random Fields]] to efficiently optimize the objective function, especially in the online setting. Stochastic optimizations like this method are known to be faster when trained with large, redundant data sets.
  
=== Gradient Descent ===
+
== Gradient Descent ==
  
 
Given a function to be minimized <math>F(\mathbf{x})</math> and a point <math>\mathbf{x}=\mathbf{a}</math>, let <math>\mathbf{b}=\gamma\nabla F(\mathbf{a})</math> then we can say  
 
Given a function to be minimized <math>F(\mathbf{x})</math> and a point <math>\mathbf{x}=\mathbf{a}</math>, let <math>\mathbf{b}=\gamma\nabla F(\mathbf{a})</math> then we can say  
Line 20: Line 20:
 
== Stochastic Gradient Descent ==
 
== Stochastic Gradient Descent ==
  
One problem of the gradient descent is that you cannot run this method online. To make it online, we express (or approximate) our objective function as a sum of functions of batches of the data. Suppose your objective function is <math>\mathcal{L}(\mathcal{D}, \mathbf{\theta})</math>. Then the objective function is decomposed as the following,
+
One of the problems of the gradient descent method above is that calculating the gradient could be an expensive computation depending on the objective function or the size of the data set. Suppose your objective function is <math>\mathcal{L}(\mathcal{D}, \mathbf{\theta})</math>. If the objective function can be decomposed as the following,
  
 
<math>
 
<math>
Line 26: Line 26:
 
</math>
 
</math>
  
where <math>\mathcal{D}_{i}</math> indicates the <math>i\quad</math>-th example. Sometimes <math>\mathcal{D}_{i}</math> is a batch, instead of an example. To make each step computationally efficient, a subset of the summand function is sampled. The procedure can be described as the following pseudocode:
+
where <math>\mathcal{D}_{i}</math> indicates the <math>i\quad</math>-th example(sometimes <math>\mathcal{D}_{i}</math> is a batch instead of one example), we can make the process stochastic. To make each step computationally efficient, a subset of the summand function is sampled. The procedure can be described as the following pseudocode:
  
 
* Initialize <math>\mathbf{\theta}</math>
 
* Initialize <math>\mathbf{\theta}</math>
Line 35: Line 35:
 
*** <math>\mathbf{\theta}=\mathbf{\theta}_{new}</math>
 
*** <math>\mathbf{\theta}=\mathbf{\theta}_{new}</math>
  
where <math>\alpha\quad</math> is the learning rate.
+
where <math>\alpha\quad</math> is the learning rate. Note that this method is no different from the plain gradient descent method when the batch size becomes the number of examples. For computational efficiency, small batch size around 5~20 turn out to be most efficient.
  
 
== Pros ==
 
== Pros ==
Line 44: Line 44:
  
 
The convergence rate is slower than second-order gradient methods. Also it tends to keep bouncing around the minimum unless the learning rate is reduced in the later iterations.
 
The convergence rate is slower than second-order gradient methods. Also it tends to keep bouncing around the minimum unless the learning rate is reduced in the later iterations.
 
== Problem formulation ==
 
 
 
 
== Forward-backward ==
 
 
== Related Concepts ==
 

Revision as of 17:16, 30 September 2011

Summary

This is an optimization method, used in many algorithms such as Conditional Random Fields to efficiently optimize the objective function, especially in the online setting. Stochastic optimizations like this method are known to be faster when trained with large, redundant data sets.

Gradient Descent

Given a function to be minimized and a point , let then we can say

for some small enough . Using this inequality, we can get a (local) minimum of the objective function using the following steps:

  • Initialize
  • Repeat the step above until the objective function converges to a local minimum

Stochastic Gradient Descent

One of the problems of the gradient descent method above is that calculating the gradient could be an expensive computation depending on the objective function or the size of the data set. Suppose your objective function is . If the objective function can be decomposed as the following,

where indicates the -th example(sometimes is a batch instead of one example), we can make the process stochastic. To make each step computationally efficient, a subset of the summand function is sampled. The procedure can be described as the following pseudocode:

  • Initialize
  • Repeat until convergence
    • Sample examples
    • For each example sampled

where is the learning rate. Note that this method is no different from the plain gradient descent method when the batch size becomes the number of examples. For computational efficiency, small batch size around 5~20 turn out to be most efficient.

Pros

This method is much faster when used for datasets that has redundant information among examples. Also, it is known to be better with noises.

Cons

The convergence rate is slower than second-order gradient methods. Also it tends to keep bouncing around the minimum unless the learning rate is reduced in the later iterations.