Difference between revisions of "Stochastic Gradient Descent"

From Cohen Courses
Jump to navigationJump to search
Line 1: Line 1:
 
== Summary ==
 
== Summary ==
  
This is an optimization [[Category::method]], used in many algorithms such as [[AddressesProblem::Conditional Random Field]] to efficiently optimize the objective function, especially in the online setting.
+
This is an optimization [[Category::method]], used in many algorithms such as [[AddressesProblem::Conditional Random Field]] to efficiently optimize the objective function, especially in the online setting. The benefit of stochastic gradient descent comes from the stochastic approximation of the objective f
  
 +
=== Gradient Descent ===
 +
 +
Given a function to be minimized <math>F(\mathbf{x})</math> and a point <math>\mathbf{x}=\mathbf{a}</math>, let <math>\mathbf{b}=\gamma\nabla F(\mathbf{a})</math> then we can say
 +
 +
<math>
 +
F(\mathbf{b}) \leq F(\mathbf{a})
 +
</math>
 +
 +
for some small enough <math>\gamma</math>. Using this inequality, we can get a (local) minimum of the objective function using the following steps:
 +
 +
* Initialize <math>\mathbf{x}</math>
 +
* Repeat the step above until the objective function converges to a local minimum
 +
** <math>\mathbf{x}_{new} = \mathbf{x} - \nabla F(\mathbf{a})</math>
 +
** <math>\mathbf{x} = \mathbf{x}_{new}</math>
 +
 +
== Stochastic Gradient Descent ==
 +
 +
One problem of the gradient descent is that you cannot run this method online. To make it online, we express (or approximate) our objective function as a sum of functions of batches of the data. Suppose your objective function is <math>\mathcal{L}(\mathcal{D}, \mathbf{\theta})</math>. Then the objective function is decomposed as the following,
 +
 +
<math>
 +
\mathcal{L}(\mathbf{\theta};\mathcal{D}) = \sum_{i=1}^{\vert\mathcal{D}\vert} {\mathcal{L}(\mathbf{\theta};\mathcal{D}_{i})}
 +
</math>
 +
 +
where <math>\mathcal{D}_{i}</math> indicates the <math>i\quad</math>-th example. Sometimes <math>\mathcal{D}_{i}</math> is a batch, instead of an example. To make each step computationally efficient, a subset of the summand function is sampled. The procedure can be described as the following pseudocode:
 +
 +
* Initialize <math>\mathbf{\theta}</math>
 +
* Repeat until convergence
 +
** Sample <math>n\quad</math> examples
 +
** For each example sampled <math>\mathcal{D}_{i}</math>
 +
*** <math>\mathbf{\theta}_{new}=\mathbf{\theta} - \alpha\nabla\mathcal{L}(\mathbf{\theta};\mathcal{D}_{i})</math>
 +
*** <math>\mathbf{\theta}=\mathbf{\theta}_{new}</math>
 +
 +
where <math>\alpha\quad</math> is the learning rate.
 +
 +
== Pros ==
 +
 +
This method is much faster when used for datasets that has redundant information among examples. Also, it is known to be better with noises.
 +
 +
== Cons ==
 +
 +
The convergence rate is slower than second-order gradient methods. Also it tends to keep bouncing around the minimum unless the learning rate is reduced in the later iterations.
  
 
== Problem formulation ==
 
== Problem formulation ==

Revision as of 17:08, 30 September 2011

Summary

This is an optimization method, used in many algorithms such as Conditional Random Field to efficiently optimize the objective function, especially in the online setting. The benefit of stochastic gradient descent comes from the stochastic approximation of the objective f

Gradient Descent

Given a function to be minimized and a point , let then we can say

for some small enough . Using this inequality, we can get a (local) minimum of the objective function using the following steps:

  • Initialize
  • Repeat the step above until the objective function converges to a local minimum

Stochastic Gradient Descent

One problem of the gradient descent is that you cannot run this method online. To make it online, we express (or approximate) our objective function as a sum of functions of batches of the data. Suppose your objective function is . Then the objective function is decomposed as the following,

where indicates the -th example. Sometimes is a batch, instead of an example. To make each step computationally efficient, a subset of the summand function is sampled. The procedure can be described as the following pseudocode:

  • Initialize
  • Repeat until convergence
    • Sample examples
    • For each example sampled

where is the learning rate.

Pros

This method is much faster when used for datasets that has redundant information among examples. Also, it is known to be better with noises.

Cons

The convergence rate is slower than second-order gradient methods. Also it tends to keep bouncing around the minimum unless the learning rate is reduced in the later iterations.

Problem formulation

Forward-backward

Related Concepts