Difference between revisions of "Stochastic Gradient Descent"

From Cohen Courses
Jump to navigationJump to search
(Created page with '(Draft coming soon from Daegun Won)')
 
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
(Draft coming soon from Daegun Won)
+
== Summary ==
 +
 
 +
This is an optimization [[Category::method]], used in many algorithms such as [[AddressesProblem::Support Vector Machines]] or [[AddressesProblem::Conditional Random Fields]] to efficiently optimize the objective function, especially in the online setting. Stochastic optimizations like this method are known to be faster when trained with large, redundant data sets.
 +
 
 +
== Gradient Descent ==
 +
 
 +
Given a function to be minimized <math>F(\mathbf{x})</math> and a point <math>\mathbf{x}=\mathbf{a}</math>, let <math>\mathbf{b}=\gamma\nabla F(\mathbf{a})</math> then we can say
 +
 
 +
<math>
 +
F(\mathbf{b}) \leq F(\mathbf{a})
 +
</math>
 +
 
 +
for some small enough <math>\gamma</math>. Using this inequality, we can get a (local) minimum of the objective function using the following steps:
 +
 
 +
* Initialize <math>\mathbf{x}</math>
 +
* Repeat the step above until the objective function converges to a local minimum
 +
** <math>\mathbf{x}_{new} = \mathbf{x} - \nabla F(\mathbf{a})</math>
 +
** <math>\mathbf{x} = \mathbf{x}_{new}</math>
 +
 
 +
== Stochastic Gradient Descent ==
 +
 
 +
One of the problems of the gradient descent method above is that calculating the gradient could be an expensive computation depending on the objective function or the size of the data set. Suppose your objective function is <math>\mathcal{L}(\mathcal{D}, \mathbf{\theta})</math>. If the objective function can be decomposed as the following,
 +
 
 +
<math>
 +
\mathcal{L}(\mathbf{\theta};\mathcal{D}) = \sum_{i=1}^{\vert\mathcal{D}\vert} {\mathcal{L}(\mathbf{\theta};\mathcal{D}_{i})}
 +
</math>
 +
 
 +
where <math>\mathcal{D}_{i}</math> indicates the <math>i\quad</math>-th example(sometimes <math>\mathcal{D}_{i}</math> is a batch instead of one example), we can make the process stochastic. To make each step computationally efficient, a subset of the summand function is sampled. The procedure can be described as the following pseudocode:
 +
 
 +
* Initialize <math>\mathbf{\theta}</math>
 +
* Repeat until convergence
 +
** Sample <math>n\quad</math> examples
 +
** For each example sampled <math>\mathcal{D}_{i}</math>
 +
*** <math>\mathbf{\theta}_{new}=\mathbf{\theta} - \alpha\nabla\mathcal{L}(\mathbf{\theta};\mathcal{D}_{i})</math>
 +
*** <math>\mathbf{\theta}=\mathbf{\theta}_{new}</math>
 +
 
 +
where <math>\alpha\quad</math> is the learning rate. Note that this method is no different from the plain gradient descent method when the batch size becomes the number of examples. For computational efficiency, small batch size around 5~20 turn out to be most efficient.
 +
 
 +
== Pros ==
 +
 
 +
When this method is used for very large data sets that has redundant information among examples, it is much faster than the plain gradient descent because it requires less computation each iteration. Also, it is known to be better with noisy data since it samples example to compute gradient.
 +
 
 +
== Cons ==
 +
 
 +
The convergence rate is slower than second-order gradient methods. However the speedup coming from computationally efficient iterations are usually greater and the method can converge faster if learning rate is adjusted as the procedure goes on. Also it tends to keep bouncing around the minimum unless the learning rate is reduced in the later iterations.
 +
 
 +
== Related Papers ==
 +
 
 +
* [[ RelatedPaper::Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods, Vishwanathan et al, ICML 2006 ]]
 +
* [[ RelatedPaper::Practical very large CRFs]]

Latest revision as of 18:08, 30 September 2011

Summary

This is an optimization method, used in many algorithms such as Support Vector Machines or Conditional Random Fields to efficiently optimize the objective function, especially in the online setting. Stochastic optimizations like this method are known to be faster when trained with large, redundant data sets.

Gradient Descent

Given a function to be minimized and a point , let then we can say

for some small enough . Using this inequality, we can get a (local) minimum of the objective function using the following steps:

  • Initialize
  • Repeat the step above until the objective function converges to a local minimum

Stochastic Gradient Descent

One of the problems of the gradient descent method above is that calculating the gradient could be an expensive computation depending on the objective function or the size of the data set. Suppose your objective function is . If the objective function can be decomposed as the following,

where indicates the -th example(sometimes is a batch instead of one example), we can make the process stochastic. To make each step computationally efficient, a subset of the summand function is sampled. The procedure can be described as the following pseudocode:

  • Initialize
  • Repeat until convergence
    • Sample examples
    • For each example sampled

where is the learning rate. Note that this method is no different from the plain gradient descent method when the batch size becomes the number of examples. For computational efficiency, small batch size around 5~20 turn out to be most efficient.

Pros

When this method is used for very large data sets that has redundant information among examples, it is much faster than the plain gradient descent because it requires less computation each iteration. Also, it is known to be better with noisy data since it samples example to compute gradient.

Cons

The convergence rate is slower than second-order gradient methods. However the speedup coming from computationally efficient iterations are usually greater and the method can converge faster if learning rate is adjusted as the procedure goes on. Also it tends to keep bouncing around the minimum unless the learning rate is reduced in the later iterations.

Related Papers