Difference between revisions of "Stochastic Gradient Descent"
Line 1: | Line 1: | ||
== Summary == | == Summary == | ||
− | This is an optimization [[Category::method]], used in many algorithms such as [[AddressesProblem::Conditional Random Field]] to efficiently optimize the objective function, especially in the online setting. | + | This is an optimization [[Category::method]], used in many algorithms such as [[AddressesProblem::Conditional Random Field]] to efficiently optimize the objective function, especially in the online setting. The benefit of stochastic gradient descent comes from the stochastic approximation of the objective f |
+ | === Gradient Descent === | ||
+ | |||
+ | Given a function to be minimized <math>F(\mathbf{x})</math> and a point <math>\mathbf{x}=\mathbf{a}</math>, let <math>\mathbf{b}=\gamma\nabla F(\mathbf{a})</math> then we can say | ||
+ | |||
+ | <math> | ||
+ | F(\mathbf{b}) \leq F(\mathbf{a}) | ||
+ | </math> | ||
+ | |||
+ | for some small enough <math>\gamma</math>. Using this inequality, we can get a (local) minimum of the objective function using the following steps: | ||
+ | |||
+ | * Initialize <math>\mathbf{x}</math> | ||
+ | * Repeat the step above until the objective function converges to a local minimum | ||
+ | ** <math>\mathbf{x}_{new} = \mathbf{x} - \nabla F(\mathbf{a})</math> | ||
+ | ** <math>\mathbf{x} = \mathbf{x}_{new}</math> | ||
+ | |||
+ | == Stochastic Gradient Descent == | ||
+ | |||
+ | One problem of the gradient descent is that you cannot run this method online. To make it online, we express (or approximate) our objective function as a sum of functions of batches of the data. Suppose your objective function is <math>\mathcal{L}(\mathcal{D}, \mathbf{\theta})</math>. Then the objective function is decomposed as the following, | ||
+ | |||
+ | <math> | ||
+ | \mathcal{L}(\mathbf{\theta};\mathcal{D}) = \sum_{i=1}^{\vert\mathcal{D}\vert} {\mathcal{L}(\mathbf{\theta};\mathcal{D}_{i})} | ||
+ | </math> | ||
+ | |||
+ | where <math>\mathcal{D}_{i}</math> indicates the <math>i\quad</math>-th example. Sometimes <math>\mathcal{D}_{i}</math> is a batch, instead of an example. To make each step computationally efficient, a subset of the summand function is sampled. The procedure can be described as the following pseudocode: | ||
+ | |||
+ | * Initialize <math>\mathbf{\theta}</math> | ||
+ | * Repeat until convergence | ||
+ | ** Sample <math>n\quad</math> examples | ||
+ | ** For each example sampled <math>\mathcal{D}_{i}</math> | ||
+ | *** <math>\mathbf{\theta}_{new}=\mathbf{\theta} - \alpha\nabla\mathcal{L}(\mathbf{\theta};\mathcal{D}_{i})</math> | ||
+ | *** <math>\mathbf{\theta}=\mathbf{\theta}_{new}</math> | ||
+ | |||
+ | where <math>\alpha\quad</math> is the learning rate. | ||
+ | |||
+ | == Pros == | ||
+ | |||
+ | This method is much faster when used for datasets that has redundant information among examples. Also, it is known to be better with noises. | ||
+ | |||
+ | == Cons == | ||
+ | |||
+ | The convergence rate is slower than second-order gradient methods. Also it tends to keep bouncing around the minimum unless the learning rate is reduced in the later iterations. | ||
== Problem formulation == | == Problem formulation == |
Revision as of 17:08, 30 September 2011
Contents
Summary
This is an optimization method, used in many algorithms such as Conditional Random Field to efficiently optimize the objective function, especially in the online setting. The benefit of stochastic gradient descent comes from the stochastic approximation of the objective f
Gradient Descent
Given a function to be minimized and a point , let then we can say
for some small enough . Using this inequality, we can get a (local) minimum of the objective function using the following steps:
- Initialize
- Repeat the step above until the objective function converges to a local minimum
Stochastic Gradient Descent
One problem of the gradient descent is that you cannot run this method online. To make it online, we express (or approximate) our objective function as a sum of functions of batches of the data. Suppose your objective function is . Then the objective function is decomposed as the following,
where indicates the -th example. Sometimes is a batch, instead of an example. To make each step computationally efficient, a subset of the summand function is sampled. The procedure can be described as the following pseudocode:
- Initialize
- Repeat until convergence
- Sample examples
- For each example sampled
where is the learning rate.
Pros
This method is much faster when used for datasets that has redundant information among examples. Also, it is known to be better with noises.
Cons
The convergence rate is slower than second-order gradient methods. Also it tends to keep bouncing around the minimum unless the learning rate is reduced in the later iterations.