Difference between revisions of "Stochastic Gradient Descent"

Revision as of 17:08, 30 September 2011

Summary

This is an optimization method, used in many algorithms such as Conditional Random Field to efficiently optimize the objective function, especially in the online setting. The benefit of stochastic gradient descent comes from the stochastic approximation of the objective f

Gradient Descent

Given a function to be minimized $F(\mathbf {x} )$ and a point $\mathbf {x} =\mathbf {a}$ , let $\mathbf {b} =\gamma \nabla F(\mathbf {a} )$ then we can say

$F(\mathbf {b} )\leq F(\mathbf {a} )$

for some small enough $\gamma$ . Using this inequality, we can get a (local) minimum of the objective function using the following steps:

Initialize $\mathbf {x}$
Repeat the step above until the objective function converges to a local minimum
- $\mathbf {x} _{new}=\mathbf {x} -\nabla F(\mathbf {a} )$
- $\mathbf {x} =\mathbf {x} _{new}$

Stochastic Gradient Descent

One problem of the gradient descent is that you cannot run this method online. To make it online, we express (or approximate) our objective function as a sum of functions of batches of the data. Suppose your objective function is ${\mathcal {L}}({\mathcal {D}},\mathbf {\theta } )$ . Then the objective function is decomposed as the following,

${\mathcal {L}}(\mathbf {\theta } ;{\mathcal {D}})=\sum _{i=1}^{\vert {\mathcal {D}}\vert }{{\mathcal {L}}(\mathbf {\theta } ;{\mathcal {D}}_{i})}$

where ${\mathcal {D}}_{i}$ indicates the $i\quad$ -th example. Sometimes ${\mathcal {D}}_{i}$ is a batch, instead of an example. To make each step computationally efficient, a subset of the summand function is sampled. The procedure can be described as the following pseudocode:

Initialize $\mathbf {\theta }$
Repeat until convergence
- Sample $n\quad$ examples
- For each example sampled ${\mathcal {D}}_{i}$ ${\mathcal {D}}_{i}$
  - $\mathbf {\theta } _{new}=\mathbf {\theta } -\alpha \nabla {\mathcal {L}}(\mathbf {\theta } ;{\mathcal {D}}_{i})$
  - $\mathbf {\theta } =\mathbf {\theta } _{new}$

where $\alpha \quad$ is the learning rate.

Pros

This method is much faster when used for datasets that has redundant information among examples. Also, it is known to be better with noises.

Cons

The convergence rate is slower than second-order gradient methods. Also it tends to keep bouncing around the minimum unless the learning rate is reduced in the later iterations.

@@ Line 1: / Line 1: @@
 == Summary ==
-This is an optimization [[Category::method]], used in many algorithms such as [[AddressesProblem::Conditional Random Field]] to efficiently optimize the objective function, especially in the online setting.
+This is an optimization [[Category::method]], used in many algorithms such as [[AddressesProblem::Conditional Random Field]] to efficiently optimize the objective function, especially in the online setting. The benefit of stochastic gradient descent comes from the stochastic approximation of the objective f
+=== Gradient Descent ===
+Given a function to be minimized <math>F(\mathbf{x})</math> and a point <math>\mathbf{x}=\mathbf{a}</math>, let <math>\mathbf{b}=\gamma\nabla F(\mathbf{a})</math> then we can say
+<math>
+F(\mathbf{b}) \leq F(\mathbf{a})
+</math>
+for some small enough <math>\gamma</math>. Using this inequality, we can get a (local) minimum of the objective function using the following steps:
+* Initialize <math>\mathbf{x}</math>
+* Repeat the step above until the objective function converges to a local minimum
+** <math>\mathbf{x}_{new} = \mathbf{x} - \nabla F(\mathbf{a})</math>
+** <math>\mathbf{x} = \mathbf{x}_{new}</math>
+== Stochastic Gradient Descent ==
+One problem of the gradient descent is that you cannot run this method online. To make it online, we express (or approximate) our objective function as a sum of functions of batches of the data. Suppose your objective function is <math>\mathcal{L}(\mathcal{D}, \mathbf{\theta})</math>. Then the objective function is decomposed as the following,
+<math>
+\mathcal{L}(\mathbf{\theta};\mathcal{D}) = \sum_{i=1}^{\vert\mathcal{D}\vert} {\mathcal{L}(\mathbf{\theta};\mathcal{D}_{i})}
+</math>
+where <math>\mathcal{D}_{i}</math> indicates the <math>i\quad</math>-th example. Sometimes <math>\mathcal{D}_{i}</math> is a batch, instead of an example. To make each step computationally efficient, a subset of the summand function is sampled. The procedure can be described as the following pseudocode:
+* Initialize <math>\mathbf{\theta}</math>
+* Repeat until convergence
+** Sample <math>n\quad</math> examples
+** For each example sampled <math>\mathcal{D}_{i}</math>
+*** <math>\mathbf{\theta}_{new}=\mathbf{\theta} - \alpha\nabla\mathcal{L}(\mathbf{\theta};\mathcal{D}_{i})</math>
+*** <math>\mathbf{\theta}=\mathbf{\theta}_{new}</math>
+where <math>\alpha\quad</math> is the learning rate.
+== Pros ==
+This method is much faster when used for datasets that has redundant information among examples. Also, it is known to be better with noises.
+== Cons ==
+The convergence rate is slower than second-order gradient methods. Also it tends to keep bouncing around the minimum unless the learning rate is reduced in the later iterations.
 == Problem formulation ==

Difference between revisions of "Stochastic Gradient Descent"

Revision as of 17:08, 30 September 2011

Contents

Summary

Gradient Descent

Stochastic Gradient Descent

Pros

Cons

Problem formulation

Forward-backward

Related Concepts

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools