Stochastic Gradient Descent

Summary

This is an optimization method, used in many algorithms such as Conditional Random Field to efficiently optimize the objective function, especially in the online setting. The benefit of stochastic gradient descent comes from the stochastic approximation of the objective f

Gradient Descent

Given a function to be minimized $F(\mathbf {x} )$ and a point $\mathbf {x} =\mathbf {a}$ , let $\mathbf {b} =\gamma \nabla F(\mathbf {a} )$ then we can say

$F(\mathbf {b} )\leq F(\mathbf {a} )$

for some small enough $\gamma$ . Using this inequality, we can get a (local) minimum of the objective function using the following steps:

Initialize $\mathbf {x}$
Repeat the step above until the objective function converges to a local minimum
- $\mathbf {x} _{new}=\mathbf {x} -\nabla F(\mathbf {a} )$
- $\mathbf {x} =\mathbf {x} _{new}$

Stochastic Gradient Descent

One problem of the gradient descent is that you cannot run this method online. To make it online, we express (or approximate) our objective function as a sum of functions of batches of the data. Suppose your objective function is ${\mathcal {L}}({\mathcal {D}},\mathbf {\theta } )$ . Then the objective function is decomposed as the following,

${\mathcal {L}}(\mathbf {\theta } ;{\mathcal {D}})=\sum _{i=1}^{\vert {\mathcal {D}}\vert }{{\mathcal {L}}(\mathbf {\theta } ;{\mathcal {D}}_{i})}$

where ${\mathcal {D}}_{i}$ indicates the $i\quad$ -th example. Sometimes ${\mathcal {D}}_{i}$ is a batch, instead of an example. To make each step computationally efficient, a subset of the summand function is sampled. The procedure can be described as the following pseudocode:

Initialize $\mathbf {\theta }$
Repeat until convergence
- Sample $n\quad$ examples
- For each example sampled ${\mathcal {D}}_{i}$ ${\mathcal {D}}_{i}$
  - $\mathbf {\theta } _{new}=\mathbf {\theta } -\alpha \nabla {\mathcal {L}}(\mathbf {\theta } ;{\mathcal {D}}_{i})$
  - $\mathbf {\theta } =\mathbf {\theta } _{new}$

where $\alpha \quad$ is the learning rate.

Pros

This method is much faster when used for datasets that has redundant information among examples. Also, it is known to be better with noises.

Cons

The convergence rate is slower than second-order gradient methods. Also it tends to keep bouncing around the minimum unless the learning rate is reduced in the later iterations.

Stochastic Gradient Descent

Contents

Summary

Gradient Descent

Stochastic Gradient Descent

Pros

Cons

Problem formulation

Forward-backward

Related Concepts

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools