Stochastic Gradient Descent
Contents
Summary
This is an optimization method, used in many algorithms such as Conditional Random Field to efficiently optimize the objective function, especially in the online setting. The benefit of stochastic gradient descent comes from the stochastic approximation of the objective f
Gradient Descent
Given a function to be minimized and a point , let then we can say
for some small enough . Using this inequality, we can get a (local) minimum of the objective function using the following steps:
- Initialize
- Repeat the step above until the objective function converges to a local minimum
Stochastic Gradient Descent
One problem of the gradient descent is that you cannot run this method online. To make it online, we express (or approximate) our objective function as a sum of functions of batches of the data. Suppose your objective function is . Then the objective function is decomposed as the following,
where indicates the -th example. Sometimes is a batch, instead of an example. To make each step computationally efficient, a subset of the summand function is sampled. The procedure can be described as the following pseudocode:
- Initialize
- Repeat until convergence
- Sample examples
- For each example sampled
where is the learning rate.
Pros
This method is much faster when used for datasets that has redundant information among examples. Also, it is known to be better with noises.
Cons
The convergence rate is slower than second-order gradient methods. Also it tends to keep bouncing around the minimum unless the learning rate is reduced in the later iterations.