Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods, Vishwanathan et al, ICML 2006
Contents
Citation
Vishwanathan et al, 2009. ccelerated Training of Conditional Random Fields with Stochastic Gradient Methods. In Proceedings of the 23rd International Conference on Machine Learning
Online version
Summary
In this paper, the authors apply Stochastic Meta Descent (SMD) to accelerate the training process of a Conditional Random Fields model.
Brief description of the method
Stochastic Approximation of Gradients
As in Stochastic Gradient Descent, the log-likelihood is approximated by subsampling small batches of
where
For an optimization step, we use for the gradient
If we just use the stochastic gradient descent, it may not converge at all or may not converge fast enough. By using gain vector adaptation from SMD, the convergence speed can be improved. Each parameter has its own gain:
The gain vector is updated by a multiplicative update with meta-gain
The vector indicates the long-term 2nd-order dependence of the system parameters with memory (decay factor ).
The hessian-vector product is calculated efficiently using the following technique:
Experimental Result
Experiments on 1D chain CRFs
The authors tested convergence rate of a 1D chain CRF model used against CoNLL'00 and GENIA_dataset. The Stochastic Meta Descent (SMD) method described in the previous section is compared with the following methods:
- Simple stochastic gradient descent (SGD) with a fixed gain
- Batch-only limited-memory BFGS algorithm
- Collins' (2002) perceptron (CP), fully online update
As shown in the graphs, SMD converged much faster than any other methods, especially BFGS and CP. It is worthwhile to note that in both datasets, SGD and SMD achieved significant F-score even before it looked at all the examples in the dataset.
Experiments on 2D lattice CRFs
For 2D CRF four optimization methods are compared: SMD, SGD, BFGS and annealed stochastic gradient descent (ASG) where the gain decreases as time goes. These methods are used to optimize the conditional likelihood approximated by loopy belief propagation (LBP) or mean field (MF) and the pseudo likelihood (PL). The authors tested 2D CRFs for the binary image denoising task and the classification task recognizing patches including "manmade structures" in the image.
When exact inference is possible (PL), SMD and SGD converged faster. However, in other cases, SMD and SGD have a faster convergence in the beginning and then BFGS catches up and even does better in later passes.
Related papers
Comment
If you're further interested in active learning for NLP, you might want to see Burr Settles' review of active learning: http://active-learning.net/ --Brendan 22:51, 13 October 2011 (UTC)