Difference between revisions of "Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods, Vishwanathan et al, ICML 2006"

Revision as of 02:19, 30 November 2011

Citation

Vishwanathan et al, 2009. ccelerated Training of Conditional Random Fields with Stochastic Gradient Methods. In Proceedings of the 23rd International Conference on Machine Learning

Online version

Link

Summary

In this paper, the authors apply Stochastic Meta Descent (SMD) to accelerate the training process of a Conditional Random Fields model.

Brief description of the method

Stochastic Approximation of Gradients

As in Stochastic Gradient Descent, the log-likelihood is approximated by subsampling small batches of $b\ll m$

${\mathcal {L}}({\boldsymbol {\theta }})=\sum _{t=0}^{{\frac {m}{b}}-1}{{\mathcal {L}}_{b}({\boldsymbol {\theta }},t)}$ where

${\mathcal {L}}_{b}({\boldsymbol {\theta }},t)={\frac {b\vert \vert {\boldsymbol {\theta }}_{t}\vert \vert ^{2}}{2m\sigma ^{2}}}-\sum _{i=1}^{b}{\left[\langle \phi \left(\mathbf {x} _{bt+i},\mathbf {y} _{bt+i}\right),{\boldsymbol {\theta }}_{t}\rangle -z\left({\boldsymbol {\theta }}_{t}\vert \mathbf {x} _{bt+i}\right)\right]}$

For an optimization step, we use ${\mathcal {L}}_{b}\left({\boldsymbol {\theta }},t\right)$ for the gradient

$\mathbf {g} _{t}={\frac {\partial }{\partial {\boldsymbol {\theta }}}}{\mathcal {L}}_{b}\left({\boldsymbol {\theta }},t\right)$

If we just use the stochastic gradient descent, it may not converge at all or may not converge fast enough. By using gain vector adaptation from SMD, the convergence speed can be improved. Each parameter has its own gain:

${\boldsymbol {\theta }}_{t+1}={\boldsymbol {\theta }}_{t}-{\boldsymbol {\eta }}_{t}\cdot \mathbf {g} _{t}$

The gain vector ${\boldsymbol {\eta }}_{t}\in \mathbb {R} _{+}^{n}$ is updated by a multiplicative update with meta-gain $\mu$

${\boldsymbol {\eta }}_{t+1}={\boldsymbol {\eta }}_{t}\cdot \max \left({\frac {1}{2}},1-\mu \mathbf {g} _{t+1}\cdot \mathbf {v} _{t+1}\right)$

The vector $\mathbf {v}$ indicates the long-term 2nd-order dependence of the system parameters with memory (decay factor $\lambda$ ).

$\mathbf {v} _{t+1}=\lambda \mathbf {v} _{t}-{\boldsymbol {\eta }}_{t}\cdot \left(\mathbf {g} _{t}+\lambda \mathbf {H} _{t}\mathbf {v} _{t}\right)$

The hessian-vector product is calculated efficiently using the following technique:

$d\mathbf {g} \left({\boldsymbol {\theta }}\right)=\mathbf {H} ({\boldsymbol {\theta }})d{\boldsymbol {\theta }}$

$\mathbf {g} \left({\boldsymbol {\theta }}+i\epsilon d{\boldsymbol {\theta }}\right)=\mathbf {g} ({\boldsymbol {\theta }})+O(\epsilon ^{2})+i\epsilon d\mathbf {g} ({\boldsymbol {\theta }})=\mathbf {g} ({\boldsymbol {\theta }})+O(\epsilon ^{2})+i\epsilon \mathbf {H} ({\boldsymbol {\theta }})d{\boldsymbol {\theta }}$

$\mathbf {g} \left({\boldsymbol {\theta }}+i\epsilon \mathbf {v} _{t}\right)=\mathbf {g} ({\boldsymbol {\theta }})+O(\epsilon ^{2})+i\epsilon \mathbf {H} ({\boldsymbol {\theta }})\mathbf {v} _{t}$

Experimental Result

The authors tested this method on MUC-7 and the oncology part of PennBioIE corpus. The base learner used for the experiment is a linear-chain Conditional Random Fields. Features used are orthographical features (regexp patterns), lexical and morphological features (prefix, suffix, lemmatized tokens), and contextual features (features of neighbor tokens). In terms of the number of tokens that had to be labled to reach the maximal F-score, SeSAL could save about 60% over FuSAL, and 80% over random sampling. Having high confidence was also important because it could save the model from making errors in the early stages.

Related papers

Comment

If you're further interested in active learning for NLP, you might want to see Burr Settles' review of active learning: http://active-learning.net/ --Brendan 22:51, 13 October 2011 (UTC)

@@ Line 44: / Line 44: @@
 <math>
 \boldsymbol{\eta}_{t+1}=\boldsymbol{\eta}_{t}\cdot \max\left(\frac{1}{2}, 1-\mu\mathbf{g}_{t+1}\cdot\mathbf{v}_{t+1}\right)
+</math>
+The vector <math>\mathbf{v}</math> indicates the long-term 2nd-order dependence of the system parameters with memory (decay factor <math>\lambda</math>).
+<math>
+\mathbf{v}_{t+1} = \lambda\mathbf{v}_{t} - \boldsymbol{\eta}_{t}\cdot\left(\mathbf{g}_{t}+\lambda\mathbf{H}_{t}\mathbf{v}_{t}\right)
+</math>
+The hessian-vector product is calculated efficiently using the following technique:
+<math>
+d\mathbf{g}\left(\boldsymbol\theta\right)=\mathbf{H}(\boldsymbol\theta)d\boldsymbol\theta
+</math>
+<math>
+\mathbf{g}\left(\boldsymbol\theta+i\epsilon d\boldsymbol\theta\right) = \mathbf{g}(\boldsymbol\theta)+O(\epsilon^{2})+i\epsilon d\mathbf{g}(\boldsymbol\theta)
+= \mathbf{g}(\boldsymbol\theta)+O(\epsilon^{2})+i\epsilon \mathbf{H}(\boldsymbol\theta)d \boldsymbol\theta
+</math>
+<math>
+\mathbf{g}\left(\boldsymbol\theta+i\epsilon\mathbf{v}_{t}\right) = \mathbf{g}(\boldsymbol\theta)+O(\epsilon^{2})+i\epsilon \mathbf{H}(\boldsymbol\theta)\mathbf{v}_{t}
 </math>

Difference between revisions of "Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods, Vishwanathan et al, ICML 2006"

Revision as of 02:19, 30 November 2011

Contents

Citation

Online version

Summary

Brief description of the method

Stochastic Approximation of Gradients

Experimental Result

Related papers

Comment

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools