Difference between revisions of "Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods, Vishwanathan et al, ICML 2006"

Latest revision as of 03:24, 30 November 2011

Citation

Vishwanathan et al, 2009. ccelerated Training of Conditional Random Fields with Stochastic Gradient Methods. In Proceedings of the 23rd International Conference on Machine Learning

Online version

Link

Summary

In this paper, the authors apply Stochastic Meta Descent (SMD) to accelerate the training process of a Conditional Random Fields model. By having an ability to adapt step sizes for each parameter, it helps the model converge faster in many cases, especially when exact inference is possible. In cases where exact inference is not possible, SMD seems to have advantages only in the beginning and later BFGS catches up and even does a better job.

Brief description of the method

Stochastic Approximation of Gradients

As in Stochastic Gradient Descent, the log-likelihood is approximated by subsampling small batches of $b\ll m$

${\mathcal {L}}({\boldsymbol {\theta }})=\sum _{t=0}^{{\frac {m}{b}}-1}{{\mathcal {L}}_{b}({\boldsymbol {\theta }},t)}$ where

${\mathcal {L}}_{b}({\boldsymbol {\theta }},t)={\frac {b\vert \vert {\boldsymbol {\theta }}_{t}\vert \vert ^{2}}{2m\sigma ^{2}}}-\sum _{i=1}^{b}{\left[\langle \phi \left(\mathbf {x} _{bt+i},\mathbf {y} _{bt+i}\right),{\boldsymbol {\theta }}_{t}\rangle -z\left({\boldsymbol {\theta }}_{t}\vert \mathbf {x} _{bt+i}\right)\right]}$

For an optimization step, we use ${\mathcal {L}}_{b}\left({\boldsymbol {\theta }},t\right)$ for the gradient

$\mathbf {g} _{t}={\frac {\partial }{\partial {\boldsymbol {\theta }}}}{\mathcal {L}}_{b}\left({\boldsymbol {\theta }},t\right)$

If we just use the stochastic gradient descent, it may not converge at all or may not converge fast enough. By using gain vector adaptation from SMD, the convergence speed can be improved. Each parameter has its own gain:

${\boldsymbol {\theta }}_{t+1}={\boldsymbol {\theta }}_{t}-{\boldsymbol {\eta }}_{t}\cdot \mathbf {g} _{t}$

The gain vector ${\boldsymbol {\eta }}_{t}\in \mathbb {R} _{+}^{n}$ is updated by a multiplicative update with meta-gain $\mu$

${\boldsymbol {\eta }}_{t+1}={\boldsymbol {\eta }}_{t}\cdot \max \left({\frac {1}{2}},1-\mu \mathbf {g} _{t+1}\cdot \mathbf {v} _{t+1}\right)$

The vector $\mathbf {v}$ indicates the long-term 2nd-order dependence of the system parameters with memory (decay factor $\lambda$ ).

$\mathbf {v} _{t+1}=\lambda \mathbf {v} _{t}-{\boldsymbol {\eta }}_{t}\cdot \left(\mathbf {g} _{t}+\lambda \mathbf {H} _{t}\mathbf {v} _{t}\right)$

The hessian-vector product is calculated efficiently using the following technique:

$d\mathbf {g} \left({\boldsymbol {\theta }}\right)=\mathbf {H} ({\boldsymbol {\theta }})d{\boldsymbol {\theta }}$

$\mathbf {g} \left({\boldsymbol {\theta }}+i\epsilon d{\boldsymbol {\theta }}\right)=\mathbf {g} ({\boldsymbol {\theta }})+O(\epsilon ^{2})+i\epsilon d\mathbf {g} ({\boldsymbol {\theta }})=\mathbf {g} ({\boldsymbol {\theta }})+O(\epsilon ^{2})+i\epsilon \mathbf {H} ({\boldsymbol {\theta }})d{\boldsymbol {\theta }}$

$\mathbf {g} \left({\boldsymbol {\theta }}+i\epsilon \mathbf {v} _{t}\right)=\mathbf {g} ({\boldsymbol {\theta }})+O(\epsilon ^{2})+i\epsilon \mathbf {H} ({\boldsymbol {\theta }})\mathbf {v} _{t}$

Experimental Result

Experiments on 1D chain CRFs

The authors tested convergence rate of a 1D chain CRF model used against CoNLL'00 and GENIA_dataset. The Stochastic Meta Descent (SMD) method described in the previous section is compared with the following methods:

Simple stochastic gradient descent (SGD) with a fixed gain
Batch-only limited-memory BFGS algorithm
Collins' (2002) perceptron (CP), fully online update

As shown in the graphs, SMD converged much faster than any other methods, especially BFGS and CP. It is worthwhile to note that in both datasets, SGD and SMD achieved significant F-score even before it looked at all the examples in the dataset.

Experiments on 2D lattice CRFs

For 2D CRF four optimization methods are compared: SMD, SGD, BFGS and annealed stochastic gradient descent (ASG) where the gain $\eta _{t}=\eta _{0}/t$ decreases as time goes. These methods are used to optimize the conditional likelihood approximated by loopy belief propagation (LBP) or mean field (MF) and the pseudo likelihood (PL). The authors tested 2D CRFs for the binary image denoising task and the classification task recognizing patches including "manmade structures" in the image.

When exact inference is possible (PL), SMD and SGD converged faster. However, in other cases, SMD and SGD have a faster convergence in the beginning and then BFGS catches up and even does better in later passes.

@@ Line 9: / Line 9: @@
 == Summary ==
-This [[Category::paper]] presents an [[UsesMethod::Active Learning]] approach that is not fully supervised. In this paper, the authors propose a semi-supervised approach where only some of the sequences are asked to be labeled. Assuming that there are subsequences that the model is confident about the labels even in a sequence that is uncertain as a whole, it only asks for labels for the subsequence the model is uncertain about and the rest is labeled using the current classifier. From their experiment this approach could save about 50~60% annotation labor over fully supervised active learning in the sequential learning settings.
+In this [[Category::paper]], the authors apply [[UsesMethod::Stochastic Meta Descent]] (SMD) to accelerate the training process of a [[UsesMethod::Conditional Random Fields]] model. By having an ability to adapt step sizes for each parameter, it helps the model converge faster in many cases, especially when exact inference is possible. In cases where exact inference is not possible, SMD seems to have advantages only in the beginning and later BFGS catches up and even does a better job.
 == Brief description of the method ==
-The method is a pretty simple extension of a standard active learning method. The following figure describes the general active learning framework.
+=== Stochastic Approximation of Gradients ===
-[[File:Tomanek ACL2009.png]]
+As in [[UsesMethod::Stochastic Gradient Descent]], the log-likelihood is approximated by subsampling small batches of <math>b \ll m</math>
-The authors refer the usual active learning mode as fully supervised active learning (FuSAL). The utility function used in FuSAL is
+<math>
+\mathcal{L}(\boldsymbol{\theta}) = \sum_{t=0}^{\frac{m}{b}-1} {\mathcal{L}_{b}(\boldsymbol{\theta},t)}
+</math> where
-<math>U_{\mathbf{\lambda}}(\mathbf{x}) = 1 - P_{\mathbf{\lambda}}(\mathbf{y}^{*}\vert\mathbf{x})</math>
+<math>
+\mathcal{L}_{b}(\boldsymbol{\theta},t) = \frac{b\vert\vert\boldsymbol{\theta}_{t}\vert\vert^{2}}{2m\sigma^{2}}
+- \sum_{i=1}^{b} {\left[\langle\phi\left(\mathbf{x}_{bt+i},\mathbf{y}_{bt+i}\right),\boldsymbol{\theta}_{t}\rangle-z\left(\boldsymbol{\theta}_{t}\vert\mathbf{x}_{bt+i}\right)\right]}
+</math>
-which makes the sampling method as an uncertainty sampling method.
+For an optimization step, we use <math>\mathcal{L}_{b}\left(\boldsymbol{\theta},t\right)</math> for the gradient
-The problem of FuSAL in the sequence labeling scenario is that an example that has a high utility can still have parts of it that the current model can label very well, thus not contribute much to the utility of the whole. Therefore, it means we can leave some of the labels that the current model labeled if the confidence on that particular token is high enough. The authors name this as semi-supervised active learning (SeSAL). It combines the benefits of [[UsesMethod::Active Learning]] and [[UsesMethod::Bootstrapping]], which are labeling only examples with high utility and minimizing annotation effort by partially labeling examples where the model is confident about the prediction. In pseudocode, the following shows the steps that are added to the FuSAL:
+<math>
+\mathbf{g}_{t} = \frac{\partial}{\partial\boldsymbol\theta}\mathcal{L}_{b}\left(\boldsymbol{\theta},t\right)
+</math>
-.1 For each example <math>p_{i}\quad</math>
-.1.1 For each token <math>x_{j}\quad</math> and the most likely label <math>y_{j}^{*}\quad</math>
+If we just use the stochastic gradient descent, it may not converge at all or may not converge fast enough. By using gain vector adaptation from SMD, the convergence speed can be improved. Each parameter has its own gain:
-.1.1.1 Compute the model's confidence in the predicted label <math>C_{\mathbf{\lambda}}(y_{j}^{*})=P_{\mathbf{\lambda}}(y_{j}=y_{j}^{*}\vert\mathbf{x})</math>
+<math>
+\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_{t} - \boldsymbol\eta_{t}\cdot\mathbf{g}_{t}
+</math>
-.1.2 Remove all labels whose confidence is lower than some threshold <math>t</math>
+The gain vector <math>\boldsymbol{\eta}_{t}\in\mathbb{R}_{+}^{n}</math> is updated by a multiplicative update with meta-gain <math>\mu</math>
-Since there is a bootstrapping element in the method, the size of the seed set is also important. Therefore the authors suggest running FuSAL several iterations before switching to SeSAL.
+<math>
+\boldsymbol{\eta}_{t+1}=\boldsymbol{\eta}_{t}\cdot \max\left(\frac{1}{2}, 1-\mu\mathbf{g}_{t+1}\cdot\mathbf{v}_{t+1}\right)
+</math>
+The vector <math>\mathbf{v}</math> indicates the long-term 2nd-order dependence of the system parameters with memory (decay factor <math>\lambda</math>).
+<math>
+\mathbf{v}_{t+1} = \lambda\mathbf{v}_{t} - \boldsymbol{\eta}_{t}\cdot\left(\mathbf{g}_{t}+\lambda\mathbf{H}_{t}\mathbf{v}_{t}\right)
+</math>
+The hessian-vector product is calculated efficiently using the following technique:
+<math>
+d\mathbf{g}\left(\boldsymbol\theta\right)=\mathbf{H}(\boldsymbol\theta)d\boldsymbol\theta
+</math>
+<math>
+\mathbf{g}\left(\boldsymbol\theta+i\epsilon d\boldsymbol\theta\right) = \mathbf{g}(\boldsymbol\theta)+O(\epsilon^{2})+i\epsilon d\mathbf{g}(\boldsymbol\theta)
+= \mathbf{g}(\boldsymbol\theta)+O(\epsilon^{2})+i\epsilon \mathbf{H}(\boldsymbol\theta)d \boldsymbol\theta
+</math>
+<math>
+\mathbf{g}\left(\boldsymbol\theta+i\epsilon\mathbf{v}_{t}\right) = \mathbf{g}(\boldsymbol\theta)+O(\epsilon^{2})+i\epsilon \mathbf{H}(\boldsymbol\theta)\mathbf{v}_{t}
+</math>
 == Experimental Result ==
-The authors tested this method on [[UsesDataset::MUC]]-7 and the oncology part of [[UsesDataset::PennBioIE]] corpus. The base learner used for the experiment is a linear-chain [[UsesMethod::Conditional Random Fields]]. Features used are orthographical features (regexp patterns), lexical and morphological features (prefix, suffix, lemmatized tokens), and contextual features (features of neighbor tokens). In terms of the number of tokens that had to be labled to reach the maximal F-score, SeSAL could save about 60% over FuSAL, and 80% over random sampling. Having high confidence was also important because it could save the model from making errors in the early stages.
+=== Experiments on 1D chain CRFs ===
-== Related papers ==
+The authors tested convergence rate of a 1D chain CRF model used against [[UsesDataset::CoNLL'00]] and [[UsesDataset::GENIA_dataset]]. The Stochastic Meta Descent (SMD) method described in the previous section is compared with the following methods:
+* Simple stochastic gradient descent (SGD) with a fixed gain
+* Batch-only limited-memory BFGS algorithm
+* Collins' (2002) perceptron (CP), fully online update
+As shown in the graphs, SMD converged much faster than any other methods, especially BFGS and CP. It is worthwhile to note that in both datasets, SGD and SMD achieved significant F-score even before it looked at all the examples in the dataset.
+[[File:Vishwanathan et al ICML2006 1.png|800px]]
+=== Experiments on 2D lattice CRFs ===
+For 2D CRF four optimization methods are compared: SMD, SGD, BFGS and annealed stochastic gradient descent (ASG) where the gain <math>\eta_{t}=\eta_{0}/t</math> decreases as time goes.
+These methods are used to optimize the conditional likelihood approximated by loopy belief propagation (LBP) or mean field (MF) and the pseudo likelihood (PL). The authors tested 2D CRFs for the binary image denoising task and the classification task recognizing patches including "manmade structures" in the image.
-* [[RelatedPaper::Muslea, Minton and Knoblock, ICML 2002]]
+[[File:Vishwanathan et al ICML2006 2.png|400px]][[File:Vishwanathan et al ICML2006 3.png|400px]]
-* [[RelatedPaper::McCallum and Ngiam, ICML 98]]
+When exact inference is possible (PL), SMD and SGD converged faster. However, in other cases, SMD and SGD have a faster convergence in the beginning and then BFGS catches up and even does better in later passes.
-== Comment ==
+== Related papers ==
-If you're further interested in active learning for NLP, you might want to see Burr Settles' review of active learning: http://active-learning.net/  --[[User:Brendan|Brendan]] 22:51, 13 October 2011 (UTC)
+* [[RelatedPaper::Lafferty_2001_Conditional_Random_Fields]]
+* [[RelatedPaper::Sha_2003_shallow_parsing_with_conditional_random_fields]]

Difference between revisions of "Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods, Vishwanathan et al, ICML 2006"

Latest revision as of 03:24, 30 November 2011

Contents

Citation

Online version

Summary

Brief description of the method

Stochastic Approximation of Gradients

Experimental Result

Experiments on 1D chain CRFs

Experiments on 2D lattice CRFs

Related papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools