Difference between revisions of "Entropy Gradient for Semi-Supervised Conditional Random Fields"

From Cohen Courses
Jump to navigationJump to search
 
(32 intermediate revisions by the same user not shown)
Line 1: Line 1:
This [[Category::method]] is used by [[RelatedPaper::Mann and McCallum, 2007]] for efficient computation of the entropy gradient used as a regularizer to train semi-supervised [[UsesMethod::Conditional Random Fields | conditional random fields]].
+
This [[Category::method]] is used by [[RelatedPaper::Mann and McCallum, 2007]] for efficient computation of the entropy gradient used as a regularizer to train [[AddressesProblem::semi-supervised]] [[UsesMethod::Conditional Random Fields | conditional random fields]]. The [[Category::method]] is an improvement over the original proposed approach by [[RelatedPaper::Jiao et al. 2006]] in terms of computing the gradient on unlabeled part of the training data.
 +
 
  
 
== Summary ==
 
== Summary ==
 +
Entropy regularization (ER) is a method applied to [[AddressesProblem::semi-supervised learning]] that augments a standard conditional likelihood objective function with an additional term that aims to minimize the predicted label entropy on unlabeled data. By insisting on peaked, confident predictions, ER guides the decision boundary away from dense regions of input space. Entropy regularization for semi-supervised learning was first proposed for classification tasks by [[RelatedPaper::Grandvalet and Bengio, 2004]].
 +
 +
== Motivation ==
 +
[[RelatedPaper::Jiao et al. 2006]] apply this method to linear chain CRFs and demonstrate encouraging accuracy improvements on a gene-name-tagging task. However, the method they presented for calculating the gradient of the entropy takes substantially greater time than the raditional supervised-only gradient. Whereas supervised training requires only classic [[UsesMethod::Forward-Backward | forward-backward]] style algorithms, taking time <math> O(ns^{2}) </math> (sequence length times the square of the number of labels), their training method takes <math> O(n^{2}s^{3}) </math> — a factor of <math> O(ns) </math> more.
 +
 +
This method proposed in [[RelatedPaper::Mann and McCallum, 2007]] introduces a more efficient way to derive entropy gradient based on [[UsesMethod::dynamic programming]] that has the same asymptotic time complexity as that of a supervised CRF training process, <math> O(ns2) </math>. This calculation introduces the concept of ''subsequence constrained entropy'' — the entropy of a CRF for an observed data sequence when part of the label sequence is fixed. This method is especially useful for training CRFs on larger unannotated data sets.
 +
 +
== Semi-Supervised CRF Training ==
 +
 +
A standard linear chain CRF is trained by maximizing the log-likelihood <math> L(\theta; D) </math> on a labeled dataset <math>D</math>. Gradient methods like L-BFGS are commonly used to optimize the following objective function:
 +
 +
[[File:CRF-objective-function.jpg]]
 +
 +
For semi-supervised training by entropy regularization, the objective function is augmented by adding the negative entropy of the unannotated data
 +
<math>
 +
U = \langle u_1..u_n \rangle
 +
</math> as shown below. A Gaussian prior is also added to the function.
 +
 +
[[File:semi-CRF-objective-function.jpg]]
 +
 +
== Entropy Gradient Computation ==
 +
[[RelatedPaper::Jiao et al. 2006]] perform the computation of entropy gradient in the following manner:
 +
 +
[[File:Jiao-semi-CRF.jpg]]
 +
 +
As shown by them, the second term of the covariance is easy to compute. However, the first term requires calculation of quadratic feature expectations. The algorithm they proposed to compute this term involves <math> O(n^{2}s^{3}) </math> time complexity as it requires an extra nested loop in the [[UsesMethod::Forward-Backward | forward-backward algorithm]].
 +
 +
An alternative derivation proposed by [[RelatedPaper::Mann and McCallum, 2007]] is presented as follows:
 +
 +
[[File:Mann-and-McCallum-semi-CRF.jpg]]
 +
 +
 +
The second term in the summation is shown to cancel out in the above equation because,
 +
 +
<math>
 +
\sum_{Y} p_{\theta}(Y|x) \sum_{Y^{\prime}} p_{\theta} (Y^{\prime}|X) F_{k}(x,Y^{\prime}) = \sum_{Y^{\prime}} p_{\theta} (Y^{\prime}|X) F_{k}(x,Y^{\prime})
 +
</math>
 +
 +
 +
Hence, the gradient reduces to
 +
 +
[[File:semi-CRF-second-term.jpg]]
 +
 +
 +
The second term in the equation above can again be computed efficiently given the feature expectations obtained by forward/backward and the entropy for the sequence. The first term is now shown by the authors to be efficiently computable by performing further analysis. The first term is initially reduced to the following form:
 +
 +
[[File:first-term-reduced.jpg]]
 +
 +
 +
In order to calculate this term, it is sufficient to compute <math> \sum_{Y_{-(i..i+1)}} p_{\theta}(Y|x) \log p_{\theta} (Y|x) </math> for all pairs <math> y_{i} </math> , <math> y_{i+1} </math>. This computation is performed by the then-introduced ''subsequence constrained entropy'' method.
 +
 +
== Subsequence Constrained Entropy ==
 +
Subsequence constrained entropy is defined as follows:
 +
 +
<math>
 +
H_{\sigma} \left( Y_{-(a..b)}|y_{a..b},x \right) = \sum_{Y_{-(a..b)}} p_{\theta}(Y|x) \log p_{\theta} (Y|x)
 +
</math>
 +
 +
The authors point that the entropy can be factored given a linear-chain CRF of Markov order 1, since <math> Y_{i+2} </math> is independent of <math> Y_i </math> given <math> Y_{i+1} </math>.
 +
 +
[[File:sce-term.jpg]]
 +
 +
 +
Computing the <math> H^{\alpha}(\cdot) </math> and <math> H^{\beta}(\cdot) </math> lattices is performed using the probabilities obtained by forward-backward in the following way:
  
 +
[[File:H-entropy-computation.jpg]]
  
== General Definition ==
 
  
 +
The computational complexity of this calculation for one label sequence and subsequent iteration over all label pairs requires one run of forward-backward at <math> O(ns^{2}) </math>, and equivalent time to calculate the lattices <math> H^{\alpha}(\cdot) </math> and <math> H^{\beta}(\cdot) </math> .
  
 
== Related Papers ==
 
== Related Papers ==

Latest revision as of 01:52, 30 October 2011

This method is used by Mann and McCallum, 2007 for efficient computation of the entropy gradient used as a regularizer to train semi-supervised conditional random fields. The method is an improvement over the original proposed approach by Jiao et al. 2006 in terms of computing the gradient on unlabeled part of the training data.


Summary

Entropy regularization (ER) is a method applied to semi-supervised learning that augments a standard conditional likelihood objective function with an additional term that aims to minimize the predicted label entropy on unlabeled data. By insisting on peaked, confident predictions, ER guides the decision boundary away from dense regions of input space. Entropy regularization for semi-supervised learning was first proposed for classification tasks by Grandvalet and Bengio, 2004.

Motivation

Jiao et al. 2006 apply this method to linear chain CRFs and demonstrate encouraging accuracy improvements on a gene-name-tagging task. However, the method they presented for calculating the gradient of the entropy takes substantially greater time than the raditional supervised-only gradient. Whereas supervised training requires only classic forward-backward style algorithms, taking time (sequence length times the square of the number of labels), their training method takes — a factor of more.

This method proposed in Mann and McCallum, 2007 introduces a more efficient way to derive entropy gradient based on dynamic programming that has the same asymptotic time complexity as that of a supervised CRF training process, . This calculation introduces the concept of subsequence constrained entropy — the entropy of a CRF for an observed data sequence when part of the label sequence is fixed. This method is especially useful for training CRFs on larger unannotated data sets.

Semi-Supervised CRF Training

A standard linear chain CRF is trained by maximizing the log-likelihood on a labeled dataset . Gradient methods like L-BFGS are commonly used to optimize the following objective function:

CRF-objective-function.jpg

For semi-supervised training by entropy regularization, the objective function is augmented by adding the negative entropy of the unannotated data as shown below. A Gaussian prior is also added to the function.

Semi-CRF-objective-function.jpg

Entropy Gradient Computation

Jiao et al. 2006 perform the computation of entropy gradient in the following manner:

Jiao-semi-CRF.jpg

As shown by them, the second term of the covariance is easy to compute. However, the first term requires calculation of quadratic feature expectations. The algorithm they proposed to compute this term involves time complexity as it requires an extra nested loop in the forward-backward algorithm.

An alternative derivation proposed by Mann and McCallum, 2007 is presented as follows:

Mann-and-McCallum-semi-CRF.jpg


The second term in the summation is shown to cancel out in the above equation because,


Hence, the gradient reduces to

Semi-CRF-second-term.jpg


The second term in the equation above can again be computed efficiently given the feature expectations obtained by forward/backward and the entropy for the sequence. The first term is now shown by the authors to be efficiently computable by performing further analysis. The first term is initially reduced to the following form:

First-term-reduced.jpg


In order to calculate this term, it is sufficient to compute for all pairs , . This computation is performed by the then-introduced subsequence constrained entropy method.

Subsequence Constrained Entropy

Subsequence constrained entropy is defined as follows:

The authors point that the entropy can be factored given a linear-chain CRF of Markov order 1, since is independent of given .

Sce-term.jpg


Computing the and lattices is performed using the probabilities obtained by forward-backward in the following way:

H-entropy-computation.jpg


The computational complexity of this calculation for one label sequence and subsequent iteration over all label pairs requires one run of forward-backward at , and equivalent time to calculate the lattices and .

Related Papers