Entropy Gradient for Semi-Supervised Conditional Random Fields

From Cohen Courses
Jump to navigationJump to search

This method is used by Mann and McCallum, 2007 for efficient computation of the entropy gradient used as a regularizer to train semi-supervised conditional random fields. The method is an improvement over the original proposed approach by Jiao et al. 2006 in terms of computing the gradient on unlabeled part of the training data.


Summary

Entropy regularization (ER) is a method applied to semi-supervised learning that augments a standard conditional likelihood objective function with an additional term that aims to minimize the predicted label entropy on unlabeled data. By insisting on peaked, confident predictions, ER guides the decision boundary away from dense regions of input space. Entropy regularization for semi-supervised learning was first proposed for classification tasks by Grandvalet and Bengio, 2004.

Motivation

Jiao et al. 2006 apply this method to linear chain CRFs and demonstrate encouraging accuracy improvements on a gene-name-tagging task. However, the method they presented for calculating the gradient of the entropy takes substantially greater time than the raditional supervised-only gradient. Whereas supervised training requires only classic forward-backward style algorithms, taking time (sequence length times the square of the number of labels), their training method takes — a factor of more.

This method proposed in Mann and McCallum, 2007 introduces a more efficient way to derive entropy gradient based on dynamic programming that has the same asymptotic time complexity as that of a supervised CRF training process, . This calculation introduces the concept of subsequence constrained entropy — the entropy of a CRF for an observed data sequence when part of the label sequence is fixed. This method is especially useful for training CRFs on larger unannotated data sets.

Semi-Supervised CRF Training

A standard linear chain CRF is trained by maximizing the log-likelihood on a labeled dataset . Gradient methods like L-BFGS are commonly used to optimize the following objective function:

CRF-objective-function.jpg

For semi-supervised training by entropy regularization, the objective function is augmented by adding the negative entropy of the unannotated data as shown below. A Gaussian prior is also added to the function.

Semi-CRF-objective-function.jpg

Entropy Gradient Computation

Jiao et al. 2006 perform the computation of entropy gradient in the following manner:

Jiao-semi-CRF.jpg

As shown by them, the second term of the covariance is easy to compute. However, the first term requires calculation of quadratic feature expectations. The algorithm they proposed to compute this term involves time complexity as it requires an extra nested loop in the forward-backward algorithm.

An alternative derivation proposed by Mann and McCallum, 2007 is presented as follows:

Mann-and-McCallum-semi-CRF.jpg


The second term in the summation is shown to cancel out in the above equation because,


Hence, the gradient reduces to

Semi-CRF-second-term.jpg


The second term in the equation above can again be computed efficiently given the feature expectations obtained by forward/backward and the entropy for the sequence. The first term is now shown by the authors to be efficiently computable by performing further analysis. The first term is initially reduced to the following form:

First-term-reduced.jpg


In order to calculate this term, it is sufficient to compute for all pairs , . This computation is performed by the then-introduced subsequence constrained entropy method.

Subsequence Constrained Entropy

Subsequence constrained entropy is defined as follows:

The authors point that the entropy can be factored given a linear-chain CRF of Markov order 1, since is independent of given .

Sce-term.jpg


Computing the and lattices is performed using the probabilities obtained by forward-backward in the following way:

H-entropy-computation.jpg


The computational complexity of this calculation for one label sequence and subsequent iteration over all label pairs requires one run of forward-backward at , and equivalent time to calculate the lattices and .

Related Papers