Entropy Gradient for Semi-Supervised Conditional Random Fields
This method is used by Mann and McCallum, 2007 for efficient computation of the entropy gradient used as a regularizer to train semi-supervised conditional random fields. The method is an improvement over the original proposed approach by Jiao et al., 2006 in terms of computing the gradient on unlabeled part of the training data.
Summary
Entropy regularization (ER) is a method applied to semi-supervised learning that augments a standard conditional likelihood objective function with an additional term that aims to minimize the predicted label entropy on unlabeled data. By insisting on peaked, confident predictions, ER guides the decision boundary away from dense regions of input space. Entropy regularization for semi-supervised learning was first proposed for classification tasks by Grandvalet and Bengio, 2004.
Motivation
Jiao et al. 2006 apply this method to linear chain CRFs and demonstrate encouraging accuracy improvements on a gene-name-tagging task. However, the method they presented for calculating the gradient of the entropy takes substantially greater time than the traditional supervised-only gradient. Whereas supervised training requires only classic forward/backward style algorithms, taking time (sequence length times the square of the number of labels), their training method takes — a factor of more.
This method proposed in Mann and McCallum, 2007 introduces a more efficient way to derive entropy gradient based on dynamic programming that has the same asymptotic time complexity as that of a supervised CRF training process, . This calculation introduces the concept of subsequence constrained entropy — the entropy of a CRF for an observed data sequence when part of the label sequence is fixed. This method is especially useful for training CRFs on larger unannotated data sets.
Semi-Supervised CRF Training
A standard linear chain CRF is trained by maximizing the log-likelihood on a labeled dataset . Gradient methods like L-BFGS are commonly used to optimize the following objective function:
For semi-supervised training by entropy regularization, the objective function is augmented by adding the negative entropy of the unannotated data as shown below. A Gaussian prior is also added to the function.