Jiao et al COLING 2006

From Cohen Courses
Jump to navigationJump to search

Citation

Jiao, F., Wang, S., Lee, C.H., Greiner, R., and Schuurmans, D. Semi-supervised conditional random fields for improved sequence segmentation and labeling. Proceedings of the 21st International Conference on Computational Linguistics. (2006) 209-216.

Online Version

http://acl.ldc.upenn.edu/P/P06/P06-1027.pdf

Summary

This paper presented a novel method to using a CRF in a semi-supervised learning setting. HMMs and other generative models easily incorporate unlabeled data using EM, but have difficulty with non-independent features. Semi-supervised discriminative approaches were less well explored. By incorporating extra data, the new technique improves the accuracy over a baseline CRF trained just on labeled data. In tandem, the authors developed an efficient dynamic programming algorithm to calculate a covariance matrix of features, something necessary to calculate the gradient and perform iterative ascent.

The key idea is to minimize the conditional entropy of the unlabeled data, thereby maximizing the certainty of the labellings and reinforcing the supervised labels. Equivalently, this is like maximizing the KL divergence, making two distributions "farther" apart or decreasing their overlap.

The optimization criterion is to maximize the sum of the conditional likelihood of the labeled samples and the negative conditional entropy of the unlabeled examples, along with regularization. This extra entropy term leads to a non-concave optimization function. However, one can still attempt to improve on a fully supervised CRF by using its learned parameter values as the starting point of an L-BFGS algorithm.

An experiment on named entity recognition of gene names resulted in generally much improved recall and F-measures.


Related Papers

This form of minimum entropy regularization was first explored by Grandvalet and Bengio, NIPS 2004 for a single, unstructured, variable.

CRFs were first proposed by Lafferty et al, ICML 2001.

The dataset analyzed was from McDonald et al 2005.