Mann and McCallum, ICML 2007

From Cohen Courses
Jump to navigationJump to search

Citation

Mann, G. and McCallum A. Simple, robust, scalable semi-supervised learning via expectation regularization. Proceedings of the 24th International Conference on Machine Learning. 2007.

Online Version

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.120.3681&rep=rep1&type=pdf

Summary

This paper develops a method of semi-supervised learning for conditional probability models by applying prior beliefs to constrain the model. Whereas other semi-supervised learning methods such as entropy regularization and transductive SVMs implicitly constrain the model by requiring decision boundaries to pass through areas of low density, expectation regularization constrains the model by requiring its output to be close to a given prior.

The prior can be over general class labels or they can be over conditional probabilities of a class given a specific feature. For instance, one can introduce prior knowledge such as "capitalized words are locations 60% of the time". This paper explicitly tested class label priors. They fitted a maximum entropy model with an additional expectation regularization term added to the log likelikhood. This term penalized for greater KL divergence between the class priors and the class proportions that resulted from training on labeled and unlabeled data. The classifier generally outperformed other supervised and semi-supervised classifiers over a range of training set sizes and different classification tasks, including part of speech tagging, NER, and relation extraction.

One surprising result is that obtaining accurate priors is not crucial and that this method maintains accuracy over a range of priors. Priors can be obtained from expert knowledge or from examining labeled data. A benefit of focusing on priors is their simple interpretability. This contrasts with trying to interpret model parameters, which often interact with each other in non-obvious ways.

The techniques discussed are quite general, and can be applied using other models than maximum entropy models, other prior knowledge than class priors, and other distance metrics, rather than KL divergence.

Related Papers

The authors produced several other papers exploring this general technique of generalized expectation. Mann, ACL 2008 and Druck, SIGIR 2008 apply it to semi supervised CRFs, exploring using labeled features vs labeled instances. A technical report describing the Generalized Expectation Criteria in broader terms is also available here.

Liang, ICML 2009 formalize the cost tradeoffs between using labeled features vs labeled instances.