Soft Supervised Text Classification
Citation
Soft-Supervised Learning for Text Classification, by A. Subramanya, J.Bilmes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2008.
This Paper is available online [1].
Background
Graph-based semi-supervised learning (SSL) approaches aim to predict labels over unlabeled data, using the labeled datapoints and the structure of the graph. Often, the graph structure is represented by the graph Laplacian. In common formulations, the Laplacian is in a regularization term (to ensure smoothness over the graph when labeling unlabeled points), and the loss function (commonly quadratic loss) tries to ensure unlabeled points close to labeled points have similar labels as the labeled data.
Summary
This work concentrates on graph-based semi-supervised learning for cases where a datapoint can have multiple labels associated with it, for example in document classification, where a particular document can have multiple topic labels. Existing graph-based approaches are often sub-optimal in this regard, since many assume binary classification tasks and then generalize to multi-class problems by using some sort of one vs. rest strategy for classification, which means that the different classifiers are trained independent of one another.
This work attempts to solve that problem by proposing a graph-based SSL framework, where the loss function is based on minimizing the Kullback-Leibler Divergence between two distributions, and the regularizer is based on entropy (maximization). Using probability distributions on the vertices is better than existing approaches that use fixed integer labels (or relaxations consisting of real values in the interval), because we can generalize to multi-class classification easier, and also allows us to use information theory-based machinery, which the authors argue (through their results) is superior to Euclidean-based methods.
Main Approach
First some SSL-related preliminaries.