Soft Supervised Text Classification
Citation
Soft-Supervised Learning for Text Classification, by A. Subramanya, J.Bilmes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2008.
This Paper is available online [1].
Background
Graph-based semi-supervised learning (SSL) approaches aim to predict labels over unlabeled data, using the labeled datapoints and the structure of the graph. Often, the graph structure is represented by the graph Laplacian. In common formulations, the Laplacian is in a regularization term (to ensure smoothness over the graph when labeling unlabeled points), and the loss function (commonly quadratic loss) tries to ensure unlabeled points close to labeled points have similar labels as the labeled data.
Summary
This work concentrates on graph-based semi-supervised learning for cases where a datapoint can have multiple labels associated with it, for example in document classification, where a particular document can have multiple topic labels. Existing graph-based approaches are often sub-optimal in this regard, since many assume binary classification tasks and then generalize to multi-class problems by using some sort of one vs. rest strategy for classification, which means that the different classifiers are trained independent of one another.
This work attempts to solve that problem by proposing a framework where the loss function is based on minimizing the Kullback-Leibler Divergence between two distributions, and the regularizer is based on entropy (maximization). Using probability distributions on the vertices is better than existing approaches that use fixed integer labels (or relaxations consisting of real values in the interval), because we can generalize to multi-class classification easier, and also allows us to use information theory-based machinery, which the authors argue (through their results) is superior to Euclidean-based methods.
Main Approach
First some SSL-related preliminaries and notation.
- we have a training set , where is the "test" data in some sense (remember we are in a transductive setting)
- we have labeled datapoints, and unlabeled datapoints. Our datapoints come from an input space , and our outputs are , where , the cardinality of the output space, are the number of classes.
- we form an undirected weighted graph , where are the set of vertices, i.e., the datapoints, and are the set of edges between datapoints. represents the similarity between and .
- in this work, , the similarity matrix, is a dense similarity matrix between points that has been somewhat sparsified by taking the nearest neighbors for each point. Various similarity measures can be used between datapoints, for example RBF kernels applied to squared Euclidean distance, or cosine similarity. In this paper, cosine similarity is used
The main algorithm can be expressed as: , where where is the standard Shannon entropy for the distribution over vertex , is the KL-divergence between two distributions, and are hyperparameters. are the distributions over labeled points. If a point has just one label, then the distribution has entropy 0. Otherwise, we set the distribution to be uniform over the different labels.
The first term corresponds to labels over labeled data. The authors call their approach "soft-supervised learning" because we do not constrain the solution to have the same labels as what was provided (we simply minimize divergence between the probability distributions). This is particularly useful for a noisy labels scenario. The second term can be viewed as a graph regularization term, and the last term encourages the distribution to have maximal entropy if not preferred to the contrary by the first two terms (useful for handling disconnected components of the graph).
The authors propose to solve this optimization problem through a method known as alternating minimization (AM). Their problem is convex, but does not admit a closed form solution in either primal or dual space. AM was chosen since it has only a single additional optimization parameter, has a closed form solution that doesn't involve matrix inversion, and yields guaranteed convergence to a global optimum. Some massaging of the objective function above is required to get the optimization problem in a form amenable to AM. See the alternating minimization article for more details.
Baseline & Results
The authors conducted their experiments on the Web KB dataset and the Reuters 21578 dataset. Their evaluation measure was the precision-recall breakeven point (the value for which precision and recall are equal), and they compared with SVM (supervised), transductive SVM (TSVM, semi-supervised), spectral graph transduction (SGT, semi-supervised), and Label Propagation (LP, semi-supervised).
- Reuters data
- webKB data
Related Work
- Compared to graph-based SSL approaches with squared-loss objective functions(e.g., Zhu et al, ICML 2003), this work optimizes KL divergence over probability distributions. Results (above) show that this is better than squared loss. Several reasons are posited in the paper.
- Alternating minimization is similar to spectral graph transducers (Joachims, ICML 2003), but in this case AM yields an exact solution to a continuous optimization problem, and not an approximate solution to an NP-hard discrete optimization problem as is done in the Joachims paper.
- Some papers have used entropy minimization in their approaches, but they were inherently not graph-based. Other papers have presented optimization procedures similar to AM, but often with drawbacks (e.g., no closed-form solution in Corduneanu & Jaakkola, UAI 2003).