Difference between revisions of "Soft Supervised Text Classification"

Revision as of 21:14, 25 September 2011

Citation

Soft-Supervised Learning for Text Classification, by A. Subramanya, J.Bilmes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2008.

This Paper is available online [1].

Background

Graph-based semi-supervised learning (SSL) approaches aim to predict labels over unlabeled data, using the labeled datapoints and the structure of the graph. Often, the graph structure is represented by the graph Laplacian. In common formulations, the Laplacian is in a regularization term (to ensure smoothness over the graph when labeling unlabeled points), and the loss function (commonly quadratic loss) tries to ensure unlabeled points close to labeled points have similar labels as the labeled data.

Summary

This work concentrates on graph-based semi-supervised learning for cases where a datapoint can have multiple labels associated with it, for example in document classification, where a particular document can have multiple topic labels. Existing graph-based approaches are often sub-optimal in this regard, since many assume binary classification tasks and then generalize to multi-class problems by using some sort of one vs. rest strategy for classification, which means that the different classifiers are trained independent of one another.

This work attempts to solve that problem by proposing a framework where the loss function is based on minimizing the Kullback-Leibler Divergence between two distributions, and the regularizer is based on entropy (maximization). Using probability distributions on the vertices is better than existing approaches that use fixed integer labels (or relaxations consisting of real values in the $[0,1]$ interval), because we can generalize to multi-class classification easier, and also allows us to use information theory-based machinery, which the authors argue (through their results) is superior to Euclidean-based methods.

Main Approach

First some SSL-related preliminaries and notation.

* we have a training set  ${\mathcal {D}}=\{{\mathcal {D}}_{l},{\mathcal {D}}_{u}\}$ , where Failed to parse (unknown function "\math"): {\displaystyle \mathcal{D}_u<\math> is the "test" data in some sense (remember we are in a transductive setting)  * we have <math>l}
 labeled datapoints, and  $u$  unlabeled datapoints.  Our datapoints  ${\bf {x}}_{i}\in X$  come from an input space  $X$ , and our outputs are  $y_{i}\in Y$ , where  $|Y|$ , the cardinality of the output space, are the number of classes.  
* we form an undirected weighted graph  ${\mathcal {G}}=(V,E)$ , where  $V$  are the set of vertices, i.e., the datapoints, and  $E$  are the set of edges between datapoints.   $w_{ij}$  represents the similarity between  $x_{i}$  and  $x_{j}$ .

in this work, ${\bf {W}}$ , the similarity matrix, is a dense similarity matrix between points that has been somewhat sparsified by taking the $k$ nearest neighbors for each point. Various similarity measures can be used between datapoints, for example RBF kernels applied to squared Euclidean distance, or cosine similarity. In this paper, cosine similarity is used

The main algorithm can be expressed as: $\min _{\textrm {p}}C_{1}({\textrm {p}})$ , where Failed to parse (syntax error): {\displaystyle C_1(p) = \left[ \sum_{i=1}^l D_{KL} (r_i \parallel p_i) + \mu \sum_i^n \sum_j w_{ij}D_{KL}(p_i \parallel p_j) - \nu \sum_{i=1}^n H(p_i) }

Baseline & Results

Related Work

@@ Line 17: / Line 17: @@
 First some SSL-related preliminaries and notation.
-  * we have a training set <math>\mathcal{D} = \{\mathcal{D_l}, \mathcal{D_u}\}</math>, where <math>\mathcal{D_u}<\math> is the "test" data in some sense (remember we are in a transductive setting)
+  * we have a training set <math>\mathcal{D} = \{\mathcal{D}_l, \mathcal{D}_u\}</math>, where <math>\mathcal{D}_u<\math> is the "test" data in some sense (remember we are in a transductive setting)
-  * we have <math>l</math> labeled datapoints, and <math>u</math> unlabeled datapoints.  Our datapoints <math>{\mathbf x}_i \in X</math> come from an input space <math>X</math>, and our outputs are <math>y_i \in Y</math>, where <math>|Y|</math>, the cardinality of the output space, are the number of classes.
+  * we have <math>l</math> labeled datapoints, and <math>u</math> unlabeled datapoints.  Our datapoints <math>{\bf x}_i \in X</math> come from an input space <math>X</math>, and our outputs are <math>y_i \in Y</math>, where <math>|Y|</math>, the cardinality of the output space, are the number of classes.
   * we form an undirected weighted graph <math>\mathcal{G} = (V,E)</math>, where <math>V</math> are the set of vertices, i.e., the datapoints, and <math>E</math> are the set of edges between datapoints.  <math>w_{ij}</math> represents the similarity between <math>x_i</math> and <math>x_j</math>.
 * in this work, <math>{\bf W}</math>, the similarity matrix, is a dense similarity matrix between points that has been somewhat sparsified by taking the <math>k</math> nearest neighbors for each point.  Various similarity measures can be used between datapoints, for example RBF kernels applied to squared Euclidean distance, or [[Cosine_similarity | cosine similarity]].  In this paper, cosine similarity is used

Difference between revisions of "Soft Supervised Text Classification"

Revision as of 21:14, 25 September 2011

Contents

Citation

Background

Summary

Main Approach

Baseline & Results

Related Work

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools