Difference between revisions of "Soft Supervised Text Classification"

From Cohen Courses
Jump to navigationJump to search
Line 17: Line 17:
  
 
First some SSL-related preliminaries and notation.   
 
First some SSL-related preliminaries and notation.   
* we have a training set <math>\mathcal{D} = \{\mathcal{D}_l, \mathcal{D}_u\}</math>, where <math>\mathcal{D}_u</math> is the "test" data in some sense (remember we are in a transductive setting)
+
* we have a training set <math>\mathcal{D} = \{\mathcal{D}_l, \mathcal{D}_u\}</math>, where <math>\mathcal{D}_u</math> is the "test" data in some sense (remember we are in a transductive setting)
* we have <math>l</math> labeled datapoints, and <math>u</math> unlabeled datapoints.  Our datapoints <math>\mathbf{x}_i \in X</math> come from an input space <math>X</math>, and our outputs are <math>y_i \in Y</math>, where <math>|Y|</math>, the cardinality of the output space, are the number of classes.   
+
* we have <math>l</math> labeled datapoints, and <math>u</math> unlabeled datapoints.  Our datapoints <math>\mathbf{x}_i \in X</math> come from an input space <math>X</math>, and our outputs are <math>y_i \in Y</math>, where <math>|Y|</math>, the cardinality of the output space, are the number of classes.   
* we form an undirected weighted graph <math>\mathcal{G} = (V,E)</math>, where <math>V</math> are the set of vertices, i.e., the datapoints, and <math>E</math> are the set of edges between datapoints.  <math>w_{ij}</math> represents the similarity between <math>x_i</math> and <math>x_j</math>.   
+
* we form an undirected weighted graph <math>\mathcal{G} = (V,E)</math>, where <math>V</math> are the set of vertices, i.e., the datapoints, and <math>E</math> are the set of edges between datapoints.  <math>w_{ij}</math> represents the similarity between <math>x_i</math> and <math>x_j</math>.   
 
* in this work, <math>{\bf W}</math>, the similarity matrix, is a dense similarity matrix between points that has been somewhat sparsified by taking the <math>k</math> nearest neighbors for each point.  Various similarity measures can be used between datapoints, for example RBF kernels applied to squared Euclidean distance, or [[Cosine_similarity | cosine similarity]].  In this paper, cosine similarity is used  
 
* in this work, <math>{\bf W}</math>, the similarity matrix, is a dense similarity matrix between points that has been somewhat sparsified by taking the <math>k</math> nearest neighbors for each point.  Various similarity measures can be used between datapoints, for example RBF kernels applied to squared Euclidean distance, or [[Cosine_similarity | cosine similarity]].  In this paper, cosine similarity is used  
  

Revision as of 20:19, 25 September 2011

Citation

Soft-Supervised Learning for Text Classification, by A. Subramanya, J.Bilmes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2008.

This Paper is available online [1].

Background

Graph-based semi-supervised learning (SSL) approaches aim to predict labels over unlabeled data, using the labeled datapoints and the structure of the graph. Often, the graph structure is represented by the graph Laplacian. In common formulations, the Laplacian is in a regularization term (to ensure smoothness over the graph when labeling unlabeled points), and the loss function (commonly quadratic loss) tries to ensure unlabeled points close to labeled points have similar labels as the labeled data.

Summary

This work concentrates on graph-based semi-supervised learning for cases where a datapoint can have multiple labels associated with it, for example in document classification, where a particular document can have multiple topic labels. Existing graph-based approaches are often sub-optimal in this regard, since many assume binary classification tasks and then generalize to multi-class problems by using some sort of one vs. rest strategy for classification, which means that the different classifiers are trained independent of one another.

This work attempts to solve that problem by proposing a framework where the loss function is based on minimizing the Kullback-Leibler Divergence between two distributions, and the regularizer is based on entropy (maximization). Using probability distributions on the vertices is better than existing approaches that use fixed integer labels (or relaxations consisting of real values in the interval), because we can generalize to multi-class classification easier, and also allows us to use information theory-based machinery, which the authors argue (through their results) is superior to Euclidean-based methods.

Main Approach

First some SSL-related preliminaries and notation.

  • we have a training set , where is the "test" data in some sense (remember we are in a transductive setting)
  • we have labeled datapoints, and unlabeled datapoints. Our datapoints come from an input space , and our outputs are , where , the cardinality of the output space, are the number of classes.
  • we form an undirected weighted graph , where are the set of vertices, i.e., the datapoints, and are the set of edges between datapoints. represents the similarity between and .
  • in this work, , the similarity matrix, is a dense similarity matrix between points that has been somewhat sparsified by taking the nearest neighbors for each point. Various similarity measures can be used between datapoints, for example RBF kernels applied to squared Euclidean distance, or cosine similarity. In this paper, cosine similarity is used

The main algorithm can be expressed as: , where Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle C_1(p) = \left[ \sum_{i=1}^l D_{KL} (r_i \parallel p_i) + \mu \sum_i^n \sum_j w_{ij}D_{KL}(p_i \parallel p_j) - \nu \sum_{i=1}^n H(p_i) }


Baseline & Results

Related Work