Measure Propagation
Contents
Citation
Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification, by A. Subramanya, J.Bilmes. In Neural Information Processing Systems, 2009.
Summary
Measure propagation is similar to label propagation, in that we propagate information from some vertices in a similarity graphs to other vertices via the edge weights (so information flows stronger if the edge weight is higher). However, measure propagation is fundamentally based on propagating discrete probability distributions through edges and not labels. The algorithm, while similar, is based on minimizing KL divergence between adjacent vertices instead of minimizing squared loss between adjacent vertices.
Details
The Method of Measure Propagation is a concept similar to LabelPropagation, in that information is propagated through the edges of vertices in a similarity or data adjacency graph. Most approaches to label propagation use a squared loss to penalize neighboring vertices that have different labels. After a bit of math, this can be shown to essentially labeling a point by averaging the labels of all of its neighbors. Other interpretations to label propagation rely on random walks and electrical networks or graph kernels (see Zhu et al, ICML 2003 for more details), but what is essentially happening is that we perform Markov random walks on the graph.
Label propagation relies on the concept of label prediction. These labels can either by integers, i.e., for binary problems, or more integers for multi-class problems, or they can be integer relaxations, i.e. real values, which are then matched against a threshold for classification. However, these methods are intrinsically based on the squared loss, which is more appropriate for regression rather than classification problems. In Soft Supervised Text Classification, Subramanya and Bilmes argue that a loss function based on KL divergence is more appropriate:
- the squared loss is fine for a Gaussian loss model but may not necessarily hold for classification purposes
- KL-divergence is based on the relative (relative to the distribution p) error and not an absolute error
- under certain conditions, KL divergence is asymptotically consistent with the underlying probability distributions over the vertices, which is what we're aiming for
In Measure Propagation, since we use an information theoretic loss-based loss function, we are actually predicting probability distributions over labels. In the algorithm, intuitively speaking, discrete probability measures are being propagated between vertices along the edges, instead of labels.