Difference between revisions of "Entropy Minimization for Semi-supervised Learning"

Latest revision as of 21:12, 8 October 2010

This is a method introduced in Y. Grandvalet.

Minimum entropy regularization can be applied to any model of posterior distribution. For this technique, one assumption for unlabeled examples to be informative is that classes are well apart, separated by a low density area.

The learning set is denoted ${\mathcal {L}}_{n}=\{X^{(i)},Z^{(i)}\}_{i=1}^{n}$ , where $Z^{(i)}\in \{0,1\}^{K}$ : If $X^{(i)}$ is labeled as $w_{i}$ , then $Z_{k}^{(i)}=1$ and $Z_{l}^{(i)}=0$ for $l\not =k$ ; if $X^{(i)}$ is unlabeled, then $Z_{l}^{(i)}=1$ for $l=1\dots K$ .

The conditional entropy of class labels conditioned on the observed variables:

$H(Y|X,Z;{\mathcal {L}}_{n})=-{\frac {1}{n}}\sum _{i=1}^{n}\sum _{k=1}^{K}P(Y^{(i)}=w_{k}|X^{(i)},Z^{(i)}){\text{log}}P(Y^{(i)}=w_{k}|X^{(i)},Z^{(i)})$

Assuming that labels are missing at random, we have that

$P(Y^{(i)}=w_{k}|X^{(i)},Z^{(i)})={\frac {Z_{k}^{(i)}P(Y^{(i)}=w_{k}|X^{(i)})}{\sum _{k=1}^{K}Z_{l}^{(i)}P(Y^{(i)}=w_{k}|X^{(i)})}}$

The posterior distribution is defined as the conditional log likelihood and an entropy-regularized term:

${\begin{alignedat}{2}C({\boldsymbol {\theta }},\lambda ;{\mathcal {L}}_{n})&=L({\boldsymbol {\theta }};{\mathcal {L}}_{n})-\lambda H(Y|X,Z;{\mathcal {L}}_{n})\\&=\sum _{i=1}^{n}{\text{log}}(\sum _{k=1}^{K}Z_{k}^{(i)}P(Y^{(i)}=w_{k}|X^{(i)}))+\lambda \sum _{i=1}^{n}\sum _{k=1}^{K}P(Y^{(i)}=w_{k}|X^{(i)},Z^{(i)}){\text{log}}P(Y^{i}=w_{k}|X^{i},Z^{i})\end{alignedat}}$

Minimum entropy regularizers have been used to encode learnability of priors M. Brand and to learn weight function parameters in the context of transduction in manifold learning Zhu et al..

@@ Line 1: / Line 1: @@
+This is a method introduced in [http://www.eprints.pascal-network.org/archive/00001978/01/grandvalet05.pdf Y. Grandvalet].
 Minimum entropy regularization can be applied to any model of posterior distribution.
-The learning set is denoted $L_{n}$
+For this technique, one assumption for unlabeled examples to be informative is that
+classes are well apart, separated by a low density area.
+The learning set is denoted <math> \mathcal{L}_{n} = \{X^{(i)}, Z^{(i)}\}^{n}_{i=1} </math>,
+where <math> Z^{(i)} \in \{0,1\}^K </math>:
+If <math> X^{(i)} </math> is labeled as <math> w_{i} </math>, then <math> Z^{(i)}_{k} = 1</math>
+and <math> Z^{(i)}_{l}  = 0 </math> for <math> l \not= k </math>; if <math> X^{(i)} </math> is unlabeled,
+then <math> Z^{(i)}_{l} = 1 </math> for <math> l = 1 \dots K </math>.
+The conditional entropy of class labels conditioned on the observed variables:
+<math>
+H(Y|X,Z; \mathcal{L}_{n}) = -\frac{1}{n} \sum^{n}_{i=1} \sum^{K}_{k=1} P(Y^{(i)}=w_{k}|X^{(i)}, Z^{(i)})\text{log}P(Y^{(i)}=w_{k}|X^{(i)},Z^{(i)})
+</math>
+Assuming that labels are missing at random, we have that
+<math>
+P(Y^{(i)}=w_{k}|X^{(i)}, Z^{(i)}) = \frac{Z^{(i)}_{k}P(Y^{(i)}=w_{k}|X^{(i)})}{\sum^{K}_{k=1} Z^{(i)}_{l} P(Y^{(i)}=w_{k}|X^{(i)})}
+</math>
+The posterior distribution is defined as the conditional log likelihood and an entropy-regularized term:
+<math>
+\begin{alignat}{2}
+C(\boldsymbol{\theta}, \lambda; \mathcal{L}_{n}) & = L(\boldsymbol{\theta}; \mathcal{L}_{n}) - \lambda H(Y|X,Z; \mathcal{L}_{n}) \\
+& = \sum^{n}_{i=1} \text{log}(\sum^{K}_{k=1} Z^{(i)}_{k}P(Y^{(i)}=w_{k}|X^{(i)})) +
+\lambda \sum^{n}_{i=1} \sum_{k=1}^{K} P(Y^{(i)}=w_{k}|X^{(i)}, Z^{(i)}) \text{log} P(Y^{i}=w_{k}|X^{i}, Z^{i})
+\end{alignat}
+</math>
+Minimum entropy regularizers have been used to encode learnability of priors [http://www.merl.com/papers/docs/TR98-18.pdf  M. Brand] and to learn weight function parameters
+in the context of transduction in manifold learning [http://www.learning.eng.cam.ac.uk/zoubin/papers/zgl.pdf Zhu et al.].

Difference between revisions of "Entropy Minimization for Semi-supervised Learning"

Latest revision as of 21:12, 8 October 2010

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools