Difference between revisions of "Expectation Regularization"
PastStudents (talk | contribs) |
PastStudents (talk | contribs) |
||
(7 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
This method introduced a way to take advantage of this prior knowledge. | This method introduced a way to take advantage of this prior knowledge. | ||
− | Let's denote human-provided prior as <math> \tilde{p} </math>. | + | Let's denote human-provided prior as <math> \tilde{p} </math> and empirical label distribution as <math> \hat{p} </math>. |
− | We | + | The empirical label distribution is computed over unlabeled data set <math>U</math>, |
+ | |||
+ | <math> | ||
+ | \hat{p}_{\theta}(y)=\frac{\sum_{x \in U} p_{\theta}(y|x)}{|U|} | ||
+ | </math> | ||
+ | |||
+ | We want to minimize the distance between <math> \tilde{p} </math> and <math> \hat{p} </math>, denoted as <math>\triangle(\hat{p},\tilde{p})</math>. | ||
KL-distance is used here so the regularization becomes | KL-distance is used here so the regularization becomes | ||
Line 10: | Line 16: | ||
D(\tilde{p}||\hat{p})=\sum_{y} \tilde{p}(y) \text{log} \frac{\tilde{p}(y)}{\hat{p}(y)}=H(\tilde{p},\hat{p})-H(\tilde{p}) | D(\tilde{p}||\hat{p})=\sum_{y} \tilde{p}(y) \text{log} \frac{\tilde{p}(y)}{\hat{p}(y)}=H(\tilde{p},\hat{p})-H(\tilde{p}) | ||
</math> | </math> | ||
+ | |||
For semi-supervised learning purposes, we can augment the objective function by adding regularization term. For example, | For semi-supervised learning purposes, we can augment the objective function by adding regularization term. For example, | ||
the new conditional likelihood of data becomes | the new conditional likelihood of data becomes | ||
<math> | <math> | ||
− | =\sum_{n}\text{log}p_{\theta}(y^{(n)}|x^{(n)}) - \lambda (\tilde{p}, \hat{p}) | + | l(\theta; D, U)=\sum_{n}\text{log}p_{\theta}(y^{(n)}|x^{(n)}) - \lambda \triangle(\tilde{p}, \hat{p}) |
− | < | + | </math> |
+ | |||
+ | where <math>D</math> is the labeled data set. | ||
+ | |||
+ | Note that this is a global regularizer instead of a local one, in which case it would assign all instances to the majority of | ||
+ | the class. |
Latest revision as of 19:28, 30 November 2010
This is a method introduced in G.S Mann and A. McCallum, ICML 2007. It is often served as a regularized term with the likelihood function. In practice human often have an insight of label prior distribution. This method introduced a way to take advantage of this prior knowledge.
Let's denote human-provided prior as and empirical label distribution as . The empirical label distribution is computed over unlabeled data set ,
We want to minimize the distance between and , denoted as . KL-distance is used here so the regularization becomes
For semi-supervised learning purposes, we can augment the objective function by adding regularization term. For example, the new conditional likelihood of data becomes
where is the labeled data set.
Note that this is a global regularizer instead of a local one, in which case it would assign all instances to the majority of the class.