# Generalized Expectation Criteria

## Summary

This can be viewed as a parameter estimation method that can augment/replace traditional parameter estimation methods such as maximum likelihood estimation. Traditional methods can be viewed as a special case of GE and more non-traditional approaches can be used using the flexibility of generalized expectation.

## Expectation

Let $X$ be some set of variables and their assignments be $\mathbf {x} \in {\mathcal {X}}$ . Let $\theta$ be the parameters of a model that defines a probability distribution $p_{\theta }(X)$ . The expectation of a function $f(X)$ according to the model is

$E_{\theta }[f(X)]=\sum _{\mathbf {x} \in {\mathcal {X}}}{p_{\theta }(\mathbf {x} )f(\mathbf {x} )}$ We can partition the variables into "input" variables $X$ and "output" variables $Y$ that is conditioned on the input variables. When the assignment of the input variables ${\tilde {\mathcal {X}}}=\{\mathbf {x} _{1},\mathbf {x} _{2},...\}$ are provided, the conditional expectation is

$E_{\theta }[f(X,Y)\vert {\tilde {\mathcal {X}}}]={1 \over \vert {\tilde {\mathcal {X}}}\vert }\sum _{\mathbf {x} \in {\mathcal {\tilde {\mathcal {X}}}}}{\sum _{\mathbf {y} \in Y}{p_{\theta }(\mathbf {y} \vert \mathbf {x} )f(\mathbf {x} ,\mathbf {y} )}}$ ## Generalized Expectation

A generalized expectation (GE) criteria is a function G that takes the model's expectation of $f(X)$ as an argument and returns a scalar. The criteria is then added as a term in the parameter estimation objective function.

$G(E_{\theta }[f(X)])\rightarrow \mathbb {R}$ Or $G$ can be defined based on a distance to a target value for $E_{\theta }[f(X)]$ . Let ${\tilde {f}}$ be the target value and $\Delta (\cdot ,\cdot )$ be some distance function, then we can define $G$ in the following way:

$G_{\tilde {f}}(E_{\theta }[f(X)])=-\Delta (E_{\theta }[f(X)],{\tilde {f}})$ ## Use Cases

### Application to semi-supervised learning

Mann and McCallum, ICML 2007 describes an application of GE to a semi-supervised learning problem. The GE terms used here indicates a preference/prior about the marginal class distribution, that is either directly provided by human expert or estimated from labeled data.

Let ${\tilde {\mathbf {f} }}={\tilde {p}}(Y)$ be the target distribution over class labels and $f(\mathbf {x} ,\mathbf {y} )={1 \over n}\sum _{i}^{n}{\vec {I}}(y_{i})$ (${\vec {I}}$ denotes the vector indicator function on labels $y\in {\mathcal {Y}}$ ). Since the expectation of $f(\mathbf {x} ,\mathbf {y} )$ is the model's predicted distribution over labels, we can define a simple GE term as a negative KL-divergence between the predicted distribution and the target distribution over the unlabeled data ${\tilde {\mathcal {X}}}$ $-KLDiv({\tilde {\mathbf {f} }},{1 \over \vert {\tilde {\mathcal {X}}}\vert }\sum _{\mathbf {x} \in {\mathcal {\tilde {\mathcal {X}}}}}{\sum _{\mathbf {y} \in Y}{p_{\theta }(\mathbf {y} \vert \mathbf {x} )f(\mathbf {x} ,\mathbf {y} )}})$ ### Application to semi-supervised clustering

Suppose we have a representative 'prototype' instance $\mathbf {x} '$ for cluster $y$ . We can encourage the model to give high probability to instances similar to cluster $y$ by having a GE term like the following:

$\sum _{\mathbf {x} \in {\tilde {\mathcal {X}}}}{p_{\theta }(\mathbf {x} \vert y)sim(\mathbf {x} ,\mathbf {x} ')}$ where sim is some similarity function such as cosine similarity.

### Ohters

GE was also successfully applied to other methods such as Active Learning and Transfer Learning for more details, check www.cs.umass.edu/~mccallum/papers/ge08note.pdf