# Generalized Expectation Criteria

## Summary

This can be viewed as a parameter estimation method that can augment/replace traditional parameter estimation methods such as maximum likelihood estimation. Traditional methods can be viewed as a special case of GE and more non-traditional approaches can be used using the flexibility of generalized expectation.

## Expectation

Let ${\displaystyle X}$ be some set of variables and their assignments be ${\displaystyle \mathbf {x} \in {\mathcal {X}}}$. Let ${\displaystyle \theta }$ be the parameters of a model that defines a probability distribution ${\displaystyle p_{\theta }(X)}$. The expectation of a function ${\displaystyle f(X)}$ according to the model is

${\displaystyle E_{\theta }[f(X)]=\sum _{\mathbf {x} \in {\mathcal {X}}}{p_{\theta }(\mathbf {x} )f(\mathbf {x} )}}$

We can partition the variables into "input" variables ${\displaystyle X}$ and "output" variables ${\displaystyle Y}$ that is conditioned on the input variables. When the assignment of the input variables ${\displaystyle {\tilde {\mathcal {X}}}=\{\mathbf {x} _{1},\mathbf {x} _{2},...\}}$ are provided, the conditional expectation is

${\displaystyle E_{\theta }[f(X,Y)\vert {\tilde {\mathcal {X}}}]={1 \over \vert {\tilde {\mathcal {X}}}\vert }\sum _{\mathbf {x} \in {\mathcal {\tilde {\mathcal {X}}}}}{\sum _{\mathbf {y} \in Y}{p_{\theta }(\mathbf {y} \vert \mathbf {x} )f(\mathbf {x} ,\mathbf {y} )}}}$

## Generalized Expectation

A generalized expectation (GE) criteria is a function G that takes the model's expectation of ${\displaystyle f(X)}$ as an argument and returns a scalar. The criteria is then added as a term in the parameter estimation objective function.

${\displaystyle G(E_{\theta }[f(X)])\rightarrow \mathbb {R} }$

Or ${\displaystyle G}$ can be defined based on a distance to a target value for ${\displaystyle E_{\theta }[f(X)]}$. Let ${\displaystyle {\tilde {f}}}$ be the target value and ${\displaystyle \Delta (\cdot ,\cdot )}$ be some distance function, then we can define ${\displaystyle G}$ in the following way:

${\displaystyle G_{\tilde {f}}(E_{\theta }[f(X)])=-\Delta (E_{\theta }[f(X)],{\tilde {f}})}$

## Use Cases

### Application to semi-supervised learning

Mann and McCallum, ICML 2007 describes an application of GE to a semi-supervised learning problem. The GE terms used here indicates a preference/prior about the marginal class distribution, that is either directly provided by human expert or estimated from labeled data.

Let ${\displaystyle {\tilde {\mathbf {f} }}={\tilde {p}}(Y)}$ be the target distribution over class labels and ${\displaystyle f(\mathbf {x} ,\mathbf {y} )={1 \over n}\sum _{i}^{n}{\vec {I}}(y_{i})}$ (${\displaystyle {\vec {I}}}$ denotes the vector indicator function on labels ${\displaystyle y\in {\mathcal {Y}}}$). Since the expectation of ${\displaystyle f(\mathbf {x} ,\mathbf {y} )}$ is the model's predicted distribution over labels, we can define a simple GE term as a negative KL-divergence between the predicted distribution and the target distribution over the unlabeled data ${\displaystyle {\tilde {\mathcal {X}}}}$

${\displaystyle -KLDiv({\tilde {\mathbf {f} }},{1 \over \vert {\tilde {\mathcal {X}}}\vert }\sum _{\mathbf {x} \in {\mathcal {\tilde {\mathcal {X}}}}}{\sum _{\mathbf {y} \in Y}{p_{\theta }(\mathbf {y} \vert \mathbf {x} )f(\mathbf {x} ,\mathbf {y} )}})}$

### Application to semi-supervised clustering

Suppose we have a representative 'prototype' instance ${\displaystyle \mathbf {x} '}$ for cluster ${\displaystyle y}$. We can encourage the model to give high probability to instances similar to cluster ${\displaystyle y}$ by having a GE term like the following:

${\displaystyle \sum _{\mathbf {x} \in {\tilde {\mathcal {X}}}}{p_{\theta }(\mathbf {x} \vert y)sim(\mathbf {x} ,\mathbf {x} ')}}$

where sim is some similarity function such as cosine similarity.

### Ohters

GE was also successfully applied to other methods such as Active Learning and Transfer Learning for more details, check www.cs.umass.edu/~mccallum/papers/ge08note.pdf