Difference between revisions of "Generalized Expectation Criteria"

Revision as of 14:56, 2 November 2011

Summary

This can be viewed as a parameter estimation method that can augment/replace traditional parameter estimation methods such as maximum likelihood estimation. M

Support Vector Machines or Conditional Random Fields to efficiently optimize the objective function, especially in the online setting. Stochastic optimizations like this method are known to be faster when trained with large, redundant data sets.

Expectation

Let $X$ be some set of variables and their assignments be $\mathbf {x} \in {\mathcal {X}}$ . Let $\theta$ be the parameters of a model that defines a probability distribution $p_{\theta }(X)$ . The expectation of a function $f(X)$ according to the model is

$E_{\theta }[f(X)]=\sum _{\mathbf {x} \in {\mathcal {X}}}{p_{\theta }(\mathbf {x} )f(\mathbf {x} )}$

We can partition the variables into "input" variables $X$ and "output" variables $Y$ that is conditioned on the input variables. When the assignment of the input variables ${\tilde {\mathcal {X}}}=\{\mathbf {x} _{1},\mathbf {x} _{2},...\}$ are provided, the conditional expectation is

$E_{\theta }[f(X,Y)\vert {\tilde {\mathcal {X}}}]={1 \over \vert {\tilde {\mathcal {X}}}\vert }\sum _{\mathbf {x} \in {\mathcal {\tilde {\mathcal {X}}}}}{\sum _{\mathbf {y} \in Y}{p_{\theta }(\mathbf {y} \vert \mathbf {x} )f(\mathbf {x} ,\mathbf {y} )}}$

Generalized Expectation

A generalized expectation (GE) criteria is a function G that takes the model's expectation of $f(X)$ as an argument and returns a scalar. The criteria is then added as a term in the parameter estimation objective function.

$G(E_{\theta }[f(X)])\rightarrow \mathbb {R}$

Or $G$ can be defined based on a distance to a target value for $E_{\theta }[f(X)]$ . Let ${\tilde {f}}$ be the target value and $\Delta (\cdot ,\cdot )$ be some distance function, then we can define $G$ in the following way:

$G_{\tilde {f}}(E_{\theta }[f(X)])=-\Delta (E_{\theta }[f(X)],{\tilde {f}})$

Use Cases

Application to semi-supervised learning

Mann and McCallum, ICML 2007 describes an application of GE to a semi-supervised learning problem. The GE terms used here indicates a preference/prior about the marginal class distribution, that is either directly provided by human expert or estimated from labeled data.

Let ${\tilde {\mathbf {f} }}={\tilde {p}}(Y)$ be the target distribution over class labels and $f(\mathbf {x} ,\mathbf {y} )={1 \over n}\sum _{i}^{n}{\vec {I}}(y_{i})$ ( ${\vec {I}}$ denotes the vector indicator function on labels $y\in {\mathcal {Y}}$ ). Since the expectation of $f(\mathbf {x} ,\mathbf {y} )$ is the model's predicted distribution over labels, we can define a simple GE term as a negative KL-divergence between the predicted distribution and the target distribution

$-KLDiv({\tilde {\mathbf {f} }},{1 \over \vert {\tilde {\mathcal {X}}}\vert }\sum _{\mathbf {x} \in {\mathcal {\tilde {\mathcal {X}}}}}{\sum _{\mathbf {y} \in Y}{p_{\theta }(\mathbf {y} \vert \mathbf {x} )f(\mathbf {x} ,\mathbf {y} )}})$

@@ Line 39: / Line 39: @@
 </math>
-== Stochastic Gradient Descent ==
+== Use Cases ==
+=== Application to semi-supervised learning ===
-== Pros ==
+[[ RelatedPaper::Mann and McCallum, ICML 2007 ]] describes an application of GE to a semi-supervised learning problem. The GE terms used here indicates a preference/prior about the marginal class distribution, that is either directly provided by human expert or estimated from labeled data.
-When this method is used for very large data sets that has redundant information among examples, it is much faster than the plain gradient descent because it requires less computation each iteration. Also, it is known to be better with noisy data since it samples example to compute gradient.
+Let <math>\tilde{\mathbf{f}}=\tilde{p}(Y)</math> be the target distribution over class labels and <math>f(\mathbf{x},\mathbf{y})={1 \over n}\sum_{i}^{n}\vec{I}(y_i)</math> (<math>\vec{I}</math> denotes the vector indicator function on labels <math>y\in\mathcal{Y}</math>). Since the expectation of <math>f(\mathbf{x},\mathbf{y})</math> is the model's predicted distribution over labels, we can define a simple GE term as a negative KL-divergence between the predicted distribution and the target distribution
+<math>
+-KLDiv(\tilde{\mathbf{f}}, {1\over\vert\tilde{\mathcal{X}}\vert}\sum_{\mathbf{x}\in\mathcal{\tilde{\mathcal{X}}}}{\sum_{\mathbf{y}\in Y}{p_{\theta}(\mathbf{y}\vert\mathbf{x})f(\mathbf{x},\mathbf{y})}})
+</math>
-== Cons ==
-The convergence rate is slower than second-order gradient methods. However the speedup coming from computationally efficient iterations are usually greater and the method can converge faster if learning rate is adjusted as the procedure goes on. Also it tends to keep bouncing around the minimum unless the learning rate is reduced in the later iterations.
 == Related Papers ==

Difference between revisions of "Generalized Expectation Criteria"

Revision as of 14:56, 2 November 2011

Contents

Summary

Expectation

Generalized Expectation

Use Cases

Application to semi-supervised learning

Related Papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools