Difference between revisions of "Generalized Expectation Criteria"

From Cohen Courses
Jump to navigationJump to search
(Created page with '== Summary == This can be viewed as a parameter estimation [[Category::method]] that can augment/replace traditional parameter estimation methods such as maximum likelihood esti…')
 
 
(8 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== Summary ==
 
== Summary ==
  
This can be viewed as a parameter estimation [[Category::method]] that can augment/replace traditional parameter estimation methods such as maximum likelihood estimation. M
+
This can be viewed as a parameter estimation [[Category::method]] that can augment/replace traditional parameter estimation methods such as maximum likelihood estimation. Traditional methods can be viewed as a special case of GE and more non-traditional approaches can be used using the flexibility of generalized expectation.
  
[[AddressesProblem::Support Vector Machines]] or [[AddressesProblem::Conditional Random Fields]] to efficiently optimize the objective function, especially in the online setting. Stochastic optimizations like this method are known to be faster when trained with large, redundant data sets.
+
== Expectation ==
 
 
== Gradient Descent ==
 
  
 
Let <math>X</math> be some set of variables and their assignments be <math>\mathbf{x}\in\mathcal{X}</math>. Let <math>\theta</math> be the parameters of a model that defines a probability distribution <math>p_{\theta}(X)</math>. The expectation of a function <math>f(X)</math> according to the model is
 
Let <math>X</math> be some set of variables and their assignments be <math>\mathbf{x}\in\mathcal{X}</math>. Let <math>\theta</math> be the parameters of a model that defines a probability distribution <math>p_{\theta}(X)</math>. The expectation of a function <math>f(X)</math> according to the model is
Line 15: Line 13:
 
</math>
 
</math>
  
We can partition the variables into "input" variables <math>X</math> and "output" variables <math>Y</math> that is conditioned on the input variables. When the assignment of the input variables <math>\tilde{\mathcal{X}}=\{\mathbf{x}_1,\mathbf{x}_2,...\} are provided, the conditional expectation is
+
We can partition the variables into "input" variables <math>X</math> and "output" variables <math>Y</math> that is conditioned on the input variables. When the assignment of the input variables <math>\tilde{\mathcal{X}}=\{\mathbf{x}_1,\mathbf{x}_2,...\}</math> are provided, the conditional expectation is
  
 
<math>
 
<math>
E_{\theta}[f(X,Y)\vert\~{\mathcal{X}}]
+
E_{\theta}[f(X,Y)\vert\tilde{\mathcal{X}}]
 
=
 
=
\sum_{\mathbf{x}\in\mathcal{X}}{p_{\theta}(\mathbf{x})f(\mathbf{x})}
+
{1\over\vert\tilde{\mathcal{X}}\vert}\sum_{\mathbf{x}\in\mathcal{\tilde{\mathcal{X}}}}{\sum_{\mathbf{y}\in Y}{p_{\theta}(\mathbf{y}\vert\mathbf{x})f(\mathbf{x},\mathbf{y})}}
 +
</math>
 +
 
 +
== Generalized Expectation ==
 +
 
 +
A generalized expectation (GE) criteria is a function G that takes the model's expectation of <math>f(X)</math> as an argument and returns a scalar. The criteria is then added as a term in the parameter estimation objective function.
 +
 
 +
<math>
 +
G(E_{\theta}[f(X)])\rightarrow\mathbb{R}
 
</math>
 
</math>
  
 +
Or <math>G</math> can be defined based on a distance to a target value for <math>E_{\theta}[f(X)]</math>. Let <math>\tilde{f}</math> be the target value and <math>\Delta(\cdot,\cdot)</math> be some distance function, then we can define <math>G</math> in the following way:
  
 
<math>  
 
<math>  
F(\mathbf{b}) \leq F(\mathbf{a})
+
G_{\tilde{f}}(E_{\theta}[f(X)])
 +
=
 +
-\Delta(E_{\theta}[f(X)],\tilde{f})
 
</math>
 
</math>
  
for some small enough <math>\gamma</math>. Using this inequality, we can get a (local) minimum of the objective function using the following steps:
+
== Use Cases ==
  
* Initialize <math>\mathbf{x}</math>
+
=== Application to semi-supervised learning ===
* Repeat the step above until the objective function converges to a local minimum
 
** <math>\mathbf{x}_{new} = \mathbf{x} - \nabla F(\mathbf{a})</math>
 
** <math>\mathbf{x} = \mathbf{x}_{new}</math>
 
  
== Stochastic Gradient Descent ==
+
[[ RelatedPaper::Mann and McCallum, ICML 2007 ]] describes an application of GE to a semi-supervised learning problem. The GE terms used here indicates a preference/prior about the marginal class distribution, that is either directly provided by human expert or estimated from labeled data.
  
One of the problems of the gradient descent method above is that calculating the gradient could be an expensive computation depending on the objective function or the size of the data set. Suppose your objective function is <math>\mathcal{L}(\mathcal{D}, \mathbf{\theta})</math>. If the objective function can be decomposed as the following,
+
Let <math>\tilde{\mathbf{f}}=\tilde{p}(Y)</math> be the target distribution over class labels and <math>f(\mathbf{x},\mathbf{y})={1 \over n}\sum_{i}^{n}\vec{I}(y_i)</math> (<math>\vec{I}</math> denotes the vector indicator function on labels <math>y\in\mathcal{Y}</math>). Since the expectation of <math>f(\mathbf{x},\mathbf{y})</math> is the model's predicted distribution over labels, we can define a simple GE term as a negative KL-divergence between the predicted distribution and the target distribution over the unlabeled data <math>\tilde\mathcal{X}</math>
  
 
<math>
 
<math>
\mathcal{L}(\mathbf{\theta};\mathcal{D}) = \sum_{i=1}^{\vert\mathcal{D}\vert} {\mathcal{L}(\mathbf{\theta};\mathcal{D}_{i})}
+
-KLDiv(\tilde{\mathbf{f}}, {1\over\vert\tilde{\mathcal{X}}\vert}\sum_{\mathbf{x}\in\mathcal{\tilde{\mathcal{X}}}}{\sum_{\mathbf{y}\in Y}{p_{\theta}(\mathbf{y}\vert\mathbf{x})f(\mathbf{x},\mathbf{y})}})
 
</math>
 
</math>
  
where <math>\mathcal{D}_{i}</math> indicates the <math>i\quad</math>-th example(sometimes <math>\mathcal{D}_{i}</math> is a batch instead of one example), we can make the process stochastic. To make each step computationally efficient, a subset of the summand function is sampled. The procedure can be described as the following pseudocode:
 
  
* Initialize <math>\mathbf{\theta}</math>
+
=== Application to semi-supervised clustering ===
* Repeat until convergence
+
 
** Sample <math>n\quad</math> examples
+
Suppose we have a representative 'prototype' instance <math>\mathbf{x}'</math> for cluster <math>y</math>. We can encourage the model to give high probability to instances similar to cluster <math>y</math> by having a GE term like the following:
** For each example sampled <math>\mathcal{D}_{i}</math>
+
 
*** <math>\mathbf{\theta}_{new}=\mathbf{\theta} - \alpha\nabla\mathcal{L}(\mathbf{\theta};\mathcal{D}_{i})</math>
+
<math>
*** <math>\mathbf{\theta}=\mathbf{\theta}_{new}</math>
+
\sum_{\mathbf{x}\in\tilde{\mathcal{X}}}{p_{\theta}(\mathbf{x}\vert y)sim(\mathbf{x},\mathbf{x}')}
 +
</math>
 +
 
 +
where sim is some similarity function such as cosine similarity.
 +
 
 +
=== Ohters ===
 +
 
 +
GE was also successfully applied to other methods such as [[AddressesMethod::Active Learning]] and [[AddressesMethod::Transfer Learning]] for more details, check [[www.cs.umass.edu/~mccallum/papers/ge08note.pdf]]
  
where <math>\alpha\quad</math> is the learning rate. Note that this method is no different from the plain gradient descent method when the batch size becomes the number of examples. For computational efficiency, small batch size around 5~20 turn out to be most efficient.
+
== Advantages ==
  
== Pros ==
+
Instead of having a prior/expectation over the model parameters that are often complex and hard to understand, we can easily set our preference in terms of our 'expectation'. Thus it is much more flexible to add more creative and non-traditional constraints on the model, even ones that cannot be represented in terms of model parameters.
  
When this method is used for very large data sets that has redundant information among examples, it is much faster than the plain gradient descent because it requires less computation each iteration. Also, it is known to be better with noisy data since it samples example to compute gradient.
+
* Expectations conditioned on different datasets -- useful when we have multiple datasets with different properties, missing data, source distribution, etc.
 +
* Different coverage of parameteric factor and constarins-expectaions
 +
* Flexible supervision -- things like "I prefer 70% or more of the documents containing the word 'ice' would be about 'icehockey' instead of 'baseball'" can be directly translated into GE terms.
  
== Cons ==
+
== Disadvantages ==
  
The convergence rate is slower than second-order gradient methods. However the speedup coming from computationally efficient iterations are usually greater and the method can converge faster if learning rate is adjusted as the procedure goes on. Also it tends to keep bouncing around the minimum unless the learning rate is reduced in the later iterations.
+
It's often easy to find that GE only under-specify the parameters, resulting in many different parameters settings that achieves near-maximum of the objective function. Therefore, it is usually used in addition to the usual optimization function such as log-likelihood, etc.
  
 
== Related Papers ==
 
== Related Papers ==
  
* [[ RelatedPaper::Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods, Vishwanathan et al, ICML 2006 ]]
+
* [[ RelatedPaper::Bellare_2009_generalized_expectation_criteria_for_bootstrapping_extractors_using_record_text_alignment ]]
* [[ RelatedPaper::Practical very large CRFs]]
+
* [[ RelatedPaper::Mann and McCallum, ICML 2007 ]]

Latest revision as of 15:50, 2 November 2011

Summary

This can be viewed as a parameter estimation method that can augment/replace traditional parameter estimation methods such as maximum likelihood estimation. Traditional methods can be viewed as a special case of GE and more non-traditional approaches can be used using the flexibility of generalized expectation.

Expectation

Let be some set of variables and their assignments be . Let be the parameters of a model that defines a probability distribution . The expectation of a function according to the model is

We can partition the variables into "input" variables and "output" variables that is conditioned on the input variables. When the assignment of the input variables are provided, the conditional expectation is

Generalized Expectation

A generalized expectation (GE) criteria is a function G that takes the model's expectation of as an argument and returns a scalar. The criteria is then added as a term in the parameter estimation objective function.

Or can be defined based on a distance to a target value for . Let be the target value and be some distance function, then we can define in the following way:

Use Cases

Application to semi-supervised learning

Mann and McCallum, ICML 2007 describes an application of GE to a semi-supervised learning problem. The GE terms used here indicates a preference/prior about the marginal class distribution, that is either directly provided by human expert or estimated from labeled data.

Let be the target distribution over class labels and ( denotes the vector indicator function on labels ). Since the expectation of is the model's predicted distribution over labels, we can define a simple GE term as a negative KL-divergence between the predicted distribution and the target distribution over the unlabeled data


Application to semi-supervised clustering

Suppose we have a representative 'prototype' instance for cluster . We can encourage the model to give high probability to instances similar to cluster by having a GE term like the following:

where sim is some similarity function such as cosine similarity.

Ohters

GE was also successfully applied to other methods such as Active Learning and Transfer Learning for more details, check www.cs.umass.edu/~mccallum/papers/ge08note.pdf

Advantages

Instead of having a prior/expectation over the model parameters that are often complex and hard to understand, we can easily set our preference in terms of our 'expectation'. Thus it is much more flexible to add more creative and non-traditional constraints on the model, even ones that cannot be represented in terms of model parameters.

  • Expectations conditioned on different datasets -- useful when we have multiple datasets with different properties, missing data, source distribution, etc.
  • Different coverage of parameteric factor and constarins-expectaions
  • Flexible supervision -- things like "I prefer 70% or more of the documents containing the word 'ice' would be about 'icehockey' instead of 'baseball'" can be directly translated into GE terms.

Disadvantages

It's often easy to find that GE only under-specify the parameters, resulting in many different parameters settings that achieves near-maximum of the objective function. Therefore, it is usually used in addition to the usual optimization function such as log-likelihood, etc.

Related Papers