Difference between revisions of "Generalized Expectation Criteria"
Line 59: | Line 59: | ||
where sim is some similarity function such as cosine similarity. | where sim is some similarity function such as cosine similarity. | ||
+ | |||
+ | === Ohters === | ||
+ | |||
+ | GE was also successfully applied to other methods such as [[AddressesMethod::Active Learning]] and [[AddressesMethod::Transfer Learning]] for more details, check [www.cs.umass.edu/~mccallum/papers/ge08note.pdf] | ||
== Advantages == | == Advantages == |
Revision as of 14:46, 2 November 2011
Contents
Summary
This can be viewed as a parameter estimation method that can augment/replace traditional parameter estimation methods such as maximum likelihood estimation. Traditional methods can be viewed as a special case of GE and more non-traditional approaches can be used using the flexibility of generalized expectation.
Expectation
Let be some set of variables and their assignments be . Let be the parameters of a model that defines a probability distribution . The expectation of a function according to the model is
We can partition the variables into "input" variables and "output" variables that is conditioned on the input variables. When the assignment of the input variables are provided, the conditional expectation is
Generalized Expectation
A generalized expectation (GE) criteria is a function G that takes the model's expectation of as an argument and returns a scalar. The criteria is then added as a term in the parameter estimation objective function.
Or can be defined based on a distance to a target value for . Let be the target value and be some distance function, then we can define in the following way:
Use Cases
Application to semi-supervised learning
Mann and McCallum, ICML 2007 describes an application of GE to a semi-supervised learning problem. The GE terms used here indicates a preference/prior about the marginal class distribution, that is either directly provided by human expert or estimated from labeled data.
Let be the target distribution over class labels and ( denotes the vector indicator function on labels ). Since the expectation of is the model's predicted distribution over labels, we can define a simple GE term as a negative KL-divergence between the predicted distribution and the target distribution over the unlabeled data
Application to semi-supervised clustering
Suppose we have a representative 'prototype' instance for cluster . We can encourage the model to give high probability to instances similar to cluster by having a GE term like the following:
where sim is some similarity function such as cosine similarity.
Ohters
GE was also successfully applied to other methods such as Active Learning and Transfer Learning for more details, check [www.cs.umass.edu/~mccallum/papers/ge08note.pdf]
Advantages
Instead of having a prior/expectation over the model parameters that are often complex and hard to understand, we can easily set our preference in terms of our 'expectation'. Thus it is much more flexible to add more creative and non-traditional constraints on the model, even ones that cannot be represented in terms of model parameters.
Disadvantages
It's often easy to find that GE only under-specify the parameters, resulting in many different parameters settings that achieves near-maximum of the objective function. Therefore, it is usually used in addition to the usual optimization function such as log-likelihood, etc.