Difference between revisions of "Contrastive Estimation"

From Cohen Courses
Jump to navigationJump to search
 
(28 intermediate revisions by one other user not shown)
Line 1: Line 1:
 
This is a [[Category::method]] proposed by [[RelatedPaper::Smith and Eisner 2005:Contrastive Estimation: Training Log-Linear Models on Unlabeled Data]].
 
This is a [[Category::method]] proposed by [[RelatedPaper::Smith and Eisner 2005:Contrastive Estimation: Training Log-Linear Models on Unlabeled Data]].
  
The proposed approach deals with the estimation of log-linear models (e.g. [[AddressesProblem::Conditional Random Fields]]) in an unsupervised fashion. The method focuses on the denominator <math> \sum_{x\prime,y\prime} p (x\prime,y\prime)</math> of the log-linear models by exploiting the so called ''implicit negative evidence'' in the probability mass.
+
The proposed approach deals with the estimation of [[AddressesProblem::Log-linear Models]] (e.g. [[AddressesProblem::Conditional Random Fields]]) in an unsupervised fashion. The method focuses on the denominator <math> \sum_{x\prime,y\prime} p (x\prime,y\prime)</math> of the log-linear models by exploiting the so called ''implicit negative evidence'' in the probability mass.
  
 
== Motivation ==
 
== Motivation ==
Line 8: Line 8:
 
In the Smith and Eisner (2005) paper, the authors have surveyed different estimation techniques (See the Figure above) for probabilistic graphic models. It is clear that for HMMs, people usually optimize the joint likelihood. For log-linear models, various methods were proposed to optimize the conditional probabilities. In addition to this, there are also methods to directly maximize the classification accuracy, the sum of conditional likelihoods, or expected local accuracy. However, none of the above estimation techniques have specifically focused on the ''implicit negative evidence'' in the denominator of the standard log-linear model in an unsupervised setting.
 
In the Smith and Eisner (2005) paper, the authors have surveyed different estimation techniques (See the Figure above) for probabilistic graphic models. It is clear that for HMMs, people usually optimize the joint likelihood. For log-linear models, various methods were proposed to optimize the conditional probabilities. In addition to this, there are also methods to directly maximize the classification accuracy, the sum of conditional likelihoods, or expected local accuracy. However, none of the above estimation techniques have specifically focused on the ''implicit negative evidence'' in the denominator of the standard log-linear model in an unsupervised setting.
  
== Problem Formulation ==
+
== How it Works ==
  
 
Unlike the above methods, the contrastive estimation approach optimizes:
 
Unlike the above methods, the contrastive estimation approach optimizes:
 
:<math>\prod_{i} p(X_i = x_i | X_i \in Neighbor(x_i), \theta)</math>
 
:<math>\prod_{i} p(X_i = x_i | X_i \in Neighbor(x_i), \theta)</math>
 +
here, the <math>Neighbor(x_i)</math> function means a set of implicit negative examples
 +
and the <math>x_i</math> itself. The idea here is to move the probability mass from
 +
the neighborhood of <math>x_i</math> to <math>x_i</math> itself, so that a good denominator
 +
in log-linear models can not only improve the task accuracy, but also reduce the computation
 +
of the normalization part of the model.
  
== The Algorithm ==
+
== Problem Formulation and the Detailed Algorithm ==
 +
 
 +
Assume we have a log-linear model that is paratermized by <math>\theta</math>,
 +
the input example is <math>x</math>, and the output label is <math>y</math>. A standard log-liner model takes the form
 +
:<math>p(x,y | \theta) \overset{\underset{\mathrm{def}}{}}{=} \frac{1}{Z(\theta)} \exp (\theta \cdot f(x,y))</math>
 +
here, we can use <math> u </math> to represent the unnormalized score <math>\exp (\theta \cdot f(x,y))</math>.  <math>Z(\theta)</math> is the partition function, and is hard to compute (much larger space). Then, we can represent the objective function as
 +
:<math> \prod_{i} \frac{\sum_{(x,y)\in A_{i}} u (x,y|\theta)}{\sum_{(x,y)\in B_{i}} u (x,y|\theta)}</math>
 +
where <math>A_{i} \subset B_{i}</math> for each <math>i</math>.
 +
 
 +
In the unsupervised setting, the contrastive estimation method maximizes
 +
:<math> \log \prod_{i} \frac{\sum_{(y \in Y)} u (x,y|\theta)}{\sum_{(x,y)\in Neighbor (x_i) \times Y} u (x,y|\theta)}</math>
 +
 
 +
In order to optimize the neighborhood likelihood, the standard numerical optimization approach like L-BFGS can be applied. There are also many other neighborhood methods for sequential data, but it will not be discussed in this method page.
  
 
== Some Reflections ==
 
== Some Reflections ==
 +
 +
(1) The hypothesis of this contrastive estimation approach is that each learning instance (the numerator of the log-linear model) is closely related to its neighbors (the denominator of the log-linear model), so that we can make the good instance more likely at the expense of bad instances nearby.
 +
 +
(2) The denominator of log-linear models not only has impacts on the system accuracy, but also influences the overall efficiency performance, as computing the partition function <math>\mathbf{Z}</math> is not desirable.
 +
 +
(3) The proposed Contrastive Estimation method is closely related to the following methods: [[UsesMethod::Expectation Maximization]], [[UsesMethod::Posterior Regularization for Expectation Maximization]], [[UsesMethod::Featurized HMM]] and [[UsesMethod::Conditional Random Fields]].
  
 
== Related Papers ==
 
== Related Papers ==
 +
 +
* [[RelatedPaper::Smith and Eisner 2005:Contrastive Estimation: Training Log-Linear Models on Unlabeled Data]] (This is the original paper that proposes the contrastive estimation method.)
 +
 +
* [[RelatedPaper::Berg-Kirkpatrick et al, ACL 2010: Painless Unsupervised Learning with Features]] (This paper talks about incorporating log-linear models as local classifiers in HMMs. The estimation method is the [[forward-backward algorithm]].)
 +
 +
* [[RelatedPaper::Klein 2002 conditional structure versus conditional estimation in nlp models]] (This paper discusses various estimation methods for log-linear models, in particular, the sum of conditional likelihoods (SCL) and maximum conditional likelihoods(CL). They observe better performances derived from SCL estimation than CL, even though SCL objective function is non-convex.)
 +
 +
 +
== Comments ==
 +
 +
It would be cool if someone figured out what the relationship is between contrastive estimation (this) vs. constrastive divergence (Hinton).  Something like, you only look at little pieces of the partition function...  --[[User:Brendan|Brendan]] 21:15, 13 October 2011 (UTC)

Latest revision as of 16:16, 13 October 2011

This is a method proposed by Smith and Eisner 2005:Contrastive Estimation: Training Log-Linear Models on Unlabeled Data.

The proposed approach deals with the estimation of Log-linear Models (e.g. Conditional Random Fields) in an unsupervised fashion. The method focuses on the denominator of the log-linear models by exploiting the so called implicit negative evidence in the probability mass.

Motivation

Ce survey.png

In the Smith and Eisner (2005) paper, the authors have surveyed different estimation techniques (See the Figure above) for probabilistic graphic models. It is clear that for HMMs, people usually optimize the joint likelihood. For log-linear models, various methods were proposed to optimize the conditional probabilities. In addition to this, there are also methods to directly maximize the classification accuracy, the sum of conditional likelihoods, or expected local accuracy. However, none of the above estimation techniques have specifically focused on the implicit negative evidence in the denominator of the standard log-linear model in an unsupervised setting.

How it Works

Unlike the above methods, the contrastive estimation approach optimizes:

here, the function means a set of implicit negative examples and the itself. The idea here is to move the probability mass from the neighborhood of to itself, so that a good denominator in log-linear models can not only improve the task accuracy, but also reduce the computation of the normalization part of the model.

Problem Formulation and the Detailed Algorithm

Assume we have a log-linear model that is paratermized by , the input example is , and the output label is . A standard log-liner model takes the form

here, we can use to represent the unnormalized score . is the partition function, and is hard to compute (much larger space). Then, we can represent the objective function as

where for each .

In the unsupervised setting, the contrastive estimation method maximizes

In order to optimize the neighborhood likelihood, the standard numerical optimization approach like L-BFGS can be applied. There are also many other neighborhood methods for sequential data, but it will not be discussed in this method page.

Some Reflections

(1) The hypothesis of this contrastive estimation approach is that each learning instance (the numerator of the log-linear model) is closely related to its neighbors (the denominator of the log-linear model), so that we can make the good instance more likely at the expense of bad instances nearby.

(2) The denominator of log-linear models not only has impacts on the system accuracy, but also influences the overall efficiency performance, as computing the partition function is not desirable.

(3) The proposed Contrastive Estimation method is closely related to the following methods: Expectation Maximization, Posterior Regularization for Expectation Maximization, Featurized HMM and Conditional Random Fields.

Related Papers


Comments

It would be cool if someone figured out what the relationship is between contrastive estimation (this) vs. constrastive divergence (Hinton). Something like, you only look at little pieces of the partition function... --Brendan 21:15, 13 October 2011 (UTC)