Difference between revisions of "Teh et, JASA2006"

From Cohen Courses
Jump to navigationJump to search
(Created page with '== Citation == Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 2006 == Online version == [http://ww…')
 
 
(6 intermediate revisions by the same user not shown)
Line 13: Line 13:
 
* Develop analogs for the [[UsesMethod:: Hierarchical Dirichlet process]] with representations of both a stick-breaking and a "Chinese restaurant franchise”.
 
* Develop analogs for the [[UsesMethod:: Hierarchical Dirichlet process]] with representations of both a stick-breaking and a "Chinese restaurant franchise”.
  
* Use [[Uses Method:: MCMC algorithm for posterior inference under hierarchical Dirichlet process mixtures]].
+
* Use [[UsesMethod:: MCMC algorithm for posterior inference under hierarchical Dirichlet process mixtures]].
  
 
== Methodology ==
 
== Methodology ==
Line 19: Line 19:
 
A hierarchical Dirichlet process is a distribution over a set of random probability measures over <math>(\theta; B)</math>. The process defines a set of random probability measures <math>G_j</math>, one for each group, and a global random probability measure <math>G_0</math>. The global measure <math>G_0</math> is distributed as a Dirichlet process with concentration parameter and base probability measure H:
 
A hierarchical Dirichlet process is a distribution over a set of random probability measures over <math>(\theta; B)</math>. The process defines a set of random probability measures <math>G_j</math>, one for each group, and a global random probability measure <math>G_0</math>. The global measure <math>G_0</math> is distributed as a Dirichlet process with concentration parameter and base probability measure H:
  
<math>G_0 | \gamma, H ~ DP(\gamma, H)</math>
+
<math>G_0 | \gamma, H \sim DP(\gamma, H)</math>
  
and the random measures Gj are conditionally independent given G0, with distributions given by a Dirichlet process with base probability measure G0:
+
and the random measures <math>G_j</math> are conditionally independent given G0, with distributions given by a Dirichlet process with base probability measure <math>G_0</math>:
  
<math>G_j | \alpha_0, G_0 ~ DP(\alpha_0, G_0)</math>.
+
<math>G_j | \alpha_0, G_0 \sim DP(\alpha_0, G_0)</math>.
  
 
A hierarchical Dirichlet process can be used as the prior distribution over the factors for grouped data. For each j let <math>\theta_{j1}, \theta_{j2},...</math> be i.i.d. random variables distributed as <math>G_j</math> . Each <math>\theta_ji</math> is a factor corresponding to a single observation <math>x_{ji}</math>. The likelihood is given by:
 
A hierarchical Dirichlet process can be used as the prior distribution over the factors for grouped data. For each j let <math>\theta_{j1}, \theta_{j2},...</math> be i.i.d. random variables distributed as <math>G_j</math> . Each <math>\theta_ji</math> is a factor corresponding to a single observation <math>x_{ji}</math>. The likelihood is given by:
  
<math>\theta_{ji} | G_j ~ G_j</math>
+
<math>\theta_{ji} | G_j \sim G_j</math>
  
<math>x_{ji} | \theta_{ji} ~ F(\theta_{ji})</math>.
+
<math>x_{ji} | \theta_{ji} \sim F(\theta_{ji})</math>.
  
 
The hierarchical Dirichlet process can readily be extended to more than two levels. That is, the base measure H can itself be a draw from a DP, and the hierarchy can be extended for as many levels as are deemed useful.
 
The hierarchical Dirichlet process can readily be extended to more than two levels. That is, the base measure H can itself be a draw from a DP, and the hierarchy can be extended for as many levels as are deemed useful.
  
  
 +
* The stick-breaking construction
  
 +
Given that the global measure <math>G_0</math> is distributed as a Dirichlet process, it can be expressed using a stick-breaking representation:
 +
 +
<math>G_0 = \sum_{k=1}^{\infty} \beta_k \delta_{\phi_k},</math>
 +
 +
where <math>\phi_k \sim H</math> independently and <math>\beta = (\beta_k)_{k=1}^{\infty} \sim GEM(\gamma)</math> are mutually independent. Since <math>G_0</math> has support at the points <math>\phi = (\phi_k)_{k=1}^{\infty}</math>, each <math>G_j</math> necessarily has support at these points as well, and can thus be written as:
 +
 +
<math>G_j = \sum_{k=1}^{\infty} \pi_{jk} \delta_{\phi_k}</math>
 +
 +
Let <math>\pi_j = ((\pi_{jk})_{k=1}^{\infty})</math>. Note that the weights <math>\pi_j</math> are independent given <math>\beta</math> (since the <math>G_j</math> are independent given <math>G_0</math>). These weights <math>\pi_j</math> are related to the global weights <math>\beta</math>.
 +
 +
An equivalent representation of the hierarchical Dirichlet process mixture can be:
 +
 +
<math>\beta | \gamma \sim GEM(\gamma)</math>
 +
 +
<math>\pi_j | \alpha_0, \beta \sim DP(\alpha_0, \beta)</math>               
 +
 +
<math>z_ji | \pi_j \sim \pi_j</math>
 +
 +
<math>\phi_k | H \sim H</math>               
 +
 +
<math>x_{ji} | z_{ji}, (\phi_k)_{k=1}^{\infty} \sim F(\phi_{z_{ji}})</math>.
 +
 +
After some derivations, the relation between weights and <math>\beta</math> is:
 +
 +
<math>\frac{1}{1-\sum_{l=1}^{k-1} \pi_{jl}} (\pi_{jk}, \sum_{l=k+1}^{\infty} \pi_{jl}) \sim Dir (\alpha_0 \beta_k, \alpha_0 \sum_{l=k+1}^{\infty} \beta_l)</math>. 
 +
 +
 +
* Chinese restaurant franchise
 +
 +
[[File:Crf.jpg | 200 px]]
 +
 +
The restaurants correspond to groups and the customers  correspond to the factors <math>\theta_{ji}</math>. Let <math>\phi_1,..., \phi_K</math> denote K i.i.d. random variables distributed according to H; this is the global menu of dishes. Vairables <math>\psi_{jt}</math> represent the table-specific choice of dishes; particular, <math>\psi_{jt}</math> is the dish served at table t and restaurant j. Use notation <math>n_{jtk}</math> to denote the number of maintain counts of customers and counts of tables. Then,
 +
 +
<math>\theta_{ji} | \theta_{j1},..., \theta_{j,i-1}, \alpha_0, G_0 \sim \sum_{t=1}^{m_{j.}} \frac{n_{jt.}}{i-1+\alpha_0} \delta_{\psi_{jt}} + \frac{\alpha_0}{i-1+\alpha_0} G_0</math>,
 +
 +
<math>\psi_{jt} | \psi_{11},\psi_{12},...,\psi_{21},..., \psi_{j,t-1}, \gamma, H \sim \sum_{k=1}^{K} \frac{m_{.k}}{m_{..}+\gamma} \delta_{\phi_{k}} + \frac{\gamma}{m_{..}+\gamma} H</math>
 +
 +
== MCMC algorithm for posterior sampling in the Chinese restaurant franchise==
 +
 +
The Chinese restaurant franchise can yield a Gibbs sampling scheme for posterior sampling given observations x. Rather than dealing with the <math>\theta_{ji}</math>'s and <math>\psi_{jt}</math>'s, we shall sample their index variables <math>t_{ji}</math> and <math>k_jt</math> instead.
 +
 +
 +
* '''Sampling t.''' The prior probability that <math>t_{ji}</math> takes on a particular previously used value t is proportional to <math>n_{jt.}^{-ji}</math>, whereas the probability that it takes on a new value (say <math>t^{new} = m_{j.} + 1</math>) is proportional to <math>\alpha_0</math>. The likelihood due to <math>x_{ji}</math> given <math>t_{ji} = t</math> for some previously used t is <math>f_k^{-x_{ji}} (x_ji)</math>, then
 +
 +
[[File:hdp_eq1.png]]
 +
 +
If the sampled value of <math>t_ji</math> is <math>t^{new}</math>, we obtain a sample of <math>k_{jt^{new}}</math> by sampling from
 +
 +
[[File:hdp_eq2.png]]
 +
 +
If as a result of updating <math>t_{ji}</math> some table t becomes unoccupied, i.e., <math>n_{jt.} = 0</math>, then the probability that this table will be reoccupied in the future will be zero, since this is always proportional to <math>n_{jt.}</math>. As a result, we may delete the corresponding <math>k_{jt}</math> from the data structure. If as a result of deleting
 +
<math>k_{jt}</math> some mixture component k becomes unallocated, we delete this mixture component as well.
 +
 +
* '''Sampling k'''. Since changing <math>k_{jt}</math> actually changes the component membership of all data items in table t, the likelihood obtained by setting <math>k_{jt} = k</math> is given by <math>f_{k}^{-x_{jt}}(x_{jt})</math>, so that the conditional
 +
probability of <math>k_{jt}</math> is
 +
 +
[[File:hdp_eq3.png]]
  
 
== Data ==
 
== Data ==
  
 +
[[UsesDataset:: Nematode biology abstracts]]
 +
 +
[[UsesDataset:: NIPS 1988-1999]]
 +
 +
[[UsesDataset:: Alice’s Adventures in Wonderland by Lewis Carroll]]
 +
 +
== Related Papers ==
 +
 +
[[RelatedPaper::Aldous, D. (1985), “Exchangeability and Related Topics,” in E´cole d’E´te´ de Probabilite´s de Saint-Flour XIII–1983, Springer, Berlin, pp. 1–198]].
  
== Related papers ==
+
[[RelatedPaper::Antoniak, C. (1974), “Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems,” Annals of Statistics, 2(6), pp. 1152–1174]].

Latest revision as of 22:08, 2 April 2011

Citation

Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 2006

Online version

Mathew Beal's papers

Summary

This paper proposed a nonparametric Bayes approach to decide the number of mixture components in grouped data, the basic idea is:

Methodology

A hierarchical Dirichlet process is a distribution over a set of random probability measures over . The process defines a set of random probability measures , one for each group, and a global random probability measure . The global measure is distributed as a Dirichlet process with concentration parameter and base probability measure H:

and the random measures are conditionally independent given G0, with distributions given by a Dirichlet process with base probability measure :

.

A hierarchical Dirichlet process can be used as the prior distribution over the factors for grouped data. For each j let be i.i.d. random variables distributed as . Each is a factor corresponding to a single observation . The likelihood is given by:

.

The hierarchical Dirichlet process can readily be extended to more than two levels. That is, the base measure H can itself be a draw from a DP, and the hierarchy can be extended for as many levels as are deemed useful.


  • The stick-breaking construction

Given that the global measure is distributed as a Dirichlet process, it can be expressed using a stick-breaking representation:

where independently and are mutually independent. Since has support at the points , each necessarily has support at these points as well, and can thus be written as:

Let . Note that the weights are independent given (since the are independent given ). These weights are related to the global weights .

An equivalent representation of the hierarchical Dirichlet process mixture can be:

.

After some derivations, the relation between weights and is:

.


  • Chinese restaurant franchise

Crf.jpg

The restaurants correspond to groups and the customers correspond to the factors . Let denote K i.i.d. random variables distributed according to H; this is the global menu of dishes. Vairables represent the table-specific choice of dishes; particular, is the dish served at table t and restaurant j. Use notation to denote the number of maintain counts of customers and counts of tables. Then,

,

MCMC algorithm for posterior sampling in the Chinese restaurant franchise

The Chinese restaurant franchise can yield a Gibbs sampling scheme for posterior sampling given observations x. Rather than dealing with the 's and 's, we shall sample their index variables and instead.


  • Sampling t. The prior probability that takes on a particular previously used value t is proportional to , whereas the probability that it takes on a new value (say ) is proportional to . The likelihood due to given for some previously used t is , then

Hdp eq1.png

If the sampled value of is , we obtain a sample of by sampling from

Hdp eq2.png

If as a result of updating some table t becomes unoccupied, i.e., , then the probability that this table will be reoccupied in the future will be zero, since this is always proportional to . As a result, we may delete the corresponding from the data structure. If as a result of deleting some mixture component k becomes unallocated, we delete this mixture component as well.

  • Sampling k. Since changing actually changes the component membership of all data items in table t, the likelihood obtained by setting is given by , so that the conditional

probability of is

Hdp eq3.png

Data

Nematode biology abstracts

NIPS 1988-1999

Alice’s Adventures in Wonderland by Lewis Carroll

Related Papers

Aldous, D. (1985), “Exchangeability and Related Topics,” in E´cole d’E´te´ de Probabilite´s de Saint-Flour XIII–1983, Springer, Berlin, pp. 1–198.

Antoniak, C. (1974), “Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems,” Annals of Statistics, 2(6), pp. 1152–1174.