Difference between revisions of "Teh et, JASA2006"
(Created page with '== Citation == Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 2006 == Online version == [http://ww…') |
|||
(6 intermediate revisions by the same user not shown) | |||
Line 13: | Line 13: | ||
* Develop analogs for the [[UsesMethod:: Hierarchical Dirichlet process]] with representations of both a stick-breaking and a "Chinese restaurant franchise”. | * Develop analogs for the [[UsesMethod:: Hierarchical Dirichlet process]] with representations of both a stick-breaking and a "Chinese restaurant franchise”. | ||
− | * Use [[ | + | * Use [[UsesMethod:: MCMC algorithm for posterior inference under hierarchical Dirichlet process mixtures]]. |
== Methodology == | == Methodology == | ||
Line 19: | Line 19: | ||
A hierarchical Dirichlet process is a distribution over a set of random probability measures over <math>(\theta; B)</math>. The process defines a set of random probability measures <math>G_j</math>, one for each group, and a global random probability measure <math>G_0</math>. The global measure <math>G_0</math> is distributed as a Dirichlet process with concentration parameter and base probability measure H: | A hierarchical Dirichlet process is a distribution over a set of random probability measures over <math>(\theta; B)</math>. The process defines a set of random probability measures <math>G_j</math>, one for each group, and a global random probability measure <math>G_0</math>. The global measure <math>G_0</math> is distributed as a Dirichlet process with concentration parameter and base probability measure H: | ||
− | <math>G_0 | \gamma, H | + | <math>G_0 | \gamma, H \sim DP(\gamma, H)</math> |
− | and the random measures | + | and the random measures <math>G_j</math> are conditionally independent given G0, with distributions given by a Dirichlet process with base probability measure <math>G_0</math>: |
− | <math>G_j | \alpha_0, G_0 | + | <math>G_j | \alpha_0, G_0 \sim DP(\alpha_0, G_0)</math>. |
A hierarchical Dirichlet process can be used as the prior distribution over the factors for grouped data. For each j let <math>\theta_{j1}, \theta_{j2},...</math> be i.i.d. random variables distributed as <math>G_j</math> . Each <math>\theta_ji</math> is a factor corresponding to a single observation <math>x_{ji}</math>. The likelihood is given by: | A hierarchical Dirichlet process can be used as the prior distribution over the factors for grouped data. For each j let <math>\theta_{j1}, \theta_{j2},...</math> be i.i.d. random variables distributed as <math>G_j</math> . Each <math>\theta_ji</math> is a factor corresponding to a single observation <math>x_{ji}</math>. The likelihood is given by: | ||
− | <math>\theta_{ji} | G_j | + | <math>\theta_{ji} | G_j \sim G_j</math> |
− | <math>x_{ji} | \theta_{ji} | + | <math>x_{ji} | \theta_{ji} \sim F(\theta_{ji})</math>. |
The hierarchical Dirichlet process can readily be extended to more than two levels. That is, the base measure H can itself be a draw from a DP, and the hierarchy can be extended for as many levels as are deemed useful. | The hierarchical Dirichlet process can readily be extended to more than two levels. That is, the base measure H can itself be a draw from a DP, and the hierarchy can be extended for as many levels as are deemed useful. | ||
+ | * The stick-breaking construction | ||
+ | Given that the global measure <math>G_0</math> is distributed as a Dirichlet process, it can be expressed using a stick-breaking representation: | ||
+ | |||
+ | <math>G_0 = \sum_{k=1}^{\infty} \beta_k \delta_{\phi_k},</math> | ||
+ | |||
+ | where <math>\phi_k \sim H</math> independently and <math>\beta = (\beta_k)_{k=1}^{\infty} \sim GEM(\gamma)</math> are mutually independent. Since <math>G_0</math> has support at the points <math>\phi = (\phi_k)_{k=1}^{\infty}</math>, each <math>G_j</math> necessarily has support at these points as well, and can thus be written as: | ||
+ | |||
+ | <math>G_j = \sum_{k=1}^{\infty} \pi_{jk} \delta_{\phi_k}</math> | ||
+ | |||
+ | Let <math>\pi_j = ((\pi_{jk})_{k=1}^{\infty})</math>. Note that the weights <math>\pi_j</math> are independent given <math>\beta</math> (since the <math>G_j</math> are independent given <math>G_0</math>). These weights <math>\pi_j</math> are related to the global weights <math>\beta</math>. | ||
+ | |||
+ | An equivalent representation of the hierarchical Dirichlet process mixture can be: | ||
+ | |||
+ | <math>\beta | \gamma \sim GEM(\gamma)</math> | ||
+ | |||
+ | <math>\pi_j | \alpha_0, \beta \sim DP(\alpha_0, \beta)</math> | ||
+ | |||
+ | <math>z_ji | \pi_j \sim \pi_j</math> | ||
+ | |||
+ | <math>\phi_k | H \sim H</math> | ||
+ | |||
+ | <math>x_{ji} | z_{ji}, (\phi_k)_{k=1}^{\infty} \sim F(\phi_{z_{ji}})</math>. | ||
+ | |||
+ | After some derivations, the relation between weights and <math>\beta</math> is: | ||
+ | |||
+ | <math>\frac{1}{1-\sum_{l=1}^{k-1} \pi_{jl}} (\pi_{jk}, \sum_{l=k+1}^{\infty} \pi_{jl}) \sim Dir (\alpha_0 \beta_k, \alpha_0 \sum_{l=k+1}^{\infty} \beta_l)</math>. | ||
+ | |||
+ | |||
+ | * Chinese restaurant franchise | ||
+ | |||
+ | [[File:Crf.jpg | 200 px]] | ||
+ | |||
+ | The restaurants correspond to groups and the customers correspond to the factors <math>\theta_{ji}</math>. Let <math>\phi_1,..., \phi_K</math> denote K i.i.d. random variables distributed according to H; this is the global menu of dishes. Vairables <math>\psi_{jt}</math> represent the table-specific choice of dishes; particular, <math>\psi_{jt}</math> is the dish served at table t and restaurant j. Use notation <math>n_{jtk}</math> to denote the number of maintain counts of customers and counts of tables. Then, | ||
+ | |||
+ | <math>\theta_{ji} | \theta_{j1},..., \theta_{j,i-1}, \alpha_0, G_0 \sim \sum_{t=1}^{m_{j.}} \frac{n_{jt.}}{i-1+\alpha_0} \delta_{\psi_{jt}} + \frac{\alpha_0}{i-1+\alpha_0} G_0</math>, | ||
+ | |||
+ | <math>\psi_{jt} | \psi_{11},\psi_{12},...,\psi_{21},..., \psi_{j,t-1}, \gamma, H \sim \sum_{k=1}^{K} \frac{m_{.k}}{m_{..}+\gamma} \delta_{\phi_{k}} + \frac{\gamma}{m_{..}+\gamma} H</math> | ||
+ | |||
+ | == MCMC algorithm for posterior sampling in the Chinese restaurant franchise== | ||
+ | |||
+ | The Chinese restaurant franchise can yield a Gibbs sampling scheme for posterior sampling given observations x. Rather than dealing with the <math>\theta_{ji}</math>'s and <math>\psi_{jt}</math>'s, we shall sample their index variables <math>t_{ji}</math> and <math>k_jt</math> instead. | ||
+ | |||
+ | |||
+ | * '''Sampling t.''' The prior probability that <math>t_{ji}</math> takes on a particular previously used value t is proportional to <math>n_{jt.}^{-ji}</math>, whereas the probability that it takes on a new value (say <math>t^{new} = m_{j.} + 1</math>) is proportional to <math>\alpha_0</math>. The likelihood due to <math>x_{ji}</math> given <math>t_{ji} = t</math> for some previously used t is <math>f_k^{-x_{ji}} (x_ji)</math>, then | ||
+ | |||
+ | [[File:hdp_eq1.png]] | ||
+ | |||
+ | If the sampled value of <math>t_ji</math> is <math>t^{new}</math>, we obtain a sample of <math>k_{jt^{new}}</math> by sampling from | ||
+ | |||
+ | [[File:hdp_eq2.png]] | ||
+ | |||
+ | If as a result of updating <math>t_{ji}</math> some table t becomes unoccupied, i.e., <math>n_{jt.} = 0</math>, then the probability that this table will be reoccupied in the future will be zero, since this is always proportional to <math>n_{jt.}</math>. As a result, we may delete the corresponding <math>k_{jt}</math> from the data structure. If as a result of deleting | ||
+ | <math>k_{jt}</math> some mixture component k becomes unallocated, we delete this mixture component as well. | ||
+ | |||
+ | * '''Sampling k'''. Since changing <math>k_{jt}</math> actually changes the component membership of all data items in table t, the likelihood obtained by setting <math>k_{jt} = k</math> is given by <math>f_{k}^{-x_{jt}}(x_{jt})</math>, so that the conditional | ||
+ | probability of <math>k_{jt}</math> is | ||
+ | |||
+ | [[File:hdp_eq3.png]] | ||
== Data == | == Data == | ||
+ | [[UsesDataset:: Nematode biology abstracts]] | ||
+ | |||
+ | [[UsesDataset:: NIPS 1988-1999]] | ||
+ | |||
+ | [[UsesDataset:: Alice’s Adventures in Wonderland by Lewis Carroll]] | ||
+ | |||
+ | == Related Papers == | ||
+ | |||
+ | [[RelatedPaper::Aldous, D. (1985), “Exchangeability and Related Topics,” in E´cole d’E´te´ de Probabilite´s de Saint-Flour XIII–1983, Springer, Berlin, pp. 1–198]]. | ||
− | + | [[RelatedPaper::Antoniak, C. (1974), “Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems,” Annals of Statistics, 2(6), pp. 1152–1174]]. |
Latest revision as of 22:08, 2 April 2011
Contents
Citation
Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 2006
Online version
Summary
This paper proposed a nonparametric Bayes approach to decide the number of mixture components in grouped data, the basic idea is:
- Develop analogs for the Hierarchical Dirichlet process with representations of both a stick-breaking and a "Chinese restaurant franchise”.
Methodology
A hierarchical Dirichlet process is a distribution over a set of random probability measures over . The process defines a set of random probability measures , one for each group, and a global random probability measure . The global measure is distributed as a Dirichlet process with concentration parameter and base probability measure H:
and the random measures are conditionally independent given G0, with distributions given by a Dirichlet process with base probability measure :
.
A hierarchical Dirichlet process can be used as the prior distribution over the factors for grouped data. For each j let be i.i.d. random variables distributed as . Each is a factor corresponding to a single observation . The likelihood is given by:
.
The hierarchical Dirichlet process can readily be extended to more than two levels. That is, the base measure H can itself be a draw from a DP, and the hierarchy can be extended for as many levels as are deemed useful.
- The stick-breaking construction
Given that the global measure is distributed as a Dirichlet process, it can be expressed using a stick-breaking representation:
where independently and are mutually independent. Since has support at the points , each necessarily has support at these points as well, and can thus be written as:
Let . Note that the weights are independent given (since the are independent given ). These weights are related to the global weights .
An equivalent representation of the hierarchical Dirichlet process mixture can be:
.
After some derivations, the relation between weights and is:
.
- Chinese restaurant franchise
The restaurants correspond to groups and the customers correspond to the factors . Let denote K i.i.d. random variables distributed according to H; this is the global menu of dishes. Vairables represent the table-specific choice of dishes; particular, is the dish served at table t and restaurant j. Use notation to denote the number of maintain counts of customers and counts of tables. Then,
,
MCMC algorithm for posterior sampling in the Chinese restaurant franchise
The Chinese restaurant franchise can yield a Gibbs sampling scheme for posterior sampling given observations x. Rather than dealing with the 's and 's, we shall sample their index variables and instead.
- Sampling t. The prior probability that takes on a particular previously used value t is proportional to , whereas the probability that it takes on a new value (say ) is proportional to . The likelihood due to given for some previously used t is , then
If the sampled value of is , we obtain a sample of by sampling from
If as a result of updating some table t becomes unoccupied, i.e., , then the probability that this table will be reoccupied in the future will be zero, since this is always proportional to . As a result, we may delete the corresponding from the data structure. If as a result of deleting some mixture component k becomes unallocated, we delete this mixture component as well.
- Sampling k. Since changing actually changes the component membership of all data items in table t, the likelihood obtained by setting is given by , so that the conditional
probability of is
Data
Alice’s Adventures in Wonderland by Lewis Carroll