# Online Inference of Topics with Latent Dirichlet Allocation

This a Paper discussed in Social Media Analysis 10-802 in Spring 2011.

## Citation

Online Inference of Topics with Latent Dirichlet Allocation. Canini, Shi and Griffiths. AISTATS 2009

## Summary

Traditional LDA inference methods only work in 'batch' mode, meaning they have to run over an entire document collection after they have been observed. Some collections, however, grow over time due to their nature, such as streaming news text. This paper introduces online inference model for LDA.

## Methodology

As we all know in LDA model each document is represented as a mixture of topics with each topic z taking a different weight. I'll skip the introduction of LDA here since I've covered it in my last writeup.

The idea of batch Gibbs sampler, which was first introduced by Griffiths and Steyvers, is that it repeatedly sample the topic of each observed word according to its conditional distribution:

${\displaystyle P(z_{j}|z_{N\setminus j},w_{N})\propto {\frac {n_{z_{j},N\setminus j}^{w_{j}}+\beta }{n_{z_{j},N\setminus j}^{(.)}+W\beta }}{\frac {n_{z_{j},N\setminus j}^{d_{j}}+\alpha }{n_{.,N\setminus j}^{(d_{j})}+T\alpha }}}$

while ${\displaystyle \alpha ,\beta }$ are the hyperparameters and ${\displaystyle W,T}$ are the size of vocabulary and topics respectively

Gibbs sampling procedure converges to the desired posterior distribution.

### O-LDA

Song et al (2005) proposed a new algorithm to have online inference on newly arrived documents. First, they apply batch Gibbs sampler on part of the full dataset, then sample tpics of each new word i by conditioning on words observed so far:

${\displaystyle P(z_{i}|z_{i-1},w_{i})\propto {\frac {n_{z_{i},i\setminus i}^{(w_{i})}+\beta }{n_{z_{j},i\setminus i}^{(.)}+W\beta }}{\frac {n_{z_{i},i\setminus i}^{(d_{i})}+\alpha }{n_{.,i\setminus i}^{(d_{i})}+T\alpha }}}$

Notice the difference of the subscript of n, which indicates when sampling each word, it only depends on the observations so far.

The complete algorithm is as follows:

Algorithm o_LDA:

• sample ${\displaystyle z_{\sigma }}$ using batch Gibbs sampler
• for i=${\displaystyle \sigma }$ +1, ..., N do
• sample ${\displaystyle z_{i}}$ from ${\displaystyle P(z_{i}|z_{i-1},w_{i})}$

As authors point out, this algorithm never resamples old topic variables so its performance heavily depends on the accuracy of topics obtained during the batch training phase.

### Incremental Gibbs sampler

In order to fix the issue above, subsequently, authors proposed their version of incremental Gibbs sampling, which, after each step i, resamples the topics of some of the previous words. The topic assignment z_j of each index j in the "rejuvenation sequence" is obtained on its conditional distribution. It's a lot like the one used above:

${\displaystyle P(z_{j}|z_{i\setminus j},w_{i})\propto {\frac {n_{z_{j},i\setminus j}^{(w_{j})}+\beta }{n_{z_{j},i\setminus j}^{(.)}+W\beta }}{\frac {n_{z_{j},i\setminus j}^{(d_{j})}+\alpha }{n_{.,i\setminus j}^{(d_{j})}+T\alpha }}}$

Authors say if this rejuvenation steps are performed often enough this incremental Gibbs sampler closely approximate the posterior distribution ${\displaystyle P(z_{j}|w_{i})}$.

This sampler is actually an instance of decayed MCMC framework introduced by Marthi et al.(2002)

The complete algorithm is as follows:

Algorithm: incremental Gibbs sampler for LDA

• for i=1, ..., N do
• sample ${\displaystyle z_{i}}$ from ${\displaystyle P(z_{i}|z_{i-1},w_{i})}$
• for j in R(i) do
• sample ${\displaystyle z_{j}}$ from ${\displaystyle P(z_{j}|z_{i\setminus j},w_{i})}$

## Evaluation

### datasets

A collection of categorized document were used, which consist of 4 subsets from 20 newsgroup corpus so that evaluation of topics become possible.

### methodology

10% of the data is used for o-LDA's initialization to get a set of starting topics. Since o-LDA does this once and offline, o-LDA is the fastest algorithm among three. Authors evaluated the result by computing how well the implies topics match the true categories which are known. To be specific, normalized mutual information is used to measure the similarity

## Results

The result is show in figure below. Particle filter and incremental Gibbs sampler worked equally well. o-LDA is consistently outperformed by the other two, which is expected.