# Topic Modeling: Beyond Bag-of-Words

This a Paper discussed in Social Media Analysis 10-802 in Spring 2011.

## Citation

Hanna M. Wallach: Topic Modeling: Beyond Bag-of-Words. ICML 2006

## Summary

In text analysis community, methods are basically 2-folded: employ n-gram statistics like language-modeling, or recently emerged topic models which using 'bag-of-words', assuming word order doesn't matter. This work tries to incorporate both methods by proposing a hierarchical generative probabilistic model.

## Methodology

To develop a bigram language model, marginal and conditional word counts are determined from corpus such as ${\displaystyle f={\frac {N_{i}}{N}}}$ and ${\displaystyle f_{i|j}={\frac {N_{i|j}}{N_{j}}}}$. Since only limited number of possible combinations are seen in practice, bigram estimator is often smoothed by marginal frequency estimator:

${\displaystyle P(w_{t}=i|w_{t-1}=j)=\lambda f_{i}+(1-\lambda )f_{i|j}}$

Next is the widely applied LDA model which makes bag-of-words assumption when modeling text. This is computational efficient but not realistic in real world since word order is very important such as in text compression, speech recognition, etc.

### Bigram Topic Model

Authors extended LDA model by incorporating a notion of word order. Word is generated by a conditional distribution: ${\displaystyle P(w_{t}=i|w_{t-1}=j,z_{t}=k)}$ not only topic, but previous word.

Now the likelihood becomes:

${\displaystyle p({\bar {w}},{\bar {z}}|\Phi ,\Theta )=\prod _{i}\prod _{j}\prod _{k}\prod _{d}\phi _{i|j,k}^{N_{i|j,k}}\theta _{k|d}^{N_{k|d}}}$

${\displaystyle N_{i|j,k}}$ is number of times word i has been assigned topic k when preceded by word j.

Prior over ${\displaystyle \Theta }$: same as LDA

Prior over ${\displaystyle \Phi }$: complicated since now there are more additional context. Author proposed 2 priors:

Prior1: a single hyperparameter vector ${\displaystyle \beta m}$ may be shared between all j,k contexts:

${\displaystyle P(\phi |\beta m)=\prod _{j}\prod _{k}Dirichlet(\phi _{j,k}|\beta m)}$

Prior1: T hyperparameter vectors, one for each topic k:

${\displaystyle P(\phi |{\beta _{k}m_{k}})=\prod _{j}\prod _{k}Dirichlet(\phi _{j,k}|\beta _{k}m_{k})}$

Having defined all distributions, now the generative process becomes:

• For each topic k and word j
• Draw ${\displaystyle \phi _{j,k}}$ from the prior over ${\displaystyle \phi }$
• For each document d in corpus:
• Draw topic mixture ${\displaystyle \theta _{d}}$ for document d
• For each position t in document d:
• Draw a topic ${\displaystyle z_{t}}$ ~ Multi(${\displaystyle \theta _{d}}$)
• Draw a word ${\displaystyle w_{t}}$ from context defined on previous word ${\displaystyle w_{t-1}}$ and Multi(${\displaystyle \phi _{w_{t-1},z_{t}}}$)

Inference is based on EM algorithm. I ignored the bulky math here. For details, please refer to the paper

## Experiments

Model Dataset: The models were compared using two data sets: first one is 150 abstracts from Psychological Review Abstracts data. Second is 150 newsgroup postings, drawn at random from the 20 Newsgroups data. In both datasets, 100 are used for parameter inference, and remaining 50 are used for evaluation of methods

Authors tests from 1 topic to 120 topics, with 400 iterations for sampling. As we can see, both new models outperformed LDA, with prior 2 winning the minimum perplexity. Also it's worthy to notice that larger number of topics help for prior 2. LDA is stationary after first 20 topics.