Difference between revisions of "Topic Modeling: Beyond Bag-of-Words"

Revision as of 02:29, 1 April 2011

This a Paper discussed in Social Media Analysis 10-802 in Spring 2011.

Citation

Hanna M. Wallach: Topic Modeling: Beyond Bag-of-Words. ICML 2006

Online version

Summary

In text analysis community, methods are basically 2-folded: employ n-gram statistics like language-modeling, or recently emerged topic models which using 'bag-of-words', assuming word order doesn't matter. This work tries to incorporate both methods by proposing a hierarchical generative probabilistic model.

Methodology

To develop a bigram language model, marginal and conditional word counts are determined from corpus such as $f={\frac {N_{i}}{N}}$ and $f_{i|j}={\frac {N_{i|j}}{N_{j}}}$ . Since only limited number of possible combinations are seen in practice, bigram estimator is often smoothed by marginal frequency estimator:

$P(w_{t}=i|w_{t-1}=j)=\lambda f_{i}+(1-\lambda )f_{i|j}$

Next is the widely applied LDA model which makes bag-of-words assumption when modeling text. This is computational efficient but not realistic in real world since word order is very important such as in text compression, speech recognition, etc.

Bigram Topic Model

Authors extended LDA model by incorporating a notion of word order. Word is generated by a conditional distribution: $P(w_{t}=i|w_{t-1}=j,z_{t}=k)$ not only topic, but previous word.

Now the likelihood becomes:

$p({\bar {w}},{\bar {z}}|\Phi ,\Theta )=\prod _{i}\prod _{j}\prod _{k}\prod _{d}\phi _{i|j,k}^{N_{i|j,k}}\theta _{k|d}^{N_{k|d}}$

$N_{i|j,k}$ is number of times word i has been assigned topic k when preceded by word j.

Prior over $\Theta$ : same as LDA

Prior over $\Phi$ : complicated since now there are more additional context. Author proposed 2 priors:

Prior1: a single hyperparameter vector $\beta m$ may be shared between all j,k contexts:

$P(\phi |\beta m)=\prod _{j}\prod _{k}Dirichlet(\phi _{j,k}|\beta m)$

@@ Line 14: / Line 14: @@
 == Methodology ==
+To develop a bigram language model, marginal and conditional word counts are determined from corpus such as <math> f=\frac{N_i}{N} </math> and <math> f_{i|j} = \frac{N_{i|j}}{N_j}</math>. Since only limited number of possible combinations are seen in practice, bigram estimator is often smoothed by marginal frequency estimator:
+<math>
+P(w_t=i|w_{t-1}=j) = \lambda f_i + (1-\lambda)f_{i|j}
+</math>
+Next is the widely applied LDA model which makes bag-of-words assumption when modeling text. This is computational efficient but not realistic in real world since word order is very important such as in text compression, speech recognition, etc.
+=== Bigram Topic Model ===
+Authors extended LDA model by incorporating a notion of word order. Word is generated by a conditional distribution:
+<math>
+P(w_t = i|w_{t-1}=j,z_t=k)
+</math>
+not only topic, but previous word.
+Now the likelihood becomes:
+<math>
+p(\bar{w}, \bar{z}|\Phi, \Theta) = \prod_i \prod_j \prod_k \prod_d \phi_{i|j,k}^{N_{i|j,k}} \theta_{k|d}^{N_{k|d}}
+</math>
+<math>N_{i|j,k}</math> is number of times word i has been assigned topic k when preceded by word j.
+Prior over <math>\Theta</math>: same as LDA
+Prior over <math>\Phi</math>: complicated since now there are more additional context. Author proposed 2 priors:
+Prior1: a single hyperparameter vector <math>\beta m </math> may be shared between all j,k contexts:
+<math>
+P(\phi|\beta m) = \prod_j \prod_k Dirichlet(\phi_{j,k}|\beta m)
+</math>

Difference between revisions of "Topic Modeling: Beyond Bag-of-Words"

Revision as of 02:29, 1 April 2011

Contents

Citation

Online version

Summary

Methodology

Bigram Topic Model

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools