Latent Dirichlet Allocation

Latent Dirichlet Allocation (or LDA) is a method is a unsupervised, generative method for determining the topics that produced a text. The idea is that there are some number K total topics and each document was produced by some of these topics. Each topic specifics a unigram distribution over the words it can produce.

LDA is used extensively in clustering. The only downside is that it requires that the number of topics be set beforehand. If the number of topics is unknown, it may be better to go with something like Brown clustering.

Method

This will mostly be straight from wikipedia. We'll use their notation:

alpha is the hyperparameter for the Dirichlet prior on per-document topic distribution
beta is the hyperparameter for Dirichlet prior on per-topic word distributions
theta_i is the topic distribution for a specific document i
phi_k is the word distribution for topic k
z_ij is the topic for the jth word in the ith document
w_ij is the jth word in the ith document

The idea is that we have hidden topics and we want to estimate both the topics and their unigram distribution.

The process is as follows: theta_i is chosen from Dir(alpha) phi_k is chosen from Dir(beta) For each word, we choose the topic from a multinomial(theta_i) and choose the word from the multinomial(phi_(z_ij)).

Sources

LDA from Wikipedia
Linguistic Structured Prediction page 127
Original paper on LDA: Blei et al, 2003

Latent Dirichlet Allocation

Method

Sources

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools