Latent Dirichlet Allocation
Latent Dirichlet Allocation (or LDA) is a method is a unsupervised, generative method for determining the topics that produced a text. The idea is that there are some number K total topics and each document was produced by some of these topics. Each topic specifics a unigram distribution over the words it can produce.
LDA is used extensively in clustering. The only downside is that it requires that the number of topics be set beforehand. If the number of topics is unknown, it may be better to go with something like Brown clustering.
This will mostly be straight from wikipedia. We'll use their notation:
- alpha is the hyperparameter for the Dirichlet prior on per-document topic distribution
- beta is the hyperparameter for Dirichlet prior on per-topic word distributions
- theta_i is the topic distribution for a specific document i
- phi_k is the word distribution for topic k
- z_ij is the topic for the jth word in the ith document
- w_ij is the jth word in the ith document
The idea is that we have hidden topics and we want to estimate both the topics and their unigram distribution.
The process is as follows: theta_i is chosen from Dir(alpha) phi_k is chosen from Dir(beta) For each word, we choose the topic from a multinomial(theta_i) and choose the word from the multinomial(phi_(z_ij)).