Difference between revisions of "Blei et al Latent Dirichlet Allocation"
Line 23: | Line 23: | ||
LDA is a generative probabilistic model for discrete data such as text corpora. It is a Bayesian model that consists of three hierarchies. Each item of the collection is modeled as a finite mixture i.e. modeled as being generated by an underlying (latent) set of topics, where each topic is characterized by a distribution over words. Each document <math>d</math> in the corpus is assumed to be generated using the following process: | LDA is a generative probabilistic model for discrete data such as text corpora. It is a Bayesian model that consists of three hierarchies. Each item of the collection is modeled as a finite mixture i.e. modeled as being generated by an underlying (latent) set of topics, where each topic is characterized by a distribution over words. Each document <math>d</math> in the corpus is assumed to be generated using the following process: | ||
− | + | # The author chooses the number of words <math>N_d</math> in the document by drawing from a Poisson(<math>\xi</math>) distribution. | |
− | + | # He then tosses a Dirichlet hypergenerator Dirichlet(<math>\alpha</math>�) to get a <math>\theta_{d,n}</math> �which is used to generate a Multinomial({<math>\theta_{d,n}</math>) topiv generator | |
− | + | # For each word <math>w_{d,n}</math> from the <math>N_d</math> words | |
− | + | * A topic <math>z_{d,n}</math> is chosen from a Multinomial({<math>\theta_{d,n}</math>) distribution | |
− | + | * A topic specific word generator parametrized by <math>w_{d,n}</math> is then tossed to get the word | |
The parameters <math>\alpha</math> and <math>\beta</math> are corpus level parameters, assumed to be sampled once in the process of generating a corpus. The variables <math>\theta_{d,n}</math> are document-level variables, sampled once per document. Finally, the variables <math>z_{d,n}</math> and <math>w_{d,n}</math> are word-level variables and are sampled once for each word in each document. | The parameters <math>\alpha</math> and <math>\beta</math> are corpus level parameters, assumed to be sampled once in the process of generating a corpus. The variables <math>\theta_{d,n}</math> are document-level variables, sampled once per document. Finally, the variables <math>z_{d,n}</math> and <math>w_{d,n}</math> are word-level variables and are sampled once for each word in each document. |
Revision as of 17:55, 3 October 2012
Citation
author = {Blei, David M. and Ng, Andrew Y. and Jordan, Michael I.}, title = {Latent dirichlet allocation}, journal = {J. Mach. Learn. Res.}, issue_date = {3/1/2003}, volume = {3}, month = mar, year = {2003}, issn = {1532-4435}, pages = {993--1022}, numpages = {30}, url = {http://dl.acm.org/citation.cfm?id=944919.944937}, acmid = {944937}, publisher = {JMLR.org}
Online Version
Summary
This paper addresses the problem of document modeling
LDA
LDA is a generative probabilistic model for discrete data such as text corpora. It is a Bayesian model that consists of three hierarchies. Each item of the collection is modeled as a finite mixture i.e. modeled as being generated by an underlying (latent) set of topics, where each topic is characterized by a distribution over words. Each document in the corpus is assumed to be generated using the following process:
- The author chooses the number of words in the document by drawing from a Poisson() distribution.
- He then tosses a Dirichlet hypergenerator Dirichlet(�) to get a �which is used to generate a Multinomial({) topiv generator
- For each word from the words
* A topic is chosen from a Multinomial({) distribution * A topic specific word generator parametrized by is then tossed to get the word
The parameters and are corpus level parameters, assumed to be sampled once in the process of generating a corpus. The variables are document-level variables, sampled once per document. Finally, the variables and are word-level variables and are sampled once for each word in each document.
Inference
The posterior distribution of the hidden variables given a document is intractable. Efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation are provided.
The basic idea is to make use of Jensen’s inequality to obtain an adjustable lower bound on the log likelihood. A family of lower bounds, indexed by a set of variational parameters, is considered and the variational parameters are chosen by an optimization procedure that attempts to find the tightest possible lower bound. It leads to the following iterative EM algorithm
1. E step: For each document, find the optimizing values of the variational parameters
2. M step: Maximize resulting lower bound on the log likelihood with respect to the model parameters
Experiments
LDA is empirically evaluated in several problem domains -- document modeling, document classification, and collaborative filtering.
Study Plan
2. Probabilistic Latent Semantic Indexing