Difference between revisions of "Blei et al Latent Dirichlet Allocation"

Revision as of 00:20, 2 October 2012

Citation

author = {Blei, David M. and Ng, Andrew Y. and Jordan, Michael I.},
title = {Latent dirichlet allocation},
journal = {J. Mach. Learn. Res.},
issue_date = {3/1/2003},
volume = {3},
month = mar,
year = {2003},
issn = {1532-4435},
pages = {993--1022},
numpages = {30},
url = {http://dl.acm.org/citation.cfm?id=944919.944937},
acmid = {944937},
publisher = {JMLR.org}

Online Version

Latent Dirichlet Allocation

Summary

This paper addresses the problem of document modeling

LDA

LDA is a generative probabilistic model for collections of discrete data such as text corpora. It is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a ﬁnite mixture over an underlying (latent) set of topics, where each topic is characterized by a distribution over words. Each document $d$ is assumed to be generated using the following process:

 1. Choose the number of words  $N_{d}$  in the document by drawing from a distribution Poisson( $\xi$ )
 2. Choose the topic probabilities  $\theta _{d,n}$  from a Dirichlet( $\alpha$ ) distribution
 3. For each of the N words  $w_{d,n}$ 
   a. Choose a topic  $z_{d,n}$  from a Multinomial({ $\theta {d,n}$ ) distrbution
   b. Choose a word  $w_{d,n}$  from p( $w_{d,n}|z_{d,n},\beta$ ) which is a multinomial distribution conditioned on the topic  $z_{d,n}$

The parameters $\alpha$ and $\beta$ are corpus level parameters, assumed to be sampled once in the process of generating a corpus. The variables $\theta _{d,n}$ are document-level variables, sampled once per document. Finally, the variables $z_{d,n}$ and $w_{d,n}$ are word-level variables and are sampled once for each word in each document.

Inference

The posterior distribution of the hidden variables given a document is intractable. Efﬁcient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation are provided.

The basic idea is to make use of Jensen’s inequality to obtain an adjustable lower bound on the log likelihood. A family of lower bounds, indexed by a set of variational parameters, is considered and the variational parameters are chosen by an optimization procedure that attempts to ﬁnd the tightest possible lower bound. It leads to the following iterative EM algorithm

 1. E step: For each document, find the optimizing values of the variational parameters
 2. M step: Maximize resulting lower bound on the log likelihood with respect to the model parameters  $\alpha ,\beta$

Experiments

LDA is empirically evaluated in several problem domains -- document modeling, document classiﬁcation, and collaborative ﬁltering.

Study Plan

1. [Mixture models] 2. www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf Probabilistic Latent Semantic Indexing 3. [Variational Bayesian Methods] 4. [Variational Inference lecture pdf by Blei]

@@ Line 42: / Line 42: @@
 == Study Plan ==
-This was a simple but interesting standalone paper to read. Not much background was needed.  Following may still help
+. [[http://en.wikipedia.org/wiki/Mixture_model Mixture models]]
-* [http://en.wikipedia.org/wiki/Multinomial_distribution Multinomial distribution]
+. [[www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf Probabilistic Latent Semantic Indexing]]
-* [http://en.wikipedia.org/wiki/Dirichlet_distribution Dirichlet distribution]
+. [[http://en.wikipedia.org/wiki/Variational_Bayesian_methods Variational Bayesian Methods]]
+. [[http://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf Variational Inference lecture pdf by Blei]]

Difference between revisions of "Blei et al Latent Dirichlet Allocation"

Revision as of 00:20, 2 October 2012

Contents

Citation

Online Version

Summary

LDA

Inference

Experiments

Study Plan

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools