Difference between revisions of "Blei et al Latent Dirichlet Allocation"

From Cohen Courses
Jump to navigationJump to search
 
(10 intermediate revisions by the same user not shown)
Line 21: Line 21:
  
 
=== LDA ===
 
=== LDA ===
LDA is a generative probabilistic model for collections of discrete data such as text corpora. It is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying (latent) set of topics, where each topic is characterized by a distribution over words. Each document <math>d</math> is assumed to be generated using the following process:   
+
LDA is a generative probabilistic model for discrete data such as text corpora. It is a Bayesian model that consists of three hierarchies. Each item of the collection is modeled as a finite mixture i.e. modeled as being generated by an underlying (latent) set of topics, where each topic is characterized by a distribution over words. Each document <math>d</math> in the corpus is assumed to be generated using the following process:   
  
  1. Choose the number of words <math>N_d</math> in the document by drawing from a distribution Poisson(<math>\xi</math>)
+
1. The author chooses the number of words <math>N_d</math> in the document by drawing from a Poisson(<math>\xi</math>) distribution.
  2. Choose the topic probabilities <math>\theta_{d,n}</math> from a Dirichlet(<math>\alpha</math>) distribution
+
2. He then tosses a Dirichlet hypergenerator Dirichlet(<math>\alpha</math>) to get a <math>\theta_{d,n}</math> which is used to generate a Multinomial(<math>\theta_{d,n}</math>) topiv generator
  3. For each of the N words <math>w_{d,n}</math>
+
3. For each word <math>w_{d,n}</math> from the <math>N_d</math> words
    a. Choose a topic <math>z_{d,n}</math> from a Multinomial({<math>\theta{d,n}</math>) distrbution
+
  a. A topic <math>z_{d,n}</math> is chosen from a Multinomial(<math>\theta_{d,n}</math>) distribution
    b. Choose a word <math>w_{d,n}</math> from p(<math>w_{d,n} | z_{d,n}, \beta</math>) which is a multinomial distribution conditioned on the topic <math>z_{d,n}</math>
+
  b. A topic specific word generator parametrized by <math>z_{d,n}</math> and <math>\beta</math> is then tossed to get the word
  
The parameters <math>\alpha</math> and <math>\beta</math> are corpus level parameters, assumed to be sampled once in the process of generating a corpus. The variables <math>\theta_{d,n}</math> are document-level variables, sampled once per document. Finally, the variables <math>z_{d,n}</math> and <math>w_{d,n}</math> are word-level variables and are sampled once for each word in each document.
+
The parameters <math>\alpha</math> and <math>\beta</math> are corpus level parameters and are sampled only once in the process of generating a corpus. The variables <math>\theta_{d,n}</math> are sampled once per document. Finally, the variables <math>z_{d,n}</math> and <math>w_{d,n}</math> are word-level variables and are sampled once for each word in each document. What makes LDA unique is that it consists of three levels, and notably the topic node is sampled repeatedly within a document. This allows documents to be associated with multiple topics rather than just one.  
  
 
=== Inference ===  
 
=== Inference ===  
The posterior distribution of the hidden variables given a document is intractable. Efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation are provided.
+
The posterior distribution of the hidden variables given a document is, in general, intractable. However, many efficient approximate inference techniques can be used to estimate the posterior. The paper describes a convexity-based variational method involving EM algorithm for Bayes parameter estimation.
  
The basic idea is to make use of Jensen’s inequality to obtain an adjustable lower bound on the log likelihood. A family of lower bounds, indexed by a set of variational parameters, is considered and the variational parameters are chosen by an optimization procedure that attempts to find the tightest possible lower bound. It leads to the following iterative EM algorithm
+
The basic idea is to obtain a lower bound on the log likelihood parametrized by the variational parameters using the Jensen’s inequality. The variational parameters are chosen by an optimization procedure that attempts to find the tightest possible lower bound. The authors show that this requires choosing the parameters to minimize the KL divergence between the distribution under the variational parameters and the true posterior. This leads to a pair of interdependent update equations which can be solved via an iterative fixed-point method. 
  1. E step: For each document, find the optimizing values of the variational parameters
 
  2. M step: Maximize resulting lower bound on the log likelihood with respect to the model parameters <math>\alpha, \beta</math>
 
  
=== Experiments ===
+
=== Parameter Estimation ===
LDA is empirically evaluated in several problem domains -- document modeling, document classification, and collaborative filtering.
+
We now need to estimate the parameters <math>\alpha</math> and <math>\beta</math> of the LDA model. An empirical Bayes method for parameter estimation is provided. Given a corpus of documents D, we wish to find parameters <math>\alpha</math> and <math>\beta</math> that maximize the (marginal) log likelihood of the data:
  
== Study Plan ==
+
<math>l(\alpha, \beta) = \sum_{d \in D} log(p(d | \alpha, \beta)</math>
1. [[http://en.wikipedia.org/wiki/Mixture_model Mixture models]]
+
 
 +
This is again not directly solvable. However, variational inference technique provides a tractable lower bound on the log likelihood/ which can be maximized with respect to <math>\alpha</math> and <math>\beta</math>. Thus an alternating variational EM procedure can be used to solve the above:
 +
# '''E step''': For each document, find the optimizing values of the variational parameters
 +
# '''M step''': For fixed values of variational parameters, maximize resulting lower bound on the log likelihood with respect to the model parameters <math>\alpha, \beta</math>
 +
 
 +
== Experiments ==
 +
LDA is empirically evaluated in several problem domains -- document modeling, document classification, and collaborative filtering.
 +
 
 +
* '''Document Modeling'''
 +
The model is evaluated by computing, what the authors call, ''perplexity'' of a test set on the trained model. Perplexity is a measure of how surprised the trained model is on seeing a new document. Thus a lower perplexity score is better. Datasets used were [[UsesDataset::Elegans scientific corpus]] and a subset of the [[UsesDataset::TREC AP corpus]]. LDA is compared against smoothed unigram (every word from the same multinomial), smoothed mixture of unigrams (a topic chosen for the document and each word selected conditioned on the chosen topic) and Hoffman's [http://www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf pLSI model]. LDA performs better than each of the above. These models also suffer from overfitting while the LDA does not. 
  
2. [[http://www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf Probabilistic Latent Semantic Indexing]]
+
* '''Document classification'''
 +
The authors conducted binary classification experiments on the [[UsesDataset::Reuters-21578]] dataset. The parameters of an LDA model were estimated on all the documents, without their true class label. A SVM was then trained on the low-dimensional representations provided by LDA and was compared to an SVM trained on all the word features.
  
3. [[http://en.wikipedia.org/wiki/Variational_Bayesian_methods Variational Bayesian Methods]]
+
* '''Collaborative filtering'''
 +
A model is trained on a fully observed set of users. Then, for each unobserved user, all but one of the movies preferred by that user are shown and the task is to predict what the held-out movie is. The different algorithms are evaluated according to the likelihood they assign to the held-out movie. The [[UsesDataset::EachMovie]] collaborative filtering data was used.
  
4. [[http://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf Variational Inference lecture pdf by Blei]]
+
== Study Plan ==
 +
* [http://www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf Probabilistic Latent Semantic Indexing]
 +
** [http://en.wikipedia.org/wiki/Mixture_model Mixture models]
 +
* [http://en.wikipedia.org/wiki/Variational_Bayesian_methods Variational Bayesian Methods]
 +
** [http://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf Variational Inference lecture pdf by Blei]
 +
* [http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence KL divergence]

Latest revision as of 19:57, 3 October 2012

Citation

author = {Blei, David M. and Ng, Andrew Y. and Jordan, Michael I.},
title = {Latent dirichlet allocation},
journal = {J. Mach. Learn. Res.},
issue_date = {3/1/2003},
volume = {3},
month = mar,
year = {2003},
issn = {1532-4435},
pages = {993--1022},
numpages = {30},
url = {http://dl.acm.org/citation.cfm?id=944919.944937},
acmid = {944937},
publisher = {JMLR.org}

Online Version

Latent Dirichlet Allocation

Summary

This paper addresses the problem of document modeling

LDA

LDA is a generative probabilistic model for discrete data such as text corpora. It is a Bayesian model that consists of three hierarchies. Each item of the collection is modeled as a finite mixture i.e. modeled as being generated by an underlying (latent) set of topics, where each topic is characterized by a distribution over words. Each document in the corpus is assumed to be generated using the following process:

1. The author chooses the number of words  in the document by drawing from a Poisson() distribution. 
2. He then tosses a Dirichlet hypergenerator Dirichlet() to get a  which is used to generate a Multinomial() topiv generator
3. For each word  from the  words
  a. A topic  is chosen from a Multinomial() distribution
  b. A topic specific word generator parametrized by  and  is then tossed to get the word

The parameters and are corpus level parameters and are sampled only once in the process of generating a corpus. The variables are sampled once per document. Finally, the variables and are word-level variables and are sampled once for each word in each document. What makes LDA unique is that it consists of three levels, and notably the topic node is sampled repeatedly within a document. This allows documents to be associated with multiple topics rather than just one.

Inference

The posterior distribution of the hidden variables given a document is, in general, intractable. However, many efficient approximate inference techniques can be used to estimate the posterior. The paper describes a convexity-based variational method involving EM algorithm for Bayes parameter estimation.

The basic idea is to obtain a lower bound on the log likelihood parametrized by the variational parameters using the Jensen’s inequality. The variational parameters are chosen by an optimization procedure that attempts to find the tightest possible lower bound. The authors show that this requires choosing the parameters to minimize the KL divergence between the distribution under the variational parameters and the true posterior. This leads to a pair of interdependent update equations which can be solved via an iterative fixed-point method.

Parameter Estimation

We now need to estimate the parameters and of the LDA model. An empirical Bayes method for parameter estimation is provided. Given a corpus of documents D, we wish to find parameters and that maximize the (marginal) log likelihood of the data:

This is again not directly solvable. However, variational inference technique provides a tractable lower bound on the log likelihood/ which can be maximized with respect to and . Thus an alternating variational EM procedure can be used to solve the above:

  1. E step: For each document, find the optimizing values of the variational parameters
  2. M step: For fixed values of variational parameters, maximize resulting lower bound on the log likelihood with respect to the model parameters

Experiments

LDA is empirically evaluated in several problem domains -- document modeling, document classification, and collaborative filtering.

  • Document Modeling

The model is evaluated by computing, what the authors call, perplexity of a test set on the trained model. Perplexity is a measure of how surprised the trained model is on seeing a new document. Thus a lower perplexity score is better. Datasets used were Elegans scientific corpus and a subset of the TREC AP corpus. LDA is compared against smoothed unigram (every word from the same multinomial), smoothed mixture of unigrams (a topic chosen for the document and each word selected conditioned on the chosen topic) and Hoffman's pLSI model. LDA performs better than each of the above. These models also suffer from overfitting while the LDA does not.

  • Document classification

The authors conducted binary classification experiments on the Reuters-21578 dataset. The parameters of an LDA model were estimated on all the documents, without their true class label. A SVM was then trained on the low-dimensional representations provided by LDA and was compared to an SVM trained on all the word features.

  • Collaborative filtering

A model is trained on a fully observed set of users. Then, for each unobserved user, all but one of the movies preferred by that user are shown and the task is to predict what the held-out movie is. The different algorithms are evaluated according to the likelihood they assign to the held-out movie. The EachMovie collaborative filtering data was used.

Study Plan