Eisenstein et al 2011: Sparse Additive Generative Models of Text

Citation

Sparse Additive Generative Models of Text. Eisenstein, Ahmed and Xing. Proceedings of ICML 2011.

Online version

Eisenstein et al 2011

Summary

This recent paper presents sparse learning and additive generative modeling approaches for Topic modeling. This is an important alternative approach to Latent Dirichlet Allocation (LDA) where sparsity and log-space additive modeling are NOT considered or introduced.

Brief Description of the method

This paper first describes three big disadvantages of Latent Dirichlet Allocation: high inference cost, overparameterization, and lack of sparsity representation. Then, it introduces SAGE, an additive generative model which does not require learning the same background distribution again and again, but rather introduces a sparse topic model that performs addition in log-space.

The Generative Story

In contrast to traditional multinomial modeling of words in LDA, SAGE looks at log frequencies and the generative distribution of words in a document d is

P(w|y_{d},m,\eta )={\frac {exp(m+\eta _{y_{d}})}{\sum _{i}exp(m_{i}+\eta _{y_{d},i})}}

where $m$ is the background and $\eta$ is the log frequency deviation that represents topic. By doing this, the authors argue that SAGE can take advantages of sparsity-inducing priors on $\eta$ to obtain additional robustness for the model. The generative story of SAGE can be described as follows

Draw background distribution m from an uninformative prior
For each class k:

     -- For each term i      
          1. Draw  $\tau _{k,i}\sim \epsilon (\gamma )$ 
          2. Draw  $\eta _{k,i}\sim \mathbb {(} 0,\tau _{k,i})$ 
     -- Set  $\mathrm {B} _{k}\propto exp(\eta _{k}+m)$

For each document d:

     -- Draw a class  $y_{d}$  from uniform distribution
     -- For each word n, draw  $w_{n}^{(d)}\sim \mathrm {B} _{y_{d}}$

Here, $\epsilon (\gamma )$ indicates the exponential distribution. If we fit a variational distribution over the latent variables, optimizing the bound, we can get the following likelihood equation

l'=\sum _{d}\sum _{n}^{N_{d}}logP(w_{n}^{(d)}|m,\eta _{y_{d}})+\sum _{k}\left\langle logP(\eta _{k}|0,\tau _{k})\right\rangle +\sum _{k}\left\langle logP(\tau _{k}|\gamma )\right\rangle -\sum _{k}\left\langle logQ(\tau _{k})\right\rangle

Then, we can perform Newton methods to solve the above optimization problem.

Parameter Estimation

For the component vector $\eta$ , the authors use the Sherman-Morrison formula to derive the Hessian matrix H, then gradient can be computed. When considering the variance, the authors construct a fully-factored variational distribution $Q_{\tau _{k}}=\prod _{i}Q_{\tau _{k_{i}}(\tau _{k_{i}})}$ .

Dataset and Experiment Settings

The authors perform four experiments on different datasets.

Document classification on 20 Newsgroups data
Sparse topic models on NIPS dataset
Topic and ideology prediction on 2008 US presidential election political blogs
Geolocation prediction from text using Twitter data

Experimental Results

The authors performed four major experiments. The first experiment does not include latent variables. The second experiment explores sparse topic models with latent variables. The last two experiments are related to multifaceted generative models.

Exp 1: Document classification

SAGE outperforms multinomial in all settings. Notably, the gap between SAGE and multinomial was wider when the training data was sparse.

Exp 2: Sparse topic models

In the above figure 3, with both 25 and 50 topics, SAGE outperforms Dirichlet allocation. When increasing the number of topics, non-zero weights decreased 5-fold. In figure 4, it shows that SAGE admits very little topical variation for low frequency words, where in contrasts, LDA displays little sensitivity to overall term frequency.

Exp 3: Topic and ideology

The SAGE model achieved 69.6% accuracy when K = 30, outperforming Ahmed and Xing (2010).

Exp 4: Geolocation from text

When predicting geolocation from Twitter data, SAGE obtained the best mean error in kilometers.

Some Reflections

(1) It is very interesting to see additive generative model (log-space) for topic modeling.

(2) The sparsity is not only for robustness, but also for efficiency in parameter estimation.

(3) Additive models allow plug-ins of different components without much effort.

Related Papers

This paper is related to many papers in topic modeling, inference algorithms in graphical models and unsupervised learning.

(1) Topic modeling.

(2) Inference algorithms in graphical models.

(3) Unsupervised learning.

Unsupervised Modeling of Dialog Acts in Asynchronous Conversation

Eisenstein et al 2011: Sparse Additive Generative Models of Text

Contents