Sparse Additive Generative Models of Text
This Paper is available online [1].
Summary
Sparse Additive Generative Models of Text, or SAGE, is an interesting alternative to traditional generative models for text. The key insight of the paper is that you can model latent classes or topics as a deviation in log-frequency from a constant background distribution. It has the advantage of enforcing sparsity which the authors argue prevents over-fitting. Additionally, generative facets can be combined through addition in the log space, avoiding the need for switching variables.
Datasets
This paper evaluates numerous different algorithms on a variety of datasets. They use the classic benchmark "Twenty Newsgroups" (20 Newsgroups), a benchmark NIPS dataset (NIPS), Geo-Tagged Tweets (Geotagged Tweets), and political blogs from 2008 (2008 Political Blogs).
Methodology
The key insight of this paper is that a generative model can be thought of as a deviation from a background distribution in log-space. This has a few nice benefits. The authors propose their method to deal with a few main problems that they see in the Dirichlet-multinomial generative models: Overfitting, Overparametrization, Inference Cost, and Lack of Sparsity. Overfitting is always a problem in Machine Learning and SAGE attempts to deal with the issue by imposing sparsity. They use an interesting technique of placing a zero-mean Laplace prior on the model. Interestingly, they claim that this has the same effect as an L1 regularizer, but do not actually explain how they are the same. This was the biggest issue with the paper. It does appear to enforce sparsity, but does not actually prove that it is the same as an L1 regularizer.
The paper also had an interesting take on inference costs. In particular, they talk about the incorporation of multiple generative facets in many model and argue that in most cases this requires an additional latent variable per token that acts as a switching variable. By adding log-probabilities, SAGE does not have the need for switching variables.
By modeling things with a background distribution, the authors also hope to deal with overparametrization so that high frequency tokens are automatically included in the background distribution and don't need to be learned. They give the example of the words "the" and "of" which in traditional classifiers have to have their probabilities be relearned for every class.
In all of these cases, the main issues stem from the use of the Dirichlet-multinomial in generative models. These are often used because estimation is straightforward. SAGE is intended to be a drop-in replacement for this. This is the main part of the method. Any model that uses a Dirichlet-multinomial (and others as well, but the focus is on this) can instead use SAGE. In order to incorporate this into a model, the generative story must change a little. To start off with, a background distribution must be drawn from a prior, normally uninformative. Then instead of having a switching latent variable, terms are drawn from a normal distribution centered on a draw from an exponential distribution. The downside to this method is that it is fairly complex and each machine learning algorithm has to be completely rewritten in the SAGE framework.
Experimental Results
The first experiment that the authors run is on Document Classification of 20 newsgroups versus Naive Bayes. In all cases, SAGE beat out NB.
The second experiment compared SAGE with Latent Dirichlet allocation (LDA). For a larger number of topics, they found lower perplexity on the NIPS dataset. They also compared the frequencies of terms in topics between the two methods and show that SAGE only distiguishes topic-term frequencies with robust counts.
The third experiment compared multifaceted generative models. They looked at topic and ideology of political blogs. Here they were able to get close to state of the art results.
The fourth experiment looked at geo-tagging tweets. They evaluated how close they got on average to the actual location of the tweet. They lost slightly when looking at the median compared to state-of-the-art methods, but had the best results when the mean was used.