Reisinger et al 2010: Spherical Topic Models

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Citation

Joseph Reisinger, Austin Waters, Bryan Silverthorn, and Raymond J. Mooney, "Spherical Topic Models", in Proceedings of the 27th International Conference on Machine Learning (ICML 2010), 2010.

Summary

This is a recent paper that presents Spherical Mixture Model and Variational Inference methods for Latent Dirichlet Allocation (LDA), which is a Bayesian generative model for general problems in Topic modeling. The highlight of this paper is that it models documents as data points in high-dimensional spherical manifold. Like cosine similarity, the model assumes the data is directional, and can be parameterized by cosine distance and other similarity measures in directional statistics. The authors claim that the spherical topic modeling approach outperforms existing models such as LDA.

Motivations

Traditional topic modeling methods, such as Latent Dirichlet Allocation (LDA), fail to model the presence and the absence of words in the target document, because they assume multinomial distribution for document likelihood. To overcome this issue, the authors propose the Spherical Admixture Model, which models both the frequency as well as the presence and absence of the words. In addition to this, by assuming von Mises-Fisher distribution, they hope to improve the system accuracy when using high-dimensional spherical modeling methods for sparse text data.

Brief Description of the method

This paper first introduces the advantages of von Mises-Fisher distribution for text, then discusses the Spherical Admixture Model and the use of Variational Inference to solve the posterior approximation problem. In this section, we will first summarize the basic characteristics of von Mises-Fisher distribution they assume, then we will introduce the definition of the proposed model, as well as the variational inference method.

von Mises-Fisher Distribution

In LDA, the multinomial distribution of words assigns probabilities to integer vectors of event counts, which is the raw counts of each words in a document in ${\displaystyle \mathbb {N} ^{|V|}}$. In contrast to multimonial distribution, von Mises-Fisher (vMF) distribution is a probability distribution on the (d-1)-dimensional sphere in ${\displaystyle \mathbb {R} ^{d}}$, where its density function is

${\displaystyle f(\mathrm {X} ;\mu ,\kappa )=c_{d}(\kappa )exp(\kappa \mu ^{T}\mathrm {X} )}$

where ${\displaystyle \mu }$ is the mean direction with ${\displaystyle ||\mu ||=1}$, and ${\displaystyle \kappa }$ is the concentration parameter. In addition,

${\displaystyle c_{d}(\kappa )={\frac {\kappa ^{d/2-1}}{(2\pi )^{d/2}I_{{d/2}-1(\kappa )}}}}$

is the normalization factor, where ${\displaystyle I_{r}(\cdot )}$ is the modified Bessel function of the first kind and order ${\displaystyle r}$.

Intuitively, vMF distribution can be considered as the multivariate Gaussian with spherical covariance, parametermized by the cosine distance rather than Euclidean distance. Cosine distance is commonly used in directional statistics and computes the directions of ${\displaystyle l'_{2}}$-normalized features vectors and corresponds to the normalized correlation coefficient.

In this paper, the authors also argue that vMF is sensitive to the absence/presence of words, where multinomial distribution is not. They showed an example: if document D1 has a vector of [1,1,1] and document D2 has a vector of [3,0,0], in multinomial scenario where topic proportion ${\displaystyle \theta =[1/3,1/3,1/3]}$, the two documents are equivalent. In contrast, vMF would compute different cosine distances.

The Spherical Admixture Model(SAM) is very different from LDA in the sense that it does not model each word given a topic distribution ${\displaystyle P(w|z)}$. Instead, it models the document, and uses a weighted directional average to combine topics. A simple generative story of SAM can be given by:

• Draw a set of T topics ${\displaystyle \phi }$ on the unit hypersphere.
• For each document d, draw topic weights ${\displaystyle \theta _{d}}$ from Dirichlet ${\displaystyle \alpha }$.
• Draw a document vector ${\displaystyle v_{d}}$ from vMF with mean ${\displaystyle {\dot {\phi _{d}}}=Avg(\phi ,\theta _{d})}$

The complete model can be represented as the following:

• ${\displaystyle \mu |\kappa _{0}\sim vMF(m,\kappa _{0})}$ (corpus mean)
• ${\displaystyle \phi _{t}|\mu ,\epsilon \sim vMF(\mu ,\epsilon ),t\in T}$ (topics)
• ${\displaystyle \theta _{d}|\alpha \sim Dirichlet(\alpha ),d\in D}$ (topic proportions)
• ${\displaystyle {\dot {\phi _{d}}}|\phi ,\theta _{d}=Avg(\phi ,\theta _{d}),d\in D}$ (spherical average)
• ${\displaystyle v_{d}|{\dot {\phi _{d}}},\kappa \sim vMF({\dot {\phi _{d}}},\kappa )}$ (documents)

here, ${\displaystyle \mu }$ is the corpus mean direction, ${\displaystyle \epsilon }$ controls of the concentration of the topics around ${\displaystyle \mu }$, the elements of ${\displaystyle \theta _{d}}$are the mixing proportions for the document d, ${\displaystyle \phi _{t}}$ is the mean of the topic t, and ${\displaystyle v_{d}}$ is the observed vector for document d.

Variational Inference

In order to set the parameters of the above model, we need to infer the posterior distribution of the topic means, topics and per-document topic proportions: ${\displaystyle p(\phi ,\theta ,\mu |v,\epsilon ,m,\alpha ,\kappa _{0},\kappa )}$. As we know it is intractable to do exact inference, so the authors proposed the variational mean field method to approximately infer the parameters. In variational mean field approach, the true posteriors are another distribution with simpler and factored parametric form. In this case, EM would be very useful to perform inference on the following approximation.

${\displaystyle q(\phi ,\theta ,\mu |{\tilde {\mu }},{\tilde {\alpha }},\epsilon )=q(\phi |{\tilde {\mu }},\epsilon )q(\theta |{\tilde {\alpha }})q(\mu |{\tilde {m}},\kappa _{0})}$

In the EM steps, the authors use gradient ascent to update the variational topic means ${\displaystyle {\tilde {\mu }}}$ and the per-document topic proportions ${\displaystyle {\tilde {\alpha _{d}}}}$ in the E step.

Dataset and Experiment Settings

The authors conduct three experiments with three different datasets. In the first experiment, the authors use the CMU 20 Newsgroups collections to classify Usenet posts. In the 2nd experiment, the task is to detect the thematic shifts in the Italian text of Niccolo Machiavelli's Il Principe, and the last task is to classify natural scenes in the 13-scene database. Four models are compared:

• LDA
• movMF - mixtures of vMF model by Banerjee et al., 2005
• SAM that can contain both positive and negative entries.
• SAM with positive entries.

In addition to the three objective experiments, the authors also did a subjective evaluation of the topic interpretability.

Experimental Results

The authors performed four major experiments. The first experiment is the subjective evaluation of the interpretability of the topics. The second to fourth experiments are objective classification experiments.

Exp I: Topic Interpretability

In this experiment, when using LDA, the raters were able to correctly identify the intruder words in 67.1% of cases (50 per model). In contrast, when using SAM, the raters were able to identify 82.7% with tf-idf and 80.4% with tf only. Both SAM results were significantly different from LDA. In their topic relevance evaluation, topics chosen by SAM are preferred roughly 3:2 over topics generated by LDA, indicating the advantage of SAM.

Exp II: CMU 20 Newsgroups

In this experiment, the authors conclude from the above figure that SAM finds better features than the other models and performs similarly as raw bag-of-words. When combining features from tf-idf SAM with BoW, unlike LDA, the overall accuracy was 94.1% in news-20-different and significantly improved over the BoW model.

Exp III: Thematic Scene Shifts

In the above figure, it is clear that SAM[S] discovers the best features in all settings. Comparing to other models, it cuts the relative classification errors by 18.5%. The authors also noticed that when T > 50, the performance became relatively stable.

Exp IV: 13 Natural Scene Categories

In the last experiment, the authors found SAM significantly outperform all other models in all settings when 10% data is used to train a Logistic Regression classification model. With dense features, SAM provides a small benefit over LDA (very similar results, not included in the table above).

Some Reflections

(1) It is absolutely justifiable to assume more complex distributions of words.

(2) Directional statistics might be useful in some certain conditions.

(3) The authors assume too many vMF distributions that slows the system down.

Related Papers

This paper is related to many papers in three dimensions.

(1) Topic modeling and visualization. Topic model and visualization are extremely popular due to the growing trend of social media. For example:

(2) Bayesian admixture model. The paper presents an interesting Bayesian admixture model that aligns to many other Bayesian models in different domains. For example:

(3) Variational inference. The variational mean field method in this paper is a very interesting alternative inference method to infer parameters in graphical models.