Rosen-Zvi et al, The Author-Topic Model for Authors and Documents

From Cohen Courses
Jump to navigationJump to search

Citation

Rosen-Zvi et al, The Author-Topic Model for Authors and Documents

Online version

UAI'04

Summary

This paper presents a probabilistic graphical model which can account for document generation taking into account the authors who have created the document collection. Potential applications include finding authors with similar recurring research interests, quantifying those research interests conditioned on the author, discovering topics present in a corpus. The basic ideas are:

  • Use a corpus of articles which have the author meta data associated with them.
  • Build a Topic Model which could model the documents generation process by assigning each author with a separate topic mixture conditioned over the author. The mixture weight of each topic is dependent on the author.
  • Use Gibbs sampling to compute the desired posterior probabilities by sampling from the stable posterior distribution.

Brief description of the method

The paper describes a probabilistic process in which each word is assigned a topic which is sampled from an author conditional topic distribution. This allows the modeling of an author's interest in the form of a probability distribution over the topics present in the corpus. Each author can have different multinomial distributions over the topics present in the corpus. The topics in the corpus are represented as multinomial distribution over the vocabulary in the corpus. For each word present in a document an author is sampled uniformly from the co-authors of the document. Then a topic is sampled from the author specific topic distribution. Then a word is sampled from the language model which is indexed by that specific topic.

Experimental Result

The author used Perplexity to measure the ability of author topic model to model the corpus of data. The first evaluation task of Perplexity was carried out on NIPS and abstracts from CiteSeer network. 10 samples were used to compute the point estimate of Perplexity. Apart from the training data (1557 papers) in NIPS from each of the test document (183 papers), a randomly generated set of words were included in the training set. This combined training set was used to predict the rest of the test set. The author model performed the worst then the LDA and author topic model. Initially author topic model outperforms the LDA with low Perplexity however the more flexible model LDA is able to predict better than the Author Topic Model as the training set increases. Another task which was used to evaluate the model was to rank the authors with respect to perplexity values, for the unseen test dataset documents. For large number of topics especially 400 topics the correct author lied in the top 20 when ranked by ascending value of perplexity. Also the authors were compared using a symmetric version of KL Divergence to measure the similarity between the authors where the model gave intuitive results.

Related papers

An interesting related paper is Blei, D.M., Journal of Machine Learning Research 2003 which proposes Latent Dirichlet Allocation.