Difference between revisions of "Rosen-Zvi et al, The Author-Topic Model for Authors and Documents"

From Cohen Courses
Jump to navigationJump to search
Line 18: Line 18:
  
 
== Brief description of the method ==
 
== Brief description of the method ==
The paper describes a probabilistic process in which each word is assigned a topic which is sampled from an author conditional topic distribution. This allows the modeling of an author's interest in the form of a probability distribution over the topics present in the corpus. Each author can have different multinomial distributions over the topics present in the corpus. The topics in the corpus are represented as multinomial distribution over the vocabulary in the corpus. For each word present in a document an author is sampled uniformly from the co-authors of the document. Then a topic is sampled from the author specific topic distribution. Then a word is sampled from the language model which is indexed by the topic.
+
The paper describes a probabilistic process in which each word is assigned a topic which is sampled from an author conditional topic distribution. This allows the modeling of an author's interest in the form of a probability distribution over the topics present in the corpus. Each author can have different multinomial distributions over the topics present in the corpus. The topics in the corpus are represented as multinomial distribution over the vocabulary in the corpus. For each word present in a document an author is sampled uniformly from the co-authors of the document. Then a topic is sampled from the author specific topic distribution. Then a word is sampled from the language model which is indexed by that specific topic.
  
 
== Experimental Result ==
 
== Experimental Result ==
The author used external tasks to verify the usability of the joint model. The first evaluation task was that of classification of web-pages [[UsesDataset:: Web KB dataset]] and abstracts from [[UsesDataset:: Cora network]]. The classification was done using a nearest neighbor method where the proximity was computed using [UsesMethod:: Cosine Similarity]. The joint model shows higher accuracy than either of the model in isolation however, no statistical significance testing was carried out. The second evaluation task was to predict a quantity called reference flow which could be used to predict link between a source and target document. In comparison to a placebo link detector the joint model performs significantly better.
+
The author used [[UsesMethod:: Perplexity]] to measure the ability of author topic model to model the corpus of data. The first evaluation task of [[UsesMethod:: Perplexity]] was carried out on  [[UsesDataset:: NIPS]] and abstracts from [[UsesDataset:: CiteSeer]] network. 10 samples were used to compute the point estimate of [[UsesMethod:: Perplexity]]. Apart from the training data (1557 papers) in [[UsesDataset:: NIPS]] from each of the test document (183 papers), a randomly generated set of words were included in the training set. This combined training set was used to predict the rest of the test set. The author model performed the worst then the LDA and author topic model. Initially author topic model outperforms the LDA with low [[UsesMethod:: Perplexity]] however the more flexible model LDA is able to predict better than the Author Topic Model as the training set increases.  Another task which was used to evaluate the model was to rank the authors with respect to perplexity values, for the unseen test dataset documents. For large number of topics especially 400 topics the correct author lied in the top 20 when ranked by ascending value of perplexity. Also the authors were compared using a symmetric version of [[UsesMethod:: KL Divergence]] to measure the similarity between the authors where the model gave intuitive results.
 
 
 
== Related papers ==
 
== Related papers ==
 
   
 
   
 
An interesting related paper is [[RelatedPaper::Cohn, D. ICML 2000]] which proposes a latent variable model for citation.
 
An interesting related paper is [[RelatedPaper::Cohn, D. ICML 2000]] which proposes a latent variable model for citation.

Revision as of 00:11, 1 April 2011

Citation

Rosen-Zvi et al, The Author-Topic Model for Authors and Documents

Online version

UAI'04

Summary

This paper presents a probabilistic graphical model which can account for document generation taking into account the authors who have created the document collection. Potential applications include finding authors with similar recurring research interests, quantifying those research interests conditioned on the author, discovering topics present in a corpus. The basic ideas are:

  • Use a corpus of articles which have the author meta data associated with them.
  • Build a Topic Model which could model the documents generation process by assigning each author with a separate topic mixture conditioned over the author. The mixture weight of each topic is dependent on the author.
  • Use Gibbs sampling to compute the desired posterior probabilities by sampling from the stable posterior distribution.

Brief description of the method

The paper describes a probabilistic process in which each word is assigned a topic which is sampled from an author conditional topic distribution. This allows the modeling of an author's interest in the form of a probability distribution over the topics present in the corpus. Each author can have different multinomial distributions over the topics present in the corpus. The topics in the corpus are represented as multinomial distribution over the vocabulary in the corpus. For each word present in a document an author is sampled uniformly from the co-authors of the document. Then a topic is sampled from the author specific topic distribution. Then a word is sampled from the language model which is indexed by that specific topic.

Experimental Result

The author used Perplexity to measure the ability of author topic model to model the corpus of data. The first evaluation task of Perplexity was carried out on NIPS and abstracts from CiteSeer network. 10 samples were used to compute the point estimate of Perplexity. Apart from the training data (1557 papers) in NIPS from each of the test document (183 papers), a randomly generated set of words were included in the training set. This combined training set was used to predict the rest of the test set. The author model performed the worst then the LDA and author topic model. Initially author topic model outperforms the LDA with low Perplexity however the more flexible model LDA is able to predict better than the Author Topic Model as the training set increases. Another task which was used to evaluate the model was to rank the authors with respect to perplexity values, for the unseen test dataset documents. For large number of topics especially 400 topics the correct author lied in the top 20 when ranked by ascending value of perplexity. Also the authors were compared using a symmetric version of KL Divergence to measure the similarity between the authors where the model gave intuitive results.

Related papers

An interesting related paper is Cohn, D. ICML 2000 which proposes a latent variable model for citation.