Nallapati Cohen Link PLSA LDA

From Cohen Courses
Jump to navigationJump to search

Citation

R. Nallapati and W. Cohen. 2008. Link-PLSA-LDA: A new unsupervised model for topics and influence of blogs. In International Conference for Weblogs and Social Media.


Online Version

[1]


Summary

Topic discovery in text data is an important machine learning problem. Several methods have been used prior to discover topics in blog posts where the posts both contain text content discussing assorted topics, but are also embedded in a network of other blog postings through the use of hyperlinks, by linking to other posts and by being linked to. Nallapati and Cohen offer an advance upon topic discovery methods used before by combining two approaches used before and adding an additional wrinkle in order to substantially improve upon topic modeling predictions within a blog posting dataset.


Background

Several methods for latent topic modeling for unsupervised topic discovery have been used before, eg. PLSA (Hoffman 1999, Cohn & Hoffman 2001) and LDA (Blei, Ng, & Jordan 2003). In Cohn and Hoffman 2001, the authors used a model (referred to as Link-PLSA by Nallapati and Cohen) that assigns a probability for a topic to a document if that document is linked to by other blog posts discussing that topic.

Nallapati and Cohen also mention an offshoot of LDA (referred to in this paper as Link-LDA) that also assigns a probability for a topic based on links to other blog posts, and also looks at co-occurrence of words and hyperlinks within blog postings.

Of note, both of these methods only exploit co-occurrence of hyperlinks to address the probability of similar topics, rather than explicitly analyzing the relationship between the linking and linked document.


Data Used

A collection of 8,370,193 blogosphere postings collected by Nielsen Buzzmetrics between 07/04/2005 and 07/24/2005, a period of 20 days. After processing the data set for usability and for postings that had at least 2 incoming and outgoing hyperlinks, they developed a set of 1,777 posts with at least 2 incoming links each and a set of 2,248 with outgoing links.


Methodology

The authors retain the approaches of Link-LDA and Link-PLSA, in which citations are modeled as samples from a topic-specific multinomial distribution over the hyperlinked documents. Unlike in Link-LDA and Link-PLSA, which only use citations of other documents with respect to topic k in determining the influence of document d', their model takes into account the content of d' in determining the topics. The authors then ran the model to determine the topics involved in a training phase on half of the dataset discussed above. The trained model was then used to predict hyperlinks in the holdout sample of the dataset and found that their method outperforms both Link-LDA and Link-PLSA.


Related Papers

D. Blei, A. Ng, and M. Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning Research 3: 993–1022. [2]

D. Cohn and T. Hofmann. 2001. The missing link - a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems 13. [3]

T. Hoffman. 1999. Probabilistic latent semantic analysis. In Uncertainty in Artificial Intelligence.