Link-PLSA-LDA: A new unsupervised model for topics and influence of blogs

From Cohen Courses
Jump to navigationJump to search

Citation

Ramesh Nallapti and William Cohen. Link-PLSA-LDA: A new unsupervised model for topics and influence of blogs. In Proc of AAAI 2008.

Online Version

Link-PLSA-LDA: A new unsupervised model for topics and influence of blogs.

Summary

This paper presents a novel, unsupervised model based of topics and topic specific influences in blogs. It is compared with Link-LDA and performs better. It intends to address two issues at once: topic discovery and modeling topic specific influence of blogs.

When one blog cites another, this is viewed as a uni-dimensional link.

Not completely generative due to hyperlinked documents being fixed.

Dataset

The data was collected from the period of July 4th, 2005 through July 24th, 2005. It was a set of over 8 million blog postings collected by [| Nielsen Buzzmetrics], but was very noisy. Each individual blog was only contained in the final dataset if it had at least 2 outgoing or 2 incoming hyperlinks (within the corpus). From the more than 8 million initial blog posts, the final set had 1,777 blogs that had been cited at least twice from within the corpus and 2,248 with outgoing. Only 68 were in both sets. The authors duplicated this to make a perfectly bipartite graph. Common pruning methods were performed on the vocabulary. This dataset was split evenly into two sets of 1,124 (doesn't explicitly state what happened to the duplicated 68 documents - which could be a potential source of overfitting).

Model

Topic Discovery

Modeling Topic Specific Influence of Blogs

Experiments

Link-plsa-lda-01.png

Study Plan