Link-PLSA-LDA: A new unsupervised model for topics and influence of blogs

From Cohen Courses
Jump to navigationJump to search

Citation

Ramesh Nallapti and William Cohen. Link-PLSA-LDA: A new unsupervised model for topics and influence of blogs. In Proc of AAAI 2008.

Online Version

Link-PLSA-LDA: A new unsupervised model for topics and influence of blogs.

Summary

This paper presents a novel, unsupervised model based of topics and topic specific influences in blogs. It is compared with Link-LDA and performs better. The model described is intended to address two issues at once: topic discovery and modeling topic specific influence of blogs.

The paper tackles the link prediction problem and does so through distributions over words in a blog. When one blog cites another, this is viewed as a uni-dimensional link.

The paper presents a topic model derived graphical model that looks at both words in a blog and links between blogs. Unlike many topic model derived models, this one is not completely generative due to hyperlinked documents being fixed.

Dataset

The data was collected from the period of July 4th, 2005 through July 24th, 2005. It was a set of over 8 million blog postings collected by Nielsen Buzzmetrics, but was very noisy. Each individual blog was only contained in the final dataset if it had at least 2 outgoing or 2 incoming hyperlinks (within the corpus). From the more than 8 million initial blog posts, the final set had 1,777 blogs that had been cited at least twice from within the corpus and 2,248 with outgoing. Only 68 were in both sets. The authors duplicated those 68 to make a perfectly bipartite graph. Common pruning methods were performed on the vocabulary. This dataset was split evenly into two sets of 1,124 (doesn't explicitly state what happened to the duplicated 68 documents - which could be a potential source of overfitting). The model needs a bipartite graph which is why this was done.

Model

The model is a fairly standard, generative, graphical model (though not fully generative due to the finite number of links). It follows common conventions for describing and presenting it. Here is the notation:

Link plsa lda notation.png

And here is the given generative story of how documents and links are generated:

Link plsa lda generative.png

Which is represented by the standard plate notation graphical model below:

Link plsa lda model.png

Of particular not is which is interpretable as the influence of document in topic . The two plates representing citing and cited documents are the reason for the bipartite graph necessity. And you can see this compared to Link-LDA:

Link lda model.png


But to learn the posterior for link-plsa-lda, they use a mean-field variational approximation represented by this graphical model:

Link plsa lda meanfield.png

This needs to be done as it is intractable due to the pairwise coupling of , , and .

Experiments

To evaluate the models performance, the authors compare the log-likelihood on unseen data. Higher values are better. They first train the model's parameters using all of the set of cited postings and set I of citing postings (then repeat the experiment with set II). The cumulative log-likelihood values of the entire set of citing posting by summing values in each experiment. Again, I'm curious to know what would the results look like without the 68 duplicated documents. I fear that they are increasing the log-likelihood. Results are shown in Figure 4.

Link-plsa-lda-01.png

Qualitative analysis of the model:

Link-plsa-lda-02.png

The authors also evaluate the model by using a link prediction algorithm arguing that this is an indicator of good topical influence analysis. The setup is very similar to the log-likelihood experiment and is also two-fold cross-validation. The baseline is the Link-LDA model. They focus only on how well the model rates the postings that are actually hyperlinked - and only the worst case scenario. They call the rank of the last relevant document, aka the rank of its most poorly ranked true citation, RKL. Here are the values of RKL:

Link-plsa-lda-03.png

Study Plan