Link-PLSA-LDA: A new unsupervised model for topics and inﬂuence of blogs

Citation

Ramesh Nallapti and William Cohen. Link-PLSA-LDA: A new unsupervised model for topics and influence of blogs. In Proc of AAAI 2008.

Online Version

Link-PLSA-LDA: A new unsupervised model for topics and influence of blogs.

Summary

This paper presents a novel, unsupervised model based of topics and topic specific influences in blogs. It is compared with Link-LDA and performs better. The model described is intended to address two issues at once: topic discovery and modeling topic specific influence of blogs.

The paper tackles the link prediction problem and does so through distributions over words in a blog. When one blog cites another, this is viewed as a uni-dimensional link.

The paper presents a topic model derived graphical model that looks at both words in a blog and links between blogs. Unlike many topic model derived models, this one is not completely generative due to hyperlinked documents being fixed.

Dataset

The data was collected from the period of July 4th, 2005 through July 24th, 2005. It was a set of over 8 million blog postings collected by Nielsen Buzzmetrics, but was very noisy. Each individual blog was only contained in the final dataset if it had at least 2 outgoing or 2 incoming hyperlinks (within the corpus). From the more than 8 million initial blog posts, the final set had 1,777 blogs that had been cited at least twice from within the corpus and 2,248 with outgoing. Only 68 were in both sets. The authors duplicated those 68 to make a perfectly bipartite graph. Common pruning methods were performed on the vocabulary. This dataset was split evenly into two sets of 1,124 (doesn't explicitly state what happened to the duplicated 68 documents - which could be a potential source of overfitting). The model needs a bipartite graph which is why this was done.

Model

The model is a fairly standard, generative, graphical model (though not fully generative due to the finite number of links). It follows common conventions for describing and presenting it. Here is the notation:

And here is the given generative story of how documents and links are generated:

Which is represented by the standard plate notation graphical model below:

Of particular not is $\Omega _{kd'}$ which is interpretable as the influence of document $d'$ in topic $k$ . The two plates representing citing and cited documents are the reason for the bipartite graph necessity. And you can see this compared to Link-LDA:

But to learn the posterior for link-plsa-lda, they use a mean-field variational approximation represented by this graphical model:

This needs to be done as it is intractable due to the pairwise coupling of $\theta$ , $\beta$ , and $\omega$ .

Experiments

To evaluate the models performance, the authors compare the log-likelihood on unseen data. Higher values are better. They first train the model's parameters using all of the set of cited postings and set I of citing postings (then repeat the experiment with set II). The cumulative log-likelihood values of the entire set of citing posting by summing values in each experiment. Again, I'm curious to know what would the results look like without the 68 duplicated documents. I fear that they are increasing the log-likelihood. Results are shown in Figure 4.

Qualitative analysis of the model:

The authors also evaluate the model by using a link prediction algorithm arguing that this is an indicator of good topical influence analysis. The setup is very similar to the log-likelihood experiment and is also two-fold cross-validation. The baseline is the Link-LDA model. They focus only on how well the model rates the postings that are actually hyperlinked - and only the worst case scenario. They call the rank of the last relevant document, aka the rank of its most poorly ranked true citation, RKL. Here are the values of RKL:

Study Plan

LDA
Cohn_Hofmann PHITS dubbed (Link-PLSA) in this paper
Erosheva_et_al dubbed (Link-LDA) in this paper

Link-PLSA-LDA: A new unsupervised model for topics and inﬂuence of blogs

Contents

Citation

Online Version

Summary

Dataset

Model

Experiments

Study Plan

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools