Difference between revisions of "Link-PLSA-LDA: A new unsupervised model for topics and influence of blogs"

From Cohen Courses
Jump to navigationJump to search
Line 13: Line 13:
  
 
== Dataset ==
 
== Dataset ==
The data was collected from the period of July 4th, 2005 through July 24th, 2005. It was a set of over 8 million blog postings collected by [[http://www.nielsenbuzzmetrics.com | Nielsen Buzzmetrics]], but was very noisy. Each individual blog was only contained in the final dataset if it had at least 2 outgoing or 2 incoming hyperlinks (within the corpus). From the more than 8 million initial blog posts, the final set had 1,777 blogs that had been cited at least twice from within the corpus and 2,248 with outgoing. Only 68 were in both sets. The authors duplicated those 68 to make a perfectly bipartite graph. Common pruning methods were performed on the vocabulary. This dataset was split evenly into two sets of 1,124 (doesn't explicitly state what happened to the duplicated 68 documents - which could be a potential source of overfitting).
+
The data was collected from the period of July 4th, 2005 through July 24th, 2005. It was a set of over 8 million blog postings collected by [http://www.nielsenbuzzmetrics.com | Nielsen Buzzmetrics], but was very noisy. Each individual blog was only contained in the final dataset if it had at least 2 outgoing or 2 incoming hyperlinks (within the corpus). From the more than 8 million initial blog posts, the final set had 1,777 blogs that had been cited at least twice from within the corpus and 2,248 with outgoing. Only 68 were in both sets. The authors duplicated those 68 to make a perfectly bipartite graph. Common pruning methods were performed on the vocabulary. This dataset was split evenly into two sets of 1,124 (doesn't explicitly state what happened to the duplicated 68 documents - which could be a potential source of overfitting).
  
 
== Model ==
 
== Model ==
Line 23: Line 23:
 
== Experiments ==
 
== Experiments ==
 
To evaluate the models performance, the authors compare the log-likelihood on unseen data. Higher values are better. They first train the model's parameters using all of the set of cited postings and set I of citing postings (then repeat the experiment with set II). The cumulative log-likelihood values of the entire set of citing posting by summing values in each experiment. Again, I'm curious to know what would the results look like without the 68 duplicated documents. I fear that they are increasing the log-likelihood. Results are shown in Figure 4.
 
To evaluate the models performance, the authors compare the log-likelihood on unseen data. Higher values are better. They first train the model's parameters using all of the set of cited postings and set I of citing postings (then repeat the experiment with set II). The cumulative log-likelihood values of the entire set of citing posting by summing values in each experiment. Again, I'm curious to know what would the results look like without the 68 duplicated documents. I fear that they are increasing the log-likelihood. Results are shown in Figure 4.
 +
 
[[File:link-plsa-lda-01.png]]
 
[[File:link-plsa-lda-01.png]]
  
 
Qualitative analysis of the model:
 
Qualitative analysis of the model:
 +
 
[[File:link-plsa-lda-02.png]]
 
[[File:link-plsa-lda-02.png]]
  
 
The authors also evaluate the model by using a link prediction algorithm arguing that this is an indicator of good topical influence analysis. The setup is very similar to the log-likelihood experiment and is also two-fold cross-validation. The baseline is the [[Mixed_membership_models_of_scientific_publication | Link-LDA model]]. They focus only on how well the model rates the postings that are actually hyperlinked - and only the worst case scenario. They call the rank of the last relevant document, aka the rank of its most poorly ranked true citation, RKL. Here are the values of RKL:
 
The authors also evaluate the model by using a link prediction algorithm arguing that this is an indicator of good topical influence analysis. The setup is very similar to the log-likelihood experiment and is also two-fold cross-validation. The baseline is the [[Mixed_membership_models_of_scientific_publication | Link-LDA model]]. They focus only on how well the model rates the postings that are actually hyperlinked - and only the worst case scenario. They call the rank of the last relevant document, aka the rank of its most poorly ranked true citation, RKL. Here are the values of RKL:
 +
 
[[File:link-plsa-lda-03.png]]
 
[[File:link-plsa-lda-03.png]]
  

Revision as of 19:13, 1 December 2012

Citation

Ramesh Nallapti and William Cohen. Link-PLSA-LDA: A new unsupervised model for topics and influence of blogs. In Proc of AAAI 2008.

Online Version

Link-PLSA-LDA: A new unsupervised model for topics and influence of blogs.

Summary

This paper presents a novel, unsupervised model based of topics and topic specific influences in blogs. It is compared with Link-LDA and performs better. The model described is intended to address two issues at once: topic discovery and modeling topic specific influence of blogs.

When one blog cites another, this is viewed as a uni-dimensional link.

Not completely generative due to hyperlinked documents being fixed.

Dataset

The data was collected from the period of July 4th, 2005 through July 24th, 2005. It was a set of over 8 million blog postings collected by | Nielsen Buzzmetrics, but was very noisy. Each individual blog was only contained in the final dataset if it had at least 2 outgoing or 2 incoming hyperlinks (within the corpus). From the more than 8 million initial blog posts, the final set had 1,777 blogs that had been cited at least twice from within the corpus and 2,248 with outgoing. Only 68 were in both sets. The authors duplicated those 68 to make a perfectly bipartite graph. Common pruning methods were performed on the vocabulary. This dataset was split evenly into two sets of 1,124 (doesn't explicitly state what happened to the duplicated 68 documents - which could be a potential source of overfitting).

Model

Topic Discovery

Modeling Topic Specific Influence of Blogs

Experiments

To evaluate the models performance, the authors compare the log-likelihood on unseen data. Higher values are better. They first train the model's parameters using all of the set of cited postings and set I of citing postings (then repeat the experiment with set II). The cumulative log-likelihood values of the entire set of citing posting by summing values in each experiment. Again, I'm curious to know what would the results look like without the 68 duplicated documents. I fear that they are increasing the log-likelihood. Results are shown in Figure 4.

Link-plsa-lda-01.png

Qualitative analysis of the model:

Link-plsa-lda-02.png

The authors also evaluate the model by using a link prediction algorithm arguing that this is an indicator of good topical influence analysis. The setup is very similar to the log-likelihood experiment and is also two-fold cross-validation. The baseline is the Link-LDA model. They focus only on how well the model rates the postings that are actually hyperlinked - and only the worst case scenario. They call the rank of the last relevant document, aka the rank of its most poorly ranked true citation, RKL. Here are the values of RKL:

Link-plsa-lda-03.png

Study Plan