Difference between revisions of "Compare Yano et al NAACL 2009 Link PLSA LDA"

From Cohen Courses
Jump to navigationJump to search
Line 26: Line 26:
  
 
<math>u</math> means that a user commented on the blog posting. <math>w'</math> is the words in the comment. The extension from [[Mixed_membership_models_of_scientific_publication | Link-LDA]] is that the words in the comment are also modeled (not just who will comment).
 
<math>u</math> means that a user commented on the blog posting. <math>w'</math> is the words in the comment. The extension from [[Mixed_membership_models_of_scientific_publication | Link-LDA]] is that the words in the comment are also modeled (not just who will comment).
 +
 +
The differences between the two models is shown below (highlighted in Orange). Note that <math>\beta'</math> is modeling the words in comments and is a different hyper-parameter and prior than the text in the blogs. Link-PLSA-LDA does not do this because they are only modeling the words in blog posts.
 +
 +
[[File:link_differences.png]]
  
 
=== Datasets Used ===
 
=== Datasets Used ===

Revision as of 21:32, 1 December 2012

Papers

The papers are:

Comparison

Both of these papers are extensions of Link-LDA and use a blog dataset. Yano et al. tries to predict which user will comment on a blog posting whereas Nallapati and Cohen try to predict which blog will link to another blog.

Method

Link-LDA is the basis for both papers and the graphical model is represented here:

Link lda model.png

Nallapti and Cohen extend it with this model:

Link plsa lda model.png

The biggest difference is that this models the text of the cited documents as well. It is worth noting that the same priors and are used for and d'.

Yano et al. extend Link-LDA this way:

Comment LDA.png

means that a user commented on the blog posting. is the words in the comment. The extension from Link-LDA is that the words in the comment are also modeled (not just who will comment).

The differences between the two models is shown below (highlighted in Orange). Note that is modeling the words in comments and is a different hyper-parameter and prior than the text in the blogs. Link-PLSA-LDA does not do this because they are only modeling the words in blog posts.

Link differences.png

Datasets Used

  • Yano et al. uses a corpus of blog posts from 40 different blog sites focusing on American politics during from November 2007 to October 2008 (right up to a presidential election). Diversity in political leanings was emphasized for the final selection. Five blogs were chosen for the final selection.
  • Nallapati and Cohen also use a corpus of blogs, but these were collected from July 2004 - July 2005. Initially, it was a noisy dataset with lots of broken links and useless information. The authors constrained blogs used to have a minimum of 2 ingoing or 2 outgoing links within the corpora. Unlike Yano et al., there was no reliance on it being a specific blog site (as they only had 5).

Problem

Big Idea

Other

Questions

  1. How much time did you spend reading the (new, non-wikified) paper you summarized? About 2 hours
  2. How much time did you spend reading the old wikified paper? About 2 hours
  3. How much time did you spend reading the summary of the old paper? About 15 min
  4. How much time did you spend reading background material? N/A My final project for the class is on this area so I've read a lot of background papers
  5. Was there a study plan for the old paper? Yes
    1. if so, did you read any of the items suggested by the study plan? and how much time did you spend with reading them? I had actually read the papers before as it is directly related to my research with my advisor. I do a lot of Gibbs Sampling on graphical models (in particular topic-model derivatives) and that fits into the study plan
  6. Give us any additional feedback you might have about this assignment. I like this comparison. It was a nice way to view the papers in a different light and really made it stick in my memory. In general, I like the wikifying and used it extensively for the project (and probably will use this for my research in the future after the class is over).