Compare Yano et al NAACL 2009 Link PLSA LDA

From Cohen Courses
Jump to navigationJump to search

Papers

The papers are:

Comparison

Both of these papers are extensions of Link-LDA and use a blog dataset. Yano et al. tries to predict which user will comment on a blog posting whereas Nallapati and Cohen try to predict which blog will link to another blog.

Method

Link-LDA is the basis for both papers and the graphical model is represented here:

Link lda model.png

Nallapti and Cohen extend it with this model:

Link plsa lda model.png

With the notation being:

Link plsa lda notation.png

The biggest difference is that this models the text of the cited documents as well. It is worth noting that the same priors and are used for and d'.

Yano et al. extend Link-LDA this way:

Comment LDA.png

means that a user commented on the blog posting. is the words in the comment. The extension from Link-LDA is that the words in the comment are also modeled (not just who will comment).

The differences between the two models is shown below (highlighted in Orange). Note that is modeling the words in comments and is a different hyper-parameter and prior than the text in the blogs. Link-PLSA-LDA does not do this because they are only modeling the words in blog posts.

Link differences.png

Datasets Used

  • Yano et al. uses a corpus of blog posts from 40 different blog sites focusing on American politics during from November 2007 to October 2008 (right up to a presidential election). Diversity in political leanings was emphasized for the final selection. Five blogs were chosen for the final selection.
  • Nallapati and Cohen also use a corpus of blogs, but these were collected from July 2004 - July 2005. Initially, it was a noisy dataset with lots of broken links and useless information. The authors constrained blogs used to have a minimum of 2 ingoing or 2 outgoing links within the corpora. Unlike Yano et al., there was no reliance on it being a specific blog site (as they only had 5).

Problem

Both of these papers are dealing with the link prediction problem, but they are looking at different links. Yano et al. is defining a link to be when a user comments on a posting on a specific blog cite. Nallapati and Cohen are looking at blogs linking to each other. Both are interesting takes on the common link prediction problem. It is interesting to model link prediction as a generative process with topic models, and both do a good job of this.

Big Idea

The big idea in both of these papers is that predicting links can be done as a distribution over links and words. Conditioning a distribution of links off of the words expressed in the original document and the linked documents (either comment on a blog post, or linked blog) can help in this task.

The interesting part of these models are what they tell us about the relationship between documents. This is useful across social media and in networks in general. We can describe likely facts about relationships and the users who create them.

Other

I would say that both papers are similar but not necessarily influenced off each other. As they do share an author, I'm certain that Yano et al. were aware of the earlier paper, but they are tackling a different problem. They have similar corpora (at least in terms of the domain) but different in the goal of their link prediction problem. There is no way to directly compare the two methods because the problem is different. Both are interesting extensions of Link-LDA, but they are fundamentally different methods.

Questions

  1. How much time did you spend reading the (new, non-wikified) paper you summarized? About 2 hours
  2. How much time did you spend reading the old wikified paper? About 2 hours
  3. How much time did you spend reading the summary of the old paper? About 15 min
  4. How much time did you spend reading background material? N/A My final project for the class is on this area so I've read a lot of background papers
  5. Was there a study plan for the old paper? Yes
    1. if so, did you read any of the items suggested by the study plan? and how much time did you spend with reading them? I had actually read the papers before as it is directly related to my research with my advisor. I do a lot of Gibbs Sampling on graphical models (in particular topic-model derivatives) and that fits into the study plan
  6. Give us any additional feedback you might have about this assignment. I like this comparison. It was a nice way to view the papers in a different light and really made it stick in my memory. In general, I like the wikifying and used it extensively for the project (and probably will use this for my research in the future after the class is over).