Difference between revisions of "Yano et al NAACL 2009"

From Cohen Courses
Jump to navigationJump to search
(Created page with 'This [[Category::Paper]] is available online [http://www.cs.cmu.edu/~nasmith/papers/yano+cohen+smith.naacl09.pdf]. == Summary == This paper describes a topic model based approa…')
 
 
(20 intermediate revisions by the same user not shown)
Line 1: Line 1:
This [[Category::Paper]] is available online [http://www.cs.cmu.edu/~nasmith/papers/yano+cohen+smith.naacl09.pdf].
+
== Citation ==
 +
Tae Yano, William Cohen, and Noah A. Smith. Predicting Response to Political Blog Posts with Topic Models. In Proc of NAACL 2009.
 +
 
 +
== Online Version ==
 +
[http://www.cs.cmu.edu/~nasmith/papers/yano+cohen+smith.naacl09.pdf Predicting Response to Political Blog Posts with Topic Models].
  
 
== Summary ==
 
== Summary ==
  
This paper describes a topic model based approach to model the generation of blog text (both posts and comments).
+
This [[Category::Paper]] describes a [[UsesMethod::topic model]]  based approach in modeling the generation of blog text (posts and comments). In essence, given a blog post (text content), the paper tries to predict which users will comment on that post, by using topic models.
 +
 
 +
== Brief description of the method ==
 +
 
 +
This paper expands upon LinkLDA, presented in [[RelatedPaper::Erosheva 2004 Mixed membership models of scientific publications|Erosheva et al. (2004)]].
 +
 
 +
[[Image:link_LDA.png|250px]]
 +
 
 +
Here, <math>\theta</math> is a distribution over topics, <math>\beta</math> is a multinomial distribution over post words, and <math>\gamma</math> is a multinomial distribution over (comment) users.
 +
<math>z</math> is the topic for post words, while <math>z^{\prime}</math> is the topic for comment users.
 +
<math>w</math> represents the specific post word, and <math>u</math> represents the user that commented on that particular post (hence, <math>w</math> and <math>u</math> are both observed during training).
 +
<math>N</math> and <math>M</math> are the words counts in the post and all of its comments, respectively.
 +
 
 +
Although LinkLDA can model which users are likely to respond to a post, it does not model the comment text they will write.
 +
The authors expand on this by proposing CommentLDA, as shown below.
 +
 
 +
[[Image:comment_LDA.png|300px]]
 +
 
 +
<math>w^{\prime}</math> represents the specific comment word, and is also observed during training.
 +
In CommentLDA, note that the comment text is modeled by the distribution over comment words given topics, <math>\beta^{\prime}</math>.
 +
 
 +
The authors provide three different variations on how to count the comments (counting by verbosity, response, or comments).
 +
 
 +
== Experimental Result ==
 +
'''Task''': given a training dataset consisting of a collection of blog posts and their commenters and comments, and an unseen test dataset from a later time period,
 +
predict who is going to comment on a new blog post from the test set.
 +
The evaluation is done using precision (macro-averaged across posts) of predictions at various cut-offs (<math>n</math>) .
 +
 
 +
The authors have released the [[UsesDataset::Yano & Smith blog dataset|Yano & Smith blog dataset]], which was used for this evaluation.
 +
This dataset contains post and comments from the following 5 sites: [http://www.thecarpetbaggerreport.com/ CarpetBagger] (CB), [http://www.dailykos.com/ Daily Kos] (DK), [http://matthewyglesias.­theatlantic.­com Matthew Yglesias] (MY), [http://www.redstate.com/ Red State] (RS), and [http://www.rightwingnews.com/ Right Wing News] (RWN).
 +
 
 +
 
 +
The compared models were:
 +
* Baseline: post-independent prediction that ranks users by their comment frequency
 +
* Naive Bayes: with word counts in the post's main entry as features
 +
* LinkLDA: 3 variations (verbosity, response, comments)
 +
* CommentLDA: 3 variations (verbosity, response, comments)
 +
 
 +
'''Results''':
 +
 
 +
[[File:yano_results.png]]
 +
 
 +
* Some improvement over both the baseline and Naive Bayes for 3 out of the 5 sites
 +
* LinkLDA usually works slightly better than CommentLDA
 +
* Varying the counting method can bring as much as 10% performance gain
 +
* Results suggest that commenters on different sites behave differently
 +
 
 +
== Discussion ==
 +
 
 +
While the proposed method (CommentLDA) achieves mixed results on the prediction task, it provides a way to understand and summarize the data.
 +
For example, when CommentLDA was applied to the MY data, the model clustered words into topics pertaining to religion and domestic policy quite reasonably (see figure below).
 +
 
 +
[[File:yano_topics.png]]
 +
 
 +
== Related Papers ==
 +
 
 +
* [[RelatedPaper::Erosheva 2004 Mixed membership models of scientific publications|Erosheva et al. (2004)]]
 +
* [[RelatedPaper::Nallapati 2008 Link-PLSA-LDA: A new unsupervised model for topics and influence of blogs|Nallapati et al. (2008)]]
 +
 
 +
== Study Plan ==
 +
 
 +
Papers/articles/blogs/videos you may want to read to understand this paper.
 +
 
 +
* David Blei, Anrew Ng, Michael Jordan, "Latent Dirichlet Allocation" JMLR, vol.3, pp.993-1022 (2003) [http://www.cs.princeton.edu/%C2%AD~blei/%C2%ADpapers/%C2%ADBleiNgJordan2003%C2%AD.%C2%ADpdf pdf]
 +
** [http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation Wikipedia article on LDA]
 +
** [http://videolectures.net/mlss09uk_blei_tm/ Topic Models - videolectures.net]
 +
** [http://www.umiacs.umd.edu/~resnik/pubs/gibbs.pdf Gibbs Sampling for the Uninitiated]
 +
** [http://www.arbylon.net/publications/text-est.pdf Parameter estimation for text analysis]
 +
** [http://www.mblondel.org/journal/2010/08/21/latent-dirichlet-allocation-in-python/ Latent Dirichlet Allocation in Python]

Latest revision as of 08:14, 2 October 2012

Citation

Tae Yano, William Cohen, and Noah A. Smith. Predicting Response to Political Blog Posts with Topic Models. In Proc of NAACL 2009.

Online Version

Predicting Response to Political Blog Posts with Topic Models.

Summary

This Paper describes a topic model based approach in modeling the generation of blog text (posts and comments). In essence, given a blog post (text content), the paper tries to predict which users will comment on that post, by using topic models.

Brief description of the method

This paper expands upon LinkLDA, presented in Erosheva et al. (2004).

Link LDA.png

Here, is a distribution over topics, is a multinomial distribution over post words, and is a multinomial distribution over (comment) users. is the topic for post words, while is the topic for comment users. represents the specific post word, and represents the user that commented on that particular post (hence, and are both observed during training). and are the words counts in the post and all of its comments, respectively.

Although LinkLDA can model which users are likely to respond to a post, it does not model the comment text they will write. The authors expand on this by proposing CommentLDA, as shown below.

Comment LDA.png

represents the specific comment word, and is also observed during training. In CommentLDA, note that the comment text is modeled by the distribution over comment words given topics, .

The authors provide three different variations on how to count the comments (counting by verbosity, response, or comments).

Experimental Result

Task: given a training dataset consisting of a collection of blog posts and their commenters and comments, and an unseen test dataset from a later time period, predict who is going to comment on a new blog post from the test set. The evaluation is done using precision (macro-averaged across posts) of predictions at various cut-offs () .

The authors have released the Yano & Smith blog dataset, which was used for this evaluation. This dataset contains post and comments from the following 5 sites: CarpetBagger (CB), Daily Kos (DK), Matthew Yglesias (MY), Red State (RS), and Right Wing News (RWN).


The compared models were:

  • Baseline: post-independent prediction that ranks users by their comment frequency
  • Naive Bayes: with word counts in the post's main entry as features
  • LinkLDA: 3 variations (verbosity, response, comments)
  • CommentLDA: 3 variations (verbosity, response, comments)

Results:

Yano results.png

  • Some improvement over both the baseline and Naive Bayes for 3 out of the 5 sites
  • LinkLDA usually works slightly better than CommentLDA
  • Varying the counting method can bring as much as 10% performance gain
  • Results suggest that commenters on different sites behave differently

Discussion

While the proposed method (CommentLDA) achieves mixed results on the prediction task, it provides a way to understand and summarize the data. For example, when CommentLDA was applied to the MY data, the model clustered words into topics pertaining to religion and domestic policy quite reasonably (see figure below).

Yano topics.png

Related Papers

Study Plan

Papers/articles/blogs/videos you may want to read to understand this paper.