Comparison mixed membership topic poisson

From Cohen Courses
Jump to navigationJump to search

Papers

Elena Erosheva, Stephen Fienberg, John Lafferty. Mixed Membership Models of Scientific Publications. PNAS (101) 2004.

Tae Yano and Noah A. Smith. What’s Worthy of Comment? Content and Comment Volume in Political Blogs. Proc of ICWSM 2010.

Problems

The two papers deal with different types of problems. The goal of the PNAS paper is to model scientific publications using the mixed membership models. Its emphasis is to add references as an additional source, and model article contents and references simultaneously. However, the ICWSM paper focuses on predicting the volume of comments which can be received by a blog post. In this paper, the problem is simplified into predicting whether a blog post will receive a higher comment volume than the average number, instead of predicting the absolute number of comments. The comment volume is measured either with word count or with comment number.

Method

The methods in both papers are the LDA model or LDA variations. In the PNAS paper, the authors use the mixed membership models, which can be viewed as generalization of LDA and similar models from other areas. In order to incorporate references, the LDA model is extended to simulate article contents and citations at the same time. Furthermore, instead of "bag of references", an alternative model for reference modeling is proposed. Basically, a combination of multinomial and Bernoulli reference drawing is adopted. This is demonstrated to be more consistent with the actual manner of reference selection.

In the ICWSM, the proposed Topic-Poisson model is directly based on LDA. For each blog post d, the generation of the textual contents is the same as LDA. However, after text generation, the comment volume, denoted as v_d, is generated from a mixture distribution. The mixture weights are the topic proportions of this comment. The model parameters are estimated with Gibbs sampling.

Datasets

The PNAS paper uses the PNSA archive of Biological Science articles between 1997 and 2001 as the dataset. This dataset totally contains 11,981 articles and 77,115 unique references.

The ICWSM paper builds its dataset by collecting blog posts from two websites: Matthew Yglesias and RedState. Stops words are removed for preprocessing of texts. The mean volume is approximately 1424 words (35 comments) for Matthew Yglesias and 819 words (29 comments) for RedState.

Comments

- Similarity. The methods in both papers are derived from LDA. In addition to the basic textual contents, they are trying to incorporate more forms of information, references in the PNAS paper and comment volume in the ICWSM paper.

- Difference. Because comment volume prediction is formulated into a classification task, then it's convenient to conduct quantitative evaluations and compare various methods on the same dataset/metrics. The ICWSM papers compares the proposed Topic-Poisson with Naive Bayes and Linear regression. On the contrary, the PNAS paper only conducts qualitative evaluations, showing the discovered topics, as well as high-probability words/references with respect to individual aspects.

Questions

1. How much time did you spend reading the (new, non-wikified) paper you summarized? 2 hours

2. How much time did you spend reading the old wikified paper? 45 minites

3. How much time did you spend reading the summary of the old paper? 10 minutes

4. How much time did you spend reading background materiel? 1 hour

5. Was there a study plan for the old paper? Yes

  5.1 if so, did you read any of the items suggested by the study plan?   Yes
  5.2 and how much time did you spend with reading them?   1.5 hours