Mixed membership models of scientific publication

From Cohen Courses
Revision as of 21:27, 5 November 2012 by Ymiao (talk | contribs) (Created page with 'This a [[Category::Paper]] discussed in Social Media Analysis 10-802 in Fall 2012. == Citation == Elena Erosheva, Stephen Fienberg, John Lafferty. Mixed Membership Models of S…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This a Paper discussed in Social Media Analysis 10-802 in Fall 2012.

Citation

Elena Erosheva, Stephen Fienberg, John Lafferty. Mixed Membership Models of Scientific Publications. PNAS (101) 2004 pp. 5220-5227.

Online version

[www.cs.cmu.edu/~lafferty/pub/efl.pdf Mixed Membership Models of Scientific Publications]

Summary

This paper presents the mixed membership models which serves as a generalization for various statistical models from genetics, social science, information retrieval and text mining. Although emphasizing the generality of mixed membership models, this paper specifically focuses on the modeling of documents. In this scenario, mixed membership basically means soft classification for each article based on proportions of the article’s content coming from each category/topic. The authors interpret the general form of mixed membership models based on four ingredients: population, subject, latent variable and sampling scheme. In terms of latent variables, we may deal with two cases: first, the membership scores are treated as unknown constants; second, the membership scores are treated as realizations from Dirichlet distributions.

For scientific publications, the authors employ a mixed membership model to simulate the generation of contents and references. As a first attempt, the authors adopt the “bag of words” assumption for contents, and each topical aspect is treated as a multinomial distribution over words. Similarly, we can assume "bag of references" for a document, and each aspect is a distribution over references. Then, by arguing that multinomial sampling is not realistic for the manner by which authors select references, this paper presents an alternative model for references.

A simple mathematical model is proposed to quantitatively compute the selection bias. Using these bias values as features, the authors apply machine learning (classification) methods to distinguish experts and ordinary users. Experiments with the TurboTax dataset show that selection bias values are superior over other types of features coming from Z-score or text analysis. Mixing up selection bias and text features provides further improvements on the classification performance. Comparison of the classifiers proves that Gaussian classification performs consistently better than linear regression and logistic regression

Dataset

The TurboTax dataset used in this paper has been collected from TurboTax Live Community, a CQA site on preparation of tax returns. Some statistics about the dataset are:

 - Questions 633112  Askers 525143
 - Answers 688390   Answerers 130770
 - 83 experts selected by TurboTax employees 
 - 1367 answerers have provided at least 10 answers

Evaluation

The authors adopt Precision, Recall and F-score as evaluation metrics. The following conclusions arise from their evaluations:

  • CQA experts have the tendency to answer questions with low completeness, which makes their responses more valuable.
  • The selection bias scores modeled in this paper can provide indications about whether an user is a expert. These bias scores are proved to be effective features for identification of CQA experts.
  • On the task of expert identification, Gaussian classification achieves better results than linear regression and logistic regression.
  • Selection bias is not influenced by dynamics of CQA sites, and can be considered as intrinsic characteristics of CQA users.

Discussion

+ plus points - minus point

  • (+) This paper falls into the area of expert search, which is an important problem in CQA research. The authors present interesting observations on selection bias of expert users in CQA. These findings are useful for question recommendation. For example, we should recommend questions with low completeness (few answers) to experts.
  • (-) The mathematical model for selection bias computation is pretty straightforward. Also, the authors rely on the commonly-used classifiers for expert identification, rather than come up with more sophisticated approaches. Thus, I would take this paper as an empirical study, whose emphasis is on the empirical observations of the selection bias concept.
  • (-) Most of the work is specifically based on the TurboTax dataset, which may limit the application of the approach. For example, TurboTax has the manual expert judgments which are not available in other datasets. In this case, expert identification cannot be translated into a classification problem.

Related papers

Here are two papers related with this work.

 - Give a detailed overview of CQA expert search
 - Simulate asking and answering behaviors using using a generative model
 - Perform expert search based on user interests which are represented by latent topics
- Also an empirical study, focusing on user interaction and category characteristics
- Study user interests in terms of cross-category entropy, and show that this entropy highly correlates with expertise/rates
- Use the Yahoo Answers dataset, which is commonly used in CQA research