Mixed membership models of scientific publication
This a Paper discussed in Social Media Analysis 10-802 in Fall 2012.
Citation
Elena Erosheva, Stephen Fienberg, John Lafferty. Mixed Membership Models of Scientific Publications. PNAS (101) 2004 pp. 5220-5227.
Online version
[www.cs.cmu.edu/~lafferty/pub/efl.pdf Mixed Membership Models of Scientific Publications]
Summary
This paper presents the mixed membership models which serves as a generalization for various statistical models from genetics, social science, information retrieval and text mining. Although emphasizing the generality of mixed membership models, this paper specifically focuses on the modeling of documents. In this scenario, mixed membership basically means soft classification for each article based on proportions of the article’s content coming from each category/topic. The authors interpret the general form of mixed membership models based on four ingredients: population, subject, latent variable and sampling scheme. In terms of latent variables, we may deal with two cases: first, the membership scores are treated as unknown constants; second, the membership scores are treated as realizations from Dirichlet distributions.
For scientific publications, the authors employ a mixed membership model to simulate the generation of contents and references. As a first attempt, the authors adopt the “bag of words” assumption for contents, and each topical aspect is treated as a multinomial distribution over words. Similarly, we can assume "bag of references" for a document, and each aspect is a distribution over references. Then, by arguing that multinomial sampling is not realistic for the manner by which authors select references, this paper presents an alternative model for references. In this alternative, a reference list is generated by a two-step combination of multinomial and Bernoulli draws.
A simple mathematical model is proposed to quantitatively compute the selection bias. Using these bias values as features, the authors apply machine learning (classification) methods to distinguish experts and ordinary users. Experiments with the TurboTax dataset show that selection bias values are superior over other types of features coming from Z-score or text analysis. Mixing up selection bias and text features provides further improvements on the classification performance. Comparison of the classifiers proves that Gaussian classification performs consistently better than linear regression and logistic regression
Dataset
The PNAS dataset used in this paper are Biological Science articles between 1997 and 2001, from Proceedings of the National Academy of Sciences. For preprocessing, the authors ignore corrections, commentaries, letters, and reviews because these are not traditional research reports. Also, articles without references or abstracts are ignored. Some statistics about the dataset are:
- Articles - Unique Words - Unique References - Subtopics
Evaluation
Because the evaluations of topic models are generally difficult, the authors present some qualitative experiments on the PNAS dataset. The number of aspects is set to be 8. Some observations are:
- This paper reports the first 15 of the high probability words for each aspect. The "stop words" are filtered out. These high-probability words provides useful indications about the underlying semantics of individual aspects. For example, words such as "species", "genetic" and "evolution" show that the aspect is talking about "molecular evolution".
- On the task of expert identification, Gaussian classification achieves better results than linear regression and logistic regression.
- Selection bias is not influenced by dynamics of CQA sites, and can be considered as intrinsic characteristics of CQA users.
Discussion
+ plus points - minus point
- (+) This paper falls into the area of expert search, which is an important problem in CQA research. The authors present interesting observations on selection bias of expert users in CQA. These findings are useful for question recommendation. For example, we should recommend questions with low completeness (few answers) to experts.
- (-) The mathematical model for selection bias computation is pretty straightforward. Also, the authors rely on the commonly-used classifiers for expert identification, rather than come up with more sophisticated approaches. Thus, I would take this paper as an empirical study, whose emphasis is on the empirical observations of the selection bias concept.
- (-) Most of the work is specifically based on the TurboTax dataset, which may limit the application of the approach. For example, TurboTax has the manual expert judgments which are not available in other datasets. In this case, expert identification cannot be translated into a classification problem.
Related papers
Here are two papers related with this work.
- Give a detailed overview of CQA expert search - Simulate asking and answering behaviors using using a generative model - Perform expert search based on user interests which are represented by latent topics
- Also an empirical study, focusing on user interaction and category characteristics - Study user interests in terms of cross-category entropy, and show that this entropy highly correlates with expertise/rates - Use the Yahoo Answers dataset, which is commonly used in CQA research