Mixed membership models of scientific publication

From Cohen Courses
Jump to navigationJump to search

This a Paper discussed in Social Media Analysis 10-802 in Fall 2012.

Citation

Elena Erosheva, Stephen Fienberg, John Lafferty. Mixed Membership Models of Scientific Publications. PNAS (101) 2004 pp. 5220-5227.

Online version

Mixed Membership Models of Scientific Publications

Summary

This paper presents the mixed membership models which serves as a generalization for various statistical models from genetics, social science, information retrieval and text mining. Although emphasizing the generality of mixed membership models, this paper specifically focuses on the modeling of documents. In this scenario, mixed membership basically means soft classification for each article based on proportions of the article’s content coming from each category/topic. The authors interpret the general form of mixed membership models based on four ingredients: population, subject, latent variable and sampling scheme. In terms of latent variables, we may deal with two cases: first, the membership scores are treated as unknown constants; second, the membership scores are treated as realizations from the Dirichlet distribution.

For scientific publications, the authors employ a mixed membership model to simulate the generation of contents and references. As a first attempt, the authors adopt the “bag of words” assumption for contents, and each topical aspect is treated as a multinomial distribution over words. Similarly, we can assume "bag of references" for a document, and each aspect is a distribution over references. Then, by arguing that multinomial sampling is not realistic for the manner by which authors select references, this paper presents an alternative model for references. In this alternative, a reference list is generated by a two-step combination of multinomial and Bernoulli draws.

The experiments are conducted on the PNAS dataset, which is a archive of scientific publications. The authors qualitatively demonstrate the effectiveness of the mixed membership modeled in discovering latent aspects from the dataset. Also, adding references as an additional source enable us to represent aspects in terms of references and examine the characteristics of frequent references.

Dataset

The PNAS dataset used in this paper are Biological Science articles between 1997 and 2001, from Proceedings of the National Academy of Sciences. For preprocessing, the authors ignore corrections, commentaries, letters, and reviews because these are not traditional research reports. Also, articles without references or abstracts are ignored. Some statistics about the dataset are:

 - Articles 11981
 - Unique Words 39616
 - Unique References  77115
 - Subtopics  19

Evaluation

Because the evaluations of topic models are generally difficult, the authors present some qualitative experiments on the PNAS dataset. The number of aspects is set to be 8. Some observations are:

  • This paper reports the first 15 of the high probability words for each aspect. The "stop words" are filtered out. These high-probability words provides useful indications about the underlying semantics of individual aspects. For example, words such as "species", "genetic" and "evolution" show that the aspect is talking about "molecular evolution".
  • The authors also show the first 15 of the high probability references for each aspect. It's interesting to note that many of these high-probability references are manuals, textbooks, and references to articles that describe particular methodology.
  • Moreover, the authors examine the most frequent references among the 8 aspects. It's observed that they were either co-authored or contributed by a distinguished member of the National Academy of Sciences.

Discussion

+ plus points - minus point

  • (+) This paper generalizes various statistical models from different areas into the mixed membership models. It interprets two ways to deal with latent variables, i.e., they can either be unknown constants or be drawn from Dirichlet prior. For the task of information retrieval and text mining, these two ways correspond to PLSA and LDA respectively.
  • (+) An alternative method is proposed for modeling publication references. In this method, we consider both the pooling of possible references based on multinomial sampling and the Bernoulli decisions made on the references. This two-stage combination of multinomial and Bernoulli draws is more consistent with the actual manner by which authors select references for their bibliography.
  • (-) The paper only presents qualitative evaluations, which makes it difficult to be compared with other methods. For completeness, quantitative evaluations using the perplexity metric can be conducted.

Related papers

Here are two papers related with this work.