Difference between revisions of "Mixed membership models of scientific publication"

From Cohen Courses
Jump to navigationJump to search
 
(9 intermediate revisions by the same user not shown)
Line 7: Line 7:
 
== Online version ==
 
== Online version ==
  
[www.cs.cmu.edu/~lafferty/pub/efl.pdf Mixed Membership Models of Scientific Publications]
+
[http://www.cs.cmu.edu/~lafferty/pub/efl.pdf Mixed Membership Models of Scientific Publications]
  
 
== Summary ==
 
== Summary ==
  
This [[Category::Paper|paper]] presents the mixed membership models which serves as a generalization for various statistical models from genetics, social science, information retrieval and text mining. Although emphasizing the generality of mixed membership models, this paper specifically focuses on the modeling of documents. In this scenario, mixed membership basically means soft classification for each article based on proportions of the article’s content coming from each category/topic. The authors interpret the general form of mixed membership models based on four ingredients: population, subject, latent variable and sampling scheme. In terms of latent variables, we may deal with two cases: first, the membership scores are treated as unknown constants; second, the membership scores are treated as realizations from Dirichlet distributions.
+
This [[Category::Paper|paper]] presents the mixed membership models which serves as a generalization for various statistical models from genetics, social science, information retrieval and text mining. Although emphasizing the generality of mixed membership models, this paper specifically focuses on the modeling of documents. In this scenario, mixed membership basically means soft classification for each article based on proportions of the article’s content coming from each category/topic. The authors interpret the general form of mixed membership models based on four ingredients: population, subject, latent variable and sampling scheme. In terms of latent variables, we may deal with two cases: first, the membership scores are treated as unknown constants; second, the membership scores are treated as realizations from the [[Dirichlet distribution]].
  
 
For scientific publications, the authors employ a mixed membership model to simulate the generation of contents and references. As a first attempt, the authors adopt the “bag of words” assumption for contents, and each topical aspect is treated as a multinomial distribution over words. Similarly, we can assume "bag of references" for a document, and each aspect is a distribution over references. Then, by arguing that multinomial sampling is not realistic for the manner by which authors select references, this paper presents an alternative model for references. In this alternative, a reference list is generated by a two-step combination of multinomial and Bernoulli draws.
 
For scientific publications, the authors employ a mixed membership model to simulate the generation of contents and references. As a first attempt, the authors adopt the “bag of words” assumption for contents, and each topical aspect is treated as a multinomial distribution over words. Similarly, we can assume "bag of references" for a document, and each aspect is a distribution over references. Then, by arguing that multinomial sampling is not realistic for the manner by which authors select references, this paper presents an alternative model for references. In this alternative, a reference list is generated by a two-step combination of multinomial and Bernoulli draws.
  
A simple mathematical model is proposed to quantitatively compute the selection bias. Using these bias values as features, the authors apply machine learning (classification) methods to distinguish experts and ordinary users. Experiments with the [[UsesDataset::TurboTax]] dataset show that selection bias values are superior over other types of features coming from Z-score or text analysis. Mixing up selection bias and text features provides further improvements on the classification performance. Comparison of the classifiers proves that Gaussian classification performs consistently better than [[linear regression]] and [[logistic regression]]
+
The experiments are conducted on the [http://malt.ml.cmu.edu/mw/index.php/PNAS_Dataset PNAS] dataset, which is a archive of scientific publications. The authors qualitatively demonstrate the effectiveness of the mixed membership modeled in discovering latent aspects from the dataset. Also, adding references as an additional source enable us to represent aspects in terms of references and examine the characteristics of frequent references.
  
 
== Dataset ==
 
== Dataset ==
  
The [[UsesDataset::PNAS]] dataset used in this paper are Biological Science articles between 1997 and 2001, from Proceedings of the National Academy of Sciences. For preprocessing, the authors ignore corrections, commentaries, letters, and reviews because these are not traditional research reports. Also, articles [https://ttlc.intuit.com/ TurboTax Live Community], a CQA site on preparation of tax returns. Some statistics about the dataset are:  
+
The [http://malt.ml.cmu.edu/mw/index.php/PNAS_Dataset PNAS] dataset used in this paper are Biological Science articles between 1997 and 2001, from Proceedings of the National Academy of Sciences. For preprocessing, the authors ignore corrections, commentaries, letters, and reviews because these are not traditional research reports. Also, articles without references or abstracts are ignored. Some statistics about the dataset are:  
   - Questions 633112  Askers 525143
+
   - Articles 11981
   - Answers 688390  Answerers 130770
+
   - Unique Words 39616
   - 83 experts selected by TurboTax employees
+
   - Unique References  77115
   - 1367 answerers have provided at least 10 answers
+
   - Subtopics  19
  
 
== Evaluation ==
 
== Evaluation ==
  
The authors adopt [http://en.wikipedia.org/wiki/Precision_and_recall Precision, Recall and F-score] as evaluation metrics. The following conclusions arise from their evaluations:
+
Because the evaluations of topic models are generally difficult, the authors present some qualitative experiments on the [http://malt.ml.cmu.edu/mw/index.php/PNAS_Dataset PNAS] dataset. The number of aspects is set to be 8. Some observations are:
* CQA experts have the tendency to answer questions with low completeness, which makes their responses more valuable.
+
* This paper reports the first 15 of the high probability words for each aspect. The "stop words" are filtered out. These high-probability words provides useful indications about the underlying semantics of individual aspects. For example, words such as "species", "genetic" and "evolution" show that the aspect is talking about "molecular evolution".  
* The selection bias scores modeled in this paper can provide indications about whether an user is a expert. These bias scores are proved to be effective features for identification of CQA experts.
+
* The authors also show the first 15 of the high probability references for each aspect. It's interesting to note that many of these high-probability references are manuals, textbooks, and references to articles that describe particular methodology.
* On the task of expert identification, Gaussian classification achieves better results than [[linear regression]] and [[logistic regression]].
+
* Moreover, the authors examine the most frequent references among the 8 aspects. It's observed that they were either co-authored or contributed by a distinguished member of the National Academy of Sciences.
* Selection bias is not influenced by dynamics of CQA sites, and can be considered as intrinsic characteristics of CQA users.
 
  
 
== Discussion ==
 
== Discussion ==
 
+ plus points    - minus point
 
+ plus points    - minus point
  
* (+) This paper falls into the area of [[AddressesProblem::Expert Search|expert search]], which is an important problem in CQA research. The authors present interesting observations on selection bias of expert users in CQA. These findings are useful for question recommendation. For example, we should recommend questions with low completeness (few answers) to experts.
+
* (+) This paper generalizes various statistical models from different areas into the mixed membership models. It interprets two ways to deal with latent variables, i.e., they can either be unknown constants or be drawn from Dirichlet prior. For the task of information retrieval and text mining, these two ways correspond to [[PLSA]] and [[LDA]] respectively.
* (-) The mathematical model for selection bias computation is pretty straightforward. Also, the authors rely on the commonly-used classifiers for expert identification, rather than come up with more sophisticated approaches. Thus, I would take this paper as an empirical study, whose emphasis is on the empirical observations of the selection bias concept.
+
* (+) An alternative method is proposed for modeling publication references. In this method, we consider both the pooling of possible references based on multinomial sampling and the Bernoulli decisions made on the references. This two-stage combination of multinomial and Bernoulli draws is more consistent with the actual manner by which authors select references for their bibliography.  
* (-) Most of the work is specifically based on the [[UsesDataset::TurboTax]] dataset, which may limit the application of the approach. For example, [[UsesDataset::TurboTax]] has the manual expert judgments which are not available in other datasets. In this case, expert identification cannot be translated into a classification problem.
+
* (-) The paper only presents qualitative evaluations, which makes it difficult to be compared with other methods. For completeness, quantitative evaluations using the [[perplexity]] metric can be conducted.
  
 
== Related papers ==
 
== Related papers ==
 
Here are two papers related with this work.
 
Here are two papers related with this work.
* [http://dl.acm.org/citation.cfm?id=1458204 Tapping on the Potential of Q&A Community by Recommending Answer Providers]  
+
* [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.6843 Missing Link - A Probabilistic Model of Document Content]  
   - Give a detailed overview of CQA expert search
+
    
  - Simulate asking and answering behaviors using using a generative model
+
* [http://dl.acm.org/citation.cfm?id=1553460 Topic-link LDA: Joint Models of Topic and Author Community]
  - Perform expert search based on user interests which are represented by latent topics
 
* [http://dl.acm.org/citation.cfm?id=1367587 Knowledge Sharing and Yahoo Answers:Everyone Knows Something]
 
- Also an empirical study, focusing on user interaction and category characteristics
 
- Study user interests in terms of cross-category entropy, and show that this entropy highly correlates with expertise/rates
 
- Use the Yahoo Answers dataset, which is commonly used in CQA research
 

Latest revision as of 01:58, 6 November 2012

This a Paper discussed in Social Media Analysis 10-802 in Fall 2012.

Citation

Elena Erosheva, Stephen Fienberg, John Lafferty. Mixed Membership Models of Scientific Publications. PNAS (101) 2004 pp. 5220-5227.

Online version

Mixed Membership Models of Scientific Publications

Summary

This paper presents the mixed membership models which serves as a generalization for various statistical models from genetics, social science, information retrieval and text mining. Although emphasizing the generality of mixed membership models, this paper specifically focuses on the modeling of documents. In this scenario, mixed membership basically means soft classification for each article based on proportions of the article’s content coming from each category/topic. The authors interpret the general form of mixed membership models based on four ingredients: population, subject, latent variable and sampling scheme. In terms of latent variables, we may deal with two cases: first, the membership scores are treated as unknown constants; second, the membership scores are treated as realizations from the Dirichlet distribution.

For scientific publications, the authors employ a mixed membership model to simulate the generation of contents and references. As a first attempt, the authors adopt the “bag of words” assumption for contents, and each topical aspect is treated as a multinomial distribution over words. Similarly, we can assume "bag of references" for a document, and each aspect is a distribution over references. Then, by arguing that multinomial sampling is not realistic for the manner by which authors select references, this paper presents an alternative model for references. In this alternative, a reference list is generated by a two-step combination of multinomial and Bernoulli draws.

The experiments are conducted on the PNAS dataset, which is a archive of scientific publications. The authors qualitatively demonstrate the effectiveness of the mixed membership modeled in discovering latent aspects from the dataset. Also, adding references as an additional source enable us to represent aspects in terms of references and examine the characteristics of frequent references.

Dataset

The PNAS dataset used in this paper are Biological Science articles between 1997 and 2001, from Proceedings of the National Academy of Sciences. For preprocessing, the authors ignore corrections, commentaries, letters, and reviews because these are not traditional research reports. Also, articles without references or abstracts are ignored. Some statistics about the dataset are:

 - Articles 11981
 - Unique Words 39616
 - Unique References  77115
 - Subtopics  19

Evaluation

Because the evaluations of topic models are generally difficult, the authors present some qualitative experiments on the PNAS dataset. The number of aspects is set to be 8. Some observations are:

  • This paper reports the first 15 of the high probability words for each aspect. The "stop words" are filtered out. These high-probability words provides useful indications about the underlying semantics of individual aspects. For example, words such as "species", "genetic" and "evolution" show that the aspect is talking about "molecular evolution".
  • The authors also show the first 15 of the high probability references for each aspect. It's interesting to note that many of these high-probability references are manuals, textbooks, and references to articles that describe particular methodology.
  • Moreover, the authors examine the most frequent references among the 8 aspects. It's observed that they were either co-authored or contributed by a distinguished member of the National Academy of Sciences.

Discussion

+ plus points - minus point

  • (+) This paper generalizes various statistical models from different areas into the mixed membership models. It interprets two ways to deal with latent variables, i.e., they can either be unknown constants or be drawn from Dirichlet prior. For the task of information retrieval and text mining, these two ways correspond to PLSA and LDA respectively.
  • (+) An alternative method is proposed for modeling publication references. In this method, we consider both the pooling of possible references based on multinomial sampling and the Bernoulli decisions made on the references. This two-stage combination of multinomial and Bernoulli draws is more consistent with the actual manner by which authors select references for their bibliography.
  • (-) The paper only presents qualitative evaluations, which makes it difficult to be compared with other methods. For completeness, quantitative evaluations using the perplexity metric can be conducted.

Related papers

Here are two papers related with this work.