Difference between revisions of "Mixed membership models of scientific publication"

From Cohen Courses
Jump to navigationJump to search
 
(5 intermediate revisions by the same user not shown)
Line 15: Line 15:
 
For scientific publications, the authors employ a mixed membership model to simulate the generation of contents and references. As a first attempt, the authors adopt the “bag of words” assumption for contents, and each topical aspect is treated as a multinomial distribution over words. Similarly, we can assume "bag of references" for a document, and each aspect is a distribution over references. Then, by arguing that multinomial sampling is not realistic for the manner by which authors select references, this paper presents an alternative model for references. In this alternative, a reference list is generated by a two-step combination of multinomial and Bernoulli draws.
 
For scientific publications, the authors employ a mixed membership model to simulate the generation of contents and references. As a first attempt, the authors adopt the “bag of words” assumption for contents, and each topical aspect is treated as a multinomial distribution over words. Similarly, we can assume "bag of references" for a document, and each aspect is a distribution over references. Then, by arguing that multinomial sampling is not realistic for the manner by which authors select references, this paper presents an alternative model for references. In this alternative, a reference list is generated by a two-step combination of multinomial and Bernoulli draws.
  
The experiments are conducted on the [[UsesDataset::PNAS]] dataset, which is a archive of scientific publications. The authors qualitatively demonstrate the effectiveness of the mixed membership modeled in discovering latent aspects from the dataset. Also, adding references as an additional source enable us to represent aspects in terms of references and examine the characteristics of frequent references.
+
The experiments are conducted on the [http://malt.ml.cmu.edu/mw/index.php/PNAS_Dataset PNAS] dataset, which is a archive of scientific publications. The authors qualitatively demonstrate the effectiveness of the mixed membership modeled in discovering latent aspects from the dataset. Also, adding references as an additional source enable us to represent aspects in terms of references and examine the characteristics of frequent references.
  
 
== Dataset ==
 
== Dataset ==
  
The [[UsesDataset::PNAS]] dataset used in this paper are Biological Science articles between 1997 and 2001, from Proceedings of the National Academy of Sciences. For preprocessing, the authors ignore corrections, commentaries, letters, and reviews because these are not traditional research reports. Also, articles without references or abstracts are ignored. Some statistics about the dataset are:  
+
The [http://malt.ml.cmu.edu/mw/index.php/PNAS_Dataset PNAS] dataset used in this paper are Biological Science articles between 1997 and 2001, from Proceedings of the National Academy of Sciences. For preprocessing, the authors ignore corrections, commentaries, letters, and reviews because these are not traditional research reports. Also, articles without references or abstracts are ignored. Some statistics about the dataset are:  
 
   - Articles 11981
 
   - Articles 11981
 
   - Unique Words 39616
 
   - Unique Words 39616
Line 27: Line 27:
 
== Evaluation ==
 
== Evaluation ==
  
Because the evaluations of topic models are generally difficult, the authors present some qualitative experiments on the [[UsesDataset::PNAS]] dataset. The number of aspects is set to be 8. Some observations are:
+
Because the evaluations of topic models are generally difficult, the authors present some qualitative experiments on the [http://malt.ml.cmu.edu/mw/index.php/PNAS_Dataset PNAS] dataset. The number of aspects is set to be 8. Some observations are:
 
* This paper reports the first 15 of the high probability words for each aspect. The "stop words" are filtered out. These high-probability words provides useful indications about the underlying semantics of individual aspects. For example, words such as "species", "genetic" and "evolution" show that the aspect is talking about "molecular evolution".  
 
* This paper reports the first 15 of the high probability words for each aspect. The "stop words" are filtered out. These high-probability words provides useful indications about the underlying semantics of individual aspects. For example, words such as "species", "genetic" and "evolution" show that the aspect is talking about "molecular evolution".  
 
* The authors also show the first 15 of the high probability references for each aspect. It's interesting to note that many of these high-probability references are manuals, textbooks, and references to articles that describe particular methodology.
 
* The authors also show the first 15 of the high probability references for each aspect. It's interesting to note that many of these high-probability references are manuals, textbooks, and references to articles that describe particular methodology.
Line 35: Line 35:
 
+ plus points    - minus point
 
+ plus points    - minus point
  
* (+) This paper generalizes various statistical models from different areas into the mixed membership models. It interprets two ways to deal with latent variables, i.e., they can either be unknown constants or be drawn from Dirichlet prior. For the task of information retrieval and text mining, these two ways correspond to PLSA and LDA respectively.
+
* (+) This paper generalizes various statistical models from different areas into the mixed membership models. It interprets two ways to deal with latent variables, i.e., they can either be unknown constants or be drawn from Dirichlet prior. For the task of information retrieval and text mining, these two ways correspond to [[PLSA]] and [[LDA]] respectively.
 
* (+) An alternative method is proposed for modeling publication references. In this method, we consider both the pooling of possible references based on multinomial sampling and the Bernoulli decisions made on the references. This two-stage combination of multinomial and Bernoulli draws is more consistent with the actual manner by which authors select references for their bibliography.  
 
* (+) An alternative method is proposed for modeling publication references. In this method, we consider both the pooling of possible references based on multinomial sampling and the Bernoulli decisions made on the references. This two-stage combination of multinomial and Bernoulli draws is more consistent with the actual manner by which authors select references for their bibliography.  
* (-) The paper only presents qualitative evaluations, which makes it difficult to be compared with other methods. For completeness, quantitative evaluations using the perplexity metric can be conducted.  
+
* (-) The paper only presents qualitative evaluations, which makes it difficult to be compared with other methods. For completeness, quantitative evaluations using the [[perplexity]] metric can be conducted.
  
 
== Related papers ==
 
== Related papers ==
 
Here are two papers related with this work.
 
Here are two papers related with this work.
* [http://dl.acm.org/citation.cfm?id=1458204 Tapping on the Potential of Q&A Community by Recommending Answer Providers]  
+
* [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.6843 Missing Link - A Probabilistic Model of Document Content]  
   - Give a detailed overview of CQA expert search
+
    
  - Simulate asking and answering behaviors using using a generative model
+
* [http://dl.acm.org/citation.cfm?id=1553460 Topic-link LDA: Joint Models of Topic and Author Community]
  - Perform expert search based on user interests which are represented by latent topics
 
* [http://dl.acm.org/citation.cfm?id=1367587 Knowledge Sharing and Yahoo Answers:Everyone Knows Something]
 
- Also an empirical study, focusing on user interaction and category characteristics
 
- Study user interests in terms of cross-category entropy, and show that this entropy highly correlates with expertise/rates
 
- Use the Yahoo Answers dataset, which is commonly used in CQA research
 

Latest revision as of 00:58, 6 November 2012

This a Paper discussed in Social Media Analysis 10-802 in Fall 2012.

Citation

Elena Erosheva, Stephen Fienberg, John Lafferty. Mixed Membership Models of Scientific Publications. PNAS (101) 2004 pp. 5220-5227.

Online version

Mixed Membership Models of Scientific Publications

Summary

This paper presents the mixed membership models which serves as a generalization for various statistical models from genetics, social science, information retrieval and text mining. Although emphasizing the generality of mixed membership models, this paper specifically focuses on the modeling of documents. In this scenario, mixed membership basically means soft classification for each article based on proportions of the article’s content coming from each category/topic. The authors interpret the general form of mixed membership models based on four ingredients: population, subject, latent variable and sampling scheme. In terms of latent variables, we may deal with two cases: first, the membership scores are treated as unknown constants; second, the membership scores are treated as realizations from the Dirichlet distribution.

For scientific publications, the authors employ a mixed membership model to simulate the generation of contents and references. As a first attempt, the authors adopt the “bag of words” assumption for contents, and each topical aspect is treated as a multinomial distribution over words. Similarly, we can assume "bag of references" for a document, and each aspect is a distribution over references. Then, by arguing that multinomial sampling is not realistic for the manner by which authors select references, this paper presents an alternative model for references. In this alternative, a reference list is generated by a two-step combination of multinomial and Bernoulli draws.

The experiments are conducted on the PNAS dataset, which is a archive of scientific publications. The authors qualitatively demonstrate the effectiveness of the mixed membership modeled in discovering latent aspects from the dataset. Also, adding references as an additional source enable us to represent aspects in terms of references and examine the characteristics of frequent references.

Dataset

The PNAS dataset used in this paper are Biological Science articles between 1997 and 2001, from Proceedings of the National Academy of Sciences. For preprocessing, the authors ignore corrections, commentaries, letters, and reviews because these are not traditional research reports. Also, articles without references or abstracts are ignored. Some statistics about the dataset are:

 - Articles 11981
 - Unique Words 39616
 - Unique References  77115
 - Subtopics  19

Evaluation

Because the evaluations of topic models are generally difficult, the authors present some qualitative experiments on the PNAS dataset. The number of aspects is set to be 8. Some observations are:

  • This paper reports the first 15 of the high probability words for each aspect. The "stop words" are filtered out. These high-probability words provides useful indications about the underlying semantics of individual aspects. For example, words such as "species", "genetic" and "evolution" show that the aspect is talking about "molecular evolution".
  • The authors also show the first 15 of the high probability references for each aspect. It's interesting to note that many of these high-probability references are manuals, textbooks, and references to articles that describe particular methodology.
  • Moreover, the authors examine the most frequent references among the 8 aspects. It's observed that they were either co-authored or contributed by a distinguished member of the National Academy of Sciences.

Discussion

+ plus points - minus point

  • (+) This paper generalizes various statistical models from different areas into the mixed membership models. It interprets two ways to deal with latent variables, i.e., they can either be unknown constants or be drawn from Dirichlet prior. For the task of information retrieval and text mining, these two ways correspond to PLSA and LDA respectively.
  • (+) An alternative method is proposed for modeling publication references. In this method, we consider both the pooling of possible references based on multinomial sampling and the Bernoulli decisions made on the references. This two-stage combination of multinomial and Bernoulli draws is more consistent with the actual manner by which authors select references for their bibliography.
  • (-) The paper only presents qualitative evaluations, which makes it difficult to be compared with other methods. For completeness, quantitative evaluations using the perplexity metric can be conducted.

Related papers

Here are two papers related with this work.