Detecting Topic Evolution in Scientific Literature: How Can Citations Help?

From Cohen Courses
Jump to navigationJump to search

Citation

@inproceedings{He:2009:DTE:1645953.1646076,

author = {He, Qi and Chen, Bi and Pei, Jian and Qiu, Baojun and Mitra, Prasenjit and Giles, Lee},
title = {Detecting topic evolution in scientific literature: how can citations help?},
booktitle = {Proceedings of the 18th ACM conference on Information and knowledge management},
series = {CIKM '09},
year = {2009},
isbn = {978-1-60558-512-3},
location = {Hong Kong, China},
pages = {957--966},
numpages = {10},
url = {http://doi.acm.org/10.1145/1645953.1646076},
doi = {10.1145/1645953.1646076},
acmid = {1646076},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {citations, inheritance topic model, topic evolution},

}


Abstract from the paper

Understanding how topics in scientific literature evolve is an interesting and important problem. Previous work simply models each paper as a bag of words and also considers the impact of authors. However, the impact of one document on another as captured by citations, one important inherent element in scientific literature, has not been considered. In this paper, we address the problem of understanding topic evolution by leveraging citations, and develop citation-aware approaches. We propose an iterative topic evolution learning framework by adapting the Latent Dirichlet Allocation model to the citation network and develop a novel inheritance topic model. We evaluate the effectiveness and efficiency of our approaches and compare with the state of the art approaches on a large collection of more than 650,000 research papers in the last 16 years and the citation network enabled by CiteSeerX. The results clearly show that citations can help to understand topic evolution better.

Online version

Summary

The authors try to investigate how to leverage citations in the scientific papers as dependency to model the temporal topic evolution.

  • Authors first overview existing topic models that have been focuses on the topic evolution detection, and then they mention that topic evolution has purely modeled based on bag-of-word assumption of documents at different timestamps, but an important factor for evolution analysis: the dependency is missing, which can be well captured by the citation relation among papers.
  • Next, authors start showing how the model is established by first introducing a simplest model that no dependency nor temporal information is considered, and adding depdenency on all previous papers, and finally proposing the citation-awere approach, which models that citation relation in a generative process and try to infer if the term of a certain topic is inherited from the citated papers or autonomous.
  • Since normally the average number of citations of a paper can be bounded, the computational complexity of the model is acceptable, which is exactly the same case of social network analysis where the average number of friends can be bounded, and in the Web page analysis, where the average number of hyperlinks in a page can be bounded as well.
  • Finally, the model is applied on a real world dataset. The effectiveness of the proposed approach is evaluated, and some interesting observatons are discovered based on the results.

Modeling

The novelty of the paper is the proposed citation-awere approach, which is again an extension of the LDA model.

  • Simple citation awere method (c-LDA): the first change from the accumulative topic evoluation model is that not all the previous papers should be counted as the topic dependency of a current paper, instead all the cited papers consist a sub topic model.
  • c-LDA with dirichlet prior smoothing
  • Inheritance topic model (ITM): the most important assumption for this model is a paper d is virtually separated into two parts: the inherited part d0 and the autonomous part d1 , which are generated independently, which is in turn modeled by a Bernoulli distribution (and further coupled by a Beta distribution). For authors' autonomous part, then the generative process follows the traditional LDA process, otherwise c-LDA fashion is used.
  • The whole model can be inferred by a variant Collapsed Gibbs sampling algorithm.

Results

  • A real dataset of more than 650,000 research papers in the last 16 years and the citation network enabled by CiteSeerX, a scientific literature digital library and search engine focusing primarily on the literature in computer and information science are used.
  • The results clearly show that citations can help to understand topic evolution better, and our methods are effective and efficient.
  • Especially, the popular research area "machine learning" is used as a case study, and we can see interesting trend of research topics in machine learning community.

Further thoughts & Study Plan

This paper leverages citation for detecting topic evolution, but in fact, citations have been widely considered in other topic models, e.g., Link-LDA to better estimate the topic distribution of a paper, or an authoer. Such papers should be also compared with this paper.

Topic evoluation is also compared with new topic detection.