Hall et, EMNLP2008

From Cohen Courses
Jump to navigationJump to search

Citation

David Hall, Daniel Jurafsky, and Christopher D. Manning. 2008. Studying the history of ideas using topic models. In Proceedings of Emperical Methods in Natural Language Processing, pages 363–371. ACL.

Online version

Daniel Jurafsky's papers

Summary

This paper applied an unsupervised topic modeling to analyze historical trends in scientific field. The basic ideas are:

  • Apply Jensen-Shannon divergence(JS divergence) of topic distributions to show if the conferences are converging in the topics they cover

Methodology

LDA does not explicitly model the temporal relationship. There are two other common ways to capture the temporal information: the Dynamic Topic Model (Blei and Lafferty, 2006), representing each years' documents as generated from a normal distribution centroid over topics, with the following year's centroid generated from the preceding year's; the Topics over Time Model (Wang and McCallum, 2006), assuming that each document chooses its own time stamp based on a topic-specific beta distribution. But both of these models impose constraints on the time periods: the Dynamic Topic Model penalizes large changes from year to year while the beta distribution in Topics over Time Model are relatively inflexible.

So in this paper the authors first apply the LDA (implemented in Gibbs Sampling) at each year. Then they perform post hoc calculations based on the observed probability of each topic given the current year. Define as the empirical probability that an arbitrary paper written in year was about topic :

where is the indicator function, is the data document was written, is set to a constant .


Define as the empirical distribution of a topic at a conference :

.


Define topic entropy to measure the breadth of a conference:

.


Finally use Jensen-Shannon divergence(JS divergence) to investigate whether or not the topic distributions of the conferences are converging:


Data

ACL Antology

Experimental Result

  • Historical Trends in Computational Linguistics

To visualize some trend, they show the probability mass asscociated with various topics over time, plotted as (a smoothed version of) . The topics becoming more prominent are such as classification, probabilistic models, stat. parsing, stat. MT and lex. sem, while the topics declined are computational semantics, conceptual semantics and plan-based dialogue and discourse.


  • Is Computational Linguistics Becoming More Applied?

Look at trends over time for some applications such as Machine Translation, Spelling Correction, Dialogue Systems etc and found there is a clear trend toward an increase in applications over time.


  • Differences and Similarities Among COLING, ACL and EMNLP

Inferred from the topic entropy, COLING has been historically the broadest of the three conferences; ACL started with a fairly narrow focus, became nearly as broad as COLING during the 1990's but become more narrow again in recent years; EMNLP shows being its status as a "special interest" conference.

From the JS divergence, they showed all of the three conferences are converging to their topics.

Related papers

Blei and Lafferty, ICML2006: David Blei and John D. Lafferty. 2006. Dynamic topic models. ICML.

Wang and McCallum, KDD2006: Xuerui Wang and Andrew McCallum. 2006. Topics over time: a non-Markov continuous-time model of topical trends. In KDD, pages 424–433, New York, NY, USA. ACM.