Difference between revisions of "Hall emnlp2008"

From Cohen Courses
Jump to navigationJump to search
 
(17 intermediate revisions by one other user not shown)
Line 1: Line 1:
== Paper ==
+
== Citation ==
  
 
* Title : Studying the History of Ideas Using Topic Models
 
* Title : Studying the History of Ideas Using Topic Models
Line 6: Line 6:
  
 
== Summary ==  
 
== Summary ==  
This paper uses topic models to study the development of ideas over time for  
+
This [[Category::paper]] uses topic models to study the development of ideas over time for  
papers in computational linguistics conferences (ACL, COOLING, EMNLP, etc.)
+
papers in computational linguistics conferences.
 +
They also investigated differences and similarities among various CL conferences (ACL, EMNLP, and COLING). <br>
 +
Some of their findings include : <br>
 +
* There is an increase in research in probabilistic models  starting from late 80s (1988).
 +
* There is a decline in research in semantics between 1978 and 2001 (possibly trending again after 2001).
 +
* There is a steady increase in research in applications (MT, Speech Recognition, etc.) over time.
 +
* COLING has more diverse topics compare to ACL and EMNLP, but all three are becoming broader (in topics).
 +
* Topics in ACL, COLING, and EMNLP are converging (similarities of topics in these conferences are increasing over time).
  
 
== Dataset ==  
 
== Dataset ==  
ACL Anthology (~12,500 papers)
+
[[UsesDataset::ACL Anthology]] (they used ~12,500 papers)
  
 
== Model ==  
 
== Model ==  
LDA with post hoc analysis to calculate observed probability of topics in the current year
+
Instead of using dynamic topic models, they used static [[UsesMethod::Topic_model]] (vanilla LDA) with post hoc analysis to calculate the probability of topics in a particular year.<br>
 
+
The probability of a topic in a particular year is computed as follows: <br>
I is the indication function, t_d is the date document d was published, p(d|y) is a constant 1/C
+
<math>
 +
\hat{p}(z|y) = \sum_{d:t_d=y} \hat{p}(z|d) \hat{p}(d|y)
 +
</math><br>
 +
where <math>t_d</math> is the date of document d, y is the year, <math>t_d</math> = y means that year(<math>t_d</math>)=y, and z is a topic of interest.
  
 
== Experiments ==
 
== Experiments ==
Ran 100 topics LDA, took relevant 36 topics.
+
* Ran 100 topics LDA, took relevant 36 topics.
Seeded words for 10 more topics to improve coverage.
+
* Seeded words for 10 more topics to improve coverage.
Used these 36+10 topics as priors for new 100-topics run.
+
* Used these 36+10 topics as priors for new 100-topics run.
Picked 43 topics and manually labeled them.
+
* Picked 43 topics and manually labeled them.
  
 
== Results ==
 
== Results ==
* Trending topics in the CL community
+
This is only a subset of their results. There are more interesting plots in the paper.
  
* Declining topics in the CL community
+
* Trending topics in the CL community<br>
 +
** Classification, Probabilistic Models, Statistical Parsing, Statistical MT, and Lexical Semantics are trending.
 +
[[File:halltrend.png]]
 +
 
 +
* Declining topics in the CL community<br>
 +
** Semantics and Discourse are declining.
 +
[[File:halltdecline.png]]
  
 
* NLP applications
 
* NLP applications
They investigated whether CL is becoming more applied over time.
+
** They investigated whether CL is becoming more applied over time. <br>
They explored six applicatons : Machine Translation, Spelling Correction, Dialogue Systems, Call Routing, Speech Recognition, and Biomedical
+
** The results show gradual increase over time. <br>
 
+
[[File:hallapp.png]]
* ACL vs COLING vs EMNLP
 

Latest revision as of 22:09, 2 April 2011

Citation

  • Title : Studying the History of Ideas Using Topic Models
  • Authors : D. Hall, D. Jurafsky, and C. D. Manning
  • Venue : EMNLP 2008

Summary

This paper uses topic models to study the development of ideas over time for papers in computational linguistics conferences. They also investigated differences and similarities among various CL conferences (ACL, EMNLP, and COLING).
Some of their findings include :

  • There is an increase in research in probabilistic models starting from late 80s (1988).
  • There is a decline in research in semantics between 1978 and 2001 (possibly trending again after 2001).
  • There is a steady increase in research in applications (MT, Speech Recognition, etc.) over time.
  • COLING has more diverse topics compare to ACL and EMNLP, but all three are becoming broader (in topics).
  • Topics in ACL, COLING, and EMNLP are converging (similarities of topics in these conferences are increasing over time).

Dataset

ACL Anthology (they used ~12,500 papers)

Model

Instead of using dynamic topic models, they used static Topic_model (vanilla LDA) with post hoc analysis to calculate the probability of topics in a particular year.
The probability of a topic in a particular year is computed as follows:

where is the date of document d, y is the year, = y means that year()=y, and z is a topic of interest.

Experiments

  • Ran 100 topics LDA, took relevant 36 topics.
  • Seeded words for 10 more topics to improve coverage.
  • Used these 36+10 topics as priors for new 100-topics run.
  • Picked 43 topics and manually labeled them.

Results

This is only a subset of their results. There are more interesting plots in the paper.

  • Trending topics in the CL community
    • Classification, Probabilistic Models, Statistical Parsing, Statistical MT, and Lexical Semantics are trending.

Halltrend.png

  • Declining topics in the CL community
    • Semantics and Discourse are declining.

Halltdecline.png

  • NLP applications
    • They investigated whether CL is becoming more applied over time.
    • The results show gradual increase over time.

Hallapp.png