Difference between revisions of "Topics over Time"

From Cohen Courses
Jump to navigationJump to search
 
Line 19: Line 19:
 
* TOT draws a time stamp <math>t_{di}\,\!</math> (the time stamp associated with the ''i'' th token in the document ''d'') from Beta <math>\psi_{z_{di}}\,\!</math> (the beta distribution of time, specific to topic ''z'')
 
* TOT draws a time stamp <math>t_{di}\,\!</math> (the time stamp associated with the ''i'' th token in the document ''d'') from Beta <math>\psi_{z_{di}}\,\!</math> (the beta distribution of time, specific to topic ''z'')
  
In this generative process, a time stamp is associated with each word, which is perfect when different parts of the document have different time stamps. However, typical documents often have only one time stamp per document. Hence in training, when fitting the model from a typical data, each training document's time stamp is copied to all the words in the document. An alternative model to TOT is one where a single time stamp is associated with each document (figure below):
+
In this generative process, a time stamp is associated with each word, which is perfect when different parts of the document are discussing different time periods. However, common documents typically have only one time stamp per document. Therefore, an alternative model to TOT is one where a single time stamp is associated with each document (figure below):
  
 
[[File:TOT-alt.png|250px]]
 
[[File:TOT-alt.png|250px]]

Latest revision as of 23:08, 27 November 2011

Summary

Topics over Time (TOT) is an LDA-style method for topic modeling that explicitly models word co-occurrences jointly with time. In other words, TOT captures both the low-dimensional structure of data and how the structure changes over time. This method is motivated by the realization that structure in data (in this case topics co-occurrence patterns) are not static but dynamic. Patterns present in the early part of the data may not be in effect later. Topics rise and fall, split and merge, and change correlations over time. In TOT, each generated document has a mixture model of topics that is influenced jointly by both word co-occurrences and the document's time stamp.

Procedure

For each generated document, mixture distribution over topics in the document is jointly influenced by word co-occurrences and the document's time stamp.

TOT.png

In LDA style of generative model, TOT first draws T multinomials from a Dirichlet prior , one for each topic z.

Then, for each document d, TOT draws a multinomial (the distribution of topics specific to the document d) from a Dirichlet prior .

Then, for each word (the i th token in document d):

  • TOT draws a topic (the topic associated with the i th token in document d) from the multinomial
  • TOT draws a word from the multinomial (the distribution of words specific to topic z)
  • TOT draws a time stamp (the time stamp associated with the i th token in the document d) from Beta (the beta distribution of time, specific to topic z)

In this generative process, a time stamp is associated with each word, which is perfect when different parts of the document are discussing different time periods. However, common documents typically have only one time stamp per document. Therefore, an alternative model to TOT is one where a single time stamp is associated with each document (figure below):

TOT-alt.png

Using this model, given the input: the words in the document, we can output: a prediction of the time stamp of the document. To facilitate this prediction, time stamps are first discretized. Then, given a document, the model predicts its time stamp by maximizing the posterior which is the product of time stamp probability of all word tokens from their corresponding topic-wise Beta distributions over time: , where is the number of word tokens in the document.

Related Methods

The strength of TOT is in its joint modeling of time and word co-occurrence. Unlike TOT, rather than doing joint modeling, other methods that model changes of topics over time do it "non-jointly"; either by (1) first fitting a time-unaware topic model on data and then ordering the documents in time, or (2) divides data into discrete time slices and fits a separate topic model in each slice. The disadvantage of such approach is that they are not making full use of time information to improve topic discovery.

Unlike other topic modeling methods that treat the "meaning" (or word distributions) of topics as changing over time, topics in TOT are modeled as constant over time. It is the occurrence and co-occurrence of the topics that are changing over time, not the word distribution of the topics. The rise and fall of topics over time, the split and merge of topics over time, are modeled as changes in the occurrence and co-occurrence of the topics over time.

Other related method is "burst of activity model" in Peak Detection that uses probabilistic infinite-state automaton with a Markov model of state structure in which high activity states are reachable only by passing through lower activity states. This method is not working directly with time stamps, rather use data ordering as a proxy for time. Unlike this method, TOT uses time stamps directly. Also, instead of employing Markov assumption over time, TOT treats time as an observed continuous variable.

References / Links

  • Wang, X. and McCallum, A. Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends, KDD 2006. - [1]

Relevant Papers