Project Draft Yandong Nitin

From Cohen Courses
Jump to navigationJump to search

Team Members

Yandong Liu

Nitin Agarwa

Project Idea

Text mining refers to the task of finding patterns of some interest by analyzing the text and it has received a great deal of research in recent years, especially with the wide application of topic models. However, many text mining methods, especially those involving topic models, often assume that the semantic themes are static across the whole document collection even when the document comes with a timestamp. In many cases this is not true due to the document characteristic. For example, for event detection tasks in news article collections, especially in an unsupervised setting, topic models are applied to help define events. The assumption that topics are static over time will create a large number of topics over the whole collection while only a few can be active in the same time frame. Also this large number of topics introduces a lot of noise hence makes the detection task inherently more challenging. Another motivation is the kind of source of text that has an ever changing subject like personal blog posts since one’s blogging interest can essentially change over time.

There is related work like David Blei etc. proposed dynamic topic models which is a HMM style LDA model to build a state space of topics and implies topics evolve one after one. Instead of assuming such a first order Markov process we would like to directly model the time aspect of the document. Mei etc. analyzed how theme patterns over time in large text collections by utilizing language models and KL-divergence to model the transition between topics. The paper assume a simple unigram mixture topic model for each document and it has been pointed out in the literature that it causes overfitting. Also we’d like to model the transition in the same model as well.

Dataset

We found a very interest dataset UFO reports at http://www.nuforc.org/webreports/ndxevent.html provided by National UFO reporting center. The dataset comprises reports from even 1400 till today and each report is well structured with facets including occurring time, reporting time, posting time, location, shape, duration as well as free text description.

Method

We will likely to build some graphical models with time component to model the data and extract themes. We would like to visualize the extracted themes and see how they evolve over time if time permits.

References:

 D. Blei and J. Lafferty. Dynamic topic models. In Proceedings of the 23rd International Qiaozhu      
 Mei, ChengXiang Zhai. Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining.  Conference on Machine Learning, 2006.