Leskovec et al KDD 09
This a Paper that appeared at the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2009
Citation
title={Meme-tracking and the dynamics of the news cycle}, author={Leskovec, J. and Backstrom, L. and Kleinberg, J.}, booktitle={Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining}, pages={497--506}, year={2009}, organization={ACM}
Online version
Meme-tracking and the dynamics of the news cycle
Summary
In this paper, the authors attempt to first analyze on such a large scale how memes spread in the news cycle, and how they propagate from major mass media news sites to blogs and vice versa. From a data mining perspective, existing work does not cover the focus of the authors, because either is able to identify long-range trends in general topics over time (by using probabilistic term mixtures) or is able to only track short information cascades through the blogosphere. The authors place their task in between those aforementioned perspectives, and present a novel means of first clustering phrases that essentially correspond to the same "meme", tracking those over time, and finally modeling their evolution in the global and the local scale (namely, examining how the meme evolves over time, as well as how it behaves in very localized points of time, especially when in its rise).
In order to cluster phrases, the authors use a graph based approach, in which they create a Directed Acyclic Graph (DAG) of all the phrases (nodes are phrases and an edge exists between two phrases if they are "close" in editing distance terms). By partitioning this DAG, the authors are able to identify clusters of phrases which correspond to the same meme
Data Analysis
The authors operate on data crawled from the web from August 1 to October 31 2008. In total, the dataset spans over 1 million documents per day, amounting to over 90 million articles as a whole. Documents come form both major news websites, as well as blogs, and the total size of the dataset was 390GB.
Global Modeling:
Local Modeling: