Leskovec et al KDD 09

From Cohen Courses
Revision as of 12:55, 1 October 2012 by Epapalex (talk | contribs)
Jump to navigationJump to search

This a Paper that appeared at the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2009

Citation

 title={Meme-tracking and the dynamics of the news cycle},
 author={Leskovec, J. and Backstrom, L. and Kleinberg, J.},
 booktitle={Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining},
 pages={497--506},
 year={2009},
 organization={ACM}

Online version

Meme-tracking and the dynamics of the news cycle

Summary

In this paper, the authors attempt to first analyze on such a large scale how memes spread in the news cycle, and how they propagate from major mass media news sites to blogs and vice versa. From a data mining perspective, existing work does not cover the focus of the authors, because either is able to identify long-range trends in general topics over time (by using probabilistic term mixtures) or is able to only track short information cascades through the blogosphere. The authors place their task in between those aforementioned perspectives, and present a novel means of first clustering phrases that essentially correspond to the same "meme", tracking those over time, and finally modeling their evolution in the global and the local scale (namely, examining how the meme evolves over time, as well as how it behaves in very localized points of time, especially when in its rise).

In order to cluster phrases, the authors use a graph based approach, in which they create a Directed Acyclic Graph (DAG) of all the phrases (nodes are phrases and an edge exists between two phrases if they are "close" in editing distance terms). By partitioning this DAG, the authors are able to identify clusters of phrases which correspond to the same meme

Data Analysis

The authors operate on data crawled from the web from August 1 to October 31 2008. In total, the dataset spans over 1 million documents per day, amounting to over 90 million articles as a whole. Documents come form both major news websites, as well as blogs, and the total size of the dataset was 390GB.

Global Modeling:

Leskovec kdd 09 global.png

Local Modeling:

Leskovec kdd 09 local.png

Discussion