Difference between revisions of "Leskovec et al KDD 09"
(Created page with 'This a [[Category::Paper]] that appeared at the [http://www.sigkdd.org/ ACM SIGKDD International Conference on Knowledge Discovery and Data Mining] 2009 == Citation == title=…') |
|||
(2 intermediate revisions by the same user not shown) | |||
Line 18: | Line 18: | ||
In this paper, the authors attempt to first analyze on such a large scale how memes spread in the news cycle, and how they propagate from major mass media news sites to blogs and vice versa. | In this paper, the authors attempt to first analyze on such a large scale how memes spread in the news cycle, and how they propagate from major mass media news sites to blogs and vice versa. | ||
From a data mining perspective, existing work does not cover the focus of the authors, because either is able to identify long-range trends in general topics over time (by using probabilistic term mixtures) or is able to only track short information cascades through the blogosphere. The authors place their task in between those aforementioned perspectives, and present a novel means of first clustering phrases that essentially correspond to the same "meme", tracking those over time, and finally modeling their evolution in the global and the local scale (namely, examining how the meme evolves over time, as well as how it behaves in very localized points of time, especially when in its rise). | From a data mining perspective, existing work does not cover the focus of the authors, because either is able to identify long-range trends in general topics over time (by using probabilistic term mixtures) or is able to only track short information cascades through the blogosphere. The authors place their task in between those aforementioned perspectives, and present a novel means of first clustering phrases that essentially correspond to the same "meme", tracking those over time, and finally modeling their evolution in the global and the local scale (namely, examining how the meme evolves over time, as well as how it behaves in very localized points of time, especially when in its rise). | ||
+ | |||
+ | In order to cluster phrases, the authors use a graph based approach, in which they create a Directed Acyclic Graph (DAG) of all the phrases (nodes are phrases and an edge exists between two phrases if they are "close" in editing distance terms). By partitioning this DAG, the authors are able to identify clusters of phrases which correspond to the same meme | ||
+ | |||
+ | == Data Analysis == | ||
+ | |||
+ | The authors operate on data crawled from the web from August 1 to October 31 2008. In total, the dataset spans over 1 million documents per day, amounting to over 90 million articles as a whole. Documents come form both major news websites, as well as blogs, and the total size of the dataset was 390GB. | ||
+ | |||
+ | '''Global Modeling''': | ||
+ | |||
+ | The authors attempted to model the global behavior of memes, as they propagate in time. Global is in the sense that they define a "thread" which is associated with a given phrase, and they attempt to mimic its temporal evolution. | ||
+ | |||
+ | Details about the models can be found in the paper, but, in general, in the Figure below, it is shown that their model is able to capture the global "thread" dynamics. | ||
+ | |||
+ | [[File:Leskovec_kdd_09_global.png]] | ||
+ | |||
+ | '''Local Modeling''': | ||
+ | |||
+ | Apart form thread modeling, the authors also looked at the data in a more fine-grained scale, and more specifically at the point around a certain peak of a phrase's temporal evolution. To their surprise, they found that the rise of such peaks is steeper than exponential and so there rises the need for more than one functions in order to model this phenomenon. More specifically, in the Figure below, we see the choice of the modeling function, co-displayed with a real spike; again, this Figure demonstrates that their model succeeds in capturing these fine-grained dynamics. | ||
+ | |||
+ | [[File:Leskovec_kdd_09_local.png]] | ||
+ | |||
+ | == Discussion == | ||
+ | |||
+ | Probably the most interesting find of this work is that there exist a 2.5 hours lag between the spread of a thread from a news site to a blog (usually this way but also in the opposite direction). Justification for that is given by the authors and goes as follows: Usually, major news websites first come up with a piece of information, which later - usually when we observe a drop in its popularity/volume in the news websites - (2.5 hours on average) is diffused to the blogosphere where it persists, since more people are talking about it. |
Latest revision as of 13:08, 1 October 2012
This a Paper that appeared at the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2009
Citation
title={Meme-tracking and the dynamics of the news cycle}, author={Leskovec, J. and Backstrom, L. and Kleinberg, J.}, booktitle={Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining}, pages={497--506}, year={2009}, organization={ACM}
Online version
Meme-tracking and the dynamics of the news cycle
Summary
In this paper, the authors attempt to first analyze on such a large scale how memes spread in the news cycle, and how they propagate from major mass media news sites to blogs and vice versa. From a data mining perspective, existing work does not cover the focus of the authors, because either is able to identify long-range trends in general topics over time (by using probabilistic term mixtures) or is able to only track short information cascades through the blogosphere. The authors place their task in between those aforementioned perspectives, and present a novel means of first clustering phrases that essentially correspond to the same "meme", tracking those over time, and finally modeling their evolution in the global and the local scale (namely, examining how the meme evolves over time, as well as how it behaves in very localized points of time, especially when in its rise).
In order to cluster phrases, the authors use a graph based approach, in which they create a Directed Acyclic Graph (DAG) of all the phrases (nodes are phrases and an edge exists between two phrases if they are "close" in editing distance terms). By partitioning this DAG, the authors are able to identify clusters of phrases which correspond to the same meme
Data Analysis
The authors operate on data crawled from the web from August 1 to October 31 2008. In total, the dataset spans over 1 million documents per day, amounting to over 90 million articles as a whole. Documents come form both major news websites, as well as blogs, and the total size of the dataset was 390GB.
Global Modeling:
The authors attempted to model the global behavior of memes, as they propagate in time. Global is in the sense that they define a "thread" which is associated with a given phrase, and they attempt to mimic its temporal evolution.
Details about the models can be found in the paper, but, in general, in the Figure below, it is shown that their model is able to capture the global "thread" dynamics.
Local Modeling:
Apart form thread modeling, the authors also looked at the data in a more fine-grained scale, and more specifically at the point around a certain peak of a phrase's temporal evolution. To their surprise, they found that the rise of such peaks is steeper than exponential and so there rises the need for more than one functions in order to model this phenomenon. More specifically, in the Figure below, we see the choice of the modeling function, co-displayed with a real spike; again, this Figure demonstrates that their model succeeds in capturing these fine-grained dynamics.
Discussion
Probably the most interesting find of this work is that there exist a 2.5 hours lag between the spread of a thread from a news site to a blog (usually this way but also in the opposite direction). Justification for that is given by the authors and goes as follows: Usually, major news websites first come up with a piece of information, which later - usually when we observe a drop in its popularity/volume in the news websites - (2.5 hours on average) is diffused to the blogosphere where it persists, since more people are talking about it.