Leskovec, J., L. Backstrom, and J. Kleinberg. 2009. Meme-tracking and the Dynamics of the News Cycle. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 497–506.

From Cohen Courses
Jump to navigationJump to search

Online Version

An online version of this paper is available here: [1]

Summary

The authors develop a frameworks for tracking memes, which they define as short distinctive phrases, as they spread and evolve across time. They use their approach to perform global and local level analyses of the news cycle, and find a number of interesting patterns. On a global level, they find that temporal patterns arise naturally from a simple mathematical model in which news sources imitate each other’s decisions about what to cover, but subject to recency effects penalizing older content. On a local level, they find a typical lag of 2.5 hours between the peaks of attention to a phrase in the news media and in blogs respectively.

Motivation

Prior work in the area of analyzing diffusion of highly dynamic online information has used one of two main approaches:

  • Probabilistic term mixtures that are successful at identifying long range trends in general topics over time
  • Identifying hyperlinks between blogs and extracting rare named entities to track short information cascades through the blogosphere

The authors wish to analyze the continuous interaction of news, blogs, and websites on a daily basis, which lies somewhere between these two extremes. Specifically, they are interested in short units of text, short phrases, and “memes” that act as signatures of topics and events propagate and diffuse over the web, from mainstream media to blogs, and vice versa.

Dataset

The authors compile an extensive dataset, consisting of three months of online mainstream and social media activity from August 1 to October 31 2008 with about 1 million documents per day. In total it consist of 90 million documents (blog posts and news articles) from 1.65 million different sites obtained through the Spinn3r API.

Clustering Phrases

The authors start their study by producing phrase clusters, which are collections of phrases deemed to be close textual variants of one another. To do so, they first build a phrase graph where each phrase is represented by a node and directed edges connect related phrases. They then partition this DAG graph in such a way that its components will be the phrase clusters. The general problem of DAG partitioning is NP-hard, so they use a heuristics based method instead.

Global analysis

The authors define a thread associated with a given phrase cluster to be the set of all items (news articles or blog posts) containing some phrase from the cluster, and track all threads over time, considering both their individual temporal dynamics as well as their interactions with one another. They then attempt to develop a model to explain the temporal variations in threads. They find that there are two minimal ingredients that must be taken into account to produce such variations:

  • Different sources imitate one another, so that once a thread experiences significant volume, it is likely to persist and grow through adoption by others.
  • The second factor, which counteracts the first, is that threads are governed by strong recency effects, in which new threads are favored to older ones.

They find that a model that takes into account these factors produces variations very similar to those seen in actual threads from the dataset, while models that use either one of these two factors alone do not produce such variations.

Local Analysis

For local analysis, the authors focus on the temporal dynamics around the peak intensity of a typical thread, as well as the interplay between the news media and blogs in producing the structure of this peak. They find that:

  • Although one would expect the overall volume of a thread to be very low initially; rise as the mass media begins joining in; and slowly decay as it percolates to blogs and other media, they find that the behavior tends to be quite different from this. They find that threads are slow to reach their peak, and afterwards they decay very quickly.
  • They also find that the median for a thread in the news media typically occurs first, and then a median of 2.5 hours later the median for the thread among blogs occurs. Moreover, news volume both increases faster, and higher, but also decreases quicker than blog volume. They speculate that this is because the news media are slower to heavily adopt a quoted phrase and subsequently quick in dropping it, as they move on to new content. On the other hand, bloggers rather quickly adopt phrases from the news media, with a 2.5-hour lag, and then discuss them for much longer.
  • In observing the handoff of quoted phrases or memes from news media to blogs, they notice “heartbeat”-like like dynamics where the phrase “oscillates” between blogs and mainstream media.
  • While the majority of phrases first appear in news media and then diffuses to blogs where it is then discussed for longer time, there are also phrases that propagate in the opposite way, percolating in the blogosphere until they are picked up the news media.

Conclusion

The authors develop a framework for tracking short, distinctive phrases that travel relatively intact through on-line text and present scalable algorithms for identifying and clustering textual variants of such phrases that scale to large collections of articles. Their work makes it possible to further investigate many issues, such as how one can best characterize the dynamics of mutation within phrases, how information changes as it propagates, and modelling the way in which the essential “core” of a widespread quoted phrase emerges and enters popular discourse more generally.