Gruhl et al www 2004

From Cohen Courses
Jump to navigationJump to search

Citation

Daniel Gruhl , R. Guha , David Liben-Nowell , Andrew Tomkins, Information diffusion through blogspace, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004

Online Version

http://dl.acm.org/citation.cfm?id=988739

Summary

This paper characterizes information diffusion through blogspace along two dimensions: topics and individuals. The model assumes that topic distributions are mainly comprised of chatter and spikes. The paper explores snapshot topic models which focus on short term behavior (weeks or months). The features for topic extraction were basically proper nouns that were considered interesting using tf-cidf* with a threshold of tf > 10 and tfcidf > 10. With the extracted words, they manually identified 340 classical topics. They identify 3 types of topic patterns :

Spiky.png

  • Just Spike: Topics which became suddenly active and went back to being inactive again.
  • Spiky Chatter: Topics that react quickly and strongly to external events and have many spikes.
  • Mostly Chatter: Topics which are continuously discussed at relatively moderate levels.

By characterizing these patterns qualitatively, they were able quantify each topic with two parameters corresponding to the chatter level and the spike pattern. Their results show that their intuition of spikes is well captured by the model.

The next part focuses on modeling how a given topic spreads over various individuals. They use a few predicates on topics using which they use to represent life-cycle of a topic. The table below contains these predicates.

Lifecycle.png

The model is derived from Independent cascade model of Goldenberg et al. The interesting results from this analysis are :

  • Most users leave the topic with less energy than it arrived, transmitting to less than one additional person.
  • Some users provide a boost to every topic they post about.
  • Also they report that there are critical linkages, some individuals tend to be more strongly associated with topics than others.

chatter

Ongoing discussion whose subtopic flow is largely determined by the authors.

spikes

Short-term, high intensity discussion of real-world events that are relevant to the topic.


  • tf- Term Frequency, cidf - Cumulative Inverse Document Frequency.

Dataset

The corpus was formed by collecting daily crawls of 11,804 blog feeds. 2K - 10K blogs posting per day. Also for the corresponding period, data from 14 RSS channels of rss.new.yahoo.com were collected. This was to keep track of what was going on in the media. They used WebFountain to store the data as parent/child entities.



Study Plan

For Topic Modeling

  • Sequential Pattern Mining [1]

For Individual Modeling


Related Work

  • Norman T. J. Bailey. The Mathematical Theory of Infectious Diseases and its Applications. Griffin, London, 2nd edition, 1975.
  • Cristopher Moore and M. E. J. Newman. Epidemics and percolation in small-world networks. Physical Review E, 61:5678–5682, 2000. cond-mat/9911492.
  • M. E. J. Newman. The spread of epidemic disease on networks. Physical Review E, 66(016128), 2002. cond-mat/0205009.