Rodriguez et al. KDD 2010
Online Version
An electronic version of this paper can be downloaded here: [1]
Summary
This paper addresses the problem of inferring networks of diffusion from the observed data that consists of times when some contagion infects particular nodes in the graph . The main premise behind their proposed approach is that the joint observations of many different contagion-spreading processes could determine the underlying diffusion network, assuming that it does not change over time. The proposed algorithm is called NetInf, which formulates the contagion spread as directed trees through the network, and can reduce the complexity of searching an exponential set of candidate trees to polynomial time. instead of estimating the likelihood of a cascade using all possible propagation trees, which causes the likelihood computationally intractable, NetInf approximates it by considering only the most likely tree, and is found to work well empirically. Additionally, the external influence from outside the network is modeled via -edges, which basically says every node can get infected by some external source with small probability.
Results
For evaluation they apply the proposed approach to both synthetic data (including Forest Fire model and the Kronecker Graphs model) and real data (Blog hyperlink cascades dataset and MemeTracker dataset created from 172 million news articles and blog posts from 1 million online sources over a year). They compare the results with a simple baseline method that infers the diffusion network by picking edges with highest scores, with the score of an edge determined by how likely it is for cascades to propagate over.
For experiments on the synthetic data, they found the resulting network by NetInf has much higher performance (in terms of precision-recall) than the baseline. Analysis for the performance under different cascade coverage, incubation time noise and infections by the external source is also included in the paper.
As for testing on the real data, they also found significant performance gap between NetInf and baseline as shown in the precision-recall curves below:
Further analysis of the data and results shows that information mainly flows from the mainstream media to blogs, and blogs tend to be slower to get infected than media sites.