Difference between revisions of "Gomez Rodriguez, M., J. Leskovec, and A. Krause. 2010. Inferring networks of diffusion and influence. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 1019–1028."
(5 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
== Summary == | == Summary == | ||
− | This paper addresses the problem of inferring | + | This paper addresses the problem of [[AddressesProblem::inferring networks of diffusion]] from the observed data that consists of times when some contagion infects particular nodes in the graph . The main premise behind their proposed approach is that the joint observations of many different contagion-spreading processes could determine the underlying diffusion network, assuming that it does not change over time. The proposed algorithm is called [[UsesMethod::NetInf]], which formulates the contagion spread as directed trees through the network, and can reduce the complexity of searching an exponential set of candidate trees to polynomial time. instead of estimating the likelihood of a cascade using all possible propagation trees, which causes the likelihood computationally intractable, NetInf approximates it by considering only the most likely tree, and is found to work well empirically. Additionally, the external influence from outside the network is modeled via <math>\epsilon</math>-edges, which basically says every node can get infected by some external source with small probability. |
== Results == | == Results == | ||
− | For evaluation they apply the proposed approach to both synthetic data (including Forest Fire model and the Kronecker Graphs model) and real data (Blog hyperlink cascades dataset and MemeTracker dataset created from 172 million news articles and blog posts from 1 million online sources over a year). They compare the results with a simple baseline method that infers the diffusion network by picking edges with highest scores, with the score of an edge determined by how likely it is for cascades to propagate over. | + | For evaluation they apply the proposed approach to both synthetic data (including Forest Fire model and the Kronecker Graphs model) and real data ([[UsesDataset::Blog hyperlink cascades]] dataset and [[UsesDataset::MemeTracker]] dataset created from 172 million news articles and blog posts from 1 million online sources over a year). They compare the results with a simple baseline method that infers the diffusion network by picking edges with highest scores, with the score of an edge determined by how likely it is for cascades to propagate over. |
− | For experiments on the synthetic data, they found the resulting network by | + | For experiments on the synthetic data, they found the resulting network by NetInf has much higher performance (in terms of precision-recall) than the baseline. Analysis for the performance under different cascade coverage, incubation time noise and infections by the external source is also included in the paper. |
− | As for testing on the real data, they also found significant performance gap between | + | As for testing on the real data, they also found significant performance gap between NetInf and baseline as shown in the precision-recall curves below: |
[[File:Res.png]] | [[File:Res.png]] | ||
+ | |||
+ | Further analysis of the data and results shows that information mainly flows from the mainstream media to blogs, and blogs tend to be slower to get infected than media sites. | ||
== Related Papers == | == Related Papers == | ||
+ | [[RelatedPaper::MYERS, S. AND LESKOVEC, J. 2010. On the Convexity of Latent Social Network Inference. In NIPS ’10: Advances in Neural Information Processing Systems.]] | ||
+ | |||
+ | [[RelatedPaper::GOMEZ-RODRIGUEZ, M., BALDUZZI, D., AND SCHO ̈ LKOPF, B. 2011. Uncovering the Temporal Dynamics of Diffusion Networks. In ICML ’11: Proceedings of the 28th International Conference on Machine Learning. 561–568.]] | ||
+ | |||
+ | [[RelatedPaper::VER STEEG, G., GHOSH, R., AND LERMAN, K. 2011. What stops social epidemics? In ICWSM ’11: Proceedings of the 5th Int. Conf. on Weblogs and Social Media.]] |
Latest revision as of 01:54, 4 October 2012
Online Version
An electronic version of this paper can be downloaded here: [1]
Summary
This paper addresses the problem of inferring networks of diffusion from the observed data that consists of times when some contagion infects particular nodes in the graph . The main premise behind their proposed approach is that the joint observations of many different contagion-spreading processes could determine the underlying diffusion network, assuming that it does not change over time. The proposed algorithm is called NetInf, which formulates the contagion spread as directed trees through the network, and can reduce the complexity of searching an exponential set of candidate trees to polynomial time. instead of estimating the likelihood of a cascade using all possible propagation trees, which causes the likelihood computationally intractable, NetInf approximates it by considering only the most likely tree, and is found to work well empirically. Additionally, the external influence from outside the network is modeled via -edges, which basically says every node can get infected by some external source with small probability.
Results
For evaluation they apply the proposed approach to both synthetic data (including Forest Fire model and the Kronecker Graphs model) and real data (Blog hyperlink cascades dataset and MemeTracker dataset created from 172 million news articles and blog posts from 1 million online sources over a year). They compare the results with a simple baseline method that infers the diffusion network by picking edges with highest scores, with the score of an edge determined by how likely it is for cascades to propagate over.
For experiments on the synthetic data, they found the resulting network by NetInf has much higher performance (in terms of precision-recall) than the baseline. Analysis for the performance under different cascade coverage, incubation time noise and infections by the external source is also included in the paper.
As for testing on the real data, they also found significant performance gap between NetInf and baseline as shown in the precision-recall curves below:
Further analysis of the data and results shows that information mainly flows from the mainstream media to blogs, and blogs tend to be slower to get infected than media sites.