Yang et al Modeling Information Diffusion in Implicit Networks

From Cohen Courses
Jump to navigationJump to search

This is a Paper summarized for the course Analysis of Social Media 10-802 in Fall 2012.

Citation

Yang, J., and Leskovec, J. 2010. Modeling Information Diffusion in Implicit Networks.

Online version

Modeling Information Diffusion in Implicit Networks

Summary

The paper proposes a Linear Influence Model in which rather than trying to predict which nodes influence other nodes based on the network structure, it tries to model the global influence of a node on the rate of diffusion of information. The authors propose a way to model influence function for each node. An influence function captures the effect of the node on the spread of the contagion at different times. Since, different types of nodes (for example blogs, news website, etc.) may have different influence function, the paper proposes a non-parametric model for influence functions capable of capturing such varied behaviors instead of same function for each node with different (estimated) parameters.

Main Ideas

Motivation

When trying to model influence using the Network structure, many assumptions are made like complete network data is available, the structure of network is sufficient to explain observed behavior and contagion can spread only through the edges. However, the paper, points out that there may be many external and hidden factors for which the data is not readily available. Also, in case of information / virus propagation, the source is not known. For example, in case of information propagation, people usually discover new information without explicitly acknowledging the source. Thus, the authors argue that existing models for diffusion based on network characteristics may be too constrained and a need for a global influence model.

Linear Influence Model

The main idea in this model is that each node has an influence function associated with it and the number of newly infected nodes at time t is a function of influences of nodes that got infected before time t. The paper defines the following terms:

  • V(t): Number of nodes that mention the information at time t (Volume at time t).
  • Iu(l): Number of followup mentions l time units after node u adopted the information. Instead of modeling it as a parametric function, it is modeled as a vector of length L. Hence, an assumption is made that L time units after the node u got activated, the influence drops to zero.
  • A(t): Denotes the set of active (infected, influenced) nodes u (one which got activated before t)
  • Time t: Is measured as a discrete set of values (generally an hour for most experiments.)
  • Mu,k(t): Indicator function, equals 1 if node u got infected by contagion k at time t, and 0 otherwise

Relation between V(t+1) and Influence functions can be written in following 2 ways

  V(t+1) =  Iu(t - tu) 
  Vk(t+1) =  Mu,k(t - l) * Iu(l + 1) 

The second equation given above is converted into a matrix form (detailed info. with a good pictorial representation in the paper) and the problem of estimating the different Iu(l) can be converted into a Non-Negative Least Squares problem (NNLS). The paper uses existing, efficient methods for solving this problem.

Extensions

The paper proposes following two extensions to the problem formulation:

  • Accounting for novelty: To model the fact that nodes are more likely to adopt novel and recent information while ignoring old and obsolete information, a multiplicative factor α(t) is added to the above equation. A coordinate descent approach is used to calculate the 2 sets of parameters (α(t) and Influence functions)
  • Accounting for imitation: An additive term b(t) is added to the equation to model imitation. This models the contribution of the imitation as the latent volume in a sense that this volume is caused not by influence, but by other factors.

Experiments

The experiments in the paper develop Linear Influence Model over different datasets and predict the time series ( Vk(t + 1) ). This is compared to the predictions by 3 existing models: (1-time lag predictor, Autoregressive model and Autoregressive moving average model) The paper describes experiments performed on following 2 datasets:

  • Using the MemeTracker methodology, the dataset is collected which comprises of popular phrases from different sources like news articles and blog posts. For each of the selected phrases, website mentions for 5 days is tracked.
  • Twitter Hashtags: The twitter hashtags dataset is collected and pruned to include only top 1000 hashtags. To account for sparsity in hashtag adoption, instead of taking a single user as a node, a group of 100 users are taken as a group.

The paper provides comparison of influence functions of 5 different types of media (News, Professional Blogs, Television, News Agencies, Personal Blogs) on 6 different fields (Politics, nation, entertainment, business, sports and technology). The results for 2 modifications (Novelty and Imitation) are also discussed.

Dataset

This paper uses the following dataset: Volume Time Series of Twitter Hashtags and Memetracker Phrases

Related papers