Supervised Random Walk

From Cohen Courses
Jump to navigationJump to search

This is one of the paper discussed and written in course Social Media Analysis 10-802 in Spring 2011

Citation

Lars Backstrom & Jure Leskovec "Supervised Random Walks: Predicting and Recommending Links in Social Networks"

Online version

Link to paper

Summary

This paper addresses the problem of Link Prediction using the method of Random walk with restart. Supervised Random Walk is interesting because it ranks the nodes based the network information and also using rich node and edge attributes that exist in the dataset. The method is supervised learning task where the goal is to learn the parameters of the function that assigns the strength of the edge (probability of taking that edge) such that a random walker is more likely to reach nodes to which new links will be created in future.

This method is used to recommend friends on Facebook dataset and also in predicting links in collaboration network in arXive database.

Method

Typically, Random walk with restart involves giving probabilities to every edge, which indicate the probability of taking that edge by a random walker given he is at any one of the node on either side of the edge. These probabilities decide which nodes are closer to the node from which we restart our random walks. One simple method that is to assign each edge out of a given node equal probability. However, supervised random walk presented in this paper provides the method to learn the probabilities so that the random walker restarting from node s is more likely to reach the "positive" nodes than "negative" nodes. Positive nodes are the one for which the links were formed in the training dataset and negative nodes are rest all nodes which are not connected the nodes s. The method is formulated into an optimization problem for which an efficient estimation procedure is derived.

Problem Formulation

Given a graph and the training data which has set of the positive nodes for which the links were formed by s and all other nodes in the graph to which linked were not formed as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle L} . Assume that we are learning the probability assignments for edges with respect to a single s. Each edge Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (u,v) } has a vector of features Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \psi_{uv}} , which could be features of the interaction between nodes Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle u, v} or could be features on individual nodes Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle u,v} . Also, if there is a function Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle f_w (\psi_{uv}) = a_{uv}} which is parameterized function that assigns edge strengths or probabilities Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a_{uv}} given feature vector Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \psi_{uv}} . The problem formulation is to learn the parameter, Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle w} that such that if we do random walk with restarts from s on the graph where the edge strengths are assigned based Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle f_w} , the random walk station distribution Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p} has property that for each Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle d \in D, l \in L\ \ p_l < p_d } . This means that are more likely to reach the positive nodes Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle D} than nodes in negative set . Hence, the optimization problem is to regularize the parameter vector satisfying this condition, which is given below.

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \min_w F(w) = ||w||^2 }

such that

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \forall d \in D, l \in L: p_l < p_d }

The above formulation is the "hard" constraint that the stationary probabilities of all negative scores should be less than that of a positive node. The constraint is relaxed by introducing an error function Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle h } which is such that Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle h(p_l - p_d) = 0 } if Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p_l - p_d < 0 } however Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle h(p_l - p_d) > 0 } in case Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p_l-p_d > 0 } where the constraint is violated. With this "soft" constraints the following is the an optimization problem formulation

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \min_w F(w) = ||w||^2 + \lambda \sum_{d\in D, l\in L} h(p_l - p_d) }

In the above formulation Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \lambda } is regularization parameter that trades-off between the complexity (measured in norm of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle w } ) for the fit of the model ( how many constraints are violated).

Solving Optimization Problem

In order to solve the optimization problem formulated, the expression is differentiated. Firstly, the loss function Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle h } has to be differential and secondly, they derive the relationship between parameters Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle w } and random walk scores . The edge strength Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a_{uv} = f_w(\psi_{uv})} then the stochastic transition matrix Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Q' } is given as following

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \displaystyle \begin{array}{lcll} Q'_{uv} & = & \frac{a_{uv}}{\sum_w a_{uw}}\ & if\ (u,v)\ \in E\\ & = & 0 \ & otherwise\\ \end{array} }

From this stochastic transition matrix Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Q' } we can obtain the transition probability matrix Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Q } when we do Random walk with restart. The following Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha } is the restart probability.

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Q_{uv} = (1-\alpha) Q'_{uv} + \alpha \mathbf{1}(v == s) }

Given this transition probability matrix, the stationary distribution Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p } of Random walk with restart is given by eigenvector equation.

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p^T = p^T Q }

With this we can find the relation ship between parameters Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle w } and stationary distribution Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p } . Hence if we differentiate the optimization problem we get the following equation. We can consider Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \delta_{ld} = p_l - p_d }

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \frac{\partial F(w)}{\partial w} = 2w + \sum_{l,d} \frac{\partial h(p_l - p_d)}{\partial w}}

Now since we know Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p } is principal eigen vector for the transition probability matrix we have Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p_u = \sum_j p_j Q_{ju} } and hence

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \frac{\partial p_u}{\partial w} = \sum_j Q_{ju} \frac{\partial p_j}{\partial w} + p_j \frac{\partial Q_{ju}}{\partial u} }

In the above equation Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \frac{\partial p_u}{\partial w} } can be computed by Power Iteration method and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \frac{\partial Q_{ju}}{\partial u} } can be computed using the definition of transition probability matrix.

Optimization problem is then solved using Gradient Descent method and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle F(w) } is minimized.

Datasets Used

This paper for link prediction and link recommendation uses the following two datasets

  • Facebook dataset is dataset of Facebook users in Iceland. They chose Iceland, as they have seen the penetration was highest in Iceland and hence new links were formed more rapidly than other countries.
  • Collaboration network in arXive database.

Experiments

The paper does experiments of ranking the nodes using the supervised random walks both on synthetic dataset and real world dataset.

Experimental Setup

The following are some of the general considerations of loss function, edge strength function, choice of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha } and regularization parameter Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha }

Edge Strength Function

Edge strength function with a parameter Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle w } takes as input the feature vector Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \psi_{uv} } and outputs the edge strength. The following are some possibility of the edge strength

  • Exponential edge strength: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a_{uv} = exp(\psi_{uv}\cdot w) }
  • Logistic edge strength: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a_{uv} = \frac{1}{1 + exp(-\psi_{uv}\cdot w)} }

Loss Function

As said earlier we need to consider loss function such that it is differentiable. The following are some of th possible loss functions considered

  • Squared loss with margin:
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle h(x) = \max\{x+b,0\}^2 }
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{array}{lcll} h(x) & = & 0 &if\ \ x \le -b\\ & = & (x+b)^2/(2z) & if\ \ -b < x \le z-b\\ & = & (x+b)-z/2 & if\ \ x> z-b\\ \end{array} }

Choice of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha }

The choice of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha} decides how far we do the random walk before we restart our random walk from s. As Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha } approaches 1, the random walks crossing more than 2 hops becomes increasingly unlikely.

Regularization parameter Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \lambda }

This parameter controls the over-fitting of the model, and in this paper they set Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \lambda = 1 } which gave best performance.

Experimental Results

Synthetic dataset

They generate a scale-free graph with 10,000 nodes. For each edge Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (u,v) } they create two independent Gaussian features with mean 0 and variance 1. The strength of each edge is given by Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a_{uv} = exp(\psi_{uv1} - \psi_{uv2})} and hence Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle w* = [1, -1].} . On this graph they run the experiment with Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha = 0.2 } and obtain the PageRank scores Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p^* } . Using these scores they pick the destination nodes or positive nodes in two ways. First, is to pick the top K nodes and second way is to pick K nodes with probability equal to their scores. They also add the Gaussian noise to the features and vary the noise and verify the performance.

They generate 100 graphs using the way explained above and use 50 graphs for learning the parameters and 50 graphs to test the performance. The measure of performance is AUC (area under the curve) which measures the accuracy of the prediction, AUC = 1 meaning the prediction were perfect. The Figures shown below provides the performance measures both in case of deterministic top K nodes and probabilistic top K nodes. The un-weighted version is the case where we assign equal weight to all the edges. We can see that performance of the learned graph(blue) is better than the un-weighted version (red) and most importantly as the noise increases the learned parameters performs better then the true parameters (green) both in case of deterministic and probabilistic ways.

SyntheticPerformanceSRW.png

Real world dataset

They run it on both the Facebook dataset and arXive database dataset and show that the performance of the supervised random walk is better than doing a typical Random walk with restart method, or a logistic regression with the network features extracted out to predict the nodes for which a given s gets linked to. As the number of nodes in the graph get really high, they consider the nodes that are not very far in terms of number of hops as potential candidates so that the method does not bloat up the size of transition matrix.

Conclusion

Supervised random walk algorithm, takes into consideration both network structures and features of the nodes/edges, therefore has better performance than a typical Random walk with restarts which assigns all the equal weight. The problem is formulated as an optimization problem and is solved using gradient descent. Experiments were done on synthetic dataset and on real world Facebook and co-authorship dataset and shown that it performs better. This can be applied to various other applications like Link recommendation, ranking nodes in the graph, missing link analysis.