Neighborhood formation and Anomaly detection

From Cohen Courses
Jump to navigationJump to search

This a technical problem discussed in Social Media Analysis 10-802 in Spring 2010.

These methods were proposed in Sun et. al., ICDM 2005

Neighbourhood Formation

Given a bipartite graph G : {(V1 U V2), E}, and a query node a in V1, NF computes the relevance scores of all the nodes in V1 to a. The ones with higher relevance are the “neighbors” of a.

Exact NF

  • Form adjecency matrix of all nodes in (V1 U V2)
  • Form a query vector where only 1th element is 1 and rest 0.
  • Run random walk with restart algorithm to compute steady state vector.
  • The steady state values for nodes in V1 give the relevance score w.r.t. query node a.

Parallel NF

If there are multiple nodes given as query then instead of running the algorithm multiple times, they propose more efficient way. The authors have vectorised the code and they run all queries in parallel.

Approximate NF

Both the above approaches need a lot of memory. But in most social graphs, nodes are sparsely connected. So the propose partitioning the graph first and then applying exact NF on the partition in which query node is present. This saves the computation cost processing the part of graph which has very less relevence to query node.

Anomaly Detection

  • Anomalous node is defined as a node which connects nodes from different neighbourhood.
  • Given a query node 'q' in V1, we get a set of nodes 'R' in V2 which connect to 'q'.
  • Every node r in R, connects to some nodes 'S' in V1.
  • A node r is anomalous if nodes in S dont belong to same neighbourhood.
  • To detect such nodes, they propose computing normality scores for all nodes in R.
  • The nodes with significantly lower normality scores than others are anomalous.
  • Normality scores can be computed using relevance score computation described in NF.
  • Computing normality score of a node r in V2
 - Let S be a set of nodes in V1 which link to r
 - Let RS be a matrix of pairwise relavance scores between any two nodes in S.
 - RS will be diagonally dominant, since a node is most relevant to itself.
 - normality_score(r) = function(RS)
     - e.g. normality_score(r) = mean over all non-diagonal elements of RS.

Relevant Papers