Community structure in social and biological networks

From Cohen Courses
Revision as of 23:11, 18 December 2012 by Ziy (talk | contribs) (Created page with '== Citation == @article{Girvan:2002, title={Community structure in social and biological networks}, author={Girvan, M. and Newman, M.E.J.}, journal={Proceedings of the Nat…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Citation

@article{Girvan:2002,

 title={Community structure in social and biological networks},
 author={Girvan, M. and Newman, M.E.J.},
 journal={Proceedings of the National Academy of Sciences},
 volume={99},
 number={12},
 pages={7821--7826},
 year={2002},
 publisher={National Acad Sciences}

}

Abstract from the paper

A number of recent studies have focused on the statistical properties of networked systems such as social networks and the Worldwide Web. Researchers have concentrated particularly on a few properties that seem to be common to many networks: the small-world property, power-law degree distributions, and network transitivity. In this article, we highlight another property that is found in many networks, the property of community structure, in which network nodes are joined together in tightly knit groups, between which there are only looser connections. We propose a method for detecting such communities, built around the idea of using centrality indices to find community boundaries. We test our method on computer-generated and real-world graphs whose community structure is already known and find that the method detects this known structure with high sensitivity and reliability. We also apply the method to two networks whose community structure is not well known-a collaboration network and a food web-and find that it detects significant and informative community divisions in both cases.

Online version

Summary

In this paper, the authors mainly introduce the community structure detection problem, as one of important characteristics of various kinds of networks, then they mentioned one feasible solution towards this problem - applying the hierarchical clustering algorithm. However, hierarchical clustering algorithms require node distance (i.e, edge weight) as input. Various weighting methods are mentioned in the paper, and in the second half of the paper, betweenness centrality as a weighting method is leveraged from another paper to assign the weight. Finally, an algorithm similar to hierarchical clustering is proposed and evaluated based on three computer-generated or real-word networks of known true communities and two other networks of unknown communities. The results show that the proposed algorithm based on betweenness can outperform the baseline methods, where the edges are weighted on the number of paths connecting the two nodes.

Objective

The objective of the community structure detection problem is to try to find the communities (or clusters) from a network, where communities can be formally and intuitively defined as a set of nodes that communicate more frequently internally than externally. However, the key of the problem is how to measure the affinity (or closeness) of two nodes.

Background

Various network characters have been discovered and analyzed systematically, e.g., small world effect, long tail effect (power law), network transitivity, etc, where the authors are trying to capture the network from a more holistic level.

Modeling

The concept of betweenness, as a centrality measure, is proposed by Freeman, which of a vertex v in a graph G:=(V,E) with V vertices is computed as follows:

1. For each pair of vertices (s,t), compute the shortest paths between them.

2. For each pair of vertices (s,t), determine the fraction of shortest paths that pass through the vertex in question (here, vertex v).

3. Sum this fraction over all pairs of vertices (s,t).

The community structure detection algorithm, similar to the iterative hierarchical clustering algorithm, can be formalized as follows:

1. Calculate the betweenness for all edges in the network.

2. Remove the edge with the highest betweenness.

3. Recalculate the betweenness for all edges affected by the removal.

4. Repeat from step 2 until no edges remain.

Results

The algorithms have been tested on a computer-generated network with known community structure, two real-world networks with known community structures (Zachary's Karate Club and College Football), and two real-world networks with unknown community structures (Collaboration Network and Food Web). A baseline method that depends on edge-independent path counts weighted hierarchical clustering algorithm was employed. All experiments show that the proposed algorithm can outperform the baseline method.

Further thoughts & Study Plan

  • Freeman, Linton (1977). "A set of measures of centrality based upon betweenness". Sociometry 40: 35–41.
  1. It is important to study how the original idea of betweenness was proposed to be the central concept of this algorithm.
  2. Probably papers should be surveyed to figure out the difference between various centrality measures, and how they differ in the community structure detection problem.
  • Another thread should be focused is the algorithm, rather than the weighting method. Data mining communities propose a large number of clustering method, e.g., EM, K-MEANS, and relatively recently, affinity propagation based on factor graph model, etc. All of them can also be applied to this problem with the proposed weighting scheme.
  • Authors didn't mention how one should specify the number of communities, while they tuned the number from the total number of nodes, where all the nodes belong to its own community, all the way down to 1, where all the nodes belong to the same community. It is also interesting how to automatically decide the number of clusters.
  • Large-scale experiment must be done. For example, coauther network has been available based on not only a single institute but the whole academic community. We should try to investigate how to scale up the algorithm and try to quantitatively measure the performance based on a large-scale dataset.