Difference between revisions of "Girvan et al PNAS 2002"

From Cohen Courses
Jump to navigationJump to search
 
(11 intermediate revisions by the same user not shown)
Line 13: Line 13:
 
The key idea is to construct the communities by removing edges from the original graph, contrasting to the traditional way which constructs the communities by add edges to the vertex set. The edges to be removed is chosen based on the [[UsesMethod::edge betweenness]] - the number of shortest paths between pairs of vertices that run along the edge.
 
The key idea is to construct the communities by removing edges from the original graph, contrasting to the traditional way which constructs the communities by add edges to the vertex set. The edges to be removed is chosen based on the [[UsesMethod::edge betweenness]] - the number of shortest paths between pairs of vertices that run along the edge.
  
If a network contains communities or groups that are only loosely connected by a few inter-group edges, then all shortest paths between different communities must go along one of these few edges. Thus, the edges connecting communities will have high edge betweenness. By removing these edges, we separate groups from one another and so reveal the underlying community structure of the graph.
+
If a network contains communities or groups that are only loosely connected by a few inter-group edges, then all shortest paths between different communities must go along one of these few edges. Thus, the edges connecting communities will have high edge betweenness. By removing these edges, the method separates groups from one another and so reveal the underlying community structure of the graph.
  
 
== Brief description of the method ==
 
== Brief description of the method ==
Line 22: Line 22:
 
#Repeat from step 2 until no edges remain.
 
#Repeat from step 2 until no edges remain.
  
Step one uses the fast algorithm to calculate the betweenness from [http://www-personal.umich.edu/~mejn/papers/016132.pdf Newman's paper]. The result is not well if we omit the third step because if two communities are connected by more than two edges, we can only guarantee one with high betweenness. By recalculating every time, we can ensure that at least one of the remaining edges between two communities will have a high value.  
+
Step one uses the fast algorithm to calculate the betweenness from [http://www-personal.umich.edu/~mejn/papers/016132.pdf Newman's paper]. The result is not well if the method omits the third step because if two communities are connected by more than two edges, the method can only guarantee one with high betweenness. By recalculating every time, the method can ensure that at least one of the remaining edges between two communities will have a high value.  
  
 
The time complexity of this algorithm is <math> O(m^2n)</math> where <math>m</math> is the number of edges and <math>n</math> is the number of nodes. This is because the each betweenness calculation time complexity is <math> O(mn) </math>.
 
The time complexity of this algorithm is <math> O(m^2n)</math> where <math>m</math> is the number of edges and <math>n</math> is the number of nodes. This is because the each betweenness calculation time complexity is <math> O(mn) </math>.
Line 35: Line 35:
  
 
* [[UsesDataset::Santa Fe Institute network|scientific collaboration network]] The algorithm seems to find two types of communities: scientists grouped together by similarity either of research topic or of methodology.
 
* [[UsesDataset::Santa Fe Institute network|scientific collaboration network]] The algorithm seems to find two types of communities: scientists grouped together by similarity either of research topic or of methodology.
* [[UsesDataset::Girvan et al PNAS 2002 food web|food web]]The algorithm finds out two well-defined communities of roughly equal size,plus a small number of vertices that belong to neither community. The split between the two large communities corresponds quite closely with the division between pelagic organisms and benthic organisms.
+
* [[UsesDataset::Girvan et al PNAS 2002 food web|food web]] The algorithm finds out two well-defined communities of roughly equal size,plus a small number of vertices that belong to neither community. The split between the two large communities corresponds quite closely with the division between pelagic organisms and benthic organisms.
  
 
== Background ==
 
== Background ==
Line 49: Line 49:
 
#Beginning from an vertex only set, add edges between pairs one by one in the desc order of the weights.
 
#Beginning from an vertex only set, add edges between pairs one by one in the desc order of the weights.
 
#The resulting graph shows a nested set of increasingly large components, which are taken to be the communities.
 
#The resulting graph shows a nested set of increasingly large components, which are taken to be the communities.
 +
 +
== Related papers ==
 +
*[[RelatedPaper:: M. E. J. Newman PNAS 2001|M. E. J. Newman, The structure of scientic collaboration networks. Proc. Natl. Acad. Sci. USA 98, 404-409(2001)]]
 +
*[[RelatedPaper::M. E. J. Newman Phys. Rev. E 2001 | M. E. J. Newman, Scientfic collaboration networks: II.Shortest paths, weighted networks, and centrality. Phys. Rev. E 64, 016132 (2001)]]
 +
*[[RelatedPaper::Zhou, Phy. Rev Phys. Rev. E 67, 041908 (2003)]]
 +
 +
== Study plan ==
 +
This paper is quite clear and self-explanatory thus require very little background in order to understand it. Some of the common properties in the [[Girvan et al PNAS 2002#Background | background section]] would be useful.
 +
 +
Just in case you are not familiar with graph:
 +
*[http://en.wikipedia.org/wiki/Graph_theory graph theory]
 +
 +
The fast algorithm calculating betweenness:
 +
*[[RelatedPaper::M. E. J. Newman Phys. Rev. E 2001 | M. E. J. Newman, Scientfic collaboration networks: II.Shortest paths, weighted networks, and centrality. Phys. Rev. E 64, 016132 (2001)]]

Latest revision as of 08:25, 27 September 2012

Citation

Girvan, M. & Newman, M. E. J. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 99, 7821-7826 (2002).

Online Version

Link to PNAS

Summary

This paper shows a method for detecting community structure in a network which addresses the topic of Community Detection. The method is designed with the setting that network nodes are joined together in tightly-knit groups between which there are only looser connections.

The key idea is to construct the communities by removing edges from the original graph, contrasting to the traditional way which constructs the communities by add edges to the vertex set. The edges to be removed is chosen based on the edge betweenness - the number of shortest paths between pairs of vertices that run along the edge.

If a network contains communities or groups that are only loosely connected by a few inter-group edges, then all shortest paths between different communities must go along one of these few edges. Thus, the edges connecting communities will have high edge betweenness. By removing these edges, the method separates groups from one another and so reveal the underlying community structure of the graph.

Brief description of the method

  1. Calculate the betweenness for all edges in the network.
  2. Remove the edge with the highest betweenness.
  3. Recalculate betweennesses for all edges affected by the removal.
  4. Repeat from step 2 until no edges remain.

Step one uses the fast algorithm to calculate the betweenness from Newman's paper. The result is not well if the method omits the third step because if two communities are connected by more than two edges, the method can only guarantee one with high betweenness. By recalculating every time, the method can ensure that at least one of the remaining edges between two communities will have a high value.

The time complexity of this algorithm is where is the number of edges and is the number of nodes. This is because the each betweenness calculation time complexity is .

Experiment result

The test result is quite promising. It was verified on three known-community graphs and explored two unknown-community graphs. Both return high accuracy results.

  • computer-generated graph. If out of 16 edges of each vertex, six or less edges are inter-community edges, then the accuracy is pretty high: 100% accuracy of classifying the vertex to the community.

G 2002 result1.png

  • Zachary's karate network Out of 34 nodes, only one node is classified incorrectly.
  • football networks Almost all teams are correctly grouped with the other teams in their conference. Few cases in which the algorithm seems to fail actually correspond to nuances in the scheduling of games.
  • scientific collaboration network The algorithm seems to find two types of communities: scientists grouped together by similarity either of research topic or of methodology.
  • food web The algorithm finds out two well-defined communities of roughly equal size,plus a small number of vertices that belong to neither community. The split between the two large communities corresponds quite closely with the division between pelagic organisms and benthic organisms.

Background

Some common properties of many networks

  • Small-word property - average distance between vertices in a network is short
  • power-law degree distributions - many vertices in a network with low degree and a small number with high degree
  • network transitivity - two vertices having a same neighbor would have higher probability of being neighbors of each other.


A traditional method of constructing the communities

  1. Calculate the weight for each pair of vertices.
  2. Beginning from an vertex only set, add edges between pairs one by one in the desc order of the weights.
  3. The resulting graph shows a nested set of increasingly large components, which are taken to be the communities.

Related papers

Study plan

This paper is quite clear and self-explanatory thus require very little background in order to understand it. Some of the common properties in the background section would be useful.

Just in case you are not familiar with graph:

The fast algorithm calculating betweenness: