Difference between revisions of "M. E. J. Newman PNAS 2006"

Revision as of 14:27, 1 October 2012

1 Citation
2 Online version
3 Summary
4 Background
- 4.1 Modularity
- 4.2 Existing work on community detection based on modularity maximization
5 Method
- 5.1 Dividing networks into two communities
  - 5.1.1 The nice features of this method
- 5.2 Dividing networks into more than two communities
6 Dataset
7 Experiment
8 Review
- 8.1 Recommendation for whether or not to assign the paper as required/optional reading in later classes.
9 Related Papers
10 Study Plan

Citation

@article{Newman:2006:Proc-Natl-Acad-Sci-U-S-A:16723398,

 author = {Newman, M E},
 journal = {Proc Natl Acad Sci U S A},
 pages = {8577-8582},
 title = {Modularity and community structure in networks},
 volume = 103,
 year = 2006

Online version

Modularity and community structure in networks

Summary

One effective approach for community detection in networks is the optimization of the quality function known as modularity.

This paper shows that this optimization problem can be rewritten in terms of eigenvalues and eigenvectors of a matrix called modularity matrix, which leads to a spectral algorithm for community detection. Using several real-world data, the author demonstrate the proposed algorithm outperforms existing methods in terms of both quality of results and computation speed.

Background

Modularity

This term is coined in Newman, M. E. J. & Girvan, M. Finding and evaluating community structure in networks. (2004) Phys. Rev. E 69, 026113.

Some links in this wiki:
- modularity
- Maximization of the benefit function known as "modularity"

Problem definition

Suppose that we are given the structure of some network and that we want to determine whether there exists any natural division of its vertices into nonoverlapping groups or communities, where these communities may be of any size. Given some divisons on the network, modularity is a quality function that measures how "good" the division is. The question here is, of course, what is the "good" divisions. I will illustrate it in the following.

Intuition

For the sake of ease, let us focus initially on the problem of whether any good division of the network exists into just two communities.
The most obvious way is to minimize the number of edges running between two groups, which is most often adopted in the graph-partitioning literature.
But it allows trivial divisions such as (1) all vertexes in one group and none in the other group, or (2) only 1 vertex in one group and the rest in the other group.
So, we need a different way to define a "good" devision of a network.
Modularity is introduced in this context. The basic idea here is, a good division of a network into communities is one in which there are fewer edges than expected between communities.
- "If the number of edges between two groups is only what one would expect on the basis of random chance, then few thoughtful observers would claim this constitutes evidence of meaningful community structure. On the other hand, if the number of edges between groups is significantly less than we expect by chance, or equivalent if the number within groups is significantly more, then it is reasonable to conclude that something interesting is going on." - From the paper

Definition

Given some divisions on networks, modularity is, up to a multiplicative constant, the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random.

Preliminary
- For a particular division of the network into two groups let $s_{i}=1$ , if vertex $i$ belongs to group 1 and $s_{i}=-1$ , if vertex $i$ belongs to group 2.
- Let the number of edges between vertices $i$ and $j$ be $A_{ij}$ .
- Then, the expected number of edges between vertices $i$ and $j$ when edges are placed at random is ${\frac {k_{i}k_{j}}{2m}}$ , where $k_{i}$ and $k_{j}$ are the degrees of the vertices and $m={\frac {1}{2}}\sum _{i}k_{i}$ is the total number of edges in the network.

Now, the modularity $Q$ is given by the sum of $A_{ij}-{\frac {k_{i}k_{j}}{2m}}$ over all pairs of vertices $i,j$ that fall in the same group. (We can interpret $Q$ as follows, too: We are given some groups now. For each group, we can calculate the difference between the actual number of edges and the expected number of edges over all pairs in the group. $Q$ is the sum of these values.)

Since ${\frac {1}{2}}(s_{i}s_{j}+1)$ is 1 if $i$ and $j$ are in the same group and 0 otherwise, we can express modularity as follows: $Q={\frac {1}{4m}}\sum _{ij}(A_{il}-{\frac {k_{i}k_{j}}{2m}})(s_{i}s_{j}+1)={\frac {1}{4m}}\sum _{ij}(A_{il}-{\frac {k_{i}k_{j}}{2m}})s_{i}s_{j},[1]$

where the second equality follows from $2m=\sum _{i}k_{i}=\sum _{ij}A_{ij}.$

Note that ${\frac {1}{4m}}$ is merely conventional.

Existing work on community detection based on modularity maximization

Danon, L., Duch, J., Diaz-Guilera, A. & Arenas, A. Comparing community structure identification. (2005) J. Stat. Mech.,P09008. and Guimer`a, R. & Amaral, L. A. N. Functional cartography of complex metabolic networks. (2005) Nature 433, 895–900. used simulated annealing. But these methods are not expected to scale to large networks due to the high computation cost.
There are other heuristic methods such as greedy algorithms (Newman, M. E. J. Fast algorithm for detecting community structure in networks. (2004) Phys. Rev. E 69, 066133.) and extremal optimization (Duch, J. & Arenas, A. Community detection in complex networks using extremal optimization. (2005) Phys. Rev. E 72, 027104.).

Method

Dividing networks into two communities

The author rewrites the equation [1] as follows. $Q={\frac {1}{4m}}\mathbf {s^{T}} \mathbf {B} \mathbf {s} ,$ where $\mathbf {s}$ is the column vector whose elements are the $s_{i}$ , and $B_{ij}=A_{ij}-{\frac {k_{i}k_{j}}{2m}}$ , which is called modularity matrix.

By writing $\mathbf {s}$ as a linear combination of the normalized eigenvectors $u_{i}$ of $\mathbf {B}$ , we can express $Q$ as follow:

$Q={\frac {1}{4m}}\sum _{i}(\mathbf {u_{i}} ^{T}\cdot \mathbf {s} )^{2}\beta _{i},[2]$

where $\beta _{i}$ is the eigenvalue of $\mathbf {B}$ corresponding to eigenvector $\mathbf {u_{i}}$ .

The author shows that the maximum of $Q$ is achieved by setting $s_{i}=+1$ if the corresponding element of the leading eigen vector (whose eigenvalue is largest) is positive and -1 otherwise. Thus, the algorithm is as follows: we compute the leading eigenvector of the modularity matrix and divide the vertices into two groups according to the signs of the elements in this vector.

The nice features of this method

We do not need to specify the size of communities.
It has the ability to refuse to divide the network when no good division exists.
- If the modularity matrix has no positive eigenvalues, it means there is no division of the network that results in positive modularity , which we can see from the equation [2]. The author says this is a situation where the network is "indivisible.".

Dividing networks into more than two communities

We can get multiple communities by dividing the network recursively

Dataset

Zachary's karate network 34 nodes.
Pablo's jazz musicians network 198 nodes.
Jeong's metabolic network 453 nodes.
Guimer's email network 1,133 nodes.
Guardiola,'s Key signing network 10,680 nodes.
Newman's Physicists network 27,519 nodes.

Experiment

Measure
- The author used modularity value as a performance measure of a community detection method.
Competing methods
- Betweenness-based algorithm of Girvan and Newman
  - Girvan, M. & Newman, M. E. J. Community structure in social and biological networks. (2002) PNAS.
- Fast algorithm of Clauset et al.
  - Clauset, A., Newman, M. E. J. & Moore, C. Finding community structure in very large networks. (2004) Phys. Rev. E 70, 066111.
  - It optimizes modularity by using a greedy algorithm.
- Extremal optimization algorithm of Duch and Arenas
  - Duch, J. & Arenas, A. Community detection in complex networks using extremal optimization. (2005) Phys. Rev. E 72, 027104.
Result
- The proposed method outperformed the first two methods (Betweenness-based algorithm of Girvan and Newman, Fast algorithm of Clauset et al. ) for all of the networks.
- The third method (Extremal optimization algorithm of Duch and Arenas) was more competitive. There are no clear diference between them for the smaller networks up to ~ 1000 vertices. But for larger networks, the proposed method outperformed the third method, and the performance gap increased as the size of networks increased, showing that the proposed method is most promising for large networks.

Review

Recommendation for whether or not to assign the paper as required/optional reading in later classes.

Yes.

Modularity-based methods are common in community detection task. This papper might be a good introduction for the concept of modularity.
This paper also illustrates how the optimization problem can be rewritten in terms of eigenvalues and eigenvectors of a matrix called modularity matrix, which results into eigenvalue problems. This derivation shows that we can solve a problem by seeing the problem from different view points. This might be a good lesson for us when we face problems.

Related Papers

Study Plan

Newman, M. E. J. & Girvan, M. Finding and evaluating community structure in networks. (2004) Phys. Rev. E 69, 026113.
- This is a paper published in 2004, by the same author, which investigated the problem of community detection by several approaches. This paper also introduced the term "modularity" to evaluate the community structure. So, this paper might be a good material to understand the motivation of the authors.

@@ Line 80: / Line 80: @@
 Thus, the algorithm is as follows: we compute the leading eigenvector of the modularity matrix and divide the vertices into two groups according to the signs of the elements in this vector.
-* The nice features of this method
+==== The nice features of this method ====
-** We do not need to specify the size of communities.
+* We do not need to specify the size of communities.
-** It has the ability to refuse to divide the network when no good division exists.
+* It has the ability to refuse to divide the network when no good division exists.
-*** If the modularity matrix has no positive eigenvalues, it means there is no division of the network that results in positive modularity , which we can see from the equation [2]. The author says this is a situation where the network is "indivisible.".
+** If the modularity matrix has no positive eigenvalues, it means there is no division of the network that results in positive modularity , which we can see from the equation [2]. The author says this is a situation where the network is "indivisible.".
 === Dividing networks into more than two communities ===