Difference between revisions of "M. E. J. Newman PNAS 2006"

From Cohen Courses
Jump to navigationJump to search
 
(49 intermediate revisions by the same user not shown)
Line 24: Line 24:
 
=== Modularity ===
 
=== Modularity ===
  
* First introduced in [http://arxiv.org/pdf/cond-mat/0308217v1.pdf Newman, M. E. J. & Girvan, M. Finding and evaluating community structure in networks. (2004) Phys. Rev. E 69, 026113.]
+
* This term is coined in [http://arxiv.org/pdf/cond-mat/0308217v1.pdf Newman, M. E. J. & Girvan, M. Finding and evaluating community structure in networks. (2004) Phys. Rev. E 69, 026113.]
  
*Some links in this wiki:
+
*Some links:
** [http://malt.ml.cmu.edu/mw/index.php/%E2%80%9Cmodularity%E2%80%9D modularity]
+
** [http://malt.ml.cmu.edu/mw/index.php/Modularity Modularity]
 
** [http://malt.ml.cmu.edu/mw/index.php/Maximization_of_the_benefit_function_known_as_%22modularity%22 Maximization of the benefit function known as "modularity"]
 
** [http://malt.ml.cmu.edu/mw/index.php/Maximization_of_the_benefit_function_known_as_%22modularity%22 Maximization of the benefit function known as "modularity"]
 
+
** [http://en.wikipedia.org/wiki/Modularity_(networks) Wikipedia:Modularity]
 
==== Problem definition ====
 
==== Problem definition ====
 
Suppose that we are given the structure of some network and that we want to determine whether there exists any natural division of its vertices into nonoverlapping groups or communities, where these communities may be of any size.
 
Suppose that we are given the structure of some network and that we want to determine whether there exists any natural division of its vertices into nonoverlapping groups or communities, where these communities may be of any size.
Line 52: Line 52:
  
 
Now, the modularity <math>Q</math> is given by the sum of <math>A_{ij} -  \frac{k_{i}k_{j}}{2m}</math>over all pairs of vertices <math>i, j</math> that fall in the same group.
 
Now, the modularity <math>Q</math> is given by the sum of <math>A_{ij} -  \frac{k_{i}k_{j}}{2m}</math>over all pairs of vertices <math>i, j</math> that fall in the same group.
* We can interpret <math>Q</math> as follows, too: We are given some groups now. For each group, we can calculate the difference between the actual number of edges and the expected number of edges over all pairs in the group. <math>Q</math> is the sum of these values.
+
(We can interpret <math>Q</math> as follows, too: We are given some groups now. For each group, we can calculate the difference between the actual number of edges and the expected number of edges over all pairs in the group. <math>Q</math> is the sum of these values.)
 +
 
 +
Since <math> \frac{1}{2} (s_{i} s_{j} + 1)</math> is 1 if  <math>i</math> and <math>j</math> are in the same group and 0 otherwise, we can express modularity as follows:
 +
<math>Q =  \frac{1}{4m}  \sum_{ij} (A_{il} - \frac{k_{i}k_{j}}{2m}) (s_{i} s_{j} +1) =  \frac{1}{4m}  \sum_{ij} (A_{il} - \frac{k_{i}k_{j}}{2m}) s_{i} s_{j}, [1]</math>
 +
 
 +
where the second equality follows from <math>2m = \sum_{i} k_{i} =  \sum_{ij} A_{ij} .</math>
 +
 
 +
Note that <math>\frac{1}{4m}</math> is merely conventional.
  
 
=== Existing work on community detection based on modularity maximization ===
 
=== Existing work on community detection based on modularity maximization ===
Line 59: Line 66:
  
 
== Method ==
 
== Method ==
 +
=== Dividing networks into two communities ===
 +
The author rewrites the equation [1] as follows.
 +
<math>Q = \frac{1}{4m}  \mathbf{s^T} \mathbf{B} \mathbf{s},</math>
 +
where <math> \mathbf{s}</math> is the column vector whose elements are the  <math> s_{i}</math>, and  <math> B_{ij} =  A_{ij} -  \frac{k_{i}k_{j}}{2m} </math>, which is called modularity matrix.
 +
 +
By writing <math>\mathbf{s}</math> as a linear combination of the normalized eigenvectors <math>u_{i}</math> of <math>\mathbf{B} </math>,  it is shown that we can express <math>Q</math> as follow:
 +
 +
<math>Q = \frac{1}{4m}  \sum_{i} (\mathbf{u_{i}}^T \cdot \mathbf{s} )^2 \beta_{i} , [2]</math>
 +
 +
where <math>\beta_{i}</math> is the eigenvalue of <math>\mathbf{B}</math> corresponding to eigenvector <math>\mathbf{u_{i}}</math>.
 +
 +
The author shows that the maximum of <math>Q</math> is achieved by setting <math>s_{i} = +1</math>  if the corresponding element of the leading eigen vector (whose eigenvalue is largest)  is positive and -1 otherwise.
 +
Thus, the algorithm is as follows: we compute the leading eigenvector of the modularity matrix and divide the vertices into two groups according to the signs of the elements in this vector.
 +
 +
=== Dividing networks into more than two communities ===
 +
The author divides networks into multiple communities by repeating the previous method recursively.
 +
That is, he uses the algorithm described above first to divide the network into two parts, then divides those parts, and so on.
 +
 +
More specifically, he considers how much the modularity increases when we divide a group <math>g</math> into two parts.
 +
He shows this additional contribution of modularity <math>\Delta{Q}</math> can be expressed in a similar form as the previous section.
 +
He also shows that the modularity matrix in the previous section is now rewritten as a generalized modularity matrix.
 +
Then he shows that we can apply same spectral algorithm to maximize <math>\Delta{Q}</math>.
 +
 +
This algorithm tells us clearly at what point we need to halt the subdivision process;
 +
If there are no division of a subgraph that will increase the modularity of the network, or equivalently that gives a positive value for <math>\Delta{Q}</math>, we should stop the process then.
 +
 +
=== Nice features of this method ===
 +
* We do not need to specify the size of communities.
 +
* It has the ability to refuse to divide the network when no good division exists.
 +
** If the generalized modularity matrix has no positive eigenvalues, it means there is no division of the network that results in positive modularity, which we can see from the equation [2].
  
 
== Dataset ==
 
== Dataset ==
Line 69: Line 106:
  
 
== Experiment ==
 
== Experiment ==
* Measure  
+
'''Measure''':
** The author used modularity value as a performance measure of a community detection method.
+
* The author used modularity value as a performance measure of a community detection method.
* Competing methods
 
**  Betweenness-based algorithm of Girvan and Newman
 
*** [http://www.pnas.org/content/99/12/7821.abstract Girvan, M. & Newman, M. E. J. Community structure in social and biological networks. (2002) Proc. Natl. Acad. Sci. USA 99,7821–7826.]
 
** Fast algorithm of Clauset et al.
 
*** [http://arxiv.org/abs/cond-mat/0408187 Clauset, A., Newman, M. E. J. & Moore, C. Finding community structure in very large networks. (2004) Phys. Rev. E 70, 066111.]
 
*** It optimizes modularity by using a greedy algorithm.
 
** Extremal optimization algorithm of Duch and Arenas
 
*** [http://pre.aps.org/abstract/PRE/v72/i2/e027104 Duch, J. & Arenas, A. Community detection in complex networks using extremal optimization. (2005) Phys. Rev. E 72, 027104.]
 
* Result
 
** The proposed method outperformed the first two methods (Betweenness-based algorithm of Girvan and Newman, Fast algorithm of Clauset et al. ) for all of the networks.
 
** The third method (Extremal optimization algorithm of Duch and Arenas) was more competitive.  There are no clear diference between them for the smaller networks up to ~ 1000 vertices. But for larger networks, the proposed method outperformed the third method, and the performance gap increased as the size of networks increased, showing that the proposed method is most promising for large networks.
 
  
== Strengths and weaknesses ==
+
'''Competing methods''':
 +
*  Betweenness-based algorithm of Girvan and Newman (GN)
 +
** [http://www.pnas.org/content/99/12/7821.abstract Girvan, M. & Newman, M. E. J. Community structure in social and biological networks. (2002) PNAS.]
 +
* Fast algorithm of Clauset et al.  (CNM)
 +
** [http://arxiv.org/abs/cond-mat/0408187 Clauset, A., Newman, M. E. J. & Moore, C. Finding community structure in very large networks. (2004) Phys. Rev. E 70, 066111.]
 +
** It optimizes modularity by using a greedy algorithm.
 +
* Extremal optimization algorithm of Duch and Arenas (DA)
 +
** [http://pre.aps.org/abstract/PRE/v72/i2/e027104 Duch, J. & Arenas, A. Community detection in complex networks using extremal optimization. (2005) Phys. Rev. E 72, 027104.]
 +
 
 +
'''Result''':
 +
 
 +
[[File:Newman_1.png]]
 +
* The figure above shows comparison of modularities for the network divisions found by the algorithm described here and three other previously published methods as described in the text, for six networks of varying sizes.
 +
* The proposed method outperformed the first two methods (GN and CNM ) for all of the networks.
 +
* The third method (DA) was more competitive.  There are no clear diference between them for the smaller networks up to ~ 1000 vertices. But for larger networks, the proposed method outperformed the third method, and the performance gap increased as the size of networks increased, showing that the proposed method is most promising for large networks.
 +
 
 +
== Review==
 +
 
 +
=== Strengths and weaknesses ===
  
 
;Strength:
 
;Strength:
The problems the authors pointed out regarding existing social CF are genuine to social media, but have not yet been fully considered.
 
The authors propose a method that can solve the problem in a unified way based on MF.
 
In addition, they actually created a Facebook application to collect user behavior data in Facebook.
 
By doing so, they got rich features and conducted detailed analyses on users behaviors, too.
 
  
;weakness:
+
This paper is an innovative paper that showed the modularity maximization problem can be rewritten in terms of eigenvalues and eigenvectors of a modularity matrix, which results into eigenvalue problems. The method is mathematically well founded.
They manually set the weight for each objective function, and this might be time-consuming in practical situations.
+
Moreover, it proposed an efficient way of implementing the method, by exploiting a particular structure of the modularity matrix.
In addition, we are not clear which part of extensions actually contributed the increase of accuracy, because there are no systematic analyses.
+
As a result, it proposed a more effective algorithm for the community detection problem, in terms of both the quality of results and computation cost, compared to existing community detection methods.
Thus, we cannot get much insight about users behaviors. We also cannot get much insight to improve the proposed method.
+
 
 +
;Weakness:
 +
 
 +
Actually, it is a difficult task to point our weakness of this paper, since it is already in very high-quality finished form.
 +
But If I had to criticize, I would point out the following thing.
 +
Although they argue the proposed method outperforms the most competitive method (DA), as data increase, we observe that there is only one data (the Physicists network) in which the poposed method clearly outperformed it.
 +
We are not clear whether this is because data become large or not.
  
== Possible impact  ==
+
=== Possible impact  ===
If they were able to publish data, it would have much more impact. (Actually, they cannot publish their data because of the requirement from the funding project.)
 
Also, if they conducted analyses about how much each part of the extension contributed to the performance. If they did so, we could have insight about users' behaviors, or ways to improve existing social CF.
 
  
 +
Since there have been many knowledge about eigenvalue problems, it may be possible that other nice properties of the proposed method can be found.
 +
Or, it may be possible that it is further found that the modularity optimization problem can be solved in more efficient ways, by relating it to the eigenvalue problems.
  
== Recommendation for whether or not to assign the paper as required/optional reading in later classes.  ==
+
=== Recommendation for whether or not to assign the paper as required/optional reading in later classes.  ===
  
 
Yes.  
 
Yes.  
 
* Modularity-based methods are common in community detection task. This papper might be a good introduction for the concept of modularity.  
 
* Modularity-based methods are common in community detection task. This papper might be a good introduction for the concept of modularity.  
* This paper also illustrates how the optimization problem can be rewritten in terms of eigenvalues and eigenvectors of a matrix called modularity matrix, which results into eigenvalue problems. This derivation shows that we can solve a problem by seeing the problem from different view points.
+
* This paper also illustrates how the optimization problem can be rewritten in terms of eigenvalues and eigenvectors of a matrix called modularity matrix, which results into eigenvalue problems. This derivation shows that we can solve a problem by seeing the problem from different view points. This might be a good lesson for us when we face problems.
  
 
== Related Papers ==
 
== Related Papers ==
* There are many papers on how to combine social network and users' other actions.
+
* [[RelatedPaper::Newman, M. E. J. & Girvan, M. Phys. Rev.  2004.]]
** [[RelatedPaper::S. H. Yang et al. WWW 2011]] : S. H. Yang, B. Long, A. Smola, N. Sadagopan, Z. Zheng, and H. Zha. Like like alike: Joint friendship and interest propagation in social networks.In WWW-11, 2011.
+
** [http://arxiv.org/pdf/cond-mat/0308217v1.pdf Newman, M. E. J. & Girvan, M. Finding and evaluating community structure in networks. (2004) Phys. Rev. E 69, 026113.]
 +
** This is a paper published before the reviewed paper, by the same author, which investigated the problem of community detection by several approaches. This paper also introduced the term "modularity" to evaluate the community structure. So, this paper might be a good material to understand the motivation of the authors in the reviewed paper.
  
 +
* [[RelatedPaper::Girvan, M. & Newman, M. E.  PNAS. 2002.]]
 +
** [http://www.pnas.org/content/99/12/7821.abstract Girvan, M. & Newman, M. E. J. Community structure in social and biological networks. (2002) PNAS.]
 +
** The method in this paper is the only method among compared methods in the reviewed paper, which is not based on modularity. So, by reading this paper, we may be able to understand what kind of approaches we can take to define community, and what pros/cons there are when we take one approach.
  
 
== Study Plan  ==
 
== Study Plan  ==
To understand some matrix calculation, I read some of the paper; K. B. Petersen and M. S. Pedersen. The matrix cookbook, 2008.
+
* [http://arxiv.org/pdf/cond-mat/0308217v1.pdf Newman, M. E. J. & Girvan, M. Finding and evaluating community structure in networks. (2004) Phys. Rev. E 69, 026113.]: This is a paper published in 2004, by the same author, which investigated the problem of community detection by several approaches. This paper also introduced the term "modularity" to evaluate the community structure. So, this paper might be a good material to understand the motivation of the authors.
 +
** [http://pre.aps.org/abstract/PRE/v67/i2/e026126 M. E. J. Newman, Mixing patterns in networks. Phys. Rev. E 67, 026126 (2003).]: The concept of modularity in the above paper is based on the concept proposed in this paper. So I may need to read this paper to further understand the background of modularity.

Latest revision as of 00:21, 4 October 2012

Citation

@article{Newman:2006:Proc-Natl-Acad-Sci-U-S-A:16723398,

 author = {Newman, M E},
 journal = {Proc Natl Acad Sci U S A},
 pages = {8577-8582},
 title = {Modularity and community structure in networks},
 volume = 103,
 year = 2006

Online version

Modularity and community structure in networks


Summary

One effective approach for community detection in networks is the optimization of the quality function known as modularity.

This paper shows that this optimization problem can be rewritten in terms of eigenvalues and eigenvectors of a matrix called modularity matrix, which leads to a spectral algorithm for community detection. Using several real-world data, the author demonstrate the proposed algorithm outperforms existing methods in terms of both quality of results and computation speed.

Background

Modularity

Problem definition

Suppose that we are given the structure of some network and that we want to determine whether there exists any natural division of its vertices into nonoverlapping groups or communities, where these communities may be of any size. Given some divisons on the network, modularity is a quality function that measures how "good" the division is. The question here is, of course, what is the "good" divisions. I will illustrate it in the following.

Intuition

  • For the sake of ease, let us focus initially on the problem of whether any good division of the network exists into just two communities.
  • The most obvious way is to minimize the number of edges running between two groups, which is most often adopted in the graph-partitioning literature.
  • But it allows trivial divisions such as (1) all vertexes in one group and none in the other group, or (2) only 1 vertex in one group and the rest in the other group.
  • So, we need a different way to define a "good" devision of a network.
  • Modularity is introduced in this context. The basic idea here is, a good division of a network into communities is one in which there are fewer edges than expected between communities.
    • "If the number of edges between two groups is only what one would expect on the basis of random chance, then few thoughtful observers would claim this constitutes evidence of meaningful community structure. On the other hand, if the number of edges between groups is significantly less than we expect by chance, or equivalent if the number within groups is significantly more, then it is reasonable to conclude that something interesting is going on." - From the paper

Definition

Given some divisions on networks, modularity is, up to a multiplicative constant, the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random.

  • Preliminary
    • For a particular division of the network into two groups let , if vertex belongs to group 1 and , if vertex belongs to group 2.
    • Let the number of edges between vertices and be .
    • Then, the expected number of edges between vertices and when edges are placed at random is , where and are the degrees of the vertices and is the total number of edges in the network.

Now, the modularity is given by the sum of over all pairs of vertices that fall in the same group. (We can interpret as follows, too: We are given some groups now. For each group, we can calculate the difference between the actual number of edges and the expected number of edges over all pairs in the group. is the sum of these values.)

Since is 1 if and are in the same group and 0 otherwise, we can express modularity as follows:

where the second equality follows from

Note that is merely conventional.

Existing work on community detection based on modularity maximization

Method

Dividing networks into two communities

The author rewrites the equation [1] as follows. where is the column vector whose elements are the , and , which is called modularity matrix.

By writing as a linear combination of the normalized eigenvectors of , it is shown that we can express as follow:

where is the eigenvalue of corresponding to eigenvector .

The author shows that the maximum of is achieved by setting if the corresponding element of the leading eigen vector (whose eigenvalue is largest) is positive and -1 otherwise. Thus, the algorithm is as follows: we compute the leading eigenvector of the modularity matrix and divide the vertices into two groups according to the signs of the elements in this vector.

Dividing networks into more than two communities

The author divides networks into multiple communities by repeating the previous method recursively. That is, he uses the algorithm described above first to divide the network into two parts, then divides those parts, and so on.

More specifically, he considers how much the modularity increases when we divide a group into two parts. He shows this additional contribution of modularity can be expressed in a similar form as the previous section. He also shows that the modularity matrix in the previous section is now rewritten as a generalized modularity matrix. Then he shows that we can apply same spectral algorithm to maximize .

This algorithm tells us clearly at what point we need to halt the subdivision process; If there are no division of a subgraph that will increase the modularity of the network, or equivalently that gives a positive value for , we should stop the process then.

Nice features of this method

  • We do not need to specify the size of communities.
  • It has the ability to refuse to divide the network when no good division exists.
    • If the generalized modularity matrix has no positive eigenvalues, it means there is no division of the network that results in positive modularity, which we can see from the equation [2].

Dataset

Experiment

Measure:

  • The author used modularity value as a performance measure of a community detection method.

Competing methods:

Result:

Newman 1.png

  • The figure above shows comparison of modularities for the network divisions found by the algorithm described here and three other previously published methods as described in the text, for six networks of varying sizes.
  • The proposed method outperformed the first two methods (GN and CNM ) for all of the networks.
  • The third method (DA) was more competitive. There are no clear diference between them for the smaller networks up to ~ 1000 vertices. But for larger networks, the proposed method outperformed the third method, and the performance gap increased as the size of networks increased, showing that the proposed method is most promising for large networks.

Review

Strengths and weaknesses

Strength

This paper is an innovative paper that showed the modularity maximization problem can be rewritten in terms of eigenvalues and eigenvectors of a modularity matrix, which results into eigenvalue problems. The method is mathematically well founded. Moreover, it proposed an efficient way of implementing the method, by exploiting a particular structure of the modularity matrix. As a result, it proposed a more effective algorithm for the community detection problem, in terms of both the quality of results and computation cost, compared to existing community detection methods.

Weakness

Actually, it is a difficult task to point our weakness of this paper, since it is already in very high-quality finished form. But If I had to criticize, I would point out the following thing. Although they argue the proposed method outperforms the most competitive method (DA), as data increase, we observe that there is only one data (the Physicists network) in which the poposed method clearly outperformed it. We are not clear whether this is because data become large or not.

Possible impact

Since there have been many knowledge about eigenvalue problems, it may be possible that other nice properties of the proposed method can be found. Or, it may be possible that it is further found that the modularity optimization problem can be solved in more efficient ways, by relating it to the eigenvalue problems.

Recommendation for whether or not to assign the paper as required/optional reading in later classes.

Yes.

  • Modularity-based methods are common in community detection task. This papper might be a good introduction for the concept of modularity.
  • This paper also illustrates how the optimization problem can be rewritten in terms of eigenvalues and eigenvectors of a matrix called modularity matrix, which results into eigenvalue problems. This derivation shows that we can solve a problem by seeing the problem from different view points. This might be a good lesson for us when we face problems.

Related Papers

Study Plan