Difference between revisions of "Xufei Wang, ICDM, 2010"

Latest revision as of 00:28, 4 April 2011

Citation

Xufei Wang. 2010. Discovering Overlapping Groups in Social Media, the 10th IEEE International Conference on Data Mining (ICDM 2010).

Online Version

http://dmml.asu.edu/users/xufei/Papers/ICDM2010.pdf

Databases

BlogCatalog [1]

Delicious [2]

Summary

In this paper, the authors propose a novel co-clustering framework, which takes advantage of networking information between users and tags in social media, to discover these overlapping communities. The basic ideas are:

To discover overlapping communities in social media. Diverse interests and interactions that human beings can have in online social life suggest that one person often belongs more than one community.

To use user-tag subscription information instead of user-user links. Metadata such as tags become an important source in measuring the user-user similarity. The paper shows that more accurate community structures can be obtained by scrutinizing tag information.

To obtain clusters containing users and tags simultaneously. Existing co-clustering methods cluster users/tags separately. Thus, it is not clear which user cluster corresponds to which tag cluster. But the proposed method is able to ﬁnd out user/tag group structure and their correspondence

Problem Statement

In this paper, the concept of community is generalized to include both users and tags. Tags of a community imply the major concern of people within it.

Let $\mu =\left(\mu _{1},\mu _{2},...,\mu _{m}\right)$ denote the user set, $\tau =\left(\tau _{1},\tau _{2},...,\tau _{n}\right)$ the tay set. A community $C_{i}\left(1\leq i\leq k\right)$ is a subset of user and tags, where k is the number of communities. As mentioned above, communities usually overlap, i.e., $C_{i}\bigcap C_{j}\neq \emptyset \left(1\leq i,j\leq k\right)$ .On the other hand, users and their subscribed tags form a user-tag matrix M, in which each entry $M_{ij}\in \left\{0,1\right\}$ indicates whether user $u_{i}$ subscribes to tag $t_{j}$ . So it is reasonable to view a user as a sparse vector of tags, and each tag as a sparse vector of users.

Given notations above, the overlapping co-clustering problem can be stated formally as follows:

Input:

A user-tag subscription matrix $M_{N_{\mu }\times N_{t}}$ , where $N_{\mu }$ and $N_{t}$ are the numbers of users and tags.

The number of communities k.

Output:

k overlapping communities which consist of both users and tags.

Brief Description Of The Method

Communities that aggregate similar users and tags together can be detected by maximizing intra-cluster similarity, which is shown below: $argmax{\frac {1}{k}}\sum _{i=1}^{k}\sum _{x_{j}\in C_{i}}^{}S_{C}\left(x_{j},c_{i}\right)$ where k is the number of communities, x is the edges and c is the centroid of community. This formulation can be solved by a k-means variant.

This paper uses different methods to solve the problem of overlapping communities:

A. Independent Learning

B. Normalized Learning

C. Correlational Learning

Experimental Result

The authors use two kinds of datasets: one is a synthetic data and the other kind is real data from BlogCatalog and Delicious

A. Synthetic Data

Synthetic data, which is controlled by various parameters, facilitates a comparative study between the uncovered and actual clusters. It has 1,000 users and 1,000 tags and with different number of clusters which range from 5 to 50.

From the experiment result, we see that correlational Learning is more effective than the other two methods in recovering overlapping clusters. It works well even when the intra-cluster link density is low. Co-clustering performs poorly because it only ﬁnds non-overlapping clusters.

B. Social Media Data

From the experiment with BlogCatalog and Delicious, the paper show us that:

The probability of a link between two users increases with respect to the number of tags they share.

Correlational Learning consistently performs better, especially when the training set is small.

Higher co-occurrence frequency suggests that two users are more similar. Similar patterns are observed in the three methods.

Related papers

The author uses Co-Clustering method in Co-clustering documents and words using bipartite spectral graph partitioning as a comparison to above methods.

@@ Line 1: / Line 1: @@
 == Citation ==
 Xufei Wang. 2010. Discovering Overlapping Groups in Social Media, the 10th IEEE International Conference on Data Mining (ICDM 2010).
@@ Line 7: / Line 6: @@
 == Databases ==
-[[Category::BlogCatalog]] [http://www.blogcatalog.com/]
+[[Property:UsesDataset|BlogCatalog]] [http://www.blogcatalog.com/]
-[[Category::Delicious]] [http://www.delicious.com/]
+[[Property:UsesDataset|Delicious]] [http://www.delicious.com/]
 == Summary ==
@@ Line 29: / Line 28: @@
 '''Input:'''
-* A user-tag subscription matrix <math>M_{N_{\mu }\times N_{t},</math> where <math>N_{\mu }</math> and <math>N_{t}</math> are the numbers of users and tags.
+* A user-tag subscription matrix <math>M_{N_{\mu }\times N_{t}}</math>, where <math>N_{\mu }</math> and <math>N_{t}</math> are the numbers of users and tags.
 * The number of communities k.
@@ Line 36: / Line 35: @@
 * k overlapping communities which consist of both users and tags.
-== Brief description of the method ==
+== Brief Description Of The Method ==
+Communities that aggregate similar users and tags together can be detected by maximizing intra-cluster similarity, which is shown below:
+<math>arg max\frac{1}{k}\sum_{i=1}^{k}\sum_{x_{j}\in C_{i}}^{}S_{C}\left ( x_{j},c_{i} \right )</math>
+where k is the number of communities, x is the edges and c is the centroid of community. This formulation can be solved by a [[UsesMethod::k-means]] variant.
+This paper uses different methods to solve the problem of overlapping communities:
+'''A. [[UsesMethod::Independent Learning]]'''
+'''B. [[UsesMethod::Normalized Learning]]'''
+'''C. [[UsesMethod::Correlational Learning]]'''
 == Experimental Result ==
+The authors use two kinds of datasets: one is a synthetic data and the other kind is real data from [[Category::Dataset|BlogCatalog]] and [[Category::Dataset|Delicious]]
+'''A. Synthetic Data'''
+Synthetic data, which is controlled by various parameters, facilitates a comparative study between the uncovered and actual clusters. It has 1,000 users and 1,000 tags and with different number of clusters which range from 5 to 50.
+[[File:figure1.jpg]]
+From the experiment result, we see that correlational Learning is more effective than the other two methods in recovering overlapping clusters. It works well even when the intra-cluster link density is low. Co-clustering performs poorly because it only ﬁnds non-overlapping clusters.
+'''B. Social Media Data'''
+[[File:Social Media experiment.jpg]]
+From the experiment with [[Category::Dataset|BlogCatalog]] and [[Category::Dataset|Delicious]], the paper show us that:
+* The probability of a link between two users increases with respect to the number of tags they share.
+* Correlational Learning consistently performs better, especially when the training set is small.
+* Higher co-occurrence frequency suggests that two users are more similar. Similar patterns are observed in the three methods.
 == Related papers ==
+The author uses Co-Clustering method in [[RelatedPaper::Co-clustering documents and words using bipartite spectral graph partitioning]] as a comparison to above methods.

Difference between revisions of "Xufei Wang, ICDM, 2010"

Latest revision as of 00:28, 4 April 2011

Contents

Citation

Online Version

Databases

Summary

Problem Statement

Brief Description Of The Method

Experimental Result

Related papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools