Difference between revisions of "Xufei Wang, ICDM, 2010"

From Cohen Courses
Jump to navigationJump to search
 
(36 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
 
== Citation ==
 
== Citation ==
 
Xufei Wang. 2010. Discovering Overlapping Groups in Social Media, the 10th IEEE International Conference on Data Mining (ICDM 2010).
 
Xufei Wang. 2010. Discovering Overlapping Groups in Social Media, the 10th IEEE International Conference on Data Mining (ICDM 2010).
Line 7: Line 6:
  
 
== Databases ==
 
== Databases ==
[[Category::BlogCatalog]] [http://www.blogcatalog.com/]
+
[[Property:UsesDataset|BlogCatalog]] [http://www.blogcatalog.com/]
  
[[Category::Delicious]] [http://www.delicious.com/]
+
[[Property:UsesDataset|Delicious]] [http://www.delicious.com/]
  
 
== Summary ==
 
== Summary ==
Line 29: Line 28:
  
 
'''Input:'''
 
'''Input:'''
* A user-tag subscription matrix <math>M_{N_{\mu }\times N_{t},</math> where <math>N_{\mu }</math> and <math>N_{t}</math> are the numbers of users and tags.
+
* A user-tag subscription matrix <math>M_{N_{\mu }\times N_{t}}</math>, where <math>N_{\mu }</math> and <math>N_{t}</math> are the numbers of users and tags.
  
 
* The number of communities k.
 
* The number of communities k.
Line 36: Line 35:
 
* k overlapping communities which consist of both users and tags.
 
* k overlapping communities which consist of both users and tags.
  
== Brief description of the method ==
+
== Brief Description Of The Method ==
 +
 
 +
Communities that aggregate similar users and tags together can be detected by maximizing intra-cluster similarity, which is shown below:
 +
<math>arg max\frac{1}{k}\sum_{i=1}^{k}\sum_{x_{j}\in C_{i}}^{}S_{C}\left ( x_{j},c_{i} \right )</math>
 +
where k is the number of communities, x is the edges and c is the centroid of community. This formulation can be solved by a [[UsesMethod::k-means]] variant.
 +
 
 +
This paper uses different methods to solve the problem of overlapping communities:
 +
 
 +
'''A. [[UsesMethod::Independent Learning]]'''
 +
 
 +
'''B. [[UsesMethod::Normalized Learning]]'''
 +
 
 +
'''C. [[UsesMethod::Correlational Learning]]'''
  
 
== Experimental Result ==
 
== Experimental Result ==
 +
 +
The authors use two kinds of datasets: one is a synthetic data and the other kind is real data from [[Category::Dataset|BlogCatalog]] and [[Category::Dataset|Delicious]]
 +
 +
'''A. Synthetic Data'''
 +
 +
Synthetic data, which is controlled by various parameters, facilitates a comparative study between the uncovered and actual clusters. It has 1,000 users and 1,000 tags and with different number of clusters which range from 5 to 50.
 +
 +
[[File:figure1.jpg]]
 +
 +
From the experiment result, we see that correlational Learning is more effective than the other two methods in recovering overlapping clusters. It works well even when the intra-cluster link density is low. Co-clustering performs poorly because it only finds non-overlapping clusters.
 +
 +
'''B. Social Media Data'''
 +
 +
[[File:Social Media experiment.jpg]]
 +
 +
From the experiment with [[Category::Dataset|BlogCatalog]] and [[Category::Dataset|Delicious]], the paper show us that:
 +
 +
* The probability of a link between two users increases with respect to the number of tags they share.
 +
 +
* Correlational Learning consistently performs better, especially when the training set is small.
 +
 +
* Higher co-occurrence frequency suggests that two users are more similar. Similar patterns are observed in the three methods.
  
 
== Related papers ==
 
== Related papers ==
 +
 +
The author uses Co-Clustering method in [[RelatedPaper::Co-clustering documents and words using bipartite spectral graph partitioning]] as a comparison to above methods.

Latest revision as of 23:28, 3 April 2011

Citation

Xufei Wang. 2010. Discovering Overlapping Groups in Social Media, the 10th IEEE International Conference on Data Mining (ICDM 2010).

Online Version

http://dmml.asu.edu/users/xufei/Papers/ICDM2010.pdf

Databases

BlogCatalog [1]

Delicious [2]

Summary

In this paper, the authors propose a novel co-clustering framework, which takes advantage of networking information between users and tags in social media, to discover these overlapping communities. The basic ideas are:

  • To discover overlapping communities in social media. Diverse interests and interactions that human beings can have in online social life suggest that one person often belongs more than one community.
  • To use user-tag subscription information instead of user-user links. Metadata such as tags become an important source in measuring the user-user similarity. The paper shows that more accurate community structures can be obtained by scrutinizing tag information.
  • To obtain clusters containing users and tags simultaneously. Existing co-clustering methods cluster users/tags separately. Thus, it is not clear which user cluster corresponds to which tag cluster. But the proposed method is able to find out user/tag group structure and their correspondence

Problem Statement

In this paper, the concept of community is generalized to include both users and tags. Tags of a community imply the major concern of people within it.

Let denote the user set, the tay set. A community is a subset of user and tags, where k is the number of communities. As mentioned above, communities usually overlap, i.e., .On the other hand, users and their subscribed tags form a user-tag matrix M, in which each entry indicates whether user subscribes to tag . So it is reasonable to view a user as a sparse vector of tags, and each tag as a sparse vector of users.

Given notations above, the overlapping co-clustering problem can be stated formally as follows:

Input:

  • A user-tag subscription matrix , where and are the numbers of users and tags.
  • The number of communities k.

Output:

  • k overlapping communities which consist of both users and tags.

Brief Description Of The Method

Communities that aggregate similar users and tags together can be detected by maximizing intra-cluster similarity, which is shown below: where k is the number of communities, x is the edges and c is the centroid of community. This formulation can be solved by a k-means variant.

This paper uses different methods to solve the problem of overlapping communities:

A. Independent Learning

B. Normalized Learning

C. Correlational Learning

Experimental Result

The authors use two kinds of datasets: one is a synthetic data and the other kind is real data from BlogCatalog and Delicious

A. Synthetic Data

Synthetic data, which is controlled by various parameters, facilitates a comparative study between the uncovered and actual clusters. It has 1,000 users and 1,000 tags and with different number of clusters which range from 5 to 50.

Figure1.jpg

From the experiment result, we see that correlational Learning is more effective than the other two methods in recovering overlapping clusters. It works well even when the intra-cluster link density is low. Co-clustering performs poorly because it only finds non-overlapping clusters.

B. Social Media Data

Social Media experiment.jpg

From the experiment with BlogCatalog and Delicious, the paper show us that:

  • The probability of a link between two users increases with respect to the number of tags they share.
  • Correlational Learning consistently performs better, especially when the training set is small.
  • Higher co-occurrence frequency suggests that two users are more similar. Similar patterns are observed in the three methods.

Related papers

The author uses Co-Clustering method in Co-clustering documents and words using bipartite spectral graph partitioning as a comparison to above methods.