Xufei Wang, ICDM, 2010

From Cohen Courses
Jump to navigationJump to search

Citation

Xufei Wang. 2010. Discovering Overlapping Groups in Social Media, the 10th IEEE International Conference on Data Mining (ICDM 2010).

Online Version

http://dmml.asu.edu/users/xufei/Papers/ICDM2010.pdf

Databases

BlogCatalog [1]

Delicious [2]

Summary

In this paper, the authors propose a novel co-clustering framework, which takes advantage of networking information between users and tags in social media, to discover these overlapping communities. The basic ideas are:

  • To discover overlapping communities in social media. Diverse interests and interactions that human beings can have in online social life suggest that one person often belongs more than one community.
  • To use user-tag subscription information instead of user-user links. Metadata such as tags become an important source in measuring the user-user similarity. The paper shows that more accurate community structures can be obtained by scrutinizing tag information.
  • To obtain clusters containing users and tags simultaneously. Existing co-clustering methods cluster users/tags separately. Thus, it is not clear which user cluster corresponds to which tag cluster. But the proposed method is able to find out user/tag group structure and their correspondence

Problem Statement

In this paper, the concept of community is generalized to include both users and tags. Tags of a community imply the major concern of people within it.

Let denote the user set, the tay set. A community is a subset of user and tags, where k is the number of communities. As mentioned above, communities usually overlap, i.e., .On the other hand, users and their subscribed tags form a user-tag matrix M, in which each entry indicates whether user subscribes to tag . So it is reasonable to view a user as a sparse vector of tags, and each tag as a sparse vector of users.

Given notations above, the overlapping co-clustering problem can be stated formally as follows:

Input:

  • A user-tag subscription matrix , where and are the numbers of users and tags.
  • The number of communities k.

Output:

  • k overlapping communities which consist of both users and tags.

Brief Description Of The Method

Communities that aggregate similar users and tags together can be detected by maximizing intra-cluster similarity, which is shown below: where k is the number of communities, x is the edges and c is the centroid of community. This formulation can be solved by a k-means variant.

This paper uses different methods to solve the problem of overlapping communities:


A. Independent Learning

If two tags are different, their similarity can be defined as 0, and 1 if they are the same. their cosine similarity can be rewritten as:

B. Normalized Learning

Let denote the degree of the user ,and represent the degree of tag in a user-tag network. their cosine similarity can be rewritten as:

C. Correlational Learning

The singular value decomposition of user-tag network M is given by , where columns of U and V are the left and right singular vectors and is the diagonal matrix whose elements are singular values.

So we can get

where and , Parameter α (0 ≤ α ≤ 1) controls the weights of users and tags. Considering the balance between user similarity and tag similarity, α is set to 0.5.

Experimental Result

The authors use two kinds of datasets: one is a synthetic data and the other kind is real data from BlogCatalog and Delicious

A. Synthetic Data

Synthetic data, which is controlled by various parameters, facilitates a comparative study between the uncovered and actual clusters. It has 1,000 users and 1,000 tags and with different number of clusters which range from 5 to 50.

Figure1.jpg

From the experiment result, we see that correlational Learning is more effective thancthe other two methods in recovering overlapping clusters. It works well even when the intra-cluster link density is low. Co-clustering performs poorly because it only finds non-overlapping clusters.

B. Social Media Data

From the experiment with BlogCatalog and Delicious, the paper show us that:

  • The probability of a link between two users increases with respect to the number of tags they share.
  • Correlational Learning consistently performs better, especially when the training set is small.
  • Higher co-occurrence frequency suggests that two users are more similar. Similar patterns are observed in the three methods.

Related papers

The author uses Co-Clustering method in Co-clustering documents and words using bipartite spectral graph partitioning as a comparison to above methods.