# Xufei Wang, ICDM, 2010

## Citation

Xufei Wang. 2010. Discovering Overlapping Groups in Social Media, the 10th IEEE International Conference on Data Mining (ICDM 2010).

## Summary

In this paper, the authors propose a novel co-clustering framework, which takes advantage of networking information between users and tags in social media, to discover these overlapping communities. The basic ideas are:

• To discover overlapping communities in social media. Diverse interests and interactions that human beings can have in online social life suggest that one person often belongs more than one community.
• To use user-tag subscription information instead of user-user links. Metadata such as tags become an important source in measuring the user-user similarity. The paper shows that more accurate community structures can be obtained by scrutinizing tag information.
• To obtain clusters containing users and tags simultaneously. Existing co-clustering methods cluster users/tags separately. Thus, it is not clear which user cluster corresponds to which tag cluster. But the proposed method is able to ﬁnd out user/tag group structure and their correspondence

## Problem Statement

In this paper, the concept of community is generalized to include both users and tags. Tags of a community imply the major concern of people within it.

Let ${\displaystyle \mu =\left(\mu _{1},\mu _{2},...,\mu _{m}\right)}$ denote the user set, ${\displaystyle \tau =\left(\tau _{1},\tau _{2},...,\tau _{n}\right)}$ the tay set. A community ${\displaystyle C_{i}\left(1\leq i\leq k\right)}$ is a subset of user and tags, where k is the number of communities. As mentioned above, communities usually overlap, i.e., ${\displaystyle C_{i}\bigcap C_{j}\neq \emptyset \left(1\leq i,j\leq k\right)}$.On the other hand, users and their subscribed tags form a user-tag matrix M, in which each entry ${\displaystyle M_{ij}\in \left\{0,1\right\}}$ indicates whether user ${\displaystyle u_{i}}$ subscribes to tag ${\displaystyle t_{j}}$. So it is reasonable to view a user as a sparse vector of tags, and each tag as a sparse vector of users.

Given notations above, the overlapping co-clustering problem can be stated formally as follows:

Input:

• A user-tag subscription matrix ${\displaystyle M_{N_{\mu }\times N_{t}}}$, where ${\displaystyle N_{\mu }}$ and ${\displaystyle N_{t}}$ are the numbers of users and tags.
• The number of communities k.

Output:

• k overlapping communities which consist of both users and tags.

## Brief Description Of The Method

Communities that aggregate similar users and tags together can be detected by maximizing intra-cluster similarity, which is shown below: ${\displaystyle argmax{\frac {1}{k}}\sum _{i=1}^{k}\sum _{x_{j}\in C_{i}}^{}S_{C}\left(x_{j},c_{i}\right)}$ where k is the number of communities, x is the edges and c is the centroid of community. This formulation can be solved by a k-means variant.

This paper uses different methods to solve the problem of overlapping communities:

## Experimental Result

The authors use two kinds of datasets: one is a synthetic data and the other kind is real data from BlogCatalog and Delicious

A. Synthetic Data

Synthetic data, which is controlled by various parameters, facilitates a comparative study between the uncovered and actual clusters. It has 1,000 users and 1,000 tags and with different number of clusters which range from 5 to 50.

From the experiment result, we see that correlational Learning is more effective than the other two methods in recovering overlapping clusters. It works well even when the intra-cluster link density is low. Co-clustering performs poorly because it only ﬁnds non-overlapping clusters.

B. Social Media Data

From the experiment with BlogCatalog and Delicious, the paper show us that:

• The probability of a link between two users increases with respect to the number of tags they share.
• Correlational Learning consistently performs better, especially when the training set is small.
• Higher co-occurrence frequency suggests that two users are more similar. Similar patterns are observed in the three methods.

## Related papers

The author uses Co-Clustering method in Co-clustering documents and words using bipartite spectral graph partitioning as a comparison to above methods.