Difference between revisions of "Xufei Wang, ICDM, 2010"

Revision as of 01:38, 28 March 2011

Citation

Xufei Wang. 2010. Discovering Overlapping Groups in Social Media, the 10th IEEE International Conference on Data Mining (ICDM 2010).

Online Version

http://dmml.asu.edu/users/xufei/Papers/ICDM2010.pdf

Databases

BlogCatalog [1]

Delicious [2]

Summary

In this paper, the authors propose a novel co-clustering framework, which takes advantage of networking information between users and tags in social media, to discover these overlapping communities. The basic ideas are:

To discover overlapping communities in social media. Diverse interests and interactions that human beings can have in online social life suggest that one person often belongs more than one community.

To use user-tag subscription information instead of user-user links. Metadata such as tags become an important source in measuring the user-user similarity. The paper shows that more accurate community structures can be obtained by scrutinizing tag information.

To obtain clusters containing users and tags simultaneously. Existing co-clustering methods cluster users/tags separately. Thus, it is not clear which user cluster corresponds to which tag cluster. But the proposed method is able to ﬁnd out user/tag group structure and their correspondence

Problem Statement

In this paper, the concept of community is generalized to include both users and tags. Tags of a community imply the major concern of people within it.

Let $\mu =\left(\mu _{1},\mu _{2},...,\mu _{m}\right)$ denote the user set, $\tau =\left(\tau _{1},\tau _{2},...,\tau _{n}\right)$ the tay set. A community $C_{i}\left(1\leq i\leq k\right)$ is a subset of user and tags, where k is the number of communities. As mentioned above, communities usually overlap, i.e., $C_{i}\bigcap C_{j}\neq \emptyset \left(1\leq i,j\leq k\right)$ .On the other hand, users and their subscribed tags form a user-tag matrix M, in which each entry $M_{ij}\in \left\{0,1\right\}$ indicates whether user $u_{i}$ subscribes to tag $t_{j}$ . So it is reasonable to view a user as a sparse vector of tags, and each tag as a sparse vector of users.

Given notations above, the overlapping co-clustering problem can be stated formally as follows:

Input:

A user-tag subscription matrix $M_{N_{\mu }\times N_{t}}$ , where $N_{\mu }$ and $N_{t}$ are the numbers of users and tags.

The number of communities k.

Output:

k overlapping communities which consist of both users and tags.

Brief Description Of The Method

Communities that aggregate similar users and tags together can be detected by maximizing intra-cluster similarity, which is shown below: $argmax{\frac {1}{k}}\sum _{i=1}^{k}\sum _{x_{j}\in C_{i}}^{}S_{C}\left(x_{j},c_{i}\right)$ where k is the number of communities, x is the edges and c is the centroid of community. This formulation can be solved by a k-means variant.

This paper uses different methods to solve the problem of overlapping communities:

A. Independent Learning

If two tags are different, their similarity can be deﬁned as 0, and 1 if they are the same. their cosine similarity can be rewritten as: $S_{e}\left(e,{e}'\right)={\frac {1}{2}}\left(\delta \left(u_{i},u_{j}\right)+\delta \left(t_{p},t_{q}\right)\right)$

B. Normalized Learning

Let $d_{u_{i}}$ denote the degree of the user $u_{i}$ ,and $d_{t_{p}}$ represent the degree of tag $t_{p}$ in a user-tag network. their cosine similarity can be rewritten as: $S_{e}\left(e,{e}'\right)={\frac {d_{t_{p}}d_{t_{q}}\delta \left(u_{i},u_{j}\right)+d_{u_{i}}d_{u_{j}}\delta \left(t_{p},t_{q}\right)}{{\sqrt {d_{u_{i}}^{2}+d_{t_{p}}^{2}}}{\sqrt {d_{u_{j}}^{2}+d_{t_{q}}^{2}}}}}$

C. Correlational Learning

The singular value decomposition of user-tag network M is given by $M=U\Sigma V^{T}$ , where columns of U and V are the left and right singular vectors and $\Sigma$ is the diagonal matrix whose elements are singular values.

${\vec {u}}_{i}({\vec {t}}_{1},{\vec {t}}_{2},...,{\vec {t}}_{m})=u_{i}(t_{1},t_{2},...t_{n})V$ So we can get $S_{e}\left(e,{e}'\right)=\alpha S_{u}\left(u_{i},u_{j}\right)+\left(1-\alpha \right)S_{t}\left(t_{p},t_{q}\right)$

where $S_{u}(u_{i},u_{j})={\frac {{\vec {u}}_{i}{\vec {u}}_{j}}{\left\|{\vec {u}}_{i}\right\|\left\|{\vec {u}}_{j}\right\|}}$ and $S_{t}(t_{i},t_{j})={\frac {{\vec {t}}_{i}{\vec {t}}_{j}}{\left\|{\vec {t}}_{i}\right\|\left\|{\vec {t}}_{j}\right\|}}$ , Parameter α (0 ≤ α ≤ 1) controls the weights of users and tags. Considering the balance between user similarity and tag similarity, α is set to 0.5.

Experimental Result

The authors use two kinds of datasets: one is a synthetic data and the other kind is real data from BlogCatalog and Delicious

A. Synthetic Data

Synthetic data, which is controlled by various parameters, facilitates a comparative study between the uncovered and actual clusters. It has 1,000 users and 1,000 tags and with different number of clusters which range from 5 to 50.

From the experiment result, we see that correlational Learning is more effective thancthe other two methods in recovering overlapping clusters. It works well even when the intra-cluster link density is low. Co-clustering performs poorly because it only ﬁnds non-overlapping clusters.

B. Social Media Data

From the experiment with BlogCatalog and Delicious, the paper show us that:

The probability of a link between two users increases with respect to the number of tags they share.

Correlational Learning consistently performs better, especially when the training set is small.

Higher co-occurrence frequency suggests that two users are more similar. Similar patterns are observed in the three methods.

@@ Line 69: / Line 69: @@
 The authors use two kinds of datasets: one is a synthetic data and the other kind is real data from [[Category::Dataset|BlogCatalog]] and [[Category::Dataset|Delicious]]
-'''A. Synthetic data'''
+'''A. Synthetic Data'''
 Synthetic data, which is controlled by various parameters, facilitates a comparative study between the uncovered and actual clusters. It has 1,000 users and 1,000 tags and with different number of clusters which range from 5 to 50.
@@ Line 76: / Line 76: @@
 From the experiment result, we see that correlational Learning is more effective thancthe other two methods in recovering overlapping clusters. It works well even when the intra-cluster link density is low. Co-clustering performs poorly because it only ﬁnds non-overlapping clusters.
+'''B. Social Media Data'''
+From the experiment with [[Category::Dataset|BlogCatalog]] and [[Category::Dataset|Delicious]], the paper show us that:
+* The probability of a link between two users increases with respect to the number of tags they share.
+* Correlational Learning consistently performs better, especially when the training set is small.
+* Higher co-occurrence frequency suggests that two users are more similar. Similar patterns are observed in the three methods.
 == Related papers ==

Difference between revisions of "Xufei Wang, ICDM, 2010"

Revision as of 01:38, 28 March 2011

Contents

Citation

Online Version

Databases

Summary

Problem Statement

Brief Description Of The Method

Experimental Result

Related papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools