Difference between revisions of "Xufei Wang, ICDM, 2010"

From Cohen Courses
Jump to navigationJump to search
Line 50: Line 50:
 
'''B. [[UsesMethod::Normalized Learning]]'''
 
'''B. [[UsesMethod::Normalized Learning]]'''
  
Let <math>d_{u_{i}}</math> denote the degree of the user <math>u_{i}</math>,and <math>d_{t_{p}}</math> represent the degree of tag <math>t_{p}</math> in a user-tag network. their cosine similarity can be rewritten as:
 
<math>S_{e}\left ( e,{e}' \right )= \frac{d_{t_{p}}d_{t_{q}}\delta \left ( u_{i},u_{j} \right )+d_{u_{i}}d_{u_{j}}\delta \left ( t_{p},t_{q} \right )}{\sqrt{d_{u_{i}}^{2}+d_{t_{p}}^{2}}\sqrt{d_{u_{j}}^{2}+d_{t_{q}}^{2}}}</math>
 
  
 
'''C. [[UsesMethod::Correlational Learning]]'''
 
'''C. [[UsesMethod::Correlational Learning]]'''
 
The singular value decomposition of user-tag network M is given by <math>M= U\Sigma V^{T}</math>, where columns of U and V are the left and right singular vectors and <math>\Sigma </math> is the diagonal matrix whose elements are singular values.
 
 
<math>\vec{u}_i(\vec{t}_1,\vec{t}_2,...,\vec{t}_m)=u_i(t_1,t_2,...t_n)V</math> So we can get <math>S_{e}\left ( e,{e}' \right )= \alpha S_{u}\left ( u_{i},u_{j} \right )+\left ( 1-\alpha  \right )S_{t}\left ( t_{p},t_{q} \right )</math>
 
 
where <math>S_u(u_i,u_j)=\frac{\vec{u}_i \vec{u}_j}{\left \| \vec{u}_i \right \| \left \| \vec{u}_j \right \|}</math> and <math>S_t(t_i,t_j)=\frac{\vec{t}_i \vec{t}_j}{\left \| \vec{t}_i \right \| \left \| \vec{t}_j \right \|}
 
</math>,
 
Parameter α (0 ≤ α ≤ 1) controls the weights of users and tags. Considering the balance between user similarity and tag similarity, α is set to 0.5.
 
  
 
== Experimental Result ==
 
== Experimental Result ==

Revision as of 12:20, 3 April 2011

Citation

Xufei Wang. 2010. Discovering Overlapping Groups in Social Media, the 10th IEEE International Conference on Data Mining (ICDM 2010).

Online Version

http://dmml.asu.edu/users/xufei/Papers/ICDM2010.pdf

Databases

BlogCatalog [1]

Delicious [2]

Summary

In this paper, the authors propose a novel co-clustering framework, which takes advantage of networking information between users and tags in social media, to discover these overlapping communities. The basic ideas are:

  • To discover overlapping communities in social media. Diverse interests and interactions that human beings can have in online social life suggest that one person often belongs more than one community.
  • To use user-tag subscription information instead of user-user links. Metadata such as tags become an important source in measuring the user-user similarity. The paper shows that more accurate community structures can be obtained by scrutinizing tag information.
  • To obtain clusters containing users and tags simultaneously. Existing co-clustering methods cluster users/tags separately. Thus, it is not clear which user cluster corresponds to which tag cluster. But the proposed method is able to find out user/tag group structure and their correspondence

Problem Statement

In this paper, the concept of community is generalized to include both users and tags. Tags of a community imply the major concern of people within it.

Let denote the user set, the tay set. A community is a subset of user and tags, where k is the number of communities. As mentioned above, communities usually overlap, i.e., .On the other hand, users and their subscribed tags form a user-tag matrix M, in which each entry indicates whether user subscribes to tag . So it is reasonable to view a user as a sparse vector of tags, and each tag as a sparse vector of users.

Given notations above, the overlapping co-clustering problem can be stated formally as follows:

Input:

  • A user-tag subscription matrix , where and are the numbers of users and tags.
  • The number of communities k.

Output:

  • k overlapping communities which consist of both users and tags.

Brief Description Of The Method

Communities that aggregate similar users and tags together can be detected by maximizing intra-cluster similarity, which is shown below: where k is the number of communities, x is the edges and c is the centroid of community. This formulation can be solved by a k-means variant.

This paper uses different methods to solve the problem of overlapping communities:


A. Independent Learning


B. Normalized Learning


C. Correlational Learning

Experimental Result

The authors use two kinds of datasets: one is a synthetic data and the other kind is real data from BlogCatalog and Delicious

A. Synthetic Data

Synthetic data, which is controlled by various parameters, facilitates a comparative study between the uncovered and actual clusters. It has 1,000 users and 1,000 tags and with different number of clusters which range from 5 to 50.

Figure1.jpg

From the experiment result, we see that correlational Learning is more effective thancthe other two methods in recovering overlapping clusters. It works well even when the intra-cluster link density is low. Co-clustering performs poorly because it only finds non-overlapping clusters.

B. Social Media Data

Social Media experiment.jpg

From the experiment with BlogCatalog and Delicious, the paper show us that:

  • The probability of a link between two users increases with respect to the number of tags they share.
  • Correlational Learning consistently performs better, especially when the training set is small.
  • Higher co-occurrence frequency suggests that two users are more similar. Similar patterns are observed in the three methods.

Related papers

The author uses Co-Clustering method in Co-clustering documents and words using bipartite spectral graph partitioning as a comparison to above methods.