Difference between revisions of "Xufei Wang, ICDM, 2010"
(53 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | |||
== Citation == | == Citation == | ||
Xufei Wang. 2010. Discovering Overlapping Groups in Social Media, the 10th IEEE International Conference on Data Mining (ICDM 2010). | Xufei Wang. 2010. Discovering Overlapping Groups in Social Media, the 10th IEEE International Conference on Data Mining (ICDM 2010). | ||
Line 7: | Line 6: | ||
== Databases == | == Databases == | ||
+ | [[Property:UsesDataset|BlogCatalog]] [http://www.blogcatalog.com/] | ||
+ | |||
+ | [[Property:UsesDataset|Delicious]] [http://www.delicious.com/] | ||
+ | |||
+ | == Summary == | ||
+ | In this [[Category::paper]], the authors propose a novel co-clustering framework, which takes advantage of networking information between users and tags in social media, to discover these [[AddressesProblem::overlapping communities]]. The basic ideas are: | ||
+ | |||
+ | * To discover overlapping communities in social media. Diverse interests and interactions that human beings can have in online social life suggest that one person often belongs more than one community. | ||
+ | |||
+ | * To use user-tag subscription information instead of user-user links. Metadata such as tags become an important source in measuring the user-user similarity. The paper shows that more accurate community structures can be obtained by scrutinizing tag information. | ||
+ | |||
+ | * To obtain clusters containing users and tags simultaneously. Existing co-clustering methods cluster users/tags separately. Thus, it is not clear which user cluster corresponds to which tag cluster. But the proposed method is able to find out user/tag group structure and their correspondence | ||
+ | |||
+ | == Problem Statement == | ||
+ | |||
+ | In this paper, the concept of community is generalized to include both users and tags. Tags of a community imply the major concern of people within it. | ||
+ | |||
+ | Let <math>\mu = \left ( \mu _{1},\mu _{2},...,\mu _{m} \right )</math> denote the user set, <math>\tau = \left ( \tau _{1},\tau _{2},...,\tau _{n} \right )</math> the tay set. A community <math>C_{i}\left ( 1\leq i\leq k \right )</math> is a subset of user and tags, where k is the number of communities. As mentioned above, communities usually overlap, i.e., <math>C_{i}\bigcap C_{j}\neq \O \left ( 1\leq i,j\leq k \right )</math>.On the other hand, users and their subscribed tags form a user-tag matrix M, in which each entry <math>M_{ij}\in \left \{ 0,1 \right \}</math> indicates whether user <math>u_{i}</math> subscribes to tag <math>t_{j}</math>. So it is reasonable to view a user as a sparse vector of tags, and each tag as a sparse vector of users. | ||
+ | |||
+ | Given notations above, the overlapping co-clustering problem can be stated formally as follows: | ||
+ | |||
+ | '''Input:''' | ||
+ | * A user-tag subscription matrix <math>M_{N_{\mu }\times N_{t}}</math>, where <math>N_{\mu }</math> and <math>N_{t}</math> are the numbers of users and tags. | ||
+ | |||
+ | * The number of communities k. | ||
+ | |||
+ | '''Output:''' | ||
+ | * k overlapping communities which consist of both users and tags. | ||
+ | |||
+ | == Brief Description Of The Method == | ||
+ | |||
+ | Communities that aggregate similar users and tags together can be detected by maximizing intra-cluster similarity, which is shown below: | ||
+ | <math>arg max\frac{1}{k}\sum_{i=1}^{k}\sum_{x_{j}\in C_{i}}^{}S_{C}\left ( x_{j},c_{i} \right )</math> | ||
+ | where k is the number of communities, x is the edges and c is the centroid of community. This formulation can be solved by a [[UsesMethod::k-means]] variant. | ||
+ | |||
+ | This paper uses different methods to solve the problem of overlapping communities: | ||
+ | |||
+ | '''A. [[UsesMethod::Independent Learning]]''' | ||
+ | |||
+ | '''B. [[UsesMethod::Normalized Learning]]''' | ||
+ | |||
+ | '''C. [[UsesMethod::Correlational Learning]]''' | ||
+ | |||
+ | == Experimental Result == | ||
+ | |||
+ | The authors use two kinds of datasets: one is a synthetic data and the other kind is real data from [[Category::Dataset|BlogCatalog]] and [[Category::Dataset|Delicious]] | ||
+ | |||
+ | '''A. Synthetic Data''' | ||
+ | |||
+ | Synthetic data, which is controlled by various parameters, facilitates a comparative study between the uncovered and actual clusters. It has 1,000 users and 1,000 tags and with different number of clusters which range from 5 to 50. | ||
+ | |||
+ | [[File:figure1.jpg]] | ||
+ | |||
+ | From the experiment result, we see that correlational Learning is more effective than the other two methods in recovering overlapping clusters. It works well even when the intra-cluster link density is low. Co-clustering performs poorly because it only finds non-overlapping clusters. | ||
+ | |||
+ | '''B. Social Media Data''' | ||
+ | |||
+ | [[File:Social Media experiment.jpg]] | ||
+ | |||
+ | From the experiment with [[Category::Dataset|BlogCatalog]] and [[Category::Dataset|Delicious]], the paper show us that: | ||
+ | |||
+ | * The probability of a link between two users increases with respect to the number of tags they share. | ||
+ | |||
+ | * Correlational Learning consistently performs better, especially when the training set is small. | ||
+ | |||
+ | * Higher co-occurrence frequency suggests that two users are more similar. Similar patterns are observed in the three methods. | ||
+ | |||
+ | == Related papers == | ||
+ | |||
+ | The author uses Co-Clustering method in [[RelatedPaper::Co-clustering documents and words using bipartite spectral graph partitioning]] as a comparison to above methods. |
Latest revision as of 23:28, 3 April 2011
Contents
Citation
Xufei Wang. 2010. Discovering Overlapping Groups in Social Media, the 10th IEEE International Conference on Data Mining (ICDM 2010).
Online Version
http://dmml.asu.edu/users/xufei/Papers/ICDM2010.pdf
Databases
Summary
In this paper, the authors propose a novel co-clustering framework, which takes advantage of networking information between users and tags in social media, to discover these overlapping communities. The basic ideas are:
- To discover overlapping communities in social media. Diverse interests and interactions that human beings can have in online social life suggest that one person often belongs more than one community.
- To use user-tag subscription information instead of user-user links. Metadata such as tags become an important source in measuring the user-user similarity. The paper shows that more accurate community structures can be obtained by scrutinizing tag information.
- To obtain clusters containing users and tags simultaneously. Existing co-clustering methods cluster users/tags separately. Thus, it is not clear which user cluster corresponds to which tag cluster. But the proposed method is able to find out user/tag group structure and their correspondence
Problem Statement
In this paper, the concept of community is generalized to include both users and tags. Tags of a community imply the major concern of people within it.
Let denote the user set, the tay set. A community is a subset of user and tags, where k is the number of communities. As mentioned above, communities usually overlap, i.e., .On the other hand, users and their subscribed tags form a user-tag matrix M, in which each entry indicates whether user subscribes to tag . So it is reasonable to view a user as a sparse vector of tags, and each tag as a sparse vector of users.
Given notations above, the overlapping co-clustering problem can be stated formally as follows:
Input:
- A user-tag subscription matrix , where and are the numbers of users and tags.
- The number of communities k.
Output:
- k overlapping communities which consist of both users and tags.
Brief Description Of The Method
Communities that aggregate similar users and tags together can be detected by maximizing intra-cluster similarity, which is shown below: where k is the number of communities, x is the edges and c is the centroid of community. This formulation can be solved by a k-means variant.
This paper uses different methods to solve the problem of overlapping communities:
Experimental Result
The authors use two kinds of datasets: one is a synthetic data and the other kind is real data from BlogCatalog and Delicious
A. Synthetic Data
Synthetic data, which is controlled by various parameters, facilitates a comparative study between the uncovered and actual clusters. It has 1,000 users and 1,000 tags and with different number of clusters which range from 5 to 50.
From the experiment result, we see that correlational Learning is more effective than the other two methods in recovering overlapping clusters. It works well even when the intra-cluster link density is low. Co-clustering performs poorly because it only finds non-overlapping clusters.
B. Social Media Data
From the experiment with BlogCatalog and Delicious, the paper show us that:
- The probability of a link between two users increases with respect to the number of tags they share.
- Correlational Learning consistently performs better, especially when the training set is small.
- Higher co-occurrence frequency suggests that two users are more similar. Similar patterns are observed in the three methods.
Related papers
The author uses Co-Clustering method in Co-clustering documents and words using bipartite spectral graph partitioning as a comparison to above methods.