Zheleva and Getoor, WWW2009
Contents
Citation
Zheleva, E., and L. Getoor. 2009. To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles. In Proceedings of the 18th international conference on World wide web, 531–540.
Online version
Summary
This paper applied some practical models that use friendship an group membership information do a relational classification problem in social networks. The key novel idea is that in addition to friendship links, groups can be carriers of significant information.
Methodology
- Attacks without links and groups
This is the baseline, where the only available information is the overall marginal distribution for the sensitive attribute in the public profiles, i.e. no relationship or group information. So the probability of assigning a sensitive attribute value is just the number of public profiles with that senssitve attribute value over the total number of public profiles.
- Privacy attacks using links
Link-based privacy attacks take advantage of autocorrelation, the property that the attribute values of linked objects are correlated. An example of autocorrelation is that people who are friends often share common characteristics (as in the proverb "Tell me who your friends are, and I'll tell you who you are").
- Friend-aggregate model (AGG)
The nodes and their links produce a graph structure in which one can identify circles of close friends. The AGG looks at the sensitive attribute distribution amongst the friends of the person under question. The adversary using this model picks the most probable attribute value (i.e., the mode of the friends' attribute distribution).
- Collective classification model (CC)
Collective classification also takes advantage of autocorrelation between linked objects. Unlike more traditional methods, in which each instance is classified independently of the rest, collective classification aims at learning and inferring class labels of linked objects together. So it makes use of not only the public profiles but also the inferred values for connected private profile. The approximating inference algorithm used in this paper is the iterative classification (ICA).
- Flat-link model (LINK)
This approach deals with links as flattening the data by considering the adjacency matrix of the graph. In this model, each row in the matrix is a user instance and each user has a list of binary features of the size of the network, while each feature has a value of 1 if the user is friends with the person who corresponds to this feature and 0 otherwise. The algorithms can be any traditional classifier, such as Naive Bayes, logistic regression or SVM.
- Blockmodeling attach (BLOCK)
Users form natural clusters or blocks, and their interactions can be explained by the blocks they belong to. In particular, the link probability between two users is the same as the link probability between their corresponding blocks. If sensitive attribute values separate users into blocks, then based on the observed interactions of a private-profile user with public-profile users, one can predict the most likely block the user belongs to and thus discover the attribute value. This model is similar to the class-distribution relational-neighbor classifier.
- Privacy attachks using groups
- Groupmate-link model (CLIQUE)
One can think of groupmates as friends to whom users are implicitly linked. So this model can assume that each group is a clique of friends, thus creating a friendship link between users who belong to at least one group together. The advantage of this model is that it simplifies the problem to a link-based classification problem, which has been studied more thoroughly. One of the disadvantages is that it doesn't account for the strength of the relationship between two people, e.g. number of common groups.
- Group-based classification model (GROUP)
This approach considers each group as a feature in a classifier. While some groups may be useful in inferring the sensitive attribute, a problem in many of the datasets that we encountered was that users were members of a very large number of groups, so identifying which groups are likely to be predictive is a key.
The group-based classification approach contains three main steps as Algorithm 1 shows. In the first step, the algorithm performs feature selection: it selects the groups that are relevant to the node classification task. This can either be done automatically or by a domain expert. Ideally, when the number of groups is high, the feature selection should be automated. For example, the function isRelevant(h) can return true if the entropy of group h is low. In the second step, the algorithm learns a global function f, e.g., trains a classifier, that takes the relevant groups of a node as features and returns the sensitive attribute value. This step uses only the nodes from the observed set whose sensitive attributes are known. Each node v is represented as a binary vector where each dimension corresponds to a unique group: {groupId : isMember}, v.a. Only memberships to relevant groups are considered and v.a is the class coming from a multinomial distribution which denotes the sensitiveattribute value. In the third step, the classifier returns the predicted sensitive attribute for each private profile.
Algorithm 1 Group-based classification model
1. Set of relevant groups 2. for each group do if then end if end for 3. 4. for each sensitive node do end for
- Privacy attacks using links and groups
This method uses both links and groups to predict the sensitive attributes of users, i.e. combine the flat-link and the group-based classification models.
Data
Experimental Results
The results show that groups can leak a significant amount of information and not joining homogeneous groups preserves privacy better. People who are concerned about their privacy should consider properties of the groups they join, and social network providers should warn their users of the privacy breaches associated with joining groups. Another privacy aspect is the ability to join public groups but display group memberships only to friends. Currently, neither Facebook nor Flickr allow group memberships to be private and this is a desirable solution to the problem discussed in the paper.
Surprisingly, link-based methods did not perform as well as the authors expected. This suggests that breaking privacy in social networks with mixed private and public profiles is not necessarily straightforward, and using friends in classifying people has to be treated with care. The results also conjecture that this depends on the dataset. For example, while link-based methods were not very successful in predicting the location of users in Flickr, they may work well in LiveJournal; for example, a study by Liben-Nowell et al. showed that most of the friendship links in LiveJournal are related to geographical proximity.
Related papers
Preserving the privacy of sensitive relationships in graph data. PinKDD, 2007.