Difference between revisions of "Zheleva ACM 2009"

From Cohen Courses
Jump to navigationJump to search
Line 10: Line 10:
 
== Summary ==
 
== Summary ==
  
This paper investigates what sensitive information can be inferred from friendship and group membership information in social networks such as Facebook, Orkut and Flickr. In such social networks the persons profile information can be marked as private, but friendship links and group affiliations are often visible to the public. The paper proposes eight privacy attacks using different classifiers and features.
+
This paper investigates what sensitive information can be inferred from friendship and group membership information in social networks such as Facebook, Orkut and Flickr. In such social networks the persons profile information can be marked as private, but friendship links and group affiliations are often visible to the public. The paper proposes eight [[AddressesProblem::privacy attacks]] using different classifiers and features.
  
Sensitive information is in this paper defined as attributes such as age, political affiliation or location. In the social network there are users for which this information is hidden and others for which it is observed, depending on their privacy settings. The goal of the paper is to predict those values for the users that hide this information.
+
Sensitive information is in this paper defined as attributes such as age, political affiliation or location. In the social network there are users for which this information is hidden and others for which it is observed, depending on their privacy settings. The goal of the paper is to predict those values for the users that hide this information. The approach consist of [[UsesMethod::Naive Bayes classifier learning]] for the a specialized graphical model.
  
The approach consist of [[UsesMethod::Naive Bayes classifier learning]] for the specialized graphical model.
+
Their experiments show that groups can leak a significant amount of information, although not joining homogeneous groups preserves privacy better. On the other hand it turned out that link based methods did not reveal that much information. Although related work [[RelatedPaper::Liben-Nowell PNAS 2005]] [1] shows that on other datasets the links actually do help.
 
 
 
 
 
 
 
 
This paper proposes a method for extracting [[AddressesProblem::semantic orientation of words]] using a [[UsesMethod::spin model]], which is a model fo a set of electrons with spins. Each word has a positive or negative orientation, which corresponds to electrons with up or down spin. It is intractable to calculate the probability function, but instead the mean field theory can be used to approximate the average orientation of each word. According to the spin model, two electrons (words) next to each other have the same spin (orientation).
 
 
 
The approach in the paper first constructs a lexical network, where there is a link between two words if one is in the gloss of the other. Each link represents that two words have either the same or a different orientation. The later can happen due to negation words such as 'not'. Then the links are weighted depending on the degree of both words. They call this the gloss network. In addition to the gloss network, yet another network called the gloss-thesaurus network is constructed. This network is based on synonyms, antonyms and hypernyms. They enhance this network with cooccurrence information extracted from the corpus and call that the gloss-thesaurus-corpus network.
 
 
 
Given the orientation of a small number of seed words, the orientations of all the other words are propagated through the network. This propagation is based on an update formula for each orientation value and ends when the difference in the value of the variational free energy is smaller than a certain threshold. The words with high final average values are classified as the positive words.
 
 
 
They created a network of approximately 88,000 words collected from the [[UsesDataset::Wall Street Journal]] and [[UsesDataset::Brown corpus]]. For evaluation they used a labeled dataset of 3596 words as a gold standard. Parameter tuning as well as the number of seed words is evaluated using 10-fold validation.
 
 
 
Based on their experiments they conclude that the network that incorporates synonyms and the cooccurrence information from the corpus improves the accuracy when there are more than 2 seed words. The possible explanation for this is that there is a relatively large degree of freedom with only 2 seed words, resulting in a local optimum. Furthermore they show that their method works well based on a comparison with the shortest-path by [[RelatedPaper::Kamps LREC 2004]] [1] and the bootstrapping method by [[RelatedPaper::Hu SIGKDD 2004]] [2]. However their method is not perfect and suffers from ambiguity, lack of structural information and idiomatic expressions.
 
  
 
''' References '''
 
''' References '''
 
+
1. D. Liben-Nowell, J. Novak, R. Kumar, P. Raghavan and A. Tomkins. Geographic routing in social networks. PNAS, 102(33):11623–11628, August 2005
1. Jaap Kamps, Maarten Marx, Robert J. Mokken, and Maarten de Rijke. 2004. Using wordnet to measure semantic orientation of adjectives. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), volume IV, pages 1115–1118.
 
 
 
2. Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining (KDD-2004), pages 168–177.
 
 
 
== Remarks ==
 
 
 
We noticed that they reported the accuracy based on cross validation. This overfits the dataset and therefore it would have been better if they evaluated on a seperate test set.
 

Revision as of 21:37, 31 March 2011

This a Paper discussed in Social Media Analysis 10-802 in Spring 2011.

Citation

Zheleva, E. and Getoor L. To Join or Not to Join: The Illusion of Privacy in Social Networks with Mixed Public and Private User Profiles. In ACM 2009, April 20-24, Madrid, Spain

Online version

To Join or Not to Join: The Illusion of Privacy in Social Networks with Mixed and Private User Profiles

Summary

This paper investigates what sensitive information can be inferred from friendship and group membership information in social networks such as Facebook, Orkut and Flickr. In such social networks the persons profile information can be marked as private, but friendship links and group affiliations are often visible to the public. The paper proposes eight privacy attacks using different classifiers and features.

Sensitive information is in this paper defined as attributes such as age, political affiliation or location. In the social network there are users for which this information is hidden and others for which it is observed, depending on their privacy settings. The goal of the paper is to predict those values for the users that hide this information. The approach consist of Naive Bayes classifier learning for the a specialized graphical model.

Their experiments show that groups can leak a significant amount of information, although not joining homogeneous groups preserves privacy better. On the other hand it turned out that link based methods did not reveal that much information. Although related work Liben-Nowell PNAS 2005 [1] shows that on other datasets the links actually do help.

References 1. D. Liben-Nowell, J. Novak, R. Kumar, P. Raghavan and A. Tomkins. Geographic routing in social networks. PNAS, 102(33):11623–11628, August 2005