Zheleva ACM 2009

From Cohen Courses
Revision as of 21:23, 31 March 2011 by Aoverwij (talk | contribs)
Jump to navigationJump to search

This a Paper discussed in Social Media Analysis 10-802 in Spring 2011.

Citation

Zheleva, E. and Getoor L. To Join or Not to Join: The Illusion of Privacy in Social Networks with Mixed Public and Private User Profiles. In ACM 2009, April 20-24, Madrid, Spain

Online version

To Join or Not to Join: The Illusion of Privacy in Social Networks with Mixed and Private User Profiles

Summary

This paper investigates what sensitive information can be inferred from friendship and group membership information in social networks such as Facebook, Orkut and Flickr. In such social networks the persons profile information can be marked as private, but friendship links and group affiliations are often visible to the public. The paper proposes eight privacy attacks using different classifiers and features.

Sensitive information is in this paper defined as attributes such as age, political affiliation or location. In the social network there are users for which this information is hidden and others for which it is observed, depending on their privacy settings. The goal of the paper is to predict those values for the users that hide this information.

The approach consist of Naive Bayes classifier learning for the specialized graphical model.



This paper proposes a method for extracting semantic orientation of words using a spin model, which is a model fo a set of electrons with spins. Each word has a positive or negative orientation, which corresponds to electrons with up or down spin. It is intractable to calculate the probability function, but instead the mean field theory can be used to approximate the average orientation of each word. According to the spin model, two electrons (words) next to each other have the same spin (orientation).

The approach in the paper first constructs a lexical network, where there is a link between two words if one is in the gloss of the other. Each link represents that two words have either the same or a different orientation. The later can happen due to negation words such as 'not'. Then the links are weighted depending on the degree of both words. They call this the gloss network. In addition to the gloss network, yet another network called the gloss-thesaurus network is constructed. This network is based on synonyms, antonyms and hypernyms. They enhance this network with cooccurrence information extracted from the corpus and call that the gloss-thesaurus-corpus network.

Given the orientation of a small number of seed words, the orientations of all the other words are propagated through the network. This propagation is based on an update formula for each orientation value and ends when the difference in the value of the variational free energy is smaller than a certain threshold. The words with high final average values are classified as the positive words.

They created a network of approximately 88,000 words collected from the Wall Street Journal and Brown corpus. For evaluation they used a labeled dataset of 3596 words as a gold standard. Parameter tuning as well as the number of seed words is evaluated using 10-fold validation.

Based on their experiments they conclude that the network that incorporates synonyms and the cooccurrence information from the corpus improves the accuracy when there are more than 2 seed words. The possible explanation for this is that there is a relatively large degree of freedom with only 2 seed words, resulting in a local optimum. Furthermore they show that their method works well based on a comparison with the shortest-path by Kamps LREC 2004 [1] and the bootstrapping method by Hu SIGKDD 2004 [2]. However their method is not perfect and suffers from ambiguity, lack of structural information and idiomatic expressions.

References

1. Jaap Kamps, Maarten Marx, Robert J. Mokken, and Maarten de Rijke. 2004. Using wordnet to measure semantic orientation of adjectives. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), volume IV, pages 1115–1118.

2. Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining (KDD-2004), pages 168–177.

Remarks

We noticed that they reported the accuracy based on cross validation. This overfits the dataset and therefore it would have been better if they evaluated on a seperate test set.