Preserving the privacy of sensitive relationships in graph data. PinKDD, 2007

From Cohen Courses
Jump to navigationJump to search

Citation

author = {Zheleva, Elena and Getoor, Lise},
title = {Preserving the privacy of sensitive relationships in graph data},
booktitle = {Proceedings of the 1st ACM SIGKDD international conference on Privacy, security, and trust in KDD},
series = {PinKDD'07},
year = {2008},
isbn = {3-540-78477-2, 978-3-540-78477-7},
location = {San Jose, CA, USA},
pages = {153--171},
numpages = {19},

Online version

https://utd.edu/~mxk055100/courses/privacy08f_files/zheleva-pinkdd07-extended.pdf

Abstract from the paper

In this paper, we focus on the problem of preserving the privacy of sensitive relationships in graph data. We refer to the problem of inferring sensitive relationships from anonymized graph data as link reidentification. We propose five different privacy preservation strategies, which vary in terms of the amount of data removed (and hence their utility) and the amount of privacy preserved. We assume the adversary has an accurate predictive model for links, and we show experimentally the success of different link re-identification strategies under varying structural characteristics of the data.

Summary

This paper addresses the problem of data sanitization to remove private information from datasets that are released to researchers and the public for data-mining. Most work in this area have focused on data that can be described by a single table with attribute information for each of the entries. Hoever real-world datasets often also exhibit dependencies between these entities. The authors study different anonymization techniques for such graph data. Specifically, unlike existing work that concentrate on hiding the identity of entities and/or their attributes (k-anonymity, l-diversity and t-closeness), the authors focused on how to prevent inference of sensitive/private relationships from anonymized graph data. The authors propose five different anonymization strategies that vary according to how many attributes or observations have to be deleted in the sanitization process and measure their effectiveness based on the number of sensitive relationships that can be inferred from the anonymized data. Firstly, the graph nodes are anonymized using one of the established techniques for non-graph data. For example, the nodes could be k-anonymized using t-closeness by clustering nodes into equivalence classes such that each node is indistinguishable, given its attributes, from the other k-1 nodes in its equivalence class. Secondly the graph edges are anonymized using one of the following approaches:

  • Intact edges: remove the sensitive edges leaving other edges intact.
  • Partial-edge removal: Remove a portion of the observed edges, for ex, edges connecting high-degree nodes, particular type of edges that contribute to the overall likelihood of a sensitive edge, or just randomly.
  • Cluster edge anonymization: Collapse the anonymized nodes into a single node for each cluster, leave the set of edges intact by keeping count of number of edges of each type between clusters.
  • Cluster-edge anonymization with constraints: This creates edges between equivalence classes as above, but requires the equivalence class nodes to have the same constraints as any two nodes in the original data (such as maximum number of edges of a certain type between two entities).
  • Removed edges: Final solution is to just remove all the edges. This is very privacy preserving, but resulting dataset would be of very low utility.

The authors then test the privacy preserving characterstics of each of the above techniques by trying out link re-identification on the anonymized dataset using the noisy-or model. Noisy-or model assumes that each observed edge contributes (indepdendently) to the probability of the sensitive edge existing. The authors empirically evaluate these proposed techniques on synthetically generated data. The results showed that at higher thresholds, keeping all the edges between node equivalence classes preserves privacy much better than deleting 50% of the two-node edges, while having higher utility. For the cluster edge method, lower the k value, more the number of private edges that can be revealed. However, with the cluster edge with contraints method, varying k didn't change privacy preservation value much.

Related papers

Zheleva and Getoor, WWW2009

Study plan

To learn about k-anonymity and t-closeness, read the abstract and introductory sections of the following papers:

  • k-anonymity:
    • P. Samarati. Protecting respondent’s privacy in microdata release. IEEE T. Knowl. Data En., 2001. pdf
  • t-closeness:
    • N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In IEEE 23rd International Conference on Data Engineering, April 2007. pdf