Inferring Social Ties From Geographic Coincidences

From Cohen Courses
Revision as of 22:48, 26 March 2011 by Dwijaya (talk | contribs)
Jump to navigationJump to search

Citation

D. Crandall, L. Backstrom, D. Cosley, S. Suri, D. Huttenlocher, J. Kleinberg. Inferring Social Ties from Geographic Coincidences. Proc. National Academy of Sciences 107 (52) 22436-22441, 28 December 2010.

Online version

Link to paper

Summary

This paper addresses the problem of inferring social ties between people based on their co-occurrence in time and space. Given that two people have been in the same geographical location at around the same time on several occasions, what is the probability that they actually know each other? Such inferences, although very intuitive, have been difficult to make precise. In this regard, the paper's contribution is in developing a general analytic framework to quantify this probability.

Applying the framework to a network of Flickr users: by inferring the probability of a friendship (social tie) between two Flickr users given the number of photos they took at approximately the same place and at approximately the same time, the paper discovers that even a very small number of such co-occurrences between two users can result in a high probability of friendship between them.

The paper's second contribution is in presenting a probabilistic model that produces a good fit to the distributions observed in the actual Flickr data. The findings of the paper also highlight potential privacy implications in the possibility of inferring social structures from even a small amount of spatio-temporal co-occurrence data.

Description of the method

First, surface of earth is divided into grid-like cells, each with s x s degrees of latitude and longitude. Two people A and B co-occurred in a given cell C, at a temporal range t, if both A and B took photos geo-tagged within a location in cell C, within t days of each other. For each pair of users, the number of distinct cells (k) in which they co-occurred at temporal range t is counted. The probability of friendship between users is computed by first constructing the social network of Flickr using all friendship links up through April 2008 and then identifying spatio-temporal co-occurrences that occurred after April 2008 - hence identifying only friendships existing prior to the accumulation of evidence via co-occurrences. The probability of friendship (fraction of users that are friends) is then computed as a function of k co-occurrences (indicating amount of evidence for a social tie), cell size s and temporal time t (indicating the precision of the evidence).

Given the observed distribution in Flickr data of the probability of friendship over number of co-occurrences k, cell size s and temporal time t, a probabilistic model is proposed to fit the observed distribution. A simple model supposes that the world is divided into N geographic cells, with M people (each having one social tie). Each day each pair of friends chooses to visit a place jointly with probability β and independently with probability . The choice of location itself is made randomly. Using Bayes' Law, the probability of friendship between two people F given that they visit the same cells on k consecutive days () is:

where prior probability of friendship between two people, :

and

where , probability of two friends being at the same place on 1 given day:

and , probability of co-occurrence between two non-friends:

Hence, :

In a more complex model, each pair of friends is randomly chosen a "home" cell drawn from the empirical distribution of Flickr photographs (approximately a power law with exponent 2.45). When they choose a cell to visit on a given day, they sample from a distribution which is not uniform over all cells, but peaked around the home cell and decays with distance according to power law distribution (with exponent γ). Each day, a person independently decides whether to visit a cell with probability α or to do nothing. When two friends each choose to visit a cell (an event with probability ), with probability β they end up in the same cell and with probability , their cell selection is independent.

Datasets used

Using Flickr's public API interface, a dataset of about 85 million geo-tagged photographs is collected from Flickr. Photos with imprecise geo-tags and/or missing time stamps are removed. About 38 million photos taken by about 490,000 users remained. The social contacts of each of these users are then collected (if they are made public by the user). The dataset contains photos taken by Flickr users as well as their social contacts.

Experimental Results

Social Network Attribute

Using the dataset of 38 million geo-tagged photos from Flickr, the paper discovers that the probability of a social tie increases sharply as the number of co-occurrences k increases and the temporal range t decreases. Specifically, two randomly chosen Flickr users have 0.0134% chance of being friends, but when they have multiple spatio-temporal co-occurrences, this probability increases significantly: for example, two people have a 60% chance of having a social tie when they have k = 5 co-occurrences at t = 1 and s = 1° latitude-longitude. The observed log-scale probability of friendship (y-axis) over number of co-occurrences k (x-axis) at s = 1° is shown below:

EmpiricalObservation.png

In developing the model to qualitatively fit this interesting observed distribution, the complex model (described above) with parameters M = 7500, N = 64800, α = 0.29, β = 0.12, γ = 1.8 is found to match the observed distribution well. The model's log-scale probability of friendship (y-axis) over number of co-occurrences k (x-axis) at s = 1° is shown below:

ModelDistribution.png

Related Papers

The main contribution of this paper is to provide an analytical framework that quantify the "power" of spatio-temporal coincidences (no matter how sparse) and its effect in predicting probability of social ties. Other earlier works that attempt to expose such social network structure have been done using less sparse information on:

- Anonymized versions of the network itself: Backstrom L, Dwork C, Kleinberg J (2007) Wherefore art thou R3579X? Anonymized social networks, hidden patterns, and structural steganography. Proceedings of the 16th International World Wide Web Conference: link discussed on the slides during Class Meeting for 10-802 04/07/2011, Narayanan A, Shmatikov V (2009) De-anonymizing social networks. Proceedings of the 30th IEEE Symposium on Security and Privacy pp 173–187: link

- Commonalities in online behavior such as co-visitations to web sites: Provost F, Dalessandro B, Hook R, Zhang X, Murray A (2009) Audience selection for online brand advertising: Privacy-friendly social network targeting. Proceedings of the International Conference on Knowledge Discovery and Data Mining pp 707–716: link and tagging shared content with similar keywords Schifanella R, Barrat A, Cattuto C, Markines B, Menczer F (2010) Folks in folksonomies: Social link prediction from shared metadata. Proceedings of the Third ACM International Conference on Web Search and Data Mining pp 271–280: link

- Detailed time series of physical co-presence: Eagle N, Pentland A, Lazer D (2009) Inferring social network structure using mobile phone data. Proc Natl Acad Sci USA, 106 pp:15274–15278: link

Discussion

The novelty of the paper lies on its quantitative treatment of spatio-temporal coincidences between people and how they are related to the likelihood of social ties. The paper does not address the question on whether or not friendship manifests themselves in pattern of repeated spatio-temporal coincidences. Rather, the strength of the paper lies in the opposite implication: that when two people exhibit multiple spatio-temporal coincidences, this is a strong indicator of a social tie, relative to the baseline frequency of such ties.

Unfortunately, although the paper proposes a probabilistic model to fit the distribution of friendship observed in Flickr data, the paper falls short in providing a quantitative evaluation of its proposed model, aside from that the model matches the observed distribution. No testing (prediction using the model) is conducted. An interesting next direction is perhaps to use the model to try and predict social ties among people based on their spatio-temporal coincidences. Another interesting direction is to explore this framework and model on another dataset, to discover whether or not the same model applies to different datasets, one that is not photography-related, for example.

Another interesting future direction is perhaps to explore whether it is possible to qualify the type of social ties between two persons, from its spatio-temporal coincidences - i.e. whether it is possible to differentiate strong ties from weak ones. For example, in photography, people may take pictures often with their friends (indicating strong ties). However, they may also take pictures often in popular public events or tourist destinations in which they are part of large crowds - hence decreasing the possibility of strong ties among observed co-occurrences. Differentiating such ties will be an interesting further direction.