Project Draft 1 - diliu, dperciva
Contents
Project Team
Introduction
We propose to explore two aspects of a large collection of SMS (text message) data. First, we will explore methods to measure the reciprocity of the ties, taking into account both time and perhaps length of the text message. Second, we will focus on abnormal behavior detection. These data consist of both phone call and SMS records have been used previously in several studies, mostly focused on the phone records.
Dataset
The dataset was collected by an anonymous mobile phone operator in the 6 month period between December 1, 2007 and May 31, 2008. The data include text messages to and from users within the network. For each text message, we have information about the sender, the recipient, and the timestamp of the message. We also have the length, in characters of the message -- the full text is not available for privacy reasons. The dataset consists text message records between 4,545,744 distinct phone numbers. At least one party of each message is within the network of the mobile carrier. That is, the data contain customers outside the network. Due to privacy and confidentiality concerns, these data are not publicly available; we have access to these data through iLab, Heinz school.
Related Work
- On this data:
- Nanavati et al. (2006) -- Some exploratory data analysis (power laws, degree distributions...) ; Graph structure for the phone call networks
- Seshadri et al. (2008) --Other models besides power laws for the phone call distributions
- De Melo et al. (2010) -- Looked at the duration of phone calls; Group behavior models
- On reciprocity:
- Zhang, Dantu, and Cangussu (2009) -- some reciprocity measures that include time (using phone call data as well)
Note, these papers to be filled in for the next phase.
Proposed Work
Note on the Data
Due to the large size of the data, we must construct our methods to be computed in parallel. This eliminates some more sophisticated techniques.
Reciprocity
Reciprocity is a property of a communication network which measures how balanced relationships are in the network. In communication terms, it simply means: if I talk to you, do you talk to me as much? Reciprocity can characterize the nature of relationships and the overall flavor of the network. For example, a rigidly hierarchical network derived from a corporation may display less reciprocity than a network derived from a group of high school students.
Previous studies have focused on reciprocity in phone calls. However, there is a fundamental problem with applying reciprocity measures to these sort of data. When a person calls another, that in itself may be a reciprocal relationship, since both people talk during the call. A call back from the second person may not be necessary to complete the communication. All we can hope to measure with such data is if a pair of people alternate in initiating communication. We cannot truly measure the depth of their communication. Text messages, on the other hand, do not have such an issue; a reply text message is necessary for a reciprocal relationship.
In a phone call network, we may wish to measure the reciprocity of the overall network, or of the individual pairs (dyads). The simplest approach to the first issue is to use simple dyad based counts to measure the overall network reciprocity. Alternately, we could measure the reciprocity of individual dyads using a ratio of counts of calls. We hope to go beyond such measures in two ways: (1) including other covariates such as time stamp and character length; (2) use the phone call records together with the SMS data together --- this would give us two types of ties in the network.
Abnormal Behavior
With such a large data set, we hope to use the SMS data to detect abnormal behavior in the network. We are not sure exactly what we will be looking for; but we have had previous success in the phone call records with finding telemarketers, changed phone numbers, business numbers, and emergencies. Similar measures could also work for the text message data.