Project Draft 1 - diliu, dperciva

From Cohen Courses
Jump to navigationJump to search

Project Team

Di Liu

Daniel Percival

Introduction

We have data from a mobile carrier from India. In the previous work, we explored topics related to anomaly detection in the phone call records. In this class project, We propose to explore two aspects of a large collection of SMS (text message) data. First, we are interested in methods to measure the reciprocity of the ties, taking into account both time and perhaps length of the text message. We expect that reciprocity plays an important part in the communication between people. This is different from the phone records where people can finish a question and reply process in a single phone call. Second, we will focus on abnormal behavior detection. For example, we would like to detect telemarketers, or uncommon relationships. If time permits, we also would like to consider the commonplace/difference of the SMS data from phone records. For example, if person i calls j often, does it indicate that i also texts j often?

We also have records of music downloading history for each phone number. We are also interested in examine the distribution in ringtone downloading as well.

The dataset is interesting for two reasons. First, the scale of the dataset is of interest to us. Second, phone call as a traditional communication methods reveals important personal relationships. And the phone records might provide us different information from internet interactions. These data consist of both phone call and SMS records have been used previously in several studies, mostly focused on the phone records.

Dataset

The dataset was collected by an anonymous mobile phone operator in the 6 month period between December 1, 2007 and May 31, 2008. The data include text messages to and from users within the network. For each text message, we have information about the sender, the recipient, and the timestamp of the message. We also have the length, in characters of the message -- the full text is not available for privacy reasons. The dataset consists text message records between 4,545,744 distinct phone numbers. At least one party of each message is within the network of the mobile carrier. That is, the data contain customers outside the network. Due to privacy and confidentiality concerns, these data are not publicly available; we have access to these data through iLab, Heinz school.

Related Work

  • On this data:
    • Nanavati et al. (2006) -- Some exploratory data analysis (power laws, degree distributions...) ; Graph structure for the phone call networks
    • Seshadri et al. (2008) --Other models besides power laws for the phone call distributions
    • De Melo et al. (2010) -- Looked at the duration of phone calls; Group behavior models
    • Leman Akoglu et al.(2010) -- structure properties in the call/SMS data; egeinvalue decomposition method for change-point detection.
    • Jure Leskovec et al(2006) -- observe the propagation of recommendations and the cascade sizes, explained by a simple stochastic model.
  • On reciprocity:
    • Zhang, Dantu, and Cangussu (2009) -- some reciprocity measures that include time (using phone call data as well)

Note, these papers to be filled in for the next phase.

Proposed Work

Note on the Data

Due to the large size of the data, we consider constructing our methods to be computed in parallel. This, however, eliminates some more sophisticated statistical techniques. Therefore, we are also considering sampling from the large data to get a subset which can fit a large memory computer. We still need to determine which sampling procedure to use. Suggestions welcomed.

Reciprocity

Reciprocity is a property of a communication network which measures how balanced relationships are in the network. In communication terms, it simply means: if I talk to you, do you talk to me as much? Reciprocity can characterize the nature of relationships and the overall flavor of the network. For example, a rigidly hierarchical network derived from a corporation may display less reciprocity than a network derived from a group of high school students.

Previous studies have focused on reciprocity in phone calls. However, there is a fundamental problem with applying reciprocity measures to these sort of data. When a person calls another, that in itself may be a reciprocal relationship, since both people talk during the call. A call back from the second person may not be necessary to complete the communication. All we can hope to measure with such data is if a pair of people alternate in initiating communication. We cannot truly measure the depth of their communication. Text messages, on the other hand, do not have such an issue; a reply text message is necessary for a reciprocal relationship.

In a phone call network, we may wish to measure the reciprocity of the overall network, or of the individual pairs (dyads). The simplest approach to the first issue is to use simple dyad based counts to measure the overall network reciprocity. Alternately, we could measure the reciprocity of individual dyads using a ratio of counts of calls. We hope to go beyond such measures in two ways: (1) including other covariates such as time stamp and character length; (2) use the phone call records together with the SMS data together --- this would give us two types of ties in the network.

Abnormal Behavior

With such a large data set, we hope to use the SMS data to detect abnormal behavior in the network. We are not sure exactly what we will be looking for; but we have had previous success in the phone call records with finding telemarketers, changed phone numbers, business numbers, and emergencies. Similar measures could also work for the text message data.

Music Downloading History

The Music Downloading dataset is relatively small comparing to the other two dataset. It is also possible for us to fit extracted summary statistics data into the memory of a personal computer. Therefore, the data is more manageable. We consider developing methods based on the paper written by Jure et al(2006).