Modeling Spread of Disease from Social Interaction

Citation

Adam Sadilek, Henry Kautz, Vincent Silenzio "Modeling Spread of Disease from Social Interaction" Sixth AAAI International Conference on Weblogs and Social Media (ICWSM)

Online Version

Online Pdf

Summary

A nice Paper which tries to model the spread of communicable diseases via analysis of social media.

An analogy (from the paper!) : Given that ﬁve of your friends have ﬂu-like symptoms, and that you have recently met eight people, possibly strangers, who complained about having runny noses and headaches, what is the probability that you will soon become ill as well?

Traditionally public health is monitored via surveys and statistics obtained from health care centres. This process is expensive and slow (also biased as many of us dont even bother to take medication for flu). This work takes is more fine grained as it considers the fine grained interactions between individuals via their tweets. One of the main challenges of this work was to identify correctly the very small number of tweets which are related to sickness. They develop a SVM classifier for this task which performs really well (with a 0.98 precision and 0.97 recall)

Data

The work was done based on analyzing tweets which were (collected for a month) from the NYC metropolitan area. The specifics of the data are shown in the following figure :

Methodologies and models

Detecting illness related tweets

The major challenge of this work was to detect the tweets of a person which were related to illness since for every health related tweet there were more than 1000 unrelated ones. Given this class imbalance this work formulates a semi-supervised cascade based approach to learn a robust Support Vector Machines (SVM).

To achieve this (extract specific tweets), they first try to obtain high quality training data to train their final classifier. The following figure shows their methodology

To explain it briefly, they first train two different binary SVM classifier . The classifier $C_{S}$ is penalized severely for false positives (normal tweets which are labelled as sick) and $C_{O}$ is penalized severely for false negatives. Then the classifiers are trained by a corpus of hand labeled 5128 tweets. After this, they trained the classifiers with 1.6 million (health related though with noise) tweets which were obtained from the work by Paul and Dredze. $C_{O}$ was further trained with a training set of 200 million tweets. Thresholding was applied to reduce the noise in the cascade. A ﬁnal corpus with over 700 thousand “sick” messages and 3 million “other” tweets were obtained which were used as a training set for the final classifier. The features for the classifier are unigram, bigram and the trigram models.

Modeling the spread of disease

This work considers two factors primarily the location proximity and the social relationship. They consider two individuals co-located if they visit the same 100 by 100 meter cell within a time window. Twitter has this concept of followers and followee. They consider two individuals to be "friends" if they follow each other.

Experiments and Results

The above figure show the impact of co-location and friendship with infected people on a given day on one’s health the following day.

Modeling Spread of Disease from Social Interaction

Contents

Citation

Online Version

Summary

Data

Methodologies and models

Detecting illness related tweets

Modeling the spread of disease

Experiments and Results

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools