Difference between revisions of "Modeling Spread of Disease from Social Interaction"

From Cohen Courses
Jump to navigationJump to search
Line 28: Line 28:
 
[[File:55.jpg]]
 
[[File:55.jpg]]
  
To explain it briefly, they first train two different binary SVM classifier . The classifier <math> C_{S} </math> is penalized severely for false positives (normal tweets which are labelled as sick) and <math> C_{O} </math> is penalized severely for false negatives.
+
To explain it briefly, they first train two different binary SVM classifier . The classifier <math> C_{S} </math> is penalized severely for false positives (normal tweets which are labelled as sick) and <math> C_{O} </math> is penalized severely for false negatives. Then the classifiers are trained by a corpus of hand labeled 5128 tweets. After this, they trained the classifiers with 1.6 million (health related though with noise) tweets which were obtained from the work by [http://www.cs.jhu.edu/~mdredze/publications/2011.tech.twitter_health.pdf Paul and Dredze]. <math> C_{O} </math> was further trained with a training set of 200 million tweets.  Thresholding was applied to reduce the noise in the cascade. A final corpus with over 700 thousand “sick” messages and 3 million “other” tweets were obtained which were used as a training set for the final classifier. The features for the classifier are unigram, bigram and the trigram models.
 +
 
 +
==== Modeling the spread of disease ====

Revision as of 23:54, 5 November 2012

Citation

Adam Sadilek, Henry Kautz, Vincent Silenzio "Modeling Spread of Disease from Social Interaction" Sixth AAAI International Conference on Weblogs and Social Media (ICWSM)

Online Version

Online Pdf

Summary

A nice Paper which tries to model the spread of communicable diseases via analysis of social media.

An analogy (from the paper!) : Given that five of your friends have flu-like symptoms, and that you have recently met eight people, possibly strangers, who complained about having runny noses and headaches, what is the probability that you will soon become ill as well?

Traditionally public health is monitored via surveys and statistics obtained from health care centres. This process is expensive and slow (also biased as many of us dont even bother to take medication for flu). This work takes is more fine grained as it considers the fine grained interactions between individuals via their tweets. One of the main challenges of this work was to identify correctly the very small number of tweets which are related to sickness. They develop a SVM classifier for this task which performs really well (with a 0.98 precision and 0.97 recall)

Data

The work was done based on analyzing tweets which were (collected for a month) from the NYC metropolitan area. The specifics of the data are shown in the following figure :

54.jpg

Methodologies and models

Detecting illness related tweets

The major challenge of this work was to detect the tweets of a person which were related to illness since for every health related tweet there were more than 1000 unrelated ones. Given this class imbalance this work formulates a semi-supervised cascade based approach to learn a robust Support Vector Machines (SVM).

To achieve this (extract specific tweets), they first try to obtain high quality training data to train their final classifier. The following figure shows their methodology

55.jpg

To explain it briefly, they first train two different binary SVM classifier . The classifier is penalized severely for false positives (normal tweets which are labelled as sick) and is penalized severely for false negatives. Then the classifiers are trained by a corpus of hand labeled 5128 tweets. After this, they trained the classifiers with 1.6 million (health related though with noise) tweets which were obtained from the work by Paul and Dredze. was further trained with a training set of 200 million tweets. Thresholding was applied to reduce the noise in the cascade. A final corpus with over 700 thousand “sick” messages and 3 million “other” tweets were obtained which were used as a training set for the final classifier. The features for the classifier are unigram, bigram and the trigram models.

Modeling the spread of disease