Collier et al. Journal of Biomedical Semantics 2011
This a Paper discussed in Social Media Analysis 10-802 in Fall 2012.
Contents
Citation
N. Collier, N. T. Son, and N. M. Nguyen, "OMG U got flu? Analysis of shared health messages for bio-surveillance," J Biomed Semantics, vol. 2 Suppl 5, p. S9, 2011.
Electronic version
Summary
This paper focuses on tracking reports of self-protective behavior as the basis for further risk analysis, specifically considering epidemic responses of users on Twitter. The authors tagged self-protective behavior referring to influenza like illness from Twitter messages (or tweets) and performed their study based on supervised learning using unigrams, bigrams and regular expressions as features with two supervised classifiers, Support Vector Machines and Naïve Bayes
Background and Methods
Infectious disease outbreaks may be able to be detected earlier (aka: pre-diagnostic stage) using social media such as Twitter where users are posting their thoughts in real time in concise and public ways. The hypothesis is that social media and user queries are secondary indicators that should be correlated with patient reported symptoms.
In this study, the categories considered for direct reporting of influenza are:
- Avoidance behavior
- Increased sanitation
- Seeking pharmaceutical intervention
- Wearing a mask
- Self-reported diagnosis
In order to handle the biased nature of self-protection messages, the authors used two stages of filters:
- Use a bag of 7 keywords to select tweets on topics related to influenza (i.e.: flu, H1N1, swine flu etc.)
- Use hand built patterns to select 14,508 tweets, then from these, randomly choose 7,412 tweets spread across the 5 categories listed previously, and finally remove duplicates which resulted in 5,283 messages.
The Weka Toolkit was used to implement the Naïve Bayes and SVM classification models to classify the 5 datasets into positive or negative. The Simple Rule Language (SRL) toolkit was used to custom build regular expressions, which was hypothesized to outperform the previous two classification models.
Results and Discussion
The overall trend for Naïve Bayes was to have stronger recall than precision whereas for SVM, precision was generally higher than recall. The SRL custom built expressions performed better than NB or SVM when combined with unigrams, but did not perform as well when combined with both unigrams and bigrams.
A validation study for the classifiers was conducted using a corpus of Twitter data called the Edinburgh Corpus, which holds 97 million tweets for the period of November 2009-February 2010. The authors applied the same keyword filtering method and followed the study plan explained in the methods section above. The results were compared to laboratory results for weeks 47-5 of the 2009-2010 influenza season. The correlation was measured using the Spearman’s Rho [add link] between counts of positive messages in each class and the laboratory data for H1N1. Strong correlations were found.