Collier et al. Journal of Biomedical Semantics 2011

This a Paper discussed in Social Media Analysis 10-802 in Fall 2012.

Citation

N. Collier, N. T. Son, and N. M. Nguyen, "OMG U got flu? Analysis of shared health messages for bio-surveillance," J Biomed Semantics, vol. 2 Suppl 5, p. S9, 2011.

Electronic version

Media:Collier et al 2010.pdf

Summary

This paper focuses on tracking reports of self-protective behavior as the basis for further risk analysis, specifically considering epidemic responses of users on Twitter. The authors tagged self-protective behavior referring to influenza like illness from Twitter messages (or tweets) and performed their study based on supervised learning using unigrams, bigrams and regular expressions as features with two supervised classifiers, Support Vector Machines and Naïve Bayes

Background and Methods

Infectious disease outbreaks may be able to be detected earlier or also referred to as pre-diagnostic stage, using social media such as Twitter where users are posting their thoughts in real time in concise and public ways. The hypothesis is that social media and user queries are secondary indicators that should be correlated with patient reported symptoms.

In this study, the categories considered for direct reporting of influenza are:

Avoidance behavior
Increased sanitation
Seeking pharmaceutical intervention
Wearing a mask
Self-reported diagnosis

In order to handle the biased nature of self-protection messages, the authors used two stages of filters:

Use a bag of 7 keywords to select tweets on topics related to influenza (i.e.: flu, H1N1, swine flu etc.)
Use hand built patterns to select 14,508 tweets, then from these, randomly choose 7,412 tweets spread across the 5 categories listed previously, and finally remove duplicates which resulted in 5,283 messages.

The Weka Toolkit was used to implement the Naïve Bayes and SVM classification models to classify the 5 datasets into positive or negative. The Simple Rule Language (SRL) toolkit was used to custom build regular expressions, which was hypothesized to outperform the previous two classification models.

Results and Discussion

The overall trend for Naïve Bayes was to have stronger recall than precision whereas for SVM, precision was generally higher than recall. The SRL custom built expressions performed better than NB or SVM when combined with unigrams, but did not perform as well when combined with both unigrams and bigrams.

A validation study for the classifiers was conducted using a corpus of Twitter data called the Edinburgh Corpus, which holds 97 million tweets for the period of November 2009-February 2010. The authors applied the same keyword filtering method and followed the study plan explained in the methods section above. The results were compared to laboratory results for weeks 47-5 of the 2009-2010 influenza season. The correlation was measured using the Spearman’s Rho [add link] between counts of positive messages in each class and the laboratory data for H1N1. Strong correlations were found.

Related papers

Some related articles include the following: [1] [2] [3]

These utilize Twitter or other forms of social media for early detection of events such as earthquakes, epidemic disease spread as well as other global events.

Collier et al. Journal of Biomedical Semantics 2011

Contents

Citation

Electronic version

Summary

Background and Methods

Results and Discussion

Related papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools