Twitter Dataset For Sentiment

From Cohen Courses
Jump to navigationJump to search

The Twitter dataset consists of 475 million public tweets from May 2009 to Jan 2010. All non-English characters are removed, and url links, hashtags and references have been replaced by URL/REF/TAG words. The content hashtags are treated as labels for the classification task. The sentiment labels are either hashtag-based or smiley-based.

  • Hash-tag based labels - The frequent tags over the entire dataset were calculated and two human judges labeled them into five

different categories. 1. strong sentiment, 2. most likely sentiment, 3. context-dependent sentiment, 4. focused sentiment and 5. no sentiment.

The following table shows the annotation result.

Annotate.png

  • Smiley based labels - Amazon Mechnanical Turk (AMT) is used to obtain the list of commonly and unambiguous ASCII smileys.

50 hash-tag based of category strong sentiment and most likely sentiment along with 15 smiley based labels are considered as labels for the classification task.