Pak 2010 Twitter as a Corpus for Sentiment Analysis and Opinion Mining

From Cohen Courses
Jump to navigationJump to search


Alexander Pak and Patrick Paroubek. 2010. Twitter as a Corpus for Sentiment Analysis and Opinion Mining. In Proceedings of LREC.

Online version

An online version of this paper is available at [1].


This paper presents an automatic approach to gather twitter data for sentiment analysis. It also includes detailed study of a recently collected corpus, its basic statistics, and a proposed classification methods for sentiment analysis on twitter messages.

Key Contributions

The paper consolidates several previous research on automatically collecting twitter corpus and presents a novel approach for this problem. The author also shows their collected corpus details with basic statistics. They also claim that their classifier for sentiment analysis on twitter messages outperforms previously used methods.

Corpus Collection

The author mentioned that they would like to collect twitter messages for three different classes: positive, negative and subjective. They used Twitter API to collect all the message with the following criteria for their classes.

  • If the message contains happy emoticons (in this case, ":-)", ":)", ":D" etc.), it is considered as positive message
  • If the message contains sad emoticons (in this case, ":-(", ":(", ";(" etc.), it is considered negative message
  • They queried accounts of 44 newspapers to collect their tweets and considered them objective messages

Corpus Analysis

The authors conducted the word frequencies distribution analysis and showed that the results followed Zipf's law. And the author also performed POS tagging on all the twitter messages and presented the variation of the POS tags across different classes.

  • Positive vs. Negative Tags

The author shows an indicator of a positive text is superlative adverbs (RBS), such as “most” and “best”. Positive texts are also characterized by the use of possessive ending (POS). And the negative set contains more often verbs in the past tense (VBN, VBD), because many authors express their negative sentiments about their loss or disappointment.

  • Subjective vs. Objective Tags

The author observe that objective texts tend to contain more common and proper nouns (NPS, NP, NNS), while authors of subjective texts use more often personal pronouns (PP, PP$).

Sentiment Classification and Results

The authors present details on their sentiment classification experiments including the feature extractions, classifier building and experiments results.

In feature extraction, authors present a four-step approach consisting of (1) filtering URL links, Twitter user names and such non-informative tokens; (2) tokenizing the text with punctuation marks and spaces; (3) removing stopwords (articles); (4) constructing n-grams.

In classifier building, the authors claim that they have tried Naive Bayes, SVM and CRF. However, Naive Bayes classifier works the best thus was picked.

In the final results, the authors present several comparisons between systems with different settings and conclude that the Naive Bayes classifier with bigram features works best due to its good balance between coverage and sentiment patterns.


This paper is highly related to our proposed course project on automatic Twitter message clustering based on hashtags.