Difference between revisions of "Pak 2010 Twitter as a Corpus for Sentiment Analysis and Opinion Mining"

From Cohen Courses
Jump to navigationJump to search
Line 34: Line 34:
  
 
== Sentiment Classification and Results ==
 
== Sentiment Classification and Results ==
The authors present a set of four interesting experiments and findings, from the tokenization on Twitter, to the topic phrase extraction, and to the topic merging etc. The author claims that the traditional tokenizatoin method works poorly on Twitter messages, the traditional IR paradigms need some reformation, and the near-duplicate message need to be grouped together.
+
The authors present details on their sentiment classification experiments  
  
 
* ''Merge Similar Topics''
 
* ''Merge Similar Topics''

Revision as of 19:31, 31 March 2011

Citation

Alexander Pak and Patrick Paroubek. 2010. Twitter as a Corpus for Sentiment Analysis and Opinion Mining. In Proceedings of LREC.

Online version

An online version of this paper is available at [1].

Summary

This paper presents an automatic approach to gather twitter data for sentiment analysis. It also includes detailed study of a recently collected corpus, its basic statistics, and a proposed classification methods for sentiment analysis on twitter messages.

Key Contributions

The paper consolidates several previous research on automatically collecting twitter corpus and presents a novel approach for this problem. The author also shows their collected corpus details with basic statistics. They also claim that their classifier for sentiment analysis on twitter messages outperforms previously used methods.

Corpus Collection

The author mentioned that they would like to collect twitter messages for three different classes: positive, negative and subjective. They used Twitter API to collect all the message with the following criteria for their classes.

  • If the message contains happy emoticons (in this case, ":-)", ":)", ":D" etc.), it is considered as positive message
  • If the message contains sad emoticons (in this case, ":-(", ":(", ";(" etc.), it is considered negative message
  • They queried accounts of 44 newspapers to collect their tweets and considered them objective messages

Corpus Analysis

The authors conducted the word frequencies distribution analysis and showed that the results followed Zipf's law. And the author also performed POS tagging on all the twitter messages and presented the variation of the POS tags across different classes.

  • Positive vs. Negative Tags

The author shows an indicator of a positive text is superlative adverbs (RBS), such as “most” and “best”. Positive texts are also characterized by the use of possessive ending (POS). And the negative set contains more often verbs in the past tense (VBN, VBD), because many authors express their negative sentiments about their loss or disappointment.

  • Subjective vs. Objective Tags

The author observe that objective texts tend to contain more common and proper nouns (NPS, NP, NNS), while authors of subjective texts use more often personal pronouns (PP, PP$).

Sentiment Classification and Results

The authors present details on their sentiment classification experiments

  • Merge Similar Topics

Every candidate phrase defines a topic, a set of messages that contain that phrase. Many phrases, however, occur in roughly the same set of messages, thus their topics are repetitive. The authors use two methods to merge similar topics. Firstly, they merge the overlapping (in fact, subsuming) topic phrases. Secondly, they consider the message set directly and merge the topics which have more than 90% Jaccard similarity on their message sets.

  • Group Near-duplicate Messages

The authors mentioned the massive amount of message duplication on Twitter, including forwarded messages, repetitive advertisements, spams, news feeds etc.. Their algorithms thus is designed to group messages based on their Jaccard similarity of tri-gram phrases, grouped if the pairwise similarity exceed 65%.

Discussion

This paper is highly related to our proposed course project on automatic Twitter message clustering based on hashtags.