Pak 2010 Twitter as a Corpus for Sentiment Analysis and Opinion Mining

From Cohen Courses
Jump to navigationJump to search

Citation

Alexander Pak and Patrick Paroubek. 2010. Twitter as a Corpus for Sentiment Analysis and Opinion Mining. In Proceedings of LREC.

Online version

An online version of this paper is available at [1].

Summary

This paper presents an automatic approach to gather twitter data for sentiment analysis. It also includes detailed study of a recently collected corpus, its basic statistics, and a proposed classification methods for sentiment analysis on twitter messages.

Key Contributions

The paper consolidates several previous research on automatically collecting twitter corpus and presents a novel approach for this problem. The author also shows their collected corpus details with basic statistics. They also claim that their classifier for sentiment analysis on twitter messages outperforms previously used methods.

Corpus Collection

The author mentioned that they would like to collect twitter messages for three different classes: positive, negative and subjective. They used Twitter API to collect all the message with the following criteria for their classes.

If the

Corpus Analysis

The authors present a set of four interesting experiments and findings, from the tokenization on Twitter, to the topic phrase extraction, and to the topic merging etc. The author claims that the traditional tokenizatoin method works poorly on Twitter messages, the traditional IR paradigms need some reformation, and the near-duplicate message need to be grouped together.

  • Score and Filter Topic Phrase Candidates

TweetMotif takes a simple language modeling approach to identifying topic phrases that are most distinctive for a tweet result set, scoring them by the likelihood ratio of the phrases appearing in this result set versus its appearance in the background tweet language model. The intuition behind is close to that of TF/IDF for IR, however, Twitter has some distinctive properties, for example, the TF of a particular word will be essentially the document (message) frequency of that word. Thus the authors also propose that the IR paradigms will need some reformation to be used in this new context.

Sentiment Classification and Results

The authors present a set of four interesting experiments and findings, from the tokenization on Twitter, to the topic phrase extraction, and to the topic merging etc. The author claims that the traditional tokenizatoin method works poorly on Twitter messages, the traditional IR paradigms need some reformation, and the near-duplicate message need to be grouped together.

  • Merge Similar Topics

Every candidate phrase defines a topic, a set of messages that contain that phrase. Many phrases, however, occur in roughly the same set of messages, thus their topics are repetitive. The authors use two methods to merge similar topics. Firstly, they merge the overlapping (in fact, subsuming) topic phrases. Secondly, they consider the message set directly and merge the topics which have more than 90% Jaccard similarity on their message sets.

  • Group Near-duplicate Messages

The authors mentioned the massive amount of message duplication on Twitter, including forwarded messages, repetitive advertisements, spams, news feeds etc.. Their algorithms thus is designed to group messages based on their Jaccard similarity of tri-gram phrases, grouped if the pairwise similarity exceed 65%.

Discussion

This paper is highly related to our proposed course project on automatic Twitter message clustering based on hashtags.