A Sentiment Detection Engine for Internet Stock Message Boards

From Cohen Courses
Jump to navigationJump to search

This is a Paper summary for 10-802 Analysis of Social Media during Fall 2012.

Citation

Christopher Chua, Maria Milosavljevic, and James R. Curran. 2009. A sentiment detection engine for internet stock message boards. In Proceedings of the Australasian Language Technology Association Workshop 2009.

Online Version

Article published online here

Summary

This article presents a solution for classifying investor sentiment on internet stock message boards and developed on prior work, which deals with messy/sparse data sets. The authors use a variation of Bayes classifier with feature selection and specifically implement Naive Bayes and Support Vector Machines (SVM).

Background

Sentiment prediction has been applied to many discussion boards and forums in areas like medicine, social media, gaming etc. This study focuses on monitoring financial information to eventually develop a method to use real-time posts to explain future price movements in the stock market. Classifying investor sentiment using forum messages can often be a challenging task in the text classification domain. The posts vary in their quality and their descriptive content. Like past studies done by Das and Chen (2007), our focus is to capture the emotional aspect of the posts rather than the actual factual content. Das and Chen used the Yahoo! Finance US based corpus and another group Antweiler and Frank (2004) used Raging Bull also based in the US. Both groups had found that implementing SVM did not improve the performance of the classifier and only added to the complexity and computation time.

Dataset

The data set used for the study was HotCopper, which is the most popular investment forum for the Australian market. The posts include author self-reported sentiment labels, allowing us to best apply sentiment analysis to this corpus. Authors can attach a sentiment label to their posts: "Buy", "Hold" and "Sell" tags are analogous to positive, neutral and negative sentiments respectively.

The study used the January-June 2004 HotCopper ASX stock-based discussions. There were 8,307 labeled posts across 469 stocks, with an average of 28 words per post and a toatl of 23,670 distinct words in the data set. Like most discussion boards and forums, each message is organized my thread and a single thread consists of multiple posts on the same topic for a stock. Discussions on HotCopper mainly surrounds speculative stocks and particularly those in minerals exploration and energy.

Methods

The first step is to pre-process the posts by removing stop words from the training set, removing alphanumerics of non-informative value such as "ahhhhhh" and finally introducing a thread volatility measure in order to assign an index value representing the level of disagreement between subsequent replies in the thread. The volatility is measured as the average sum of the differences between the discrete values of the sentiment classes (buy=1, hold=2, and sell=3). Thus, transitions between buy and sell have a higher volatility that transitions between buy and hold. So, posts within a thread with lower volatility are considered a superior sample. The samples with lower volatility (less than 0.5) were chosen for the training data set.

The automated sentiment detection engine was implemented using a few different classifiers: Bernoulli Naive Bayes and Complement Naive Bayes. Both models were used with feature selection based on the information gain approach. Bernoulli NB replaces feature frequency counts with Boolean values

Discussion and Conclusion

The Complement Naive Bayes achieved an accuracy of 78.72% and the Bernoulli Naive Bayes an accuracy of 78.45%. Both of these outperformed the human annotator's performance of 57% accuracy and the baseline of 65.63%.

Assignment 6 image.jpg

Study Plan