A Sentiment Detection Engine for Internet Stock Message Boards
This is a Paper summary for 10-802 Analysis of Social Media during Fall 2012.
Contents
Citation
Christopher Chua, Maria Milosavljevic, and James R. Curran. 2009. A sentiment detection engine for internet stock message boards. In Proceedings of the Australasian Language Technology Association Workshop 2009.
Online Version
Summary
This article addresses the problem of Sentiment analysis by presenting a solution for classifying investor sentiment on internet stock message boards and developed on prior work, which deals with messy/sparse data sets. The authors use a variation of Bayes classifier with feature selection and specifically implement Naive Bayes and Support Vector Machines (SVM).
Background
Sentiment prediction has been applied to many discussion boards and forums in areas like medicine, social media, gaming etc. This study focuses on monitoring financial information to eventually develop a method to use real-time posts to explain future price movements in the stock market. Classifying investor sentiment using forum messages can often be a challenging task in the text classification domain. The posts vary in their quality and their descriptive content. Like past studies done by Das and Chen 2007 (link below in Related Paper section), our focus is to capture the emotional aspect of the posts rather than the actual factual content. Das and Chen used the Yahoo! Finance US based corpus and another group Antweiler and Frank 2004 (link below in Related Paper section) used Raging Bull also based in the US. Both groups had found that implementing SVM did not improve the performance of the classifier and only added to the complexity and computation time.
Dataset
The data set used for the study was HotCopper, which is the most popular investment forum for the Australian market. The posts include author self-reported sentiment labels, allowing us to best apply sentiment analysis to this corpus. Authors can attach a sentiment label to their posts: "Buy", "Hold" and "Sell" tags are analogous to positive, neutral and negative sentiments respectively.
The study used the January-June 2004 HotCopper ASX stock-based discussions. There were 8,307 labeled posts across 469 stocks, with an average of 28 words per post and a toatl of 23,670 distinct words in the data set. Like most discussion boards and forums, each message is organized my thread and a single thread consists of multiple posts on the same topic for a stock. Discussions on HotCopper mainly surrounds speculative stocks and particularly those in minerals exploration and energy.
Methods
The first step is to pre-process the posts by removing stop words from the training set, removing alphanumerics of non-informative value such as "ahhhhhh" and finally introducing a thread volatility measure in order to assign an index value representing the level of disagreement between subsequent replies in the thread. The volatility is measured as the average sum of the differences between the discrete values of the sentiment classes (buy=1, hold=2, and sell=3). Thus, transitions between buy and sell have a higher volatility that transitions between buy and hold. So, posts within a thread with lower volatility are considered a superior sample. The samples with lower volatility (less than 0.5) were chosen for the training data set.
The automated sentiment detection engine was implemented using a few different classifiers: Bernoulli Naive Bayes and Complement Naive Bayes. Both models were used with feature selection based on the information gain approach. Bernoulli NB replaces feature frequency counts with Boolean values. It was assumed that individual features present in a post were independent from other commonly found words such as "a" or "the". The authors also implemented the Term Frequency Inverse Document Frequency (TF-IDF) transformation that gave more weight to unique words that provided more insight to classifying the post as compared to common words. A smoothing function was also implemented to account for zero probabilities.
For features selection, the first step was to rank the words by order of frequency and the second was to test the information gain algorithm in order to reduce the number of features to a manageable subset.
Discussion and Conclusion
To evaluate the performance of the classifiers, the authors finally compared the performance to human agreement using Amazons Mechanical Turk (MTurk) to obtain classification from 3 paid annotators who passed a certain test (the test was not described in any detail). The accuracy was 57% and Kappa statistic was 0.5, which demonstrates how challenging this classification task is.
Each model was built using a 10-fold cross validation and the main findings are summarized in the table below. The Complement Naive Bayes achieved an accuracy of 78.72% and the Bernoulli Naive Bayes an accuracy of 78.45%. Both of these outperformed the human annotator's performance of 57% accuracy and the baseline of 65.63%. With 7,200 features, the best performing algorithm was the Complement and Bernoulli Naive Bayes.
In conclusion, the authors were able to achieve a classification F-score of 77.50% following the methods outlined above. There is a lot of room for optimization of the model performances and should be pursued in future studies.