Difference between revisions of "Comparison: Widespread Worry and the Stock Market versus Sentiment Detection Engine for Internet Stock Message Boards"
(→Method) |
|||
(3 intermediate revisions by the same user not shown) | |||
Line 7: | Line 7: | ||
* Gilbert et al. used supervised learning techniques to classify posts in a binary fashion into sets of anxiety posts and not anxious posts. The first classifier was a boosted decision tree used the 100 most informative word stems as features and ranked them by information gain. The second classifier used was a bagged Complement Naive Bayes algorithm, which compensated for the small vocabulary used by the boosted decision tree. These two classifiers were then used to create an Anxiety Index by being trained on the LiveJournal posts. Once this Anxiety Index was created, the authors used [http://en.wikipedia.org/wiki/Granger_causality Granger Causality] to determine the correlation (if any) between the Anxiety Index and the S&P 500 corpus. In short, the Granger Causality is a statistical hypothesis test for determining whether one time series is useful for forecasting another. The final step was to use a Monte Carlo simulation to confirm the findings of the Granger Causality technique. | * Gilbert et al. used supervised learning techniques to classify posts in a binary fashion into sets of anxiety posts and not anxious posts. The first classifier was a boosted decision tree used the 100 most informative word stems as features and ranked them by information gain. The second classifier used was a bagged Complement Naive Bayes algorithm, which compensated for the small vocabulary used by the boosted decision tree. These two classifiers were then used to create an Anxiety Index by being trained on the LiveJournal posts. Once this Anxiety Index was created, the authors used [http://en.wikipedia.org/wiki/Granger_causality Granger Causality] to determine the correlation (if any) between the Anxiety Index and the S&P 500 corpus. In short, the Granger Causality is a statistical hypothesis test for determining whether one time series is useful for forecasting another. The final step was to use a Monte Carlo simulation to confirm the findings of the Granger Causality technique. | ||
− | * Chua et al. first pre-processed the posts by removing stop words from the training set, removing alphanumerics of non-informative value such as "ahhhhhh" and finally introducing a thread volatility measure in order to assign an index value representing the level of disagreement between subsequent replies in the thread. The automated sentiment detection engine was implemented using a few different classifiers: Bernoulli Naive Bayes and Complement Naive Bayes. Both models were used with feature selection based on the information gain approach. Bernoulli NB replaces feature frequency counts with Boolean values. It was assumed that individual features present in a post were independent from other commonly found words such as "a" or "the". The authors also implemented the Term Frequency Inverse Document Frequency (TF-IDF) transformation that gave more weight to unique words that provided more insight to classifying the post as compared to common words. A smoothing function was also implemented to account for zero probabilities. | + | * Chua et al. first pre-processed the posts by removing stop words from the training set, removing alphanumerics of non-informative value such as "ahhhhhh" and finally introducing a thread volatility measure in order to assign an index value representing the level of disagreement between subsequent replies in the thread. The automated sentiment detection engine was implemented using a few different classifiers: [[UsesMethod::Bernoulli Naive Bayes]] and [[UsesMethod::Complement Naive Bayes]]. Both models were used with feature selection based on the information gain approach. Bernoulli NB replaces feature frequency counts with Boolean values. It was assumed that individual features present in a post were independent from other commonly found words such as "a" or "the". The authors also implemented the Term Frequency Inverse Document Frequency (TF-IDF) transformation that gave more weight to unique words that provided more insight to classifying the post as compared to common words. A smoothing function was also implemented to account for zero probabilities. |
Overall, the two papers used very different methods when evaluating their respective financial forums even though they ironically had very similar purposes. Both were attempting to use sentiment analysis of stock market type discussion boards to find trends and use them to predict future patterns in the stock market. | Overall, the two papers used very different methods when evaluating their respective financial forums even though they ironically had very similar purposes. Both were attempting to use sentiment analysis of stock market type discussion boards to find trends and use them to predict future patterns in the stock market. | ||
Line 20: | Line 20: | ||
== Problem == | == Problem == | ||
− | + | As mentioned above, the general problem of opinion mining and sentiment analysis was a common theme throughout both papers. They both also focused on a financially focused discussion board that mainly had threads about certain stocks and investment options. Gilbert et al. distinguished the posts based on anxious and not anxious while Chua et al. had posts pre-labeled as either buy, hold or sell (positive, neutral and negative respectively in terms of sentiment). | |
− | |||
− | |||
== Big Idea == | == Big Idea == | ||
− | + | Similar to the Problem section above, the overall idea of both studies were similar, which is expected if two papers are linked by similar citations. The biggest differences were the data sets used and the methodologies utilized for the study. | |
− | |||
− | |||
− | |||
− | |||
== Additional Questions == | == Additional Questions == |
Latest revision as of 16:54, 5 November 2012
Papers
- A_Sentiment_Detection_Engine_for_Internet_Stock_Message_Boards Christopher Chua, Maria Milosavljevic, and James R. Curran. 2009. A sentiment detection engine for internet stock message boards. In Proceedings of the Australasian Language Technology Association Workshop 2009.
- Gilbert_et_al.,_ICWSM_2010 Gilbert, E. and Karahalios, K., Widespread worry and the stock market, 2010, In Proceedings of the international conference on weblogs and social media (ICWSM 10).
Method
- Gilbert et al. used supervised learning techniques to classify posts in a binary fashion into sets of anxiety posts and not anxious posts. The first classifier was a boosted decision tree used the 100 most informative word stems as features and ranked them by information gain. The second classifier used was a bagged Complement Naive Bayes algorithm, which compensated for the small vocabulary used by the boosted decision tree. These two classifiers were then used to create an Anxiety Index by being trained on the LiveJournal posts. Once this Anxiety Index was created, the authors used Granger Causality to determine the correlation (if any) between the Anxiety Index and the S&P 500 corpus. In short, the Granger Causality is a statistical hypothesis test for determining whether one time series is useful for forecasting another. The final step was to use a Monte Carlo simulation to confirm the findings of the Granger Causality technique.
- Chua et al. first pre-processed the posts by removing stop words from the training set, removing alphanumerics of non-informative value such as "ahhhhhh" and finally introducing a thread volatility measure in order to assign an index value representing the level of disagreement between subsequent replies in the thread. The automated sentiment detection engine was implemented using a few different classifiers: Bernoulli Naive Bayes and Complement Naive Bayes. Both models were used with feature selection based on the information gain approach. Bernoulli NB replaces feature frequency counts with Boolean values. It was assumed that individual features present in a post were independent from other commonly found words such as "a" or "the". The authors also implemented the Term Frequency Inverse Document Frequency (TF-IDF) transformation that gave more weight to unique words that provided more insight to classifying the post as compared to common words. A smoothing function was also implemented to account for zero probabilities.
Overall, the two papers used very different methods when evaluating their respective financial forums even though they ironically had very similar purposes. Both were attempting to use sentiment analysis of stock market type discussion boards to find trends and use them to predict future patterns in the stock market.
Dataset Used
- Chua et al in A Sentiment Detection Engine for Internet Stock Message Board used the data set HotCopper and collected posts between January-June 2004. This corpus is based in Australia.
- Gilbert et al. in Widespread Worry and the Stock Market used two data sets: theLiveJournal blog data set and the S&P 500 data set. Both data sets are US based.
Overall, the papers used similar data sets and the primary difference was the geographic location of the blogs/discussion boards. Since these are financially focused forums, the geographic difference would indicate that the posts from the HotCopper is mostly from residents of Australia and the posts in the LifeJournal blog or S&P are primarily from the US. But this does not limit either corpus to those two regions as the bloggers can part-take in either web site.
Problem
As mentioned above, the general problem of opinion mining and sentiment analysis was a common theme throughout both papers. They both also focused on a financially focused discussion board that mainly had threads about certain stocks and investment options. Gilbert et al. distinguished the posts based on anxious and not anxious while Chua et al. had posts pre-labeled as either buy, hold or sell (positive, neutral and negative respectively in terms of sentiment).
Big Idea
Similar to the Problem section above, the overall idea of both studies were similar, which is expected if two papers are linked by similar citations. The biggest differences were the data sets used and the methodologies utilized for the study.
Additional Questions
1) How much time did you spend reading the (new, non-wikified) paper you summarized? ~2 hour
2) How much time did you spend reading the old wikified paper? ~1 hour
3) How much time did you spend reading the summary of the old paper? ~15 minutes
4) How much time did you spend reading background material? ~1 hour
5) Was there a study plan for the old paper? No
6) If so, did you read any of the items suggested by the study plan? and how much time did you spend with reading them? N/A
7) Other comments and feedback: It was an interesting assignment, but since we had to create two wiki pages, it was quite time consuming. It would be just as effective to have a short summary section in the comparison wiki page so that we can focus our attention to the comparison page. Otherwise, the exercise of comparing two studies in several categories (data set, methods, problem etc) was beneficial.