Difference between revisions of "Tsur et al ICWSM 10"
(15 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | This a [[Category::Paper]] that appeared at the [http://www.aaai.org/Conferences/ICWSM/icwsm.php | + | This a [[Category::Paper]] that appeared at the [http://www.aaai.org/Conferences/ICWSM/icwsm.php International AAAI Conference on Weblogs and Social Media] 2010 |
== Citation == | == Citation == | ||
Line 14: | Line 14: | ||
== Summary == | == Summary == | ||
+ | In this work, the authors introduce a novel semi-supervised approach that is able to identify sarcasm in the comments of online reviews. Sarcasm detection can be viewed as a (probably) challenging instance of the [[AddressesProblem::Sentiment analysis]] problem. | ||
+ | As they point out, this problem is particularly hard; as a matter of fact, sometimes even people find it hard to recognize sarcasm, let alone a clever machine learning algorithm. | ||
+ | |||
+ | To that end, first define a small training set, labeled by hand, which contains some very obvious sarcastic comments and some clearly non-sarcastic ones. The sarcasm levels for each of those reviews range in a scale from 1-5. | ||
+ | Using this train set, they extract two different types of features: | ||
+ | |||
+ | * Pattern Based: For the pattern identification, the authors separated all terms into High Frequency Words (HFW) or Context Words (CW), simply by thresholding their corpus frequency (with HFW having higher such frequency than CW's). Consequently, they allow for each pattern to contain 2-6 HWF and 1-6 CW. As a next step, they filter out some patterns that are not particularly useful (in order to cut down their initially big number), by eliminating patterns that 1) appear only on a single product, 2) appear on the train set in reviews which are either clearly sarcastic (rated 5) or clearly non-sarcastic (rated 1). This part is based on [http://leibniz.cs.huji.ac.il/tr/884.pdf Davidov, D., and Rappoport, A. 2006. Efficient unsupervised discovery of word categories using symmetric patterns and high frequency words. In COLING-ACL]. | ||
+ | |||
+ | * Syntactic: These features mainly pertain to the punctuation marks used by the review. For example, the number of quotes, exclamation marks, question marks, the number of capitalized words and the actual length of a sentence constitute important such features. As a preview of the results, however, the authors conclude that punctuation marks are not particularly useful in the detection of sarcasm in written speech (in contrast to spoken communication, as a previous work had concluded). | ||
+ | |||
+ | Since this approach is semi-supervised, and the task of manually labeling sarcastic comments, in order to enrich the training set, is painful (both regarding the number of reviews needed to label and because of the task's difficulty), the orders come up with a "data enrichment" strategy which briefly is as follows: They used the [http://developer.yahoo.com/search/boss Yahoo! BOSS API] to create a custom search engine for sentences that had a fixed level of sarcasm; they were able to control the sarcasm level by examining the structure of already known/labeled sarcastic sentences and querying the search engine for similar ones, adding them to the training set. | ||
+ | |||
+ | After the feature extraction process, in order to decide how sarcastic a new comment, drawn from a test dataset, is, they utilize a k-NN inspired classifier which works as follows: | ||
+ | For any given (new) review, after extracting the features and converting it to a vector in that feature space, they look at its k nearest neighboring vectors of the training set, in the euclidean sense. Then, the label of that review is determined by the weighted average of the scores/labels for those k neighbors. | ||
+ | |||
+ | == Evaluation == | ||
+ | |||
+ | For the performance evaluation of their approach, the authors first do 5-fold cross validation on their dataset and measure the performance of each component of the algorithm separately and once combined altogether. As a second step, they asked 15 adult raters coming from different backgrounds, to label a portion of the dataset, providing a "gold-standard" means of validation. Based on this gold-standard, the authors compare their approach to an intuitive baseline that they also propose. | ||
+ | |||
+ | '''Dataset''': | ||
+ | |||
+ | The data used for the experiments come from Amazon. In particular, they consist of 66271 reviews, spanning 120 different products. The average number of starts (aka the average rating) for those products was 4.19/5, whereas the average review length was 953 characters. | ||
+ | |||
+ | '''Metrics''': | ||
+ | |||
+ | The authors base their evaluation on | ||
+ | *[http://en.wikipedia.org/wiki/Precision_and_recall Precision] | ||
+ | *[http://en.wikipedia.org/wiki/Precision_and_recall Recall] | ||
+ | *Accuracy | ||
+ | *[http://en.wikipedia.org/wiki/F1_score F-score] | ||
+ | |||
+ | '''Baseline''': | ||
+ | |||
+ | For the sake of comparison, the authors propose a strong baseline, according to which they look at all the negative reviews (roughly between 1-3 stars) and classify as sarcastic those that have strong positive sentiment. Intuitively, this is because | ||
+ | according to the authors, it captures one definition of sarcasm which states that sarcasm is "saying the opposite of what you mean in a way intended to make someone else feel stupid or show you are angry". The authors call this baseline "Star Sentiment" approach. | ||
+ | |||
+ | '''Results''': | ||
+ | |||
+ | In the tables below we show the results of the cross-validation (Table 2) and the gold-standard comparison to the baseline (Table 3). | ||
+ | |||
+ | [[File:ICWSM10_sarcasm_results.png]] | ||
+ | |||
+ | == Discussion == | ||
+ | |||
+ | The most important conclusions drawn from the results are: | ||
+ | * Punctuation marks, in contrast to their initial speculation, serves as a relatively poor indicator of sarcasm, in written speech. | ||
+ | * By looking at the baseline results, one can see that the aforementioned "naive" definition of sarcasm is good enough for obvious sarcastic phrases, but fails to distinguish more subtle sarcastic sentences. | ||
+ | * The algorithm still fails to distinguish subtle sarcasm like the following: “This book was really good until page 2!” (sarcasm) and “This book was really good until page 430!” (probably not sarcasm). This behavior, however, is to be expected, since the sarcasm levels of this sentence depend on features not taken into account (and which are probably hard to consider to begin with), such as how long the particular book is and at what point the page number indicates sarcasm or not. | ||
+ | * Finally, the authors observed high correlation between sarcastic comments and average low star ratings, a fact that indicates that sarcasm mostly expresses negative feelings. | ||
+ | |||
+ | |||
+ | == Related Papers == | ||
+ | * [[RelatedPaper::Davidov, D., and Rappoport, A. COLING-ACL 2006]] : Davidov, D., and Rappoport, A. 2006. Efficient unsupervised discovery of word categories using symmetric patterns and high frequency words. In COLING-ACL. |
Latest revision as of 10:53, 1 October 2012
This a Paper that appeared at the International AAAI Conference on Weblogs and Social Media 2010
Citation
title={ICWSM--A great catchy name: Semi-supervised recognition of sarcastic sentences in online product reviews}, author={Tsur, O. and Davidov, D. and Rappoport, A.}, booktitle={Proceedings of the fourth international AAAI conference on weblogs and social media}, pages={162--169}, year={2010}
Online version
Summary
In this work, the authors introduce a novel semi-supervised approach that is able to identify sarcasm in the comments of online reviews. Sarcasm detection can be viewed as a (probably) challenging instance of the Sentiment analysis problem. As they point out, this problem is particularly hard; as a matter of fact, sometimes even people find it hard to recognize sarcasm, let alone a clever machine learning algorithm.
To that end, first define a small training set, labeled by hand, which contains some very obvious sarcastic comments and some clearly non-sarcastic ones. The sarcasm levels for each of those reviews range in a scale from 1-5. Using this train set, they extract two different types of features:
- Pattern Based: For the pattern identification, the authors separated all terms into High Frequency Words (HFW) or Context Words (CW), simply by thresholding their corpus frequency (with HFW having higher such frequency than CW's). Consequently, they allow for each pattern to contain 2-6 HWF and 1-6 CW. As a next step, they filter out some patterns that are not particularly useful (in order to cut down their initially big number), by eliminating patterns that 1) appear only on a single product, 2) appear on the train set in reviews which are either clearly sarcastic (rated 5) or clearly non-sarcastic (rated 1). This part is based on Davidov, D., and Rappoport, A. 2006. Efficient unsupervised discovery of word categories using symmetric patterns and high frequency words. In COLING-ACL.
- Syntactic: These features mainly pertain to the punctuation marks used by the review. For example, the number of quotes, exclamation marks, question marks, the number of capitalized words and the actual length of a sentence constitute important such features. As a preview of the results, however, the authors conclude that punctuation marks are not particularly useful in the detection of sarcasm in written speech (in contrast to spoken communication, as a previous work had concluded).
Since this approach is semi-supervised, and the task of manually labeling sarcastic comments, in order to enrich the training set, is painful (both regarding the number of reviews needed to label and because of the task's difficulty), the orders come up with a "data enrichment" strategy which briefly is as follows: They used the Yahoo! BOSS API to create a custom search engine for sentences that had a fixed level of sarcasm; they were able to control the sarcasm level by examining the structure of already known/labeled sarcastic sentences and querying the search engine for similar ones, adding them to the training set.
After the feature extraction process, in order to decide how sarcastic a new comment, drawn from a test dataset, is, they utilize a k-NN inspired classifier which works as follows: For any given (new) review, after extracting the features and converting it to a vector in that feature space, they look at its k nearest neighboring vectors of the training set, in the euclidean sense. Then, the label of that review is determined by the weighted average of the scores/labels for those k neighbors.
Evaluation
For the performance evaluation of their approach, the authors first do 5-fold cross validation on their dataset and measure the performance of each component of the algorithm separately and once combined altogether. As a second step, they asked 15 adult raters coming from different backgrounds, to label a portion of the dataset, providing a "gold-standard" means of validation. Based on this gold-standard, the authors compare their approach to an intuitive baseline that they also propose.
Dataset:
The data used for the experiments come from Amazon. In particular, they consist of 66271 reviews, spanning 120 different products. The average number of starts (aka the average rating) for those products was 4.19/5, whereas the average review length was 953 characters.
Metrics:
The authors base their evaluation on
Baseline:
For the sake of comparison, the authors propose a strong baseline, according to which they look at all the negative reviews (roughly between 1-3 stars) and classify as sarcastic those that have strong positive sentiment. Intuitively, this is because according to the authors, it captures one definition of sarcasm which states that sarcasm is "saying the opposite of what you mean in a way intended to make someone else feel stupid or show you are angry". The authors call this baseline "Star Sentiment" approach.
Results:
In the tables below we show the results of the cross-validation (Table 2) and the gold-standard comparison to the baseline (Table 3).
Discussion
The most important conclusions drawn from the results are:
- Punctuation marks, in contrast to their initial speculation, serves as a relatively poor indicator of sarcasm, in written speech.
- By looking at the baseline results, one can see that the aforementioned "naive" definition of sarcasm is good enough for obvious sarcastic phrases, but fails to distinguish more subtle sarcastic sentences.
- The algorithm still fails to distinguish subtle sarcasm like the following: “This book was really good until page 2!” (sarcasm) and “This book was really good until page 430!” (probably not sarcasm). This behavior, however, is to be expected, since the sarcasm levels of this sentence depend on features not taken into account (and which are probably hard to consider to begin with), such as how long the particular book is and at what point the page number indicates sarcasm or not.
- Finally, the authors observed high correlation between sarcastic comments and average low star ratings, a fact that indicates that sarcasm mostly expresses negative feelings.
Related Papers
- Davidov, D., and Rappoport, A. COLING-ACL 2006 : Davidov, D., and Rappoport, A. 2006. Efficient unsupervised discovery of word categories using symmetric patterns and high frequency words. In COLING-ACL.