Difference between revisions of "Higher Order Review Rating Sentiment Analysis"
Line 1: | Line 1: | ||
+ | == Comments == | ||
+ | --[[User:Bbd|Bbd]] 23:52, 10 October 2012 (UTC) | ||
== Team members == | == Team members == | ||
Revision as of 18:52, 10 October 2012
Contents
Comments
--Bbd 23:52, 10 October 2012 (UTC)
Team members
Project Title
Higher Order Review Rating / Sentiment Analysis
Project Abstract
Given a review, there are many different dimensions that one may exploit, in order to improve performance in review rating/sentiment analysis. For example, when was that review written? Is it referring to the book or the movie of a specific title (e.g. are we talking about Harry Potter the book, or the movie?).
We propose to exploit the intrinsic high dimensionality of a review, in order to improve performance. Namely, we propose to model such reviews as high dimensional tensors (possibly with more than 3 dimensions/modes) and use tensor decomposition algorithms in order to obtain higher accuracy.
Intuitively, by incorporating all available dimensions, we should be able to do at least as good as if we were using only a subset of those, provided that the extra information that we add is useful and not particularly noisy.
We wil show that by exploiting these high dimensional data, we can achieve higher performance in the review sentiment classification task, than when we do not exploit them.
Data
We will use Amazon Product Review Data.
The data size is more than 5.8 million reviews.
This data is used in (Jindal and Liu, WWW-2007, WSDM-2008; Lim et al, CIKM-2010; Jindal, Liu and Lim, CIKM-2010; Mukherjee et al. WWW-2011; Mukherjee, Liu and Glance, WWW-2012.
From this dataset, we can extract various dimensions such as reviewer, product id, product category, date, review text.
Task
The task is classification of reviews (from Amazon) as positive or negative.
We have only ratings (from 1 to 5) for reviews, and we do not have explicit positive/negative labels. Since the data is so large, we do not manually create labels for reviews. Instead, we will count ratings 1 and 2 as negative, and ratings 4 and 5 as positive.
Baseline
As a baseline, we propose to "ignore" the inherent high dimensionality of the data and instead use matrix approaches (which are two dimensional). For the example that we mentioned earlier, in the case of a review, we may only take into account the terms that appear in that review but ignore the date or the product category. Examples of these baselines could be the Singular Value Decomposition or some other Matrix Factorization methods, like the Non-negative Matrix Factorization (NMF), which is particularly popular in many data mining applications.
Evaluation
As performance evaluation, we may consider both quantitative and qualitative approaches.
- Quantitative
- We may be able to classify the reviews using, e.g. SVM/k-NN classifier and argue that our approach yields better classification accuracy than the chosen baselines.
- We may also conduct analysis on how each dimension contributes to the performance. For example, we may compare a situation where we ignore "time" information, and a situation where we ignore "reviewers".
- Qualitative
- We may considering visualizing the review in a lower (possibly 2-dimensional) space and argue that the visualization quality achieved by our approach (taking into account more different views on the data) succeeds in differentiating e.g. good from bad reviews (visually).
Key technical challenges
- We will need to deal with large data (original dataset contains more than 5.8 million reviews).
- We may need to deal with features for each objects (such as product's price), in addition to the relational data.
- We may need to deal with multi-relational data (such as reviewer-reviewer trust network), if data is available, though we have not found such data for now.
What we hope to learn
- We would like to learn how each dimension actually contributes to the performance in a specific task.