Higher Order Review Rating Sentiment Analysis

From Cohen Courses
Jump to navigationJump to search

Comments

  • Nice proposal with well thought tasks and challenges.
  • On Amazon users can specify whether they found some review useful. This can be another interesting dimension for your studies. Is this data available in the dataset you have?
  • It will be great if you can add a related work section which has

the paper title, one line summary of how is it related and link to PDF for each paper.

  • You might find following paper interesting :

Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification [1]

--Bbd 01:08, 11 October 2012 (UTC)


Thank you very much for your comment.

  • About the "review" about review, that is, the data users specify whether they found the review useful, it is surely interesting dimension. We found similar data, "number of helpful feedbacks" for each review. We are thinking to exploit this dimension. Thank you very much for your advice!
  • About the related paper you suggested, we found domain adaptation is also an interesting topic, which we might be able to cover.
  • About the related work, we added one paper.

--Nnori

Team members


Project Title

Higher Order Review Rating / Sentiment Analysis

Project Abstract

Given a review, there are many different dimensions that one may exploit, in order to improve performance in review rating/sentiment analysis. For example, when was that review written? Is it referring to the book or the movie of a specific title (e.g. are we talking about Harry Potter the book, or the movie?).

We propose to exploit the intrinsic high dimensionality of a review, in order to improve performance. Namely, we propose to model such reviews as high dimensional tensors (possibly with more than 3 dimensions/modes) and use tensor decomposition algorithms in order to obtain higher accuracy.

Intuitively, by incorporating all available dimensions, we should be able to do at least as good as if we were using only a subset of those, provided that the extra information that we add is useful and not particularly noisy.

We wil show that by exploiting these high dimensional data, we can achieve higher performance in the review sentiment classification task, than when we do not exploit them.


Data

We will use Amazon Product Review Data.

The data size is more than 5.8 million reviews.

This data is used in (Jindal and Liu, WWW-2007, WSDM-2008; Lim et al, CIKM-2010; Jindal, Liu and Lim, CIKM-2010; Mukherjee et al. WWW-2011; Mukherjee, Liu and Glance, WWW-2012.

From this dataset, we can extract various dimensions such as reviewer, product id, product category, date, review text.


Task

The task is classification of reviews (from Amazon) as positive or negative.

We have only ratings (from 1 to 5) for reviews, and we do not have explicit positive/negative labels. Since the data is so large, we do not manually create labels for reviews. Instead, we will count ratings 1 and 2 as negative, and ratings 4 and 5 as positive.


Baseline

As a baseline, we propose to "ignore" the inherent high dimensionality of the data and instead use matrix approaches (which are two dimensional). For the example that we mentioned earlier, in the case of a review, we may only take into account the terms that appear in that review but ignore the date or the product category. Examples of these baselines could be the Singular Value Decomposition or some other Matrix Factorization methods, like the Non-negative Matrix Factorization (NMF), which is particularly popular in many data mining applications.


Evaluation

As performance evaluation, we may consider both quantitative and qualitative approaches.

Quantitative
  • We may be able to classify the reviews using, e.g. SVM/k-NN classifier and argue that our approach yields better classification accuracy than the chosen baselines.
  • We may also conduct analysis on how each dimension contributes to the performance. For example, we may compare a situation where we ignore "time" information, and a situation where we ignore "reviewers".


Qualitative
  • We may considering visualizing the review in a lower (possibly 2-dimensional) space and argue that the visualization quality achieved by our approach (taking into account more different views on the data) succeeds in differentiating e.g. good from bad reviews (visually).


Key technical challenges

  • We will need to deal with large data (original dataset contains more than 5.8 million reviews).
  • We may need to deal with features for each objects (such as product's price), in addition to the relational data.
  • We may need to deal with multi-relational data (such as reviewer-reviewer trust network), if data is available, though we have not found such data for now.


What we hope to learn

  • We would like to learn how each dimension actually contributes to the performance in a specific task.

Related Work