Difference between revisions of "Sentiment Analysis in Multiple Domains"

From Cohen Courses
Jump to navigationJump to search
(Created page with '== Team members == * Zeyu Zheng * Mahaveer Jain == Project Title == Sentiment Analysis in Multiple Domains == Project Abstract == == Data…')
 
Line 9: Line 9:
  
 
== Project Abstract ==
 
== Project Abstract ==
 +
Analyzing sentiment in text has emerged as a very interesting and challenging area of research in the past decade. Several techniques including simple rule-based approaches, unsupervised learning and a range of supervised learning techniques using various feature representations and constraints have been proposed in previous works. Sentiment analysis on different granularity is also extensively studied.
  
 
+
However, one of the major challenges that still needs to be addressed is that of domain adaptation across different kinds of text that sentiment analysis algorithms need to process. For example, in the context of product reviews on Amazon.com, a sentiment analysis model that was learned on book reviews does not perform as well on kitchen appliance reviews if applied directly [Blitzer et. al. 2007]. One of the reasons the model underperforms is that the kinds of features that indicate positive or negative sentiment in book reviews are not the same as the features that indicate positive or negative sentiment in the domain of kitchen appliances. For example, in the kitchen domain of Amazon review, we may see lots of people use “stainless” as a positive feedback for some products, thus it may get a high weight in the domain specific classifier. However, this word is less likely to appear in some other domains like books or dvd, so it can't benefit classifying reviews in other domains.
  
  
Line 20: Line 21:
 
== Task ==
 
== Task ==
  
Given the labeled reviews of some source product types and unlabeled reviews from the target product types, we want to classifier reviews of target product types into positive or negative class.
+
Given the labeled reviews of some product types, which is regarded as source domains and unlabeled reviews from another product type, which is regarded as target domain, we want to classifier reviews from target domain into positive or negative class.
  
  
 
== Baseline ==
 
== Baseline ==
  
Firstly, the most naïve approach for this task is simply merging all examples in the multiple source product types, and leverage the algorithm proposed in [11] to automatically adding target domain unlabeled data in a bootstrapping way. We refer this algorithm as “All-data (AD)” hereafter. Then, we preformed the semi-supervised multiple classifier system (MCS) [19]. Finally, in order to examine the effectiveness of the Contrast Classifier, we performed our framework without filtering out not informative examples at beginning, and this algorithm would be referred as “No-CC”.
+
Firstly, the most naïve approach for this task is simply merging all examples in the multiple source product types, and leverage some single source domain adaptation algorithm like [1], [2] to classify target domain reviews.
 +
 
 +
As we assume that target domain unlabeled data is available, the second baseline follows a bootstrapping way of automatically adding target domain unlabeled data like proposed in [3].  
  
 
== Challenges ==
 
== Challenges ==
  
* We will need to deal with large data (original dataset contains more than 5.8 million reviews).
+
* How to  
 
* We may need to deal with features for each objects (such as product's price), in addition to the relational data.
 
* We may need to deal with features for each objects (such as product's price), in addition to the relational data.
 
* We may need to deal with multi-relational data (such as reviewer-reviewer trust network), if data is available, though we have not found such data for now.
 
* We may need to deal with multi-relational data (such as reviewer-reviewer trust network), if data is available, though we have not found such data for now.

Revision as of 21:05, 7 October 2012

Team members

Project Title

Sentiment Analysis in Multiple Domains

Project Abstract

Analyzing sentiment in text has emerged as a very interesting and challenging area of research in the past decade. Several techniques including simple rule-based approaches, unsupervised learning and a range of supervised learning techniques using various feature representations and constraints have been proposed in previous works. Sentiment analysis on different granularity is also extensively studied.

However, one of the major challenges that still needs to be addressed is that of domain adaptation across different kinds of text that sentiment analysis algorithms need to process. For example, in the context of product reviews on Amazon.com, a sentiment analysis model that was learned on book reviews does not perform as well on kitchen appliance reviews if applied directly [Blitzer et. al. 2007]. One of the reasons the model underperforms is that the kinds of features that indicate positive or negative sentiment in book reviews are not the same as the features that indicate positive or negative sentiment in the domain of kitchen appliances. For example, in the kitchen domain of Amazon review, we may see lots of people use “stainless” as a positive feedback for some products, thus it may get a high weight in the domain specific classifier. However, this word is less likely to appear in some other domains like books or dvd, so it can't benefit classifying reviews in other domains.


Data

We will use the benchmark dataset of Amazon review collected by Blitzer et al. (2007). This dataset gathered more than 340,000 reviews from 22 different product types, which can be regarded as different domains.


Task

Given the labeled reviews of some product types, which is regarded as source domains and unlabeled reviews from another product type, which is regarded as target domain, we want to classifier reviews from target domain into positive or negative class.


Baseline

Firstly, the most naïve approach for this task is simply merging all examples in the multiple source product types, and leverage some single source domain adaptation algorithm like [1], [2] to classify target domain reviews.

As we assume that target domain unlabeled data is available, the second baseline follows a bootstrapping way of automatically adding target domain unlabeled data like proposed in [3].

Challenges

  • How to
  • We may need to deal with features for each objects (such as product's price), in addition to the relational data.
  • We may need to deal with multi-relational data (such as reviewer-reviewer trust network), if data is available, though we have not found such data for now.


What we hope to learn

  • We would like to learn how each dimension actually contributes to the performance in a specific task.