Difference between revisions of "Sentiment Analysis in Multiple Domains"

From Cohen Courses
Jump to navigationJump to search
 
(9 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
== Comments ==
 +
* Your proposal and list of related work looks good.
 +
* I would suggest adding more details about which algorithms you are going to explore. And if the authors have used them for some other task, how are you going to adopt them to your task.
 +
* You might want to take a look at Mahesh's work, which talks about domain adaptation for Amazon dataset and contains an overview of the related papers you have cited here.  [http://www.cs.cmu.edu/~maheshj/pubs/joshi+dredze+cohen+rose.emnlp2012.pdf] 
 +
 +
--[[User:Bbd|Bbd]] 01:31, 11 October 2012 (UTC)
 +
 
== Team members ==
 
== Team members ==
  
 
* [[User:Zeyuz|Zeyu Zheng]]
 
* [[User:Zeyuz|Zeyu Zheng]]
* [[User:mahaveer|Mahaveer Jain]]
+
* [[User:mmahavee|Mahaveer]]
  
 
== Project Title ==
 
== Project Title ==
Line 17: Line 24:
 
== Task ==
 
== Task ==
  
Given the labeled reviews of some product types, which is regarded as source domains and unlabeled reviews from another product type, which is regarded as target domain, we want to classifier reviews from target domain into positive or negative class.
+
Given the labeled reviews of some product types, which is regarded as source domains and unlabeled reviews from another product type, which is regarded as target domain, we want to classify reviews from target domain into positive or negative class.
 
 
  
 
== Data ==
 
== Data ==
Line 25: Line 31:
 
Moreover, we do not label data manually, instead we use the star information as proposed in the original work.
 
Moreover, we do not label data manually, instead we use the star information as proposed in the original work.
  
== Baseline ==
+
== Techniques ==
  
Firstly, the most naïve approach for this task is simply merging all examples in the multiple source product types, and leverage some single source domain adaptation algorithm like [1], [2] to classify target domain reviews.
+
Firstly, the most naïve approach for this task would be to simply merge all examples of  multiple source product types, and leverage some single source domain adaptation algorithm like [1], [3] to classify target domain reviews.
  
As we assume that target domain unlabeled data is available, the second baseline follows a bootstrapping way of automatically adding target domain unlabeled data like proposed in [3].
+
As we assume that unlabeled data of target domain is available, the second technique could follow a bootstrapping way of automatically adding unlabeled data of target domain like proposed in [5].
  
 +
One major challenge in domain adaption problem is that we do not have enough training data from the target domain, so that the model trained might not represent the data distribution of the target domain. As a result, in this project, we want to explore how to leverage unlabeled data in target domain to address this distribution bias problem and make the final trained model more adaptable to target domain in a bootstrapping way.
  
 
== Related Work ==
 
== Related Work ==
 
[1] John Blitzer, Mark Dredze, Fernando Pereira, Biographies, Bollywood, Boom-Boxes and Blenders: Domain Adaptation for Sentiment Classification. Proc. 45th Ann. Meeting of the Assoc. Computational Linguistics, pp. 432-439, 2007.
 
[1] John Blitzer, Mark Dredze, Fernando Pereira, Biographies, Bollywood, Boom-Boxes and Blenders: Domain Adaptation for Sentiment Classification. Proc. 45th Ann. Meeting of the Assoc. Computational Linguistics, pp. 432-439, 2007.
 +
 
[2] Hai Daume´ III. Frustratingly Easy Domain Adaptation. Proc. 45th Ann. Meeting of the Assoc. Computational Linguistics, pp. 256-263, June 2007.  
 
[2] Hai Daume´ III. Frustratingly Easy Domain Adaptation. Proc. 45th Ann. Meeting of the Assoc. Computational Linguistics, pp. 256-263, June 2007.  
 +
 
[3] John Blitzer , Ryan McDonald , Fernando Pereira, Domain adaptation with structural correspondence learning, Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, July 22-23, 2006, Sydney, Australia.
 
[3] John Blitzer , Ryan McDonald , Fernando Pereira, Domain adaptation with structural correspondence learning, Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, July 22-23, 2006, Sydney, Australia.
 +
 
[4] Jing Jiang, Chengxiang Zhai. Instance Weighting for Domain Adaptation in NLP. Proc. 45th Ann. Meeting of the Assoc. Computational Linguistics, pp. 264-271, June 2007.
 
[4] Jing Jiang, Chengxiang Zhai. Instance Weighting for Domain Adaptation in NLP. Proc. 45th Ann. Meeting of the Assoc. Computational Linguistics, pp. 264-271, June 2007.
 +
 
[5] Dan Wu, Wee Sun Lee, Nan Ye, Hai Leong Chieu, Domain adaptive bootstrapping for named entity recognition. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3, August 06-07, 2009, Singapore
 
[5] Dan Wu, Wee Sun Lee, Nan Ye, Hai Leong Chieu, Domain adaptive bootstrapping for named entity recognition. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3, August 06-07, 2009, Singapore

Latest revision as of 12:31, 18 October 2012

Comments

  • Your proposal and list of related work looks good.
  • I would suggest adding more details about which algorithms you are going to explore. And if the authors have used them for some other task, how are you going to adopt them to your task.
  • You might want to take a look at Mahesh's work, which talks about domain adaptation for Amazon dataset and contains an overview of the related papers you have cited here. [1]

--Bbd 01:31, 11 October 2012 (UTC)

Team members

Project Title

Sentiment Analysis in Multiple Domains

Project Abstract

Analyzing sentiment in text has emerged as a very interesting and challenging area of research in the past decade. Several techniques including simple rule-based approaches, unsupervised learning and a range of supervised learning techniques using various feature representations and constraints have been proposed in previous works. Sentiment analysis on different granularity is also extensively studied.

However, one of the major challenges that still needs to be addressed is that of domain adaptation across different kinds of text that sentiment analysis algorithms need to process. For example, in the context of product reviews on Amazon.com, a sentiment analysis model that was learned on book reviews does not perform as well on kitchen appliance reviews if applied directly [Blitzer et. al. 2007]. One of the reasons the model underperforms is that the kinds of features that indicate positive or negative sentiment in book reviews are not the same as the features that indicate positive or negative sentiment in the domain of kitchen appliances. For example, in the kitchen domain of Amazon review, we may see lots of people use “stainless” as a positive feedback for some products, thus it may get a high weight in the domain specific classifier. However, this word is less likely to appear in some other domains like books or dvd, so it can't benefit classifying reviews in other domains.

In this project, we want to address the problem of how can we leverage labeled reviews from multiple source domains to better classify reviews in target domain.

Task

Given the labeled reviews of some product types, which is regarded as source domains and unlabeled reviews from another product type, which is regarded as target domain, we want to classify reviews from target domain into positive or negative class.

Data

We will use the benchmark dataset of Amazon review collected by Blitzer et al. (2007). This dataset gathered more than 340,000 reviews from 22 different product types, which can be regarded as different domains. Moreover, we do not label data manually, instead we use the star information as proposed in the original work.

Techniques

Firstly, the most naïve approach for this task would be to simply merge all examples of multiple source product types, and leverage some single source domain adaptation algorithm like [1], [3] to classify target domain reviews.

As we assume that unlabeled data of target domain is available, the second technique could follow a bootstrapping way of automatically adding unlabeled data of target domain like proposed in [5].

One major challenge in domain adaption problem is that we do not have enough training data from the target domain, so that the model trained might not represent the data distribution of the target domain. As a result, in this project, we want to explore how to leverage unlabeled data in target domain to address this distribution bias problem and make the final trained model more adaptable to target domain in a bootstrapping way.

Related Work

[1] John Blitzer, Mark Dredze, Fernando Pereira, Biographies, Bollywood, Boom-Boxes and Blenders: Domain Adaptation for Sentiment Classification. Proc. 45th Ann. Meeting of the Assoc. Computational Linguistics, pp. 432-439, 2007.

[2] Hai Daume´ III. Frustratingly Easy Domain Adaptation. Proc. 45th Ann. Meeting of the Assoc. Computational Linguistics, pp. 256-263, June 2007.

[3] John Blitzer , Ryan McDonald , Fernando Pereira, Domain adaptation with structural correspondence learning, Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, July 22-23, 2006, Sydney, Australia.

[4] Jing Jiang, Chengxiang Zhai. Instance Weighting for Domain Adaptation in NLP. Proc. 45th Ann. Meeting of the Assoc. Computational Linguistics, pp. 264-271, June 2007.

[5] Dan Wu, Wee Sun Lee, Nan Ye, Hai Leong Chieu, Domain adaptive bootstrapping for named entity recognition. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3, August 06-07, 2009, Singapore