Comparative Study : Sentiment Analysis using Automated pattern based appraoch VS Single structured model

From Cohen Courses
Revision as of 06:29, 6 November 2012 by Ydalal (talk | contribs) (→‎Comparison)
Jump to navigationJump to search

Papers Compared

  1. Enhanced sentiment learning using Twitter hashtags and smileys Davidov ...
  2. Structured Models for Fine-to-Coarse Sentiment Analysis Ryan ...

Comparison

Both the paper solve same problem "sentiment classification". But the approaches are completely different.

Problem

Davidov and team, are trying to solve the sentiment classification problem by leveraging a ready to use corpus "twitter", they use the twitter hashtags and smileys to train the KNN model. Hence this process doesn't require any manual labeling of training data. They have also used interesting "patterns based" features that are language independent and provides most significant improvement over rest of the features. This approach is limited to document level ( considering a tweet is a document).

Ryan and team has approached the sentiment classification from an different perspective altogether. They don't leverage any dataset rather they propose a new structured modeling approach to improve sentiment classification accuracy at different granular levels.

Big Idea

Davidov and team, leverages the vast source of sentiment information available in twitter corpus. They identify several sentiment classes from hashtags. They use "Pattern based" features to improve the sentiment classification accuracy as compared to previous work.

Ryan and team uses a novel structured model approach to transfer inference from subcomponent's sentiments to parents and vice versa to jointly understand the sentiments at these levels and hence they have achieved very significant improvements in sentiment classification accuracy.

Dataset

Ryan and team uses customer reviews dataset where as Davidov and team uses twitter dataset.

  • Customer reviews differ from tweets as they don't contain hashtags and very less smileys. We can say that review dataset can be evaluated with Davidov's model using the same set of features. But it would be limited to document level.
  • Vice versa Ryan and team's model can be used to evaluate the twitter corpus. As the model is capable of taking multiple classes into consideration.

Discussion

Other

First paper provides a good feature engineering but it lacks in practical concerns of overlapping classes and ignoring the minority neighbors. As it is shown in the paper that hashtags and smileys overlap to a large extent. I would say they should try multilabel classification instead of multiclass classification.

Additional Questions

  1. How much time did you spend reading the (new, non-wikified) paper you summarized?
    • 2.5 hours
  2. How much time did you spend reading the old wikified paper?
    • 1 hour
  3. How much time did you spend reading the summary of the old paper?
    • 15 minutes
  4. How much time did you spend reading background materiel?
    • None
  5. Was there a study plan for the old paper?
    • Yes
    1. if so, did you read any of the items suggested by the study plan? and how much time did you spend with reading them?
      • Yes, I glanced over 2/3 papers to understand the key concepts. It was a good starting point.
      • 45 minutes
  6. Give us any additional feedback you might have about this assignment.
    1. The wikified paper's summary was quite useful to start with as it helped in understanding the big picture immediately and noting down the key areas to look for in the paper.
      • For example the binary classification was not immediately clear from summary, evaluation with human judges was a new thing i encountered when i read the paper. I had additional doubt on overlapping hashtags and labels that was explained in paper.
    2. Some additional key features that I had to look for in paper: KNN distance function, Neighbor selection criteria, Feature selection process.
    3. I think its useful to have a good summary and its unavoidable to ignore too much details in summary. But In the current wikified summary some important features were missing and a good discussion on pros and cons of the approach were missing.