Esuli and Sebastiani LREC 2006

From Cohen Courses
Revision as of 07:58, 28 September 2012 by Gmontane (talk | contribs)
Jump to navigationJump to search

This a Paper discussed in Social Media Analysis 10-802 in Spring 2010.

Citation

SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining, Andrea Esuli and Fabrizio Sebastiani, 2006, In Proceedings of the 5th Conference on Language Resources and Evaluation (LREC 06)

Online version

SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining

Summary

This paper uses the WordNet dataset as a resource for addressing opinion mining. The authors develop a three-part layer over the existing WordNet ontology, adding scores for Objectivity, Positivity and Negativity for each WordNet synset (collection of terms with the same meaning).

The approach used in the paper is semi-supervised learning, where a small hand-labeled set of terms is used to seed an automatic process which generates more labeled data. They use WordNet lexical relationships to expand both positive and negative sets of terms, with the remaining terms labeled as objective when they coincide with terms excluded from the General Inquirer lexicon.

Given the training datasets, the glosses (dictionary definitions) for each synset are represented in vectorized form using tf * idf, cosine normalized weighting and are then fed into standard supervised learning algorithms (Rocchio and SVMs) to generate several semi-independent classifiers. The binary outputs of these classifiers are then combines to produce a score between 0 and 1.0 inclusive for objectivity, positivity and neutrality.

The entire WordNet dataset is then assigned scores using the trained classifiers.

Results

The authors found that nearly 1/4 of all WordNet words were labeled as non-objective by their classifiers. However, as the degree of non-objectivity increases, the number of strongly polar words sharply decreases. Thus, only a relatively small proportion of WordNet terms convey strong sentiment.

As for accuracy, the authors acknowledge that they currently lack the ability to verify the results output by their committee classifier, since the lack of labeled data was what prompted their approach in the first place.

  • As a proxy for determining accuracy, the authors previously compared their labelings to those of the General Inquirer lexicon, as reported in Esuli and Sebastiani EACL 2006.
  • They claim to have in preparation a large-scale manual labeling project which would allow them to compare their results against a human generated ground truth, at a later time.

Tool Visual Output

The authors present a web-based tool [1] that visualizes the relationship between objectivity, positivity and negativity scores for each term. The sum of the three scores is 1, so the results can be represented within a simplex, with the corners representing full objectivity, full positivity or full negativity.

Sentiwordnet.png

Discussion

The paper presents a potentially useful resource in the SentiWordNet, which can have application for sentiment analysis tasks. The authors develop a web-based tool for visualizing the three-part scoring relationship for each term. These tools may be useful, but the true value of the resource can only be measured once its general accuracy is known. Even if comprehensive, the tool might not be useful to researchers if the sentiment scores output by the classifiers does not reflect the true sentiment of terms. Thus, this research represents an interesting direction in which more work needs to be done.

Related papers

Study plan

Some concepts which made aid in understanding this paper