M. Hurst and K. Nigam. Retrieving topical sentiments from online document collection.

From Cohen Courses
Jump to navigationJump to search

This a Paper reviewed for Social Media Analysis 10-802 in Fall 2012.

Citation

title={Retrieving topical sentiments from online document collections},
author={Hurst, M.F. and Nigam, K.},
booktitle={Proceedings of SPIE},
volume={5296},
pages={27--34},
year={2004}

Online version

Hurst, Matthew F., and Kamal Nigam. "Retrieving topical sentiments from online document collections." Proceedings of SPIE. Vol. 5296. 2004.

Summary

In this work the authors try to automatically extract public opinion on certain topics from online text.This is one of the earlier works at combining Topicality and Polarity i.e identifying polar sentences about a topic. Here authors argue for the fusion of Topicality and Polarity by using statistical machine learning approaches to identify topics and shallow NLP techniques to determine polarity. They argue for the locality assumption whereby polar sentences that contain the topic, denote polarity about the topic. For this they retrieved a collection of online messages to form their dataset based on specific domains followed by manual annotation. They present a lightweight approach using linear classifier and a shallow NLP rule based system to identify polar sentences about topics.

Dataset Description

16, 616 sentences from 982 messages from online resources(usenet, online message boards, etc.) about a certain topic. Manually annotated 250 Randomly selected sentences with following labels

  • Polarity Identification: positive, negative
  • Topic Identification: Topical, Out-of-Topic
  • Polarity and Topic Identification: positive-correlated, negative-correlated, positive-uncorrelated, negative-uncorrelated.

Task Description and Evaluation

Polarity Identification:

The authors use a rule based approach to perform polarity identification. It has the following steps

  • Tokenization followed by POS Tagging using a statistical tagger trained on PennTreebank Data.
  • Semantic Polarity tagging using manually created predefined Topical Lexicon tuned for the domain.
  • Chunking using simple POS Tag patterns
  • Rule based Syntactic patterns and negations rules to modify and associate polarity to topics.
  • Syntactic patterns are: Predicative modification (it is good), Attributive modification (a good car), Equality (it is a good car), Polar clause (it broke my car). Negation Rules: Verbal attachment (it is not good, it isn't good)

Performance: There system achieved a precision of 82% at detecting positive polarity and precision of 80% for detecting negative polarity.

Topic Identification

Here the users try to identify the topicality of a sentence using a text classification based approach. They use a variant of the Winnow Classifier which is an online learning algorithm for learning a linear decision boundary. Since they dont have enough sentence level annotations for topicality, they use message level labels (topical or not topical) to train the classifier using standard Bag Of Words representation.

During Testing phase

  • Classify each message as Topical or Non Topical using the trained classifier
  • If message is topical, Classify each sentence in the message using the trained classifier.
  • If sentence is topical, perform semantic analysis to determine polarity.

Performance: They achieved a Message Level precision of 85.4% and a sentence level precision of 79%.

Combining Polarity and Topical Models

  • 982 messages [with 16616 sentences ]classified as topical
  • 1262 (of 16616) predicted to be topical - 316 [positive polarity], 81[negative polarity]
  • A precision[ Percentage of times a polar sentence containing topic contains polarity wrt topic] of 72% was observed.

Findings and Discussions

  • Lightweight approach towards identifying polar sentences about Topics with satisfactory results. Combines Topicality and Polarity Detection using a linear classifier and a simple grammatical model. The use of a grammatical model is desirable in scenarios where getting enough labeled data for learning algorithms is difficult.
  • Being one of the earlier attempts at polarity detection, the paper makes various assumptions and leaves several areas for improvement.
  • The dataset used in the study is pretty small owing to the manual labor of annotation. Various modeling asummptions like restricting domain, using tuned semantic polarity dictionary, training and testing disparity in the topic classifier, etc have been made which can make the system biased.
  • Co-reference Resolution could be used to improve topic detection, by transforming "anphora's" [like pronouns] to their referring entities for better topical detection.
  • As also mentioned in the paper, due to manual overhead they didn't annotate topicality on a sentence basis rather used message level annotations for training there topical classifier. While testing this classifier was used to classify significantly smaller sentences which is not desirable given the training-testing disparity.
  • Another constraint of the system is using a fine tuned semantic vocabulary for polarity labeling, based on domain, which restricts it's usage in other domains or makes the porting require further manual efforts. Automatic construction of lexicon would be desirable.

Related papers

A follow up work by the authors.

  1. Glance, Natalie, et al. "Deriving marketing intelligence from online discussion." Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, 2005.

Some of the earlier works cited in the paper.

  1. B. Pang, L. Lee, and S. Vaithyanathan, Thumbs up? sentiment classification using machine learning techniques," in Proceedings of MNLP 2002, 2002.
  2. J. Wiebe, T. Wilson, and M. Bell, Identifying collocations for recognizing opinions," in Proceedings of ACL/EACL '01 Workshop on Collocation, (Toulouse, France), July 2001

Study plan

The paper does not require any major background knowledge.