Detection of Ad Hominem attacks in blog and review data

From Cohen Courses
Revision as of 18:10, 6 October 2012 by Gmontane (talk | contribs) (→‎Task)
Jump to navigationJump to search

Task

Use machine-learning and/or probabilistic topic modeling to detect examples of personal insult in blog and product review data. This is a form of opinion mining.

Overview

This project is aimed at detecting ad hominem attacks and personal insults in blog data and product review data. Personal insults consist of attacking people rather than ideas or features of a product. The task is challenging due to the subjective nature of verbal attack, but previous work has been done in this area, showing that at least some progress is possible on this task.

Team

George Montañez

Datasets

  • A dataset, from the Kaggle.com "Detecting Insults in Social Commentary" competition, consisting of 1,050 insult comments and 2,898 neutral comments.
  • Collection of 30,771 blog documents from blogs discussing evolution and anti-evolution. (Unlabeled)
  • Collection of 294 hand-labeled product review sentences, classified for sentiment (pos/neg/neutral).
  • Bo Pang and Lillian Lee's data from "A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts", which consists of 2000 multi-sentence movie reviews, labeled as positive and negative.

Baseline Method

Given the presence of labeled data, simple logistic regression or naive Bayes classification on a bag-of-words representation will be used to predict whether a sentence is "insulting" (a personal attack) or not.

Proposed Method

I propose combining a search for negative sentiment within a sentence with a method of detecting whether the target of a sentence is a person as a proxy for ad hominem (negative sentiment aimed at persons, not ideas). In addition, I would like to try machine learning based on more advanced features, such as part-of-speech tags and inferred topic models, to build additional classifiers.

Evaluation

  • Quantitative analysis will be performed using the labeled test data. We will compute precision and recall scores.
  • Qualitative analysis will be performed by running the classification algorithms on the unlabeled data, and looking at the examples of text labeled as "insult/ad hominem" by the classifiers.

Challenges

  • The subjective nature of personal attack makes this task difficult. Humans can disagree on whether a sentence is insulting or not.
  • The labeled insult data is noisy (looking over it). Some insults are not marked as such in the data, so the task may be more difficult because of this.
  • The primary challenge of securing labeled insult data has already been met.

Learning Objectives

The hope is to develop automated methods of identifying such complicated objects as ad hominem attacks in text. I would like to expand my knowledge of sentiment analysis methods. Furthermore, I am interested in seeing the number of sentences (and types of sentences) identified as insults in the blog data.

Related Work