Can predicate-argument structures be used for contextual opinion retrieval from blogs?

From Cohen Courses
Revision as of 00:34, 7 November 2012 by Austinma (talk | contribs) (→‎Dataset)
Jump to navigationJump to search

This a Paper reviewed for Social Media Analysis 10-802 in Fall 2012.

Citation

author    = {Sylvester O. Orimaye and
              Saadat M. Alhashmi and
              Eu-Gene Siew},
 title     = {Can predicate-argument structures be used for contextual opinion retrieval from blogs?},
 booktitle = {DOI},
 year      = {2011},
 ee        = {http://rd.springer.com/article/10.1007/s11280-012-0170-8}

Online Version

Can predicate-argument structures be used for contextual opinion retrieval from blogs?

Main Idea

This paper examines the use of syntactic structures for sentiment analysis of text. Instead of using frequency of certain keywords or word distance as is traditionally done, the authors instead leverage predicate-argument structures to attach sentiment-bearing words with their objects. They use standard topic and mention extraction techniques like tf-idf, and use a synonym/hyponym matrix structure derived from [Rijsbergen_1977 Rijsbergen]'s Binary Independent Model to identify sentiment-bearing phrases.

They compute predicate-argument structures from CCG parses of training data and compare them to predicate-argument structures of input queries, returning results that have similar structure but with a question word filled in by another argument.

They then combined that information with two other factors, a relevance score and a subjectivity score, using a linear model.

Dataset

The system was trained on TREC blog data, using heuristics to extract only English-language blogs. They tested on two particular years of TREC data: 2006 and 2008.

For each document retrieved, they then removed all tags and other markup, and broke it into sentences using the LingPipe sentence model. They modified the model to detect omitted sentences and sentences without boundary punctuation.

Furthermore, they used query formulation techniques to remove functional, as opposed to content-bearing, parts of queries. For example, they modified the query Provide opinion of the film documentary "Mark of the Penguins" to simply Film documentary "March of the Penguins". This allowed many more hits with their predicate-argument structures.

Methodology

- Sentiment Extraction

Each of the web documents has been treated as a bag-of-word model. Harvard Inquirer and SentiWordNet have been used to obtain the sentiment scores of the individual words in the post. The sentiment score of the entire post is taken to be the average sentiment scores of the words in the document. The sentiment attributes are - positivity, negativity and objectivity of a post.

The paper proposes sentiment extraction from emoticon. The sentiment from emoticon are assumed to binary (+1/-1) and are assigned to the post directly as the frequency of the emoticons in a post. A simple technique is to treat all [:), :D, :P, :p, ;)] to be positive and [:(,D:] to be negative.

The authors define the average sentiment of a user as the baseline and then computes the deviation of the individual posts as the polarity of the post. Each web domain has been considered as an author and the baseline sentiment for the domain has been obtained by averaging over the sentiment of the individual posts.

- Identification of Cascades and its Topology

The data has been modeled as a graph. Each node represents a blog post, which has its sentiment score as the attribute. A directed edge from u to v represents that the post u contains a hyperlink citing v. The nodes with no outdegrees represents posts which start the flow of the sentiments and are referred as cascade initiators. The topology of a cascade is obtained by applying Breadth-first Search (BFS) from the cascade intiators.

Findings

The paper explores the flow in the sentiment across hyperlink networks. The main findings of the paper are as follows.

- Post Level Analysis

  • Nodes are strongly influenced by their immediate neighbors.
    • Given an edge from u to v, u is referred to as the parent of v, and v is referred to as the child of u. The analysis in the paper shows that the subjectivity of a child is attributed to the subjectivity of its parent. The usage of subjective language in the parent post leads to higher sentiment score in the child post.
  • Emoticon tagging provides a rough heuristic in sentiment analysis, but the bag-of-words model is much richer understanding of the sentiment.

- Cascade Level Analysis

  • Sentiment in deeper cascades exhibits 4 distinct phases with time.
    • At the cascade initiator, language is close to the baseline.
    • Positivity and negativity heat up quickly.
    • The sentiments cools off fairly quickly.
    • Returns to the mild baseline.
  • Shallow cascades have a mild and short-lived sentiment exhibition.
    • Shallow cascades start off with a slight sentiment support and then dies out quickly. A reasoning for these posts is that they tend to be relatively tame, and so do not attract the attention of more posters.

To conclude, the position of a post in the cascade topology and the overall depth of a cascade plays an important factor in determining the sentiment of a post.

Related Work