Difference between revisions of "Can predicate-argument structures be used for contextual opinion retrieval from blogs?"

From Cohen Courses
Jump to navigationJump to search
Line 28: Line 28:
  
 
== Methodology ==
 
== Methodology ==
- '''Sentiment Extraction'''
+
The authors claim that standard methodologies for sentiment extraction require 'ad-hoc' techniques, and they seek to perform the task in a more principled way. As such, they parse every sentence in their data with a full syntactic parser. In particular, they use a CCG parser. This allows them to easily create predicate-argument structures for verbs, overcoming many challenges in natural language, such as long-range dependencies in the process. They can then match the extracted predicate-argument structures against opinion queries to find sentences that mention the same topic as the query, with similar structure.
  
Each of the web documents has been treated as a bag-of-word model. [http://www.wjh.harvard.edu/~inquirer/ Harvard Inquirer] and [http://sentiwordnet.isti.cnr.it/ SentiWordNet] have been used to obtain the sentiment scores of the individual words in the post. The sentiment score of the entire post is taken to be the average sentiment scores of the words in the document. The sentiment attributes are - positivity, negativity and objectivity of a post.  
+
In this model, each word is tagged with a category, which can be simple or comprehensive. A simple category might be 'NP' or 'S'. A comprehensive category might be S\NP, which represents something that would be a sentence if it had an NP to its right, such as a verb phrase. Another example of a comprehensive category might be S\NP/NP, representing a word that would be a complete sentence with one NP to its left and another to its right, like a transitive verb. They use a grammar model that comes with their CCG parser, which is trained on news data, to automatically parse their sentences with this CCG model. While they throw out some sentences in their data that are not well-formed, and hence not parseable, they report a parsing success rate of over 90%.
  
The paper proposes sentiment extraction from emoticon. The sentiment from emoticon are assumed to binary (+1/-1) and are assigned to the post directly as the frequency of the emoticons in a post. A simple technique is to treat all [:), :D, :P, :p, ;)] to be positive and [:(,D:] to be negative.
+
To compare the structure of a query against their training data, they describe a novel algorithm. Given a query, such as 'regulations that China proposed', they first extract the predicate argument structure Q. In this case, the predicate is "proposed" and the arguments are "China" and "regulations". Note that the parser was able to determine that China is the first argument (i.e. the subject) of the sentence, despite it occurring later in the query. Suppose they also have a sentence that is hypothesized to be relevent such as "China proposed regulations on tobacco". That predicate argument structure S would again have predicate "proposed" and arguments "China" and "regulations on tobacco". They model similarity between each term in Q and S using a PLSA topic model to measure term co-occurrences. To compute the overall similarity they use the Jaccard Similarity Coefficient (JSC), which they claim is more effieicnet for sparse texts.
  
The authors define the '''average sentiment of a user''' as the baseline and then computes the '''deviation of the individual posts''' as the polarity of the post. Each web domain has been considered as an author and the baseline sentiment for the domain has been obtained by averaging over the sentiment of the individual posts.
+
They wish, however, to extract only sentences that are highly subjective. So far their approach will find both objective and subjective sentences. As such, they turn to other works, such as [Wiebe_Wilson_Cardie_2005 Annotating expressions of opinions and emotions in language] and [Esuli_2008 Automatic generation of lexical resources for opinion mining: models, algorithms and applications].
 
 
- '''Identification of Cascades and its Topology'''
 
 
The data has been modeled as a graph. Each node represents a blog post, which has its sentiment score as the attribute.
 
A directed edge from ''u'' to ''v'' represents that the post ''u'' contains a hyperlink citing ''v''. The nodes with no outdegrees represents posts which start the flow of the sentiments and are referred as '''cascade initiators'''.
 
The topology of a cascade is obtained by applying Breadth-first Search (BFS) from the cascade intiators.
 
  
 
== Findings ==
 
== Findings ==

Revision as of 21:13, 7 November 2012

This a Paper reviewed for Social Media Analysis 10-802 in Fall 2012.

Citation

author    = {Sylvester O. Orimaye and
              Saadat M. Alhashmi and
              Eu-Gene Siew},
 title     = {Can predicate-argument structures be used for contextual opinion retrieval from blogs?},
 booktitle = {DOI},
 year      = {2011},
 ee        = {http://rd.springer.com/article/10.1007/s11280-012-0170-8}

Online Version

Can predicate-argument structures be used for contextual opinion retrieval from blogs?

Main Idea

This paper examines the use of syntactic structures for sentiment analysis of text. Instead of using frequency of certain keywords or word distance as is traditionally done, the authors instead leverage predicate-argument structures to attach sentiment-bearing words with their objects. They use standard topic and mention extraction techniques like tf-idf, and use a synonym/hyponym matrix structure derived from [Rijsbergen_1977 Rijsbergen]'s Binary Independent Model to identify sentiment-bearing phrases.

They compute predicate-argument structures from CCG parses of training data and compare them to predicate-argument structures of input queries, returning results that have similar structure but with a question word filled in by another argument.

They then combined that information with two other factors, a relevance score and a subjectivity score, using a linear model.

Dataset

The system was trained on TREC blog data, using heuristics to extract only English-language blogs. They tested on two particular years of TREC data: 2006 and 2008.

For each document retrieved, they then removed all tags and other markup, and broke it into sentences using the LingPipe sentence model. They modified the model to detect omitted sentences and sentences without boundary punctuation.

Furthermore, they used query formulation techniques to remove functional, as opposed to content-bearing, parts of queries. For example, they modified the query Provide opinion of the film documentary "Mark of the Penguins" to simply Film documentary "March of the Penguins". This allowed many more hits with their predicate-argument structures.

Methodology

The authors claim that standard methodologies for sentiment extraction require 'ad-hoc' techniques, and they seek to perform the task in a more principled way. As such, they parse every sentence in their data with a full syntactic parser. In particular, they use a CCG parser. This allows them to easily create predicate-argument structures for verbs, overcoming many challenges in natural language, such as long-range dependencies in the process. They can then match the extracted predicate-argument structures against opinion queries to find sentences that mention the same topic as the query, with similar structure.

In this model, each word is tagged with a category, which can be simple or comprehensive. A simple category might be 'NP' or 'S'. A comprehensive category might be S\NP, which represents something that would be a sentence if it had an NP to its right, such as a verb phrase. Another example of a comprehensive category might be S\NP/NP, representing a word that would be a complete sentence with one NP to its left and another to its right, like a transitive verb. They use a grammar model that comes with their CCG parser, which is trained on news data, to automatically parse their sentences with this CCG model. While they throw out some sentences in their data that are not well-formed, and hence not parseable, they report a parsing success rate of over 90%.

To compare the structure of a query against their training data, they describe a novel algorithm. Given a query, such as 'regulations that China proposed', they first extract the predicate argument structure Q. In this case, the predicate is "proposed" and the arguments are "China" and "regulations". Note that the parser was able to determine that China is the first argument (i.e. the subject) of the sentence, despite it occurring later in the query. Suppose they also have a sentence that is hypothesized to be relevent such as "China proposed regulations on tobacco". That predicate argument structure S would again have predicate "proposed" and arguments "China" and "regulations on tobacco". They model similarity between each term in Q and S using a PLSA topic model to measure term co-occurrences. To compute the overall similarity they use the Jaccard Similarity Coefficient (JSC), which they claim is more effieicnet for sparse texts.

They wish, however, to extract only sentences that are highly subjective. So far their approach will find both objective and subjective sentences. As such, they turn to other works, such as [Wiebe_Wilson_Cardie_2005 Annotating expressions of opinions and emotions in language] and [Esuli_2008 Automatic generation of lexical resources for opinion mining: models, algorithms and applications].

Findings

The paper explores the flow in the sentiment across hyperlink networks. The main findings of the paper are as follows.

- Post Level Analysis

  • Nodes are strongly influenced by their immediate neighbors.
    • Given an edge from u to v, u is referred to as the parent of v, and v is referred to as the child of u. The analysis in the paper shows that the subjectivity of a child is attributed to the subjectivity of its parent. The usage of subjective language in the parent post leads to higher sentiment score in the child post.
  • Emoticon tagging provides a rough heuristic in sentiment analysis, but the bag-of-words model is much richer understanding of the sentiment.

- Cascade Level Analysis

  • Sentiment in deeper cascades exhibits 4 distinct phases with time.
    • At the cascade initiator, language is close to the baseline.
    • Positivity and negativity heat up quickly.
    • The sentiments cools off fairly quickly.
    • Returns to the mild baseline.
  • Shallow cascades have a mild and short-lived sentiment exhibition.
    • Shallow cascades start off with a slight sentiment support and then dies out quickly. A reasoning for these posts is that they tend to be relatively tame, and so do not attract the attention of more posters.

To conclude, the position of a post in the cascade topology and the overall depth of a cascade plays an important factor in determining the sentiment of a post.

Related Work