Can predicate-argument structures be used for contextual opinion retrieval from blogs?

From Cohen Courses
Jump to navigationJump to search

This a Paper reviewed for Social Media Analysis 10-802 in Fall 2012.

Citation

author    = {Sylvester O. Orimaye and
              Saadat M. Alhashmi and
              Eu-Gene Siew},
 title     = {Can predicate-argument structures be used for contextual opinion retrieval from blogs?},
 booktitle = {DOI},
 year      = {2011},
 ee        = {http://rd.springer.com/article/10.1007/s11280-012-0170-8}

Online Version

Can predicate-argument structures be used for contextual opinion retrieval from blogs?

Main Idea

This paper examines the use of syntactic structures for sentiment analysis of text. Instead of using frequency of certain keywords or word distance as is traditionally done, the authors instead leverage predicate-argument structures to attach sentiment-bearing words with their objects. They use standard topic and mention extraction techniques like tf-idf, and use a synonym/hyponym matrix structure derived from [Rijsbergen_1977 Rijsbergen]'s Binary Independent Model to identify sentiment-bearing phrases.

They compute predicate-argument structures from CCG parses of training data and compare them to predicate-argument structures of input queries, returning results that have similar structure but with a question word filled in by another argument.

They then combined that information with two other factors, a relevance score and a subjectivity score, using a linear model.

Dataset

The system was trained on TREC blog data, using heuristics to extract only English-language blogs. They tested on two particular years of TREC data: 2006 and 2008.

For each document retrieved, they then removed all tags and other markup, and broke it into sentences using the LingPipe sentence model. They modified the model to detect omitted sentences and sentences without boundary punctuation.

Furthermore, they used query formulation techniques to remove functional, as opposed to content-bearing, parts of queries. For example, they modified the query Provide opinion of the film documentary "Mark of the Penguins" to simply Film documentary "March of the Penguins". This allowed many more hits with their predicate-argument structures.

Methodology

The authors claim that standard methodologies for sentiment extraction require 'ad-hoc' techniques, and they seek to perform the task in a more principled way. As such, they parse every sentence in their data with a full syntactic parser. In particular, they use a CCG parser. This allows them to easily create predicate-argument structures for verbs, overcoming many challenges in natural language, such as long-range dependencies in the process. They can then match the extracted predicate-argument structures against opinion queries to find sentences that mention the same topic as the query, with similar structure.

In this model, each word is tagged with a category, which can be simple or comprehensive. A simple category might be 'NP' or 'S'. A comprehensive category might be S\NP, which represents something that would be a sentence if it had an NP to its right, such as a verb phrase. Another example of a comprehensive category might be S\NP/NP, representing a word that would be a complete sentence with one NP to its left and another to its right, like a transitive verb. They use a grammar model that comes with their CCG parser, which is trained on news data, to automatically parse their sentences with this CCG model. While they throw out some sentences in their data that are not well-formed, and hence not parseable, they report a parsing success rate of over 90%.

To compare the structure of a query against their training data, they describe a novel algorithm. Given a query, such as 'regulations that China proposed', they first extract the predicate argument structure Q. In this case, the predicate is "proposed" and the arguments are "China" and "regulations". Note that the parser was able to determine that China is the first argument (i.e. the subject) of the sentence, despite it occurring later in the query. Suppose they also have a sentence that is hypothesized to be relevant such as "China proposed regulations on tobacco". That predicate argument structure S would again have predicate "proposed" and arguments "China" and "regulations on tobacco". They model similarity between each term in Q and S using a PLSA topic model to measure term co-occurrences. To compute the overall similarity they use the Jaccard Similarity Coefficient (JSC), which they claim is more efficient for sparse texts.

They wish, however, to extract only sentences that are highly subjective. So far their approach will find both objective and subjective sentences. As such, they turn to other works, such as Annotating expressions of opinions and emotions in language (Wiebe, Wilson, and Cardie 2005) and Esuli_2008 Automatic generation of lexical resources for opinion mining: models, algorithms and applications (Esuli 2008).

Results

The authors found that their new approach greatly improved upon the best runs on the TREC Blog06 and Blog08 test sets. On Blog06, they outperform the best run by over 11% MAP, 6.8% R-Prec, and 10% P@10. On Blog08 they beat the best run by 8% by MAP, 12% by R-Prec, and 2.5% by P@10.

Related Work