Yang et al 2007 Fusion approach to finding opinions in blogosphere
Contents
Citation
Yang, K., N. Yu, A. Valerio, H. Zhang, and W. Ke. 2007. Fusion approach to finding opinions in Blogosphere. In Proc Intern Conf on Weblogs and Social Media, Boulder, Colorado.
Online version
Summary
This paper outlines an approach to find opinionated blog posts on a given topic. The problem is two-fold, one to find relevant blog posts for a given topic, and the other an opinion mining task to only return blog posts that contain a polar sentiment about the topic. The authors approach this method by first searching the blog collection using a search engine utilizing a vector space model. The system in use is the SMART system with Lnu weights for document terms and ltc weights for query terms. After the initial result set is returned, there is a round of re-ranking for topical precision, and then the system re-ranks the results again based on "opinion scores". There are four different opinion score generating modules in use and they each produce a ranked list of documents and the four opinion detection modules are:
- Opinion Term Module
- counts opinion terms in a document (opinion term lexicon is provided)
- Rare Term Module
- counts non-dictionary low frequency terms that are sometimes used to express opinion such as "soooo good"
- IU Module
- counts phrases that contain I or you that express opinion (e.g. "I believe", "good for you" etc.)
- Adjective-Verb Module
- looks at the density of potentially subjective elements, i.e. subjective adjectives and verbs
These four ranked result sets generated by the modules are then combined using a fusion module. The fusion module scores each document in three ways:
where:
is the fusional score of the document, is the weight of system , is the normalized score of the document by system , and is the number of systems that retrieve document .
Experimental Result
The system was run on the TREC Blog06 dataset produced by the Blog track of the Text REtrieval Conference (TREC) in 2006. Query topics and relevance judgements were created by trained assessors from NIST. The results for the system was as follows:
Mean Average Precision | Mean Reciprocal Precision | Precision @ 10 |
---|---|---|
0.2052 | 0.2881 | 0.468 |
TREC is a competition, and of the groups that submitted to the Blog track in 2006, the authors placed first for MAP and MRP, and second for P@10.
Comments & Criticism
The group managed to get great results using a handful of heuristics that seem pretty fast to calculate. However, because they are heuristics, they do not have a solid theoretical foundation. Also, the paper lacked an analysis on which heuristics were the most helpful.
Study Plan
For background on information retrieval ranking models, read: