Multilingual Sentiment Analysis in Microblogs
Contents
Team members
Project Summary
Most of the work done on Microblogs (e.g. Twitter) has focused on processing English language messages. However, it has been stated in [1] that only approximately 40% of Twitter messages are posted in English. Ignoring these messages, might have negative effects on the results of the analysis experiment regarding a given topic. For instance, the analysis of customer satisfaction on a product based on only English messages, might be disregarding issues such as support for non-native customers.
In this project, we analyse the user sentiment during the 2012 Olympic game period from 2 sources Twitter and Sina Weibo. The goal is to analyse, for multitude of topics, whether the aggregate sentiment over the olympic games period in Twitter correlates with the ones in Weibo. In case, there is a strong divergence between the aggregate sentiments over a period, we will find which are the reasons that lead to that divergence.
Dataset
A daily Twitter dataset of 1M sentences (each day) is available internally to CMU students.
To obtain the Weibo corpora, we will use the search API provided by Weibo to crawl the messages in the specified period.
To estimate the aggregate sentiment, we plan to use the same method described in O'Connor et al, ICWSM 2010, where a list of words and their prior polarity are used. This list for English will be retrieved from the Subjectivity Lexicon available at [2]. As for the Chinese Lexicon, we can project the English words into Chinese words using a bilingual dictionary. Such a strategy was explored before in [3], which showed reasonable results. Thus, we hope that the noise generated by the projection does not have a high negative impact on the aggregate sentiment of Weibo messages.