Difference between revisions of "Multilingual Sentiment Analysis in Microblogs"

From Cohen Courses
Jump to navigationJump to search
Line 6: Line 6:
 
Most of the work done on Microblogs (e.g. Twitter) has focused on processing English language messages. However, it has been stated in [http://www.mediabistro.com/alltwitter/twitter-language-share_b16109] that only approximately 40% of Twitter messages are posted in English. Ignoring these messages, might have negative effects on the results of the analysis experiment regarding a given topic. For instance, the analysis of customer satisfaction on a product based on only English messages, might be disregarding issues such as support for non-native customers.
 
Most of the work done on Microblogs (e.g. Twitter) has focused on processing English language messages. However, it has been stated in [http://www.mediabistro.com/alltwitter/twitter-language-share_b16109] that only approximately 40% of Twitter messages are posted in English. Ignoring these messages, might have negative effects on the results of the analysis experiment regarding a given topic. For instance, the analysis of customer satisfaction on a product based on only English messages, might be disregarding issues such as support for non-native customers.
  
In this project, we analyse the user sentiment during the 2012 Olympic game period from 2 sources Twitter and Sina Weibo. The goal is to analyse, for multitude of topics, whether the aggregate sentiment over the olympic games period in Twitter correlates with the ones in Weibo. In case, there is a strong divergence between the aggregate sentiments over a period, we will find which are the reasons that lead to that divergence.
+
In this project, we analyse the user sentiment during the 2012 Olympic game period from 2 sources Twitter and Sina Weibo. The goal is to analyse, for multitude of topics, whether the aggregate sentiment over the Olympic games period in Twitter correlates with the ones in Weibo. In case, there is a strong divergence between the aggregate sentiments over a period, we will find which are the reasons that lead to that divergence.
  
 
== Dataset ==
 
== Dataset ==
Line 17: Line 17:
  
 
== Task ==
 
== Task ==
 +
 +
The main goal of this project is to analyse the sentiment related to different types of topics associated with the 2012 Olympic games in Twitter and Weibo, and correlate then over the period of the Olympic games. The task is divided into different step:
 +
 +
* First, we need to detect the messages that are relevant to each topic. We will do this simply by filtering messages for a given sets of keywords, specified manually. Examples of topics include a given athlete, a sport, a country, or an event (such as opening ceremony).
 +
 +
* Afterwards, we will aggregate the messages by each day, estimate the aggregate sentiment (ratio between positive and negative messages), and plot the sentiment over the Olympic game period (using kernels for smoothing).
 +
 +
* The following step will depend on the conclusions we obtain. We will check whether the general sentiment within a topic is correlated between the two microblogs. For instance, if there is a definitive correlation between the sentiment, then we will look for topics and periods of time were the correlation is less apparent and check the cause. Otherwise, we will check factors that lead to the non-correlation (such as censorship, patriotism etc...).

Revision as of 22:04, 8 October 2012

Team members

Project Summary

Most of the work done on Microblogs (e.g. Twitter) has focused on processing English language messages. However, it has been stated in [1] that only approximately 40% of Twitter messages are posted in English. Ignoring these messages, might have negative effects on the results of the analysis experiment regarding a given topic. For instance, the analysis of customer satisfaction on a product based on only English messages, might be disregarding issues such as support for non-native customers.

In this project, we analyse the user sentiment during the 2012 Olympic game period from 2 sources Twitter and Sina Weibo. The goal is to analyse, for multitude of topics, whether the aggregate sentiment over the Olympic games period in Twitter correlates with the ones in Weibo. In case, there is a strong divergence between the aggregate sentiments over a period, we will find which are the reasons that lead to that divergence.

Dataset

A daily Twitter dataset of 1M sentences (each day) is available internally to CMU students.

To obtain the Weibo corpora, we will use the search API provided by Weibo to crawl the messages in the specified period.

To estimate the aggregate sentiment, we plan to use the same method described in O'Connor et al, ICWSM 2010, where a list of words and their prior polarity are used. This list for English will be retrieved from the Subjectivity Lexicon available at [2]. As for the Chinese Lexicon, we can project the English words into Chinese words using a bilingual dictionary. Such a strategy was explored before in [3], which showed reasonable results. Thus, we hope that the noise generated by the projection does not have a high negative impact on the aggregate sentiment of Weibo messages.

Task

The main goal of this project is to analyse the sentiment related to different types of topics associated with the 2012 Olympic games in Twitter and Weibo, and correlate then over the period of the Olympic games. The task is divided into different step:

  • First, we need to detect the messages that are relevant to each topic. We will do this simply by filtering messages for a given sets of keywords, specified manually. Examples of topics include a given athlete, a sport, a country, or an event (such as opening ceremony).
  • Afterwards, we will aggregate the messages by each day, estimate the aggregate sentiment (ratio between positive and negative messages), and plot the sentiment over the Olympic game period (using kernels for smoothing).
  • The following step will depend on the conclusions we obtain. We will check whether the general sentiment within a topic is correlated between the two microblogs. For instance, if there is a definitive correlation between the sentiment, then we will look for topics and periods of time were the correlation is less apparent and check the cause. Otherwise, we will check factors that lead to the non-correlation (such as censorship, patriotism etc...).