Difference between revisions of "Multilingual Sentiment Analysis in Microblogs"

From Cohen Courses
Jump to navigationJump to search
Line 1: Line 1:
== Comments ==
+
Project changed to "bitext extraction from Weibo"
 
 
Seems like a neat project.  A couple of questions to think about:
 
 
 
* Topics will be search queries.  Will it be difficult to come up with parallel queries?
 
* Can you think of any way to do a quantitative evaluation of the results?
 
* There are a lot of non-English users of Twitter, and non-US users, so it might not be a nice homogeneous community that's a counterpoint to Weibo.
 
 
 
--[[User:Wcohen|Wcohen]] 20:15, 10 October 2012 (UTC)
 
 
 
  
 
== Team members ==
 
== Team members ==
Line 15: Line 6:
 
== Project Summary ==
 
== Project Summary ==
  
Most of the work done on Microblogs (e.g. Twitter) has focused on processing English language messages. However, it has been stated in [http://www.mediabistro.com/alltwitter/twitter-language-share_b16109] that only approximately 40% of Twitter messages are posted in English. Ignoring these messages, might have negative effects on the results of the analysis experiment regarding a given topic. For instance, the analysis of customer satisfaction on a product based on only English messages, might be disregarding issues such as support for non-native customers.
+
Parallel corpora constitutes a valuable resource to Machine Translation. These are used to train translation models in statistical Machine Translation, build bilingual dictionaries, and most importantly used to evaluate translation systems. However, state-of-the-art machine translation systems (such as Google Translate) are not suited for translating short messages. One of the main reasons, is the use of colloquial language (He ain 't about that “ team no sleep ” life . Gotcha DJ Irie . photobombed . . . love - Dwayne Wade). More importantly, the research on translation of short messages is scarce, and we believe that one of the reasons is the fact that there is no gold standard for evaluating the quality of such messages.
  
In this project, we analyse the user sentiment during the 2012 Olympic game period from 2 sources Twitter and Sina Weibo. The goal is to analyse, for multitude of topics, whether the aggregate sentiment over the Olympic games period in Twitter correlates with the ones in Weibo. In case, there is a strong divergence between the aggregate sentiments over a period, we will find which are the reasons that lead to that divergence.
+
In this project, we will show how to build an Mandarin-English parallel corpora from Weibo messages. We leverage the fact that some users tend to translate their messages from Chinese to English and vice-versa. For instance, pop stars such as Snoop Dogg post messages in English and their translations in Mandarin (Shout out 2 Kelly Monaco on DWTS! Good lucc. Keep em bouncn! U got it! - 为Dancing with the Stars的Kelly Monaco加油!祝你好运。舞翻全场吧!你行的!). Although, the number of such users is extremely small, given the vastitude of messages in Weibo (8 times larger than Twitter), we can definately to build a 1000 sentence pair gold standard. Furthermore, we expect to be able to extract more than 1M sentence pairs for translation model building purposes. Such a feat could definately promote the field of short message translation.
  
 
== Dataset ==
 
== Dataset ==
  
A daily Twitter dataset of 1M sentences (each day) is available internally to CMU students.  
+
To obtain the Weibo corpora, we will use the search API provided by Weibo to crawl messages.  
  
To obtain the Weibo corpora, we will use the search API provided by Weibo to crawl the messages in the specified period.
+
== Task ==
  
To estimate the aggregate sentiment, we plan to use the same method described in [http://malt.ml.cmu.edu/mw/index.php/OConnor_et._al.,_ICWSM_2010 O'Connor et al, ICWSM 2010], where a list of words and their prior polarity are used. This list for English will be retrieved from the Subjectivity Lexicon available at [http://www.cs.pitt.edu/mpqa/subj_lexicon.html]. As for the Chinese Lexicon, we can project the English words into Chinese words using a bilingual dictionary. Such a strategy was explored before in [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.76.1774], which showed reasonable results. Thus, we hope that the noise generated by the projection does not have a high negative impact on the aggregate sentiment of Weibo messages.
+
There are 3 components in this work. Extraction of messages, detection of parallel data, and sentence alignment.  
  
== Task ==
+
First, we will have to build an client API to retrieve messages. This raise some questions on the best approach to find users that can potentially write parallel messages. One way is to check whether users that translate their messages live in US, and thus need to translate their messages in multiple languages, another is to find one user that writes parallel messages and search if his friends/followees have the same behavior.
 +
 
 +
Secondly, we will have to filter the messages we collect so that we discard messages that are not parallel. This can be done using heuristics and bilingual dictionaries. First, we can check the ratio between chinese characters and english words. Also, we can check using a dictionary, how many words map to each other.
  
The main goal of this project is to analyse the sentiment related to different types of topics associated with the 2012 Olympic games in Twitter and Weibo, and correlate then over the period of the Olympic games. The task is divided into different step:
+
The main problem with this second heuristic is that the boundary between the source and the target sentences is not defined (or not always defined in the same way). Thus, the second step is highly correlated to the third step, where we need to find the best boundary to split the message between the source side and the target side.
  
* First, we need to detect the messages that are relevant to each topic. We will do this simply by filtering messages for a given sets of keywords, specified manually. Examples of topics include a given athlete, a sport, a country, or an event (such as opening ceremony).
+
== Evaluation ==
  
* Afterwards, we will aggregate the messages by each day, estimate the aggregate sentiment (ratio between positive and negative messages), and plot the sentiment over the Olympic game period (using kernels for smoothing).
+
We will choose 1000 messages and split the source and target sides manually. Then, we will test the precision and recall of our system on the detection of parallel messages. In theory, given the large mass of messages, we believe that our system should be tuned for precision for building a testset. As the training set, statistical models tend to be tolelant to noise, thus precision wouldn't be very important.
  
* The following step will depend on the conclusions we obtain. We will check whether the general sentiment within a topic is correlated between the two microblogs. For instance, if there is a definitive correlation between the sentiment, then we will look for topics and periods of time were the correlation is less apparent and check the cause. Otherwise, we will check factors that lead to the non-correlation (such as censorship, patriotism etc...).
+
Furthermore, we should test if we can build a model using the parallel data that can obtain better results than state-of-the-art systems trained on huge amounts of data such as Google Translate.

Revision as of 15:00, 16 October 2012

Project changed to "bitext extraction from Weibo"

Team members

Project Summary

Parallel corpora constitutes a valuable resource to Machine Translation. These are used to train translation models in statistical Machine Translation, build bilingual dictionaries, and most importantly used to evaluate translation systems. However, state-of-the-art machine translation systems (such as Google Translate) are not suited for translating short messages. One of the main reasons, is the use of colloquial language (He ain 't about that “ team no sleep ” life . Gotcha DJ Irie . photobombed . . . love - Dwayne Wade). More importantly, the research on translation of short messages is scarce, and we believe that one of the reasons is the fact that there is no gold standard for evaluating the quality of such messages.

In this project, we will show how to build an Mandarin-English parallel corpora from Weibo messages. We leverage the fact that some users tend to translate their messages from Chinese to English and vice-versa. For instance, pop stars such as Snoop Dogg post messages in English and their translations in Mandarin (Shout out 2 Kelly Monaco on DWTS! Good lucc. Keep em bouncn! U got it! - 为Dancing with the Stars的Kelly Monaco加油!祝你好运。舞翻全场吧!你行的!). Although, the number of such users is extremely small, given the vastitude of messages in Weibo (8 times larger than Twitter), we can definately to build a 1000 sentence pair gold standard. Furthermore, we expect to be able to extract more than 1M sentence pairs for translation model building purposes. Such a feat could definately promote the field of short message translation.

Dataset

To obtain the Weibo corpora, we will use the search API provided by Weibo to crawl messages.

Task

There are 3 components in this work. Extraction of messages, detection of parallel data, and sentence alignment.

First, we will have to build an client API to retrieve messages. This raise some questions on the best approach to find users that can potentially write parallel messages. One way is to check whether users that translate their messages live in US, and thus need to translate their messages in multiple languages, another is to find one user that writes parallel messages and search if his friends/followees have the same behavior.

Secondly, we will have to filter the messages we collect so that we discard messages that are not parallel. This can be done using heuristics and bilingual dictionaries. First, we can check the ratio between chinese characters and english words. Also, we can check using a dictionary, how many words map to each other.

The main problem with this second heuristic is that the boundary between the source and the target sentences is not defined (or not always defined in the same way). Thus, the second step is highly correlated to the third step, where we need to find the best boundary to split the message between the source side and the target side.

Evaluation

We will choose 1000 messages and split the source and target sides manually. Then, we will test the precision and recall of our system on the detection of parallel messages. In theory, given the large mass of messages, we believe that our system should be tuned for precision for building a testset. As the training set, statistical models tend to be tolelant to noise, thus precision wouldn't be very important.

Furthermore, we should test if we can build a model using the parallel data that can obtain better results than state-of-the-art systems trained on huge amounts of data such as Google Translate.