Difference between revisions of "Multilingual Sentiment Analysis in Microblogs"

From Cohen Courses
Jump to navigationJump to search
 
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 +
Project changed to "Bitext Extraction from Weibo"
 +
 
== Team members ==
 
== Team members ==
 
* [[User:lingwang|Wang Ling]]
 
* [[User:lingwang|Wang Ling]]
Line 4: Line 6:
 
== Project Summary ==
 
== Project Summary ==
  
Most of the work done on Microblogs (e.g. Twitter) has focused on processing English language messages. However, it has been stated in [http://www.mediabistro.com/alltwitter/twitter-language-share_b16109] that only approximately 40% of Twitter messages are posted in English. Ignoring these messages, might have negative effects on the results of the analysis experiment regarding a given topic. For instance, the analysis of customer satisfaction on a product based on only English messages, might be disregarding issues such as support for non-native customers.
+
Parallel corpora constitutes a valuable resource to Machine Translation. These are used to train translation models in statistical Machine Translation, build bilingual dictionaries, and most importantly used to evaluate translation systems. However, state-of-the-art machine translation systems (such as Google Translate) are not suited for translating short messages. One of the main reasons, is the use of colloquial language (He ain 't about that “ team no sleep ” life . Gotcha DJ Irie . photobombed . . . love - Dwayne Wade). More importantly, the research on translation of short messages is scarce, and we believe that one of the reasons is the fact that there is no gold standard for evaluating the quality of such messages.
  
In this project, we analyse the user sentiment during the 2012 Olympic game period from 2 sources Twitter and Sina Weibo. The goal is to analyse, for multitude of topics, whether the aggregate sentiment over the olympic games period in Twitter correlates with the ones in Weibo. In case, there is a strong divergence between the aggregate sentiments over a period, we will find which are the reasons that lead to that divergence.
+
In this project, we will show how to build an Mandarin-English parallel corpora from Weibo messages. We leverage the fact that some users tend to translate their messages from Chinese to English and vice-versa. For instance, pop stars such as Snoop Dogg post messages in English and their translations in Mandarin (Shout out 2 Kelly Monaco on DWTS! Good lucc. Keep em bouncn! U got it! - 为Dancing with the Stars的Kelly Monaco加油!祝你好运。舞翻全场吧!你行的!). Although, the number of such users is extremely small, given the vastitude of messages in Weibo (8 times larger than Twitter), we can definately to build a 1000 sentence pair gold standard. Furthermore, we expect to be able to extract more than 1M sentence pairs for translation model building purposes. Such a feat could definately promote the field of short message translation.
  
 
== Dataset ==
 
== Dataset ==
  
A daily Twitter dataset of 1M sentences (each day) is available internally to CMU students.  
+
To obtain the Weibo corpora, we will use the search API provided by Weibo to crawl messages.
 +
 
 +
== Task ==
 +
 
 +
There are 3 components in this work. Extraction of messages, detection of parallel data, and sentence alignment.
 +
 
 +
First, we will have to build an client API to retrieve messages. This raise some questions on the best approach to find users that can potentially write parallel messages. One way is to check whether users that translate their messages live in US, and thus need to translate their messages in multiple languages, another is to find one user that writes parallel messages and search if his friends/followees have the same behavior.
 +
 
 +
Secondly, we will have to filter the messages we collect so that we discard messages that are not parallel. This can be done using heuristics and bilingual dictionaries. First, we can check the ratio between chinese characters and english words. Also, we can check using a dictionary, how many words map to each other.
 +
 
 +
The main problem with this second heuristic is that the boundary between the source and the target sentences is not defined (or not always defined in the same way). Thus, the second step is highly correlated to the third step, where we need to find the best boundary to split the message between the source side and the target side.
  
To obtain the Weibo corpora, we will use the search API provided by Weibo to crawl the messages in the specified period.
+
== Evaluation ==
  
To estimate the aggregate sentiment, we plan to use the same method described in [http://malt.ml.cmu.edu/mw/index.php/OConnor_et._al.,_ICWSM_2010 O'Connor et al, ICWSM 2010], where a list of words and their prior polarity are used. This list for English will be retrieved from the Subjectivity Lexicon available at [http://www.cs.pitt.edu/mpqa/subj_lexicon.html]. As for the Chinese Lexicon, we can project the English words into Chinese words using a bilingual dictionary. Such a strategy was explored before in [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.76.1774], which showed reasonable results. Thus, we hope that the noise generated by the projection does not have a high negative impact on the aggregate sentiment of Weibo messages.
+
We will choose 1000 messages and split the source and target sides manually. Then, we will test the precision and recall of our system on the detection of parallel messages. In theory, given the large mass of messages, we believe that our system should be tuned for precision for building a testset. As the training set, statistical models tend to be tolelant to noise, thus precision wouldn't be very important.
  
== Task ==
+
Furthermore, we should test if we can build a model using the parallel data that can obtain better results than state-of-the-art systems trained on huge amounts of data such as Google Translate.

Latest revision as of 14:05, 16 October 2012

Project changed to "Bitext Extraction from Weibo"

Team members

Project Summary

Parallel corpora constitutes a valuable resource to Machine Translation. These are used to train translation models in statistical Machine Translation, build bilingual dictionaries, and most importantly used to evaluate translation systems. However, state-of-the-art machine translation systems (such as Google Translate) are not suited for translating short messages. One of the main reasons, is the use of colloquial language (He ain 't about that “ team no sleep ” life . Gotcha DJ Irie . photobombed . . . love - Dwayne Wade). More importantly, the research on translation of short messages is scarce, and we believe that one of the reasons is the fact that there is no gold standard for evaluating the quality of such messages.

In this project, we will show how to build an Mandarin-English parallel corpora from Weibo messages. We leverage the fact that some users tend to translate their messages from Chinese to English and vice-versa. For instance, pop stars such as Snoop Dogg post messages in English and their translations in Mandarin (Shout out 2 Kelly Monaco on DWTS! Good lucc. Keep em bouncn! U got it! - 为Dancing with the Stars的Kelly Monaco加油!祝你好运。舞翻全场吧!你行的!). Although, the number of such users is extremely small, given the vastitude of messages in Weibo (8 times larger than Twitter), we can definately to build a 1000 sentence pair gold standard. Furthermore, we expect to be able to extract more than 1M sentence pairs for translation model building purposes. Such a feat could definately promote the field of short message translation.

Dataset

To obtain the Weibo corpora, we will use the search API provided by Weibo to crawl messages.

Task

There are 3 components in this work. Extraction of messages, detection of parallel data, and sentence alignment.

First, we will have to build an client API to retrieve messages. This raise some questions on the best approach to find users that can potentially write parallel messages. One way is to check whether users that translate their messages live in US, and thus need to translate their messages in multiple languages, another is to find one user that writes parallel messages and search if his friends/followees have the same behavior.

Secondly, we will have to filter the messages we collect so that we discard messages that are not parallel. This can be done using heuristics and bilingual dictionaries. First, we can check the ratio between chinese characters and english words. Also, we can check using a dictionary, how many words map to each other.

The main problem with this second heuristic is that the boundary between the source and the target sentences is not defined (or not always defined in the same way). Thus, the second step is highly correlated to the third step, where we need to find the best boundary to split the message between the source side and the target side.

Evaluation

We will choose 1000 messages and split the source and target sides manually. Then, we will test the precision and recall of our system on the detection of parallel messages. In theory, given the large mass of messages, we believe that our system should be tuned for precision for building a testset. As the training set, statistical models tend to be tolelant to noise, thus precision wouldn't be very important.

Furthermore, we should test if we can build a model using the parallel data that can obtain better results than state-of-the-art systems trained on huge amounts of data such as Google Translate.