Multilingual Sentiment Analysis in Microblogs
Project changed to "Bitext Extraction from Weibo"
Team members
Project Summary
Parallel corpora constitutes a valuable resource to Machine Translation. These are used to train translation models in statistical Machine Translation, build bilingual dictionaries, and most importantly used to evaluate translation systems. However, state-of-the-art machine translation systems (such as Google Translate) are not suited for translating short messages. One of the main reasons, is the use of colloquial language (He ain 't about that “ team no sleep ” life . Gotcha DJ Irie . photobombed . . . love - Dwayne Wade). More importantly, the research on translation of short messages is scarce, and we believe that one of the reasons is the fact that there is no gold standard for evaluating the quality of such messages.
In this project, we will show how to build an Mandarin-English parallel corpora from Weibo messages. We leverage the fact that some users tend to translate their messages from Chinese to English and vice-versa. For instance, pop stars such as Snoop Dogg post messages in English and their translations in Mandarin (Shout out 2 Kelly Monaco on DWTS! Good lucc. Keep em bouncn! U got it! - 为Dancing with the Stars的Kelly Monaco加油!祝你好运。舞翻全场吧!你行的!). Although, the number of such users is extremely small, given the vastitude of messages in Weibo (8 times larger than Twitter), we can definately to build a 1000 sentence pair gold standard. Furthermore, we expect to be able to extract more than 1M sentence pairs for translation model building purposes. Such a feat could definately promote the field of short message translation.
Dataset
To obtain the Weibo corpora, we will use the search API provided by Weibo to crawl messages.
Task
There are 3 components in this work. Extraction of messages, detection of parallel data, and sentence alignment.
First, we will have to build an client API to retrieve messages. This raise some questions on the best approach to find users that can potentially write parallel messages. One way is to check whether users that translate their messages live in US, and thus need to translate their messages in multiple languages, another is to find one user that writes parallel messages and search if his friends/followees have the same behavior.
Secondly, we will have to filter the messages we collect so that we discard messages that are not parallel. This can be done using heuristics and bilingual dictionaries. First, we can check the ratio between chinese characters and english words. Also, we can check using a dictionary, how many words map to each other.
The main problem with this second heuristic is that the boundary between the source and the target sentences is not defined (or not always defined in the same way). Thus, the second step is highly correlated to the third step, where we need to find the best boundary to split the message between the source side and the target side.
Evaluation
We will choose 1000 messages and split the source and target sides manually. Then, we will test the precision and recall of our system on the detection of parallel messages. In theory, given the large mass of messages, we believe that our system should be tuned for precision for building a testset. As the training set, statistical models tend to be tolelant to noise, thus precision wouldn't be very important.
Furthermore, we should test if we can build a model using the parallel data that can obtain better results than state-of-the-art systems trained on huge amounts of data such as Google Translate.