Project - Second Draft Proposal - Bo, Kevin, Rushin
Social Media Analysis (10-802) Project Proposal
Bo Lin [email@example.com]
Kevin Dela Rosa [firstname.lastname@example.org]
Rushin Shah [email@example.com]
Title: Fine & coarse grain clustering of tweets based on topics
We propose to tackle the problem of clustering twitter messages (tweets) for a set of six predefined topics: Politics, Sports, Science/Technology, Entertainment, Finance, and "Just for Fun". We propose to address problem by gathering twitter data for approximately 30 popular hash tags corresponding to the different topics and performing some language and/or topic modeling on the tweets to produce a set of clusters, and then comparing those cluster against the one's defined by the different tags.
This problem is motivated by the observation that hash tags are approximate indicators of a tweets topics, and a desire to cluster user tweets into topic in an analogous way to how Google clusters new articles. Some of the challenges for this project include the fact that tweets are short (on average around 10 words) and noisy (filled with jargon, abbreviations, and misspellings).
The specific tasks that we plan to look at are:
- Task 1: (Coarse grain clustering) Can we cluster the tweets into 6 different clusters (unsupervised, not classification), and how well will these clusters correspond to our 6 predefined clusters of tweets?
- Task 2: (Fine grain clustering) Can we cluster the tweets into approximately 30 clusters, and how well will these correspond to our hash tags?
- Task 3: For tweets of a given topic (out of the 6), can we cluster those tweets into the approximately 5 corresponding "sub-topics" as indicated by the hash-tags?
- Task 4: Can we extract the most representative tweets for a given cluster?
Dataset & Evaluation
We plan to create a data set of at least 1,000 tweets (probably much more) for each of our 30 popular/topical hash-tags using the Twitter API. As an example,
- Politics: #obama, #2012, #sotu
- Sports: #superbowl, #worldcup, #lebron
- Science/Technology: #ipad, #watson, #exoplanet
- Entertainment: #musicmonday, #grammy, #oscar
- Finance: #GOOG, #dowjones, #gold
- "Just for fun": #fml, #thingsilike, #followfriday
For task 1, 2 and 3, we plan to use the following standard IR metrics to measure the quality of clusters, using the hash-tag or higher level topic as golden classes for each tweet:
- Normalized Mutual Information
- Per-Cluster F-Score
- Rand Index
- Additionally we may use B-Cubed or some variant of it
For task 4, we plan to rank tweets according to cosine similarity or some other such measure to determine how representative each tweet is of the entire cluster, and then select the highest ranked one.
Since this is a clustering problem, we can think of a variety of unsupervised techniques:
- Generative models like LDA
- Training a distance metric that uses features such as cosine similarity, noun-phrase cosine similarity and then apply standard clustering algorithm such as Hierarchical Agglomerative Clustering
- In addition the the tweets text, we could look at some of the following features
- Text from links referenced in the tweets.
- Sentiment analysis of the tweet text
- Text from recent tweets by users near the ones that mention the hash tag
- Re-tweet information
Additionally, we also plan to look at supervised algorithms for tasks 1-3 (like Rocchio & Labeled LDA), and compare the results with the unsupervised methods.
On the Addition of Other Tasks
We realize that since we are collecting tweets from different areas of the web, we may end up getting really good results even with relatively easy approaches. To remedy this, we discussed various additions that we could make to this project, such as using these techniques to personalize a user's twitter stream according to the users that he follows, or suggesting new users to follow, or predicting hash-tags for a tweet. However, all these ideas, although interesting, significantly change the initial direction of our project and dramatically increase the workload of our project. We are not sure if we would be able to finish them, so we don't want to modify our goals at the moment. If we address our initial project tasks with enough time to spare, we will revisit one of these other ideas and consider adding it to our project.
Professor. Cohen has provided us with a few initial references:
- An Empirical Comparison of Topics in Twitter and Traditional Media. Zhao et al. 2011.
- This paper contains techniques to prepare topic-based Twitter datasets, as well as a novel technique to adapt LDA to Tweets. Since there isn't any off-the-shelf implementation of this new version of LDA, we are not sure if we will implement it, but we will try to do so if it can be done in a reasonable amount of time.
- Social Links from Latent Topics in Micro-blog. Puniyani et al. 2010.
- This paper describes Kriti's project from last year, which involved some visualization of Twitter topics. It helps one get an idea of what topics naturally appear in twitter. The paper reports stylistic variations as well as topical variations when one performs unsupervised LDA on Tweets, and we will see if we obtain evidence of these variations in our project as well.