Project - First Draft Proposal - Bo, Kevin, Rushin

Social Media Analysis (10-802) Project Proposal

Team Members

Bo Lin [bolin@cs.cmu.edu]

Rushin Shah [rnshah@cs.cmu.edu]

Summary

Tentative Title: Fine & coarse grain clustering of tweets based on topics

We propose to tackle the problem of clustering twitter messages (tweets) for a set of six predefined topics: Politics, Sports, Technology, Entertainment, Finance, and "Just for Fun". We propose to address problem by gathering twitter data for approximately 30 popular hash tags corresponding to the different topics and performing some language and/or topic modeling on the tweets to produce a set of clusters, and then comparing those cluster against the one's defined by the different tags.

We look at the following issues:

Task 1: (Coarse grain clustering) Can we cluster the tweets into 6 different clusters (unsupervised, not classification), and how well will these clusters correspond to our 6 predefined clusters of tweets?
Task 2: (Fine grain clustering) Can we cluster the tweets into approximately 30 clusters, and how well will these correspond to our hash tags?
Task 3: For tweets of a given topic (out of the 6), can we cluster those tweets into the approximately 5 corresponding "sub-topics" as indicated by the hash-tags?
Task 4: Can we extract the most representative tweets for a given cluster?

Dataset & Evaluation

We plan to create a data set of at least 1,000 tweets for each of our 30 popular/topical hash-tags. As an example,

Politics: #obama, #takethecountryback, #2012, #bp, #sotu
Sports: #superbowl, #kobe, #lebron
Technology: #ipad, #willjobslive, #exoplanet
Entertainment: #musicmonday, #ladygaga, #oscar
Finance: #AAPL, #tarp
"Just for fun": #fml, #thingsilike

For task 1, 2 and 3, we plan to use the B-cubed metric to measure the quality of clusters, using the hash-tag or higher level topic as golden classes for each tweet. We're not exactly sure how to evaluate our performance for Task 4. It can be considered a cluster labeling problem, and we may use crowdsourcing or some equivalent to obtain representativeness scores for each tweet. We'd appreciate W's suggestions on this.

Possible Techniques

Since this is a clustering problem, we can think of a variety of unsupervised techniques:

Generative models like LDA
Training a distance metric that uses features such as cosine similarity, noun-phrase cosine similarity and then apply standard clustering algorithm such as HAC.
Additional features such as sentiment could also be used
We could augment tweets with text from links referenced in the tweets.

We'd appreciate any other suggestions on improving our project.

Project - First Draft Proposal - Bo, Kevin, Rushin

Contents

Team Members

Summary

Dataset & Evaluation

Possible Techniques

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools