Project - First Draft Proposal - Bo, Kevin, Rushin
Social Media Analysis (10-802) Project Proposal
Team Members
Bo Lin [bolin@cs.cmu.edu]
Kevin Dela Rosa [kdelaros@cs.cmu.edu]
Rushin Shah [rnshah@cs.cmu.edu]
Summary
Tentative Title: Fine & coarse grain clustering of tweets based on topics
We propose to tackle the problem of clustering twitter messages (tweets) for a set of six predefined topics: Politics, Sports, Technology, Entertainment, Finance, and "Just for Fun". We propose to address problem by gathering twitter data for approximately 30 popular hash tags corresponding to the different topics and performing some language and/or topic modeling on the tweets to produce a set of clusters, and then comparing those cluster against the one's defined by the different tags.
We look at the following issues:
- Task 1: (Coarse grain clustering) Can we cluster the tweets into 6 different clusters (unsupervised, not classification), and how well will these clusters correspond to our 6 predefined clusters of tweets?
- Task 2: (Fine grain clustering) Can we cluster the tweets into approximately 30 clusters, and how well will these correspond to our hash tags?
- Task 3: For tweets of a given topic (out of the 6), can we cluster those tweets into the approximately 5 corresponding "sub-topics" as indicated by the hash-tags?
- Task 4: Can we extract the most representative tweets for a given cluster?
Dataset & Evaluation
We plan to create a data set of at least 1,000 tweets for each of our 30 popular/topical hash-tags. As an example,
- Politics: #obama, #takethecountryback, #2012, #bp, #sotu
- Sports: #superbowl, #kobe, #lebron
- Technology: #ipad, #willjobslive, #exoplanet
- Entertainment: #musicmonday, #ladygaga, #oscar
- Finance: #AAPL, #tarp
- "Just for fun": #fml, #thingsilike
For task 1, 2 and 3, we plan to use the B-cubed metric to measure the quality of clusters, using the hash-tag or higher level topic as golden classes for each tweet. We're not exactly sure how to evaluate our performance for Task 4. It can be considered a cluster labeling problem, and we may use crowdsourcing or some equivalent to obtain representativeness scores for each tweet. We'd appreciate W's suggestions on this.
Possible Techniques
Since this is a clustering problem, we can think of a variety of unsupervised techniques:
- Generative models like LDA
- Training a distance metric that uses features such as cosine similarity, noun-phrase cosine similarity and then apply standard clustering algorithm such as HAC.
- Additional features such as sentiment could also be used
- We could augment tweets with text from links referenced in the tweets.
We'd appreciate any other suggestions on improving our project.