Difference between revisions of "Project - First Draft Proposal - Bo, Kevin, Rushin"
(Created page with 'Social Media Analysis (10-802) Project Proposal == Team Members == Bo Lin [bolin@cs.cmu.edu] Kevin Dela Rosa [kdelaros@cs.cmu.edu] [[User:rns…') |
|||
Line 11: | Line 11: | ||
== Summary == | == Summary == | ||
− | + | Tentative Title: '''Fine & coarse grain clustering of tweets based on topics''' | |
− | + | We propose to tackle the problem of clustering twitter messages (tweets) for a set of six predefined topics: Politics, Sports, Technology, Entertainment, Finance, and "Just for Fun". We propose to address problem by gathering twitter data for approximately 30 popular hash tags corresponding to the different topics and performing some language and/or topic modeling on the tweets to produce a set of clusters, and then comparing those cluster against the one's defined by the different tags. | |
− | + | We look at the following issues: | |
− | + | * (Coarse grain clustering) Can we cluster the tweets into 6 different clusters (unsupervised, not classification), and how well will these clusters correspond to our 6 predefined clusters of tweets? | |
− | + | * (Fine grain clustering) Can we cluster the tweets into approximately 30 clusters, and how well will these correspond to our hash tags? | |
− | + | * For tweets of a given topic (out of the 6), can we cluster those tweets into the approximately 5 corresponding "sub-topics" as indicated by the hash-tags? | |
− | + | * Can we extract the most representative tweets for a given cluster? | |
== Dataset == | == Dataset == | ||
− | + | We create a data set of at-least 1,000 tweets for each of our 30 popular/topical hash-tags. | |
− | == Techniques == | + | == Possible Techniques == |
Blah | Blah |
Revision as of 14:27, 1 February 2011
Social Media Analysis (10-802) Project Proposal
Team Members
Bo Lin [bolin@cs.cmu.edu]
Kevin Dela Rosa [kdelaros@cs.cmu.edu]
Rushin Shah [rnshah@cs.cmu.edu]
Summary
Tentative Title: Fine & coarse grain clustering of tweets based on topics
We propose to tackle the problem of clustering twitter messages (tweets) for a set of six predefined topics: Politics, Sports, Technology, Entertainment, Finance, and "Just for Fun". We propose to address problem by gathering twitter data for approximately 30 popular hash tags corresponding to the different topics and performing some language and/or topic modeling on the tweets to produce a set of clusters, and then comparing those cluster against the one's defined by the different tags.
We look at the following issues:
- (Coarse grain clustering) Can we cluster the tweets into 6 different clusters (unsupervised, not classification), and how well will these clusters correspond to our 6 predefined clusters of tweets?
- (Fine grain clustering) Can we cluster the tweets into approximately 30 clusters, and how well will these correspond to our hash tags?
- For tweets of a given topic (out of the 6), can we cluster those tweets into the approximately 5 corresponding "sub-topics" as indicated by the hash-tags?
- Can we extract the most representative tweets for a given cluster?
Dataset
We create a data set of at-least 1,000 tweets for each of our 30 popular/topical hash-tags.
Possible Techniques
Blah