Difference between revisions of "Project - First Draft Proposal - Bo, Kevin, Rushin"

From Cohen Courses
Jump to navigationJump to search
Line 16: Line 16:
  
 
We look at the following issues:
 
We look at the following issues:
* (Coarse grain clustering) Can we cluster the tweets into 6 different clusters (unsupervised, not classification), and how well will these clusters correspond to our 6 predefined clusters of tweets?
+
* Task 1: (Coarse grain clustering) Can we cluster the tweets into 6 different clusters (unsupervised, not classification), and how well will these clusters correspond to our 6 predefined clusters of tweets?
* (Fine grain clustering) Can we cluster the tweets into approximately 30 clusters, and how well will these correspond to our hash tags?
+
* Task 2: (Fine grain clustering) Can we cluster the tweets into approximately 30 clusters, and how well will these correspond to our hash tags?
* For tweets of a given topic (out of the 6), can we cluster those tweets into the approximately 5 corresponding "sub-topics" as indicated by the hash-tags?
+
* Task 3: For tweets of a given topic (out of the 6), can we cluster those tweets into the approximately 5 corresponding "sub-topics" as indicated by the hash-tags?
* Can we extract the most representative tweets for a given cluster?
+
* Task 4: Can we extract the most representative tweets for a given cluster?
  
== Dataset ==
+
== Dataset & Evaluation ==
 
 
We create a data set of at-least 1,000 tweets for each of our 30 popular/topical hash-tags.
+
We plan to create a data set of at least 1,000 tweets for each of our 30 popular/topical hash-tags.
 +
 
 +
For task 1, 2 and 3, we plan to use the B-cubed metric to measure the quality of clusters, using the hash-tag or higher level topic as golden classes for each tweet.
  
 
== Possible Techniques ==
 
== Possible Techniques ==
  
 
Blah
 
Blah

Revision as of 14:29, 1 February 2011

Social Media Analysis (10-802) Project Proposal

Team Members

Bo Lin [bolin@cs.cmu.edu]

Kevin Dela Rosa [kdelaros@cs.cmu.edu]

Rushin Shah [rnshah@cs.cmu.edu]

Summary

Tentative Title: Fine & coarse grain clustering of tweets based on topics

We propose to tackle the problem of clustering twitter messages (tweets) for a set of six predefined topics: Politics, Sports, Technology, Entertainment, Finance, and "Just for Fun". We propose to address problem by gathering twitter data for approximately 30 popular hash tags corresponding to the different topics and performing some language and/or topic modeling on the tweets to produce a set of clusters, and then comparing those cluster against the one's defined by the different tags.

We look at the following issues:

  • Task 1: (Coarse grain clustering) Can we cluster the tweets into 6 different clusters (unsupervised, not classification), and how well will these clusters correspond to our 6 predefined clusters of tweets?
  • Task 2: (Fine grain clustering) Can we cluster the tweets into approximately 30 clusters, and how well will these correspond to our hash tags?
  • Task 3: For tweets of a given topic (out of the 6), can we cluster those tweets into the approximately 5 corresponding "sub-topics" as indicated by the hash-tags?
  • Task 4: Can we extract the most representative tweets for a given cluster?

Dataset & Evaluation

We plan to create a data set of at least 1,000 tweets for each of our 30 popular/topical hash-tags.

For task 1, 2 and 3, we plan to use the B-cubed metric to measure the quality of clusters, using the hash-tag or higher level topic as golden classes for each tweet.

Possible Techniques

Blah