Difference between revisions of "Project - Second Draft Proposal - Bo, Kevin, Rushin"

From Cohen Courses
Jump to navigationJump to search
Line 13: Line 13:
 
Title: '''Fine & coarse grain clustering of tweets based on topics'''
 
Title: '''Fine & coarse grain clustering of tweets based on topics'''
  
We propose to tackle the problem of clustering twitter messages (tweets) for a set of six predefined topics: Politics, Sports, Technology, Entertainment, Finance, and "Just for Fun". We propose to address problem by gathering twitter data for approximately 30 popular hash tags corresponding to the different topics and performing some language and/or topic modeling on the tweets to produce a set of clusters, and then comparing those cluster against the one's defined by the different tags.  
+
We propose to tackle the problem of clustering twitter messages (tweets) for a set of six predefined topics: Politics, Sports, Science/Technology, Entertainment, Finance, and "Just for Fun". We propose to address problem by gathering twitter data for approximately 30 popular hash tags corresponding to the different topics and performing some language and/or topic modeling on the tweets to produce a set of clusters, and then comparing those cluster against the one's defined by the different tags.  
  
 
We look at the following issues:
 
We look at the following issues:
Line 23: Line 23:
 
== Dataset & Evaluation ==
 
== Dataset & Evaluation ==
 
 
We plan to create a data set of at least 1,000 tweets for each of our 30 popular/topical hash-tags. As an example,
+
We plan to create a data set of at least 1,000 tweets (probably much more) for each of our 30 popular/topical hash-tags using the Twitter API. As an example,
  
 
* Politics: #obama, #2012, #sotu
 
* Politics: #obama, #2012, #sotu
Line 46: Line 46:
  
 
* Generative models like LDA
 
* Generative models like LDA
* Training a distance metric that uses features such as cosine similarity, noun-phrase cosine similarity and then apply standard clustering algorithm such as HAC.
+
* Training a distance metric that uses features such as cosine similarity, noun-phrase cosine similarity and then apply standard clustering algorithm such as Hierarchical Agglomerative Clustering
* Additional features such as sentiment could also be used
+
* We could also look at additional features such as sentiment, and recent tweets by users near the ones that mention the hash tag
 
* We could augment tweets with text from links referenced in the tweets.
 
* We could augment tweets with text from links referenced in the tweets.
 +
 +
Additionally, we also plan to look at supervised algorithms for tasks 1-3 (like Rocchio & Labeled LDA), and compare the results with the unsupervised methods.
  
 
== Discussion ==
 
== Discussion ==

Revision as of 19:31, 14 February 2011

Social Media Analysis (10-802) Project Proposal

Team Members

Bo Lin [bolin@cs.cmu.edu]

Kevin Dela Rosa [kdelaros@cs.cmu.edu]

Rushin Shah [rnshah@cs.cmu.edu]

Summary

Title: Fine & coarse grain clustering of tweets based on topics

We propose to tackle the problem of clustering twitter messages (tweets) for a set of six predefined topics: Politics, Sports, Science/Technology, Entertainment, Finance, and "Just for Fun". We propose to address problem by gathering twitter data for approximately 30 popular hash tags corresponding to the different topics and performing some language and/or topic modeling on the tweets to produce a set of clusters, and then comparing those cluster against the one's defined by the different tags.

We look at the following issues:

  • Task 1: (Coarse grain clustering) Can we cluster the tweets into 6 different clusters (unsupervised, not classification), and how well will these clusters correspond to our 6 predefined clusters of tweets?
  • Task 2: (Fine grain clustering) Can we cluster the tweets into approximately 30 clusters, and how well will these correspond to our hash tags?
  • Task 3: For tweets of a given topic (out of the 6), can we cluster those tweets into the approximately 5 corresponding "sub-topics" as indicated by the hash-tags?
  • Task 4: Can we extract the most representative tweets for a given cluster?

Dataset & Evaluation

We plan to create a data set of at least 1,000 tweets (probably much more) for each of our 30 popular/topical hash-tags using the Twitter API. As an example,

  • Politics: #obama, #2012, #sotu
  • Sports: #superbowl, #worldcup, #lebron
  • Science/Technology: #ipad, #watson, #exoplanet
  • Entertainment: #musicmonday, #grammy, #oscar
  • Finance: #AAPL, #dowjones, #gold
  • "Just for fun": #fml, #thingsilike, #followfriday

For task 1, 2 and 3, we plan to use the following standard IR metrics to measure the quality of clusters, using the hash-tag or higher level topic as golden classes for each tweet:

  • Purity
  • NMI
  • Per-Cluster F-Score
  • Rand Index


For task 4, we plan to rank tweets according to cosine similarity or some other such measure to determine how representative each tweet is of the entire cluster, and then select the highest ranked one.

Possible Techniques

Since this is a clustering problem, we can think of a variety of unsupervised techniques:

  • Generative models like LDA
  • Training a distance metric that uses features such as cosine similarity, noun-phrase cosine similarity and then apply standard clustering algorithm such as Hierarchical Agglomerative Clustering
  • We could also look at additional features such as sentiment, and recent tweets by users near the ones that mention the hash tag
  • We could augment tweets with text from links referenced in the tweets.

Additionally, we also plan to look at supervised algorithms for tasks 1-3 (like Rocchio & Labeled LDA), and compare the results with the unsupervised methods.

Discussion

We realize that since we are collecting tweets from different areas of the web, we might end up getting really good results even with relatively easy approaches. To remedy this, we discussed various additions that we could make to this project, such as using these techniques to personalize a user's twitter stream according to the users that he follows, or suggesting new users to follow, or predicting hashtags for a tweet. However, all these ideas, although interesting, end up significantly changing the direction and increasing the workload of our project; we aren't sure if we would be able to finish them. So, we don't want to modify our goals as of now. If we finish our project with enough time to spare, we will revisit one of these other ideas and consider adding it to our project.

References

Prof. Cohen has provided us with a couple of references:

- http://net.pku.edu.cn/~zhaoxin/Wayne%20Xin%20ZHAO@PKU_files/Comparing_Twitter_NYT_Report.pdf This paper contains techniques to prepare topic-based Twitter datasets, as well as a novel technique to adapt LDA to Tweets. Since there isn't any off-the-shelf implementation of this new version of LDA, we are not sure if we will implement it, but we will try to do so if it can be done in a reasonable amount of time.

- http://www.aclweb.org/anthology/W/W10/W10-0510.pdf w This paper describes Kriti's project from last year, which involved some visualization of Twitter topics. It helps one get an idea of what topics naturally appear in twitter. The paper reports stylistic variations as well as topical variations when one performs unsupervised LDA on Tweets, and we will see if we obtain evidence of these variations in our project as well.