Difference between revisions of "Project - Second Draft Proposal - Bo, Kevin, Rushin"

From Cohen Courses
Jump to navigationJump to search
Line 48: Line 48:
 
* Generative models like LDA
 
* Generative models like LDA
 
* Training a distance metric that uses features such as cosine similarity, noun-phrase cosine similarity and then apply standard clustering algorithm such as Hierarchical Agglomerative Clustering  
 
* Training a distance metric that uses features such as cosine similarity, noun-phrase cosine similarity and then apply standard clustering algorithm such as Hierarchical Agglomerative Clustering  
* We could also look at additional features such as sentiment, and recent tweets by users near the ones that mention the hash tag
+
* In addition the the tweets text, we could look at some of the following features
* We could augment tweets with text from links referenced in the tweets.
+
** Text from links referenced in the tweets.
 +
** Sentiment analysis of the tweet text
 +
** Text from recent tweets by users near the ones that mention the hash tag
 +
** Re-tweet information
  
 
Additionally, we also plan to look at supervised algorithms for tasks 1-3 (like Rocchio & Labeled LDA), and compare the results with the unsupervised methods.
 
Additionally, we also plan to look at supervised algorithms for tasks 1-3 (like Rocchio & Labeled LDA), and compare the results with the unsupervised methods.

Revision as of 19:39, 14 February 2011

Social Media Analysis (10-802) Project Proposal

Team Members

Bo Lin [bolin@cs.cmu.edu]

Kevin Dela Rosa [kdelaros@cs.cmu.edu]

Rushin Shah [rnshah@cs.cmu.edu]

Summary

Title: Fine & coarse grain clustering of tweets based on topics

We propose to tackle the problem of clustering twitter messages (tweets) for a set of six predefined topics: Politics, Sports, Science/Technology, Entertainment, Finance, and "Just for Fun". We propose to address problem by gathering twitter data for approximately 30 popular hash tags corresponding to the different topics and performing some language and/or topic modeling on the tweets to produce a set of clusters, and then comparing those cluster against the one's defined by the different tags.

This problem is motivated by the observation that hash tags are approximate indicators of a tweets topics, and a desire to cluster user tweets into topic in an analogous way to how Google clusters new articles. Some of the challenges for this project include the fact that tweets are short (on average around 10 words) and noisy (filled with jargon, abbreviations, and misspellings).

The specific tasks that we plan to look at are:

  • Task 1: (Coarse grain clustering) Can we cluster the tweets into 6 different clusters (unsupervised, not classification), and how well will these clusters correspond to our 6 predefined clusters of tweets?
  • Task 2: (Fine grain clustering) Can we cluster the tweets into approximately 30 clusters, and how well will these correspond to our hash tags?
  • Task 3: For tweets of a given topic (out of the 6), can we cluster those tweets into the approximately 5 corresponding "sub-topics" as indicated by the hash-tags?
  • Task 4: Can we extract the most representative tweets for a given cluster?

Dataset & Evaluation

We plan to create a data set of at least 1,000 tweets (probably much more) for each of our 30 popular/topical hash-tags using the Twitter API. As an example,

  • Politics: #obama, #2012, #sotu
  • Sports: #superbowl, #worldcup, #lebron
  • Science/Technology: #ipad, #watson, #exoplanet
  • Entertainment: #musicmonday, #grammy, #oscar
  • Finance: #AAPL, #dowjones, #gold
  • "Just for fun": #fml, #thingsilike, #followfriday

For task 1, 2 and 3, we plan to use the following standard IR metrics to measure the quality of clusters, using the hash-tag or higher level topic as golden classes for each tweet:

  • Purity
  • NMI
  • Per-Cluster F-Score
  • Rand Index

For task 4, we plan to rank tweets according to cosine similarity or some other such measure to determine how representative each tweet is of the entire cluster, and then select the highest ranked one.

Possible Techniques

Since this is a clustering problem, we can think of a variety of unsupervised techniques:

  • Generative models like LDA
  • Training a distance metric that uses features such as cosine similarity, noun-phrase cosine similarity and then apply standard clustering algorithm such as Hierarchical Agglomerative Clustering
  • In addition the the tweets text, we could look at some of the following features
    • Text from links referenced in the tweets.
    • Sentiment analysis of the tweet text
    • Text from recent tweets by users near the ones that mention the hash tag
    • Re-tweet information

Additionally, we also plan to look at supervised algorithms for tasks 1-3 (like Rocchio & Labeled LDA), and compare the results with the unsupervised methods.

Discussion

We realize that since we are collecting tweets from different areas of the web, we might end up getting really good results even with relatively easy approaches. To remedy this, we discussed various additions that we could make to this project, such as using these techniques to personalize a user's twitter stream according to the users that he follows, or suggesting new users to follow, or predicting hashtags for a tweet. However, all these ideas, although interesting, end up significantly changing the direction and increasing the workload of our project; we aren't sure if we would be able to finish them. So, we don't want to modify our goals as of now. If we finish our project with enough time to spare, we will revisit one of these other ideas and consider adding it to our project.

References

Prof. Cohen has provided us with a couple of references:

- http://net.pku.edu.cn/~zhaoxin/Wayne%20Xin%20ZHAO@PKU_files/Comparing_Twitter_NYT_Report.pdf This paper contains techniques to prepare topic-based Twitter datasets, as well as a novel technique to adapt LDA to Tweets. Since there isn't any off-the-shelf implementation of this new version of LDA, we are not sure if we will implement it, but we will try to do so if it can be done in a reasonable amount of time.

- http://www.aclweb.org/anthology/W/W10/W10-0510.pdf w This paper describes Kriti's project from last year, which involved some visualization of Twitter topics. It helps one get an idea of what topics naturally appear in twitter. The paper reports stylistic variations as well as topical variations when one performs unsupervised LDA on Tweets, and we will see if we obtain evidence of these variations in our project as well.