Difference between revisions of "Analyzing User Tweets around foursquare checkins"
(49 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
+ | |||
+ | == Team == | ||
+ | * [[User:rajarshd|Rajarshi Das]] | ||
+ | |||
+ | * [[User:Rgkulkar|Ranjitha Gurunath Kulkarni]] | ||
+ | |||
== Project idea == | == Project idea == | ||
− | Recently there has been a massive increase in the usage of location sharing social networks. Social networks such as | + | Recently there has been a massive increase in the usage of location sharing social networks. Social networks such as foursquare have brought a new way of social interaction where in an user checks in to a physical location (Food, College & University, Nightlife Spots etc). Foursquare allows the user checkins to be published as tweets. Our plan was to analyze the tweeting behaviour of the user around (before and after) their foursquare checkin. It is likely that a person who is excited to go to a place (a movie, a baseball match etc), would tweet about it while he/she is going to that place. People are also likely to tweet about that place after they have been to that place (for example a restaurant or a movie etc). Our strategy was to analyze the tweets from a user before or after they have been to that place. How do we identify the tweets? - by the checkin which serves as a location proxy for where they are. |
== Data == | == Data == | ||
− | We have data for tweets over a week for around 300,000 users over the world. We expect that there will be significant number of foursquare checkins in the tweets. As a starting point we will start our analysis on this data and once we have a proof of concept we will start gathering more data. The present data has been generously shared to us by Hazim Almuhimedi, a Phd student of Institute of Software Research at CMU. | + | We have data for tweets over a week for around 300,000 users over the world. We expect that there will be significant number of foursquare checkins in the tweets. As a starting point we will start our analysis on this data and once we have a proof of concept we will start gathering more data. The present data has been generously shared to us by Hazim Almuhimedi, a Phd student of Institute of Software Research at CMU. |
+ | |||
+ | We also have a list of users from NYC that were considered in the livehoods project, from [http://justincranshaw.com Justin Cranshaw]. We are planning to use these users to incrementally listen to their tweets starting 10/16. | ||
+ | |||
+ | == Approach == | ||
+ | |||
+ | To find evidence to our hypothesis in the data collected, | ||
+ | we carried out two experiments: | ||
+ | * Topic Modeling. | ||
+ | * Label Prediction. | ||
+ | |||
+ | |||
+ | To get some initial confidence that our hypothesis made some sense we thought it would be nice to do some topic modeling (Blei, Ng and Jordan 2003) on the tweets, surrounding (temporally) a Foursquare check-in and see the distribution of topics which arise from it. Each document in this experiment was a set of tweets before or after all Foursquare check-in tweets belonging to a particular Foursquare category within a specific time interval. The time interval varied depending on the category. For example if it was a food place we only considered the tweets which were within an hour before or after the time of the check-in and for places such as office or residences we considered a larger time interval of around 3 hours. Hence there were documents which contained tweets surrounding a “checkin” for each top level foursquare category (for example Food, Nightlife etc). We got pretty promising results which are described in the next section. | ||
+ | |||
+ | After gaining the initial confidence, we did some qualitative analysis in which we train a classifier which classifies a given document containing tweets into a given category. For example if we have a document containing tweets surrounding nightlife places (for an user) and also another document containing tweets surrounding professional places (offices, workplaces, etc), then we want our classifier to label the former as a document containing tweets related to "Nightlife places" and the latter as a document containing tweets related to "Professional places". We selected semantically contrasting categories (Nightlife vs Professional places for example , since people go to the former during night time and the latter during the day and if our hypotheses holds then their tweeting behaviour should be different). Other categories which we considered are Food vs. Professional and College vs. Travel. | ||
+ | |||
+ | == Results == | ||
+ | |||
+ | Our initial analysis of topic models were very promising indeed. Below are some of the topics which came up for the category Nightlife and College & University. | ||
+ | |||
+ | [[File:101.jpg|200px|thumb|left|Topics for category Nightlife]] [[File:102.jpg|200px|thumb|center|Topics for category College & University]]. As you can see that the words(of the tweets) near the nightlife places are related to nightlife (beer, lounge etc) and the words (of the tweets) close to University campus are related to the same (class, professor etc) | ||
+ | |||
+ | After the initial confidence with our approach, we did some qualitative evaluation by training classifiers for tweets around a checkin (mentioned briefly in the approach section). Next we present the classifier accuracy. There were two different sets of test data. One set contained the tweets around the checkin and also the text of the checkin. When you checkin to a location via foursquare, there is a default "I'm at <this place>" text which is generated. User can also edit this. Below we state the classifier accuracy for two classifiers (Naive Bayes and Maxent) | ||
+ | |||
+ | [[File:103.jpg|400px|thumb|left|Naive Bayes]] [[File:104.jpg|400px|thumb|center|Maxent]] | ||
+ | |||
− | + | On including the foursquare checkin text, we see that the accuracy is high (which follows from intuition, since that particular tweet is talking about location). Without foursquare checkin the accuracy decreases but is fairly high which gives us a confidence that our hypotheses is correct that "A person tweets is affected by where he is and what he is doing" | |
− | + | == Comments == | |
− | |||
− | |||
− | |||
− | |||
− | + | Note: A previous project for this course used [[4-square data and location-based Twitter data]], which might be available. --[[User:Wcohen|Wcohen]] 20:31, 10 October 2012 (UTC) | |
− | + | Thank you. Yes we have retrieved that data. We will be using it for user data in case the current set of users don't prove helpful. [[User:Rgkulkar|Rgkulkar]] 23:11, 15 October 2012 (UTC) | |
− | |||
− | + | * The problem/task is not concrete. You may want to write more about what exactly you want to predict and how do you want to do it i.e. what's your approach and what features will you use. | |
− | * | + | * Some ideas |
− | * | + | ** As you have noted the data could be really small w.r.t. check-in and location based tweet information. One possibility is that you could leave-out a portion of "location-based" tweets |
+ | as test-set for evaluation. Then take the rest of the location-based tweets as a seed-set to cluster the unlabeled tweets from the dataset. | ||
+ | ** You may want to start with a coarse-level prediction i.e., category-type (say restaurants_in_sq._hill or just restaurants) as opposed to fine-grained i.e. exact place (name of that restaurant) for the sentiment-analysis. | ||
− | + | -- [[User:Apappu|Apappu]] 13:32, 11 October 2012 (UTC) | |
− | We | + | Thank you, We have updated the wiki to include as much details as possible. One clarification: we are not predicting anything, but analyzing user tweeting behaviour in condition to what place they are at, that moment. |
+ | We agree with the category-type level analysis that you pointed out. | ||
+ | [[User:Rgkulkar|Rgkulkar]] 23:11, 15 October 2012 (UTC) |
Latest revision as of 22:02, 11 January 2013
Contents
Team
Project idea
Recently there has been a massive increase in the usage of location sharing social networks. Social networks such as foursquare have brought a new way of social interaction where in an user checks in to a physical location (Food, College & University, Nightlife Spots etc). Foursquare allows the user checkins to be published as tweets. Our plan was to analyze the tweeting behaviour of the user around (before and after) their foursquare checkin. It is likely that a person who is excited to go to a place (a movie, a baseball match etc), would tweet about it while he/she is going to that place. People are also likely to tweet about that place after they have been to that place (for example a restaurant or a movie etc). Our strategy was to analyze the tweets from a user before or after they have been to that place. How do we identify the tweets? - by the checkin which serves as a location proxy for where they are.
Data
We have data for tweets over a week for around 300,000 users over the world. We expect that there will be significant number of foursquare checkins in the tweets. As a starting point we will start our analysis on this data and once we have a proof of concept we will start gathering more data. The present data has been generously shared to us by Hazim Almuhimedi, a Phd student of Institute of Software Research at CMU.
We also have a list of users from NYC that were considered in the livehoods project, from Justin Cranshaw. We are planning to use these users to incrementally listen to their tweets starting 10/16.
Approach
To find evidence to our hypothesis in the data collected, we carried out two experiments:
- Topic Modeling.
- Label Prediction.
To get some initial confidence that our hypothesis made some sense we thought it would be nice to do some topic modeling (Blei, Ng and Jordan 2003) on the tweets, surrounding (temporally) a Foursquare check-in and see the distribution of topics which arise from it. Each document in this experiment was a set of tweets before or after all Foursquare check-in tweets belonging to a particular Foursquare category within a specific time interval. The time interval varied depending on the category. For example if it was a food place we only considered the tweets which were within an hour before or after the time of the check-in and for places such as office or residences we considered a larger time interval of around 3 hours. Hence there were documents which contained tweets surrounding a “checkin” for each top level foursquare category (for example Food, Nightlife etc). We got pretty promising results which are described in the next section.
After gaining the initial confidence, we did some qualitative analysis in which we train a classifier which classifies a given document containing tweets into a given category. For example if we have a document containing tweets surrounding nightlife places (for an user) and also another document containing tweets surrounding professional places (offices, workplaces, etc), then we want our classifier to label the former as a document containing tweets related to "Nightlife places" and the latter as a document containing tweets related to "Professional places". We selected semantically contrasting categories (Nightlife vs Professional places for example , since people go to the former during night time and the latter during the day and if our hypotheses holds then their tweeting behaviour should be different). Other categories which we considered are Food vs. Professional and College vs. Travel.
Results
Our initial analysis of topic models were very promising indeed. Below are some of the topics which came up for the category Nightlife and College & University.
. As you can see that the words(of the tweets) near the nightlife places are related to nightlife (beer, lounge etc) and the words (of the tweets) close to University campus are related to the same (class, professor etc)
After the initial confidence with our approach, we did some qualitative evaluation by training classifiers for tweets around a checkin (mentioned briefly in the approach section). Next we present the classifier accuracy. There were two different sets of test data. One set contained the tweets around the checkin and also the text of the checkin. When you checkin to a location via foursquare, there is a default "I'm at <this place>" text which is generated. User can also edit this. Below we state the classifier accuracy for two classifiers (Naive Bayes and Maxent)
On including the foursquare checkin text, we see that the accuracy is high (which follows from intuition, since that particular tweet is talking about location). Without foursquare checkin the accuracy decreases but is fairly high which gives us a confidence that our hypotheses is correct that "A person tweets is affected by where he is and what he is doing"
Comments
Note: A previous project for this course used 4-square data and location-based Twitter data, which might be available. --Wcohen 20:31, 10 October 2012 (UTC)
Thank you. Yes we have retrieved that data. We will be using it for user data in case the current set of users don't prove helpful. Rgkulkar 23:11, 15 October 2012 (UTC)
- The problem/task is not concrete. You may want to write more about what exactly you want to predict and how do you want to do it i.e. what's your approach and what features will you use.
- Some ideas
- As you have noted the data could be really small w.r.t. check-in and location based tweet information. One possibility is that you could leave-out a portion of "location-based" tweets
as test-set for evaluation. Then take the rest of the location-based tweets as a seed-set to cluster the unlabeled tweets from the dataset.
- You may want to start with a coarse-level prediction i.e., category-type (say restaurants_in_sq._hill or just restaurants) as opposed to fine-grained i.e. exact place (name of that restaurant) for the sentiment-analysis.
-- Apappu 13:32, 11 October 2012 (UTC)
Thank you, We have updated the wiki to include as much details as possible. One clarification: we are not predicting anything, but analyzing user tweeting behaviour in condition to what place they are at, that moment. We agree with the category-type level analysis that you pointed out. Rgkulkar 23:11, 15 October 2012 (UTC)