Difference between revisions of "Analyzing User Tweets around foursquare checkins"

From Cohen Courses
Jump to navigationJump to search
Line 30: Line 30:
 
The plan is to do an exploratory analysis on user tweets that follow foursquare check-ins.  
 
The plan is to do an exploratory analysis on user tweets that follow foursquare check-ins.  
  
Our main '''hypothesis''' is that: ''When a user enters a place (we use foursquare check-ins as a proxy to this), the following tweets carry some useful information related to the place he is in.''   
+
Our main '''hypothesis''' is that:
 +
''When a user enters a place (we use foursquare check-ins as a proxy to this), the following tweets carry some useful information related to the place he is in.''   
  
 
We plan to investigate into this issue by using bag of words approach. For example, if the user enters a restaurant, we will process the following tweets for keywords related to "restaurants" generated by source expansion (top n search results could be one of the sources, where 10<n<100). This is the approach we want to use for the baseline. We might consider a more sophisticated approach in case this one does bad.
 
We plan to investigate into this issue by using bag of words approach. For example, if the user enters a restaurant, we will process the following tweets for keywords related to "restaurants" generated by source expansion (top n search results could be one of the sources, where 10<n<100). This is the approach we want to use for the baseline. We might consider a more sophisticated approach in case this one does bad.

Revision as of 18:12, 15 October 2012

Comments

Note: A previous project for this course used 4-square data and location-based Twitter data, which might be available. --Wcohen 20:31, 10 October 2012 (UTC)

Thank you. Yes we have retrieved that data. We will be using it for user data in case the current set of users don't prove helpful. Rgkulkar 23:11, 15 October 2012 (UTC)

  • The problem/task is not concrete. You may want to write more about what exactly you want to predict and how do you want to do it i.e. what's your approach and what features will you use.
  • Some ideas
    • As you have noted the data could be really small w.r.t. check-in and location based tweet information. One possibility is that you could leave-out a portion of "location-based" tweets

as test-set for evaluation. Then take the rest of the location-based tweets as a seed-set to cluster the unlabeled tweets from the dataset.

    • You may want to start with a coarse-level prediction i.e., category-type (say restaurants_in_sq._hill or just restaurants) as opposed to fine-grained i.e. exact place (name of that restaurant) for the sentiment-analysis.

-- Apappu 13:32, 11 October 2012 (UTC)

Thank you, We have updated the wiki to include as much details as possible. One clarification: we are not predicting anything, but analyzing user tweeting behaviour in condition to what place they are at, that moment. We agree with the category-type level analysis that you pointed out. Rgkulkar 23:11, 15 October 2012 (UTC)

Team

Project idea

Recently there has been a massive increase in the usage of location sharing social networks. Social networks such as FourSquare have brought a new way of social interaction where in an user checks in to a physical location (Food, College & University, Nightlife Spots etc). FourSquare allows the user checkins to be published as tweets. We plan to analyze the tweeting behaviour of the user after their foursquare checkin.

Description

The plan is to do an exploratory analysis on user tweets that follow foursquare check-ins.

Our main hypothesis is that: When a user enters a place (we use foursquare check-ins as a proxy to this), the following tweets carry some useful information related to the place he is in.

We plan to investigate into this issue by using bag of words approach. For example, if the user enters a restaurant, we will process the following tweets for keywords related to "restaurants" generated by source expansion (top n search results could be one of the sources, where 10<n<100). This is the approach we want to use for the baseline. We might consider a more sophisticated approach in case this one does bad.

After we get sufficient evidence to our hypothesis, we plan to further explore user tweet behaviour after foursquare check-ins. To begin with, we want to generate topic models of what users talk about when they are at a particular place. The way we go about doing this is :

                                   for every location(or category) in our check-ins list, 
                                         for all the users that have checked-in to this place,
                                               retrieve all the user-tweets that follow each check-in

Once we get all the tweets for a location(or category), we plan to generate a topic model to get the distribution of various topics that the users tweet about when they are there.

If we have sufficient data for a particular place, we also want to investigate the overall sentiment of the user tweets as a proxy for customer reviews about that place. To do sentiment analysis, we might use LingPipe. We will have more clarity on this once we get there.


Tasks

  • For a user, we plan to analyze tweets (within a small interval) after their foursquare check-in, to see if the user talks about things related to the places in which he/she has checked in.
  • Analyzing all the tweets that follow foursquare check-in to a particular place (or category), to see what percentage of the users do tweet about that place.
  • Find out the topics that users mostly talk about when they are at a particular place.
  • Once we have all the tweets about a particular place, analyze the overall sentiment about that place. (For example, a particular restaurant is liked by most people or not).

Note : We will be able to do the task of sentiment analysis only if we find out that a significant number of people actually tweet about a place they are in after checking into that place.

Data

We have data for tweets over a week for around 300,000 users over the world. We expect that there will be significant number of foursquare checkins in the tweets. As a starting point we will start our analysis on this data and once we have a proof of concept we will start gathering more data. The present data has been generously shared to us by Hazim Almuhimedi, a Phd student of Institute of Software Research at CMU.

We also have a list of users from NYC that were considered in the livehoods project, from Justin Cranshaw. We are planning to use these users to incrementally listen to their tweets starting 10/16.


Evaluation

  • Quantitative: Build a small annotated test dataset to evaluate the accuracy of our prediction.
  • Qualitative : For sentiment analysis on restaurant tweets, we will see if the overall sentiment correlates with the ratings on other famous social networks like Yelp.

Key Technical Challenges

  • We might not have sufficient amount of data if we narrow to a single location (for example a particular restaurant)
  • Given the limited amount of data, we are not sure if we can do topic modelling accurately (since tweets are inherently short)