Difference between revisions of "Cohen Courses:Tweet"
(4 intermediate revisions by the same user not shown) | |||
Line 8: | Line 8: | ||
== Proposal == | == Proposal == | ||
− | In this project we would like to infer | + | Inferring the location of social media users is an interesting problem that can enable many applications such as location-based personalized information services, regionally-focused advertisements, epidemiological modeling of the spread of diseases, etc. However, previous works have focused largely on modeling or inferring the (base) location of the user (region, city) that is mostly static; instead of the dynamic location of the user which changes with time and is reflected by his activity in social media: e.g. his status update in Facebook, or his tweet in Twitter. In this project we would like to infer this (dynamic) location category of a user (i.e. where he makes a tweet) based on the words in the tweet (including sentiments) and the time of the tweet. We believe Twitter users: |
* tweet differently at different locations - e.g. a tweet made from a restaurant (about the food, the service, etc) maybe different from a tweet made from an office (about works, etc) | * tweet differently at different locations - e.g. a tweet made from a restaurant (about the food, the service, etc) maybe different from a tweet made from an office (about works, etc) | ||
Line 24: | Line 24: | ||
* tweets are inherently noisy with shorthands and non-standard vocabulary | * tweets are inherently noisy with shorthands and non-standard vocabulary | ||
* there may not be any location cues in the tweet: e.g. a user maybe in a restaurant but his tweet may not reflect him being in a restaurant | * there may not be any location cues in the tweet: e.g. a user maybe in a restaurant but his tweet may not reflect him being in a restaurant | ||
− | * a user may tweet about a location but he/she may not even be in that location (i.e. it can be just a location that he is interested in) | + | * a user may tweet about a location but he/she may not even be in that location (i.e. it can be just a location that he is interested in, or it can be a location that he has visited before or wanted to visit in future) |
* a user may not have a structure in his tweeting habit: i.e. he may not have any geographical pattern to his activity or even if he has, he may not tweet regularly about it or geo-tagged the tweet regularly | * a user may not have a structure in his tweeting habit: i.e. he may not have any geographical pattern to his activity or even if he has, he may not tweet regularly about it or geo-tagged the tweet regularly | ||
− | To begin the project, we would like to | + | To begin the project, we would like to first analyze the data to ensure that we have a valid assumption that: (1) location of tweets depend on time, (2) tweet contents (and/or sentiments) differ across time and/or location. |
+ | |||
+ | To analyze the data, we will split tweets into different time periods with different granularity: {morning, afternoon, night, evening}; {weekdays and weekends}; {Monday, Tuesday, ..., Sunday}. For each group of tweets we will investigate the location category of the tweet and see if there are any clusters of location categories emerging at different time periods. Such clusters could affirm our assumption that the location of tweets depend on time. Next, we plan to conduct deeper analysis on the contents of the tweets (including sentiments) and if indeed they vary according to time and space. | ||
+ | |||
+ | Next, we will construct a baseline classifier that is based on bag-of-words model to classify location category of each tweet. Then, we will add time and space in a structured prediction to see if it improves the baseline classifier. | ||
== Baseline & Dataset == | == Baseline & Dataset == | ||
Line 33: | Line 37: | ||
For the baseline, we will be using bag-of-words model to predict location category. We would like to find out whether adding the structure across geographical space and time will improve the prediction results. | For the baseline, we will be using bag-of-words model to predict location category. We would like to find out whether adding the structure across geographical space and time will improve the prediction results. | ||
− | For the dataset, we have obtained | + | For the dataset, we have obtained a dataset from [http://infolab.tamu.edu/ Infolab] which consists of geo-tagged tweets from Twitter (up to a maximum of the most recent 2000 geo-labeled tweets for each sampled user) from late September 2010 to late January 2011; resulting in a total collection of 225,098 users and 22,506,721 unique ''checkins'' (i.e. the process of announcing arrival at a location) where more than 53% are from Foursquare. |
− | |||
== Related Work == | == Related Work == | ||
+ | * [http://brenocon.com/eisenstein_oconnor_smith_xing.emnlp2010.geographic_lexical_variation.pdf A Latent Variable Model for Geographic Lexical Variation] by Eisenstein et al., EMNLP 2010 | ||
* [http://people.csail.mit.edu/jacobe/papers/nipsws2010.pdf A Mixture Model of Demographic Lexical Variation] by O'Connor et al., NIPS-2010 Workshop on Machine Learning and Social Computing | * [http://people.csail.mit.edu/jacobe/papers/nipsws2010.pdf A Mixture Model of Demographic Lexical Variation] by O'Connor et al., NIPS-2010 Workshop on Machine Learning and Social Computing | ||
− | |||
* [http://faculty.cs.tamu.edu/caverlee/pubs/cheng10cikm.pdf You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users] by Cheng et al., CIKM 2010 | * [http://faculty.cs.tamu.edu/caverlee/pubs/cheng10cikm.pdf You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users] by Cheng et al., CIKM 2010 | ||
* [http://www.sciencemag.org/content/333/6051/1878.abstract Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures] by Golder and Macy, Science, Vol. 333 no. 6051 pp. 1878-1881, 30 September 2011 | * [http://www.sciencemag.org/content/333/6051/1878.abstract Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures] by Golder and Macy, Science, Vol. 333 no. 6051 pp. 1878-1881, 30 September 2011 | ||
+ | * [http://students.cse.tamu.edu/kyumin/papers/cheng11icwsm.pdf Exploring Millions of Footprints in Location Sharing Services] by Cheng et al., ICWSM 2011 |
Latest revision as of 00:14, 6 October 2011
Inferring geographical activity using Twitter.
Team Member(s)
Proposal
Inferring the location of social media users is an interesting problem that can enable many applications such as location-based personalized information services, regionally-focused advertisements, epidemiological modeling of the spread of diseases, etc. However, previous works have focused largely on modeling or inferring the (base) location of the user (region, city) that is mostly static; instead of the dynamic location of the user which changes with time and is reflected by his activity in social media: e.g. his status update in Facebook, or his tweet in Twitter. In this project we would like to infer this (dynamic) location category of a user (i.e. where he makes a tweet) based on the words in the tweet (including sentiments) and the time of the tweet. We believe Twitter users:
- tweet differently at different locations - e.g. a tweet made from a restaurant (about the food, the service, etc) maybe different from a tweet made from an office (about works, etc)
- location is affected by time - e.g. a person is more likely to tweet from the office in the morning than from a nightspot
- sentiment is affected by location and/or time - e.g. a person maybe more likely to feel sombre in the office in weekdays than in travel spots in holidays or weekends
How locations of tweets change with time represents geographical activity profile of the user. Such activity maybe structured across geographical space and across time. This structure is what we want to learn about the user based on his tweets. Using the structure and the tweet, we would like to infer the location from which the tweet is made.
The location categories to infer are taken from Foursquare categories: "Arts and Entertainment", "College and Education", "Food", "Home/Work/Other", "Nightlife Spots", "Great Outdoors", "Shops", "Travel Spots".
Proposed Approach
There are a couple of challenges to this task, among others, that we can think of:
- tweets are inherently noisy with shorthands and non-standard vocabulary
- there may not be any location cues in the tweet: e.g. a user maybe in a restaurant but his tweet may not reflect him being in a restaurant
- a user may tweet about a location but he/she may not even be in that location (i.e. it can be just a location that he is interested in, or it can be a location that he has visited before or wanted to visit in future)
- a user may not have a structure in his tweeting habit: i.e. he may not have any geographical pattern to his activity or even if he has, he may not tweet regularly about it or geo-tagged the tweet regularly
To begin the project, we would like to first analyze the data to ensure that we have a valid assumption that: (1) location of tweets depend on time, (2) tweet contents (and/or sentiments) differ across time and/or location.
To analyze the data, we will split tweets into different time periods with different granularity: {morning, afternoon, night, evening}; {weekdays and weekends}; {Monday, Tuesday, ..., Sunday}. For each group of tweets we will investigate the location category of the tweet and see if there are any clusters of location categories emerging at different time periods. Such clusters could affirm our assumption that the location of tweets depend on time. Next, we plan to conduct deeper analysis on the contents of the tweets (including sentiments) and if indeed they vary according to time and space.
Next, we will construct a baseline classifier that is based on bag-of-words model to classify location category of each tweet. Then, we will add time and space in a structured prediction to see if it improves the baseline classifier.
Baseline & Dataset
For the baseline, we will be using bag-of-words model to predict location category. We would like to find out whether adding the structure across geographical space and time will improve the prediction results.
For the dataset, we have obtained a dataset from Infolab which consists of geo-tagged tweets from Twitter (up to a maximum of the most recent 2000 geo-labeled tweets for each sampled user) from late September 2010 to late January 2011; resulting in a total collection of 225,098 users and 22,506,721 unique checkins (i.e. the process of announcing arrival at a location) where more than 53% are from Foursquare.
Related Work
- A Latent Variable Model for Geographic Lexical Variation by Eisenstein et al., EMNLP 2010
- A Mixture Model of Demographic Lexical Variation by O'Connor et al., NIPS-2010 Workshop on Machine Learning and Social Computing
- You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users by Cheng et al., CIKM 2010
- Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures by Golder and Macy, Science, Vol. 333 no. 6051 pp. 1878-1881, 30 September 2011
- Exploring Millions of Footprints in Location Sharing Services by Cheng et al., ICWSM 2011