Difference between revisions of "Analysis of Twitter user's location behaviors"

From Cohen Courses
Jump to navigationJump to search
 
(38 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
== Comments ==
 +
 +
This is a nice general area - I'm a little worried because though because you guys haven't yet converged on a precise topic/question, and are planning on collecting data.  A couple of suggestions:
 +
 +
* Justin's talk tomorrow on LiveHoods may give you some concrete ideas.  I believe that their group has data to distribute.
 +
* There is some data with ground truth on some of the questions you're raising, in MIT's [http://reality.media.mit.edu/download.php reality mining] project.  It might make sense to evaluate some models on this data, and then apply them to Twitter location data to see what can be inferred (eg, where people live, etc.)
 +
 +
Good luck refining these ideas! --[[User:Wcohen|Wcohen]] 15:08, 10 October 2012 (UTC)
 +
 +
''(Response:  Thank you Professor Cohen for your helpful comments!  One of our ultimate motivation is to ask the question "Is geo-tags meaningful or useful in tweets?“. Sometimes people attach location to tweets with no or minor purpose. And we do not think we can use content-based method to know the answer. ----Our hypothesis is that we can get that information from users' past behaviors. (Users turn out to have a consistent behavior on location posting and sharing. For example, if someone always attaches location information to tweets and post tweets frequently, then the location might probably be irrelevant to the tweet text. ). And We guess this course project can be an initial step of it. We have updated some detailed tasks in the task description below. )''
 +
 +
PS. Bhavana recommends this paper:  Nilesh N. Dalvi, Ravi Kumar, Bo Pang: Object matching in tweets with spatial models. WSDM 2012: 43-52
 +
 +
''(Thanks Bhavana for your sharing. This paper is really related to our idea and insightful on content-based methods. We would try to benefit from their findings and wisdom.)''
 +
 +
PPS. A previous project for this course used [[4-square data and location-based Twitter data]], which might be available.
 +
 
== What's the team ==
 
== What's the team ==
  
Line 6: Line 23:
 
== What’s the data you’ll work with? ==
 
== What’s the data you’ll work with? ==
  
We will work with a set of 500 Twitter users and all their friends. Two users are defined as friends if they follow each other. The set of core users are chosen to be those who have sent out at least N tweets with location information during the past P days.
+
We first select a list of users in twitter who utilize location information in their tweets.
 +
 
 +
* Seed user set (SUS):  we randomly gather a set of users (say 500) that are "location-active" (It means they use location information in their tweets.).
 +
We identify them by listening to streaming API and apply filtering rules for SUS.
 +
In order to make their location interpretable by us, we enforce them to be in Great Pittsburgh area.
 +
* Extended user set (EUS) : we extract all the mutual followers of users in SUS in order to observe the social sphere of the seed users.
 +
(All SUS users are in EUS. if user A is in SUS, A follows B and B follows A, then B is in the EUS.  B does not have to be using location feature)
 +
 
 +
Then we crawl user profiles and tweeting history for all the users in EUS in a specified time period (say 60 days) using REST API, in order to study the nature of their tweeting behavior using location.
  
 
== What’s the task or tasks? ==
 
== What’s the task or tasks? ==
  
Through our analysis, we would like to answer the following research questions:
+
The main task for us is to understand:
  * If one user sends out his/her location information actively, what can we infer about his/her friends?  
+
* [1] What types of the places people tends to tag in their tweets? Do they use it to share public info only, private info only, both or neither?
  *
+
** (We try to identify some typical personas, and estimate the population distribution on them. We use reverse geocoding API to label locations semantically.) 
 +
* [2] Do people, who socially closed to each other, share similar usage pattern of location? How much difference did location feature bring to social interactions between users?
 +
** (We analyze the location usage for all users, and check the conversations, re-tweets and deletions to study the social impact of location information in tweets)
 +
 
 +
Some general research questions we are trying to answer during the project:
 +
* RQ1: What is the fraction of users that use location feature in Twitter? And how frequent are they using it?
 +
* RQ2: What are relations between location and tweet content among "location-active" users?
 +
* RQ3: Does "location-active" users share common posting behavior with their social sphere?
 +
* RQ4: How is the impact of location info in the "location-active" users' social sphere?
 +
* RQ5: What can we infer from users' location info? (Can we locate their living area, their home or their workplace? )
 +
 
 +
== How will you evaluate? (qualitative or quantitatively?) ==
 +
 
 +
* To evaluate our analysis on tweets, we can pick a random subset of users, and compare our inferred user info with human-interpreted information.
 +
* And to evaluate the consistency of our conclusions, we may use cross-validation.
 +
(We use a subset to learn the nature of tweeting behavior, and check if the conclusions can also be applied to the rest of users.)
 +
 
 +
== What are the key technical challenges, and what do you hope to learn? ==
 +
 
 +
* One of the challenges here is to infer users' behavior from tweet stream, which is a noisy channel.
 +
Users might not have consistent tweeting style, or behave differently in different sources (for example, web browser V.S. mobile clients)
 +
* Another challenge is the analysis of social interaction of users. The interaction flow that we can observe might be incomplete.
 +
What kinds of social impact can we identify? We might need to dig deeper into our data to find the answer.

Latest revision as of 19:11, 15 October 2012

Comments

This is a nice general area - I'm a little worried because though because you guys haven't yet converged on a precise topic/question, and are planning on collecting data. A couple of suggestions:

  • Justin's talk tomorrow on LiveHoods may give you some concrete ideas. I believe that their group has data to distribute.
  • There is some data with ground truth on some of the questions you're raising, in MIT's reality mining project. It might make sense to evaluate some models on this data, and then apply them to Twitter location data to see what can be inferred (eg, where people live, etc.)

Good luck refining these ideas! --Wcohen 15:08, 10 October 2012 (UTC)

(Response: Thank you Professor Cohen for your helpful comments! One of our ultimate motivation is to ask the question "Is geo-tags meaningful or useful in tweets?“. Sometimes people attach location to tweets with no or minor purpose. And we do not think we can use content-based method to know the answer. ----Our hypothesis is that we can get that information from users' past behaviors. (Users turn out to have a consistent behavior on location posting and sharing. For example, if someone always attaches location information to tweets and post tweets frequently, then the location might probably be irrelevant to the tweet text. ). And We guess this course project can be an initial step of it. We have updated some detailed tasks in the task description below. )

PS. Bhavana recommends this paper: Nilesh N. Dalvi, Ravi Kumar, Bo Pang: Object matching in tweets with spatial models. WSDM 2012: 43-52

(Thanks Bhavana for your sharing. This paper is really related to our idea and insightful on content-based methods. We would try to benefit from their findings and wisdom.)

PPS. A previous project for this course used 4-square data and location-based Twitter data, which might be available.

What's the team

What’s the data you’ll work with?

We first select a list of users in twitter who utilize location information in their tweets.

  • Seed user set (SUS): we randomly gather a set of users (say 500) that are "location-active" (It means they use location information in their tweets.).

We identify them by listening to streaming API and apply filtering rules for SUS. In order to make their location interpretable by us, we enforce them to be in Great Pittsburgh area.

  • Extended user set (EUS) : we extract all the mutual followers of users in SUS in order to observe the social sphere of the seed users.

(All SUS users are in EUS. if user A is in SUS, A follows B and B follows A, then B is in the EUS. B does not have to be using location feature)

Then we crawl user profiles and tweeting history for all the users in EUS in a specified time period (say 60 days) using REST API, in order to study the nature of their tweeting behavior using location.

What’s the task or tasks?

The main task for us is to understand:

  • [1] What types of the places people tends to tag in their tweets? Do they use it to share public info only, private info only, both or neither?
    • (We try to identify some typical personas, and estimate the population distribution on them. We use reverse geocoding API to label locations semantically.)
  • [2] Do people, who socially closed to each other, share similar usage pattern of location? How much difference did location feature bring to social interactions between users?
    • (We analyze the location usage for all users, and check the conversations, re-tweets and deletions to study the social impact of location information in tweets)

Some general research questions we are trying to answer during the project:

  • RQ1: What is the fraction of users that use location feature in Twitter? And how frequent are they using it?
  • RQ2: What are relations between location and tweet content among "location-active" users?
  • RQ3: Does "location-active" users share common posting behavior with their social sphere?
  • RQ4: How is the impact of location info in the "location-active" users' social sphere?
  • RQ5: What can we infer from users' location info? (Can we locate their living area, their home or their workplace? )

How will you evaluate? (qualitative or quantitatively?)

  • To evaluate our analysis on tweets, we can pick a random subset of users, and compare our inferred user info with human-interpreted information.
  • And to evaluate the consistency of our conclusions, we may use cross-validation.

(We use a subset to learn the nature of tweeting behavior, and check if the conclusions can also be applied to the rest of users.)

What are the key technical challenges, and what do you hope to learn?

  • One of the challenges here is to infer users' behavior from tweet stream, which is a noisy channel.

Users might not have consistent tweeting style, or behave differently in different sources (for example, web browser V.S. mobile clients)

  • Another challenge is the analysis of social interaction of users. The interaction flow that we can observe might be incomplete.

What kinds of social impact can we identify? We might need to dig deeper into our data to find the answer.