Rao, D., D. Yarowsky, A. Shreevats, and M. Gupta. 2010. Classifying latent user attributes in twitter. In Proceedings of the 2nd international workshop on Search and mining user-generated contents, 37–44.
Contents
Online Version
An online version of this paper is available here: [1]
Summary
The authors investigate the use of rich feature sets and stacked SVM based classifiers to classify latent user attributes, including gender, age, regional origin, and political orientation solely from Twitter user language. They also include an analysis of features and approaches that are effective and not effective in classifying such user attributes in Twitter-style data as opposed to spoken genres previously studied in the user-property classiffication literature.
Dataset
The authors build distinct datasets for each attribute mentioned above in a semi-supervised manner. For gender, they get their seed set for a crawl from initial sources including sororities, fraternities, and male and female hygiene products. This produces around 500 users in each class. For age, they manually classify seed users as being below or above 30. For regional origins, they selected seed users from cities with low cross-migration and performed manual annotation. Finally, for political orientation,they looked at twitter lists for the National Rife Association (NRA),keyword searches like "support Palin" or "proud democrat" and hashtags related to current news events.
Network Structure and Communication Behavior
For each of the classes in each of the attributes, the authors studied if there was any difference in the distribution of: 1. The follower-following ratio: The ratio of number of followers of an user to the number of users he/she is following. 2. The follower frequency: The number of followers 3. The following frequency: The number of followees They found that these features did not correlate with the attributes under consideration. They similarly considered distributions of:
1. Response frequency: percentage of tweets from the user that are replies 2. Retweet frequency: percentage of tweets that are retweets. 3. Tweet frequency: percentage of tweets that are from the user, uninitiated. Again,they observed no exploitable difference in the communication behavior on these dimensions by male and female users, or for classifying age, regional origin, and political orientation. This is in contrast to the results of similar experiments for speech and conversation related data.
Classification Models
The authors used Support Vector Machines to train binary classifiers for each attribute. They used three sets of features, all derived from the tweet messages alone, and trained the following three SVM classifiers: 1) Sociolinguistic features based model The authors rely on prior research in sociolinguistics that shows differences in lexical choice and other linguistic features in discourse conditioned on age, gender, and social class. For example, in speech it is well known that certain utterances like "umm", "uh-huh", and back channel responses like laughter and lip smacking are more prevalent among female speakers than their male counterparts.
2) Ngram-features These models have been used earlier for identification of gender from telephone conversations. The authors utilized this as one class of approaches for Twitter classification by deriving the unigrams and bigrams of the tweet text. The text was segmented and normalized to preserve emoticons and other punctuation sequences as first-class lexical items. The authors find that emoticons serve as important features, and that TFIDF did worse for this task on a development set than TF alone.
3) A stacked model that combines SVMs based on the above two feature sets Finally,the authors employed a stacked SVM model to do simple classifier stacking. Its features are the predictions from the Ngram-feature and SocioLinguistic models along with their prediction weights