Rao, D., D. Yarowsky, A. Shreevats, and M. Gupta. 2010. Classifying latent user attributes in twitter. In Proceedings of the 2nd international workshop on Search and mining user-generated contents, 37–44.
An online version of this paper is available here: 
The authors investigate the use of rich feature sets and stacked SVM based classifiers to classify latent user attributes, including gender, age, regional origin, and political orientation solely from Twitter user language. They also include an analysis of features and approaches that are effective and not effective in classifying such user attributes in Twitter-style data as opposed to spoken genres previously studied in the user-property classiffication literature.
The authors build distinct datasets for each attribute mentioned above in a semi-supervised manner. For gender, they get their seed set for a crawl from initial sources including sororities, fraternities, and male and female hygiene products. This produces around 500 users in each class. For age, they manually classify seed users as being below or above 30. For regional origins, they selected seed users from cities with low cross-migration and performed manual annotation of users as being from either North or South India. Finally, for political orientation,they looked at twitter lists for the National Rife Association (NRA),keyword searches like "support Palin" or "proud democrat" and hashtags related to current news events and classified users as being either Republicans or Democrats.
Network Structure and Communication Behavior
For each of the classes in each of the attributes, the authors studied if there was any difference in the distribution of:
- 1. The follower-following ratio: The ratio of number of
followers of an user to the number of users he/she is following.
- 2. The follower frequency: The number of followers
- 3. The following frequency: The number of followees
They found that these features did not correlate with the attributes under consideration. They similarly considered distributions of:
- 1. Response frequency: percentage of tweets from the user
that are replies
- 2. Retweet frequency: percentage of tweets that are retweets.
- 3. Tweet frequency: percentage of tweets that are from
the user, uninitiated. Again,they observed no exploitable difference in the communication behavior on these dimensions by male and female users, or for classifying age, regional origin, and political orientation. This is in contrast to the results of similar experiments for speech and conversation related data.
The authors used Support Vector Machines to train binary classifiers for each attribute. They used three sets of features, all derived from the tweet messages alone, and trained the following three SVM classifiers:
- 1) Sociolinguistic features based model
The authors rely on prior research in sociolinguistics that shows differences in lexical choice and other linguistic features in discourse conditioned on age, gender, and social class. For example, in speech it is well known that certain utterances like "umm", "uh-huh", and back channel responses like laughter and lip smacking are more prevalent among female speakers than their male counterparts.
- 2) Ngram-features based model
These models have been used earlier for identification of gender from telephone conversations. The authors utilized this as one class of approaches for Twitter classification by deriving the unigrams and bigrams of the tweet text. The text was segmented and normalized to preserve emoticons and other punctuation sequences as first-class lexical items. The authors find that emoticons serve as important features, and that TFIDF did worse for this task on a development set than TF alone.
- 3) A stacked model that combines SVMs based on the above two feature sets
Finally,the authors employed a stacked SVM model to do simple classifier stacking. Its features are the predictions from the Ngram-feature and SocioLinguistic models along with their prediction weights
The authors achieved an accuracy of 72.33% for gender using their stacked model, significantly beating the 50% baseline of random chance. They found that their sociolinguistic model did better than their n-gram model for this task. They found that sociolinguistic features such as emoticons, ellipses, alphabetic character repetitions inside words and exlcamantion marks were very helpful in discriminating females from males.
The authors achieved an accuracy of 74.11% in classifying users as above or below 30 with the stacked model, and for this task found that their n-gram based model did significantly better than their socio-linguistic feature based model.
The authors achieved an accuracy of 77% for this task using their sociolinguistic model, and it outperformed their stacked as well as n-gram based models. They found that for some unknown reason, north Indian users tended to use emoticons, repeat characters, show excitement (repeated exclamation, puzzled punctuation) more than south Indian users.
The authors achieved an accuracy of 82% for this task and in this case, their n-gram based model beat their stacked and sociolinguistic models. They found that certain words such as handgun or vegetarianism were almost always indicative of Republican or Democrat respectively.
This paper presents a novel set of features and approaches for automatically classifying latent user attributes including gender, age, regional origin, and political orientation solely from the user language of informal communication such as Twitter.