Ramage et al ICWSM 2010

From Cohen Courses
Revision as of 21:51, 5 November 2012 by Yuchenz (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Citation

Daniel Ramage, Susan Dumais, and Dan Liebling. Characterizing Microblogs with Topic Models. In ICWSM, 2010.

Online version

PDF

Summary

This paper presents a topic model that maps the content of the twitter feed into dimensions. The paper claims that latent topics of a tweet can be categorized into four types: Substance Topics about events and ideas, Social Topics recognizing language used toward a social end, Status Topics denoting personal update and Style Topic that contains broader trends in language usage. Also, they used hashtags, emoticons etc as the labeled dimensions of a tweet. They used these learned dimensions to evaluate on two tasks i.e. ranking tweets in user's feed and recommending new users to follow.

Brief description of the method

Authors used LDA to model tweets with above defined latent and labeled dimensions.

To label tweets with four latent dimensions, authors first applied LDA to classify tweets into 200 latent dimensions. Then they manually labeled these 200 dimensions according to the four latent categories (substance, status, style, social or other).

Other labeled dimensions used are different hashtags (one label for each hashtag), Emoticons (canonical variations were collapsed either into :) or :(), @user (tweets addressing to some specific user), Reply (tweets which are a reply) and Question (tweets having questions mark character).

Based on the numbers of post having particular hashtag, they selected 504 final labels having most used hashtags.

After getting the topic distribution, to characterize a tweet or collection of tweet, authors calculated the fraction of words in the post belonging to different topics.

Experimental Results

They evaluated their approach on two tasks i.e. ranking tweets in user's feed and recommending users who to follow.

For the ranking experiment, they conducted user study in which they ask users to rate each tweet on 1-3 scale. Then they used 70% of this data to train and remaining to test a classifier. For classifier they used LDA based and TF-IDF based features. In results they reported that the MAP for LDA's feature is better than the MAP of the TF-IDF based features. Also, the system having both features combined performed the best.

For User recommendation experiment, they model user's interest based on the posts of the posters a participant follow except posts of one poster (positive example). They pick 8 other posters the participant does not follow as negative example. In this experiment, again, the LDA system outperformed the TF-IDF based system and combined system performed the best.

Dataset Used

They trained the proposed model on the data collected by crawling one week of public posts from Twitter.

Related papers

There is not much related to the topic modeling of the tweets. Here are some related papers:

  • The paper by Ritter NAACL 2010 proposed an unsupervised approach to the problem of modeling dialogue acts in Twitter.
  • The paper by Naaman CSCW 2010 examines the characteristics of social activity and patterns of communication on Twitter.