SAGE Weibo
Team Members
Summary
For this project, we are looking at using generative, graphical models to predict occurrences such as retweets, comments, tweets-at or follower networks of users of (Tencent Weibo). In particular, we would like to examine how the influence of different demographic factors (i.e.: age, gender and user location) impacts the distribution of words in the social network and the overall structure of the network. We plan to look at basic graphical models such as LDA and then extend this graphical model to account for the other potential structures in the data. In addition to looking at collapsed samplers to infer hidden variables and structures in our model, we will evaluate how SAGE can be used to increase robustness of our models on held-out data and reduce computational complexity.
Dataset
We will be using the data from the KDD Cup 2012 Track 1 (Tencent Weibo). The original KDD Cup challenge was to figure out if an item (broad class of things including people and users) that was recommended to a user would actually be followed. However, we would like to note that this is not the goal of our project as the competition has passed and the winning entries announced. Instead, we will divide the publicly available dataset into a development, tuning, and test set (by user ID) and look at predicting the likelihood of a user following another user, tweeting-at them, re-tweeting them, or commenting about something they said. In particular, we will be interested in the demographic information associated with users such as their age and gender.
Even though Tencent Weibo is thought of as the Chinese version of Twitter, there will not be a language barrier as this dataset does not contain actual words, but rather integer IDs so deep linguistic processing is impossible to do on this dataset. We will have to look at bags-of-words models that use unigrams. The dataset has had unknown preprocessing steps done and keywords are left. Original tweets are not recoverable. We acknowledge that although this preprocessing tokenization of tweets can help reduce computational complexity, it is also a limitation of the study as some meanings may have been lost in the translation.
Methodology
Baseline: Use LDA to predict likelihood of a user following another user based on the topics learned from the words in a tweet. Then, make our own generative model that has additional observed values of gender and age which come from a latent variable model.
Once we have our baseline performance, we propose the following main thrusts to this work:
- Compute inference of the baseline model using a collapsed Gibbs Sampler that we will derive. Possibly have multiple models depending on complexity of project.
- For each of these models, reformulate the problem in a SAGE perspective. See if there is noticable improvement (or lack thereof) by using SAGE. Look at the difference in inference cost between a collapsed sampler and this additive model.
We also plan on submitting this work to the Social Media Track of WWW2013 [[1]]
Study Plan
The background knowledge needed for this project will be a basic understanding of generative, graphical models of text and some derivatives of those. In particular, we will need to be knowledgeable about: