Difference between revisions of "SAGE Weibo"

From Cohen Courses
Jump to navigationJump to search
(Created page with 'For this project, we are looking at using generative, graphical models to predict retweets, comments, tweets-at, follower networks of users of Tencent Weibo. In particular, we wo…')
 
Line 2: Line 2:
  
 
== Dataset ==
 
== Dataset ==
 +
We will be using the data from the KDD Cup 2012 Track 1 ([[UsesDataset:Tencent Weibo|Tencent Weibo]]). The original KDD Cup challenge was to figure out if an item (broad class of things including people and users) that was recommended to a user would actually be followed. We will not be using this in this project. Instead, we will divide the publicly available dataset into a dev, tuning, and test set (by user ID) and look at predicting the likelihood of a user following another user, tweeting-at them, re-tweeting them, or commenting about something they said. In particular, we will be interested in the demographic information associated with users such as the year of their birth and gender.
  
We will be using the data from the KDD Cup 2012 Track 1 ([[UsesDataset:Tencent Weibo|Tencent Weibo]]
+
This dataset does not contain actual words, but rather integer IDs so deep linguistic processing is impossible to do on this dataset. We will have to look at bags-of-words models that use unigrams. The dataset has had unknown preprocessing steps done and keywords are left. Original tweets are not recoverable.
 +
 
 +
== Methodology ==
 +
We propose three main thrusts to this work:
 +
*Use LDA to predict likelihood of a user following another user based on the topics learned from the words in a tweet
 +
*Make our own generative model that has additional observed values of gender and age which come from a latent variable model. Compute inference of this model using a collapsed Gibbs Sampler that we will derive. Possibly have multiple models depending on complexity of project.
 +
*For each of these models, reformulate the problem in a SAGE perspective. See if there is noticable improvement (or lack thereof) by using SAGE. Look at the difference in inference cost between a collapsed sampler and this additive model.
 +
 
 +
We also plan on submitting this work to the Social Media Track of WWW2013 [[http://www2013.org/]]
 +
 
 +
== Study Plan ==
 +
The background knowledge needed for this project will be a basic understanding of generative, graphical models of text and some derivatives of those. In particular, we will need to be knowledgeable about:
 +
*[[Latent Dirichlet Allocation]]
 +
*[[Sparse Additive Generative Models of Text]]
  
 
== Team Members ==
 
== Team Members ==
 
*[[User:kwmurray|Kenton Murray]]
 
*[[User:kwmurray|Kenton Murray]]
 
*[[User:nloghman|Natasha Loghmanpour]]
 
*[[User:nloghman|Natasha Loghmanpour]]

Revision as of 23:47, 8 October 2012

For this project, we are looking at using generative, graphical models to predict retweets, comments, tweets-at, follower networks of users of Tencent Weibo. In particular, we would like to examine how the influence of different demographic factors impacts the distribution of words in the social network and the structure of the network. We plan on looking at basic graphical models such as LDA and then extend this graphical model to account for the other structure in the data. In addition to looking at collapsed samplers to infer hidden variables and structures in our model, we will evaluate how SAGE can be used to increase robustness of our models on held-out data and reduce computational complexity.

Dataset

We will be using the data from the KDD Cup 2012 Track 1 (Tencent Weibo). The original KDD Cup challenge was to figure out if an item (broad class of things including people and users) that was recommended to a user would actually be followed. We will not be using this in this project. Instead, we will divide the publicly available dataset into a dev, tuning, and test set (by user ID) and look at predicting the likelihood of a user following another user, tweeting-at them, re-tweeting them, or commenting about something they said. In particular, we will be interested in the demographic information associated with users such as the year of their birth and gender.

This dataset does not contain actual words, but rather integer IDs so deep linguistic processing is impossible to do on this dataset. We will have to look at bags-of-words models that use unigrams. The dataset has had unknown preprocessing steps done and keywords are left. Original tweets are not recoverable.

Methodology

We propose three main thrusts to this work:

  • Use LDA to predict likelihood of a user following another user based on the topics learned from the words in a tweet
  • Make our own generative model that has additional observed values of gender and age which come from a latent variable model. Compute inference of this model using a collapsed Gibbs Sampler that we will derive. Possibly have multiple models depending on complexity of project.
  • For each of these models, reformulate the problem in a SAGE perspective. See if there is noticable improvement (or lack thereof) by using SAGE. Look at the difference in inference cost between a collapsed sampler and this additive model.

We also plan on submitting this work to the Social Media Track of WWW2013 [[1]]

Study Plan

The background knowledge needed for this project will be a basic understanding of generative, graphical models of text and some derivatives of those. In particular, we will need to be knowledgeable about:

Team Members