Difference between revisions of "SAGE Weibo"
(8 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | For this project, we are looking at using generative, graphical models to predict retweets, comments, tweets-at | + | == Comments == |
+ | * Well defined problem but I wonder whether you would do a 3-way classification (retweet, comment or tweet-at) or come up with a coarse prediction that a user follows another user with some likelihood. | ||
+ | * In a 3-way classification, what kind of features could discriminate them? | ||
+ | * Please add a related-work section with 2-3 papers that have looked at this (or a similar) problem. | ||
+ | |||
+ | --[[User:Apappu|Apappu]] 12:10, 11 October 2012 (UTC) | ||
+ | |||
+ | == Team Members == | ||
+ | *[[User:kwmurray|Kenton Murray]] | ||
+ | *[[User:nloghman|Natasha Loghmanpour]] | ||
+ | |||
+ | ==Summary== | ||
+ | For this project, we are looking at using generative, graphical models to predict occurrences such as retweets, comments, tweets-at or follower networks of users of ([[UsesDataset::Tencent Weibo|Tencent Weibo]]). In particular, we would like to examine how the influence of different demographic factors (i.e.: age, gender and user location) impacts the distribution of words in the social network and the overall structure of the network. We plan to look at basic graphical models such as [[UsesMethod::Latent Dirichlet Allocation|LDA]] and then extend this graphical model to account for the other potential structures in the data. In addition to looking at collapsed samplers to infer hidden variables and structures in our model, we will evaluate how [[Sparse Additive Generative Models of Text|SAGE]] can be used to increase robustness of our models on held-out data and reduce computational complexity. | ||
== Dataset == | == Dataset == | ||
− | We will be using the data from the KDD Cup 2012 Track 1 ([[UsesDataset:Tencent Weibo|Tencent Weibo]]). The original KDD Cup challenge was to figure out if an item (broad class of things including people and users) that was recommended to a user would actually be followed. | + | We will be using the data from the KDD Cup 2012 Track 1 ([[UsesDataset::Tencent Weibo|Tencent Weibo]]). The original KDD Cup challenge was to figure out if an item (broad class of things including people and users) that was recommended to a user would actually be followed. However, we would like to note that this is not the goal of our project as the competition has passed and the winning entries announced. Instead, we will divide the publicly available dataset into a development, tuning, and test set (by user ID) and look at predicting the likelihood of a user following another user, tweeting-at them, re-tweeting them, or commenting about something they said. In particular, we will be interested in the demographic information associated with users such as their age and gender. |
− | + | Even though [[UsesDataset::Tencent Weibo|Tencent Weibo]] is thought of as the Chinese version of [[Twitter]], there will not be a language barrier as this dataset does not contain actual words, but rather integer IDs so deep linguistic processing is impossible to do on this dataset. We will have to look at bags-of-words models that use unigrams. The dataset has had unknown preprocessing steps done and keywords are left. Original tweets are not recoverable. We acknowledge that although this preprocessing tokenization of tweets can help reduce computational complexity, it is also a limitation of the study as some meanings may have been lost in the translation. | |
== Methodology == | == Methodology == | ||
− | + | '''Baseline''': Use LDA to predict likelihood of a user following another user based on the topics learned from the words in a tweet. Then, make our own generative model that has additional observed values of gender and age which come from a latent variable model. | |
− | + | ||
− | + | Once we have our baseline performance, we propose the following main thrusts to this work: | |
+ | *Compute inference of the baseline model using a collapsed Gibbs Sampler that we will derive. Possibly have multiple models depending on complexity of project. | ||
*For each of these models, reformulate the problem in a SAGE perspective. See if there is noticable improvement (or lack thereof) by using SAGE. Look at the difference in inference cost between a collapsed sampler and this additive model. | *For each of these models, reformulate the problem in a SAGE perspective. See if there is noticable improvement (or lack thereof) by using SAGE. Look at the difference in inference cost between a collapsed sampler and this additive model. | ||
We also plan on submitting this work to the Social Media Track of WWW2013 [[http://www2013.org/]] | We also plan on submitting this work to the Social Media Track of WWW2013 [[http://www2013.org/]] | ||
+ | |||
+ | == Related Work == | ||
+ | [[Media:Chen and She-Weibo article.pdf|Chen and She-An Analysis of Verifications in Microblogging Social Networks - Sina Weibo]] | ||
+ | |||
+ | [[Media:AdditiveForestChen.pdf|Chen et al-Combining Factorization Model and Additive Forest for Collaborative Followee Recommendation]] | ||
+ | |||
+ | [[Media:Deans and Miles-framework for understandign social media trends in china.pdf|Deans and Miles-Framework for Understanding Social Media Trends in China]] | ||
+ | |||
+ | [[Media:Jiang et al-Social Contextual Recommendation.pdf|Jiang et al-Social Contextual Recommendation]] | ||
+ | |||
+ | [[Media:Jiang et al-understanding latent interactions in online social networks.pdf|Jiang et al-Understanding Latent Interactions in Online Social Networks]] | ||
+ | |||
+ | [[Media:Lin et al- analysis and comparison of interaction patterns in online social network and social media.pdf|Lin et al- analysis and comparison of interaction patterns in online social network and social media]] | ||
+ | |||
+ | [[Media:Wang et al- TM LDA efficient online modeling of latent topic transitions in social media.pdf|Wang et al- TM LDA Efficient Online Modeling of Latent Topic Transitions in Social Media]] | ||
+ | |||
+ | [[Media:Yang et al- data selection for user topic model in twitter-like service.pdf|Yang et al- Data Selection for User Topic Model in Twitter-like Service]] | ||
+ | |||
+ | [[Media:Yu et al-what trends in chinese social media.pdf|Yu et al-What Trends in Chinese Social Media]] | ||
== Study Plan == | == Study Plan == | ||
Line 18: | Line 50: | ||
*[[Latent Dirichlet Allocation]] | *[[Latent Dirichlet Allocation]] | ||
*[[Sparse Additive Generative Models of Text]] | *[[Sparse Additive Generative Models of Text]] | ||
− | |||
− | |||
− | |||
− |
Latest revision as of 14:12, 1 November 2012
Comments
- Well defined problem but I wonder whether you would do a 3-way classification (retweet, comment or tweet-at) or come up with a coarse prediction that a user follows another user with some likelihood.
- In a 3-way classification, what kind of features could discriminate them?
- Please add a related-work section with 2-3 papers that have looked at this (or a similar) problem.
--Apappu 12:10, 11 October 2012 (UTC)
Team Members
Summary
For this project, we are looking at using generative, graphical models to predict occurrences such as retweets, comments, tweets-at or follower networks of users of (Tencent Weibo). In particular, we would like to examine how the influence of different demographic factors (i.e.: age, gender and user location) impacts the distribution of words in the social network and the overall structure of the network. We plan to look at basic graphical models such as LDA and then extend this graphical model to account for the other potential structures in the data. In addition to looking at collapsed samplers to infer hidden variables and structures in our model, we will evaluate how SAGE can be used to increase robustness of our models on held-out data and reduce computational complexity.
Dataset
We will be using the data from the KDD Cup 2012 Track 1 (Tencent Weibo). The original KDD Cup challenge was to figure out if an item (broad class of things including people and users) that was recommended to a user would actually be followed. However, we would like to note that this is not the goal of our project as the competition has passed and the winning entries announced. Instead, we will divide the publicly available dataset into a development, tuning, and test set (by user ID) and look at predicting the likelihood of a user following another user, tweeting-at them, re-tweeting them, or commenting about something they said. In particular, we will be interested in the demographic information associated with users such as their age and gender.
Even though Tencent Weibo is thought of as the Chinese version of Twitter, there will not be a language barrier as this dataset does not contain actual words, but rather integer IDs so deep linguistic processing is impossible to do on this dataset. We will have to look at bags-of-words models that use unigrams. The dataset has had unknown preprocessing steps done and keywords are left. Original tweets are not recoverable. We acknowledge that although this preprocessing tokenization of tweets can help reduce computational complexity, it is also a limitation of the study as some meanings may have been lost in the translation.
Methodology
Baseline: Use LDA to predict likelihood of a user following another user based on the topics learned from the words in a tweet. Then, make our own generative model that has additional observed values of gender and age which come from a latent variable model.
Once we have our baseline performance, we propose the following main thrusts to this work:
- Compute inference of the baseline model using a collapsed Gibbs Sampler that we will derive. Possibly have multiple models depending on complexity of project.
- For each of these models, reformulate the problem in a SAGE perspective. See if there is noticable improvement (or lack thereof) by using SAGE. Look at the difference in inference cost between a collapsed sampler and this additive model.
We also plan on submitting this work to the Social Media Track of WWW2013 [[1]]
Related Work
Chen and She-An Analysis of Verifications in Microblogging Social Networks - Sina Weibo
Deans and Miles-Framework for Understanding Social Media Trends in China
Jiang et al-Social Contextual Recommendation
Jiang et al-Understanding Latent Interactions in Online Social Networks
Lin et al- analysis and comparison of interaction patterns in online social network and social media
Wang et al- TM LDA Efficient Online Modeling of Latent Topic Transitions in Social Media
Yang et al- Data Selection for User Topic Model in Twitter-like Service
Yu et al-What Trends in Chinese Social Media
Study Plan
The background knowledge needed for this project will be a basic understanding of generative, graphical models of text and some derivatives of those. In particular, we will need to be knowledgeable about: