Active Learning in Link Prediction for social networks
Contents
Comments
This is a nice idea. One worry I have is that most link datasets are not complete - they are observed interactions, but non-observed interactions might also be just as plausible as the ones actually observed. It's probably hard to find a dataset with actual users you can query.
I see a couple of ways to get around this. One option would be to consider active learning for link classification - there are a few signed networks available, for instance. Lise Getoor has also done some work on link classification, I believe. Another would be to look at link datasets that include time, and explore a variant of active learning that looks ahead into the future, instead of asking users. Eg, you'd do active learning to modify a classifier at time T by using labels from time T+1 as oracles, and then use this to classify at time T+2. Or perhaps you could consider two disjoint networks, and learn some sort of link-prediction strategy on one by looking ahead into the future for the second.
There are a lot of interesting issues that arise when you start asking - what is the impact of making suggestions to a user (as most networks now do)? I don't know if there's any easy way to study this, or get insight into it, from passive time-stamped data, but it's worth thinking about. --Wcohen 19:02, 10 October 2012 (UTC)
Response
I think the concern is about how to "imitate" an oracle or construct the ground truth from datasets which are not likely to be complete. This is a very important issue not only in algorithm design but also in evaluation due to one important feature in social network interactions which is the difficulty to have highly-confident negative data points.
To focus on prediction over different time stamps is a good way to get a golden-standard datasets as the negative data points actually make sense here. But one of my concerns is whether our features are strong or informative enough to achieve this goal. From my own experience, the order or sequence that I added my friends on Facebook is quite intractable even does not mean anything that when I came to CMU, I added my classmates here and the order was actually how I got to know them in real life. One promising feature can be my recent update of school information in my profile. However, this kind of feature is hard to provide. Liben-Nowell Kleinberg J. Am.Soc.Inf.Sci.2007 did an extensive study on many heuristic graph features, but even the best did not do well enough in a time stamps based prediction task.
My idea is to use users with sufficient relations as fully developed nodes. For such a use with "sufficient" friend relations (I will discuss the issue with "sufficiency" later), we randomly take a small part of her friends as our training set and use the rest to build an oracle who answers "yes" if the system asks someone which is actually in her friend list and answers "no" otherwise as we believe she has already added all friends she would like to add. This will imitate an oracle.
There are two issues here. One is how to define users with sufficient relations. The other is training over these hub-like uses will have a strong bias over the rest. To overcome these two issues, we have to do two things: training over most users and distinguish them in some way
our active learning framework is designed in the following way.
- For the learning model, it will not only give a prediction score for each possible friend suggestion but also a learned threshold.
- For each user, we randomly take a small fixed number of friends to build a training set.
- For each round, the oracle for this user only gives positive feedback when the query suggestions are in the actual friend-list.
- For each round of querying the oracle, the learning model will not only change the prediction model itself, but also the threshold.
- After a fixed number of rounds, the learning model will generate a list of suggestions of which scores are higher than the learned threshold.
- Intuitively, the threshold will work like a tuning variable to predict the total number of user relations.
- For evaluation, we can calculate the harmonic mean of precision and recall over the rest data which are not chosen into the training set in the first step. And this score should be the target we want to optimize.
Teem members
- Troy Hua
- Looking for teammates :-)
Basic Idea
Link prediction or friend suggestion is a common feature that most online social network sites have provided.
In a real scenario, the process is not static, that users keep adding friends by themselves or by confirming to the friend suggestions from the system. All the users feedback to the suggestion list is crucial and informative to generate a more accurate suggestion list. More over, we can look at this from an active learning model's perspective. The system is trying to learn a model to predict friends for a user and is able to query the user by giving suggestions and learn from the feedback.
Advantage of using active learning in link prediction
- It just matches how user is interacting with the system in online social network
- In a cold start situation, e.g. a new social network site, a choice of informative suggestions and a clever learning from the feedback will quick increase the prediction accuracy even with small number of known friends, which plays a crucial role for a new SNS to grow up and become popular rapidly.
- It will generate a more diversified suggestion list. If a user moves into a new environment, she/he will be more eager to establish relationship with people in the new environment. A traditional learning algorithm will not give a high score to people in the new environment due to lack of information. However, active learning will possibly do if we a good design on how to make informative suggestions.
- Active learning is able to adopt other good methods for link prediction as its interior part.
Our Goal
We want to use fewer labeled data (known relation between users) to reach a higher precision and to derive a more natural user scenarios for friend suggestions.
Possible Methods and something we can do
when deriving a detailed learning framework, we have to answer all the questions below:
- Q: How to choose informative suggestions and what does "informative" mean here?
A: There are couple of ways in active learning. e.g. To choose the one that will change the model the most if we know the label or to choose the unlabeled data which have different predictions from different hypothesis.
- Q: How to understand user feedback?
A: One simple way is just to add the feedback to the current labeled data, or we can add more weight in this data point if it gets a positive feedback and we should think about how to use it wisely when we get a negative feedback.
- Q: How to "get" feedback information from the data we have? (We actually do not have a user to query.)
A: A way to solve this is to use a subset of training data and use the rest as oracles. It can be a problem that different user will have different number of friends. So I am thinking to add a learned threshold for each user to bias to suggestion.
- Q: What is the general learning method here?
A: As far as I know, supervised random walk is kind of state of the art and well known method for general link prediction on social networks. It can be used as our interior learning method as well as a competitive baseline.
Data sets
Stanford data sets and a lot of social network data from datamob, which has a 2009 facebook social graph data. And I also have a protein-protein interaction network dataset. Although it is not social network, it shares some important facts like small-world phenomenon and a power low distribution of node degree and modularity. It can be used to test on a cold start situation as the average degree is only 4 to 5.
Evaluation Metrics
An "unfair" evaluation is to compare the accuracy with other learning methods which have their training data randomly selected. Although I think it is a little bit "unfair", some papers in active learning actually do their evaluation in this way as they are the first to use active learning in a certain data set or a certain problem. So one way to evaluate is to compare the accuracy with supervised random walk. And if it is possible, I hope we can involve real users to interact with the system and provide reviews on the suggestion list.