Link prediction for the communication graphs including text
This is a sample, not a real project - although I think it would be a cool one!--Wcohen 14:14, 2 October 2012 (UTC)
Link prediction is predicting when new links will be formed in a network. Often it is performed by using structural properties of the graph, using methods such as random walk with reset (RWR). Link prediction is straightforward to evaluate in a graph with timestamped links - one common measure is AUC.
In social networks, links often correspond to communication between people, and sometimes the text associated with these communications is available. Our project is to attempt link prediction using recently-developed methods for using supervised learning to tune RWR by using arbitrary features of the edges in the network for link prediction tasks. In particular, we want to use a bag-of-words representation for text as a feature set for the edges of a network. To our knowledge this is a novel task.
- Team: William Cohen, Bhavana Dalvi. (We'd be happy to have a third member if it's approved by the TAs.)
- Datasets: we plan to use the three communication networks available from Stanford - in particular the email and wikipedia-talk network. We will start by considering the wikipedia talk dataset. We will start with the minimal features in the dataset, and then "scale up" by looking at the text features associated with a communication.
- Baseline method: our baselines will be an untrained RWR method. We also think there should be some sort of baseline that uses text alone, without considering the graph, but we haven't decided yet what that will be.
- Challenges: it may be expensive to run this algorithm on large datasets. We're not sure how to subset the data. We are also considering implementing a stochastic gradient descent version of the Backstrom and Leskovic algorithm, which should be faster.