Link Prediction in Relational Data

Citation

Ben Taskar and Ming-fai Wong and Pieter Abbeel and Daphne Koller, Link prediction in relational data, NIPS 2003

Online version

Summary

This paper focuses on Link Prediction and develops a framework which supports multiple link types and both link features and node features. The key idea is to use relational Markov network and to define the probabilistic patterns over subgraph structures for each application data sets to capture some type of feature.

Problem and Intuition

The problem is not exactly the traditional relationship prediction or recommendation over social network, but in a broader sense. Given some data in a relational format, say hyper-linked university web pages, the task can be to find who is whose adviser. This is compatible with the traditional link prediction problem, as every node feature can be mapped into a relational format. To predict whether a link exist, the information of both the two nodes and the link is not enough. For example, the fact that a professor and a student often show up in the same research project pages is a strong indicator. And this paper tries to use a subgraph structure to capture these kind of graph features in a relational Markov Network framework.

Relational Markov Network

$G=(V,E)$ be an undirected graph with a set of cliques $C(G)$ . Each $c\in C(G)$ is associated with a set of nodes $V_{c}$ and a clique potential $\phi _{c}(V_{C})$ , which is a non-negative function defined on the joint domain of $V_{c}$ . The Markov net defines the distribution $P(v)={\frac {1}{z}}\Pi _{c\in C(G)}{\phi _{c}(v_{c})}$

To extend it to a relational setting, a relational Markov Network specifies a conditional distribution over all of the labels of all of the entities in an instantiation given the relational structure and the content attributes.

To specify what cliques should be constructed in an instantiation, we will define a notion of a relational clique template. A relational clique template specifies tuples of variables in the instantiation by using a relational query language.

For more details, please refer to B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proc. UAI, 2002.

Data Sets

The paper uses two data sets, as they are both collected by the authors and not shown publicly, there is no source to find them now.

The paper collected and manually labeled Computer Science department webpages from 3 schools: Stanford, Berkeley, and MIT.
A social network data set they collected by a portal website at a large university that hosts an online community for students.