Difference between revisions of "Link propagation: A fast semi-supervised learning algorithm for link prediction"

Latest revision as of 02:59, 4 October 2012

Link propagation: A fast semi-supervised learning algorithm for link prediction

Citation

Kashima H., Kato T. , Yamanishi Y. , Sugiyama M and Tsuda K (May-2009) Link Propagation: A Fast Semi-supervised Learning Algorithm for Link Prediction In: SDM 2009, 2009 SIAM International Conference on Data Mining, Philadelphia, PA, USA, Society for Industrial and Applied Mathematics, 1099-1110.

Online version

PDF

Summary

This Paper adopts a semi-supervised algorithm known as "label propagation" and successfully transfers the original one which focus on node labeling to solving Link Prediction task. It also solves the problem of combining topological data and node information in the prediction algorithm.

Key Ideas

Mapping between node labeling and link prediction.

Label Propagation is a semi-supervised algorithm which successfully combines topological features and node information in predicting nodes' label(assigning a label for each node on a graph). And the paper we are talking about find a mapping between node labeling and link prediction by treating each link as a triplet $(x_{i},y_{j},z_{t})$ , where $x_{i}$ and $y_{j}$ represent two nodes and $z_{t}$ represents a type of link. As the method of this paper also supports the prediction of multiple type of link, if there is only one type of link, we can just set $z_{t}$ to 1 constantly. And we treat this triplet as one node and consider a problem of assigning two possible label, 0 or 1, for each this triplet and 1 stands for this link exists and 0 otherwise.

Use Conjugate Gradient Method to solve the minimization problem

The Idea of The Original Node Labeling Algorithm

The goal is to derive a $f(x)$ for every node $x$ which means the probability for this node to have a label "1". And $f$ function has to agree on the training data which we already know the labels.
With an input of similarity matrix of nodes $w_{i,j}$ and the base assumption is that similar nodes should have similar labels
Have a cost function on the assumption and the goal is to find a $f$ which minimize the loss function $E(f)$

$E(f)={\frac {1}{2}}\sum _{i,j}{w_{i,j}(f(i)-f(j))^{2}}$

Derive that the minimum loss labels are just a weighted average of neighbor points

$f(j)={\frac {1}{d_{j}}}\sum _{i}^{j}{w_{i,j}f(i)}$

Some Modifications

The loss function is changed to

$J(\{f_{ijk}\})={\frac {\delta }{2}}\sum _{i,j,k,l,m,n}{w_{ijk,lmn}(f_{ijk}-f_{lmn})^{2}}+{\frac {1}{2}}$

Differentiate the above function over $f$ and apply conjugate gradient method to solve the equation after differentiation.

Data sets

The paper use three data sets.

The metabolic pathways of the yeast S. Cerevisiae in the KEGG/PATHWAY database

Y. Yamanishi, J.-P. Vert, and M. Kanehisa. Supervised enzyme network inference from the integration of genomic data and chemical information. Bioinformatics, 21:i468–i477, 2005

A protein-protein interaction network data set

C. von Mering, R. Krause, B. Snel, M. Cornell, S. Olivier,S. Fields, and P. Bork. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417:399–403, 2002

NIPS 1-17 data is a social network representing the co-authorships in the NIPS conferences, available here

Related papers

There is a later paper which is based on this one and tries to improve the efficiency of running time: Fast and scalable algorithms for semi-supervised link prediction on static and dynamic graphs.

The paper talks about the node label algorithm is Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions

Study Plan

To understand the original node labeling algorithm
- Fast and scalable algorithms for semi-supervised link prediction on static and dynamic graphs
- Wiki Page Harmonic function
Wiki Page Conjugate gradient method

@@ Line 11: / Line 11: @@
 == Summary ==
-This paper adopts a semi-supervised algorithm known as "label propagation" and successfully transfers the original one which focus on node labeling to solving link (edge) prediction task. It also solves the problem of combining topological data and node information in the prediction algorithm.
+This [[Category::Paper]] adopts a semi-supervised algorithm known as "label propagation" and successfully transfers the original one which focus on node labeling to solving [[AddressesProblem::Link  Prediction]] task. It also solves the problem of combining topological data and node information in the prediction algorithm.
 == Key Ideas ==
 *Mapping between node labeling and link prediction.
 [http://www.aladdin.cs.cmu.edu/papers/pdfs/y2003/zgl.pdf Label Propagation] is a semi-supervised algorithm which successfully combines topological features and node information in predicting nodes' label(assigning a label for each node on a graph). And the paper we are talking about find a mapping between node labeling and link prediction by treating each link as a triplet <math>(x_i, y_j, z_t)</math>, where <math>x_i</math> and <math>y_j</math> represent two nodes and <math>z_t</math> represents a type of link. As the method of this paper also supports the prediction of multiple type of link, if there is only one type of link, we can just set <math>z_t</math> to 1 constantly. And we treat this triplet as one node and consider a problem of assigning two possible label, 0 or 1, for each this triplet and 1 stands for this link exists and 0 otherwise.
-*Use Conjugate gradient Method to solve the minimization problem
+*Use [[UsesMethod::Conjugate Gradient Method]] to solve the minimization problem
 == The Idea of The Original Node Labeling Algorithm==
@@ Line 40: / Line 40: @@
 * Differentiate the above function over <math>f</math> and apply conjugate gradient method to solve the equation after differentiation.
-== Data sets and Experiments ==
+== Data sets ==
-The paper works with five co-authorship networks, obtained from the author lists of articles contained in five sections of the physics e-Print arXiv. The training interval is defined to be the 3 years from 1994 through 1996 and test interval to be the 3 years from 1997 through 1999.
+The paper use three data sets.
+*The metabolic pathways of the yeast S. Cerevisiae in the [[UsesDataset::KEGG/PATHWAY database]]
-Each link predictor outputs a ranked list of pairs and the paper takes the first <math>n</math>, which is the number of new links from the test set, and count how many of edges from these <math>n</math> are in the test set.
+[[paper::Y. Yamanishi, J.-P. Vert, and M. Kanehisa. Supervised enzyme network inference from the integration of genomic data and chemical information. Bioinformatics, 21:i468–i477, 2005]]
+*A protein-protein interaction network data set
-== Result ==
+[[paper::C. von Mering, R. Krause, B. Snel, M. Cornell, S. Olivier,S. Fields, and P. Bork. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417:399–403, 2002]]
-The performance is compared with a random predictor.
+*[[UsesDataset::NIPS 1-17 data]] is a social network representing the co-authorships in the NIPS conferences, available [http://ai.stanford.edu/~gal/data.html here]
-[[File:LN_K_2007_Result.jpg]]
-We can see that Adamic/Adar outperforms other methods and for more detailed results please refer to the paper.
 == Related papers ==
-Most link prediction solutions can fall to two directories, one is to use topological features and the other is to use node or edge information.
+There is a later paper which is based on this one and tries to improve the efficiency of running time:
+[[RelatedPaper::Fast and scalable algorithms for semi-supervised link prediction on static and dynamic graphs]].
-[[RelatedPaper::Link propagation: A fast semi-supervised learning algorithm for link prediction]] suggests a semi-supervised framework which need a node similarity matrix as input. Many applications like protein-protein interactions and social networks are able to provide this information.
-[[RelatedPaper::Supervised Random Walks: Predicting and Recommending Links in Social Networks]] adopts a random walk framework and use edge and node information as features to train a transition rate over each edge to maximize the probability of random walk stopping time on positive nodes.
+The paper talks about the node label algorithm is [[RelatedPaper::Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions]]
 == Study Plan ==
-* About Adamic/Adar score
+* To understand the original node labeling algorithm
-**[[Paper::Adamic, L.A., & Adar, E. (2003). Friends and neighbors on the Web. Social Networks, 25(3), 211–230]]
+**[[Paper::Fast and scalable algorithms for semi-supervised link prediction on static and dynamic graphs]]
-* To understand which graph information these features represent
+**Wiki Page [http://en.wikipedia.org/wiki/Harmonic_function Harmonic function]
-**[[Paper::Davidsen, J., Ebel, H., & Bornholdt, S. (2002). Emergence of a small world from local interactions: Modeling acquaintance networks. Physical Review Letters, 88(128701)]]
+* Wiki Page [http://en.wikipedia.org/wiki/Conjugate_gradient_method Conjugate gradient method]

Difference between revisions of "Link propagation: A fast semi-supervised learning algorithm for link prediction"

Latest revision as of 02:59, 4 October 2012

Contents

Citation

Online version

Summary

Key Ideas

The Idea of The Original Node Labeling Algorithm

Some Modifications

Data sets

Related papers

Study Plan

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools