Link prediction using low-dimensional node embeddings: The measurement problem

Significance Link prediction is a fundamental machine learning task on complex networks, used to evaluate the central technique of low-dimensional embeddings. Our results question the common wisdom that low-dimensional embeddings perform well in link prediction tasks. We show that this wisdom is based on faulty measurements (based on area under the curve) used to evaluate link prediction. We propose vertex-centric local measures, under which existing low-dimensional embedding methods are shown to fail in link prediction. We identify a mathematical connection between this poor performance and the low-dimensional geometry of the node embeddings. Under a formal theoretical framework, we prove that low-dimensional vectors cannot capture sparse ground truth using dot product similarities (which is the standard practice in the literature).


D. Exploration of Normalized Discounted Cumulative Gain.
We also compute a vertex-centric Normalized Discounted Continuous Gain (NDCG).NDCG is a metric that analyzes how well a ranking method ranks relevant documents (47).Similarly to VCMPR, we compute VCNDCG@k for some threshold k.Formally, VCNDCG@k is computed as follows.For a given vertex i of non-zero degree, we rank all other vertices j in decreasing order of their scores.Let L be a list of the binary ground truth values ordered by their ranking.Let I, the ideal ranking, be a list of binary ground truth values in sorted order.We consider all ground-truth lists only contain values of 0 or 1, i.e. relevant or not.Then VCNDCG@k for the vertex i is defined as We compute VCNDCG for all data-sets and plot their scores in Fig. 12.Just as with VCMPR, the scores are quite low, which indicates that the low-841 dimensional embeddings have poor overall performance.Table 7.This table complements Fig. 8, which has results from link prediction on the ogbl-ddi dataset.We give the ROC-AUC, PR-AUC average VCMPR@k for k = 10, 50.
choice of k for VCMPR is the average degree (or maybe twice average degree).
For the hits metric, the k can vary depending on the instance.On the OGB leaderboard, common choices are k = 20, 50.We performed a comparison of all methods on the ogbl-collab dataset, where we see that VCMPR scores are much lower than then hits scores.
We also perform the same experiments on the sbm datasets we generated.
We simply set the k parameter to 20, for both VCMPR and hits.We see that again, hits scores are significantly higher than VCMPR scores.The results are summarized in Tab.11.For example, the HOP-rec method has a hits@20 value of 0.83, but the average VCMPR@20 is 0.49.There are numerous hits values above 0.65, where the average VCMPR@20 is less than 0.5.
For the dense ogbl-ddi dataset, the recommended metric on OGBL is hits@20.Here, we see the opposite: VCMPR scores are large, but the hits scores are low.This is another indication that the hits@k and VCMPR metrics are fundamentally different.
We also perform the same experiment on a PPi dataset, specifically the Bioplex dataset, with the number of negative instances set to 10% of the number of edges, consistent with the numbers chosen in the OGB datasets.9, which has results from link prediction on the Bioplex dataset.We give the ROC-AUC, PR-AUC, average VCMPR@k for k = 10, 50.
We compare VCMPR and hits with k set to 50.Here we see that VCMPR 863 scores are again lower than then hits scores, for example Role2Vec which has 864 a VCMPR@50 of 0.08 but a hits@50 of 0.47.

865
F. Results on entire datasets.For the same setup as described in the paper, we compute VCMPR@k plots for the entire dataset, not just the graph consisting of Etest.This includes edges seen in training.This allows us to closely examine the local structure of the embedding.Under this setting, VCMPR@k is defined as follows.

VCMPR@k for vertex
where D i is the degree of vertex i in the entire graph G = (V, E).We report 866 the scores in Fig. 13 and Fig. 14.As expected the scores are much higher.867 However, we see that in general, across all data-sets, thresholds, and methods, 868 very few vertices have a VCMPR of 0.8, despite the embedding having seen 869 80% of the graph's edges.As before VCMPR curves drop quite steeply.This  Table 9.This table complements Fig. 10, which has results from link prediction on the HI-II-14 dataset.We give the ROC-AUC, PR-AUC, average VCMPR@k for k = 10, 50.
G. Results using a variable threshold for VCMPR.We also investigate the use of a variable threshold in our evaluation.We compute VCMPR as follows.
Let D i be the degree of vertex i in the original graph G = (V, E) and d i be the degree of i in the test graph Gtest = (V, Etest).Then VCMPR@Deg(v) is defined as follows: We set D i and d i to not be the same to ensure each threshold is sufficiently large.This experiment is motivated by the fact that in some graphs, the degree of vertices tends to obey a power law distribution.Under such a distribution, a small portion of vertices are incident to a large portion of the edges.Thus, variable thresholds may be more appropriate.The results of these experiments are in Fig. 11 and Tab. 12, where we see low scores as before.The performance of all methods is extremely low in comparison to the AUC scores.

Fig. 7 .
Fig. 7.We show results for a small SBM.The SBM has a block size of 50 and 200 blocks.The AUC scores are exceptional, over 0.97 for almost all cases.However the VCMPR values are quite low.The VCMPR@10 scores are quite low, under 0.3.Even for k = 20, there are no methods that score above 0.5, despite the fact that at this threshold is more than half of the size of the block with the training edges removed.The data is summarized in Tab. 4.

Fig. 5 ,
which has results from link prediction on the dblp dataset.We give the ROC-AUC, PR-AUC, and average VCMPR@k for k = 10, 20.The AUC numbers are extremely large for real data, close to 0.9, with all methods showing good scores.Comparatively, the average VCMPR@10 numbers are low.Walklets gets a score of 0.56, but other methods are below 0.4.and clustering coefficient as shown in Tab. 3. We train each model on 90% of the edges of the graph and withhold 10% for testing.The results for amazon can be found in Fig.6and Tab. 6, for dblp in Fig.5and Tab. 5, for blog-Catalog in Fig.1and Tab. 1, for ogbl-collab in Fig.2and Tab. 2, and for ogbl-ddi in Fig.8and Tab.7.We do some small scale experiments with simple Stochastic Block Models (SBMs) to make our point more compelling.We create an SBM with a block size of 50 and 200 such blocks.Within each block, an edge is inserted with probability 0.3.Pairs across blocks are connected with probability 0.3/n (where n is the number of vertices).So blocks are extremely dense, and there are few edges across blocks.We show the results in Tab. 4 and Fig.7.B.The connection to graph density.We show results on the dense ogbl-ddi dataset.In Tab. 7, we can see that VCMPR scores are quite high, and in many cases, almost the same as AUC scores.For the leading HOP-rec and Walklets algorithms, the scores are quite close to each other.

Fig. 12 .
Fig. 12.We compute VCNDCG for all datasets.Similarly to VCMPR, the VCNDCG scores are quite low, indicating that the top predictions are of poor quality.Thus most edges are not in the top k predictions or are not ranked highly within said predictions.

Table 6 . This table complements Fig. 6, which has results from link prediction on the amazon dataset. We give the ROC-AUC, PR-AUC, and average VCMPR@k for k = 10, 20. The AUC numbers are quite high
, with a highest of 0.97.But the average VCMPR number is quite low.Even the highest for Walklets is at 0.58, while other methods are lower than 0.4.The average degree is amazon is 5, so a choice of k = 20 is quite large.VCMPR scores than Walklets, despite having lower ROC-AUC and PR-AUC scores.But in all cases, the VCMPR scores are extremely low.Overall, 836 AUC is typically sufficient to demonstrate poor performance of link prediction algorithms for such datasets.Hence, VCMPR may have limited utility for 838 such settings.

Table 10 . This table complements Fig. 12, which plots the Normalized Discounted Cumulative Gain of each method over multiple datasets. We give the average NDCG. We see that across all datasets, the scores are very low, typically below 0.2. Since blog
-Catalog and ogbl-ddi

Table 12 . This table complements Fig. 11, which plots the VCMPR scores using a variable threshold based on the deg(v). We give the average scores
here.For amazon and dblp, we see a slight drop in scores when compared to the fixed thresholds.