## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Toward link predictability of complex networks

Contributed by H. Eugene Stanley, December 31, 2014 (sent for review March 10, 2014; reviewed by Giorgio Parisi and Dashun Wang)

## Significance

Quantifying a network's link predictability allows us to (*i*) evaluate predictive algorithms associated with the network, (*ii*) estimate the extent to which the organization of the network is explicable, and (*iii*) monitor sudden mechanistic changes during the network's evolution. The hypothesis of this paper is that a group of links is predictable if removing them has only a small effect on the network's structural features. We introduce a quantitative index for measuring link predictability and an algorithm that outperforms state-of-the-art link prediction methods in both accuracy and universality. This study provides fundamental insights into important scientific problems and will aid in the development of information filtering technologies.

## Abstract

The organization of real networks usually embodies both regularities and irregularities, and, in principle, the former can be modeled. The extent to which the formation of a network can be explained coincides with our ability to predict missing links. To understand network organization, we should be able to estimate link predictability. We assume that the regularity of a network is reflected in the consistency of structural features before and after a random removal of a small set of links. Based on the perturbation of the adjacency matrix, we propose a universal structural consistency index that is free of prior knowledge of network organization. Extensive experiments on disparate real-world networks demonstrate that (*i*) structural consistency is a good estimation of link predictability and (*ii*) a derivative algorithm outperforms state-of-the-art link prediction methods in both accuracy and robustness. This analysis has further applications in evaluating link prediction algorithms and monitoring sudden changes in evolving network mechanisms. It will provide unique fundamental insights into the above-mentioned academic research fields, and will foster the development of advanced information filtering technologies of interest to information technology practitioners.

Understanding the organization of real networks is a long-standing challenge in many branches of science (1). Although some mechanisms have already been accepted as primary driving forces in network organization, including homophily (2⇓–4), triadic closure (5⇓–7), preferential attachment (8⇓–10), reciprocity (11), and social balance (12), one or two of these mechanisms cannot provide a complete explanation; i.e., link formation in real-world networks is usually driven by both regular and irregular factors, and only the former can be explained using mechanistic models. This intrinsic network complexity presents us with the question of how to estimate what portions of a real network can be categorized as regular, in other words, to what extent the link formation in network is explicable.

This question brings to mind the link prediction problem in which the set of observed links in a network is used to estimate the likelihood that a nonobserved link exists (13). The extent to which the network formation is explicable coincides with our capacity to predict missing links (14, 15). On the one hand, an effective link prediction algorithm provides strong evidence of the corresponding mechanism(s) of network organization, e.g., effectiveness of common-neighborhood-based methods indicates the significance of triadic closure (16, 17). On the other hand, a better understanding of network organization should be transferable to a good link prediction algorithm, e.g., the prior assumption of hierarchical organization of networks can be directly applied to the design of a prediction algorithm (18). In this sense, the precision of a link prediction algorithm tells us the extent to which the link formation in network can be explained by this algorithm. However, different algorithms provide different precisions in same network (see Table 1, the precisions of seven link prediction (LP) methods on 10 networks) and thus the precision only reflects the link predictability associated with a specific algorithm, not the intrinsic feature of the network itself.

Predictability is usually defined as the possible maximum precision of a prediction algorithm (19). However, this kind of definition is not suitable for link prediction since a real network’s link predictability under such definition should be 1 because their nonobserved links are almost always distinguishable (see *Materials and Methods*). In this paper, link predictability indeed characterizes the inherent difficulty of prediction that does not depend on specific algorithms, and our fundamental hypothesis is that missing links are difficult to predict if their addition causes huge structural changes, and thus network is highly predictable if the removal or addition of a set of randomly selected links does not significantly change the network’s structural features. Accordingly, we propose a so-called “structural consistency” index that is based on the first-order matrix perturbation, which can reflect the inherent link predictability of a network and does not require any prior knowledge of the network’s organization. We also propose a structural perturbation method for link prediction that is more accurate and robust than the state-of-the-art methods.

## Structural Consistency

Consider a simple undirected network *V* is the set of nodes and *E* is the set of links. The given network can be represented by an *A*, where the element *i* and *j* are connected and

We consider the set *SI Appendix, Case of Degenerate Eigenvalues*, for the case with degenerate eigenvalues). After perturbation, the eigenvalue *A* if the expansion is based on

The eigenvectors can well reflect network structural features (20). If the perturbation does not significantly change the structural features, the eigenvectors of the observed matrix **4**, *A*, we first randomly remove a group of randomly selected links **4**. If the network is highly regular, the random removal *A* and *U* is the universal set of links. We denote *L* ranked links, where *SI Appendix, Six Steps to Calculate σ*_{c}.

## Structural Perturbation Method

The perturbation method used to determine the structural consistency can be applied to predict missing links. Link prediction aims at estimating the existence likelihood of nonobserved links based on the observed topology (13). The simplest framework of link prediction is similarity-based algorithms (16) in which each pair of nodes, *x* and *y*, is assigned a similarity score *SI Appendix, Link Prediction Problem*). Under this framework, the entries of *A* by using the structural perturbation method (SPM), we will rank all of the nonobserved links (i.e., the links corresponding to 0 in matrix *A*) according to their scores in *SI Appendix*, Table S1), which indicates that the missing links, which are considered as unknown information, can be recovered by perturbing the network with another set of known links (i.e.,

Consider an undirected network *E*, is randomly divided into two parts: (*i*) a training set *ii*) a probe set (i.e., validation subset)

Notice that, in this task, the training set *A*, and to obtain the perturbed matrix **4**. The final average prediction matrix *SI Appendix, Five Steps to Calculate Prediction Accuracy of SPM*.

We compare the structural perturbation method with six widely applied link prediction algorithms, including four similarity-based indices: the common neighbors (CN) index (16), the Adamic-Adar (AA) index (22), the resource allocation (RA) index (17, 23), and the Katz index (24). We also use two likelihood methods: the hierarchical structure model (HSM) (18) and the stochastic block model (SBM) (25). See *Materials and Methods* for the six baseline algorithms. Table 1 shows the prediction accuracy of the 10 real-world networks (see *Materials and Methods* and *SI Appendix*, Table S2, for the description and basic statistics of the data), measured by precision [see *SI Appendix*, Table S3 for the results measured by another metric called AUC: the area under the receiver operating characteristic curve (26); see the definition in *SI Appendix, Link Prediction Problem*, Eq. 5]. The highest value for each network (in each column) is in boldface. Overall, SPM outperforms all other baseline algorithms including such state-of-the-art methods as the RA index, HSM, and SBM. In addition, SPM is the most robust method for disparate networks; i.e., although, in a few cases, its performance is not the best, it is always very good. In contrast, all six baseline algorithms give very poor predictions for some networks. In addition to the effectiveness of SPM, we can efficiently obtain an approximate result by sampling large-scale networks (see discussion in *SI Appendix, Applying to Large Networks*).

Notice that the random division of *Materials and Methods* and *SI Appendix*, Table S2); as shown in Table 2, SPM still performs the best.

## Link Predictability

We first consider the structural consistency of modeled networks and show the validity of *p*. If *p* is finite and the network size *N* goes to infinity, the spectral density adjacency matrices in ER networks obey the Wigner semicircle law and the eigenvectors are distributed isotropically at random (29, 30). The first-order perturbation of the eigenvalues is thus also random, leading to low structural consistency values. Given an ER network *SI Appendix*, Fig. S3), and determine the average structural consistency *N* for different *p*. Fig. 2*A* shows how the structural consistency decreases with the network size in a power-law-like relationship and tends to the random chance

We next consider the Watts−Strogatz (WS) networks (32) with controllable randomness. A WS network *N* nodes where each node connects to its *k* nearest neighbors. With a probability *q*, each link is replaced by another link that joins two randomly chosen nodes. When *B* shows how *q*, indicating once again that higher irregularities (i.e., randomness) will result in lower predictability.

The above experimental results on modeled networks affirm the rationale behind the proposed index

## Discussion

Throughout history, human beings, from ancient prophets to modern scientists, have attempted to make predictions. The recent development of theoretical tools and the expanding availability of massive databases have allowed scientists to predict behaviors and trends, chart emergent events, and locate missing elements of a system (33⇓–35). In this paper, we treat predictability as an inherent measurement of the regularities in the organization of a networked system, and our hypothesis is that a missing part is predictable only when it does not significantly change the structural features of the observable part. If it does, it cannot be revealed through observation. The perspective of this hypothesis is that high predictability indicates some perceptible regulative principles in the organization of the network. Putting aside any a priori hypotheses of what this regulative principle might be, we directly measure the structural consistency of a network by perturbing its adjacency matrix and observing the change of eigenvalues provided the fixed eigenvectors, similar to the well-known first-order perturbation in quantum mechanics. Although we cannot determine the maximum precision of any given link prediction algorithm, the structural consistency

Potential applications of this work are wide and can take both theoretical and practical forms. Using structural consistency values, we can determine whether a poor prediction was caused by an inappropriate algorithm or was due to the intrinsic unpredictability of the network, and then estimate how large a space is needed to improve the algorithm. For example, the CN index does not perform well for either neural or food webs. Because the structural consistency of a food web is much larger than that of a neural web, we can infer that CN is not appropriate for a food web, while the low precision of a neural web may result from its own low predictability. Indeed, as shown in Fig. 3, the networks largely below the fitting line are those where the corresponding algorithm is not suitable to be applied. For an evolving network, the structural consistency can give a temporal estimation of whether the network becomes more elusive or not, as well as monitor the sudden changes in the evolving mechanisms (see *SI Appendix, Monitor the Sudden Changes of Evolving Networks with σ*_{c}, for numerical experiments). In addition, the structural perturbation method, as a straightforward extension of structural consistency, can be directly applied to determining the missing links in real-world networks. This work should be of interest to academic researchers in a variety of fields, to information technology practitioners, and to business practitioners.

## Materials and Methods

### Maximum Precision of Link Prediction.

If we define link predictability as the maximum precision of any link prediction algorithm, then a network is of nearly zero predictability if all nonobserved links are completely the same (e.g., a star network). For example, a vertex-transitive (20) network is of zero predictability since all of the nodes in the observed structure are identical and thus missing links are also indistinguishable from nonexistent links. For a vertex-transitive network, given any of its two nodes *u* and *v*, there is some automorphism *f* such that

### Data Description.

We consider the following 10 real-world networks drawn from disparate fields: (*i*) Jazz (37), a collaboration network of jazz musicians consists of 198 nodes and 2742 interactions; (*ii*) Metabolic (38), the metabolic network of *Caenorhabditis elegans*; (*iii*) Neural (32), the neural network of *C. elegans* (the original network is directed and weighted; here we treat it as a simple network by ignoring the directions and weights); (*iv*) US Air (39), the US Air transportation network; (*v*) Food web (40), the food web in Florida Bay during wet season; (*vi*) Hamster (41), a friendship network of users on the website hamsterster.com; (*vii*) NetSci (42), a coauthorship network of scientists working on network theory and experiment; (*viii*) Yeast (43), a protein−protein interaction network in budding yeast; (*ix*) Email (44), a network of email interchanges between members of the University Rovira I Virgili; (*x*) Router (45), a symmetrized snapshot of the structure of the Internet at the level of autonomous systems; (*xi*) Arxiv (46), a scientific collaboration network from the arXiv’s High Energy Physics C Theory (hep-th) section; (*xii*) Facebook (47), a network of a small group of Facebook users; and (*xiii*) Enron (48), an email communication network from employees of Enron between 1999 and 2003. The more detailed information and statistical features of these networks can be found in *SI Appendix, Statistical Features of Experimental Networks*.

### Baseline Algorithms for Link Prediction.

The link prediction problem has been a long-standing challenge in modern information science (13, 49). Its main goal is to estimate the existence likelihood of nonobserved links based on the known topology and node attributes. Link prediction has already found wide applications in interdisciplinary fields, including uncovering missing parts of social and biological networks (50⇓–52) and recommending friends and products in online social networks and e-commerce web sites (53⇓–55).

For comparison, we introduce four benchmark similarity indices (13). The simplest is the CN index (16) in which two nodes, *x* and *y*, have a higher connecting probability if they have more common neighbors. Two improved indices based on CN are the AA index (22) and the RA index (17, 23), both of which assign less-connected neighbors more weight. The mathematical expressions are

where *x*.

Unlike the above three local similarity indices, the Katz index (24) uses global topological information by summing over the collection of paths with exponential damping according to path lengths, i.e.,

which can be rewritten in the compact form, when

where *I* is the identity matrix, *A* is the adjacency matrix, and *A*. In our experiments, we tune the parameter α to optimize the performance of the Katz index. Notice that, since α cannot be exactly zero, the Katz index cannot degenerate to the CN index. Even when α is very close to zero, the performance of the Katz index can be different from the CN index, because, under the CN index, many nonobserved links are scored the same and thus ranked in a random way (see analysis on this so-called degeneracy phenomenon in refs. 17 and 31); therefore the very slight differences contributed by the latter items in Eq. **9** may result in considerable changes in the order of nonobserved links associated with the same number of common neighbors.

We also consider two probability methods. The HSM (18) method assumes that many real-world networks are hierarchically organized and thus nodes can be divided into groups and further divided into subgroups. The SBM (25) approach is one of the most general network models. Nodes are partitioned into groups and the connecting probability of any two nodes is only determined by the groups they belong to.

## Acknowledgments

We thank G. D’Agostino for helpful discussion. This work was partially supported by the National Natural Science Foundation of China (Grants 11222543, 11075031, 11205042, and 61433014), NESS Project, and CCF-Tencent Open Research Fund. L.L. acknowledges the research start-up fund of Hangzhou Normal University under Grant PE13002004039 and the EU FP7 Grant 611272 (project GROWTHCOM). T.Z. acknowledges the Program for New Century Excellent Talents in University under Grant NCET-11-0070. The work at Boston University was supported by US National Science Foundation Grants 1125290 and 0855453.

## Footnotes

↵

^{1}L.L., L.P., and T.Z. contributed equally to this work.- ↵
^{2}To whom correspondence may be addressed. Email: hes{at}bu.edu, linyuan.lv{at}hznu.edu.cn, or yi-cheng.zhang{at}unifr.ch.

Author contributions: L.L., L.P., T.Z., Y.-C.Z., and H.E.S. designed research; L.L., L.P., Y.-C.Z., and H.E.S. performed research; L.L., L.P., T.Z., and Y.-C.Z. analyzed data; and L.L., T.Z., Y.-C.Z., and H.E.S. wrote the paper.

Reviewers: G.P., University of Rome; and D.W., IBM TJ Watson Research Center.

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1424644112/-/DCSupplemental.

Freely available online through the PNAS open access option.

## References

- ↵.
- Barabási A-L

- ↵
- ↵.
- Currarini S,
- Jackson MO,
- Pin P

- ↵.
- Lewis K,
- Gonzalez M,
- Kaufman J

- ↵.
- Szabo G,
- Alava M,
- Kertész J

*Complex Networks*, Lecture Notes in Physics, eds Ben-Naim E, Frauenfelder H, Toroczkai Z (Springer, New York), Vol 650, pp 139–162 - ↵.
- Kossinets G,
- Watts DJ

- ↵.
- Yin D,
- Hong L,
- Xiong X,
- Davison BD

*Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval*(ACM Press, New York), pp 1234–1236 - ↵.
- Barabási A-L,
- Albert R

- ↵
- ↵.
- Leskovec J,
- Backstrom L,
- Kumar R,
- Tomkins A

*Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*(ACM Press, New York), pp 462–470 - ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Song C,
- Qu Z,
- Blumm N,
- Barabási A-L

- ↵.
- Godsil C,
- Royle G

- ↵
- ↵
- ↵
- ↵
- ↵.
- Guimerà R,
- Sales-Pardo M

- ↵
- ↵.
- Amaral LAN

- ↵.
- Erdös P,
- Rényi A

- ↵.
- Chung F,
- Lu L,
- Vu V

- ↵
- ↵
- ↵
- ↵
- ↵.
- Vespignani A

- ↵.
- Goel S,
- Hofman JM,
- Lahaie S,
- Pennock DM,
- Watts DJ

- ↵.
- Bollobás B

- ↵
- ↵
- ↵Batagelj V, Mrvar A (2006) Pajek datasets. Available at vlado.fmf.uni-lj.si/pub/networks/data/. Accessed May 19, 2013.
- ↵.
- Ulanowicz RE,
- Bondavalli C,
- Egnotovich MS

*Network Analysis of Trophic Dynamics in South Florida Ecosystem, FY 97: The Florida Bay Ecosystem*(Chesapeake Biol Lab, Solomons, MD), Tech Rep CBL 98-123 - ↵Kunegis J (2013) Hamsterster friendships unique network dataset – KONECT. Available at konect.uni-koblenz.de/networks/petster-friendships-hamster. Accessed June 1, 2013.
- ↵
- ↵.
- Bu D, et al.

- ↵
- ↵
- ↵
- ↵.
- Viswanath B,
- Mislove A,
- Cha M,
- Gummadi KP

*Proceedings of the 2nd Workshop on Online Social Networks*(ACM Press, New York), pp 37–42 - ↵.
- Klimt B,
- Yang Y

*Proceedings of the European Conference on Machine Learning*(Springer, New York), pp 217−226 - ↵.
- Getoor L,
- Diehl CP

- ↵
- ↵
- ↵
- ↵.
- Schifanella R,
- Barrat A,
- Cattuto C,
- Markines B,
- Menczer F

*Proceedings of the 3rd ACM International Conference on Web Search and Data Mining*(ACM Press, New York), pp 271−280 - ↵.
- Aiello LM, et al.

- ↵

## Citation Manager Formats

### More Articles of This Classification

### Physical Sciences

### Related Content

- No related articles found.

### Cited by...

- No citing articles found.