Homophily modulates double descent generalization in graph convolution networks

Significance Graph neural networks (GNNs) have been applied with great success across science and engineering, but we do not understand why they work so well. Motivated by experimental evidence of a rich phase diagram of generalization behaviors, we analyzed simple GNNs on a community graph model and derived precise expressions for generalization error as a function of noise in the graph, noise in the features, proportion of labeled data, and the nature of interactions in the graph. Computer experiments show that the analysis also qualitatively explains large “production-scale” networks and can thus be used to improve performance and guide hyperparameter tuning. This is significant both for the downstream science and for the theory of deep learning on graphs.

Graph neural networks (GNNs) excel in modeling relational data such as biological, social, and transportation networks, but the underpinnings of their success are not well understood.Traditional complexity measures from statistical learning theory fail to account for observed phenomena like the double descent or the impact of relational semantics on generalization error.Motivated by experimental observations of "transductive" double descent in key networks and datasets, we use analytical tools from statistical physics and random matrix theory to precisely characterize generalization in simple graph convolution networks on the contextual stochastic block model.Our results illuminate the nuances of learning on homophilic versus heterophilic data and predict double descent whose existence in GNNs has been questioned by recent work.We show how risk is shaped by the interplay between the graph noise, feature noise, and the number of training labels.Our findings apply beyond stylized models, capturing qualitative trends in real-world GNNs and datasets.As a case in point, we use our analytic insights to improve performance of state-of-the-art graph convolution networks on heterophilic datasets.Graph neural networks (GNNs) recently achieved impressive results on problems as diverse as weather forecasting (1), predicting forces in granular materials (2), or understanding biological molecules (3)(4)(5).They have become the de facto machine learning model for datasets with relational information such as interactions in protein graphs or friendships in a social network (6)(7)(8)(9).These remarkable successes triggered a wave of research on better, more expressive GNN architectures for diverse tasks, yet there is little theoretical work that studies why and how these networks achieve strong performance.
In this paper we study generalization in graph neural networks for transductive (semi-supervised) node classification: given a graph G = (V, E), node features x : V → R F , and labels y : Vtrain → {−1, 1} for a "training" subset of nodes Vtrain ⊂ V , we want to learn a rule which assigns labels to nodes in Vtest = V \ Vtrain.This setting exhibits a richer generalization phenomenology than the usual supervised learning: in addition to the quality and dimensionality of features associated with data, the generalization error is affected by the quality of relational information (are there missing or spurious edges?), the proportion of observed labels |Vtrain|/|V |, and the specifics of interaction between the graph and the features.Additional complexity arises because links in different graphs encode qualitatively distinct semantics.Interactions between proteins are heterophilic; friendships in social networks are homophilic (10).They result in graphs with different structural statistics, which in turn modulate interactions between the graphs and the features (11,12).Whether and how these factors influence learning and generalization is currently not understood.Outstanding questions include the role of overparameterization and the differences in performance on graphs with different levels of homophily or heterophily.Despite much work showing that in overparameterized models the traditional bias-variance tradeoff is replaced by the so-called double descent, there have been no reports nor analyses of double descent in transductive graph learning.Recent work speculates that this is due to implicit regularization (13).
Toward addressing this gap, we derive a precise characterization of generalization in simple graph convolution networks (GCNs) in semi-supervised * node classification on random community graphs.We motivate this setting by first presenting a sequence of experimental observations that point to universal behaviors in a variety of GNNs on a variety of domains.
In particular, we argue that in the transductive setting a natural way to "diagnose" double descent is by varying the number of labels available for training (Section 1).We then design experiments that show that double descent is in fact ubiquitous in GNNs: there is often a counterintuitive regime where more training data hurts generalization (14).Understanding this regime has important implications for the (often costly) label collection and questions of observability of complex systems (15).While earlier work reports similar behavior in standard supervised learning, our transductive version demonstrates it directly (14,16).On the other hand, we indeed find that for many combinations of relational datasets and GNNs, double descent is mitigated by implicit or explicit regularization.Interestingly, the risk curves are affected not only by the properties of the models and data (14), but also by the level of homophily or heterophily in the graphs.
Motivated by these findings we then present our main theoretical result: a precise analysis of generalization on the contextual stochastic block model (CSBM) with a simple GCN.We combine tools from statistical physics and random matrix theory and derive generalization curves either in closed form or as solutions to tractable low-dimensional optimization problems.To carry out our theoretical analysis, we formulate a universality conjecture which states that in the limit of large graphs, the risks in GCNs with polynomial filters do not change if we replace random binary adjacency matrices with random Gaussian matrices.We empirically verify the validity of this conjecture in a variety of settings; we think it may serve as a starting point for future analyses of deep GNNs.
These theoretical results allow us to effectively explore a range of questions: for example, in Section 3 we show that double descent also appears when we fix the (relative) number of observed labels, and vary relative model complexity (Fig. 4).This setting is close but not identical to the usual supervised double descent (17).We also explain why self-loops improve performance of GNNs on homophilic (18) but not heterophilic (11,12) graphs, as empirically established in a number of papers, but also that negative self-loops benefit learning on heterophilic graphs (19,20).We then go back to experiment and show that building negative self-loop filters into stateof-the-art GCNs can further improve their performance on heterophilic graphs.This can be seen as a theoretical GCN counterpart of recent observations in the message passing literature (19,20) and an explicit connection with heterophily for architectures such as GraphSAGE which can implement analogous logic (9).
Existing studies of generalization in graph neural networks rely on complexity measures like the VC-dimension or Rademacher complexity but they result in vacuous bounds which do not explain the observed new phenomena (21)(22)(23).Further, they only indirectly address the interaction between the graph and the features.This interaction, however, is of key importance: an Erdős-Renyi graph is not likely to be of much use in learning with a graph neural network.In reality both the graph and the features contain information about the labels; learning should exploit the complementarity of these two views.
Instead of applying the "big hammers" of statistical learning theory, we adopt a statistical mechanics approach and study performance of simple graph convolution networks on the contextual stochastic block model (CSBM) (24).We derive precise expressions for the learning curves which exhibit a rich phenomenology.
The two ways to think about generalization, statistical learning theory and statistical mechanics, have been contrasted already in the late 1980s and the early 1990s.Statistical mechanics of learning, developed at that time by Gardner, Opper, Sejnowski, Sompolinsky, Tishby, Vallet, Watkin, and many others-an excellent account is given in the review paper by Watkin, Rau, and Biehl (25)-must make more assumptions about the data and the space of admissible functions, but it gives results that are more precise and more readily applied to the practice of machine learning.
These dichotomies have been revisited recently in the context of deep learning and highly-overparameterized models by Martin and Mahoney (26), in reaction to Zhang et al.'s thought provoking "Understanding deep learning requires rethinking generalization" (27) which shows, among other things, that modern deep neural networks easily fit completely random labels.Martin and Mahoney explain that such seemingly surprising new behaviors can be effectively understood within the statistical mechanics paradigm by identifying the right order parameters and related phase diagrams.We explore these connections further in Section 4-Discussion.
Outline.We begin by describing the motivational experimental findings in Section 1.We identify the key trends to explain, such as the dependence of double descent generalization on the level of noise in features and graphs.In Section 2 we introduce our analytical model: a simple GCN on the contextual stochastic block model.Section 3 then explores the implications of some of the analytical findings about self-loops and heterophily on the design of state-of-the-art GCNs.We follow this by a discussion of our results in the context of related work in Section 4. In Section 5 we explain the analogies between GCNs and spin glasses which allow us to apply analysis methods from statistical physics and random matrix theory.We follow with a few concluding comments in Section 6.

Motivation: empirical results
Given an N -vertex graph G = (V, E) with an adjacency matrix A ∈ {0, 1} N ×N and features X ∈ R N ×F , a node classification GNN is a function (A, X) → h(w; A, X) insensitive to vertex ordering: for any node permutation π, h(w; πAπ ⊺ , πX) = πh(w; A, X).We are interested in the behavior of train and test risk, with S ∈ {Vtrain, Vtest} and ℓ(•, •) a loss metric such as the mean-squared error (MSE) or the cross-entropy.The optimal network parameters w * are obtained by minimizing the regularized loss [2] where rN (w) is a regularizer.

Is double descent absent in GNNs?
We start by investigating the lack of reports of double descent in transductive learning on graphs.Double descent is a scaling of test risk with model complexity which is rather different from the textbook biasvariance tradeoff (16,28).Up to the interpolation point, where the model has sufficient capacity to fit the training data without error, things behave as usual, with the test risk first decreasing together with the bias and then increasing with the variance due to overfitting.But increasing complexity beyond the interpolation point-into an overparameterized region characteristic for modern deep learning-may make the test risk decrease again.This generalization behavior has been identified already in the 90s by applying analytical tools from statistical mechanics to problems of machine learning; see for example Figure 10 in the paper by Watkin, Rau, and Biehl (25) or Figure 1 in Opper et al. (29) which show the generalization ability of the so-called pseudoinverse algorithm to train a boolean linear classifier (see also the book (30)).It is implicit in work on phase diagrams of generalization akin to those for magnetism or the Sherrington-Kirkpatrick model (31,32).
While these works are indeed the first to observe double descent, its significance for modern machine learning has been recognized by a line of research starting with (33).Double descent has been observed in complex deep neural networks (14) and theoretically analyzed for a number of machine learning models (17,25,30,34,35).There are, however, scarcely any reports of double descent in graph neural networks.Oono and Suzuki (13) speculate that this may be due to implicit regularization in relation to the so-called oversmoothing (36).
Generalization in supervised vs transductive learning.When illustrating double descent the test error is usually plotted against model complexity.For this to make sense, the amount of training data must be fixed, so the complexity on the abscissa is really relative complexity; denoting the size of the dataset (node of nodes) by N and the number of parameters by F we let this relative complexity be α := F/N .An alternative is to plot the risk against γ = α −1 : Starting from a small amount of data (small γ), we first go through a regime in which increasing the amount of training data leads to worse performance.In our context this can be interpreted as varying the size of the graph while keeping the number of features fixed.
In transductive node classification we always observe the entire graph A and the features associated with all vertices X, but only a subset of M labels.It is then more natural to vary τ := M/N than α −1 , with M being the number of observed labels.Although the resulting curves are slightly different, they both exhibit double descent; in the terminology of Martin and Mahoney, both τ and α −1 may be called loadlike parameters (26); see also (37).† In particular, they both have the interpolation peak at τ = α −1 , or M = F , when the system matrix becomes square and poorly conditioned.The key aspect of double descent is that the generalization error decreases on both sides of the interpolation peak.
Using τ instead of α −1 is convenient for several reasons: in real datasets, the number of input features is fixed; we cannot vary it.Further, there is no unique way to increase the number of parameters in a GNN and different GNNs are parameterized differently which complicates comparisons.Varying depth may lead to confounding effencts such as oversmoothing which is implicit regularization.Varying τ is a straightforward and clean way to compare different architectures in analogous settings.We can, however, easily vary α = γ −1 in our analytic model described in Section 2; we show the related results in Fig. 4.

Experimental observation of double descent in GNNs.
Armed with this understanding, we design an experiment as follows: we study the homophilic citation graph Cora (38) and the heterophilic graphs of Wikipeda pages Chameleon (39) and university web pages Texas (11).We apply different graph convolution networks with different losses, with and without dropout regularization.
Results are shown in Fig. 1.Importantly, we plot both the test error (red) and the test accuracy (black) in node classification against a range of training label ratios τ .In the first column, we use a one-layer GCN similar to the one we analyze theoretically in Section 2, but with added degree normalization, self-loops, and multiple classes; in the second column, we use a two-layer GCN; in the third column we add dropout; in the fourth, we use the cross-entropy loss instead of the MSE.This last model is used in the pytorch-geometric node classification tutorial ‡ .
First, with a one-layer network one can clearly observe transductive double descent on Cora in both the test risk and accuracy.The situation is markedly different on the heterophilic Texas, which contains only 183 nodes but 1703 features per node which yields relative model complexity α = F/N much higher than for other datasets.Here the test accuracy decreases near-monotonically, consistently with our theoretical analysis in Section 2 (cf.Fig. 5D).In this setting strong regularization improves performance.
With a two-layer network the double descent still "survives" in the test error on Cora, but the accuracy is almost monoton- ically increasing except on Texas.These results corroborate the intuition that dropout and non-linearity alleviate GNN overfitting on node classification, especially for large training label ratios.
We then explore the role of noise in the graph and in the features by manually adding noise to Cora.We randomly remove 30% of the links and add the same number of random links, and randomize 30% of the entries in X; results are shown in the fourth row of Fig. 1.The double descent in test error appears even with substantial regularization.Comparing the first and the fourth row affirms that double descent is more prominent with noisy data; this is again consistent with our analysis (see Section 3).In the last row we apply the networks to the synthetic CSBM.Observing the same qualitative behavior also in this case lends credence to the choice of CSBM for our precise analysis in Section 2.
In Fig. 2 we further focus on the strongly heterophilic Chameleon which does not clearly show double descent in Fig. 1.We randomly perturb different percentages of edges and in addition to GCNs also use the considerably more powerful FSGNN (40), which achieves current state-of-the-art results on Chameleon.Again, we see that double descent (a non-monotonic risk curve) emerges at higher noise (weaker heterophily).It is noteworthy that more expressive architectures do seem to mitigate double descent; conversely, a one-layer GCN exhibits double descent even without additional noise.We analytically characterize this phenomenon in Section 2 and illustrate it in Fig. 3. Beyond GCNs, we show that double descent occurs in more sophisticated GNNs like graph attention networks (41), GraphSAGE (9) and Chebyshev graph networks (7); see the SI Appendix E for details.
In summary, transductive double descent occurs in a variety of graph neural networks applied to real-world data, with noise and implicit or explicit regularization being the key determinants of the shape of generalization curves.Understanding the behavior of generalization error as a function of the number of training labels is of great practical value given the difficulty of obtaining labels in many domains.For some datasets like Texas, using too many labels seems detrimental for some architectures.

A precise analysis of node classification on CSBM with a simple graph convolution network
Motivated by the above discussions, we turn to a theoretical study of the performance of GCNs on random community graphs where we can understand the influence of all the involved parameters.We have seen in Section 1 that the generalization behavior in this setting qualitatively matches generalization on real data.Graph convolution networks are composed of graph convolution filters and nonlinear activations.Removing the activations results in a so-called simple GCN (42) or a spectral GNN (43,44).For a graph G = (V, E) with adjacency matrix A and features that live on the nodes X, h(w ; A, X) = P (A)Xw where where w ∈ R F are trainable parameters and K is the filter support size in terms of hops on the graph.We treat the neighborhood weights c k at different hops as hyperparameters.We let A 0 def = IN so that the model Eq. ( 3) reduces to ordinary linear regression when K = 0.
In standard feed-forward networks, removing the activations results in a linear end-to-end mapping.Surprisingly, GCNs without activations (such as SGC ( 42)) or with activations only in the output (such as FSGNN (40) and GPRGNN ( 12)) achieve state-of-the-art performance in many settings.§  We will derive test risk expressions for the above graph convolution network in two shallow cases: P (A) = A and P (A) = A + cI.We will also state a universality conjecture for general polynomial filters.Starting with this conjecture, we can in principle extend the results to all polynomial filters using routine but tedious computation.We provide an example for the training error of a two-hop network in SI Appendix C. As we will show, this analytic behavior closely resembles the motivational empirical findings from Section 1.
Training and generalization.We are interested in the large graph limit N → ∞ where the training label ratio |Vtrain|/N → τ .We fit the model parameters w by ridge regression w * := arg min w L A,X (w), where [4] We are interested in the training and test risk in the limit of large graphs, [5]    § GCNs without activations are sometimes called "linear" in analogy with feed-forward networks, but that terminology is misleading.In graph learning, both A and X are bona fide parts of the input and a function which depends on their multiplication is a nonlinear function.What is more, in many applications A is constructed deterministically from a dataset X, for example as a neighborhood graph, resulting in even stronger nonlinearity.
as well as in the expected accuracy, ACC = lim We will sometimes write Rtrain(A), Rtest(A), ACC(A) to emphasize that the matrix A in Eq. ( 3) follows a distribution A, A ∼ A. The expectations are over the random graph adjacency matrix A, random features X, and the uniformly random test-train partition V = Vtrain ∪ Vtest.Our analysis in fact shows that the quantities all concentrate around the mean for large N (and M and F ): In the language of statistical physics, they are self-averaging.This proportional aysmptotics regime where F, M , and N all grow large at constant ratios is more challenging to analyze than the regimes where dataset size or model complexity is constant, but it results in phenomena we see with production-scale machine learning models on real data; see also (26,34).
Contextual stochastic block model.We apply the GCN to the contextual stochastic block model (CSBM).CSBM adds node features to the stochastic block model (SBM)-a random community graph model (24) where the probability of a link between nodes depends on their communities.The lower triangular part of the adjacency matrix A bs has distribution A convenient parameterization is where d is the average node degree and the sign of λ determines whether the graph is homophilic or heterophilic; |λ| can be regarded as the graph signal noise ratio (SNR).
We will also study a directed SBM (45,46) with adjacency matrix distributed as Many real graphs have directed links, including chemical connections between neurons, the electric grid, folowee-folower relation in social media, and Bayesian graphs.In our case the directed SBM facilitates analysis with self-loops while exhibiting the same qualitative behavior and phenomenology as the undirected one.
The features of CSBM follow the spiked covariance model, where u ∼ N (0, IF /F ) is the F -dimensional hidden feature and ξ i ∼ N (0, IF /F ) are i.i.d.Gaussian noise; the parameter µ is the feature SNR.We work in the proportional scaling regime where N F → γ, with γ being the inverse relative model complexity, and ascribe feature vectors to the rows of the data matrix X, We assume throughout that the two communities are balanced; without loss of generality we let y i = 1 for i = 1, 2, . . ., N/2 and y i = −1 for i > N/2.
We will show that CSBM is a comparatively tractable statistical model to characterize generalization in GNNs.Intuitively, when N → ∞, the risk should concentrate around a number that depends on five parameters: τ Label ratio r Ridge regularization parameter.
We emphasize that we study the challenging weak-signal regime where λ, µ and γ do not scale with N (but F does).This stands in contrast to recent machine learning work on CSBM (47,48) which studies the low-noise regime where µ or λ 2 scale with N , or even the noiseless regime where the classes become linearly separable after applying a graph filter or a GCN.We argue that the weak-signal regime is closer to real graph learning problems which are neither too easy (as in linearly separable) nor too hard (as with a vanishing signal).The fact that we discover phenomena which occur in state-of-the-art networks and real datasets supports this claim.
We outline our analysis in Section 5 and provide the details in the SI appendices.But first, in the following section, we show that the derived expressions precisely characterize generalization of shallow GCNs on CSBM and also give a correct qualitative description of the behavior of "big" graph neural networks on complex datasets, pointing to interesting phenomena and interpretations.

Phenomenology of generalization in GCNs
We focus on the behavior of the test risk under various levels of graph homophily, emphasizing two main aspects: i) different levels of homophily lead to different types of double descent; ii) self-loops, standard in GCNs, create an imbalance between heterophilic and homophilic datasets; negative self-loops improve the handling of heterophilic datasets.

Double descent in shallow GCNs on CSBM.
As we show in Section 5 and SI Appendix Section A, the expression for the test risk for unregularized regression (r = 0) with shallow GCN can be obtained in closed form as when γτ > 1.It is evident that the denominator vanishes as γτ approaches 1.When this happens, the system matrix ItrainP (A)X, where Itrain selects the rows for which we have labels; see Section 5, Eq. ( 12), is square and near-singular for large N , which leads to the explosion of Rtest (Fig. 3A).When relative model complexity is high, i.e., γ = N/F < 1 is low, τ γ is always less than 1.In such cases, no interpolation peak appears, which is consistent with our experimental results for the Texas dataset where γ = 0.11; cf.Fig. 5D.
At the other extreme, for strongly regularized training (large r) the double descent disappears (Fig. 3C); it has been shown that this happens at optimal regularization (35,49).The absolute risk values in Fig. 3B and 3C show the same behavior.We work with the symmetric binary adjacency matrix ensemble A bs .Each experimental data point is averaged over 10 independent trials; their standard deviation is shown by vertical bars.The theoretical curves agree perfectly with experiments but also qualitatively with the phenomena we observed on real data in Section 1.
Fig. 3B shows that when the graph is very noisy (λ is small) the test error starts to increase as soon as the training label ratio τ increases from 0. When λ is large, meaning that the graph is discriminative, the test error first decreases and then increases.Similar behavior can be observed when varying the feature SNR µ instead of λ.Double descent also appears in test accuracy (Fig. 5).
While these curves all illustrate double descent in the sense that they all have the interpolation peak on both sides of which the error decreases, they are qualitatively different.The emergence of these different shapes can be explained by looking at the distribution of the predicted ith label hi(w * ).As we show in SI Appendix A, hi(w * ) is normally distributed with mean and variance given by the solutions of a saddle point equation outlined in Section 5.The test accuracy can thus be expressed by the error function (cf.Eq. ( 17)).
As we increase the number of labels, the mean E[hi(w * )] approaches y i monotonically.However, the variance Var[hi(w * )] behaves diferently for different model complexities α = 1 γ and regularizations r, resulting in distinct double descent curves.
For example, when r → 0 and τ → 1 γ , the variance of hi(w * ) for i ∈ Vtest diverges and the accuracy approaches 50%, a random guess.On the other hand, when r is large, the variance is small and double descent is mild or absent, as shown in Fig. 5A. Figure 5B shows a typical double descent curve with two regimes where additional labels hurt generalization.In Fig. 5C we also see a mild double descent when the relative model complexity is close to 1: this is consistent with experimental observations on Cora in Fig. 1.In certain extremal cases, for example when γ is very small, the test accuracy continuously decreases after a very small ascent around τ = 0 (Fig. 5D); this is consistent with our experimental observations for the Texas dataset.

Double descent as a function of the relative model complexity.
As mentioned earlier, the theoretical model makes it easy to study double descent as we vary the model complexity α = 1/γ rather than τ ; this is closer to the traditional reports of double descent in supervised learning.The resulting plots follow a similar logic: as shown in Fig. 4, adding randomness in the graph (low |λ|), makes the double descent more prominent.Conversely, for a highly homophilic graph (large λ), the test risk decreases monotonically as the relative model complexity α grows.(with increasing λ) show curves for graphs of decreasing randomness.Varying model complexity in GNNs yields non-monotonic curves similar to those in the earlier studies of double descent studies in supervised (inductive) learning.Note that the overall shape of the curve is strongly modulated by the degree of homophily in the graph.

Heterophily, homophily, and positive and negative self-loops.
GCNs often perform worse on heterophilic than on homophilic graphs.An active line of research tries to understand and mitigate this phenomenon with special architectures and training strategies (12,50,51).We now show that it can be understood through the lens of self-loops.
Strong GCNs ubiquitously employ self-loops of the form P (A) = A + IN on homophilic graphs (8,12,41,42,52).¶ Self-loops, however, deteriorate performance on heterophilic networks.CSBM is well suited to study this phenomenon since λ allows us to transition between homophilic and heterophilic graphs.
We allow the self-loop strength c to vary continuously so that the effective adjacency matrix becomes A + cIN .Importantly, we also allow c to be negative (see SI Appendix B).In Fig. 6 we plot the test risk as a function of c for both positive and negative c.We find that a negative self-loop (c < 0) results in much better performance on heterophilic data (λ < 0).We sketch a signal-processing interpretation of this phenomenon in SI Appendix D.
Negative self-loops in state-of-the-art GCNs.It is remarkable that this finding generalizes to complex state-of-the-art graph neural networks and datasets.We experiment with two common heterophilic benchmarks, Chameleon and Squirrel, first ¶ One way to characterize link semantics in graphs is by notions of homophily and heterophily.In a friendship graph links signify similarity: if Alice and Bob both know Jane it is reasonable to expect that Alice and Bob also know each other.In a protein interaction graph, if proteins A and B interact, a small mutation A' of A will likely still interact with B but not with A. Thus "interaction" links signify partition.Most graphs are somewhere in between the homophilic and heterophilic extremes.with a two-layer ReLU GCN.The default GCN (for example in pytorch-geometric) contains self-loops of the form A + I; we immediately observe in Fig. 7 that removing them improves performance on both datasets.We then make the intensity of the self-loop adjustable as a hyper-parameter and find that a negative self-loop with c between −1.0 and −0.The datasets and splits are the same as Fig. 7.

Discussion
Before delving into the details of the analytical methods in Section 5 and conceptual connections between GNNs and spin glasses, we discuss the various interpretations of our results in the context of related work.

Related work on theory of GNNs. Most theoretical work on
GNNs addresses their expressivity (53,54).A key result is that the common message-passing formulation is limited by the power of the Weisfeiler-Lehman graph isomorphism test (55).This is of great relevance for computational chemistry where one must discriminate between the different molecular structures (56), but it does not explain how the interaction between the graph structure and the node features leads to generalization.Indeed, simple architectures like graph convolution networks (GCNs) are far from being universal approximators but they often achieve excellent performance on real problems with real data.
Existing studies of generalization in GNNs leverage complexity measures such as the Vapnik-Chervonenkis dimension (57)(58)(59) or the Rademacher complexity (21).While the resulting bounds sometimes predict coarse qualitative behavior, a precise characterization of relevant phenomena remains elusive.Even the more refined techniques like PAC-Bayes perform only marginally better (22).It is striking that only in rare cases do these bounds explicitly incorporate the interaction between the graph and the features (23).Our results show that understanding this interaction is crucial to understanding learning on graphs.
Indeed, recall that a standard practice in the design of GNNs is to build (generalized) filters from the adjacency matrix or the graph Laplacian and then use these filters to process data.But if the underlying graph is an Erdős-Rényi random graph, the induced filters will be of little use in learning.
The key is thus to understand how much useful information the graph provides about the labels (and vice-versa), and in what way that information is complementary to that contained in the features.
A statistical mechanics approach: precise analysis of simple models.An alternative to the typically vacuous ‖ complexitybased risk bounds for graph neural networks (21)(22)(23) is to adopt a statistical mechanics perspective on learning; this is what we do here.Indeed, one key aspect of learning algorithms that is not easily captured by complexity measures of statistical learning theory is the emergence of qualitatively distinct phases of learning as one varies certain key "order parameters".Such phase diagrams emerge naturally when one views machine learning models in terms of statistical mechanics of learning (26,37).
Martin and Mahoney (26) demonstrate this elegantly by formulating what they call a very simple deep learning model, and showing that it displays distinct learning phases reminiscent of many realistic, complex models, despite abstracting away all but the essential "load-like" and "temperature-like" parameters.They argue that such parameters can be identified in machine learning models across the board.
The statistical mechanics paradigm requires one to commit to a specific model and do different calculations for different ‖ We quote the authors of the PAC-Bayesian analysis of generalization in GNNs ( 22 We use the non-symmetric binary adjacency matrix ensemble A bn .The solid lines are the theoretical results predicted by the replica method.In B we see that the optimal generalization performance requires adapting the self-loop intensity c to the degree of homophily.models (25), but it results in sharp characterizations of relevant phases of learning.Important results within this paradigm, both rigorous and heuristic, were derived over the last decade for regularized leastsquares (60-62), random-feature regression (17,34,49,63), and noisy Gaussian mixture and spiked covariance models (64-66), using a variety of analytical techniques from statistical physics, high-dimensional probability, and random matrix theory.Not all of these works explicitly declare adherence to the statistical mechanics tradition.It nonetheless seems appropriate to categorize them thus since they provide precise analyses of learning in specific models in terms of a few order parameters.
Even though these papers study comparatively simple models, many key results only appeared in the last couple of years, motivated by the proliferation of over-parameterized models and advances in analytical techniques.One should make sure to work in the correct scaling of the various parameters (34); while this may complicate the analysis it leads to results which match the behavior of realistic machine learning systems.We extend these recent results by allowing the information to propagate on a graph; this gives rise to interesting new phenomena of some relevance for the practitioners.In order to obtain precise results we similarly study simple graph networks, but we also show that the salient predictions closely match the behavior of state-of-the-art networks on real datasets.We precisely traced the connection between generalization, the interaction type (homophilic or heterophilic) and the parameters of the GCN architecture and the dataset for a specific graph model.Experiments show that the learned lessons apply to a broad class of GNNs and can be used constructively to improve the performance of state-of-the-art graph neural networks on heterophilic data.
Finally, let us mention that phenomenological characterizations of phase diagrams of risk are not the only way to apply tools from statistical mechanics and more broadly physics to machine learning and neural networks.These tools may help address a rather different set of "design" questions, as reviewed by Bahri et al. (67).
Negative self-loops in other graph learning models.Recent theoretical work (19,20) shows that optimal message passing in heterophilic datasets requires aggregating neighbor messages with a sign opposite from that of node-specific updates.Similarly, in earlier GCN architectures such as GraphSAGE (9), node and neighbor features are extracted using different trainable functions.This immediately allows the possibility of aggregating neighbors with an opposite sign in heterophilic settings.We show that self-loops with sign and strength depending on the degree of heterophily improve performance both in theory and in real state-of-the-art GCNs.The notion of self-loops in the context of GCNs usually indicates an explicit connection between a node and itself, A ← A + I.
GCNs with a few labels outperform optimal unsupervised detection.One interpretation of our results is that they quantify the value of labels in community detection, traditionally approached with unsupervised methods.These approaches are subject to fundamental information-theoretic detection limits which have drawn considerable attention over the last decade (64,68,69).The most challenging and most realistic high-dimensional setting is when the signal strength is comparable to that of the noise for both the graph and the features (24,68,70).The results of Deshpande et al. indicate that when µ 2 /γ + λ 2 < 1, no unsupervised estimator can detect the latent structure y from A and X (24).Our analysis shows that even a small fraction of revealed labels allow a simple GCN to break this unsupervised barrier.
In Fig. 8, we compare the accuracy of a one-layer GCN with unsupervised belief propagation (BP) (24).We first run BP with µ = λ = γ = 1 and record the achieved accuracy.We then plot the smallest training label ratio τ for which the GCN achieves the same accuracy.We repeat this procedure for different feature SNRs µ and graph SNRs λ.The black solid line indicates the information-theoretic threshold for detecting the latent structure from A and X.
Earlier analyses of belief propagation in the SBM without features uncover a detectability phase transition (71).Our analysis shows that no such transition happens with GCNs.Indeed, our primary interest is in understanding GCNs, which are a general tool for a variety of problems, but unlike belief propagation, GCNs need not be near-optimal for community detection.For the optimal inference strategy, the phase transition may not be destroyed by revealing labels.

Generalization in GCNs via statistical physics
The optimization problem Eq. ( 4) has a unique minimizer as long as r > 0. Since it is a linear least-squares problem in w, we can write down a closed-form solution, w * = (rIF + (P (A)X) T ItrainP (A)X) −1 (P (A)X) T Itrainy, [11] where Analyzing generalization is, in principle, as simple as substituting the closed-form expression Eq. ( 11) into Eq.( 5) and Eq. ( 6) and calculating the requisite averages.The procedure is, however, complicated by the interaction between the graph A and the features X and the fact that A is a random binary adjacency matrix.Further, for a symmetric A, ItrainP (A) is correlated with ItestP (A) even in a shallow GCN (and certainly in a deep one).
The statistical physics program.We interpret the (scaled) loss function as an energy, or a Hamiltonian, H(w; A, X) = τ N L A,X (w).Corresponding to this Hamiltonian is the Gibbs measure over the weights w, where Z β (A, X) = dw exp (−βH (w; A, X)) , β is the inverse temperature and Z β is the partition function.
At infinite temperature (β → 0), the Gibbs measure is diffuse; as the temperature approaches zero (β → ∞), it converges to an atomic measure concentrated on the unique solution of Eq. ( 4), w * = lim β→∞ wP β (w; A, X)dw.In this latter case the partition function is similarly dominated by the minimum of the Hamiltonian.The expected loss can thus be computed from the free energy density f β , Since the quenched average E ln Z β is usually intractable, we apply the replica method ( 72) which allows us to take the expectation inside the logarithm and compute the annealed average, The gist of the replica method is to compute E A,X Z n β for integer n and then "pretend" that n is real and take the limit n → 0. The computation for integer n is facilitated by the fact that Z n β normalizes the joint distribution of n independent copies of w, {w a } n a=1 .We obtain ).
[13] Instead of working with the product AX, replica allows us to express the free energy density as a stationary point of a function where the dependence on A and X is separated (see Appendix A for details), C(m, p, q) + E( m, p, q) + D(m, p, q, m, p, q), [14] where we defined ], which in the limit N → ∞, n → 0 only depend on the socalled order parameters m, p, q and m, p, q.The separation thus allows us to study the influence of the distribution of A in isolation; we provide the details in SI Appendix A. The risks (called the observables in physics) can be obtained from f β .

Gaussian adjacency equivalence.
A challenge in computing the quantities in Eq. ( 13) and Eq. ( 14) is to average over the binary adjacency matrix A. We argue that f in Eq. ( 14) does not change if we instead average over the Gaussian ensemble with a correctly chosen mean and covariance.For a one-layer GCN (P (A) = A), we show that replacing E A bs c(P (A bs )) by E A gn c (P (A gn )) will not change f in Eq. ( 14)with A gn being a spiked non-symmetric Gaussian random matrix, [15] with Ξ gn having i.i.d.centered normal entries with variance 1/N .This substitution is inspired by the universality results for the disorder of spin glasses (73)(74)(75) and the universality of mutual information in CSBM (24).Deshpande et al. (24) showed that the binary adjacency matrix in the stochastic block model can be replaced by where Ξ gs ∈ R N ×N is a sample from the standard Gaussian orthogonal ensemble, without affecting the mutual information between y (which they modeled as random) and (A, X) when N → ∞ and d → ∞.
Our claim refers to certain averages involving A; we record it as a conjecture since our derivations are based on the nonrigorous replica method.We first define four probability distributions: • A bs : The distribution of adjacency matrices in the undirected CSBM (cf.Eq. ( 7)) scaled by 1/ √ d, 1 • A bn : the distribution of adjacency matrices in the directed CSBM (cf.Eq. ( 8)), scaled by 1/ √ d; • A gs : the distribution of spiked Gaussian orthogonal ensemble (cf.Eq. ( 16); • A gn : the distribution of spiked Gaussian random matrices (cf.Eq. (15).

With these definitions in hand we can state
Conjecture 1 (Equivalence of graph matrices).Assume that d scales with N so that 1/d → 0 and d/N → 0 when N → ∞.
Let P (A) be a polynomial in A used to define the GCN function in Eq. ( 3).It then holds that with • ∈ {s, n}.When P (A) = A, the above quantities for symmetric and non-symmetric distributions also coincide.
In the case when P (A) = A we justify Conjecture 1 by the replica method (see SI Appendix A).In the general case we provide abundant numerical evidence in Fig. 9.We first consider the case when P (A) = A. Fig. 9A and Fig. 9D show estimates of Rtrain and Rtest averaged over 100 independent runs.The standard deviation over independent runs is indicated by the shading.We see that the means converge and the variance shrinks as N grows.
We also show absolute differences between the averages of Rtrain and Rtest in Fig. 9B and Fig. 9E.We find that the values of Rtrain and Rtest can be well fitted by a linear relationship in the logarithmic scale, suggesting that the absolute differences approach zero exponentially fast as N → ∞.We next consider P (A) = A 2 .In Fig. 9C and Fig. 9F we can see that for intermediate values of d, Rtrain and Rtest corresponding to A bs and A bn are both close to that corresponding to A gs and A gn .This is consistent with the results shown in Fig. 3 and Fig. 6 where the theoretical results computed by the replica method and A gn perfectly match the numerical results with A bn (for P (A) = A + cIN ) and A bs (for P (A) = A), further validating the conjecture.
Solution to the saddle point equation.We can now solve the saddle point equation Eq. ( 14) by averaging over A gn .In the general case the solution is easy to obtain numerically.For an one-layer GCN with P (A) = A we can compute a closed-form solution.Denoting the critical point in Eq. ( 14) by (m * , p * , q * ) we obtain where erf is the usual error function.While the general expressions are complicated (see SI Appendix A), in the ridgeless limit r → 0 we can compute simple closed-form expressions for train and test risks, , [18] assuming that τ γ > 1.
A rigorous solution.We note that for a one-layer GCN risks can be computed rigorously using random matrix theory provided that Conjecture 1 holds and we begin with a Gaussian "adjacency matrix" instead of the true binary SBM adjacency matrix.We outline this approach in SI Appendix C; in particular, for r = 0, the result of course coincides with that in Eq. ( 18).

Conclusion
We analyzed generalization in graph neural networks by making an analogy with a system of interacting particles: particles correspond to the data points and the interactions are specified by the adjacency relation and the learnable weights.The latter can be interpreted as defining the "interaction physics" of the problem.The best weights correspond to the most plausible interaction physics, coupled in turn with the network formation mechanism.
The setting that we analyzed is maybe the simplest combination of a graph convolution network and data distribution which exhibits interesting, realistic behavior.In order to theoretically capture a broader spectrum of complexity in graph learning we need to work on new ideas in random matrix theory and its neural network counterparts (76).While very deep GCNs are known to suffer from oversmoothing, there exists an interesting intermediate-depth regime beyond a single layer (77).Our techniques should apply simply by replacing A by any polynomial P (A) before solving the saddle point equation, but we will need a generalization of existing random matrix theory results for HCIZ integrals.Finally, it is likely that these generalized results could be made fully rigorous if "universality" in Conjecture 1 could be established formally.

A. Sketch of the derivation
A. Replica method.We now outline the replica-based derivations.We first consider one-layer GCN P (A) = A to show the main idea.The training and test risks in this case is given by Eq. ( 5) in the main text.We further extend the analysis when self-loops are included in Appendix B.
We begin by defining the augmented partition function, where β is the inverse temperature.The Hamiltonian in the above equation reads which is the loss Eq. ( 4) scaled by N τ .The "observables" O train and Otest (the scaled training and test risks) are the quantities we are interested in: When the inverse temperature β is small, the Gibbs measure Z −1 β (A, X) exp (−βH(w)) dw is diffuse; when β → ∞, the Gibbs measure converges to an atomic measure concentrated on the unique solution of Eq. ( 4).That is to say, for t 0 = t 1 = 0, we can write The idea is to compute the values of the observables in the large system limit at a finite temperature and then take the limit β → ∞.To this end, we define the free energy density f β corresponding to the augmented partition function, A, X). [20] The expected risks can be computed as Although ln Z β (A, B)/N concentrates for large N , a direct computation of the quenched average Eq.( 20) is intractable.We now use the replica trick which takes the expectation inside logarithm, or replaces the quenched average by annealed average in physics jargon: The main idea of the replica method is to first compute E A,X Z n β for integer n by interpreting Z n β = (Z β ) 1 (Z β ) 2 . . .(Z β )n as the product of n partition functions for n independent configurations {w a } n a=1 .Then we obtain E A,X ln Z β by taking the limit n → 0 in Eq. ( 22), even though the formula for ln E A,X Z β n is valid only for integer n.The expectation of replicated partition function reads where we denote I β def = β − βt 0 I train + −βt 1 Itest.We first keep X fixed and take the expectation over A. Directly computing the expectation over the binary graph matrix A ∼ A bs is non-trivial.To make progress, we average over A ∼ A gn (instead cf.Eq. ( 15)).In Appendix B we show that this Gaussian substitution does not change the free energy density, ultimately yielding the same risks and accuracies as detailed in Conjecture 1.
Letting σ a = Xw a , the elements of A gn σ a are now jointly Gaussian for any fixed σ a and C = can be computed by multivariate Gaussian integration.It is not hard to see that C depends only on a vector m ∈ R n and a matrix Q ∈ R n×n defined as ma = y T σ a /N, and In statistical physics these quantities are called the order parameters.We then define . [25] Using the Fourier representation of the Dirac delta function δ(t − t 0 ) = 1 2π dω exp(iω(t − t 0 )), we have , [26] so that Eq. ( 23) becomes where 2 . [28] In Eq. ( 27) and Eq. ( 28), we apply the change of variables iQ ab → Q ab , ima → ma.Note that while Eq. ( 25) still depends on X, we do not average over it.As A and X appear as a product in Eq. ( 22), it is not straightforward to average over them simultaneously.We average over A first to arrive at Eq. ( 27), and then find that Eq. ( 25)-Eq.( 28) are self-averaging, meaning that they concentrate around their expectation as N → ∞.Ultimately, this allows us to isolate the randomness in A and X.It also allows the possibility to adapt our framework to other graph filters by simply replacing A by P (A) in Eq. ( 25).Since Eq. ( 4) has a unique solution, we take the replica symmetry assumption where the order parameters are structured as In the limit N → ∞, n → 0, we are only interested in the leading order contributions to C(m, Q) so we write (with a small abuse of notation), Similarly, we have , [31] where and r = 2τ r 2 q + p .[32] We give the details of the derivation of C in Appendix B and of E in Appendix C. When r → 0, we have T → 1 1−γ , and (1).[33] We can now compute Eq. ( 27) by integrating only on 6 parameters: m, m, q, q, p, and p.For N → ∞, the integral can be computed via the saddle point method: C(m, p, q) + E( m, p, q) + (q + p) q + p − 1 2 p p + m m. [34] The stationary point satisfies When β → ∞, the stationary point exists only if m, m, p, p, q, q scale as We thus reparameterize them as p → p, βq → q, p β 2 → p, We use the asymptotic notation o(1) for deterministic and random quantities which vanish in the limit n → 0, N → ∞ in a suitable sense.We similarly use o β (1) in the limit β → ∞.The order parameters m, p, q in Eq. ( 30) as well as m, p, q in Eq. ( 33) scale in linear order of β or a polynomial of β.
We analyze the connection between the stationary point and w * .As w * is the unique solution in Eq. ( 4), and from the definition of order parameters Eq. ( 24), stationarity in Eq. (36) implies that E A,X y T Xw * /N .[37] Let A train ∈ R F ×N be the selection of rows from A corresponding to i-th row for all i ∈ V train and Atest ∈ R (N −F )×N be the selection of rows corresponding to i-th row for all i ∈ Vtest.The neural network output for the test nodes reads htest = Atestσ * , where σ * = Xw * .Since we work with a non-symmetric Gaussian random matrix A ∼ A gn as our graph matrix, Atest is independent of A train and σ * (Note σ * = Xw * depends on A train ).Therefore, for any fixed A train and X but random Atest, the network outputs for test nodes are jointly Gaussian, Combining this with the results from Eq. ( 37), we obtain the test accuracy as ACC = P(x > 0), where x ∼ N (λm * , p * ) > 0. [38]

B. Computation of C(•).
Recall that in Eq. ( 25) we define In this section, we begin by computing G(A gn ) = C(m, p, q) in Eq. ( 30) where A gn denotes the distribution of non-symmetric Gaussian spiked matrices Eq. ( 15).We then show that the symmetry does not influence the value of Eq. ( 39), i.e., G(A gn )/β = G(A gs )/β when N, d, β → ∞ and d/N, n → 0. Finally, we show that the Gaussian substitution for the binary adjacency matrix does not influence the corresponding free energy density, which ultimately leads to the same risks and accuracies under different adjacency matrices (Conjecture 1).Let's first concatenate {σ a } n a=1 as σ = (σ Then we can rewrite Eq. ( 39) in vector form where ⊗ is the Kronecker product.By the central limit theorem, when N → ∞, the vectors (A ⊗ 1n)σ for A ∼ A gn , A ∼ A bn , A ∼ A gs and A ∼ A bs all converge in distribution to Gaussian random vectors.Letting µ(A) and Σ(A) be the mean and the covariance of A ∼ A, we get [41] where I nβ = I β ⊗ In.The vanishing lower-order term o(1) comes from the tails in the central limit theorem and it is thus absent when A = A gn .In this case we have µ(A gn ) = λy ⊗ m, Σ(A gn ) = I N ⊗ Q, [42] with m and Q defined in Eq. (24).Leveraging the replica symmetric assumption Eq. ( 29), we compute the determinant term in Eq. ( 41) as + τ ln (2β (1 − t 0 ) q + 1) + (1 − τ ) ln (2 (−βt 1 ) q + 1) . [43] The o(1) in the third line comes from the approximation 1 n ln(1 + n) = 1 + o (1).The last two terms in Eq. ( 43) do not increase with β and can thus be neglected in the limit β → ∞ when computing G(A)/β: they give rise to βo β (1) in Eq. (30).The rest terms in Eq. ( 41) can be computed as Collecting everything we get G(A gn ) in Eq. (30).Note that G(A gn )/β = O(1), and we are going to show that G(A gn )/β −G(A gs )/β = o(1).For A gs , A bn and A bs , we find the means and covariances of (A ⊗ 1n)σ as where l ∈ R n with entries la = are order parameters analogous to m in Eq. ( 24).Substituting Eq. ( 44) into Eq.( 41), we see that the perturbation In µ(A bn ) and µ(A gn ), there is a bias l.By the replica symmetric assumption Eq. ( 29), we have l = l1n.It is easy to show that a critical point in the saddle point equation with l and l exists only when l = 0 for β → ∞.Therefore, the term √ d1 N ⊗ l will not influence the value of free energy density when β → ∞.It further implies that the elements of σ * = Xw * are symmetrically distributed around zero.This is analogous to the vanishing average magnetization for the Sherrington-Kirkpatrick model (72).
To summarize this section, as long as 1  N , 1 √ d , d N → 0, averaging over A bs , A gn , A gs and A gn are equivalent in Eq. ( 34) when β → ∞ for a one-layer GCN with P (A) = A. † † C. Computation of E. We recall Eq. ( 28) and denote Q We can compute the determinant term and the exponential term in Eq. ( 45) separately when N → ∞.Denoting by λ f (X ⊺ X) the f -th largest eigenvalue of X ⊺ X, we have For the same reasons as in Eq. ( 43), the two logarithmic terms in the last line of Eq. ( 46) can be ignored since they do not grow with β.
We denote r def = 2τ rβ/ 2 q + p .The first term in the RHS of Eq. ( 46) is then which is obtained by integrating over the Marchenko-Pastur distribution.† † In general it does not hold that G(A gn ) = G(A bn ) nor that G(A gs ) = G(A bs ).The equivalence stems from the fact that σ * = Xw * are symmetrically distributed, but for any fixed σ * with term in Eq. ( 49) in the n → 0 limit as [51] The exponential term in Eq. ( 49) can be computed similarly, with noticing that Y 0 and Y 1 are rotationally invariant since µ = 0, Both Eq. ( 51) and Eq. ( 52) involve the same quantity, which can be computed by random matrix free convolution (78).The Green's function (also called the Cauchy function in the mathematical literature) is defined via the Stieltjes transform which in turn yields the spectrum transform ρ We get the multiplicative free convolution as [55] After computing S Y 0 (Y 1 ) −1 and S Y 1 (Y 0 ) −1 , we obtain the expression for U in Eq. (53).For example, when τ = 0.8 and γ = 5, the eigenvalue distributions of Y 0 , Y 1 asymptotically follow the Marchenko-Pastur distribution as We then get Once C and E are computed, we have all the ingredients of the saddle point equation Eq. ( 34) with 12 variables, m 0 , m 1 , m 0 , m 1 , p 0 , p 1 , p 0 , p 1 , q 0 , q 1 , q 0 and m 1 .
A critical point of this saddle point equation gives the explicit formulas for the risks in Section A.
The accurate matching between the numerical and theoretical results in Figure 10 also supports Conjecture 1.

D. A signal processing interpretation of self-loops
We now show a simple interpretation of negative self-loops based on a graph signal processing intuition (80,81).In homophilic graphs the labels change slowly on the graph: they are a low-pass signal (12,81) with most of their energy concentrated on the eigenvectors of the graph Laplacian which correspond to small eigenvalues or small "frequencies".Equivalently, they correspond to large eigenvalues of the adjacency matrix since L = diag(A1) − A. ‡ ‡ On heterophilic graphs the labels usually change across an edge, which corresponds to a high-frequency signal concentrated on the small-eigenvalue eigenvectors of the adjacency matrix.A graph Fourier transform can be defined via the Laplacian but also via the adjacency matrix (81).The matrix product Ax = h is a delay-like filter, diagonal in the graph Fourier domain with basis functions which are the eigenvectors u 1 , . . ., u N of A. We have ( Ax) i = ⟨x, u i ⟩ = λ i x i = λ i ⟨x, u i ⟩, where λ i is the i-th smallest eigenvalue of A. Figure 11 illustrates the spectra of homophilic and heterophilic labels and graphs.A homophilic graph § § has a low-pass spectrum while a heterophilic graph has a high-pass spectrum.A self-loop shifts the spectrum of A so that it becomes either a lowpass filter for positive c or a highpass filter for negative c.As a result, the corresponding GCNs better suppress noise and enhance signal for the corresponding graph types.In particular, assuming that the label-related signal in x lives between eigenvalues λa and λ b (say, negative, so we are in a heterophilic situation), we can quantify the distortion induced by the filter A + cI as (λa + c)/(λ b + c) which is close to 1 for large |c|.

E. Double descent in various GNNs
In Figure 12 we experiment with node classification on the citeeer dataset and some popular GNN architectures: the graph attention network (41), GraphSAGE (9), and Chebyshev graph network (7).The architectures of these GNNs incorporate various strategies to mitigate overfitting.As a result there is no clear double descent in the test accuracy curves, but we still observe non-monotonicity in the test risk.

F. Experimental Details
In this section we provide more details for the experiments in the main text.
In the real-world data experiments, all GCNs in Figure 1 are trained by the ADAM optimizer with learning rate 10 −2 and weight decay 10 −5 .We run ADAM for 10 4 iterations and select the model with minimal training loss.In each trial, the training and test nodes are selected uniformly randomly.We sample training nodes separately for each label to avoid the pathology where a label has few or zero samples, which can happen at extremely low training ratios.We average 10 different trials for each point; the error bars show their standard deviation.The standard deviation in the figures is mainly due to the train-test splits in the different trials; The fluctuations due to random initialization and stochastic optimization training are comparatively small.We do not normalize features and reprocess the data.All results in this paper are fully reproducible; code available at https://github.com/DaDaCheng/SMGCN.For the CSBM experiments in Fig. 3,5 and 6, we calculate w * by Eq. ( 11) and then compute Eq. ( 5) and Eq. ( 5).In Fig. 3 and 5, we use symmetric binary adjacency matrix set A bs ; In Fig. 6 we use non-symmetric binary adjacency matrix A bn as defined in Conjecture 1.The theoretical results in Fig. 3

Fig. 1 .
Fig.1.Double descent generalization for different GNNs, different losses, with and without explicit regularization, on datasets with varying levels of noise.We plot both the test error (red) and the test accuracy (black) against different training label ratios τ on the abscissa on a logarithmic scale.First column: one linear layer trained by MSE loss; second column: a two-layer GCN with ReLU activations and MSE loss; third column: a two-layer GCN with ReLU activation function, dropout and MSE loss; fourth column: a two-layer GCN with ReLU activations, dropout and cross-entropy loss; Each experimental data point is averaged over 10 random traintest splits; the shadow area represents the standard deviation.The right ordinate axis shows classification accuracy; we suppress the left-axis ticks due to different numerical ranges.We observe that double descent is ubiquitous across datasets and architectures when varying the ratio of training labels: there often exists a regime where more labels impair generalization.

Fig. 3 .
Fig. 3. Theoretical results computed by the replica method (solid line) versus experimental results (solid circles) on CSBM, with P (A) = A, for varying training label ratios τ .A: training and test risks with λ = µ = 1, γ = 5 and r = 0. (For τ < 0.2, we use the pseudoinverse in Eq. (11) in numerics and r = 10 −5 for the theoretical curves).We further study the impact of varying λ in B and r in C. We set r = 0.02, γ = 2, µ = 1 in B and λ = 3, µ = 1, γ = 2 in C. In all experiments we set N = 5000 and d = 30.We work with the symmetric binary adjacency matrix ensemble A bs .Each experimental data point is averaged over 10 independent trials; their standard deviation is shown by

Fig. 4 .
Fig. 4. Test risk as a function of relative model complexity α = γ −1 : different levels of homophily lead to distinct types of double descent in CSBM.Plots from left to right

Fig. 5 .
Fig. 5. Four typical generalization curves in CSBM model.The solid lines represent theoretical results of test risk (black) and accuracy (red) computed via Eq.(17).We also plot the mean and variance of test output hi(w * ) where i ∈ Vtest.This illustrates how the tradeoff of Mean-Variance leads to different double descent curves.Note we only display results for nodes with label y i = 1; the result for the y i = −1 class simply has opposite mean and identical variance.A: monotonic ACC (increasing) and Rtest (decreasing) when regularization r is large; B: A typical double descent with small regularization r; C slight double descent with relative model complexity α close to 1; D (near-monotonically) decreasing ACC and increasing Rtest with large relative model complexity α = 1/γ.The parameters are chosen as A: µ = 1, λ = 2, γ = 5, r = 2; B: µ = 1, λ = 2, γ = 5, r = 0.1; C: µ = 1, λ = 2, γ = 1.2, r = 0.05; D: µ = 5, λ = 1, γ = 0.1, r = 0.005.The solid circles and vertical bars represent the mean and standard deviation of risk and accuracy from experiment results.Each experimental data point is averaged over 10 independent trials; the standard deviation is indicated by vertical bars.We use N = 5000 and d = 30 for A, B and C, and N = 500 and d = 20 for D. In all case we use the symmetric binary adjacency matrix ensemble A bs .

Fig. 6 .
Fig. 6.Train and test risks on CSBM for different intensities of self loops.A: train and test risk for τ = 0.8 and λ = −1 (heterophilic).B: test risks for γ = 0.8, τ = 0.8, µ = 0 under different λ.C: training risk for different µ when τ = λ = 1.Each data point is averaged over 10 independent trials with N = 5000, r = 0, and d = 30.We use the non-symmetric binary adjacency matrix ensemble A bn .The solid lines are the theoretical results predicted by the replica method.In B we see that the optimal generalization performance requires adapting the self-loop intensity c to the degree of homophily.

Fig. 7 .
Fig. 7. Test accuracy (black) and test error (red) in node classification with GCNs on real heterophilic graphs with different self-loop intensities.We implement a two-layer ReLU GCN with 128 hidden neurons and an additional self-loop with strength c.Each setting is averaged over different training-test splits taken from (11) (60% training, 20% validation, 20% test).The relatively large standard deviation (vertical bars) is mainly due to the randomness of the splits.The randomness from model initialization and training is comparatively small.The optimal test accuracy for these two datasets is obtained for self-loop intensity −0.5 < c * < −1.

Fig. 8 .
Fig. 8. Training label ratio when a one-layer GCN matches the performance of unsupervised belief propagation at µ = λ = γ = 1.The black solid line denotes the information-theoretic detection threshold in the unsupervised setting where no label information is available ( i.e., when we use onlyA, X).If given a small number of labels, a simple, generally sub-optimal estimator matches the performance of the optimal unsupervised estimator.

Fig. 9 .
Fig. 9. Numerical validation of Conjecture 1.In A & D: we show training and test risks with different numbers of nodes for P (A) = A. The parameters are set to γ = λ = µ = 2, r = 0.01, τ = 0.8 and d = √ N /2.In B & E, we show the absolute difference of the risks between binary and Gaussian adjacency as a function of N , using the same data in A & D. The solid lines correspond to a linear fit in the logarithmic scale, which shows that the error scales as |∆| ∼ N −0.5 .In C & F we show the training and test risks when P (A) = A 2 under different average node degrees d.Other parameters are set to λ = µ = 1, γ = 2, N = 2000, τ = 0.8 and r = 0.01.In these settings, the conjecture empirically holds up to scrutiny.

5 Fig. 10 .
Fig.10.Theoretical results (solid line) vs. experimental results (solid circles) for varying homophily of graphs (λ).We compare the one-hop case (P (A) = A) and two-hops case (P (A) = A 2 ) for non-symmetric CSBM with µ = 0, τ = 1, N = 2000 and d = 30.We use non-symmetric binary adjacency matrix A bn .Each experimental data point is averaged over 10 independent trials and the standard deviation is indicated by vertical lines.
, 4,5 and 6 are obtained by computing the extreme values in 36.‡ ‡ If node degrees are all the same the eigenvectors of the adjacency matrix and the Laplacian coincide.§ § More precisely, a homophilic graph-label pair.

Fig. 12 .
Fig. 12. Test error and classification accuracy at different training ratios for Chebyshev GNN (ChebNet), Graph Attention Network (GAT), the Graph Sample and Aggregate Network (SAGE), Topology Adaptive Graph Convolutional Networks (TAGCN), on the Citeeer dataset.All models have two layers with ReLU activations, and are trained by ADAM with the cross-entropy loss.

Table 1 . Comparison of test accuracy when negative self-loop is absent (first and third column) or present (second and fourth column).
5results in the highest accuracy on both datasets.It is notable that the best performance in the two-layer ReLU GCN with c = −0.5 (76.29%) is already close to state-of-the-art results by the Feature Selection Graph Neural Network (FSGNN) (40) (78.27%).FSGNN uses a graph filter bank B = {A k , (A + I) k } with careful normalization.Taking a cue from the above findings, we show that a simple addition of negative self-loop filters (A−0.5I)k to FSGNN yields the new state of the art (78.96%); see also Table1.