Representations and generalization in artificial and brain neural networks

Humans and animals excel at generalizing from limited data, a capability yet to be fully replicated in artificial intelligence. This perspective investigates generalization in biological and artificial deep neural networks (DNNs), in both in-distribution and out-of-distribution contexts. We introduce two hypotheses: First, the geometric properties of the neural manifolds associated with discrete cognitive entities, such as objects, words, and concepts, are powerful order parameters. They link the neural substrate to the generalization capabilities and provide a unified methodology bridging gaps between neuroscience, machine learning, and cognitive science. We overview recent progress in studying the geometry of neural manifolds, particularly in visual object recognition, and discuss theories connecting manifold dimension and radius to generalization capacity. Second, we suggest that the theory of learning in wide DNNs, especially in the thermodynamic limit, provides mechanistic insights into the learning processes generating desired neural representational geometries and generalization. This includes the role of weight norm regularization, network architecture, and hyper-parameters. We will explore recent advances in this theory and ongoing challenges. We also discuss the dynamics of learning and its relevance to the issue of representational drift in the brain.

Humans and animals exhibit a remarkable ability to generalize from limited experiences to novel situations.This trait is likely related to the ability of the neuronal system to extract from the stream of complex noisy high dimensional input signals features which are relevant for downstream computation, a property known as "feature learning".Understanding the generalization and feature learning in biological neural networks can lead to significant breakthroughs in both neuroscience and artificial intelligence (AI) (1).
In recent years, AI has undergone a transformative advancement in capability, primarily fueled by developments in DNNs (2).These computational models have achieved unparalleled success across diverse domains, ranging from image recognition and natural language processing to structural biology and medicine.Broadly, their exceptional performance is rooted in their ability to generalize from training data to unseen inputs.Although the generalization power of DNNs falls short of human brains, understanding the mechanisms behind generalization in DNNs could provide insights into the principles governing generalization in neural networks within the brain.The study of learning in DNNs offers an important opportunity to understand the process of feature learning in complex learning systems.
In DNNs, data representation undergoes iterative refinement through multiple layers, capturing increasingly abstract features, yielding a top layer ("the feature layer") whose representations serve as substrates for a broad spectrum of downstream computations (3).
This Perspective explores the impact of neural representations on the generalization capabilities of artificial and brain neural networks.One facet of generalization is to predict the correct response on a learned task for novel "test inputs" which are sampled from the same distribution as the training examples (in-distribution generalization).A more challenging capability is to rapidly learn new tasks.The example we will explore is that of few-shot learning, where the trained network is capable of learning new tasks using few examples (4).We will elucidate the role of the learned representations in each of these capabilities.
For decades, neuroscientists have studied single neurons' receptive fields and tuning curves across various sensory arrays (e.g., in the retina, the cochlea, and the olfactory receptor neurons) as well as in sensory and motor cortices, e.g., primary visual area (V1) , primary somatosensory area (S1), and primary motor area (M1).The difficulties in extending this program beyond primary areas suggest that understanding neural representations in higher stages of processing requires populationlevel theoretical and experimental approaches.An increasingly powerful line of research focuses on the topological and geometrical properties of an ensemble of population responses, known as neural manifolds (5)(6)(7)(8)(9)(10)(11).In the first part of this paper, we will demonstrate the successful applications of geometric approaches in predicting key aspects of generalization in the context of object recognition tasks.In the second part of the paper, we will explore our current understanding of the relation between generalization and the emergence of these representations in DNNs.

Neural Representational Geometry underlying Object Recognition
Geometry and Separability of Object Manifolds.The set of neural population responses to stimuli belonging to the same object defines an object manifold.Intuitively, to perform well on object identity tasks, object manifolds at the top stages of the visual hierarchy (IT cortex in the ventral visual stream) should be well separated from each other (12).How do we quantify the degree to which object manifolds satisfy this property?A simple approach is to consider random binary classification tasks and check whether the tasks can be performed by a linear classifier downstream of a given neuronal layer.The utility of the object manifolds can be quantified by the maximum number of objects that can be classified with high probability by a separating hyperplane in the population state space (Fig. 1A).The classical theory of linear classification, often limited to finite, weakly correlated input vectors, is inapplicable for the problem of classifying manifold data.To close this gap, we have developed a statistical mechanics theory of linear separability of manifolds (13).Our theory identifies three key metrics: manifold dimension D M , radius R M , and inter-manifold correlation as the primary determinants of manifold separability.We consider a layer consisting of N neurons responding to numerous images belonging to P objects, forming P object manifolds; the system's load is defined by the ratio = P / N. We ask whether these manifolds can be separated into two randomly labeled classes by a hyperplane.In the regime where P and N are large, our theory shows the existence of a critical load value c , called manifold classification capacity, such that when P < c N object manifolds are linearly separable with high probability, whereas if P > c N the manifolds are inseparable with high probability.Intuitively, this capacity serves as a measure of the amount of linearly decodable information per neuron about object identity.Thus, the theory predicts that the total number of objects which can be well-represented is extensive, proportional to the total number of neurons participating in the representation.

A B
Next, we study how the shapes of the manifolds determine the value of c .As stated above, theory predicts that the key metrics are their radius and dimensionality.The first measures the overall extent of the manifolds (relative to the distance between their centers) and the second measures the number of directions that these manifolds span.
As shown in ref. 13, the well-known concept of support vectors can be generalized to manifolds, where the weight vector normal to their separating plane is a linear combination of anchor points.Each manifold contributes (at most) a single anchor point, residing in the manifold or its convex hull.These points uniquely define the separating plane, thus anchoring it.The identity of the anchor points depends not only on the manifolds' shape but also on their location or orientation in the N dimensional state space as well as the particular choice of random labeling.Thus, for a given fixed manifold, as the location and labeling of the other manifolds are varied, the manifold's anchor point will change, thereby generating a distribution of the its anchor points.The manifold's radius R M is the total variance of its anchor points normalized by the average distance between the manifold centers.Its dimension D M is the spread of the anchor points along the different manifold axes.The mean-field theory provides precise algorithms for estimating these quantities for any given set of manifolds (9,13).
For manifolds spanning D 1 dimensions, the classification capacity is well approximated by R M and D M through The theory predicts that when 1 the manifolds are "entangled" yielding a capacity of O(1 / D) (note that in the entangled regime D M ≈ D 1).This theory assumes that the positions and orientations of different manifolds are uncorrelated.Object manifolds induced by real images show substantial correlations between their positions (i.e., their centers) particularly in early stages of the deep hierarchy.These correlations exhibit prominent low rank structure.Hence, the correlations can be accounted for by projecting all the points in the P manifolds (at each layer) to the null space of the center-center correlations.Recent work extends this theory to the case where not only the manifolds' centroids, but also their directions of variability are correlated (14).
We have applied this framework to the study of the geometry of neural representations of object manifolds in DNNs pretrained for object recognition tasks on large labeled dataset, ImageNet (15), including AlexNet (16), VGG (17), and ResNet (18).In each network, we measure classification capacity and geometry of point-cloud manifolds generated by responses to high-scoring samples from ImageNet classes (15) in each layer (10).Results of this analysis (shown in Fig. 1B for ResNet) demonstrate that the manifold classification capacity increases along the hierarchy of a fully trained deep network, with a concomitant decrease in manifold dimension and radius.Across most of the stages, the reduction in dimension and radius are incremental followed by steep changes in the last stages; a pattern that is apparent in other architectures as well (10).
While the classification capacity does not directly provide information about generalization, in particular, the likelihood that the system trained on a set of images would correctly classify held-out images from the same classes ("test accuracy"), we can use the notion of max-margin from the theory of support vector machines (SVM) (19) as a good proxy for the test accuracy.Similar to the margin in SVMs, the margin in our context is the distance of the "anchor points" from the separating plane that is optimized to maximize this distance.Naturally, if the load is near capacity, the margin is close to zero.For a fixed load below capacity, the maximum achievable margin is given in terms of the manifold radius and dimensionality, by The behavior of the manifold margin is shown in the Bottom plot of (Fig. 1B).Here the value of is fixed to a value close to the minimal capacity, yielding almost zero margin at the early stages.Overall, there is a large 7-fold increase in margin value from the pixel layer to the feature layer.
Manifold Geometry for Few-Shot Category Learning.Another facet of generalization is the ability to transfer knowledge acquired during training to rapidly learn novel tasks.In the context of object recognition, we discuss few-shot learning, the ability to learn new object categories from just a few novel examples, building on established representations from learning a large number of other object categories.As schematized in Fig. 2 A and B, one or a few examples of two novel objects (here coatis and numbats) are presented, and are mapped through the layers of the ventral visual stream of a mature animal (Top), or the layers of a DNN pre-trained for object recognition tasks (Bottom), resulting in high-dimensional neural representations of each example.
The few-shot learning happens only downstream, and is modeled by a single readout neuron learning a decision boundary between two novel objects on the basis of these few examples (Fig. 2 B, Right).Several common choices of linear and nonlinear decision rules (e.g., SVMs, nearest neighbor classifiers) all match or underperform a simple linear classifier trained with a prototype-learning rule: averaging the few examples of each class into a central "prototype," serving as an approximation of the true prototype of the new category manifold, and classifying a new input according to which of the two estimated "prototypes" is closer to it.
In experiments on pre-trained DNNs (details in SI Appendix, 1A), we find that with the simple prototype learning rule, these pre-trained representations are powerful enough to support good few-shot learning performance (Fig. 2C).Furthermore, performance consistently improves along the layers of the pretrained DNN (Fig. 2D).We additionally perform numerical experiments on neural representations in the visual cortices of primates (21) (details in SI Appendix, 1B).We find these representations are also powerful enough for few-shot learning performance of visual objects, and that few-shot performance improves along the visual hierarchy (Fig. 2D).

A B C D
To understand what features of the neural representation empower good few-shot learning performance, we introduce a mathematical theory relating few shot learning of new objects to the geometry of their underlying manifolds.Unlike the complex manifold geometry governing object classification capacity, an ellipsoidal approximation of their geometry is sufficient to account for few shot prototype learning.Thus, performance is well predicted by each (true) manifold's centroid x 0 , and radii R i along a set of orthonormal basis directions u i , i = 1, . . ., N, capturing the extent of natural variation of examples belonging to the same object.A useful measure of the overall size of these variations is the mean squared radius The reason for this simplified geometry is that unlike the case of separating a large number of manifolds each consisting of a large (or infinite) number of points, here, the separating plane is determined only by the empirical centroids and does not have access to the more salient manifold statistics such as the anchor points.
Our theory predicts that the average error of m-shot learning on test examples of object a is given by a = H(SNR a ), where The quantity SNR a is the signal-to-noise ratio (SNR) for manifold a, whose dominant terms are given by, A full expression and derivation is given in ref.
a represents the pairwise distance between the manifolds' centroids, x 0 a and x 0 b , normalized by R 2 a .Well-separated manifolds have a higher SNR, and hence a lower generalization error.
(2) Bias.R 2 b R −2 a − 1 represents the average bias of the linear classifier.Importantly, this bias is asymmetric: When manifold a is larger than manifold b, the bias term is negative, predicting a lower SNR for manifold a.
(3) Dimension.A natural notion of dimensionality arises in our theory, known as the participation ratio 4 , which quantifies the number of dimensions along which the object manifold varies significantly, and is often much smaller than the number of neurons N.This is the analog of manifold dimension D M studied above.However, in contrast to the role of dimensionality for capacity discussed above, Eq. 3 reveals that for few-shot learning, high-dimensional manifolds are preferred.(4) Signal-noise overlap.
x 0 • U a 2 and x 0 • U b 2 quantify the overlap between the signal direction x 0 and the manifold axes of variation Generalization error increases as the overlap between the signal and noise of directions increases.We note that signal-noise overlap is bounded above by 1 / D a , and hence is small in high dimensions.
To validate our theory, we conducted experiments on visual object manifolds from pre-trained DNNs (ResNet50) and pri- mate IT cortex neural activity (21), finding agreement across visual categories (Fig. 3 A and B).
Comparing Geometry in DNNs and the Primate Visual Pathway.
While the SNR increases along both the primate visual hierarchy and the successive layers of pre-trained DNNs (Fig. 2D), the individual underlying geometric quantities may show different behavior.In particular, the dimension of object manifolds expands dramatically in the early layers of trained DNNs, and compresses in the final layers (Fig. 3C).This dimensionality expansion and compression has been observed in other recent works and architectures (23,24).In contrast, the dimension of object manifolds in the primate visual pathway remains low throughout V4 and IT cortex (Fig. 3C).This difference is highlighted in Fig. 3D, which shows that the eigenspectra of object manifolds in V4 are low dimensional, and well described by a power law, while the eigenspectra in the corresponding layer of a trained DNN are much higher dimensional.Future work could explore the computational underpinnings of these differences.
Comparing Geometry of Vision and Language Representations.
Our finding that downstream classifier can use empirical prototypes obtained by few-shot learning raises the question whether information from other modalities may also be used to approximate vision prototypes in the feature layer, enabling transfer learning of new categories.Indeed in ref. 11, we find a surprising alignment between representations in vision models pre-trained on images and word vector embedding models pre-trained on text.Object prototypes in the visual embedding space and their corresponding language representations can be closely aligned by a rotation operation.Moreover, we show that this alignment generalizes to novel objects, so that new visual categories can be correctly discriminated purely by a languagebased descriptor ("zero-shot" learning).This finding suggests that the two pre-training processes endow vision and language models with a similar fine-grained, generalizable semantic structure.This conclusion is supported by the finding that the geometry of visual representations encodes a rich hierarchical structure (11), SI Appendix, Fig. 1.Interesting recent works have investigated the structure and origin of this hierarchical structure (25,26).

Theory of Deep Learning
We have discussed the geometric properties of neural manifolds that are necessary and sufficient for good generalization capabilities.To understand the learning mechanisms that give rise to these representations requires a theory of how learning in deep networks shapes neural representations.In this section, we review recent advances in the theory of fully connected deep wide networks (27)(28)(29)(30)(31)(32)(33)(34)(35)(36)(37)(38)(39).We will compare the predictions of these theories regarding the geometry of learned representations against our results on object manifolds, future directions of analyzing feature learning in more complex DNNs.
Sampling the Space of Solutions.Wide DNNs are examples of over-parameterized neural networks, in which the training data can be perfectly fit by many choices of weights, only a subset of which yields good generalization.One strategy to sample solutions with good generalization performance is to bias the sampling to solutions with small weight norms, as they tend to mitigate overfitting (40,41).Recent DNN theories focus on two disparate implementation schemes.One approach focuses on learning by gradient descent (GD) on the training cost function, where different solutions are reached by varying the initialization.In this case, weight norms are controlled indirectly by the norms of the initialized ones.Performance of GD also depends on details of the learning dynamics, including batch sizes, learning rates, and initialization (36,(42)(43)(44).The second approach focuses on Bayesian neural networks (BNNs) (45) where the effect of learning is characterized by a posterior distribution in weight space.Sampling of weights from this posterior distribution determines the statistics of the inputoutput function of the network.Significant analytical progress has been made in both approaches in the limit where the width of the network is large.
Predictor, Loss Function and Generalization Error.In a fully connected DNN, for a given set of weights and an input x ∈ R N 0 , the output, called the predictor, is given by a linear summation of the activation of the last hidden layer (the "feature layer"): where M is the dimension of the output, and a is the linear readout weight.Different choices of may result in vastly different behaviors as discussed in later sections.One choice, common in practice and in theoretical investigations, known as the "lazy regime", is = 1 / 2. This is the regime we'll focus on in this perspective, as opposed to the "nonlazy" regime, where = 1 (46).We denote the network parameters as Θ = {W, a}, with the hidden layer weights W and readout weights a. Φ(W, x) is the vector of responses of the "feature layer" to an input vector x, Φ(W, x) = (h L (x)) where the pre-activations of the l-th ∈ R N×N 0 , where N denotes the width of all hidden layers.The function (•) denotes the nonlinear activation function.For simplicity, we use the squared error (SE) loss both during training and for evaluating test performance.Denoting the set of P training data points as starting from some random initialization Θ 0 and performing weight updates in proportion to −∇ Θt L(D, Θ t ).Initial weights are chosen from an iid Gaussian distribution Θ 0 ∼ N (0, 2 0 I).In contrast, BNNs sample from the posterior distribution where we choose a Gaussian prior P 0 (Θ) ∝ exp((2 2 ) −1 Θ 2 ). is the inverse temperature = T −1 controlling the relative strength of the likelihood P(D|Θ) over the prior.Hereafter we focus on the limit → ∞ which constrains the posterior distribution within the L(D, Θ) = 0 solution space.The generalization error per input x with ground truth label y(x) can be decomposed into bias and variance components where • Θ denotes averaging over the posterior distribution Eq. 6.Thus, the mean and variance of the predictor determine the generalization error.
Predictor Statistics and Kernel Functions.Using the BNN model, Eq. 6, it is straightforward to average over the readout weights a conditioned on W, yielding where • a denotes partial averaging over the conditional distribution P(a|W, D), and I M denotes an M×M identity matrix.These statistics are given in terms of the top layer kernel function K L (Eq. 10).For each layer, K l is a scalar function of a pair of inputs {x, x } (27) From this function, the P × P data kernel matrix is constructed as K l = K l (X , X ) and the P × 1 kernel vector is defined as k l (x) = K l (x, X ).These kernel functions depend on the hidden layer weights W; averaging over them is highly nontrivial due to the non-Gaussianity of W. How to make tractable this average in different regimes remains a challenging question, as we discuss in detail below.
Infinitely Wide DNNs.Infinitely wide networks have been the target of numerous theoretical studies (27)(28)(29)(30)(31)(32)(33).The infinite width limit is defined as taking the dimensionality of the hidden layers, N, to infinity while keeping the size of the training dataset, P, finite.For instance, in the BNN model of Eq. 6, the first-and second-order predictor statistics in this limit are given by Here, the W-dependent kernel functions of Eqs.8-10 are replaced by their averages over the Gaussian prior of W, and thus no longer depend on W. This kernel function is referred to as the Neural Network Gaussian Process (NNGP) kernel.Similarly, the corresponding P × P data kernel matrix and P × 1 kernel vector are constructed by K GP, , l = K GP l (X , X ) and k GP, l (x) = K GP l (x, X ), respectively.The NNGP kernels can be calculated iteratively across layers, and for some choices of the nonlinearity (•), analytical forms of this recursion relation have been derived (27).
Expressions similar to Eqs. 11 and 12 hold for GD-trained infinitely wide DNNs but with a different kernel function, the neural tangent kernel (NTK) (28,33).Quantitative differences between the NTK and the NNGP have been systematically studied for different network architectures and tasks (47).However, the connection between the two frameworks has not yet been elucidated.In the following section, we describe our recent work (39) which unifies them.
Langevin Learning Connects GD Training and BNNs.Unlike learning by GD, the BNN formulation does not specify the learning dynamics for sampling from the posterior Eq. 6 and various efficient sampling methods have been proposed (48,49).Here, we consider sampling by Langevin dynamics, a gradient-based stochastic dynamical process which at long times corresponds to sampling from Gibbs equilibrium distribution (50).In our case, the dynamics of the network parameters Θ take the form where t is Gaussian white noise with zero mean and covariance The weight decay term is the gradient of an L 2 weight norm regularization, or equivalently the exponent of the prior P 0 (Θ).When T = 0, the above dynamics is deterministic and corresponds to continuous time GD.For any finite T, the process converges to sampling from the posterior distribution, Eq. 6.Here we describe the interesting regime where T is small but nonzero (51).By analyzing the distribution of the dynamical trajectories induced by the above Langevin dynamics (Eq.14) at small T in the infinite width limit, we are able to characterize the dynamics that connects the deterministic GD learning (short times) to sampling from the BNN posterior (long times), recovering both NTK and NNGP results under different time scales.Fig. 4 A and B offer an intuitive illustration of the dynamic process.The dynamics initially approximate the GD dynamics, as the first term on the RHS of Eq. 14 dominates.We refer to this learning stage as gradient-driven phase.As the training error reaches approximately 0, the gradient contribution becomes the same order as the noise t , entering what we refer to as the diffusive phase.The predictor fluctuates significantly as the dynamics explore the solution space driven by small noise.Although the initial gradient-driven stage largely depends on initialization as in GD, the dynamics become ergodic in the diffusive phase.When t scales as 2 / T ( controls size of the solution space and T controls the speed of exploration), the predictor statistics averaged across time becomes independent of initialization, as expected for BNNs.
Our formulation allows for investigation of how representations and generalization performance vary during the initial transient and the gradual exploration of the solution space.
Finite Width Kernel Renormalization.Comparing the infinite width limit predictions to real wide neural networks with N ∼ 10 2 to 10 3 requires restriction to relatively small number of examples P, often failing to capture the properties of realistic networks where both N and P are large.A more realistic regime is one where the number of training examples scales linearly with the network width, namely P → ∞, N, N 0 → ∞, ≡ P N , 0 ≡ . We refer to this regime as the thermodynamic limit.In this section, we discuss our results in this regime (38), focusing on the BNN posterior framework.
Although the prior on the weights is Gaussian, the constraints imposed by the likelihood term on the training data cannot be ignored in the finite regime.We show that in this regime the effect of finite can be expressed in terms of a kernel renormalization, which can be derived using the Back-Propagating Kernel Renormalization (BPKR) procedure.The BPKR approach allows us to integrate the network weights in a backward direction, starting from the readout weights a, and proceeding to W L , W L−1 , • • • , W 1 , as shown in Fig. 5A.At each stage of the integration, we introduce a renormalization factor, which summarizes the effect of the integrated weights.After averaging over all weights, we find that the predictor statistics still follow the same form as Eqs.11 and 12, but with the NNGP kernel function (Eq.13) replaced by a renormalized kernel function.For a network with single readout M = 1 and L hidden layers, the renormalized kernel function is given by K The renormalization factor u 0 is determined self-consistently by u 0 = (1 − ) 2 + 2 Y K−1 Y / P.Here we use K ∈ R P×P to denote the data kernel matrix constructed by K , = K (X , X ).Note that at = 0, u 0 = 2 , reducing to the NNGP theory.Unlike the NNGP kernel function, this renormalized kernel function also depends on the target labels Y, reflecting the effect of training data on the posterior.Furthermore, u 0 can be related to the average norm of the readout weights a w.r.t. the posterior distribution, u 0 = N −1 a 2 Θ .Intuitively, the renormalization by u 0 captures how the learned partial alignment between the readout weights and the target labels affects the average readout weight norm, and in turn the predictor statistics.
For networks with multiple outputs, the renormalized kernel is given by the Kronecker product of the GP kernel with an M×M dimensional renormalization matrix U 0 (SI Appendix, C3 of ref. 38).
The mean predictor is unaffected by kernel renormalization because the renormalization cancels in the kernel and the inverse kernel.However, the predictor variance is affected.Importantly, we find that for strong norm regularization (small ) the variance decreases with N, thus the infinite-width limit performance is optimal.Conversely, for large , error increases with the width, implying that the weak regularization fails to prevent overfitting as the network width increases (Fig. 5B).The transition point between the two regimes depends on the depth as well as the training data.
The above result is exact for linear networks in the thermodynamic limit.An interesting recent work (52) derived nonasymptotic expressions for the posterior predictor statistics and training data likelihood in terms of Meijer-G functions, and agrees with our kernel renormalization results in the thermodynamic limit.For ReLU networks the above kernel renormalization expressions are heuristic extension of the linear case.Surprisingly, we find that this approximation agrees remarkably well with the numerical simulations for ReLU networks as illustrated in Fig. 5B.The validity of this approximation is further discussed in recent works (53)(54)(55)(56).

Feature Learning in Wide Networks.
Mean layer-wise kernels.Our BPKR framework allows for the computation of the changes in the representations in the network, which can be evaluated by the posterior average of the layer-wise kernel functions (Eq.10).We find that even in the thermodynamic limit, the mean kernels depart from their infinite width limit only by a correction of the order of 1 / N.For instance, the P × P mean training data kernel matrix is, where G l (U 0 ) ∈ R M×M is a function of the renormalization matrix U 0 .The first term is the NNGP kernel matrix which is generally full rank, and the second term is a rank M correction that aligns with the subspace spanned by the M × P-dimensional target labels Y.The expression for mean layer-wise kernel function on arbitrary test points are given in SI Appendix, 2A.We emphasize that the average kernel matrix K l Θ is not equivalent to the renormalized kernel K (x, x ).The latter one appears in the predictor statistics which involves products and inverses of the hidden layer kernels as well as the effect of the posterior readout weights.We note that the NNGP kernel scales as 2(l+1) , while G l (U 0 ) in the second term scales as 2 .For < 1, the first term shrinks rapidly as l increases while the magnitude of the second term remains unchanged, revealing a more pronounced learninginduced structure.Furthermore, in Eq. 15 the term G l (U 0 ) modifies the structure of the second term.In Fig. 6D, we show an example of our theory applied to an L = 4 ReLU network, trained simultaneously on 4-way classification of four MNIST digits as well as on 2-way classification of even vs.odd.We see that learning-induced changes in the mean layer-wise kernel become more pronounced as we increase l.Furthermore, the structure of the kernel matrix changes across l due to the modification of G l (U 0 ); in particular, the higher order structure (the two larger blocks corresponding to even vs.odd) becomes more pronounced at the deeper layers.Representation and generalization.Although the learninginduced change in the mean layer-wise kernel is small, it is low rank and aligns with the network target output, therefore it may significantly affect the generalization performance.We investigate this effect by comparing the generalization error of a fully trained DNN with the predictor statistics given by the BPKR theory, to a DNN with random features Φ(W, x) (W ∼ N (0, 2 I)) of the same width.
In Fig. 6 A-C we show an example of a ReLU network with one hidden layer trained on MNIST classification (see details in SI Appendix, 3).As expected, the learned mean hidden layer kernel (Fig. 6B) exhibits slightly stronger block structure compared to the NNGP kernel (Fig. 6A).However, we see drastic improvement in the generalization performance in the trained DNNs compared to DNNs with random features across all values of > 0 (as shown in Fig. 6C red vs. blue lines).In particular, the generalization error diverges in the random feature model at

A B C
Hidden layer 1  For ReLU networks in the thermodynamic limit, signal increases with layer depth (blue).For linear networks (yellow, purple) and ReLU networks in the infinite width limit (red), signal remains unchanged across l.Error bars are across all distinct pairs of manifolds/digits.(B) Dimension as a function of l.In the infinite width limit, dimension remains constant with l in linear networks (purple) and increases with l in ReLU networks (red).In the thermodynamic limit, dimension decreases in linear networks (yellow), and is non-monotonic in ReLU networks (blue), similar to Fig. 3C.Error bars are across all manifolds/digits.= 1.This is because conditioned on any fixed random W, the norm of a diverges at = 1.This divergence does not show up in fully trained DNNs in which the hidden representations are partially aligned with the target outputs.Therefore, although feature learning in the thermodynamic limit is weak, it allows the network to outperform the corresponding random feature model with the same finite hidden layer width, and yields a performance similar to the corresponding infinitely wide network captured by the NNGP theory (Fig. 6C red vs. yellow lines).This similar performance is because is chosen to be relatively small ( = 0.2), and the bias contribution dominates the generalization error.
From Mean Kernels to Representational Geometry.In the first part of this paper, we introduced several normative conditions on the representational geometry for obtaining good generalization performance in concept-identity tasks.Some of these measures can be readily obtained from the mean layer-wise kernels (SI Appendix, 2B).In Fig. 7, we present preliminary results for the signal Δx 0 2 and dimension D a defined in Eq. 3.They are calculated from the mean layer-wise kernel functions (Eq.15 and SI Appendix, 2A) and the NNGP kernel functions (Eq.13) for linear and ReLU networks (see details in SI Appendix, 2B and 3), applied on test samples from the MNIST dataset.Each manifold is determined by sample points from a unique digit.We compare these geometric measures across different hidden layers l.The signal increases monotonically with the layer depth, and the dimension is non-monotonic.These preliminary results, strikingly similar to the trends in the manifold geometry exhibited in DNNs trained for object recognition (Fig. 3C), suggest that the BPKR theory may provide a powerful theoretical tool to explain the emergence of category manifolds in deep networks.Extending these theoretical calculations to CNNs and to representations of held-out categories is a future direction.
Representational Drift.Representational drift (RD) refers to neuroscience observations of neural activities accumulating changes over time without noticeably affecting the relevant animal behavior (58)(59)(60)(61)(62).It has been suggested that behavioral robustness to RD is due to readout changes that compensate for the drift in representational layers, maintaining a stable input-output relations (63,64).Indeed, within our framework of learning under Langevin dynamics with small noise, the stability of the performance during the diffusion phase is due to the continuous realignment of readout weights a t to changes in W t .Additionally, as shown above, the diffusion in W t is constrained by learning.Adhering to these constraints requires an ongoing learning signal.To highlight the importance of this signal, we consider an alternative scenario where the readout weights are frozen at some time (denoted as t 0 ) after achieving low training error while the weights of the hidden layers W t drift randomly without an external learning signal.As we show in Fig. 8A, while the generalization error remains small for Langevin learning dynamics (red), the performance degrades significantly in the absence of the learning signal (blue).
We seek to understand how the results on the mean layerwise kernel translate to constrained drift of the representation (h l ).Inspecting Eq. 15 we hypothesize that the dynamical trajectory of (h l t (X)) can be approximately captured by (h l t (X)) = (h l (W 0 t , X)) + N −1 / 2 z(t)Y , [16] where z(t) ∈ R N×M is a time-dependent random vector with mean 0 and N −1 z(t) z(t) = G l (U 0 ).W 0 t denotes a sample of the hidden layer weights from N (0, 2 I) and is independent of z(t), and (h l (W 0 t , X)) denotes the l-th layer hidden activation on X with W 0 t .The hypothesis is consistent with Eq. 15.The first term contributes to the NNGP kernel.The second term represents the representational drift within a space constrained by the task.
We have tested this hypothesis by simulating a single hidden layer, single output ReLU network, trained with Langevin dynamics (Eq.14).We track the hidden layer representations on the training data X during training.At time t, we denote the hidden layer activation data matrix as (h l t (X)) ∈ R N×P .In order to characterize the drift, we compute the unit norm top right and left singular vectors of (h l t (X)), denoted by u(t) ∈ R P and v(t) ∈ R N respectively, and track their temporal correlations in the diffusive learning stage.These temporal correlations are defined as u ( ) ≡ lim t→∞ u(t + ) u(t) , and v ( ) ≡ lim t→∞ v(t + ) v(t) .Furthermore, to quantify how the representation is constrained by the training labels Y, we define the correlation between u(t) and Y at equilibrium as u,Y ≡ lim t→∞ u(t) Y / Y .As shown in Fig. 8B, we find that u(t) is constantly aligned with Y.Meanwhile, v(t) gradually decorrelates with time, representing the drift in the N-dimensional feature space.This pattern is consistent with the low-rank correction in Eq. 16.The diffusion in z(t) in Eq. 16 is compensated for by a continuous alignment of a t to read out the target labels.These results provide insights regarding the pattern of representational drift in neural circuits and the robustness of the performance to the drift.The results predict that injecting synaptic noise that changes the representation in a less constrained manner may result in a degraded performance, and can be tested with perturbation experiments.Finally, our recent work (39) has shown that architectural constraints such as the type of nonlinearity and weight sharing may result in significant performance in the presence of weight drift even in the absence of gradient error signal.

Fig. 1 .
Fig. 1. (A) Illustration of three layers in a visual hierarchy where the population response of the first layer is mapped into intermediate layer by F1 and into the last layer by F2 (Top) (10).The transformation of perstimuli responses is associated with changes in the geometry of the object manifold, the collection of responses to stimuli of the same object (colored blue for a "dog" manifold and pink for a "cat" manifold).Changes in geometry may result in transforming object manifolds that are not linearly separable (in the first and intermediate layers) into separable ones in the last layer (separating hyperplane, colored orange).(B) Changes in classification capacity C , manifold radius R M , manifold dimension D M , and classification margin across the layers of pre-trained DNNs (ResNets).

Fig. 2 .
Fig. 2. (A and B)Examples of novel objects, here "coatis" (blue) and "numbats" (green), are presented to the ventral visual pathway (Top), modeled by a trained DNN (Bottom), eliciting a pattern of activity across IT-like neurons in the feature layer.We model concept learning as learning a linear readout w to classify these activity patterns.(C), Generalization accuracy is very high across pairs of novel objects from the ImageNet21k dataset when using a pre-trained DNN (orange), but poor when using a randomly initialized DNN (blue), or a linear classifier in the pixel space of input images (gray).(D) Few-shot learning improves along the ventral visual hierarchy from pixels to V1 to V4 to IT, due to orchestrated transformations of object manifold geometry.The layerwise behavior of a trained ResNet50 (blue), Alexnet (light blue), and an untrained ResNet50 (gray) is included for comparison.We align V1, V4, and IT to the most similar ResNet layer under the BrainScore metric(20) (see ref. 11 for details).

Fig. 3 .
Fig. 3. (A) We compare the empirical generalization error in 1-, 2-, and 5-shot learning experiments to the prediction from our geometric theory (Eq. 3) on all pairs of objects from the ImageNet21k dataset, using object manifolds derived from a trained ResNet50.x-axis: SNR obtained by estimating neural manifold geometry.y-axis: Empirical generalization error measured in fewshot learning experiments.Theoretical prediction (dashed line) shows a good match with experiments.(B) We provide additional examples of 5-shot prototype learning experiments in a ResNet50 (colored points), along with the prediction from our geometric theory (dashed line), on four randomly selected novel visual objects from the ImageNet21k dataset.Each panel plots the generalization error of one novel visual object (e.g., "Virginia bluebell") against all 999 other novel visual objects.Each point represents the average generalization error on one such pair of objects.x-axis: SNR (Eq. 3) obtained by estimating neural manifold geometry.y-axis: Empirical generalization error measured in few-shot learning experiments.Theoretical prediction (dashed line) shows a good match with experiments.(C) In a pre-trained ResNet50 (blue) dimensionality expands dramatically in the early layers and contracts in the later layers, while in the primate visual pathway (Black) dimensionality contracts from the V1-like layer to V4, then expands from V4 to IT. (D) Single-manifold eigenspectra in macaque V4 (black) and the corresponding layer of a pre-trained ResNet50 (blue).

Fig. 4 .
Fig. 4. (A) Two stages of learning of Langevin dynamics with small T, 0 controls the width of weight distribution at initialization, controls the size of the solution space, and T relates to the sampling speed.(B) Example trajectories of the predictor from three different initializations, the dynamics is initially deterministic and starts to fluctuate as Θ t drifts in the solution space after reaching zero training error.

Fig. 5 .
Fig. 5. (A) Schematics for the BPKR approach.A renormalization factor is introduced at each step during backward integration until all the network weights are averaged out.(B) Theory (black solid line) and simulation (blue points) of generalization error g = g (x, y(x)) {x,y(x)} on binary MNIST classification in fully connected ReLU networks, for small (Top) and large (Bottom) .The approximate theory for ReLU networks agrees remarkably well with the numerics.

Fig. 6 .
Fig. 6. (A and B) NNGP and mean layer-wise kernels in classifying eight MNIST digits SI Appendix, 3, (57).(C) Generalization error averaged across test examples for finite width random feature model (blue), infinitely wide network following the NNGP theory (yellow), and the learned network following the BPKR theory (red, overlaying NNGP theory).(D) The NNGP kernel, and the mean layer-wise kernel of hidden layer l = 2, 4. For a 4hidden-layer ReLU network trained on four MNIST digits grouped into two higher-order categories of even vs.odd.The values of the kernel are small since we take relatively small (SI Appendix, 3).

40 DFig. 7 .
Fig. 7.(A) Signal as a function of hidden layer depth l.For ReLU networks in the thermodynamic limit, signal increases with layer depth (blue).For linear networks (yellow, purple) and ReLU networks in the infinite width limit (red), signal remains unchanged across l.Error bars are across all distinct pairs of manifolds/digits.(B) Dimension as a function of l.In the infinite width limit, dimension remains constant with l in linear networks (purple) and increases with l in ReLU networks (red).In the thermodynamic limit, dimension decreases in linear networks (yellow), and is non-monotonic in ReLU networks (blue), similar to Fig.3C.Error bars are across all manifolds/digits.

Fig. 8 .
Fig. 8. (A) Comparison of the generalization error dynamics between a network fully trained under Langevin dynamics (Eq.14, shown in red), and a network with a frozen at time t 0 in the diffusive learning stage, and W t randomly drifting afterward (shown in blue).denotes the difference between the current time t and t 0 .(B) Both the temporal correlation of the top right singular vector ( u( )) and the correlation between u(t) and Y ( u,Y ) remain close to 1, representing the constant alignment between the Top Right singular vector of the representation and the training labels.Temporal correlation of the Top Left singular vector ( v( )) gradually decreases with the time difference , representing the random drift in the feature space.