Optimal errors and phase transitions in high-dimensional generalized linear models

Significance High-dimensional generalized linear models are basic building blocks of current data analysis tools including multilayers neural networks. They arise in signal processing, statistical inference, machine learning, communication theory, and other fields. We establish rigorously the intrinsic information-theoretic limitations of inference and learning for a class of randomly generated instances of generalized linear models, thus closing several decades-old conjectures. Moreover, we delimit regions of parameters for which the optimal error rates are efficiently achievable with currently known algorithms. Our proof technique is able to deal with the output nonlinearity and is hence of independent interest, opening ways to establish similar results for models of neural networks where nonlinearities are essential but in general difficult to account for.

Generalized linear models (GLMs) are used in high-dimensional machine learning, statistics, communications, and signal processing.In this paper we analyze GLMs when the data matrix is random, as relevant in problems such as compressed sensing, error-correcting codes, or benchmark models in neural networks.We evaluate the mutual information (or "free entropy") from which we deduce the Bayes-optimal estimation and generalization errors.Our analysis applies to the high-dimensional limit where both the number of samples and the dimension are large and their ratio is fixed.Nonrigorous predictions for the optimal errors existed for special cases of GLMs, e.g., for the perceptron, in the field of statistical physics based on the socalled replica method.Our present paper rigorously establishes those decades-old conjectures and brings forward their algorithmic interpretation in terms of performance of the generalized approximate message-passing algorithm.Furthermore, we tightly characterize, for many learning problems, regions of parameters for which this algorithm achieves the optimal performance and locate the associated sharp phase transitions separating learnable and nonlearnable regions.We believe that this random version of GLMs can serve as a challenging benchmark for multipurpose algorithms.
high-dimensional inference | generalized linear model | Bayesian inference | perceptron | approximate message-passing algorithm A s datasets grow larger and more complex, modern data anal- ysis requires solving high-dimensional estimation problems with very many parameters.Developing algorithms for this task and understanding their limitations have become a major challenge in computer science, machine learning, statistics, signal processing, communications, and related fields.
In the present contribution, we address this challenge in the case of generalized linear estimation models (GLMs) (1,2) where data are generated as follows: Given an n-dimensional vector X * , hidden to statisticians, they observe instead an m-dimensional vector Y where each component reads where Φ is an m × n "measurement" or "data" matrix, and the random variables (Aµ) iid ∼ PA account for noise/randomness of the model.The model is "linear" because the output Yµ depends on a linear combination of the data zµ The GLM generalizes the ordinary linear regression by allowing the output function ϕ(z , A) to be nonlinear and/or stochastic; in the case of a deterministic model we simply write ϕ(z ).Explicit examples are given below.
GLMs belong to the realm of supervised learning and arise in a wide variety of scientific fields.In signal processing one usually observes Yµ given as a linear combination of the sig-nal elements X * .In a range of applications these observations are obtained via a nonlinear function ϕ.In optics or X-ray crystallography one often measures only the amplitude of [ΦX * ]µ, leading to the phase retrieval problem (3).A real-valued analog is the problem of sign retrieval when we observe only |[ΦX * ]µ| (4,5).Observations are sometimes quantized to reduce the storage, leading for instance to the problem of 1-bit compressed sensing (6).In statistics and machine learning, classification is often described via a GLM where the output function ϕ is discrete and corresponds to the labels that classify the data points Φµ (1, 2, 7).GLMs with nonlinear output functions are also the basic building blocks of each layer of neural networks (8): ϕ corresponds to the activation, the rows of the matrix Φ are different data samples, and X * is the set of synaptic weights to be learned.
There are two main learning problems in GLMs: (i) The estimation task requires, knowing the measured vector Y and the matrix Φ, inference of the unknown vector X * ; (ii) the prediction or generalization task instead requires, again knowing Y and Φ, accurate prediction of new values Ynew when new rows (i.e., data points) are added to the matrix Φ.

Significance
High-dimensional generalized linear models are basic building blocks of current data analysis tools including multilayers neural networks.They arise in signal processing, statistical inference, machine learning, communication theory, and other fields.We establish rigorously the intrinsic informationtheoretic limitations of inference and learning for a class of randomly generated instances of generalized linear models, thus closing several decades-old conjectures.Moreover, we delimit regions of parameters for which the optimal error rates are efficiently achievable with currently known algorithms.Our proof technique is able to deal with the output nonlinearity and is hence of independent interest, opening ways to establish similar results for models of neural networks where nonlinearities are essential but in general difficult to account for.
In the present paper we build a rigorous theory for both of these tasks for random instances of the GLM.In this setting each element Φµi of the matrix is sampled independently from a probability distribution of zero mean and unit variance, and the unknown vector X * has been also created randomly from a probability distribution P0, with each of its components X * 1 , . . ., X * n iid ∼ P0.Since our main aim is to study the intrinsic information-theoretic and algorithmic limitations caused by the lack of samples and/or the amplitude of the noise, we assume throughout this paper that P0 and ϕ are known to the statistician (if they are not, the task can only be harder).Our results are derived in the challenging and interesting high-dimensional limit where m, n → ∞ and m/n → α a constant.Random instances of GLMs are both practically and theoretically relevant in many different contexts: i) In signal processing, GLM estimation with a random matrix Φ has been studied with considerable attention in the context of compressed sensing (9)(10)(11), where an n-dimensional sparse signal is recovered from m < n noisy measurements.
While standard compressed sensing focused on the linear case-where ϕ(z , A) = z + A with a Gaussian noise A-the generalized case was also widely studied (12,13), especially for quantized output ( 14) and 1-bit compressed sensing (6,15) where ϕ(z , A) = sign(z + A), as well as for compressive phase retrieval when ϕ(z , A) = |z + A| (16).ii) In statistical learning, a substantial amount of activity is dedicated to understanding the limitation of learning with data generated by GLMs, both in the linear case, e.g., in the context of ridge regression or least absolute shrinkage and selection operator (LASSO) (17), or with nonlinear probabilistic output, e.g., logistic regression.Random instances were studied in particular in the context of so-called M estimators (18)(19)(20)(21).iii) In studies of artificial neural networks there has been a large amount of work using random instances of GLMs, with ϕ playing the role of a nonlinear activation function.In this context the random GLM was introduced as the teacherstudent setting for the perceptron in the pioneering work of Gardner and Derrida (22).A large volume of work followed and is reviewed, e.g., in refs.23-25.While initial works concentrated on a simple activation function ϕ(z ) = sign(z − K ) (K is the threshold constant), many other functions were considered, e.g., in refs.26-28.Recently, the study of random instances of neural networks has emerged as a key ingredient in understanding the performance of deep-learning algorithms (29,30).Computing mutual information in GLMs is also a critical issue in confirming the information bottleneck scenario of refs.31 and 32.iv) In communications, error-correcting codes that use random constructions are particularly efficient, as discussed by Shannon in his seminal paper (33).Random instances of GLMs describe both the setting of code-division multiple access-a multiuser access method used in communication technologies (34,35)-and an error correction scheme called sparse superposition codes, which have been shown to achieve the Shannon capacity for any type of noisy channel (36)(37)(38)(39)(40).
Interestingly there is an important gap in the above volume of work.On the one hand there are studies that rely on the algorithmic performance of the so-called generalized approximate message-passing (GAMP) algorithm (11,12,41).GAMP is remarkable in that its asymptotic (n, m → ∞, m/n → α) performance can be analyzed rigorously using the so-called state evolution (42)(43)(44)(45).However, GAMP is not expected to be always information-theoretically optimal.On the other hand, other results are concerned with the linear case of the GLM with additive Gaussian noise for which the information-theoretically optimal performance was established in refs.46-48 (the methodology of these works unfortunately does not generalize straightforwardly to the important nonlinear case or to other types of additive noise).All of the other works, which provide information-theoretic results for the nonlinear case, are based on powerful and sophisticated but nonrigorous techniques originating in statistical physics of disordered systems, such as the cavity and replica methods (49).Historically, the first of these nonrigorous, yet correct, results on information-theoretic limitations of learning was for the perceptron with binary weights and was established using the replica method in refs.22, 50, and 51, including a discontinuous phase transition to perfect learning that appears as the ratio between the number of samples and the dimension exceeds α ≈ 1.249.
In the present paper we close the above gap between mathematically rigorous work and conjectures (some of them several decades old) from statistical mechanics.In particular, we prove that the results for GLMs stemming from the replica method are indeed correct and imply the optimal value of both the estimation and generalization error.These results are summarized in Main Results.The proof is based on the adaptive interpolation method recently developed in ref. 52 and is of independent interest as it is applicable to a range of other models.We present it in Methods and Proofs and in SI Appendix.We compare our information-theoretic results to the performance of the GAMP algorithm and its state evolution (reviewed briefly in Main Results).We determine regions of parameters where this algorithm is or is not information-theoretically optimal.Up to technical assumptions (specified below), our results apply to all activation functions ϕ and priors P0, thus unifying a large volume of previous work where many particular functions have been analyzed on a case-by-case basis.This generality allows us to provide a unifying understanding of the types of phase transitions and phase diagrams that we can encounter in GLMs, which is as well of independent interest and we devote Application to Learning and Inference to its presentation.

Main Results
This section summarizes our main results.Their formal statement and all technical assumptions and full proofs are provided in Methods and Proofs and in SI Appendix.
For the random GLM problem as defined in the Introduction, the optimal way to estimate the ground-truth signal/weights X * relies on its posterior probability distribution where we used the prior P0 of X * and introduced the likelihood Pout that an output Yµ is observed given is the probability density function of ϕ(z , A) [where again the random variable (r.v.)A ∼ PA accounts for noise].We are concerned with the so-called Bayes-optimal setting where the prior P0 and the likelihood Pout that appear in the posterior 2 were also used to generate the ground-truth signal X * and the labels Y, with a known random matrix Φ.
A first quantity of interest is the free entropy (which is the free energy up to a sign) defined as fn (Y, Φ) ≡ 1 n ln Z(Y, Φ).The expectation of the free entropy is equal to minus the conditional entropy density of the observation − 1 n H (Y|Φ), as well as (up to an additive constant) to the mutual information density between the signal and the observations 1  n I (X * ; Y|Φ).
The Free Entropy.Our first result is the rigorous determination of the free entropy, in the high-dimensional asymptotic regime n, m → ∞, m/n → α.For a random matrix Φ with independent entries of zero mean and unit variance, for output Y that was generated using [1], and under appropriate technical assumptions stated precisely in Methods and Proofs, the free entropy converges in probability to fRS(q, r ; ρ), where ρ ≡ EP 0 [(X * ) 2 ] and where the potential fRS(q, r ; ρ) is where Dw = dw exp(−w 2 /2)/ √ 2π is a standard Gaussian measure and the scalar r.v. are independently sampled from Only the special linear case with Gaussian Pout is known rigorously so far (46)(47)(48).Convergence of the averaged free entropy is precisely stated in Theorem 1; the one in probability follows from concentration results in SI Appendix.
One can check by explicit comparison that for specific choices of P0 and Pout the expression 4 is the replica-symmetric free entropy derived in numerous statistical physics papers (thus the RS in fRS) and in particular in refs.22, 41, 50, and 51 for ϕ(z ) = sign(z ).The formula for general P0 and Pout was conjectured based on the statistical physics derivation in ref. 13.Establishing [3] closes these old conjectures and yields an important step toward vindication of the cavity and replica methods for inference, along with, e.g., refs.43 and 53.We now discuss the main consequences of this formula.
Overlap and Optimal Estimation Error.Our second result concerns the overlap between a sample x from the posterior 2 and the ground truth.We obtain that as n, m → ∞, n/m → α, whenever q * = q * (α) the maximizer in formula 3 is unique.This is the case for almost every α (SI Appendix).
It is a simple fact of Bayesian inference that, given the measurements Y and the measurement matrix Φ, the estimator X that minimizes the mean-square error with the ground-truth X * is the mean of the posterior distribution 2; i.e., X = E P(x|Y,Φ) [x].The minimum mean-square error (MMSE) that is achieved by such a "Bayes-optimal" estimator is deduced, again in the limit n → ∞, m/n → α, as follows: We refer to Theorem 2 in Methods and Proofs for rigorous statements.Again the value of the MMSE is known rigorously so far only for the linear case with Gaussian noise (46-48) (and conjectured for the nonlinear case, e.g., in ref. 13).
Optimal Generalization Error.Our third result concerns the prediction error, also called generalization error.Consider again the statistical model 1.To define the Bayes-optimal generalization error, one is given a new row of the matrix/data point, denoted Φnew ∈ R n (in addition to the data Φ and associated outputs Y used for the learning), and is asked to estimate the corresponding output value Ynew.We seek for an estimator Ŷnew = Ŷnew(Y, Φ, Φnew) that achieves Egen ≡ min Ŷnew E[(Ynew − Ŷnew) 2 ], i.e., that minimizes the MSE with the true Ynew obtained using the ground-truth weights X * .Such an estimator is again obtained from the posterior: , which leads to a worse MSE than Ŷnew.Yet it is often used in practice for deterministic models since most algorithms for generalized linear regression do not provide the full posterior distribution.
Our result states that the optimal generalization error follows from the I-MMSE theorem (54) applied to the free entropy 3 (see SI Appendix for details).The optimal generalization error reads as n → ∞, m/n → α (q * is the maximizer in [9] where V , w iid ∼ N (0, 1) and a ∼ PA.See again Theorem 2 in Methods and Proofs for the precise statement (and SI Appendix, Theorems 3 and 4).
Note that for labels Y belonging to a discrete set the MSE might not be a suitable loss and we are more often interested in maximizing the so-called overlap, i.e., the probability of obtaining the correct label.In this case the Bayes-optimal estimator is computed as the argmax of the posterior marginals, rather than as their mean; i.e., for discrete labels Ȳnew = argmax y P(y = ϕ( 1 a ∼ PA.The replica method has been used to compute the optimal generalization error for the perceptron where ϕ(x ) = sign(z ) in the pioneering works of refs.23, 50, and 55.We note that in this special case the plug-in estimator Ỹnew is actually equal to the optimal one Ȳnew.
A final note concerns the issue of overfitting.In optimizationbased approaches to learning overfitting may lead to a generalization error which is too large compared with the training error.In the Bayes-optimal setting the estimators are constructed to not overfit.This is related to general properties of Bayes-optimal inference and learning that are called "Nishimori conditions" in the physics literature (13) and that turn out to be crucial in our proofs.
Optimality of Approximate Message Passing.Although the three results stated above are of an information-theoretic nature, our fourth one concerns the performance of an algorithm for solving random instances of GLMs called GAMP (11)(12)(13), which is closely related to the Thouless-Anderson-Palmer (TAP) equations developed in statistical physics (41,56,57).
The GAMP algorithm can be summarized as follows (11-13): Given initial estimates x 0 , v 0 for the marginal posterior means and variances of the unknown signal vector X * entries, GAMP iterates the following equations, with g 0 µ = 0: (here we denote by u the average over all of the components of a vector u).The so-called thresholding function gP 0 (R, λ) is defined as the mean of the normalized distribution ∝ P0(x ) exp(−λ(R − x ) 2 /2) and the output function . The heuristic derivation of GAMP in statistical physics (13) suggests via the definition of the function gP out that ω and V are the estimates of the means and average variance of the components of the variable z = Φx.This, in turn, suggests a GAMP prediction of labels of new data points, where Comparing it with the test-set labels, this serves to compute GAMP's generalization error.
One of the strongest assets of GAMP is that its performance can be tracked via a closed-form procedure known as state evolution (SE), again in the asymptotic limit when n, m → ∞, m/n → α.For proofs of SE see refs.43 and 44 for the linear case and ref. 45 for the generalized one.In our notations, SE tracks the correlation (or "overlap") between the true weights X * and their estimate x t defined as q t ≡ lim The derivatives are with respect to (w.r.t.) the first argument.
Similarly for the evolution of GAMP's generalization error E GAMP,t gen (SI Appendix) we obtain that it is asymptotically, and with high probability, given by the right-hand side (r.h.s.) of formula 9 but with q * replaced by q t .It is a simple algebraic fact that the fixed points of the SE Eqs. 10 correspond to the critical points of the potential 4. The question of GAMP achieving asymptotically optimal MMSE or generalization error therefore reduces to the study of the extrema of the two-scalar-variables potential 4. If the SE 10 converges to the same couple (q, r ) as the extremizer (q * , r * ) of [3], then GAMP is optimal, and if it does not, then GAMP is suboptimal.In the next section we illustrate this result on several examples, delimiting regions where GAMP reaches optimality.We note that optimality of AMP-based algorithms in terms of the MMSE on the ground-truth vector X * was proved for several cases where the extremizer q * in [3] is unique, e.g., ref. 58, or in the linear case of GLM in ref. 47.Our results allow us to complete the characterization of regions of parameters where the algorithm reaches optimal performance in terms of the estimation and generalization errors.While the asymptotic value of the Bayes-optimal generalization error was predicted for some cases of Pout and P0 (55), and TAP-based algorithms were argued to reach this performance in refs.59 and 60, it was not known whether this error can be achieved provably or for what exact regions of parameters the algorithm is suboptimal.Our present work settles this question due to the state evolution of the GAMP algorithm.Interestingly, heuristic arguments based on the glassy nature of the corresponding probability measure were used to argue that direct sampling or optimization-based approaches will not be able to match this performance (51).Whether this statement is correct goes beyond the scope of the present paper.

Application to Learning and Inference
In this section, we report what our results imply for the information-theoretically optimal errors and those reached by the GAMP algorithm for several interesting cases of output functions ϕ and prior distributions P0.We do not seek to be exhaustive in any way; we simply aim to illustrate the kind of insights about the GLM that can be obtained from our results.We focus on determination of phase transitions in performance as we vary parameters of the model, e.g., the number of samples or the sparsity of the signal.We use careful numerical procedures to compute the expectations required in formula 4 and check that the reported results are stable toward the choice of various precision parameters.In this section we, however, do not seek rigor in bounding formally the corresponding numerical errors.Many of the codes used in this section are given online in a github repository (62).
General Observations About Fixed Points and Terminology.
Noninformative fixed point and its stability.It is instrumental to analyze under what conditions q * = 0 is the optimizer in [3].Our result 8 about the MMSE implies that if q * = 0, then the MMSE is as large as if we had no samples/measurements at our disposition.A necessary condition for q * = 0 is that it is a fixed point of the state evolution.In turn, a sufficient condition for the state evolution 10 to have such a fixed point is that (i) the output density Pout(y|z ) is even in the argument z and that (ii) the prior P0 has zero mean.A proof of this is given in SI Appendix.For q * = 0 to be a fixed point to which the state evolution 10 converges, it needs to be stable.We detail in SI Appendix that under properties i and ii this fixed point is stable when In what follows we denote αc the largest value of α for which the above condition holds.Consequently the error reachable by the GAMP algorithm is as bad as random guessing for both the estimation and generalization errors as long as α < αc.For α > αc, starting with infinitesimal positive q the state evolution will move toward larger q as in ref. 63.Note that condition 11 also appears in a recent work (61) as a barrier for performance of spectral algorithms.Concerning the information-theoretically optimal error, we call the phase where MMSE = ρ, i.e., q * = 0 is the extremizer of [4], the noninformative phase.Existing literature sometimes refers to such behavior as the retarded learning phase (64), in the sense that in this case a critical number of samples is required for the generalization error to be better than random guessing.Below we evaluate condition 11 explicitly for several examples.

Almost exact recovery fixed point. Another fixed point of [10]
that is worth our particular attention is the one corresponding to almost exact recovery, meaning with average error per coordinate going to 0 as n → ∞, where q * = ρ.A sufficient and necessary condition for this to be a fixed point is that limq→ρ Ψ P out (q; ρ) = +∞.This means that the integral of the Fisher information of the output channel diverges, where P out (y|ω) denotes the partial derivative w.r.t.ω.This typically means that the output channel should be noiseless.For example, for the Gaussian channel with noise variance ∆, the above expression equals 1/∆.For the probit channel where Pout(y|z ) = erfc(−yz / √ 2∆)/2 the above expression at small ∆ is proportional to 1/ √ ∆.Stability of the almost exact recovery fixed point depends nontrivially on the properties of both the output channel and the prior.Below we give several examples where almost exact recovery either is or is not possible.In what follows we call the region of parameters for which MMSE = 0, i.e., q * = ρ is the extremizer in [3], the almost exact recovery phase.Hard phase.As can be anticipated from the statement of our main algorithmic result, there are regions of parameters for which the error reached by GAMP is asymptotically equal to the optimal error and regions where it is not.We call the hard phase the region of parameters where MMSE < MSEAMP with a strict inequality.Focusing on the ratio α between the number of samples and the dimensionality, we denote αIT the ratio for which the hard phase appears and αAMP > αIT the ratio for which it disappears.In other words, the hard phase is an interval (αIT, αAMP) and is associated to a first-order phase transition in the Bayes-optimal posterior probability distribution.
It remains a formidable open question of average computational complexity whether in the setting of this paper (and for problems that are NP complete in the worst case) there exists an efficient algorithm that achieves better performance than GAMP in the hard phase.We are not aware of any and tend to conjecture that there is not.
Sensing Compressively with Nonlinear Outputs.Existing literature covers in detail the case of noiseless compressed sensing, i.e., when the output function ϕ(z ) = z .The representative sparse prior distribution is the Gauss-Bernoulli (GB) distribution P0 = ρN (0, 1) + (1 − ρ)δ0, where ρ is the average fraction of nonzeros, which are in this case standard Gaussians.The phase diagram of this case is well known (67,68).In noiseless compressed sensing with random i.i.d.matrices and GB prior, almost exact recovery of the signal is possible for α > αIT = ρ and GAMP recovers the signal for α > αAMP,CS where αAMP,CS is plotted in Fig. 1 (Left) with a dotted red line, thus delimiting the hard phase of compressed sensing.We note that the Donoho-Tanner phase transition (9) known as the performance limit of the LASSO 1 regularization is slightly higher than αAMP,CS.Signless output channel.The phase diagram of noiseless compressed sensing changes intriguingly when only the absolute value of the output is measured, i.e., when ϕ(z ) = |z | instead of ϕ(z ) = z .Such an output channel is reminiscent of the widely studied phase retrieval problem (3) where the signal is complex valued and only the amplitude is observed.The generalization of our results for the complex case would require extensions, as done for the algorithmic aspects in ref. 69.The real-valued case was studied under the name "sparse recovery from quadratic measurements" in the literature, e.g., ref. 70 and references therein, when the number of nonzero variables grows slower than linearly with the dimension n.Our results give access to the phase diagram of sparse recovery from quadratic (or equivalently signless) measurements that is presented in Fig. 1 (Left) for the GB prior.
We observe that the information-theoretical phase transition αIT is the same in the signless sparse recovery as in the canonical linear case; i.e., almost exact recovery is possible whenever α > ρ.However, the algorithmic phase transition αAMP above which GAMP is able to find the sparse signal is strikingly larger for the signless case (solid red line in Fig. 1, Left).(We note that to break the symmetry that prevents GAMP from finding the signal in a constant number of iteration steps, we mismatch infinitesimally the output function ϕ used in the algorithm from the symmetric one used to generate the data.Another way to deal with this issue is related to a spectral initialization as discussed recently in ref. 61.)We note that even for a dense signal ρ = 1 almost exact recovery is algorithmically possible only for α > αAMP(ρ = 1) ≈ 1.128.For very sparse signals, small ρ, the situation is even more striking because the measurement rate of at least α > αc = 1/2 is needed for algorithmically tractable almost exact recovery for every ρ.This is in sharp contrast with the canonical compressed sensing where αAMP,CS → 0 as ρ → 0. The nature of this algorithmic difficulty of GAMP is related to the symmetry of the output channel due to which the noninformative fixed point is stable for α < αc = 1/2.Summarizing this result in one sentence, tractable compressive sensing is impossible (for α < 1/2) if we have lost the signs.We reiterate that this result holds in the setting of the present paper, i.e., in particular when the sparsity ρ is of constant order.For signals where ρ = O(1) the situation is expected to be different (70).Rectified linear unit output channel.Another case of output channel that attracted our interest is the rectified linear unit (ReLU), ϕ(z ) = max(0, z ), as widely used in multilayer feedforward neural networks.In the present single-layer case reconstruction with the ReLU output is interesting mathematically.With GB signals, roughly half of the measurements are given without noise, but the only information we have about the other half is its sign.A straightforward upper bound for both information-theoretic and tractable almost exact recovery is simply twice as many measurements than needed in the canonical  4] for this case, we find that a recovery of the signal is information-theoretically impossible for α < α IT = ρ.Recovery becomes possible starting from α > ρ, just as in the canonical compressed sensing.Algorithmically the signless case is much harder.Evaluating [11], we conclude that GAMP is not able to perform better than a random guess as long as α < αc = 1/2, and the same is true for spectral algorithms (61).For larger values of α, the inference using GAMP leads to better results than a purely random guess.GAMP can recover the signal and generalize perfectly only for values of α larger than α AMP (solid red line).The dotted red line shows for comparison the algorithmic phase transition of the canonical compressed sensing.(Center) Analogous to Left, for the ReLU output function, ϕ(x) = max(0, x).Here it is always possible to perform better than random guessing using GAMP.The dotted red line shows the algorithmic phase transition when using information only about the nonzero observations.(Right) Phase diagram for the symmetric door output function ϕ(z) = sign(|z| − K) for a Rademacher signal, as a function of α and K.The stability line αc is depicted as a dashed blue line, the information-theoretic phase transition to almost exact recovery α IT is a solid black line, and the algorithmic one α AMP is a solid red line.
noiseless compressed sensing.It is interesting to ask whether this bound is tight.Results in the present paper imply that for the information-theoretic performance this bound indeed is tight.However, the phase transition αAMP above which almost exact recovery is possible with the GAMP algorithm is strictly lower than twice the phase transition of compressed sensing; both are depicted in Fig. 1, Center.This implies that while the negative outputs are not useful information theoretically, they do help to achieve better performance algorithmically.
Perceptron and Similar Problems.
Binary and Gauss-Bernoulli perceptron.One of the most studied problems that fits in the setting of the present paper is the problem of the perceptron (71), where ϕ(z ) = sign(z ), that has been analyzed for random patterns Φ in the statistical physics literature; see refs.23-25 for reviews.We plot in Fig. 2 the optimal generalization error 9 as follows from our results for the binary perceptron, i.e., weights taken from the Rademacher distribution P0 = 1 2 δ+1 + 1 2 δ−1 (Fig. 2, Left) and for the GB perceptron where P0 = ρN (0, 1) + (1 − ρ)δ0 (Fig. 2, Center).The information-theoretically optimal value of the generalization error that we report and prove agrees with existing predictions obtained by the nonrigorous replica method from refs.50, 51, and 55.Notably, we see that for the GB case the optimal generalization error decreases smoothly as α increases, while for the binary case the generalization error has a first-order (i.e., discontinuous) phase transition toward perfect generalization at αIT ≈ 1.249 as predicted already in ref. 50.Our results provide rigorous validation for these old conjectures.
Furthermore, our results together with recent literature on GAMP provide a refreshing clarification of the algorithmic questions.It is natural to ask for what region of parameters the optimal generalization error can be provably achieved with efficient algorithms.This question remained unanswered until now.Indeed, for the spherical perceptron the optimal generalization error was computed in ref. 55 and argued empirically in small instances to be achievable with a TAP-like algorithm (59).The state evolution of GAMP together with our formulas for the generalization error ( [9] for the average optimal one and with q t replacing q * in this formula for GAMP) imply that the optimal generalization error is indeed achievable asymptotically for all α in the GB perceptron.
For the binary perceptron the optimal generalization error was computed in refs.50 and 51.By comparing with the state evolution of GAMP we obtain that it can also be asymptot-ically achieved by GAMP, but this time only outside of the hard phase (αIT, αAMP) with αAMP ≈ 1.493.The literature was unclear on the algorithmic question; ref.50 identified the spinodal of the replica-symmetric solution to be at α ≈ 1.493, but did not attribute it to any algorithmic or physical meaning.Ref. 51 argues that metastable states exist at least up to αRSB ≈ 1.628 and speculates that Gibbs sampling-based algorithms will not be able to reach perfect generalization before that point (23).Taking our results into account, the main algorithmic question that remains open is whether efficient algorithms can reach perfect generalization for αIT < α < αAMP.Symmetric door.Out of interest we explored an example of a binary output channel for which Pout(y|z ) is even in the argument z , so that the noninformative fixed point q * = 0 exists.Specifically we analyzed the symmetric door channel with ϕ(z ) = sign(|z | − K ) and Rademacher prior P0.In literature such a perceptron is studied with the replica method in the context of lossy data compression (28).In Fig. 1, Right we report the phase diagram in terms of the stability line of the noninformative fixed point αc (below which GAMP is not better than random guesses), the information-theoretic phase transition toward perfect generalization αIT, and the phase transition of GAMP to perfect generalization αAMP.
A simple-counting lower bound states that for binary outputs and weights X * i perfect generalization is not possible for α < 1.Thus it is interesting to note that the symmetric door channel is able to saturate this lower bound for K ≈ 0.6745 for which the probability of yµ = 1 is 1/2.This saturation was already stated in ref. 28.Our results also, however, imply that in that case the perfect generalization will not be achievable with the GAMP (and we conjecture no other efficient) algorithm unless α > αAMP ≈ 1.566.The generalization error that GAMP provides for this case is depicted in Fig. 2, Right.
Empirical comparison with general-purpose algorithms.In this section we argue that many cases that fit into the setting of the present paper could serve as useful benchmarks for existing machine-learning algorithms.We believe that the situation is perhaps similar to Shannon coding theorems that have driven algorithmic developments in error-correcting codes, achieving the Shannon bound being the primary goal in many works in communications.In machine learning, classification is a natural task and algorithms are usually benchmarked using open access databases.In current state-of-the-art applications of machine learning we usually have very little insight about what is the NeuralNet Fig. 2. Optimal generalization error in three classification problems vs. the sample complexity α, the size of the training set being αn.The solid red line is the Bayes-optimal generalization error 9 while the solid green line shows the (asymptotic) performances of GAMP as predicted by the state evolution 10.For comparison, we also show the results of GAMP (black circles) and the performance of a standard out-of-the-box solver (blue squares).(Left) Perceptron, with ϕ(x) = sign(x) and a binary Rademacher signal.While a perfect generalization is information-theoretically possible starting from α IT ≈ 1.249, the state evolution predicts that GAMP will achieve such perfect prediction only above α AMP ≈ 1.493.The results of a logistic regression with fine-tuned regularizations with the software scikit-learn (65) are shown for comparison.(Center) Perceptron with Gauss-Bernoulli distribution of the weights.No phase transition is observed in this case, but a smooth decrease of the error with α.The results of a logistic regression are very close to optimal.(Right) The symmetric door activation rule with parameter K chosen to observe the same number of occurrences of the two classes.In this case there is a sharp phase transition from as bad as random to perfect generalization at α IT = 1.GAMP identifies the rule perfectly, starting only from α AMP ≈ 1.566.The noninformative fixed point is stable up to αc = 1.36 (dashed gray line).Interestingly, this nonlinear rule seems very hard to learn for standardly used solvers.Using Keras (66), a neural network with two hidden layers was able to learn only approximately the rule, only for considerably larger training set sizes and a much larger number of iterations than GAMP.
sample complexity, i.e., how many samples are truly needed so that a given generalization error can be achieved.In our setting the situation is different: We can present samples (yµ, Φµ) to generic out-of-the-box classification algorithms and see how their performances compare with the information-theoretic optimal performance and to the one of the GAMP algorithm that is fine-tuned to the problem.
In Fig. 2 we present examples of state-of-the-art classification algorithms that are compared with our results.In Fig. 2, Left and Center we compare the optimal and GAMP performances to a simple logistic regression, fine-tuned by manually optimizing the ridge penalty (for 2 regularization) and LASSO penalty (for a sparsity-enhancing 1 regularization) with the software scikit-learn (65).We observe that for the GB case the logistic regression is comparable to the performance of GAMP, whereas for binary weights perfect generalization is not achieved close to the GAMP phase transition.
In Fig. 2, Right we study classification for labels generated by the symmetric door channel.A general-purpose algorithm would not know about the form of the channel.A neural network with only two hidden units is in principle able to represent the corresponding function (each of the hidden neurons can learn one of the two planes that separate data in the symmetric door function).A more intriguing question is whether a more generic multilayer neural network is indeed able to learn this rule and how many samples it may need.In the example used in Fig. 2, using the software Keras (66) with a tensorflow backend, we show the performance of a network with two hidden layers, ReLU activation and dropout [the details for this particular run can be found in the github repository (62)].The symmetric door function thus provides a challenging benchmark that could be used to study how to improve performance of the general-purpose multilayer neural network classifiers.In SI Appendix we provide additional examples comparing the optimal performance to general-purpose algorithms for regression.

Methods and Proofs
In this section we give the main theorem for the free entropy and main ideas of the proof.An essential tool is the adaptive interpolation method recently introduced in ref. 52 which is a powerful evolution of the Guerra and Toninelli (72) interpolation method developed for spin glasses.Ref. 52 analyzed simpler inference problems.In particular, the proof for the upper bound in ref. 52 does not apply to GLMs and requires nontrivial additional ingredients.One such additional ingredient is to work with a potential f RS (q, r; ρ) depending on two parameters (q, r) instead of a single one as in ref. 52.This allows us to use convexity arguments that are crucial to finish the proof, discussed in Matching Bounds and End of Proof.We stress that the present analysis heavily relies on properties of Bayes-optimal inference that translate into remarkable identities between correlation functions (called Nishimori identities by physicists; see SI Appendix for their formulation) valid for all values of parameters.These are used in the derivation of [17] and [18] below, which are two essential steps of our proof.The formula from Theorem 1 relies on the Nishimori identities and does not hold out of the Bayes-optimal setting.Main Theorems.For the proof it is necessary to work with a slightly different model with an additive regularizing Gaussian noise with variance ∆ ≥ 0, where (Zµ) iid ∼ N (0, 1), and (Aµ) h3) The r.v.(Φ µi ) are independent with zero mean, unit variance, and finite third moment bounded with n. h4) For almost all values of a (w.r.t. the distribution P A ), the function x → ϕ(x, a) is continuous almost everywhere.h5) (∆ > 0) or (∆ = 0 and ϕ takes values in N).
In general, when ϕ is continuous, the condition ∆ > 0 (but arbitrarily small) is necessary for the existence of a finite limit of the free entropy [for particular choices of (ϕ, P A ) this might not be needed, e.g., ϕ(z, A) = z + A with A ∼ N (0, σ 2 )].We also assume that the kernel Pout is informative; i.e., there exists y such that Pout(y | •) is not equal almost everywhere to a constant.If Pout is not informative, it is not difficult to show that estimation is then impossible.
We define the set of the critical points of f RS , [4], also called "state evolution fixed points" (as is clear from [10]): Then the main theorem of this paper is stated as follows: Theorem 1 (Replica-Symmetric Free Entropy).Suppose that (h1)-( h2)-( h3)-(h4)-(h5) hold.Then, for the GLM 12, Moreover, as one can see in SI Appendix, the "sup inf" and the supremum over Γ above are achieved over the same couples.Under stronger assumptions on P 0 and Pout, one can show (Theorem 6 in SI Appendix) that fn(Y, Φ) concentrates around its mean fn and thus obtains convergence in probability 3.
An immediate corollary of Theorem 1 is the limiting expression of the mutual information I(X*; Y|Φ) ≡ E ln P(Y, X*|Φ) − E ln(P(Y|Φ)P(X*)) between the observations and the unknown vector: i RS (q, r) = inf (q,r)∈Γ i RS (q, r) , i RS (q, r) ≡ αΨ P out (ρ; ρ) − αΨ P out (q; ρ) − ψ P 0 (r) + rq/2 .Finally, we gather our main results related to the optimal errors in a single theorem (see SI Appendix for more details), including results on the optimality of the GAMP algorithm: Theorem 2 (Optimal Errors).Assume the same hypotheses as in Theorem 1. Then formula 9 for the generalization error is true as n, m → ∞, m/n → α whenever the maximizer q*(α) of [3] is unique, which is the case for almost every α.If moreover all of the moments of P 0 are finite, then formula 7 for the overlap and the matrix-MMSE formula are true, where − F is the Frobenius norm.
There are cases of GLMs (e.g., the signless output channel Y = |ΦX*|/ √ n + Z) where the sign of X* simply cannot be estimated (thus the absolute value in [7]).This is why our general theorem is related to an error metric 13 insensitive to this ± symmetry.Nevertheless formula 8 for the signal MSE is formally valid when there is no such sign symmetry.Sketch of Proof by the Adaptive Interpolation Method.We now give the main ideas behind the proof of Theorem 1.We defer to SI Appendix the details, as well as those of Corollary 1 and Theorem 2.
We note a clarification about notation.The r.v.Y (and also Φ, X*, A, and Z) are called quenched because once the measurements are acquired, they are fixed.The expectation w.r.t.all quenched r.v. is denoted by E without a subscript.In contrast, expectation of annealed variables w.r.t. a posterior distribution at fixed quenched variables is denoted by Gibbs brackets − .Two scalar inference channels.An important role in the proof is played by two simple scalar inference channels.The free entropy is expressed in terms of the free entropies of these channels.This "decoupling property" stands at the root of the replica approach in statistical physics.
The first scalar channel is an additive Gaussian channel.Suppose that we observe Y 0 = √ r X 0 + Z 0 where X 0 ∼ P 0 and Z 0 ∼ N (0, 1) are independent.Consider the inference problem consisting of retrieving X 0 from the observation Y 0 .The free entropy associated with this channel is the expectation of the logarithm of the normalization factor of the associated posterior dP(x|Y 0 ) that is given by [5] (up to a constant).
The second scalar channel that appears naturally in the problem is linked to the channel Pout through the following inference model.Suppose that V, W* iid ∼ N (0, 1) where V is known while the inference problem is to recover the unknown W* from the observation Ỹ0 ∼ Pout(• | √ q V + √ ρ − q W*) where ρ > 0 and q ∈ [0, ρ].The free entropy for this model, again given by a normalization factor of the posterior of w given Ỹ0 and V, is exactly [6].
Interpolating the estimation problem.To carry out the proof, we introduce an "interpolating estimation problem" that interpolates between the original problem Yµ ∼ Pout(•| 1 √ n [ΦX*]µ) at t = 0, with t ∈ [0, 1] being the interpolation parameter, and the two scalar problems described above at t = 1.For t ∈ (0, 1) the interpolating estimation problem is a mixture of the original and the scalar problems.This interpolation scheme is inspired by the interpolation paths used by Talagrand (73) to study the perceptron.Due to a novel ingredient specific to the adaptive interpolation method (52), it allows us to obtain in a unified manner a complete proof of the replica formula for the free entropy and in the whole phase diagram.
Let q(t) and r(t) be two interpolation functions.Moreover define where Vµ, W* µ iid ∼ N (0, 1).Consider the following observation channels, with two types of observations obtained through where (Z i ) iid ∼ N (0, 1).We assume that V = (Vµ) m µ=1 is known and that the inference problem is to recover both W* = (W* µ ) m µ=1 and X* = (X* i ) n i=1 from the "t-dependent" observations Yt = (Yt,µ) m µ=1 and Y t = (Y t,i ) n i=1 .We now understand that the integral of r(t) appearing in the second set of measurements in [14] and 1 − t as well as the two integrals appearing in the first set all play the role of signal-to-noise ratios (SNRs) in the interpolating problem, with t giving more and more "power" (or weight) to the scalar inference channels when increasing.Here is the first crucial ingredient of our interpolation scheme.In classical interpolations, these SNRs would all take a trivial form, i.e., be linear in t, but here, the nontrivial integral dependency in t of the two latter SNRs allows for much more flexibility when choosing the interpolation path.This will allow us to actually choose the "optimal interpolation path" (this will become clear below).

[15]
Now comes another crucial property of the interpolating model: It is such that at t = 0 we recover the original problem and thus fn(0) = fn − 1/2 (the constant 1/2 comes from the purely noisy measurements of the second channel in [14]), while at t = 1 we have the two scalar inference channels and thus the associated terms ψ P 0 and Ψ P out appear in fn (1).These are precisely the terms appearing in the free entropy potential 4. Entropy variation along the interpolation.From the understanding of the previous section, it is natural to evaluate the variation of entropy along the interpolation, which allows us to "compare" the original and purely scalar models due to the identity fn = fn(0) + 1 2 = fn(1) where the first equality follows from [15] (the prime means the derivative).
Then by choosing the optimal interpolation path due to the nontrivial SNR dependencies in t, we will be able to show the equality between the replica formula and the true free entropy fn.
We thus compute the t derivative of the free entropy (see SI Appendix for the details of this calculation).It is given by where On( 1) is a quantity that goes to 0 in the n, m → ∞ limit, uniformly in t, and the overlap is Q = Qn ≡ n −1 n i=1 X* i x i .We now state a crucial result in an informal way and refer to SI Appendix for precise statements.Formally, the overlap concentrates around its mean (for all t ∈ [0, 1]), a behavior called "replica-symmetric" in statistical physics.To make this statement mathematically rigorous, one has to slightly modify the interpolating model by adding a "side channel" that brings vanishingly small additional information about X* without affecting the asymptotic free entropy density.This perturbation forces the overlap to concentrate.Effectively, one can use the following formal formula (see SI Appendix, section 4.3, Lemma 2 for a precise statement): (1) . [18] Canceling the remainder.Note from [15] and [4] that the first two terms appearing in [17] are precisely the missing ones to obtain the expression of the potential on the r.h.s. of [16].Thus, we want to "cancel" the Gibbs bracket in [17].This term is called the remainder.To prove the replica formula, we have to show that this remainder vanishes, which was until now a difficult task.But due to the freedom of choice of the interpolation path allowed by the interpolating function q, we are able to do so by "adapting" the interpolation (thus the name of the method).Thus, we want to choose q(t) = E Q t ≈ Q because of [18].However, E Q t is a function of t 0 q(v)dv.The equation q(t) = E Q t ∈ [0, ρ] is therefore an order 1 differential equation over t → t 0 q(v)dv.Assume for the moment that this equation has a solution over [0,1].Once the solution q (r) n is selected, the Cauchy-Schwarz inequality applied to the remainder allows us to show that its absolute value is upper bounded by C Vart(Q) for some constant C > 0 independent of n and t.Therefore from [17] and [18], for 0 ≤ t ≤ 1 we get 2 q (r) n (t) − r(t)ρ 2 + On (1) .

[19]
This important equality is obtained due to the choice of the optimal interpolation path q (r) n (t) permitted by the method.Matching bounds and end of proof.We now possess all of the necessary tools to finish the sketch of the proof of Theorem 1.We first prove that limn→∞ fn = sup r≥0 inf q∈[0,ρ] f RS (q, r).Then in SI Appendix, we show that (i) this is also equal to sup q∈[0,ρ] inf r≥0 f RS (q, r), which gives the first equality of the theorem, and (ii) that this sup inf is attained at the supremum of the state evolution fixed points, which gives the second equality.

Fig. 1 .
Fig. 1.Phase diagrams showing boundaries of the region where almost exact recovery is possible (in absence of noise).(Left) The case of signless sparse recovery, ϕ(x) = |x| with a Gauss-Bernoulli signal, as a function of the ratio between number of samples/measurements and the dimension α = m/n, and the fraction of nonzero components ρ.Evaluating [4] for this case, we find that a recovery of the signal is information-theoretically impossible for α < α IT = ρ.Recovery becomes possible starting from α > ρ, just as in the canonical compressed sensing.Algorithmically the signless case is much harder.Evaluating [11], we conclude that GAMP is not able to perform better than a random guess as long as α < αc = 1/2, and the same is true for spectral algorithms(61).For larger values of α, the inference using GAMP leads to better results than a purely random guess.GAMP can recover the signal and generalize perfectly only for values of α larger than α AMP (solid red line).The dotted red line shows for comparison the algorithmic phase transition of the canonical compressed sensing.(Center) Analogous to Left, for the ReLU output function, ϕ(x) = max(0, x).Here it is always possible to perform better than random guessing using GAMP.The dotted red line shows the algorithmic phase transition when using information only about the nonzero observations.(Right) Phase diagram for the symmetric door output function ϕ(z) = sign(|z| − K) for a Rademacher signal, as a function of α and K.The stability line αc is depicted as a dashed blue line, the information-theoretic phase transition to almost exact recovery α IT is a solid black line, and the algorithmic one α AMP is a solid red line.