# Reconciling modern machine-learning practice and the classical bias–variance trade-off

See allHide authors and affiliations

Edited by Peter J. Bickel, University of California, Berkeley, CA, and approved July 2, 2019 (received for review February 21, 2019)

### This article has a Letter. Please see:

- Relationship between Research Article and Letter - May 05, 2020

## Significance

While breakthroughs in machine learning and artificial intelligence are changing society, our fundamental understanding has lagged behind. It is traditionally believed that fitting models to the training data exactly is to be avoided as it leads to poor performance on unseen data. However, powerful modern classifiers frequently have near-perfect fit in training, a disconnect that spurred recent intensive research and controversy on whether theory provides practical insights. In this work, we show how classical theory and modern practice can be reconciled within a single unified performance curve and propose a mechanism underlying its emergence. We believe this previously unknown pattern connecting the structure and performance of learning architectures will help shape design and understanding of learning algorithms.

## Abstract

Breakthroughs in machine learning are rapidly changing science and society, yet our fundamental understanding of this technology has lagged far behind. Indeed, one of the central tenets of the field, the bias–variance trade-off, appears to be at odds with the observed behavior of methods used in modern machine-learning practice. The bias–variance trade-off implies that a model should balance underfitting and overfitting: Rich enough to express underlying structure in data and simple enough to avoid fitting spurious patterns. However, in modern practice, very rich models such as neural networks are trained to exactly fit (i.e., interpolate) the data. Classically, such models would be considered overfitted, and yet they often obtain high accuracy on test data. This apparent contradiction has raised questions about the mathematical foundations of machine learning and their relevance to practitioners. In this paper, we reconcile the classical understanding and the modern practice within a unified performance curve. This “double-descent” curve subsumes the textbook U-shaped bias–variance trade-off curve by showing how increasing model capacity beyond the point of interpolation results in improved performance. We provide evidence for the existence and ubiquity of double descent for a wide spectrum of models and datasets, and we posit a mechanism for its emergence. This connection between the performance and the structure of machine-learning models delineates the limits of classical analyses and has implications for both the theory and the practice of machine learning.

Machine learning has become key to important applications in science, technology, and commerce. The focus of machine learning is on the problem of prediction: Given a sample of training examples

The predictor

The goal of machine learning is to find

Conventional wisdom in machine learning suggests controlling the capacity of the function class ℋ based on the bias–variance trade-off by balancing underfitting and overfitting (cf. refs. 1 and 2): 1) If ℋ is too small, all predictors in ℋ may underfit the training data (i.e., have large empirical risk) and hence predict poorly on new data. 2) If ℋ is too large, the empirical risk minimizer may overfit spurious patterns in the training data, resulting in poor accuracy on new examples (small empirical risk but large true risk).

The classical thinking is concerned with finding the “sweet spot” between underfitting and overfitting. The control of the function class capacity may be explicit, via the choice of ℋ (e.g., picking the neural network architecture), or it may be implicit, using regularization (e.g., early stopping). When a suitable balance is achieved, the performance of *A* that has been widely used to guide model selection and is even thought to describe aspects of human decision making (3). The textbook corollary of this curve is that “a model with zero training error is overfit to the training data and will typically generalize poorly” (ref. 2, p. 221), a view still widely accepted.

However, practitioners routinely use modern machine-learning methods, such as large neural networks and other nonlinear predictors that have very low or zero training risk. Despite the high function class capacity and near-perfect fit to training data, these predictors often give very accurate predictions on new data. Indeed, this behavior has guided a best practice in deep learning for choosing neural network architectures, specifically that the network should be large enough to permit effortless zero-loss training (called interpolation) of the training data (4). Moreover, in direct challenge to the bias–variance trade-off philosophy, recent empirical evidence indicates that neural networks and kernel machines trained to interpolate the training data obtain near-optimal test results even when the training data are corrupted with high levels of noise (5, 6).

The main finding of this work is a pattern in how performance on unseen data depends on model capacity and the mechanism underlying its emergence. This dependence, empirically witnessed with important model classes including neural networks and a range of datasets, is summarized in the “double-descent” risk curve shown in Fig. 1*B*. The curve subsumes the classical U-shaped risk curve from Fig. 1*A* by extending it beyond the point of interpolation.

When function class capacity is below the “interpolation threshold,” learned predictors exhibit the classical U-shaped curve from Fig. 1*A*. (In this paper, function class capacity is identified with the number of parameters needed to specify a function within the class.) The bottom of the U is achieved at the sweet spot which balances the fit to the training data and the susceptibility to overfitting: To the left of the sweet spot, predictors are underfitted, and immediately to the right, predictors are overfitted. When we increase the function class capacity high enough (e.g., by increasing the number of features or the size of the neural network architecture), the learned predictors achieve (near) perfect fits to the training data—i.e., interpolation. Although the learned predictors obtained at the interpolation threshold typically have high risk, we show that increasing the function class capacity beyond this point leads to decreasing risk, typically going below the risk achieved at the sweet spot in the “classical” regime.

All of the learned predictors to the right of the interpolation threshold fit the training data perfectly and have zero empirical risk. So why should some—in particular, those from richer functions classes—have lower test risk than others? The answer is that the capacity of the function class does not necessarily reflect how well the predictor matches the inductive bias appropriate for the problem at hand. For the learning problems we consider (a range of real-world datasets as well as synthetic data), the inductive bias that seems appropriate is the regularity or smoothness of a function as measured by a certain function space norm. Choosing the smoothest function that perfectly fits observed data is a form of Occam’s razor: The simplest explanation compatible with the observations should be preferred (cf. refs. 7 and 8). By considering larger function classes, which contain more candidate predictors compatible with the data, we are able to find interpolating functions that have smaller norm and are thus “simpler.” Thus, increasing function class capacity improves performance of classifiers.

Related ideas have been considered in the context of margins theory (7, 9, 10), where a larger function class ℋ may permit the discovery of a classifier with a larger margin. While the margins theory can be used to study classification, it does not apply to regression and also does not predict the second descent beyond the interpolation threshold. Recently, there has been an emerging recognition that certain interpolating predictors (not based on ERM) can indeed be provably statistically optimal or near optimal (11, 12), which is compatible with our empirical observations in the interpolating regime.

In the remainder of this article, we discuss empirical evidence for the double-descent curve and the mechanism for its emergence and conclude with some final observations and parting thoughts.

## Neural Networks

In this section, we discuss the double-descent risk curve in the context of neural networks.

### Random Fourier Features.

We first consider a popular class of nonlinear parametric models called random Fourier features (RFF) (13), which can be viewed as a class of 2-layer neural networks with fixed weights in the first layer. The RFF model family

Our learning procedure using

In Fig. 2, we show the test risk of the predictors learned using

The interpolation regime connected with modern practice is shown to the right of the interpolation threshold, with

What structural mechanisms account for the double-descent shape? When the number of features is much smaller than the sample size,

To the right of the interpolation threshold, all function classes are rich enough to achieve zero training risk. For the classes *SI Appendix*.

Additional empirical evidence for the same double-descent behavior using other datasets is presented in *SI Appendix*. For instance, we demonstrate double descent for rectified linear unit (ReLU) random feature models, a class of ReLU neural networks with a setting similar to that of RFF. We also describe a simple synthetic model, which can be regarded as a 1D version of the RFF model, where we observe the same double-descent behavior.

### Neural Networks and Backpropagation.

In general multilayer neural networks (beyond RFF or ReLU random feature models), a learning algorithm will tune all of the weights to fit the training data, typically using versions of stochastic gradient descent (SGD), with backpropagation to compute partial derivatives. This flexibility increases the representational power of neural networks, but also makes ERM generally more difficult to implement. Nevertheless, as shown in Fig. 3, we observe that increasing the number of parameters in fully connected 2-layer neural networks leads to a risk curve qualitatively similar to that observed with RFF models. That the test risk improves beyond the interpolation threshold is compatible with the conjectured “small norm” inductive biases of the common training algorithms for neural networks (16, 17). We note that this transition from under- to overparameterized regimes for neural networks was also previously observed by refs. 18⇓⇓–21. In particular, ref. 21 draws a connection to the physical phenomenon of “jamming” in particle systems.

The computational complexity of ERM with neural networks makes the double-descent risk curve difficult to observe. Indeed, in the classical underparameterized regime (

It is common to use neural networks with an extremely large number of parameters (22). But to achieve interpolation for a single output (regression or 2-class classification) one expects to need at least as many parameters as there are data points. Moreover, if the prediction problem has more than one output (as in multiclass classification), then the number of parameters needed should be multiplied by the number of outputs. This is indeed the case empirically for neural networks shown in Fig. 3. Thus, for instance, datasets as large as ImageNet (23), which has

Additional results with neural networks are given in *SI Appendix*.

## Decision Trees and Ensemble Methods

Does the double-descent risk curve manifest with other prediction methods besides neural networks? We give empirical evidence that the families of functions explored by boosting with decision trees and random forests also show similar generalization behavior to that of neural nets, both before and after the interpolation threshold.

AdaBoost and random forests have recently been investigated in the interpolation regime by ref. 24 for classification. In particular, they give empirical evidence that, when AdaBoost and random forests are used with maximally large (interpolating) decision trees, the flexibility of the fitting methods yields interpolating predictors that are more robust to noise in the training data than the predictors produced by rigid, noninterpolating methods (e.g., AdaBoost or random forests with shallow trees). This in turn is said to yield better generalization. The averaging of the (near) interpolating trees ensures that the resulting function is substantially smoother than any individual tree, which aligns with an inductive bias that is compatible with many real-world problems.

We can understand these flexible fitting methods in the context of the double-descent risk curve. Observe that the size of a decision tree (controlled by the number of leaves) is a natural way to parameterize the function class capacity: Trees with only 2 leaves correspond to 2-piecewise constant functions with an axis-aligned boundary, while trees with n leaves can interpolate n training examples. It is a classical observation that the U-shaped bias–variance trade-off curve manifests in many problems when the class capacity is considered this way (2). (The interpolation threshold may be reached with fewer than n leaves in many cases, but n is clearly an upper bound.) To further enlarge the function class, we consider ensembles (averages) of several interpolating trees.* So, beyond the interpolation threshold, we use the number of such trees to index the class capacity. When we view the risk curve as a function of class capacity defined in this hybrid fashion, we see the double-descent curve appear just as with neural networks (Fig. 4 and *SI Appendix*). We observe a similar phenomenon using *SI Appendix*.

## Concluding Thoughts

The double-descent risk curve introduced in this paper reconciles the U-shaped curve predicted by the bias–variance trade-off and the observed behavior of rich models used in modern machine-learning practice. The posited mechanism that underlies its emergence is based on common inductive biases and hence can explain its appearance (and, we argue, ubiquity) in machine-learning applications.

We conclude with some final remarks.

### Historical Absence.

The double-descent behavior may have been historically overlooked on account of several cultural and practical barriers. Observing the double-descent curve requires a parametric family of spaces with functions of arbitrary complexity. The linear settings studied extensively in classical statistics usually assume a small, fixed set of features and hence fixed fitting capacity. Richer families of function classes are typically used in the context of nonparametric statistics, where smoothing and regularization are almost always used (28). Regularization, of all forms, can both prevent interpolation and change the effective capacity of the function class, thus attenuating or masking the interpolation peak.

The RFF model is a popular and flexible parametric family. However, these models were originally proposed as a computationally favorable alternative to kernel machines. This computational advantage over traditional kernel methods holds only for

The situation with general multilayer neural networks is slightly different and more involved. Due to the nonconvexity of the ERM optimization problem, solutions in the classical underparameterized regime are highly sensitive to initialization. Moreover, as we have seen, the peak at the interpolation threshold is observed within a narrow range of parameters. Sampling of the parameter space that misses that range may lead to the misleading impression that increasing the size of the network simply improves performance. Finally, in practice, training of neural networks is typically stopped as soon as (an estimate of) the test risk fails to improve. This early stopping has a strong regularizing effect that, as discussed above, makes it difficult to observe the interpolation peak.

### Inductive Bias.

In this paper, we have dealt with several types of methods for choosing interpolating solutions. For random Fourier features, solutions are constructed explicitly by minimum norm linear regression in the feature space. As the number of features tends to infinity they approach the minimum functional norm solution in the reproducing kernel Hilbert space, a solution which maximizes functional smoothness subject to the interpolation constraints. For neural networks, the inductive bias owes to the specific training procedure used, which is typically SGD. When all but the final layer of the network are fixed (as in RFF models), SGD initialized at zero also converges to the minimum norm solution. While the behavior of SGD for more general neural networks is not fully understood, there is significant empirical and some theoretical evidence (e.g., ref. 16) that a similar minimum norm inductive bias is present. Yet another type of inductive bias related to averaging is used in random forests. Averaging potentially nonsmooth interpolating trees leads to an interpolating solution with a higher degree of smoothness; this averaged solution performs better than any individual interpolating tree.

Remarkably, for kernel machines all 3 methods lead to the same minimum norm solution. Indeed, the minimum norm interpolating classifier,

### Optimization and Practical Considerations.

In our experiments, appropriately chosen “modern” models usually outperform the optimal classical model on the test set. But another important practical advantage of overparameterized models is in optimization. There is a growing understanding that larger models are “easy” to optimize as local methods, such as SGD, converge to global minima of the training risk in overparameterized regimes (e.g., ref. 30). Thus, large interpolating models can have low test risk and be easy to optimize at the same time, in particular with SGD (31). It is likely that the models to the left of the interpolation peak have optimization properties qualitatively different from those to the right, a distinction of significant practical import.

### Outlook.

The classical U-shaped bias–variance trade-off curve has shaped our view of model selection and directed applications of learning algorithms in practice. The understanding of model performance developed in this work delineates the limits of classical analyses and opens additional lines of inquiry to study and compare computational, statistical, and mathematical properties of the classical and modern regimes in machine learning. We hope that this perspective, in turn, will help practitioners choose models and algorithms for optimal performance.

## Acknowledgments

M.B. was supported by NSF Grant RI-1815697. D.H. was supported by NSF Grant CCF-1740833 and a Sloan Research Fellowship.

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: mbelkin{at}cse.ohio-state.edu.

Author contributions: M.B., D.H., S. Ma, and S. Mandal designed research, performed research, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

↵*These trees are trained in the way proposed in random forest except without bootstrap resampling. This is similar to the PERT method of ref. 25.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1903070116/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- ↵
- T. Hastie,
- R. Tibshirani,
- J. Friedman

- ↵
- G. Gigerenzer,
- H. Brighton

- ↵
- R. Salakhutdinov

- ↵
- C. Zhang,
- S. Bengio,
- M. Hardt,
- B. Recht,
- O. Vinyals

- ↵
- J. Dy,
- A. Krause

- M. Belkin,
- S. Ma,
- S. Mandal

- ↵
- V. N. Vapnik

- ↵
- A. Blumer,
- A. Ehrenfeucht,
- D. Haussler,
- M. K. Warmuth

- ↵
- P. L. Bartlett

- ↵
- ↵
- S. Bengio,
- H. Wallach,
- H. Larochelle,
- K. Grauman,
- N. Cesa-Bianchi,
- R. Garnett

- M. Belkin,
- D. Hsu,
- P. Mitra

- ↵
- M. Belkin,
- A. Rakhlin,
- A. B. Tsybakov

*Proceedings of Machine Learning Research,*K. Chaudhuri, M. Sugiyama, Eds. (Proceedings of Machine Learning Research, 2019), vol. 89, pp. 1611–1619. - ↵
- J. C. Platt,
- D. Koller,
- Y. Singer,
- S. T. Roweis

- A. Rahimi,
- B. Recht

- ↵
- B. E. Boser,
- I. M. Guyon,
- V. N. Vapnik

- ↵
- I. Guyon,
- U. V. Luxburg,
- S. Bengio,
- H. Wallach,
- R. Fergus,
- S. Vishwanathan,
- R. Garnett

- A. Rudi,
- L. Rosasco

- ↵
- I. Guyon,
- U. V. Luxburg,
- S. Bengio,
- H. Wallach,
- R. Fergus,
- S. Vishwanathan,
- R. Garnett

- S. Gunasekar,
- B. E. Woodworth,
- S. Bhojanapalli,
- B. Neyshabur,
- N. Srebro

- ↵
- S. Bubeck,
- V. Perchet,
- P. Rigollet

- Y. Li,
- T. Ma,
- H. Zhang

- ↵
- M. C. Mozer,
- M. I. Jordan,
- T. Petsche

- S. Bös,
- M. Opper

- ↵
- M. S. Advani,
- A. M. Saxe

- ↵
- B. Neal et al.

- ↵
- S. Spigler et al.

- ↵
- A. Canziani,
- A. Paszke,
- E. Culurciello

- ↵
- ↵
- A. J. Wyner,
- M. Olson,
- J. Bleich,
- D. Mease

- ↵
- A. Cutler,
- G. Zhao

- ↵
- ↵
- ↵
- L. Wasserman

- ↵
- O. Bousquet,
- U. von Luxburg,
- G. Rätsch

- C. E. Rasmussen

- ↵
- M. Soltanolkotabi,
- A. Javanmard,
- J. D. Lee

- ↵
- J. Dy,
- A. Krause

- S. Ma,
- R. Bassily,
- M. Belkin

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Statistics

### See related content:

- Reply to Loog et al.: Looking beyond the peaking phenomenon- May 05, 2020