# Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes

See allHide authors and affiliations

Edited by William Bialek, Princeton University, Princeton, NJ, and approved October 14, 2016 (received for review May 20, 2016)

## Significance

Artificial neural networks are some of the most widely used tools in data science. Learning is, in principle, a hard problem in these systems, but in practice heuristic algorithms often find solutions with good generalization properties. We propose an explanation of this good performance in terms of a nonequilibrium statistical physics framework: We show that there are regions of the optimization landscape that are both robust and accessible and that their existence is crucial to achieve good performance on a class of particularly difficult learning problems. Building on these results, we introduce a basic algorithmic scheme that improves existing optimization algorithms and provides a framework for further research on learning in neural networks.

## Abstract

In artificial neural networks, learning from data is a computationally demanding task in which a large number of connection weights are iteratively tuned through stochastic-gradient-based heuristic processes over a cost function. It is not well understood how learning occurs in these systems, in particular how they avoid getting trapped in configurations with poor computational performance. Here, we study the difficult case of networks with discrete weights, where the optimization landscape is very rough even for simple architectures, and provide theoretical and numerical evidence of the existence of rare—but extremely dense and accessible—regions of configurations in the network weight space. We define a measure, the robust ensemble (RE), which suppresses trapping by isolated configurations and amplifies the role of these dense regions. We analytically compute the RE in some exactly solvable models and also provide a general algorithmic scheme that is straightforward to implement: define a cost function given by a sum of a finite number of replicas of the original cost function, with a constraint centering the replicas around a driving assignment. To illustrate this, we derive several powerful algorithms, ranging from Markov Chains to message passing to gradient descent processes, where the algorithms target the robust dense states, resulting in substantial improvements in performance. The weak dependence on the number of precision bits of the weights leads us to conjecture that very similar reasoning applies to more conventional neural networks. Analogous algorithmic schemes can also be applied to other optimization problems.

There is increasing evidence that artificial neural networks perform exceptionally well in complex recognition tasks (1). Despite huge numbers of parameters and strong nonlinearities, learning often occurs without getting trapped in local minima with poor prediction performance (2). The remarkable output of these models has created unprecedented opportunities for machine learning in a host of applications. However, these practical successes have been guided by intuition and experiments, whereas obtaining a complete theoretical understanding of why these techniques work seems currently out of reach, due to the inherent complexity of the problem. In other words, in practical applications, large and complex architectures are trained on big and rich datasets using an array of heuristic improvements over basic stochastic gradient descent (SGD). These heuristic enhancements over a stochastic process have the general purpose of improving the convergence and robustness properties (and therefore the generalization properties) of the networks, with respect to what would be achieved with a pure gradient descent (GD) on a cost function.

There are many parallels between the studies of algorithmic stochastic processes and out-of-equilibrium processes in complex systems. Examples include jamming processes in physics, local search algorithms for optimization and inference problems in computer science, regulatory processes in biological and social sciences, and learning processes in real neural networks (see, e.g., refs. 3⇓⇓⇓–7). In all these problems, the underlying stochastic dynamics are not guaranteed to reach states described by an equilibrium probability measure, as would occur for ergodic statistical physics systems. Sets of configurations that are quite atypical for certain classes of algorithmic processes become typical for other processes. Although this fact is unsurprising in a general context, it is unexpected and potentially quite significant when sets of relevant configurations that are typically inaccessible for a broad class of search algorithms become extremely attractive to other algorithms.

Here, we discuss how this phenomenon emerges in learning in large-scale neural networks with low precision synaptic weights. We further show how it is connected to an out-of-equilibrium statistical physics measure that suppresses the confounding role of exponentially many deep and isolated configurations (local minima of the error function) and also amplifies the statistical weight of rare but extremely dense regions of minima. We call this measure the robust ensemble (RE). Moreover, we show that the RE allows us to derive novel and exceptionally effective algorithms. One of these algorithms is closely related to a recently proposed stochastic learning protocol used in complex deep artificial neural networks (8), implying that the underlying geometrical structure of the RE may provide an explanation for its effectiveness.

In the present work, we consider discrete neural networks with only one or two layers, which can be studied analytically. However, we believe that these results should extend to deep neural networks of which the models studied here are building blocks, and in fact to other learning problems as well.

## Interacting Replicas As a Tool for Seeking Dense Regions

In statistical physics, the canonical ensemble describes the equilibrium (i.e., long-time limit) properties of a stochastic process in terms of a probability distribution over the configurations

In the past few decades, equilibrium statistical physics descriptions have emerged as fundamental frameworks for studying the properties of a variety of systems that were previously squarely in the domain of other disciplines. For example, the study of the phase transitions of the random

For learning problems with discrete synapses, numerical experiments indicate that efficient algorithms also seek unfrozen solutions. In ref. 13, we showed that the equilibrium description in these problems is insufficient, in the sense that it predicts that the problem is always in the completely frozen phase in which all solutions are locked (14), despite the fact that efficient algorithms seem to exist. This motivated us to introduce a different measure, which ignores isolated solutions and enhances the statistical weight of large, accessible regions of solutions:

Here,

It is therefore natural to consider using our large-deviation statistics as a starting point to design new algorithms, in much the same way that SA uses equilibrium statistics. Indeed, this was shown to work well in ref. 15. The main difficulty of that approach was the need to estimate the local (free) entropy

Here, we demonstrate an alternative, general, and much simpler approach. The key observation is that, when **1** as

This partition function describes a system of

In fact, in most cases, we can further improve on this scheme by tracing out the reference instead, which leaves us with a system of

In the following, we will demonstrate how this simple procedure can be applied to a variety of different algorithms: SA, SGD, and BP. To demonstrate the utility of the method, we will focus on the problem of training neural network models.

## Neural Network Models

Throughout this paper, we will consider for simplicity one main kind of neural network model, composed of identical threshold units arranged in a feed-forward architecture. Each unit has many input channels and one output and is parameterized by a vector of “synaptic weights”

Because we are interested in connecting with analytical results, for the sake of simplicity all our tests have been performed using binary weights,

In all tests, we extracted all inputs and outputs in

To use methods such as SA and gradient descent, we need to associate an energy or cost to every configuration of the system *SI Appendix*. For deeper networks, the definition is conceptually too naive and computationally too hard, and it should be amended to reflect the fact that more than one layer is affected by training, but this is beyond the scope of the present work.

We also need to define a distance function between replicas of the system. In all our tests, we used the squared distance

## Replicated SA

We claim that there is a general strategy that can be used by a system of **3** or **4**, and lowering the temperature via an SA procedure, until either a zero of the energy (“cost”) or a “give-up condition” is reached. For simplicity, we use the RE, in which the reference configuration is traced out (Eq. **4**), and we compare our method to the case in which the interaction between the replicas is absent (i.e.,

Additionally, we find that the effect of the interaction among replicas can be almost entirely accounted for by adding a prior on the choice of the moves within an otherwise standard Metropolis scheme, while still maintaining the detailed balance condition (of course, this reduces to the standard Metropolis rule for *Materials and Methods* and in more detail in *SI Appendix*.

In Fig. 2, we show the results for the perceptron; an analogous figure for the committee machine, with similar results, is shown in *SI Appendix*, Fig. S1. The analysis of the scaling with

## Replicated GD

Monte Carlo methods are computationally expensive and may be infeasible for large systems. One simple alternative general method for finding minima of the energy is using GD or one of its many variants. All these algorithms are generically called backpropagation algorithms in the neural networks context (19). Indeed, GD—in particular SGD—is the basis of virtually all recent successful “deep learning” techniques in machine learning. The two main issues with using GD are that it does not in general offer any guarantee to find a global minimum and that convergence may be slow [in particular for some of the variables; cf. the “vanishing gradient” problem (20) that affects deep neural network architectures]. Additionally, when training a neural network for the purpose of inferring (generalizing) a rule from a set of examples, it is in general unclear how the properties of the local minima of the energy on the training set are related to the energy of the test set (i.e., to the generalization error).

GD is defined on differentiable systems, and thus it cannot be applied directly to the case of systems with discrete variables considered here. One possible work-around is to introduce a two-level scheme, consisting of using two sets of variables, a continuous one

Almost all of the above-mentioned results are purely heuristic (except in the online generalization setting, which is not considered in the present work). Indeed, even just using the two-level SGD is heuristic in this context. Nevertheless, here we demonstrate that, as in the case of SA of the previous section, replicating the system and adding a time-dependent interaction term (i.e., performing the GD over the RE energy defined in Eq. **5**) leads to a noticeable improvement in the performance of the algorithm, and that when a solution is found it is indeed part of a dense region, as expected (*Materials and Methods*). We showed in ref. 13 that solutions belonging to maximally dense regions have better generalization properties than other solutions; in other words, they are less prone to overfitting.

It is also important to note here that the stochastic nature of the SGD algorithm is essential in this case in providing an entropic component that counters the purely attractive interaction between replicas introduced by Eq. **5**, because the absolute minima of the replicated energy of Eq. **4** are always obtained by collapsing all of the replicas in the minima of the original energy function. Indeed, the existence of an entropic component allowed us to derive the RE definition by using the interaction strength **2**; to use this technique with a nonstochastic minimization algorithm, the distance should be controlled explicitly instead.

In Fig. 3 we show the results for a fully connected committee machine, demonstrating that the introduction of the interaction term greatly improves the capacity of the network (from *SI Appendix*, Fig. S2 we show the results for *SI Appendix*. These results are in perfect agreement with the analysis of the next section, on BP, which suggests that this replicated SGD algorithm has near-optimal capacity.

It is interesting to note that a very similar approach—a replicated system in which each replica is attracted toward a reference configuration, called elastic averaged SGD (EASGD)—was proposed in ref. 8 (see also ref. 26) using deep convolutional networks with continuous variables, as a heuristic way to exploit parallel computing environments under communication constraints. Although it is difficult in that case to fully disentangle the effect of replicating the system from the other heuristics (in particular the use of “momentum” in the GD update), their results clearly demonstrate a benefit of introducing the replicas in terms of training error, test error, and convergence time. It seems therefore plausible that, despite the great difference in complexity between their network and the simple models studied in this paper, the general underlying reason for the effectiveness of the method is the same (i.e., the existence of accessible robust low-energy states in the space of configurations).

## Replicated BP

BP, also known as sum-product, is an iterative message-passing method that can be used to describe a probability distribution over an instance described by a factor graph in the correlation decay approximation (27, 28). The accuracy of the approximation relies on the assumption that, when removing an interaction from the network, the nodes involved in that interaction become effectively independent, an assumption linked to so-called replica symmetry (RS) in statistical physics.

Algorithmically, the method can be briefly summarized as follows. The goal is to solve a system of equations involving quantities called messages. These messages represent single-variable cavity marginal probabilities, where the term “cavity” refers to the fact that these probabilities neglect the existence of a node in the graph. For each edge of the graph there are two messages going in opposite directions. Each equation in the system gives the expression of one of the messages as a function of its neighboring messages. The expressions for these functions are derived from the form of the interactions between the variables of the graph. The resulting system of equations is then solved iteratively, by initializing the messages in some arbitrary configuration and updating them according to the equations until a fixed point is eventually reached. If convergence is achieved, the messages can be used to compute various quantities of interest; among those, the most relevant in our case are the single-site marginals and the thermodynamic quantities such as the entropy and the free energy of the system. From single-site marginals, it is also possible to extract a configuration by simply taking the

In the case of our learning problem, the variables are the synaptic weights, and each pattern represents a constraint (i.e., an “interaction”) between all variables, and the form of these interactions is such that BP messages updates can be computed efficiently. Also note that, because our variables are binary, each of the messages and marginals can be described with a single quantity: We generally use the magnetization, that is, the difference between the probability associated to the states *SI Appendix*, closely following the work of ref. 29. Here, we give a short summary. We use the notation *SI Appendix*.

As mentioned above, BP is an inference algorithm: When it converges, it describes a probability distribution. One particularly effective scheme to turn BP into a solver is the addition of a “reinforcement” term (29): A time-dependent local field is introduced for each variable, proportional to its own marginal probability as computed in the previous iteration step, and is gradually increased until the whole system is completely biased toward one configuration. This admittedly heuristic scheme is quite general, leads to very good results in a variety of different problems, and can even be used in cases in which unmodified BP would not converge or would provide a very poor approximation (see, e.g., ref. 30). In the case of the single-layer binary network such as those considered in this paper, it can reach a capacity of

The reason for the effectiveness of reinforced BP has not been clear. Intuitively, the process progressively focuses on smaller and smaller regions of the configuration space, with these regions determined from the current estimate of the distribution by looking in the “most promising” direction. This process has thus some qualitative similarities with the search for dense regions described in the previous sections. This analogy can be made precise by writing the BP equations for the system described by Eq. **3**. There are in this case two equivalent approaches. The first is to use the local entropy as the energy function, using a second-level BP to estimate the local entropy itself. This approach is very similar to the so-called one-step replica-symmetry-breaking (1RSB) cavity equations (see ref. 16 for a general introduction). The second approach is to replicate the system, considering

Fortunately, it turns out that that there is a different way of applying BP to the replicated system, leading to an efficient solver that is both very similar to the reinforced BP algorithm and reasonably well described by the theoretical results. Instead of considering the joint distribution over all replicated variables at a certain site **4**). The procedure is shown graphically in Fig. 4. At each iteration step **6**. Note that even though we started from a system of

The latter equation uses a single parameter *SI Appendix*). This result is somewhat incidental, because in the context of our analysis the limit **7** is only related to the approximations introduced by the simplified scheme of Fig. 4. As it turns out, however, choosing slightly different mappings with both *SI Appendix*. Using a protocol in which

Apart from the possibility of using fBP as a solver, by gradually increasing *SI Appendix*. The results indicate that fBP is better than the alternatives described above in overcoming the issues arising from RSB effects. The 1RSB scheme describes a nonergodic situation that arises from the breakup of the space of configurations into a collection of several separate equivalent ergodic clusters, each representing a possible state of the system. These states are characterized by the typical overlap inside the same state (the intracluster overlap *SI Appendix*, Fig. S5).

Therefore, although this algorithm is not fully understood from the theoretical point of view, it offers a valuable insight into the reason for the effectiveness of adding a reinforcement term to the BP equations. It is interesting in this context to observe that the existence of a link between the reinforcement term and the RE seems consistent with some recent results on random bicoloring constraint satisfaction problems (32), which showed that reinforced BP finds solutions with shorter “whitening times” with respect to other solvers: This could be interpreted as meaning they belong to a region of higher solution density, or are more central in the cluster they belong to. Furthermore, our algorithm can be used to estimate the point up to which accessible dense states exist, even in cases, such as multilayer networks, where analytical calculations are prohibitively complex.

Fig. 6 shows the result of experiments performed on a committee machine with the same architecture and same *Materials and Methods*). The figure shows that fBP finds that dense states (where the local entropy curves approach the upper bound at small distances) exist up to nearly

Finally, we also performed some exploratory tests applying fBP on the random *SI Appendix*.

## Discussion

In this paper, we have presented a general scheme that can be used to bias the search for low-energy configurations, enhancing the statistical weight of large, accessible states. Although the underlying theoretical description is based on a nontrivial large deviation measure, its concrete implementation is very simple—replicate the system and introduce an interaction between the replicas—and versatile, in that it can be generally applied to a number of different optimization algorithms or stochastic processes. We demonstrated this by applying the method to SA, GD, and BP, but it is clear that the list of possible applications may be much longer. The intuitive interpretation of the method is also quite straightforward: A set of coupled systems is less likely to get trapped in narrow minima, and will instead be attracted to wide regions of good (and mostly equivalent) configurations, thus naturally implementing a kind of robustness to details of the configurations.

The utility of this kind of search depends on the details of the problem under study. Here we have mainly focused on the problem of training neural networks, for a number of reasons. The first is that, at least in the case of single-layer networks with discrete synapses, we had analytical and numerical evidence that dense, accessible states exist and are crucial for learning and improving the generalization performance, and we could compare our findings with analytical results. The second is that the general problem of training neural networks has been addressed in recent years via a sort of collective search in the space of heuristics, fueled by impressive results in practical applications and mainly guided by intuition; heuristics are evaluated based on their effectiveness in finding accessible states with good generalization properties. It seems reasonable to describe these accessible states as regions of high local entropy (i.e., wide, very robust energy minima): The center of such a region can act as a Bayesian estimator for the whole extensive neighborhood. Here we showed a simple way to exploit the existence of such states efficiently, whatever the optimization algorithm used. This not only sheds light on previously known algorithms but also suggests improvements or even entirely new algorithms. Further work is required to determine whether the same type of phenomenon that we observed here in simple models actually generalizes to the deep and complex networks currently used in machine learning applications (the performance boost obtained by the EASGD algorithm of ref. 8 being a first indication in this direction), and to investigate further ways to improve the performance of learning algorithms, or to overcome constraints (such as being limited to very-low-precision computations).

It is also natural to consider other classes of problems in which this analysis may be relevant. One application would be solving other constraint satisfaction problems. For example, in ref. 15 we demonstrated that the EdMC algorithm can be successfully applied to the random

## Materials and Methods

### Replicated SA.

The SA procedure used an elementary Metropolis–Hastings Monte Carlo iteration step that can be summarized as follows: *i*) We choose a replica uniformly at random; *ii*) we choose a candidate weight to flip for that replica according to a prior computed from the state of all of the replicas; and *iii*) we estimate the single-replica energy cost of the flip and accept it according to a standard Metropolis rule with a (usually very small) extra term.

The sampling technique (step *ii* above) uses a prior to choose which spin to flip. The prior for a spin that, if flipped, would result in a contribution **5**, is computed from the number of spins in the same replica that would produce the same contribution, *iii* above) with probability *SI Appendix*.

The annealing procedure consisted of starting from a given inverse temperature *SI Appendix*.

### Replicated GD.

Our replicated SGD scheme uses continuous internal variables **5**, with an additional step parameter

This formula is derived and discussed in more detail in *SI Appendix*.

We used a standard SGD protocol with fixed minibatch size. During the training process, we kept the gradient steps fixed but increased *SI Appendix*.

### Replicated BP.

Our implementation of the fBP algorithm closely follows ref. 29 with the addition of the self-interaction Eq. **7**, except that great care is required to correctly estimate the local entropy at large *SI Appendix*.

## Acknowledgments

We thank Y. LeCun and L. Bottou for encouragement and interesting discussions about future directions for this work. This work was supported by European Research Council Grant 267915 (to C. Baldassi, C.L., and R.Z.).

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: carlo.baldassi{at}polito.it.

Author contributions: C. Baldassi, C. Borgs, J.T.C., A.I., C.L., L.S., and R.Z. designed research; C. Baldassi, A.I., C.L., L.S., and R.Z. performed research; C. Baldassi and A.I. contributed software; C. Baldassi analyzed data; and C. Baldassi, C. Borgs, J.T.C., C.L., L.S., and R.Z. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Data deposition: The code for the Replicated Stochastic Gradient Descent algorithm is available at https://github.com/carlobaldassi/BinaryCommitteeMachineRSGD.jl. The code for the Focusing Belief Propagation algorithm is available at https://github.com/carlobaldassi/BinaryCommitteeMachineFBP.jl.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1608103113/-/DCSupplemental.

Freely available online through the PNAS open access option.

## References

- ↵
- ↵.
- Ngiam J, et al.

*Proceedings of the 28th International Conference on Machine Learning (ICML-11)*(International Machine Learning Society), pp 265–272. - ↵
- ↵.
- Ricci-Tersenghi F,
- Semerjian G

- ↵.
- Bressloff PC

- ↵.
- Easley D,
- Kleinberg J

- ↵
- ↵.
- Cortes C,
- Lawrence ND,
- Lee DD,
- Sugiyama M,
- Garnett R

- Zhang S,
- Choromanska AE,
- LeCun Y

- ↵.
- Kirkpatrick S,
- Gelatt CD Jr,
- Vecchi MP

- ↵.
- Mézard M,
- Parisi G,
- Zecchina R

- ↵.
- Krzakala F,
- Montanari A,
- Ricci-Tersenghi F,
- Semerjian G,
- Zdeborova L

- ↵
- ↵.
- Baldassi C,
- Ingrosso A,
- Lucibello C,
- Saglietti L,
- Zecchina R

- ↵.
- Huang H,
- Kabashima Y

- ↵.
- Baldassi C,
- Ingrosso A,
- Lucibello C,
- Saglietti L,
- Zecchina R

- ↵.
- Mézard M,
- Montanari A

- ↵.
- Baldassi C,
- Gerace F,
- Lucibello C,
- Saglietti L,
- Zecchina R

- ↵.
- Moore C,
- Mertens S

- ↵.
- Rumelhart DE,
- Hinton GE,
- Williams RJ

- ↵.
- Hochreiter S

- ↵.
- Baldassi C,
- Braunstein A,
- Brunel N,
- Zecchina R

- ↵.
- Baldassi C

- ↵
- ↵.
- Cortes C,
- Lawrence ND,
- Lee DD,
- Sugiyama M,
- Garnett R

- Courbariaux M,
- Bengio Y,
- David JP

- ↵.
- Courbariaux,
- Matthieu Hubara I,
- Soudry D,
- El-Yaniv R,
- Bengio Y

- ↵.
- Zhang S

- ↵.
- MacKay DJ

- ↵.
- Yedidia JS,
- Freeman WT,
- Weiss Y

- ↵
- ↵.
- Bailly-Bechet M, et al.

- ↵.
- Kabashima Y

- ↵.
- Braunstein A,
- Dall’Asta L,
- Semerjian G,
- Zdeborová L

- ↵.
- Marino R,
- Parisi G,
- Ricci-Tersenghi F

- ↵.
- Dall’Asta L,
- Ramezanpour A,
- Zecchina R

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Computer Sciences