## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Fundamental bounds on learning performance in neural circuits

Edited by Terrence J. Sejnowski, Salk Institute for Biological Studies, La Jolla, CA, and approved March 4, 2019 (received for review August 3, 2018)

## Significance

We show how neural circuits can use additional connectivity to achieve faster and more precise learning. Biologically, internal synaptic noise imposes an optimal size of network for learning a given task. Above the optimal size, addition of neurons and synaptic connections starts to impede learning and task performance. Overall brain size may therefore be constrained by pressure to learn effectively with unreliable synapses and may explain why certain neurological learning deficits are associated with hyperconnectivity. Beneath this optimal size, apparently redundant connections are advantageous for learning. Such apparently redundant connections have recently been observed in several species and brain areas.

## Abstract

How does the size of a neural circuit influence its learning performance? Larger brains tend to be found in species with higher cognitive function and learning ability. Intuitively, we expect the learning capacity of a neural circuit to grow with the number of neurons and synapses. We show how adding apparently redundant neurons and connections to a network can make a task more learnable. Consequently, large neural circuits can either devote connectivity to generating complex behaviors or exploit this connectivity to achieve faster and more precise learning of simpler behaviors. However, we show that in a biologically relevant setting where synapses introduce an unavoidable amount of noise, there is an optimal size of network for a given task. Above the optimal network size, the addition of neurons and synaptic connections starts to impede learning performance. This suggests that the size of brain circuits may be constrained by the need to learn efficiently with unreliable synapses and provides a hypothesis for why some neurological learning deficits are associated with hyperconnectivity. Our analysis is independent of specific learning rules and uncovers fundamental relationships between learning rate, task performance, network size, and intrinsic noise in neural circuits.

In the brain, computations are distributed across circuits that can include many millions of neurons and synaptic connections. Maintaining a large nervous system is expensive energetically and reproductively (1⇓–3), suggesting that the cost of additional neurons is balanced by an increased capacity to learn and process information.

Empirically, a “bigger is better” hypothesis is supported by the correlation of brain size with higher cognitive function and learning capacity across animal species (4⇓–6). Within and across species, the volume of a brain region often correlates with the importance or complexity of the tasks it performs (7⇓–9). These observations make sense from a theoretical perspective because larger artificial neural networks can solve more challenging computational tasks than smaller networks (10⇓⇓⇓⇓–15). However, we still lack a firm theoretical understanding of how network size improves learning performance.

Biologically it is not clear that there is always a computational advantage to having more neurons and synapses engaged in learning a task. During learning, larger networks face the problem of tuning greater numbers of synapses using limited and potentially corrupted information on task performance (16, 17). Moreover, no biological component is perfect, so unavoidable noise arising from the molecular machinery in individual synapses might sum unfavorably as the size of a network grows. Intriguingly, a number of well-studied neurodevelopmental disorders exhibit cortical hyperconnectivity at the same time as learning deficits (18⇓⇓–21). It is therefore a fundamental question whether learning capacity can grow indefinitely with the number of neurons and synapses in a neural circuit or whether there is some law of diminishing returns that eventually leads to a decrease in performance beyond a certain network size.

We address these questions with a general mathematical analysis of learning performance in neural circuits that is independent of specific learning rules and circuit architectures. For a broad family of learning tasks, we show how the expected learning rate and steady-state performance are related to the size of a network. The analysis reveals how connections can be added to intermediate layers of a multilayer network to reduce the difficulty of learning a task. This gain in overall learning performance is accompanied by slower per-synapse rates of change, predicting that synaptic turnover rates should vary across brain areas according to the number of connections involved in a task and the typical task complexity.

If each synaptic connection is intrinsically noisy, we show that there is an optimal network size for a given task. Above the optimal network size, adding neurons and connections degrades learning and steady-state performance. This reveals an important disparity between synapses in artificial neural networks, which are not subject to unavoidable intrinsic noise, and those in biology, which are necessarily subject to fluctuations at the molecular level (22⇓⇓–25).

For networks that are beneath the optimal size, it turns out to be advantageous to add apparently redundant neurons and connections. We show how additional synaptic pathways reduce the impact of imperfections in learning rules and uncertainty in the task error. This provides a potential theoretical explanation for recent, counterintuitive experimental observations in mammalian cortex (26, 27), which show that neurons frequently make multiple, redundant synaptic connections to the same postsynaptic cell. A nonobvious consequence of this result is that the size of a neural circuit can either reflect the complexity of a fixed task or instead deliver greater learning performance on simpler, arbitrary tasks.

## Results

### Modeling the Effect of Network Size on Learning.

Our goal is to analyze how network size affects learning and steady-state performance in a general setting depicted in Fig. 1, which is independent of specific tasks, network architectures, and learning rules. We assume that there is some error signal that is fed back to the network via a learning rule that adjusts synaptic weights. We also assume that the error signal is limited both by noise and by a finite sampling rate quantified by some time interval T (Fig. 1*A*). In addition to the noise in the learning rule, we also consider noise that is independently distributed across synapses (“intrinsic synaptic noise”). This models molecular noise in signaling and structural apparatus in a biological synapse that is uncorrelated with learning processes and with changes in other synapses. Network size is adjusted by adding synapses and neurons (Fig. 1*B*).

Before analyzing the general case, we motivate the analysis with simulations of fully connected, multilayer nonlinear feedforward neural networks that we trained to learn input–output mappings (Fig. 2*A*). We used the so-called student–teacher framework to generate tasks (e.g., refs. 28 and 29). A “teacher” network is initialized with random fixed weights. The task is for a randomly initialized “student” network to learn the input–output mapping of the teacher. This framework models learning of any task that can be performed by a feedforward neural network by setting the teacher as the network optimized to perform the task.

The sizes of the student networks were set by incrementally adding neurons and connections to internal layers of a network with the same initial connection topology as that of the teacher (Fig. 2*B* and *Materials and Methods*). This generated student networks of increasing size with the guarantee that each student can in principle learn the exact input–output mapping of the teacher.

Learning was simulated by modifying synapses with noise-corrupted gradient descent to mimic an imperfect biological learning rule. We emphasize that we do not assume learning in a biological network occurs by explicit gradient descent. However, any error-based learning rule must induce synaptic changes that approximate gradient descent, as we show below (Eq. **1**). We assume that learning must be performed online; that is, data arrive one sample at a time. We believe this captures a typical biological learning scenario where a learner gets intermittent examples and feedback.

The phenomena we wish to understand are shown in Fig. 2 *C* and *D*. We trained networks of varying sizes on the same task, with the same amount of learning-rule noise. Larger networks learn more quickly and to a higher steady-state performance than smaller networks when there is no intrinsic synaptic noise (Fig. 2*C*). This is surprising because the only difference between the smallest network and larger networks is the addition of redundant synapses and neurons, and the task is designed so that all networks can learn it perfectly in principle. Moreover, as shown in Fig. 2*D*, adding intrinsic noise to the synapses of the student networks results in a nonmonotonic relationship between performance and network size. Beyond a certain size, both learning and asymptotic performance start to worsen.

The simulations in Fig. 2 provide evidence of an underlying relationship between learning rate, task performance, network size, and intrinsic noise. To understand these observations in a rigorous and general way, we mathematically analyzed how network size and imperfections in feedback learning rules impact learning in a general case.

We note that in machine learning, noise processes such as dropout and stochastic regularization (e.g., refs. 30⇓–32) can be applied to improve generalization from finite training data. Intrinsic synaptic noise is qualitatively different from these regularization processes. In particular, the per-synapse magnitude of intrinsic noise remains constant, independent of network size or training level. Moreover, our simulations use online learning, which is distinct from the common machine-learning paradigm where data are divided into training and test sets. The implications of our paper for this paradigm are considered in *SI Appendix*, *Regularization and Generalization Error* and *Online Learning and Generalization Error*, where we also show that regularization can be incorporated as learning-rule noise.

### Learning Rate and Task Difficulty.

We define task error as a smooth function

Biologically, it is reasonable to assume that learning-related synaptic changes occur due to old information. For example, a task-related reward or punishment may be supplied only at the end of a task, which itself takes time to execute. Similarly, even if error feedback is ongoing in time, there will always be some biochemically induced delay between acquisition of this error signal and its integration into plastic changes at each synapse.

Thus, there will be a maximum rate at which task error information can be transmitted to synapses during learning, which for mathematical convenience can be lumped into discrete time points. Suppose feedback on task error occurs at time points 0 and T, but not in between, for some *A*). If the network learned over the interval *A*) as**1** shows that synaptic changes, on average, must anticorrelate with the gradient for learning to occur. We can thus decompose net learning rate during the interval T into contributions as follows (further details in *SI Appendix*, *Learning Rate and Local Task Difficulty*):**2** is not necessarily small. Nonetheless, we can gain useful insight for how error surface geometry affects learning by examining the other terms on the right-hand side of Eq. **2**. The gradient strength scales the overall learning rate. Inside the brackets, the curvature term (which can change sign and magnitude during learning) can compete with the gradient term to slow down (or reverse) learning.

Informally, the curvature term in Eq. **2** therefore controls the learning “difficulty” at each stage of learning. As we will show, this term can be tuned by changing the number of neurons and synaptic connections in the network.

The learning rate, k, is likely to remain positive during learning if the gradient direction changes gradually as the error surface is traversed (i.e., the error surface is almost linear). In this case a high rate of plasticity—due to a high gain between feedback error and synaptic change—will result in a high learning rate. However, if the descent direction changes rapidly due to the curvature of the error surface (i.e., the surface is crinkled up), then correlation with *A*, where the length of the leaps along the error surface indicates the rate of plasticity.

We next decompose the contributions to the overall synaptic change during a learning increment. First we assume that synapses are perfectly reliable, with no intrinsic noise fluctuations affecting their strengths. In this case, we can decompose *B*),

Note that a learning rule could theoretically induce task-relevant synaptic changes in a direction that is not parallel to the gradient, *SI Appendix*, *Task-Relevant Plasticity*) but this complicates the presentation without adding insight.

There are several sources of task-irrelevant plasticity. First, there can be inherent imperfections in the learning rule: Information on task error may be imperfectly acquired and transmitted through the nervous system. Second, as we have emphasized above, the process of integrating feedback error and converting it into synaptic changes takes time. Therefore, any learning rule will be using outdated information on task performance, implying that the gradient information will have error in general, unless it is constant for a task. This is illustrated in Fig. 3*A*, where we see that during learning, the local information used to modify synapses leads to a network overshooting local minima in the error surface. Third, in a general biological setting, synapses will be involved in multiple task-irrelevant plasticity processes that contribute to *A*). For instance, the learning of additional, unrelated tasks may induce concurrent synaptic changes; so too could other ongoing cellular processes such as homeostatic plasticity. The common feature of all these components of task-irrelevant plasticity is that they are correlated across the network, but uncorrelated with learning the task.

We now consider the impact of intrinsic noise in the synapses themselves. Synapses are subject to continuous fluctuations due to turnover of receptors and signaling components. Some component of this will be uncorrelated across synapses so we can model these sources of noise as additional, independent white-noise fluctuations at each synapse with some total (per-synapse) variance *SI Appendix*, *Decomposition of Local Task Difficulty* for justification), we can also write the magnitude of total synaptic rate of change across the network in a convenient form:**5** and **6** allow us to rewrite Eq. **2**:

For given values of the **7** that **7** to be negative and ceases to occur when this term is zero. This implies

If inequality Eq. **8** is broken, then learning stops entirely. At some point in learning, this breakage is inevitable: As **8** above is broken.

To validate our analysis we numerically computed the quantities in Eq. **7** in simulations (Fig. 4). In the case of a linear network with quadratic error the **8** indeed predicts the steady-state value of *A*.

For more general error functions, we have observed that Eq. **8** is always conservative in numerical simulations: Learning stops before local task difficulty reaches the critical value, implying that the **7** is usually negative. This is demonstrated in Fig. 4 *A* and *B*.

In summary, we have shown that local task difficulty

Note that in Fig. 4 (as well as in subsequent numerical simulations) we define the entire distribution of input–output pairs to be a finite set or “batch” generated from a fixed, random set of inputs. This a technical necessity that allowed us to numerically calculate true task error (i.e., the error over all inputs) and the true task gradient **5**. We emphasize that this finite batch is considered to be the entire distribution, not a sample from a true (unknown) distribution. It is not possible to numerically specify the *SI Appendix*, *Online Learning and Generalization Error* for further details).

### Local Task Difficulty as a Function of Network Size.

We next show precisely how network size influences the local task difficulty and thus learning rate and steady-state performance when other factors such as noise and the task itself remain the same.

Recall that **7**, as we see by substituting into it the expanded form of

We can gain intuition into how Eq. **9** is derived without going through additional technical details (*SI Appendix*, *Decomposition of Local Task Difficulty*). Suppose that the weights were perturbed by a randomly chosen direction n over the time interval

If the direction n is drawn independently of task error and its derivatives, then**11** that determines the effect of the perturbation on task error. Its contribution is**9** tells us explicitly how local task difficulty (and thus expected learning rate and steady-state performance) can be modified by changing the size of a network, provided the size change leaves *C* generically increased learning performance and provides a general explanation for enhanced learning performance in larger networks.

### Network Expansions That Increase Learning Performance.

We next give detailed examples of network expansions that increase learning rate and use the theory developed so far to compute the optimal size of a network when intrinsic noise is present. We first analyze a linear network and then apply insights from this to a more general nonlinear feedforward case.

Consider a linear network (i.e., a linear map, as shown in Fig. 5*A*) that transforms any input u into an output

We next embed this network in a larger network with *A*.

The expanded neural network with weights **12** tells us that if these weights are related to the original network weights by *SI Appendix, Learning in a Linear Network*) yields:**13** implies

Bringing together Eqs. **13** and **14** and the formula Eq. **10** for **16**. Indeed this allows us to optimize the steady-state error of the network by changing N. To see how, recall that*A*, by evaluating the learning performance of transformed neural networks of different sizes, with different

This estimate of the optimal network size is plotted in Fig. 5*B*, which shows the dependence on intrinsic synaptic noise levels. As noise decreases to zero, we see that the optimal network size grows arbitrarily. In addition, the optimal network size is smaller for a lower amount of task-irrelevant plasticity (i.e., a “better” learning rule). We validate the optimal network size estimate in Fig. 6*A* in simulations.

We next consider nonlinear multilayer, feedforward networks. Again, we use the student–teacher framework to generate learning tasks. We consider learning performance of a nominal and an expanded network, both with l layers, and both using the same learning rule. The only difference between the two networks is the larger number of neurons in each hidden layer of the expanded network. We will use our theory to predict an optimal number of synapses (and consequently optimal hidden layer sizes) for the transformed network. As before, this size will depend on the learning rule used by the networks, which is defined by levels of task-relevant plasticity, task-irrelevant plasticity, per-synapse white-noise intensity, and frequency of task error feedback. Our predictions are validated in simulations in Fig. 6*B*.

We first describe the nominal network architecture. Given a vector *Materials and Methods*). The first layer of neurons receives an input vector u in place of neural activities

For any given state w of the nominal network, we can construct a state

Suppose the nominal network is at state *SI Appendix*, *Learning in a Nonlinear, Feedforward Network* for additional details). We get**18** to minimize*B*).Note the dependence of **19** reflects the intrinsic difficulty of the task.

## Discussion

It is difficult to disentangle the physiological and evolutionary factors that determine the size of a brain circuit (33⇓–35). Previous studies focused on the energetic cost of sustaining large numbers of neurons and connecting them efficiently (2, 34⇓–36). Given the significant costs associated with large circuits (3), it is clear that some benefit must offset these costs, but it is currently unclear whether other inherent tradeoffs constrain network size. We showed under broad assumptions that there is an upper limit to the learning performance of a network which depends on its size and the intrinsic volatility of synaptic connections.

Neural circuits in animals with large brains were presumably shaped on an evolutionary timescale by gradual addition of neurons and connections. Expanding a small neural circuit into a larger one can increase its dynamical repertoire, allowing it to generate more complex behaviors (37, 38). It can improve the quality of locally optimal behaviors arrived at after learning (39, 40). Less obviously, as we show here, circuit expansion can also allow a network to learn simpler tasks more quickly and to greater precision.

By directly analyzing the influence of synaptic weight configurations on task error we derived a quantity we called “local task difficulty” that determines how easily an arbitrary network can learn. We found that local task difficulty always depends implicitly on the number of neurons and can therefore be decreased by adding neurons according to relatively unrestrictive constraints. In simple terms, adding redundancy flattens out the mapping between synaptic weights and task error, reducing the local task difficulty on average. This flattening makes learning faster and steady-state task error lower because the resulting error surface is less contorted and easier to descend using intermittent task error information. Biological learning rules are unlikely to explicitly compute gradients. Regardless, any learning rule that uses error information must effectively approximate a gradient as the network learns.

As an analogy, imagine hiking to the base of a mountain without a map and doing so using intermittent and imperfect estimates of the slope underfoot. An even slope will be easier to descend because slope estimates will remain consistent and random errors in the estimates will average out over time. An undulating slope will be harder to descend because the direction of descent necessarily changes with location. Now consider the same hike in a heavy fog at dusk. The undulating slope will become far harder to descend. However, if it were possible to somehow smooth out the undulations (that is, reduce local task difficulty), the same hike would progress more efficiently. This analogy illustrates why larger neural circuits are able to achieve better learning performance in a given task when error information is corrupted.

In specific examples we show that adding neurons to intermediate layers of a multilayer, feedforward network increases the magnitude of the slope (gradient) of the task error function relative to its curvature. From this we provide a template for scaling up network architecture such that both quantities increase approximately equally. This provides hypotheses for the organizing principles in biological circuits which, among other things, predict a prevalence of apparently redundant connections in networks that need to learn new tasks quickly and to high accuracy. Recent experimental observations reveal such apparently redundant connections in a number of brain areas across species (26, 27, 41, 42).

Even if neurons are added to a network in a way that obeys the architectural constraints we derive, intrinsic synaptic noise eventually defeats the benefits conferred to learning. All synapses experience noisy fluctuations due to their molecular makeup (23⇓–25, 43⇓–45). These sources of noise are distinct from shared noise in a feedback signal that is used in learning. Such independent noise sources accumulate as a network grows in size, outcompeting the benefit of size on learning performance. An immediate consequence is an optimal network size for a given task and level of synaptic noise.

Furthermore, our results show that different noise sources in nervous systems impact learning in qualitatively different ways. Noise in the learning rule as well as external noise in the task error, which may arise from sensory noise or fluctuations in the task, can be overcome in a larger circuit. On the other hand, the impact of intrinsic noise in the synaptic connections only worsens as network size grows. Our results demonstrate the intuitive fact that insufficient connections impair learning performance. Conversely, and less obviously, excessive numbers of connections impair learning once the optimal network size is exceeded. This provides a hypothesis for why abnormalities in circuit connectivity may lead to learning deficits (18⇓⇓–21).

Our analysis allowed us to predict the optimal size of a network in theoretical learning tasks where we can specify the levels of noise in the learning rule and in synapses. Fig. 5 shows that the optimal network size decreases rapidly as the intrinsic noise in synapses increases. We speculate that the emergence of large neural circuits therefore depended on evolutionary modifications to synapses that reduce intrinsic noise. In particular, optimal network size increases explosively as intrinsic synaptic noise approaches zero. An intriguing and challenging goal for future work would be to infer noise parameters in synapses across different nervous systems and test whether overall network size obeys the relationships our theory predicts.

## Materials and Methods

Full details of all simulations are provided in *SI Appendix*. We provide an overview here. Code related to paper is publicly accessible in ref. 46.

### Network Architectures.

All tested neural networks have fully connected feedforward architectures. Unless otherwise specified networks are nonlinear with multiple hidden layers. Each neuron in these networks passes inputs through a sigmoidal nonlinearity

### Details of the Learning Tasks.

All training tasks used in simulation use the student–teacher framework. The basic setup is as follows: We first make a teacher network, which has the same basic network architecture as the student’s. We then initialize the teacher weights at fixed values, which are generated as follows (unless otherwise specified): At the kth layer of the teacher network, weights are distributed uniformly on the interval

We then specify a set U consisting of 1,000 input vectors. We generate each vector

The output vector of the teacher, given an input vector u, is denoted

For the linear networks studied at the beginning of the section Network Expansions That Increase Learning Performance, the dimensionalities of the input and output vectors differ between the teacher and the students. However, fixed matrix transformations lift the teacher inputs/outputs into the appropriate dimensionality (Eq. **12**). For all other networks, the number of network inputs/outputs is shared between the students and the teacher. Teacher and students also have the same number of hidden layers. At the ith hidden layer, each student has at least as many neurons as the teacher. This ensures that the teacher network forms a subset of each student network and therefore that each student network is theoretically capable of exactly recreating the input–output mapping

### Network Training.

The theoretical analysis in this paper and the simulations in Fig. 2 pertain to online learning, where training data are sampled continuously from an (infinite) distribution defined by the input–output mapping of the teacher networks. However, in cases where we needed to compute a true gradient (specifically Figs. 4 and 6), we needed to define finite distributions to numerically evaluate the true gradient. Having the true gradient allows us to precisely specify values of task-relevant and task-irrelevant components of plasticity.

We emphasize (as we emphasized in the main text) that this differs from treating the finite set as a sample—or batch—from an infinite distribution, which would incur generalization issues because any finite sample will be necessarily biased. The relationship between the results of our paper and the latter scenario is described in *SI Appendix*, *Regularization and Generalization Error*.

In Figs. 4 and 6 learning is conducted on the finite input set U described in the previous section. At each learning cycle, we apply the weight update**21** using backpropagation. The normalized vectors

The network in Fig. 2 *C* and *D* is conducted online from an infinite distribution. At each learning cycle, we randomly draw a single input vector u of Gaussian components; i.e., **21** with

## Acknowledgments

We thank Fulvio Forni, Rodrigo Echeveste, and Aoife McMahon for careful readings of the manuscript; and Rodolphe Sepulchre and Stephen Boyd for helpful discussions. This work is supported by European Research Council Grant StG2016-FLEXNEURO (716643).

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: tso24{at}cam.ac.uk or dvr23{at}cam.ac.uk.

Author contributions: D.V.R. and T.O. designed research; D.V.R. and A.P.R. performed research; D.V.R., A.P.R., and T.O. analyzed data; D.V.R. and T.O. wrote the paper; and T.O. interpreted results.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Data deposition: Code for figure simulations in this paper is available at https://github.com/olearylab/raman_etal_2018.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1813416116/-/DCSupplemental.

- Copyright © 2019 the Author(s). Published by PNAS.

This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).

## References

- ↵
- ↵
- Tomasi D,
- Wang G-J,
- Volkow ND

- ↵
- ↵
- Reader SM,
- Laland KN

- ↵
- Sol D,
- Duncan RP,
- Blackburn TM,
- Cassey P,
- Lefebvre L

- ↵
- ↵
- Maguire EA, et al.

- ↵
- ↵
- Black JE,
- Isaacs KR,
- Anderson BJ,
- Alcantara AA,
- Greenough WT

- ↵
- Lawrence S,
- Giles CL,
- Tsoi AC

- ↵
- Pereira F,
- Burges CJC,
- Bottou L,
- Weinberger KQ

- Krizhevsky A,
- Sutskever I,
- Hinton GE

- ↵
- Huang G-B

- ↵
- Takiyama K

- ↵
- ↵
- Saxe AM,
- McClelland JL,
- Ganguli S

- ↵
- ↵
- Saul L,
- Weiss Y,
- Bottou L

- Werfel J,
- Xie X,
- Seung HS

- ↵
- ↵
- ↵
- ↵
- ↵
- Loewenstein Y,
- Yanover U,
- Rumpel S

- ↵
- Ziv NE,
- Brenner N

- ↵
- ↵
- Puro DG,
- De Mello FG,
- Nirenberg M

- ↵
- Bloss EB, et al.

- ↵
- ↵
- Levin E,
- Tishby N,
- Solla SA

- ↵
- ↵
- Srivastava N,
- Hinton G,
- Krizhevsky A,
- Sutskever I,
- Salakhutdinov R

- ↵
- José Hanson S

- ↵
- Frazier-Logue N,
- José Hanson S

- ↵
- ↵
- Herculano-Houzel S

- ↵
- ↵
- ↵
- Hinton GE,
- Salakhutdinov RR

- ↵
- Kschischang FR

- Tishby N,
- Zaslavsky N

- ↵
- ↵
- Ghahramani Z,
- Welling M,
- Cortes C,
- Lawrence ND,
- Weinberger KQ

- Dauphin YN, et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Raman DV,
- Perez-Rotondo A,
- O’Leary TS

- ↵
- Montavon G,
- Orr GB,
- Muller KR

- Bengio Y

## Citation Manager Formats

## Sign up for Article Alerts

## Article Classifications

- Biological Sciences
- Neuroscience

## Jump to section

## You May Also be Interested in

*Top Left:*Image credit: Dikka Research Project.

*Top Right:*Image credit: Alem Abreha (photographer).

*Bottom:*Image credit: Dikka Research Project.