Skip to main content

Main menu

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian

User menu

  • Log in
  • My Cart

Search

  • Advanced search
Home
Home
  • Log in
  • My Cart

Advanced Search

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
Research Article

A multiplicative reinforcement learning model capturing learning dynamics and interindividual variability in mice

Brice Bathellier, Sui Poh Tee, Christina Hrovat, and Simon Rumpel
  1. aResearch Institute of Molecular Pathology, 1030 Vienna, Austria; and
  2. bUnité de Neurosciences, Information et Complexité, Unité Propre de Recherche 3293, Centre National de la Recherche Scientifique, 91198 Gif sur Yvette, France

See allHide authors and affiliations

PNAS December 3, 2013 110 (49) 19950-19955; https://doi.org/10.1073/pnas.1312125110
Brice Bathellier
aResearch Institute of Molecular Pathology, 1030 Vienna, Austria; and
bUnité de Neurosciences, Information et Complexité, Unité Propre de Recherche 3293, Centre National de la Recherche Scientifique, 91198 Gif sur Yvette, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: brice.bathellier@unic.cnrs-gif.fr
Sui Poh Tee
aResearch Institute of Molecular Pathology, 1030 Vienna, Austria; and
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Christina Hrovat
aResearch Institute of Molecular Pathology, 1030 Vienna, Austria; and
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Simon Rumpel
aResearch Institute of Molecular Pathology, 1030 Vienna, Austria; and
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  1. Edited by Terrence J. Sejnowski, Salk Institute for Biological Studies, La Jolla, CA, and approved October 22, 2013 (received for review June 26, 2013)

  • Article
  • Figures & SI
  • Info & Metrics
  • PDF
Loading

Significance

Learning speed can strongly differ across individuals. This is seen in humans and animals. Here, we measured learning speed in mice performing a discrimination task and developed a theoretical model based on the reinforcement learning framework to account for differences between individual mice. We found that, when using a multiplicative learning rule, the starting connectivity values of the model strongly determine the shape of learning curves. This is in contrast to current learning models based on additive rules, in which the learning speed is typically determined by a single parameter describing the ability of connections to change their strength. Our findings suggest that the particular wiring architecture in the brain strongly conditions our ability to rapidly learn a new task.

Abstract

Both in humans and in animals, different individuals may learn the same task with strikingly different speeds; however, the sources of this variability remain elusive. In standard learning models, interindividual variability is often explained by variations of the learning rate, a parameter indicating how much synapses are updated on each learning event. Here, we theoretically show that the initial connectivity between the neurons involved in learning a task is also a strong determinant of how quickly the task is learned, provided that connections are updated in a multiplicative manner. To experimentally test this idea, we trained mice to perform an auditory Go/NoGo discrimination task followed by a reversal to compare learning speed when starting from naive or already trained synaptic connections. All mice learned the initial task, but often displayed sigmoid-like learning curves, with a variable delay period followed by a steep increase in performance, as often observed in operant conditioning. For all mice, learning was much faster in the subsequent reversal training. An accurate fit of all learning curves could be obtained with a reinforcement learning model endowed with a multiplicative learning rule, but not with an additive rule. Surprisingly, the multiplicative model could explain a large fraction of the interindividual variability by variations in the initial synaptic weights. Altogether, these results demonstrate the power of multiplicative learning rules to account for the full dynamics of biological learning and suggest an important role of initial wiring in the brain for predispositions to different tasks.

  • behavior
  • memory
  • cue competition
  • savings

It is commonly observed in animal behavior experiments, as much as in the classroom, that different individuals eventually learn the same task with strikingly different dynamics. Many factors were shown to influence learning speed and/or performance at the group level, including genetic background (1), early experience (2), or contextual factors such as stress (3). In all these cases, the underlying idea is that these factors act on synaptic plasticity mechanisms to change different parameters that modulate or gate synaptic updates, thereby modifying learning dynamics at the system scale (e.g., ref. 4). These ideas are in line with most theoretical models of biological learning (5), such as reinforcement learning (6, 7), which are based on the trial-by-trial update of mathematical variables that can be directly or indirectly related to updates of synaptic weights between neurons. In these models, one essential parameter controlling the speed of learning is the learning rate that scales the learning rule (i.e., the rule according to which synapses are updated). Other factors are also known to have an influence on learning speed such as the level of noise and the particular learning rule (8) or metaparameters that dynamically control the balance between different aspects of learning behavior like the exploration vs. exploitation or memory storage vs. renewal trade-offs (3, 9). However, importantly, all these potential variability factors represent core parameters of the system that directly influence the dynamics throughout the entire learning process.

Whereas most theoretical models efficiently capture the dynamics of group learning curves, it is known that in many basic operant conditioning protocols, individual learning curves strongly deviate from the gradually increasing and negatively accelerated learning curves resulting from group averaging (10). Individual learning curves in fact often display a step-like increase from the untrained to the trained performance level after a delay, sometimes termed the “presolution period” (11, 12), whose duration strongly varies from one animal to another. Little is known about the biological underpinning of the learning delays and of their interindividual variations. Delays were tentatively explained by a threshold between the experience accumulated by the animal and its behavioral response (12); however, the biological nature of such a threshold remains elusive.

Here, we combine theoretical modeling and learning experiments in mice and show that both the presence and the variability of learning delays across individuals can be quantitatively explained by variations of the initial connectivity between the representation of the relevant stimuli and the action selection network in a model using a multiplicative learning rule. Hence, we propose that initial connectivity could be a crucial, yet unidentified factor of variability in learning.

Results

Learning Dynamics Strongly Vary Across Individual Mice.

To assess sensory learning dynamics in mice, we trained 15 inbred male mice to perform an auditory Go/NoGo discrimination task (Fig. 1A). In the task, water-deprived mice had to sustain licking at a spout during a delay period following an S+ sound to obtain a water drop and to refrain from licking after an S− sound to avoid a timeout before they can initiate the next trial. The two specific sounds chosen for the task were short broadband sounds (Fig. 1A) that we expect mice to readily discriminate perceptually, because we observed in a previous study that learning curves for this pair of sounds were the fastest among all sound pairs tested and because the sounds could be readily decoded from activity patterns in the auditory cortex even in naive mice (13). To dissociate motor learning from sensory–motor associative learning, mice were first trained on an uncued operant conditioning task in which they learned to visit the water delivery port and to sustain licking during the delay period to obtain the reward in the absence of any auditory stimuli. The discrimination task was started when all mice obtained rewards in at least 90% of the operant trials. From this point on, mice slowly learned to differentially adjust their behavior in response to each of the two sounds (Fig. 1B), until they reached a maximal correct overall performance (i.e., the mean fraction of correct choices for S+ and S− trials) of 95.4 ± 2.7% (mean ± SD). Interestingly, the number of trials necessary to learn the task was very variable. Across the 15 mice, between 800 and 2,160 trials were needed to reach 80% performance (Fig. 1C). Whereas the mean group learning curve suggested a smooth and steady increase in performance from the start of the training, individual learning curves deviated clearly from that average as previously observed in a variety of tasks (10). Although some mice improved their performance very early in the training, we also observed several sigmoid-like learning curves displaying a delay period during which performance was at chance level (Fig. 2A), which was very similar to the presolution periods described in other studies, also in symmetrical “two alternative forced-choice” tasks (12, 14). We evaluated the duration of the delay period (number of trials until the mouse reaches 20% of its maximum performance increase) and of the subsequent rise period (number of trials for the mouse to go from 20% to 80% of its maximum performance increase), based on sigmoid functions fitted to individual learning curves (Fig. 2B). The durations of both the delay and the rise periods were variable across mice, and in more than half of them the delay was even longer than the rise period. More importantly, delay and rise durations were not correlated (Fig. 2C, Pearson’s correlation coefficient = −0.13, P = 0.63), suggesting the presence of at least two independent factors determining the shape of an individual learning curve. We therefore looked for a mechanism that could explain a delay period during the initial part of the training independent from parameters that determine the speed of learning during the rise period.

Fig. 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 1.

Interindividual variability in auditory discrimination learning. (A) Sketch of the Go/NoGo behavioral task. For the sound spectrograms, the time and frequency axes range from 0 ms to 70 ms and from 1 kHz to 100 kHz (logarithmic scale), respectively. (B) Population learning curve for the overall performance of 15 mice. Binning: 100 trials. (C) Cumulative distribution of the number of trials necessary for each mouse to reach 80% correct performance measured from sigmoid functions fitted to the individual learning curves.

Fig. 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 2.

Discrimination performance increases following variable delays. (A) Individual learning curves for four mice. The red and blue lines represent the probability of correct performance for the rewarded sounds (S+) and the nonrewarded sound (S−), respectively. The black line is the overall performance (average of the red and blue lines). Binning: 180 trials. (B) Overall performance for a fifth mouse (black line) fitted with a sigmoid function (red dashed line; the black dashed lines indicate the asymptotic values). The sigmoid function is used to evaluate the delay until behavioral performance reaches 20% of the asymptotic performance increase (delay = t20%) and the additional number of trials needed to reach 80% of the asymptotic performance increase (rise = t80% − t20%). (C) Plot of t80% − t20% against t20% for 15 mice with the best linear fit to the data (black line). The magenta circle stands for the measurements in B.

A Reinforcement Learning Model of the Behavioral Task.

To do so, we designed a minimal reinforcement learning model of the auditory discrimination task, using a formalism that eases the biological interpretation of different parameters. The model consists of three sensory units projecting onto a simple decision circuit (Fig. 3A). The activity of sensory units is described by a 3D binary vector Graphic. The first dimension represents the port entry (trial initiation) and captures all associations between nonspecific stimuli and the reward that may occur in the initial uncued operant conditioning. The two other dimensions reflect the presence of the S+ and the S− sounds. The decision circuit consists of a unit that linearly sums the sensory inputs and responds in an all-or-none fashion to signal the decision to lick or not to lick (y = 0 or 1). In addition, it receives graded feed-forward inhibition from an inhibitory unit. Units of the circuit can be thought of as populations of functionally similar neurons that we model with a single activity variable (e.g., average population activity). Similarly, the connection between units can be thought of as large populations of synapses. For example, the graded feed-forward inhibition provided by the inhibitory unit can be envisioned as the summed output of a heterogeneous interneuron population receiving numerous, distributed axonal connections from sensory networks. Because we suppose the inhibitory unit to provide graded inhibition, which is linear with respect to its inputs, we can simply model its output as a change of sign for the sum of its inputs. Hence, the model formally reduces to a single equation for the decision unit,Embedded Imagein which Graphic is the Heaviside step function. Graphic and Graphic are 3D positive vectors describing the excitatory synaptic weights from the sensory units to the decision and inhibitory units, respectively. The variable Graphic is a Gaussian noise process of unit variance that models the stochasticity of behavioral choices.

Fig. 3.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 3.

Additive vs. multiplicative rule in a reinforcement learning model. (A) Sketch of the reinforcement learning (RL) model. (B) Learning curves for a model based on an additive learning rule for three different initial conditions sketched in Insets: (Top) balanced start, all initial weights are equal; (Middle) slightly unbalanced start, for the “port” unit the difference between the synaptic weights to the decision and the inhibitory unit (Graphic) is initially equal to 1; and (Bottom) strongly unbalanced start, Graphic. In the latter situation, the model initially responds only with lick decisions (arrow) until Graphic decreases. (C) Learning curves for a model based on a multiplicative learning curve for three different initial conditions sketched in Insets: (Top) all initial weights initially large; (Middle) the synaptic weights between sound units and the decision circuit (Graphic) are 10-fold smaller than those in the Top; the low synaptic weights initially slow down discrimination learning; and (Bottom) Graphic is 100-fold smaller. Red and blue lines: probability of correct performance for the rewarded and the nonrewarded sound, respectively. Black line: overall performance.

For learning, the synaptic weights are updated trial-wise according to the stimulus received (Graphic = [1 0 1] or Graphic = [1 1 0]) and the result of the model’s decision (R = 1 for a reward, R = −1 for no reward). To follow common reinforcement learning models (7), we first chose to implement the following additive learning rule,Embedded ImageEmbedded Imagein which Graphic is the learning rate and Graphic is a Hebbian term that conditions updates to coactivation of pre- and postsynaptic units. In this implementation of the model the learning rule for the inhibitory unit is nonlocal as its update depends on the activity of y. Note that these equations are equivalent to a model with excitatory and inhibitory inputs from a population of sensory neurons directly impinging on the decision neuron because the inhibitory unit reverses only the sign of its input. The central term corresponds to an expectation error as used in the Rescola–Wagner model (15) or in temporal difference learning (7). This term is the difference between the reward R and the prediction Graphic that corresponds to the Q value of canonical reinforcement learning models (SI Methods). Graphic is a parameter that sets the asymptotic performance of the model. In contrast to canonical reinforcement learning models, we suppose that positive expectation errors are more strongly weighted than negative ones as expressed by the asymmetric function Graphic if Graphic and Graphic if Graphic (the parameter Graphic is typically larger than 1). We introduced this function to account for the fast improvement of performance for rewarded trials and the slower improvement for negative trials observed in all mice (Fig. 2A). However, it is noteworthy that such an asymmetry is typically observed in mice (16) and monkeys (6) in the activity of dopaminergic neurons of the basal ganglia that code for reward expectation errors (Fig. S1). The model has three core parameters (Graphic) to which we added three parameters describing the initial connectivity of the model at the beginning of the training (initial conditions). The first two parameters are Graphic and Graphic, which are the initial weight values of connections from the “port” unit to the excitatory and inhibitory decision neuron, respectively. The third parameter is Graphic, the initial value of the four connections from the sound units to the decision circuit, which we supposed to be the same at the beginning of the training.

In this form, the model produces stochastic responses on a trial to trial basis, similar to mouse behavior. However, for simplicity and computational speed, it is more advantageous to directly compute the average response probabilities than single instantiations of the response sequence. This is done by transforming the above-described stochastic equations into deterministic probability equations as described in SI Methods.

When simulating the model (Fig. 3B), we observed no delay in the learning curves if Graphic is initially comprised between Graphic and Graphic, which are the asymptotic values toward which it converges when learning goes on (SI Methods). All learning curves for these initial conditions resembled inverted exponentials. The only condition in which the model produced a delay in overall performance learning curves was for much larger (or smaller) initial values of Graphic. In this situation the model initially operates in a saturation regime and systematically chooses one of the two possible responses irrespective of the stimulus until Graphic has decreased enough (Fig. 3B). However, it was clear that this phenomenon could not explain the delay observed in the mouse learning curves, because none of the 15 mice was observed to systematically choose only one of the two responses at the beginning of the discrimination training (e.g., Fig. 2A). On the contrary, response probabilities in mice in the first 100 trials were often close to 50% for both S+ and S− trials (Fig. 2A and Fig. 4G). These qualitative observations indicated that the model as such could not explain the learning dynamics observed in mice.

Fig. 4.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 4.

Faster reversal learning is explained by the multiplicative learning rule. (A) Illustration of reversal learning in which the sound spectrograms depict the rewarded and nonrewarded stimuli. (B) Population learning curves for the overall performance of 15 mice (dashed lines) during initial (black) and reversal trainings (purple). Binning: 180 trials. (C) Number of trials for individual mice to reach 80% behavioral performance in the initial training vs. the reversal training. (D) Population learning curves for the overall performance of 15 mice (dashed lines) and for the fitted reinforcement learning models endowed with a multiplicative learning rule (solid lines) during initial (black) and reversal trainings (purple). (E) Population learning curves for the overall performance of 15 mice (dashed lines) and for the fitted reinforcement models endowed with an additive learning rule (solid lines) during initial (black) and reversal trainings (purple). Unlike that observed in behaving mice, the additive model is learned more slowly during reversal training. (F) Number of trials to reach 80% behavioral performance in the initial training vs. the reversal training, for the additive (white circles) and the multiplicative (gray circles) models. (G) Population learning curves as in D but including the performance for rewarded and nonrewarded trials. (H) Percentage of variance unexplained by fitted models for both initial and reversal trainings when the additive learning rule is used (Additive), when the multiplicative learning rule is used (Multiplicative), when the synaptic diffusion term is omitted in the multiplicative model (Graphic), and when the expectation error function in the multiplicative model is symmetrical (Graphic).

A Multiplicative Learning Rule Can Explain Delayed Learning.

We reasoned that the exponential-like learning curves in the nonsaturated conditions were resulting from the near linearity of the time evolution equations of the synaptic weights (Eqs. 2 and 3), which results from the additive learning rule. In contrast, a multiplicative learning rule as used in some neural network models (17⇓–19) or machine learning applications (20) is expected to render Eqs. 2 and 3 strongly nonlinear. Interestingly, it was recently shown that ongoing dynamics of synaptic spines (21) and boutons (22) in the mouse cortex are multiplicative rather than additive. Therefore, we decided to test whether a multiplicative learning rule would better account for the behavioral learning curves.

We changed the learning rate Graphic for each excitatory synaptic connection Graphic into a weight-dependent parameter, Embedded Imagewhereas the rest of the model was unchanged. With this modification, the effective learning rate depends on the current synaptic weight. Hence, if learning starts with small synaptic weights, learning is initially slow but accelerates when synaptic weights become larger and start to influence the output decision. In line with this qualitative idea, our simulations showed that the multiplicative learning rule gave rise to delays in the learning curve when initial synaptic weights are small, whereas the delay vanishes if the initial weights are large (Fig. 3C). In addition, specific learning dynamics for rewarded and nonrewarded trials qualitatively agreed with the experimental measurements (Fig. 3C).

Reversal Learning Is Faster Than Initial Learning in Mice.

We next wanted to test more rigorously the idea that multiplicative learning rules might mediate the acquisition of the auditory discrimination task. A prediction of the multiplicative learning rule is that learning is fast in situations where synaptic weights are high. For example, if learning is initially slow because of weak initial synaptic weights, a task involving the same synapses, but starting from a state where weights are high, should be learned much faster and without a delay. Such a learning situation could be achieved in mice during so-called reversal training, consisting of switching the rewarded and nonrewarded sounds (Fig. 4A). Hence, the 15 mice initially trained in the sound discrimination task were submitted to a reversal. We observed that all mice were faster in learning the reversal training than the initial training (Fig. 4 B and C). Interestingly, we did not observe a significant correlation between reversal and initial learning speeds (Fig. 4C, correlation coefficient = 0.39, P = 0.15), which is consistent with the idea that the factor causing a delay in initial training contributes much less during reversal training.

The Multiplicative Learning Rule Can Explain Fast Reversal Learning.

When trying to fit the learning curves during initial and reversal training, we observed that both the additive and the multiplicative model failed to capture the learning dynamics (Fig. S2A). The reason for the failure lies in the learning rules expressed in Eqs. 2 and 3, which leads to the potentiation of only those synapses that drive correct behavior. In the case of the S+ unit this means potentiation of the excitatory connection whereas the connection to the inhibitory unit is weakened to very low levels. Hence, the connections that become relevant during the reversal learning start again from very low weights (Fig. S2B). However, we observed that processes that lead to even just a slight reduction in the specificity of the potentiation of a given connection will endow the model with the ability to capture the full dynamics of initial and reversal learning. In a biological sense such processes can be reflected by heterosynaptic plasticity (23, 24) in which potentiation of a subset of synapses induced by simultaneous pre- and postsynaptic activity leads to potentiation of also neighboring synapses in a non-Hebbian manner (Fig. S2E). In addition, nonspecificity in potentiation could be caused by incorrect targeting or residual turnover of axonal boutons (25) or dendritic spines (26) (Fig. S2C). We modeled the latter process by a synaptic diffusion term added to the learning rules,Embedded ImageEmbedded Imagewhere Graphic, the fraction of “diffusing” synapses, is typically more than 100 times smaller than Graphic. With this term, even connections that are not driving correct behavior have a residual increase allowing for a fast reversal.

We tested quantitatively the ability of our multiplicative model to capture the dynamics of individual learning curves of mice during initial and reversal training. We observed a good match of the average of the individual fits and the average overall behavioral performance (Fig. 4D). Furthermore, the learning curves for S+ and S− trials were also reproduced with high precision (Fig. 4G). In contrast, the synaptic diffusion term was not sufficient in the additive model to produce a good fit when initialized with nonsaturated connectivity (Fig. 4E). The additive model always produced slower learning in the reversal (Fig. 4F), which was observed in none of the 15 mice (Fig. 4C). To precisely quantify the contribution of each feature of the full model to the goodness of fit, we measured the fraction of unexplained variance between the fitted learning curves and the behavioral measurements. This measure clearly indicated that the full multiplicative model with seven unconstrained parameters explains much more of the observed learning dynamics than either the additive model or the multiplicative model lacking the synaptic diffusion term or the asymmetric expectation error function (Fig. 4H), which also suggests that parameters Graphic and Graphic do not overfit the data.

Initial Connectivity Can Explain a Large Fraction of the Interindividual Variability.

An interesting aspect of the multiplicative learning rule is its sensitivity to initial conditions. So far we have shown fits of the model where all parameters were allowed to vary to find the optimum match to an individual learning curve. However, we noted that very good fits could also be obtained when only the three initial connectivity parameters were allowed to vary across animals while the four core parameters, including the learning rate Graphic, were optimized with the constraint that they should be the same for all 15 mice (Fig. 5A). In this setting, the learning curves of individual mice during both the initial and the reversal training could be well fitted by adjusting only the initial connectivity values at the beginning of the initial training. Initial conditions of the model could explain large differences in learning dynamics between two individuals that showed similar performance during initial training but striking differences during reversal training (Fig. 5 B and C). Further quantification (Fig. S3) indicated that initial conditions are sufficient to account for a large fraction of the observed variance and can explain the lack of correlation between delay and rise durations as seen in Fig. 2C. This suggested that initial connectivity could be an important determinant of interindividual variability that can even explain nontrivial aspects of learning dynamics.

Fig. 5.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 5.

Initial connectivity can explain a large fraction of the interindividual variability. (A) Behavioral learning curves (dashed lines) and the fit obtained with the multiplicative model (solid lines) when only the initial connectivity parameters are allowed to vary across the 15 animals. Binning: 180 trials. (B and C) Two examples of single-mouse learning curves and of the fit obtained when only the initial connectivity parameters are allowed to vary across the 15 animals. Strikingly, strong discrepancies in the reversal learning curves (arrow) can be accounted for by different initial conditions.

Discussion

We have designed a reinforcement learning model that can reproduce individual learning dynamics in a cohort of mice involved in an auditory Go/NoGo discrimination task. The model includes two crucial features: (i) a multiplicative learning rule that allows changing the learning speed with increasing experience and (ii) a process that introduces imprecision in the potentiation of synaptic connections to accelerate behavior switching after reversal (Fig. S2). The multiplicative rule creates a nonlinearity that allows fitting the sigmoid-like learning dynamics observed during initial training to the task. It is not fully excluded that another type of nonlinearity (such as a threshold on learned associations to gate their impact on behavior) would lead to similar dynamics. However, when we modeled such a nonlinearity, although sigmoid-like curves could be generated in certain conditions, the model was not able to reproduce faster learning in the reversal, and other shortcomings were observed (Fig. S4). Hence, the nonlinearity alone does not seem to be a sufficient alternative to our model.

Our model includes an asymmetry of the reward expectation error signal (27), which was essential to capture the large differences in performance for S+ and S− trials, specific to the Go/NoGo compared with two alternative forced-choice paradigms. To account for observed behavior, the error signal must be much stronger (e.g., Graphic, Fig. 5 B and C) for the presence of an unexpected reward (positive error) than for the absence of a reward (negative error), similar to the activity of the neurons suspected to signal expectation errors (6, 16). By accelerating learning for positive outcomes, the asymmetry allows the animal to collect a maximum of rewards even if the outcome is uncertain. The asymmetric rule produces two learning speeds, as in models designed to adapt to different timescales of external fluctuations (28) or models that change state according to contextual inferences (29, 30), except that the speed is adjusted as a function of the relevance for getting rewards in our model. It is noteworthy that the asymmetry of the task and of the learning rule is not a condition for obtaining delayed learning curves. Large delays are also behaviorally observed in two alternative forced-choice tasks (12, 14) and a multiplicative learning rule is also sufficient to model delayed learning in a symmetric task (Fig. S5).

Our model can precisely account for the learning dynamics in an auditory version of the Go/NoGo task. To what extent can the model account for learning dynamics in other contexts? The model is a dynamical extension of the Rescorla–Wagner rule (15) and therefore is expected to reproduce a wide range of effects observed in classical conditioning. As an illustration, we demonstrated that the model can reproduce cue competition effects (Fig. S6) as efficiently as the original rule. Going beyond the Rescorla–Wagner model, specific dynamic effects can be modeled. One example is savings effects in relearning (31). We observed that if forgetting is modeled as a loss of precision in synaptic connections with a limited net loss of synapses, relearning with the multiplicative rule will be faster than initial learning (Fig. S7). In some experimental paradigms (32⇓–34), reversal learning is actually slower than initial learning. As a second example, we show that slower reversal learning can be modeled by either low synaptic diffusion or large initial synaptic weights (Fig. S8). When studying learning delays, Heinemann (12) observed that the delays increase with the similarity of the two stimuli trained to be discriminated. As a last example, we show that this effect can be easily reproduced in our model by introducing correlations between the vectors representing the stimuli (Fig. S9).

What could be factors determining variability in initial wiring in a biological learning situation? It could emerge during development due to genetic or even stochastic causes, but also during postnatal learning experiences. Hence, our results suggest that multiplicative learning rules can give rise to a large variety of learning dynamics across individuals for different tasks independent of genetic factors. Intriguingly, our behavioral (Fig. 4C) and theoretical results also suggest that slow learning in a particular task does not necessarily predict slow learning in the future. If a long enough period of training is used to overcome certain weak initial connections, subsequent learning based on these connections could actually occur in a much shorter time period.

Methods

Experiments were performed with male CB57BL/6J mice and complied with the Austrian laboratory animal law guidelines (approval no. M58/001236/2010/8). Mice were trained twice a day in 30-min sessions of ∼200 trials. The pretraining lasted exactly six sessions for all mice. The reversal training was initiated independently for each mouse after it had performed three discrimination sessions with more than 90% correct performance.

In all graphs, the error bars indicate SEM. All analyses and simulations were performed in Matlab. The additive model is fully described by Eqs. 1, 5, and 6 and the multiplicative model is obtained by making α proportional to the synaptic weight. We fitted the response probabilities for S+ and S− trials, using a brute force approach to minimize the square error between the binned learning curves of each mouse and the output of the model (bins of 180 trials). Extended methods are found in SI Methods.

Acknowledgments

We thank H. Sprekeler, A. Destexhe, N. Kaouane, and members of the S.R. laboratory for helpful discussions and comments on the manuscript and A. Bichl, M. Ziegler, and M. Colombini for technical assistance. This work was supported by Boehringer Ingelheim GmbH and a postdoctoral fellowship (to B.B.) from the Human Frontier Science Program.

Footnotes

  • ↵1To whom correspondence should be addressed. E-mail: brice.bathellier{at}unic.cnrs-gif.fr.
  • Author contributions: B.B. and S.R. designed research; B.B. and C.H. performed research; B.B. and S.P.T. analyzed data; and B.B. and S.R. wrote the paper.

  • The authors declare no conflict of interest.

  • This article is a PNAS Direct Submission.

  • This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312125110/-/DCSupplemental.

References

  1. ↵
    1. Holmes A,
    2. Wrenn CC,
    3. Harris AP,
    4. Thayer KE,
    5. Crawley JN
    (2002) Behavioral profiles of inbred strains on novel olfactory, spatial and emotional tests for reference memory in mice. Genes Brain Behav 1(1):55–69.
    OpenUrlCrossRefPubMed
  2. ↵
    1. Kosten TA,
    2. Kim JJ,
    3. Lee HJ
    (2012) Early life manipulations alter learning and memory in rats. Neurosci Biobehav Rev 36(9):1985–2006.
    OpenUrlCrossRefPubMed
  3. ↵
    1. Luksys G,
    2. Gerstner W,
    3. Sandi C
    (2009) Stress, genotype and norepinephrine in the prediction of mouse behavior using reinforcement learning. Nat Neurosci 12(9):1180–1186.
    OpenUrlCrossRefPubMed
  4. ↵
    1. Tsai KJ,
    2. Chen SK,
    3. Ma YL,
    4. Hsu WL,
    5. Lee EH
    (2002) sgk, a primary glucocorticoid-induced gene, facilitates memory consolidation of spatial learning in rats. Proc Natl Acad Sci USA 99(6):3990–3995.
    OpenUrlAbstract/FREE Full Text
  5. ↵
    1. Dayan P,
    2. Abbott LF
    (2001) Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems (MIT Press, Cambridge, MA).
  6. ↵
    1. Schultz W,
    2. Dayan P,
    3. Montague PR
    (1997) A neural substrate of prediction and reward. Science 275(5306):1593–1599.
    OpenUrlAbstract/FREE Full Text
  7. ↵
    1. Sutton RS,
    2. Barto AG
    (1998) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).
  8. ↵
    1. Urbanczik R,
    2. Senn W
    (2009) Reinforcement learning in populations of spiking neurons. Nat Neurosci 12(3):250–252.
    OpenUrlCrossRefPubMed
  9. ↵
    1. Doya K
    (2002) Metalearning and neuromodulation. Neural Netw 15(4-6):495–506.
    OpenUrlCrossRefPubMed
  10. ↵
    1. Gallistel CR,
    2. Fairhurst S,
    3. Balsam P
    (2004) The learning curve: Implications of a quantitative analysis. Proc Natl Acad Sci USA 101(36):13124–13131.
    OpenUrlAbstract/FREE Full Text
  11. ↵
    1. Krechevsky I
    (1938) A study of the continuity of the problem solving process. Psychol Rev 45:107–133.
    OpenUrlCrossRef
  12. ↵
    1. Commons ML
    1. Heinemann EG
    (1983) The presolution period and the detection of statistical associations. Quantitative Analysis of Behavior: Discrimination Processes, ed Commons ML (Ballinger, Cambridge, MA), Vol IV, pp 21–36.
    OpenUrl
  13. ↵
    1. Bathellier B,
    2. Ushakova L,
    3. Rumpel S
    (2012) Spontaneous association of sounds by discrete neuronal activity patterns in the neocortex. Neuron 76:435–449.
    OpenUrlCrossRefPubMed
  14. ↵
    1. Chase S,
    2. Schupak C,
    3. Ploog BO
    (2012) Attention, the presolution period, and choice accuracy in pigeons. Behav Processes 89(3):225–231.
    OpenUrlCrossRefPubMed
  15. ↵
    1. Black AH,
    2. Prokasy WF
    1. Rescorla RA,
    2. Wagner AR
    (1972) A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. Classical Conditioning II: Current Research and Theory, eds Black AH, Prokasy WF (Appleton Century Crofts, New York), pp 64–99.
  16. ↵
    1. Cohen JY,
    2. Haesler S,
    3. Vong L,
    4. Lowell BB,
    5. Uchida N
    (2012) Neuron-type-specific signals for reward and punishment in the ventral tegmental area. Nature 482(7383):85–88.
    OpenUrlCrossRefPubMed
  17. ↵
    1. Gutig R,
    2. Aharonov R,
    3. Rotter S,
    4. Sompolinsky H
    (2003) Learning input correlations through nonlinear temporally asymmetric Hebbian plasticity. J Neurosci 23(9):3697–3714.
    OpenUrlAbstract/FREE Full Text
  18. ↵
    1. Morrison A,
    2. Aertsen A,
    3. Diesmann M
    (2007) Spike-timing-dependent plasticity in balanced random networks. Neural Comput 19(6):1437–1467.
    OpenUrlCrossRefPubMed
  19. ↵
    1. Koulakov AA,
    2. Hromadka T,
    3. Zador AM
    (2009) Correlated connectivity and the distribution of firing rates in the neocortex. J Neurosci 29(12):3685–3694.
    OpenUrlAbstract/FREE Full Text
  20. ↵
    1. Littlestone N,
    2. Warmuth MK
    (1994) The weighted majority algorithm. Inf Comput 108:212–261.
    OpenUrlCrossRef
  21. ↵
    1. Loewenstein Y,
    2. Kuras A,
    3. Rumpel S
    (2011) Multiplicative dynamics underlie the emergence of the log-normal distribution of spine sizes in the neocortex in vivo. J Neurosci 31(26):9481–9488.
    OpenUrlAbstract/FREE Full Text
  22. ↵
    1. Grillo FW,
    2. et al.
    (2013) Increased axonal bouton dynamics in the aging mouse cortex. Proc Natl Acad Sci USA 110(16):E1514–E1523.
    OpenUrlAbstract/FREE Full Text
  23. ↵
    1. Engert F,
    2. Bonhoeffer T
    (1997) Synapse specificity of long-term potentiation breaks down at short distances. Nature 388(6639):279–284.
    OpenUrlCrossRefPubMed
  24. ↵
    1. Harvey CD,
    2. Svoboda K
    (2007) Locally dynamic synaptic learning rules in pyramidal neuron dendrites. Nature 450(7173):1195–1200.
    OpenUrlCrossRefPubMed
  25. ↵
    1. De Paola V,
    2. et al.
    (2006) Cell type-specific structural plasticity of axonal branches and boutons in the adult neocortex. Neuron 49(6):861–875.
    OpenUrlCrossRefPubMed
  26. ↵
    1. Trachtenberg JT,
    2. et al.
    (2002) Long-term in vivo imaging of experience-dependent synaptic plasticity in adult cortex. Nature 420(6917):788–794.
    OpenUrlCrossRefPubMed
  27. ↵
    1. Niv Y,
    2. Duff MO,
    3. Dayan P
    (2005) Dopamine, uncertainty and TD learning. Behav Brain Funct 1:6.
    OpenUrlCrossRefPubMed
  28. ↵
    1. Kording KP,
    2. Tenenbaum JB,
    3. Shadmehr R
    (2007) The dynamics of memory as a consequence of optimal adaptation to a changing body. Nat Neurosci 10(6):779–786.
    OpenUrlCrossRefPubMed
  29. ↵
    1. Redish AD,
    2. Jensen S,
    3. Johnson A,
    4. Kurth-Nelson Z
    (2007) Reconciling reinforcement learning models with behavioral extinction and renewal: Implications for addiction, relapse, and problem gambling. Psychol Rev 114(3):784–805.
    OpenUrlCrossRefPubMed
  30. ↵
    1. Gershman SJ,
    2. Niv Y
    (2012) Exploring a latent cause theory of classical conditioning. Learn Behav 40(3):255–268.
    OpenUrlCrossRefPubMed
  31. ↵
    Ebbinghaus H (1885) [Über das Gedächtnis. Untersuchungen zur Experimentellen Psychologie] (Duncker & Humblot, Leipzig, Germany). German.
  32. ↵
    1. Rapp PR
    (1990) Visual discrimination and reversal learning in the aged monkey (Macaca mulatta) Behav Neurosci 104(6):876–884.
    OpenUrlCrossRefPubMed
  33. ↵
    1. Mair RG,
    2. Knoth RL,
    3. Rabchenuk SA,
    4. Langlais PJ
    (1991) Impairment of olfactory, auditory, and spatial serial reversal learning in rats recovered from pyrithiamine-induced thiamine deficiency. Behav Neurosci 105(3):360–374.
    OpenUrlCrossRefPubMed
  34. ↵
    1. Johnson C,
    2. Wilbrecht L
    (2011) Juvenile mice show greater flexibility in multiple choice reversal learning than adults. Dev Cogn Neurosci 1(4):540–551.
    OpenUrlCrossRefPubMed
PreviousNext
Back to top
Article Alerts
Email Article

Thank you for your interest in spreading the word on PNAS.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
A multiplicative reinforcement learning model capturing learning dynamics and interindividual variability in mice
(Your Name) has sent you a message from PNAS
(Your Name) thought you would like to see the PNAS web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Multiplicative reinforcement learning model
Brice Bathellier, Sui Poh Tee, Christina Hrovat, Simon Rumpel
Proceedings of the National Academy of Sciences Dec 2013, 110 (49) 19950-19955; DOI: 10.1073/pnas.1312125110

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Request Permissions
Share
Multiplicative reinforcement learning model
Brice Bathellier, Sui Poh Tee, Christina Hrovat, Simon Rumpel
Proceedings of the National Academy of Sciences Dec 2013, 110 (49) 19950-19955; DOI: 10.1073/pnas.1312125110
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Mendeley logo Mendeley

Article Classifications

  • Biological Sciences
  • Neuroscience
Proceedings of the National Academy of Sciences: 110 (49)
Table of Contents

Submit

Sign up for Article Alerts

Jump to section

  • Article
    • Abstract
    • Results
    • Discussion
    • Methods
    • Acknowledgments
    • Footnotes
    • References
  • Figures & SI
  • Info & Metrics
  • PDF

You May Also be Interested in

Water from a faucet fills a glass.
News Feature: How “forever chemicals” might impair the immune system
Researchers are exploring whether these ubiquitous fluorinated molecules might worsen infections or hamper vaccine effectiveness.
Image credit: Shutterstock/Dmitry Naumov.
Reflection of clouds in the still waters of Mono Lake in California.
Inner Workings: Making headway with the mysteries of life’s origins
Recent experiments and simulations are starting to answer some fundamental questions about how life came to be.
Image credit: Shutterstock/Radoslaw Lecyk.
Cave in coastal Kenya with tree growing in the middle.
Journal Club: Small, sharp blades mark shift from Middle to Later Stone Age in coastal Kenya
Archaeologists have long tried to define the transition between the two time periods.
Image credit: Ceri Shipton.
Illustration of groups of people chatting
Exploring the length of human conversations
Adam Mastroianni and Daniel Gilbert explore why conversations almost never end when people want them to.
Listen
Past PodcastsSubscribe
Panda bear hanging in a tree
How horse manure helps giant pandas tolerate cold
A study finds that giant pandas roll in horse manure to increase their cold tolerance.
Image credit: Fuwen Wei.

Similar Articles

Site Logo
Powered by HighWire
  • Submit Manuscript
  • Twitter
  • Facebook
  • RSS Feeds
  • Email Alerts

Articles

  • Current Issue
  • Special Feature Articles – Most Recent
  • List of Issues

PNAS Portals

  • Anthropology
  • Chemistry
  • Classics
  • Front Matter
  • Physics
  • Sustainability Science
  • Teaching Resources

Information

  • Authors
  • Editorial Board
  • Reviewers
  • Subscribers
  • Librarians
  • Press
  • Cozzarelli Prize
  • Site Map
  • PNAS Updates
  • FAQs
  • Accessibility Statement
  • Rights & Permissions
  • About
  • Contact

Feedback    Privacy/Legal

Copyright © 2021 National Academy of Sciences. Online ISSN 1091-6490