A dopamine mechanism for reward maximization

Individual survival and evolutionary selection require biological organisms to maximize reward. Economic choice theories define the necessary and sufficient conditions, and neuronal signals of decision variables provide mechanistic explanations. Reinforcement learning (RL) formalisms use predictions, actions, and policies to maximize reward. Midbrain dopamine neurons code reward prediction errors (RPE) of subjective reward value suitable for RL. Electrical and optogenetic self-stimulation experiments demonstrate that monkeys and rodents repeat behaviors that result in dopamine excitation. Dopamine excitations reflect positive RPEs that increase reward predictions via RL; against increasing predictions, obtaining similar dopamine RPE signals again requires better rewards than before. The positive RPEs drive predictions higher again and thus advance a recursive reward-RPE-prediction iteration toward better and better rewards. Agents also avoid dopamine inhibitions that lower reward prediction via RL, which allows smaller rewards than before to elicit positive dopamine RPE signals and resume the iteration toward better rewards. In this way, dopamine RPE signals serve a causal mechanism that attracts agents via RL to the best rewards. The mechanism improves daily life and benefits evolutionary selection but may also induce restlessness and greed.

these events (24,29), thus reflecting the unpredictability of the eliciting event (Fig. 1A) but without showing inhibition with event omission like the negative dopamine RPE response (Fig. S1B) (24), which suggests salience rather than bidirectional prediction error coding.Thus, the initial component may constitute a default signal that reports novel or poorly identified environmental events that are potential rewards and should be approached to not miss a reward; therefore, such a neuronal signal would be beneficial for individual animals and evolutionary selection (21).
The distinction between the initial component and the main RPE signal requires temporal resolution in the ten-millisecond range; accordingly, some modern cellular imaging techniques with slower time courses, despite their elegance, may not allow distinction between the initial, prediction-dependent component and the genuine bidirectional RPE signal.The distinction would be helped by independent variation of the sensory and reward parameters of the eliciting event and multivariate data analysis (19).Reproduced from own work (CC-BY) (24).(B) For comparison: bidirectional juice RPE coding in same dopamine population as A (CC-BY) (24).(C) Heterogeneous slow phasic excitations and inhibitions in different monkey dopamine neurons during behavior.Time marker applies to all subpanels (200 msec).Reproduced from own work (4,8).
Aversive responses.Many studies, including our own, report excitations by aversive stimuli in some dopamine neurons (19,23,32).In analogy to rewards, punishers elicit sequential components of detection, identification and evaluation.Dissociations of these components suggest that most reported dopamine excitations by aversive stimuli reflect the sensory impact of the first saliency component rather than genuine aversive negative value of the punisher (19,23).Most dopamine neurons either do not code aversiveness at all (19) or show inhibitions that reflect the negative reward value (second component) (19,29,31).The inhibitory punisher responses are compatible with full value coding from positive (excitation with reward) to negative (inhibition with punishment) and thus can be aligned with the RPE account of dopamine signals.Further, dopamine neurons are excited by omission of punishment, which constitutes a negative prediction error of negative value and thus a positive RPE (25,29).Also, as relief from punishment is rewarding, its prediction elicits a dopamine excitation (28,31).In contrast to these responses that are compatible with the standard RPE account, a population of outlying dopamine neurons projecting to the posterior striatum shows excitation by aversive air puff but not by other aversive stimuli; as these neurons are not excited by punishment omission, they code neither general reward value nor value prediction error (31).Thus, apart from these exceptions, the 'canonical' dopamine RPE signal reflects reward value, together with its early sensory detection, rather than punishment.
Multiple phasic dopamine functions.In addition to the composite RPE signal, dopamine neurons show distinct slower and lower heterogeneous excitations and inhibitions that last from hundreds of milliseconds to several seconds and minutes; they arise in a wide variety of well-established tasks (1-3) (Fig. 2 blue).These changes were first observed in monkeys.They occur in various task epochs during or before large reaching movements, during consumption of expected rewards or over whole trial durations (Fig. S1C) (4,5,8).The movement changes are not precisely related to movement onset, fail to vary with postural adjustments and small reaching movements, and do not occur with precise and well-controlled elbow, hand and eye movements (15-17, 34, 35).Further, some dopamine neurons show ramping activity in anticipation of risky rewards (36).In rodents, which may show substantial task-unrelated behaviors (13), dopamine neurons show even larger varieties of slow and heterogeneous excitations or inhibitions that vary with movement speed and velocity (12,(37)(38)(39)(40)(41)(42) and may also reflect inadvertent visual, auditory and proprioceptive sensory stimulation during on-going behaviors.Some of these activities ramp up over several seconds toward the reward and last beyond, reflecting sensory stimulation, movement, reward expectation, motivation and general behavioral activation (10,(43)(44)(45)(46)(47); the ramp is often absent without sensory feedback (27,(48)(49)(50)(51)(52)(53)(54).These changes are independent of any prediction, including reward prediction, and thus do not seem to reflect RPEs.
While these activities are slower and lower than the phasic RPE signal with rewards and reward-predicting stimuli, they might contain small RPE responses.TD implementations often assume small RPEs that backpropagate during learning via imaginary intervening stimuli from primary reward to the next preceding stimulus (55)(56)(57).However, biological dopamine neurons do not respond to imagined stimuli (58); accordingly, monkey dopamine neurons do not show ramps in tasks with short stimulus-reward intervals and devoid of intervening events or movements (55), except with risky rewards as noted above (36).Interestingly, ramps are not essential for biologically plausible TD implementations when using larger backward steps (59).In rodents, dopamine neurons show ramps during learning (60).Ramps occur also during well-established tasks with intervening stimuli that may generate small RPEs arising from inaccurately perceived temporal reward predictions ('state uncertainty') (61).Dopamine neurons in monkeys tested in such richer tasks with multiple stimuli may also show ramping activity reflecting small RPE responses.
The large variety of sensory, motor and rewarding events eliciting slower dopamine changes in monkeys and rodents may not only reflect specific behavioral processes but also include underlying general processes such as motivation, arousal and behavioral activation.The more stimuli and movements engage the animal, the more frequent and stronger seem to be the slower dopamine changes.These dopamine changes are often difficult to distinguish from the more phasic RPE signal and are sometimes lumped together into a single phasic dopamine change, in particular in rodents that often show multiple parallel behaviors with inadvertent, taskunrelated limb movements and locomotor activity (13).Recording methods should have temporal resolution of tens of milliseconds, and statistics for the neuronal analysis should use multivariate methods (12).Appropriate regression models should account for potential intercorrelations between reward expectation, movement speed, vigor, motivation, arousal and behavioral activation.
Besides these slower phasic dopamine changes, tonic levels of extracellular dopamine in the striatum may have permissive influences rather than phasically driving motor and cognitive functions, as the efficacy of dopamine receptor stimulating drugs in alleviating Parkinsonian deficits indicates (Fig. 2 light blue, 'tonic').Modulation of the influence of such tonic dopamine 'fuel for the brain' may also underly the behavioral effects of dopamine agonists and neuroleptics.
Absence of dopamine responses despite reward prediction or delivery.Dopamine neurons fail to respond to stimuli and rewards that are fully predicted (8); when in addition general behavioral activation is limited, dopamine neurons may not show any task modulation at all (15).The first stimulus in such tasks would predict the ultimate reward, and its appearance at an unpredicted time should elicit an RPE response.However, due to the delay to the reward, temporal discounting (62) may reduce this response to a level that is indistinguishable from neuronal noise.
Multiple neuronal functions in other brain systems.The possibility of individual brain systems having more than one function is not entirely new.Multiple functions are classically seen in phylogenetically older brain structures like the amygdala whose neurons are not only involved in fear conditioning but also process reward information (63) and exploratory states (64).The same amygdala neurons code reward value early in a trial and reward choice at trial end (65).In the hippocampus, place cells change their focus depending on locally available reward (66).Even in the cerebellum with its staunchly motor function, optogenetic stimulation drives behavior toward reward (67).In the monkey prefrontal cortex, the same neurons process several task components (68).Even neurons in primary visual cortex code reward information (69) and specific aspects of face motion together with arousal states (70).Neurons in primary motor cortex code also visual and, classically, somatosensory information (71,72).These mixed selectivities might constitute examples for energy-saving in the brain by using the same neurons for different processes.Hence, the notion of one brain system having just one function might be an unrealistic myth, and dopamine neurons are no exception.Summary.Phasic dopamine signals have multiple functions.The RPE signal is composed of an initial component that reflects the salience elicited by the detection of any sufficiently strong event, including rewardpredicting stimuli and rewards (Fig. 2 brown), and a main component that reflects the true bidirectional RPE (red).Somewhat slower, non-RPE dopamine changes lasting for seconds or minutes (blue) occur during movement initiation or reward delivery, or reflect the TD ramp, reward risk, reward expectation, sensory stimulation or movement, or common underlying processes of general arousal or behavioral activation.An even slower change or outright tonic level of striatal and cortical dopamine concentration (light blue) is required for well-functioning motor and cognitive processes, as the deficits in Parkinson's disease and the behavioral effects of dopamine agonists and neuroleptics attest.Thus, the recognition of the slower dopamine signals provides for a more complete appreciation of dopamine functions.However, as interesting as they are, the slower dopamine functions do not play a major role in the proposed dopamine reward maximization mechanism, which relies on the RPE signal.

SI Text 2: Economic choice and utility
Economic utility is a mathematical representation of preferences ( 73) and thus provides a metric of subjective reward value.Rather than estimating subjective value on an ad-hoc basis, definitions of utility functions convey the advantage of mathematical functions in predicting events that have not been used for their estimation.In economics, this prediction concerns the choice of goods, also called rewards in biology.Utility is not objectively measurable but can be inferred from observable choice.Rewards are typically uncertain and risky, which contributes to the subjectivity of reward value.Thus, utility functions can be estimated from choice under risk.Choice under risk is also the basis of the four utility axioms of Expected Utility Theory (EUT) that define utility maximization under risk and provide a conceptual foundation for estimating utility from choice (74)(75)(76).

Definition of risk.
Risk has three basic connotations, variability, loss, and gain.Among the many definitions of risk, economic risk concerns the higher statistical moments of outcome probability distributions, notably variance (second moment, a measure of spread), skewness (third moment, a measure of asymmetry), and kurtosis (fourth moment, a measure of symmetric outliers).This definition of risk requires fully known probability distributions (whereas the uncertainty in incompletely known probability distributions is referred to as ambiguity).Thus, economic risk focusses on variability and considers any outcome below the mean of a gamble as (relative) loss and any outcome above the mean as (relative) gain.A different, popular view defines risk as the probability of loss, which equals negative probabilistic value.This view focusses on loss and considers probability to represent variability, but the definition does not include the potential gain (without which risk-taking would be useless).The intuitive, everyday notion of risk often comprises a weighted combination of both definitions.Here I will only follow the economic definition.

Basic choice design.
In the most simple and controllable version, a monkey chooses on each trial between two discrete and distinct options (binary choice) that appear simultaneously at the right and left of the center on a computer monitor in front of the animal.The animal chooses its momentarily preferred option by eye movement or joystick movement, or by contacting a specific spot on a touch-sensitive computer monitor.After the choice, the chosen option, and no other option, is paid out to the animal.Thus, the option set is collectively exhaustive (all options are included, and every option can be chosen) and mutually exclusive (only one option can be chosen in a trial).
The definition of economic risk based on reward probability distributions encourages simple experimental designs that demonstrate meaningful choices in rhesus monkeys.A most simple design comprises a 'safe' option with a specific reward amount and a 'risky' option with several probabilistically alternating reward amounts (whose probabilities sum up to P = 1.0).Each option is represented by a distinct quantitative stimulus composed of one horizonal bar (safe option) or several bars (risky option), each inside a vertical rectangle (Fig. S2A top).The vertical position of each bar inside the vertical rectangle indicates the reward amount the animal would receive with a specific probability if it chooses that option.For example, with equiprobable choice options, two bars indicate two different reward amounts, each of them occurring on half the trials (P = 0.5 each amount); the animal receives only one of these reward amounts when it chooses that option.Thus, the reward amounts of a multi-amount option are collectively exhaustive (all amounts are included, and every amount can be delivered) and mutually exclusive (only one amount is delivered on a given trial).

Fig. S2. Utility estimation and maximization during risky choice in monkeys. (A)
Meaningful choice according to reward goodness shown by compliance with first-order stochastic dominance (FOST).Top: ocular stochastic choice between a safe reward (blue) and a risky reward (red, probability P (each amount) = 0.5).Left option set: gamble is always equal to or better than safe reward (gamble dominates safe reward).Right option set: safe reward is always equal to or better than gamble.Bar height within each rectangle indicates juice amount (higher is more).Both stimuli of each option set are presented simultaneously on a computer monitor in front of the animal.Bottom: choice probability of each option indicates compliance with FOST.The occasional choice of the non-dominating option reflects the nature of the stochastic choice process.(B) Utility function estimated from fractile chaining procedure.Higher variance defines more economic risk (red).Higher choice probability for riskier option indicates risk-seeking.Physical reward amounts are ml of juice.(C) Schematic showing how convexity of utility function predicts risk-seeking with binary gamble.Respective distances to utility of mean physical gamble amounts (u(EV); EV, expected (mean) value) indicate that large rewards are subjectively overweighted and small rewards underweighted.Hence, the utility gain from obtaining the large reward relative to mean utility is larger than the utility loss from obtaining the small reward, consistent with risk-seeking.Rectangle with 2 solid inside bars indicates binary equiprobable gamble (P = 0.5 each reward amount).(D) Utility maximization at choice indifference.The options have same utility (expressed as Certainty Equivalent, CE) when chosen equally frequently (P (choice safe reward) = 0.5; choice indifference); here, the animal chooses the gamble with the lower physical EV in half the trials, indicating that its choice follows utility rather than physical reward amount (note that the animal ranks rewards correctly according to their goodness, as shown in a).Top: options presented to the animal for psychophysical estimation.The safe reward occurs with probability P = 1.0 (blue); the risky rewards occur in random alternation with P = 0.5 each (red).Risk is defined as reward difference between bottom and top outcome (variance; no skew).(E) Utility maximization with same mean physical reward amounts.Projection of the two tested gambles onto a globally convex utility function (red and blue dots) indicates higher Expected Utility (EU) of riskier gamble (red) compared to less risky gamble (blue) despite same mean physical amounts (dotted line in top display).Note that the higher utility (EU, red) in this test with the same monkey explains the risky choices in panel A by value-seeking rather than risk-seeking.Panels A, B and C newly created, and panels D and E modified (CC-BY), from own work (77).

Meaningful risky choice.
Any assessment of utility maximization rests on the assumption that animals' choices reflect the rank-ordered goodness of rewards; they should prefer unambiguously better rewards to worse rewards.This requirement can be tested formally by invoking first-order stochastic dominance in stochastic choice of probabilistic options, which is defined as every probabilistic option being at least as good as its alternative and better in at least one instance.Statewise dominance, as its most simple version, is defined by every single reward of the better (dominant) option being at least as good as the corresponding reward in the alternative (dominated) option and better in at least one instance.In a test of statewise dominance, a monkey chooses between a safe reward (Fig. S2A blue) and a risky option containing two equiprobable but mutually exclusive rewards (P = 0.5 each) (red).Choice of the safe option results in delivery of a small reward on every trial, whereas choice of the risky option delivers either the same small reward or a larger reward (Fig. S2A left).Thus, the option with the occasionally (P = 0.5) larger reward is always objectively and subjectively equal to or better than its alternative, and never worse (given a positive value function, for example when satiety is ruled out); the risky option dominates the safe option.Monkeys choose the better (dominating) option on most trials (Fig. S2A bottom) (77) (occasional selection of the non-dominating option is typical for stochastic choice and may indicate exploration).Alternatively, preference for the risky option might reflect risk-seeking.To differentiate between the value-seeking and risk-seeking interpretations, the animal is presented with a safe reward that has the same amount as the top risky reward (Fig. S2A right): that safe reward (blue) is now always equal to or better than any reward of the risky gamble (red) and thus is the dominating option.Monkeys choose mostly the safe over the risky option (bottom) and thus follow the better reward.This choice pattern is not specific for tworeward gambles but applies also to three-reward gambles in which each reward occurs with P = 1/3 (78).The animals' choices also follow first-order stochastic dominance when both options have two probabilistic rewards (Fig. 3A).Thus, the animals' choices satisfy first-order stochastic dominance; they seem to understand their stimuli, they distinguish between good and less good options, and they choose the better option.Thus, the design can be used for estimating utility from choices and then for assessing utility maximization.
Utility estimation.Utility functions can be reliably estimated in laboratory settings during choice under risk from a sequence of psychophysically estimated CEs at choice indifference (choice of each option with P = 0.5) between an adjustable safe reward and a pre-set equiprobable gamble, using the so-called fractile or chaining procedure (79,80).Starting with a CE for an equiprobable gamble containing the minimal and maximal tested reward amounts (each delivered at P = 0.5), the procedure assesses new CEs with progressively finer resolution by iteratively setting gamble rewards to previously estimated CEs, thus obtaining more closely spaced CEs across the entire test range (for details, see ( 77)).Then the CEs are used to fit a full utility function using multiple splines or power, exponential, logarithmic or multi-parameter functions.The fitted utility functions with reward amounts used for monkeys are often S-shaped, gradually steepening with increasing reward amounts (convex to the origin), then becoming more linear and flattening gradually with larger rewards (concave) (Fig. S2B) (77,81,82).Such sigmoid utility functions have been conceptualized before for humans (83,84).Utility functions estimated for safe rewards using the Random Utility Model (85,86) show similar profiles (87), suggesting that the nonlinear transformation from objective physical reward amount to subjective utility applies to both risky and safe rewards.The convexity may partly reflect the small reward amounts required for testing the large number of trials typical for monkey experiments, although monkey utility functions can also be concave (88) (the 'wellbehaved' concave human utility function typically concerns more substantial amounts).
The convex curvature of the utility function with small reward amounts corresponds to the risk attitude observed in binary choice between a risky gamble and a safe reward (Fig. S2C): relative to the utility of the mean gamble reward u(EV), the utility gain from receiving the top gamble reward exceeds the utility loss from receiving the bottom gamble reward.Thus, the gain exceeds the loss, and the animal would prefer the risky gamble to the safe reward when their EVs are the same; it would be risk-seeking, exactly as shown empirically (Fig. S2D red), and choice indifference requires the safe reward exceeding the EV (blue).By contrast, the concave utility curvature with larger rewards indicates a lower gain from the top gamble reward, and the animal would be risk-avoiding.Such risk attitudes are indeed observed in monkeys with two-reward gambles and threereward gambles (77,78), confirming that utility functions estimated under risk predict choice under risk.Further, the utility functions obtained from stochastic choices have cardinal numeric properties (defined as being immune to offset and gain change).Utility functions with these properties are mathematically appropriate for neuronal response functions that themselves have cardinal properties (which allow z-score-normalized responses to be meaningfully averaged across neurons, as often done in neurophysiology).Thus, according to the general function of mathematical functions, utility functions, rather than ad-hoc estimated subjective values, allow prediction of choices that have not been used for estimating the utility function.
Compliance with EUT utility axioms and Prospect Theory.The estimated utility functions are valid within the constraints of EUT (linearly weighted probability, fixed reference point, restriction to gains).Choices of monkeys satisfy the first three EUT axioms within well-defined option sets; (i) the choices are complete by revealing preferences for all options (77); (ii) the choices are transitive and thus consistent (if option A is preferred to option B, and option B is preferred to option C, then option A should be preferred to option C) (35); (iii) the choices follow the trade-off between probability and reward amount defined by the continuity axiom (89).In tests of the independence axiom (IA), which is the fourth and most demanding EUT axiom, monkeys show some outright preference reversals and many graded preferences changes.These IA non-compliances suggest that subjective value does not only include the usually nonlinear weighting of reward amount (utility) but also nonlinear probability weighting; such subjective value defined by a combination of weighted amount (utility) and weighted probability corresponds well to monkeys' choices (90).Thus, monkeys choose more consistently than humans who violate the independence axiom often and severely (91,92).The long experience of monkeys with the experiment and its reward distributions might partly explain their good performance (93).
The IA tests demonstrate the importance of nonlinear probability weighting, which constitutes one of the tenets of Prospect Theory (94).When probabilities are presented in random alternation, monkeys overweigh the influence of low probabilities on expected utility and underweigh the influence of high probabilities, which is represented by an inverted-S shape of the probability weighting function (95) and resembles the weighting of instructed probabilities in human choices.By contrast, block-wise probability changes result in a regular-S shape of the probability weighting function (81), which resembles the weighting of experienced probabilities in humans.Thus, reward probability weighting is more variable than weighting of reward amount in utility functions.The larger variability may reflect the complexity of probability perception that requires memory of event occurrence and is more delicate than the perception of reward amounts that are closer to detection by primary sensory receptors.Thus, both reward amount and reward probability as components of reward value are subjective.
Further, monkeys' choices of safe rewards are generally compatible with the reference dependency of Prospect Theory.The utility functions adapt to changing reward distributions over several months (96).Finally, Prospect Theory suggests a convex loss function that is steeper than the gain function.When considering the full test range of reward amounts, reception of the small reward of a gamble might constitute a loss relative to the mean gamble amount.The convexity of monkeys' utility functions in the lower amount range might then correspond to the convexity of the loss function of Prospect Theory; however, the function is less, and not more, steep at such low values.
These properties of monkeys' choices suggest that utility functions provide a fundamental, concept-based and mathematically defined approach for expressing reward value on a subjective scale.Together with probability weighting, utility functions incorporate key features of subjective reward value that can be inferred from economic choice.The successful behavioral characterization of utility encourages investigations of its underlying neuronal signals.
Empirical maximization of utility rather than physical amount.Utility maximization can be empirically tested when monkeys choose safe and risky rewards equally frequently ('choice indifference').With small rewards, where the utility function is convex, monkeys typically prefer risky to safe rewards when the two options have the same EV (equal mean amounts), as described above (Fig. S2D red) (35,97); the animals choose both options equally frequently (choice indifference) only when the safe reward exceeds the EV of the risky reward (certainty equivalent, CE, in Fig. S2D blue) (77).Despite these different reward amounts, the monkey chooses each reward equally frequently, suggesting that the two options have the same utility (as we infer utility from choice).Thus, the animal chooses the larger (safe) reward only on half the trials.On the other half trials, it chooses the on-average smaller (risky) reward, although it could have chosen the larger (safe) reward.The compliance with first-order stochastic dominance confirms that the animal understands the stimuli and distinguishes very well between reward amounts (Fig. S2A).Thus, the animal behaves 'as if' it chooses the smaller (risky) reward because of its utility, and surely not for its smaller physical amount, which suggests that utility matters more to the animal than reward amount.(Use of the term 'as if' is an attempt to bypass the circular argument that utility is inferred from choices, and that the animal chooses to maximize the utility that is inferred from choices.)The dominance of utility over physical amount occurs also with larger rewards where the utility function is concave.Here, choice indifference occurs when the risky reward is larger than the safe reward (77,78): by choosing the smaller safe reward on half the trials, the animal foregoes the larger risky amount and thus chooses according to utility rather than physical reward amount.Thus, their choices under risk demonstrate that monkeys choose 'as if' they are maximizing utility rather than physical reward amount.
The tendency towards utility maximization is also present in choices between two gambles with the same mean reward amount but different variance (Fig. S2E).This 'mean-preserving spread' provides a minimalconfound test of risky choice (98,99), as both options have the same EV (mean physical reward amount), same number of rewards (N = 2) that occur with same probability (P = 0.5, small nonlinear probability weight), and no skewness.Using a previously estimated utility function (Fig. S2B), the projection of the gambles onto the function suggests that the riskier gamble has higher expected utility than the less risky gamble (Fig. S2E) (77).If monkeys choose both gambles equally frequently, they would choose according to their same EV (same mean amount), but if they prefer the riskier to the less risky gamble despite same EV, they would choose according to utility (specifically expected utility, EU).Indeed, the animals prefer the riskier gamble 'as if' they are maximizing utility rather than physical amount (see Fig. 2B).Thus, monkeys maximize utility irrespective of different or same mean physical reward amounts (Fig. S2D, E).

SI Text 3: Risk attitude explained by dopamine excitation and inhibition
The results from artificial manipulations of dopamine signals suggest that animals choose more frequently events that result in dopamine excitation as compared to events resulting in dopamine inhibition (100)(101)(102)(103).This notion is interesting in light of risk-seeking and risk-avoiding.The two reward outcomes of binary gambles elicit respectively positive or negative reward prediction errors (RPE) relative to the mean reward; the RPEs induce respective dopamine excitations and inhibitions.Thus, agents' risk attitude may depend on the relative strength of dopamine excitations and inhibitions elicited by the RPEs of gamble rewards.The relative strength of dopamine excitations and inhibitions depend on the curvature of the agent's utility function coded by dopamine neurons (77).A convex (gradually steepening) curvature reflects overweighting of large over small gamble rewards, resulting in risk-seeking (Fig. S2C), whereas a concave (gradually flattening) curvature reflects underweighting of large to small gamble rewards, resulting in risk-avoiding.
In the simple case of a binary equiprobable gamble, and a convex utility function, the dopamine excitation from the larger-than-mean gamble reward (eliciting a positive RPE) would have a stronger effect on an agent's behavior than the inhibition from the smaller-than-mean gamble reward (eliciting a negative RPE).Here, the agent would prefer the gamble to a safe reward (or to a less risky gamble); thus, the agent would be riskseeking.But in the case of a concave utility function, the dopamine excitation from the larger-than-mean gamble reward would have a weaker effect than the inhibition from the smaller-than-mean gamble reward; the agent would be risk-avoiding.
The relative strength of dopamine excitation and inhibition would differentially update reward predictions.An overweight of the dopamine excitation from the large gamble reward (positive RPE signal) compared to the inhibition from the small gamble reward (negative RPE signal; convex utility function) would result, via RL, in stronger gamble-predicting dopamine excitation by a riskier gamble as compared to a less risky gamble, even when both options have equal mean reward amount (Fig. S2E, red vs. blue).As a result, the stronger dopamine excitation would drive an agent toward the riskier gamble.Risk-seeking can be particularly beneficial when resulting in particularly valuable rewards, namely outliers in positively skewed gambles (78).To the opposite, an underweight of the dopamine excitation from the large gamble reward compared to the inhibition from the small gamble reward (concave utility function) would result in weaker gamble-predicting dopamine excitation by the riskier as compared to the less risky gamble and result in risk avoidance.Thus, the dopamine excitation and inhibition induced by gamble rewards may provide a neuronal basis for behavioral risk attitudes.

SI Text 4: Postsynaptic effects of dopamine signals
Modeling work suggests that phasic dopamine RPE signals are suitable to serve as effective teaching signals for reinforcement learning (RL).A temporal difference (TD) model using the kind of dopamine-like RPE signals recorded during electrophysiological experiments can mediate the learning of a series of spatial tasks in a similar manner as monkeys do in the laboratory (59).Artificial computer models using TD RL, often together with deep neural networks, acquire champion-level and superhuman performance in Backgammon, Atari, Go, Chess and Shogi games (104)(105)(106)(107).The obvious architectural differences between the artificial computer models and biological brains stress the importance of the TD RL algorithm over the particular implementation of the algorithm (108).Nevertheless, we are dealing with real brains and need to assess how the biological dopamine RPE signal may affect the neuronal mechanisms mediating TD RL.
The neurophysiological responses of dopamine neurons in the midbrain constitute the first step of processing RPEs.To result in RL, this signal is propagated along axons to postsynaptic structures that implement RL.The implementation involves several subsequent steps: (1) Axons of dopamine neurons branch profusely in the ventral and dorsal striatum (nucleus accumbens, caudate nucleus, putamen), frontal cortex and amygdala (109).Axonal conductance can fail at any axonal branch and reduce action potential frequency there (110,111).Such and other mechanisms may underlie the known heterogeneous dopamine release in postsynaptic structures (112).Thus, the dopamine action potentials may not propagate into the whole axonal arborization tree but may be selected depending on specific terminal regions.
(2) Once an action potential reaches a varicosity, its action on dopamine release is subject to local influences.Dopamine release is regulated by presynaptic dopamine autoreceptors (113,114) and by rich presynaptic interactions with other neurotransmitters (115).For example, acetylcholine modulates striatal dopamine release depending on the timing of incoming dopamine action potentials relative to cholinergic neurons (116).Differences in these local interactions will result in region-specific dopamine influence.
(3) The presynaptic influence of other neurotransmitters on dopamine release may result in dopamine release without involving changes in action potential frequency (117), which is also apparent from behavioral tests (118).Thus, some behavioral dopamine functions may not require action potential changes at dopamine cell bodies.
(4) As soon as dopamine is released into the extracellular space, the dopamine transporter recaptures the intact molecule within tens to hundreds of milliseconds into the dopamine neurons and thus limits the dopamine action on postsynaptic receptors (119).The known local variations in dopamine transporter concentration result in regional differences of dopamine level.For example, dopamine transporter concentration is less dense in nucleus accumbens as compared to dorsal striatum (120).With different recapture speed, postsynaptic dopamine action follows different time courses in different brain regions.
(5) The impact of dopamine neurotransmission on postsynaptic neurons depends on the type of postsynaptic dopamine receptor.Striatal neurons carrying D1 or D2 types of dopamine receptor have opposite action on striatal neurons and project to different nuclei of the basal ganglia (121).These distinct striatal neurons affect learning and choices in opposite ways (122)(123)(124).While it is unknown whether these effects of striatal neurons derive from their dopamine inputs, the existence of dopamine receptors on these neurons make the efferent routing of dopamine influences at least possible.
(6) Dopamine neurotransmission enhances the synaptic weights from recently active inputs onto postsynaptic striatal and cortical neurons while leaving inactive synapses unchanged, thus following the notion of a three-factor Hebbian learning mechanism based on the anatomical triad of cortical inputs, local striatal or cortical neurons, and dopamine varicosities (125)(126)(127)(128)(129).These changes require D1 and D2 dopamine receptors (130,131) and occur at dendritic spines within a time window compatible with behavioral conditioning (132).These plasticity effects may be key mechanisms by which the dopamine RPE signal mediates RL.
Thus, while allowing temporally and spatially precise measurements, the action potentials of midbrain dopamine neurons carry only the first step of RPE information; subsequent steps are likely to modulate the influence of the RPE signal depending on other presynaptic inputs and according to the specificity of postsynaptic neurons.While these multiple modulations are important to take into account, they do not change fundamentally the concepts of dopamine influence on learning, as attested by the reward-related behavioral effects of optogenetic excitation and inhibition of midbrain dopamine cell bodies and striatal dopamine axons.

Fig. S1 .
Fig. S1.Diverse dopamine responses in addition to reward prediction error (RPE) signal.(A) Prediction-dependent unidirectional attentional coding of non-rewarding pictures in monkey: excitation by surprising appearance of probabilistic picture (P = 0.25), but absence of inhibition by picture omission (P = 0.75) (population average from 14 dopamine neurons).Reproduced from own work (CC-BY) (24).(B) For comparison: bidirectional juice RPE coding in same dopamine population as A (CC-BY) (24).(C) Heterogeneous slow phasic excitations and inhibitions in different monkey dopamine neurons during behavior.Time marker applies to all subpanels (200 msec).Reproduced from own work (4, 8).