Prospective and retrospective values integrated in frontal cortex drive predictive choice
Edited by Peter Strick, University of Pittsburgh Brain Institute, Pittsburgh, PA; received April 30, 2022; accepted September 21, 2022
Significance
Animals must flexibly estimate the value of their actions (action-value) to successfully adapt to a changing environment. The brain is thought to estimate action-values from two different sources, namely the action-outcome history (retrospective value) and the knowledge of the environment (prospective value), but how different estimates of action-values are reconciled to make a choice is not well understood. Here we found that as mice learn the state-transition structure of a decision-making task, retrospective and prospective values become jointly encoded in the preparatory activity of neurons in ALM. Suppressing this preparatory activity in expert mice returned their behavior to a naïve state. These results reveal the neural circuit that injects structural knowledge into action selection to promote predictive decision-making.
Abstract
To make a deliberate action in a volatile environment, the brain must frequently reassess the value of each action (action-value). Choice can be initially made from the experience of trial-and-errors, but once the dynamics of the environment is learned, the choice can be made from the knowledge of the environment. The action-values constructed from the experience (retrospective value) and the ones from the knowledge (prospective value) were identified in various regions of the brain. However, how and which neural circuit integrates these values and executes the chosen action remains unknown. Combining reinforcement learning and two-photon calcium imaging, we found that the preparatory activity of neurons in a part of the frontal cortex, the anterior-lateral motor (ALM) area, initially encodes retrospective value, but after extensive training, they jointly encode the retrospective and prospective value. Optogenetic inhibition of ALM preparatory activity specifically abolished the expert mice’s predictive choice behavior and returned them to the novice-like state. Thus, the integrated action-value encoded in the preparatory activity of ALM plays an important role to bias the action toward the knowledge-dependent, predictive choice behavior.
Sign up for PNAS alerts.
Get alerts for new articles, or get an alert when an article is cited.
An adaptive animal in a volatile environment must flexibly change their choices, which depends on the ability to assess the value of each action (action-value). In a novel environment, a naïve animal would assess the action-values from the history of actions and outcomes (retrospective values). Computationally, the retrospective value of an action can be calculated as a discounted sum of past rewards obtained by the action (1, 2). However, the choice strategies based on the retrospective values are slow to change, especially in a volatile environment, as it requires trial-and-errors to update the values. By contrast, after learning the model of the environment, the action-values computed from the knowledge of the environment (prospective values) allow the animal to predict the state transition and promote flexible and predictive choices (3, 4).
Accumulating evidence suggests that different cortico-striatal circuits compute different types of values. Lesion studies in rodents and neuroimaging studies in humans suggested that dorsolateral striatum in rodents and posterior striatum in humans are implicated in the history-dependent, retrospective valuation, and dorsomedial striatum in knowledge-dependent, prospective valuation (5–8). In the cortex, the history-dependent (retrospective) value is often detected in the posterior regions (9–11). In contrast, anterior regions of the cortex are often implicated in the choice behaviors that depend on the prediction, such as the state transition within a trial (11–13), state transitions over trials (e.g., change of reward condition) (14–16), rule representation (17, 18), and implicit knowledge about the task (19–21). These results suggest that the lateral or posterior regions of the striatum, and possibly posterior regions of the cortex, are implicated in the retrospective valuation (9, 10, 22, 23) and the regions of the prefrontal cortex and their downstream pathways in the prospective valuation (3, 4, 24–27). An important question is how and which brain region integrates these values to ultimately select the optimal action, especially when different brain regions propose different values for the same action.
A subregion of the mouse frontal cortex, anterior-lateral motor (ALM) cortex (28), is involved in the planning and control of orofacial movements (29–31). When a mouse is trained to lick either left or right water ports to obtain water, a large number of ALM neurons showed persistent activity that predicts the future lick direction, often referred to as preparatory activity (29, 30, 32, 33). The neurons in ALM send direct projections to the orofacial control regions in the midbrain and brainstem (34). ALM also projects to the orofacial region of the striatum (35), and its descending pathway to the superior colliculus is involved in the left versus right lick competition (36). The preparatory activity in ALM is maintained by the thalamocortical loop (37) which is under the influence of the basal ganglia pathway (38). Therefore, the anatomical connection of ALM is suitable for integrating reward-expectation signals through its cortico-basal ganglia-thalamic loop and biasing the future lick direction. Indeed, higher-order motor cortices including ALM are known to encode value-related information (10, 39–41). However, it is not known whether ALM encodes either retrospective or prospective values, or integrates both to make a choice.
In this present study, we investigated how the value representation of ALM develops as the animals learn the state-transition structure of the task. For this, we employed a fully deterministic two-alternative forced-choice task in which mice choose to lick either left or right water port for water. The rewarding water port is alternated in every tenth reward delivery (state transition). The overtrained mice spontaneously changed their action near the state transition, suggesting that the mice predicted the approach of state transition. The reinforcement learning that explains this choice behavior predicted two distinct forms of value dynamics depending on the training stages. In the early phase of training, the expectation about reward will monotonically increase after each rewarded trial because mice do not know about the state transition. In contrast, in the later phase of training, the expectation would be down-regulated because the overtrained mice learned that the currently rewarding action will yield no reward in the near future. The trial-by-trial dynamics of ALM preparatory activity showed a striking concordance with this prediction throughout learning. Photoinhibition of ALM in experts delayed reversal and also suppressed their spontaneous (and premature) change of actions toward the state transition, returning the experts into a naïve state. Our results demonstrated the importance of ALM in the deliberative action selection that incorporates the knowledge about the task.
Results
Deterministic State-Transition Task (Reward10).
To study the neuronal process to integrate retrospective and prospective values in mice, we employed a fully deterministic two-alternative forced-choice task. In this task, the reward condition (state) deterministically changes after every tenth reward delivery (Fig. 1 A and B, we term “Reward10” task). This deterministic task structure allows the analytical calculation of the prospective values. Here, the prospective value of an action is formulated as the discounted sum of future rewards obtained by the action (1, 4) (SI Appendix, Methods). In this task, each trial begins with a preparatory period signaled by an auditory Ready signal (noise) during which a head-fixed mouse was required to withhold licking. Upon hearing the Go tone, the mouse was allowed to make a response by licking either the left or right water port. Premature licks between Ready and Go signals were punished by a loud noise. The first detected lick within the response window (3 s) was considered as a response. In the left (right) state (Fig. 1B), only the response toward the left (right) water port was rewarded by a drop of water (~4 μl). Otherwise, no water was provided. The next trial resumes after an inter-trial interval (ITI, ~10 s). After mice harvested all the prefixed number of rewards, the rewarding port was reversed without notice (Fig. 1B, State transition).
Fig. 1.

Mice reached the criterion performance (n = 43 mice, 70% ≥ success rate, Fig. 1C) after a period of training. The behavior data were classified into two categories for comparison (novice, session success rate <60%; expert, ≥70%; n = 103 and 675 sessions, respectively). Experts make reversals faster than novices (Fig. 1 D and E; for premature rate and reaction time, SI Appendix, Fig. S1 A and B). However, their trial-by-trial success rate peaked before the tenth trial in Reward10 task (SI Appendix, Fig. S1C, experts), indicating that experts selected an alternative of the previously rewarded action (Fig. 1B; win-shift) instead of selecting the same action (win-stay). Indeed, the win-shift probability increased toward the state transition in experts (Fig. 1 F and G) which occasionally led to reversals without mistakes (win-shift reversals, Fig. 1D filled circles). This increase of win-shift suggests that the animal predicted the approach of state transition.
Reward5 and Short ITI Task.
To further test whether the predictive choice behavior observed in Reward10 task depends on the key task parameter, the maximum number of rewards in one state, we subjected Reward10 expert mice to five rewards per state task (Reward5 task, SI Appendix, Fig. S2 A–C and Movie S2). After additional training in Reward5 task, the expert mice showed a success rate peak before the fifth trial (SI Appendix, Fig. S2 D and E). They also showed a larger increase in win-shift probability (SI Appendix, Fig. S2 F and G) which led to a higher probability of making reversals at the state transition without mistakes compared to Reward10 experts, suggesting that the Reward5 experts shifted the timing of anticipatory reversals toward earlier trials in a state. To exclude the possibility that animals rely on internal timing, the expert mice were also tested in a short inter-trial interval task (from 10 to 5 s). If a mouse relies on the internal timing to make predictive reversals, reducing the ITI in half would significantly increase the errors, but it did not affect the performance (SI Appendix, Fig. S3). These results demonstrate that mice can predict the state transition triggered by the number of rewards delivered.
Reinforcement Learning.
To understand the value dynamics behind this predictive choice, we resorted to the framework provided by reinforcement learning (2). Previous studies showed that Q-learning which depends on the history-dependent action-values (retrospective values) can well explain the choice behavior in the dynamic foraging task (9, 10, 22). However, we found that retrospective value alone cannot account for the increase of win-shifts in our task (SI Appendix, Fig. S4). If the choice is solely made by the history of action-outcome, the expectation will be monotonically increased following each gain of reward (Fig. 2A, cyan, retrospective value). Therefore, it is not surprising that reinforcement learning solely based on retrospective value only showed a monotonic decrease of win-shift probability, which is inconsistent with the choice behavior of the experts. However, if the mice understood that the water supply is limited, the amount of water available in the near future will be decreased following each consumption of water (Fig. 2A, dark blue, prospective value). This decrease of expectation could explain the increase of win-shifts in experts. Indeed, a model with action-values constructed from the hybrid of the retrospective and prospective values with a mixing weight w (Fig. 2A; hybrid Q-learning [hQ-learning]; SI Appendix, Methods) was sufficient to explain the increase of win-shift in experts (Fig. 2B). We quantified the goodness-of-fit to the behavior using Bayes factor (42). This analysis showed that hQ-learning outperformed Q-learning and another simple strategy, win-stay-lose-shift (WSLS) (SI Appendix, Fig. S5 A and B); therefore, we use hQ-learning to characterize the choice behavior of the animals hereafter. Our model predicted the contrastive value dynamics in novice and expert animals. The action-value in novices will be monotonically increased as more reward is consumed. In contrast, the action-value in experts will be decreased as the state-transition approaches (Fig. 2C). This model also implies that the decay of action-value contributes to the increase of win-shifts in experts.
Fig. 2.

The parameters of reinforcement learning characterize the choice behavior of mice. The learning rate controlling the speed of reversal was significantly increased from novice to expert state (Fig. 2D). This analysis suggests that experts rely on more recent outcomes to update their choice behavior. The mixing weight indicating the relative contribution of prospective value on the choice behavior also significantly increased in experts (Fig. 2E). The other parameters (the inverse temperature, choice bias, and the discount factor for future rewards) did not show significant changes over learning (SI Appendix, Fig. S5 D and E). This analysis suggests that the novices largely rely on retrospective value, and experts incorporated the prospective value into their action selection.
Two-Photon Ca Imaging.
To investigate whether the neuronal activities in ALM follow the model prediction, we conducted longitudinal two-photon calcium imaging in the ALM. A previous study showed that neurons in ALM layer 5 showed higher lick direction selectivity compared to layer 2/3 (32). Therefore, we focused on layer 5 neurons to maximize our chance to image from task-selective cells (400–600 μm deep, Fig. 3A, Movie S1, and SI Appendix, Fig. S6A for imaging depth distribution). In total, we analyzed fluorescence signals from n = 3,645 cells from novices and n = 5,392 cells from experts (n = 51 sessions from 12 mice in total; the same image planes excluded within the same learning stage; see data selection criteria in SI Appendix, Table S1). Among them, n = 965 cells in novices and n = 2,741 cells in experts showed a trial-type selective activity. ALM is known for its preparatory activity that signals the future action several seconds before the action initiation (29, 30, 32, 33). To capture this preparatory activity, we focused on the ramping-up cells which showed the highest activity during the preparatory period (Ready to Go). Among those task-selective cells, n = 156 cells from novices and n = 459 cells from experts were ramping-up cells. Examples of trial-type selective ramping-up cells from a novice and an expert mouse are shown in Fig. 3 B and C. Their trial-averaged activities were modulated by the upcoming choice (Fig. 3 B–E, Right reward (red) versus Left reward (blue)), consistent with previous studies (10, 29, 30, 32, 43). However, these cells also changed their trial-averaged activity in rewarded and unrewarded trials even for the same action (Fig. 3 B and D, Right reward (red) versus Right no-reward (purple); Fig. 3 C and E, Left reward (blue) versus Left no-reward (cyan). See SI Appendix, Fig. S6 B–F for more examples). In this task, the outcome of an action can be reliably guessed from the past outcome unless the state transition occurs. Therefore, we suspect that these cells are not directly encoding the uncontrollable future outcome, rather responding to the previous outcome. To investigate this possibility, we regressed each neuron activity with the upcoming and previous choices and also upcoming and previous outcomes. This regression analysis revealed that a significant fraction of ALM ramping-up cells responded to the previous outcome (SI Appendix, Fig. S7 A and B; novice, 13.5 ± 1.9%; expert, 18.0 ± 2.1%; no significant difference between novice and expert states, P = 0.34), which were significantly larger than those encoding future outcome (novice, 8.2 ± 1.0%, previous versus future, P = 3.5 × 10−6; expert, 7.9 ± 1.0%, P = 2.8 × 10−7). A similar fraction (SI Appendix, Fig. S7B; novice, 9.6 ± 0.9%; expert, 14.7 ± 1.2%; no significant difference between novice and expert states, P = 0.19) of ALM ramping-up cells were selective to the upcoming choice (lick direction). The response to the future action and previous outcome is suggestive of the action-value coding neurons. The action-value neurons will have different responses to the first, the second, and each of the following rewards, but this regression analysis cannot capture such dynamics.
Fig. 3.

Movie S1
ALM layer 5 calcium imaging. Imaged at Anterior 2.5 mm from bregma, Left 1.5 mm, 500 μm deep. ×8 speed, GCaMP6s.
Movie S2
Reward5 expert behavior. An example of Reward5 expert behavior in which the animal is making reversals without mistakes.
To further investigate how each gain of reward modulates the preparatory activity, we plotted the population average of all the ramping-up cells at various reward counts (Fig. 4 A and B). In this analysis, only the rewarded trials were included. Unrewarded trials in this context mean that the animal changed the lick direction in the middle of one state or continued to lick the same direction following the state transition (see SI Appendix, Fig. S8 for unrewarded trials and unpreferred lick direction data). In novices, each gain of reward monotonically increased the preparatory activity. In experts, each gain of reward initially increased but ultimately decreased the preparatory activity. This population average without any clustering showed the striking concordance between the trial-by-trial dynamics of preparatory activity and the dynamics of action-values (compare Figs. 2C and 4 B; only the preferred lick in the rewarded trials is shown. For the other trial types, see SI Appendix, Fig. S8). To further investigate what fraction of neurons are selective to either choice or action-values, we regressed the activity of each neuron with the choice and action-values. Here, action-values (Q-values) are computed from hQ-learning fit to each session of each mouse and used as regressors. This regression analysis revealed that the majority of ALM ramping-up cells were action-value selective during the preparatory period (Fig. 4C (Action-value), novice, 20.7 ± 2.8%, expert, 30.5 ± 2.7%, Pthreshold = 0.025; significantly increased from novice to expert state, P = 0.011) compared to imminent choice (Fig. 4C (Choice), novice, 9.1 ± 0.6%, expert, 10.1 ± 0.6%, Pthreshold = 0.05; no significant difference in novice and expert states, P = 0.35. For non-ramping-up cells, SI Appendix, Fig. S7C). The fraction of action-value coding cells was significantly larger than that of choice coding cells in both novice and expert states (choice versus action-value; novice, P = 0.0067; expert, P = 4.2 × 10−9). Among the action-value coding ramping-up cells, the majority was positively correlated with action-values in both novices and experts (ΣQ = Qcontra + Qipsi ≥ 0, novice 72.4%, P = 9.11 × 10−10; expert, 64.5%, P = 6.3 × 10−10, Binomial test. Fig. 4D). The population activity of positive action-value neurons (ΣQ ≥ 0) showed the dynamics correlated with the action-values (SI Appendix, Fig. S9 A–C). The population activity of negative action-value neurons (ΣQ < 0), although a minor population in ALM, showed anti-correlated dynamics to the action-values (SI Appendix, Fig. S9 D–F). These neurons were active when the unrewarded trial continues, which may encode error-related information. Video analysis during the preparatory period did not detect any correlation between the preparatory facial movements and action-values (SI Appendix, Fig. S10); therefore, facial movements cannot explain the action-value coding in ALM. These data strongly support the idea that the population activity of ALM ramping-up cells dominantly encodes the action-value, which is largely retrospective in novices but integrates both the retrospective and prospective values in experts.
Fig. 4.

Task-Parameter Dependence of Action-Value Dynamics.
To further test whether the trial-by-trial dynamics of preparatory activity depends on the key parameter of the task, the maximum number of rewards per state, we imaged ALM layer 5 cells in Reward5 novice and expert mice. In total, we analyzed n = 1,561 cells from novices and n = 6,147 cells from experts (n = 44 sessions from n = 14 mice). Among them, n = 769 cells in novices and n = 3,273 cells in experts showed task-selective activity, and n = 116 cells in novice and n = 446 cells in expert were ramping-up cells (SI Appendix, Fig. S6 for single-cell examples. SI Appendix, Fig. S7 D and E for history dependence). The session success rate initially dropped following the transition from Reward10 to Reward5 task (SI Appendix, Fig. S2B). During this low success rate period (defined as novice, session success rate <60%), the dynamics of population preparatory activity (without clustering) reverted to a monotonic increase. After a period of training, it evolved into the rise-and-decay dynamics again as the mice became experts (session success rate ≥ 70%; SI Appendix, Fig. S8 F–I) which is consistent with the model prediction (SI Appendix, Fig. S8J). The peak of activity was shifted toward an earlier rewarded trial in Reward5 experts (compare SI Appendix, Fig. S8 D and I), suggesting that the dynamics of preparatory activity depends on the number of available rewards in one state, and not the innate tendency to avoid repeating the same action. The regression analysis using the choice and action-values further confirmed that ALM layer 5 is enriched with the action-value coding cells (SI Appendix, Fig. S7 F and G).
Decay of Preparatory Activity Precedes Win-Shift.
An implication of our model is that the decrease of action-value is the cause of win-shift choice. We then investigated the difference of action-values before win-stay and win-shift choices in experts. To minimize the effect of choice and reward history, we selected the same action-outcome sequence, uninterrupted 10-wins (ten straight rewarded trials) with the same action, followed by either same or different choice (Fig. 5A, 10-wins & stay, and 10-wins & shift), and investigated the preparatory activity of positive action-value neurons (ΣQ ≥ 0). Because ALM neurons tend to have preferred lick direction, we collected data from the left-preferred neurons in uninterrupted ten left-rewarded trials and right-preferred neurons in uninterrupted ten right-rewarded trials. We then divided the data into two parts by the future choice (11th action), stay (same action), or shift (different action). We found that the action-value started to diminish several trials before shifting but not when staying with the same action (Fig. 5 B and C), supporting the idea that the decline of action-value is the precursor of win-shift choice.
Fig. 5.

The experts often made premature reversals. To investigate the neural activity before the premature reversal, we also collected uninterrupted 9-wins followed by either stay (and rewarded) or shift (premature reversal, no-reward). A similar trend was also observed when the animals make a premature reversal (9-wins & shift, Fig. 5 D–F); the preparatory activity started to decline several trials before shifting. In the above 10-wins and 9-wins sequence, the actions and outcomes are the same during the rewarded action. Therefore, the difference in the level of preparatory activity is likely to reflect the difference in the animal’s internal state that leads to the different choice (stay or shift). The positive action-value neurons (ΣQ > 0) were unbiasedly collected regardless of the difference of future choice (stay or shift). Therefore, this analysis suggests the causal relationship between the preparatory activity of ALM and the actual choice behavior of expert mice.
Photoinhibition of ALM.
Finally, to test the causal role of ALM preparatory activity on predictive choice behavior, we used photoinhibition of ALM (29) (Fig. 6A and SI Appendix, Fig. S11, n = 8 expert mice). ALM photoinhibition and control stimulation targeting the headpost were conducted in alternating blocks (every trial in two to three consecutive states for ALM photoinhibition; three to seven consecutive states for control stimulation). We found that bilateral ALM inactivation during the preparatory period significantly delayed the reversal (Fig. 6 B and C and SI Appendix, Fig. S12A) and suppressed the increase of win-shift probability (Fig. 6 D and E) as if the experts were reverted to novices (compare Figs. 1 E–G and 6 C–E). This manipulation did not affect the reaction time (SI Appendix, Fig. S12B), suggesting that the effect is not caused by a motor deficit. If the function of ALM in this task is to maintain the memory of the previously rewarded action, the photoinhibition during the ITI would also affects the success rate. However, longer photoinhibition during ITI (−10 to −5 s before Go) had no significant effect (SI Appendix, Fig. S12C). This indicates that ALM preparatory activity plays a specific role in this task.
Fig. 6.

We further analyzed how the reinforcement learning parameters were affected by ALM photoinhibition during the preparatory period. This analysis revealed that the learning rate and the mixing weight have significantly reduced when ALM activity was suppressed before action (Fig. 6 F and G). The other parameters did not show significant changes (SI Appendix, Fig. S12 D–F). The reduction of corresponds to the slower reversal, and the reduction of corresponds to the lack of anticipatory reversals, respectively. These results suggest the temporally specific role of ALM in reflecting the prospective value into action selection in experts (SI Appendix, Fig. S13).
Discussion
The ability to learn the dynamics of a volatile environment would be advantageous for individuals to predict how and when the state transition occurs and to prepare for the next action. After extensive training in the deterministic state-transition task, our mice showed an anticipatory increase of win-shifts before the state transition. We take this anticipatory action as the behavioral signature of predicting the state transition. However, the premature reversals (win-shifts before the state transition) are detrimental to the reward rate in our task. Given there is a much simpler strategy that can yield a higher success rate, such as WSLS, why did most of the experts converge to the sub-optimal strategy that requires prediction? WSLS only requires the memory of the previous action-outcome and therefore simpler than estimating the state transition of the environment. WSLS should also lead to a sufficiently high session success rate in our task; the upper limit of the mean session success rate by WSLS is ~90.9% (ten rewards in 11 trials) for Reward10, ~83.3% (five rewards in six trials) for Reward5. However, we did not observe experts showing a stable WSLS-like strategy (SI Appendix, Fig. S5 A and B), except for one animal. Indeed, most of the session success rates of the experts were below the upper limit of WSLS partly due to the increased win-shifts toward the state transition. Interestingly, macaques and humans engaged in a reversal task also showed a similar predictive behavior (44, 45). This suggests that there is a common mechanism in the brain that triggers win-shift behavior in a predictable environment.
One possible mechanism is the innate tendency to explore different options (46). However, our Reward5 data showed that the tendency to make win-shifts was also adaptively increased (SI Appendix, Fig. S2F), suggesting that the increase of win-shift is not solely explained by the innate behavior, but rather an acquired behavior adjusted to the parameters of the environment. Another possible mechanism is that the expert mice were motivated enough to use their knowledge about the task, but a limited capacity of hidden state representation (e.g., the number of rewards delivered already) in mice may hinder the full usage of such knowledge in the task. This might explain why the successful win-shift at the state-transition point (reversal without mistake) is higher in Reward5 than Reward10 because Reward5 requires a smaller state space to represent the task. One unexplored possibility is the fast modulation of behavioral variability (e.g., the inverse temperature parameter in reinforcement learning). Although our reinforcement learning analysis did not detect changes in the inverse temperature during the photoinhibition trials (SI Appendix, Fig. S12D), it is possible that the exploration signal can be rapidly modulated to control the choice variability. In some species, the source of behavior variability is identified (47–49). In mammals, several neuromodulators and brain regions are implicated in the exploration (50–52), and the uncertainty is postulated to trigger the exploration (50, 53). In our task, the probability of reward delivery is 100% for the correct choice; therefore, the uncertainty could arise from the hidden state representation about the task. However, our result in Reward5 also speaks against the uncertainty-driven exploration in our task. Reward5 task is shorter than Reward10 task; therefore, the uncertainty level must be lower in Reward5 but the win-shift probability at the state-transition point was higher in Reward5. These results all point toward the idea that the value signal is modulated by the knowledge of the task.
An approach unique to this study is the dissection of value in the retrospective and prospective directions. Much attention has been drawn to the distinction between the model-free and model-based values (3, 4). This classification is based on the algorithms used to compute the value. The model-free action-value of an action is often calculated as the discounted sum of past rewards obtained by the action, which is equivalent to the retrospective value in this study. The model-based value is often referred to as the value computed from an algorithm that requires complete knowledge about the task, such as the state-transition matrix representing the dynamics of the environment. However, to what extent the model-based computation can be applied to the computation in the real brain remains obscure. Meanwhile, a recent machine learning study pointed to the possibility that a deep, recurrent neural network trained in a “model-free” algorithm can behave as if like a “model-based” learner (meta-RL) (54). This suggests that different learning algorithms can converge to highly similar value dynamics. In this study, instead of defining the value by a specific algorithm, we adopted one of the most general definitions of the value, the discounted sum of future rewards (prospective value). The dichotomy of value in retrospective and prospective directions combined with a simple deterministic state-transition task can provide an alternative approach to understand the neural mechanism of value computation.
The value-related signaling in higher motor cortices is consistently observed in many species. Neurons in the secondary motor cortex (M2) of rodents are known to encode information about the history of sensory inputs, actions, and outcomes (10, 39, 55–58). The supplementary eye field of primates encodes action and its value-related information before the saccade (41, 59). Neurons in the cingulate motor area also encode reward information before action initiation (60). ALM is implicated in holding the short-term memory of future action (29, 30, 32, 37, 43) and tongue movements (31). Recent studies also shed light on the value coding in ALM (10, 40). In this study, using reinforcement learning, we have identified ALM as a key cortical area that integrates retrospective and prospective values. Thus, our results extend the role of ALM in encoding the knowledge-dependent value which is presumably advantageous for the animal to guide its behavior based on the task knowledge.
Thus far, we have largely focused on the activity of ramping-up cells defined as the cells showing the highest activity during the preparatory period in any of the four trial types. The reason for this selection is to capture the characteristic feature of ALM activity, the preparatory activity that signals the future action several seconds before the action initiation, and to test whether the preparatory activity is modulated by the reward-related information and the task knowledge. We showed that the preparatory activity dominantly encoded the action-value during the preparatory period of our task. Nevertheless, our conclusions are not limited to these ramping-up cells. For example, 20 to 30% of task-selective cells (including those with the activity heightened at different timing of the task) significantly encode action-value during the preparatory period (SI Appendix, Fig. S7 C and F). Therefore, the dominance of action-value coding also holds true for a much larger fraction of neurons in ALM.
What are the potential anatomical pathways for ALM to integrate different types of values? Some afferent regions of ALM are known to encode values that rely on task knowledge. For example, a rodent’s frontal agranular cortex including ALM receives inputs from the orbitofrontal cortex (OFC) (61, 62). The OFC is necessary for the task that requires the knowledge of the task (63–65), such as the inference about the currently rewarding water port (16). Therefore, the OFC could be one of the critical areas that can provide the task knowledge through its connection to the ALM. The anterior cingulate cortex (ACC) is another area implicated in inference-based decision-making and exploration (13, 51, 52, 66). Although the direct projection from ACC to ALM is not known, it is possible that ACC indirectly modulates ALM activity through the cortico-basal ganglia-thalamic loop. Indeed, the ACC, OFC, and ALM all receive input from the mediodorsal thalamic nucleus (MD) (37, 67, 68), suggesting that these structures have access to the same information. Different regions of the cortex form their own recurrent cortico-basal ganglia-thalamic loops (35). To what extent these loops converge and information is shared between these areas are, however, remains to be elucidated.
Finally, our photoinhibition experiment provided a causal link between the prospective value and predictive choice behavior. We note that the action-value was encoded throughout the task (Fig. 4C and SI Appendix, Fig. S6C). However, our photoinhibition revealed that the ALM has a causal role during the preparatory period but not during the ITI. This indicates that the other brain regions may encode similar information during the ITI and such redundancy could explain the lack of effect in ITI photoinhibition. In contrast, ALM photoinhibition during the preparatory period strongly delayed learning, but importantly, it did not fully prevent the licking nor the reversal learning itself. ALM photoinhibited animals can still lick with the same reaction time and change their choices and reached a similar level of trial-by-trial success rate at the end of one state. Therefore, the photoinhibited animals showed signatures of novice animals: inflexible choices and the lack of anticipatory reversals. Our reinforcement learning-based analysis further supported this idea; the learning rate () and the contribution of prospective value () increased through the learning, and ALM photoinhibition specifically reduced these parameters. This suggests that ALM is the major conduit by which the prospective value reaches the action-selection circuit (SI Appendix, Fig. S13), such as the superior colliculus where left versus right lick competition might occur (36). The residual behavior pattern that is resistant to photoinhibition also implies the existence of a bypassing route for the retrospective value to reach the action-selection circuit. Recent studies found that the neurons in the posterior parietal cortex and the retrosplenial cortex encode past sensory and reward information (10, 69). This supports the idea that the brain contains parallel learning systems in which different cortico-basal ganglia loops implement a gradient of retrospective to prospective learning systems. Large-scale neural recordings from the areas that interact with ALM will further elucidate the precise circuit mechanism of flexible action selection. Together, our results show that the frontal cortex (ALM) is the critical hub to inject structural knowledge into action selection and to promote predictive decision-making.
Materials and Methods
Animals.
All experiments were approved by the Animal Research Committee, Graduate School of Medicine, Kyoto University. C57BL/6N mice (n = 35, Japan SLC) and PV-Cre × Ai32 mice (n = 8, PV-Cre: B6;129P2-Pvalbtm1(cre)Arbr/J [JAX 008069]; Ai32: B6.Cg-Gt(ROSA)26Sortm32(CAG-COP4*H134R/EYFP)Hze/J [JAX 024109]) were used. All mouse strains used in the behavioral studies were maintained in a C57BL/6N background. Mice were individually housed in an aluminum cage with an environment enrichment apparatus (a running wheel). The room was in a reversed light cycle (12–12h), and all the experiments were conducted in the dark phase. Male mice aged 8 wk or older were used in all experiments.
AAV Preparation.
The plasmid carrying the CaMKIIa promoter and GCaMP6s sequence (pAAV-CaMKIIa-GCaMP6s) was derived from pAAV-CaMKIIa-GCaMP6s-P2A-nls-dTomato (addgene plasmid #51086) by deleting the P2A-nls-dTomato sequence. Recombinant adeno-associated virus (AAV) expressing GCaMP6s under the control of the CaMKIIa promoter was packaged in serotype AAV2/9. AAV was purified by iodixanol gradient ultracentrifugation. The final concentration was assessed using real-time PCR (original titer 1.2 × 1013 vg/ml).
Surgery.
Headposting surgeries were conducted at least 3 wk before the behavior training. Mice were anesthetized with isoflurane (1 to 2%) and placed in a stereotaxic device, followed by subcutaneous injection of dexamethasone (2 mg/kg). Body temperature was maintained using an electric heater. The scalp was dissected out in a round shape, and the periosteum over the dorsal surface of the skull was removed. The imaging areas were identified according to the stereotaxic coordinates. The target was marked on the surface of the skull with a razor blade and painted with a black marker. A rectangular stainless-steel frame (CF-10, Narishige) was attached with clear dental cement (Superbond C&B, Sun Medical). A single craniotomy was made over both cerebral hemispheres. A glass pipette attached to a Nanoject-II (Drummond Scientific) was used to deliver AAV expressing GCaMP6s (AAV-CaMKII-GCaMP6s, diluted to 2.4 × 1012 vg/mL, 40–60 nl per imaging site). A glass window (5 mm diameter, No. 2 thickness, Matsunami-Glass) was cut to fit the craniotomy and was secured on the remaining edge of the skull by using the same clear dental cement. After the surgery, Enrofloxacin (2.5% Baytril, Bayer, diluted to 8 μl/mL) and Carprofen (Rimadyl, Zoetis JP, diluted to 1.6 mg/mL) were administered via drinking water for at least 5 d. After that, mice were allowed to recover with free access to water for 2 wk before water restriction. A subset of mice received windowing surgery after they reached the criterion (70% success rate), followed by at least 3 d of a recovery period. For photoinhibition experiments, the same procedure was used except for the craniotomy. The skull was covered with the same clear dental cement. For electrophysiological recording experiments, several days before the recording, a craniotomy (~1 mm diameter) was made over the skull, and a ground and a reference pin were implanted over the cerebellum. The surface was covered by silicon elastomer (Kwik-Sil, WPI) to protect the craniotomy until the recording. The extracellular activity was recorded using a silicon probe (A1x32 series, NeuroNexus Inc) through a head-stage amplifier (RHD-2000, Intan). Photo-stimulation patterns and electrophysiological data were simultaneously recorded via a data acquisition board (RHD USB Interface Board, Intan) and analyzed by custom-written software (by KH, MATLAB, MathWorks).
Pre-Training.
Mice received pre-training before performing a deterministic state-transition task. 2–3 d before the start of pre-training, mice were water restricted at 1 mL per day and acclimated to the handling of an experimenter. The preferred side of licking was assessed by manual water feed. A behavior control program running in real time (written in MATLAB and Simulink Real time by KH) was used to control and record behavioral events. Initially, the rewarding side was fixed to the unfavored side. The head-fixation period started at less than 5 min and eventually extended up to 90 min over the sessions. When the mice reached the first criterion (more than 100 trials within 30 min), licks between “Ready” and “Go” signals were categorized as premature licks: they were punished by a loud noise and the state of the trial reverted to the inter-trial interval. Once the percentage of premature trials became less than 15% of all the trials, the water delivery port was changed to the other, originally favored side. This condition was kept fixed until the same criterion (the percentage of premature trials became less than 15% of the total trials) was met. This phase of pre-training usually lasted for 5–20 d.
Deterministic State-Transition Task.
Once the mice learned to lick both sides of the water ports with minimum premature behavior as described above, we initiated the deterministic state-transition task. In the main task, the mice were required to refrain from licking after hearing the “Ready” noise and to wait until the “Go” tone presentation. Any licks between “Ready” and “Go” signals were punished by the loud noise, and the trial was considered as a premature one. Both novice and expert mice performed the task with a small fraction of premature actions with a similar level of reaction time (SI Appendix, Fig. S1 A and B). After the “Go” tone presentation, the first detected lick within the response window (3 s) was considered as the response. The detection of response was reported to the animal as a brief presentation of a click sound (~100 ms). In the left (right) state, the left (right) response was considered as a “correct” response and rewarded with a drop of water (~4 μl). On the other hand, in the left (right) state, the right (left) response was considered as an “incorrect” response, and no water was supplied. The rewarding water port was changed when the mice gained a prefixed amount of reward (ten for the Reward10 task, five for the Reward5 task). Reward delivery is deterministic and was always given for the correct responses, and no-reward was given for the incorrect responses. No lick within the response window was considered as a missed trial, and water was not supplied.
Initially, the mice were trained with the ten rewards per state condition (Reward10). A subset of mice was also subjected to five rewards per state task (Reward5) if they had already become Reward10 experts (at least 70% success rate) over three consecutive days in the past. One session lasted maximally 90 min or until the mice obtained 1.5 mL of water. If mice did not maintain a stable body weight, they received supplemental water. The main training lasted 50–190 sessions. We did not observe any decline of fluorescence of genetically encoded calcium indicators (GCaMP6s) during this period.
Behavior Control Hardware and Software.
The behavior experiments under two-photon microscope were performed in a dual-lick operant module (OPR-1410, O’Hara & Co., LTD). The rest of the behavior training and optogenetic experiments were performed using in-house built dual-lick operant chambers. The onset of the lick was detected by the break of the infra-red beam situated in front of the water delivery port. The auditory stimuli were presented by a pair of miniature speakers (FK-23451-000, Knowles) attached to the head-fixation stage (O’Hara & Co., LTD) near the ears of the mouse. Water delivery was controlled by a solenoid valve (LHDA1233315H, The Lee Company). The duration of the valve opening was adjusted to deliver 4 μl of water. The lick ports and mouse chair were designed with a 3D CAD software (Solidworks, Dassault Systems) and printed in a 3D printer (Form2, Formlabs).
In all the experiments, a custom-written behavior control program running in real time (written in MATLAB and Simulink Real time by KH) was used to control the behavioral events. The behavioral events, imaging scan signal, and laser control signals were recorded at 1 kHz by using a data acquisition board (MF634, Humusoft or PCI-6251, National Instruments) and streamed to the host PC.
Behavior Data Analysis.
The session success rate and the trial-by-trial success rate were defined as the fraction of rewarded trials, excluding the misses and premature trials. The definitions of expert and novice are based on a single-session success rate. The transition from expert to novice state was a rare event (3.8% of Reward10 expert sessions and 3.0% of Reward5 expert sessions drop to novice on the next session). For imaging data, we confirmed that no novice imaging data contain the sessions dropped from expert state (except for Reward5 novice after the transition from Reward10 to Reward5 task). The win-shift probability was defined as the fraction of the win-shift actions among all wins (rewarded trials). The bootstrap mean and 5 to 95% CI were used for trial-by-trial success rate and win-shift probability plots. The slope of win-shift probability was estimated from the pooled trials for each animal using linear regression. The peaks of the trial-by-trial success rate were calculated for each session, and mean, median, and quartiles were computed over the sessions for each animal. The reaction time was defined as the median time from the Go signal to the first lick and computed for each animal.
Two-Photon Calcium Imaging.
Imaging experiments started after the mice had reached the criteria in the pre-training stage. Imaging was conducted using a commercial two-photon microscope (FVMPE-RS, Olympus) with a 25× objective lens (XLPN25XWMP2, Olympus). The beam of a femtosecond pulsed laser with a peak wavelength at 920 nm (Chameleon Vision II, Coherent) was delivered via a custom-built light path. The laser intensity was controlled by an acousto-optical modulator (AOM; MT110-B50A1,5-IRHk, OEM ver. AA Optoelectronic). Images (512 × 512 pixels covering a 500 × 500 μm area) were continuously recorded, up to 64,000 frames at 30 frames per second, using image acquisition software (FV315-SW, Olympus). The average excitation power was up to 120 mW for deep layer imaging (~500 μm deep).
Pre-Processing for Ca Imaging Data Analysis.
Data analysis was performed in MATLAB (MathWorks) and Python 3.6. Brain motion was detected as the peak of phase correlation between the mean image and each image. The detected motion was corrected by shifting each image in an x-y coordinate (no shear, no rotation) by using an algorithm in Suite2P (version 2016) (70). Regions of interest (ROIs) corresponding to the cell bodies were detected by using a custom-written algorithm (HDBCellSCAN) (74). The fluorescence signal was deconvolved using a nonnegative deconvolution algorithm (71) and was defined as the dF.
Ca Imaging Data Analysis.
All data analysis was performed using custom-written codes in MATLAB (MathWorks). First, neurons were tested for their selectivity to four trial types (Contra/Ipsi-lick × Rewarded/Unrewarded) during the preparatory period (−2–0 s before the Go signal), action period (0–1 s after Go), and outcome period (1–4 s after Go), using the dF averaged within each period (ANOVA, P threshold = 0.05). The ramping-up cells were selected as the cells that showed maximum activity during the preparatory period in any of the four trial types. For the rewarded trials, the data were grouped by the learning stage (novice, expert) and the cumulative number of rewards (reward count) within one state and then averaged within the group for each animal. Note that the first reward trial was defined as the trial in which the animal receives the first reward within the state. For the unrewarded trials, the data were grouped by the learning stage and the cumulative number of unrewarded trials within the state and then averaged. In all the analyses above, the preferred lick direction of a neuron was defined as the direction in which the mean firing rate of rewarded trials was higher.
Photoinhibition.
Light from a 473 nm laser (OBIS 473 LX 75 mW, Coherent) was first introduced to an f = 100 mm lens (AC254-100AB, Thorlabs) to form a focused spot on a two-axis scanning mirror (Integrated MEMS Mirror, Mirrorcle Tech) and was subsequently refocused onto the skull surface by another f = 75 mm lens (AC254-75AB, Thorlabs). The laser had a Gaussian profile with σ = 160 μm at the level of the skull surface. The laser spot size on a piece of black paper was measured by a CMOS camera (DMK37BUX287, ImagingSource).
The MEMS mirror directed the light in a stepwise manner to stimulate multiple brain regions sequentially. The MEMS mirror was aimed at each target for 50 ms. Transient time was less than 2 ms. The laser intensity was controlled through a remote controller (OBIS Single Remote, Coherent) in the analog modulation mode. During the laser stimulation, the amplitude of the laser was sinusoidally modulated at 40 Hz and was linearly attenuated for the last 100–200 ms period. During the transient time of the mirror, the laser beam was transiently turned off to prevent stimulating the non-target area. Laser power was calibrated using a laser power meter (Fieldmate, Coherent). The laser intensity was set to 1.5 mW per spot throughout the experiments. All the photo-stimulation was done through clear dental cement and an intact skull. The attenuation level through the skull was 28.4 ± 4.8% (n = 5 skulls, mean ± SEM) measured by using the power meter and the skull after the experiments. This corresponds to 0.42 ± 0.07 mW per spot in our experiments.
Photoinhibition experiments were conducted in alternating blocks; one block corresponds to two to three consecutive states for ALM bilateral photoinhibition and three to seven consecutive states for control stimulation. In ALM targeting blocks, the laser was aimed at two spots over ALM (1.5 mm left and right from the midline, 2.5 mm anterior from the bregma). In the control blocks, the laser was aimed at the frontal edge of the head post (two spots at 1.5 mm left and right from the midline, usually 3.5–4.5 mm anterior from bregma). The length of states within one block was randomly chosen. The experiments were always started with a control stimulation block.
Electrophysiological Data Analysis.
Data analysis was performed in MATLAB (MathWorks) and Python 3.6. First, isolated units were identified from the multi-channel recording data by using Kilosorts (72) and manually re-clustered by using Phy (https://github.com/cortex-lab/phy). The difference in the firing rate during photoinhibition and control periods was used to assess the range of photo-inhibition effects.
Statistical Analysis.
All the statistics were computed using MATLAB (MathWorks). For the nested data (Figs. 1 E and F, 5 B and C, and 6 C–E and SI Appendix, Figs. S2 D and F and S12 C) in which multiple data points (win-shift probability, neural activity, success rate) were recorded from multiple sessions and animals, we used the hierarchical bootstrap (73) to compute the mean, 5 to 95% CI, and significance of the differences. The other statistics were based on the averages from each animal. Tests for differences were two-sided, and differences were considered as significant when P < 0.05. Bonferroni correction was applied when necessary.
Regression Analysis.
We performed a multivariate regression analysis of the dF activity to test whether the neuronal activity was correlated with any of the behavior events and latent variables of the model. The regression model was
[1]
where is the dF during the ith trial, is the choice during the ith trial (contra- or ipsilateral choice; 1 or −1), and are action-values associated with the contra- and ipsilateral actions before observing the outcome of the ith trial (SI Appendix, Methods for the computation of action-values and ). The residual error term is , and are the regression coefficients. The median variance inflation factors for choice, and were 1.62, 1.65, and 1.45 in novice, 3.32, 2.61, and 2.32 in expert. The regression coefficients and their P-values were computed by a MATLAB function regstats. The significance of the regression coefficient is defined as the P-value lower than Pthreshold, which is set to 0.05 for choice, and 0.025 for action-values because choice is represented by a single variable (), but the action-values are defined by two variables ( and ). The learning-dependent changes of the significant fraction of cells were computed as follows; the significant fraction of cell was computed for each mouse, and then this population data is compared using a MATLAB function multicompare. For video analysis of facial movements, is replaced by a principal component of a detected facial feature.
We also used the following model to regress the contribution from the past choices and outcomes in the dF activity;
[2]
where represents rewarded (1) or unrewarded (0) on the (i-k) th trial, and represents the regression coefficients of past choices and outcomes. The median variance inflation factors for the choice at current trial (), one trial before (), two trials before (), the reward outcomes at current trial (), one trial before (), and two trials before (), were 2.10, 2.58, 2.02, 1.86, 2.22, 1.80 in novice, and 2.44, 3.63, 2.42, 1.29, 1.59, 1.35 in expert. For the above regression analysis with future and past choices and outcomes, Pthreshold was set to 0.05. The differences between the fraction of significant cells and their learning-dependent changes were computed in the same way as described in the choice and action-value regression analysis.
Histology.
Mice were perfused transcardially with PBS followed by 4% PFA/0.1 M PB. The brains were postfixed overnight and transferred to 30% sucrose until sectioning on a microtome (Leica, CM1850). Coronal, 40 μm free-floating sections were collected in PBS-azide. Slide-mounted sections were imaged on a Keyence microscope (BZ-X700).
For the identification of PV-positive and ChR2-EYFP positive neurons, we used Rabbit-anti-GFP (Invitrogen A11122) and Mouse-anti-PV (Swant PV235) as primary antibodies (1:1,000 dilution), and Goat-anti-Rabbit Alexa488 (Invitrogen), Goat-anti-Mouse Alexa 594 (Invitrogen) secondary antibodies (1:200 dilution). DAPI was used for counterstaining. For cell counting, we manually selected cells using the ImageJ multipoint selection tool (NIH).
Data, Materials, and Software Availability
HDBCellSCAN is deposited to GitHub (https://github.com/hamaguchikosuke/HDBCellSCAN) (74). Imaging and electrophysiological data will be shared upon a reasonable request.
Acknowledgments
We thank Mitsuko Uchida, Naoshige Uchida, Kenji Doya, James Hejna, and Richard Mooney for their comments on earlier versions of the manuscripts and Bernd Kuhn for his advice on two-photon microscope setup. Satoshi Yawata constructed the pAAV-CaMKIIa-GCaMP6s plasmid. Fumi Ageta assisted with animal training. pAAV-CaMKIIa-GCaMP6s-P2A-nls-dTomato was a gift from Jonathan Ting (Addgene plasmid # 51086). MEXT/JPSP KAKENHI 19H04983, 21H02804, and 22H05495 (K.H.), MEXT/JPSP KAKENHI JP18H04014 (D.W.), JST CREST JPMJCR1756 (D.W.), Takeda Science Foundation (D.W.).
Author contributions
K.H. designed research; K.H. and H.T.-A. performed research; K.H. and D.W. contributed new reagents/analytic tools; K.H. analyzed data; and K.H. wrote the paper.
Competing interests
The authors declare no competing interest.
Supporting Information
Appendix 01 (PDF)
- Download
- 5.74 MB
Movie S1
ALM layer 5 calcium imaging. Imaged at Anterior 2.5 mm from bregma, Left 1.5 mm, 500 μm deep. ×8 speed, GCaMP6s.
- Download
- 3.75 MB
Movie S2
Reward5 expert behavior. An example of Reward5 expert behavior in which the animal is making reversals without mistakes.
- Download
- 1.80 MB
References
1
W. Schultz, P. Dayan, P. R. Montague, A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).
2
A. G. Barto, R. S. Sutton, Reinforcement Learning: An Introduction (MIT Press, Cambridge, Mass, 1998).
3
B. B. Doll, D. A. Simon, N. D. Daw, The ubiquity of model-based reinforcement learning. Curr. Opin. Neurobiol. 22, 1075–1081 (2012).
4
N. D. Daw, P. Dayan, The algorithmic anatomy of model-based evaluation. Philos. Trans. R. Soc. B 369, 20130478 (2014).
5
H. H. Yin, B. J. Knowlton, B. W. Balleine, Lesions of dorsolateral striatum preserve outcome expectancy but disrupt habit formation in instrumental learning. Eur. J. Neurosci. 19, 181–189 (2004).
6
H. H. Yin, S. B. Ostlund, B. J. Knowlton, B. W. Balleine, The role of the dorsomedial striatum in instrumental conditioning. Eur. J. Neurosci. 22, 513–523 (2005).
7
E. Tricomi, B. W. Balleine, J. P. O’Doherty, A specific role for posterior dorsolateral striatum in human habit learning. Eur. J. Neurosci. 29, 2225–2232 (2009).
8
K. Wunderlich, P. Dayan, R. J. Dolan, Mapping value based planning and extensively trained choice in the human brain. Nat. Neurosci. 15, 786–791 (2012).
9
L. P. Sugrue, G. S. Corrado, W. T. Newsome, Matching behavior and the representation of value in the parietal cortex. Science 304, 1782–1787 (2004).
10
R. Hattori, B. Danskin, Z. Babic, N. Mlynaryk, T. Komiyama, Area-specificity and plasticity of history-dependent value coding during learning. Cell 177, 1858–1872.e15 (2019).
11
S. W. Lee, S. Shimojo, J. P. O’Doherty, Neural computations underlying arbitration between model-based and model-free learning. Neuron 81, 687–699 (2014).
12
K. J. Miller, M. M. Botvinick, C. D. Brody, Dorsal hippocampus contributes to model-based planning. Nat. Neurosci. 20, 1269–1276 (2017).
13
T. Akam et al., The anterior cingulate cortex predicts future states to mediate model-based action selection. Neuron 109, 149–163.e7 (2020).
14
V. D. Costa, V. L. Tran, J. Turchi, B. B. Averbeck, Reversal learning and dopamine: A bayesian perspective. J. Neurosci. 35, 2407–2416 (2015).
15
R. Bartolo, B. B. Averbeck, Prefrontal cortex predicts state switches during reversal learning. Neuron 106, 1044–1054.e4 (2020).
16
P. Vertechi et al., Inference-based decisions in a hidden state foraging task: Differential contributions of prefrontal cortical areas. Neuron 106, 166–176.e6 (2020).
17
E. K. Miller, J. D. Cohen, An integrative theory of prefrontal cortex function. Annu. Rev. Neurosci. 24, 167–202 (2001).
18
K. Sakai, Task set and prefrontal cortex. Annu. Rev. Neurosci. 31, 219–245 (2008).
19
K. Watanabe, O. Hikosaka, Immediate changes in anticipatory activity of caudate neurons associated with reversal of position-reward contingency. J. Neurophysiol. 94, 1879–1887 (2005).
20
E. S. Bromberg-Martin, M. Matsumoto, S. Hong, O. Hikosaka, A pallidus-habenula-dopamine pathway signals inferred stimulus values. J. Neurophysiol. 104, 1068–1076 (2010).
21
M. A. van der Meer, A. D. Redish, Covert expectation-of-reward in rat ventral striatum at decision points. Front. Integr. Neurosci. 3, 1 (2009).
22
K. Samejima, Y. Ueda, K. Doya, M. Kimura, Representation of action-specific reward values in the striatum. Science 310, 1337–1340 (2005).
23
M. Ito, K. Doya, Distinct neural representation in the dorsolateral, dorsomedial, and ventral parts of the striatum during fixed- and free-choice tasks. J. Neurosci. 35, 3499–3514 (2015).
24
H. H. Yin, B. J. Knowlton, The role of the basal ganglia in habit formation. Nat. Rev. Neurosci. 7, 464–476 (2006).
25
B. W. Balleine, J. P. O’Doherty, Human and rodent homologies in action control: Corticostriatal determinants of goal-directed and habitual action. Neuropsychopharmacology 35, 48–69 (2010).
26
B. B. Doll, K. D. Duncan, D. A. Simon, D. Shohamy, N. D. Daw, Model-based choices involve prospective neural activity. Nat. Neurosci. 18, 767–772 (2015).
27
L. S. Morris et al., Fronto-striatal organization: Defining functional and microstructural substrates of behavioural flexibility. Cortex 74, 118–133 (2016).
28
T. Komiyama et al., Learning-related fine-scale specificity imaged in motor cortex circuits of behaving mice. Nature 464, 1182–1186 (2010).
29
Z. V. Guo et al., Flow of cortical activity underlying a tactile decision in mice. Neuron 81, 179–194 (2014).
30
N. Li, T. W. Chen, Z. V. Guo, C. R. Gerfen, K. Svoboda, A motor cortex circuit for motor planning and movement. Nature 519, 51–56 (2015).
31
T. Bollu et al., Cortex-dependent corrections as the tongue reaches for and misses targets. Nature 594, 82–87 (2021).
32
T. W. Chen, N. Li, K. Daie, K. Svoboda, A map of anticipatory activity in mouse motor cortex. Neuron 94, 866–879.e4 (2017).
33
J. Tanji, E. V. Evarts, Anticipatory activity of motor cortex neurons in relation to direction of an intended movement. J. Neurophysiol. 39, 1062–1068 (1976).
34
M. N. Economo et al., Distinct descending motor cortex pathways and their roles in movement. Nature 563, 79–84 (2018).
35
N. N. Foster et al., The mouse cortico–basal ganglia–thalamic network. Nature 598, 188–194 (2021).
36
J. Lee, B. L. Sabatini, Striatal indirect pathway mediates exploration via collicular competition. Nature 599, 645–649 (2021).
37
Z. V. Guo et al., Maintenance of persistent activity in a frontal thalamocortical loop. Nature 545, 181–186 (2017).
38
Y. Wang et al., A cortico-basal ganglia-thalamo-cortical channel underlying short-term memory. Neuron 109, 3486–3499.e7 (2021).
39
J. H. Sul, S. Jo, D. Lee, M. W. Jung, Role of rodent secondary motor cortex in value-based action selection. Nat. Neurosci. 14, 1202–1208 (2011).
40
B. A. Bari et al., Stable representations of decision variables for flexible behavior. Neuron 103, 922–933.e7 (2019).
41
V. Stuphorn, T. L. Taylor, J. D. Schall, Performance monitoring by the supplementary eye field. Nature 408, 857–860 (2000).
42
R. E. Kass, A. E. Raftery, Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995).
43
H. K. Inagaki, L. Fontolan, S. Romani, K. Svoboda, Discrete attractor dynamics underlies persistent activity in the frontal cortex. Nature 566, 212–217 (2019).
44
V. D. Costa, O. Dal Monte, D. R. Lucas, E. A. Murray, B. B. Averbeck, Amygdala and ventral striatum make distinct contributions to reinforcement learning. Neuron 92, 505–517 (2016).
45
A. Vilà-Balló et al., Unraveling the role of the hippocampus in reversal learning. J. Neurosci. 37, 6686–6697 (2017).
46
E. A. Gaffan, J. Davies, Reward, novelty and spontaneous alternation. J. Exp. Psychol. Sec. B 34, 31–47 (1982).
47
M. H. Kao, A. J. Doupe, M. S. Brainard, Contributions of an avian basal ganglia-forebrain circuit to real-time modulation of song. Nature 433, 638–643 (2005).
48
B. P. Olveczky, A. S. Andalman, M. S. Fee, Vocal experimentation in the juvenile songbird requires a basal ganglia circuit. PLoS Biol. 3, e153 (2005).
49
K. Hamaguchi, R. Mooney, Recurrent interactions between the input and output of a songbird cortico-basal ganglia pathway are implicated in vocal sequence variability. J. Neurosci. 32, 11671–11687 (2012).
50
A. J. Yu, P. Dayan, Uncertainty, neuromodulation, and attention. Neuron 46, 681–692 (2005).
51
D. G. R. Tervo et al., Behavioral variability through stochastic choice and its gating by anterior cingulate cortex. Cell 159, 21–32 (2014).
52
D. G. R. Tervo et al., The anterior cingulate cortex directs exploration of alternative strategies. Neuron 109, 1876–1887.e6 (2021).
53
N. D. Daw, Y. Niv, P. Dayan, Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat. Neurosci. 8, 1704–1711 (2005).
54
J. X. Wang et al., Prefrontal cortex as a meta-reinforcement learning system. Nat. Neurosci. 21, 860–868 (2018).
55
M. J. Siniscalchi, V. Phoumthipphavong, F. Ali, M. Lozano, A. C. Kwan, Fast and slow transitions in frontal ensemble activity during flexible sensorimotor behavior. Nat. Neurosci. 19, 1234–1242 (2016).
56
T.-Y. Wang, J. Liu, H. Yao, Control of adaptive action selection by secondary motor cortex during flexible visual categorization. eLife 9, e54474 (2020).
57
B. A. Bari, J. Y. Cohen, Dynamic decision making and value computations in medial frontal cortex. Int. Rev. Neurobiol. 158, 83–113 (2021).
58
Y. Yuan, H. Mao, J. Si, Cortical neural responses to previous trial outcome during learning of a directional choice task. J. Neurophysiol. 113, 1963–1976 (2014).
59
N. Amador, M. Schlag-Rey, J. Schlag, Reward-predicting and reward-detecting neuronal activity in the primate supplementary eye field. J. Neurophysiol. 84, 2166–2170 (2000).
60
K. Shima, J. Tanji, Role for cingulate motor area cells in voluntary movement selection based on reward. Science 282, 1335–1338 (1998).
61
J. P. Donoghue, C. Parham, Afferent connections of the lateral agranular field of the rat motor cortex. J. Comp. Neurol. 217, 390–404 (1983).
62
F. Condé, E. Maire-Lepoivre, E. Audinat, F. Crépel, Afferent connections of the medial frontal cortex of the rat. II. Cortical and subcortical afferents. J. Comp. Neurol. 352, 567–593 (1995).
63
T. A. Stalnaker, N. K. Cooch, G. Schoenbaum, What the orbitofrontal cortex does not do. Nat. Neurosci. 18, 620–627 (2015).
64
M. A. McDannald, F. Lucantonio, K. A. Burke, Y. Niv, G. Schoenbaum, Ventral striatum and orbitofrontal cortex are both required for model-based, but not model-free, reinforcement learning. J. Neurosci. 31, 2700–2705 (2011).
65
Y. Liu, Y. Xin, N.-L. Xu, A cortical circuit mechanism for structural knowledge-based flexible sensorimotor decision-making. Neuron 102, 2009–2024.e6 (2021).
66
N. D. Daw, S. J. Gershman, B. Seymour, P. Dayan, R. J. Dolan, Model-based influences on humans’ choices and striatal prediction errors. Neuron 69, 1204–1215 (2011).
67
V. B. Domesick, Thalamic relationships of the medial cortex in the rat. Brain Behav. Evol. 6, 457–483 (1972).
68
H. J. Groenewegen, Organization of the afferent connections of the mediodorsal thalamic nucleus in the rat, related to the mediodorsal-prefrontal topography. Neuroscience 24, 379–431 (1988).
69
A. Akrami, C. D. Kopec, M. E. Diamond, C. D. Brody, Posterior parietal cortex represents sensory history and mediates its effects on behaviour. Nature 554, 368–372 (2018).
70
M. Pachitariu, Suite2p: Beyond 10,000 neurons with standard two-photon microscopy. bioRxiv [Preprint] (2017). https://doi.org/10.1101/061507. Accessed 15 September 2016
71
J. T. Vogelstein et al., Fast nonnegative deconvolution for spike train inference from population calcium imaging. J. Neurophysiol. 104, 3691–3704 (2010).
72
M. Pachitariu, N. Steinmetz, S. Kadir, M. Carandini, K. Harris, Fast and accurate spike sorting of high-channel count probes with Kilosort. Adv. Neural Inf. Proc. Sys. 4448–4456 (2016).
73
V. Saravanan, G. J. Berman, S. J. Sober, Application of the hierarchical bootstrap to multi-level data in neuroscience. Neuron Behav. Data Anal. Theory 3 (2020).
74
K. Hamaguchi, HDBCellSCAN (Version 1.0.0) [Computer software]. GitHub. https://doi.org/10.5281/zenodo.72971475. Deposited 7 November 2022.
Information & Authors
Information
Published in
Classifications
Copyright
Copyright © 2022 the Author(s). Published by PNAS. This article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).
Data, Materials, and Software Availability
HDBCellSCAN is deposited to GitHub (https://github.com/hamaguchikosuke/HDBCellSCAN) (74). Imaging and electrophysiological data will be shared upon a reasonable request.
Submission history
Received: April 30, 2022
Accepted: September 21, 2022
Published online: November 23, 2022
Published in issue: November 29, 2022
Keywords
Acknowledgments
We thank Mitsuko Uchida, Naoshige Uchida, Kenji Doya, James Hejna, and Richard Mooney for their comments on earlier versions of the manuscripts and Bernd Kuhn for his advice on two-photon microscope setup. Satoshi Yawata constructed the pAAV-CaMKIIa-GCaMP6s plasmid. Fumi Ageta assisted with animal training. pAAV-CaMKIIa-GCaMP6s-P2A-nls-dTomato was a gift from Jonathan Ting (Addgene plasmid # 51086). MEXT/JPSP KAKENHI 19H04983, 21H02804, and 22H05495 (K.H.), MEXT/JPSP KAKENHI JP18H04014 (D.W.), JST CREST JPMJCR1756 (D.W.), Takeda Science Foundation (D.W.).
Author contributions
K.H. designed research; K.H. and H.T.-A. performed research; K.H. and D.W. contributed new reagents/analytic tools; K.H. analyzed data; and K.H. wrote the paper.
Competing interests
The authors declare no competing interest.
Notes
This article is a PNAS Direct Submission.
Authors
Metrics & Citations
Metrics
Citation statements
Altmetrics
Citations
Cite this article
Prospective and retrospective values integrated in frontal cortex drive predictive choice, Proc. Natl. Acad. Sci. U.S.A.
119 (48) e2206067119,
https://doi.org/10.1073/pnas.2206067119
(2022).
Copied!
Copying failed.
Export the article citation data by selecting a format from the list below and clicking Export.
Cited by
Loading...
View Options
View options
PDF format
Download this article as a PDF file
DOWNLOAD PDFLogin options
Check if you have access through your login credentials or your institution to get full access on this article.
Personal login Institutional LoginRecommend to a librarian
Recommend PNAS to a LibrarianPurchase options
Purchase this article to access the full text.