A general model of hippocampal and dorsal striatal learning and decision making

Significance A central question in neuroscience concerns how humans and animals trade off multiple decision-making strategies. Another question pertains to the use of egocentric and allocentric strategies during navigation. We introduce reinforcement-learning models based on learning to predict future reward directly from states and actions or via learning to predict future “successor” states, choosing actions from either system based on the reliability of its predictions. We show that this model explains behavior on both spatial and nonspatial decision tasks, and we map the two model components onto the function of the dorsal hippocampus and the dorsolateral striatum, thereby unifying findings from the spatial-navigation and decision-making fields.

Following Wan Lee et al. (1), we can use the reliability measure for arbitration. These authors computed transition rates α and β for transitioning from MF to MB states and vice versa as follows. Here we use the same terms but for transitions between MF and SR. These transition rates are functions of the reliability of the respective systems: β(χSR) = A β 1 + exp(B β χSR) [5] where the A and B parameters in both equations determine transition rate and the steepness of these curves, respectively.

32
These parameters were fitted to behavioural data by Wan Lee et al. (1) and we matched their parameter values (see Table S1).

33
At each time step, the rate of changes of the probability of choosing the SR system PSR was computed using the following 34 differential equation: Although not explored here (but see 1), this means that there is a certain "stickiness" to the model: if the model is currently 37 choosing MF actions, it will take some time to move weight to the MB system.

38
Following Wan Lee et al. (1), state action value estimates were given by a weighted average of the two model components: Thus, the degree to which a system contributes to the value estimate is influenced by its reliability. Given these full-model 41 state-action values, the agent chose actions following a softmax policy: where τ −1 is an inverse temperature parameter which sets the balance between exploration and exploitation. The higher the 44 inverse temperature, the more the agent chooses higher-valued actions.

45
Task-specific adaptations 46 Although the general model architecture remained the same throughout all simulations, different adaptations were made to the model described above such that it could be used in the differen state spaces defined by the tasks.

48
Plus maze. For the Plus Maze task described in Fig.3, landmark cells were tuned to the ends of the maze. We assumed that 49 the landmark cells could not distinguish between the two ends of the maze such that, from the point of view of the striatal 50 system, probe trials and training trials looked the same.

51
Blocking. For the blocking simulations (Fig.4), we adapted the hippocampal controller (that worked with a tabular state representation as input) to incorporate the effects of boundaries on place cell firing. To that end, we defined the hippocampal 53 SR system using linear function approximation. The agent observes states through a vector of features f (s) which, if chosen 54 rightly, will be of much smaller dimension than the number of states, allowing the agent to generalise to states that are nearby 55 in feature space. The feature-based SR (2) encodes the expected discounted future activity of each feature: As in the tabular case, the feature-based SR can be used to compute value when multiplied with a vector of reward expectations 58 per feature, u: V π (s) = ψ π (s) T u. In the case of linear function approximation, these Successor Features ψ in Equation 9 are 59 approximated by a linear function of the features f : where W is a weight matrix which parameterises the approximation.

62
In the context of hippocampus, the feature-based SR allows us to represent states as population vectors of place cells with corresponding to f and ψ, respectively.

73
As in the tabular case, temporal difference learning can be used to update the SR weights: Note that the algorithm has not changed with respect to the one-hot state encoding mentioned earlier -it is easy to see that To investigate the relationship between the agents' spatial navigation and non-spatial decision making strategies, we quantified 92 the agents' degree of MB planning, as well as their degree of using an allocentric strategy, and computed their correlation.

93
For quantifying MB planning, we followed earlier studies (10, 11) and analysed the agents' choices using a mixed-effects 94 logistic regression (estimated using the statsmodels Python package, (12)). For each trial, the dependent variable (stay with 95 the same first-level action or switch) was explained in terms of whether there was a reward on the previous trial, whether the 96 previous transition was of the rare or common type, and the interaction between these factors. The logic of the two-step task is 97 that an MB learner will stay with the same action if it was rewarded after a common transition, but will be more likely to 98 switch if it gets rewarded after a rare transition. Thus, the degree of MB planning can be quantified as the interaction between 99 previous reward and trial type.

100
For quantifying the degree of allocentric place memory, we computed the average distance between the previous platform 101 location and the location of the maximum of the agent's value function at the start of the next session. This is akin to the 102 boundary distande error employed by (13).