Model-free decision making is prioritized when learning to avoid harming others

Significance “Do no harm” is a universal principle of human social life. But how do we learn which of our actions help or harm others? Learning theory suggests there are two different systems that govern how we link actions and outcomes: a model-free system that is efficient and a model-based system that is deliberative. Here we show that people rely more on model-free decision making when learning to avoid harming others compared to themselves. Model-free neural signals that distinguish self and other are observed in the thalamus/caudate, and reliance on model-free moral learning for others varies with individual differences in moral judgment. These findings suggest that moral decision making for others is more model-free and has a specific neural signature.


Experimental note in introduction
When designing our study, we considered multiple variants of two-step tasks that have previously been used in research [1][2][3][4] . Our aim here was to specifically design our task to assess people's 'relative balance' between engaging in model-free and model-based strategies using a paradigm that yields equal frequencies of pain versus no pain outcomes independent of the level of model-basedness and thus did not 'push' people towards being more model based. We ultimately opted for a hybrid version of the original Daw et al 4 paradigm and the more recently suggested version by Kool et al 2 , including some of the benefits of both of these variants of the task.
We chose to adopt one important feature from the original paradigm by Daw et al. 4 and designed the current paradigm such that there was no incentive to be model-based. In other words, model-free and model-based learning strategies yielded overall similar proportions of positive versus negative outcomes. This meant that participants could not avoid more pain by adopting a model-based over a model-free strategy. This was an important feature for two reasons. First, there is extensive evidence that people value others' outcomes differently from their own 5,6 and are willing to exert more effort to benefit themselves than others 7 If we had adopted a more recent variant of this task by Kool et al. where the model-based strategy more effectively obtained desired outcomes (e.g., 2 ) and observed differences in model-based learning for self vs. other, we would have been unable to rule out the possibility that such differences were due to differential utilities in outcomes for self vs. other. Second, this aspect of the original variant of the task is optimised to measure people's 'relative balance' or 'natural tendency' between engaging in model-free vs. model-based learning. In their paper describing the newer version of the two-step task, Kool et al themselves state that "even though there is no significant relationship between reward and model-based control [in the two-step task], this does not undermine the usefulness of the task for measuring the relative balance of model-based and model-free strategies (page 6-7)." The authors therefore highlighted the usefulness of using a variant more closely based on the original Daw paradigm to assess the relative balance between systems which was the main aim of the current study. By contrast, the newer version 2 focusses on the trade-off between cognitive demand and 'accuracy' (here successfully avoiding harm), which was not the focus of the current study.
However, we also adopted some of the improvements in task design suggested by Kool et al 2 ., including changes in drift rate and drifts that include more extreme probabilities to facilitate learning. Specifically, we made the drifting of the random walks faster (0.2 as in Kool, rather than 0.025 in Daw) and bounded the reward(pain) rate between 0-1 rather than 0.25-0.75 as suggested by Kool et al. The drift rate referred to the standard deviation of the normal distribution from which the random noise that was added at each timepoint was drawn when generating the random walks. The first modification allowed us to assess learning within a smaller number of trials (138 per agent) than in Daw et al 42 (201 trials) but comparable to Kool et al 2 (125 trials). The second modification allowed us to do this whilst ensuring the task was not too difficult. Also, because there are currently no studies involving neural data for the Kool et al paradigm we wanted a task that could be used, in part, to replicate the neural findings described in Daw et al. 4 .

Linear mixed effect model of behavioural data
We validated our behavioural results by repeating our analysis using linear-mixed effects models with the lme4 package in R, which ensured we had good estimates of random effects and accounted for variability in behaviour using Bound Optimization by Quadratic Approximation (see Methods). We found all results remained the same (glmer(Switch_Stay~NoPain_Pain*Transition*Recipient+(1+ NoPain_Pain*Transition*Recipient | Subject)), (main effect of outcome (model-free) Z=4.154, p<.001) ; Outcome x transition interaction (model-based) Z=3.841, p=.001; outcome x recipient Z=-2.048, p=0.041; [outcome x transition (model-based)] x recipient Z=0.255, p=0.80).

Bayesian 1 st level analysis comparing prediction error and outcome only models
We ran a Bayesian 1 st level analysis in SPM12 to compare exceedance probabilities for the PE model (Model 1) and a model based on outcome only (Model 2, no pain/pain). We constructed GLMs that were identical except for whether the onset of each outcome had a parametric regressor modelling model-free prediction errors or outcomes (1 or 0). We note that PE and outcome regressors were highly correlated (r 2 =.83). We then compared the exceedance probabilities at the group level for each of the peak coordinates in thalamus and ventral striatum reported in the manuscript. This analysis showed that one peak in the thalamus was best explained by a PE model (M1, xp=0.97), one by the outcome model (M2, xp=1) and another equally by both models (xp=0.43 and xp=0.57). In the striatum, one peak (right) was best explained by a PE model (Table S2) whereas another (left) was best explained by an outcome model (Table S2). Therefore, the results suggest anatomical specificity in tracking PEs and outcomes. Both variables were encoded in thalamus and striatum, but in anatomically distinct locations.

TPJ BOLD and model-free and model-based behaviour
In GLM2, we observed the inverse pattern to the one seen in sgACC in right TPJ (x=54, y=-38, z=34, Z=3.56, K=39, p=.03, FWE-SVC, Fig. S2). In other words, the signal in TPJ was opposite to what would be predicted by model-free influence. This region was more active during switch relative to stay choices on the current trial, following a 'no pain' outcome on the previous trial, in the 'other' condition specifically. However, in GLM3 where we tested whether there was a significant outcome x transition interaction (model-based), as well as a main effect of outcome (model-free), we found no suprathreshold voxels in the independent anatomical TPJ ROI.

Switch/stay additional analyses
In the switch-stay analysis we identified sgACC and TPJ activity after receiving no pain for other ( Figure 5 and Fig. S2.). There were no significant results at the whole brain level or in any of our ROIs after no pain for self, or pain for either agent. Due to the coding of switch-stay as a parametric modulator we were unable to statistically assess a full 2 (self/other) x 2 (switch/stay) x 2 (no pain/pain) interaction. However, in sgACC, we additionally confirmed using post-hoc t-tests that there was indeed no response for these three parametric modulators (self no pain (t(32)=-.56, p=.58, BF01 = 4.6 [substantial in support of null]), self pain (t(32)=1.52, p=.14, BF01 = 1.89 [anecdotal in support of null]), other pain (t(32)=-.05, p=.96, BF01 = 5.37 [substantial in support of null])). This was also true for right TPJ: it responded to stay more than switch after no pain for other but there was no significant effects in any of the other three conditions (self no pain (t(32)=.95, p=.35, BF01 = 3.56 [substantial in support of null])), self pain (t(32)=1.09, p=.29, BF01 = 3.12 [substantial in support of null])), other pain (t(32)=-.065, p=.95, BF01 = 5.36 [substantial in support of null])).

Testing for differences in neural responses to self and other at decision time
We ran an exploratory GLM to test whether participants showed any difference in overall response to self vs other at decision time, perhaps consistent with such processing requiring a greater cognitive load. We modelled the onset of self and other choices separately and ran simple contrast to directly compare them. No brain areas survived whole brain correction or showed a significant response in any of our ROIs.

Model-free influence at choice additional analyses
Although the results related to Figure 5 in the main text are consistent with a model-free influence in sgACC, a decisive test would be to also show a BOLD effect of outcome, but no significant outcome x transition interaction, analogous to the behavioural analysis. Therefore, in an additional GLM (GLM3) we instead modelled onsets separately for stay versus switch choices with two parametric regressors each -a model-free (outcome, (no pain/pain)) and a model-based (outcome x transition (no pain/pain x common/rare transition)) regressor. We allowed the model-free and model-based regressors to directly compete for variance (correlations were all <0.4). We found that sgACC activity for other reflected the model-free effect, responding most strongly to switch choices following pain relative to no pain, in a cluster that overlapped with our analysis in GLM2 (x=0, y=36, z=6, K=29; Z = 3.46, p=.024, FWE-SVC for an anatomical sgACC ROI after initial thresholding at p<.001). Importantly, within this independent anatomical ROI, there was no significant parametric modulation that reflected the outcome x transition interaction (p=.12, t(32) =1.59, BF01=1.7, d=0.28), and the model-free and model-based estimates were also significantly different from one another (p=.017, t(32) = -2.5, d = -.44). We note that it would be ideal to conduct an analysis on just the rare trials to provide further support that this signal reflects a model-free influence. However, by design, there are few rare trials, and such an analysis would not be well powered.

Psychophysiological Interaction analyses
For completeness we also ran an additional set of GLMs that used seed regions in brain areas apart from sgACC that showed specificity of model-free processing for other compared to self (Thalamus and TPJ). We defined a seed region in these two areas using a 6mm sphere based on the peak co-ordinates from our analyses. We then extracted the physiological variable and the psychophysiological interaction terms for other prediction error > self prediction error (peak in thalamus/caudate) and stay vs. switch after no pain for other. These PPI terms were entered into the GLMs along with all previous regressors that specified the events of our study. In all PPI GLMs, six head motion parameters modelled the residual effects of head motion as covariates of no interest. Neither area showed significant connectivity with any part of the brain at the whole brain or small-volume corrected level.

Profile of dlPFC response
We ran two additional GLMs (GLM 4.1 and GLM 4.2) to assess whether dlPFC responses were specific to connectivity with sgACC or also showed a main effect. In the first we modelled the onset of stay vs. switch trials separately for the self and other condition to test whether dlPFC was more active for switch vs. stay trials overall. We confirmed that the model was estimable (correlations between regressors: all r's<|.032|). This analysis showed no significant voxels at the whole brain level or in any of our ROIs. In the second GLM, we coded each participants' choice as consistent versus not consistent with model-free behaviour (1=stay after no pain, switch after pain; -1=stay after pain, switch after no pain) and again observed no significant voxels at the whole brain level or in any of our ROIs. These findings suggest that dlPFC effects were related to differential connectivity with sgACC, rather than reflecting a main effect of switch/stay during the task or a main effect of model-free processing regardless of connectivity with sgACC.

Correlations between shock aversiveness and w parameter
We also tested whether ratings of the aversiveness of shocks for self and other, or the difference between self and other, correlated with the model-free x recipient interaction and with the w parameters for self and other. No correlations were statistically significant (all r's <.09, all ps > .61). Instead, the strongest correlation with the model-free by recipient interaction was with the sensitivity of moral judgments to harmful outcomes, consistent with the idea that aversion to harming others drove the model-free effects rather than the perception of the aversiveness of shocks.

Time pressure, reaction times and social decision-making
Previous studies have suggested that time pressure can alter social decision-making such that increased time pressure can make people more cooperative 8,9 . Although in our experiment average RTs (1.2 seconds) were well below the allotted time available to make decisions (2.5 seconds), it would be interesting to compare a self-paced version of the task with one where participants are under time-pressure to examine whether more time to deliberate would reduce participants' model-free behaviour for others.
Previous research using the two-step task in combination with eye tracking has suggested that subjects who are "model-free" deliberate more at the time of choice whereas those that are "model-based" may make their choice about which first-stage cue to select prior to its onset 10 . Whilst we did not observe sgACC or dlPFC activity when we modelled switch-stay decisions at the time of the outcome instead of the choice, or any differences in reaction times between conditions, it would be interesting to examine whether an aversion to harming others is reflected in eye gaze deliberation.

Alternative accounts of increased model-free decision making for others
An potential alternative explanation for our behavioral finding showing that participants were more modelfree for other than self is that model-based learning is effortful 2 , and people might choose to put in less effort to benefit others 7 . As described in the main text, this explanation seems unlikely because being more model-free overall should equally influence trials that lead to pain or no pain. However, in our study, modelfree moral learning was specifically driven by a lower probability of repeating choices that harmed others (pain outcome), while there was no difference between self and other on trials that avoided harm (no pain outcome). Furthermore, we did not observe differences in choice consistency (captured by the temperature parameter in our model) during learning for self vs. other. Since choice consistency is related to task engagement, if our behavioral effects reflected reduced effort for others, we might expect to see lower choice consistency when learning for others than self. However, we acknowledge that the link between choice consistency and cognitive control is not completely straightforward 11 . RT analyses also showed that choices were not slower for other compared to self trials, which again might be expected if effort was different between self and other.

Opposing Prediction error signals in the thalamus
Of note, the thalamus tracked prediction errors for self and other in opposing directions, signalling a positive PE when avoiding pain for others, but signalling a negative PE when causing harm to oneself. Whilst we interpret these results with caution to avoid problems of reverse inference, one possible explanation is that participants were differentially invoking optimism/pessimism biases for self and other. If participants were pessimistic about avoiding harm to others but optimistic about avoiding harm to themselves, this may have been reflected in opposing encoding in the thalamus. It would be interesting for future studies to employ designs used to study optimism and pessimism biases and examine whether these might be able to explain learning to avoid harm for self and other.

Overlap between behavioural and neuroimaging results
Of note, our main neural result at decision time in sgACC was specific to no pain trials, whereas behaviourally we observed differences between self and other more strongly on pain trials, as evidenced by increased switching after pain caused to others. However, overall there was consistency between our behavioural and neural effects. First, our neural finding of distinct model-free prediction error signalling for self vs. other outcomes in the ventral striatum and thalamus/caudate is in line with our behavioural effect of more model-free behaviour for others, as the prediction error tracks responses parametrically to no pain vs. pain outcomes. Second, our behavioural effect (more model-free for other than self) does correlate with responses in sgACC, with those people who are more model-free for other having stronger sgACC responses to stay vs. switch after no pain for other.
Finally, we ran an additional analysis to further understand our results in sgACC (see SI Text Model-free influence at choice additional analyses for full statistical details). In parallel to our behavioral analysis, we coded a GLM where we modelled parametric regressors of the choice at the current trial as a function of (a) outcome on the previous trial (pain/no-pain, the model-free influence) and (b) an interaction between outcome (pain/nopain) and transition (rare/common) on the previous trial (the model-based influence). We then looked at these two parametric regressors separately for trials when participants switch or stay, or by pooling across switch/stay decisions. We used an anatomical ROI in sgACC that was not biased in any way towards finding model-free or model-based effects.
Our results demonstrate effects in sgACC consistent with a model-free, but not a model-based influence on choice, in a cluster overlapping with our original analysis reported in the manuscript. In other words, sgACC activity tracks more strongly with decisions to switch than decisions to stay following pain than no pain for other (model-free effect: contrast of pain outcome (coded as -1) vs. no pain outcome (coded as 1), regardless of transition type). We note that this is highly consistent with our behavioural effect, whereby participants are more likely to switch following pain than no pain to others, regardless of transition type.
By contrast, the model-based regressor that codes the outcome x transition interaction shows no significant parametric modulation in sgACC, with these two effects in sgACC significantly different from one another. Taken together, these results suggest consistency between our behavioural and neuroimaging analyses (See Fig S3. for additional details).

Detailed description of full 7-parameter model
We refer to Daw et al. 4 for a more detailed description of the learning model but repeat a short description with the formulas here for completeness.

Computational modelling of behavioural data
For modelling of choice behaviour using trial-by-trial updates we proceeded in two steps. First we fitted a range of plausible models separately to self and other blocks. This was to probe whether the same model would win for self and other blocks, allowing us to rule out participants might employ entirely different strategies in the two block types. The following models were fitted: (1) 7-parameter: full model specified by Daw et al. using parameters: learning rates for stage 1 and 2 (aS1, aS2), temperature parameters for stage 1 and 2 (bS1,bS2), a perseverance parameter (r), an eligibility trace (l) and a model-free/based weighting parameter (w); for full details of model see Supplementary information and 4 (2) 6-parameter model, as (1) but with l=1 (l was shown to have a high mean value and small variance in previous work e.g. 12 ) (3) 5-parameter model, as (1) but with only one a and b for stage 1 and 2 (4) 4-parameter model, as (3) but with l=1 (5) 5-parameter model, as (4) but with two learning rates for pain and no pain outcomes (aPain, aNoPain) Models were fitted using a hierarchical Bayesian model fitting approach described in detail in 13,14 . It finds the maximum a posteriori estimate of each parameter for each subject using a prior distribution for each parameter which helps to regularise and constrain parameters. The algorithm uses Expectation-Maximization (EM) 15 and parameters were transformed to a logistic or exponential distribution to enforce constraints and ensure normality such that 0<{a,w}<1, {b,l}>0.
For formal model comparison, we report the Bayesian Information Criterion (BIC) based on the log-likelihood, and computed the model evidence by integrating out the free parameters (BICint 13,14 ; Tables S3 and S4). Exceedance probabilities were calculated by feeding the BICint into SPM's function spm_BMS (http://www.fil.ion.ucl.ac.uk/spm/software/spm8).
The five-parameter model with separate learning rates for pain and no-pain outcomes best explained behaviour in both self and other conditions (Table S3). We report the difference between the best-fitting parameters, but this method has a caveat. Because of the nature of hierarchical fitting, which uses separate priors for self and other parameters, this method is somewhat biased towards finding differences. Meanwhile, fitting self and other parameters using the same priors, is overly conservative and biased against finding differences.
To resolve whether the parameter w differed between self and other blocks, in line with results from the basic logistic regression analyses, and without introducing any such biases, in a second step, we therefore fitted three models to the merged data of both self and other blocks: (1) 5-parameter model (as (5) above) with all parameters shared between self/other (2) 6-parameter model with aPain, aNoPain, b and r shared but w split into wSelf and wOther (3) 7-parameter model with aPain, aNoPain, b shared and r and w split into rSelf, rOther, wSelf and wOther As described above, model comparison was performed based on BICint values. The mean parameter estimates are shown in Table S4. We also simulated data from our participant schedules and showed that we had reliable parameter recovery 16,17 .

Parameter recovery
Because schedules had been optimized for the seven-parameter model, but our winning model involved five parameters, with two separate learning rates for pain and no pain outcomes, one inverse temperature parameter and a fixed =1, we tested whether we could recover parameters for this model from simulated data for which we knew the ground truth. For the five parameters (aPain, aNoPain, , , ), we simulated behaviour using the same schedule given to our participants. We used a wide range of parameter values We added noise to each of the five parameters for each simulated agent (from a standard normal distribution multiplied by 0.1) to improve our coverage of possible parameter values. After having generated the behaviour, we refitted the simulated behaviour using fminunc in Matlab. We used the best fit from 10 random starting configurations to avoid local minima.

Statistical analysis of behavioural data
For Fig 2a, we calculated the % of stay choices after common or rare transitions following pain or no pain outcomes (2x2). For regression analyses using lme4 in R, we coded Stay as 1 and Switch as 0 and created regressors for transition type, outcome, outcome x transition type, agent x outcome and agent x outcome x transition type. We used Bound Optimization by Quadratic Approximation (bobyqa) with 1e5 function evaluations. We examined bivariate associations between the interactions with agent from the regression analyses and neural responses with individual differences in utilitarianism using the Oxford Utilitarianism Scale (OUS) 18 and action and outcome sensitivity from the Harmful Action Outcome Scale 19 .

Moral Judgment Scales
The OUS-IH consists of 4 items reflecting a relative willingness to cause harm to others in order to bring about the greater good (e.g., "It is morally right to harm an innocent person if harming them is a necessary means to helping several other innocent people"). Participants rated these items on a 7-point scale (1 = strongly disagree, 7 = strongly agree). A mean score was then computed for all participants.
For the HAO participants were presented with a scenario about two people, Carl and John. They are told John is terminally ill and sincerely wants to die and has asked another person, Carl, to perform a mercy killing. Participants then rated how morally wrong each of the 23 methods of killing were on a scale from 1the least morally wrong, to 10 -the most morally wrong. In a previous study 19 participants were asked to rate the action value and outcome value of these different methods of killing. To assess action value, the researchers asked participants to rate how upsetting it would be to "act out" performing each behavior as though it were part of a movie script. To assess outcome value, they asked them to rate how much suffering each act would impose. We used the mean action and outcome scores derived from this initial paper to predict the 'wrongness' scores in our current sample. This analysis created two different beta weights for each participant corresponding to the action and outcome sensitivity, respectively.

fMRI pre-processing and set-up of GLMs 1-5
Images were realigned and unwarped using a fieldmap and co-registered to the participant's own anatomical image. The anatomical image was processed using a unified segmentation procedure combining segmentation, bias correction, and spatial normalization to the MNI template using the New Segment procedure; the same normalization parameters were then used to normalize the EPI images. Lastly, a Gaussian kernel of 8 mm FWHM (SPM default) was applied to spatially smooth the images.
Before the study, example first-level design matrices were checked to ensure that estimable GLMs could be performed with independence between the parametric regressors: value difference at the first-stage choice, the state prediction error at stage 2, and a model-free prediction error at the time of the outcome ( Figure S6). This allowed us to look at value and prediction error responses independent from one another. We also tested a GLM that coded switch vs. stay trials as a parametric modulator at the time of choice dependent on the previous outcome. Again this GLM could be estimated with independence (See Figure S7). We convolved these different event types with SPM's canonical haemodynamic response function. All events were modelled as stick functions with 0 duration.
For GLM1, each of these regressors was associated with parametric modulators taken from the computational model. At the time of the first stage choice this was the value difference from the hybrid model combining model-free and model based learning. At the time of the second stage choice this was the state prediction error for the transition from stage 1 to stage 2, and at the time of the outcome this was the model free prediction error (since the behavioural differences were in the model-free parameters). In all cases, values were modelled separately for the onsets of self and other trials. As in20, we fixed the parameters to the average values for self and other (Table S5 and S6) but allowed w to vary.
For GLM2 we modelled whether participants stayed or switched at the first-stage choice relative to the outcome on the previous trial, i.e. no pain or pain. Due to the smaller number of trials included in this analysis we coded stay and switch as a parametric regressor with values of 1 assigned to switch and -1 assigned to stay. One participant did not have a trial in at least one of these regressors and was therefore excluded from the stay-switch analysis. For all GLMs in some participants, an extra regressor modelled all missed trials, on which participants did not select one of the first-stage choices.
GLM3 was designed to confirm that the BOLD effects in sgACC were indeed best explained by a model-free, rather than a model-based, influence. This GLM was similar to GLM2 but instead of splitting regressors by the outcome on the previous trial, we modelled the onsets of switch and stay choices (with respect to the choice on the previous trial), for both self and other. Each of these regressors was assigned two parametric modulators, a parametric regressor coding the outcome of the previous trial irrespective of transition (pain/no pain, model-free) and a second regressor that coded the outcome by transition interaction on the previous trial (outcome x transition, model-based).
GLM4 and GLM5 corresponded to our psychophysiological interaction analyses. We defined a seed region in the sgACC using a 6mm sphere based on the peak co-ordinates from our analyses (for completeness we also ran PPI analyses using seed regions in thalamus and TPJ, see SI Text). We then extracted the physiological variable and the psychophysiological interaction terms for stay vs. switch after no pain for other (GLM4) and stay vs. switch after no pain for self as a control analysis (GLM5). These PPI terms were entered into the GLMs along with all previous regressors that specified the events of our study as described above.  Neurosynth meta-analysis. a) Neurosynth meta-analysis of the term 'Pain' from 516 studies (including both experienced pain and observed pain) showed robust activation in the thalamus that overlapped with the structural thalamus ROI we used to examine specific responses to Self and Other prediction errors. (b) Independent structural small-volume region of interests in the thalamus and caudate.  Fig. S3. Hypothesised pattern of model free effects as a function of switch/stay and no pain/pain. In GLM2, we modelled onsets separately for choices following pain versus no pain outcomes, with switch/stay as the parametric modulator, and concluded that sgACC BOLD distinguished between stay versus switch after no pain for others (GLM2, red arrow in the schematic below). In GLM3, we modelled onsets separately for switch and stay choices, with pain/no pain on the previous trial as the parametric modulator, and here we see that sgACC activity distinguishes between switching more after pain compared to no pain to another person (denoted by the yellow arrow in the schematic below). Both of these effects are consistent with a model-free influence on decisions that affect others. Fig. S4. Bootstrapping analysis for rare trials. The effect of switch versus stay for other after no pain was estimated for rare trials only (blue line) and for 1000 GLMs that were matched, in trial numbers, to the rare trials but relied on a randomly drawn subset of only common trials in each iteration (grey histogram). Rare trials were contributing to the effect of switch vs stay for other after no pain about as much as common trials, supporting our interpretation that sgACC response is consistent with a model-free effect.
(parameter estimate from rare trials only: -.042; mean parameter estimate from 1000 models using random subsets of common trials matching the number of rare trials: -.036 +/-SE .002). However, we note that each model in this analysis relied on a small subset of trials and thus, this result should be interpreted with caution. Fig. S5. Correlations between Oxford Utilitarianism Scale (OUS) model-free moral behaviour and dlPFC switch after no pain for other. Fig. S6. Sensitivity to harm in moral judgments correlates with model-free moral behaviour and its neural correlates. (a) Partial correlation, controlling for harmful action sensitivity (as measured by the Harmful Action-Outcome Scale, Miller et al. 19 ), between harmful outcome sensitivity and model-free moral behaviour. Model-free moral behaviour was the parameter estimate for the model free x recipient interaction that showed participants had a tendency to switch more after causing harm to other participant. Note the values are reversed on the y axis to depict that greater model-free behaviour is associated with greater harmful outcome sensitivity (b) Partial correlation controlling for harmful action sensitivity, between harmful outcome sensitivity and prediction errors of pain avoidance in the thalamus/caudate for Other compared to Self (c) Partial correlation controlling for harmful action sensitivity between harmful outcome sensitivity and parameter estimates extracted from the parametric regressor for stay vs switch after no pain for other in subgenual anterior cingulate cortex (sgACC). Note that values are reversed on the y axis such that a greater tendency to stay vs switch tracked in sgACC correlates with harmful outcome sensitivity.

Fig. S7.
Correlation between parametric regressors in the model-based analyses. All correlations were below r<|0.26|, indicating that conditions could be appropriately estimated with independence from one another.

Fig. S8.
Correlation between parametric regressors in switch/stay analysis. All correlations were below r<|0.1|, indicating that conditions could be appropriately estimated with independence from one another.  Model comparison when fitting M5 and variants of it to the pooled data from self and other blocks. This does not support a strong conclusion. Nevertheless, the BICint slightly prefers M6 over the others. This model is identical to M5 but models separate model-free/model-based weights for self and other blocks (w Self and wOther). Crucially, there is a significant difference between the parameter estimates obtained for w Self and wOther providing further support for including both parameters.  Average parameter estimates extracted for both agents fitted simultaneously for the three different models M5-M7. The winning model M6 contained one perseverance parameter (r) across both agents but separate model-free/based weighting parameters (w) which differed significantly from each other. Table S5. Whole brain analysis (p<.05 cluster correction after thresholding at p<.001).  1) to an outcome only model (M2), shown at different peak coordinates in thalamus and ventral striatum. There was anatomical specificity and both models best explained the BOLD signal in different sub-peaks of the two areas we identified in our main analysis (GLM1) suggesting responses in different locations can be interpreted as reflecting the tracking of prediction errors or the tracking of outcomes.