Preferences for nutrients and sensory food qualities identify biological sources of economic values in monkeys

Significance Preferences for foods high in sugar and fat are near universal and major contributors to obesity. Additionally, human food choices are sophisticated and individualistic: we choose by evaluating a food’s nutrients and sensory features and trading them against quantity and cost. To understand the mechanisms behind human-like food choices, we developed an experimental paradigm in which monkeys chose nutrient rewards offered in varying quantities. Resembling human suboptimal eating, the monkeys’ fat and sugar preferences shifted their nutrient balance away from dietary reference points. Formally defined economic values for specific nutrients and food textures explained the monkeys’ preferences and individual differences. Our findings show how human-like preferences derive from biologically critical food components and open up investigations of underlying neural mechanisms.

contacting surfaces (Fig. 3A, right). Our design included a flat aluminium platform to hold the nominally flat fixed base pig tongue. The upper moving tongue-tip tissue was attached using superglue onto a dome-shaped slider of radius 100 mm. Thus, the two tissues contacted with a nominal point contact, avoiding issues with alignment during the sliding process. The anatomically upper surface of each tongue was used for each of the two contacting surfaces. The dome-shaped slider was mounted using low-friction bushings onto a track containing two rails, pivoted at one end. A counterweight was used to balance out the weight of the rail elements of the track. Thus, the load through the contact between the tongues was only the weight of the dome-shaped slider and tongue tip specimen (2.58 ± 0.07 N), which did not vary as the slider moved along the track. The slider was attached via a light string and pulley to an Instron 5544 Universal Testing machine (Instron, USA). Preliminary tests confirmed that pulley friction was negligible. During testing, the moving tissue was loaded against the fixed base pig tongue with the testing liquids interfaced as lubricating layers. The Instron machine then imposed a fixed velocity (v = 16 mm/s) to the slider, while measuring the traction force using a load cell attached to the Instron machine. This design measured the sliding friction between the liquids and oral tissues to approximate the oral sensing conditions of the animals. Because we maintained a constant velocity during the test, according to Newton's First Law of Motion, = traction force (F) loading force (N ) = traction force measured by the tensile machine total weight of the slider and the fixed tongue tip where F is the applied force (traction) by the Instron machine, N is the loading force perpendicular to the contact surface (normal force), and , the coefficient of sliding friction, is the ratio of the traction force and the perpendicular loading force. The day before testing, fresh pig tongues were obtained intact from a local butcher (Leech & Sons, Royston, UK) and were gently rinsed with water to remove residual blood and tissue fluids. We then retrieved the superficial 1 cm-thick, anterior 18 cm of the tongues for a flat contact surface that fitted onto the testing platform. The processed pig tongue slices were preserved in isotonic saline buffer (Phosphate-buffered saline, PBS, 1X, pH 7.4) in a freezer under 4 ºC overnight. On the testing day, we first prepared the contact surfaces by gluing one 18-cm tongue on the base platform and another tongue tip (5 cm) on the surface of the domeshaped slider. We weighed the dome-shaped slider with the attached tongue tip to give the loading force for later calculation of the CSF. Before each measurement, we rinsed both tongue surfaces with 10 mL of isotonic saline buffer (PBS) three times to remove residual testing liquids and hydrate the tongue surfaces. Next, we loaded 30 mL of the testing liquids between the pig tongues and pulled the slider from the posterior tongue bases forward to the anterior tongue tip (Fig. 3B). All procedures were approved by the Departmental Safety Office, Department of Engineering, University of Cambridge, including Control of Substances Hazardous to Health (COSHH) and biohazard risk assessments.
For the formal testing of the stimuli, we measured each liquid with triplicate repeats using two pairs of pig tongues in opposite measuring orders to cancel out possible carryover effects (low-fat to high-fat liquids and reverse). We first averaged the triplicate measurements for each liquid and divided them by the corresponding loading force to obtain the coefficients of sliding friction along the tongue surface. We selected the anterior 5-7 cm of the tongues as the analysis window because the mechanosensory receptors are located mainly within the anterior two thirds of tongue (2,3), and further anterior tongue tips were too thin for stable measurements. Because inevitable variations in tongue conditions influenced the absolute sliding-friction measurement, for each pair of testing tongues we normalized the measured coefficients with the coefficient of water obtained from the same pair of tongues. This normalization adjusted the offset in absolute measurement due to variations of the tongues and provided comparable results between different tongue pairs. Finally, we averaged the two normalized coefficients obtained from the different tongues for further analyses.
Our engineering approach measured sliding friction on biological tissues to approximate oral food-sensing conditions. We followed an earlier study that used pig tongues to measure sliding friction and link human fat perception to lower friction coefficients (4). The physical mechanism by which fat in emulsions lowers friction, i.e. produces lubrication, likely involves coalescence of fat droplets on oral surfaces that form an adhering fat layer (5)(6)(7). The present study is the first to combine food-engineering methods with a controlled repeated-choice paradigm. To refine this technique, future studies should examine differences in hydrodynamic conditions between the tribological set-up and oral conditions, dependences on lubrication by saliva and testing speed, and measure related food properties (e.g. coalescence of fat particles, fatty-acid concentration) (8).

Data analysis and statistical methods
Unless otherwise specified, all data were analyzed separately for each animal and different juice flavors (peach and blackcurrant) using custom code and in-built functions in MATLAB R2017b.

Nutrient-specific choice biases.
We assessed the preferences for fat and sugar content based on choice biases including choice frequency, choice repetition, magnitude-nutrient trade-off, and reward values between nutrient-defined rewards. Analyses were performed separately for each monkey.
Choice frequency. For each monkey, we pooled choice data for the same offered liquid rewards across sessions to compute the mean choice frequency. We first transformed the observed left-right choices into reward-based choice outcomes and fitted the binary reward choice outcomes using the binomial distribution (binofit function, MATLAB) to estimate the mean choice frequency and the 95 % confidence interval. Statistical significance of choice frequencies was determined based on two-tailed one-sample proportion tests against the null hypothesis, 0 = 0.5. In addition, differences of choice frequencies between two samples were examined based on two-sided two-sample proportion test against the null hypothesis Choice repetition. We quantified choice repetition based on repeated-choice counts, which tracked how many consecutive choices for the same reward had been performed up to the current choice trial (Fig. 1E, Fig. S1). Starting from zero, repeated-choice counts accumulated when the chosen reward on the current trial matched the previously chosen reward and returned to zero otherwise. Therefore, each session had two strings of repeatedchoice counts that tracked the numbers of repetitive choices for each reward. Repeatedchoice counts for each reward were compared using a two-sided likelihood ratio test: where 1 and 2 were the means of two exponential repeated-choice counts and the degrees of freedom were twice the sample numbers ( 1 = 2 1 , 2 = 2 2 ).

Nutrient-magnitude trade-off.
To examine nutrient-magnitude trade-offs, we computed the intake of fat, sugar, and reward amount with respect to the maximally and minimally available intake in each session (Fig. S1). For each choice trial, we multiplied the reward amount and the nutrient concentration to calculate the amount of fat and sugar for both options. We then summed the larger nutrient amount between the options in each trial as the upper bound of the fat and sugar intake, respectively. Similarly, the smaller nutrient amounts were summed to derive the lower bound of nutrient intake. We then normalized the actual fat, sugar, and reward intake amount into percentage between these bounds, where is the actual intake of fat and sugar (g) or reward amount (mL); and are the maximally and minimally available intake amount. Therefore, any monkey who aimed to maximize the reward amount should obtain 100 % of reward amount ( = ). By contrast, because the controlled nutrient content was identical in both rewards, the intake of the controlled nutrient scaled with the percentage of reward intake. The intake of the manipulated nutrient, however, should be 50 % for a reward maximizer because the reward magnitudes were randomized. The equally distributed choices should translate to the mean of the maximal and minimal intake, i.e. [ ] = ( + )/2 or 50 % intake percentage. However, if the monkeys showed a preference for the manipulated nutrient, they would choose the highnutrient reward even when it was offered in lower amounts, thereby reducing the intake of reward amount (< 100 %) in exchange for the preferred manipulated nutrient (> 50 %). The extent to which the intake of reward amount reduced from 100 % revealed the willingness of the monkeys to trade reward magnitudes for preferred nutrients.
Psychometric curves and subjective reward values. We estimated subjective reward values based on the psychometric curves that linked choice probabilities to the offered reward amounts (Fig. 1F, Fig. 2A). Specifically, as we used LFLS as the common reference to estimate the values of other rewards, we first binned the log offer ratios (LFLS/alternatives) into deciles and computed the binned choice frequencies of LFLS. We then fitted the choice frequencies at different offer ratios with a two-parameter logistic function: where was the log offer ratio between LFLS and the alternatives; 0 was the inflection point of the logistic curve; was a steepness constant of the curve, and was the base of natural logarithm. The inflection point 0 represented the specific ratio at which both rewards were chosen with equal probability (indifference point), i.e. ( ℎ ) = 0. 5 , and signaled the relative exchange rate between rewards, i.e. one unit of reward was equally valued to 0 unit(s) of LFLS. Therefore, an indifference point larger than ( 0 > 1) and a right-shifted psychometric curve revealed a positive preference for the alternative reward compared to LFLS, and vice versa ( Fig. 2A, top). To compress the psychometric curves for better visualization while preserving noticeable changes of the indifference points, we logtransformed the offered ratios with respect to base 2 and reversed the transformation once we acquired the indifference point estimates. We further performed a 10-fold cross-validation to examine the reward value estimates by transforming the trial-by-trial offers into value equivalents of LFLS. Specifically, we computed the value equivalents of high-nutrient rewards ( ) by multiplying the offered reward amount ( ) with the subjective reward value ( ) and predicted choices based on the differential LFLS-equivalent offers ( − ) (Fig. 2B) The probability of left choices was again fitted with the two-parameter logistic function to assess how well the subjective reward values served as the exchange rates between rewards to explain the choices. The distribution of the adjusted R 2 in the sigmoid fit evaluated the outof-sample validity of these value estimates.

Transitivity of relative values.
To validate the reward values derived from the psychometric curves as subjective exchange rates between rewards, we examined the transitivity of these values by comparing direct relative values between reward pairs and their corresponding indirect relative rewards through another intermediate reward (Fig. S2). If reward values are suitable exchange rates between rewards, the direct values and the indirect values should be equal, as in the following derivation: where the indirect relative value from reward to reward ( ( , )) and then from reward to reward ( ( , ) ) was computed based on the multiplication of the value ratios between the relevant rewards. The result was identical to the direct relative value between reward and reward ( ( , ) ). Thus, valid reward value estimates should fulfill this transitivity criterion to explain choices between rewards with different level of preferences. We performed this analysis across all pairs between the four factorial rewards, except for those involving the HFHS reward because for some of the animals the reward estimates for the highly preferred HFHS reward were outside of the appropriate estimation range, due to the necessarily limited magnitude range we could offer in the task.

Logistic regression analysis
Mixed-effects multinomial logistic regression. We adopted mixed-effects multinomial logistic regression analysis (fitglme function, MATLAB) to model the animals' trial-by-trial choices. Specifically, we modelled the left-right choices excluding the first 50 trials in each session (during which associations between visual cues and liquid rewards were learned) and specified the categorical session number (Session) as the group variable to account for session-by-session (i.e. day-by-day) variations (random effects). We adopted the global model in which we estimated both the main effects and random effects of all the relevant regressors. The response variable was the dichotomous left ( ℎ = 1 ) or right ( ℎ = 0) trial-by-trial choice, collected from sessions in monkey ( ∈ ℕ, = 1,2,3). Under the framework of generalized linear mixed models (GLMMs) with logit function as the link function, the logistic regression model can be specified as where denotes the probability of choosing left in the th trial of session ( = 1,2, … , ∈ ℕ; = the total number of trials in session ); is a vector of trial-by-trial predictors depending on the models (fixed-effect regressors; see below) and is vector of trial-by-trial predictors nested in , and the effects of these predictors vary across sessions (randomeffect regressors). Under the assumption that the session-by-session variations of random effect regressors followed normal distribution with mean 0 and covariance matrix Ω, ] , = 1,2, … , ∈ ℕ, = 1,2,3 the model estimated the coefficients of fixed-effect regressors, , and the session-wise variations of the random-effect regressors, . The estimated left-right choice responses, , were derived by reverse logit function conditional on the session-wise random effects ( ), and the session-wise regression coefficients ( ) were derived from the fixed-effect coefficients ( ) and the session-wise calibration terms ( ).
Regression models.

Nutrient model.
In the main nutrient model (Table S2), we modeled basic nutrient sensitivities while controlling task-related regressors including the position of the liquiddelivery spout (Spout) and the presentation order of visual cues (LeftFirst) in a mixedeffect model. Importantly, we specified the categorical session number (Session) as the group variable to address session-wise variations of nutrient sensitivities as follows, where indicated whether the left option was shown first (1, if the left option was shown first; 0 if the right option was shown first), indicated the spout channel that delivered the left reward option (1, if left, 0 if right).
represented the left offer magnitudes minus the right offer magnitudes, whereas and coded the ordinal left-right nutrient level differences (1, if left > right; 0, if left = right; -1, if left < right) and × captured the additional fat-sugar interactions (1, if the left option was both high-fat and high-sugar, but the right was not; -1, if the opposite, and 0, if otherwise).

Energy model.
To test the hypothesis of energy maximization, we combined reward magnitudes and nutrient content into a single energy-content regressor and included the energy difference between left-right options to construct the energy model as follows (Table S2), where the newly included regressor represented the left-right differences in actual energy content (kcal).
3. Nutrient history model. We explored how past fat and sugar choices influenced current nutrient sensitivities by including interaction terms between current nutrient sensitivities and within-nutrient (sugarsugar, fatfat) or across-nutrient (sugarfat, fatsugar) feedback up to 10 trials prior to current trial (Table S2).
In this model, in addition to the nutrient model, we computed the intake of fat and sugar up to 10 trials prior to the current trial ( , , = 1,2,3, … ,10). The within-nutrient effects were modeled by the interactions between past sugar intake and current sugar sensitivity, as well as past fat intake and current fat sensitivity; the acrossnutrient effects were captured by the influences of past sugar intake on current fat sensitivity and the influences of past fat intake on current sugar sensitivity.

Model comparison.
We compared the performance of regression models (Fig. 4E) based on the Akaike Information Criteria (AIC), where is the number of parameters in the model and is the maximal likelihood of the model predictions given the actual data. Because the relative likelihood of model ( ) suggested by the AIC difference is where 0 and are the AIC values of the reference model and model , respectively, we accepted that a model was better if the model likelihood was 5 times more than the competing models, i.e. Δ = 0 − > 2 5 ≈ 3.22. We constructed the main nutrient model by including significant task-related regressors based on AIC criteria for all three monkeys. We then performed similar model comparison between the nutrient model and the energy model to test the energy maximization hypothesis against the nutrient valuation strategy (Fig. 4E).
Cross-prediction validation. We examined the robustness of the models across sessions, across flavors (Fig. 2E), and across animals (Fig. 2F, Fig. S5). The cross-prediction analyses validated the regression models outside of the training samples. Specifically, we first separated the data into mutually exclusive training and testing sets. We then predicted choices in the testing set based on the regression coefficients derived from the training data. The performance of the cross-prediction was evaluated by (McFadden's) cross-validated pseudo-R Cross-validated pseudo-R 2 = 1 − which was based on the log-likelihood ratio of choices predicted by the nutrient model ( ) and the intercept model ( ). Higher cross-validated pseudo-R 2 indicated better cross prediction performance, therefore more robust nutrient-value functions across conditions. Based on this concept, in the cross-session prediction, we sequentially left out one session as testing set and fitted the nutrient model in the remaining training data. We then reported the mean cross-validated pseudo-R 2 to indicate the stability of the nutrient model across testing sessions. In the cross-flavor prediction (Fig. 2E), we took turns using choice data in one flavor as training set to predict choices in another flavor. The cross-predicted pseudo-R 2 was reported in the confusion matrix. Lastly, in the cross-animal prediction (Fig.  2F, Fig. S5), we used the nutrient-value function from one monkey to predict the other two monkeys' choices. We first randomly selected one flavor-matched testing session from each of the three monkeys. We then fitted the nutrient model in one of the monkeys (training monkey) using choices excluding the left-out session and predicted choices in the three testing sessions. We compared the prediction performance on the other two monkeys (testing monkeys) to that on the training monkey. Importantly, we truncated the testing sessions to identical trial numbers to ensure comparability.

Preference dissimilarity index (PDI).
To quantify the distinctiveness of choice patterns across monkeys, we defined the preference dissimilarity index (PDI) as the bidirectional average ratio of the log-likelihood ( ) where denotes the log-likelihoods of using trained nutrient preference of monkey to predict choices of monkey , and the opposite; and were self-predicted reference models based on the monkey's own nutrient preferences. The PDI compared the cross-predictability to self-predictability, with PDI = 1 indicating comparable crosspredictability based on self and other nutrient preferences. PDIs larger than 1 suggested inconsistency between other-predicted and the reference self-predicted choices (distinct choice patterns). The cross-prediction was repeated for 1,000 iterations between each pair of the monkeys to compute the average log likelihood ratios for pairwise PDIs. The results were visualized in a preference triangle using pairwise PDIs as side-lengths; therefore, longer between-animal distances indicated more distinct choice patterns of the two connected animals (Fig. 2F Fig. S5). The importance of fat and sugar in explaining the individual differences of choice patterns were estimated by repeating the cross-animal predictions after systematically including fat and sugar regressors in the nutrient model. The nutrient contribution was estimated by the percentage change of PDI with and without the nutrient regressor in the nutrient model.

Nutrient contribution = − −
With the null model containing only the reward magnitudes and task-related control variables, the nutrient contribution of choice discrepancies was quantified as the increase in the PDI after adding specific nutrient regressor into the null model ( ) normalized with the whole range of PDI, which was bounded by the full nutrient model ( ) and the null model ( ) (Fig. 2G).
Mediation analysis. We adopted mediation analysis (9, 10) in logistic regressions to assess possible causal relationships between the nutrient content, oral texture parameters and reward choices. The framework of mediation analysis involved three components (Fig. 3D): Component 1 (path c) -fat and sugar contents were significant predictors for choices (total effect); Component 2 (path a) -fat and sugar content (predictors) were correlated with the texture parameters (mediators); Component 3 (path c') -after including the mediators into the regression model, they replaced the effects of the original predictors, either completely (complete mediation) or partially (partial mediation). The direct effect (path c') of the nutrient content (predictors) on choices (outcome) was defined as the coefficients of nutrient content (predictors) after controlling the texture parameters (mediators). The mediation effect (indirect effect = c-c') was then quantified as the coefficient differences between the total effect and the direct effect. Specifically, we examined whether the oral texture parameters replaced the effects of fat and sugar in the nutrient model by independently including viscosity and CSF into the nutrient model (Table S2): The differences of fat and sugar regression coefficients before (nutrient model, total effect: c) and after including texture parameters (texture model, direct effect: c') in all three monkeys showed that viscosity and CSF partially replaced both fat and sugar sensitivities (Fig. 3E). We tested the significance of the mediation effects using a bootstrap analysis (11,12) with 1,000 iterations. We accepted significant mediation effects if the iterated distributions of the mediation effects significantly deviated from zero.

Structural Equation Modelling
. Based on the framework of structural equation modelling (SEM), we combined three logistic regressions in the path analysis to describe the relationships between nutrient content, food texture parameters and choices (Fig. 3F). The first two regressions recapitulated how fat and sugar content influenced viscosity and sliding friction: The third regression characterized the influences of food textures and the direct influences of sugar content independent of its texture (direct effect) on choices: Total effect: logit( ℎ ) = 0 + 1 × + 2 × Direct effect: logit( ℎ ) = ′ 0 + ′ 1 × + ′ 2 × + ′ 3 × The differences between the regression coefficients for sugar level in the two models ( 1 − ′ 1 ) were defined as the mediation effect of the texture parameters between sugar content and choices. We performed a bootstrap test with 1,000 iterations to evaluate the significance of these regression coefficients and the mediation effects. All path coefficients were normalized and expressed in the path diagrams to indicate how fat and sugar content could change the food textures to influence reward choices.
Reward space choice trajectories. In the reward space, starting from the origin, we plotted the cumulative choices between the two options against each other (Fig. 4A, B). We normalized the choice counts to the total trial number in each session and averaged the cumulative choices across sessions. Thus, indiscriminate choices would show a choice trajectory that follows the 45-degree unity line and an endpoint that rests on the midpoint of the hypotenuse. Conversely, deviations away from the unity line suggest a choice bias and the continuous trajectory describes the changing choice patterns within the session. In addition, we compared reference trajectories based on three strategies that maximized energy, sugar, or fat, respectively. Specifically, we first computed the target component (energy, fat, or sugar) of both options in each trial. The simulated maximizers then chose on each trial the option with the higher target component and chose randomly when the two options matched in target component.

Nutrient space choice trajectories.
In the nutrient space, we converted the choice trajectories from the reward space to visualize the changing patterns of nutrient sensitivity within sessions (Fig. 4C, D). Specifically, we first computed the energy intake from fat and sugar (kcal) on each trial, based on the chosen reward magnitude and the nutrient composition of the chosen reward. Next, we normalized the trial-by-trial nutrient-specific energy intake to the total trial number in each session before averaging them to derive the final trajectories. For visualization, we plotted the smoothed energy intake from sugar against fat (kcal/kcal) across sessions. Thus, in the isocaloric comparison, the slope of the trajectory revealed the animals' trade-off between fat and sugar as source of energy, indicated by the angle between the trajectory and the horizontal axis, = −1 ( Energy intake from sugar ( ) Energy intake from fat ( ) ) Likewise, the three reference trajectories in the reward space were transformed into the nutrient space, to indicate the relative nutrient intake in the three maximizing strategies.

Geometric Framework for Nutrition (GFN).
We used the right-angled mixer triangle developed by Raubenheimer which implements a proportion-based variant of the Geometric Framework for Nutrition (GFN) (13,14) in which the available food compositions, reference nutritional targets, and the actual nutrient-intake balance could be analyzed in a common framework (Fig. 5A). The compositions of food rewards were plotted in a mixer triangle (13) based on the percentage contribution of fat and sugar to total energy content. The actual nutrient-intake balance ( * , a vector of nutrient composition in percentage of total energy) was calculated based on the ratio of consumed fat and sugar amount from the choices: * = Total energy intake from nutrients (kcal) Total energy intake (kcal) × 100% = × 100% Because nutrients can be acquired through reward A or reward B, the total intake can be separated based on the source of rewards, where and were the offered magnitudes (mL) of reward A and B on trial across total trials; and were indicator functions whose values were 1 only when the specific reward was chosen, and 0 if otherwise. The intake amounts of rewards were then multiplied by the energy density, and (kcal/mL), and the nutrient composition, and (% energy), to compute the energy intake from specific nutrients. Because the reward compositions ( , ) were constant within sessions, the geometric representations of final nutrient balance were interpolations between the two points for reward options, and , weighted by the energy intake contributed by each reward. * = + • + + • = • + • ∈ , = + ∈ [0,1], = + ∈ [0,1], + = 1 Nutrient reference comparison. We compared the nutrient intake balance derived from the animals' choices with two nutrient reference points, a recommended ('optimal') diet composition for adult macaques (15) and macaque milk (16). Because the actual nutrient balance should lie on the segments that connected the reward options, we orthogonally projected the reference points onto the lines that connected LFLS and HFLS or LFHS, using vector orthogonal projection as below (Fig. 5C, Fig. 5D): where A and B were nutrient compositions of reward A and reward B, R was the nutrient reference point and R' was its projection onto line ⃡⃑⃑⃑⃑ . These projection points were the closest achievable targets that served as surrogate nutrient references in these comparisons.

Reinforcement Learning (RL) simulation
Reversal-learning task. We simulated a reversal-learning task involving binary choices between high-nutrient (H) and low-nutrient (L) reward options. Each option was associated with a specific reward probability, in this case ( ) = 0.6, ( ) = 0.4 (Fig. S9A). The reward probability was reversed regularly without notification every 50 trials, e.g. ( ) = 0.6 ⟶ 0.4, ( ) = 0.4 ⟶ 0.6. Therefore, the agent should track the changing reward values through trial and error. This basic reversal-learning task has been widely used as a paradigm for adaptive learning in neuroscience and has been successfully modelled by RL models (17)(18)(19).

Standard RL model.
In the standard RL model, we adopted the basic form of a Q-learning algorithm that followed the Rescorla-Wagner learning rule (20,21) and performed 100 repetitions of choice simulations in the reversal learning task (Fig. 6A). The choice action for the high-nutrient reward ( ) was 1 when the choice probability was larger than 0.5, and was 0 if otherwise. In case of equal choice probability ( ( ) = ( ) = 0.5), the agent flipped a fair coin (Bernoulli trial) to decide which reward to choose. The subsequent reward outcomes of the choices were randomly drawn depending on the taskassigned reward probability, which alternated every 50 trials in the reversal-learning task. The received rewarded value was 1 if the agent received the reward and was 0 if otherwise.
Importantly, in the standard RL model, the values for both rewards were identical, irrespective of their nutrient composition. This value specification was later modified in the nutrient RL model to incorporate the nutrient preferences into the RL framework.
Nutrient-sensitive RL model. In the nutrient-sensitive model, we extended the standard RL model by assigning higher reward outcomes for the high-nutrient reward than for the low-nutrient reward, when rewarded. The higher value for the high-nutrient reward was controlled by a nutrient-sensitivity parameter ∈ [0,1) as follows (Fig. 6B, Fig. S9), The single nutrient-sensitivity parameter , created a continuous spectrum of nutrientsensitive RL models, which degenerated to the standard RL model when = 0 and converged into high-nutrient only choices regardless of the reward probability when → 1 ( → ∞). By contrast, the reward outcome for the low-nutrient reward remained unchanged as the standard RL model.

Economic choice theory simulation Nutrient indifference map.
We simulated each monkey's choices in the nutrient choice task in which we systematically compared randomized amounts of LFLS with rewards across combinations of interpolated fat and sugar content (Fig. S10). We systematically sampled LFLS ( = 0, = 0) against rewards with combinations of fat and sugar level from 0.1 to 1 with in steps of 0.1. The 10,000 simulated choices were based on the regression coefficients derived from each monkey (Fig. 2C). We then based on the simulated choices to estimate the reward values for each fat-sugar combination using the indifference points on the psychometric curves. In the model, we log-transformed the fat and sugar level to follow the formulation of Cobb-Douglas utility function (22,23), which created non-overlapping, nondecreasing, and negative-sloped indifference curves that have been widely used in economic studies.
where denotes the probability of choosing left in the th trial in session , represents the aggregated task-related parameters as in the nutrient model, is the Gaussian random error.
Extension of indifference analysis. We proposed an approach for indifference analysis between goods with common ingredients but different compositions (Fig. 6D). Specifically, we constructed four example composite food rewards, each with different compositions of fat and sugar (fat/sugar): A (70%/30%), B (10%/90%), C (90%/10%), D (20%/80%). We assumed that the values for fat and sugar followed the exponential utility function, where c is the quantity of the nutrients and a is a risk attitude parameter that determines the curvature of the function. In addition, we also assumed that the values for fat and sugar were additive; therefore, the value of reward X was the weighted sum of values of its nutrient constituents, Without loss of generality, we set the risk attitude parameters for fat and sugar as = 2, = 5, and the weights that integrated values of fat and sugar into the composite reward values as = 0.3, = 0.7. The slightly larger parameters for sugar reflected the observed stronger influences of sugar content than fat content on choices. Next, we illustrated how four reward bundles (a, b, c, d) and their relative preference ranking linearly transformed from reward space A-B to the nutrient space, and finally to reward space C-D. Specifically, each bundle point ( = 1,2,3,4) in the reward space A-B was linearly transformed into the nutrient space, where ′ was the transformed bundle point in the nutrient space, the transformation matrix to nutrient space ( ) was the nutrient composition matrix ( ) that included the nutrient vectors of the composite reward A and B ( : fat content in reward A, : sugar content in reward A; : fat content in reward B, : sugar content in reward B). Similarly, the same four bundle points can be again linearly transformed into reward space C-D, which was defined by rewards with the same ingredients as reward A,B but with different compositions.
Each bundle point in the reward space C-D ′′ was transformed from those in the nutrient space ′ by multiplying the points with the transformation matrix for reward space ( ), which was the inverse of the nutrient composition matrix that included the nutrient vectors of reward C and D ( ). Notably, the relative rankings between the four bundles were preserved throughout the transformation. This value-preserving property illustrated that the same choice analysis could be performed in the nutrient space or in reward spaces constructed by rewards with different compositions, therefore providing a unifying framework that links indifference analyses across different reward sets via their common ingredients.
Human psychophysical experiment. Healthy, non-obese participants (N = 23, aged 18-21, 15 male) gave written informed consent and participated in an experiment that involved sampling and psychophysically evaluating liquid rewards. The experiment was approved by the Local Research Ethics Committee of the Cambridgeshire Health Authority. The rewards were the same stimuli as used in our monkey experiment, but prepared specifically for human testing, and included the four factorial rewards, a cream-based stimulus, water and waterdiluted fruit juice concentrate (an additional stimulus involving a food-thickener was tested in a subset of 14 subjects). All stimuli were blackcurrant-flavored. Subjects sampled and swallowed 1.5 mL of each stimulus from opaque cups in randomized order; after sampling a stimulus, subjects gave psychophysical ratings on a touchpad and rinsed their mouth with water before sampling the next stimulus. Each stimulus was sampled six times. Rating scales ranged from 1 to 10, with endpoints labelled as 'none' and 'very strong'. Ratings were zscored before analysis.   Regression coefficients (± s.e.m.) for fat, sugar and reward magnitude (RM) from the nutrient model fitted to deciles of consecutive trial-windows in each testing session. Following initial learning of associations between conditioned stimuli and rewards coefficients for fat (blue) and sugar (green) were typically stable or increased throughout the session, compared to constant or slightly decreased coefficients for magnitude (yellow).     Reward intake was not affected by the nutrient sensitivity parameter η due to the symmetric experimental design (i.e. same number of high probability blocks for both options), but nutrient intake increased with higher nutrient sensitivity from nutrient-indiscriminate choices (50% nutrient intake) to nutrient-exclusive choices (100% nutrient intake). The animal obtained higher reward intake due to faster learning driven by the higher nutrient content in the high-probability option. However, the nutrient-sensitive animal was reluctant to switch to low-nutrient choices after probability reversal even if the high-nutrient option was now associated with a lower reward probability. The gain and loss of reward intake would cancel out in a symmetric experimental design as in (A), but the nutrient intake would always increase irrespective of the task structure.