Stochastic sampling provides a unifying account of visual working memory limits

Significance We demonstrate that three of the most prominent accounts of visual working memory in the psychology and neuroscience literature—the slots+averaging model, the variable precision model, and the population coding model—can all be expressed in the common mathematical framework of sampling. This reformulation allows us to pinpoint the key differences between these models, and to determine which factors are critical to account for the observed patterns of recall errors across different human psychophysical experiments. Moreover, the sampling framework provides a possible neural grounding for these models in the spiking activity of neuronal populations, as well as a link to existing theories of capacity limits in visual attention.

W orking memory refers to the nervous system's ability to form stable internal representations that can be actively manipulated in the pursuit of behavioral goals. A classical view of visual working memory (VWM) held that it was organized into a limited number of memory slots, each capable of holding a single object (1,2). This model was subsequently modified to allow multiple slots to hold the same object and be combined on retrieval to achieve higher precision (3). This "slots+averaging" model incorporated aspects of an alternative view, which holds that VWM is better conceptualized as a continuous resource that can be flexibly distributed between different objects or visual elements (4,5), accounting for set size effects in delayed reproduction tasks (6) (Fig. 1A) and flexibility in prioritizing representations (7). Variable precision models (8,9) additionally proposed that the amount of memory resource is not fixed but varies randomly from item to item and trial to trial. An alternative approach (10) sought to explain VWM errors from neural principles as decoding variability in population representations (11), with the limited memory resource equated to the total neural activity dedicated to storage. Here we show that each of these influential accounts of VWM can be interpreted within a common framework based on the statistical principle of sampling (12)(13)(14)(15)(16)(17)(18).

Sampling Interpretation of Population Coding
We first show how a population coding model (10) can, with some simplifying assumptions, be reinterpreted in terms of sampling ( Fig. 1 A-C). We consider a mathematically idealized population of independent neurons encoding a one-dimensional (1D) stimulus feature θ, where the amplitude of each cell's activity is determined by its individual tuning function. Neurons are assumed to share the same tuning function, merely shifted so the peak lies at each neuron's preferred feature value ϕi , [1] Discrete spikes are generated from the cells' activity via independent Poisson processes. If we pick, at random, any spike generated by the neural population in response to a stimulus value θ, we can determine the probability that it was produced by a neuron with preferred feature value ϕ. If we assume dense uniform coverage of the underlying feature space by neural tuning curves, this yields a continuous probability distribution p(ϕ) over the space of preferred feature values (Fig. 1C). This distribution has the same shape as the neural tuning curves and is centered on the true stimulus value, Thus, if we associate each spike with the preferred feature value of the neuron that generated it (the principle of population vector decoding; ref. 19), we can interpret the spiking activity of the population as a set of noisy samples of the true stimulus value, drawn from the distribution p(ϕ).
Retrieval of a feature value is modeled as decoding of the spikes generated within a fixed time window. In the idealized case with Gaussian tuning functions, the maximum likelihood (ML) decoder generates an estimate by simple averaging of the spike values,θ where ϕ (j ) is the preferred feature value of the neuron that generated the j th spike. Due to the superposition property of Poisson processes, the number of spikes-or samples-generated by the neural  (B) Probability distribution over stimulus space obtained by associating a spike with the preferred stimulus of the neuron that generated it. (C) Precision of maximum likelihood estimates based on spikes emitted in a fixed decoding window. Precision, defined as the width of the likelihood function (Insets), is discretely distributed as a product of the tuning precision (ω 1 ) and the number of spikes, which varies stochastically. Assuming normalization of total activity encoding multiple items, larger set sizes correspond to less mean activity per item. (D and E) An account based on averaging limited memory slots can also be described as sampling. (D) Allocation of a fixed number of samples or slots (here, three) to memory displays of different sizes. (E) Precision is discretely distributed as a product of the tuning width, ω 1 , and the number of samples allocated per item. population within the decoding window is also a Poisson random variable. If the total spike rate in the neural population is normalized (20), or fixed at a population level γ, it implements a form of limited resource (10). This resource is continuousunlike the discrete number of samples-and can be distributed between memory items, depending on task demands (e.g., prioritizing one item that is cued as a likely target). We will focus on the simplest case, in which the total spike rate is distributed evenly among all memory items, resulting in a mean number of samples available for decoding each stimulus that is inverse to the set size N . This has been shown to quantitatively capture the set size effect in single-report delayed reproduction tasks ( Fig. 2A).
The actual number of samples available in this model for decoding each item in a single trial, n k , is a discrete random variable independently drawn from a Poisson distribution, with its mean determined by the spike rate for that item, The neural population model can therefore be interpreted as a stochastic sampling model.

Fixed Sampling Models
The most prominent discrete representation account of VWM, the slots+averaging model (3), can also readily be interpreted in terms of sampling ( Fig. 1 D and E). Each slot is postulated to hold a representation of a single item with a fixed precision, and so provides a noisy sample of the item's feature value (or values; the sampling interpretation is agnostic as to feature-vs. objectbased views of VWM; refs. 21 and 22). Multiple slots, or samples, that correspond to the same object are averaged at retrieval to enhance the precision of the estimated stimulus feature. Thus, the format of representation and the decoding mechanism are identical to the stochastic sampling model. There is one critical difference, however: The slots+averaging model assumes that the total number of samples available for representing multiple items is fixed, that is, This has also been the most common assumption in previous sampling-based models in the attentional and memory literature (refs. 12-14, but see ref. 23). We will refer to this as a fixed sampling model.

Predictions for Precision and Error Distributions
We now consider the distribution of representational precision in these models. For any particular set of samples, ϕ, the information they provide about the stimulus is described by the likelihood function, L(θ; ϕ) = p θ (ϕ|θ), equivalent to the conditional probability of obtaining those samples given different values of the stimulus. The width of the likelihood function is a measure of uncertainty in the estimate: A set of samples with a broad likelihood function ( Fig. 1 C, Bottom Inset) is compatible with many different feature values, whereas a narrow likelihood function ( Fig. 1 C, Top Inset) identifies a value more precisely. While a pattern of samples may have a sharp likelihood function with a peak far from the true estimate (a kind of "false alarm"), statistically, this is unlikely. If the sample values follow a normal distribution with variance σ 2 centered on the true stimulus value, then the likelihood function is also normal, with a width that depends only on the number of samples available for decoding, Furthermore, for a specified number of samples, the ML estimate is distributed around the true stimulus value as a normal with the same width as the likelihood, This correspondence between uncertainty, as expressed in the likelihood width, and trial-to-trial variability is not universal, but does apply to all of the models considered in this study, and justifies defining the precision of an individual estimate (which we will denote ω) as the precision of its corresponding likelihood function (see SI Appendix, Fig. S1 for a detailed illustration).
Adopting this definition explicitly (see also refs. 25 and 26) allows us to treat precision as a random variable with a defined probability function, describing variation in the reliability of estimates while also predicting the distribution of errors across trials. This will prove critical in fitting data from whole-report tasks ( Fig. 2D and below).
For the stochastic sampling model based on population coding, likelihood precision has a Poisson distribution (Fig. 1C), scaled by the precision of a single sample which is determined by the neural tuning function, ω1 = 1/σ 2 , Example distributions of decoding error are shown in Fig. 2 B and E, where we have made a transition from 1D Euclidean to a circular stimulus space, corresponding more closely to the feature dimensions (e.g., orientation, hue) commonly used experimentally. The distribution of errors can be described as a scale mixture of normal distributions with precision proportional to the sample count (SI Appendix, Fig. S1; due to the circular stimulus space, this is a close approximation rather than exact: see SI Appendix, Supporting Information Text). The dispersion of errors increases with decreasing activity (e.g., as a result of increasing set size; Fig. 2B), and the distribution deviates from normality, with this effect being particularly evident at lower activity levels (blue curve) where long tails are observed.
For the fixed sampling model, making the common assumption that samples are distributed as evenly as possible among items (9,27), we obtain a discrete distribution over, at most, two precision values (Fig. 1E), which are multiples of the precision of one sample, ω1. As in the stochastic model, mean precision is inversely proportional to set size, but, because the distributions over precision differ, the fixed and stochastic models make distinct, testable predictions for error distributions (Fig. 2).

Response Errors Discriminate between Models
We tested the ability of stochastic and fixed sampling models to capture response errors in delayed reproduction tasks (SI Appendix, Fig. S2). We fit the models to a large dataset of singlereport tasks originating from different laboratories (SI Appendix, Table S1) and also to a set of whole-report experiments (24), in which participants reported the feature values of all items in a sample array, either in a prescribed random order or in an order freely chosen by each participant on each trial. While only a single study, the whole-report results include information regarding correlations in errors between items represented simultaneously in VWM that could differentiate the models. On free choice trials, we assumed that participants gave their responses in order of decreasing precision (corresponding to decreasing number of samples and increasing likelihood width). This assumption is supported by previous findings that humans have knowledge about the uncertainty with which individual items are recalled (8,25).
Overall, the stochastic model fit data substantially better than the fixed sampling model for both single-report ( Fig. 3A; difference in log likelihood per participant, ∆LL = 16.3 ± 2.37 [M ± SE]) and whole-report tasks ( Fig. 3B; ∆LL = 162 ± 13.6), indicating that stochasticity is critical for capturing behavioral performance (see also SI Appendix, Fig. S3). The response error distributions in the whole-report task with freely chosen response order have previously been argued to provide evidence for a fixed item limit (24), since they approach uniform distributions for the later responses at high set sizes ( Fig. 2D; see SI Appendix, Figs. S4 and S5 for full behavioral results and model fits). However, this qualitative observation is also predicted by the stochastic sampling model with responses ordered by precision, as the lowest precision retrievals will be based on few, or no, samples (Fig. 2E). Quality of fits could be further improved by taking into account deterioration of recall precision with increasing retention intervals (SI Appendix, Figs. S3J and S5), modeled as random drift of encoded feature values over time (28) In contrast, the quantitative changes in error distribution with response order and set size were relatively poorly fit by the fixed sampling model (Fig. 2F). In particular, when the set size exceeds the fixed sample count, each item is represented by either one or zero samples, so this model cannot reproduce the graded decline in precision with response order that is also present in individual participants' data (and does not merely arise at the group level due to averaging across participants with different capacities).
We tested two intermediate model versions in order to further dissociate the specific aspects in which the fixed and stochastic sampling models differ, and determine the significance of each for capturing human performance. In the random-fixed model, the total number of samples was fixed but distributed randomly between items. This model provided an improved fit to data compared to the fixed model with even allocation (moderately for single-report, ∆LL = 3.07 ± 1.10; strongly for whole-report, ∆LL = 112 ± 11.9), but was still substantially worse than the stochastic model in both cases (single-report, ∆LL = 13.2 ± 2.24; whole-report, ∆LL = 50.4 ± 7.03). In the even-stochastic model, the total number of samples was a Poisson random variable, but the samples were distributed as evenly as possible between items. This model achieved a better fit to single-report data than the stochastic model with independent sample counts for each item (∆LL = 3.57 ± 0.697), but provided a much worse fit to whole-report data (∆LL = 21.4 ± 4.12). Combining evidence across all participants and tasks, the stochastic model with independent sample counts was strongly preferred over this and the other alternative models (total ∆LL > 1,450; Fig. 3C).

Generalizing the Stochastic Model
For the models examined above, typical fitted parameters indicate that estimates are based on relatively small numbers of samples (e.g., mean of ∼13 samples based on fits to single-report data). One result is that the precision of decoded estimates could take on only a limited set of possible values, and error distributions reflect a discrete mixture of distributions with different widths. From a neural perspective, while consistent with the remarkable fidelity with which single neurons' activity encodes visual stimuli (29,30), such small sample counts nonetheless seem unlikely when interpreted as spike counts (see Toward Biophysically Realistic Models). To investigate whether discreteness and/or low numbers of samples are important for reproducing human performance, we therefore implemented a generalization of the stochastic model in which the number of samples was free to vary.
The distribution over precision values in the generalized stochastic model was obtained as a scaling of the negative binomial distribution, This distribution has previously been proposed to model neural spiking activity (31), and it retains the characteristic relationship between mean and variability in the scaled Poisson distribution: The Fano factor (the ratio of variance to mean) is constant, equal to the value of a single sample, . This distinguishes the stochastic models from the fixed sampling model, where the Fano factor is at or close to zero (mean ∼0.25 of ω1 based on ML parameters and typical set sizes) and varies in an idiosyncratic manner between set sizes, due to the varying combinatorial possibilities of allocating a fixed number of samples to a fixed number of items (Fig. 3D, purple). The parameter p in the generalized stochastic model controls the discretization of the precision distribution: p = 1 corresponds to the Poisson model described above and illustrated in Fig. 4A (strictly, Eq. 8 is the limit of Eq. 9 as p → 1), while p < 1 corresponds to a stochastic model with a greater mean number of samples,n = γ/p, each with a lower individual precision, ω1p. The mean and variance in precision (E [ω] = ω1γ/N and Var [ω] = ω 2 1 γ/N ), and thus also the Fano factor, are independent of the discretization p. Examples of precision distributions with different discretizations are shown in Fig. 4 B and C.
As the discretization parameter becomes very small (p → 0), the number of samples becomes very large, and the distribution of precision described by Eq. 9 approaches a continuous function ( Fig. 4D and SI Appendix), specifically the Gamma distribution, Two previous studies (8,9) independently proposed that a continuous scale mixture of normal distributions with Gammadistributed precision provided a good account of VWM data, but did not provide a theoretical motivation for this choice of distribution. In particular, ref. 9 proposed distributing precision as Gamma(J1/N α , τ ), withJ1, τ , and α as free parameters. With α = 1, this is identical to Eq. 10 (see SI Appendix for results regarding this parameter). We can now explain Gammadistributed precision as a limit case of the stochastic sampling model with large numbers of low-precision samples. Fig. 3 E, Top shows the results of fitting the generalized stochastic model with different levels of discretization, p, to the single-report dataset. The best fit was obtained with a discretization roughly one-third that of the Poisson model, p = 0.39. However, varying discretization produced differences in fit an order of magnitude smaller than those between fixed and stochastic sampling (varying by ∼1.5 versus ∼15 log likelihood points). Fitting the same model with p as a free parameter that could vary between participants, we found that ML estimates of discretization were very broadly distributed (Fig. 3 E, Bottom), with a majority of participants (72%) best described by a sampling model with less discreteness than the Poisson, and a minority (18%) better captured by the continuous limit (p → 0) than any discrete value of p we tested (as low as 0.0001, corresponding to ∼100,000 samples). Formal model comparison was equivocal with respect to an advantage of including the discretization parameter in comparison to either the Poisson model (i.e., p = 1; difference in Akaike Information Criterion, ∆AIC = −0.61 ± 0.49; difference in Bayesian Information Criterion, ∆BIC = +4.2 ± 0.46; negative values favor the added parameter) or the continuous Gamma model (i.e., p → 0; ∆AIC = −3.3 ± 0.93; ∆BIC = +1.5 ± 0.89). Overall, these results do not allow strong conclusions to be drawn regarding the discreteness of sampling, which has relatively little effect on error distributions (Fig 4, Insets) or the quality of fits.

Probabilistic Item Limits
In the fixed sampling model, at higher set sizes, a meaningful proportion of estimates are random "guesses" based on no samples (Fig. 5 A and B). Specifically, if an estimate was generated for every item in the memory array, then, as set size N increased, the number of estimates based on at least one sample, Sω>0, would reach a maximum at the fixed total number of samples, This is a trivial consequence of sharing out a fixed number of samples evenly between items.  In the stochastic model with Poisson variability (p = 1), the number of samples available for each item varies probabilistically and independently of the other items. There is, again, a probability of making an estimate based on zero samples, and the number of nonrandom retrievals in a set of N items has a binomial distribution, As set size increases, the mean number of estimates based on at least one sample reaches a maximum at the expected total number of samples (Fig. 5 C and D). However, unlike the fixed sampling model, this limit is probabilistic, and (as illustrated in Fig. 5C) the actual number will vary from one set of memory items to the next, converging to a Poisson distribution for large N , lim N →∞ Sω>0 ∼ Poisson(γ).

[14]
As we increase the expected number of samples by reducing the discretization, p, the probability of zero samples falls to zero: Pr(ω = 0) = p γ/ (N (1−p)) . However, if we choose a precision threshold that is less than or equal to the base precision ω1, it can be shown that the mean number of items with abovethreshold precision converges to a finite positive number at large set sizes (SI Appendix). This saturation is illustrated for different levels of discretization and various precision thresholds in Fig. 5 E-H. Item limits or "magic numbers" (2,24,32) are usually considered synonymous with slot-based accounts, occurring when some items must go unrepresented because other items have filled the available capacity. The present results show that a probabilistic item limit, that is, an upper limit on the average number of items successfully retrieved that is not exceeded at any set size, can arise even when the probability of success for one item is independent of each other item. This holds true in the Poisson sampling model if we define success as obtaining one or more samples, but also more generally, even in a continuous model, if we define success as exceeding a threshold level of precision in estimation. Note, however, that the item limit does not, in general, have a simple relationship to the underlying number of samples. For example, the probabilistic limit at approximately five items in Fig. 5E is obtained from a model with a mean of 24 samples.

Toward Biophysically Realistic Models
The idealized description of population coding on which we based the stochastic sampling model overlooks a number of important considerations in order to reveal relationships between cognitive and neural-level accounts of VWM. For instance, the statistics of spike counts in the neural system often deviate from the Poisson distribution assumed in the original population coding model in that they are "overdispersed" (i.e., Fano factor of >1). Such an overdispersion in the sample count also occurs in the generalized stochastic model as the discretization p decreases (in order for the Fano factor of the precision distribution to remain constant, the Fano factor of the sample count has to increase). Spike counts in visual cortical neurons typically show a Fano factor in the range 1.5 to 3 (e.g., ref. 33), cretization falls to zero, the mean number of samples becomes infinite, and the distribution over precision approaches a continuous Gamma distribution. The ratio of variance to mean precision (Fano factor) is fixed (at ω 1 = 1.5) across all set sizes and levels of discretization.   A and B) In the fixed sampling model, the number of items with nonzero precision increases with set size, then plateaus when the number of items equals the number of samples. (C and D) The stochastic sampling model with Poisson variability also has a limit on the number of items with nonzero precision, although this limit is probabilistic and emerges asymptotically (converging to the distribution shown by the red curve in C for large set sizes, corresponding to the mean number of items plotted as black curve in D). (E and F) Stochastic models with lower discretization display similar probabilistic item limits for precision exceeding a fixed threshold, but with the expected number of items saturating at different values depending on threshold (different colors in F). (G and H) This property also extends to models with continuous precision distributions.
corresponding, in our model, to discretization p in the range 0.33 to 0.75.
In real neural populations, there is also considerable variability between individual cells' tuning curves (34). Due to this heterogeneity, neurons differ in the amount of information each spike provides about a stimulus. From a sampling perspective, this means that estimates are based on samples that vary in precision, and this has the effect of "smoothing out" the discrete distribution of precision values predicted by the stochastic model (SI Appendix, Fig. S6). This has similar consequences for estimation error as decreasing p in the generalized model. We fit the single report data with a variant of the population model with random variability in the neurons' tuning curves (affecting baseline activity, gain, and tuning curve width, as well as adding heterogeneity in the coverage of the feature space by neural tuning curves; SI Appendix), scaled by a global heterogeneity parameter ν. Incorporating biologically realistic heterogeneity into the population model improved fits to data (∆AIC = 8.3 ± 1.8, ∆BIC = 3.4 ± 1.7 compared to the stochastic sampling model). The mean heterogeneity parameter in the ML fits was ν = 0.66 ± 0.08, where ν = 0 means no heterogeneity, and ν = 1 was approximately matched to heterogeneity of orientation-selective neurons in recordings from primary visual cortex (34).
Finally, spikes in real neural populations are not independent events as assumed by the sampling interpretation, but rather are correlated within and between neurons. This will tend to result in deviations from the simple additivity assumed by sampling. An implementation of short-range pair-wise correlations in the heterogeneous population model (see SI Appendix for details) greatly increased the numbers of decoded spikes required to reproduce behavioral data (on average, 168 times higher), without changing quality of fit (∆AIC/BIC = 0.045 ± 0.28). We note, however, that the exact consequences of spike correlations for decoding depend on details of correlation structure that are difficult to measure experimentally (35)(36)(37), and suboptimal inference (in the form of a mismatched decoder) may play an important part (38).

Discussion
Taking, as a starting point, a mathematical idealization of the way neural populations encode information, we have shown that retrieval of a visual feature from working memory can be described as estimation based on a stochastically varying number of noisy samples. Two other influential models of VWM can be reconceptualized in the same framework: The slots+averaging model, because it modified the original slot model to allow multiple representations with independent noise, is directly equivalent to a sampling model with a fixed number of samples (13). And the variable precision model (8,9) constitutes the continuous limit of a sampling model as samples are made less precise and more numerous, while maintaining the fixed proportionality between the variance and mean of precision in the decoded estimate.
Formulating all three models in the same mathematical framework (a formal "unification") allowed us to pinpoint specific differences between them. We determined the effect of these differences on the models' ability to account for human behavior by fitting multiple variants of the sampling model to a large database of delayed reproduction tasks. We found that stochasticity both in the total number of samples and in their distribution among items has a major impact on the quality of fit, with the best fits obtained if the number of samples is drawn randomly and independently for each item in each trial. Note that this form of stochasticity is poorly captured by the concept of memory "slots," because of the implication that a slot occupied by one item leaves fewer slots available for other items-this would predict dependencies between items in whole report that were not supported experimentally.
On the other hand, contrary to the assumptions of continuous resource models, we did find limited support for discreteness of memory representations (3). The fully continuous model with Gamma-distributed precision proposed in previous studies provided fits to data that were, overall, a little worse than the discrete Poisson model, in both single-and whole-report tasks (SI Appendix). When we attempted to fit discretization as a free parameter, however, we found that ML estimates varied widely between participants, and many were best fit by continuous or near-continuous versions of the generalized stochastic sampling model. So, while discreteness in memory representations is plausible-even inevitable if based on discrete spiking activity-recall errors do not provide strong evidence for any one particular level of discreteness, or, as a corollary, any particular mean number of samples.
Our findings further highlight the need to distinguish between two concepts that have previously been elided: discreteness in representation and discreteness in allocation. In the stochastic sampling model, the resource underlying capacity limits in VWM is equated with the mean number of samples (or the mean spike rate in the population coding interpretation), which can be distributed among items in a continuous fashion, even though the consequent number of samples obtained by each item is a discrete integer. This view on memory resources was strongly motivated by studies showing that prioritized items can be represented more precisely in VWM, at the cost of decreased precision for other items (7,(39)(40)(41)(42). The stochastic sampling model can account for such findings through an uneven distribution of resources among memory items, corresponding to a higher mean number of samples for some items at the cost of a lower mean for other items. The actual number of samples available on an individual memory retrieval varies randomly about the item's mean. In the neural population model, this mechanism has previously been shown to successfully reproduce data from tasks in which one item is cued as the likely target (10).
While the stochastic sampling model is based on a highly idealized implementation of population coding, it nevertheless provides a link to a concrete neural mechanism that could form the basis of VWM performance. We have shown that adapting this model to achieve a higher degree of biophysical realism-by introducing heterogeneity in neural tuning curves and correlated spiking activity-improved the quality of fit to behavioral data. It has recently been shown that more neurally realistic population coding models preserve the key characteristics of the idealized model, and that signatures of neural tuning may even be visible in behavioral data (43).
Our results also provide a link between models of working memory used in the psychological literature and more biophysically detailed neural models such as continuous attractor networks (44)(45)(46), whose greater complexity typically precludes quantitative fits to behavioral data. These models are, likewise, based on principles of population coding and emphasize the role of neural noise in explaining variability in working memory performance. They are capable of producing probabilistic item limits similar to those described here, but it remains unresolved how these models could account for the graded variations in recall fidelity that we have found to be essential for capturing human behavioral performance.
In keeping with most previous work on VWM limits, we have not here attempted to reproduce the variations in bias and precision that are observed for different feature values, exemplified by the finding that cardinal orientations can be reproduced with greater precision than obliques. However, previous work has shown that these effects can be simply and elegantly captured within the population coding framework via the principle of efficient coding (47)(48)(49). The idea is that neural tuning functions are adapted to the stimulus statistics of the natural environment in such a way as to maximally convey information in that environment (effectively by distributing neural resources preferentially to the most frequently occurring stimulus features). Although it should be possible to formulate this model as a modification of stochastic sampling, without reference to neural populations, it seems that the modifications required would not have a natural explanation within the sampling framework. These observations, and the results of incorporating heterogeneity described above, illustrate the value of connecting abstract cognitive models to neural theory.
We also did not address here the question of how individual features of a visual stimulus are bound together, which forms another point of contention in the debate on the format of VWM representations. In the model fits, we allowed failures of binding memory in the form of swap errors to occur with a fixed rate, although taking into account similarity of items with respect to the cue feature is likely to improve model fits (50,51). While the discrete memory representations in slot models have traditionally been associated with a strongly object-based view (1), the sampling framework is agnostic as to whether objects or features are the units of VWM storage. Both views are compatible with the population coding interpretation, depending on whether the neurons in question are sensitive to a single feature (52) or a conjunction of features (51,53).
A recent proposal that VWM errors can be explained in terms of a perceptual rescaling of stimulus space can also be expressed in terms of population coding, with some minor differences from the version presented here (see ref. 54 for details and discussion). In particular, the idea of retrieval based on normally distributed "memory-match" signals maps exactly onto an idealized population code with continuous-valued activity and constant Gaussian noise (51,55). This predicts a continuous distribution over precision, not dissimilar to the Gamma distribution. Continuousness in representation does not appear a necessary component of this account, however, and it should be possible to reformulate it with arbitrary levels of discreteness, as in our generalized stochastic model.
There are other models of working memory that address capacity limits without explicitly postulating a limited memory resource (56). Some accounts stress the importance of memory decay over time, and active rehearsal to counteract this decay (57,58). These theories do not have a clear analogue in the sampling framework, although effects of retention time have been incorporated into the neural population model (28). Other accounts have sought to explain capacity limits by interference between different memorized items (59). While the sampling framework does not explicitly address interference, the effect of normalization could be described as a form of nonspecific interference between items. A model of feature binding based on the neural population model shows some notable congruencies with an interference account of VWM, and both models make similar predictions regarding swap errors (50,51). Further research will be needed to determine the exact relationship between these models.
Taken together, our results reveal a surprising convergence between prominent models of VWM. Despite the fact that these competing models were independently motivated by different behavioral and neural findings, they can be expressed within the shared formal framework of sampling, which reveals specific distinguishing factors as well as shared general principles. This convergence gives cause for confidence that the stochastic sampling model captures key characteristics of VWM and will provide a solid foundation for future research.

Materials and Methods
We fit computational models of VWM to behavioral data from a large dataset of delayed estimation experiments. The dataset included 15 individual single-report experiments (SI Appendix, Table S1; see SI Appendix, Supporting Information Text for inclusion criteria), as well as four wholereport experiments (SI Appendix, Table S2). Each model defines a parameterized distribution of response probabilities given the true feature values of the target and nontarget items in each trial (SI Appendix). For fits to whole-report data, we determined the probabilities of obtaining the given combination of responses within a single trial, taking into account the correlations of recall precision between different items within a trial predicted by each model.
We obtained an ML fit of each subject's data for each model. The stochastic sampling, fixed sampling, and continuous sampling (Gamma) model, as well as the fixed-random and stochastic-even variants, each have three free parameters (including one parameter for the probability of swap errors). We fit these to both the single-report and whole-report data using the Nelder-Mead simplex algorithm (see SI Appendix for details). The generalized stochastic model and the neural population model with heterogeneous tuning curves have four free parameters each. For these models, we employed a grid search to obtain fits only of the single-report data (fitting them to whole-report data was not computationally feasible). We further evaluated model variants employing a more accurate method for ML decoding for circular feature spaces (rather than the Gaussian approximation used for fits reported in Response Errors Discriminate between Models), models without swap errors, models with an additional free parameter for the power law in set size effects, and models with a temporal decay of memory precision over varying response delays in the whole-report experiments (SI Appendix).
Data Availability. Data and code associated with this study are publicly available in Open Science Framework (60).