Academics are more specific, and practitioners more sensitive, in forecasting interventions to strengthen democratic attitudes
Edited by Kenneth Wachter, University of California, Berkeley, CA; received April 30, 2023; accepted November 8, 2023
Significance
More credible ideas for addressing social problems are generated than can be tested or implemented. To identify the most promising interventions, decision-makers may rely on forecasts of intervention efficacy from experts or laypeople. We compare the accuracy of academic experts, practitioner experts, and members of the public in forecasting interventions to strengthen Americans’ democratic attitudes. Results show that academics and practitioners outperformed nonexperts. Experts also differed in how they were accurate: Academics were better at avoiding false-positive forecasts (predicting success when an intervention actually failed), while practitioners were better at avoiding false-negative forecasts (predicting failure when an intervention actually succeeded). Depending on the relative importance of avoiding false-positive vs. false-negative forecasts, decision-makers may prefer different experts.
Abstract
Concern over democratic erosion has led to a proliferation of proposed interventions to strengthen democratic attitudes in the United States. Resource constraints, however, prevent implementing all proposed interventions. One approach to identify promising interventions entails leveraging domain experts, who have knowledge regarding a given field, to forecast the effectiveness of candidate interventions. We recruit experts who develop general knowledge about a social problem (academics), experts who directly intervene on the problem (practitioners), and nonexperts from the public to forecast the effectiveness of interventions to reduce partisan animosity, support for undemocratic practices, and support for partisan violence. Comparing 14,076 forecasts submitted by 1,181 forecasters against the results of a megaexperiment (n = 32,059) that tested 75 hypothesized effects of interventions, we find that both types of experts outperformed members of the public, though experts differed in how they were accurate. While academics’ predictions were more specific (i.e., they identified a larger proportion of ineffective interventions and had fewer false-positive forecasts), practitioners’ predictions were more sensitive (i.e., they identified a larger proportion of effective interventions and had fewer false-negative forecasts). Consistent with this, practitioners were better at predicting best-performing interventions, while academics were superior in predicting which interventions performed worst. Our paper highlights the importance of differentiating types of experts and types of accuracy. We conclude by discussing factors that affect whether sensitive or specific forecasters are preferable, such as the relative cost of false positives and negatives and the expected rate of intervention success.
Sign up for PNAS alerts.
Get alerts for new articles, or get an alert when an article is cited.
American politics in the 21st century has been characterized by political division and democratic backsliding. Research shows concerning levels of support for undemocratic practices (SUP) and partisan violence (SPV) (1, 2). Additionally, partisan animosity (PA) has substantially increased since the 1970s, with Democrats and Republicans expressing extreme hostility toward one another (3). Animosity imperils social relationships (4) and limits democratic accountability (5). These developments have spurred civic organizations and researchers to propose numerous interventions to reduce PA, SUP, and SPV, along with calls to evaluate these interventions (6).
While experimental tests offer one way to evaluate intervention effectiveness, resource limitations constrain the number of interventions and outcomes that can be experimentally tested. Additionally, in scenarios requiring urgent response, testing prior to intervening may be impractical or undesirable (7). Decision-makers faced with the challenge of selecting interventions to test or implement (when testing is impractical) may consult experts to forecast which interventions are likely to succeed (8). Nevertheless, it remains unclear whether domain experts, understood as individuals with knowledge relevant to a particular field (9–11), can accurately forecast intervention efficacy in the realm of political polarization. Additionally, it is unclear how different kinds of domain experts compare in their predictive success.
We differentiate two types of experts with potentially useful domain knowledge. Some experts work to develop general knowledge about social problems through the collection and analysis of data, and less often acquiring direct, personal experience with the problem being studied (academic experts, in our analysis). Other experts attempt to intervene directly on the same social problems (practitioner experts in our analysis), gaining participatory and experiential knowledge as a result. These two kinds of domain experts correspond closely to the classic distinction between “actuarial judgment”—judgment that relies on establishing systematic, general empirical relationships to draw conclusions—vs. “clinical judgment”—judgment that combines data with direct, personal experience to arrive at conclusions (12). While academics can draw on analysis of large datasets and systematic research from multiple contexts, practitioners are more likely to develop and employ first-hand knowledge built from personal experience (13). While some prior research finds greater support for the accuracy of actuarial judgment (12), both approaches to judgment may consider factors that the other ignores and thus could outperform the other in a given domain. Further, clinical judgment may have advantages in multifactorial settings where general knowledge development is nascent.
Here, we recruit members of the public, academics who study polarization, and practitioners who work on reducing polarization, to forecast the performance of interventions to reduce PA, SUP, and SPV. Our recruitment approach extends research on forecasting in other domains in two primary ways. First, most prior work on expert forecasting has focused on the forecasting performance of academic experts or the lay public, neglecting practitioner experts. For instance, forecasting studies of the efficacy of behavioral nudges to improve vaccine uptake (14), the efficacy of group innovation interventions (15), the effect of various monetary and nonmonetary incentives in inducing costly effort (16), or whether scientific studies will replicate (17, 18) recruited academic experts or laypeople as forecasters. Second, those studies which have compared practitioner and academic experts have not specifically recruited experts with domain expertise, though this may be an important factor in many practitioners’ judgment. For instance, whether involving predictions of the efficacy of interventions to increase gym activity (19), nudge interventions to improve uptake of government services (20), or messages to promote masking (21), several existing forecasting studies recruited academics and practitioners with significant experience running experiments, but not necessarily with direct domain experience regarding the behavior of interest (e.g., gym activity, uptake of government services, or protective health behaviors).
We calculate accuracy of these academic, expert, and lay forecasts of intervention efficacy by comparing forecasts to intervention effects identified by a large-scale experiment called the Strengthening Democracy Challenge. In analyses that we preregistered, we first assess whether the forecasts of domain experts are more accurate than chance, whether they outperform nonexpert members of the lay public, and if performance varies between academic and practitioner experts.* In subsequent analyses that were not preregistered, we explore the sensitivity and specificity of expert forecasters (22, 23). Specificity refers to how well forecasters avoid false-positive forecasts, or inaccurate predictions that an intervention would succeed, when it in fact failed. Sensitivity, by contrast, involves avoiding false-negative forecasts, or inaccurate predictions that an intervention would fail, when it actually succeeded. We conclude by discussing how and when differences in specificity and sensitivity are important for decision-makers, identifying conditions where one component of accuracy might be preferred over the other.
The Forecasting for Democracy Challenge.
Our findings come from the Forecasting for Democracy competition. This competition was linked to the Strengthening Democracy Challenge, the latter being an experimental “megastudy” that tested the effects of 25 crowdsourced interventions on PA, SUP, and SPV (24). The impact of each intervention on these three outcomes was assessed in a survey experiment conducted on a large, national sample of American partisans (n = 32,059) that was representative of US population benchmarks for key demographic variables. This design enables us to directly compare forecasts of these effects with the actual performance of these interventions in a highly powered experiment.
We recruited three cohorts of forecasters: nonexperts from the general public (n = 1,024), academic experts (n = 106), and practitioner experts (n = 51). Members of the general public were recruited from Bovitz, an online panel. We recruited academics by extending invitations through professional networks and mailing lists; we operationalized academics as those who report studying political polarization, having a PhD degree, and working in an academic institution. We recruited practitioners by sending invitations to participate through nonprofits that work on bridging partisan divides, counting as practitioner experts those who work, volunteer, or are otherwise connected to nonprofit bridging organizations with active programs to reduce political polarization. Hence, all expert forecasters had experience studying (in the case of academics) or intervening directly in (in the case of practitioners) the three outcomes they were to forecast (11).
Forecasters were invited to make predictions about the efficacy of interventions. Before doing so, they participated in a training that defined the outcomes, explained the nature of the experimental sample used to measure the interventions’ success, offered details on how to register forecasts, and described how a forecast would be evaluated (in terms of its success). For each forecast, participants received a summary that described the intervention and a link that enabled them to view and experience the entire intervention. Having reviewed this information, forecasters then registered the percent likelihood of success of each of 25 interventions across the three outcomes (PA, SUP, and SPV; thus, they could make up to 75 forecasts). We defined intervention success as statistically significantly reducing the target outcomes and failure as having no statistically significant effect or a significant backfire effect that increased the target outcomes. Forecasters received monetary (up to $.40 a forecast) rewards for accuracy, as well as public acknowledgment. In total, our 1,181 forecasters contributed a total of 14,076 predictions. The expert forecasters, but not the public, also were given the option of writing rationales for their forecasts. After all academics and practitioners had 3 wk to register their individual forecasts, forecasters were allowed to view the forecasts and rationales of others within their cohort and update their forecasts.
Results
Experts Outperform Nonexperts.
The predicted likelihoods of intervention success among domain experts (academics and practitioners) and nonexpert members of the public are plotted in Fig. 1A, with interventions that had a statistically significant effect on reducing target outcomes (“successes”) colored in blue, and those that did not or backfired (“failures”) colored in red. When we interpret forecasts of 50% or more as predicting the success of an intervention (indicated by a dashed line in Fig. 1A), the average of academic and practitioner (expert) forecasts correctly identified the effects of 61 of the 75 intervention–outcome pairs or 81% (all blue dots above the dashed line, plus all red dots below the dashed line). This is significantly higher than random (one-sample binomial test P < 0.0001). By contrast, members of the public (represented by “x”s in Fig. 1A) only correctly identified 39 of the 75 interventions (52%), which is not significantly higher than chance (P = 0.409). Raw predictions for each of the interventions are displayed in SI Appendix, Figs. S1–S3.
Fig. 1.

Because of the hierarchical nature of the dependent variable (predictions nested within each forecaster), we calculate predicted margins from a multilevel model to test whether experts outperformed members of the general public, controlling for demographic characteristics like gender, race, and age (Materials and Methods). Members of the public had an average absolute prediction error (APE) of 0.484 (95% CI [0.447 to 0.521]), which is higher than academics’ average APE (0.447 [0.402 to 0.492]) and practitioners’ average APE (0.418 [0.367 to 0.468]). Pairwise tests of differences, with a Tukey adjustment for multiple comparisons, are statistically significant between the public and both academics (P = 0.03) and practitioners (P = 0.002). The difference between academics and practitioners, however, is not statistically significant (P = 0.41).
We also evaluated the relative accuracy of expert and public crowd wisdom. That is, we take the mean forecast for each intervention in a given cohort (i.e., practitioners, academics, and the public), such that each cohort makes 75 collective forecasts (25 interventions × 3 outcomes). We use Wilcoxon signed-rank tests comparing the APE for the crowd wisdom of practitioners, academics, and the public. The APE of crowd wisdom among the public was 0.501, compared to 0.449 among academics and 0.460 among practitioners. Crowd forecasts among academics had lower APE than the public (by 0.053; 95% CI [0.027 to 0.081], P < 0.001). Similarly, the APE of crowd forecasts among practitioners was 0.043 less than in the general public (95% CI [0.026 to 0.064], P < 0.0001). We cannot reject the null hypothesis that the aggregate forecasts of academics and practitioners are identical in their accuracy, 95% CI [−0.041 to 0.017] (P = 0.414).
Finally, forecasters were also asked to assign probabilities to whether the interventions would have a large, medium, small, null, or backfire effect, allowing us to compare the accuracy of members of the public, academics, and practitioners in forecasting ordinal effect sizes. To do this, we calculate a Brier score for predictions of these five events and analyze this outcome with the same multilevel models described above. Predicted margins show that academics and practitioners outperform the public, but with no statistically significant difference between academics and practitioners (SI Appendix, Tables S1 and S2). In SI Appendix, we additionally discuss results showing that academics and practitioners become more accurate after opportunities to update their forecasts, and we cannot reject the null hypothesis that academics and practitioners are identical in their accuracy after these opportunities to update.
Academics and Practitioners Excel at Different Components of Accuracy.
Fig. 1B shows the crowd forecasts of academics and practitioners arrayed by the effect size of interventions. This figure distinguishes two kinds of errors. The first error, which we refer to as false-positive forecasts, occurs when a forecaster predicted intervention success, but the intervention did not have a statistically significant effect on reducing the target outcome or backfired (Top Right Quadrant in Fig. 1B). The second error, which we refer to as false-negative forecasts, occurs when a forecaster predicted intervention failure when it, in fact, successfully reduced target outcomes (Bottom Left Quadrant in Fig. 1B).
Fig. 1B reveals that while the two cohorts performed similarly in their overall accuracy, they differed in the kinds of errors they made, differing in their specificity (i.e., avoiding false-positive forecasts) and in their sensitivity (i.e., avoiding false-negative forecasts—22, 23). On the one hand, academics demonstrated greater specificity. In the 42 cases where interventions did not statistically significantly reduce PA, SUP, or SPV, academics correctly predicted these failures 66.7% of the time (95% CI [0.647 to 0.691], estimated using 5,000 bootstrap samples, with bias-corrected and accelerated bootstrap intervals—25, 26). Practitioners correctly predicted failures only 56.7% [0.530 to 0.603] of the time. Permutation tests suggest that academics have statistically significantly higher specificity than practitioners (P < 0.001, with 5,000 permutations of the null hypothesis that specificity in the two cohorts are identical).
On the other hand, practitioners demonstrated greater sensitivity. In the 33 cases where interventions statistically significantly reduced PA, SUP, or SPV, practitioners correctly forecasted their effects 59.8% [0.566 to 0.628] of the time. By contrast, academics only predicted 48.8% [0.467 to 0.510] of these successful interventions. Permutation tests again suggest that the difference in sensitivity is statistically significant (P < 0.001).†
We can also compare specificity and sensitivity in terms of mean or aggregate forecasts between academics and practitioners. One caveat is that we successfully recruited more academics than practitioners, and there are more academic forecasts. Hence, academics have an advantage over practitioners when we aggregate their forecasts, as it incorporates a wider range of information that enhances its predictive value (27). We find that, in the aggregate, academics correctly identified 37 out of the 42 cases where interventions failed, or a specificity of 88.1% [0.744 to 0.955]. By contrast, practitioners only correctly predicted 19 of the 42 cases where interventions failed, or a specificity of 45.2% [0.301 to 0.605]. In terms of sensitivity, the opposite was true. Academics correctly predicted 18 of the 33 cases where interventions were successful, or a sensitivity of 57.6% [0.393 to 0.733], whereas practitioners displayed a higher sensitivity of 87.9% [0.802 to 0.966]. A permutation test indicates that these differences in sensitivity (P = 0.005) and specificity (P < 0.001) are both statistically significant. Taken together, the evidence suggests that academics and practitioners, while similar in their overall accuracy, differ significantly in both their specificity and sensitivity.
Accounting for baseline forecast likelihoods.
The mean forecasted likelihood of intervention success among academics was 0.46 (SD = 0.286), whereas it was 0.56 (SD = 0.317) among practitioners. Hence, one concern is that differences in sensitivity and specificity are attributable to baseline pessimism or optimism. We define baseline pessimism or optimism as overall lower or higher prediction likelihoods that do not change the rank order of forecasted probabilities across interventions. If two groups of forecasters make the same predictions, but one group then reduces their forecasted probabilities by some constant, then this pessimistic group would demonstrate higher specificity without being better at distinguishing between failures and successes. This would mean our results regarding academic specificity and practitioner sensitivity are an artifact of distinct baselines, rather than true differences in ability to identify effective or ineffective interventions.
We address this concern by varying thresholds for classifying predicted probabilities. That is, rather than classifying forecasts of 50% or more as predicting intervention success, as in the analyses above, we allow this threshold to vary for academics and practitioners, making it stricter for practitioners (who were more optimistic) and more lenient for academics (who were more pessimistic). To select the appropriate threshold, we rely on optimal thresholds for academics and practitioners. At the optimal threshold, forecasters have the greatest ability to differentiate between successes and failures. At this cutoff, sensitivity and specificity have the lowest (and thus best) tradeoff, and the sum of sensitivity and specificity is maximized (Youden J statistic—28).
We find that baseline academic pessimism or practitioner optimism can only partially explain the observed differences in sensitivity and specificity. The optimal threshold for practitioners is 0.600, and for academics, it is 0.505, which reflects the difference in average forecast probabilities between the two cohorts. Thus, all else constant, practitioners are more optimistic. At these respective thresholds, the estimated sensitivity for practitioners declines to 56.4% [0.533 to 0.596] and remains 48.8% [0.467 to 0.509] for academics. Given the stricter threshold, the estimated specificity for practitioners increases to 59.1% [0.555 to 0.626], whereas for academics, it is 66.9% [0.647 to 0.690]. Permutation tests suggest that academics continue to have statistically significantly lower sensitivity than practitioners (P < 0.001), and practitioners have statistically significantly lower specificity than academics (P < 0.001). Thus, this result establishes that differences in sensitivity and specificity are not a mere by-product of overall optimism vs. pessimism regarding intervention efficacy.
When comparing the mean, or aggregate, forecasts of academics and practitioners with threshold adjustments, we find the optimal threshold for academics is 0.476. At this threshold, their sensitivity is 78.7% [0.607 to 0.906], and their specificity is 83.3% [0.685 to 0.923]. For practitioners, their optimal threshold is 0.551. At this threshold, their sensitivity is 75.8% [0.579 to 0.882], and their specificity is 69.0% [0.528 to 0.816]. Given the small sample size for these aggregated analyses (n = 75 per group), the differences are not statistically significant (P = 0.807 for the sensitivity difference, and P = 0.078 for the specificity difference). Additionally, comparisons of sensitivity and specificity in the aggregate are biased in favor of academics, as the larger number of academic forecasters gives this group an inherent advantage in our data.
Heterogeneity by target outcome and by effect size of interventions.
The fact that academics and practitioners demonstrate different levels of specificity and sensitivity, even at their respective optimal thresholds, means that the prevalence of working interventions will affect the accuracy of their forecasts. To see why this is the case, consider how accuracy is defined as the weighted sum of sensitivity and specificity. When prevalence is greater than 0.5, sensitivity receives more weight. When prevalence is lower than 0.5, specificity receives more weight. Accordingly, Fig. 2 shows that practitioners had lower APEs than academics on PA—for which the prevalence of effective interventions was high (92%)—whereas academics outperformed practitioners on SUP and SPV—for which the prevalence of effective interventions was low (20% for each). The results are robust to different ways of operationalizing prediction error (SI Appendix, Figs. S4 and S5).
Fig. 2.

Another finding that is consistent with differences in sensitivity and specificity is that practitioners were better at correctly predicting interventions with the strongest effect sizes (best-performers), whereas academics were better at correctly predicting the worst-performing interventions. If policymakers and other decision-makers are most interested in selecting best-performing interventions, then it may not matter whether forecasters are generally accurate, but whether they are able to identify interventions that have the largest effect sizes (in intended directions). By including an interaction effect of quartiles of actual treatment effect size with our indicators for cohort affiliation in our multilevel model, we estimate how the APEs of forecasters vary by the size of the treatment effect of the interventions. We also include a fixed effect for which outcome is being forecasted (either PA, SUP, or SPV). We display the predicted margins in Fig. 3, which shows that academics have low prediction errors when forecasting interventions with weak or backfire effects, as well as interventions with treatment effects in the second quartile. Yet, they are statistically significantly worse than practitioners when forecasting whether the best interventions will succeed or not (those with the largest treatment effects in the intended direction). To illustrate concretely, relative to academics, practitioners assigned higher likelihoods of success for the five interventions that in fact had the largest effects in reducing PA (68.2 vs. 54.7%), SUP (59.2 vs. 55.4%), and SPV (57.2 vs. 51.8%). By contrast, for the five worst-performing interventions, academics were more correct in assigning lower likelihoods of success than practitioners for PA (40.6 vs. 50.8%), SUP (38.7 vs. 50.9%), and SPV (42.7 vs. 54.7%).
Fig. 3.

If these differences are an artifact of baseline optimism or pessimism, they would disappear when comparing the effect sizes of highest-ranked interventions from practitioners vs. academics. By focusing only on the top interventions, rank-ordered by forecasted probabilities, we control for the baseline differences in optimism and pessimism between academics and practitioners. For academics, the average effect size among the top three interventions selected as most likely to work was d = −0.193. For practitioners, it was −0.223, or an advantage of 0.03 over academics. This pattern is robust across other thresholds, where the average effect size of the top 10, 15, or 20 interventions (as predicted by practitioners) exceeds those predicted by academics by 0.088, 0.095, and 0.043, respectively. The only exception is the case of top five interventions, where interventions selected by academics had slightly stronger effect sizes (0.024) than those by practitioners. For reference, the average effect size detected across all 75 intervention–outcome pairs was −0.089, so the ability to identify interventions that are d = 0.03 stronger is a 33% increase. Taken together, it appears that the interventions that practitioners assign highest probabilities of success are generally more effective than those of academics.
Discussion
The apparently high levels of political division and the fragile state of democratic attitudes in the United States have stimulated substantial discussion about how to effectively intervene (29). This includes interventions aimed at creating a less politically acrimonious citizenry that supports democratic practices and opposes partisan violence (24). Because financial and logistical constraints limit the number of interventions that can be subject to high-quality experimental evaluations—in this domain and many others—decisions about which interventions to implement must be made based on incomplete evidence. Expert forecasters are a promising source of predictions to inform such decisions under uncertainty, and our results suggest domain experts can accurately predict intervention success, making more accurate predictions than chance, and those made by the general public. Importantly, we analyze two distinct components of accuracy across types of experts. While academics and practitioners had comparable overall accuracy, they differed in terms of their specificity and sensitivity. Academics were more specific, meaning that they correctly identified a larger proportion of failed interventions (i.e., they made fewer false-positive forecasts). By contrast, practitioners were more sensitive, meaning that they correctly identified a larger proportion of successful interventions (i.e., they made fewer false-negative forecasts). Consistent with this difference, practitioners were better at predicting best-performing interventions, while academics were superior in predicting which interventions performed worst. Finally, we found that these differences in specificity and sensitivity were not fully explained by differences in baseline optimism or pessimism.
In the context of the larger forecasting literature, beyond extending work to the substantive domain of political interventions, our findings differ from the results of some recent forecasting studies where experts were found to perform no better than chance in predicting the effects of behavioral interventions (14, 15, 21), or no better than members of the general public (19). While these differences may be attributable to various degrees of predictability in intervention effects across domains, or the specificities of what is being forecasted (e.g., likelihood of a statistically significant effect vs. effect sizes), we speculate that the better performance of expert forecasters in our study may have resulted from our recruitment of experts with domain-specific expertise, i.e., experts who had directly worked on the topic being forecasted (9).
Additionally, by recruiting two different types of experts, the present analyses shed light on the question of whether those with on-the-ground experience intervening on a problem (practitioners), vs. those with general knowledge about a problem gained through systematic analysis of data (academics), may offer distinct insights that could be useful to decision-makers. Prior studies have contrasted the performance of academics against forecasters with experience evaluating behavioral interventions (e.g., “nudge practitioners”—19–21), but it is unclear whether these forecasters had domain-specific expertise. For instance, it is unclear whether the nudge practitioners had first-hand experience intervening in the problems of interest, and the academics in these prior forecasting studies were general experts who did not necessarily study the social problems (e.g., gym attendance or masking during the COVID-19 pandemic) they were forecasting. Our results suggest that practitioners with direct experience attempting to address PA and/or anti-democratic attitudes can accurately forecast the efficacy of interventions targeting these outcomes, performing better than chance and overall forecasting as accurately as academics. The sensitivity and specificity differences further suggest that practitioners and academics offer different insights for decision-makers seeking to select among various interventions, at least in this domain.
Finally, our study generally illustrates the utility of differentiating components of accuracy, as different decision scenarios make it preferable to draw upon more specific, or more sensitive, experts. We propose five criteria that, all else being equal, affect whether a decision-maker is likely to prefer greater sensitivity or specificity. In Table 1, we provide a summary of criteria that shape the prioritization of sensitive vs. specific experts. First, if decision-makers know that many interventions will work (i.e., prevalence is high), experts with greater sensitivity (practitioners, in our case) are more likely to be correct. When prevalence is low, then experts with greater specificity are more likely to be correct.
Table 1.
Criterion | Value → preferred accuracy type |
---|---|
Probable proportion of successful interventions | High → sensitivity; low → specificity |
Interest in identifying top-performing interventions | High → sensitivity; low → specificity |
Availability of later assessment | Yes → sensitivity; no → specificity |
Marginal intervention deployment cost | High → specificity; low → sensitivity |
Number of successful interventions desired | Many → sensitivity; few → specificity |
Second, if decision-makers are primarily interested in identifying top-performing interventions—such as situations where finding one of the most efficacious interventions is of unique interest—experts with sensitivity are preferred. By contrast, if there is greater emphasis on avoiding interventions that will backfire or otherwise fail, then experts with specificity are preferred.
Third, if decision-makers are selecting interventions that have a chance to be validated or assessed at a later stage (e.g., via an experiment), then sensitivity is more important. This is because any false-positive forecasts can be ruled out later through later assessment, while potentially highly effective interventions will not be missed. The later assessment serves as a type of insurance to counter false-positive forecasts. However, in an emergency, unexpected situation where interventions must be deployed without subsequent assessment of their effects (7), then specific forecasters will be preferred, as the chance of fielding an ineffective intervention will be low.
Fourth, when the marginal cost of deploying interventions is low, then the cost of mistakenly selecting ineffective interventions is lower, which favors greater reliance on forecasters with sensitivity. If the marginal deployment costs are high, as in the case of a costly field intervention, then a specific forecaster may be preferred.
Finally, when the goal is to discover many working interventions, as opposed to needing to find only a single working solution, then more sensitive forecasters are preferred because there is greater likelihood that they will identify multiple effective interventions. By contrast, if a single effective intervention is all that is desired or needed, such as a targeted intervention at a point in time, then forecasters who are specific are preferable.
To illustrate these five criteria concretely, whether one prioritizes specificity or sensitivity (and hence, academics or practitioners) for strengthening democratic attitudes might depend on whether few or many successful interventions are expected. All other factors being equal, practitioners would be favored if more interventions are expected to succeed. If decision-makers care primarily about identifying the most successful interventions in reducing PA and anti-democratic attitudes, then practitioners would also be favored. To give examples where academics might be favored, if decision-makers need to identify immediate actions to address fast-moving democratic backsliding that preclude other assessment exercises, as opposed to longer-term efforts to evaluate interventions to address democratic erosion, then academic forecasters are more informative. If interventions require expensive in-person participation, rather than being deployed online at scale, then academics would again be favored. Finally, if it is unnecessary to identify multiple workable variants upon which to draw, and a single intervention is all that is required, then academics would again be preferable as forecasters.
Future research could extend the present work by investigating whether the sensitivity and specificity differences between practitioners and academics observed here obtain in other domains, especially because the differences we observed were not anticipated by us ex ante, and therefore were not preregistered as hypotheses. In future work, it would also be valuable to isolate the mechanisms driving these differences. For instance, academics may generally exhibit greater specificity given the particular emphasis placed by social scientists on avoiding type 1 errors (false-positive findings), more so than type 2 errors (false-negative findings). The fact that the customary cutoff for type 1 errors (0.05) is smaller than that for type 2 errors (0.2) suggests that academics may be more concerned with making conclusions that hold up to later replication or review, as opposed to missing important findings. One possibility, then, is that this differential avoidance of false-positive vs. -negative claims generalizes to how academics make judgments about intervention efficacy, making academics more specific in their forecasts. Practitioners, by contrast, might be more sensitive and able to predict effective interventions because they focus on identifying potential workable interventions to implement in the field. Thus, interventions with low or inconsistent effectiveness may be discarded, while promising interventions are attended to more closely by this pragmatically minded expert community.
The selection of promising behavioral interventions remains a persistent problem for researchers and policymakers. Even as the emergence of megastudies and adaptive experimental research designs enable more behavioral interventions to be tested in parallel, it remains infeasible to test all potential interventions, necessitating less costly selection processes that often involve expert recommendations. The fact that domain experts can accurately identify the effects of interventions to strengthen democratic attitudes and reduce PA suggests these experts are important sources of insight for this domain. Additionally, the fact that different types of domain expertise track with distinct components of accuracy (specificity and sensitivity) implies academic and practitioner experts may be preferred as forecasters under different circumstances.
Materials and Methods
We study forecasting in the context of a competition called the Strengthening Democracy Challenge. The study evaluated 25 crowdsourced interventions targeting three widely expressed concerns (24): PA, SUP, and/or SPV. Eligible interventions had to be no longer than 8 min and suitable for implementation via a survey experiment. We tested the impact of each of the 25 interventions on each of the three outcome variables (i.e., 25 × 3 = 75 effects) in a megastudy survey experiment (n = 32,059). These results enable us to evaluate the accuracy of forecaster predictions (see SI Appendix for more details about the Challenge).
We recruited expert academics, expert practitioners, and members of the general public (a nationally diverse sample recruited from an online panel provided by Bovitz Inc.) to predict how likely each of the 25 interventions would reduce PA, SUP, or SPV (a total of 75 forecasts, given there are three outcome variables). Study procedures were approved by Northwestern University’s Institutional Review Board. All potential participants had to complete an intake survey before any forecasts were elicited. The intake survey introduced the study in sufficient detail for participants to give informed consent.
We invited academics (n = 106) who directly worked on the outcomes under study, such as those who had published papers on partisan polarization. We invited practitioners (n = 51) who worked in organizations aimed at reducing partisan divisions, such as those that implement interventions to bridge divides. These numbers of forecasters align with (or exceed) that found in prior forecasting studies (14–21). The higher number of academics reflects a higher-than-anticipated participation rate. Additionally, we recruited 1,024 participants from the general public, where each participant was only asked to make eight predictions in one sitting. The higher number of general public forecasters reflects the fact that they each forecasted fewer interventions. Participants were paid for accurate forecasts (see SI Appendix on incentives).
To ensure fairness and to avoid conflating expertise with familiarity in working with our forecasting setup, all participants completed a training module on how to make forecasts prior to registering forecasts. This training ensured that all participants understood how success of interventions would be measured, i.e., statistical significance at the 0.05 level using a one-tailed test from a control group (described in widely accessible language). Additionally, the training covered details about how each dependent variable was measured, the experimental sample, and other pertinent details for making forecasts.
After completing this training module, academics and practitioners signed onto a custom website called Cultivate Labs to register their forecasts. The general panel registered forecasts via a survey we programmed rather than the custom website. All forecasters could see a title and abstract of each intervention, and a link to the full intervention, exactly as participants in the experiment would experience it. Based on this information, each participant was asked to forecast the likelihood of the intervention statistically significantly reducing one of three dependent variables as well as the effect size (as being a large, medium, small, null, or backfire effect—see SI Appendix for more details on Cultivate Labs). We sum forecasted probabilities for statistically significant small, medium, and large effects as predictions that the intervention would be successful.
The expert forecasters also could, but were not required to, write a rationale for each forecast. Expert forecasters had 3 wk to make their forecasts, at which point we made the forecasts and rationales of all forecasters publicly available. The forecasters could then update their forecasts as often as they wanted (and the others’ forecasts were constantly updated to the most recent predictions) for another 3 wk.
Data and code to reproduce all analyses in this paper are available at https://osf.io/acver/. For our preregistered statistical analyses of cohort differences in prediction accuracy, our primary dependent measure is prediction error. In the manuscript, we report APE, defined as the absolute value of the error between the forecast probability and the actual outcome. We also test whether results are robust to different variants of prediction error, including the Brier score (mean squared error) or Brier score for predictions of the five effect size categories (we report these specifications in SI Appendix).
We model each operationalization of prediction error using a multilevel model to account for the hierarchical nature of the data (predictions nested within each forecaster). The first level portion of the model is as follows:
where prediction error of each forecast j within individual i is modeled as having a mean bi and an error εij. The second level equation in the model is as follows:
where the prediction error within each forecaster i (bi) is modeled as a function of the primary independent variable, a vector X consisting of covariates for gender, race, age, education, and party identification, and a between-subjects error term u. Our primary variable of interest is an indicator variable for cohort, i.e., identification as an academic or practitioner, relative to being a member of the general public.
Finally, in our exploratory analyses comparing the specificity and sensitivity of academic and practitioner forecasts, we report CI estimated using bootstrapping, with 5,000 resamples of individual forecasts (with replacement). Because the distribution of estimates in the bootstrap samples is not necessarily normally distributed, we construct bias-corrected and accelerated CI (25, 26). Additionally, to identify whether the differences between academics and practitioners are statistically significant, we rely on a nonparametric permutation test to produce a distribution of differences under the null hypothesis, i.e., that there were no differences between academics and practitioners. To generate this distribution, we randomly permute whether individual forecasts are attributable to academics or practitioners multiple times, each time calculating the difference in specificity or sensitivity between academics and practitioners. By comparing the observed differences against the distribution under the null hypothesis, we can estimate a P-value and infer whether our results are statistically significant.
Data, Materials, and Software Availability
Anonymized .csv data have been deposited in OSF (https://osf.io/acver/) (30).
Acknowledgments
We thank A. Dreber, S. DellaVigna, E. Linos, W. Kim, K. Milkman, T. Pfeffer, and anonymous reviewers for invaluable feedback. Funding for this project was provided by the Ford Motor Company Center for Global Citizenship and the Institute for Policy Research at Northwestern University.
Author contributions
J.Y.C. contributed new reagents/analytic tools; J.Y.C. and S.K. analyzed data; J.Y.C., J.N.D., and R.W. wrote the paper; J.Y.C., J.G.V., and M.N.S. equally led the project; and J.G.V., M.N.S., and D.G.R. provided edits.
Competing interests
The authors declare no competing interest.
Supporting Information
Appendix 01 (PDF)
- Download
- 2.24 MB
References
1
M. H. Graham, M. W. Svolik, Democracy in America? Partisanship, polarization, and the robustness of support for democracy in the United States. Am. Polit. Sci. Rev. 114, 392–409 (2020).
2
N. P. Kalmoe, L. Mason, Radical American Partisanship: Mapping Violent Hostility, Its Causes, and the Consequences for Democracy (University of Chicago Press, 2022).
3
E. J. Finkel et al., Political sectarianism in America. Science 370, 533–536 (2020).
4
S. Iyengar, Y. Lelkes, M. Levendusky, N. Malhotra, S. J. Westwood, The origins and consequences of affective polarization in the United States. Annu. Rev. Polit. Sci. 22, 129–146 (2019).
5
J. N. Druckman, S. Klar, Y. Krupnikov, M. Levandusky, J. B. Ryan, Partisan Hostility and American Democracy: Explaining Political Divides (University of Chicago Press, Forthcoming).
6
R. Hartman et al., Interventions to reduce partisan animosity. Nat. Hum. Behav. 6, 1194–1205 (2022).
7
A. J. London, O. O. Omotade, M. M. Mello, G. T. Keusch, Ethics of randomized trials in a public health emergency. PLoS Negl. Trop. Dis. 12, e0006313 (2018).
8
S. DellaVigna, D. Pope, E. Vivalt, Predict science to improve science. Science 366, 428–429 (2019).
9
E. Salas, M. A. Rosen, D. DiazGranados, Expertise-based intuition and decision making in organizations. J. Manage. 36, 941–973 (2010).
10
K. A. Ericsson, R. T. Krampe, C. Tesch-Römer, The role of deliberate practice in the acquisition of expert performance. Psychol. Rev. 100, 363–406 (1993).
11
K. A. Ericsson, R. Pool, Peak: Secrets From the New Science of Expertise (Houghton Mifflin Harcourt, 2016).
12
R. M. Dawes, D. Faust, P. E. Meehl, Clinical versus actuarial judgment. Science 243, 1668–1674 (1989).
13
L. Candy, The Creative Reflective Practitioner: Research Through Making and Practice (Routledge, ed. 1, 2019).
14
K. L. Milkman et al., A megastudy of text-based nudges encouraging patients to get vaccinated at an upcoming doctor’s appointment. Proc. Natl. Acad. Sci. U.S.A. 118, e2101165118 (2021).
15
D. Viganola et al., Using prediction markets to predict the outcomes in the Defense Advanced Research Projects Agency’s next-generation social science programme. R. Soc. Open Sci. 8, e181308 (2021).
16
S. DellaVigna, D. Pope, Predicting experimental results: Who knows what? J. Political Econ. 126, 2410–2456 (2018).
17
A. Dreber et al., Using prediction markets to estimate the reproducibility of scientific research. Proc. Natl. Acad. Sci. U.S.A. 112, 15343–15347 (2015).
18
C. F. Camerer et al., Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nat. Hum. Behav. 2, 637–644 (2018).
19
K. L. Milkman et al., Megastudies improve the impact of applied behavioural science. Nature 600, 478–483 (2021).
20
S. DellaVigna, E. Linos, RCTs to scale: Comprehensive evidence from two nudge units. Econometrica 90, 81–116 (2022).
21
E. Dimant et al., Politicizing mask-wearing: Predicting the success of behavioral interventions among republicans and democrats in the U.S. Sci. Rep. 12, 7575 (2022).
22
J. Yerushalmy, Statistical problems in assessing methods of medical diagnosis, with special reference to X-ray techniques. Public Health Rep. 62, 1432–1449 (1947).
23
D. G. Altman, J. M. Bland, Statistics notes: Diagnostic tests 1: Sensitivity and specificity. BMJ 308, 1552–1552 (1994).
24
J. G. Voelkel et al., Megastudy Identifying Effective Interventions to Strengthen Americans’ Democratic Attitudes (Open Science Framework, 2023).
25
T. J. DiCiccio, B. Efron, Bootstrap confidence intervals. Stat. Sci. 11, 1432–1449 (1996).
26
B. Efron, R. Tibshirani, Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat. Sci. 1, 54–75 (1986).
27
T. Kameda, W. Toyokawa, R. S. Tindale, Information aggregation and collective intelligence beyond the wisdom of crowds. Nat. Rev. Psychol. 1, 345–357 (2022).
28
M. D. Ruopp, N. J. Perkins, B. W. Whitcomb, E. F. Schisterman, Youden index and optimal cut-point estimated from observations affected by a lower limit of detection. Biom. J. 50, 419–430 (2008).
29
S. Levitsky, D. Ziblatt, How Democracies Die (Crown, ed. 1, 2018).
30
J. Y. Chu et al., Forecasting for Democracy. Open Science Framework. https://osf.io/acver/. Deposited 9 August 2022.
Information & Authors
Information
Published in
Classifications
Copyright
Copyright © 2024 the Author(s). Published by PNAS. This article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).
Data, Materials, and Software Availability
Anonymized .csv data have been deposited in OSF (https://osf.io/acver/) (30).
Submission history
Received: April 30, 2023
Accepted: November 8, 2023
Published online: January 12, 2024
Published in issue: January 16, 2024
Keywords
Acknowledgments
We thank A. Dreber, S. DellaVigna, E. Linos, W. Kim, K. Milkman, T. Pfeffer, and anonymous reviewers for invaluable feedback. Funding for this project was provided by the Ford Motor Company Center for Global Citizenship and the Institute for Policy Research at Northwestern University.
Author contributions
J.Y.C. contributed new reagents/analytic tools; J.Y.C. and S.K. analyzed data; J.Y.C., J.N.D., and R.W. wrote the paper; J.Y.C., J.G.V., and M.N.S. equally led the project; and J.G.V., M.N.S., and D.G.R. provided edits.
Competing interests
The authors declare no competing interest.
Notes
This article is a PNAS Direct Submission.
*
The preregistration and all code and data to replicate analyses in this study are available at https://osf.io/acver/. Some preregistered analyses are beyond the scope of this paper and will appear in a future paper analyzing how individual differences correlate with forecasting accuracy.
†
Differences in specificity and practitioners also remain robust even after participants had an opportunity to update their forecasts (SI Appendix).
Authors
Metrics & Citations
Metrics
Citation statements
Altmetrics
Citations
Cite this article
Academics are more specific, and practitioners more sensitive, in forecasting interventions to strengthen democratic attitudes, Proc. Natl. Acad. Sci. U.S.A.
121 (3) e2307008121,
https://doi.org/10.1073/pnas.2307008121
(2024).
Copied!
Copying failed.
Export the article citation data by selecting a format from the list below and clicking Export.
Cited by
Loading...
View Options
View options
PDF format
Download this article as a PDF file
DOWNLOAD PDFLogin options
Check if you have access through your login credentials or your institution to get full access on this article.
Personal login Institutional LoginRecommend to a librarian
Recommend PNAS to a LibrarianPurchase options
Purchase this article to access the full text.