Using prediction markets to estimate the reproducibility of scientific research
- aDepartment of Economics, Stockholm School of Economics, SE-113 83 Stockholm, Sweden;
- bNew Zealand Institute for Advanced Study, Massey University, Auckland 0745, New Zealand;
- cWissenschaftskolleg zu Berlin–Institute for Advanced Study, D-14193 Berlin, Germany;
- dSveriges Riksbank, SE-103 37 Stockholm, Sweden;
- eConsensus Point, Nashville, TN 37203;
- fJohn A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138;
- gDepartment of Psychology, University of Virginia, Charlottesville, VA 22904;
- hCenter for Open Science, Charlottesville, VA 22903
See allHide authors and affiliations
Edited by Kenneth W. Wachter, University of California, Berkeley, CA, and approved October 6, 2015 (received for review August 17, 2015)

Significance
There is increasing concern about the reproducibility of scientific research. For example, the costs associated with irreproducible preclinical research alone have recently been estimated at US$28 billion a year in the United States. However, there are currently no mechanisms in place to quickly identify findings that are unlikely to replicate. We show that prediction markets are well suited to bridge this gap. Prediction markets set up to estimate the reproducibility of 44 studies published in prominent psychology journals and replicated in The Reproducibility Project: Psychology predict the outcomes of the replications well and outperform a survey of individual forecasts.
Abstract
Concerns about a lack of reproducibility of statistically significant results have recently been raised in many fields, and it has been argued that this lack comes at substantial economic costs. We here report the results from prediction markets set up to quantify the reproducibility of 44 studies published in prominent psychology journals and replicated in the Reproducibility Project: Psychology. The prediction markets predict the outcomes of the replications well and outperform a survey of market participants’ individual forecasts. This shows that prediction markets are a promising tool for assessing the reproducibility of published scientific results. The prediction markets also allow us to estimate probabilities for the hypotheses being true at different testing stages, which provides valuable information regarding the temporal dynamics of scientific discovery. We find that the hypotheses being tested in psychology typically have low prior probabilities of being true (median, 9%) and that a “statistically significant” finding needs to be confirmed in a well-powered replication to have a high probability of being true. We argue that prediction markets could be used to obtain speedy information about reproducibility at low cost and could potentially even be used to determine which studies to replicate to optimally allocate limited resources into replications.
The process of scientific discovery centers on empirical testing of research hypotheses. A standard tool to interpret results in statistical hypothesis testing is the P value. A result associated with a P value below a predefined significance level (typically 0.05) is considered “statistically significant” and interpreted as evidence in favor of a hypothesis. However, concerns about the reproducibility of statistically significant results have recently been raised in many fields including medicine (1⇓–3), neuroscience (4), genetics (5, 6), psychology (7⇓⇓⇓–11), and economics (12, 13). For example, an industrial laboratory could only reproduce 6 out of 53 key findings from “landmark” studies in preclinical oncology (2) and it has been argued that the costs associated with irreproducible preclinical research alone are about US$28 billion a year in the United States (3). The mismatch between the interpretation of statistically significant findings and a lack of reproducibility threatens to undermine the validity of statistical hypothesis testing as it is currently practiced in many research fields (14).
The problem with inference based on P values is that a P value provides only partial information about the probability of a tested hypothesis being true (14, 15). This probability also depends on the statistical power to detect a true positive effect and the prior probability that the hypothesis is true (14). Lower statistical power increases the probability that a statistically significant effect is a false positive (4, 14). Statistically significant results from small studies are therefore more likely to be false positives than statistically significant results from large studies. A lower prior probability for a hypothesis to be true similarly increases the probability that a statistically significant effect is a false positive (14). This problem is exacerbated by publication bias in favor of speculative findings and against null results (4, 16⇓⇓–19).
Apart from rigorous replication of published studies, which is often perceived as unattractive and therefore rarely done, there are no formal mechanisms to identify irreproducible findings. Thus, it is typically left to the judgment of individual researchers to assess the credibility of published results. Prediction markets are a promising tool to fill this gap, because they can aggregate private information on reproducibility, and can generate and disseminate a consensus among market participants. Although prediction markets have been argued to be a potentially important tool for assessing scientific hypotheses (20⇓–22)—most notably in Robin Hanson’s paper “Could Gambling Save Science? Encouraging an Honest Consensus” (20)—relatively little has been done to develop potential applications (21). Meanwhile, the potential of prediction markets has been demonstrated in a number of other domains, such as sports, entertainment, and politics (23⇓⇓–26).
We tested the potential of using prediction markets to estimate reproducibility in conjunction with the Reproducibility Project: Psychology (RPP) (9, 10). The RPP systematically replicated studies from a sampling frame of three top journals in psychology. To investigate the performance of prediction markets in this context, a first set of prediction markets were implemented in November 2012 and included 23 replication studies scheduled to be completed in the subsequent 2 mo, and a second set of prediction markets were implemented in October 2014 and included 21 replication studies scheduled to be completed before the end of December 2014. The prediction markets were active for 2 wk at each of these occasions.
For each of the replication studies, participants could bet on whether or not the key original result would be replicated. Our criterion for a successful replication was a replication result, with a P value of less than 0.05, in the same direction as the original result. In one of the studies, the original result was a negative finding, and successful replication was thus defined as obtaining a negative (i.e., statistically nonsignificant) result in the replication. Information on the original study and the setup of the replication were accessible to all participants.
In the prediction markets, participants traded contracts that pay $1 if the study is replicated and $0 otherwise. This type of contract allows the price to be interpreted as the predicted probability of the outcome occurring. This interpretation of the price is not without caveats (27) but has an advantage of being simple and reasonably robust (28), especially in settings where traders’ initial endowments are the same and traders’ bets are relatively small. Invitations to participate in the prediction markets were sent to the email list of the Open Science Framework, and for the second set of markets also to the email list of the RPP collaboration. Participants were not allowed to bet in those markets where they were involved in carrying out the replication. In the first set of prediction markets, 49 individuals signed up and 47 of these actively participated; in the second set, 52 individuals signed up and 45 of these actively participated. Before the markets started, participants were asked in a survey for their subjective probability of each study being replicated. Each participant was endowed with US$100 for trading.
Results
The prediction markets functioned well in an operational sense. Participation was broad, i.e., trading was not dominated by a small subset of traders or concentrated to just a few of the markets. In total, 2,496 transactions were carried out. The number of transactions per market ranged from 28 to 108 (mean, 56.7), and the number of active traders per market ranged from 18 to 40 (mean, 26.7). We did not detect any market bias regarding bets on success (“long positions”) or failure (“short positions”) to replicate the original results. In the final portfolios held at market closing time (Supporting Information), we observed approximately the same number of bets on success and failure.
The mean prediction market final price is 55% (range, 13–88%), implying that about half of the 44 studies were expected to replicate. Out of the 44 scientific studies included in the prediction markets, the replications were completed for 41 of the studies, with the remaining replications being delayed. Of the 41 completed, 16 studies (39%) replicated and 25 studies (61%) did not replicate according to the market criterion for a successful replication (Supporting Information).
We evaluate the performance of the markets in three ways. We test whether the market prices are informative; if the market prices can be interpreted as probabilities of replication; and if the prediction markets predict the replication outcomes better than a survey measure of beliefs. When interpreting a market price larger than 50% as predicting successful replication and a market price smaller than 50% as predicting failed replication, informative markets are expected to correctly predict more than 50% of the replications. We find that the prediction markets correctly predict the outcome of 71% of the replications (29 of 41 studies; Fig. 1), which is significantly higher than 50% (one-sample binomial test; P = 0.012).
Prediction market performance. Final market prices and survey predictions are shown for the replication of 44 publications from three top psychology journals. The prediction market predicts 29 out of 41 replications correctly, yielding better predictions than a survey carried out before the trading started. Successful replications (16 of 41 replications) are shown in black, and failed replications (25 of 41) are shown in red. Gray symbols are replications that remained unfinished (3 of 44).
Interpreting the prediction market prices as probabilities means that not all markets with a price larger (smaller) than 50% are expected to correspond to successful (failed) replications. The expected prediction rate of the markets depends on the distribution of final market prices, which in our study implies that 69% of the outcomes are expected to be predicted correctly. This is very close to the observed value of 71%. To formally test whether prediction market prices can be interpreted as probabilities of replication, we estimated a linear probability model (with robust SEs) with the outcome of the replication as a function of the prediction market price. If market prices equal replication probabilities, the coefficient of the market price variable should be equal to 1 and the constant in the regression should be equal to zero. The coefficient of the market price variable is 0.995, which is significantly different from zero (P = 0.003), but not significantly different from 1 (P = 0.987). The constant (−0.167) is not significantly different from zero (t = −1.11, P = 0.276).
The prediction market can also be compared with the pretrading survey of participants’ beliefs about the probability of replication. A simple average of the survey correctly predicts 58% of outcomes (23 of 40; Fig. 1; survey data are missing for one market), which is not significantly different from 50% (one-sample binomial test; P = 0.429). A weighted average, using self-reported expertise as weights, correctly predicts 50% (20 of 40) of outcomes, which is not significantly different from 50% (one-sample binomial test; P = 1.00). The absolute prediction error is significantly lower for the prediction market than for both the pretrading survey (paired t test, n = 40, t = −2.558, P = 0.015) and the weighted survey (paired t test, n = 40, t = −2.727, P = 0.010; see Supporting Information for a more detailed comparison of the prediction market and survey responses). The prediction market thus outperforms the survey measure of beliefs.
The above results suggest that the prediction markets generate good estimates of the probability that a published result will be replicated. Note that the probability of successful replication is not the same thing as the probability of a tested hypothesis being true. The probability of a tested hypothesis being true, also referred to as the positive predictive value or PPV (4), can however be estimated from the market price (Fig. 2). Using information about the power and significance levels of the original study and the replications (see Supporting Information for details), it can be estimated for three stages of the testing process: the prior probability (p0) before observing the outcome of the initial study; the probability after observing the result of the initially published study (p1); and the probability after observing the outcome of the replication (p2). A summary of the results of these estimations are shown in Fig. 3; a more detailed breakdown is given in Supporting Information.
Relationship between market price and prior and posterior probabilities p0, p1, and p2 of the hypothesis under investigation. Bayesian inference (green arrows) assigns an initial (prior) probability p0 to a hypothesis, indicating its plausibility in absence of a direct test. Results from an initial study allows this prior probability to be updated to posterior p1, which in turn determines the chances for the initial result to hold up in a replication, and thus the market price in the prediction market. Once the replication has been performed, the result can be used to generate posterior p2. Observing the market price, and using the statistical characteristics of the initial study and the replication, we can thus reconstruct probabilities p1, p2, and p0. Detailed calculations are presented in Supporting Information.
Probability of a hypothesis being true at three different stages of testing: before the initial study (p0), after the initial study but before the replication (p1), and after replication (p2). “Error bars” (or whiskers) represent range, boxes are first to third quartiles, and thick lines are medians. Initially, priors of the tested hypothesis are relatively low, with a median of 8.8% (range, 0.7–66%). A positive result in an initial publication then moves the prior into a broad range of intermediate levels, with a median of 56% (range, 10–97%). If replicated successfully, the probability moves further up, with a median of 98% (range, 93.0–99.2%). If the replication fails, the probability moves back to a range close to the initial prior, with a median of 6.3% (range, 0.01–80%).
Our analysis reveals priors (p0) for the 44 studies ranging from 0.7% to 66% with a median (mean) of 8.8% (13%). This relatively low average prior may reflect that top psychology journals focus on publishing surprising findings, i.e., positive findings on relatively unlikely hypotheses. The probability that the research hypothesis is true after observing the positive finding in the first study (p1) ranges from 10% to 97% with a median (mean) of 56% (57%) for the 44 studies. This estimate implies that about 43% of statistically significant research findings published in these top psychology journals can be expected to be false positives.
For the 41 studies replicated so far, we can also estimate the posterior probability that the research finding is true contingent on observing the result of the replication (p2). This probability ranges between 93.0% and 99.2% with a median (mean) of 98% (97%) for the 16 studies whose result was replicated, and between 0.1% and 80% with a median (mean) of 6.3% (15%) for the 25 studies that were not replicated.
These results show that prediction markets can give valuable insights into the dynamics of information accumulation in a research field. Eliciting priors in this manner allows us to evaluate whether hypotheses are tested appropriately in a given research field. A common, but incorrect, interpretation of a published result with a P < 0.05 is that it implies a 95% probability of the research hypothesis being true. Interestingly, our findings imply that to achieve such a high probability of the research hypothesis being true, a “statistically significant” positive finding needs to be confirmed in a well-powered replication. This illustrates the importance of replicating positive research findings before they are given high credibility. It remains to be studied how psychology compares in this aspect to other fields.
Discussion
The RPP project recently found that more than one-half of 100 original findings published in top psychology journals failed to replicate (10). Our prediction market results suggest that this relatively low rate of reproducibility should not come as a surprise to the profession, as it is consistent with the beliefs held by psychologists participating in our prediction market.
As can be seen in Fig. 1, original findings for which the market prices indicated a low probability of replication were indeed typically not replicated. However, there were also some findings that failed to replicate despite high market prices that indicated that participants had less doubts about those findings. An interesting hypothesis is that in some of these cases it was the replication itself, rather than the original finding, that failed. It would thus be particularly interesting to carry out additional replications of these studies.
Although our results suggest that prediction markets can be used to obtain accurate forecasts regarding the outcome of replications, one limitation of the approach we used in this study lays in the necessity to run replications so that there is an outcome to trade on. Some studies such as large field experiments may be very costly to replicate (29). One way to mitigate this would be to run prediction markets on a number of studies, from which a subset is randomly selected for replication after the market closes (20). Such an approach could provide quick information about reproducibility at low cost. Moreover, prediction markets could potentially be used as so-called “decision markets” (30, 31) to prioritize replication of some studies, such as those with the lowest likelihood of replication. This would generate salient and informative signals about reproducibility, and help optimizing the allocation of resources into replication.
Materials and Methods
The RPP by the Open Science Collaboration (10) sampled papers in the 2008 issues of three top psychology journals: Journal of Personality and Social Psychology, Psychological Science, and Journal of Experimental Psychology: Learning, Memory, and Cognition. In the case of several studies in one paper, typically the last study of each paper was selected for replication.
We chose 23 studies for the first set of prediction markets and 21 studies for the second set of prediction markets, where the chosen studies were scheduled to be replicated within 2 mo after the completion of the prediction market. For each replication, the hypothesis of the original study was summarized by one of the authors of this paper and submitted to the replication team for comments and final approval. In 1 of the 23 studies in the first prediction market, the chosen experiment was changed by the replicating researcher after the survey had been performed but before the trading started (SI ref. 34 in Supporting Information); we thus lack survey data for this study. One of the 21 studies in the second prediction market was later changed for a different experiment to be replicated (SI ref. 59 in Supporting Information), but for completeness we still include the prediction market and survey data for this study (although there are no current plans to replicate this study).
Participants in the prediction market were researchers in various fields of psychology, ranging from graduate students to professors. Fourteen participants were directly involved in one or several replication studies (15 studies in total) and were not allowed to make trades on the outcomes of these specific studies. Sixteen participants participated in both sets of prediction markets. Before the prediction market, the participants filled out a survey. For each study, participants were asked two questions. One was meant to capture their beliefs of reproducibility: “How likely do you think it is that this hypothesis will be replicated (on a scale from 0% to 100%)?” Participants were also asked about their expertise in the area: “How well do you know this topic? (not at all, slightly, moderately, very well, extremely well).” We transformed this latter measure into a 1–5 scale, and it was used to construct the weighted average belief measure from the survey.
Trading in the prediction market took place through a web-based market interface in collaboration with Consensus Point (www.consensuspoint.com/), a leading provider of prediction market research technology. Before starting to trade, participants received information about the trading procedure as well as logins. Trading accounts were initially endowed with $100 (expressed as 10,000 “points”). These points were used to make predictions of successful replication. Predictions were made by buying and selling stocks on the hypotheses on an interface that highlighted the forecasting functionality of the market (Supporting Information). In the prediction market, participants traded contracts that pay $1 (i.e., 100 points) if the study is replicated and $0 otherwise. This type of contract allows the price to be interpreted as the predicted probability of the outcome occurring. For each hypothesis, participants could see the current market prediction for the probability of successful replication.
The trading platform used an automated marker maker implementing a logarithmic market scoring rule (32). This algorithm offers a buying price and a selling price at all times, ensuring that there is always a counterpart with which to trade. More specifically, the algorithm uses the net sales (s) the market maker has done so far in a market to determine the prices for a (infinitesimally small) trade as P = exp(s/b)/(exp(s/b) + 1). To buy stocks, participants chose YES on the trading interface and entered how many points they would like to invest. For each additional point invested in a YES position, the price (and the predicted probability for successful replication) increased. To sell stocks, participants chose NO on the trading interface and entered how many points they would like to invest. For each additional point invested in a NO position, the price decreased. Participants could also buy (sell) shares by increasing (decreasing) an existing YES position, or decreasing (increasing) an existing NO position. The market maker ensures that the value of a YES share is $1 minus the value of a NO share. Parameter b determines the liquidity and the maximal subsidies provided by the market maker and controls how strongly the market price is affected by a trade. We set the liquidity parameter to b = 100 (points). This means that, by investing 1,000 points (i.e., 1/10 of the initial endowment), traders can move the price of a single market from 50% to about 55%; and investing the entire initial endowment into a single market moves the price from 50% to 82%.
For the first set of prediction markets, investments were settled 5 mo after the market had closed according to actual results of the replications in the cases where the outcome was available and to market value in the cases where the replications were not yet finished. At the time of the close of the market, only eight results were known by the replicating researcher, where all replicating researchers had agreed to not share the results with anyone until after the market closed. For the second set of prediction markets, investments were similarly settled 4.5 mo after the markets had closed. At the time of the close of the second market, one result was known by the replicating researcher; all replicating researchers agreed here too to not share their results with anyone until the market had closed.
Here, we provide further details on the market performance; the comparison of the prediction market and survey responses; the reconstruction of the prior and posterior probabilities (p0, p1, and p2) from the market price; the association between the market price and the statistical power; and results and data for the individual studies.
Market Performance
The overall trading volume in the first set of prediction markets ranged from 169 to 2,564 (mean, 921; median, 797) in terms of traded shares, and from 9,671 to 146,472 (mean, 51,486; median, 46,415) in terms of cash. In the second set of markets, volumes ranged from 365 to 1,155 (mean, 555; median, 506) in terms of traded shares, and from 18,721 to 67,033 (mean, 30,147; median, 27,987) in terms of cash.
We distinguish between four types of transactions: increasing a long position, reducing a long position, increasing a short position, and reducing a short position. In the first set of markets, 618 transactions were carried out to increasing a long position (average volume, 12.4; median volume, 6.8), 157 to reduce a long position (average volume, 22.4; median volume, 9.8), 549 to increase a short position (average volume, 12.8; median volume, 8.8), and 156 to reduce a short position (average volume, 18.9; median volume, 9.9). In the second set of markets, 408 transactions were carried out to increase a long position (average volume, 13.8; median volume, 10.4), 77 to reduce a long position (average volume, 11.3; median volume, 5.5), 454 to increase a short position (average volume, 9.8; median volume, 6.4), and 77 to reduce a short position (average volume, 8.8; median volume, 5.9). Thus, transactions to reduce existing positions were larger in volume than transactions to enter new positions or increase existing ones; and trading into long positions and short positions showed similar patterns.
Comparison of the Prediction Market and Survey Responses
There is considerable overlap between the prediction market and survey responses (Fig. 1 and Fig. S3), suggesting that the information given in the survey is also reflected in the market. The market generated predictions over a wider range of 13–88% compared with the survey range of 32–74%; i.e., the prediction market was more informative than the survey, in the narrow sense that the survey generated predictions closer to a diffuse (noninformative) prior. This constitutes additional support for the interpretation that the prediction market generated better predictions than the survey. We also observe that the diversity of beliefs is positively correlated in the survey and the market (Fig. S3). The diversity of beliefs is also higher when the prediction market predicts a low probability that the original result will be replicated. In other words, there is more disagreement about the outcomes of replications that are not likely to be replicated, which could indicate that market participants hold more private information about false positives.
Final positions per participant and market. The left panel shows the portfolios in the first set of prediction markets, and the right panel shows the portfolios for the second set of prediction markets. Long positions (bets on success) are shown in green, and short positions (bets on failure) are shown in red. This figure indicates that, in both sets of prediction markets, the participants had broad portfolios with positions in several markets. Similarly, each market attracted a number of traders. Often, traders have diverging views: in each market, there is at least one trader holding a long position, and one trader holding a short position. The final portfolios show that there are a few “bears” (predominantly betting on failure) who invested in short positions only (6 of 47 traders for the first set of markets; 4 of 45 traders for the second set of markets), and “bulls” (predominantly betting on success) who invested in long positions only (3 of 47 traders for the first set of markets; 6 of 45 traders for the second set of markets). However, most of the participants fall into a wide spectrum between these two extremes.
(A) Trading interface introductory page. When entering the prediction market, participants were presented with all hypotheses along with their current price (“score”) and recent change in price. By clicking Adjust, the participants received more information on the study and the possibility to trade by buying and selling (a). For each replication, participants were presented with the hypothesis, the authors, the title, and the journal, and could buy stocks by choosing Yes or sell stocks by choosing No (b), and enter how many points they would like to invest in the specific hypothesis (c). (B) Position summary presented participants with an overview of their investments: which hypotheses, number of shares held, and current market value.
Comparison of survey responses and behavior in the two prediction markets. (A) Correlation between market price and average survey response. Market prices and average survey responses are positively correlated, suggesting that information given in the surveys was also revealed in the market (Pearson correlation coefficient of 0.78, P < 0.001, n = 43). However, market prices are more “extreme” than survey responses, which translate into a lower prediction error. Studies that were replicated successfully are shown in black, and studies that failed to replicate are shown in red. Studies that remained unfinished are shown in gray. (B) Correlation between volume of traded shares and diversity in survey responses (i.e., SD of responses; Pearson correlation coefficient of 0.51, P < 0.001, n = 43). The positive correlation between volume in the market and diversity in the surveys suggests that there was more trading for studies where participants had more diverging views on the replicability of a study. In other words, when there is larger diversity in premarket views, more trades are required to reach a “consensus” in the market pricing. (C) Negative correlation between market price and diversity in survey responses (Pearson correlation coefficient of −0.53, P < 0.001, n = 43). The diversity of survey responses is higher when the prediction market predicts a low probability that the original result will be replicated. This suggests that there is more disagreement around replications that are overall expected to fail rather than replications expected to succeed.
The point-biserial correlation coefficient between the market price and the outcome of the replication is 0.42 and significant (P = 0.006, n = 41), whereas the survey and weighted survey measures are not significantly correlated with the outcome of the replication [the point-biserial correlation coefficient between the survey and the outcome of the replication is 0.27 (P = 0.096, n = 40), and the point-biserial correlation coefficient between the weighted survey and the outcome of the replication is 0.26 (P = 0.112, n = 40)].
Reconstruction of the Prior and Posterior Probabilities p0, p1, and p2 from the Market Price pM
Prior and posterior probabilities associated with the hypothesis are denoted by p0, p1, and p2. Probability p1 is the prior at the time of the replication, p2 is the posterior after the replication, and p0 is the prior at the time of the original study. Probabilities α0, β0, α1, and β1 are false-positive probabilities and power of the original study and the replication, respectively. Probability pE denotes the probability of observing positive evidence in the replication, and pM is the final market price.
From Market Price to p1 (Eq. 1 in Fig. 2).
When the original study reports a positive outcome, successful replication means a positive outcome in the replication. Such a positive outcome can be either due to a true or false positive. The probability pE for a positive outcome is thus given by pE = p1β1 + (1 – p1)α1. Assuming that the market price pM reflects probability pE, probability p1 can thus be reconstructed as follows:
From p1 to p2 (Eq. 2 in Fig. 2).
Once the outcome of the replication is known, it can be used to calculate p2 from p1. In case of a positive outcome, p2 is given by p2 = p1β1/pE. When the original finding is positive, Eq. S1a can be used to substitute p1 and pM can be assumed to reflect pE, and thus p2 can be calculated as follows:
From p1 to p0 (Eq. 3 in Fig. 2).
Probability p1 can also be used to reconstruct the original prior, p0. When the original result is positive, the original prior is given by the following:
The Association Between the Market Price and the Statistical Power
Based on the section above, one would expect the market price to be positively associated with the statistical power of the original study and the statistical power of the replication. We tested these associations in the data (excluding the study that replicated a null result in the original study). The Pearson correlation coefficient between the market price and the power of the original study is 0.26 (P = 0.086, n = 43). The Pearson correlation coefficient between the market price and the power of the replication is 0.35 (P = 0.020, n = 43). In an ordinary least squares regression (with robust SEs) of the prediction market price as a function of the statistical power of the original study and the power of the replication, the R-squared is 14.7% and the regression is significant (F = 6.56, P = 0.003; both coefficients have the expected signs, but only the coefficient for the power of the replication is significant; P = 0.027 for the power of the replication, and P = 0.321 for the power of the original study; n = 43).
A limitation of these analyses is that there is relatively little variation in the replication power that was constrained to be at least 80% in all replications. In addition, the power of the original studies was estimated ex post based on the P values; ideally ex ante power estimations from the original studies should have been used, but such data were not available from the original studies.
Results and Data for the Individual Studies
In Table S1 (the first set of prediction markets) and Table S2 (the second set of prediction markets), we present the results of p0, p1, and p2 for each of the 44 studies included in the prediction market (p2 could only be estimated for the 41 studies in which the replication has been carried out), along with the data (the market price, the statistical power of the original study, and the statistical power of the replication) used in these estimations. In Table S3 (the first set of prediction markets) and Table S4 (the second set of prediction markets), we report the hypothesis replicated in each study; and in Table S5 (the first set of prediction markets) and Table S6 (the second set of prediction markets), we provide additional data about the prediction markets. The significance level (the false-positive probabilities α0 and α1) are set to 5% in all estimations as a significance level of 5% was used in both the original studies and the replications. The results of the prediction market and the survey are also shown in the tables. For the case of the replication of the originally negative result, we show 1 − p0, 1 − p1, and 1 − p2 in Table S1 and in Fig. 3, because the working hypothesis in the original study in this case was a negative outcome.
Individual results for the 23 replication studies in the first set of prediction markets
Individual results for the 20 replication studies in the second set of prediction markets
Hypotheses for the 23 replication studies in the first set of prediction markets
Hypotheses for the 21 replication studies in the second set of prediction markets
Additional market data for the 23 replication studies in the first set of prediction markets
Additional market data for the 21 replication studies in the second set of prediction markets
For the statistical power of each finished replication study, we use the power of the replication stated in the replicating authors’ replication reports. This information is contained on the RPP project page at the Open Science Framework, https://osf.io/ezcuj/. For the replications that have not yet been carried out, we use the planned power of the replication also taken from the RPP Open Science Framework project page (which was available information to the prediction market participants at the same location). The statistical power of the original studies was not reported in the published papers. Therefore, we did a post hoc estimate of the statistical power of the original studies based on the P values of the published studies and the standard power formula (i.e., the power estimate is essentially a rescaled P value). This power estimate can be interpreted as the power of finding the observed effect size in the original study at the 5% level with the same sample size as in the original study.
The prediction markets predicted 87% (20 of 23) of the replications correctly in the first set of prediction markets and 50% (9 of 18) of the replications correctly in the second set of prediction markets. These point estimates differ substantially and the prediction rates are significantly different between the two sets of prediction markets (P = 0.016; Fisher’s exact test). If the prediction market prices are correct estimates of the probability of replication for each individual replication, the expected prediction rate is 69% in the first set of prediction markets and 68% in the second set of prediction markets.
The self-reported expertise about the topic of the studies was significantly lower in the second set of prediction markets compared with the first set of markets (1.71 vs. 1.91, independent-samples t test, n = 92, t = 2.146, P = 0.035). It is possible that this lower self-reported expertise has contributed to less well-functioning prediction markets in the second set of prediction markets. However, the different prediction rates in the first and the second sets of prediction markets may also be due to random variations, especially as the overall prediction rate for the two sets of markets of 71% is close to the expected prediction rate of 69% based on the distribution of market prices.
Acknowledgments
We thank Agneta Berge for research assistance; and Juergen Huber, Willemien Kets, and Pranjal Mehta for comments on a previous version of the manuscript. We thank the Jan Wallander and Tom Hedelius Foundation (P2012-0002:1, P2013-0156:1, and P2015-0001:1), the Knut and Alice Wallenberg Foundation [Wallenberg Academy Fellows Grant (to A.D.)], the Swedish Foundation for Humanities and Social Sciences (NHS 14-1719:1), and the National Science Foundation (Grant CCF-0953516) for financial support.
Footnotes
↵1A.D. and T.P. contributed equally to this work.
- ↵2To whom correspondence should be addressed. Email: anna.dreber{at}hhs.se.
Author contributions: A.D., T.P., J.A., B.A.N., and M.J. designed research; A.D., T.P., J.A., S.I., B.W., Y.C., B.A.N., and M.J. performed research; A.D., T.P., J.A., and M.J. analyzed data; A.D., T.P., J.A., and M.J. wrote the paper.
Conflict of interest statement: Consensus Point employs B.W. and provided the online market interface used in the experiment. The market interface is commercial software.
This article is a PNAS Direct Submission.
Data deposition: The data reported in this paper have been deposited in the Open Science Framework database, https://osf.io/yjmht.
See Commentary on page 15267.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1516179112/-/DCSupplemental.
Freely available online through the PNAS open access option.
References
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Simmons JP,
- Nelson LD,
- Simonsohn U
- ↵.
- Carpenter S
- ↵.
- Open Science Collaboration
- ↵Open Science Collaboration (2015) Estimating the reproducibility of psychological science. Science 349(6251):aac4716.
- ↵.
- Bohannon J
- ↵.
- Ioannidis J,
- Doucouliagos CJ
- ↵
- ↵
- ↵
- ↵
- ↵.
- Stern JM,
- Simes RJ
- ↵
- ↵.
- Miguel E, et al.
- ↵
- ↵
- ↵
- ↵
- ↵.
- Tziralis G,
- Tatsiopoulos I
- ↵.
- Arrow KJ, et al.
- ↵.
- Plott CR,
- Smith VL
- Berg J,
- Forsythe R,
- Nelson F,
- Rietz T
- ↵
- ↵.
- Wolfers J,
- Zitzewitz E
- ↵
- ↵.
- Patashnik EM,
- Gerber AS
- Hanson R
- ↵.
- Chen Y,
- Kash IA,
- Ruberry M,
- Shnayder V
- ↵.
- Hanson R
- .
- Richeson JA,
- Trawalter S
- ↵
- .
- Rule NO,
- Ambady N
- .
- Alter AL,
- Oppenheimer DM
- .
- Estes Z,
- Verges M,
- Barsalou LW
- .
- Nairne JS,
- Pandeirada JNS,
- Thompson SR
- .
- Masicampo EJ,
- Baumeister RF
- .
- Vul E,
- Nieuwenstein M,
- Kanwisher N
- .
- Vohs KD,
- Schooler JW
- .
- Bressan P,
- Stranieri D
- .
- Lobue V,
- DeLoache JS
- .
- Nurmsoo E,
- Bloom P
- ↵
- .
- Lau GP,
- Kay AC,
- Spencer SJ
- .
- Halevy N,
- Bornstein G,
- Sagiv L
- .
- Tamir M,
- Mitchell C,
- Gross JJ
- .
- Farris C,
- Treat TA,
- Viken RJ,
- McFall RM
Citation Manager Formats
Article Classifications
- Social Sciences
- Psychological and Cognitive Sciences
Sign up for Article Alerts
Jump to section
- Article
- Abstract
- Results
- Discussion
- Materials and Methods
- Market Performance
- Comparison of the Prediction Market and Survey Responses
- Reconstruction of the Prior and Posterior Probabilities p0, p1, and p2 from the Market Price pM
- The Association Between the Market Price and the Statistical Power
- Results and Data for the Individual Studies
- Acknowledgments
- Footnotes
- References
- Figures & SI
- Info & Metrics
See related content: