The mixed effects of online diversity training

Significance Although diversity training is commonplace in organizations, the relative scarcity of field experiments testing its effectiveness leaves ambiguity about whether diversity training improves attitudes and behaviors toward women and racial minorities. We present the results of a large field experiment with an international organization testing whether a short online diversity training can affect attitudes and workplace behaviors. Although we find evidence of attitude change and some limited behavior change as a result of our training, our results suggest that the one-off diversity trainings that are commonplace in organizations are not panaceas for remedying bias in the workplace.


Study Population
We conducted our field experiment at a large global organization. The organization we partnered with has offices on every continent except Antarctica. The organization co-developed and deployed our training as part of a broader strategic effort around inclusion and inclusive leadership training.
We recruited as many employees as possible at the firm to participate in the study over the course of six weeks. Of the 10,983 employees who received an email with an invitation to take the training, 3,016 (27.5%) started the training and consented to participate in our experiment. Of the employees who started the training, 2,282 (75.7%) completed the hour-long training. Our entire sample of 3,016 study participants was 61.5% male and included employees located in 63 different countries (38.5% were located in the United States). In the U.S., 56.3% of participants were male, while 64.7% of participants outside of the U.S. were male. In the U.S., 59.4% of participants were white (our partner organization does not track race of employees outside of the U.S.). The median completion time of the training was 68 minutes. The length of the training was selected not because we thought there was something special about one hour of education but instead in response to various considerations at the organization we partnered with, including employee bandwidth. Additional details including balance checks to ensure the success of balanced random assignment to conditions are available in the section Data Validations: Validation of Random Assignment.

Summary of Attitudes Towards Women in the Absence of Intervention
In our pre-analysis plan, we pre-registered that we would examine our data by two demographic splits (see https://www.socialscienceregistry.org/trials/2200). Specifically, we pre-registered that we would analyze our participants by gender (men vs. women) and by location (in the U.S. vs. outside the U.S.). Past surveys have shown that women generally have more supportive attitudes towards women in the workplace and towards feminist issues than do men (Davis & Robinson, 1991;Inglehart & Norris, 2003), so we expected that women's attitudes would be particularly supportive of women relative to men's attitudes. Similarly, past surveys have shown that attitudes in the United States are more progressive towards gender equality than internationally (Brandt, 2011)-though there is, of course, considerable heterogeneity within the U.S. and across and within other countries-so we expected the attitudes of employees in the U.S.
to be relatively progressive on gender, and somewhat more so than those of employees outside the U.S.
We examined the attitudinal support for women among participants in our control condition in order to provide a measure of the attitudes held by different subgroups in the absence of a diversity training intervention. We used the attitudinal support for women scale to verify these predicted differences because it is a minor adaptation of a scale that has been previously validated extensively in the literature (Swim, Aikin, Hall, & Hunter, 1995).
As anticipated, we found that among study participants who were not exposed to our treatment intervention (that is, participants in our control group), women exhibited higher average standardized 1 levels of attitudinal support for women than men (M women = 0.269, SD women = 0.92; M men = -0.346, SD men = 1.12; t(781) = 8.03, p < 0.0001). In addition, we found that employees located in the U.S. exhibited higher average standardized levels of attitudinal support for women than employees outside the U.S. (M U.S. = 0.288, SD U.S. = 1.03; M international = -0.358, SD international = 1.05; t(781) = 8.47, p < 0.0001).
In exploratory analyses, we examined the intersection of these two demographic splits.
We found that women in the U.S. exhibited the highest average levels of attitudinal support for women (see Figure S2), men in the U.S. and women outside the U.S. exhibited slightly less attitudinal support for women (and these two groups did not differ significantly from one another), and finally, men outside the U.S. who did not experience our diversity training exhibited somewhat lower average levels of attitudinal support for women than these other groups (see Figure S2).

Validation of Random Assignment
We found no evidence of differential attrition between our treatment and control conditions. The training completion rate in the control condition was 75.1%, while the completion rate in the treatment condition was 75.9% (z = 0.467; p = 0.640). Our sample was also balanced in terms of demographics (see Table S10), indicating that randomization was successful. Employees rated the control training as more valuable (M = 5.27; SD = 1.33) than the treatment training (M = 4.93; SD = 1.39; t(2278) = 5.65, p < 0.0001) and spent more time on the control training than the treatment training (median completion time for control = 70.7 minutes; median completion time for treatment = 65.5 minutes). This might suggest that, if anything, demand effects would bias employees towards greater responsiveness to follow-up measures in the control condition rather than the treatment condition.

Ruling Out Response Biases for Behavioral Measures
In addition to collecting measures of actual workplace behaviors in programs that were in no ostensible way connected to our diversity training intervention, we collected two follow-up measures that were explicitly connected to the inclusive leadership training, as described in the Procedures section. These measures were delivered by our research team and were labeled as follow-ups to the training, allowing us to test whether the response rates differed between treatment and control (note that this concern is also mitigated by the fact that we use an intent-totreat framework for analyzing our measures of real workplace behaviors).
First, we measured what proportion of employees in each condition clicked through to take a voluntary follow-up survey sent by the research team associated with the training. We found no differences in willingness to take this follow-up survey between conditions (M treatment = 6 15.4%; M control = 15.7%; z = 0.278, p = 0.781). Second, we measured the average number of engagements with text messages per employee across conditions where engagements were defined as any reply to a text or click-through to a linked article or video. We found no differences in engagement with text messages from the research team across conditions (M treatment = 0.507 engagements, SD treatment = 1.16 engagements; M control = 0.549 engagements, SD control = 1.25 engagements; t(3,014) = 0.891, p = 0.373).
The fact that we found no differences in response rates to these two follow-ups explicitly connected to our intervention and that attrition was also nearly identical in our treatment and control groups suggests that our experimental conditions were successfully balanced in the degree of engagement they produced.

Analysis Strategy
We compared participants across conditions on each outcome measure using t-tests and ordinary least squares (OLS) regressions, following our pre-analysis plan. Our pre-analysis plan was uploaded to the AEA RCT Registry (https://www.socialscienceregistry.org/trials/2200) after recruitment for the study started but before any data were received from our field partner. First, we conducted pairwise t-tests between treatment and control for each of our dependent variables about attitudes and behaviors pertaining to women 2 (see Tables S11, S12). Next, we ran regressions to see the effects of the treatment condition on these dependent variables in models that include controls for each subgroup (see Tables S13, S14). Finally, we present results from regressions that include interaction terms to see how the intervention varies in effectiveness between pre-registered subgroups (i.e., men and women; participants based in the United States and those based outside of the United States; see Tables S15-S17). Our pre-analysis plan did not specify the exact functional form of our regression models because we did not know in advance what demographic variables our field partner would be able to provide us, so we use all available participant-invariant characteristics as controls (namely: office location, job category, race, and gender). Specifically, our standard ordinary least squares regression with controls predicts each outcome using all interactions between (a) an indicator for being in the treatment condition, (b) an indicator for being male, and (c) an indicator for being located in the U.S., as well as fixed effects for race, exact office location, and job category. Following our pre-analysis plan, for outcome measures that require nominating colleagues, we also cluster standard errors by office to account for the fact that employees are likely to nominate colleagues who work in the same office as them. For ease of interpretability, all attitudinal outcome measures have been standardized.
Results of our intervention's effects on attitudes and behaviors pertaining to racial minorities are available in Tables S18-S20. For these results, we focus our attention on employees in the U.S., given that our field partner only collects data on the race of its employees in the U.S. We also break down our results pertaining to racial minorities by white employees and racial minority employees in the U.S., just as we break down results involving attitudes and behaviors pertaining to women between men and women inside and outside of the U.S.

Attitudes Pertaining to Women
Note that whenever we report on attitudinal variables, we standardize the variable to report its level in z-scores.
Attitudinal Support for Women. Employees in the treatment condition exhibited significantly higher levels of attitudinal support for women than employees in the control condition post-intervention (M treatment = 0.055, SD treatment = 0.95; M control = -0.108, SD control = 1.09; t(2325) = 3.736, p = 0.0002, d = 0.164; see Table S11). This difference suggests that our treatment had a significant, positive effect on employees' attitudes towards women. However, the overall positive treatment effect was largely driven by differences among employees outside the U.S.: there were significant differences between the treatment and control conditions for men We ran our standard ordinary least squares (OLS) regression predicting standardized attitudinal support for women where we included fixed effects for an employee's office location, job category, race, and all interactions between (a) the treatment condition, (b) an indicator for being male, and (c) an indicator for being located in the U.S. (see Table S15, Model 2). We found a significant main effect of the treatment (b = 0.149, p < 0.001) 3 , suggesting that our diversity training intervention had a significant positive effect on attitudes. Further, we observed a significant interaction between the treatment and an indicator for being located in the U.S. (b = -0.333, p = 0.012), indicating that the treatment had a larger positive effect on the attitudes of employees outside the U.S. than employees in the U.S.

Gender Bias Acknowledgment.
Employees in the treatment condition were more willing to acknowledge that their own gender biases matched those of the general population compared to employees in the control condition (M treatment = 0.077, SD treatment = 0.97; M control = -0.128, SD control = 1.04; t(2330) = 4.69, p < 0.0001, d = 0.205; see Table S11). The increase in acknowledgment was driven by significantly higher perceptions of employees' own gender bias and stereotyping in the treatment condition compared to the control condition (M treatment = 0.193, SD treatment = 0.96; M control = -0.381, SD control = 0.97; t(2333) = 13.60, p < 0.0001; d = 0.596).
Perceptions of others' gender bias and stereotyping were also significantly higher in the treatment condition than in the control condition (M treatment = 0.158, SD treatment = 0.92; M control = -0.312, SD control = 1.08; t(2333) = 10.98, p < 0.0001, d = 0.491). All pre-registered demographic subgroups showed significant differences between the treatment and control conditions (see Table S11).
In our standard pre-registered ordinary least squares regression with all controls, we still observed a significant main effect of the treatment condition (b = 0.217, p < 0.001; see Table   S15, Model 4). We did not find any significant interactions between the treatment and the demographic subgroups of interest (employees in the U.S. and male employees).

Gender Inclusive Intentions.
Employees in the treatment condition earned significantly higher scores on the situational judgment test measuring gender-inclusive intentions than employees in the control condition (M treatment = 0.042, SD treatment = 1.04; M control = -0.084, SD control = 0.90; t(2280) = 2.85, p = 0.0044, d = 0.13; see Table S11). The overall treatment effect was largely driven by intention change in employees located outside of the U.S.: there were significant differences between the treatment and control conditions for men outside of the U.S.  Table S11).
We also ran our standard pre-registered ordinary least squares regression with all controls to predict scores on this measure. We again saw a significant main effect of the treatment (b = 0.147, p = 0.001; see Table S15, Model 6), but we observed no significant interactions.  Table S12). This difference remains significant when we operationalize our dependent variable as a binary outcome measuring whether an employee selected any women at all (any women selected by U.S women treatment = 10.4%, any women selected by U.S. women control = 3.35%, z = 2.81, p = 0.0050).

Informal Mentoring 4 of Women (~3 Weeks After Recruitment Ended
We also ran our standard pre-registered ordinary least squares regression with all controls with clustered standard errors by office location predicting the number of women selected per employee. We found a significant interaction between the treatment condition and an indicator for being located in the U.S. (b = 0.292, p = 0.003) as well as a significant three-way interaction between the treatment condition, an indicator for being male, and an indicator for being located in the U.S. (b = -0.243, p = 0.015; see Table S16, Model 2), suggesting that women in the U.S.
were most likely to change their behaviors in response to the intervention. In terms of real-world significance, these estimates suggest that for every five women in the U.S. who were in the treatment condition (instead of the control), an additional woman was selected through this program (a Wald test adding together the coefficients for Treatment and Treatment x Male Employee provides a treatment effect estimate for women in the U.S. of b = 0.203, p = 0.001).
In exploratory analyses suggested by our organizational partner, we examined tenure differences between the employees who participated in our study and those they invited to meet for coffee in the informal mentoring program. While this program was designed so employees could volunteer to provide informal mentoring to others, we discovered that many participants elected to invite employees who were senior to them in the organization to meet for coffee, suggesting that many were using this program to seek out mentorship, rather than provide mentorship. To determine if an employee was seeking out mentorship or providing mentorship, we used data provided by our field partner. If an employee selected someone who was either a) at a higher level in the organization, as determined by our field partner, or b) at the same level in the organization but had spent more years at the organization, then we categorized that selection as "seeking mentorship," rather than "providing mentorship." If an employee selected someone who was either 1) at a lower level in the organization or 2) at the same level in the organization but had spent fewer years at the organization, then we categorized that selection as "providing mentorship." We then re-ran our standard pre-registered ordinary least squares regression with these two new outcome variables ("providing mentorship" and "seeking mentorship") and clustered standard errors by office location. Specifically, one outcome variable was the number of women invited to coffee through this mentoring program who were senior to the employee in our study and the other was the number of women invited to coffee through this mentorship program who were junior to the employee in our study. We see the same pattern of results for both measures, although the results appear driven by women in the U.S. seeking out more mentoring: women in the U.S. aimed to provide marginally more mentorship to women in the treatment condition relative to the control condition (b = 0.0568, p = 0.067) in addition to seeking out more informal mentorship from other women (b = 0.141, p = 0.001; see Table S21). More generally, women in the U.S. in the treatment condition sought out more mentorship from more senior colleagues regardless of gender (b = 0.221, p < 0.001), suggesting that our intervention prompted women in the U.S. to take more initiative in overcoming any potential obstacles or barriers they face in the workplace.

Recognition of Women for Excellence (~6 weeks after recruitment ended).
We found no differences in the number of women recognized for excellence per consented participant across conditions (M treatment = 0.0155, SD treatment = 0.127; M control = 0.0119, SD control = 0.108; t(3014) = 0.763; p = 0.446; see Table S12). Using our standard pre-registered regression model (see Table S16), we found a marginal positive treatment effect on the number of women nominated for excellence awards by employees in our study who worked in the U.S. (b = 0.012, p = 0.075).  Table S12). Notably, we found that participants in the treatment condition were significantly more willing to talk to a female new hire than a male new hire (p = 0.0008). Collapsing across female employees and using our standard pre-registered regression model to estimate treatment effects (see Table S17), we found that the intervention did have a significant effect among female employees, leading them to favor speaking with a female new hire over a male new hire (b = 0.127; P = 0.047).

Moderation of Gender Results by Attitudes in the Absence of Intervention
In exploratory analyses, we tested for additional evidence to support our proposed model of behavior change. In particular, we tested whether those whose untreated attitudes were more aligned with our intervention would change their behaviors more and their attitudes less in response to our intervention, while those whose untreated attitudes were less aligned with our intervention would change their attitudes more and their behaviors less in response to our intervention. We do not know the pre-treatment attitudes of participants because they were not measured to avoid demand and anchoring effects (Orne, 1962;Tversky & Kahneman, 1974;Zizzo, 2010). However, we can examine the attitudes of participants in the control condition in different subgroups as a proxy for pre-treatment attitudes.
We created participant subgroups by both gender and exact country of residence (a narrower category than our pre-registered U.S. vs. international classification). This allows us to analyze 86 country-gender subgroups, with an average of 34.1 participants per subgroup. 5 We can then use the attitudinal support for women 6 exhibited by untreated participants in the subgroup of interest (e.g., women in France) as a proxy for where participants in that subgroup stood along an attitude continuum pre-training from most to least aligned with our training's 5 Although we have 63 countries represented in our data overall, not all countries had men and women as participants or had participants in both the treatment and control conditions, which means we must exclude these country-gender pairs from our analysis. 6 We use attitudinal support for women for all of these tests since it is the only attitude measure we collect which is based on a scale that has been previously validated extensively in the literature (Swim, Aikin, Hall, & Hunter, 1995). message. 7 The standard deviation in the attitudinal support for women scores across these 86 subgroups was 0.541, and the absolute difference between the subgroups with the highest and lowest average scores was 3.25.
The Intervention's Impact on Attitudes. We used our proxy measure of pre-training attitudinal support for women in moderation analyses. It is important to note that, on average, employees in our study had attitudes well above the midpoint of our scale. We tested whether a subgroup's pre-training attitudes predicted which subgroups' attitudes shifted the most in a series of regressions. To test this, we ran an ordinary least squares regression predicting attitudes after training with an interaction between an indicator for the treatment condition and a continuous standardized variable for our proxy of pre-training attitudes. When predicting our intervention's impact on attitudinal support for women, we found a significant interaction between the treatment and an employee's subgroup's pre-training attitudes (b = -0.273, p < 0.001; see Table   S22, Model 2). Specifically, those who showed the most movement were those whose attitudes in the absence of intervention-while still supportive of women-were less so than those of other employees. In charts depicting the relationship between a subgroup's pre-training attitudes and its subsequent attitudinal shift as a result of the intervention, we found the same pattern of results (see Figure S3, Panel A). Similarly, when predicting scores on our situational judgment test, we found significant interactions in the predicted direction between the treatment and an employee's subgroup's pre-training attitudes (b = -0.121, p = 0.023, see Table S22, Model 6). While employees' perceptions of other people's gender bias showed the expected interaction, this was not true for perceptions of their own gender bias, so we did not find the expected interaction when predicting the size of the gap between an employee's perceptions of their own gender bias and their perceptions of other people's gender bias.
The Intervention's Impact on Behaviors. Using the same proxy measure for our finergrained subgroups' pre-training attitudes, we tested whether a subgroup's pre-training attitudes predicted its degree of behavior change in response to our intervention. Specifically, we ran a series of OLS regressions (see Table S22) predicting behaviors after training with a treatment indicator variable, a continuous standardized variable for our proxy of pre-training attitudes (i.e. a demographic subgroup's average level of attitudinal support for women in the control condition), and-as our predictor of primary interest-an interaction between an indicator for the treatment condition and the continuous standardized variable for our proxy of pre-training attitudes. For the informal mentoring program and the award nomination program, we cluster standard errors by office location to account for the fact that employees are likely to select people with whom they share an office location. In the audit study, our dependent variable is the difference in the percent of employees willing to talk to a female new hire relative to a male new hire by condition. Hence, we need to interact these willingness-to-talk terms with an indicator for whether the new hire was female to determine if our treatment had the predicted effect.
We found some evidence suggesting stronger behavioral effects of the intervention among subgroups whose untreated attitudes were more aligned with our intervention's message.
Specifically, as shown in Table S22, Model 8, there was a significant interaction between receiving our intervention and an employee's subgroup's pre-training attitude alignment when predicting the number of women an employee chose to informally mentor (b = 0.112, p = 0.012).
The largest effects on behaviors occurred for those subgroups whose attitudes were most supportive of women (see Figure S3, Panel B). To put the magnitude of this effect in context, our results suggest that an additional woman received informal mentoring as a result of our intervention for every twenty employees trained in the subgroup whose attitudes were least aligned with our intervention, while an additional woman received informal mentoring as a result of our intervention for every five employees trained in the subgroup whose attitudes were most aligned with our intervention. We did not, however, find significant interactions between pretraining attitudes and our treatment condition when predicting other behavioral measures.

Moderation Robustness Checks.
In addition to our moderation analyses using countrygender subgroups, we also ran moderation analyses using finer-grained office location-gender subgroups (note that there is often more than one office location in a country). Using the same method discussed above, we used attitudinal support in the control group as a proxy for pretraining attitudes in each subgroup. We were able to create 159 office location-gender subgroups with an average of 17.9 employees per subgroup. We then repeated each of our moderation analyses. The results of these analyses are listed in Table S23 and are extremely similar to those that rely on country-gender subgroups, providing further support for our proposed model of behavior change. In particular, we found evidence suggesting that the treatment was more effective at increasing attitudinal support for women (p < 0.001; see Table S23, Model 2) and gender inclusive intentions (p = 0.035; see Table S23, Model 6) for employees whose pretraining attitudes were relatively less supportive of women. On the other hand, we found evidence that the treatment was more effective at increasing the number of women selected for informal mentoring 8 (p = 0.037; see Table S23, Model 8) for employees whose pre-training attitudes were relatively more supportive of women.
We also used these office location-gender subgroups to examine variation within the United States to see if our proposed model of behavior change was supported within the country with the most participants in our sample. Limiting our data to participants in the United States left us with 43 office location-gender subgroups with an average of 26.3 employees each. We reran our moderation analyses using these U.S. office location-gender subgroups. The results of these analyses are listed in Table S24. We again found some evidence in support of our proposed model of behavior change. First, the treatment was more effective at increasing attitudinal support for women among employees whose pre-training attitudes were relatively less supportive of women (p < 0.001; see Table S24, Model 2). Second, the treatment was more effective at increasing willingness to talk to the female new hire relative to the mail new hire for employees whose pre-training attitudes were relatively more supportive of women (p = 0.045; see Table   S24, Model 12).

Attitudes Pertaining to Racial Minorities
The one measure of attitudes pertaining to racial minorities that we collected was a racial bias acknowledgment measure (see section Materials and Methods: Attitude Measures for a definition). Employees in the treatment condition had a significantly smaller gap in their perceptions of racial bias and stereotyping by others versus themselves than did employees in the control condition (t(2330) = 4.27, p < 0.0001, d = 0.187; see Table S18). The smaller difference in self-other perceptions was driven by significantly higher perceptions of employees' own racial bias and stereotyping in the treatment condition compared to the control condition (t(2330) = 14.04, p < 0.0001; d = 0.615). Perceptions of others' racial bias and stereotyping were also significantly higher in the treatment condition than in the control condition (t(2330) = 11.93, p < 0.0001, d = 0.522). In our standard pre-registered OLS regression with all controls, we still observed a significant main effect of the treatment condition (b = 0.193, p < 0.001). We did not find any significant interactions between the treatment and the demographic subgroups of interest (employees in the U.S. and male employees).
We also found evidence of positive spillover effects, as participants in our genderfocused intervention condition exhibited a significantly smaller gap in self-other perceptions of racial bias and stereotyping than did participants in our placebo control condition (t(1575) = 2.66, p = 0.0079; see Table S6). The smaller difference in self-other perceptions was driven by participants' significantly higher perceptions of their own racial bias and stereotyping in the gender-focused intervention condition compared to the control condition (t(1575) = 12.04, p < 0.0001). Perceptions of others' racial bias and stereotyping were also significantly higher in the gender-focused treatment condition than in the control condition (t(1575) = 10.90, p < 0.0001).
Together, these results suggest that a diversity training focusing exclusively on gender bias and stereotyping can also have a positive effect on people's attitudes relating to racial minorities. The general-bias condition also had a significant positive effect on this measure (t(1537) = 4.71, p < 0.001; see Table S9).

Behaviors Pertaining to Racial Minorities
Because our field partner only collects data on the race of its employees in the United States, we can only analyze our behavioral measures when it comes to racial minorities for employees in the U.S.
A t-test comparing the average number of racial minorities selected for informal mentoring by U.S. participants exposed to our treatment condition and our control condition showed a directionally positive effect, whereby U.S. participants in the treatment condition chose to informally mentor marginally more racial minorities than participants in the control condition (M treatment = 0.123, SD treatment = 0.57; M control = 0.072, SD control = 0.40; t(1157) = 1.57, p = 0.116; see Table S18). We also ran an OLS regression predicting the number of racial minorities selected for informal mentoring using an indicator for the treatment condition, an indicator for the participant being white, the interaction between these two indicators, and fixed effects for office location, job category, race, and gender while clustering standard errors by office location (see Table S20). We again found a directionally positive effect of the treatment on the number of racial minorities selected for informal mentoring (b = 0.0470, p = 0.052).
A t-test comparing the average number of racial minorities recognized for excellence across conditions showed that U.S. participants in the treatment condition chose to recognize directionally more racial minorities than participants in the control condition (M treatment = 0.026, SD treatment = 0.17; M control = 0.0080, SD control = 0.089; t(1157) = 1.93, p = 0.054; see Table S18).
We also ran an OLS regression with interactions and controls to predict the number of racial minorities recognized for excellence (see Table 20). We found a significant positive effect of the intervention on the number of racial minorities recognized for excellence (b = 0.0170, p = 0.039).
Interestingly, these effects on behaviors appeared to be directionally driven by the effects of the gender-bias treatment condition, as opposed to the general-bias treatment condition.
Although there were no significant differences between these two conditions on these measures, an OLS regression showed a significant positive effect of the gender-bias treatment on the number of racial minorities selected for informal mentoring (b = 0.0539, p = 0.044), providing additional support for the existence of positive spillover effects of our training. Further, there was also a significant positive effect of the gender-bias intervention on the number of racial minorities recognized for excellence (b = 0.026, p = 0.016).

Additional Robustness Checks
In exploratory analyses encouraged by reviewers, we investigated whether there might be other factors that moderated the effects of our diversity training. For example, managers might have more incentives to attend to a diversity training because they may be evaluated in part based on their handling of diversity and inclusion.
Our organizational partner has two main types of employees. For one type of employee (73.9% of our sample), there is a strict hierarchy of ranks (or levels) in the organization, so we can easily identify who is a manager and who is not (25.1% of these employees are managers).
For the second type of employee (the remaining 26.1% of our sample), there is not a clear hierarchy, and we only know how long each employee has been at the company. In this subsample, we are not able to easily classify who is a manager.
To test whether our training had different effects on managers as compared to other types of employees, we reran our primary analyses with the sub-sample of employees who could be classified as managers or not and included an interaction term between our treatment indicator and an indicator for whether a given employee was a manager. We do not see any evidence that the training was differentially effective for managers versus non-managers (all P's across all behavioral and attitudinal DVs > 0.1). We also reran these analyses interacting our treatment indicator with a continuous variable representing an employee's rank in the organization (again, these analyses necessarily only include the 73.9% of employees in roles with clearly defined ranks in our organizational partner's hierarchy). We find no evidence that the training was differentially effective for higher-ranked versus lower-ranked employees (again, all P's across all behavioral and attitudinal DVs > 0.1).
Given that the recruitment period for our training was six weeks long, we also examined whether there were differential effects of treatment depending on when an employee completed the training. Similar to the above analyses, we reran our primary analyses interacting our treatment indicator with a continuous variable representing the number of days from the start of the recruitment period to when an employee began the training. We find no evidence that the training was differentially effective depending on when an employee started the training (all P's across all behavioral and attitudinal DVs > 0.05).
Finally, we also tested whether our results differed if we analyzed only participants who completed the training, as compared to analyzing all participants who were randomized into a condition using our intention-to-treat analysis strategy. Interestingly, we do not find any significant differences in estimated treatment effects when limiting our analyses to only employees who completed the training in both conditions. This may be due to the fact that our training completion rate is relatively high (above 75%) in both conditions, and we do not observe any evidence of differential attrition by condition.

SURVEY INSTRUMENTS AND INSTRUMENT VALIDATION
Attitudinal Support for Women Scale (adapted from (Swim et al., 1995)): Please rate the following items on a scale from -3 (Strongly Disagree) to 3 (Strongly Agree): 1. Discrimination against women is no longer a problem in society. 2. Women often miss out on good jobs due to sexual discrimination. 3. It is rare to see women treated in a sexist manner on television. 4. On average, people in our society treat husbands and wives equally. 5. Society has reached the point where women and men have equal opportunities for achievement. 6. It is easy to understand the anger of women's groups in society. 7. It is easy to understand why women's groups are still concerned about societal limitations of women's opportunities. 8. Over the past few years, the government and news media have been showing more concern about the treatment of women than is warranted by women's actual experiences.

Gender Bias Acknowledgment:
Gender Stereotyping: Many studies have found that we often make automatic assumptions about other people based on their gender. For example, people associate men with technology and women with housework.
Please answer the following questions on a scale from 1 (Not at all) to 7 (Very Much): 1. To what extent do you believe that you exhibit gender stereotyping? 2. To what extent do you believe that the average person exhibits gender stereotyping?

Racial Bias Acknowledgment:
Racial Stereotyping: Many studies have found that we often make automatic assumptions about other people based on their race. For example, people associate Asians with being good at math and Blacks with being athletic Please answer the following questions on a scale from 1 (Not at all) to 7 (Very Much): 1. To what extent do you believe that you exhibit racial stereotyping? 2. To what extent do you believe that the average person exhibits racial stereotyping?
Gender Inclusive Intentions: [NOTE: The words "project", "organization", and "junior colleagues" in the survey instrument printed below have replaced words that could potentially identify our organizational partner. We have also removed one additional word that could identify our organizational partner. All edits are embedded in brackets below.] Instructions: For the following 10 questions, please read each scenario carefully and choose two responses: 1) the response that most likely reflects what you would do and 2) the response that least likely reflects what you would do.
Some questions may involve descriptions of situations you normally face, and some may not. Sometimes you might think of another strategy you might use. That is okay. Please choose from the responses presented.
1. Sara and Joe were members of a large team you were on last year. You think that they are both excellent data analysts, though you don't know either of them well personally. Both Sara and Joe have decided to apply to a fellowship, and both asked you to write a recommendation letter. There is only one fellowship available. What would you be most likely and least likely to do?
Write a recommendation for the first person who asked and politely decline the second Try to get to know them both a little better so you can make a more informed choice Follow your intuition about who deserves the recommendation more 4. Write them both recommendations anyway and let the selection committee decide (0) 2. Your client has asked for collaboration, but you suspect that she is not fully sharing her opinions and ideas. Your team needs buy-in from her on an initial set of recommendations to be able to move forward. What would you be most likely and least likely to do? 3. You are about to serve a new client and are getting advice from a peer who served them in the past. Among other things, your colleague shares that one of the women you will be working closely with is highly competent, but he didn't like her tendency to frequently assert strong opinions. What would you be most likely and least likely to do?
Be prepared to manage a potentially challenging client relationship 2. Take your colleague's feedback on board but wait before drawing any conclusions of your own (0) 3. Go out of your way early in the engagement to make sure you have earned her respect

Validation of Situational Judgment Test for Gender Inclusive Intentions
As part of this study, we created a new situational judgment test (SJT) to measure the extent to which individuals intend to behave in inclusive ways when responding to workplace situations where bias may arise.
Test Development. Situational judgment tests ask participants how they would respond to a range of context-specific scenarios as a way of detecting patterns in employees' motivations and behavioral tendencies across different circumstances (Lievens, Peeters, & Schollaert, 2008;Weekley & Ployhart, 2013). The SJT format was particularly well-suited for our study for several reasons: a) by presenting participants with multiple attractive options, SJTs can be more difficult to fake than traditional personality assessments (Hooper, Cullen, & Sackett, 2006); b) because the items are written in a way that is job-relevant, they also tend to have strong face validity and elicit positive reactions from respondents (Oostrom, De Soete, & Lievens, 2015); and c) SJT scores can be improved through organizationally-endorsed coaching without undermining their validity (Stemig, Sackett, & Lievens, 2015), which suggests they may capture a behavioral intention that is responsive to a training such as our intervention.
We developed our initial item pool by following a combination of critical-incident and theory-based methods, both of which have been shown to be effective methods for SJT development (Oostrom et al., 2015). To develop item stems that would be job-relevant for our population of employees, we solicited examples of common workplace situations where bias may arise from human resources and learning and development leaders at our partner organization. Our research team then turned these situations into an initial pool of 10 item stems and drafted theory-derived behavioral responses that ranged from bias-reinforcing to biasreducing. These items were also improved thanks to input from our organizational partner. This process resulted in a list of 10 questions asking employees what they would be most and least likely to do in different bias-prone situations from a list of four different response options ranging from bias-interrupting (earning a score of +1 if selected as most likely and a score of -1 if selected as least likely) to bias-reinforcing (earning a score of -1 if selected as most likely and a score of +1 if selected as least likely). Overall scores are determined by summing participants' scores across all 10 situations.
Validation Study. To demonstrate convergent validity with different measures of bias, as well as predictive validity for gender-inclusive behaviors, we collected data from a sample of working adults via Amazon Mechanical Turk, which typically provides samples more representative of the population than undergraduates (Buhrmester, Kwang, & Gosling, 2011).
Participants were asked to complete our SJT and several scales commonly used in research this domain, and they agreed to be contacted for a follow-up survey one week later. The follow-up survey included the gender-career implicit association test (Greenwald, McGhee, & Schwartz, 1998) and multiple opportunities to engage in behaviors to promote gender inclusion, as described in more detail below.

Sample.
A total of 299 participants completed the initial survey, and 243 (81.3%) of those respondents also completed the follow-up survey. The respondents in the final sample were 51% male and 81% Caucasian. The respondents had an average age of 35.6 years, and an average of 15 years of work experience. When asked to describe their political views, 58% described themselves as somewhat to extremely liberal, 19% were neither liberal nor conservative, and 23% said they were somewhat to extremely conservative.
Measures. Participants completed two rounds of surveys intended to measure the extent to which they hold biased attitudes and to what degree they were willing to engage in behaviors that promote gender inclusion. The first survey included our SJT, along with modern sexism (Swim et al., 1995) and ambivalent sexism (Glick & Fiske, 1996), two commonly used measures of gender bias, as well as a measure of gender bias acknowledgment (described in detail under
At the one-week follow-up, participants completed the gender-career implicit association test and were offered an opportunity to advocate for women at a personal cost. Participants were told they would receive an additional $0.25 bonus above and beyond their payment for participating in this follow-up study. They were then told they could contribute some or all of this bonus to Ellevate, a professional development organization for women that aims to reduce gender inequality at work. Participants then indicated how much they would like to donate to Ellevate.
We also included several other measures that involved providing opinions or making public commitments to support gender equality.
-Hiring Measure -Participants were asked to evaluate a candidate for a managerial position in a task adapted from several prior studies (Bowles, Babcock, & Lai, 2007;Duguid & Thomas-Hunt, 2015;Rudman & Glick, 1999). Participants were randomly assigned to receive application materials from either a hypothetical male or a female candidate with equal qualifications. They then rated their likelihood of choosing that candidate, as well as the candidate's agentic and communal traits.
-Discrimination Compensation -In another task adapted from a prior study (Pietri et al., 2017), participants watched a brief video excerpt from a simulated job interview and learned that the job applicant had sued the prospective employer because she felt the interviewer asked several inappropriate questions about her pregnancy.
Participants were asked how much the applicant should receive in damages if she wins her case. They could choose any amount between $10,000 and $100,000, which they were told was a typical award range in prior cases like this.
-Advocating Measures -Participants were given two opportunities to advocate for gender inclusiveness. First, participants were asked if they would be willing to sign the HeForShe Commitment, a United Nations Women's Solidarity Movement for Gender Equality. To increase the believability of this measure without asking for any identifiable information, we told them we would provide instructions for signing the pledge if they said "yes", asked them to write a short statement about the importance of the pledge, and asked if we could anonymously share their comment with future study participants. Second, participants were asked if they would be willing to share an article from Harvard Business Review about gender bias in entrepreneurship on social media if we provided them with a link to do so. We scored this measure by giving participants a 0 if they declined both invitations to advocate for gender inclusiveness, a 1 if they accepted one but not the other, and a 2 if they accepted both.
Results. To rule out nonresponse bias in our final sample (Rogelberg & Stanton, 2007), we compared participants who responded to both surveys with those who responded to only the first survey on each of the variables we measured in the first survey. Using independent samples t-tests, we found no significant differences between these groups on any of our bias-related We removed items nine and ten from our SJT due to heavily skewed responses. As expected, scores on our resulting 8-item SJT were significantly negatively correlated with both modern sexism (r = -0.189, p = 0.001) and ambivalent sexism (r = -0.235, p < 0.001), including the latter's sub-scales for both benevolent sexism (r = -0.187, p = 0.001) and hostile sexism (r = -0.177, p = 0.002). Scores on our SJT were also significantly negatively correlated with broader measures of biased attitudes, including social dominance orientation (r = -0.241, p < 0.001) and authoritarian aggression (r = -0.213, p < 0.001). Our SJT was significantly positively correlated with gender bias acknowledgment (r = 0.217, p < 0.001) and racial bias acknowledgment (r = 0.178, p = 0.002), with the larger gap being driven by significant negative correlations between the SJT and self-ratings of gender bias (r = -0.167, p = 0.004) and racial bias (r = -0.191, p = 0.001). The SJT was not significantly associated with ratings of others' bias. A full set of correlations is reported in Table S25. Taken together, this suggests that higher scores on our SJT are associated with less biased attitudes.
To examine the unique role of our SJT in predicting future behavior, we conducted a hierarchical regression analysis with the amount of charitable contribution to Ellevate as our dependent variable. We entered the gender-related bias scales (benevolent sexism, hostile sexism, modern sexism, and gender bias acknowledgment) in the first step, the gender-career IAT in the second step, and our SJT in the third step. In the first two steps, none of the explicit or implicit biased attitude scales were significant predictors of the charitable contribution measure.
In the third step, only the SJT was a significant predictor (b = 0.003, p = 0.040; see Table S26 for the full results of this analysis.) This suggests that although our SJT was correlated in the expected directions with other measures of biased attitudes, it was unique in its ability to predict this specific gender-inclusive behavior.
We ran similar regression analyses with each of the other behavioral measures and found mixed results. On the hiring measure, there were no significant differences in likelihood to work  Table S27 for the full results of this analysis). The other Big 5 personality traits were insignificant. This suggests that although our SJT is positively associated with agreeableness and AOT, it still has incremental predictive value for future gender-inclusive behaviors like our charitable contribution measure.
Lastly, to test whether our SJT still held incremental predictive value after controlling for both biased attitudes and personality measures, we conducted a hierarchical regression analysis with the gender-related bias scales in the first step, the gender-career IAT in the second step, the Big 5 personality dimensions and actively open-minded thinking in the third step, and our SJT in the fourth step. In the first two steps, neither the gender-related bias scales nor the IAT were significant predictors of financial contribution to a women's cause. In the third step, agreeableness (β = 0.012, p = 0.032) was a significant predictor and actively open-minded thinking was a marginally negative predictor (β = -0.016, p = 0.052). In the final step, the SJT significantly predicted financial contribution (β = 0.004, p = 0.037), with agreeableness (β = 0.011, p = 0.049) and actively open-minded thinking (β = -0.018, p = 0.023; see Table S28 for the full results of this analysis) remaining significant.

FIG. S1
Study Timeline Note. This table reports differences between the gender-bias and general-bias trainings, standard errors for these differences in parentheses, and sample sizes for the specified measures split by demographic subgroup. Statistical significance reported is from t-tests. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively Note. This table reports differences between the gender-bias and general-bias trainings, standard errors for these differences in parentheses, and sample sizes for the specified measures split by demographic subgroup. Statistical significance reported is from t-tests for the informal mentoring program and recognition for excellence program. Statistical significance is reported from a regression with no controls for the audit study. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively Note. This table reports differences between the gender-bias and general-bias trainings, standard errors for these differences in parentheses, and sample sizes for the specified measures split by demographic subgroup. Statistical significance reported is from t-tests. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively Note. This table reports differences between the gender-bias and control trainings, standard errors for these differences in parentheses, and sample sizes for the specified measures split by demographic subgroup. Statistical significance reported is from t-tests. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively  Note. This table reports differences between the gender-bias and control trainings, standard errors for these differences in parentheses, and sample sizes for the specified measures split by demographic subgroup. Statistical significance reported is from t-tests for the informal mentoring program and recognition for excellence program. Statistical significance is reported from a regression with no controls for the audit study. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively  Note. This table reports differences between the gender-bias and control trainings, standard errors for these differences in parentheses, and sample sizes for the specified measures split by demographic subgroup. Statistical significance reported is from t-tests. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively  Note. This table reports differences between the general-bias and control trainings, standard errors for these differences in parentheses, and sample sizes for the specified measures split by demographic subgroup. Statistical significance reported is from t-tests. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively  Note. This table reports differences between the general-bias and control trainings, standard errors for these differences in parentheses, and sample sizes for the specified measures split by demographic subgroup. Statistical significance reported is from t-tests for the informal mentoring program and recognition for excellence program. Statistical significance is reported from a regression with no controls for the audit study. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively  Note. This table reports differences between the general-bias and control trainings, standard errors for these differences in parentheses, and sample sizes for the specified measures split by demographic subgroup. Statistical significance reported is from t-tests. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively  Note. This table reports differences between the treatment and control conditions, standard errors for these differences in parentheses, and sample sizes for the specified measures split by demographic subgroup. Statistical significance reported is from t-tests. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively Note. This table reports differences between the treatment and control, standard errors for these differences in parentheses, and sample sizes for the specified measures split by demographic subgroup. Statistical significance reported is from t-tests for the informal mentoring program and recognition for excellence program. Statistical significance is reported from a regression with no controls for the audit study. We do not have the gender or office location of five participants, so we include them in the overall intention-to-treat analysis, but we cannot include them in demographic subgroup analyses. a We only include participants who received an email about talking to a new hire in this intention-to-treat analysis. Some participants in our study were not on the original invite list (e.g., due to people forwarding the training to others), so they did not receive any of the emails for follow-up behavioral measures. For the informal mentoring program and the recognition for excellence program, we can treat these people as nominating zero women; however, because the dependent variable for the audit study is the difference in the willingness to talk to the female versus male new hire, if a participant did not receive any email, they were not randomly assigned to speak to either a female or a male new hire, so we cannot count them in either bucket. We still conduct an intention-to-treat analysis for all participants who received any email about speaking to a new hire. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively from ordinary least squares regressions predicting the specified dependent variable using fixed effects for office location, job category, race, and gender. Each regression includes only participants from the demographic subgroup specified. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively Note. This table reports estimated treatment effects and robust standard errors in parentheses from ordinary least squares regressions predicting the specified dependent variable using fixed effects for office location, job category, race, and gender. Each regression includes only participants from the demographic subgroup specified. a We only include participants who received an email about talking to a new hire in this intention-to-treat analysis. Some participants in our study were not on the original invite list (e.g., due to people forwarding the training to others), so they did not receive any of the emails for follow-up behavioral measures. For the informal mentoring program and the recognition for excellence program, we can treat these people as nominating zero women; however, because the dependent variable for the audit study is the difference in the willingness to talk to the female versus male new hire, if a participant did not receive any email, they were not randomly assigned to speak to either a female or a male new hire, so we cannot count them in either bucket. We still conduct an intention-to-treat analysis for all participants who received any email about speaking to a new hire. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively support for women, gender bias acknowledgment, and gender inclusive intentions after training including all interactions between the treatment, an indicator for the employee being male, and an indicator for the employee being located in the U.S. Standard errors are reported in parentheses. When full controls are present, regressions include fixed effects for office location, job category, and race. Sample sizes vary because we only include employees who filled out each measure. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively 0394 Note. This table shows ordinary least squares (OLS) regressions predicting the number of women selected for informal mentoring per consented employee and the number of women recognized for excellence per consented employee after training using all interactions between the treatment, an indicator for the participant being male, and an indicator for the participant being located in the U.S. Robust standard errors are in parentheses and are clustered at the office location level. When full controls are present, regressions include fixed effects for office location, job category, and race. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively 0.0109 0.0675 Note. This table shows ordinary least squares (OLS) regressions predicting whether participants respond to an email asking if they would be willing to speak to either a female or male new hire using all interactions between the treatment, an indicator for the participant being male, an indicator for the participant being located in the U.S., and an indicator for the new hire being female. Robust standard errors are in parentheses. When full controls are present, regressions include fixed effects for office location, job category, and race. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively Note. This table reports differences between the treatment and control conditions, standard errors for these differences in parentheses, and sample sizes for the specified measures split by demographic subgroup. Statistical significance reported is from ttests. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively Note. This table reports estimated treatment effects and standard errors in parentheses from ordinary least squares regressions predicting the specified dependent variable using fixed effects for office location, job category, race, and gender. Each regression includes only participants from the demographic subgroup specified. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively 050 Note. This table shows ordinary least squares (OLS) regressions predicting the difference in racial bias acknowledgment, the number of racial minorities selected for informal mentoring per consented employee, and the number of racial minorities recognized for excellence per consented employee after training using an interaction between the treatment and an indicator for the participant being white. Robust standard errors are in parentheses and are clustered at the office location level for the nomination measures. When full controls are present, regressions include fixed effects for office location, job category, race, and gender. Regressions are limited to U.S. employees. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively 068 Note. This table shows ordinary least squares (OLS) regressions predicting the number of women selected for informal mentoring per consented employee who are senior colleagues and who are junior colleagues after training using all interactions between the treatment, an indicator for the participant being male, and an indicator for the participant being located in the U.S. Robust standard errors are in parentheses and are clustered at the office location level. When full controls are present, regressions include fixed effects for office location, job category, and race. †, *, **, and *** denote significance at the 10%, 5%, 1%, and 0.1% levels, respectively    -.187*** .160** .716*** 1.00 5. Hostile Sexism -.177** .732*** .808*** .197*** 1.00 6. Social Dominance Orientation -.241*** .642*** .527*** .216** .571*** 1.00 7. Authoritarian Aggression -.213*** .442*** .665*** .545*** .484*** .441*** 1.00