Intersectional inequalities in science

Significance The US scientific workforce is not representative of the population. Barriers to entry and participation have been well-studied; however, few have examined the effect of these disparities on the advancement of science. Furthermore, most studies have looked at either race or gender, failing to account for the intersection of these variables. Our analysis utilizes millions of scientific papers to study the relationship between scientists and the science they produce. We find a strong relationship between the characteristics of scientists and their research topics, suggesting that diversity changes the scientific portfolio with consequences for career advancement for minoritized individuals. Science policies should consider this relationship to increase equitable participation in the scientific workforce and thereby improve the robustness of science.


Definitions
Joint probability is the proportion of articles for each racial group, gender, and topic, where = ( , , ), with r: racial group, g: gender and t: topic. In this way, the sum of the joint probability of all articles, races, and genders equals 1. The marginal probability by topic sums to one for each topic.
The over or underrepresentation by racial group and gender is, proportionally, how much more or less present a group is with what is expected at random, given the overall share of that racial group and gender in the full dataset.

Data Sources
The bibliometric database used for our analysis is Clarivate Analytics' Web of Science (WOS). In addition to country information and citation indicators, we used given names of authors to infer their probably gender, and their family names to infer their probable race. Field and subfield classification is that of the National Science Foundation (1). To build topics, we use titles, keywords, and abstracts. Given that before 2008 first names are not provided by WOS, we restrict our analysis to the period 2008-2019 in order to infer author gender. Racial categories are a social construct that varies by country. Therefore, our analysis is limited to the United States. The information provided by the 2010 US Census on family names and their distribution by race and Latinx origin was used for racial inference (see below).

Race and Gender Inference
The method used for the racial inference based on names can be found in (2). We briefly highlight its main characteristics below. Our work is guided by the words of Tukufu Zuberi (3): "The racialization of data is an artifact of both the struggles to preserve and to destroy racial stratification." Similarly, Buolamwini and Gebru's work also showed the dangers of algorithmic bias (4). Therefore, we focused on different possible methodologies to infer race from names, and the potential biases they carry. Family names from the 2010 US census (5) were used as a primary source of information on racial categorizations. We also attempted to use data on the distribution between given names and race from mortgage data (6). In all, in our attempt to minimize bias, we developed several strategies: 1) collapsing probability distribution into a single metric, 2) using thresholds, 3) using first names, and then given names, and 4) mixing given and family names' probabilities.
Ultimately, we found that the distribution by race when retaining names with >x% probability of belonging to a racial group-i.e., thresholding-is strongly affected by x. Therefore, the 'informativeness' of family names is heavily dependent on the racial categorization itself. This result is deeply related with the history of US racial classification and naming conventions. Descendants of enslaved Africans have historically shared surnames with their White enslavers. Disambiguating Black from White names therefore is more challenging in the US context, reinforcing the need to understand historical context when developing race-based algorithms like ours. Our main conclusion is that thresholds can heavily underestimate racialized groups, especially the Black population. One way to avoid this bias is to use the full distribution across racial groups, even if this complicates future analyses.

Topic Modeling
In order to compare the robustness of the LDA, we developed an experimental approach. The randomness of LDA can be controlled by a random seed. Each random seed provides different results for the same model and dataset. A non-robust model would provide very different results with each iteration. A robust model should only display minor changes. The latter guarantees that the results observed are not the product of chance. To test for this, we ran the LDA model for Social Sciences, Humanities, and Professional Fields ten times with different random seeds. By comparison, we used the pre-trained model on Health to predict the data from Social Sciences, and generated a completely random case using a Dirichlet distribution with the same dimensions. Importantly, the order of topics in the LDA is not fixed. What may be referred to as topic #2 in one model may be topic #17 in another. Therefore, we used a column-permutation invariant metric of distance to compare models.
The result of the LDA model was a matrix of dimensions NxT, with N articles and T topics. The proposed measure of distance first gets the L2 norm of each article. This assigns a single value for every article and, for each run of the model, a vector. We then use these vectors to compare the similarity between models using cosine similarity. The results (see figure S7) illustrate that all runs using random seeds are very similar between each other, while the results using a model trained on a different dataset and the random case are both very different to all other cases. This result validates our model and demonstrates that the results are not a simply a product of chance.      S7. Cosine similarity between multiple runs of the LDA model. Models 1 to 10 represent a model trained on the same social science dataset, with different random seeds. Model 'health' is a model with the same number of topics trained on the health dataset. All models were used to predict the same cases (social science data), resulting in a distribution of topics by document. The 'random case' uses a Dirichlet distribution that replicates the dimensions of the results. All results are compared using L2 norm and cosine similarity. All models trained on the same data showed a similar behavior, while the to control groups show a very different pattern. This confirms that the LDA results are not a product of chance, but reflect the properties of the database.  Dataset S10 (separate file). Over/under representation of race and gender groups by topics, Health. Dataset S11 (separate file). Mean, median and standard deviation of citations by topic, race and gender, Health.