Explicitly unbiased large language models still form biased associations

Edited by Timothy Wilson, University of Virginia, Charlottesville, VA; received August 11, 2024; accepted January 15, 2025
February 20, 2025
122 (8) e2416228122

Significance

Modern large language models (LLMs) are designed to align with human values. They can appear unbiased on standard benchmarks, but we find that they still show widespread stereotype biases on two psychology-inspired measures. These measures allow us to measure biases in LLMs based on just their behavior, which is necessary as these models have become increasingly proprietary. We found pervasive stereotype biases mirroring those in society in 8 value-aligned models across 4 social categories (race, gender, religion, health) in 21 stereotypes (such as race and criminality, race and weapons, gender and science, age and negativity), also demonstrating sizable effects on discriminatory decisions. Given the growing use of these models, biases in their behavior can have significant consequences for human societies.

Abstract

Large language models (LLMs) can pass explicit social bias tests but still harbor implicit biases, similar to humans who endorse egalitarian beliefs yet exhibit subtle biases. Measuring such implicit biases can be a challenge: As LLMs become increasingly proprietary, it may not be possible to access their embeddings and apply existing bias measures; furthermore, implicit biases are primarily a concern if they affect the actual decisions that these systems make. We address both challenges by introducing two measures: LLM Word Association Test, a prompt-based method for revealing implicit bias; and LLM Relative Decision Test, a strategy to detect subtle discrimination in contextual decisions. Both measures are based on psychological research: LLM Word Association Test adapts the Implicit Association Test, widely used to study the automatic associations between concepts held in human minds; and LLM Relative Decision Test operationalizes psychological results indicating that relative evaluations between two candidates, not absolute evaluations assessing each independently, are more diagnostic of implicit biases. Using these measures, we found pervasive stereotype biases mirroring those in society in 8 value-aligned models across 4 social categories (race, gender, religion, health) in 21 stereotypes (such as race and criminality, race and weapons, gender and science, age and negativity). These prompt-based measures draw from psychology’s long history of research into measuring stereotypes based on purely observable behavior; they expose nuanced biases in proprietary value-aligned LLMs that appear unbiased according to standard benchmarks.

Get full access to this article

Purchase, subscribe or recommend this article to your librarian.

Data, Materials, and Software Availability

LLM behavior data have been deposited at https://github.com/baixuechunzi/llm-implicit-bias (88). All other data are included in the manuscript and/or SI Appendix.

Acknowledgments

We thank Benedek Kurdi, Bonan Zhao, Jian-Qiao Zhu, Kristina Olson, Raja Marjieh, Susan Fiske, and Tessa Charlesworth for their insightful discussions. This project and related results were made possible with the support of the NOMIS Foundation and the Microsoft Foundation Models grant. Data and code can be accessed at https://github.com/baixuechunzi/llm-implicit-bias.

Author contributions

X.B., A.W., and T.L.G. designed research; X.B., A.W., and I.S. performed research; X.B. and A.W. contributed new reagents/analytic tools; X.B., A.W., and I.S. analyzed data; T.L.G. provided funding; and X.B., A.W., and T.L.G. wrote the paper.

Competing interests

The authors declare no competing interest.

Supporting Information

Appendix 01 (PDF)

References

1
L. Ouyang et al., Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730–27744 (2022).
2
C. Si et al., Prompting GPT-3 to be reliable. arXiv [Preprint] (2022). http://arxiv.org/abs/2210.09150 (Accessed 31 January 2024).
3
I. Solaiman, C. Dennison, Process for adapting language models to society (PALMS) with values-targeted datasets. Adv. Neural Inf. Process. Syst. 34, 5861–5873 (2021).
4
S. L. Blodgett, S. Barocas, H. Daumé III, H. Wallach, “Language (technology) is power: A critical survey of “bias” in NLP” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020), pp. 5454–5476.
5
A. G. Greenwald, M. R. Banaji, Implicit social cognition: Attitudes, self-esteem, and stereotypes. Psychol. Rev. 102, 4 (1995).
6
X. Qi et al., Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv [Preprint] (2023). http://arxiv.org/abs/2310.03693 (Accessed 31 January 2024).
7
O. Shaikh, H. Zhang, W. Held, M. Bernstein, D. Yang, On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. arXiv [Preprint] (2022). http://arxiv.org/abs/2212.08061 (Accessed 31 January 2024).
8
B. Wang et al., Decodingtrust: A comprehensive assessment of trustworthiness in GPT models. arXiv [Preprint] (2023). http://arxiv.org/abs/2306.11698 (Accessed 31 January 2024).
9
Y. Wan et al., “Kelly is a warm person, Joseph is a role model: Gender biases in LLM-generated reference letters” in Findings of the Association for Computational Linguistics: EMNLP 2023 (2023), pp. 3730–3748.
10
V. Hofmann, P. R. Kalluri, D. Jurafsky, S. King, Dialect prejudice predicts AI decisions about people’s character, employability, and criminality. arXiv [Preprint] (2024). http://arxiv.org/abs/2403.00742 (Accessed 31 January 2024).
11
M. Cheng, E. Durmus, D. Jurafsky, Marked personas: Using natural language prompts to measure stereotypes in language models. Assoc. Comput. Linguist. 1, 1504–1532 (2023).
12
J. Dhamala et al., “Bold: Dataset and metrics for measuring biases in open-ended language generation” in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (2021), pp. 862–872.
13
A. Parrish et al., “A hand-built bias benchmark for question answering” in Findings of the Association for Computational Linguistics: ACL 2022 (2022), pp. 2086–2105.
14
A. Tamkin et al., Evaluating and mitigating discrimination in language model decisions. arXiv [Preprint] (2023). http://arxiv.org/abs/2312.03689 (Accessed 31 January 2024).
15
M. I. Posner, C. R. Snyder, R. Solso, Attention and cognitive control. Cogn. Psychol. Key Read 205, 55–85 (2004).
16
P. G. Devine, Stereotypes and prejudice: Their automatic and controlled components. J. Pers. Soc. Psychol. 56, 5 (1989).
17
S. Chaiken, Y. Trope, Dual-Process Theories in Social Psychology (Guilford Press, 1999).
18
M. R. Banaji, A. G. Greenwald, Blindspot: Hidden Biases of Good People (Bantam, 2016).
19
F. Crosby, S. Bromley, L. Saxe, Recent unobtrusive studies of black and white discrimination and prejudice: A literature review. Psychol. Bull. 87, 546 (1980).
20
T. Riddle, S. Sinclair, Racial disparities in school-based disciplinary actions are associated with county-level rates of racial bias. Proc. Natl. Acad. Sci. U.S.A. 116, 8255–8260 (2019).
21
A. Caliskan, J. J. Bryson, A. Narayanan, Semantics derived automatically from language corpora contain human-like biases. Science 356, 183–186 (2017).
22
W. Guo, A. Caliskan, “Detecting emergent intersectional biases: Contextualized word embeddings contain a distribution of human-like biases” in Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (2021), pp. 122–133.
23
C. May, A. Wang, S. Bordia, S. R. Bowman, R. Rudinger, On measuring social biases in sentence encoders. Annu. Conf. North Am. Chapter Assoc. for Comput. Linguist. 1, 622–628 (2019).
24
T. E. Charlesworth, A. Caliskan, M. R. Banaji, Historical representations of social groups across 200 years of word embeddings from Google Books. Proc. Natl. Acad. Sci. U.S.A. 119, e2121798119 (2022).
25
N. Garg, L. Schiebinger, D. Jurafsky, J. Zou, Word embeddings quantify 100 years of gender and ethnic stereotypes. Proc. Natl. Acad. Sci. U.S.A. 115, E3635–E3644 (2018).
26
I. D. Raji, R. Dobbe, Concrete problems in AI safety, revisited. arXiv [Preprint] (2023). http://arxiv.org/abs/2401.10899 (Accessed 31 January 2024).
27
I. D. Raji, E. M. Bender, A. Paullada, E. Denton, A. Hanna, AI and the everything in the whole wide world benchmark. arXiv [Preprint] (2021). http://arxiv.org/abs/2111.15366 (Accessed 31 January 2024).
28
J. Achiam et al., GPT-4 technical report. arXiv [Preprint] (2023). http://arxiv.org/abs/2303.08774 (Accessed 31 January 2024).
29
L. Bian, S. J. Leslie, A. Cimpian, Gender stereotypes about intellectual ability emerge early and influence children’s interests. Science 355, 389–391 (2017).
30
S. T. Fiske, A. J. Cuddy, P. Glick, J. Xu, A model of (often mixed) stereotype content: Competence and warmth respectively follow from perceived status and competition. J. Pers. Soc. Psychol. 82, 878–902 (2002).
31
A. M. Koenig, A. H. Eagly, Evidence for the social role theory of stereotype content: Observations of groups’ roles shape stereotypes. J. Pers. Soc. Psychol. 107, 371 (2014).
32
G. W. Allport, The Nature of Prejudice (Addison-wesley, 1954).
33
D. Katz, K. Braly, Racial stereotypes of one hundred college students. J. Abnorm. Soc. Psychol. 28, 280 (1933).
34
W. Lippmann, Public opinion (Harcourt, Brace, 1922).
35
J. A. Bargh, M. Chen, L. Burrows, Automaticity of social behavior: Direct effects of trait construct and stereotype activation on action. J. Pers. Soc. Psychol. 71, 230 (1996).
36
E. S. Bogardus, Measuring social distance. J. Appl. Sociol. 9, 299–308 (1925).
37
H. Schuman, Racial Attitudes in America: Trends and Interpretations (Harvard University Press, 1997).
38
M. Bertrand, S. Mullainathan, Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination Am. Econ. Rev. 94, 991–1013 (2004).
39
J. F. Dovidio, S. L. Gaertner, The effects of race, status, and ability on helping behavior. Soc. Psychol. Q. 44, 192–203 (1981).
40
C. O. Word, M. P. Zanna, J. Cooper, The nonverbal mediation of self-fulfilling prophecies in interracial interaction. J. Exp. Soc. Psychol. 10, 109–120 (1974).
41
S. T. Fiske, S. E. Taylor, Social Cognition: From Brains to Culture (Sage, 2013).
42
A. G. Greenwald, M. R. Banaji, The implicit revolution: Reconceiving the relation between conscious and unconscious. Am. Psychol. 72, 861 (2017).
43
B. A. Nosek, Moderators of the relationship between implicit and explicit evaluation. J. Exp. Psychol. Gen. 134, 565 (2005).
44
A. Gast, K. Rothermund, When old and frail is not the same: Dissociating category and stimulus effects in four implicit attitude measurement methods. Q. J. Exp. Physiol. 63, 479–498 (2010).
45
B. D. Stewart, B. K. Payne, Bringing automatic stereotyping under control: Implementation intentions as efficient means of thought control. Pers. Soc. Psychol. Bull. 34, 1332–1345 (2008).
46
F. R. Conrey, J. W. Sherman, B. Gawronski, K. Hugenberg, C. J. Groom, Separating multiple processes in implicit social cognition: The quad model of implicit task performance. J. Pers. Soc. Psychol. 89, 469 (2005).
47
J. Glaser, E. D. Knowles, Implicit motivation to control prejudice. J. Exp. Soc. Psychol. 44, 164–172 (2008).
48
R. H. Fazio, M. A. Olson, Implicit measures in social cognition research: Their meaning and use. Annu. Rev. Psychol. 54, 297–327 (2003).
49
P. Graf, D. L. Schacter, Implicit and explicit memory for new associations in normal and amnesic subjects. J. Exp. Psychol. Learn. Mem. Cogn. 11, 501 (1985).
50
M. J. Monteith, P. G. Devine, J. R. Zuwerink, Self-directed versus other-directed affect as a consequence of prejudice-related discrepancies. J. Pers. Soc. Psychol. 64, 198 (1993).
51
Y. Bai et al., Constitutional AI: Harmlessness from AI feedback. arXiv [Preprint] (2022). http://arxiv.org/abs/2212.08073 (Accessed 31 January 2024).
52
B. Kurdi et al., Relationship between the implicit association test and intergroup behavior: A meta-analysis. Am. Psychol. 74, 569 (2019).
53
M. Binz, E. Schulz, Using cognitive psychology to understand GPT-3. Proc. Natl. Acad. Sci. U.S.A. 120, e2218523120 (2023).
54
D. Demszky et al., Using large language models in psychology. Nat. Rev. Psychol. 2, 688–701 (2023).
55
S. Rathje et al., GPT is an effective tool for multilingual psychological text analysis. Proc. Natl. Acad. Sci. U.S.A. 121, e2308950121 (2024).
56
Y. Bai et al., Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv [Preprint] (2022). http://arxiv.org/abs/2204.05862 (Accessed 31 January 2024).
57
H. Touvron et al., Llama 2: Open foundation and fine-tuned chat models. arXiv [Preprint] (2023). http://arxiv.org/abs/2307.09288 (Accessed 31 January 2024).
58
R. Taori et al., Stanford Alpaca: An instruction-following LLaMA model (2023). Github. https://github.com/tatsu-lab/stanford_alpaca. Deposited 15 March 2023.
59
A. G. Greenwald, D. E. McGhee, J. L. Schwartz, Measuring individual differences in implicit cognition: The implicit association test. J. Pers. Soc. Psychol. 74, 1464 (1998).
60
A. H. Eagly, V. J. Steffen, Gender stereotypes stem from the distribution of women and men into social roles. J. Pers. Soc. Psychol. 46, 735 (1984).
61
D. Ganguli et al., The capacity for moral self-correction in large language models. arXiv [Preprint] (2023). http://arxiv.org/abs/2302.07459 (Accessed 31 January 2024).
62
Y. T. Cao et al., On the intrinsic and extrinsic fairness evaluation metrics for contextualized language representations. Annu. Meet. Assoc. Comput. Linguist. 2, 561–570 (2022).
63
R. Steed, S. Panda, A. Kobren, M. Wick, Upstream mitigation is not all you need: Testing the bias transfer hypothesis in pre-trained language models. Annu. Meet. Assoc. for Comput. Linguist. 1, 3524–3542 (2022).
64
T. Bolukbasi, K. W. Chang, J. Y. Zou, V. Saligrama, A. T. Kalai, Man is to computer programmer as woman is to homemaker? Debiasing word embeddings Adv. Neural Inf. Process. Syst. 29, 4356–4364 (2016).
65
S. Goldfarb-Tarrant, R. Marchant, R. M. Sanchez, M. Pandya, A. Lopez, Intrinsic bias metrics do not correlate with application bias. Annu. Meet. Assoc. for Comput. Linguist. 1, 1926–1940 (2021).
66
P. Liang et al., Holistic evaluation of language models. arXiv [Preprint] (2022). http://arxiv.org/abs/2211.09110 (Accessed 31 January 2024).
67
M. Nadeem, A. Bethke, S. Reddy, Stereoset: Measuring stereotypical bias in pretrained language models. Int. Joint. Conf. Nat. Lang. Process. 1, 5356–5371 (2021).
68
H. R. Kirk et al., Bias out-of-the-box: An empirical analysis of intersectional occupational biases in popular generative language models. Adv. Neural Inf. Process. Syst. 34, 2611–2624 (2021).
69
E. Sheng, K. W. Chang, P. Natarajan, N. Peng, “The woman worked as a babysitter: On biases in language generation” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019), pp. 3407–3412.
70
A. Abid, M. Farooqi, J. Zou, “Persistent anti-Muslim bias in large language models” in Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (2021), pp. 298–306.
71
A. Ovalle et al., “’I’m fully who I am’: Towards centering transgender and non-binary voices to measure biases in open language generation” in Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (2023), pp. 1246–1266.
72
S. L. Blodgett, G. Lopez, A. Olteanu, R. Sim, H. Wallach, “Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (2021), pp. 1004–1015.
73
J. Mu, S. Bhat, P. Viswanath, All-but-the-top: Simple and effective postprocessing for word representations. arXiv [Preprint] (2017). http://arxiv.org/abs/1702.01417 (Accessed 31 January 2024).
74
R. Wolfe, A. Caliskan, “Vast: The valence-assessing semantics test for contextualizing language models” in Proceedings of the AAAI Conference on Artificial Intelligence (2022), vol. 36, pp. 11477–11485.
75
H. Gonen, Y. Goldberg, Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. arXiv [Preprint] (2019). http://arxiv.org/abs/1903.03862 (Accessed 31 January 2024).
76
J. Kaplan et al., Scaling laws for neural language models. arXiv [Preprint] (2020). http://arxiv.org/abs/2001.08361 (Accessed 31 January 2024).
77
J. Hu, R. Levy, Prompting is not a substitute for probability measurements in large language models. arXiv [Preprint] (2023). http://arxiv.org/abs/2305.13264 (Accessed 31 January 2024).
78
F. Ladhak et al., “When do pre-training biases propagate to downstream tasks? A case study in text summarization” in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (2023), pp. 3206–3219.
79
D. M. Amodio, P. G. Devine, Stereotyping and evaluation in implicit race bias: Evidence for independent constructs and unique effects on behavior. J. Pers. Soc. Psychol. 91, 652 (2006).
80
C. M. Brendl, A. B. Markman, C. Messner, How do indirect measures of evaluation work? Evaluating the inference of prejudice in the implicit association test J. Pers. Soc. Psychol. 81, 760 (2001).
81
A. G. Greenwald, T. A. Poehlman, E. L. Uhlmann, M. R. Banaji, Understanding and using the implicit association test: III. Meta-analysis of predictive validity. J. Pers. Soc. Psychol. 97, 17 (2009).
82
N. Rüsch, P. W. Corrigan, A. R. Todd, G. V. Bodenhausen, Implicit self-stigma in people with mental illness. J. Nerv. Ment. Dis. 198, 150–153 (2010).
83
U. Schimmack, Invalid claims about the validity of implicit association tests by prisoners of the implicit social-cognition paradigm. Perspect. Psychol. Sci. 16, 435–442 (2021).
84
B. K. Payne, H. A. Vuletich, K. B. Lundberg, The bias of crowds: How implicit bias bridges personal and systemic prejudice. Psychol. Inq. 28, 233–248 (2017).
85
J. W. Sherman, S. A. Klein, The four deadly sins of implicit attitude research. Front. Psychol. 11, 604340 (2021).
86
A. G. Greenwald, B. A. Nosek, M. R. Banaji, Understanding and using the implicit association test: I. An improved scoring algorithm. J. Pers. Soc. Psychol. 85, 197 (2003).
87
K. Zhu et al., Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv [Preprint] (2023). http://arxiv.org/abs/2306.04528 (Accessed 31 January 2024).
88
X. Bai et al., llm-implicit-bias. Github. https://github.com/baixuechunzi/llm-implicit-bias. Deposited 21 May 2024.

Information & Authors

Information

Published in

The cover image for PNAS Vol.122; No.8
Proceedings of the National Academy of Sciences
Vol. 122 | No. 8
February 25, 2025
PubMed: 39977313

Classifications

Data, Materials, and Software Availability

LLM behavior data have been deposited at https://github.com/baixuechunzi/llm-implicit-bias (88). All other data are included in the manuscript and/or SI Appendix.

Submission history

Received: August 11, 2024
Accepted: January 15, 2025
Published online: February 20, 2025
Published in issue: February 25, 2025

Keywords

  1. large language models
  2. bias and fairness
  3. psychology
  4. stereotypes

Acknowledgments

We thank Benedek Kurdi, Bonan Zhao, Jian-Qiao Zhu, Kristina Olson, Raja Marjieh, Susan Fiske, and Tessa Charlesworth for their insightful discussions. This project and related results were made possible with the support of the NOMIS Foundation and the Microsoft Foundation Models grant. Data and code can be accessed at https://github.com/baixuechunzi/llm-implicit-bias.
Author contributions
X.B., A.W., and T.L.G. designed research; X.B., A.W., and I.S. performed research; X.B. and A.W. contributed new reagents/analytic tools; X.B., A.W., and I.S. analyzed data; T.L.G. provided funding; and X.B., A.W., and T.L.G. wrote the paper.
Competing interests
The authors declare no competing interest.

Notes

This article is a PNAS Direct Submission.
*
Disciplines differ in what “bias” means; this paper follows social psychological uses of “bias” to refer to stereotypical associations (4, 5).

Authors

Affiliations

Department of Psychology, The University of Chicago, Chicago, IL 60637
Department of Computer Science, Stanford University, Palo Alto, CA 94305
Center for Data Science, New York University, New York, NY 10011
Thomas L. Griffiths1 [email protected]
Departments of Psychology and Computer Science, Princeton University, Princeton, NJ 08540

Notes

1
To whom correspondence may be addressed. Email: [email protected] or [email protected].

Metrics & Citations

Metrics

Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.


Altmetrics

Citations

Export the article citation data by selecting a format from the list below and clicking Export.

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Personal login Institutional Login

Recommend to a librarian

Recommend PNAS to a Librarian

Purchase options

Purchase this article to access the full text.

Single Article Purchase

Explicitly unbiased large language models still form biased associations
Proceedings of the National Academy of Sciences
  • Vol. 122
  • No. 8

View options

PDF format

Download this article as a PDF file

DOWNLOAD PDF

Figures

Tables

Media

Share

Share

Share article link

Share on social media