DON’T LET AI MAKE YOU MOODY: ASSESSING RACIAL BIAS IN LARGE LANGUAGE MODELS’ CLINICAL DECISION-MAKING IN MOOD DISORDERS

Background

Racial disparities in the diagnosis and treatment of mood disorders, particularly bipolar disorder, have been widely reported in psychiatric practice. Black patients are more likely to receive antipsychotic-focused or restrictive treatment approaches and less likely to receive guideline-concordant mood-stabilizing therapies compared with White patients, even when their clinical presentations are similar. Racial minority groups, including Hispanic and Asian patients, have also been shown to experience delayed diagnosis, differential prescribing patterns, and barriers to equitable psychiatric care. As artificial intelligence (AI), in the form of large language models (LLMs), is increasingly explored for clinical decision support and psychiatric education, concerns have emerged that these systems may echo existing treatment inequities. Although prior studies have identified racial bias in AI-generated psychiatric recommendations across multiple diagnoses, less is known about how LLMs differ in treatment recommendations for mood disorders when patient race is varied.

Objective

This study aims to examine whether general and medical-focused LLMs generate different treatment recommendations for identical mood disorder presentations when patient race is explicitly stated, implicitly suggested, and omitted.

Methods

Standardized clinical cases representing mood disorders (bipolar I disorder, bipolar II disorder, and major depressive disorder) will be developed. For each mood disorder category, cases will be systematically varied by race (White non-Hispanic, African American, Hispanic, Asian) and by one of three framing conditions: neutral condition with no reference to race, implicit condition using a name empirically associated with a racial group, and explicit condition stating the patient’s race. A formulary, including cost, will also be provided to each of four LLMs (general-purpose LLMs ChatGPT and Grok, and medical-focused LLMs OpenEvidence and DoxGPT). Hyper-specific prompts will delineate these demographic attributes, diagnosis, and the same formulary, and request a succinct response with medications of choice along with risk of recurrent mood episodes (low-, medium-, and high-risk). Outputs will be analyzed to determine treatment variations in terms of medication(s), cost of medication(s), and risk of recurrent mood episodes compared to race per LLM. Outcomes: This study aims to identify race-associated differences in LLM-generated treatment recommendations for mood disorders. Findings will inform the responsible use of AI in psychiatric care by clarifying how treatment recommendations vary across race and model type.