DEPLOYING ARTIFICIAL INTELLIGENCE (AI) AND LARGE LANGUAGE MODELS (LLMS) IN PSYCHIATRY DRUG DEVELOPMENT

Artificial Intelligence (AI) and Large Language Models (LLMs) are driving transformative change across healthcare and drug development. While these technologies have begun to address challenges in psychiatric clinical practice, their application in psychiatry drug development remains in its early stages. Nevertheless, they hold tremendous promise for mitigating many of the methodological challenges that have historically hindered progress in this field. Psychiatry clinical trials continue to have among the lowest success rates across all therapeutic areas of drug development, contributing to the limited availability of new and effective treatments for brain health disorders. This low success rate stems from several factors, including uncertainty around the biological mechanisms of psychiatric illnesses, high placebo response rates, reliance on subjective efficacy endpoints, and variability in rater training, assessment and performance. These issues increase data variability and reduce the ability to detect true drug effects amid study noise. Large Language Models such as GPT4, Claude and Llama are advanced AI systems trained on vast text corpora—such as books, scientific literature, and online content—to learn the structure, meaning, and context of human language. By recognizing linguistic patterns, LLMs can perform diverse tasks such as answering questions, translating text, and summarizing information, demonstrating powerful language understanding and generation capabilities. Psychiatric drug development heavily relies on clinician-reported rating scales (ClinROs) as primary endpoints in pivotal trials. Instruments such as the Montgomery–Åsberg Depression Rating Scale (MADRS), Hamilton Anxiety Rating Scale (HAM-A), and Positive and Negative Syndrome Scale (PANSS) are all semi-structured clinical interviews that quantify symptom severity through standardized questions and scoring rubrics. Despite their widespread use, these tools are limited by clinician bias, inter-and intra-rater variability, and the inherent subjectivity of psychological evaluation. As such, drug developers have implemented measures such as rater training, centralized assessments, blinded data analytics, and third-party endpoint reviews to improve data quality, but these approaches add operational complexity and cost without fully resolving reliability concerns. The structured and language-based nature of ClinROs makes them particularly well suited for application of LLMs through prompt engineering and model fine-tuning. LLMs have the potential to transform psychiatric clinical trials by serving both as training tools to improve rater consistency and as oversight systems capable of independently scoring rating scales to provide objective comparators to human assessments leading to significantly increased reliability and validity of study endpoints. This panel brings together psychometricians, drug developers, and AI experts who are actively integrating LLMs into late-stage psychiatry clinical trials.

Learning Objective 1: Explore how Large Language Models (LLMs) such as GPT-4 can be applied to enhance the reliability, objectivity, and efficiency of clinician-reported outcomes in psychiatry clinical trials.

Learning Objective 2: Discuss the practical, operational, and regulatory considerations for integrating LLMs into late-stage psychiatry clinical development based on emerging realworld implementations

References

Volkmer, Sebastian, Andreas Meyer-Lindenberg, and Emanuel Schwarz. “Large language models in psychiatry: Opportunities and challenges.” Psychiatry research 339 (2024): 116026. Kolar, A., et al. Using Large Language Models for Endpoint Oversight. (October, 2025). Poster presented at the International Society for CNS Clinical Trials Methodology (ISCTM), Amsterdam, Netherlands.