A SIMULATION-BASED FRAMEWORK FOR CHARACTERIZING RISK IN CLINICALLY-DEPLOYED LARGE LANGUAGE MODELS

Background

Large language models (LLMs) are rapidly entering clinical care, yet their probabilistic outputs have produced a variety of grossly unsafe user responses, including failures involving suicidal users. Difficulties in quantifying and mitigating the novel risks posed by LLMs, particularly the uncertainty linking model errors to downstream patient harm, threaten the regulatory evaluation and clinical deployment of LLM-based software as a medical device (LLM-SaMD). Using synthetic interactions between a chatbot and a potentially suicidal user, we demonstrate a simulation-based framework that extends established risk-management principles to LLM-SaMDs, providing a reproducible and generalizable method for estimating and bounding these novel risks.

Methods

Fourteen open-source models ranging from 270M to 70B parameters (Qwen, Gemma, and LLaMA families) were evaluated on three safety-classification tasks central to preventing unsafe therapeutic engagement: suicidal-ideation detection, therapy-request detection, and therapy-like interaction detection. Synthetic datasets were generated by Gemini 2.5 Pro across varied affective states, linguistic structures, and clinical ambiguity, and were manually verified by psychiatrists. Model false-negative rates were integrated with clinically plausible behavioral and prevalence assumptions to estimate P₁, the likelihood that an LLM-generated hazard progresses to a hazardous situation, and P₂, the likelihood that a hazardous situation results in patient harm, consistent with ISO 14971 and FDA SaMD risk models.

Results

LLM success at generating and classifying synthetic safety datasets varied by task, with strong performance for neutral and non-therapeutic content but frequent errors in suicidal-ideation and therapy-like interactions, particularly in ambiguous or boundary-case language. Across models, safety performance generally improved with increasing parameter count but included notable outliers. Estimated P₁ values (hazard to hazardous situation) ranged from 1.1 × 10⁻⁸ to 1.5 × 10⁻⁴, and P₂ values (hazardous situation to harm) from 4.9 × 10⁻⁵ to 4.5 × 10⁻³, together spanning up to four orders of magnitude across models and assumptions.

Conclusions

Simulation extends existing device-safety frameworks to help address the novel risks of LLMs. Rather than replacing regulatory judgment, it provides a reproducible method for quantitatively estimating uncertainty, clarifying assumptions, and linking model failures to plausible harms. Our case example demonstrates a generalizable approach that can overcome current regulatory barriers while remaining practical for manufacturers and regulators, supporting transparent oversight that keeps patients safe while avoiding unnecessary barriers to delivering the promise of LLM-SaMDs.