INTERPRETING PSYCHIATRIC DIGITAL PHENOTYPING DATA WITH LARGE LANGUAGE MODELS: A PRELIMINARY ANALYSIS
Background
Digital phenotyping, the passive monitoring of behavioral health via smartphone sensors, offers promising clinical applications but faces implementation challenges in translating complex multimodal data into actionable insights. Digital navigators, healthcare staff who interpret patient data for clinicians, provide a solution, but workforce limitations restrict scalability. Large language models (LLMs) may help address this gap, yet their performance in interpreting psychiatric digital phenotyping data remains unevaluated.
Objective
We sought to provide one of the first systematic evaluations of LLM performance in interpreting simulated psychiatric digital phenotyping data, establishing baseline accuracy metrics for this emerging application.
Methods
We evaluated GPT-4o and GPT-3.5-turbo across 153 test cases covering various clinical scenarios, timeframes, and data quality levels using simulated datasets currently employed in training human digital navigators. Performance was assessed on each model’s capacity to identify clinical patterns relative to human digital navigation experts.
Results
GPT-4o demonstrated 52% accuracy (95% CI 46.5%-57.6%) in identifying clinical patterns, significantly outperforming GPT-3.5-turbo (12%, 95% CI 8.4%-15.6%). Performance varied substantially across clinical scenarios: GPT-4o achieved 100% accuracy for worsening depression and 83% for worsening anxiety, but only 6% for increased home time with improving symptoms. Accuracy declined with decreasing data quality (69% for high-quality data vs. 39% for low-quality data) and shorter observation timeframes (60% for 3-month data vs. 43% for 3-week data). Occasional hallucinations were observed, with models occasionally generating clinically plausible but incorrect interpretations.
Conclusions
GPT-4o’s 52% accuracy establishes a meaningful baseline, though performance gaps and hallucinations confirm that human oversight in digital navigation tasks remains essential. These results suggest that current LLMs are less reliable than traditional algorithmic approaches for raw sensor data parsing, yet may still help address workforce scalability challenges when positioned as assistive tools that augment rather than replace human digital navigators. Future research should explore agentic architectures that combine traditional pipelines for structured data processing with LLM capabilities for clinical reasoning and synthesis—potentially capturing the strengths of both approaches to improve psychiatric digital phenotyping workflows.