AI Therapy Chatbot Reduced PHQ-9 Depression Scores in Pilot Trial

TL;DR: A 2026 randomized feasibility trial in JMIR Mental Health found both a structured AI therapy chatbot and ChatGPT reduced PHQ-9 depression scores versus assessment-only control, but neither AI condition improved anxiety significantly or beat the other.

Key Findings

PHQ-9 depression scores fell: The Patient Health Questionnaire-9 (PHQ-9), a depression-symptom scale, improved more with AI therapy than assessment-only control, d = -0.47, p = .01.
ChatGPT showed a similar PHQ-9 signal: Unstructured ChatGPT conversations also reduced PHQ-9 scores versus control, d = -0.44, p = .02.
Anxiety did not improve significantly: The 7-item Generalized Anxiety Disorder scale moved in the right direction, but neither AI group separated significantly from control.
Structured AI did not outperform ChatGPT: Direct comparisons between the therapy chatbot and ChatGPT were nonsignificant across depression, anxiety, impairment, and well-being outcomes.
Engagement was uneven: Only 39% of the AI therapy group and 62% of the ChatGPT group completed all 9 assigned sessions.

Source: JMIR Mental Health (2026) | Kuta et al.

3 Weeks of AI Chat Moved PHQ-9 Scores

Generative AI therapy tools are already moving faster than the evidence base. This trial gives the field an early controlled test because it compared 2 AI conditions with a control group instead of only measuring whether people felt better after using an app.

Researchers recruited English-speaking adults online and randomized them to structured AI therapy, ChatGPT, or assessment-only control. Assessment-only control meant participants completed study assessments but did not receive a chatbot intervention.

The intervention lasted 3 weeks. The AI therapy group was told to complete 9 structured app-based sessions based on solution-focused therapy principles. The ChatGPT group was told to complete 9 unstructured conversations with GPT-4o-based models.

The primary depression result was measured with the PHQ-9, a common self-report scale for depression symptoms. Compared with assessment-only control, both AI conditions showed statistically significant PHQ-9 reductions:

Structured AI therapy: d = -0.47, p = .01 versus control.
ChatGPT: d = -0.44, p = .02 versus control.
Direct comparison: structured AI therapy did not significantly outperform ChatGPT on PHQ-9.

Anxiety, Impairment, and Well-Being Did Not Separate

The PHQ-9 result was the clearest finding. Other outcomes were weaker, even when the direction looked favorable.

Researchers measured anxiety with the 7-item Generalized Anxiety Disorder scale, impairment-linked depression severity with the Overall Depression Severity and Impairment Scale, and well-being with the 5-item World Health Organization Well-Being Index.

Those secondary outcomes did not show statistically significant separation from control:

Anxiety: AI therapy d = -0.37, p = .11; ChatGPT d = -0.27, p = .22.
ODSIS depression and impairment: AI therapy d = -0.25, p = .22; ChatGPT d = -0.12, p = .53.
Well-being: AI therapy d = 0.12, p = .53; ChatGPT d = 0.20, p = .29.

Plain matrix showing AI therapy and ChatGPT effects on depression, anxiety, impairment, and well-being in a feasibility trial — The trial found a PHQ-9 depression signal for both AI conditions, but anxiety, impairment, and well-being outcomes did not separate significantly from control.

The nondepression results keep the trial from becoming a broad claim that chatbots improved mental health overall. The defensible result is narrower: short-term depression symptoms improved on PHQ-9.

Clinical size also matters. The discussion notes mean PHQ-9 changes of about -2.7 points for AI therapy and -2.5 points for ChatGPT, below the commonly used 3.3-point minimal clinically important difference.

The PHQ-9 finding is worth following, but not enough by itself to call the intervention clinically proven.

Structured AI Therapy Was Not Clearly Better Than ChatGPT

The structured AI therapy chatbot was built around solution-focused therapy, including guided ventilation and goal-setting prompts. ChatGPT was more open-ended and unstructured.

The design made the head-to-head comparison important. If a purpose-built therapeutic chatbot clearly beat general ChatGPT, the result would support specialized clinical design as the active ingredient.

The trial did not show a clear advantage for the structured chatbot. Depression and anxiety changes were descriptively better, but direct comparisons between the 2 AI groups were nonsignificant across all measured outcomes.

For PHQ-9, the difference between structured AI therapy and ChatGPT was essentially absent, b = -0.19, d = 0.03, p = .87.

Engagement also complicates interpretation. The AI therapy group had 44 participants, and only 17 completed all sessions. The ChatGPT group had 60 participants, and 38 completed all sessions.

The Trial Supports Testing, Not Clinical Substitution

This study is informative because it is controlled, randomized, and transparent about its limits. It does not show that AI chatbots can replace clinicians, diagnose depression, manage risk, or deliver durable treatment.

The main constraints are concrete:

Small eligible sample: 147 participants completed pretreatment assessment after 185 were randomized.
Short follow-up: the intervention lasted 3 weeks, so durability is unknown.
High attrition: many participants did not complete all assigned sessions.
Limited engagement data: researchers could not track total conversation time or detailed interaction patterns.

The engagement gap is especially important for unsupervised tools. A chatbot can only help if people return to it, understand its guidance, and use it in moments when symptoms are active rather than only when study reminders arrive.

Safety framing also belongs in the next trial, especially for users with severe depression, suicidality, psychosis, or complex medication changes.

The study still gives digital mental health a concrete next step. Larger trials should test longer interventions, follow-up durability, safety monitoring, and whether specific therapeutic structure adds value beyond a general chatbot.

Until then, the evidence points to a narrow and cautious conclusion: AI chatbot use may reduce short-term self-reported depression symptoms, but the current trial does not prove broad mental health efficacy or superiority of a therapy-specific chatbot.

Citation: DOI: 10.2196/82642. Kuta et al. Effectiveness of a fully automated mobile therapeutic versus a general chatbot in reducing depression and anxiety and improving well-being: feasibility randomized controlled trial. JMIR Mental Health. 2026;13:e82642.

Study Design: Online feasibility randomized controlled trial comparing structured AI therapy, ChatGPT, and assessment-only control over 3 weeks.

Sample Size: 185 randomized adults; 147 eligible participants completed pretreatment assessment.

Key Statistic: PHQ-9 depression scores improved versus control for AI therapy, d = -0.47, p = .01, and ChatGPT, d = -0.44, p = .02.

Caveat: Short duration, high attrition, self-reported outcomes, and no significant head-to-head advantage for structured AI therapy over ChatGPT.