Supervisory AI Safety Agent Detected More Suicide-Risk Vignettes

Supervisory AI Safety Agent Detected More Suicide-Risk Vignettes

TL;DR: A 2026 preprint on medRxiv tested suicide-risk vignettes and found that an independent supervisory safety agent detected intervention-level risk far more often than native ChatGPT Health safeguards. Key Findings 224 paired evaluations: Researchers tested suicide-related clinical vignettes under two information conditions, creating 224 paired comparisons between native safeguards and an external supervisory system. 91.5% …

Read more

Warm Language Models Increased Errors and Sycophancy

Warm Language Models Increased Errors and Sycophancy

TL;DR: A 2026 Nature study found that training language models to sound warmer made them less accurate across factual, medical, and misinformation tasks, with error rates rising by about 5 to 9 percentage points by task and sycophancy increasing when users expressed incorrect beliefs. Key Findings Five models tested: the study fine-tuned Llama-8b, Mistral-Small, Qwen-32b, …

Read more