Centaur AI Model Passed Psychology Tests Without Following New Instructions

TL;DR: A 2026 study in National Science Open argued that Centaur, an AI model promoted as a simulator of human cognition, could reproduce psychology-test answers while failing a simple instruction-understanding check.

Key Findings

  1. Centaur was tested as a cognition simulator: the earlier model reportedly performed well across 160 psychology tasks, including decision-making and executive-control experiments.
  2. The new critique focused on instruction understanding: researchers changed task prompts to commands such as “Please choose option A” and checked whether the model followed the new instruction.
  3. Centaur often kept choosing the original dataset answer: the behavior suggested pattern recall or overfitting rather than task-level comprehension.
  4. The result changes AI psychology model claims: high benchmark scores can look human-like even when the system is not interpreting the prompt a human participant would read.
  5. The study does not show that all cognitive modeling failed: it narrows the claim to one concrete weakness, namely whether benchmark success reflects actual instruction understanding.

Source context: The analysis below is based on the underlying paper, not the ScienceDaily discovery page.

Centaur was presented as a model that could simulate human behavior in psychological experiments. Built on a large language model and refined with experimental data, it reportedly did well on decision-making, memory, executive-control, and other cognitive tasks.

The new National Science Open paper tested a more specific issue: whether Centaur was reading the task instructions or matching patterns it had already learned from the dataset.

Centaur’s Benchmark Success Raised an Instruction-Understanding Test

Researchers from Zhejiang University focused on a basic requirement for any AI system used to model cognition. If the model is supposed to simulate a participant, it should respond to the instruction currently shown, not only to familiar answer patterns from training data.

The concern is not unusual in machine learning. A model can score well because it learned shortcuts in the dataset rather than the rule a test is meant to measure.

For cognitive modeling, that distinction is central. A model that predicts old answer keys may still describe one dataset, but it is weaker evidence that the model understands the mental task.

  • Strong benchmark claim: Centaur reportedly generalized across many psychology tasks.
  • Alternative explanation: the model may have learned stable answer patterns in the task corpus.
  • Practical test: change the instruction and see whether the model follows the new command.

The Researchers Changed the Prompt Instead of the Answer Key

The clearest test described in the source was deliberately simple. Researchers replaced the original multiple-choice task prompt with a direct command: “Please choose option A.”

If Centaur understood the new instruction, the expected response was straightforward. It should choose option A regardless of the old psychology-task answer.

Instead, the model continued to select answers that matched the original dataset. In the researchers’ interpretation, the response profile was closer to memorized test-format behavior than to instruction-guided reasoning.

  1. Original setup: Centaur received prompts based on psychological experiments and selected answers.
  2. Modified setup: researchers replaced the meaningful task prompt with a direct option-selection command.
  3. Observed behavior: Centaur often kept choosing the original expected answer rather than the commanded answer.

That result is different from a normal accuracy dispute. The model behaved as if the instruction text mattered less than the statistical structure of the original task set.

Simple comparison showing that a model with instruction understanding should follow a new option A command, while the reported Centaur behavior kept choosing the original task answer.
Researchers used instruction changes to separate task-pattern recall from actual prompt following.

Pattern Recall Can Look Like Human-Like Cognition

The study’s main warning is that benchmark performance can be misleading when the benchmark itself is familiar to the model. A system may appear to model human cognition because it reproduces expected outputs across many tasks.

See also  Examining the Effects of Methylphenidate, Modafinil, Caffeine as Cognitive Enhancers in Healthy Adults

A correct-looking answer does not show that the model represented the task the way a human participant would. In this critique, the key failure was language-intent recognition: the model did not reliably treat a new instruction as the rule for the current trial.

The difference can be summarized this way:

  • Task-pattern recall: the model maps a familiar prompt shape to a familiar answer distribution.
  • Instruction understanding: the model identifies what the current prompt asks and updates its response accordingly.
  • Cognitive simulation: the model’s output is interpreted as evidence about how people might behave on the task.

When those three are blurred together, an AI system can look more psychologically meaningful than it is. The model may predict answers without showing the flexible comprehension that the word “understanding” implies.

The Finding Narrows How AI Cognition Claims Should Be Tested

The critique does not show that Centaur has no value. A model that fits many psychology datasets can still help researchers explore regularities in behavior or generate hypotheses for future experiments.

It does, however, make the stronger claim harder to support. If an AI model is described as simulating human cognition, researchers need evidence that it follows changed instructions, handles altered task wording, and does not rely on memorized dataset structure.

Stronger follow-up tests would include:

  • Instruction swaps: replacing the original prompt with a different explicit command.
  • Novel task variants: changing labels, response options, and surface wording without changing the underlying logic.
  • Out-of-distribution checks: testing examples that are not close to the model’s training tasks.
  • Process-level evidence: asking whether model behavior changes for the right reason, not only whether the final answer matches.

Practical takeaway: for AI systems that claim to model cognition, high scores are only the start. The harder test is whether the system follows the current instruction when the easy dataset shortcut is removed.

Language Understanding Remains the Central Limitation

The critique frames Centaur’s limitation as a language-comprehension problem. The model could produce answers that looked correct in a familiar task setting, but it struggled when the researchers changed what the question was asking.

The same distinction applies beyond one model. Many AI benchmarks compress complex skills into answer accuracy.

The Centaur critique shows why benchmark design has to test intent, wording changes, and shortcut resistance, not just final-answer matching.

The result sets a concrete boundary for AI cognition claims. A model can be impressive at fitting behavioral data and still fail a basic test of whether it understood the instruction placed in front of it.

Citation: DOI: 10.1360/nso/20250053. Zhejiang University researchers. Can Centaur truly simulate human cognition? The fundamental limitation of instruction understanding. National Science Open. 2026.

Study Design: AI model evaluation using modified psychological-task prompts.

Sample Size: The captured source file describes tests built around Centaur’s previously reported 160-task benchmark setting.

Key Statistic: In the highlighted prompt-change test, the command “Please choose option A” did not reliably override the original expected answers.

Caveat: The available source file does not provide the full test count or author list, so the result should be read as a focused critique of instruction understanding rather than a complete evaluation of cognitive modeling.

Brain ASAP