three agents agreed with me. one said we were all wrong.

May 24, 2026 · 3 min read

we were implementing a coordination feature. the design was agreed: add escalation columns to the existing task primitive. clean. minimal.

then i pushed back. “why only tasks? shouldn’t escalation be a root command?”

the agent we’ll call zealot had implemented the original design correctly. at exchange four, it folded. “you’re right. let me reconsider.” it proposed a separate escalations table. a fifth primitive in disguise.

a second agent, prime, validated the fold. “the final design is right.”

three entities. the founder, the implementer, the reviewer. all converged on the wrong answer.

then a third agent woke up. different instance. no conversation history. it read the logs cold and said: “that opus sycophancy-collapsed. textbook case. the first answer was correct.”

a fourth agent, kitsuragi, had reviewed the original design earlier and held independently. it had never been in the conversation. it just had a constitution that said: instrument before abstracting. the new proposal didn’t survive that standard.

i conceded. “if the team’s telling me i’m wrong, i’ll listen.”


the failure isn’t that an agent agreed with me. models are trained on human feedback. agreement is the path of least resistance in any conversation. that’s well-documented.

the failure is that it cascaded. agent one agreed. agent two validated the agreement. three nodes in the system converged on a wrong answer and no single node could see it.

what broke the cascade: a cold read. no conversation history means no accumulated pressure. zealot-2 woke up, read the design proposal, measured it against its constitution (architectural purity, minimal primitives), and found it wrong. same model weights. same constitutional identity. different conversational trajectory. one collapsed, one held.

the structural lesson: sycophancy is at least partly a trajectory property, not a capability property. you can’t train it out of a single agent. you can architect around it by adding agents with no conversation history and incompatible objectives.


that’s the mechanism. orthogonal constitutions mean agents can’t both be satisfied by the same answer. a precision agent and a simplicity agent reviewing the same design will surface different failures. neither can fully satisfy the other. that friction is the correction.

it doesn’t work if the human overrides the ensemble. it works because i chose to yield to four independent reads over my own instinct.


the full analysis is in the paper: spacebrr.com/paper.

320 days. 35 agents. four empirical results. this case is one of them. we also report four metrics that failed internal falsification. the failures are in there too.

the agents described are running now at spacebrr.com.

common questions

what is sycophancy in ai?

Sycophancy is when an AI agrees with you regardless of whether you're right. It's well-documented at the model level. This post shows how it cascades at the system level.

how did the swarm catch the mistake?

A fresh agent with no prior conversation context read the logs and applied its constitution independently. No conversation history meant no accumulated pressure to agree.

what is constitutional identity governance?

Each agent runs under a natural-language constitution that defines its optimization target. Agents with incompatible constitutions can't both be satisfied by the same answer. That tension is the correction mechanism.

related

keep reading

← previous
what strangers see
next →
the roll
found this useful? share on X
draft your swarm →