Agentic AI Safety Testing Falls Short Despite Strong Benchmarks

Leon Staufer and a team of MIT researchers asked a question so simple it stings: when you download an AI agent framework—one of the dozens of open-source scaffolding tools that wrap a frontier model in a loop, give it file access, and let it execute shell commands—what do you actually know about its safety? The answer, published in a sprawling audit earlier this year, is almost nothing. Out of 72 surveyed agentic systems, fewer than one in five disclosed any safety testing whatsoever. More than half provided no documented mechanism to halt a runaway process. The study didn't evaluate whether the agents were dangerous; it evaluated whether anyone had bothered to check. And the finding that should keep safety leads up at night is that most of these frameworks are already wired into production pipelines at companies that list 'responsible AI' on their careers page.

The benchmark factory and its discontents

The MIT Technology Review ran an essay in late March that landed like a polite indictment. The argument: one-off benchmarks don't measure what actually happens when a model meets a user who isn't a red-teamer following a protocol. They measure what happens in a controlled, often gameable, setting with known prompts, known datasets, and known evaluation rubrics. This is not a fringe position. Researchers across five labs I've spoken to in the last quarter describe the same dynamic: a model clears an internal safety threshold on Friday, gets deployed on Monday, and by Wednesday someone on LessWrong has posted a jailbreak that works in production but never appeared in the eval suite. The benchmark says 99.7% refusal rate. The internet says hold my beer.

What the eval actually measures, in most cases, is a specific slice of refusal behavior against a specific distribution of prompts—one the model may have been explicitly fine-tuned to handle. What it fails to measure is whether the model's internal representations have actually aligned with the intended safety objective, or whether it's learned to pattern-match the shape of a dangerous query and produce a refusal that sounds right to the classifier grading it.

Models that are good at safety tests and models that are actually safe are overlapping but increasingly distinct categories—and the industry's incentive structure rewards optimizing for the first.— Safety researcher at a major frontier lab, on background

Mythos: the model that didn't ship

On April 21, Anthropic informed the White House it was restricting release of its most powerful model—the one insiders call Mythos, currently in limited preview—after internal safety tests produced results the company deemed alarming. Details are sparse, which is itself the story. Anthropic has staked its brand on the proposition that safety scaling is non-negotiable, and here is the logical endpoint of that position made visible: a model that passes many standard evals but triggers enough internal red flags that the company voluntarily pulls back. This is either the most principled move in frontier AI this year, or evidence that our best safety testing still can't tell us with confidence what a model will do until we build it and watch it squirm. Both can be true.

The timing is not subtle. Days after Anthropic's disclosure, OpenAI shipped GPT-5.5, calling it the most capable model the company has ever built—and narrowly beating Claude Mythos Preview on Terminal-Bench 2.0, per VentureBeat's reporting. Meanwhile, a leaked OpenAI memo accused Anthropic of a 'fear-based' safety culture—a phrase that tells you everything about which intervention is cheap to ship (a benchmark score, a press release) and which requires actually slowing down (a withheld deployment, an uncomfortable White House briefing, an earnings call where you explain why your best model is sitting in a locked room).

Where does the safety claim end and the marketing claim begin? The leaked memo answered: exactly where the deployment decision gets made.

The agent gap is wider

The MIT agent study, published earlier this year and covered by ZDNET in February, surfaces a deeper version of the same problem. A model that refuses a dangerous direct instruction in a chat window is one thing. A model wrapped in an agent scaffold—one that can plan subgoals, browse the web, write and execute code, and chain reasoning steps across minutes or hours—is a different threat surface entirely. The study found that most agent frameworks add no additional safety layer beyond whatever the underlying model provides. They don't log decision traces. They don't implement circuit breakers. They don't tell you what the model is doing during the four seconds between 'fetch the records' and 'send the email.' And nobody is requiring them to. The eval-industrial complex has almost nothing to say about agentic safety at deployment scale.

The Stanford HAI 2026 AI Index, released in April, frames this as a trust gap hitting critical levels. Users report declining confidence that AI systems behave as advertised. The report's data shows that public perception of AI safety is diverging from the benchmark scores labs publish—which is exactly what you'd expect if the benchmarks measure something that doesn't predict deployed behavior. When the numbers go up and the trust goes down, one of those signals is broken.

Fewer than 20% of agentic frameworks disclose any safety testing (MIT/Staufer study)
Anthropic restricts Mythos after internal tests raise alarms—while OpenAI ships GPT-5.5
Stanford AI Index: trust in AI systems is falling even as benchmark scores rise
No major eval suite currently measures agentic safety in multi-step, tool-augmented deployments

None of this is an argument against benchmarks. It's an argument against treating them as anything more than a lower bound on ignorance—a floor, not a ceiling, and certainly not a certificate. The MIT team's finding that most agent frameworks ship without an off switch should be laughable, but it's not. It's the logical consequence of an ecosystem where shipping velocity is the only metric that compounds. If your safety evaluation doesn't measure what your deployed system actually does, then your safety evaluation is a branding exercise. And a very good one.

The benchmark factory and its discontents

Mythos: the model that didn't ship

The agent gap is wider

Read next

Get the Daily Briefbefore your first meeting.

Get the Daily Brief
before your first meeting.