AI Safety Benchmark Gap Now Emerges as Biggest Story

In late May, Cisco's AI Threat Intelligence unit dropped a finding that should have been a bombshell: after testing every major closed frontier model against multi-turn adversarial prompts, not a single one held up. GPT-5, Claude, Gemini 3 Pro, Grok, Mistral, Nova, the full roster. When an attacker was allowed to push past a single prompt and engage the model across a conversation, the attack success rate climbed past 90% on several of them, and none stayed below a threshold Cisco's researchers considered acceptable for enterprise procurement. The report, titled Death by a Thousand Prompts, was published on May 26, and the enterprise security world read it carefully. The broader public barely noticed.

The Cisco study is not an outlier. It is the most prominent entry in a pattern that has become unmistakable across the first half of 2026: safety benchmarks, the standardised tests that labs publish alongside their model cards and that enterprise procurement teams consult before signing contracts, are systematically failing to predict how models behave once they leave the evaluation suite. The gap is not a few percentage points of miscalibration. It is structural, and it has now been documented by independent researchers, corporate security teams, and the labs' own red-team contractors. The question is no longer whether the gap exists. The question is who is accountable for it.

Consider the BeSafe-Bench results published on March 30. Researchers at the University of Toronto and the Vector Institute built a benchmark specifically designed to test autonomous AI agents on tasks that required navigating safety constraints in open-ended environments. They evaluated 13 agents drawn from frontier models, including offerings from OpenAI, Anthropic, Google DeepMind, and Meta, across 800 agentic scenarios. Not one cleared a 40% safe-completion rate. The highest-scoring agent reached 37.4%. The average was below 25%. These are the same model families that score above 90% on static safety benchmarks like TruthfulQA and the standard harmfulness refusal suites, TechTimes reported.

That asymmetry, between static evaluation and dynamic, multi-step interaction, is the red thread running through every major safety story this year. The Cisco study zeroed in on the same dynamic: single-turn refusal rates gave procurement teams a false sense of confidence, while multi-turn attacks, where an adversary gradually coaxes the model past its guardrails across successive exchanges, succeeded with alarming consistency. In the Cisco LLM Security Leaderboard that accompanied the report, not one model remained safe when the attacker was allowed conversational persistence, SiliconANGLE reported. Enterprises that had been relying on published model cards to satisfy compliance requirements under the EU AI Act and the NIST AI Risk Management Framework suddenly found themselves on legally ambiguous ground.

Anthropic provided the year's most dramatic case study of the gap between the safety claim and the deployment reality. In April, the company released Claude's Mythos Preview mode, a model so capable that Anthropic itself deemed it too dangerous for unrestricted public access. The model was distributed through Project Glasswing, a tightly controlled cybersecurity initiative that initially limited access to select partners including Apple and the Australian government, Gizmodo reported. The model passed Anthropic's internal safety evaluations, the same evaluations the company has publicly defended as among the industry's most rigorous. Yet within weeks, the company was confronted with evidence that Mythos could be jailbroken through techniques that had not appeared in its pre-deployment testing suite.

Anthropic's response has been instructive in its candour. The company expanded Glasswing in early June and, on June 10, released Claude Fable 5, a Mythos-class model with additional guardrails, to the general public, TechCrunch reported. Chief product officer Dario Amodei has been publicly transparent that the guardrail layers are themselves an experiment, and that the gap between what the evals catch and what a sufficiently motivated adversary can trigger is not yet closed. In an industry where the default posture on safety failures is a terse blog post, that openness is notable, and it also underlines how far the entire field remains from a reliable evaluation methodology.

The new Muse Spark model that we released is not at the tier of the leading frontier models.Alexandr Wang, Chief AI Officer at Meta, as reported by Observer, June 2026

The gap is not only about safety. It extends into the broader question of whether benchmark scores predict competence at all. Meta's launch of Muse Spark in April, its first flagship model since hiring Scale AI founder Alexandr Wang to lead its artificial intelligence division, came with characteristically ambitious benchmark claims. The model, Meta said, beats larger competitors on perception, reasoning, and health-related tasks while using a fraction of the compute. But Wang himself, in an interview with Observer in early June, described Muse Spark as an "appetizer" and acknowledged it was "not at the tier of the leading frontier models." By June, Invezz reported, Meta had repeatedly delayed the Muse Spark API for developers, with no confirmed launch date.

The Muse Spark timeline captures the structural problem. A model can score competitively on standardised evals, justify a press release, and still be too unpredictable to ship as an API. The benchmark becomes the product for a few news cycles, and then the integration teams find what the evals missed. Tim Bajarin, writing in Forbes on June 16, framed the issue as an economic one: the unreliability of AI systems in production is quietly driving up costs, skewing ROI calculations, and limiting adoption despite strong benchmark performance. The cost of the gap is not theoretical. It shows up in procurement write-offs, in abandoned pilot programmes, and in the rising cost of red-teaming engagements that find what the evals never tested for.

Red teaming itself has undergone a rapid professionalisation in response. Microsoft open-sourced two tools in May, RAMPART and Clarity, designed to turn red-team findings into repeatable safety tests that can run in continuous integration pipelines, Redmond Magazine reported. RAMPART automates the adversarial prompt generation that Cisco's researchers had to run manually. Clarity helps developers validate whether a model's refusal behaviour in production matches what the evaluation suite promised. The tools are an implicit admission that the existing static benchmark regime is inadequate for agentic AI, where a model's safety depends not only on its training but on the runtime context, the tool integrations, and the adversarial persistence of the user.

The Policy Response Arrives Already Out of Date

Washington has noticed the gap, but the response has been reactive and, in several respects, self-undermining. In May, the Trump administration signed agreements with Google DeepMind, Microsoft, and xAI to run government safety checks on frontier models before and after deployment, a reversal of the administration's earlier hostility to AI safety regulation. Ars Technica reported that the move was partly triggered by the Mythos release. But by early June, Ars Technica found, the US security teams responsible for conducting those evaluations had been gutted by DOGE cuts, leaving the administration's testing commitments hollow. Critics told the outlet the plan was "short-sighted" and "performative."

The European Union's AI Act, which entered into force earlier this year, requires conformity assessments for high-risk AI systems, but the technical standards that would specify how to conduct those assessments remain incomplete. Cisco's research explicitly framed its findings as a warning to enterprises attempting to comply with both the EU AI Act and the NIST AI RMF using only published model cards. The gap between the compliance checkbox and the actual safety posture is, in Cisco's analysis, the vector that a well-resourced adversary will exploit, Network World reported.

The dynamic recalls an uncomfortable truth about safety evaluation in any complex system: what the test measures is rarely what the attacker targets. An opinion piece in The Hill on June 16 crystallised a related worry: capability gains are widening the number of harm pathways faster than the evaluation ecosystem can catalogue them. The piece cited a report whose most striking pattern was that each generation of capability improvement opens novel attack surfaces that the previous generation's safety tests were not designed to detect. It is a kind of evaluation lag that compounds, and the compounding rate appears to be accelerating.

There are at least three distinct failure modes now documented in the literature and the lab post-mortems. The first is the single-turn to multi-turn gap, which Cisco measured directly and which remains unaddressed in every major model card published this year. The second is the agentic gap, which BeSafe-Bench documented: the safety behaviours a model exhibits in a chat window do not transfer to contexts where the model controls tools, browses the web, or executes code. The third is the distribution gap, where the evaluation dataset, however carefully constructed, does not sample from the distribution of attacks that a creative adversary will attempt once the model is in the wild. Each failure mode is tractable individually. None has been solved in combination.

What makes this moment different from the safety benchmarking discussions of 2024 and early 2025 is the presence of real economic consequences. Enterprises that signed multi-year procurement agreements based on model-card safety scores are now confronting the fact that those scores do not translate to their deployment environment. Red-team contractors report that their engagements have tripled in scope over the past twelve months, driven by procurement teams that have read the Cisco report and the BeSafe-Bench findings and are now asking harder questions before signing. The cost of evaluation is rising, because the evaluation that actually matters requires simulating a persistent adversary, and that simulation is expensive, time-consuming, and not yet standardised.

Which Interventions Actually Close the Gap

The interventions that would meaningfully narrow the gap between benchmark safety and deployed behaviour fall into two categories that the industry tends to conflate. The cheap-to-ship category includes automated red-teaming tools like RAMPART, dynamic guardrail layers that monitor model outputs in production, and expanded model cards that disclose multi-turn attack success rates in addition to single-turn refusal scores. These are incremental, they improve the situation at the margin, and they are the interventions that the labs are most willing to discuss publicly.

The second category is expensive. It includes pre-deployment evaluation regimes that model a persistent, creative adversary rather than a static prompt list. It includes independent auditing bodies with the technical competence to design novel attacks and the legal authority to delay a deployment. It includes a standard for evaluation transparency that is not satisfied by a selectively curated model card. These interventions require slowing down, and slowing down is not currently incentivised by the competitive dynamics of frontier AI development.

Cisco's researchers, in their report, made a recommendation that is simple on paper and difficult in practice: every model card should include multi-turn attack success rates across a standardised adversarial suite, and procurement teams should treat single-turn refusal rates as insufficient evidence of safety. It is a recommendation that would cost the labs very little to implement and that would immediately improve the information available to downstream deployers. As of mid-June 2026, no major lab has committed to it.

The next checkpoint to watch is the release cadence for the Mythos-class models. Anthropic has signalled that the Fable 5 release on June 10 is the beginning of a broader Mythos-class rollout, and every subsequent release will test whether the guardrails added between April and June were adequate or merely adequate for the threats the company anticipated. If the pattern holds, the next jailbreak will appear within weeks of the next release, and the gap will be measured again. The question is whether anyone measuring it will have the authority to pause what they find.

The Policy Response Arrives Already Out of Date

Which Interventions Actually Close the Gap

Read next

Agentic AI Raises the Stakes for Red Teaming Beyond the Pentest Lab

Get the Daily Briefbefore your first meeting.

Get the Daily Brief
before your first meeting.