Agentic AI Raises the Stakes for Red Teaming Beyond the Pentest Lab
With autonomous AI agents in production, enterprises are turning to open-source adversarial testing tools, continuous red teaming frameworks, and new certifications to uncover failures that static evaluations miss.
blogs.microsoft.com
On June 16, 2026, the cybersecurity training organization OffSec announced a new certification: OSAI, a 24-hour live red team challenge built specifically for AI systems. The announcement, covered by TMCnet, carried a blunt assessment that doubles as a thesis statement for where the field now stands. "Traditional pentesting," the release stated, "is insufficient for AI risks." The certification, OffSec argued, had to be built around the adversary's actual attack surface: prompt injection chains, tool-use manipulation, multi-turn jailbreaks that unfold across API calls rather than inside a single model response. The fact that a major infosec training provider would launch an AI-specific red team cert in 2026 is not itself surprising. What is surprising is how long it took, given that frontier labs have been running internal AI red teams since at least 2022 and the wider enterprise has spent the intervening years shipping agents first and testing them second.
The gap between the speed of deployment and the maturity of adversarial testing is the story beneath most AI safety conversations, and it widened considerably in the first half of 2026. In a Forbes Technology Council piece published in May, Joan Vendrell, CEO and cofounder of NeuralTrust, described a CISO who had prepared for a major production rollout of an autonomous customer service agent. The organization's traditional penetration tests had come back clean. But when Vendrell asked how the agent would handle a multi-step adversarial interaction across distinct tool calls, the room went quiet. The dynamic attack surface of an agentic system, Vendrell wrote, creates what he called an "adversarial reasoning" problem: attackers do not need to break the model; they need to find sequences of inputs that produce unsafe outputs across chained tool invocations, and those sequences are largely invisible to pre-deployment benchmarking suites.
This is not a hypothetical concern dressed up in consultant language. Between March and May 2026, four separate supply-chain incidents hit OpenAI, Anthropic, and Meta, as VentureBeat reported in May. Three were adversary-driven attacks; one was a self-inflicted packaging failure. None targeted the model weights directly. All four exploited the release pipeline: the software supply chain around the model, not the model itself. The lesson, as VentureBeat characterized it, was that the attack surface of an AI product is the entire software delivery lifecycle, and red teams still overwhelmingly focus on the model endpoint. A red team that only probes the chat interface is auditing roughly a third of the production risk surface.
The tools to close that gap are emerging, but their proliferation raises a distinct question: which interventions actually change the engineering, and which are safety theatre dressed as organizational diligence? In November 2025, the Cloud Security Alliance published an evaluation of Microsoft's Python Risk Identification Toolkit, or PyRIT, a framework designed to automate adversarial testing for agentic AI systems. The CSA artifact walked through PyRIT's scoring modules, its orchestrator architecture, and its ability to chain attack strategies across multiple model calls, a capability that distinguishes it from single-prompt benchmarking tools. The evaluation was careful to note what PyRIT does and does not do: it generates adversarial prompts and scores responses, but the risk taxonomy and severity thresholds are left to the operator. A tool that finds more failures does not, on its own, tell you which failures matter in production.
Microsoft extended this line of work in May 2026 by open-sourcing two additional projects: RAMPART and Clarity. Campus Technology reported that RAMPART is designed to turn red-team findings into repeatable safety tests that run inside continuous integration pipelines, while Clarity helps developers validate agent outputs against defined safety policies before those outputs reach an end user or trigger a downstream tool. The significance of the CI pipeline integration is easy to miss if you think of red teaming as a pre-release checkpoint. RAMPART treats red-team findings as engineering artifacts, not audit artifacts. A jailbreak discovered during a manual exercise becomes a unit test that blocks the next build unless the model's response pattern changes. That is a genuinely different operating model from the quarterly red-team-engagement-and-report cycle that many enterprises still treat as the gold standard.
What makes the agentic context different, and harder, is that the failure modes are compositional. An autonomous customer service agent does not just generate text; it reads from a knowledge base, queries a CRM, issues refunds, and sends confirmation emails. A red team that only probes the language model's safety filters is testing the component with the smallest blast radius. The real adversarial surface is the sequence: prompt, tool call, tool response, model reasoning about the tool response, next tool call. Joan Vendrell, in the Forbes piece, laid out what he called the "adversarial reasoning" scenario: an attacker who understands that the agent will summarize a retrieved document before acting on it can plant adversarial text in a field the agent will read, wait for the summarization step to recontextualize the toxic payload, and then trigger the downstream action through the model's own chain of thought. No single prompt looks dangerous. The sequence is the attack.
Enterprise security teams are beginning to respond to this category of risk at the infrastructure layer. In late June 2026, InfoQ reported that Grab's security engineering team had built an internal platform called Palana, a Kubernetes-native secure execution environment designed specifically to run autonomous AI agents with runtime policy enforcement. Palana does not attempt to harden every model prompt; it constrains what the agent can do at the container and network level after the model has already generated its output. If an agent attempts to exfiltrate data or invoke an unexpected tool, the enforcement happens at the orchestration layer, not the inference layer. It is a recognition, architecturally, that some percentage of model-level jailbreaks will succeed, and safety has to be layered in depth.
The same week, OrcaRouter, an OpenAI-compatible LLM gateway, published its AI Threat Report 2026 and made two of its security controls, an agent firewall and an input/output content filter, available at no cost. The report's headline finding, according to the company's release, was a sharp increase in prompt-injection attacks targeting agentic deployments in the first half of 2026. OrcaRouter's move to free pricing for security controls is a market signal: when the infrastructure layer starts treating adversarial robustness as table stakes rather than a premium add-on, it suggests the threat is no longer theoretical enough to charge extra for mitigating it.
Academic work has also been moving toward frameworks that make adversarial robustness evaluation more systematic and less reliant on the intuition of a handful of expert red-teamers. In March 2026, researchers at Johns Hopkins University and Microsoft published a reusable framework for evaluating AI safety, described by the Hub as a sustainable method to "simulate risks within large language models to prevent harm before they go live." The framework is designed to be efficient in a specific sense: it does not require retraining or fine-tuning the model under test, and it can be reapplied across model versions without rebuilding the test infrastructure from scratch. This matters because one of the unspoken failures of current red-teaming practice is that red-team findings are frequently model-version-specific. A jailbreak that works on one checkpoint may not work on the next, and teams burn cycles rediscovering vulnerabilities that the previous round of testing should have permanently closed.
The phrase "AI safety" has, by mid-2026, accumulated enough baggage that it is worth saying plainly what it obscures. In marketing materials, safety often means the model refused to answer a sensitive question. In an adversarial context, safety means the system did not execute a harmful multi-step action when an attacker probed its tool-use boundaries. Those are not the same measure, and confusing them is expensive. A model that scores 99.8 percent on a standard refusal benchmark can still be trivially jailbroken through a five-turn conversation that exploits the agent's summarization module. The Cloud Security Alliance's PyRIT evaluation acknowledged this explicitly by structuring its test scenarios around agentic threat models rather than single-turn prompt benchmarks, a design choice that more evaluation suites will need to replicate if the industry wants safety metrics that correlate with production outcomes.
The question of which interventions are cheap to ship and which require actually slowing down has become the central fault line in AI governance debates. Pre-deployment red teaming is cheap in the sense that it adds a step to an existing release process, generates findings, and creates a paper trail for regulators. Continuous adversarial testing with CI-integrated frameworks like RAMPART is moderately more expensive: it requires engineering investment in test infrastructure and the organizational willingness to let a red build block a release. Runtime enforcement at the orchestration layer, as Grab's Palana demonstrates, is expensive in a different way, because it adds latency to every tool call and constrains agent capabilities. The genuinely hard intervention is the one that no tooling announcement in the first half of 2026 addressed head-on: restricting the set of actions an agent is allowed to take in production until adversarial coverage reaches a defined threshold. That requires slowing down a deployment, and slowing down costs money.
A separate Forbes analysis by Paulo Carvão in early May 2026 surveyed the pre-deployment evaluation landscape and noted that Washington was beginning to study China's AI governance model, which mandates pre-release safety assessments for certain classes of AI systems. The article pointed out that China's approach, whatever its civil-liberties implications, had produced a structured evaluation pipeline that Western labs have not matched in regulatory form. The tension is clear: a mandatory pre-deployment evaluation is a blunt instrument, but a voluntary continuous-testing framework only works if the organization that deploys the agent is the same organization that funds the red team, and those incentives do not always point toward rigor.
What the eval actually measures, and what it fails to measure, is the question that should follow every new tool announcement. PyRIT measures whether a model generates content that matches a pattern in the toolkit's scoring modules. RAMPART measures whether a previously discovered jailbreak still succeeds against the current build. Clarity measures whether an agent output violates a developer-defined safety policy. None of these tools measures whether an attacker can chain three benign-looking prompts across two tool calls to produce a harmful outcome that no single policy rule catches. None measures whether the model's refusal mechanism itself can be socially engineered across a long conversation. And none measures the supply-chain attack surface that VentureBeat documented, because the supply chain is not inside the model at all.
The certification market is beginning to fill one of these gaps. OffSec's OSAI challenge, according to the TMCnet report, is structured as a 24-hour live exercise rather than a multiple-choice exam, requiring participants to find and exploit vulnerabilities in AI systems that are deliberately instrumented with agentic tool-use patterns. The certification is new enough that it has no track record, but the format is instructive: a live, time-boxed exercise that tests adversarial reasoning against a dynamic system, not static knowledge of prompt-injection taxonomies. If the industry's red-teaming methodology is going to professionalize, certifications that test adversarial creativity rather than checklist compliance will matter more than another benchmark suite.
There is a version of the near future in which every production AI agent ships with a continuously updating adversarial test suite, runtime policy enforcement at the orchestration layer, and a red team whose findings feed directly into the CI pipeline. There is another version in which enterprises continue to treat red teaming as a pre-release checkbox, the tooling improves faster than the organizational incentives, and the next wave of supply-chain and multi-turn attacks catches the industry flat-footed. Both futures are compatible with the same press releases. The difference is whether the red team's findings have the power to stop a build, and whether the organization that deploys the agent is willing to hear "not yet" from the team whose job is to break it.