AI Red Teaming Outgrows Its Script: What Adversarial Testing Measures
Automated tools, agentic testing, and the Mythos wake-up call are reshaping AI security assessments, yet the gap between what evaluations detect and what adversaries actually exploit remains far wider than vendor marketing suggests.
In late April, Anthropic quietly restricted access to its Mythos model after internal testers found it could identify exploitable software vulnerabilities at a speed and scale that the company's own safety team described as dangerous. Within days, independent researchers had already used the preview to surface 271 distinct flaws in Mozilla Firefox alone, Computerworld reported. The number itself is vivid, but what ought to concentrate the mind is the rhythm: a single model, aimed at a single widely-deployed codebase, producing a vulnerability report that would have taken a conventional red team months to assemble, delivered in what sources described as a dramatically compressed window. That ratio, not the bug count, is the signal.
The Mythos episode arrives at a moment when the practice of adversarial testing for AI systems is undergoing a transformation that is simultaneously technical, commercial, and conceptual. Across the first half of 2026, a wave of startups and platform vendors have shipped tools promising to automate, accelerate, or productize AI red-teaming. DeepKeep Ltd. launched its Vibe AI Red Teaming capability in April, designed for what the company calls human-steered, dynamic attack simulation on AI applications and agents. NDay Security made its GARAK self-service LLM red-teaming platform generally available in March. Suzu Labs acquired Emulated Criminals in late April to fold continuous adversary emulation into its AI security portfolio. The market is filling a vacuum that the frontier labs themselves created when they first admitted, in 2023 and 2024, that their internal red-teaming was uneven and under-resourced.
The problem these tools are selling a fix for is real and getting realer by the quarter. The Hacker News reported this month that exploit windows have collapsed to roughly 10 hours in 2026, a compression severe enough that manual purple-teaming workflows, where red and blue teams share findings in scheduled handoffs, now resemble air traffic control conducted by carrier pigeon. The article, which drew on data from security validation firm Picus Security, argued that only autonomous purple teaming, in which attack simulation and defensive response run in a continuous loop mediated by AI, can keep pace. Picus itself was named Frost & Sullivan's 2026 Global Company of the Year for automated security validation, a recognition announced in early May.
Read the press releases side by side and a pattern emerges. The new red-teaming tools share a rhetorical architecture: they emphasize speed, coverage, and the inadequacy of manual approaches. What they describe less often is the evaluation methodology underneath. This is not an accident. Red-teaming an AI system, particularly a large language model or an agent that takes actions in external environments, is categorically different from penetration-testing a network. A network has a topology. An LLM has a probability distribution over tokens. The attack surface is not a set of ports and protocols but a conversation, and conversations do not have well-defined boundaries, which is the core observation behind the title of TechRadar's recent piece, "You can't firewall a conversation."
The phrase captures something structural. Traditional security testing assumes an architecture in which data flows through known channels and an attacker must find a misconfiguration or an unpatched vulnerability to move laterally. Prompt injection, jailbreaking, encoded-instruction attacks, and multi-turn manipulation do not require any of those things. They exploit the fact that a language model has been trained to be helpful and has no native concept of a security boundary. A red team that models an LLM as if it were a web application will miss the most consequential failure modes, which emerge from the model's own reasoning rather than from a software bug in the traditional sense.
This is where the distinction between automated scanning and genuine red-teaming becomes critical, and it is a distinction that several of the new commercial offerings blur by design. DeepKeep's Vibe AI Red Teaming product, for instance, is built around the concept of human-steered testing: a human operator guides the adversarial exploration, while the platform handles attack generation, variation, and logging. The architecture acknowledges that threat modeling for AI systems is not yet automatable in the way that port scanning is automatable. A human has to decide what a meaningful failure looks like for a given deployment context, because the same model output that constitutes a safety violation in a customer-service chatbot may be entirely benign in a creative-writing assistant.
The Devdiscourse piece on agentic AI red-teaming, published May 8, pushes the argument one step further. The core claim is that as AI systems gain agency, the red team must itself become agentic. A static prompt library, no matter how large, cannot anticipate the combinatorial space of actions an agent might take when it has access to tools, memory, and multi-step reasoning. The article invokes Meta's Llama Scout as a reference point and catalogues the standard taxonomy of LLM attacks: prompt injection, jailbreak sequences, encoded payloads. But the deeper insight is that agentic systems introduce timing-dependent and state-dependent vulnerabilities that are invisible to a single-turn evaluation. An agent that behaves safely in the first five steps of a task may diverge dangerously on step six, after it has accumulated context that its safety training never anticipated.
There is an irony here that the field has not yet fully metabolized. The same AI capabilities that make systems more useful, long-horizon reasoning, tool use, memory, are the capabilities that make them harder to evaluate. Each new affordance expands the attack surface. This is not a temporary condition that better fine-tuning will resolve. It is a structural feature of deploying models that can plan. The safety community has a term for this, "capability overhang," and 2026 is the year it stopped being a seminar-room abstraction.
The Mythos episode crystallizes the other half of the dilemma. Anthropic did not restrict Mythos because it was unsafe in the sense of generating toxic text or producing biased outputs. The company restricted it because the model was too good at something that is, in many contexts, a legitimate and valuable capability: finding software vulnerabilities. The dual-use problem is not new, but Mythos made it visceral. An AI that can audit code for security flaws at superhuman speed is a defensive asset and an offensive weapon, and the difference is entirely a matter of who holds the keyboard and under what constraints. The restriction itself became a news event, covered by MSN and multiple outlets, precisely because it raised the question that no commercial red-teaming product currently answers: what do you do when the model passes your safety eval but fails your deployment calculus?
That question exposes a fault line in the red-teaming-as-a-service market. Most of the products launching in 2026 test whether a model can be made to produce harmful outputs under adversarial prompting. They test for jailbreaks, for data leakage, for compliance violations. What they do not test, because it is not an eval you can ship as a feature, is whether the model's legitimate capabilities, deployed at scale, change the threat landscape in ways that make the world harder to defend. Measuring that requires a different kind of analysis, one that integrates red-teaming findings with threat intelligence, deployment context, and an honest accounting of who the likely adversaries are and what they already have.
The Picus Security recognition from Frost & Sullivan points toward one resolution of this tension. Picus has built its platform around automated security validation that simulates real-world attack scenarios continuously, not at a single point in the development lifecycle. The company's thesis, as described in the Manila Times report, is that validation must be as dynamic as the threats it measures. When exploit windows shrink to 10 hours, a red-teaming exercise conducted quarterly becomes a compliance checkbox rather than a security control. The same logic applies to AI systems, with the additional complication that a model's behavior can shift post-deployment due to fine-tuning, system-prompt updates, or interaction with other models in agentic workflows.
The autonomous purple-teaming model that The Hacker News piece advocates is, in effect, an argument that the red-team and blue-team functions must be algorithmically fused. In the AI context, this would mean an architecture where adversarial probes, safety evaluations, and defensive mitigations run in a continuous feedback loop, with each new attack vector automatically generating a corresponding guardrail. Several of the new platforms gesture in this direction. NDay's GARAK, for example, is positioned as a self-service tool that allows organizations to run continuous exploitability testing against LLMs, rather than commissioning point-in-time red-team engagements. The shift from consultancy to platform is itself a signal about where the market believes the value lies.
But the gap between what a platform can measure and what an adversary can exploit remains stubbornly wide. Consider the evaluation landscape for prompt injection. A typical automated test will submit thousands of prompt variants, measure the model's response, and flag outputs that match a harm taxonomy. This approach catches known attack patterns efficiently. It does not catch a novel multi-turn manipulation that exploits a specific deployment's tool-access configuration, because the test was designed before that configuration existed. The eval measures the model's resistance to last year's attacks, while the adversary is developing this morning's.
This is why the phrase "red-teaming" itself has become ambiguous in ways that matter. To a frontier lab, red-teaming means a structured engagement with external experts who spend weeks probing a pre-deployment model for safety failures, often producing detailed qualitative reports that inform training adjustments. To a startup buying a SaaS tool, red-teaming means running an automated scan that generates a risk score and a list of flagged prompts. Both activities are valuable, but they are not the same activity, and conflating them under one label serves vendors more than it serves security teams. The careful reader of the 2026 product landscape should watch for the moment when a vendor claims its tool "replaces" expert red-teaming rather than augmenting it. That is where the safety claim ends and the marketing claim begins.
The policy dimension adds another layer of pressure. The MSN piece on 2026 priorities frames AI red-teaming as one of the dual strategic challenges facing organizations, paired with managing the most age-diverse workforce in history. That pairing is not random. The workforce that builds and deploys AI systems is increasingly composed of practitioners who did not grow up with adversarial machine learning as part of their training, and the institutions that might bridge that gap, universities, professional certifications, internal training programs, are moving at a pace that bears no relationship to the speed at which AI systems are being shipped. In that environment, a red-teaming tool that promises to encode expert knowledge into a platform has genuine appeal, even if the encoding is necessarily incomplete.
The Suzu Labs acquisition of Emulated Criminals, announced via Business Wire on April 27, is instructive in this regard. The deal was explicitly framed around combining "security-first AI expertise with real-world continuous adversarial emulation." The phrase "real-world" is doing the heavy lifting. Emulated Criminals built its reputation on adversary emulation that modeled the tactics of actual threat actors rather than running generic attack scripts. The acquisition suggests that the market is beginning to value fidelity over coverage, a bet that organizations would rather have a smaller number of high-quality adversarial simulations than a larger number of automated checks that produce noise.
Where does this leave the practitioner who needs to make a decision about which red-teaming investment to make? The honest answer, unsatisfying as it is, is that no single tool or engagement model is sufficient. An organization deploying LLM-based applications or agentic systems in 2026 needs at least three layers: continuous automated scanning for known attack patterns, periodic expert-led red-teaming for novel threat discovery, and a purple-teaming feedback loop that ensures defensive mitigations are tested against the same adversarial techniques that breached them. The third layer is the one most organizations skip, and it is the one that the Picus and BreachLock platforms are explicitly designed to operationalize.
What to watch for in the second half of 2026 is whether the red-teaming tooling market consolidates around a shared evaluation methodology or fragments further into incompatible frameworks. The frontier labs have begun to converge on certain standards. OpenAI's Trusted Access for Cyber program, which added Zscaler as a participant in April and provides access to security-tuned models including GPT-5.4-Cyber, represents one model of how labs are externalizing safety validation to partners. But those partnerships are selective, and the standards that govern them are not public in the way that, say, the OWASP Top 10 for LLM Applications is public. A fragmented evaluation landscape benefits vendors who can sell integration services. It does not benefit defenders who need to compare risk across heterogeneous deployments.
The Mythos episode will be studied for years, but the immediate lesson is simpler than the commentary suggests. Anthropic tested its own model, found a capability it considered too dangerous to release broadly, and acted on that finding before the model caused harm. That sequence, internal testing, honest finding, restrictive action, is what the red-teaming methodology is supposed to produce. It is also a sequence that depends on institutional incentives that do not exist at most organizations deploying AI. A startup racing to ship an AI feature does not have the same incentive to find reasons not to ship it. The tools are getting better. The incentives are not. Until they do, the most sophisticated red-teaming platform in the world is a mirror held up to an organization that may not want to see its own reflection.