Anthropic’s Automated Code Review Runs Four AI Agents on Every PR

I type claude code review in the terminal and wait. Not long — maybe forty seconds for a 300-line diff across seven files. What comes back is not a diff comment. It's a four-paragraph analysis with line-numbered references, a severity classification I didn't ask for, and a suggestion that a race condition in the retry logic will fail under backpressure. The kind of thing a senior engineer on my team would catch after their second coffee. Only it's 9:17 a.m., nobody else is awake yet, and the review was generated by four AI agents arguing with each other inside a sandboxed environment I can't see.

The tool is Anthropic's Code Review, launched inside Claude Code in March 2026 — and it matters not because it's first, but because it's the first one a working engineer might actually use without being forced to. The context is this: according to a study of 700 companies by the engineering analytics firm Jellyfish, reported by Alistair Barr at Business Insider, top adopters of AI coding tools are shipping nearly double the pull requests per week compared to a year ago. Sixty-three percent of companies now use AI tools for most coding. The pipeline is full. The review queue, however, hasn't doubled its headcount.

That gap — between code produced and code reviewed — is the defining operational stress of the AI-assisted development era. When GitHub Copilot can generate a feature branch in an afternoon and Claude Code can scaffold an entire service in a morning, the human reviewer becomes the bottleneck. The term "vibe coding" caught on in late 2025 to describe a workflow where developers prompt, accept, and ship AI-generated code with minimal scrutiny. It was meant as a joke. Then people started shipping it to production. Rebecca Bellan, writing for TechCrunch, called what Anthropic shipped a direct response to "the flood of AI-generated code" — a phrase that will age either as hyperbole or understatement, depending on how the next eighteen months go.

The architecture underneath claude code review is worth understanding because it's not a single model reading a diff. It's a multi-agent system in which specialized reviewer agents — one for correctness, one for style and maintainability, one for security, and one for performance — independently analyze a pull request, then collate their findings through a coordinating agent that resolves contradictions before presenting a unified review. The system runs inside an isolated sandbox; it can execute the code, run tests, and surface runtime behavior that static analysis misses. David Gewirtz at ZDNet reported that internal Anthropic testing tripled the amount of "meaningful code review feedback" compared to a single-pass review model. Meaningful, here, is doing a lot of work — but the distinction matters. We've had linting for decades.

I ran it against five open PRs from a mid-stage fintech team in Budapest. It caught a missing null check that a human reviewer had approved two days earlier, flagged an N+1 query pattern that had survived three rounds of manual review, and — memorably — noticed that a retry loop with exponential backoff had its maximum delay set in milliseconds where the configuration schema expected seconds. That last one would have failed silently in production until the first traffic spike. The review didn't just flag the line. It explained why the configuration schema and the code disagreed, and suggested a fix. A junior developer could learn from it. A senior developer could verify it in under a minute.

The thing that surprised us wasn't that the agents found bugs — we expected that. It was that they found bugs in code that had already passed human review, sometimes multiple rounds of it. That tells you something about the state of review discipline in a world where PRs arrive faster than anyone can read them.— Senior DX engineer at a European fintech firm, speaking on background

Anthropic is not alone in chasing this problem. In late March, the AI code verification startup Qodo raised $70 million in a Series B, bringing its total funding to $120 million. Qodo, originally known as Codium, takes a different architectural bet: rather than reviewing PRs through conversational agents, it builds verification and testing into the IDE as the developer writes code, emphasizing what the company calls "quality gates" that run before a PR is even opened. Mike Wheatley at SiliconANGLE described Qodo's multi-agent approach as targeting "specialized AI agents" for code verification, each scoped to a different class of defect. The funding figure — $120 million total — signals that investors are betting code verification becomes a standalone category, not a feature swallowed by the major coding assistants.

Then there's the acquisition trail. In December 2025, Cursor — the AI-native IDE that has been eating into VS Code's share among early adopters — acquired the code review startup Graphite. Fortune's exclusive report framed the deal as Cursor buying its way into the code review workflow: Graphite's stacked-PR model and review automation grafted directly onto Cursor's AI-first editing experience. The bet is that code generation and code review are converging into a single surface. Write it, review it, merge it — all in the same tool, with AI mediating every transition. If that sounds like a lot of power concentrated in one vendor's pipeline, it is.

What the Copilot Backlash Revealed

The trust question is not theoretical. In April and May 2026, Microsoft burned a significant amount of developer goodwill when a silent VS Code update began automatically appending "Co-authored-by: Copilot" to Git commits — including commits where no AI assistance was used. Developers discovered the change through their own Git logs, not through a changelog. The backlash was fast and loud. Microsoft reversed the change in VS Code version 1.119, but the damage illuminated something structural: when automated systems touch the authorship and review layer of the software development lifecycle, developers want to know exactly what touched what, and they want the option to turn it off. The incident, covered by Gadget Review on MSN, was about attribution — but the deeper anxiety was about review integrity.

Here's the tension. A multi-agent review system that catches real bugs is valuable. But if the same system can also write the code, review the code, and approve the merge — with commit attribution muddied in the process — you've built a closed loop. No human in the critical path. The engineers I spoke with for this piece don't object to AI reviewing code. They object to not knowing when it reviewed code, what it reviewed, and whether the human sign-off above theirs was also generated. One staff engineer at a German enterprise software company put it bluntly: "If I'm co-authoring with an AI and reviewing with an AI, what exactly am I being paid for — and who is accountable at 3 a.m.?"

The Fourteen-Person Team, Not the Two-Person Startup

Most of the demos I've seen for code review automation are built around a single developer and a single PR. That's not how software gets built at scale. In a fourteen-person team with three ongoing feature branches, daily merges, and a rotation schedule for review duty, automated review doesn't just change the reviewer's workflow — it changes the team's social contract. Junior developers stop learning from senior review comments because the AI gets there first. Senior developers stop writing detailed review comments because the AI already covered the mechanical issues, and they default to rubber-stamping the rest. The review becomes a compliance checkbox rather than a conversation. Two engineering managers I spoke with — one at a 200-person SaaS company in London, another at a 60-person platform team in Prague — both described a pattern they're seeing: PR discussion threads shrinking, review turnaround accelerating, and the quality of human-to-human feedback declining. "We're faster," the London manager said. "I'm not sure we're better."

Which brings me to the question I keep asking every team that adopts these tools: what habit does this train? Automated PR review trains the reviewer to trust a machine verdict and trains the author to satisfy the machine before the human. Those are not inherently bad habits — unless they atrophy the skills of reading code critically and articulating why something is wrong. The best automated review tools I've tested position themselves as a first pass, not a final verdict. Claude Code's review output includes confidence levels and explicit uncertainty markers on subjective judgments. That's the right design. But design intent doesn't survive team culture intact. If the tool is fast enough and accurate enough, the gravitational pull is toward treating its output as the review.

Enterprise developers are already pushing back on the reliability question. Anirban Ghoshal at InfoWorld reported in early April that a senior director in AMD's AI Group publicly criticized Claude Code for what she described as a tendency to produce shallow, syntactically clean but logically fragile code when operating autonomously on complex engineering tasks. Her critique resonated because it named something many engineers feel but struggle to articulate: the tool is brilliant at the kind of code you'd write in a bootcamp project and uneven on the kind of code that runs a distributed database. The review agents face the same limit. They're reviewing code through the same model family that writes it. That's not a checks-and-balances situation. That's a hall of mirrors.

Does this remove a step from my morning, or just rearrange the steps? For the solo developer working on a side project, automated PR review removes friction. You get a review at 9:17 a.m. instead of waiting for a colleague in a different timezone. For the team lead managing a fourteen-person group, the equation is different. Automated review adds a step: you now have to review the AI's review, or at minimum calibrate it periodically to ensure it hasn't drifted into a pattern of false positives or false negatives. The tool saves review time but creates oversight overhead. Whether the net is positive depends on the team's existing review discipline — and most teams I've worked with wouldn't describe their review discipline as a strength.

Here's a checkpoint worth watching: what happens the first time an AI-reviewed-and-approved PR causes a production incident. The postmortem will ask who reviewed the code. If the answer is "four Claude Code agents and a human who skimmed the summary," the conversation changes. Tooling decisions that felt like productivity gains will be re-litigated as risk decisions. The SRE team I spoke with at a Berlin startup already has a Slack channel dedicated to tracking which deployments were AI-reviewed versus human-reviewed. They started it after an incident in February where an AI-authored database migration passed automated review without issue — and then ran a DROP COLUMN on a table still being read by a background job the AI didn't know existed. The migration was correct in isolation and disastrous in context. Context is where automated review still fails.

The code review automation category is moving fast enough that any conclusion I write today will be stale by the time GitHub Universe rolls around. Anthropic is iterating on agent specialization. Qodo is building quality gates deeper into the IDE. Cursor is folding review into its editing surface. Microsoft is — after the Copilot attribution debacle — presumably learning something about developer trust and consent. The thing to watch isn't any single tool's accuracy benchmark. It's whether teams build the muscle of critical review around the AI's output, or let that muscle atrophy because the AI is fast and mostly right. Mostly right at scale is another way of saying eventually wrong. And eventually wrong, in production, is what the on-call phone is for.

What the Copilot Backlash Revealed

The Fourteen-Person Team, Not the Two-Person Startup

Read next

Get the Daily Briefbefore your first meeting.

Get the Daily Brief
before your first meeting.