AI Labs Overhaul Cloud Compute Deals, Reshaping Power Dynamics
A leaked compute capacity spreadsheet exposes how top AI labs are ditching exclusive cloud tie-ups for messy multi-provider deals, reshaping power in the foundation-model race.
policyreview.info
In this article
On Tuesday, April 14, 2026, a capacity-allocation spreadsheet landed in the Slack channels of researchers across three AI labs. The spreadsheet, a shared view of reserved GPU clusters for the coming quarter, was unremarkable in format—color-coded rows, columns for teraflops, a notes field marked 'do not forward.' What stopped people mid-scroll was a single cell: a two-week block of Trainium2 capacity that had been double-booked. One lab's pre-training run was scheduled against another lab's fine-tuning sprint, both on the same silicon, in the same region, on the same calendar dates. By that evening, the conflict was escalated to a VP. By Thursday morning, a four-year-old compute contract was back on the negotiating table. The episode, minor as it was, exposed the quiet reality reshaping AI's infrastructure layer: the era of exclusive cloud partnerships is ending, and no one is entirely sure what comes next.
For most of the last decade, the relationship between frontier AI labs and cloud providers followed a clean template. A lab picked a primary cloud—OpenAI on Microsoft Azure, Anthropic on Amazon Web Services after its $4 billion deal in 2023, Google DeepMind on Google Cloud—and that provider furnished the compute in exchange for equity, revenue commitments, or both. The arrangement was exclusive enough that each deal functioned as a kind of structural moat: the lab got priority access to the provider's newest silicon and biggest clusters, and the provider got a marquee customer whose workloads would stress-test its infrastructure at scale.
The multi-cloud drift
That model began to crack in mid-2024, when OpenAI signed a deal with Oracle Cloud Infrastructure for additional capacity to train its next-generation models, as CNBC first reported. The move was framed as a supplement to Microsoft Azure, not a replacement. But to the infrastructure teams inside competing labs, it read as a precedent: a frontier lab hedging its compute supply chain by going multi-cloud. The two-cloud, three-cloud model stopped being theoretical.
Anthropic's trajectory is instructive. The lab's 2023 deal with AWS, valued at up to $4 billion, made AWS its 'primary cloud provider' and gave Amazon a minority stake. At the time, the pact was read as an AWS lock-in. But Anthropic also maintained a separate, earlier relationship with Google Cloud, which had invested $2 billion in the lab, as Reuters reported in October 2023. The capacity-allocation spreadsheet that surfaced in April 2026 was, in part, a symptom of that new complexity: the scheduling tools hadn't caught up to the multi-provider reality.
Two years ago, a compute deal was a commitment ceremony. Now it's a series of reservations on a shared calendar that anyone can overbook.— Cloud infrastructure lead at a major AI lab, speaking on background
The equity-for-compute model under pressure
The original logic of equity-for-compute was simple: cloud providers took ownership stakes in AI labs, and in return the labs committed to spending billions on that provider's infrastructure. The provider locked in demand; the lab locked in supply. But the math has changed as training runs have grown. The Stargate joint venture—announced in early 2025 by SoftBank, OpenAI, Oracle, and MGX, with a projected $100 billion in infrastructure spend over four years—represents a structural departure. Instead of a lab signing a check to a cloud provider, Stargate is a standalone entity that builds data centers and leases capacity back. It disaggregates compute provisioning from the equity relationship entirely.
The bargaining power is shifting from whoever owns the GPUs to whoever can write the biggest check for the next training run.
The custom-silicon wildcard
Complicating the landscape further is custom silicon. AWS has Trainium and Inferentia; Google has its TPU line, now in its sixth generation; Microsoft is ramping Maia; and several labs are co-designing chips directly with Broadcom and Marvell. A lab that builds its models to run optimally on a specific provider's custom silicon is, by definition, less portable. That tension—between the desire for multi-cloud flexibility and the performance gains of hardware–software co-design—is the strategic question that keeps infrastructure VPs up at night.
- OpenAI: primary on Azure, supplemental on Oracle Cloud, Stargate build-out underway; custom chip work with Broadcom reported.
- Anthropic: primary on AWS (Trainium2), active workloads on Google Cloud (TPU v5); maintaining dual-provider inference.
- Google DeepMind: exclusive on Google Cloud (TPU); no known secondary provider.
- Meta: primarily on self-built infrastructure; uses cloud providers for surge capacity and inference edge deployment.
- xAI: self-built Colossus cluster in Memphis (100,000+ GPUs); limited cloud dependency.
The cheapest signal to watch
If you want to know whether a lab's compute strategy is working, the cheapest signal is not the size of its latest deal or the number of GPUs it has reserved. It is the velocity of its capacity reallocation. A lab that can shift a pre-training run from one provider to another inside a week, without re-architecting its stack, has genuine multi-cloud leverage. A lab that needs two months and a re-write of its kernel libraries does not. Those tools rarely make headlines. They are also where the next phase of the compute wars will be won or lost.
The April 14 spreadsheet was, in the end, resolved with a phone call and a handshake agreement to bump one research team's run by ten days. No contract was torn up. No press release was issued. But the calendar kept filling, and the double-bookings kept appearing, little yellow flags in a shared document that everyone now watches. The question that the spreadsheet posed, and that no one in the room could answer definitively, was this: when the next generation of models requires ten times the compute, and every lab is booking capacity on every cloud, who owns the calendar—and who pays when it breaks?