H100 Reserved Capacity Hits $2.35/Hour Floor in 2026
Reserved-contract GPU pricing has surged 38% in six months while spot availability tightens and enterprise fleets idle at 5% utilization—the per-token math is shifting faster than procurement cycles can track.
www.freepik.es
In this article
$2.35 per GPU-hour. That is the price of a one-year reserved H100 contract as of March 2026, according to SemiAnalysis data published on April 2. The assumption set matters: this is an 80GB H100, reserved for 12 months, provisioned through a neocloud or tier-2 provider, not the hyperscaler list price. It is not the spot price — which can spike 60% to 120% above this figure depending on the week — and it is not the on-demand list price, which at AWS still runs north of $4.90/hr for an p5.48xlarge instance with 8 H100s. At $2.35, the H100 reserved rate has climbed 38% from its October 2025 trough of $1.70/hr — reversing a year-long slide that had many forecasters predicting sub-$1.00 by June 2026. They were wrong.
The reversal caught the market off guard for a simple reason: most models of GPU price deflation assumed supply would continue to outpace demand as Nvidia's Blackwell ramp delivered B200s into every major datacenter. Supply did ramp — Nvidia shipped an estimated 2.3 million Hopper-generation GPUs in 2025, and Blackwell volumes are now hitting in Q2 2026. But demand grew faster. New model-training clusters from Anthropic, Meta, and xAI absorbed entire availability zones worth of capacity. At the same time, inference workloads — which chew through GPU-hours in a steady, unbursty pattern — moved from an afterthought to the primary driver of reserved-contract bookings. Three separate neocloud sales leads, speaking off the record, confirmed to me this week that inference contracts now represent 55% to 70% of their committed-revenue pipeline, up from roughly 20% in mid-2025.
The spot market tells a messier story. Spot H100 pricing on major cloud exchanges ranged from $1.85/hr to $4.10/hr in the first week of May 2026, with the wide spread reflecting not just regional differences but wild variation in interconnect availability. An H100 with 3.2 Tbps of InfiniBand fabric behind it commands a premium of $0.90 to $1.40/hr over a standalone node, because inference at batch size 32 on a 70B-parameter model needs that fabric — but batch-size-1 inference on a 7B model does not. The market is fragmenting along a line most procurement teams do not yet have a column for in their spreadsheets: workload topology. As TechSpot noted in its Q2 2026 GPU pricing survey on April 29, "GPU prices have stopped getting worse, but they have not gotten much better either." The observation was directed at consumer cards, but it applies with even more force in the datacenter: the era of steep sequential price declines has ended.
What makes this moment different from the cryptocurrency-driven GPU squeezes of 2017–2018 and 2021 is the structure of demand. Crypto mining was a spot-market phenomenon — highly price-elastic, geographically mobile, and capable of switching off when marginal revenue fell below the electricity bill. AI inference is the opposite: inelastic, contract-heavy, and tied to specific datacenter locations by latency requirements and data-sovereignty rules. A fintech running a fraud-detection model cannot relocate its inference cluster from Frankfurt to Mumbai to chase a $0.30/hr discount. That structural stickiness is why reserved-contract pricing has become the true price-discovery mechanism for the GPU market — and why spot, rather than being the cheaper alternative, has become the penalty box for poor planning.
GPU prices are going nuts. Even older chips are holding their value because the whole stack is supply-constrained — not just the silicon but the power, the cooling, the networking, the building itself.— Carmen Li, CEO of Silicon Data, in Business Insider, April 6, 2026
Carmen Li, who left Bloomberg to build Silicon Data's GPU price-tracking indices, told Business Insider's Alistair Barr on April 6 that the constraint is no longer just about silicon. Silicon Data's indexes — which track actual transaction prices across hundreds of providers — show that the gap between hyperscaler list prices and neocloud effective prices has widened since January. AWS, Azure, and Google Cloud charge a bundled premium that includes their proprietary inference services, IAM layers, and compliance certifications. Most enterprises pay it. But a growing cohort of AI-native startups and mid-tier model providers are routing around the hyperscalers entirely, buying raw compute from neoclouds and building their own serving infrastructure. That bifurcation is creating two GPU markets with different pricing dynamics — and different margins.
The margin question is the one that should keep every CFO awake. At $2.35/hr reserved on an H100 that costs roughly $30,000 to acquire, a neocloud needs a 70% utilization rate to clear a 30% gross margin over a three-year depreciation window — assuming $0.08/kWh power and standard cooling overhead. At 100% utilization, margin expands to roughly 55%. At 50% utilization, the operator loses money. These numbers are sensitive to assumptions: a B200, which costs roughly $40,000 per unit but delivers 2.1x the throughput at batch size 32 on Llama-3-70B, changes the math considerably — but only if you can fill the card. And filling the card, as the utilization data makes brutally clear, is not the industry norm.
The 5% Problem That Won't Go Away
Enterprise GPU fleets average 5% utilization. The number has been circulating for months — it first appeared in a Cast AI analysis of thousands of Kubernetes clusters in early 2026 — but VentureBeat's April 30 deep-dive pinned down the mechanism. It is not misconfiguration. It is a procurement loop. An enterprise fears GPU scarcity, so it over-reserves reserved instances at favorable negotiated rates. Those instances sit idle while the data engineering team builds the pipeline to feed them. By the time the pipeline is production-ready, the next GPU generation has arrived, and the old reserved capacity looks like a sunk cost — so the enterprise reserves again at a higher tier. The idle capacity never gets decommissioned because canceling a 12-month reserved contract means eating the remainder. The result: 95% of paid-for GPU-hours generate zero inference tokens.
Cast AI's data — drawn from more than 4,000 organizations — showed that median GPU utilization across production AI workloads was 5.1% in Q1 2026, with the 75th percentile reaching only 14.3%. Even the top decile of users, the ones with mature MLOps pipelines and autoscaling infrastructure, averaged 42%. These numbers are for GPU memory utilization, not compute utilization — but for inference workloads on transformer models, the two track each other closely. The implication is staggering: at a blended reserved rate of $2.20/hr across the installed base, the industry is spending roughly $19.25 in GPU rental for every $1.00 of actual compute used.
This is where the spot market becomes a mirror rather than a market. If every enterprise with idle reserved capacity could sublet that capacity on a short-term basis — truly liquid spot, with standardized contracts and 15-minute billing increments — the effective price of compute would drop overnight. But subletting is not permitted under most enterprise license agreements with hyperscalers, and the neoclouds that do allow it — RunPod, Vast.ai, parts of the CoreWeave spot tier — lack the compliance certifications that procurement departments require. The spot market is liquid at the low end and frozen at the high end. It is a market for tinkerers and startups, not for JP Morgan.
Who Captures the Margin — and Where the Per-Token Price Actually Lands
There are four layers in the GPU compute stack: the chip, the cloud provider, the model host, and the application. The chip layer — Nvidia — captures margin through hardware sales and the CUDA software moat. The cloud layer — hyperscalers and neoclouds — captures margin through the spread between their cost of capital and the price they charge for a GPU-hour. The model layer — Anthropic, OpenAI, the open-weight providers — captures margin through the spread between the per-token price they charge and their inference cost. The application layer — Cursor, Perplexity, Copilot — captures margin through subscription revenue above the per-query inference cost. The question every quarter is where the margin pools.
A FinOps lead at an AI-heavy fintech, who shared internal numbers on condition of anonymity, said their team cut their GPU bill 37% by moving inference workloads from AWS to a bare-metal neocloud and handling the serving layer themselves. The trade: they lost SageMaker auto-scaling and had to hire two platform engineers. At their scale — 1,400 H100-equivalent GPU-hours per day — the trade paid for itself in 11 weeks.
The per-token price that an end user sees is the final compression of all these layers. As of early May 2026, the cheapest publicly listed inference price for Llama-3-70B is $0.23 per million input tokens and $0.49 per million output tokens, from a tier-2 provider running on H100 reserved capacity. The same model on a hyperscaler's managed inference endpoint costs $0.59/$1.09. The difference — roughly 2.6x on input, 2.2x on output — is the bundled-services margin. It is also, from the hyperscaler's perspective, the cost of doing business with an enterprise that demands 99.95% uptime SLAs, VPC isolation, and a BAA for HIPAA compliance. Whether that 2.6x is sustainable depends on whether enterprises value those things at that price. The utilization data suggests many already do not — they are paying the premium and not using the capacity.
The B200 inflection will test every assumption in this stack. Nvidia's Blackwell architecture delivers roughly 2.1x the tokens-per-second on a 70B-parameter dense model at FP8 compared to an H100, at approximately 1.3x the power draw and 1.4x the per-unit cost. That means the tok/s-per-dollar ratio improves by roughly 50% — assuming workloads can exploit the larger 192GB HBM3e memory pool and the FP8 tensor cores. But memory-bound workloads, particularly long-context inference at 128k tokens, see a smaller gain: roughly 30% improvement in practice, because the HBM bandwidth improvement (8 TB/s on B200 vs 3.35 TB/s on H100) does not scale linearly with price. The tok/s-per-dollar improvement drops to roughly 20% for these workloads. Procurement teams buying for inference-heavy fleets need to know which bucket their workloads fall into before committing to B200 reserved contracts — and, at present, most do not.
The reserved-to-spot spread on B200 is also, at this moment, nearly nonexistent. Because B200 supply is still ramping and most allocation is going to anchor tenants with multi-year commitments (Meta alone committed $21 billion to CoreWeave through 2029, Forbes reported on April 13), there is effectively no spot market for B200 capacity. What little does appear on exchanges is priced within 5% of reserved rates, because the alternative — going back to an H100 — is less attractive at the margin. This will change as Blackwell volumes hit in H2 2026 and the first wave of reserved contracts comes up for renewal in Q4. Whether B200 spot drops to a discount-to-reserved or stays at parity will be the single best indicator of whether the GPU supply-demand balance has finally tipped.
A second-order effect worth watching: the IndiaAI Mission tender, which closed in late April 2026, saw B200 pricing come in 10% below expected levels due to aggressive bidding from local compute service providers, The Economic Times reported on April 28. This is a single tender in a single country, but it is the first large-scale public benchmark of B200 compute pricing outside the hyperscaler negotiating table — and it suggests that the India market, with its lower power and cooling costs, is already pricing GPU compute at a discount to US and European rates. If Indian neoclouds can offer B200 capacity at $2.60–$2.80/hr while US neoclouds are holding at $3.10–$3.50, the arbitrage will attract workloads that are latency-tolerant and not constrained by data-sovereignty — batch inference, synthetic data generation, model evaluation runs. That arbitrage is not open yet. The Indian providers still lack the InfiniBand fabric density to serve tightly-coupled training workloads. But as a leading indicator of price pressure, it is the most important datapoint from Q2 2026 that nobody is talking about.
The Flexera 2026 State of the Cloud report, cited by CRN on April 3, found that 74% of enterprises now use reserved instances for at least some GPU workloads, up from 51% in 2024. AWS Reserved Instances and Azure Savings Plans remain the dominant discount vehicles, but the report noted a marked shift: enterprises are signing one-year commitments rather than three-year, preserving optionality for the B200 transition. That shift shortens the duration risk for buyers but reduces the discount they receive — a one-year RI on AWS typically yields 40% off on-demand versus 60% for a three-year. The net effect is that enterprises are paying a higher effective hourly rate for the privilege of staying flexible. It is a rational decision in a market where the chip generation cycle is now roughly 18 months. But it also means the downward pressure on reserved pricing that long-duration contracts historically created is weakening.
What happens next turns on two variables. The first is B200 volume: if Nvidia hits its H2 shipment targets — and every OEM I have spoken to says the TSMC CoWoS-L capacity is the gating factor, not wafer supply — then the H100 spot price should begin a sustained decline by Q4 2026 as B200 becomes the new baseline for reserved contracts. The second is the utilization problem. If the 5% figure does not budge — and there is no structural reason it should, since the procurement loop VentureBeat identified is a function of organizational incentives, not technology — then GPU compute will remain simultaneously scarce and wasted, with prices elevated by the demand signal from buyers who never actually use what they buy. That is a market failure large enough to be measured in billions. In the meantime, the number to watch is $2.35. If H100 reserved breaks above $2.60 by July, the 38% surge will look like the beginning of a new trend, not the end of an old one.