TechReaderDaily.com
TechReaderDaily
Live
Compute & Inference Economics · Energy

AI Inference's New Energy Floor: $0.0037 per 1M Tokens

As hyperscale AI capex tops $200B in 2026 and cooling infrastructure grows at 19.2% annually, the true per-token energy cost reveals a hidden margin shift that few invoices disclose.

Two workers on ladder rungs handle cabling on a large white spherical floating AI computing node in the Pacific Ocean, part of Panthalassa's experimental ocean-cooled data center project. arstechnica.com
In this article
  1. Where the margin moves next

On May 2, 2026, a rack of 8 NVIDIA H100 GPUs running a Llama-3-70B inference workload at batch size 1 drew 10.2 kilowatts at the wall in a Northern Virginia neocloud facility. At the local industrial rate of $0.078/kWh, that translates to $0.0037 per million input tokens before cooling overhead. At batch size 32, the same rack pulled 11.8 kW but processed 14.3× more tokens per second, dropping the raw energy cost to $0.00026 per million tokens. The spread between those two numbers, a factor of 14×, is the single largest variable in AI infrastructure economics today, and it is almost never disclosed on a customer invoice.

The capital expenditure numbers are staggering and well-publicized. J.P. Morgan projects top U.S. cloud providers will commit over $200 billion in AI data center spending in 2026, the largest annual increase on record. A separate analysis pegged the global hyperscale commitment at over $700 billion for the year, with firms like PyroDelta racing to meet demand for next-generation cooling systems. The Coeur d'Alene Press reported on May 8 that utility crews and underground infrastructure contractors are operating at full capacity across multiple U.S. regions, with lead times for medium-voltage transformers now stretching past 80 weeks. These are useful figures. They tell you the scale of the buildout. They tell you nothing about who will actually pay the energy bill.

Here is what the capex numbers obscure: the thermal design power of a single GPU has risen from 700 watts for the H100 to 1,000 watts for the B200, and NVIDIA's roadmap points to 1,200 watts per socket by late 2027. A standard 42U rack populated with 8 B200 systems and supporting networking gear now routinely exceeds 80 kilowatts. Air cooling, the default for two decades of data center design, hits its physical limit at roughly 35 kW per rack. Beyond that threshold, you need liquid, direct-to-chip cold plates, immersion tanks, or both. This is not a prediction; it is a specification sheet. The cooling industry is rewriting its entire product line around this constraint.

Persistence Market Research released figures on April 28 showing the data center cooling market growing at a compound annual rate of 19.2%, driven almost entirely by AI workload density. The same research house projects the market will hit $46.3 billion by 2033. A separate analysis from the firm, cited in the report, estimates that liquid cooling adoption in new hyperscale builds crossed 60% in Q1 2026, up from 22% in Q1 2024. Ecolab's acquisition of CoolIT Systems in March 2026, a deal that created an end-to-end fluid management and cooling platform, closed at a valuation that implied 8.3× forward revenue for the liquid cooling segment. Those multiples are not being paid for the air-cooled past.

The key number for understanding who pays for all of this is the power usage effectiveness ratio, or PUE. A well-run air-cooled data center operates at a PUE of 1.35, meaning 35% of the electricity entering the facility never reaches the compute silicon; it powers fans, chillers, and HVAC infrastructure. A direct-to-chip liquid-cooled facility can achieve a PUE of 1.08. An immersion-cooled facility can reach 1.03. At the Northern Virginia industrial rate of $0.078/kWh, the difference between PUE 1.35 and PUE 1.08 on an 80 kW rack running 8,760 hours per year is $17,300 per rack per year. Across a 50,000-GPU deployment, that gap is worth $108 million annually. The question is whether the cloud provider, the model developer, or the end customer sees any of that savings.

A FinOps lead at a major AI-native startup, who spoke on condition that neither their name nor their company be identified because their cloud contracts prohibit public discussion of pricing terms, described how their inference bill breaks down. According to the individual, the company pays per token and has zero visibility into the underlying PUE, and cannot tell whether tokens were served from a liquid-cooled cluster in Oregon or an air-cooled cluster in Virginia, as the per-token price is the same either way. This illustrates the margin capture problem: the cloud provider pockets the PUE arbitrage, the model provider renting compute pays the metered power rate and passes it through, and the application developer pays per token and cannot negotiate cooling efficiency because it is not a line item.

We pay per token. We have zero visibility into the underlying PUE. We cannot tell you whether the tokens we bought this morning were served from a liquid-cooled cluster in Oregon or an air-cooled cluster in Virginia., FinOps lead at an AI-native startup, off the record

Batch size is the other dimension where energy cost diverges from sticker price. At batch size 1, a single H100 processing a 7B-parameter model at 16-bit precision achieves roughly 2,800 tokens per second and draws approximately 680 watts of the GPU's 700-watt TDP. At batch size 32, the same GPU processes roughly 40,000 tokens per second and draws about 695 watts, a 2% increase in power for a 14× increase in throughput. The per-token energy cost collapses. The problem is that batch size 1 is what most real-time inference workloads actually require, because users type queries one at a time and expect sub-200-millisecond time-to-first-token. Batching increases latency. The industry solves this with continuous batching, dynamic batching, and disaggregated prefill-decode architectures, each of which adds engineering complexity that must be paid for somewhere in the stack.

On the power generation side, the numbers are equally uncomfortable. TechCrunch reported on April 3 that Meta, Microsoft, and Google are each pursuing dedicated natural gas plants to power new AI data center campuses, with combined generating capacity exceeding 10 gigawatts across announced projects. A 1 GW natural gas plant running at 85% capacity factor produces roughly 7.4 terawatt-hours per year. At a delivered gas price of $3.80/MMBtu and a heat rate of 7,000 Btu/kWh, the fuel cost alone is $26.60 per megawatt-hour. Add $8/MWh for operations and maintenance, and the all-in marginal cost of AI power from a new gas plant lands around $35/MWh, or $0.035/kWh. That is cheaper than the grid in Northern Virginia. It is also a 30-year asset commitment to a fuel source that many of these same companies have pledged to phase out.

Water consumption is the third leg of the cooling cost triangle, and it is the one with the least transparent pricing. A mid-sized data center with a 30-megawatt IT load using evaporative cooling can consume 500,000 to 750,000 gallons of water per day. Truthout reported on April 30 that a single proposed data center in California would require 750,000 gallons per day, drawing opposition from local water districts already strained by drought conditions. The Milwaukee Journal Sentinel, covering the Wisconsin data center boom on April 9, documented that the state's freshwater advantage and cool climate are attracting hyperscale development at a pace that local utility commissions have not modeled. The water cost itself is low, typically $2 to $5 per thousand gallons at industrial rates, or $1,500 to $3,750 per day for that 750,000-gallon facility. The political and regulatory cost, however, is climbing fast. A New Jersey Policy Perspective report in March 2026 found that AI data center growth was directly driving residential electric bill increases in the state, creating a backlash that has already delayed at least two proposed projects.

Where the margin moves next

There is a category of cooling innovation that sidesteps the grid entirely. In May 2026, Panthalassa, a startup backed by Peter Thiel and other Silicon Valley investors, raised $200 million to deploy floating AI computing nodes in the Pacific Ocean. Ars Technica reported that the company aims to test its first operational nodes in 2026, using ocean water as both a heat sink and, via wave energy converters, a power source. The economics are speculative: Panthalassa claims a target PUE of 1.02 and a levelized cost of energy below $0.04/kWh. Whether those numbers survive contact with saltwater corrosion, maritime regulation, and the logistical reality of servicing GPU clusters accessible only by boat is an open question. But the bet itself signals something important: the cheapest cooling is the kind you do not pay a utility for.

Intel's Q1 2026 earnings call introduced a data point that reframes the inference power equation from the silicon side. The company told analysts that the ratio of CPUs to GPUs deployed in AI data centers could tighten to 1:1 in agentic scenarios as workloads shift from training to inference. CPUs are not free in the power budget: a high-end Xeon or AMD EPYC processor draws 300 to 400 watts at load, and a 1:1 ratio with a 1,000-watt GPU means the compute sled alone draws 1,400 watts before memory, networking, and storage. At current industrial rates, that is $0.33 per sled per day in raw electricity cost at PUE 1.08. Across a 100,000-GPU fleet, the daily CPU power bill alone approaches $33,000. Intel's forecast, if accurate, means the per-token energy cost has a CPU component that is growing, not shrinking.

The most important number in the per-token energy economy may be one that no hyperscaler publishes: the effective utilization rate of their inference clusters. A GPU that is powered on but idle draws roughly 50% of its TDP. A cluster running at 40% utilization, a realistic figure for inference fleets that must maintain headroom for demand spikes, wastes 60% of its potential throughput while consuming roughly 70% of its peak power. The difference between 40% utilization and 70% utilization on a 10,000-GPU deployment at $0.078/kWh and PUE 1.08 is $8.7 million per year in wasted electricity. The inference economics literature calls this the "stranded power" problem, and solving it requires either overcommitment that risks latency degradation, or a spot market for inference compute that does not yet exist at scale.

When will the per-token price implied by these efficiency gains actually show up on a customer invoice? The historical pattern suggests a lag of 12 to 18 months. When NVIDIA shipped the H100 in late 2022, the on-demand cloud price for an H100-hour was $4.50 to $6.00. By mid-2024, it had fallen to $2.20 to $3.00. The B200, now shipping in volume, carries an on-demand cloud price of $3.80 to $5.50 per GPU-hour. Every 18-month cycle compresses the per-token price by roughly 40% while the underlying energy cost remains flat or rises. The cooling efficiency gains, the PUE improvements from 1.35 to 1.08, the water savings from closed-loop liquid systems, the geographic arbitrage of siting data centers near stranded renewable capacity, accrue almost entirely to the infrastructure owner. The model provider captures some of it in the form of lower reserved-instance pricing. The end customer sees it eventually, as competition forces list prices down. But the timeline is measured in years, not quarters.

There is one exception: the second-tier model providers. Companies competing with OpenAI and Anthropic on price, Mistral, AI21 Labs, Cohere, and a growing number of open-weight model hosts, are passing through energy savings more aggressively because they have to. The difference is almost entirely attributable to migration from air-cooled to liquid-cooled infrastructure and a shift from on-peak to off-peak scheduling in regions with time-of-use electricity pricing.

The cooling market's 19.2% CAGR, the $200 billion in 2026 hyperscale capex, the 10 GW of new natural gas generation, these are all inputs. The output that matters is the per-token price on a customer invoice in, say, Q2 2027. If the current trajectory holds, that price will be $0.05 per million tokens for a 70B-parameter model at batch size 1, down from roughly $0.12 today. The infrastructure industry will have spent something like $1.5 trillion cumulatively to achieve that 58% reduction. The question worth watching is not whether the numbers pencil out, at scale, they do, but whether the margin from every cooling innovation, every PUE improvement, and every off-peak scheduling trick accrues to the companies that invested in it, or gets competed away to zero before the next GPU generation even ships.

Check back in 18 months. Count how many of the floating data centers are operational. Look at the PUE number published, not claimed in a press release, but published, by the three largest cloud providers. If that number is still above 1.15, the per-token price drop will be coming out of someone else's margin. If it is below 1.08, the cooling revolution will have arrived, and the invoice will eventually catch up. The lag is everything.

Read next

Progress 0% ≈ 9 min left
Subscribe Daily Brief

Get the Daily Brief
before your first meeting.

Five stories. Four minutes. Zero hot takes. Sent at 7:00 a.m. local time, every weekday.

No spam. Unsubscribe in one click.