Inference ASICs Now Handle Two-Thirds of AI Compute

Inference workloads consumed two-thirds of all AI compute resources in early 2026, overtaking training for the first time as the primary driver of hardware investment, according to an MSN market analysis published 11 May. The number had been climbing for eighteen months, but crossing the two-thirds threshold crystallised something that process integration engineers and fabless design teams had been betting on since late 2024: the silicon supply chain is reorganising itself around the token, not the checkpoint.

The shift has a specific technical shape. A training GPU must be good at everything: matrix multiplication at BF16, attention mechanisms that span 128K-token context windows, all-reduce collectives across thousands of nodes, and enough programmability to survive the next architecture paper out of Berkeley or Beijing. An inference chip only needs to run the forward pass. If you are willing to lock the architecture to transformers, and transformers only, you can strip out the training circuitry, shrink the die, drop the DRAM bandwidth requirements by an order of magnitude, and deliver tokens per watt that make an H200 look indecently wasteful.

The trade-off is not theoretical. Groq's Language Processing Unit, the dataflow architecture that Nvidia licensed in a deal valued at roughly $20 billion and integrated into its Vera Rubin platform, achieves its throughput numbers by being deterministic: every operation is scheduled at compile time, there is no cache hierarchy to manage, and the memory system is pure SRAM. The chip cannot train a model. It cannot run a mixture-of-experts gating network that was not anticipated at tape-out. It does exactly one thing, and the GTC 2026 keynote made clear that Nvidia now considers that one thing to be the main event. DigiTimes reported from the San Jose conference that the Groq integration was "widely seen as a strategic effort to defend its market share and discourage customers from turning to application-specific integrated circuits, or ASICs."

The irony is that Nvidia is using one ASIC strategy to fight another. While Jensen Huang was on stage projecting a $1 trillion inference market, Google was deep in negotiations with Marvell Technology to design two new inference-optimised chips, adding a third partner to a TPU supply chain that already includes Broadcom. The Next Web reported on 19 April that custom ASIC sales in the AI segment are projected to grow 45 percent in 2026, driven almost entirely by inference deployments. The hyperscalers are not diversifying for leverage alone; they are chasing a different point on the power-performance curve that a general-purpose GPU cannot reach.

What a transformer-only ASIC chooses not to be good at is the more interesting list. It cannot handle state-space models like Mamba-3, which VentureBeat covered in March as achieving nearly 4 percent better language modelling scores than transformers with reduced latency. It cannot pivot if the next generation of models moves to a fundamentally different attention mechanism. It cannot be repurposed for scientific computing, molecular dynamics, or any of the other workloads that justify HPC cluster budgets. The bet is that the transformer, and its linear-attention variants, will dominate commercial inference long enough to amortise a 7 nm or 5 nm mask set. Given that every major model API endpoint from OpenAI, Anthropic, Google, and Meta currently serves transformer-derived architectures, the bet is not reckless.

But it is a bet. "The traditional software playbook promised that more users meant better margins," Forbes noted in its coverage of a Stanford lecture on AI hardware economics in late April. "Generative AI breaks that rule entirely; every new user requires burning expensive GPU compute." The inference ASIC thesis is a direct response to that broken rule. If every query costs money at the transistor level, the only way to scale gross margin is to reduce the transistor count per query. Training ASICs don't solve this; they optimise a capital expense that is increasingly concentrated in a handful of frontier labs. Inference ASICs target operating expense, which scales with every user.

The bottleneck that defines this generation of inference silicon is memory bandwidth, and the design responses are diverging sharply. Groq's LPU avoids high-bandwidth memory entirely, using 230 MB of on-chip SRAM distributed across a systolic array to keep the model weights stationary and stream tokens through. d-Matrix's Corsair, which entered volume sampling in late 2025, opts for in-memory compute with LPDDR5 and SRAM, bypassing HBM entirely and claiming a 10x improvement in tokens per watt over GPU baselines for the same Llama-class models. The common thread is a refusal to pay the HBM tax: a stack of eight HBM3e modules costs more than the logic die it sits beside and consumes roughly half the package power budget.

Euclyd, the Dutch startup backed by former ASML CEO Eric Meurice, is approaching the same problem from the other direction. Rather than building a new chip, CNBC reported on 17 April that the company is seeking at least $100 million to scale a chiplet-based interconnect that stitches existing inference accelerators into a coherent fabric. The pitch is not a better MAC array but a better way to keep the ones you already have from stalling on memory fetches. Fractile, a UK startup that also appeared in CNBC's funding roundup, is working on in-memory compute at the analogue-digital boundary, where the precision trade-offs get uncomfortably sharp below 8 bits.

The supply chain implications are as significant as the architectural ones. A training GPU tape-out at TSMC N3 typically requires 18 to 24 months from RTL freeze to volume production, with yields that can hover below 50 percent for the first two quarters. An inference ASIC built on a mature node, with no training circuitry, a fixed dataflow, and a smaller die, can go from spec to silicon in 12 months with yields above 80 percent by the third month. Samsung's foundry, which is manufacturing the Groq LPU on its 4 nm process, has been aggressively courting inference ASIC designers with a process design kit tuned for deterministic data paths and SRAM-heavy floorplans. The foundry competition that TSMC largely ignored during the training boom is now material in the inference cycle.

Google's talks with Marvell, as The Next Web detailed, include a memory processing unit alongside an inference-optimised TPU. The MPU is the more telling design win. It signals that even companies with mature internal silicon teams see the memory wall as the binding constraint and are willing to commission a standalone chip to address it. Broadcom, which has been Google's primary TPU design partner since the programme's inception, is not being displaced; the Marvell engagement is additive. Google is effectively running two inference ASIC programmes in parallel, each targeting a different segment of the latency-throughput curve. One is optimised for the millisecond response-time requirements of its search and assistant products; the other, for the batch-processing economics of its cloud inference API.

EDA tooling is adapting in ways that reinforce the ASIC trend. Cadence and Synopsys both released inference-specific design flows in the first quarter of 2026 that automate the mapping of a transformer architecture specification to a physical floorplan, pre-characterising the memory macros and MAC arrays for the forward-pass data path. The flows assume a fixed model architecture at tape-out and optimise for it, which is exactly the constraint that a transformer-only ASIC accepts. The result is a reduction in physical design time from six months to roughly ten weeks for a mid-complexity inference die. The tooling investment is a lagging indicator of market conviction, and in this case it is telling.

What is absent from the inference ASIC story so far is a credible merchant silicon player outside Nvidia. The hyperscalers are building their own, the startups are targeting niche deployments or acquisition, and the independent foundry ecosystem is filling in the gaps, but there is no AMD or Intel inference ASIC that a mid-tier cloud provider or an enterprise can buy off the shelf. AMD's MI-series remains a training-first architecture with inference bolted on. Intel's Gaudi 3 competes on total cost of ownership for training but has not broken out an inference-only SKU with the power envelope to justify a dedicated deployment. The inference ASIC market is a market of custom silicon parties to which the general public is not invited.

The Stanford lecture covered by Forbes made a related point about value concentration. If the inference hardware layer consolidates around a handful of custom ASIC architectures, the economic surplus migrates from the chip vendor to the hyperscaler who commissioned the design. A Google inference TPU is not for sale; its value accrues to Google's cloud margin. An Anthropic model served on an Amazon Trainium3 is value captured by AWS. The merchant GPU market, which made Nvidia a $3 trillion company, was built on the premise that the same silicon serves everyone. The inference ASIC market is built on the opposite premise: that differentiation at the silicon level is a competitive moat wide enough to justify the NRE.

The next milestone to watch is the third quarter of 2026, when both the Groq LPU in Nvidia's Rubin racks and Google's inference TPU on TSMC N3E are expected to reach volume deployment. The metric that matters is not peak TFLOPS, which is a training-era vanity number that tells you nothing about tokens per second per watt under a real serving load. The metric is tokens per joule at p99 latency under a 128-batch concurrent query profile. Whoever publishes that number first, with a reproducible benchmark on a model the industry actually uses, will define the inference ASIC conversation for the next two years. Nobody has published it yet.

The transformer-only bet is, in the end, a bet on architectural stasis, and the semiconductor industry has rarely been rewarded for betting that way. The x86 ISA survived four decades because the software ecosystem was too heavy to lift. Transformer architectures have no such lock-in; they are seven years old, defined in software, and under active assault from state-space models, liquid networks, and whatever comes out of the next NeurIPS. The ASIC designers know this. Their response is that seven years is an eternity in silicon, and that a chip taped out in 2026 can pay for itself by 2028 on the inference volumes that the hyperscalers are projecting. The arithmetic works. Whether the architecture holds still is the question.

Nvidia's $20 billion Groq license, Google's dual-track inference programme with Broadcom and Marvell, and the $275 million that d-Matrix raised in its Series C are all priced against the same assumption: that in 2028, a majority of the world's largest AI workloads will still be transformers, and they will still spend the overwhelming majority of their silicon budget on the forward pass. If that assumption cracks, so does the entire inference ASIC investment thesis. For now, the silicon is shipping and the order books are full. What happens when the architecture moves is a problem for the next mask set.

Read next

Modular Data Centers Rewrite Compute Infrastructure Rules

Get the Daily Briefbefore your first meeting.

Get the Daily Brief
before your first meeting.