Inference Drives Two-Thirds of AI Compute, Reshaping Chip Design
As inference overtakes training, custom ASICs purpose-built for transformers from Google, Cerebras, and emerging startups are challenging the GPU monopoly and rewriting the semiconductor roadmap.
cnbc.com
Two-thirds. That is the share of AI compute now consumed by inference workloads, according to a Deloitte projection confirmed by MSN in May 2026. Training, the compute regime that built Nvidia's $3 trillion market cap, has slipped to the minority position. The inversion happened faster than most roadmaps anticipated, and it has opened a window that the silicon industry has not seen since the death of Moore's Law as a free lunch: a wholesale re-architecture of the chip, aimed not at training the largest possible model but at running it cheaply, repeatedly, at planetary scale.
The chip that wins inference is not a scaled-down training chip. It is a fundamentally different animal. A training accelerator spends its die area on floating-point throughput for backpropagation, on high-bandwidth interconnect for all-reduce collectives across thousands of nodes, on the flexibility to handle any operator a researcher might dream up. An inference accelerator for transformer models can strip all of that. It can be built around a single dataflow: attention, feed-forward, repeat. No backpropagation. No general matrix multiply for research flexibility. Just the forward pass of a transformer, executed as efficiently as silicon allows. The term of art is transformer-only ASIC, and in 2026 it has moved from academic curiosity to the centre of the industry's capital allocation.
The numbers driving this shift are unforgiving. A frontier model serving a billion daily queries on general-purpose GPUs burns tens of megawatts and consumes HBM at rates that strain the global supply of advanced packaging. Deloitte's two-thirds figure, cited across multiple outlets including The Motley Fool's May 2026 analysis, implies that the hyperscaler capex cycle, already the largest infrastructure buildout in corporate history, is tilting decisively toward deployment silicon rather than training silicon. Nvidia captured the training buildout. The inference buildout is still up for grabs.
Google saw this coming earliest and has built the most diversified custom-chip supply chain in the industry to meet it. The Next Web reported in April 2026 that Google now works with four design partners, Broadcom, MediaTek, Marvell, and Intel, on a dual-track TPU v8 roadmap targeting TSMC's 2nm process. The Ironwood TPU, announced the same month, packs dense SRAM alongside the compute die specifically for inference, a design choice that CNBC described as "packing ample amounts of static random access memory into a dedicated chip for running artificial intelligence models, following Nvidia's plans." The approach is straightforward: if you know the model architecture in advance, and Google knows its own Gemini family, you can size the on-chip memory to eliminate off-chip HBM accesses for the most latency-sensitive operations.
What Google's strategy reveals is that the inference chip market is not consolidating around a single architecture. It is fragmenting by model family and deployment environment. A TPU optimized for Gemini's attention mechanism is not necessarily optimal for Llama or Claude. This fragmentation is the strategic premise behind nearly every inference ASIC startup that has raised capital in the last eighteen months.
Give me tokens. Just give me tokens. I want them fast. I want them cheap. I want them now., Developer mantra cited by Parasail CEO, as reported by TechCrunch
Parasail, which raised $32 million in Series A funding in April 2026, is betting on what it calls "tokenmaxxing", the developer behaviour of routing inference requests to whichever provider delivers the cheapest token per second for a given model, with zero loyalty to the underlying silicon. TechCrunch's Tim Fernholz reported that the round signals "a fractured future of models and compute," a phrase that captures the structural disaggregation underway. When the API is the product and the model is a commodity, the chip underneath matters only on price-performance. That is a market structure the ASIC startups were built for.
The most radical expression of transformer-only design comes from Cerebras Systems. Its wafer-scale engine, a single chip fabricated across an entire 300mm silicon wafer, nearly 30 times the area of an Nvidia Blackwell B200, dispenses with the multi-chip packaging problem entirely. Cerebras went public in May 2026 at $125 per share, becoming the first pure-play AI chip company to list, and quickly reached a $60 billion valuation. The IPO filing disclosed major contracts with OpenAI and Amazon. What matters for the inference story is that Cerebras's architecture eliminates the inter-chip communication bottleneck that dominates multi-GPU inference clusters, the wafer is the cluster.
Alongside Cerebras, a constellation of inference-specialized startups is filling specific niches in the stack. Gimlet Labs, founded by former Google engineers, raised $80 million in Series A funding led by Menlo Ventures in March 2026 to build a multi-chip inference cloud that can split workloads across silicon from different manufacturers. Within weeks, OpenAI hired Gimlet Labs to optimize its models for Cerebras hardware, claiming 10× faster inference at the same cost. The pairing is instructive: a middleware layer that abstracts away which chip runs the query, sitting on top of a wafer-scale engine that was never designed to be general-purpose.
The Korean startup HyperAccel, led by CEO Joo-Young Kim, has taken a different approach, building an ASIC narrowly optimized for large language model inference at the matrix multiplication level. Kim received a national ICT honour from the Korean government in April 2026 for what Chosunbiz described as "LLM chip breakthroughs." HyperAccel's architecture is built around the observation that the attention mechanism in a transformer, the QKV projection, the softmax, the weighted sum, can be implemented as a fixed-function pipeline with far less control logic than a programmable GPU requires. The die area saved translates directly into more multiply-accumulate units per square millimetre.
d-Matrix, based in Santa Clara, has been pursuing a similar thesis with its Corsair inference chip, optimised for the low-latency, high-throughput regime that enterprise deployments demand. In April 2026, d-Matrix acquired GigaIO's datacenter technology and assets to add rack-scale composability to its inference platform. The acquisition signals that the inference chip problem is increasingly a systems problem, how you feed the chip, how you interconnect dozens of them, how you keep memory bandwidth from becoming the bottleneck long before compute does.
Memory bandwidth is, in fact, the bottleneck most of these designs are racing to solve. A transformer forward pass with a 70-billion-parameter model running at batch size one is memory-bound on almost every architecture currently shipping. Every token generated requires reading the entire model weight matrix from memory. The arithmetic intensity, the ratio of compute operations to bytes moved, is low enough that the chip spends more time waiting for data than computing on it. This is why Google's TPU v8 pushes SRAM directly onto the inference die, why Cerebras's wafer-scale design keeps everything on one piece of silicon, and why d-Matrix is investing in rack-scale memory pooling. The ASIC that wins inference will not necessarily have the most teraflops. It will have the memory subsystem that keeps the compute units fed.
Intel's inference play, by contrast, does not rely on a single exotic ASIC but on portfolio breadth. Multiple MSN analyses in May 2026 project Intel as the market leader in inference silicon, citing its 71% share of the server CPU market, growing custom ASIC revenue, and its EMIB-T advanced packaging technology. The argument is that most inference workloads do not run on exotic wafer-scale engines or dedicated transformer ASICs; they run on Xeon processors inside existing data centres, handling the long tail of smaller models, traditional ML serving, and the batch workloads that do not justify dedicated accelerator hardware. Intel's Gaudi ASIC line, meanwhile, targets the mid-range of the inference market with a chip that is purpose-built but not transformer-exclusive.
Nvidia is not standing still. Forbes reported in March 2026 that Nvidia licensed Groq's inference technology in a deal valued at $20 billion and subsequently unveiled inference-optimised silicon derived from that partnership. Groq's language processing unit architecture, which uses a deterministic dataflow scheduler to eliminate the memory-access unpredictability that plagues GPU-based inference, represents a philosophical pivot for Nvidia, an acknowledgment that the general-purpose GPU, however successful in training, is not the optimal inference substrate.
What every player in this market is choosing, explicitly or implicitly, is what not to be good at. A transformer-only ASIC is not good at training. It is not good at convolutional networks. It is not good at graph neural networks or recommendation models or any of the non-transformer architectures still running in production. It is good at exactly one thing: the forward pass of a transformer model. The bet is that one thing is big enough, and growing fast enough, to support an entire chip ecosystem.
What the supply chain looks like now
The inference ASIC supply chain has bifurcated. On one track, the hyperscalers, Google, Amazon, Microsoft, and reportedly Meta, are designing their own silicon through internal teams and close partnerships with Broadcom and Marvell. These chips are fabbed at TSMC's leading nodes, with Google's TPU v8 targeting 2nm and Amazon's Trainium-Inferentia roadmap expected to follow. On the other track, independent startups are competing for the non-hyperscaler market: enterprises that need inference capacity but cannot design their own silicon, cloud providers that want a differentiated offering, and model companies that want hardware co-optimised with their specific architectures.
The RISC-V ecosystem is also staking a claim. SiFive raised $400 million at a $3.65 billion valuation in April 2026, with Atreides Management leading the round. The open-source instruction set architecture is attractive for inference ASICs because it allows designers to add custom vector extensions tuned to transformer operations without paying the ARM licensing tax. A SiFive-based inference chip can ship with a minimal RISC-V control core and a sea of custom matrix units, all on the same die, with no third-party intellectual property encumbering the layout.
FuriosaAI, the Korean inference chip startup, has been navigating the same territory with its RNGD (pronounced "renegade") chip, designed for the low-power, high-throughput regime that hyperscaler edge inference demands. CEO June Paik, in a February 2026 interview with TechRadar, discussed the power constraints that are reshaping inference deployments, chips that must fit inside a 75-watt thermal envelope while serving thousands of simultaneous requests. That power budget simply does not allow for a general-purpose GPU. It demands an ASIC.
The bottleneck that no amount of clever chip design can solve alone is advanced packaging. TSMC's CoWoS capacity, Samsung's I-Cube, and Intel's EMIB-T are all booked through 2027. An inference ASIC startup that finishes tape-out in 2026 faces an eighteen-month queue for packaging before it can ship volume. This is why d-Matrix's acquisition of GigaIO's rack-scale technology matters; if you cannot get enough HBM stacks onto a single interposer, you need a system architecture that pools memory across multiple chips and presents it as a unified address space to the inference workload.
Who wins, who loses, and what to watch
The inference ASIC market is not winner-take-all. It is winner-take-the-deployment-that-matches-the-architecture. Google's TPU will win Google's own workloads. Cerebras will win the highest-throughput, largest-model deployments where wafer-scale makes economic sense. Intel's Xeon will win the long tail of inference that never touches an accelerator. The startups, HyperAccel, d-Matrix, FuriosaAI, will win or lose on their ability to hit volume before the hyperscalers' internal designs close the window.
What every design in this space has in common is a single architectural commitment: the transformer is the workload, and nothing else matters. That commitment is either the smartest bet in the history of silicon or a spectacular bet on architectural stasis. The history of computing suggests that workloads change. But the history of AI since 2017 suggests that the transformer, in one attention-variant or another, has been more durable than any ISA, any process node, any packaging technology. The inference ASIC market is a bet that the next decade looks like the last one, and that being excellent at exactly one thing is better than being pretty good at everything.
The checkpoint to watch is TSMC's Q3 2026 earnings call. The foundry's revenue breakdown by application will reveal whether inference ASIC tape-outs are accelerating at the rate the startups' funding rounds imply. If the 2nm pipeline is dominated by five-nanometre-retread GPUs, the transformer-only thesis still has something to prove. If it is filling with custom silicon from names that did not exist three years ago, the rewiring is real.