The Inference Chip Is No Longer a Training Chip With Fewer Transistors

In 2026, inference workloads consumed two-thirds of all AI compute resources, according to industry data cited by MSN. That number, once a forward projection in a Goldman Sachs note, is now a trailing measurement. Training still grabs the headlines, but inference pays the bills, and the silicon supply chain has spent the past eighteen months reorganising itself around that fact. The result is a new category of chip that does not merely de-prioritise training: it strips out the circuitry required for it entirely.

The cleanest signal arrived at Google Cloud Next in April 2026. The company made its seventh-generation Ironwood TPU generally available, a 4.6-petaFLOPS chip that can still handle both training and inference. But alongside it, Google previewed the eighth-generation architecture: two separate chips, one for training designed with Broadcom, and one for inference designed with MediaTek, both targeting TSMC's 2nm node for late 2027, as The Next Web reported. The inference variant, codenamed TPU 8i, is not a scaled-down training chip. It is a ground-up design built around the specific dataflow of a forward pass through a transformer, with matrix multiply units dimensioned for decoder-only attention patterns and memory bandwidth provisioned for KV-cache reads rather than gradient accumulation.

This split, long predicted by chip architects, has arrived faster than most roadmaps anticipated. TrendForce data published in late May 2026 projected 44.6 percent ASIC shipment growth for the year, roughly triple the rate of GPU shipments, Tech Times reported. Inside that aggregate number, the fastest-growing sub-segment is inference-only silicon. The economic logic is simple: a transformer forward pass is a deterministic, highly repetitive computation graph. A chip that executes only that graph can drop the general-purpose programmability, the high-precision training accumulators, and the inter-chip topologies needed for distributed backpropagation. The die area and power saved translate directly into lower cost per token, the metric that hyperscaler procurement teams now track in place of raw FLOPS.

Google is not alone. In April 2026, The Next Web reported that Google was in talks with Marvell Technology to develop two additional custom chips: a memory processing unit and a further inference-optimised TPU, adding a third design partner alongside Broadcom and MediaTek. Marvell had already disclosed in its fiscal 2026 results that custom silicon revenue hit a $1.5 billion annual run rate across 18 cloud-provider design wins, a figure that has nearly doubled year on year. The company's CEO, Matt Murphy, has described the inference pipeline as the single largest contributor to that growth, according to Marvell's earnings materials.

What makes an inference ASIC transformer-only rather than merely inference-optimised is the question of what the design team chose to leave out. A chip that still supports convolutional networks, mixture-of-experts routing with dynamic gating, or training backward passes carries silicon area that sits idle during a transformer decode. The most aggressive designs eliminate all of it. UK startup Fractile, which raised a $220 million Series B in May 2026 led by Factorial Funds, Accel, and Founders Fund, builds inference accelerators that remove DRAM from the chip entirely, relying instead on a deeply banked SRAM architecture tuned to hold KV-cache entries for long-context inference, MSN reported. Founded in 2022 by Oxford-trained engineer Walter Goodwin, Fractile has already drawn early purchase interest from Anthropic, which is evaluating the chips for inference workloads that currently run on Nvidia hardware, according to reporting by The Information cited in the same piece.

The Fractile architecture is instructive because it makes explicit a trade-off that general-purpose accelerators cannot. A standard GPU or even a first-generation TPU provisions high-bandwidth DRAM, typically HBM, to feed both training batches and inference requests. But inference at scale is latency-bound, not bandwidth-bound in the conventional sense. The bottleneck is the time required to read the KV cache from memory, perform the attention computation, and write the result back. Fractile's bet is that a large on-chip SRAM, sized to hold the entire KV cache for a batch of sequences running to 128k tokens, eliminates the HBM controller, the interposer, and the associated power draw. The die area that would have gone to memory interfaces goes to more multiply-accumulate units instead. The chip runs transformers and, for all practical purposes, nothing else.

The hyperscalers are moving in parallel. Meta Platforms announced a new deal with Broadcom in April 2026, extending their custom silicon partnership with a commitment to deploy 1 gigawatt of custom AI processors, as SiliconANGLE reported. Meta's MTIA chip programme, which began with a modest inference accelerator for recommendation models, has now expanded to cover the Llama family of large language models. The new Broadcom-designed chips target the inference side of that workload. Meta's training infrastructure remains largely Nvidia-based, but the inference fleet is migrating to custom silicon at a pace that Mark Zuckerberg described as the company's single largest infrastructure investment for 2026, according to the SiliconANGLE report.

Amazon's trajectory follows a similar arc, though with a twist. CEO Andy Jassy disclosed in early 2026 that the company's custom AI chip unit had reached a $20 billion annual revenue run rate and could generate $50 billion if sold externally, MSN reported. AWS already offers Trainium for training and Inferentia for inference, but Jassy has also signalled interest in selling Trainium chips to customers outside AWS, a move that would position Amazon as a direct merchant silicon competitor to Nvidia and AMD, Fudzilla reported. Inferentia3, the third-generation inference ASIC expected to tape out in late 2026, is rumoured to drop training support entirely and focus on transformer decode at ultra-low latency, though Amazon has not confirmed the specification publicly.

Cerebras Systems, which takes the opposite architectural approach by building chips at wafer scale, reported in May 2026 that its CS-3 system ran Moonshot AI's trillion-parameter Kimi K2.6 model at inference speeds nearly seven times faster than comparable GPU cloud instances, VentureBeat reported. The Cerebras architecture is not transformer-only in the strict sense; its wafer-scale design retains general-purpose programmability. But the inference performance advantage it claims derives from the same principle Fractile exploits: keep the entire model state on-die and eliminate the memory wall. At 46,225 square millimetres of silicon on a single wafer, the CS-3 can hold model parameters and KV-cache entries without touching off-chip DRAM. The trade-off is yield, cost, and power, and Cerebras has yet to demonstrate that the economics work at the scale of a hyperscaler deployment rather than a specialised research cluster.

The EDA tooling ecosystem is adapting to this shift. Designing a transformer-only ASIC is, in one sense, simpler than designing a general-purpose GPU: the compute graph is fixed, the dataflow is predictable, and the verification space shrinks dramatically. But in another sense it is harder, because the margin for error narrows. A general-purpose GPU that underperforms on a novel attention variant can be reprogrammed. A transformer-only ASIC that mis-provisions its SRAM banks for a KV-cache layout that shifts with the next model release becomes a $300 million paperweight. The design services firms that win in this market are the ones that can close the loop between model architecture roadmaps and silicon tape-outs, compressing what used to be an eighteen-month design cycle into something closer to nine months. Taiwan's ASIC design service providers delivered sharply divergent results in the first quarter of 2026, DigiTimes reported, with firms that had direct line-of-sight into hyperscaler model roadmaps outperforming those that relied on general-purpose design wins.

The memory supply chain is the binding constraint on all of this. An inference ASIC that drops DRAM in favour of SRAM faces a different bottleneck: SRAM bit-cell area scales poorly at advanced process nodes. At TSMC 3nm, an SRAM bit cell occupies roughly 0.0199 square microns, a scaling factor that has improved only marginally from the 5nm node. A chip designed to hold a 128k-token KV cache for a model with 70 billion parameters requires on the order of 1.4 gigabytes of SRAM, which at 3nm densities translates to roughly 560 square millimetres of die area for the cache alone, before any compute logic. That is why Fractile's architecture, and any design that bets on SRAM-heavy inference, effectively requires a 2nm process node or an advanced packaging solution that stacks SRAM dies on top of logic dies. The TSMC 2nm ramp scheduled for late 2027, which Google's TPU 8i is targeting, is the enabling event for this entire category.

Intel, meanwhile, is positioning its foundry business to capture a slice of the inference ASIC market. The company has landed multiple AI inference chip contracts in 2026, leveraging its advanced packaging capabilities, particularly EMIB and Foveros, which allow customers to combine logic dies from one process node with SRAM or HBM dies from another, according to MSN reporting. Intel's foundry pitch to inference-chip startups is straightforward: you design the logic tile, we handle the packaging and memory integration. Whether Intel's 18A process node can compete with TSMC 2nm on power efficiency for inference workloads remains an open question, but the packaging argument is real, and it is the reason several early-stage inference chip companies have signed memoranda of understanding with Intel Foundry rather than committing to TSMC's crowded N2 queue.

The Goldman Sachs research note that captured attention in late May 2026 projected that ASIC demand would match GPU demand by 2027, Yahoo Finance reported. The note framed the shift as a function of inference workload growth, not training displacement. Training still requires the flexibility of general-purpose compute, and Nvidia's CUDA moat remains deep. But inference at the scale of billions of tokens per day is a different economic proposition, one where a 20 percent reduction in cost per token justifies a multi-billion-dollar custom silicon investment. Goldman's analysts estimated that the inference ASIC market alone would reach $45 billion by 2028, with transformer-only designs capturing the majority of that value.

Ambarella, a company historically associated with video processing and edge AI, signalled its own pivot at the Needham Growth Conference in January 2026, outlining a transformer-ready inference architecture for edge devices, Yahoo Finance reported. The edge inference market splits along a different axis than the data centre: power is the binding constraint, not memory bandwidth, and the models running on-device tend to be smaller, often in the 1-billion to 7-billion parameter range. But the design philosophy is the same. Ambarella's new chips drop support for the convolutional neural network primitives that defined its earlier computer vision processors and replace them with transformer-specific attention engines. It is the same bet Fractile is making, scaled down to a 5-watt power envelope.

What the transformer-only ASIC category has not solved, and may not solve in this generation, is the problem of model architecture drift. The transformer has been dominant for eight years, an eternity in machine learning, but state-space models, recurrent alternatives like Mamba and RWKV, and hybrid architectures that interleave attention with other mechanisms are all in active development. A chip that hard-codes the transformer attention pattern is a bet that the dominant architecture will remain dominant for the five-year lifecycle of a silicon design. The chip architects making that bet point to the installed base of transformer models, the tooling ecosystem built around them, and the fact that even the challenger architectures borrow heavily from the attention mechanism's primitives. The sceptics point to the history of AI accelerators optimised for LSTMs, which were obsolete before the first tape-out. The difference this time, the inference ASIC designers argue, is scale: the transformer has become infrastructure, and infrastructure changes slowly.

The next milestone to watch is the TSMC 2nm risk production ramp, expected in the second half of 2027. Google's TPU 8i, the Fractile production silicon, and the next generation of Meta's MTIA chips are all queued for that node. If the SRAM scaling at 2nm hits the targets TSMC has been signalling to early-access customers, the economics of transformer-only inference ASICs become compelling not just for the hyperscalers but for the tier below them: the AI-native startups, the enterprise deployment vendors, and the sovereign cloud operators that are building inference capacity independent of US-dominated supply chains. If it misses, the category pauses, and the general-purpose GPU gets another cycle to defend its inference territory.

Read next

Modular data centre buildout transforms into factory production

Get the Daily Briefbefore your first meeting.

Get the Daily Brief
before your first meeting.