Inference Chip Split as Google's TPU Fork Redraws ASIC Map
Google forking its TPU into separate training and inference dies at 2nm signals that the AI chip market is splitting, and for inference-only ASICs the crucial design question is no longer what a chip can do but what it can leave out.
cnbc.com
TrendForce put a number on something the supply chain already knew: custom AI chip shipments will grow 44.6% in 2026, against 16.1% for merchant GPUs. The forecast, published in late May, marks the first year ASICs outgrow the GPU category they were supposed to complement, not displace. It is not a fluke. It is the arrival of a bifurcated silicon market in which the training chip and the inference chip stop being siblings and become different species.
The split has been visible in roadmaps for eighteen months, but it became architecture-level doctrine in April 2026 at Google Cloud Next, where the company confirmed that its eighth-generation TPU will ship as two separate dies. TPU 8t, code-named Sunfish and designed with Broadcom, is a training chip. TPU 8i, designed with MediaTek, is an inference chip. Both will be fabricated on TSMC's 2nm process and are scheduled for late 2027, The Next Web reported. The current-generation Ironwood TPU, which shipped at the same event, already delivers 4.6 petaFLOPS per chip. But Ironwood is a single architecture. The fork that follows it is the real story.
What an inference-only die chooses not to do is more interesting than what it does. Training silicon must support a backward pass: gradient computation, optimizer state, weight update logic, and the associated memory traffic for all of it. An inference die can shed that entire subsystem. The silicon real estate that gets freed up goes to more matrix-multiply units, more on-chip SRAM for key-value caches, and narrower data paths tuned for forward-pass throughput rather than bidirectional gradient flow. The result is a chip that is worse at training by design and better at inference per watt and per square millimeter of die area. That tradeoff did not make sense when the same cluster ran both workloads. It makes sense now.
The Deloitte projection that inference workloads will consume two-thirds of AI compute in 2026, up from roughly half in 2025, is what changed the arithmetic. When inference becomes the dominant cost, the overhead of running it on training-capable silicon stops being acceptable. A chip that carries backward-pass logic it never uses is paying a transistor tax on every inference token. The hyperscalers noticed. Google's TPU split is the most detailed public confirmation, but it is not alone.
Amazon's Graviton5, the subject of a multibillion-dollar, multi-year deal with Meta announced in late April, is a general-purpose ARM server CPU, not an ASIC. But the deal is instructive for a different reason: Meta committed to tens of millions of cores for agentic AI workloads, which are overwhelmingly inference-heavy. The Graviton5 pipeline demonstrates that the inference buildout is now large enough to justify dedicated procurement tracks that run parallel to, and sometimes independent of, GPU training budgets. When Meta's capex crossed $135 billion, the inference fraction became too large to route entirely through Nvidia.
Google's inference silicon strategy goes deeper than the MediaTek TPU 8i. In April, The Next Web reported that the company is in talks with Marvell Technology to develop two additional inference chips, a memory processing unit and a further inference-optimised TPU. That would bring Google's custom-chip design partners to four: Broadcom for training, MediaTek and Marvell for inference, and Intel as a packaging partner. No other hyperscaler operates a supply chain this wide for its own silicon. The motivation, according to the reporting, is to diversify beyond Broadcom's TPU programme and to create parallel inference architectures that could optimise for different model sizes, batch assumptions, or power envelopes.
The memory question is where inference ASICs make their hardest architectural bets. Transformer inference is memory-bandwidth-bound, not compute-bound, at almost every batch size that matters for real-time serving. An inference chip that gets the SRAM ratio wrong will underperform a training chip running the same model, even if its matrix engines are faster on paper. Google's response, CNBC reported, is to pack "ample amounts of static random access memory" into its dedicated inference die. The phrasing is nonspecific, but the direction is clear: the inference TPU will carry a larger on-chip SRAM pool relative to its compute throughput than a training TPU at the same node. That is the same bet Nvidia made with the B200, which couples its compute with a significantly enlarged HBM stack. The inference-ASIC approach simply takes the logic to its conclusion: if you know the chip will never run a backward pass, you can tune the memory hierarchy for forward-only access patterns without compromise.
The SRAM decision points to a deeper design philosophy question. A transformer-only inference chip can hard-code attention. It can bake the softmax, the QKV projection, and the positional encoding pipeline into fixed-function blocks rather than relying on programmable shader cores. It does not need to support the dozens of operator types that a training chip must handle for experimentation with novel architectures. The bet is that the transformer will remain the dominant inference architecture for long enough to justify the silicon commitment. That bet is now being locked into physical design at 2nm.
What the ASIC pipeline actually looks like
TrendForce's 44.6% growth figure applies to the broader custom ASIC market, which includes networking chips, storage accelerators, and non-AI silicon. But the AI fraction is the driver. Alchip Chairman Johnny Shen, whose company is one of the largest ASIC design houses serving the hyperscalers, told Tech Times that the growth is concentrated in AI-specific designs and that the pipeline of new projects is shifting decisively toward inference. The article also cited TrendForce's finding that merchant GPU shipment growth has decelerated to 16.1%, a rate that reflects saturation in the training market as frontier labs concentrate their largest training runs into fewer, higher-utilisation clusters.
The ASIC pipeline is not uniform. At the high end, Google's approach involves custom dies at the leading edge, with per-chip development costs that industry analysts estimate in the low hundreds of millions of dollars. At the mid-range, companies like Ambarella are shipping inference chips tuned for edge deployment. Ambarella outlined a "transformer-ready" inference architecture at the Needham Growth Conference in January, MarketBeat reported via Yahoo Finance, targeting vision transformers and small language models at power envelopes measured in single-digit watts. The edge inference market has its own bifurcation: chips that run vision transformers for automotive and surveillance, and chips that run quantised LLMs on-device. Both are transformer-only in practice, but their memory budgets and precision assumptions are worlds apart from a TPU 8i in a Google data center.
Intel's play for the inference market operates on a third axis. Rather than designing a transformer-only ASIC from scratch, Intel is leveraging its CPU dominance and its packaging technology. The company's EMIB interconnect, 24/7 Wall St reported in early May, creates an opening in a market constrained by TSMC's CoWoS capacity. Intel's foundry business can offer an alternative advanced-packaging path for ASIC designers who cannot get TSMC CoWoS allocation. The bottleneck, in other words, is not just chip design; it is the physical assembly of chiplets, HBM stacks, and interposers. A brilliant inference ASIC that cannot get packaged is a paperweight.
The packaging constraint feeds back into architecture decisions. An inference ASIC designed for a CoWoS-like advanced package can assume high-bandwidth access to HBM stacks and can design its memory controllers accordingly. An inference ASIC that might need to ship in a less exotic package must provision more on-chip SRAM or accept lower bandwidth. The fork in Google's TPU roadmap may partly reflect a bet that by late 2027, 2nm capacity and advanced packaging capacity will be plentiful enough to support two distinct die designs without forcing compromises on either. That is a supply-chain bet as much as an architectural one.
Where this leaves the merchant GPU
Nvidia is not standing still. The B200 and its successors push ever-larger HBM configurations that benefit inference as much as training. Nvidia's architectural response to the inference-specialised ASIC threat is to argue that programmeability matters: a customer running a B200 can shift capacity between training and inference as demand fluctuates, while a customer running a transformer-only inference ASIC cannot. That argument has merit for cloud providers serving a mixed workload. It has less merit for a hyperscaler whose inference traffic is large enough, stable enough, and cost-sensitive enough to justify its own silicon.
The TrendForce data suggests that the market is tilting toward the latter logic. When ASIC shipment growth nearly triples GPU shipment growth, the implication is that a meaningful share of the inference buildout is happening outside the merchant GPU channel. That share includes Google's TPU fleet, Amazon's Trainium and Inferentia lines, Microsoft's rumoured inference ASIC programme, and a growing number of mid-volume designs from fabless startups that target specific model architectures at specific power points. Not all of these are transformer-only in the strict sense, but the design trend runs in one direction: strip out every transistor that does not serve a forward pass through a transformer, and ship the result at the best price per inference token the process node allows.
The chip that emerges from this philosophy has a distinctive silhouette. It is SRAM-heavy, matrix-multiply-dense, and architecturally boring. It runs a small set of fixed-function attention pipelines and very little else. It does not do general-purpose linear algebra. It does not train. It ships with a software stack that is thin relative to CUDA, targeting a narrow operator set with compiler optimisations that assume a specific dataflow. It is, in short, the opposite of a GPU. And it is, in 2026, the fastest-growing category in AI silicon.
The merchant GPU shipment growth rate has decelerated to 16.1%, reflecting saturation in the training market as frontier labs concentrate their largest training runs into fewer, higher-utilisation clusters.TrendForce, as cited by Tech Times, May 2026
Google's four-partner supply chain, with Broadcom on training, MediaTek and Marvell on inference, and Intel on packaging, is the most elaborate expression of the inference-specialised thesis. But the thesis itself is broader. It holds that inference is now a big enough business, measured in tens of billions of dollars annually, to support chip designs that are useless for anything else. It holds that the transformer architecture is stable enough to hard-code. And it holds that the packaging bottleneck, not the logic design, is the binding constraint on how fast the inference fleet can grow. The late-2027 arrival of TPU v8 on TSMC 2nm will be the first real test of all three assumptions at once. The tape-out clock is already running.