TechReaderDaily.com
TechReaderDaily
Live
Silicon · Inference ASICs

Inference-Specialized ASICs Surge 44.6%, Rewriting Silicon Roadmap

OpenAI's Jalapeño, Cerebras's wafer-scale benchmarks, and a 44.6% ASIC shipment surge confirm that transformer-only inference chips are no longer experimental, they are the industry's new default.

Broadcom AI compute ASIC chip on a circuit board, illustrating the custom silicon that hyperscalers are increasingly deploying for inference workloads. digitimes.com
In this article
  1. Jalapeño: what OpenAI left on the table
  2. The bottleneck question

ASIC shipments for AI inference will grow 44.6% in 2026, nearly three times the 16.1% growth rate projected for merchant GPUs. That single datapoint, from a TrendForce forecast published in late May, captures the structural shift now running through the AI silicon supply chain. For the first time, custom silicon designed for a specific model architecture, overwhelmingly the transformer, is outshipping general-purpose accelerators in percentage growth terms, and the absolute unit gap is narrowing faster than most roadmaps predicted eighteen months ago.

The shift is not merely about market share. It represents a renegotiation of what a compute chip is meant to do. A GPU must run matrix multiplies for training and inference, support a broad software stack, and maintain enough generality to serve an HPC customer on Tuesday and a diffusion model on Wednesday. An inference-specialized ASIC has none of those obligations. What it chooses not to be good at is the whole point.

Jalapeño: what OpenAI left on the table

On June 24, OpenAI and Broadcom unveiled Jalapeño, the company's first custom silicon: an "Intelligence Processor" built from the ground up for large language model inference. Tech Times reported that the chip claims roughly 50% lower inference cost per token than current Nvidia GPUs. It was designed in nine months, manufactured by TSMC, and co-developed using OpenAI's own models to accelerate parts of the chip design flow, VentureBeat reported, citing the companies' joint announcement.

The silicon choices are what tell the story. Jalapeño is architected around the memory-access patterns of transformer attention layers: large matrix-vector products, high-bandwidth key-value cache lookups, and a kernel library tuned specifically for the sequence lengths and batch sizes that ChatGPT encounters at production scale. SDxCentral noted that the chip is "optimized around kernels, memory movement and networking." It does not attempt to be a training accelerator. It does not ship with a CUDA-compatible programming model. It does not target vision transformers, recommendation models, or anything outside the autoregressive decoder stack. Those omissions are the design.

This is the transformer-only bet, executed at the scale of one of the world's largest inference footprints. The economics are straightforward: if two-thirds of AI compute spending has already shifted to inference, a figure cited in a May 2026 MSN analysis of hardware investment trends, then a chip that slices 50% off the per-token cost of that dominant workload pays for its development costs in months, not years.

OpenAI is not alone. The same week that Jalapeño was announced, Cerebras Systems reported first-quarter revenue of $193.4 million, with core revenue up 92% year over year, and disclosed a multi-year deal with OpenAI. Cerebras's wafer-scale engine, a single chip the size of a dinner plate, ran Moonshot AI's trillion-parameter Kimi K2.6 model at nearly seven times the speed of GPU clouds, according to VentureBeat. Cerebras completed the largest tech IPO of 2026 in May and promptly put inference at the center of its post-IPO narrative.

The bottleneck question

Every inference ASIC on the market answers the same question differently: where is the bottleneck? For Jalapeño, it is memory bandwidth between the compute die and the key-value cache, hence the emphasis on data movement and kernel-level optimization. For Cerebras, it is the off-chip communication latency that strangles large-batch inference on multi-GPU clusters; the wafer-scale approach eliminates interposer hops and keeps the entire model on a single piece of silicon. For Groq, the inference startup that raised $650 million in late June 2026 to build out its inference cloud, the bottleneck is the scheduler: its language processing unit uses a deterministic architecture that avoids the nondeterministic memory-access patterns of GPU thread blocks, trading flexibility for predictable latency.

The common thread is architectural honesty. A GPU is a compromise that works acceptably well across a dozen workload categories. An ASIC is a bet that one workload category is large enough, stable enough, and valuable enough to justify a dedicated transistor budget. The TrendForce data suggests that bet is paying off: Alchip Technologies Chairman and CEO Johnny Shen told investors that he expects revenue growth from custom ASICs to outrun the broader GPU market, a forecast aligned with the TrendForce projection.

The supply chain is reconfiguring around this reality. Broadcom's custom ASIC business has doubled, and the company now counts at least five hyperscaler customers for its AI accelerator platform, DigiTimes reported. Marvell Technology and MediaTek have each announced custom ASIC pipelines targeting inference workloads. TSMC's advanced packaging capacity, CoWoS in particular, is being allocated increasingly toward ASIC customers rather than merchant GPU builders, according to supply-chain analysts who track wafer allocation.

The ASIC wave is not limited to the data center. Ambarella pitched a transformer-ready edge AI platform at the Needham Growth Conference in January 2026, targeting on-device inference for vision and language models. The edge is a harder problem: power envelopes are measured in single-digit watts, not kilowatts, and the memory footprint of even a quantized transformer can overwhelm a $12 BOM. But the same dynamic applies: if the model architecture is known at design time, a general-purpose accelerator leaves performance on the table.

China's AI chip industry is being pushed along the same path by force. US export controls have restricted access to Nvidia's highest-performance GPUs, making the economics of custom ASICs suddenly compelling. The Next Web reported in June 2026 that Huawei now commands a projected 62% share of China's domestic AI chip market, with its Ascend series of inference ASICs filling the gap left by restricted Nvidia silicon. Alibaba's T-Head division and Cambricon are pursuing similar custom designs optimized for domestic LLM workloads. The export controls have inadvertently accelerated the same architectural trend that is reshaping the Western market: the move from general-purpose to purpose-built.

What each of these chips chooses not to be good at is as instructive as what they optimize. Jalapeño does not do training at all, zero flops allocated to backward passes. Cerebras's wafer-scale engine cannot be disaggregated across smaller inference jobs efficiently; its economics work best at the scale of trillion-parameter models served to millions of users. Groq's deterministic scheduler imposes constraints on model topology that make it a poor fit for architectures that deviate significantly from the standard transformer decoder stack. These are not bugs. They are explicit design decisions, made early and held to.

The EDA tooling story is a quieter but equally significant part of the shift. OpenAI's nine-month design cycle for Jalapeño, from architectural specification to tape-out, was accelerated by using its own models to optimize floorplanning, place-and-route, and power-grid design, according to VentureBeat. This is a feedback loop with implications beyond a single chip: if frontier AI labs can shorten the ASIC design cycle from eighteen months to nine, the window in which a merchant GPU enjoys a performance advantage narrows considerably. A general-purpose chip designed on a two-year cadence is competing against custom silicon that can be refreshed twice as fast.

The foundry ecosystem is adapting to this cadence. TSMC's N3 and N4 process nodes now support ASIC-specific design rules that reduce the number of mask layers required for inference-only chips, transistors that would be wasted on training-specific data paths can be left undoped, cutting die area and improving yield. Equipment vendors, including ASML and Applied Materials, have begun offering lithography and deposition recipes tuned for the simpler, more repetitive structures of transformer accelerators. These are incremental changes individually. Cumulatively, they lower the barrier to entry for any company with a model and a workload forecast.

The question hanging over all of this is what happens when the transformer stops being the universal substrate for large language models. Every inference ASIC shipping in 2026 is optimized for the attention mechanism, for the autoregressive decode loop, for the KV-cache access pattern. If a successor architecture, a state-space model, a recurrent architecture with fundamentally different memory behavior, or something not yet published, supplants the transformer for production inference, these ASICs lose their architectural advantage overnight. The counterargument, and the one these chip programs are betting on, is that the transformer's dominance is now locked in by an ecosystem of tooling, quantization techniques, and compiler optimizations too large to displace quickly.

The next milestone to watch is Q3 2026, when Jalapeño is expected to enter volume deployment in OpenAI's data center infrastructure. The per-token cost numbers that matter will not be the claimed 50% reduction from a press release; they will be the measured figures from production traffic, on real workloads, at scale. That data will either validate the transformer-only ASIC thesis or reveal the hidden overhead that clean-room benchmarks obscure. Either way, the silicon roadmap has already forked.

Read next

Progress 0% ≈ 9 min left
Subscribe Daily Brief

Get the Daily Brief
before your first meeting.

Five stories. Four minutes. Zero hot takes. Sent at 7:00 a.m. local time, every weekday.

No spam. Unsubscribe in one click.