TechReaderDaily.com
TechReaderDaily
Live
Compute & Inference Economics · Tokyo

Inference Economics Takes Over Neocloud War in $643M Eigen Deal

Nebius's $643M purchase of 20-person Eigen AI, valuing inference optimization at $32 million per engineer, and CoreWeave's $21 billion Meta deal signal that the neocloud race now centers on extracting maximum tokens per GPU rather than GPU count.

Server racks inside a Nebius AI data centre, rows of GPU hardware under blue lighting. thenextweb.com
In this article
  1. Who Captures the Margin
  2. What to Watch

On 1 May 2026, Nebius Group, the Dutch neocloud that separated from Yandex in 2024, agreed to acquire Eigen AI for approximately $643 million in stock and cash, The Next Web reported. The target: a 20-person startup founded by alumni of MIT's HAN Lab. The arithmetic is $32 million per employee, a figure that requires explanation in a market where the largest AI companies are valued in the hundreds of billions and the most prominent acquisitions involve thousands of engineers. The explanation is inference. Eigen AI's technology maximises the number of tokens that each Nvidia GPU can generate when running AI models. That capability, it turns out, is the most valuable layer in AI infrastructure right now.

The AI industry's most expensive problem is not training models. It is running them. Training a frontier model is a one-time capital expenditure, measured in hundreds of millions of dollars, that produces a set of weights. Inference, the process of running those weights to generate responses for users, is a recurring operational cost that scales with every query, every API call, and every token produced. For companies that sell AI as a service, inference is the dominant cost line. Every percentage point of efficiency gained in inference, every additional token squeezed from the same Nvidia chip, translates directly into lower costs or higher margins. Eigen AI specialises in exactly this: optimising the performance of open-source models from OpenAI, Alibaba, Meta, and Nvidia so that each chip produces more output for the same input of electricity and silicon.

The acquisition arrives in the middle of a neocloud supercycle. Less than four weeks earlier, Forbes reported that CoreWeave signed a $21 billion infrastructure deal with Meta and, within 48 hours of that announcement, closed a separate multiyear agreement with Anthropic. CoreWeave disclosed it now serves nine of the 10 largest AI labs. Seeking Alpha subsequently noted that CoreWeave's total contracted backlog reached $66.8 billion. Then, on 15 April, Reuters reported that trading firm Jane Street committed approximately $6 billion to CoreWeave's cloud services, the third multibillion-dollar deal for the Nvidia-backed neocloud in a matter of weeks.

The neocloud market has bifurcated. One path, exemplified by CoreWeave's $66.8 billion backlog, is about raw capacity deployment: building and leasing GPU clusters at scale to the largest AI labs. The other path, which Nebius is now pursuing aggressively with the Eigen acquisition, is about extracting more value from each unit of capacity already deployed. The two strategies are complementary but the margin dynamics are entirely different. Capacity leasing generates revenue from square footage and kilowatt-hours. Inference optimisation generates revenue from software that runs on top of someone else's silicon, and software margins, at scale, are structurally higher.

Eigen AI's core technical contribution is activation-aware weight quantisation, a method for compressing AI models from higher-precision to lower-precision arithmetic without proportionally degrading output quality. The technique identifies which weights in a neural network are most sensitive to precision loss and preserves them at higher bit depths while aggressively quantising the rest. The result is that a model can run in INT4 or FP8 precision on fewer GPU resources while maintaining output quality close to the full-precision baseline. At batch size 1, the regime that dominates interactive inference workloads, the difference between a naive quantisation and an activation-aware one can be 15 to 30 percent in throughput per dollar of GPU time. At batch size 32, the gap narrows because the GPU's compute utilisation already runs high, but the majority of paid inference traffic flows at low batch sizes where users expect sub-second latency.

This is like the Olympic sport of the current market: who can extract more tokens for the same price?, Roman Chernin, Nebius co-founder and chief business officer, to The Next Web

Chernin's Olympic-sport framing is not rhetorical flourish. The per-token unit economics of inference are being measured with the precision of high-frequency trading. A model provider charging $2.00 per million output tokens on an H200 GPU might have a gross margin of 35 percent at 1,200 tokens per second throughput. An optimisation layer that pushes that to 1,500 tokens per second, a 25 percent gain, lifts the gross margin on that same pricing to roughly 48 percent, assuming power and cooling costs stay flat. When inference revenues are measured in billions of tokens per day, a 13-point margin swing moves hundreds of millions of dollars to the bottom line annually. The market is pricing that delta, and Nebius just paid $643 million to capture it.

The Nebius deal also signals something about the structure of the inference stack. Eigen AI's technology will be integrated into Nebius's Token Factory inference platform, which serves enterprise customers running open-source models. The platform competes not against the hyperscalers' raw GPU rental businesses but against the inference endpoints offered by model providers themselves: think Together AI, Fireworks, Anyscale, and the fast-growing category of pay-per-token inference clouds. SiliconANGLE reported in mid-April that Parasail, another entrant in this space, raised $32 million in Series A funding for its own pay-per-token inference cloud. The funding environment suggests that investors see inference-as-a-service as a distinct category from GPU rental, one where the margin capture shifts from the hardware layer to the software optimisation layer.

CoreWeave's Forbes profile, headlined "CoreWeave Becomes AI's Landlord," captures the capacity-side thesis. The company's model is to build or lease data centre space, fill it with Nvidia GPUs, and rent that compute to the largest AI labs under multiyear contracts. The landlord metaphor is apt: CoreWeave collects rent on scarce physical assets while the tenants, Meta and Anthropic among them, bear the burden of making those assets productive. But the Forbes analysis also flagged risks, noting that the neocloud market is getting crowded. Applied Digital, profiled by The Motley Fool in early April, builds AI data centres for neocloud operators including CoreWeave itself, and the company's revenue pipeline suggests that capacity buildout is accelerating across the sector. When capacity becomes abundant, the margin migrates upward to the layer that differentiates: software that makes each GPU produce more tokens.

The per-token pricing model that now dominates the inference market adds another dimension to the competition. OnMSFT reported in late April that token-based pricing is disrupting the AI chip market, with Groq challenging Nvidia on cost and speed for inference workloads. Groq's language processing units are designed specifically for inference, not training, and the company markets directly against Nvidia on a cost-per-million-tokens basis. The incumbency advantage Nvidia holds in training does not translate cleanly to inference, where specialised architectures can undercut general-purpose GPUs on both latency and throughput. For neoclouds building their fleets, the question is no longer simply which Nvidia SKU to deploy but whether the next generation of inference hardware will make today's GPU purchase decisions look expensive in hindsight.

Who Captures the Margin

The inference stack has four layers: the silicon, the cloud infrastructure, the model, and the application. Each layer wants to capture the margin, and the competition between them is intensifying. Nvidia captures margin through GPU sales at roughly 70 percent gross margins on data centre products. Neoclouds like CoreWeave and Nebius capture margin through rental or platform fees layered on top of hardware they purchase from Nvidia. Model providers, from OpenAI to Anthropic to the open-source ecosystem, capture margin through API pricing that bundles compute and model access. And application developers capture margin by building products on top of models that users pay for directly.

The Nebius-Eigen deal represents an attempt to shift margin capture toward the cloud layer. By owning the optimisation technology, Nebius can offer inference at a lower per-token price than competitors running the same models on the same hardware, while maintaining or improving its own margins. A customer running Llama 4 at scale has little loyalty to a particular inference provider; the model weights are identical regardless of who serves them. The provider that offers the lowest price per million tokens at acceptable latency wins the volume. Eigen's technology is designed to create exactly that price gap. The $643 million price tag suggests Nebius believes the gap is both real and defensible.

CoreWeave's approach is different but converging. The company's $66.8 billion backlog is built on long-term contracts that lock in revenue before the GPUs are even deployed. Seeking Alpha noted that CoreWeave's $6.8 billion contract expansion with Meta in early April "validates long-term hyperscaler demand for specialised AI infrastructure and high-performance inference." The validation, however, cuts both ways. If hyperscalers are willing to commit billions to neocloud capacity, the capacity itself is becoming a commodity. The question for CoreWeave's margin structure is whether the next contract renewal will price capacity at the same premium as the current one, or whether an increasingly crowded supply of GPU clusters will compress rental rates toward the cost of capital plus a thin operating margin.

What to Watch

The most important number to track in the neocloud inference race over the next two quarters is not revenue or backlog but the effective per-token price showing up on customer invoices. Public pricing pages list API rates like $0.50 per million input tokens or $2.00 per million output tokens, but the largest customers negotiate rates that can be 40 to 60 percent below list. Those negotiated rates, not the public ones, determine whether inference optimisation technology like Eigen's actually creates a moat. If the market price per token falls faster than optimisation can reduce cost, the margin captured by the optimisation layer evaporates. The EDN network warned in late March that cost-per-token metrics can be misleading because they abstract away assumptions about batch size, sequence length, and hardware generation, all of which materially change the underlying economics.

The second number to watch is the age of the GPU fleet at each major neocloud. Nvidia's cadence, from H100 to H200 to B200 to the Vera Rubin generation expected later in 2026, means that each new architecture delivers roughly 2x to 4x the inference throughput of its predecessor on the same power envelope. A neocloud that bought heavily into H100s in 2024 is now competing against rivals deploying B200s and soon Vera Rubin chips that can serve the same model at half the cost per token. The residual value of last-generation GPUs in inference is not zero, because demand continues to grow faster than supply, but the margin premium accrues to the operator with the newest silicon. CoreWeave's backlog locks customers into contracts that can absorb this depreciation cycle; smaller neoclouds without long-term commitments face the risk of being undercut on price by competitors on newer hardware.

The third variable is the open-source model ecosystem itself. Eigen AI's optimisation technology works on open-weight models from OpenAI, Alibaba, Meta, and Nvidia. If the frontier models that drive the most inference volume shift toward closed-source APIs where the model provider also controls the serving infrastructure, the addressable market for third-party inference optimisation shrinks. The trend over the past eighteen months has moved in the opposite direction: open-weight models like Llama 4 and Qwen 3 have closed the quality gap with proprietary alternatives, expanding the opportunity for neocloud inference platforms. But that trend is not a law of nature. A single breakthrough from a closed-source lab that resets the quality frontier could redirect inference traffic back toward vertically integrated providers and away from the neoclouds that depend on open models.

Nebius paid $643 million for a 20-person team because the market is pricing a future where inference is the dominant cost centre for the entire AI industry. Training budgets are front-page news; inference budgets are the recurring expense that determines whether AI businesses actually make money. The neoclouds that capture the inference layer, whether through capacity scale like CoreWeave or through software optimisation like Nebius with Eigen, are betting that the per-token economy will be larger and more durable than the per-training-run economy. The next checkpoint to watch is Nebius's Q2 earnings, the first report that will reflect any contribution from the Eigen acquisition. The numbers that matter will not be revenue alone but the tokens-per-second-per-dollar figures that the company chooses, or declines, to disclose.

Read next

Progress 0% ≈ 10 min left
Subscribe Daily Brief

Get the Daily Brief
before your first meeting.

Five stories. Four minutes. Zero hot takes. Sent at 7:00 a.m. local time, every weekday.

No spam. Unsubscribe in one click.