Stop calling it ‘The AI bubble’: It’s actually multiple bubbles, each with a different expiration date

It’s the question on everyone’s minds and lips: Are we in an AI bubble?

It’s the wrong question. The real question is: Which AI bubble are we in, and when will each one burst?

The debate over whether AI represents a transformative technology or an economic time bomb has reached a fever pitch. Even tech leaders like Meta CEO Mark Zuckerberg have acknowledged evidence of an unstable financial bubble forming around AI. OpenAI CEO Sam Altman and Microsoft co-founder Bill Gates see clear bubble dynamics: overexcited investors, frothy valuations and plenty of doomed projects — but they still believe AI will ultimately transform the economy.

But treating “AI” as a single monolithic entity destined for a uniform collapse is fundamentally misguided.  The AI ecosystem is actually three distinct layers, each with different economics, defensibility and risk profiles. Understanding these layers is critical, because they won’t all pop at once. 

Layer 3: The wrapper companies (first to fall)

The most vulnerable segment isn’t building AI — it’s repackaging it.

These are the companies that take OpenAI’s API, add a slick interface and some prompt engineering, then charge $49/month for what amounts to a glorified ChatGPT wrapper. Some have achieved rapid initial success, like Jasper.ai, which reached approximately $42 million in annual recurring revenue (ARR) in its first year by wrapping GPT models in a user-friendly interface for marketers.

But the cracks are already showing. These businesses face threats from every direction:

Feature absorption: Microsoft can bundle your $50/month AI writing tool into Office 365 tomorrow. Google can make your AI email assistant a free Gmail feature. Salesforce can build your AI sales tool natively into their CRM. When large platforms decide your product is a feature, not a product, your business model evaporates overnight.

The commoditization trap: Wrapper companies are essentially just passing inputs and outputs, if OpenAI improves prompting, these tools lose value overnight. As foundation models become more similar in capability and pricing continues to fall, margins compress to nothing.

Zero switching costs: Most wrapper companies don’t own proprietary data, embedded workflows or deep integrations. A customer can switch to a competitor, or directly to ChatGPT, in minutes. There’s no moat, no lock-in, no defensibility.

The white-label AI market exemplifies this fragility. Companies using white-label platforms face vendor lock-in risks from proprietary systems and API limitations that can hinder integration. These businesses are building on rented land, and the landlord can change the terms, or bulldoze the property, at any moment.

The exception that proves the rule: Cursor stands as a rare wrapper-layer company that has built genuine defensibility. By deeply integrating into developer workflows, creating proprietary features beyond simple API calls and establishing strong network effects through user habits and custom configurations, Cursor has demonstrated how a wrapper can evolve into something more substantial. But companies like Cursor are outliers, not the norm — most wrapper companies lack this level of workflow integration and user lock-in.

Timeline: Expect significant failures in this segment by late 2025 through 2026, as large platforms absorb functionality and users realize they’re paying premium prices for commoditized capabilities.

Layer 2: Foundation models (the middle ground)

The companies building LLMs — OpenAI, Anthropic, Mistral — occupy a more defensible but still precarious position.

Economic researcher Richard Bernstein points to OpenAI as an example of the bubble dynamic, noting that the company has made around $1 trillion in AI deals, including a $500 billion data center buildout project, despite being set to generate only $13 billion in revenue. The divergence between investment and plausible earnings “certainly looks bubbly,” Bernstein notes.

Yet, these companies possess genuine technological moats: Model training expertise, compute access and performance advantages. The question is whether these advantages are sustainable or whether models will commoditize to the point where they’re indistinguishable — turning foundation model providers into low-margin infrastructure utilities.

Engineering will separate winners from losers: As foundation models converge in baseline capabilities, the competitive edge will increasingly come from inference optimization and systems engineering. Companies that can scale the memory wall through innovations like extended KV cache architectures, achieve superior token throughput and deliver faster time-to-first-token will command premium pricing and market share. The winners won’t just be those with the largest training runs, but those who can make AI inference economically viable at scale. Technical breakthroughs in memory management, caching strategies and infrastructure efficiency will determine which frontier labs survive consolidation.

Another concern is the circular nature of investments. For instance, Nvidia is pumping $100 billion into OpenAI to bankroll data centers, and OpenAI is then filling those facilities with Nvidia’s chips. Nvidia is essentially subsidizing one of its biggest customers, potentially artificially inflating actual AI demand.

Still, these companies have massive capital backing, genuine technical capabilities and strategic partnerships with major cloud providers and enterprises. Some will consolidate, some will be acquired, but the category will survive.

Timeline: Consolidation in 2026 to 2028, with 2 to 3 dominant players emerging while smaller model providers are acquired or shuttered.

Layer 1: Infrastructure (built to last)

Here’s the contrarian take: The infrastructure layer — including Nvidia, data centers, cloud providers, memory systems and AI-optimized storage — is the least bubbly part of the AI boom.

Yes, the latest estimates suggest global AI capital expenditures and venture capital investments already exceed $600 billion in 2025, with Gartner estimating that all AI-related spending worldwide might top $1.5 trillion. That sounds like bubble territory.

But infrastructure has a critical characteristic: It retains value regardless of which specific applications succeed. The fiber optic cables laid during the dot-com bubble weren’t wasted — they enabled YouTube, Netflix and cloud computing. Twenty-five years ago, the original dot-com bubble burst after debt financing built out fiber-optic cables for a future that had not yet arrived, but that future eventually did arrive, and the infrastructure was there waiting.

Despite stock pressure, Nvidia’s Q3 fiscal year 2025 revenue hit about $57 billion, up 22% quarter-over-quarter and 62% year-over-year, with the data center division alone generating roughly $51.2 billion. These aren’t vanity metrics; they represent real demand from companies making genuine infrastructure investments.

The chips, data centers, memory systems and storage infrastructure being built today will power whatever AI applications ultimately succeed, whether that’s today’s chatbots, tomorrow’s autonomous agents or applications we haven’t even imagined yet. Unlike commoditized storage alone, modern AI infrastructure encompasses the entire memory hierarchy — from GPU HBM to DRAM to high-performance storage systems that serve as token warehouses for inference workloads. This integrated approach to memory and storage represents a fundamental architectural innovation, not a commodity play.

Timeline: Short-term overbuilding and lazy engineering are possible (2026), but long-term value retention is expected as AI workloads expand over the next decade.

The cascade effect: Why this matters

The current AI boom won’t end with one dramatic crash. Instead, we’ll see a cascade of failures beginning with the most vulnerable companies, and the warning signs are already here.

Phase 1: Wrapper and white-label companies face margin compression and feature absorption. Hundreds of AI startups with thin differentiation will shut down or sell for pennies on the dollar. More than 1,300 AI startups now have valuations of over $100 million, with 498 AI “unicorns” valued at $1 billion or more, many of which won’t justify those valuations.

Phase 2: Foundation model consolidation as performance converges and only the best-capitalized players survive. Expect 3 to 5 major acquisitions as tech giants absorb promising model companies.

Phase 3: Infrastructure spending normalizes but remains elevated. Some data centers will sit partially empty for a few years (like fiber optic cables in 2002), but they’ll eventually fill as AI workloads genuinely expand.

What this means for builders

The most significant risk isn’t being a wrapper — it’s staying one. If you own the experience the user operates in, you own the user.

If you’re building in the application layer, you need to move upstack immediately:

From wrapper → application layer: Stop just generating outputs. Own the workflow before and after the AI interaction.

From application → vertical SaaS: Build execution layers that force users to stay inside your product. Create proprietary data, deep integrations and workflow ownership that makes switching painful.

The distribution moat: Your real advantage isn’t the LLM, it’s how you get users, keep them and expand what they do inside your platform. Winning AI businesses aren’t just software companies — they’re distribution companies.

The bottom line

It’s time to stop asking whether we’re in “the” AI bubble. We’re in multiple bubbles with different characteristics and timelines.

The wrapper companies will pop first, probably within 18 months. Foundation models will consolidate over the next 2 to 4 years. I predict that current infrastructure investments will ultimately prove justified over the long term, although not without some short-term overbuilding pains.

This isn’t a reason for pessimism, it’s a roadmap. Understanding which layer you’re operating in and which bubble you might be caught in is the difference between becoming the next casualty and building something that survives the shakeout.

The AI revolution is real. But not every company riding the wave will make it to shore.

Val Bercovici is CAIO at WEKA.

Why reinforcement learning plateaus without representation depth (and other key takeaways from NeurIPS 2025)

Every year, NeurIPS produces hundreds of impressive papers, and a handful that subtly reset how practitioners think about scaling, evaluation and system design. In 2025, the most consequential works weren’t about a single breakthrough model. Instead, they challenged fundamental assumptions that academicians and corporations have quietly relied on: Bigger models mean better reasoning, RL creates new capabilities, attention is “solved” and generative models inevitably memorize.

This year’s top papers collectively point to a deeper shift: AI progress is now constrained less by raw model capacity and more by architecture, training dynamics and evaluation strategy.

Below is a technical deep dive into five of the most influential NeurIPS 2025 papers — and what they mean for anyone building real-world AI systems.

1. LLMs are converging—and we finally have a way to measure it

Paper: Artificial Hivemind: The Open-Ended Homogeneity of Language Models

For years, LLM evaluation has focused on correctness. But in open-ended or ambiguous tasks like brainstorming, ideation or creative synthesis, there often is no single correct answer. The risk instead is homogeneity: Models producing the same “safe,” high-probability responses.

This paper introduces Infinity-Chat, a benchmark designed explicitly to measure diversity and pluralism in open-ended generation. Rather than scoring answers as right or wrong, it measures:

  • Intra-model collapse: How often the same model repeats itself

  • Inter-model homogeneity: How similar different models’ outputs are

The result is uncomfortable but important: Across architectures and providers, models increasingly converge on similar outputs — even when multiple valid answers exist.

Why this matters in practice

For corporations, this reframes “alignment” as a trade-off. Preference tuning and safety constraints can quietly reduce diversity, leading to assistants that feel too safe, predictable or biased toward dominant viewpoints.

Takeaway: If your product relies on creative or exploratory outputs, diversity metrics need to be first-class citizens. 

2. Attention isn’t finished — a simple gate changes everything

Paper: Gated Attention for Large Language Models

Transformer attention has been treated as settled engineering. This paper proves it isn’t.

The authors introduce a small architectural change: Apply a query-dependent sigmoid gate after scaled dot-product attention, per attention head. That’s it. No exotic kernels, no massive overhead.

Across dozens of large-scale training runs — including dense and mixture-of-experts (MoE) models trained on trillions of tokens — this gated variant:

  • Improved stability

  • Reduced “attention sinks”

  • Enhanced long-context performance

  • Consistently outperformed vanilla attention

Why it works

The gate introduces:

  • Non-linearity in attention outputs

  • Implicit sparsity, suppressing pathological activations

This challenges the assumption that attention failures are purely data or optimization problems.

Takeaway: Some of the biggest LLM reliability issues may be architectural — not algorithmic — and solvable with surprisingly small changes.

3. RL can scale — if you scale in depth, not just data

Paper: 1,000-Layer Networks for Self-Supervised Reinforcement Learning

Conventional wisdom says RL doesn’t scale well without dense rewards or demonstrations. This paper reveals that that assumption is incomplete.

By scaling network depth aggressively from typical 2 to 5 layers to nearly 1,000 layers, the authors demonstrate dramatic gains in self-supervised, goal-conditioned RL, with performance improvements ranging from 2X to 50X.

The key isn’t brute force. It’s pairing depth with contrastive objectives, stable optimization regimes and goal-conditioned representations

Why this matters beyond robotics

For agentic systems and autonomous workflows, this suggests that representation depth — not just data or reward shaping — may be a critical lever for generalization and exploration.

Takeaway: RL’s scaling limits may be architectural, not fundamental.

4. Why diffusion models generalize instead of memorizing

Paper: Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training

Diffusion models are massively overparameterized, yet they often generalize remarkably well. This paper explains why.

The authors identify two distinct training timescales:

  • One where generative quality rapidly improves

  • Another — much slower — where memorization emerges

Crucially, the memorization timescale grows linearly with dataset size, creating a widening window where models improve without overfitting.

Practical implications

This reframes early stopping and dataset scaling strategies. Memorization isn’t inevitable — it’s predictable and delayed.

Takeaway: For diffusion training, dataset size doesn’t just improve quality — it actively delays overfitting.

5. RL improves reasoning performance, not reasoning capacity

Paper: Does Reinforcement Learning Really Incentivize Reasoning in LLMs?

Perhaps the most strategically important result of NeurIPS 2025 is also the most sobering.

This paper rigorously tests whether reinforcement learning with verifiable rewards (RLVR) actually creates new reasoning abilities in LLMs — or simply reshapes existing ones.

Their conclusion: RLVR primarily improves sampling efficiency, not reasoning capacity. At large sample sizes, the base model often already contains the correct reasoning trajectories.

What this means for LLM training pipelines

RL is better understood as:

  • A distribution-shaping mechanism

  • Not a generator of fundamentally new capabilities

Takeaway: To truly expand reasoning capacity, RL likely needs to be paired with mechanisms like teacher distillation or architectural changes — not used in isolation.

The bigger picture: AI progress is becoming systems-limited

Taken together, these papers point to a common theme:

The bottleneck in modern AI is no longer raw model size — it’s system design.

  • Diversity collapse requires new evaluation metrics

  • Attention failures require architectural fixes

  • RL scaling depends on depth and representation

  • Memorization depends on training dynamics, not parameter count

  • Reasoning gains depend on how distributions are shaped, not just optimized

For builders, the message is clear: Competitive advantage is shifting from “who has the biggest model” to “who understands the system.”

Maitreyi Chatterjee is a software engineer.

Devansh Agarwal currently works as an ML engineer at FAANG.

How Google’s ‘internal RL’ could unlock long-horizon AI agents

Researchers at Google have developed a technique that makes it easier for AI models to learn complex reasoning tasks that usually cause LLMs to hallucinate or fall apart. Instead of training LLMs through next-token prediction, their technique, called internal reinforcement learning (internal RL), steers the model’s internal activations toward developing a high-level step-by-step solution for the input problem. 

Ultimately, this could provide a scalable path for creating autonomous agents that can handle complex reasoning and real-world robotics without needing constant, manual guidance.

The limits of next-token prediction

Reinforcement learning plays a key role in post-training LLMs, particularly for complex reasoning tasks that require long-horizon planning. However, the problem lies in the architecture of these models. LLMs are autoregressive, meaning they generate sequences one token at a time. When these models explore new strategies during training, they do so by making small, random changes to the next single token or action. This exposes a deeper limitation: next-token prediction forces models to search for solutions at the wrong level of abstraction, making long-horizon reasoning inefficient even when the model “knows” what to do.

This token-by-token approach works well for basic language modeling but breaks down in long-horizon tasks where rewards are sparse. If the model relies solely on random token-level sampling, the probability of stumbling upon the correct multi-step solution is infinitesimally small, “on the order of one in a million,” according to the researchers.

The issue isn’t just that the models get confused; it’s that they get confused at the wrong level. In comments provided to VentureBeat, Yanick Schimpf, a co-author of the paper, notes that in a 20-step task, an agent can get lost in the minute details of a single step, or it can lose track of the overall goal.

“We argue that when facing a problem with some abstract structure… [goal-oriented exploration] is what you want,” Schimpf said. By solving the problem at the abstract level first, the agent commits to a path, ensuring it doesn’t “get lost in one of the reasoning steps” and fail to complete the broader workflow.

To address this, the field has long looked toward hierarchical reinforcement learning. HRL attempts to solve complex problems by decomposing them into a hierarchy of temporally abstract actions (high-level subroutines that represent different stages of the solution) rather than managing a task as a string of tokens. 

However, discovering these appropriate subroutines remains a longstanding challenge. Current HRL methods often fail to discover proper policies, frequently “converging to degenerate options” that do not represent meaningful behaviors. Even sophisticated modern methods like GRPO (a popular RL algorithm used for sparse-reward tasks) fail in complex environments because they cannot effectively bridge the gap between low-level execution and high-level planning.

Steering the LLM’s internal thoughts

To overcome these limitations, the Google team proposed internal RL. Advanced autoregressive models already “know” how to perform complex, multi-step tasks internally, even if they aren’t explicitly trained to do so.

Because these complex behaviors are hidden inside the model’s residual stream (i.e., the numerical values that carry information through the network’s layers), the researchers introduced an “internal neural network controller,” or metacontroller. Instead of monitoring and changing the output token, the metacontroller controls the model’s behavior by applying changes to the model’s internal activations in the middle layers.

This nudge steers the model into a specific useful state. The base model then automatically generates the sequence of individual steps needed to achieve that goal because it has already seen those patterns during its initial pretraining. 

The metacontroller operates through unsupervised learning and does not require human-labeled training examples. Instead, the researchers use a self-supervised framework where the model analyzes a full sequence of behavior and works backward to infer the hidden, high-level intent that best explains the actions.

During the internal RL phase, the updates are applied to the metacontroller, which shifts training from next-token prediction to learning high-level actions that can lead to the solution.

To understand the practical value of this, consider an enterprise agent tasked with code generation. Today, there is a difficult trade-off: You need “low temperature” (predictability) to get the syntax right, but “high temperature” (creativity) to solve the logic puzzle.

“Internal RL might facilitate this by allowing the model to explore the space of abstract actions, i.e. structuring logic and method calls, while delegating the token-level realization of those actions to the robust, lower-temperature distribution of the base model,” Schimpf said. The agent explores the solution without breaking the syntax.

The researchers investigated two methods for applying this controller. In the first, the base autoregressive model is pretrained on a behavioral dataset and then frozen, while the metacontroller is trained to steer the frozen model’s residual stream. In the second, the metacontroller and the base model are jointly optimized, with parameters of both networks updated simultaneously. 

Internal RL in action

To evaluate the effectiveness of internal RL, the researchers ran experiments across hierarchical environments designed to stump traditional learners. These included a discrete grid world and a continuous control task where a quadrupedal “ant” robot must coordinate joint movements. Both environments used sparse rewards with very long action sequences.

While baselines like GRPO and CompILE failed to learn the tasks within a million episodes due to the difficulty of credit assignment over long horizons, internal RL achieved high success rates with a small number of training episodes. By choosing high-level goals rather than tiny steps, the metacontroller drastically reduced the search space. This allowed the model to identify which high-level decisions led to success, making credit assignment efficient enough to solve the sparse reward problem.

Notably, the researchers found that the “frozen” approach was superior. When the base model and metacontroller were co-trained from scratch, the system failed to develop meaningful abstractions. However, applied to a frozen model, the metacontroller successfully discovered key checkpoints without any human labels, perfectly aligning its internal switching mechanism with the ground-truth moments when an agent finished one subgoal and started the next.

As the industry currently fixates on reasoning models that output verbose “chains of thought” to solve problems, Google’s research points toward a different, perhaps more efficient future.

“Our study joins a growing body of work suggesting that ‘internal reasoning’ is not only feasible but potentially more efficient than token-based approaches,” Schimpf said. “Moreover, these silent ‘thoughts’ can be decoupled from specific input modalities — a property that could be particularly relevant for the future of multi-modal AI.”

If internal reasoning can be guided without being externalized, the future of AI agents may hinge less on prompting strategies and more on how well we can access and steer what models already represent internally. For enterprises betting on autonomous systems that must plan, adapt, and act over long horizons, that shift could matter more than any new reasoning benchmark.

Breaking through AI’s memory wall with token warehousing

As agentic AI moves from experiments to real production workloads, a quiet but serious infrastructure problem is coming into focus: memory. Not compute. Not models. Memory.

Under the hood, today’s GPUs simply don’t have enough space to hold the Key-Value (KV) caches that modern, long-running AI agents depend on to maintain context. The result is a lot of invisible waste — GPUs redoing work they’ve already done, cloud costs climbing, and performance taking a hit. It’s a problem that’s already showing up in production environments, even if most people haven’t named it yet.

At a recent stop on the VentureBeat AI Impact Series, WEKA CTO Shimon Ben-David joined VentureBeat CEO Matt Marshall to unpack the industry’s emerging “memory wall,” and why it’s becoming one of the biggest blockers to scaling truly stateful agentic AI — systems that can remember and build on context over time. The conversation didn’t just diagnose the issue; it laid out a new way to think about memory entirely, through an approach WEKA calls token warehousing.

The GPU memory problem

“When we’re looking at the infrastructure of inferencing, it is not a GPU cycles challenge. It’s mostly a GPU memory problem,” said Ben-David.

The root of the issue comes down to how transformer models work. To generate responses, they rely on KV caches that store contextual information for every token in a conversation. The longer the context window, the more memory those caches consume, and it adds up fast. A single 100,000-token sequence can require roughly 40GB of GPU memory, noted Ben-David.

That wouldn’t be a problem if GPUs had unlimited memory. But they don’t. Even the most advanced GPUs top out at around 288GB of high-bandwidth memory (HBM), and that space also has to hold the model itself.

In real-world, multi-tenant inference environments, this becomes painful quickly. Workloads like code development or processing tax returns rely heavily on KV-cache for context.

“If I’m loading three or four 100,000-token PDFs into a model, that’s it — I’ve exhausted the KV cache capacity on HBM,” said Ben-David. This is what’s known as the memory wall. “Suddenly, what the inference environment is forced to do is drop data,” he added.

That means GPUs are constantly throwing away context they’ll soon need again, preventing agents from being stateful and maintaining conversations and context over time

The hidden inference tax

“We constantly see GPUs in inference environments recalculating things they already did,” Ben-David said. Systems prefill the KV cache, start decoding, then run out of space and evict earlier data. When that context is needed again, the whole process repeats — prefill, decode, prefill again. At scale, that’s an enormous amount of wasted work. It also means wasted energy, added latency, and degraded user experience — all while margins get squeezed.

That GPU recalculation waste shows up directly on the balance sheet. Organizations can suffer nearly 40% overhead just from redundant prefill cycles This is creating ripple effects in the inference market.

“If you look at the pricing of large model providers like Anthropic and OpenAI, they are actually teaching users to structure their prompts in ways that increase the likelihood of hitting the same GPU that has their KV cache stored,” said Ben-David. “If you hit that GPU, the system can skip the prefill phase and start decoding immediately, which lets them generate more tokens efficiently.”

But this still doesn’t solve the underlying infrastructure problem of extremely limited GPU memory capacity.

Solving for stateful AI

“How do you climb over that memory wall? How do you surpass it? That’s the key for modern, cost- effective inferencing,” Ben-David said. “We see multiple companies trying to solve that in different ways.”

Some organizations are deploying new linear models that try to create smaller KV caches. Others are focused on tackling cache efficiency.

“To be more efficient, companies are using environments that calculate the KV cache on one GPU and then try to copy it from GPU memory or use a local environment for that,” Ben-David explained. “But how do you do that at scale in a cost-effective manner that doesn’t strain your memory and doesn’t strain your networking? That’s something that WEKA is helping our customers with.”

Simply throwing more GPUs at the problem doesn’t solve the AI memory barrier. “There are some problems that you cannot throw enough money at to solve,” Ben-David said.

Augmented memory and token warehousing, explained

WEKA’s answer is what it calls augmented memory and token warehousing — a way to rethink where and how KV cache data lives. Instead of forcing everything to fit inside GPU memory, WEKA’s Augmented Memory Grid extends the KV cache into a fast, shared “warehouse” within its NeuralMesh architecture.

In practice, this turns memory from a hard constraint into a scalable resource — without adding inference latency. WEKA says customers see KV cache hit rates jump to 96–99% for agentic workloads, along with efficiency gains of up to 4.2x more tokens produced per GPU.

Ben-David put it simply: “Imagine that you have 100 GPUs producing a certain amount of tokens. Now imagine that those hundred GPUs are working as if they’re 420 GPUs.”

For large inference providers, the result isn’t just better performance — it translates directly to real economic impact.

“Just by adding that accelerated KV cache layer, we’re looking at some use cases where the savings amount would be millions of dollars per day,” said Ben-David

This efficiency multiplier also opens up new strategic options for businesses. Platform teams can design stateful agents without worrying about blowing up memory budgets. Service providers can offer pricing tiers based on persistent context, with cached inference delivered at dramatically lower cost.

What comes next

NVIDIA projects a 100x increase in inference demand as agentic AI becomes the dominant workload. That pressure is already trickling down from hyperscalers to everyday enterprise deployments— this isn’t just a “big tech” problem anymore.

As enterprises move from proofs of concept into real production systems, memory persistence is becoming a core infrastructure concern. Organizations that treat it as an architectural priority rather than an afterthought will gain a clear advantage in both cost and performance.

The memory wall is not something organizations can simply outspend to overcome. As agentic AI scales, it is one of the first AI infrastructure limits that forces a deeper rethink, and as Ben-David’s insights made clear, memory may also be where the next wave of competitive differentiation begins.

How DoorDash scaled without a costly ERP overhaul

Presented by NetSuite


Most companies racing from startup to an industry leader face a choice: limp along with scrappy early systems or endure a costly platform migration.

DoorDash did neither. The local-commerce giant scaled from its 2013 founding through IPO and global expansion — acquiring the Helsiniki-based technology company Wolt in 2022 and UK-based Deliveroo in 2025 — while keeping its original Oracle NetSuite business system. Today, it serves over 50 million consumers in more than 40 countries.*

Chief Accounting Officer Gordon Lee says the secret is building a scalable ecosystem that allows teams to use tools that work best for them.

Choosing flexibility over uniformity

When DoorDash selected NetSuite as its corporate financial control center, it wasn’t looking for a system to enforce uniformity. It sought a scalable platform that could connect all its systems, from ERP, CRM, HR, sourcing, and more.

“Our philosophy has been to create a platform that allows our customers and business partners to use whatever tools work best for them,” Lee says. “When we’re managing growth, the majority of the conversation is about managing expectations — what people expect when you grow from A to B.”

The migration question

Two years after its founding, DoorDash surpassed one million deliveries and expanded into Canada. As the company scaled, Lee faced growing pressure from vendors insisting that rapid growth required a new enterprise platform.

He ran the numbers. The move to another platform could cost millions and consume months of his team’s focus.

Instead, DoorDash stayed with NetSuite, which continued to scale alongside the company’s growth. Built on Oracle Cloud Infrastructure, NetSuite delivers the performance and reliability of an enterprise platform without the cost or disruption of migration.

Lee concluded: “Why do I bother to move? I already have the scalability I need from NetSuite.”

Today, DoorDash’s NetSuite backend provides enterprise-grade security while its familiar front end provides the team flexibility, creating a stable, modern foundation for sustained, high-velocity growth.

Expanding the menu without the technical indigestion

That flexibility soon proved invaluable. The ability to add new applications quickly — without long, costly integrations — became a major advantage during hypergrowth.

For example, as DoorDash expanded from restaurant delivery into grocery, convenience, and retail, Lee turned to NetSuite’s inventory modules to handle the distinct demands of those new categories.

“The flexibility to have and not have, and turn the switch on and off, is easy because it’s all integrated,” he explains.

Today, DoorDash’s technology stack spans multiple systems — all integrating seamlessly with NetSuite as the financial hub. “They do it, and you’re done,” Lee says.

Embedding expertise to scale smarter, not bigger

For Lee, true partnerships turn vendors into part of the team — and that’s exactly how he describes NetSuite Advanced Customer Support (ACS).

“They are here with us every week. They know all my schematics, they know all my data infrastructure, they know all my database structure within NetSuite. Essentially, they are an extension of my team,” Lee explains.

Close collaboration benefits both parties. DoorDash keeps NetSuite attuned to the realities of hypergrowth and gets instant feedback on technology capability and scalability. In turn, NetSuite stays close to a marquee customer. Interaction is ongoing — and frank, according to Lee.

“We work directly with NetSuite ACS and often ask, ‘Can NetSuite do this?’ If they can prove it can, we stay with NetSuite.”

Another benefit is the ability to extend DoorDash’s expertise without expanding headcount.

“If someone says to me, ‘Gordon, you’re just an accountant. How do you know about systems? I say, I don’t. I have a network guy with us, an expert.’ That’s the kind of partner I want to surround myself with, so that I can grow beyond what I am.”

By embedding expertise within our partnerships, DoorDash scales with precision and control. Lee says the model applies to other companies preparing for IPOs or global expansion. He adds that sustainable growth depends as much on shared understanding as on technology itself.

Too often, finance and IT “look at the same requirement but see completely different things,” Lee says, describing what he calls the “blue versus purple” problem. “The accountant doesn’t understand the configuration of the system,” he explains. “The IT guy doesn’t understand what the accountant was trying to tell them.”

NetSuite bridges that gap. With a unified data model and built-in best practices across finance, operations, and more, it keeps teams aligned and information consistent. That close collaboration, Lee notes, is what keeps rollouts smooth, data clean, and growth sustainable at any stage.

AI strategy: Trust only internal data, get data ducks in a row

Lee plans to test the NetSuite AI Connector Service — which supports Model Context Protocol (MCP) and lets customers connect their own AI to NetSuite — to see how faster access to accurate data can drive growth.

By implementing an internal instance, Lee is less worried about disruptive errors from LLMs trained on public data sources.

“Think about a generative AI chatbot. When you ask a question, it can reflect many perspectives,” he explains. On the other hand, a chatbot trained on private enterprise systems benefits from “a clean data infrastructure.”

Lee is taking a methodical approach: first get data pristine, then train AI on domain-specific terminology, and finally see how internal AI can both find the right information and automate downstream accounting processes to save resources and accelerate growth.

Betting long-term on its original financial core

From early growth to major acquisitions that helped expand its footprint across the globe, DoorDash has relied on NetSuite as a consistent foundation for innovation and scale.

Lee credits NetSuite’s flexible architecture and close partnership with helping enable DoorDash as it continued to scale and cement itself as a leader in local commerce globally.

His mantra is simple: “Focus on growth instead of churning through vendors.”

* Based on the combined numbers for DoorDash, Wolt, and Deliveroo, measured as of September 2025.


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Why your LLM bill is exploding — and how semantic caching can cut it by 73%

Our LLM API bill was growing 30% month-over-month. Traffic was increasing, but not that fast. When I analyzed our query logs, I found the real problem: Users ask the same questions in different ways.

“What’s your return policy?,” “How do I return something?”, and “Can I get a refund?” were all hitting our LLM separately, generating nearly identical responses, each incurring full API costs.

Exact-match caching, the obvious first solution, captured only 18% of these redundant calls. The same semantic question, phrased differently, bypassed the cache entirely.

So, I implemented semantic caching based on what queries mean, not how they’re worded. After implementing it, our cache hit rate increased to 67%, reducing LLM API costs by 73%. But getting there requires solving problems that naive implementations miss.

Why exact-match caching falls short

Traditional caching uses query text as the cache key. This works when queries are identical:

# Exact-match caching

cache_key = hash(query_text)

if cache_key in cache:

    return cache[cache_key]

But users don’t phrase questions identically. My analysis of 100,000 production queries found:

  • Only 18% were exact duplicates of previous queries

  • 47% were semantically similar to previous queries (same intent, different wording)

  • 35% were genuinely novel queries

That 47% represented massive cost savings we were missing. Each semantically-similar query triggered a full LLM call, generating a response nearly identical to one we’d already computed.

Semantic caching architecture

Semantic caching replaces text-based keys with embedding-based similarity lookup:

class SemanticCache:

    def __init__(self, embedding_model, similarity_threshold=0.92):

        self.embedding_model = embedding_model

        self.threshold = similarity_threshold

        self.vector_store = VectorStore()  # FAISS, Pinecone, etc.

        self.response_store = ResponseStore()  # Redis, DynamoDB, etc.

    def get(self, query: str) -> Optional[str]:

        “””Return cached response if semantically similar query exists.”””

        query_embedding = self.embedding_model.encode(query)

        # Find most similar cached query

        matches = self.vector_store.search(query_embedding, top_k=1)

        if matches and matches[0].similarity >= self.threshold:

            cache_id = matches[0].id

            return self.response_store.get(cache_id)

        return None

    def set(self, query: str, response: str):

        “””Cache query-response pair.”””

        query_embedding = self.embedding_model.encode(query)

        cache_id = generate_id()

        self.vector_store.add(cache_id, query_embedding)

        self.response_store.set(cache_id, {

            ‘query’: query,

            ‘response’: response,

            ‘timestamp’: datetime.utcnow()

        })

The key insight: Instead of hashing query text, I embed queries into vector space and find cached queries within a similarity threshold.

The threshold problem

The similarity threshold is the critical parameter. Set it too high, and you miss valid cache hits. Set it too low, and you return wrong responses.

Our initial threshold of 0.85 seemed reasonable; 85% similar should be “the same question,” right?

Wrong. At 0.85, we got cache hits like:

  • Query: “How do I cancel my subscription?”

  • Cached: “How do I cancel my order?”

  • Similarity: 0.87

These are different questions with different answers. Returning the cached response would be incorrect.

I discovered that optimal thresholds vary by query type:

Query type

Optimal threshold

Rationale

FAQ-style questions

0.94

High precision needed; wrong answers damage trust

Product searches

0.88

More tolerance for near-matches

Support queries

0.92

Balance between coverage and accuracy

Transactional queries

0.97

Very low tolerance for errors

I implemented query-type-specific thresholds:

class AdaptiveSemanticCache:

    def __init__(self):

        self.thresholds = {

            ‘faq’: 0.94,

            ‘search’: 0.88,

            ‘support’: 0.92,

            ‘transactional’: 0.97,

            ‘default’: 0.92

        }

        self.query_classifier = QueryClassifier()

    def get_threshold(self, query: str) -> float:

        query_type = self.query_classifier.classify(query)

        return self.thresholds.get(query_type, self.thresholds[‘default’])

    def get(self, query: str) -> Optional[str]:

        threshold = self.get_threshold(query)

        query_embedding = self.embedding_model.encode(query)

        matches = self.vector_store.search(query_embedding, top_k=1)

        if matches and matches[0].similarity >= threshold:

            return self.response_store.get(matches[0].id)

        return None

Threshold tuning methodology

I couldn’t tune thresholds blindly. I needed ground truth on which query pairs were actually “the same.”

Our methodology:

Step 1: Sample query pairs. I sampled 5,000 query pairs at various similarity levels (0.80-0.99).

Step 2: Human labeling. Annotators labeled each pair as “same intent” or “different intent.” I used three annotators per pair and took a majority vote.

Step 3: Compute precision/recall curves. For each threshold, we computed:

  • Precision: Of cache hits, what fraction had the same intent?

  • Recall: Of same-intent pairs, what fraction did we cache-hit?

def compute_precision_recall(pairs, labels, threshold):

    “””Compute precision and recall at given similarity threshold.”””

    predictions = [1 if pair.similarity >= threshold else 0 for pair in pairs]

    true_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 1)

    false_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 0)

    false_negatives = sum(1 for p, l in zip(predictions, labels) if p == 0 and l == 1)

    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0

    recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0

    return precision, recall

Step 4: Select threshold based on cost of errors. For FAQ queries where wrong answers damage trust, I optimized for precision (0.94 threshold gave 98% precision). For search queries where missing a cache hit just costs money, I optimized for recall (0.88 threshold).

Latency overhead

Semantic caching adds latency: You must embed the query and search the vector store before knowing whether to call the LLM.

Our measurements:

Operation

Latency (p50)

Latency (p99)

Query embedding

12ms

28ms

Vector search

8ms

19ms

Total cache lookup

20ms

47ms

LLM API call

850ms

2400ms

The 20ms overhead is negligible compared to the 850ms LLM call we avoid on cache hits. Even at p99, the 47ms overhead is acceptable.

However, cache misses now take 20ms longer than before (embedding + search + LLM call). At our 67% hit rate, the math works out favorably:

  • Before: 100% of queries × 850ms = 850ms average

  • After: (33% × 870ms) + (67% × 20ms) = 287ms + 13ms = 300ms average

Net latency improvement of 65% alongside the cost reduction.

Cache invalidation

Cached responses go stale. Product information changes, policies update and yesterday’s correct answer becomes today’s wrong answer.

I implemented three invalidation strategies:

  1. Time-based TTL

Simple expiration based on content type:

TTL_BY_CONTENT_TYPE = {

    ‘pricing’: timedelta(hours=4),      # Changes frequently

    ‘policy’: timedelta(days=7),         # Changes rarely

    ‘product_info’: timedelta(days=1),   # Daily refresh

    ‘general_faq’: timedelta(days=14),   # Very stable

}

  1. Event-based invalidation

When underlying data changes, invalidate related cache entries:

class CacheInvalidator:

    def on_content_update(self, content_id: str, content_type: str):

        “””Invalidate cache entries related to updated content.”””

        # Find cached queries that referenced this content

        affected_queries = self.find_queries_referencing(content_id)

        for query_id in affected_queries:

            self.cache.invalidate(query_id)

        self.log_invalidation(content_id, len(affected_queries))

  1. Staleness detection

For responses that might become stale without explicit events, I implemented  periodic freshness checks:

def check_freshness(self, cached_response: dict) -> bool:

    “””Verify cached response is still valid.”””

    # Re-run the query against current data

    fresh_response = self.generate_response(cached_response[‘query’])

    # Compare semantic similarity of responses

    cached_embedding = self.embed(cached_response[‘response’])

    fresh_embedding = self.embed(fresh_response)

    similarity = cosine_similarity(cached_embedding, fresh_embedding)

    # If responses diverged significantly, invalidate

    if similarity < 0.90:

        self.cache.invalidate(cached_response[‘id’])

        return False

    return True

We run freshness checks on a sample of cached entries daily, catching staleness that TTL and event-based invalidation miss.

Production results

After three months in production:

Metric

Before

After

Change

Cache hit rate

18%

67%

+272%

LLM API costs

$47K/month

$12.7K/month

-73%

Average latency

850ms

300ms

-65%

False-positive rate

N/A

0.8%

Customer complaints (wrong answers)

Baseline

+0.3%

Minimal increase

The 0.8% false-positive rate (queries where we returned a cached response that was semantically incorrect) was within acceptable bounds. These cases occurred primarily at the boundaries of our threshold, where similarity was just above the cutoff but intent differed slightly.

Pitfalls to avoid

Don’t use a single global threshold. Different query types have different tolerance for errors. Tune thresholds per category.

Don’t skip the embedding step on cache hits. You might be tempted to skip embedding overhead when returning cached responses, but you need the embedding for cache key generation. The overhead is unavoidable.

Don’t forget invalidation. Semantic caching without invalidation strategy leads to stale responses that erode user trust. Build invalidation from day one.

Don’t cache everything. Some queries shouldn’t be cached: Personalized responses, time-sensitive information, transactional confirmations. Build exclusion rules.

def should_cache(self, query: str, response: str) -> bool:

    “””Determine if response should be cached.””

    # Don’t cache personalized responses

    if self.contains_personal_info(response):

        return False

    # Don’t cache time-sensitive information

    if self.is_time_sensitive(query):

        return False

    # Don’t cache transactional confirmations

    if self.is_transactional(query):

        return False

    return True

Key takeaways

Semantic caching is a practical pattern for LLM cost control that captures redundancy exact-match caching misses. The key challenges are threshold tuning (use query-type-specific thresholds based on precision/recall analysis) and cache invalidation (combine TTL, event-based and staleness detection).

At 73% cost reduction, this was our highest-ROI optimization for production LLM systems. The implementation complexity is moderate, but the threshold tuning requires careful attention to avoid quality degradation.

Sreenivasa Reddy Hulebeedu Reddy is a lead software engineer.

Nvidia’s Vera Rubin is months away — Blackwell is getting faster right now

The big news this week from Nvidia, splashed in headlines across all forms of media, was the company’s announcement about its Vera Rubin GPU.

This week, Nvidia CEO Jensen Huang used his CES keynote to highlight performance metrics for the new chip. According to Huang, the Rubin GPU is capable of 50 PFLOPs of NVFP4 inference and 35 PFLOPs of NVFP4 training performance, representing 5x and 3.5x the performance of Blackwell.

But it won’t be available until the second half of 2026. So what should enterprises be doing now?

Blackwell keeps on getting better

The current, shipping Nvidia GPU architecture is Blackwell, which was announced in 2024 as the successor to Hopper.  Alongside that release, Nvidia emphasized that that its product engineering path also included squeezing as much performance as possible out of the prior Grace Hopper architecture.

It’s a direction that will hold true for Blackwell as well, with Vera Rubin coming later this year.

“We continue to optimize our inference and training stacks for the Blackwell architecture,” Dave Salvator, director of accelerated computing products at Nvidia, told VentureBeat.

In the same week that Vera Rubin was being touted by Nvidia’s CEO as its most powerful GPU ever, the company published new research showing improved Blackwell performance.

How Blackwell performance has improved inference by 2.8x 

Nvidia has been able to increase Blackwell GPU performance by up to 2.8x per GPU in a period of just three short months.

The performance gains come from a series of innovations that have been added to the Nvidia TensorRT-LLM inference engine. These optimizations apply to existing hardware, allowing current Blackwell deployments to achieve higher throughput without hardware changes.

The performance gains are measured on DeepSeek-R1, a 671-billion parameter mixture-of-experts (MoE) model that activates 37 billion parameters per token.

Among the technical innovations that provide the performance boost:

  • Programmatic dependent launch (PDL): Expanded implementation reduces kernel launch latencies, increasing throughput.

  • All-to-all communication: New implementation of communication primitives eliminates an intermediate buffer, reducing memory overhead.

  • Multi-token prediction (MTP): Generates multiple tokens per forward pass rather than one at a time, increasing throughput across various sequence lengths.

  • NVFP4 format: A 4-bit floating point format with hardware acceleration in Blackwell that reduces memory bandwidth requirements while preserving model accuracy.

The optimizations reduce cost per million tokens and allow existing infrastructure to serve higher request volumes at lower latency. Cloud providers and enterprises can scale their AI services without immediate hardware upgrades.

Blackwell has also made training performance gains 

Blackwell is also widely used as a foundational hardware component for training the largest of large language models.

In that respect, Nvidia has also reported significant gains for Blackwell when used for AI training. 

Since its initial launch, the GB200 NVL72 system delivered up to 1.4x higher training performance on the same hardware — a 40% boost achieved in just five months without any hardware upgrades.

The training boost came from a series of updates including:

  • Optimized training recipes. Nvidia engineers developed sophisticated training recipes that effectively leverage NVFP4 precision. Initial Blackwell submissions used FP8 precision, but the transition to NVFP4-optimized recipes unlocked substantial additional performance from the existing silicon.

  • Algorithmic refinements. Continuous software stack enhancements and algorithmic improvements enabled the platform to extract more performance from the same hardware, demonstrating ongoing innovation beyond initial deployment.

Double-down on Blackwell or wait for Vera Rubin?

Salvator noted that the high-end Blackwell Ultra is a market-leading platform purpose-built to run state-of-the-art AI models and applications. 

He added that the Nvidia Rubin platform will extend the company’s market leadership and enable the next generation of MoEs to power a new class of applications to take AI innovation even further.

Salvator explained that the Vera Rubin is built to address the growing demand in compute created by the continuing growth in model size and reasoning token generation from leading models such as MoE.  

 “Blackwell and Rubin can serve the same models, but the difference is the performance, efficiency and token cost,” he said.

According to Nvidia’s early testing results, compared to Blackwell, Rubin can train large MoE models in a quarter the number of GPUs, inference token generation with 10X more throughput per watt, and inference at 1/10th the cost per token.

“Better token throughput performance and efficiency, means newer models can be built with more reasoning capability and faster agent-to-agent interaction, creating better intelligence at lower cost,” Salvator said.

What it all means for enterprise AI builders

For enterprises deploying AI infrastructure today, current investments in Blackwell remain sound despite Vera Rubin’s arrival later this year.

Organizations with existing Blackwell deployments can immediately capture the 2.8x inference improvement and 1.4x training boost by updating to the latest TensorRT-LLM versions — delivering real cost savings without capital expenditure. For those planning new deployments in the first half of 2026, proceeding with Blackwell makes sense. Waiting six months means delaying AI initiatives and potentially falling behind competitors already deploying today.

However, enterprises planning large-scale infrastructure buildouts for late 2026 and beyond should factor Vera Rubin into their roadmaps. The 10x improvement in throughput per watt and 1/10th cost per token represent transformational economics for AI operations at scale.

The smart approach is phased deployment: Leverage Blackwell for immediate needs while architecting systems that can incorporate Vera Rubin when available. Nvidia’s continuous optimization model means this isn’t a binary choice; enterprises can maximize value from current deployments without sacrificing long-term competitiveness.

Why AI feels generic: Replit CEO on slop, toys, and the missing ingredient of taste

Right now in the AI world, there are a lot of percolating ideas and experimentation. But as far as Replit CEO Amjad Masad is concerned, they’re just “toys”: unreliable, marginally effective, and generic. 

“There’s a lot of sameness out there,” Masad explains in a new VB Beyond the Pilot podcast. “Everything kind of looks the same, all the images, all the code, everything.”

This “slop,” as it’s come to be known, is not only the result of lazy one-shot prompting, but a lack of individual flavor. 

“The way to overcome slop is for the platform to expend more effort and for the developers of the platform to imbue the agent with taste,” Masad says.

How Replit overcomes being generic

Replit tackles the slop problem through a mix of specialized prompting, classification features built into its design systems, and proprietary RAG techniques. The team also isn’t hesitant to use more tokens; this results in higher-quality inputs, Masad notes. 

Ongoing testing is also critical. After the first generation of an app, Masad’s team kicks the result off to a testing agent, which analyzes all its features, then reports back to a coding agent about what worked (and didn’t). “If you introduce testing in the loop, you can give the model feedback and have the model reflect on its work,” Masad says. 

Pitting models against one another is another of Replit’s strategies: Testing agents may be built on one LLM, coding agents on another. This capitalizes on their different knowledge distributions. “That way the product you’re giving to the customer is high effort and less sloppy,” Masad says. “You generate more variety.” 

Ultimately, he describes a “push and pull” between what the model can actually do and what teams need to build on top of it to add value. Also, “if you wanna move fast and you wanna ship things, you need to throw away a lot of code,” he says. 

Why vibe coding is the future 

There’s still a lot of frustration around AI because, Masad acknowledges, it isn’t living up to the intense hype. Chatbots are well-established but they offer a “marginal improvement” in workflows. 

Vibe coding is beginning to take off partly because it’s the best way for companies to adopt AI in an impactful way, he notes. It can “make everyone in the enterprise the software engineer,” he says, allowing employees to solve problems and improve efficiency through automation, thus requiring less reliance on traditional SaaS tools. 

“I would say that the population of professional developers who studied computer science and trained as developers will shrink over time,” Masad says. On the flip side, the population of vibe coders who can solve problems with software and agents will grow “tremendously” over time. 

In the end, enterprises must fundamentally change how they think about software; traditional roadmaps are no longer relevant, Masad says. Because AI capabilities are evolving so dramatically, builders can only “roughly” estimate what things might look like months or even weeks into the future. 

Reflecting this reality, Replit’s team remains agile and isn’t hesitant to “drop everything” when a new model comes out to perform evals. “It’ll ebb and flow,” Masad contends. “You need to be very zen about it and not have an ego about it.” 

Listen to the full podcast to hear about: 

  • The “squishy” divide in AI intelligence that impedes specialization;

  • The cathedral versus bazaar debate in open source — and why a “cathedral made of bazaars” may be the best path to collective innovation;

  • How Replit “forks” the development environment to create isolated sandboxes for experimentation; 

  • The importance of context compression; 

  • What really defines AI agents: They don’t just retrieve information; they work autonomously, repeatedly, without human intervention.  

Subscribe to Beyond the Pilot on Apple Podcasts, Spotify and YouTube

New ‘Test-Time Training’ method lets AI keep learning without exploding inference costs

A new study from researchers at Stanford University and Nvidia proposes a way for AI models to keep learning after deployment — without increasing inference costs. For enterprise agents that have to digest long docs, tickets, and logs, this is a bid to get “long memory” without paying attention costs that grow with context length.

The approach, called “End-to-End Test-Time Training” (TTT-E2E), reframes language modeling as a continual learning problem: Instead of memorizing facts during pre-training, models learn how to adapt in real time as they process new information.

The result is a Transformer that can match long-context accuracy of full attention models while running at near-RNN efficiency — a potential breakthrough for enterprise workloads where context length is colliding with cost.

The accuracy-efficiency trade-off

For developers building AI systems for long-document tasks, the choice of model architecture often involves a painful trade-off between accuracy and efficiency.

On one side are Transformers with full self-attention, currently the gold standard for accuracy. They are designed to scan through the keys and values of all previous tokens for every new token generated, providing them with lossless recall. However, this precision comes at a steep cost: The computational cost per token grows significantly with context length.

On the other side are linear-time sequence models, which keep inference costs constant but struggle to retain information over very long contexts.

Other approaches try to split the difference — sliding-window attention, hybrids that mix attention with recurrence, and other efficiency tricks — but they still tend to fall short of full attention on hard language modeling.

The researchers’ bet is that the missing ingredient is compression: Instead of trying to recall every token exactly, models should distill what matters into a compact state.

Test-Time Training

The core innovation of the paper is the application of Test-Time Training (TTT) to language modeling. This transforms the model from a static database into a flexible learner.

In standard AI deployment, models are trained to minimize loss and then deployed as frozen artifacts. If you try to make a static model learn during deployment, it typically performs poorly because it was never trained to update itself efficiently.

The researchers solve this by shifting from standard pre-training (teaching the model facts) to meta-learning (teaching the model how to learn). The goal is to optimize the model’s “initialization” so that it can absorb new information rapidly when it goes live.

The process involves simulating inference-time learning during the training phase:

  • Inner loop (learn): During training, the model treats text as a stream and performs small, temporary updates as it predicts the next token — simulating how it would adapt at inference.

  • Outer loop (teach it to learn): The system then updates the model’s initialization so the next round of streaming adaptation becomes faster and more accurate.

While the idea of a model changing its weights during deployment might sound risky to reliability focused enterprise leaders, co-author Yu Sun argues it is mathematically safer than it appears.

“You should think of the model as an RNN with a huge hidden state,” Sun says. He notes that if an enterprise feels safe deploying standard Transformers or RNNs, the stability profile of TTT is comparable.

Dual-memory architecture

To implement TTT-E2E, the researchers modified the standard Transformer architecture to support this new learning paradigm, creating a hierarchy that separates cheap short-term context handling from selective long-term memory updates.

  1. The model uses Sliding Window Attention rather than full attention. This acts as the model’s “working memory,” looking back only at a fixed window of recent tokens to handle immediate syntax and local references. This ensures the cost of processing a new token remains constant rather than growing as the context expands.

  2. The model employs “targeted weight updates.” While standard models have completely frozen weights during use, TTT-E2E designates specific sections (Multi-Layer Perceptron layers in the final 25% of the model’s blocks) to be mutable.

  3. The architecture uses a “dual-track storage” to prevent the model from forgetting its general training while learning a new document. Each updateable block contains two MLP components: one static layer that holds general pre-trained knowledge, and one dynamic layer that updates in real-time to store the current document’s context.

The innovation lies in how the model handles information that falls out of the sliding window. In a standard sliding window model, once a token slides out of view, it is forgotten. TTT-E2E prevents this via compression. As the window moves, the model uses next-token prediction to “compress” the passing information directly into the weights of the dynamic MLP layers. This consolidates the gist and facts of the earlier parts of the document into the model’s structure, serving as a long-term memory.

TTT-E2E in action

The headline result: TTT-E2E continues improving as context length grows — matching or outperforming full attention — while efficient baselines plateau after ~32,000 tokens.

To validate their approach, the researchers trained models ranging from 125 million to 3 billion parameters. They employed a two-stage training process: pre-training on 8,000-token contexts and fine-tuning on 128,000-token contexts. These models were tested against robust baselines, including Transformers with full attention, Transformers with Sliding Window Attention (SWA), hybrid models (Mamba 2 and Gated DeltaNet), and TTT-KVB (an earlier form of test-time training).

The results highlight a significant breakthrough in scaling. The most critical experiment tested performance as the input document grew from 8,000 to 128,000 tokens. The Full Attention Transformer, the gold standard, continued to improve its performance (lower loss) as the context grew. In contrast, efficient baselines like Mamba 2, Gated DeltaNet, and SWA hit a ceiling, with their performance degrading or flattening out after 32,000 tokens.

The new TTT-E2E method successfully scaled with context length, mimicking the behavior of Full Attention. In the experiments using 3B parameter models, TTT-E2E actually maintained a lower perplexity (better performance) than Full Attention throughout the context window.

Critically, this performance did not come at the cost of speed. On inference latency, TTT-E2E matched the efficiency of RNNs. At a context length of 128k tokens, TTT-E2E was 2.7x faster than the Full-Attention Transformer on Nvidia H100 hardware.

Crucially for adoption, Sun notes that TTT models can be deployed for inference today on standard Transformer infrastructure to achieve these speedups. However, he cautions that the training side of the equation (specifically the outer loop) is currently more complex and slower than standard methods, representing a hurdle that still needs engineering optimization.

The benefits become even more drastic as data scales. Sun argues the advantage should widen further at million-token contexts, though those figures are projections rather than today’s benchmarked deployments.

However, the approach does have specific limitations rooted in its design philosophy. The researchers performed a “Needle in a Haystack” test, which requires the model to retrieve a specific, isolated piece of information (like a passcode) hidden in a large block of text. In this evaluation, Full Attention dramatically outperformed all other methods, including TTT-E2E.

This is because Full Attention relies on a cache that allows for nearly lossless recall of specific details, whereas TTT-E2E relies on compression. Compression captures the intuition and core information perfectly but may lose specific, random details that do not fit the learned patterns.

This distinction has major implications for enterprise data pipelines, specifically RAG. Sun suggests that TTT won’t make RAG obsolete but will redefine it. He likens TTT to “updating the human brain” with general knowledge, while RAG will remain a necessary tool for precision, “similar to how humans still need to write things down in a notepad.” For enterprise teams, the takeaway is that TTT reduces how often you need retrieval — but doesn’t eliminate the need for exact external memory.

While the technique was demonstrated on the Transformer architecture, the researchers note that “in principle, TTT can be applied to any baseline architecture” that allows for a separation of long-term and short-term memory components.

“We believe that these two classes of memory will continue to complement each other,” the researchers concluded. 

Looking ahead, Sun predicts a paradigm shift where the primary form of AI memory will be highly compressed rather than exact. While models will retain a “reasonable” perfect-recall window of around 128,000 tokens, he believes TTT architectures will eventually unlock a “compressed memory of billions of tokens,” fundamentally changing how enterprise agents balance recall, cost, and context length.

Nvidia just admitted the general-purpose GPU era is ending

Nvidia’s $20 billion strategic licensing deal with Groq represents one of the first clear moves in a four-front fight over the future AI stack. 2026 is when that fight becomes obvious to enterprise builders.

For the technical decision-makers we talk to every day — the people building the AI applications and the data pipelines that drive them — this deal is a signal that the era of the one-size-fits-all GPU as the default AI inference answer is ending.

We are entering the age of the disaggregated inference architecture, where the silicon itself is being split into two different types to accommodate a world that demands both massive context and instantaneous reasoning.

Why inference is breaking the GPU architecture in two

To understand why Nvidia CEO Jensen Huang dropped one-third of his reported $60 billion cash pile on a licensing deal, you have to look at the existential threats converging on his company’s reported 92% market share

The industry reached a tipping point in late 2025: For the first time, inference — the phase where trained models actually run — surpassed training in terms of total data center revenue, according to Deloitte. In this new “Inference Flip,” the metrics have changed. While accuracy remains the baseline, the battle is now being fought over latency and the ability to maintain “state” in autonomous agents.

There are four fronts of that battle, and each front points to the same conclusion: Inference workloads are fragmenting faster than GPUs can generalize.

1. Breaking the GPU in two: Prefill vs. decode

Gavin Baker, an investor in Groq (and therefore biased, but also unusually fluent on the architecture), summarized the core driver of the Groq deal cleanly: “Inference is disaggregating into prefill and decode.”

Prefill and decode are two distinct phases:

  • The prefill phase: Think of this as the user’s “prompt” stage. The model must ingest massive amounts of data — whether it’s a 100,000-line codebase or an hour of video — and compute a contextual understanding. This is “compute-bound,” requiring massive matrix multiplication that Nvidia’s GPUs are historically excellent at.

  • The generation (decode) phase: This is the actual token-by-token “generation.” Once the prompt is ingested, the model generates one word (or token) at a time, feeding each one back into the system to predict the next. This is “memory-bandwidth bound.” If the data can’t move from the memory to the processor fast enough, the model stutters, no matter how powerful the GPU is. (This is where Nvidia was weak, and where Groq’s special language processing unit (LPU) and its related SRAM memory, shines. More on that in a bit.)

Nvidia has announced an upcoming Vera Rubin family of chips that it’s architecting specifically to handle this split. The Rubin CPX component of this family is the designated “prefill” workhorse, optimized for massive context windows of 1 million tokens or more. To handle this scale affordably, it moves away from the eye-watering expense of high bandwidth memory (HBM) — Nvidia’s current gold-standard memory that sits right next to the GPU die — and instead utilizes 128GB of a new kind of memory, GDDR7. While HBM provides extreme speed (though not as quick as Groq’s static random-access memory (SRAM)), its supply on GPUs is limited and its cost is a barrier to scale; GDDR7 provides a more cost-effective way to ingest massive datasets.

Meanwhile, the “Groq-flavored” silicon, which Nvidia is integrating into its inference roadmap, will serve as the high-speed “decode” engine. This is about neutralizing a threat from alternative architectures like Google’s TPUs and maintaining the dominance of CUDA, Nvidia’s software ecosystem that has served as its primary moat for over a decade.

All of this was enough for Baker, the Groq investor, to predict that Nvidia’s move to license Groq will cause all other specialized AI chips to be canceled — that is, outside of Google’s TPU, Tesla’s AI5, and AWS’s Trainium.

2. The differentiated power of SRAM

At the heart of Groq’s technology is SRAM. Unlike the DRAM found in your PC or the HBM on an Nvidia H100 GPU, SRAM is etched directly into the logic of the processor.

Michael Stewart, managing partner of Microsoft’s venture fund, M12, describes SRAM as the best for moving data over short distances with minimal energy. “The energy to move a bit in SRAM is like 0.1 picojoules or less,” Stewart said. “To move it between DRAM and the processor is more like 20 to 100 times worse.”

In the world of 2026, where agents must reason in real-time, SRAM acts as the ultimate “scratchpad”: a high-speed workspace where the model can manipulate symbolic operations and complex reasoning processes without the “wasted cycles” of external memory shuttling.

However, SRAM has a major drawback: it is physically bulky and expensive to manufacture, meaning its capacity is limited compared to DRAM. This is where Val Bercovici, chief AI officer at Weka, another company offering memory for GPUs, sees the market segmenting.

Groq-friendly AI workloads — where SRAM has the advantage — are those that use small models of 8 billion parameters and below, Bercovici said. This isn’t a small market, though. “It’s just a giant market segment that was not served by Nvidia, which was edge inference, low latency, robotics, voice, IoT devices — things we want running on our phones without the cloud for convenience, performance, or privacy,” he said.

This 8B “sweet spot” is significant because 2025 saw an explosion in model distillation, where many enterprise companies are shrinking massive models into highly efficient smaller versions. While SRAM isn’t practical for the trillion-parameter “frontier” models, it is perfect for these smaller, high-velocity models.

3. The Anthropic threat: The rise of the ‘portable stack’

Perhaps the most under-appreciated driver of this deal is Anthropic’s success in making its stack portable across accelerators.

The company has pioneered a portable engineering approach for training and inference — basically a software layer that allows its Claude models to run across multiple AI accelerator families — including Nvidia’s GPUs and Google’s Ironwood TPUs. Until recently, Nvidia’s dominance was protected because running high-performance models outside of the Nvidia stack was a technical nightmare. “It’s Anthropic,” Weka’s Bercovici told me. “The fact that Anthropic was able to … build up a software stack that could work on TPUs as well as on GPUs, I don’t think that’s being appreciated enough in the marketplace.”

(Disclosure: Weka has been a sponsor of VentureBeat events.)

Anthropic recently committed to accessing up to 1 million TPUs from Google, representing over a gigawatt of compute capacity. This multi-platform approach ensures the company isn’t held hostage by Nvidia’s pricing or supply constraints. So for Nvidia, the Groq deal is equally a defensive move. By integrating Groq’s ultra-fast inference IP, Nvidia is making sure that the most performance-sensitive workloads — like those running small models or as part of real-time agents — can be accommodated within Nvidia’s CUDA ecosystem, even as competitors try to jump ship to Google’s Ironwood TPUs. CUDA is the special software Nvidia provides to developers to integrate GPUs. 

4. The agentic ‘statehood’ war: Manus and the KV Cache

The timing of this Groq deal coincides with Meta’s acquisition of the agent pioneer Manus just two days ago. The significance of Manus was partly its obsession with statefulness.

If an agent can’t remember what it did 10 steps ago, it is useless for real-world tasks like market research or software development. KV Cache (Key-Value Cache) is the “short-term memory” that an LLM builds during the prefill phase.

Manus reported that for production-grade agents, the ratio of input tokens to output tokens can reach 100:1. This means for every word an agent says, it is “thinking” and “remembering” 100 others. In this environment, the KV Cache hit rate is the single most important metric for a production agent, Manus said. If that cache is “evicted” from memory, the agent loses its train of thought, and the model must burn massive energy to recompute the prompt.

Groq’s SRAM can be a “scratchpad” for these agents — although, again, mostly for smaller models — because it allows for the near-instant retrieval of that state. Combined with Nvidia’s Dynamo framework and the KVBM, Nvidia is building an “inference operating system” that enables inference servers to tier this state across SRAM, DRAM, HBM, and other flash-based offerings like that from Bercovici’s Weka.

Thomas Jorgensen, senior director of Technology Enablement at Supermicro, which specializes in building clusters of GPUs for large enterprise companies, told me in September that compute is no longer the primary bottleneck for advanced clusters. Feeding data to GPUs was the bottleneck, and breaking that bottleneck requires memory.

“The whole cluster is now the computer,” Jorgensen said. “Networking becomes an internal part of the beast … feeding the beast with data is becoming harder because the bandwidth between GPUs is growing faster than anything else.”

This is why Nvidia is pushing into disaggregated inference. By separating the workloads, enterprise applications can use specialized storage tiers to feed data at memory-class performance, while the specialized “Groq-inside” silicon handles the high-speed token generation.

The verdict for 2026

We are entering an era of extreme specialization. For decades, incumbents could win by shipping one dominant general-purpose architecture — and their blind spot was often what they ignored on the edges. Intel’s long neglect of low-power is the classic example, Michael Stewart, managing partner of Microsoft’s venture fund M12, told me. Nvidia is signaling it won’t repeat that mistake. “If even the leader, even the lion of the jungle will acquire talent, will acquire technology — it’s a sign that the whole market is just wanting more options,” Stewart said.

For technical leaders, the message is to stop architecting your stack like it’s one rack, one accelerator, one answer. In 2026, advantage will go to the teams that label workloads explicitly — and route them to the right tier:

  • prefill-heavy vs. decode-heavy

  • long-context vs. short-context

  • interactive vs. batch

  • small-model vs. large-model

  • edge constraints vs. data-center assumptions

Your architecture will follow those labels. In 2026, “GPU strategy” stops being a purchasing decision and becomes a routing decision. The winners won’t ask which chip they bought — they’ll ask where every token ran, and why.