Conversational AI doesn’t understand users — ‘Intent First’ architecture does

The modern customer has just one need that matters: Getting the thing they want when they want it. The old standard RAG model embed+retrieve+LLM misunderstands intent, overloads context and misses freshness, repeatedly sending customers down the wrong paths.

Instead, intent-first architecture uses a lightweight language model to parse the query for intent and context, before delivering to the most relevant content sources (documents, APIs, people).

Enterprise AI is a speeding train headed for a cliff. Organizations are deploying LLM-powered search applications at a record pace, while a fundamental architectural issue is setting most up for failure.

A recent Coveo study revealed that 72% of enterprise search queries fail to deliver meaningful results on the first attempt, while Gartner also predicts that the majority of conversational AI deployments have been falling short of enterprise expectations.

The problem isn’t the underlying models. It’s the architecture around them.

After designing and running live AI-driven customer interaction platforms at scale, serving millions of customer and citizen users at some of the world’s largest telecommunications and healthcare organizations, I’ve come to see a pattern. It’s the difference between successful AI-powered interaction deployments and multi-million-dollar failures.

It’s a cloud-native architecture pattern that I call Intent-First. And it’s reshaping the way enterprises build AI-powered experiences.

The $36 pillion problem 

Gartner projects the global conversational AI market will balloon to $36 billion by 2032. Enterprises are scrambling to get a slice. The demos are irresistible. Plug your LLM into your knowledge base, and suddenly it can answer customer questions in natural language.Magic. 

Then production happens. 

A major telecommunications provider I work with rolled out a RAG system with the expectation of driving down the support call rate. Instead, the rate increased. Callers tried AI-powered search, were provided incorrect answers with a high degree of confidence and called customer support angrier than before.

This pattern is repeated over and over. In healthcare, customer-facing AI assistants are providing patients with formulary information that’s outdated by weeks or months. Financial services chatbots are spitting out answers from both retail and institutional product content. Retailers are seeing discontinued products surface in product searches.

The issue isn’t a failure of AI technology. It’s a failure of architecture

Why standard RAG architectures fail 

The standard RAG pattern — embedding the query, retrieving semantically similar content, passing to an LLM —works beautifully in demos and proof of concepts. But it falls apart in production use cases for three systematic reasons:

1. The intent gap

Intent is not context. But standard RAG architectures don’t account for this.

Say a customer types “I want to cancel” What does that mean? Cancel a service? Cancel an order? Cancel an appointment? During our telecommunications deployment, we found that 65% of queries for “cancel” were actually about orders or appointments, not service cancellation. The RAG system had no way of understanding this intent, so it consistently returned service cancellation documents.

Intent matters. In healthcare, if a patient is typing “I need to cancel” because they’re trying to cancel an appointment, a prescription refill or a procedure, routing them to medication content from scheduling is not only frustrating — it’s also dangerous.

2. Context flood 

Enterprise knowledge and experience is vast, spanning dozens of sources such as product catalogs, billing, support articles, policies, promotions and account data. Standard RAG models treat all of it the same, searching all for every query.

When a customer asks “How do I activate my new phone,” they don’t care about billing FAQs, store locations or network status updates. But a standard RAG model retrieves semantically similar content from every source, returning search results that are a half-steps off the mark.

3. Freshness blindspot 

Vector space is timeblind. Semantically, last quarter’s promotion is identical to this quarter’s. But presenting customers with outdated offers shatters trust. We linked a significant percentage of customer complaints to search results that surfaced expired products, offers, or features.

The Intent-First architecture pattern 

The Intent-First architecture pattern is the mirror image of the standard RAG deployment. In the RAG model, you retrieve, then route. In the Intent-First model, you classify before you route or retrieve.

Intent-First architectures use a lightweight language model to parse a query for intent and context, before dispatching to the most relevant content sources (documents, APIs, agents).

Comparison: Intent-first vs standard RAG

Cloud-native implementation

The Intent-First pattern is designed for cloud-native deployment, leveraging microservices, containerization and elastic scaling to handle enterprise traffic patterns.

Intent classification service

The classifier determines user intent before any retrieval occurs:

ALGORITHM: Intent Classification

INPUT: user_query (string)

OUTPUT: intent_result (object)

1. PREPROCESS query (normalize, expand contractions)

2. CLASSIFY using transformer model:

   – primary_intent ← model.predict(query)

   – confidence ← model.confidence_score()

3. IF confidence < 0.70 THEN

   – RETURN {

       requires_clarification: true,

       suggested_question: generate_clarifying_question(query)

     }

4. EXTRACT sub_intent based on primary_intent:

   – IF primary = “ACCOUNT” → check for ORDER_STATUS, PROFILE, etc.

   – IF primary = “SUPPORT” → check for DEVICE_ISSUE, NETWORK, etc.

   – IF primary = “BILLING” → check for PAYMENT, DISPUTE, etc.

5. DETERMINE target_sources based on intent mapping:

   – ORDER_STATUS → [orders_db, order_faq]

   – DEVICE_ISSUE → [troubleshooting_kb, device_guides]

   – MEDICATION → [formulary, clinical_docs] (healthcare)

6. RETURN {

     primary_intent,

     sub_intent,

     confidence,

     target_sources,

     requires_personalization: true/false

   }

Context-aware retrieval service

Once intent is classified, retrieval becomes targeted:

ALGORITHM: Context-Aware Retrieval

INPUT: query, intent_result, user_context

OUTPUT: ranked_documents

1. GET source_config for intent_result.sub_intent:

   – primary_sources ← sources to search

   – excluded_sources ← sources to skip

   – freshness_days ← max content age

2. IF intent requires personalization AND user is authenticated:

   – FETCH account_context from Account Service

   – IF intent = ORDER_STATUS:

       – FETCH recent_orders (last 60 days)

       – ADD to results

3. BUILD search filters:

   – content_types ← primary_sources only

   – max_age ← freshness_days

   – user_context ← account_context (if available)

4. FOR EACH source IN primary_sources:

   – documents ← vector_search(query, source, filters)

   – ADD documents to results

5. SCORE each document:

   – relevance_score ← vector_similarity × 0.40

   – recency_score ← freshness_weight × 0.20

   – personalization_score ← user_match × 0.25

   – intent_match_score ← type_match × 0.15

   – total_score ← SUM of above

6. RANK by total_score descending

7. RETURN top 10 documents

Healthcare-specific considerations

In healthcare deployments, the Intent-First pattern includes additional safeguards:

Healthcare intent categories:

  • Clinical: Medication questions, symptoms, care instructions

  • Coverage: Benefits, prior authorization, formulary

  • Scheduling: Appointments, provider availability

  • Billing: Claims, payments, statements

  • Account: Profile, dependents, ID cards

Critical safeguard: Clinical queries always include disclaimers and never replace professional medical advice. The system routes complex clinical questions to human support.

Handling edge cases

The edge cases are where systems fail. The Intent-First pattern includes specific handlers:

Frustration detection keywords:

  • Anger: “terrible,” “worst,” “hate,” “ridiculous”

  • Time: “hours,” “days,” “still waiting”

  • Failure: “useless,” “no help,” “doesn’t work”

  • Escalation: “speak to human,” “real person,” “manager”

When frustration is detected, skip search entirely and route to human support.

Cross-industry applications

The Intent-First pattern applies wherever enterprises deploy conversational AI over heterogeneous content:

Industry

Intent categories

Key benefit

Telecommunications

Sales, Support, Billing, Account, Retention

Prevents “cancel” misclassification

Healthcare

Clinical, Coverage, Scheduling, Billing

Separates clinical from administrative

Financial services

Retail, Institutional, Lending, Insurance

Prevents context mixing

Retail

Product, Orders, Returns, Loyalty

Ensures promotional freshness

Results

After implementing Intent-First architecture across telecommunications and healthcare platforms:

Metric

Impact

Query success rate

Nearly doubled

Support escalations

Reduced by more than half

Time to resolution

Reduced approximately 70%

User satisfaction

Improved roughly 50%

Return user rate

More than doubled

The return user rate proved most significant. When search works, users come back. When it fails, they abandon the channel entirely, increasing costs across all other support channels.

The strategic imperative

The conversational AI market will continue to experience hyper growth.

But enterprises that build and deploy typical RAG architectures will continue to fail … repeatedly.

AI will confidently give wrong answers, users will abandon digital channels out of frustration and support costs will go up instead of down.

Intent-First is a fundamental shift in how enterprises need to architect and build AI-powered customer conversations. It’s not about better models or more data. It’s about understanding what a user wants before you try to help them.

The sooner an organization realizes this as an architectural imperative, the sooner they will be able to capture the efficiency gains this technology is supposed to enable. Those that don’t will be debugging why their AI investments haven’t been producing expected business outcomes for many years to come.

The demo is easy. Production is hard. But the pattern for production success is clear: Intent First.

Sreenivasa Reddy Hulebeedu Reddy is a lead software engineer and enterprise architect

Railway secures $100 million to challenge AWS with AI-native cloud infrastructure

Railway, a San Francisco-based cloud platform that has quietly amassed two million developers without spending a dollar on marketing, announced Thursday that it raised $100 million in a Series B funding round, as surging demand for artificial intelligence applications exposes the limitations of legacy cloud infrastructure.

TQ Ventures led the round, with participation from FPV Ventures, Redpoint, and Unusual Ventures. The investment values Railway as one of the most significant infrastructure startups to emerge during the AI boom, capitalizing on developer frustration with the complexity and cost of traditional platforms like Amazon Web Services and Google Cloud.

“As AI models get better at writing code, more and more people are asking the age-old question: where, and how, do I run my applications?” said Jake Cooper, Railway’s 28-year-old founder and chief executive, in an exclusive interview with VentureBeat. “The last generation of cloud primitives were slow and outdated, and now with AI moving everything faster, teams simply can’t keep up.”

The funding is a dramatic acceleration for a company that has charted an unconventional path through the cloud computing industry. Railway raised just $24 million in total before this round, including a $20 million Series A from Redpoint in 2022. The company now processes more than 10 million deployments monthly and handles over one trillion requests through its edge network — metrics that rival far larger and better-funded competitors.

Why three-minute deploy times have become unacceptable in the age of AI coding assistants

Railway’s pitch rests on a simple observation: the tools developers use to deploy and manage software were designed for a slower era. A standard build-and-deploy cycle using Terraform, the industry-standard infrastructure tool, takes two to three minutes. That delay, once tolerable, has become a critical bottleneck as AI coding assistants like Claude, ChatGPT, and Cursor can generate working code in seconds.

“When godly intelligence is on tap and can solve any problem in three seconds, those amalgamations of systems become bottlenecks,” Cooper told VentureBeat. “What was really cool for humans to deploy in 10 seconds or less is now table stakes for agents.”

The company claims its platform delivers deployments in under one second — fast enough to keep pace with AI-generated code. Customers report a tenfold increase in developer velocity and up to 65 percent cost savings compared to traditional cloud providers.

These numbers come directly from enterprise clients, not internal benchmarks. Daniel Lobaton, chief technology officer at G2X, a platform serving 100,000 federal contractors, measured deployment speed improvements of seven times faster and an 87 percent cost reduction after migrating to Railway. His infrastructure bill dropped from $15,000 per month to approximately $1,000.

“The work that used to take me a week on our previous infrastructure, I can do in Railway in like a day,” Lobaton said. “If I want to spin up a new service and test different architectures, it would take so long on our old setup. In Railway I can launch six services in two minutes.”

Inside the controversial decision to abandon Google Cloud and build data centers from scratch

What distinguishes Railway from competitors like Render and Fly.io is the depth of its vertical integration. In 2024, the company made the unusual decision to abandon Google Cloud entirely and build its own data centers, a move that echoes the famous Alan Kay maxim: “People who are really serious about software should make their own hardware.”

“We wanted to design hardware in a way where we could build a differentiated experience,” Cooper said. “Having full control over the network, compute, and storage layers lets us do really fast build and deploy loops, the kind that allows us to move at ‘agentic speed’ while staying 100 percent the smoothest ride in town.”

The approach paid dividends during recent widespread outages that affected major cloud providers — Railway remained online throughout.

This soup-to-nuts control enables pricing that undercuts the hyperscalers by roughly 50 percent and newer cloud startups by three to four times. Railway charges by the second for actual compute usage: $0.00000386 per gigabyte-second of memory, $0.00000772 per vCPU-second, and $0.00000006 per gigabyte-second of storage. There are no charges for idle virtual machines — a stark contrast to the traditional cloud model where customers pay for provisioned capacity whether they use it or not.

“The conventional wisdom is that the big guys have economies of scale to offer better pricing,” Cooper noted. “But when they’re charging for VMs that usually sit idle in the cloud, and we’ve purpose-built everything to fit much more density on these machines, you have a big opportunity.”

How 30 employees built a platform generating tens of millions in annual revenue

Railway has achieved its scale with a team of just 30 employees generating tens of millions in annual revenue — a ratio of revenue per employee that would be exceptional even for established software companies. The company grew revenue 3.5 times last year and continues to expand at 15 percent month-over-month.

Cooper emphasized that the fundraise was strategic rather than necessary. “We’re default alive; there’s no reason for us to raise money,” he said. “We raised because we see a massive opportunity to accelerate, not because we needed to survive.”

The company hired its first salesperson only last year and employs just two solutions engineers. Nearly all of Railway’s two million users discovered the platform through word of mouth — developers telling other developers about a tool that actually works.

“We basically did the standard engineering thing: if you build it, they will come,” Cooper recalled. “And to some degree, they came.”

From side projects to Fortune 500 deployments: Railway’s unlikely corporate expansion

Despite its grassroots developer community, Railway has made significant inroads into large organizations. The company claims that 31 percent of Fortune 500 companies now use its platform, though deployments range from company-wide infrastructure to individual team projects.

Notable customers include Bilt, the loyalty program company; Intuit’s GoCo subsidiary; TripAdvisor’s Cruise Critic; and MGM Resorts. Kernel, a Y Combinator-backed startup providing AI infrastructure to over 1,000 companies, runs its entire customer-facing system on Railway for $444 per month.

“At my previous company Clever, which sold for $500 million, I had six full-time engineers just managing AWS,” said Rafael Garcia, Kernel’s chief technology officer. “Now I have six engineers total, and they all focus on product. Railway is exactly the tool I wish I had in 2012.”

For enterprise customers, Railway offers security certifications including SOC 2 Type 2 compliance and HIPAA readiness, with business associate agreements available upon request. The platform provides single sign-on authentication, comprehensive audit logs, and the option to deploy within a customer’s existing cloud environment through a “bring your own cloud” configuration.

Enterprise pricing starts at custom levels, with specific add-ons for extended log retention ($200 monthly), HIPAA BAAs ($1,000), enterprise support with SLOs ($2,000), and dedicated virtual machines ($10,000).

The startup’s bold strategy to take on Amazon, Google, and a new generation of cloud rivals

Railway enters a crowded market that includes not only the hyperscale cloud providers—Amazon Web Services, Microsoft Azure, and Google Cloud Platform—but also a growing cohort of developer-focused platforms like Vercel, Render, Fly.io, and Heroku.

Cooper argues that Railway’s competitors fall into two camps, neither of which has fully committed to the new infrastructure model that AI demands.

“The hyperscalers have two competing systems, and they haven’t gone all-in on the new model because their legacy revenue stream is still printing money,” he observed. “They have this mammoth pool of cash coming from people who provision a VM, use maybe 10 percent of it, and still pay for the whole thing. To what end are they actually interested in going all the way in on a new experience if they don’t really need to?”

Against startup competitors, Railway differentiates by covering the full infrastructure stack. “We’re not just containers; we’ve got VM primitives, stateful storage, virtual private networking, automated load balancing,” Cooper said. “And we wrap all of this in an absurdly easy-to-use UI, with agentic primitives so agents can move 1,000 times faster.”

The platform supports databases including PostgreSQL, MySQL, MongoDB, and Redis; provides up to 256 terabytes of persistent storage with over 100,000 input/output operations per second; and enables deployment to four global regions spanning the United States, Europe, and Southeast Asia. Enterprise customers can scale to 112 vCPUs and 2 terabytes of RAM per service.

Why investors are betting that AI will create a thousand times more software than exists today

Railway’s fundraise reflects broader investor enthusiasm for companies positioned to benefit from the AI coding revolution. As tools like GitHub Copilot, Cursor, and Claude become standard fixtures in developer workflows, the volume of code being written — and the infrastructure needed to run it — is expanding dramatically.

“The amount of software that’s going to come online over the next five years is unfathomable compared to what existed before — we’re talking a thousand times more software,” Cooper predicted. “All of that has to run somewhere.”

The company has already integrated directly with AI systems, building what Cooper calls “loops where Claude can hook in, call deployments, and analyze infrastructure automatically.” Railway released a Model Context Protocol server in August 2025 that allows AI coding agents to deploy applications and manage infrastructure directly from code editors.

“The notion of a developer is melting before our eyes,” Cooper said. “You don’t have to be an engineer to engineer things anymore — you just need critical thinking and the ability to analyze things in a systems capacity.”

What Railway plans to do with $100 million and zero marketing experience

Railway plans to use the new capital to expand its global data center footprint, grow its team beyond 30 employees, and build what Cooper described as a proper go-to-market operation for the first time in the company’s five-year history.

“One of my mentors said you raise money when you can change the trajectory of the business,” Cooper explained. “We’ve built all the required substrate to scale indefinitely; what’s been holding us back is simply talking about it. 2026 is the year we play on the world stage.”

The company’s investor roster reads like a who’s who of developer infrastructure. Angel investors include Tom Preston-Werner, co-founder of GitHub; Guillermo Rauch, chief executive of Vercel; Spencer Kimball, chief executive of Cockroach Labs; Olivier Pomel, chief executive of Datadog; and Jori Lallo, co-founder of Linear.

The timing of Railway’s expansion coincides with what many in Silicon Valley view as a fundamental shift in how software gets made. Coding assistants are no longer experimental curiosities — they have become essential tools that millions of developers rely on daily. Each line of AI-generated code needs somewhere to run, and the incumbents, by Cooper’s telling, are too wedded to their existing business models to fully capitalize on the moment.

Whether Railway can translate developer enthusiasm into sustained enterprise adoption remains an open question. The cloud infrastructure market is littered with promising startups that failed to break the grip of Amazon, Microsoft, and Google. But Cooper, who previously worked as a software engineer at Wolfram Alpha, Bloomberg, and Uber before founding Railway in 2020, seems unfazed by the scale of his ambition.

“In five years, Railway [will be] the place where software gets created and evolved, period,” he said. “Deploy instantly, scale infinitely, with zero friction. That’s the prize worth playing for, and there’s no bigger one on offer.”

For a company that built a $100 million business by doing the opposite of what conventional startup wisdom dictates — no marketing, no sales team, no venture hype—the real test begins now. Railway spent five years proving that developers would find a better mousetrap on their own. The next five will determine whether the rest of the world is ready to get on board.

Why LinkedIn says prompting was a non-starter — and small models was the breakthrough

LinkedIn is a leader in AI recommender systems, having developed them over the last 15-plus years. But getting to a next-gen recommendation stack for the job-seekers of tomorrow required a whole new technique. The company had to look beyond off-the-shelf models to achieve next-level accuracy, latency, and efficiency.

“There was just no way we were gonna be able to do that through prompting,” Erran Berger, VP of product engineering at LinkedIn, says in a new Beyond the Pilot podcast. “We didn’t even try that for next-gen recommender systems because we realized it was a non-starter.”

Instead, his team set to develop a highly detailed product policy document to fine-tune an initially massive 7-billion-parameter model; that was then further distilled into additional teacher and student models optimized to hundreds of millions of parameters. 

The technique has created a repeatable cookbook now reused across LinkedIn’s AI products. 

“Adopting this eval process end to end will drive substantial quality improvement of the likes we probably haven’t seen in years here at LinkedIn,” Berger says. 

Why multi-teacher distillation was a ‘breakthrough’ for LinkedIn 

Berger and his team set out to build an LLM that could interpret individual job queries, candidate profiles and job descriptions in real time, and in a way that mirrored LinkedIn’s product policy as accurately as possible. 

Working with the company’s product management team, engineers eventually built out a 20-to-30-page document scoring job description and profile pairs “across many dimensions.” 

“We did many, many iterations on this,” Berger says. That product policy document was then paired with a “golden dataset” comprising thousands of pairs of queries and profiles; the team fed this into ChatGPT during data generation and experimentation, prompting the model over time to learn scoring pairs and eventually generate a much larger synthetic data set to train a 7-billion-parameter teacher model.

However, Berger says, it’s not enough to have an LLM running in production just on product policy. “At the end of the day, it’s a recommender system, and we need to do some amount of click prediction and personalization.” 

So, his team used that initial product policy-focused teacher model to develop a second teacher model oriented toward click prediction. Using the two, they further distilled a 1.7 billion parameter model for training purposes. That eventual student model was run through “many, many training runs,” and was optimized “at every point” to minimize quality loss, Berger says. 

This multi-teacher distillation technique allowed the team to “achieve a lot of affinity” to the original product policy and “land” click prediction, he says. They were also able to “modularize and componentize” the training process for the student.

Consider it in the context of a chat agent with two different teacher models: One is training the agent on accuracy in responses, the other on tone and how it should communicate. Those two things are very different, yet critical, objectives, Berger notes. 

“By now mixing them, you get better outcomes, but also iterate on them independently,” he says. “That was a breakthrough for us.” 

Changing how teams work together

Berger says he can’t understate the importance of anchoring on a product policy and an iterative eval process. 

Getting a “really, really good product policy” requires translating product manager domain expertise into a unified document. Historically, Berger notes, the product management team was laser focused on strategy and user experience, leaving modeling iteration approaches to ML engineers. Now, though, the two teams work together to “dial in” and create an aligned teacher model. 

“How product managers work with machine learning engineers now is very different from anything we’ve done previously,” he says. “It’s now a blueprint for basically any AI products we do at LinkedIn.”

Watch the full podcast to hear more about: 

  • How LinkedIn optimized every step of the R&D process to support velocity, leading to real results with days or hours rather than weeks; 

  • Why teams should develop pipelines for plugability and experimentation and try out different models to support flexibility; 

  • The continued importance of traditional engineering debugging.

You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.

TrueFoundry launches TrueFailover to automatically reroute enterprise AI traffic during model outages

When OpenAI went down in December, one of TrueFoundry’s customers faced a crisis that had nothing to do with chatbots or content generation. The company uses large language models to help refill prescriptions. Every second of downtime meant thousands of dollars in lost revenue — and patients who could not access their medications on time.

TrueFoundry, an enterprise AI infrastructure company, announced Wednesday a new product called TrueFailover designed to prevent exactly that scenario. The system automatically detects when AI providers experience outages, slowdowns, or quality degradation, then seamlessly reroutes traffic to backup models and regions before users notice anything went wrong.

“The challenge is that in the AI world, failover is no longer that simple,” said Nikunj Bajaj, co-founder and chief executive of TrueFoundry, in an exclusive interview with VentureBeat. “When you move from one model to another, you also have to consider things like output quality, latency, and whether the prompt even works the same way. In many cases, the prompt needs to be adjusted in real-time to prevent results from degrading. That is not something most teams are set up to manage manually.”

The announcement arrives at a pivotal moment for enterprise AI adoption. Companies have moved far beyond experimentation. AI now powers prescription refills at pharmacies, generates sales proposals, assists software developers, and handles customer support inquiries. When these systems fail, the consequences ripple through entire organizations.

Why enterprise AI systems remain dangerously dependent on single providers

Large language models from OpenAI, Anthropic, Google, and other providers have become essential infrastructure for thousands of businesses. But unlike traditional cloud services from Amazon Web Services or Microsoft Azure — which offer robust uptime guarantees backed by decades of operational experience — AI providers operate complex, resource-intensive systems that remain prone to unexpected failures.

“Major LLM providers experience outages, slowdowns, or latency spikes every few weeks or months, and we regularly see the downstream impact on businesses that rely on a single provider,” Bajaj told VentureBeat.

The December OpenAI outage that affected TrueFoundry’s pharmacy customer illustrates the stakes. “At their scale, even seconds of downtime can translate into thousands of dollars in lost revenue,” Bajaj explained. “Beyond the economic impact, there is also a human consequence when patients cannot access prescriptions on time. Because this customer had our failover solution in place, they were able to reroute requests to another model provider within minutes of detecting the outage. Without that setup, recovery would likely have taken hours.”

The problem extends beyond complete outages. Partial failures — where a model slows down or produces lower-quality responses without going fully offline — can quietly destroy user experience and violate service-level agreements. These “slow but technically up” scenarios often prove more damaging than dramatic crashes because they evade traditional monitoring systems while steadily eroding performance.

Inside the technology that keeps AI applications online when providers fail

TrueFailover operates as a resilience layer on top of TrueFoundry’s AI Gateway, which already processes more than 10 billion requests per month for Fortune 1000 companies. The system weaves together several interconnected capabilities into a unified safety net for enterprise AI.

At its core, the product enables multi-model failover by allowing enterprises to define primary and backup models across providers. If OpenAI becomes unavailable, traffic automatically shifts to Anthropic, Google’s Gemini, Mistral, or self-hosted alternatives. The routing happens transparently, without requiring application teams to rewrite code or manually intervene.

The system extends this protection across geographic boundaries through multi-region and multi-cloud resilience. By distributing AI endpoints across zones and cloud providers, health-based routing can detect problems in specific regions and divert traffic to healthy alternatives. What would otherwise become a global incident transforms into an invisible infrastructure adjustment that users never perceive.

Perhaps most critically, TrueFailover employs degradation-aware routing that continuously monitors latency, error rates, and quality signals. “We look at a combination of signals that together indicate when a model’s performance is starting to degrade,” Bajaj explained. “Large language models are shared resources. Providers run the same model instance across many customers, so when demand spikes for one user or workload, it can affect everyone else using that model.”

The system watches for rising response times, increasing error rates, and patterns suggesting instability. “Individually, none of these signals tell the full story,” Bajaj said. “But taken together, they allow us to detect early signs that a model is slowing down or becoming unreliable. Those signals feed into an AI-driven system that can decide when and how to reroute traffic before users experience a noticeable drop in quality.”

Strategic caching rounds out the protection by shielding providers from sudden traffic spikes and preventing rate-limit cascades during high-demand periods. This allows systems to absorb demand surges and provider limits without brownouts or throttling surprises.

The approach represents a fundamental shift in how enterprises should think about AI reliability. “TrueFailover is designed to handle that complexity automatically,” Bajaj said. “It continuously monitors how models behave across many customers and use cases, looks for early warning signs like rising latency, and takes action before things break. Most individual enterprises do not have that kind of visibility because they are only able to see their own systems.”

The engineering challenge of switching models without sacrificing output quality

One of the thorniest challenges in AI failover involves maintaining consistent output quality when switching between models. A prompt optimized for GPT-5 may produce different results on Claude or Gemini. TrueFoundry addresses this through several mechanisms that balance speed against precision.

“Some teams rely on the fact that large models have become good enough that small differences in prompts do not materially affect the output,” Bajaj explained. “In those cases, switching from one provider to another can happen with some visible impact — that’s not ideal, but some teams choose to do it.”

More sophisticated implementations maintain provider-specific prompts for the same application. “When traffic shifts from one model to another, the prompt shifts with it,” Bajaj said. “In that case, failover is not just switching models. It is switching to a configuration that has already been tested.”

TrueFailover automates this process. The system dynamically routes requests and adjusts prompts based on which model handles the query, keeping quality within acceptable ranges without manual intervention. The key, Bajaj emphasized, is that “failover is planned, not reactive. The logic, prompts, and guardrails are defined ahead of time, which is why end users typically do not notice when a switch happens.”

Importantly, many failover scenarios do not require changing providers at all. “It can be routing traffic from the same model in one region to another region, such as from the East Coast to the West Coast, where no prompt changes are required,” Bajaj noted. This geographic flexibility provides a first line of defense before more complex cross-provider switches become necessary.

How regulated industries can use AI failover without compromising compliance

For enterprises in healthcare, financial services, and other regulated sectors, the prospect of AI traffic automatically routing to different providers raises immediate compliance concerns. Patient data cannot simply flow to whichever model happens to be available. Financial records require strict controls over where they travel. TrueFoundry built explicit guardrails to address these constraints.

“TrueFailover will never route data to a model or provider that an enterprise has not explicitly approved,” Bajaj said. “Everything is controlled through an admin configuration layer where teams set clear guardrails upfront.”

Enterprises define exactly which models qualify for failover, which providers can receive traffic, and even which regions or model categories — such as closed-source versus open-source — are acceptable. Once those rules take effect, TrueFailover operates only within them.

“If a model is not on the approved list, it is simply not an option for routing,” Bajaj emphasized. “There is no scenario where traffic is automatically sent somewhere unexpected. The idea is to give teams full control over compliance and data boundaries, while still allowing the system to respond quickly when something goes wrong. That way, reliability improves without compromising security or regulatory requirements.”

This design reflects lessons learned from TrueFoundry’s existing enterprise deployments. A Fortune 50 healthcare company already uses the platform to handle more than 500 million IVR calls annually through an agentic AI system. That customer required the ability to run workloads across both cloud and on-premise infrastructure while maintaining strict data residency controls — exactly the kind of hybrid environment where failover policies must be precisely defined.

Where automatic failover cannot help and what enterprises must plan for

TrueFoundry acknowledges that TrueFailover cannot solve every reliability problem. The system operates within the guardrails enterprises configure, and those configurations determine what protection is possible.

“If a team allows failover from a large, high-capacity model to a much smaller model without adjusting prompts or expectations, TrueFailover cannot guarantee the same output quality,” Bajaj explained. “The system can route traffic, but it cannot make a smaller model behave like a larger one without appropriate configuration.”

Infrastructure constraints also limit protection. If an enterprise hosts its own models and all of them run on the same GPU cluster, TrueFailover cannot help when that infrastructure fails. “When there is no alternate infrastructure available, there is nothing to fail over to,” Bajaj said.

The question of simultaneous multi-provider failures occasionally surfaces in enterprise risk discussions. Bajaj argues this scenario, while theoretically possible, rarely matches reality. “In practice, ‘going down’ usually does not mean an entire provider is offline across all models and regions,” he explained. “What happens far more often is a slowdown or disruption in a specific model or region because of traffic spikes or capacity issues.”

When that occurs, failover can happen at multiple levels — from on-premise to cloud, cloud to on-premise, one region to another, one model to another, or even within the same provider before switching providers entirely. “That alone makes it very unlikely that everything fails at once,” Bajaj said. “The key point is that reliability is built on layers of redundancy. The more providers, regions, and models that are included in the guardrails, the smaller the chance that users experience a complete outage.”

A startup that built its platform inside Fortune 500 AI deployments

TrueFoundry has established itself as infrastructure for some of the world’s largest AI deployments, providing crucial context for its failover ambitions. The company raised $19 million in Series A funding in February 2025, led by Intel Capital with participation from Eniac Ventures, Peak XV Partners, and Jump Capital. Angel investors including Gokul Rajaram and Mohit Aron also joined the round, bringing total funding to $21 million.

The San Francisco-based company was founded in 2021 by Bajaj and co-founders Abhishek Choudhary and Anuraag Gutgutia, all former Meta engineers who met as classmates at IIT Kharagpur. Initially focused on accelerating machine learning deployments, TrueFoundry pivoted to support generative AI capabilities as the technology went mainstream in 2023.

The company’s customer roster demonstrates enterprise-scale adoption that few AI infrastructure startups can match. Nvidia employs TrueFoundry to build multi-agent systems that optimize GPU cluster utilization across data centers worldwide — a use case where even small improvements in utilization translate into substantial business impact given the insatiable demand for GPU capacity. Adopt AI routes more than 15 million requests and 40 billion input tokens through TrueFoundry’s AI Gateway to power its enterprise agentic workflows.

Gaming company Games 24×7 serves machine learning models to more than 100 million users through the platform at scales exceeding 200 requests per second. Digital adoption platform Whatfix migrated to a microservices architecture on TrueFoundry, reducing its release cycle sixfold and cutting testing time by 40 percent.

TrueFoundry currently reports more than 30 paid customers worldwide and has indicated it exceeded $1.5 million in annual recurring revenue last year while quadrupling its customer base. The company manages more than 1,000 clusters for machine learning workloads across its client base.

TrueFailover will be offered as an add-on module on top of the existing TrueFoundry AI Gateway and platform, with pricing following a usage-based model tied to traffic volume along with the number of users, models, providers, and regions involved. An early access program for design partners opens in the coming weeks.

Why traditional cloud uptime guarantees may never apply to AI providers

Enterprise technology buyers have long demanded uptime commitments from infrastructure providers. Amazon Web Services, Microsoft Azure, and Google Cloud all offer service-level agreements with financial penalties for failures. Will AI providers eventually face similar expectations?

Bajaj sees fundamental constraints that make traditional SLAs difficult to achieve in the current generation of AI infrastructure. “Most foundational LLMs today operate as shared resources, which is what enables the standard pricing you see publicly advertised,” he explained. “Providers do offer higher uptime commitments, but that usually means dedicated capacity or reserved infrastructure, and the cost increases significantly.”

Even with substantial budgets, enterprises face usage quotas that create unexpected exposure. “If traffic spikes beyond those limits, requests can still spill back into shared infrastructure,” Bajaj said. “That makes it hard to achieve the kind of hard guarantees enterprises are used to with cloud providers.”

The economics of running large language models create additional barriers that may persist for years. “LLMs are still extremely complex and expensive to run. They require massive infrastructure and energy, and we do not expect a near-term future where most companies run multiple, fully dedicated model instances just to guarantee uptime.”

This reality drives demand for solutions like TrueFailover that provide resilience regardless of what individual providers can promise. “Enterprises are realizing that reliability cannot come from the model provider alone,” Bajaj said. “It requires additional layers of protection to handle the realities of how these systems operate today.”

The new calculus for companies that built AI into critical business processes

The timing of TrueFoundry’s announcement reflects a fundamental shift in how enterprises use AI — and what they stand to lose when it fails. What began as internal experimentation has evolved into customer-facing applications where disruptions directly affect revenue and reputation.

“Many enterprises experimented with Gen AI and agentic systems in the past, and production use cases were largely internal-facing,” Bajaj observed. “There was no immediate impact on their top line or the public perception of the enterprise.”

That era has ended. “Now that these enterprises have launched public-facing applications, where both the top line and public perception can be impacted if an outage occurs, the stakes are much higher than they were even six months ago. That’s why we are seeing more and more attention on this now.”

For companies that have woven AI into critical business processes — from prescription refills to customer support to sales operations — the calculus has changed entirely. The question is no longer which model performs best on benchmarks or which provider offers the most compelling features. The question that now keeps technology leaders awake is far simpler and far more urgent: what happens when the AI disappears at the worst possible moment?

Somewhere, a pharmacist is filling a prescription. A customer support agent is resolving a complaint. A sales team is generating a proposal for a deal that closes tomorrow. All of them depend on AI systems that depend on providers that, despite their scale and sophistication, still go dark without warning.

TrueFoundry is betting that enterprises will pay handsomely to ensure those moments of darkness never reach the people who matter most — their customers.

Stop calling it ‘The AI bubble’: It’s actually multiple bubbles, each with a different expiration date

It’s the question on everyone’s minds and lips: Are we in an AI bubble?

It’s the wrong question. The real question is: Which AI bubble are we in, and when will each one burst?

The debate over whether AI represents a transformative technology or an economic time bomb has reached a fever pitch. Even tech leaders like Meta CEO Mark Zuckerberg have acknowledged evidence of an unstable financial bubble forming around AI. OpenAI CEO Sam Altman and Microsoft co-founder Bill Gates see clear bubble dynamics: overexcited investors, frothy valuations and plenty of doomed projects — but they still believe AI will ultimately transform the economy.

But treating “AI” as a single monolithic entity destined for a uniform collapse is fundamentally misguided.  The AI ecosystem is actually three distinct layers, each with different economics, defensibility and risk profiles. Understanding these layers is critical, because they won’t all pop at once. 

Layer 3: The wrapper companies (first to fall)

The most vulnerable segment isn’t building AI — it’s repackaging it.

These are the companies that take OpenAI’s API, add a slick interface and some prompt engineering, then charge $49/month for what amounts to a glorified ChatGPT wrapper. Some have achieved rapid initial success, like Jasper.ai, which reached approximately $42 million in annual recurring revenue (ARR) in its first year by wrapping GPT models in a user-friendly interface for marketers.

But the cracks are already showing. These businesses face threats from every direction:

Feature absorption: Microsoft can bundle your $50/month AI writing tool into Office 365 tomorrow. Google can make your AI email assistant a free Gmail feature. Salesforce can build your AI sales tool natively into their CRM. When large platforms decide your product is a feature, not a product, your business model evaporates overnight.

The commoditization trap: Wrapper companies are essentially just passing inputs and outputs, if OpenAI improves prompting, these tools lose value overnight. As foundation models become more similar in capability and pricing continues to fall, margins compress to nothing.

Zero switching costs: Most wrapper companies don’t own proprietary data, embedded workflows or deep integrations. A customer can switch to a competitor, or directly to ChatGPT, in minutes. There’s no moat, no lock-in, no defensibility.

The white-label AI market exemplifies this fragility. Companies using white-label platforms face vendor lock-in risks from proprietary systems and API limitations that can hinder integration. These businesses are building on rented land, and the landlord can change the terms, or bulldoze the property, at any moment.

The exception that proves the rule: Cursor stands as a rare wrapper-layer company that has built genuine defensibility. By deeply integrating into developer workflows, creating proprietary features beyond simple API calls and establishing strong network effects through user habits and custom configurations, Cursor has demonstrated how a wrapper can evolve into something more substantial. But companies like Cursor are outliers, not the norm — most wrapper companies lack this level of workflow integration and user lock-in.

Timeline: Expect significant failures in this segment by late 2025 through 2026, as large platforms absorb functionality and users realize they’re paying premium prices for commoditized capabilities.

Layer 2: Foundation models (the middle ground)

The companies building LLMs — OpenAI, Anthropic, Mistral — occupy a more defensible but still precarious position.

Economic researcher Richard Bernstein points to OpenAI as an example of the bubble dynamic, noting that the company has made around $1 trillion in AI deals, including a $500 billion data center buildout project, despite being set to generate only $13 billion in revenue. The divergence between investment and plausible earnings “certainly looks bubbly,” Bernstein notes.

Yet, these companies possess genuine technological moats: Model training expertise, compute access and performance advantages. The question is whether these advantages are sustainable or whether models will commoditize to the point where they’re indistinguishable — turning foundation model providers into low-margin infrastructure utilities.

Engineering will separate winners from losers: As foundation models converge in baseline capabilities, the competitive edge will increasingly come from inference optimization and systems engineering. Companies that can scale the memory wall through innovations like extended KV cache architectures, achieve superior token throughput and deliver faster time-to-first-token will command premium pricing and market share. The winners won’t just be those with the largest training runs, but those who can make AI inference economically viable at scale. Technical breakthroughs in memory management, caching strategies and infrastructure efficiency will determine which frontier labs survive consolidation.

Another concern is the circular nature of investments. For instance, Nvidia is pumping $100 billion into OpenAI to bankroll data centers, and OpenAI is then filling those facilities with Nvidia’s chips. Nvidia is essentially subsidizing one of its biggest customers, potentially artificially inflating actual AI demand.

Still, these companies have massive capital backing, genuine technical capabilities and strategic partnerships with major cloud providers and enterprises. Some will consolidate, some will be acquired, but the category will survive.

Timeline: Consolidation in 2026 to 2028, with 2 to 3 dominant players emerging while smaller model providers are acquired or shuttered.

Layer 1: Infrastructure (built to last)

Here’s the contrarian take: The infrastructure layer — including Nvidia, data centers, cloud providers, memory systems and AI-optimized storage — is the least bubbly part of the AI boom.

Yes, the latest estimates suggest global AI capital expenditures and venture capital investments already exceed $600 billion in 2025, with Gartner estimating that all AI-related spending worldwide might top $1.5 trillion. That sounds like bubble territory.

But infrastructure has a critical characteristic: It retains value regardless of which specific applications succeed. The fiber optic cables laid during the dot-com bubble weren’t wasted — they enabled YouTube, Netflix and cloud computing. Twenty-five years ago, the original dot-com bubble burst after debt financing built out fiber-optic cables for a future that had not yet arrived, but that future eventually did arrive, and the infrastructure was there waiting.

Despite stock pressure, Nvidia’s Q3 fiscal year 2025 revenue hit about $57 billion, up 22% quarter-over-quarter and 62% year-over-year, with the data center division alone generating roughly $51.2 billion. These aren’t vanity metrics; they represent real demand from companies making genuine infrastructure investments.

The chips, data centers, memory systems and storage infrastructure being built today will power whatever AI applications ultimately succeed, whether that’s today’s chatbots, tomorrow’s autonomous agents or applications we haven’t even imagined yet. Unlike commoditized storage alone, modern AI infrastructure encompasses the entire memory hierarchy — from GPU HBM to DRAM to high-performance storage systems that serve as token warehouses for inference workloads. This integrated approach to memory and storage represents a fundamental architectural innovation, not a commodity play.

Timeline: Short-term overbuilding and lazy engineering are possible (2026), but long-term value retention is expected as AI workloads expand over the next decade.

The cascade effect: Why this matters

The current AI boom won’t end with one dramatic crash. Instead, we’ll see a cascade of failures beginning with the most vulnerable companies, and the warning signs are already here.

Phase 1: Wrapper and white-label companies face margin compression and feature absorption. Hundreds of AI startups with thin differentiation will shut down or sell for pennies on the dollar. More than 1,300 AI startups now have valuations of over $100 million, with 498 AI “unicorns” valued at $1 billion or more, many of which won’t justify those valuations.

Phase 2: Foundation model consolidation as performance converges and only the best-capitalized players survive. Expect 3 to 5 major acquisitions as tech giants absorb promising model companies.

Phase 3: Infrastructure spending normalizes but remains elevated. Some data centers will sit partially empty for a few years (like fiber optic cables in 2002), but they’ll eventually fill as AI workloads genuinely expand.

What this means for builders

The most significant risk isn’t being a wrapper — it’s staying one. If you own the experience the user operates in, you own the user.

If you’re building in the application layer, you need to move upstack immediately:

From wrapper → application layer: Stop just generating outputs. Own the workflow before and after the AI interaction.

From application → vertical SaaS: Build execution layers that force users to stay inside your product. Create proprietary data, deep integrations and workflow ownership that makes switching painful.

The distribution moat: Your real advantage isn’t the LLM, it’s how you get users, keep them and expand what they do inside your platform. Winning AI businesses aren’t just software companies — they’re distribution companies.

The bottom line

It’s time to stop asking whether we’re in “the” AI bubble. We’re in multiple bubbles with different characteristics and timelines.

The wrapper companies will pop first, probably within 18 months. Foundation models will consolidate over the next 2 to 4 years. I predict that current infrastructure investments will ultimately prove justified over the long term, although not without some short-term overbuilding pains.

This isn’t a reason for pessimism, it’s a roadmap. Understanding which layer you’re operating in and which bubble you might be caught in is the difference between becoming the next casualty and building something that survives the shakeout.

The AI revolution is real. But not every company riding the wave will make it to shore.

Val Bercovici is CAIO at WEKA.

Why reinforcement learning plateaus without representation depth (and other key takeaways from NeurIPS 2025)

Every year, NeurIPS produces hundreds of impressive papers, and a handful that subtly reset how practitioners think about scaling, evaluation and system design. In 2025, the most consequential works weren’t about a single breakthrough model. Instead, they challenged fundamental assumptions that academicians and corporations have quietly relied on: Bigger models mean better reasoning, RL creates new capabilities, attention is “solved” and generative models inevitably memorize.

This year’s top papers collectively point to a deeper shift: AI progress is now constrained less by raw model capacity and more by architecture, training dynamics and evaluation strategy.

Below is a technical deep dive into five of the most influential NeurIPS 2025 papers — and what they mean for anyone building real-world AI systems.

1. LLMs are converging—and we finally have a way to measure it

Paper: Artificial Hivemind: The Open-Ended Homogeneity of Language Models

For years, LLM evaluation has focused on correctness. But in open-ended or ambiguous tasks like brainstorming, ideation or creative synthesis, there often is no single correct answer. The risk instead is homogeneity: Models producing the same “safe,” high-probability responses.

This paper introduces Infinity-Chat, a benchmark designed explicitly to measure diversity and pluralism in open-ended generation. Rather than scoring answers as right or wrong, it measures:

  • Intra-model collapse: How often the same model repeats itself

  • Inter-model homogeneity: How similar different models’ outputs are

The result is uncomfortable but important: Across architectures and providers, models increasingly converge on similar outputs — even when multiple valid answers exist.

Why this matters in practice

For corporations, this reframes “alignment” as a trade-off. Preference tuning and safety constraints can quietly reduce diversity, leading to assistants that feel too safe, predictable or biased toward dominant viewpoints.

Takeaway: If your product relies on creative or exploratory outputs, diversity metrics need to be first-class citizens. 

2. Attention isn’t finished — a simple gate changes everything

Paper: Gated Attention for Large Language Models

Transformer attention has been treated as settled engineering. This paper proves it isn’t.

The authors introduce a small architectural change: Apply a query-dependent sigmoid gate after scaled dot-product attention, per attention head. That’s it. No exotic kernels, no massive overhead.

Across dozens of large-scale training runs — including dense and mixture-of-experts (MoE) models trained on trillions of tokens — this gated variant:

  • Improved stability

  • Reduced “attention sinks”

  • Enhanced long-context performance

  • Consistently outperformed vanilla attention

Why it works

The gate introduces:

  • Non-linearity in attention outputs

  • Implicit sparsity, suppressing pathological activations

This challenges the assumption that attention failures are purely data or optimization problems.

Takeaway: Some of the biggest LLM reliability issues may be architectural — not algorithmic — and solvable with surprisingly small changes.

3. RL can scale — if you scale in depth, not just data

Paper: 1,000-Layer Networks for Self-Supervised Reinforcement Learning

Conventional wisdom says RL doesn’t scale well without dense rewards or demonstrations. This paper reveals that that assumption is incomplete.

By scaling network depth aggressively from typical 2 to 5 layers to nearly 1,000 layers, the authors demonstrate dramatic gains in self-supervised, goal-conditioned RL, with performance improvements ranging from 2X to 50X.

The key isn’t brute force. It’s pairing depth with contrastive objectives, stable optimization regimes and goal-conditioned representations

Why this matters beyond robotics

For agentic systems and autonomous workflows, this suggests that representation depth — not just data or reward shaping — may be a critical lever for generalization and exploration.

Takeaway: RL’s scaling limits may be architectural, not fundamental.

4. Why diffusion models generalize instead of memorizing

Paper: Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training

Diffusion models are massively overparameterized, yet they often generalize remarkably well. This paper explains why.

The authors identify two distinct training timescales:

  • One where generative quality rapidly improves

  • Another — much slower — where memorization emerges

Crucially, the memorization timescale grows linearly with dataset size, creating a widening window where models improve without overfitting.

Practical implications

This reframes early stopping and dataset scaling strategies. Memorization isn’t inevitable — it’s predictable and delayed.

Takeaway: For diffusion training, dataset size doesn’t just improve quality — it actively delays overfitting.

5. RL improves reasoning performance, not reasoning capacity

Paper: Does Reinforcement Learning Really Incentivize Reasoning in LLMs?

Perhaps the most strategically important result of NeurIPS 2025 is also the most sobering.

This paper rigorously tests whether reinforcement learning with verifiable rewards (RLVR) actually creates new reasoning abilities in LLMs — or simply reshapes existing ones.

Their conclusion: RLVR primarily improves sampling efficiency, not reasoning capacity. At large sample sizes, the base model often already contains the correct reasoning trajectories.

What this means for LLM training pipelines

RL is better understood as:

  • A distribution-shaping mechanism

  • Not a generator of fundamentally new capabilities

Takeaway: To truly expand reasoning capacity, RL likely needs to be paired with mechanisms like teacher distillation or architectural changes — not used in isolation.

The bigger picture: AI progress is becoming systems-limited

Taken together, these papers point to a common theme:

The bottleneck in modern AI is no longer raw model size — it’s system design.

  • Diversity collapse requires new evaluation metrics

  • Attention failures require architectural fixes

  • RL scaling depends on depth and representation

  • Memorization depends on training dynamics, not parameter count

  • Reasoning gains depend on how distributions are shaped, not just optimized

For builders, the message is clear: Competitive advantage is shifting from “who has the biggest model” to “who understands the system.”

Maitreyi Chatterjee is a software engineer.

Devansh Agarwal currently works as an ML engineer at FAANG.

How Google’s ‘internal RL’ could unlock long-horizon AI agents

Researchers at Google have developed a technique that makes it easier for AI models to learn complex reasoning tasks that usually cause LLMs to hallucinate or fall apart. Instead of training LLMs through next-token prediction, their technique, called internal reinforcement learning (internal RL), steers the model’s internal activations toward developing a high-level step-by-step solution for the input problem. 

Ultimately, this could provide a scalable path for creating autonomous agents that can handle complex reasoning and real-world robotics without needing constant, manual guidance.

The limits of next-token prediction

Reinforcement learning plays a key role in post-training LLMs, particularly for complex reasoning tasks that require long-horizon planning. However, the problem lies in the architecture of these models. LLMs are autoregressive, meaning they generate sequences one token at a time. When these models explore new strategies during training, they do so by making small, random changes to the next single token or action. This exposes a deeper limitation: next-token prediction forces models to search for solutions at the wrong level of abstraction, making long-horizon reasoning inefficient even when the model “knows” what to do.

This token-by-token approach works well for basic language modeling but breaks down in long-horizon tasks where rewards are sparse. If the model relies solely on random token-level sampling, the probability of stumbling upon the correct multi-step solution is infinitesimally small, “on the order of one in a million,” according to the researchers.

The issue isn’t just that the models get confused; it’s that they get confused at the wrong level. In comments provided to VentureBeat, Yanick Schimpf, a co-author of the paper, notes that in a 20-step task, an agent can get lost in the minute details of a single step, or it can lose track of the overall goal.

“We argue that when facing a problem with some abstract structure… [goal-oriented exploration] is what you want,” Schimpf said. By solving the problem at the abstract level first, the agent commits to a path, ensuring it doesn’t “get lost in one of the reasoning steps” and fail to complete the broader workflow.

To address this, the field has long looked toward hierarchical reinforcement learning. HRL attempts to solve complex problems by decomposing them into a hierarchy of temporally abstract actions (high-level subroutines that represent different stages of the solution) rather than managing a task as a string of tokens. 

However, discovering these appropriate subroutines remains a longstanding challenge. Current HRL methods often fail to discover proper policies, frequently “converging to degenerate options” that do not represent meaningful behaviors. Even sophisticated modern methods like GRPO (a popular RL algorithm used for sparse-reward tasks) fail in complex environments because they cannot effectively bridge the gap between low-level execution and high-level planning.

Steering the LLM’s internal thoughts

To overcome these limitations, the Google team proposed internal RL. Advanced autoregressive models already “know” how to perform complex, multi-step tasks internally, even if they aren’t explicitly trained to do so.

Because these complex behaviors are hidden inside the model’s residual stream (i.e., the numerical values that carry information through the network’s layers), the researchers introduced an “internal neural network controller,” or metacontroller. Instead of monitoring and changing the output token, the metacontroller controls the model’s behavior by applying changes to the model’s internal activations in the middle layers.

This nudge steers the model into a specific useful state. The base model then automatically generates the sequence of individual steps needed to achieve that goal because it has already seen those patterns during its initial pretraining. 

The metacontroller operates through unsupervised learning and does not require human-labeled training examples. Instead, the researchers use a self-supervised framework where the model analyzes a full sequence of behavior and works backward to infer the hidden, high-level intent that best explains the actions.

During the internal RL phase, the updates are applied to the metacontroller, which shifts training from next-token prediction to learning high-level actions that can lead to the solution.

To understand the practical value of this, consider an enterprise agent tasked with code generation. Today, there is a difficult trade-off: You need “low temperature” (predictability) to get the syntax right, but “high temperature” (creativity) to solve the logic puzzle.

“Internal RL might facilitate this by allowing the model to explore the space of abstract actions, i.e. structuring logic and method calls, while delegating the token-level realization of those actions to the robust, lower-temperature distribution of the base model,” Schimpf said. The agent explores the solution without breaking the syntax.

The researchers investigated two methods for applying this controller. In the first, the base autoregressive model is pretrained on a behavioral dataset and then frozen, while the metacontroller is trained to steer the frozen model’s residual stream. In the second, the metacontroller and the base model are jointly optimized, with parameters of both networks updated simultaneously. 

Internal RL in action

To evaluate the effectiveness of internal RL, the researchers ran experiments across hierarchical environments designed to stump traditional learners. These included a discrete grid world and a continuous control task where a quadrupedal “ant” robot must coordinate joint movements. Both environments used sparse rewards with very long action sequences.

While baselines like GRPO and CompILE failed to learn the tasks within a million episodes due to the difficulty of credit assignment over long horizons, internal RL achieved high success rates with a small number of training episodes. By choosing high-level goals rather than tiny steps, the metacontroller drastically reduced the search space. This allowed the model to identify which high-level decisions led to success, making credit assignment efficient enough to solve the sparse reward problem.

Notably, the researchers found that the “frozen” approach was superior. When the base model and metacontroller were co-trained from scratch, the system failed to develop meaningful abstractions. However, applied to a frozen model, the metacontroller successfully discovered key checkpoints without any human labels, perfectly aligning its internal switching mechanism with the ground-truth moments when an agent finished one subgoal and started the next.

As the industry currently fixates on reasoning models that output verbose “chains of thought” to solve problems, Google’s research points toward a different, perhaps more efficient future.

“Our study joins a growing body of work suggesting that ‘internal reasoning’ is not only feasible but potentially more efficient than token-based approaches,” Schimpf said. “Moreover, these silent ‘thoughts’ can be decoupled from specific input modalities — a property that could be particularly relevant for the future of multi-modal AI.”

If internal reasoning can be guided without being externalized, the future of AI agents may hinge less on prompting strategies and more on how well we can access and steer what models already represent internally. For enterprises betting on autonomous systems that must plan, adapt, and act over long horizons, that shift could matter more than any new reasoning benchmark.

Breaking through AI’s memory wall with token warehousing

As agentic AI moves from experiments to real production workloads, a quiet but serious infrastructure problem is coming into focus: memory. Not compute. Not models. Memory.

Under the hood, today’s GPUs simply don’t have enough space to hold the Key-Value (KV) caches that modern, long-running AI agents depend on to maintain context. The result is a lot of invisible waste — GPUs redoing work they’ve already done, cloud costs climbing, and performance taking a hit. It’s a problem that’s already showing up in production environments, even if most people haven’t named it yet.

At a recent stop on the VentureBeat AI Impact Series, WEKA CTO Shimon Ben-David joined VentureBeat CEO Matt Marshall to unpack the industry’s emerging “memory wall,” and why it’s becoming one of the biggest blockers to scaling truly stateful agentic AI — systems that can remember and build on context over time. The conversation didn’t just diagnose the issue; it laid out a new way to think about memory entirely, through an approach WEKA calls token warehousing.

The GPU memory problem

“When we’re looking at the infrastructure of inferencing, it is not a GPU cycles challenge. It’s mostly a GPU memory problem,” said Ben-David.

The root of the issue comes down to how transformer models work. To generate responses, they rely on KV caches that store contextual information for every token in a conversation. The longer the context window, the more memory those caches consume, and it adds up fast. A single 100,000-token sequence can require roughly 40GB of GPU memory, noted Ben-David.

That wouldn’t be a problem if GPUs had unlimited memory. But they don’t. Even the most advanced GPUs top out at around 288GB of high-bandwidth memory (HBM), and that space also has to hold the model itself.

In real-world, multi-tenant inference environments, this becomes painful quickly. Workloads like code development or processing tax returns rely heavily on KV-cache for context.

“If I’m loading three or four 100,000-token PDFs into a model, that’s it — I’ve exhausted the KV cache capacity on HBM,” said Ben-David. This is what’s known as the memory wall. “Suddenly, what the inference environment is forced to do is drop data,” he added.

That means GPUs are constantly throwing away context they’ll soon need again, preventing agents from being stateful and maintaining conversations and context over time

The hidden inference tax

“We constantly see GPUs in inference environments recalculating things they already did,” Ben-David said. Systems prefill the KV cache, start decoding, then run out of space and evict earlier data. When that context is needed again, the whole process repeats — prefill, decode, prefill again. At scale, that’s an enormous amount of wasted work. It also means wasted energy, added latency, and degraded user experience — all while margins get squeezed.

That GPU recalculation waste shows up directly on the balance sheet. Organizations can suffer nearly 40% overhead just from redundant prefill cycles This is creating ripple effects in the inference market.

“If you look at the pricing of large model providers like Anthropic and OpenAI, they are actually teaching users to structure their prompts in ways that increase the likelihood of hitting the same GPU that has their KV cache stored,” said Ben-David. “If you hit that GPU, the system can skip the prefill phase and start decoding immediately, which lets them generate more tokens efficiently.”

But this still doesn’t solve the underlying infrastructure problem of extremely limited GPU memory capacity.

Solving for stateful AI

“How do you climb over that memory wall? How do you surpass it? That’s the key for modern, cost- effective inferencing,” Ben-David said. “We see multiple companies trying to solve that in different ways.”

Some organizations are deploying new linear models that try to create smaller KV caches. Others are focused on tackling cache efficiency.

“To be more efficient, companies are using environments that calculate the KV cache on one GPU and then try to copy it from GPU memory or use a local environment for that,” Ben-David explained. “But how do you do that at scale in a cost-effective manner that doesn’t strain your memory and doesn’t strain your networking? That’s something that WEKA is helping our customers with.”

Simply throwing more GPUs at the problem doesn’t solve the AI memory barrier. “There are some problems that you cannot throw enough money at to solve,” Ben-David said.

Augmented memory and token warehousing, explained

WEKA’s answer is what it calls augmented memory and token warehousing — a way to rethink where and how KV cache data lives. Instead of forcing everything to fit inside GPU memory, WEKA’s Augmented Memory Grid extends the KV cache into a fast, shared “warehouse” within its NeuralMesh architecture.

In practice, this turns memory from a hard constraint into a scalable resource — without adding inference latency. WEKA says customers see KV cache hit rates jump to 96–99% for agentic workloads, along with efficiency gains of up to 4.2x more tokens produced per GPU.

Ben-David put it simply: “Imagine that you have 100 GPUs producing a certain amount of tokens. Now imagine that those hundred GPUs are working as if they’re 420 GPUs.”

For large inference providers, the result isn’t just better performance — it translates directly to real economic impact.

“Just by adding that accelerated KV cache layer, we’re looking at some use cases where the savings amount would be millions of dollars per day,” said Ben-David

This efficiency multiplier also opens up new strategic options for businesses. Platform teams can design stateful agents without worrying about blowing up memory budgets. Service providers can offer pricing tiers based on persistent context, with cached inference delivered at dramatically lower cost.

What comes next

NVIDIA projects a 100x increase in inference demand as agentic AI becomes the dominant workload. That pressure is already trickling down from hyperscalers to everyday enterprise deployments— this isn’t just a “big tech” problem anymore.

As enterprises move from proofs of concept into real production systems, memory persistence is becoming a core infrastructure concern. Organizations that treat it as an architectural priority rather than an afterthought will gain a clear advantage in both cost and performance.

The memory wall is not something organizations can simply outspend to overcome. As agentic AI scales, it is one of the first AI infrastructure limits that forces a deeper rethink, and as Ben-David’s insights made clear, memory may also be where the next wave of competitive differentiation begins.

How DoorDash scaled without a costly ERP overhaul

Presented by NetSuite


Most companies racing from startup to an industry leader face a choice: limp along with scrappy early systems or endure a costly platform migration.

DoorDash did neither. The local-commerce giant scaled from its 2013 founding through IPO and global expansion — acquiring the Helsiniki-based technology company Wolt in 2022 and UK-based Deliveroo in 2025 — while keeping its original Oracle NetSuite business system. Today, it serves over 50 million consumers in more than 40 countries.*

Chief Accounting Officer Gordon Lee says the secret is building a scalable ecosystem that allows teams to use tools that work best for them.

Choosing flexibility over uniformity

When DoorDash selected NetSuite as its corporate financial control center, it wasn’t looking for a system to enforce uniformity. It sought a scalable platform that could connect all its systems, from ERP, CRM, HR, sourcing, and more.

“Our philosophy has been to create a platform that allows our customers and business partners to use whatever tools work best for them,” Lee says. “When we’re managing growth, the majority of the conversation is about managing expectations — what people expect when you grow from A to B.”

The migration question

Two years after its founding, DoorDash surpassed one million deliveries and expanded into Canada. As the company scaled, Lee faced growing pressure from vendors insisting that rapid growth required a new enterprise platform.

He ran the numbers. The move to another platform could cost millions and consume months of his team’s focus.

Instead, DoorDash stayed with NetSuite, which continued to scale alongside the company’s growth. Built on Oracle Cloud Infrastructure, NetSuite delivers the performance and reliability of an enterprise platform without the cost or disruption of migration.

Lee concluded: “Why do I bother to move? I already have the scalability I need from NetSuite.”

Today, DoorDash’s NetSuite backend provides enterprise-grade security while its familiar front end provides the team flexibility, creating a stable, modern foundation for sustained, high-velocity growth.

Expanding the menu without the technical indigestion

That flexibility soon proved invaluable. The ability to add new applications quickly — without long, costly integrations — became a major advantage during hypergrowth.

For example, as DoorDash expanded from restaurant delivery into grocery, convenience, and retail, Lee turned to NetSuite’s inventory modules to handle the distinct demands of those new categories.

“The flexibility to have and not have, and turn the switch on and off, is easy because it’s all integrated,” he explains.

Today, DoorDash’s technology stack spans multiple systems — all integrating seamlessly with NetSuite as the financial hub. “They do it, and you’re done,” Lee says.

Embedding expertise to scale smarter, not bigger

For Lee, true partnerships turn vendors into part of the team — and that’s exactly how he describes NetSuite Advanced Customer Support (ACS).

“They are here with us every week. They know all my schematics, they know all my data infrastructure, they know all my database structure within NetSuite. Essentially, they are an extension of my team,” Lee explains.

Close collaboration benefits both parties. DoorDash keeps NetSuite attuned to the realities of hypergrowth and gets instant feedback on technology capability and scalability. In turn, NetSuite stays close to a marquee customer. Interaction is ongoing — and frank, according to Lee.

“We work directly with NetSuite ACS and often ask, ‘Can NetSuite do this?’ If they can prove it can, we stay with NetSuite.”

Another benefit is the ability to extend DoorDash’s expertise without expanding headcount.

“If someone says to me, ‘Gordon, you’re just an accountant. How do you know about systems? I say, I don’t. I have a network guy with us, an expert.’ That’s the kind of partner I want to surround myself with, so that I can grow beyond what I am.”

By embedding expertise within our partnerships, DoorDash scales with precision and control. Lee says the model applies to other companies preparing for IPOs or global expansion. He adds that sustainable growth depends as much on shared understanding as on technology itself.

Too often, finance and IT “look at the same requirement but see completely different things,” Lee says, describing what he calls the “blue versus purple” problem. “The accountant doesn’t understand the configuration of the system,” he explains. “The IT guy doesn’t understand what the accountant was trying to tell them.”

NetSuite bridges that gap. With a unified data model and built-in best practices across finance, operations, and more, it keeps teams aligned and information consistent. That close collaboration, Lee notes, is what keeps rollouts smooth, data clean, and growth sustainable at any stage.

AI strategy: Trust only internal data, get data ducks in a row

Lee plans to test the NetSuite AI Connector Service — which supports Model Context Protocol (MCP) and lets customers connect their own AI to NetSuite — to see how faster access to accurate data can drive growth.

By implementing an internal instance, Lee is less worried about disruptive errors from LLMs trained on public data sources.

“Think about a generative AI chatbot. When you ask a question, it can reflect many perspectives,” he explains. On the other hand, a chatbot trained on private enterprise systems benefits from “a clean data infrastructure.”

Lee is taking a methodical approach: first get data pristine, then train AI on domain-specific terminology, and finally see how internal AI can both find the right information and automate downstream accounting processes to save resources and accelerate growth.

Betting long-term on its original financial core

From early growth to major acquisitions that helped expand its footprint across the globe, DoorDash has relied on NetSuite as a consistent foundation for innovation and scale.

Lee credits NetSuite’s flexible architecture and close partnership with helping enable DoorDash as it continued to scale and cement itself as a leader in local commerce globally.

His mantra is simple: “Focus on growth instead of churning through vendors.”

* Based on the combined numbers for DoorDash, Wolt, and Deliveroo, measured as of September 2025.


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Why your LLM bill is exploding — and how semantic caching can cut it by 73%

Our LLM API bill was growing 30% month-over-month. Traffic was increasing, but not that fast. When I analyzed our query logs, I found the real problem: Users ask the same questions in different ways.

“What’s your return policy?,” “How do I return something?”, and “Can I get a refund?” were all hitting our LLM separately, generating nearly identical responses, each incurring full API costs.

Exact-match caching, the obvious first solution, captured only 18% of these redundant calls. The same semantic question, phrased differently, bypassed the cache entirely.

So, I implemented semantic caching based on what queries mean, not how they’re worded. After implementing it, our cache hit rate increased to 67%, reducing LLM API costs by 73%. But getting there requires solving problems that naive implementations miss.

Why exact-match caching falls short

Traditional caching uses query text as the cache key. This works when queries are identical:

# Exact-match caching

cache_key = hash(query_text)

if cache_key in cache:

    return cache[cache_key]

But users don’t phrase questions identically. My analysis of 100,000 production queries found:

  • Only 18% were exact duplicates of previous queries

  • 47% were semantically similar to previous queries (same intent, different wording)

  • 35% were genuinely novel queries

That 47% represented massive cost savings we were missing. Each semantically-similar query triggered a full LLM call, generating a response nearly identical to one we’d already computed.

Semantic caching architecture

Semantic caching replaces text-based keys with embedding-based similarity lookup:

class SemanticCache:

    def __init__(self, embedding_model, similarity_threshold=0.92):

        self.embedding_model = embedding_model

        self.threshold = similarity_threshold

        self.vector_store = VectorStore()  # FAISS, Pinecone, etc.

        self.response_store = ResponseStore()  # Redis, DynamoDB, etc.

    def get(self, query: str) -> Optional[str]:

        “””Return cached response if semantically similar query exists.”””

        query_embedding = self.embedding_model.encode(query)

        # Find most similar cached query

        matches = self.vector_store.search(query_embedding, top_k=1)

        if matches and matches[0].similarity >= self.threshold:

            cache_id = matches[0].id

            return self.response_store.get(cache_id)

        return None

    def set(self, query: str, response: str):

        “””Cache query-response pair.”””

        query_embedding = self.embedding_model.encode(query)

        cache_id = generate_id()

        self.vector_store.add(cache_id, query_embedding)

        self.response_store.set(cache_id, {

            ‘query’: query,

            ‘response’: response,

            ‘timestamp’: datetime.utcnow()

        })

The key insight: Instead of hashing query text, I embed queries into vector space and find cached queries within a similarity threshold.

The threshold problem

The similarity threshold is the critical parameter. Set it too high, and you miss valid cache hits. Set it too low, and you return wrong responses.

Our initial threshold of 0.85 seemed reasonable; 85% similar should be “the same question,” right?

Wrong. At 0.85, we got cache hits like:

  • Query: “How do I cancel my subscription?”

  • Cached: “How do I cancel my order?”

  • Similarity: 0.87

These are different questions with different answers. Returning the cached response would be incorrect.

I discovered that optimal thresholds vary by query type:

Query type

Optimal threshold

Rationale

FAQ-style questions

0.94

High precision needed; wrong answers damage trust

Product searches

0.88

More tolerance for near-matches

Support queries

0.92

Balance between coverage and accuracy

Transactional queries

0.97

Very low tolerance for errors

I implemented query-type-specific thresholds:

class AdaptiveSemanticCache:

    def __init__(self):

        self.thresholds = {

            ‘faq’: 0.94,

            ‘search’: 0.88,

            ‘support’: 0.92,

            ‘transactional’: 0.97,

            ‘default’: 0.92

        }

        self.query_classifier = QueryClassifier()

    def get_threshold(self, query: str) -> float:

        query_type = self.query_classifier.classify(query)

        return self.thresholds.get(query_type, self.thresholds[‘default’])

    def get(self, query: str) -> Optional[str]:

        threshold = self.get_threshold(query)

        query_embedding = self.embedding_model.encode(query)

        matches = self.vector_store.search(query_embedding, top_k=1)

        if matches and matches[0].similarity >= threshold:

            return self.response_store.get(matches[0].id)

        return None

Threshold tuning methodology

I couldn’t tune thresholds blindly. I needed ground truth on which query pairs were actually “the same.”

Our methodology:

Step 1: Sample query pairs. I sampled 5,000 query pairs at various similarity levels (0.80-0.99).

Step 2: Human labeling. Annotators labeled each pair as “same intent” or “different intent.” I used three annotators per pair and took a majority vote.

Step 3: Compute precision/recall curves. For each threshold, we computed:

  • Precision: Of cache hits, what fraction had the same intent?

  • Recall: Of same-intent pairs, what fraction did we cache-hit?

def compute_precision_recall(pairs, labels, threshold):

    “””Compute precision and recall at given similarity threshold.”””

    predictions = [1 if pair.similarity >= threshold else 0 for pair in pairs]

    true_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 1)

    false_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 0)

    false_negatives = sum(1 for p, l in zip(predictions, labels) if p == 0 and l == 1)

    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0

    recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0

    return precision, recall

Step 4: Select threshold based on cost of errors. For FAQ queries where wrong answers damage trust, I optimized for precision (0.94 threshold gave 98% precision). For search queries where missing a cache hit just costs money, I optimized for recall (0.88 threshold).

Latency overhead

Semantic caching adds latency: You must embed the query and search the vector store before knowing whether to call the LLM.

Our measurements:

Operation

Latency (p50)

Latency (p99)

Query embedding

12ms

28ms

Vector search

8ms

19ms

Total cache lookup

20ms

47ms

LLM API call

850ms

2400ms

The 20ms overhead is negligible compared to the 850ms LLM call we avoid on cache hits. Even at p99, the 47ms overhead is acceptable.

However, cache misses now take 20ms longer than before (embedding + search + LLM call). At our 67% hit rate, the math works out favorably:

  • Before: 100% of queries × 850ms = 850ms average

  • After: (33% × 870ms) + (67% × 20ms) = 287ms + 13ms = 300ms average

Net latency improvement of 65% alongside the cost reduction.

Cache invalidation

Cached responses go stale. Product information changes, policies update and yesterday’s correct answer becomes today’s wrong answer.

I implemented three invalidation strategies:

  1. Time-based TTL

Simple expiration based on content type:

TTL_BY_CONTENT_TYPE = {

    ‘pricing’: timedelta(hours=4),      # Changes frequently

    ‘policy’: timedelta(days=7),         # Changes rarely

    ‘product_info’: timedelta(days=1),   # Daily refresh

    ‘general_faq’: timedelta(days=14),   # Very stable

}

  1. Event-based invalidation

When underlying data changes, invalidate related cache entries:

class CacheInvalidator:

    def on_content_update(self, content_id: str, content_type: str):

        “””Invalidate cache entries related to updated content.”””

        # Find cached queries that referenced this content

        affected_queries = self.find_queries_referencing(content_id)

        for query_id in affected_queries:

            self.cache.invalidate(query_id)

        self.log_invalidation(content_id, len(affected_queries))

  1. Staleness detection

For responses that might become stale without explicit events, I implemented  periodic freshness checks:

def check_freshness(self, cached_response: dict) -> bool:

    “””Verify cached response is still valid.”””

    # Re-run the query against current data

    fresh_response = self.generate_response(cached_response[‘query’])

    # Compare semantic similarity of responses

    cached_embedding = self.embed(cached_response[‘response’])

    fresh_embedding = self.embed(fresh_response)

    similarity = cosine_similarity(cached_embedding, fresh_embedding)

    # If responses diverged significantly, invalidate

    if similarity < 0.90:

        self.cache.invalidate(cached_response[‘id’])

        return False

    return True

We run freshness checks on a sample of cached entries daily, catching staleness that TTL and event-based invalidation miss.

Production results

After three months in production:

Metric

Before

After

Change

Cache hit rate

18%

67%

+272%

LLM API costs

$47K/month

$12.7K/month

-73%

Average latency

850ms

300ms

-65%

False-positive rate

N/A

0.8%

Customer complaints (wrong answers)

Baseline

+0.3%

Minimal increase

The 0.8% false-positive rate (queries where we returned a cached response that was semantically incorrect) was within acceptable bounds. These cases occurred primarily at the boundaries of our threshold, where similarity was just above the cutoff but intent differed slightly.

Pitfalls to avoid

Don’t use a single global threshold. Different query types have different tolerance for errors. Tune thresholds per category.

Don’t skip the embedding step on cache hits. You might be tempted to skip embedding overhead when returning cached responses, but you need the embedding for cache key generation. The overhead is unavoidable.

Don’t forget invalidation. Semantic caching without invalidation strategy leads to stale responses that erode user trust. Build invalidation from day one.

Don’t cache everything. Some queries shouldn’t be cached: Personalized responses, time-sensitive information, transactional confirmations. Build exclusion rules.

def should_cache(self, query: str, response: str) -> bool:

    “””Determine if response should be cached.””

    # Don’t cache personalized responses

    if self.contains_personal_info(response):

        return False

    # Don’t cache time-sensitive information

    if self.is_time_sensitive(query):

        return False

    # Don’t cache transactional confirmations

    if self.is_transactional(query):

        return False

    return True

Key takeaways

Semantic caching is a practical pattern for LLM cost control that captures redundancy exact-match caching misses. The key challenges are threshold tuning (use query-type-specific thresholds based on precision/recall analysis) and cache invalidation (combine TTL, event-based and staleness detection).

At 73% cost reduction, this was our highest-ROI optimization for production LLM systems. The implementation complexity is moderate, but the threshold tuning requires careful attention to avoid quality degradation.

Sreenivasa Reddy Hulebeedu Reddy is a lead software engineer.