Conversational AI doesn’t understand users — ‘Intent First’ architecture does

The modern customer has just one need that matters: Getting the thing they want when they want it. The old standard RAG model embed+retrieve+LLM misunderstands intent, overloads context and misses freshness, repeatedly sending customers down the wrong paths.

Instead, intent-first architecture uses a lightweight language model to parse the query for intent and context, before delivering to the most relevant content sources (documents, APIs, people).

Enterprise AI is a speeding train headed for a cliff. Organizations are deploying LLM-powered search applications at a record pace, while a fundamental architectural issue is setting most up for failure.

A recent Coveo study revealed that 72% of enterprise search queries fail to deliver meaningful results on the first attempt, while Gartner also predicts that the majority of conversational AI deployments have been falling short of enterprise expectations.

The problem isn’t the underlying models. It’s the architecture around them.

After designing and running live AI-driven customer interaction platforms at scale, serving millions of customer and citizen users at some of the world’s largest telecommunications and healthcare organizations, I’ve come to see a pattern. It’s the difference between successful AI-powered interaction deployments and multi-million-dollar failures.

It’s a cloud-native architecture pattern that I call Intent-First. And it’s reshaping the way enterprises build AI-powered experiences.

The $36 pillion problem 

Gartner projects the global conversational AI market will balloon to $36 billion by 2032. Enterprises are scrambling to get a slice. The demos are irresistible. Plug your LLM into your knowledge base, and suddenly it can answer customer questions in natural language.Magic. 

Then production happens. 

A major telecommunications provider I work with rolled out a RAG system with the expectation of driving down the support call rate. Instead, the rate increased. Callers tried AI-powered search, were provided incorrect answers with a high degree of confidence and called customer support angrier than before.

This pattern is repeated over and over. In healthcare, customer-facing AI assistants are providing patients with formulary information that’s outdated by weeks or months. Financial services chatbots are spitting out answers from both retail and institutional product content. Retailers are seeing discontinued products surface in product searches.

The issue isn’t a failure of AI technology. It’s a failure of architecture

Why standard RAG architectures fail 

The standard RAG pattern — embedding the query, retrieving semantically similar content, passing to an LLM —works beautifully in demos and proof of concepts. But it falls apart in production use cases for three systematic reasons:

1. The intent gap

Intent is not context. But standard RAG architectures don’t account for this.

Say a customer types “I want to cancel” What does that mean? Cancel a service? Cancel an order? Cancel an appointment? During our telecommunications deployment, we found that 65% of queries for “cancel” were actually about orders or appointments, not service cancellation. The RAG system had no way of understanding this intent, so it consistently returned service cancellation documents.

Intent matters. In healthcare, if a patient is typing “I need to cancel” because they’re trying to cancel an appointment, a prescription refill or a procedure, routing them to medication content from scheduling is not only frustrating — it’s also dangerous.

2. Context flood 

Enterprise knowledge and experience is vast, spanning dozens of sources such as product catalogs, billing, support articles, policies, promotions and account data. Standard RAG models treat all of it the same, searching all for every query.

When a customer asks “How do I activate my new phone,” they don’t care about billing FAQs, store locations or network status updates. But a standard RAG model retrieves semantically similar content from every source, returning search results that are a half-steps off the mark.

3. Freshness blindspot 

Vector space is timeblind. Semantically, last quarter’s promotion is identical to this quarter’s. But presenting customers with outdated offers shatters trust. We linked a significant percentage of customer complaints to search results that surfaced expired products, offers, or features.

The Intent-First architecture pattern 

The Intent-First architecture pattern is the mirror image of the standard RAG deployment. In the RAG model, you retrieve, then route. In the Intent-First model, you classify before you route or retrieve.

Intent-First architectures use a lightweight language model to parse a query for intent and context, before dispatching to the most relevant content sources (documents, APIs, agents).

Comparison: Intent-first vs standard RAG

Cloud-native implementation

The Intent-First pattern is designed for cloud-native deployment, leveraging microservices, containerization and elastic scaling to handle enterprise traffic patterns.

Intent classification service

The classifier determines user intent before any retrieval occurs:

ALGORITHM: Intent Classification

INPUT: user_query (string)

OUTPUT: intent_result (object)

1. PREPROCESS query (normalize, expand contractions)

2. CLASSIFY using transformer model:

   – primary_intent ← model.predict(query)

   – confidence ← model.confidence_score()

3. IF confidence < 0.70 THEN

   – RETURN {

       requires_clarification: true,

       suggested_question: generate_clarifying_question(query)

     }

4. EXTRACT sub_intent based on primary_intent:

   – IF primary = “ACCOUNT” → check for ORDER_STATUS, PROFILE, etc.

   – IF primary = “SUPPORT” → check for DEVICE_ISSUE, NETWORK, etc.

   – IF primary = “BILLING” → check for PAYMENT, DISPUTE, etc.

5. DETERMINE target_sources based on intent mapping:

   – ORDER_STATUS → [orders_db, order_faq]

   – DEVICE_ISSUE → [troubleshooting_kb, device_guides]

   – MEDICATION → [formulary, clinical_docs] (healthcare)

6. RETURN {

     primary_intent,

     sub_intent,

     confidence,

     target_sources,

     requires_personalization: true/false

   }

Context-aware retrieval service

Once intent is classified, retrieval becomes targeted:

ALGORITHM: Context-Aware Retrieval

INPUT: query, intent_result, user_context

OUTPUT: ranked_documents

1. GET source_config for intent_result.sub_intent:

   – primary_sources ← sources to search

   – excluded_sources ← sources to skip

   – freshness_days ← max content age

2. IF intent requires personalization AND user is authenticated:

   – FETCH account_context from Account Service

   – IF intent = ORDER_STATUS:

       – FETCH recent_orders (last 60 days)

       – ADD to results

3. BUILD search filters:

   – content_types ← primary_sources only

   – max_age ← freshness_days

   – user_context ← account_context (if available)

4. FOR EACH source IN primary_sources:

   – documents ← vector_search(query, source, filters)

   – ADD documents to results

5. SCORE each document:

   – relevance_score ← vector_similarity × 0.40

   – recency_score ← freshness_weight × 0.20

   – personalization_score ← user_match × 0.25

   – intent_match_score ← type_match × 0.15

   – total_score ← SUM of above

6. RANK by total_score descending

7. RETURN top 10 documents

Healthcare-specific considerations

In healthcare deployments, the Intent-First pattern includes additional safeguards:

Healthcare intent categories:

  • Clinical: Medication questions, symptoms, care instructions

  • Coverage: Benefits, prior authorization, formulary

  • Scheduling: Appointments, provider availability

  • Billing: Claims, payments, statements

  • Account: Profile, dependents, ID cards

Critical safeguard: Clinical queries always include disclaimers and never replace professional medical advice. The system routes complex clinical questions to human support.

Handling edge cases

The edge cases are where systems fail. The Intent-First pattern includes specific handlers:

Frustration detection keywords:

  • Anger: “terrible,” “worst,” “hate,” “ridiculous”

  • Time: “hours,” “days,” “still waiting”

  • Failure: “useless,” “no help,” “doesn’t work”

  • Escalation: “speak to human,” “real person,” “manager”

When frustration is detected, skip search entirely and route to human support.

Cross-industry applications

The Intent-First pattern applies wherever enterprises deploy conversational AI over heterogeneous content:

Industry

Intent categories

Key benefit

Telecommunications

Sales, Support, Billing, Account, Retention

Prevents “cancel” misclassification

Healthcare

Clinical, Coverage, Scheduling, Billing

Separates clinical from administrative

Financial services

Retail, Institutional, Lending, Insurance

Prevents context mixing

Retail

Product, Orders, Returns, Loyalty

Ensures promotional freshness

Results

After implementing Intent-First architecture across telecommunications and healthcare platforms:

Metric

Impact

Query success rate

Nearly doubled

Support escalations

Reduced by more than half

Time to resolution

Reduced approximately 70%

User satisfaction

Improved roughly 50%

Return user rate

More than doubled

The return user rate proved most significant. When search works, users come back. When it fails, they abandon the channel entirely, increasing costs across all other support channels.

The strategic imperative

The conversational AI market will continue to experience hyper growth.

But enterprises that build and deploy typical RAG architectures will continue to fail … repeatedly.

AI will confidently give wrong answers, users will abandon digital channels out of frustration and support costs will go up instead of down.

Intent-First is a fundamental shift in how enterprises need to architect and build AI-powered customer conversations. It’s not about better models or more data. It’s about understanding what a user wants before you try to help them.

The sooner an organization realizes this as an architectural imperative, the sooner they will be able to capture the efficiency gains this technology is supposed to enable. Those that don’t will be debugging why their AI investments haven’t been producing expected business outcomes for many years to come.

The demo is easy. Production is hard. But the pattern for production success is clear: Intent First.

Sreenivasa Reddy Hulebeedu Reddy is a lead software engineer and enterprise architect

Claude Cowork turns Claude from a chat tool into shared AI infrastructure

Claude Cowork is now available to more Claude users, alongside new updates aimed at team workflows.Anthropic made Claude Cowork accessible to users on Team and Enterprise plans, and it brings the platform closer to being a collaborative AI infrastructu…

Why enterprise AI pilots fail — and how to move to scaled execution

Presented by Insight Enterprises


Organizations today are trapped in proof-of-concept purgatory because yesterday’s models don’t work for today’s AI challenges.

Everyone’s racing to prove what AI could do. But the real winners are those who have realized that AI deployment is not a technology project — it is a core operational capability.

Success depends on execution, not just far-reaching visions of optimization.

At Insight, we’ve seen this cycle before. For more than 35 years, from our roots as a Value-Added Reseller (VAR) to our evolution as the leading Solutions Integrator, we’ve helped clients cut through the hype and make emerging technology actually work.

AI is following the same pattern. But this time, the stakes are higher, and the timelines are tighter. The organizations making real progress aren’t chasing pilots. They’re building the muscle to deploy, turning experiments and early momentum into measurable outcomes for the business.

What every technology “era” has taught us about AI success

MIT research estimates that 95% of enterprise AI initiatives fail to deliver measurable business value. This isn’t a failure of ambition. It’s a failure of deployment.

Too often, leaders are stuck in the “what”, obsessing over which model to use or how fast they can automate a single task. They get locked into long, costly discovery phases with traditional consultants that are all about theory and very little action.

We know this because we’ve lived it. When Insight first began experimenting with generative AI, our early pilots suffered from the same issues we see in the market: they looked great on slides but failed to scale.

We also hit cultural resistance and skills gaps. To overcome this, we had to stop treating AI as a “tool” and start treating it as a “capability.”

We started asking questions like, “Where will AI truly change how our people work and how our business performs — and how do we get there now?” OR “Given the AI tech advances, what is the art of the possible? How can we re-imagine our business processes and the work our people do to drive 10x improvement?

Now, 93% of our 14,000+ teammates are using generative AI tools in their daily work, saving more than 8,500 hours every week through automation and productivity gains.

Building AI that actually delivers value

If there’s one thing we’ve learned from decades of transformation, it’s that success isn’t born from strategy decks or proofs of concept.

It’s earned in the details.

As we brought together our AI experts from across our business, we saw that the most successful client engagements shared three common traits, but not the kind that fit neatly into a diagram. They’re about how work gets done:

Fees tied to outcomes. The old model of billing for time and material is broken. Commercial models need to put skin in the game. We win when you see measurable business value, not when we complete project.

Use tech to accelerate past theory. Instead of manual, multi-month discovery phases, look for partners who can accelerate your journey. We do this by providing our clients with an inventory of high-value use cases on day zero, so our consulting engagement starts with a roadmap to action, not just a listening tour.

Look at internal transformation. You cannot successfully deploy for your customers what you haven’t mastered internally. At Insight, we built our suite of AI offerings by first transforming our own business. Our internal story isn’t just a data point. It’s our proof of concept for cultural and operational change. It’s how we break the old perceptions and prove we understand the human side of deployment. In our 2024 survey of IT leaders, 44% identified skills gaps as a top barrier to transformation, and 74% said they have focused time and budget on building custom AI tools. Yet most still lack the deployment discipline to embed them.

That’s the real craft of deployment. It’s not theory, and it’s not hype. It is execution at scale.

And over the past few years, we’ve built on those lessons to give organizations a clear roadmap from ideation to ROI. Real success comes from connecting expertise, tools, and a robust delivery engine to get beyond vision and experimentation.

The 70% that separates talk from transformation

I love this concept from Boston Consulting Group (BCG) called the 10-20-70 rule.

10% of success comes from algorithms, 20% from data and technology, and 70% from people, process, and culture.

Most companies invest nearly all their energy in the first 30%. But the real advantage (yes, the durable kind) lives in the 70%. That’s where execution happens.

At Insight, we’ve built our entire business around that principle. From cloud to AI, our mission hasn’t changed. We turn technology into a capability that clients can scale and continuously improve.

Turning AI potential into real-world results

The “AI theory” era is ending. This next chapter belongs to the doers. To organizations ready to apply intelligence the same way they operationalized cloud or digital transformation.

It requires a delicate balance of innovation and governance, and certainly bold ideas with disciplined execution.

In fact, that philosophy is exactly what inspired Prism, our way of helping organizations bring clarity to complexity. Clients can get a full inventory of AI use cases for their entire business on day zero, skipping the months-long discovery phase of traditional consulting and prioritizing opportunities for immediate impact.

We know that transformation doesn’t begin with algorithms. It begins with mastery, and it’s the kind we’ve earned through decades of deploying and scaling what’s next.

How are you moving from hype to how?

Joyce Mullen is President & CEO at Insight Enterprises.


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Everything in voice AI just changed: how enterprise AI builders can benefit

Despite lots of hype, “voice AI” has so far largely been a euphemism for a request-response loop. You speak, a cloud server transcribes your words, a language model thinks, and a robotic voice reads the text back. Functional, but not really conversational.

That all changed in the past week with a rapid succession of powerful, fast, and more capable voice AI model releases from Nvidia, Inworld, FlashLabs, and Alibaba’s Qwen team, combined with a massive talent acquisition and tech licensing deal by Google DeepMind and Hume AI.

Now, the industry has effectively solved the four “impossible” problems of voice computing: latency, fluidity, efficiency, and emotion.

For enterprise builders, the implications are immediate. We have moved from the era of “chatbots that speak” to the era of “empathetic interfaces.”

Here is how the landscape has shifted, the specific licensing models for each new tool, and what it means for the next generation of applications.

1. The death of latency – no more awkward pauses

The “magic number” in human conversation is roughly 200 milliseconds. That is the typical gap between one person finishing a sentence and another beginning theirs. Anything longer than 500ms feels like a satellite delay; anything over a second breaks the illusion of intelligence entirely.

Until now, chaining together ASR (speech recognition), LLMs (intelligence), and TTS (text-to-speech) resulted in latencies of 2–5 seconds.

Inworld AI’s release of TTS 1.5 directly attacks this bottleneck. By achieving a P90 latency of under 120ms, Inworld has effectively pushed the technology faster than human perception.

For developers building customer service agents or interactive training avatars, this means the “thinking pause” is dead.

Crucially, Inworld claims this model achieves “viseme-level synchronization,” meaning the lip movements of a digital avatar will match the audio frame-by-frame—a requirement for high-fidelity gaming and VR training.

It’s vailable via commercial API (pricing tiers based on usage) with a free tier for testing.

Simultaneously, FlashLabs released Chroma 1.0, an end-to-end model that integrates the listening and speaking phases. By processing audio tokens directly via an interleaved text-audio token schedule (1:2 ratio), the model bypasses the need to convert speech to text and back again.

This “streaming architecture” allows the model to generate acoustic codes while it is still generating text, effectively “thinking out loud” in data form before the audio is even synthesized. This one is open source on Hugging Face under the enterprise-friendly, commercially viable Apache 2.0 license.

Together, they signal that speed is no longer a differentiator; it is a commodity. If your voice application has a 3-second delay, it is now obsolete. The standard for 2026 is immediate, interruptible response.

2. Solving “the robot problem” via full duplex

Speed is useless if the AI is rude. Traditional voice bots are “half-duplex”—like a walkie-talkie, they cannot listen while they are speaking. If you try to interrupt a banking bot to correct a mistake, it keeps talking over you.

Nvidia’s PersonaPlex, released last week, introduces a 7-billion parameter “full-duplex” model.

Built on the Moshi architecture (originally from Kyutai), it uses a dual-stream design: one stream for listening (via the Mimi neural audio codec) and one for speaking (via the Helium language model). This allows the model to update its internal state while the user is speaking, enabling it to handle interruptions gracefully.

Crucially, it understands “backchanneling”—the non-verbal “uh-huhs,” “rights,” and “okays” that humans use to signal active listening without taking the floor. This is a subtle but profound shift for UI design.

An AI that can be interrupted allows for efficiency. A customer can cut off a long legal disclaimer by saying, “I got it, move on,” and the AI will instantly pivot. This mimics the dynamics of a high-competence human operator.

The model weights are released under the Nvidia Open Model License (permissive for commercial use but with attribution/distribution terms), while the code is MIT Licensed.

3. High-fidelity compression leads to smaller data footprints

While Inworld and Nvidia focused on speed and behavior, open source AI powerhouse Qwen (parent company Alibaba Cloud) quietly solved the bandwidth problem.

Earlier today, the team released Qwen3-TTS, featuring a breakthrough 12Hz tokenizer. In plain English, this means the model can represent high-fidelity speech using an incredibly small amount of data—just 12 tokens per second.

For comparison, previous state-of-the-art models required significantly higher token rates to maintain audio quality. Qwen’s benchmarks show it outperforming competitors like FireredTTS 2 on key reconstruction metrics (MCD, CER, WER) while using fewer tokens.

Why does this matter for the enterprise? Cost and scale.

A model that requires less data to generate speech is cheaper to run and faster to stream, especially on edge devices or in low-bandwidth environments (like a field technician using a voice assistant on a 4G connection). It turns high-quality voice AI from a server-hogging luxury into a lightweight utility.

It’s available on Hugging Face now under a permissive Apache 2.0 license, perfect for research and commercial application.

4. The missing ‘it’ factor: emotional intelligence

Perhaps the most significant news of the week—and the most complex—is Google DeepMind’s move to license Hume AI’s technology and hire its CEO, Alan Cowen, along with key research staff.

While Google integrates this tech into Gemini to power the next generation of consumer assistants, Hume AI itself is pivoting to become the infrastructure backbone for the enterprise.

Under new CEO Andrew Ettinger, Hume is doubling down on the thesis that “emotion” is not a UI feature, but a data problem.

In an exclusive interview with VentureBeat regarding the transition, Ettinger explained that as voice becomes the primary interface, the current stack is insufficient because it treats all inputs as flat text.

“I saw firsthand how the frontier labs are using data to drive model accuracy,” Ettinger says. “Voice is very clearly emerging as the de facto interface for AI. If you see that happening, you would also conclude that emotional intelligence around that voice is going to be critical—dialects, understanding, reasoning, modulation.”

The challenge for enterprise builders has been that LLMs are sociopaths by design—they predict the next word, not the emotional state of the user. A healthcare bot that sounds cheerful when a patient reports chronic pain is a liability. A financial bot that sounds bored when a client reports fraud is a churn risk.

Ettinger emphasizes that this isn’t just about making bots sound nice; it’s about competitive advantage.

When asked about the increasingly competitive landscape and the role of open source versus proprietary models, Ettinger remained pragmatic.

He noted that while open-source models like PersonaPlex are raising the baseline for interaction, the proprietary advantage lies in the data—specifically, the high-quality, emotionally annotated speech data that Hume has spent years collecting.

“The team at Hume ran headfirst into a problem shared by nearly every team building voice models today: the lack of high-quality, emotionally annotated speech data for post-training,” he wrote on LinkedIn. “Solving this required rethinking how audio data is sourced, labeled, and evaluated… This is our advantage. Emotion isn’t a feature; it’s a foundation.”

Hume’s models and data infrastructure are available via proprietary enterprise licensing.

5. The new enterprise voice AI playbook

With these pieces in place, the “Voice Stack” for 2026 looks radically different.

  • The Brain: An LLM (like Gemini or GPT-4o) provides the reasoning.

  • The Body: Efficient, open-weight models like PersonaPlex (Nvidia), Chroma (FlashLabs), or Qwen3-TTS handle the turn-taking, synthesis, and compression, allowing developers to host their own highly responsive agents.

  • The Soul: Platforms like Hume provide the annotated data and emotional weighting to ensure the AI “reads the room,” preventing the reputational damage of a tone-deaf bot.

Ettinger claims the market demand for this specific “emotional layer” is exploding beyond just tech assistants.

“We are seeing that very deeply with the frontier labs, but also in healthcare, education, finance, and manufacturing,” Ettinger told me. “As people try to get applications into the hands of thousands of workers across the globe who have complex SKUs… we’re seeing dozens and dozens of use cases by the day.”

This aligns with his comments on LinkedIn, where he revealed that Hume signed “multiple 8-figure contracts in January alone,” validating the thesis that enterprises are willing to pay a premium for AI that doesn’t just understand what a customer said, but how they felt.

From good enough to actually good

For years, enterprise voice AI was graded on a curve. If it understood the user’s intent 80% of the time, it was a success.

The technologies released this week have removed the technical excuses for bad experiences. Latency is solved. Interruption is solved. Bandwidth is solved. Emotional nuance is solvable.

“Just like GPUs became foundational for training models,” Ettinger wrote on his LinkedIn, “emotional intelligence will be the foundational layer for AI systems that actually serve human well-being.”

For the CIO or CTO, the message is clear: The friction has been removed from the interface. The only remaining friction is in how quickly organizations can adopt the new stack.

MemRL outperforms RAG on complex agent benchmarks without fine-tuning

A new technique developed by researchers at Shanghai Jiao Tong University and other institutions enables large language model agents to learn new skills without the need for expensive fine-tuning.

The researchers propose MemRL, a framework that gives agents the ability to develop episodic memory, the capacity to retrieve past experiences to create solutions for unseen tasks. MemRL allows agents to use environmental feedback to refine their problem-solving strategies continuously.

MemRL is part of a broader push in the research community to develop continual learning capabilities for AI applications. In experiments on key industry benchmarks, the framework outperformed other baselines such as RAG and other memory organization techniques, particularly in complex environments that require exploration and experiments. This suggests MemRL could become a critical component for building AI applications that must operate in dynamic real-world settings where requirements and tasks constantly shift.

The stability-plasticity dilemma

One of the central challenges in deploying agentic applications is adapting the underlying model to new knowledge and tasks after the initial training phase. Current approaches generally fall into two categories: parametric approaches, such as fine-tuning, and non-parametric approaches, such as RAG. But both come with significant trade-offs.

Fine-tuning, while effective for baking in new information, is computationally expensive and slow. More critically, it often leads to catastrophic forgetting, a phenomenon where newly acquired knowledge overwrites previously learned data, degrading the model’s general performance.

Conversely, non-parametric methods like RAG are fundamentally passive; they retrieve information based solely on semantic similarity, such as vector embeddings, without evaluating the actual utility of the information to the input query. This approach assumes that “similar implies useful,” which is often flawed in complex reasoning tasks.

The researchers argue that human intelligence solves this problem by maintaining “the delicate balance between the stability of cognitive reasoning and the plasticity of episodic memory.” In the human brain, stable reasoning (associated with the cortex) is decoupled from dynamic episodic memory. This allows humans to adapt to new tasks without “rewiring neural circuitry” (the rough equivalent of model fine-tuning).

Inside the MemRL framework

Inspired by humans’ use of episodic memory and cognitive reasoning, MemRL is designed to enable an agent to continuously improve its performance after deployment without compromising the stability of its backbone LLM. Instead of changing the model’s parameters, the framework shifts the adaptation mechanism to an external, self-evolving memory structure.

In this architecture, the LLM’s parameters remain completely frozen. The model acts effectively as the “cortex,” responsible for general reasoning, logic, and code generation, but it is not responsible for storing specific successes or failures encountered after deployment. This structure ensures stable cognitive reasoning and prevents catastrophic forgetting.

To handle adaptation, MemRL maintains a dynamic episodic memory component. Instead of storing plain text documents and static embedding values, as is common in RAG, MemRL organizes memory into “intent-experience-utility” triplets. These contain the user’s query (the intent), the specific solution trajectory or action taken (the experience), and a score, known as the Q-value, that represents how successful this specific experience was in the past (the utility).

Crucially for enterprise architects, this new data structure doesn’t require ripping out existing infrastructure. “MemRL is designed to be a ‘drop-in’ replacement for the retrieval layer in existing technology stacks and is compatible with various vector databases,” Muning Wen, a co-author of the paper and PhD candidate at Shanghai Jiao Tong University, told VentureBeat. “The existence and updating of ‘Q-Value’ is solely for better evaluation and management of dynamic data… and is independent of the storage format.”

This utility score is the key differentiator from classic RAG systems. At inference time, MemRL agents employ a “two-phase retrieval” mechanism. First, the system identifies memories that are semantically close to the query to ensure relevance. It then re-ranks these candidates based on their Q-value, effectively prioritizing proven strategies.

The framework incorporates reinforcement learning directly into the memory retrieval process. When an agent attempts a solution and receives environmental feedback (i.e., success or failure) it updates the Q-value of the retrieved memory. This creates a closed feedback loop: over time, the agent learns to ignore distractor memories and prioritize high-value strategies without ever needing to retrain the underlying LLM.

While adding a reinforcement learning step might sound like it adds significant latency, Wen noted that the computational overhead is minimal. “Our Q-value calculation is performed entirely on the CPU,” he said.

MemRL also possesses runtime continual learning capabilities. When the agent encounters a new scenario, the system uses the frozen LLM to summarize the new trajectory and adds it to the memory bank as a new triplet. This allows the agent to expand its knowledge base dynamically as it interacts with the world.

It is worth noting that the automation of the value assignment comes with a risk: If the system mistakenly validates a bad interaction, the agent could learn the wrong lesson. Wen acknowledges this “poisoned memory” risk but notes that unlike black-box neural networks, MemRL remains transparent and auditable. “If a bad interaction is mistakenly classified as a positive example… it may spread more widely,” Wen said. “However … we can easily fix it by removing the contaminated data from the memory bank or resetting their Q-values.”

MemRL in action

The researchers evaluated MemRL against several baselines on four diverse industry benchmarks: BigCodeBench (code generation), ALFWorld (embodied navigation), Lifelong Agent Bench (OS and database interaction), and Humanity’s Last Exam (complex multidisciplinary reasoning).

The results showed that MemRL consistently outperformed baselines in both runtime learning (improving during the session) and transfer learning (generalizing to unseen tasks).

The advantages of this value-aware retrieval mechanism were most pronounced in exploration-heavy environments like ALFWorld. In this benchmark, which requires agents to navigate and interact with a simulated household environment, MemRL achieved a relative improvement of approximately 56% over MemP, another agentic memory framework. The researchers found that the reinforcement learning component effectively encouraged the agent to explore and discover solutions for complex tasks that similarity-based retrieval methods often failed to solve.

When the memory bank was frozen and tested on held-out sets to measure generalization, MemRL achieved the highest accuracy across benchmarks. For example, on the Lifelong Agent Bench, it improved significantly upon the standard RAG baseline on OS tasks. This indicates that the system does not merely memorize training data but effectively filters out low-value memories to retain high-utility experiences that generalize to new situations.

The broader picture for self-evolving agents

MemRL fits within a growing body of research focused on Memory-Based Markov Decision Processes (M-MDP), a formulation that frames memory retrieval as an active decision-making step rather than a passive search function. By treating retrieval as an action that can be optimized via reinforcement learning, frameworks like MemRL and similar approaches such as Memento are paving the way for more autonomous systems. 

For enterprise AI, this shift is significant. It suggests a future where agents can be deployed with a general-purpose LLM and then rapidly adapt to specific company workflows, proprietary databases, and unique problem sets through interaction alone. The key shift we’re seeing is frameworks that are treating applications as dynamic environments that they can learn from.

These emerging capabilities will allow organizations to maintain consistent, high-performance agents that evolve alongside their business needs, solving the problem of stale models without incurring the prohibitive costs of constant retraining.

It marks a transition in how we value data. “In a future where static data is about to be exhausted, the interaction experience generated by each intelligent agent during its lifespan will become the new fuel,” Wen said.

What ServiceNow and OpenAI signal for enterprises as AI moves from advice to execution

ServiceNow announced a multi-year partnership with OpenAI to bring GPT-5.2 into its AI Control Tower and Xanadu platform, reinforcing ServiceNow’s strategy to focus on enterprise workflows, guardrails, and orchestration rather than building frontier mo…

MIT’s new ‘recursive’ framework lets LLMs process 10 million tokens without context rot

Recursive language models (RLMs) are an inference technique developed by researchers at MIT CSAIL that treat long prompts as an external environment to the model. Instead of forcing the entire prompt into the model’s context window, the framework allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the text.

Rather than expanding context windows or summarizing old information, the MIT team reframes long-context reasoning as a systems problem. By letting models treat prompts as something they can inspect with code, recursive language models allow LLMs to reason over millions of tokens without retraining. This offers enterprises a practical path to long-horizon tasks like codebase analysis, legal review, and multi-step reasoning that routinely break today’s models.

Because the framework is designed as a wrapper around existing models, it can serve as a drop-in replacement for applications that make direct calls to LLMs.

The LLM context problem

While frontier models are becoming increasingly sophisticated at reasoning, their ability to process massive amounts of information is not scaling at the same rate. This bottleneck is driven by two distinct limitations: the hard physical constraint on how much text a model can process at once (context length) and “context rot.”

The challenge, the researchers argue, is whether it’s possible to scale the effective context size of general-purpose LLMs by orders of magnitude without retraining them. This capability is becoming increasingly important for enterprise applications, where LLMs are adopted for long-horizon tasks requiring the processing of millions of tokens — a challenge Zhang argues can’t be solved by simply expanding context windows.

“There is an entropy argument that implies you need exponentially more data samples as you increase the effective context window size,” Alex Zhang, a co-author of the paper, told VentureBeat. 

Current approaches to extending context often rely on compaction, where the model summarizes older parts of the conversation to free up space. However, this method fails for tasks requiring random access to specific details located in earlier parts of the prompt.

How RLMs work

The concept behind RLMs is drawn from “out-of-core” algorithms used in classical computing. These algorithms are designed to process datasets too large to fit into a computer’s main memory by keeping the data on a hard drive and fetching only the necessary chunks as needed.

RLMs apply this logic to generative AI. Instead of feeding a long prompt directly into the neural network, the framework loads the text as a string variable inside a Python coding environment. The LLM is given general context about the data (such as the total character count) but does not “see” the text initially.

Once the prompt is stored as a variable, the LLM acts as a programmer. It writes Python code to interact with the external variable, using standard commands to peek into the data. For example, the model might use regular expressions to search for specific keywords like “Chapter 1” or “financial results.”

When the code execution finds a relevant snippet, the RLM pulls only that specific chunk into its active context window for analysis.

For example, if the prompt is a massive book, the LLM might write a loop that identifies chapter boundaries and then triggers a sub-call to summarize each chapter individually.

The architecture typically involves two agents. A “root language model,” often a capability-heavy model like GPT-5, acts as the orchestrator. It plans the approach, writes the code, and manages the data flow within the REPL environment. A “recursive language model,” often a faster and cheaper model, acts as the worker. The root LM calls this worker to process the specific text snippets isolated by the code.

Because the prompt resides in the environment’s memory rather than the model’s context window, the system can handle inputs far larger than the model’s training limit. Importantly, to the end-user, the RLM behaves exactly like a standard model: It accepts a string and returns an answer. This allows enterprise teams to swap standard API calls for RLMs.

For developers looking to experiment, the RLM code is currently available on GitHub.

“A key argument for RLMs is that most complex tasks can be decomposed into smaller, ‘local’ sub-tasks,” Zhang said. “However, how to perform this context/problem decomposition is non-trivial, and the model must be capable of performing this.”

RLMs in action

To validate the framework, the researchers tested RLMs against base models and other agentic approaches like CodeAct and summary agents across a variety of long-context tasks, including retrieval and multi-hop question answering.

The results demonstrated strong performance gains at the 10 million+ token scale. On BrowseComp-Plus, a benchmark involving inputs of 6 to 11 million tokens, standard base models failed completely, scoring 0%. In contrast, the RLM powered by GPT-5 achieved a score of 91.33%, significantly outperforming the Summary Agent (70.47%) and CodeAct (51%).

The framework also excelled at tasks with high computational complexity. On OOLONG-Pairs, an information-dense reasoning benchmark where the difficulty scales quadratically with input length, base GPT-5 models failed catastrophically with a score of just 0.04%. The RLM achieved an F1 score (a balanced measure of precision and recall) of 58%, demonstrating emergent capabilities to handle dense tasks that paralyze standard models. Similarly, on code understanding tasks (CodeQA benchmark), the RLM more than doubled the performance of the base GPT-5 model, jumping from 24% to 62%.

Regarding the context rot problem, the data showed that while the base GPT-5 performance degrades rapidly as task complexity increases, RLM performance holds steady, consistently outperforming the base model on contexts longer than 16,000 tokens.

Despite the increased complexity of the workflow, RLMs often maintained comparable or lower average costs than the baselines. On the BrowseComp-Plus benchmark, the RLM was up to three times cheaper than the summarization baseline.

However, the researchers noted that while median costs are low, RLM trajectories are “long-tailed.” Outlier runs can become expensive if the model gets stuck in loops or performs redundant verifications. While GPT-5 was conservative in its sub-calls, the open-source Qwen3-Coder model sometimes attempted thousands of sub-calls for simple tasks.

“Today, you likely will have to implement your own guardrails and logic to control RLM behavior,” Zhang said. However, he hypothesizes that future models could be trained to manage their own compute budgets more effectively. Companies like Prime Intellect are planning to integrate RLM into the training process of models, possibly addressing the edge cases where the model’s inference budget spikes.

For enterprise architects deciding where to place their bets, the RLM framework offers a new tool for handling information-dense problems.

“I think RLMs are still extremely useful for chatbots (think long chat histories), but ultimately they argue for an alternative way of using LMs,” Zhang said. “I think RLMs work in tandem with standard retrieval methods like RAG; they do not serve as a replacement, and can be used in different settings or together.”

Why reinforcement learning plateaus without representation depth (and other key takeaways from NeurIPS 2025)

Every year, NeurIPS produces hundreds of impressive papers, and a handful that subtly reset how practitioners think about scaling, evaluation and system design. In 2025, the most consequential works weren’t about a single breakthrough model. Instead, they challenged fundamental assumptions that academicians and corporations have quietly relied on: Bigger models mean better reasoning, RL creates new capabilities, attention is “solved” and generative models inevitably memorize.

This year’s top papers collectively point to a deeper shift: AI progress is now constrained less by raw model capacity and more by architecture, training dynamics and evaluation strategy.

Below is a technical deep dive into five of the most influential NeurIPS 2025 papers — and what they mean for anyone building real-world AI systems.

1. LLMs are converging—and we finally have a way to measure it

Paper: Artificial Hivemind: The Open-Ended Homogeneity of Language Models

For years, LLM evaluation has focused on correctness. But in open-ended or ambiguous tasks like brainstorming, ideation or creative synthesis, there often is no single correct answer. The risk instead is homogeneity: Models producing the same “safe,” high-probability responses.

This paper introduces Infinity-Chat, a benchmark designed explicitly to measure diversity and pluralism in open-ended generation. Rather than scoring answers as right or wrong, it measures:

  • Intra-model collapse: How often the same model repeats itself

  • Inter-model homogeneity: How similar different models’ outputs are

The result is uncomfortable but important: Across architectures and providers, models increasingly converge on similar outputs — even when multiple valid answers exist.

Why this matters in practice

For corporations, this reframes “alignment” as a trade-off. Preference tuning and safety constraints can quietly reduce diversity, leading to assistants that feel too safe, predictable or biased toward dominant viewpoints.

Takeaway: If your product relies on creative or exploratory outputs, diversity metrics need to be first-class citizens. 

2. Attention isn’t finished — a simple gate changes everything

Paper: Gated Attention for Large Language Models

Transformer attention has been treated as settled engineering. This paper proves it isn’t.

The authors introduce a small architectural change: Apply a query-dependent sigmoid gate after scaled dot-product attention, per attention head. That’s it. No exotic kernels, no massive overhead.

Across dozens of large-scale training runs — including dense and mixture-of-experts (MoE) models trained on trillions of tokens — this gated variant:

  • Improved stability

  • Reduced “attention sinks”

  • Enhanced long-context performance

  • Consistently outperformed vanilla attention

Why it works

The gate introduces:

  • Non-linearity in attention outputs

  • Implicit sparsity, suppressing pathological activations

This challenges the assumption that attention failures are purely data or optimization problems.

Takeaway: Some of the biggest LLM reliability issues may be architectural — not algorithmic — and solvable with surprisingly small changes.

3. RL can scale — if you scale in depth, not just data

Paper: 1,000-Layer Networks for Self-Supervised Reinforcement Learning

Conventional wisdom says RL doesn’t scale well without dense rewards or demonstrations. This paper reveals that that assumption is incomplete.

By scaling network depth aggressively from typical 2 to 5 layers to nearly 1,000 layers, the authors demonstrate dramatic gains in self-supervised, goal-conditioned RL, with performance improvements ranging from 2X to 50X.

The key isn’t brute force. It’s pairing depth with contrastive objectives, stable optimization regimes and goal-conditioned representations

Why this matters beyond robotics

For agentic systems and autonomous workflows, this suggests that representation depth — not just data or reward shaping — may be a critical lever for generalization and exploration.

Takeaway: RL’s scaling limits may be architectural, not fundamental.

4. Why diffusion models generalize instead of memorizing

Paper: Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training

Diffusion models are massively overparameterized, yet they often generalize remarkably well. This paper explains why.

The authors identify two distinct training timescales:

  • One where generative quality rapidly improves

  • Another — much slower — where memorization emerges

Crucially, the memorization timescale grows linearly with dataset size, creating a widening window where models improve without overfitting.

Practical implications

This reframes early stopping and dataset scaling strategies. Memorization isn’t inevitable — it’s predictable and delayed.

Takeaway: For diffusion training, dataset size doesn’t just improve quality — it actively delays overfitting.

5. RL improves reasoning performance, not reasoning capacity

Paper: Does Reinforcement Learning Really Incentivize Reasoning in LLMs?

Perhaps the most strategically important result of NeurIPS 2025 is also the most sobering.

This paper rigorously tests whether reinforcement learning with verifiable rewards (RLVR) actually creates new reasoning abilities in LLMs — or simply reshapes existing ones.

Their conclusion: RLVR primarily improves sampling efficiency, not reasoning capacity. At large sample sizes, the base model often already contains the correct reasoning trajectories.

What this means for LLM training pipelines

RL is better understood as:

  • A distribution-shaping mechanism

  • Not a generator of fundamentally new capabilities

Takeaway: To truly expand reasoning capacity, RL likely needs to be paired with mechanisms like teacher distillation or architectural changes — not used in isolation.

The bigger picture: AI progress is becoming systems-limited

Taken together, these papers point to a common theme:

The bottleneck in modern AI is no longer raw model size — it’s system design.

  • Diversity collapse requires new evaluation metrics

  • Attention failures require architectural fixes

  • RL scaling depends on depth and representation

  • Memorization depends on training dynamics, not parameter count

  • Reasoning gains depend on how distributions are shaped, not just optimized

For builders, the message is clear: Competitive advantage is shifting from “who has the biggest model” to “who understands the system.”

Maitreyi Chatterjee is a software engineer.

Devansh Agarwal currently works as an ML engineer at FAANG.

Claude Code just got updated with one of the most-requested user features

Anthropic’s open source standard, the Model Context Protocol (MCP), released in late 2024, allows users to connect AI models and the agents atop them to external tools in a structured, reliable format. It is the engine behind Anthropic’s hit AI agentic programming harness, Claude Code, allowing it to access numerous functions like web browsing and file creation immediately when asked.

But there was one problem: Claude Code typically had to “read” the instruction manual for every single tool available, regardless of whether it was needed for the immediate task, using up the available context that could otherwise be filled with more information from the user’s prompts or the agent’s responses.

At least until last night. The Claude Code team released an update that fundamentally alters this equation. Dubbed MCP Tool Search, the feature introduces “lazy loading” for AI tools, allowing agents to dynamically fetch tool definitions only when necessary.

It is a shift that moves AI agents from a brute-force architecture to something resembling modern software engineering—and according to early data, it effectively solves the “bloat” problem that was threatening to stifle the ecosystem.

The ‘Startup Tax’ on Agents

To understand the significance of Tool Search, one must understand the friction of the previous system. The Model Context Protocol (MCP), released in 2024 by Anthropic as an open source standard was designed to be a universal standard for connecting AI models to data sources and tools—everything from GitHub repositories to local file systems.

However, as the ecosystem grew, so did the “startup tax.”

Thariq Shihipar, a member of the technical staff at Anthropic, highlighted the scale of the problem in the announcement.

“We’ve found that MCP servers may have up to 50+ tools,” Shihipar wrote. “Users were documenting setups with 7+ servers consuming 67k+ tokens.”

In practical terms, this meant a developer using a robust set of tools might sacrifice 33% or more of their available context window limit of 200,000 tokens before they even typed a single character of a prompt, as AI newsletter author Aakash Gupta pointed out in a post on X.

The model was effectively “reading” hundreds of pages of technical documentation for tools it might never use during that session.

Community analysis provided even starker examples.

Gupta further noted that a single Docker MCP server could consume 125,000 tokens just to define its 135 tools.

“The old constraint forced a brutal tradeoff,” he wrote. “Either limit your MCP servers to 2-3 core tools, or accept that half your context budget disappears before you start working.”

How Tool Search Works

The solution Anthropic rolled out — which Shihipar called “one of our most-requested features on GitHub” — is elegant in its restraint. Instead of preloading every definition, Claude Code now monitors context usage.

According to the release notes, the system automatically detects when tool descriptions would consume more than 10% of the available context.

When that threshold is crossed, the system switches strategies. Instead of dumping raw documentation into the prompt, it loads a lightweight search index.

When the user asks for a specific action—say, “deploy this container”—Claude Code doesn’t scan a massive, pre-loaded list of 200 commands. Instead, it queries the index, finds the relevant tool definition, and pulls only that specific tool into the context.

“Tool Search flips the architecture,” Gupta analyzed. “The token savings are dramatic: from ~134k to ~5k in Anthropic’s internal testing. That’s an 85% reduction while maintaining full tool access.”

For developers maintaining MCP servers, this shifts the optimization strategy.

Shihipar noted that the `server instructions` field in the MCP definition—previously a “nice to have”—is now critical. It acts as the metadata that helps Claude “know when to search for your tools, similar to skills.”

‘Lazy Loading’ and Accuracy Gains

While the token savings are the headline metric—saving money and memory is always popular—the secondary effect of this update might be more important: focus.

LLMs are notoriously sensitive to “distraction.” When a model’s context window is stuffed with thousands of lines of irrelevant tool definitions, its ability to reason decreases. It creates a “needle in a haystack” problem where the model struggles to differentiate between similar commands, such as `notification-send-user` versus `notification-send-channel`.

Boris Cherny, Head of Claude Code, emphasized this in his reaction to the launch on X: “Every Claude Code user just got way more context, better instruction following, and the ability to plug in even more tools.”

The data backs this up. Internal benchmarks shared by the community indicate that enabling Tool Search improved the accuracy of the Opus 4 model on MCP evaluations from 49% to 74%.

For the newer Opus 4.5, accuracy jumped from 79.5% to 88.1%.

By removing the noise of hundreds of unused tools, the model can dedicate its “attention” mechanisms to the user’s actual query and the relevant active tools.

Maturing the Stack

This update signals a maturation in how we treat AI infrastructure. In the early days of any software paradigm, brute force is common. But as systems scale, efficiency becomes the primary engineering challenge.

Aakash Gupta drew a parallel to the evolution of Integrated Development Environments (IDEs) like VSCode or JetBrains. “The bottleneck wasn’t ‘too many tools.’

It was loading tool definitions like 2020-era static imports instead of 2024-era lazy loading,” he wrote. “VSCode doesn’t load every extension at startup. JetBrains doesn’t inject every plugin’s docs into memory.”

By adopting “lazy loading”—a standard best practice in web and software development—Anthropic is acknowledging that AI agents are no longer just novelties; they are complex software platforms that require architectural discipline.

Implications for the Ecosystem

For the end user, this update is seamless: Claude Code simply feels “smarter” and retains more memory of the conversation. But for the developer ecosystem, it opens the floodgates.

Previously, there was a “soft cap” on how capable an agent could be. Developers had to curate their toolsets carefully to avoid lobotomizing the model with excessive context. With Tool Search, that ceiling is effectively removed. An agent can theoretically have access to thousands of tools—database connectors, cloud deployment scripts, API wrappers, local file manipulators—without paying a penalty until those tools are actually touched.

It turns the “context economy” from a scarcity model into an access model. As Gupta summarized, “They’re not just optimizing context usage. They’re changing what ‘tool-rich agents’ can mean.”

The update is rolling out immediately for Claude Code users. For developers building MCP clients, Anthropic recommends implementing the `ToolSearchTool` to support this dynamic loading, ensuring that as the agentic future arrives, it doesn’t run out of memory before it even says hello.

Why MongoDB thinks better retrieval — not bigger models — is the key to trustworthy enterprise AI

Agentic systems and enterprise search depend on strong data retrieval that works efficiently and accurately. Database provider MongoDB thinks its newest embeddings models help solve falling retrieval quality as more AI systems go into production.

As agentic and RAG systems move into production, retrieval quality is emerging as a quiet failure point — one that can undermine accuracy, cost, and user trust even when models themselves perform well.

The company launched four new versions of its embeddings and reranking models. Voyage 4 will be available in four modes: voyage-4 embedding, voyage-4-large, voyage-4-lite, and voyage-4-nano.  

MongoDB said the voyage-4 embedding serves as its general-purpose model; MongoDB considers Voyage-4-large its flagship model. Voyage-4-lite focuses on tasks requiring little latency and lower costs, and voyage-4-nano is intended for more local development and testing environments or for on-device data retrieval. 

Voyage-4-nano is also MongoDB’s first open-weight model. All models are available via an API and on MongoDB’s Atlas platform. 

The company said the models outperform similar models from Google and Cohere on the RTEB benchmark. Hugging Face’s RTEB benchmark puts Voyage 4 as the top embedding model. 

“Embedding models are one of those invisible choices that can really make or break AI experiences,” Frank Liu, product manager at MongoDB, said in a briefing. “You get them wrong, your search results will feel pretty random and shallow, but if you get them right, your application suddenly feels like it understands your users and your data.”

He added that the goal of the Voyage 4 models is to improve the retrieval of real-world data, which often collapses once agentic and RAG pipelines go into production. 

MongoDB also released a new multimodal embedding model, voyage-multimodal-3.5, that can handle documents that include text, images, and video. This model vectorizes the data and extracts semantic meaning from the tables, graphics, figures, and slides typically found in enterprise documents.

Enterprise’s embeddings problems

For enterprises, an agentic system is only as good as its ability to reliably retrieve the right information at the right time. This requirement becomes harder as workloads scale and context windows fragment.

Several model providers target that layer of agentic AI. Google’s Gemini Embedding model topped the embedding leaderboards, and Cohere launched its Embed 4 multimodal model, which processes documents more than 200 pages long. Mistral said its coding-embedding model, Codestral Embedding, outperforms Cohere, Google, and even MongoDB’s Voyage Code 3. MongoDB argues that benchmark performance alone doesn’t address the operational complexity enterprises face in production.

MongoDB said many clients have found that their data stacks cannot handle context-aware, retrieval-intensive workloads in production. The company said it’s seeing more fragmentation with enterprises having to stitch together different solutions to connect databases with a retrieval or reranking model. To help customers who don’t want fragmented solutions, the company is offering its models through a single data platform, Atlas. 

MongoDB’s bet is that retrieval can’t be treated as a loose collection of best-of-breed components anymore. For enterprise agents to work reliably at scale, embeddings, reranking, and the data layer need to operate as a tightly integrated system rather than a stitched-together stack.