Agentic AI is moving rapidly from the developer terminal to the corporate world.
On Tuesday, OpenAI announced a major update of its agentic AI platform Codex, introducing domain-specific workflows, a rapid, semi-private web hosting feature within it for enterprises called “Sites,” and an in-place editing tool named “Annotations”.
The release marks a deliberate strategy to transform Codex from a specialized programming assistant into an everyday operating environment for business professionals.
Non-developers—including financial analysts, marketers, operators, and researchers—now constitute approximately 20% of the platform’s 5 million weekly users and are adopting the technology three times faster than traditional engineers, according to research shared by OpenAI with VentureBeat and other outlets.
OpenAI is capitalizing on this shift to position Codex as the premier application for white-collar task automation. The timing of the announcement is highly strategic, arriving precisely as its own primary investor turned business rival Microsoft this week kicks off its annual BUILD developer conference in San Francisco—where a slate of competing enterprise productivity tools is expected—and hot on the heels of Anthropic’s rapid adoption among knowledge-workers via its Claude Cowork and Claude Code platorms.
For business users, the most critical technical upgrade is the elimination of full-document regeneration. Previously, instructing an AI to update a specific chart or spreadsheet calculation often meant the model had to rewrite the entire file, which frequently broke custom formatting or introduced hallucinations.
OpenAI addresses this through Annotations, a localized context-scoping mechanism. As demonstrated in the company’s release materials, the platform maps a document’s underlying data schema.
When a user highlights a specific segment—such as a block of cells in a financial model—Codex isolates those exact data arrays.
If an analyst prompts the system to “Add a chart of revenue, EBITDA, and net income over the selected years,” the model executes the code strictly within that boundary, generating the visualization while leaving the surrounding cell dependencies, styles, and unselected formulas completely untouched.
To further anchor Codex in daily enterprise operations, OpenAI has introduced modular software bundles and a rapid-prototyping hosting environment.
The company is rolling out six role-specific plugins that aggregate 62 popular business applications (including Snowflake, Figma, and Salesforce) and 110 automated skills straight out of the box.
Data Analytics: Unifies cloud environments like Snowflake, Databricks Genie, Hex, and Tableau to translate natural language inquiries into data reports and change-analysis dashboards.
Creative Production: Connects Figma, Canva, Shutterstock, Picsart, and Fal to generate and iterate on ad variations, campaign boards, and e-commerce assets directly from text briefs.
Sales: Integrates pipeline infrastructure across Salesforce, HubSpot, Slack, Outreach, Clay, Rox, and Actively to automate follow-up communications, close plans, and account risk reviews.
Product Design: Bridges Figma and Canva environments to audit live user journeys and transform static wireframes into clickable prototypes.
Public Equity & Investment Banking: Syncs institutional market feeds—including Moody’s, Daloopa, Datasite, FactSet, LSEG, S&P, PitchBook, and Hebbia—to streamline financial modeling, competitive landscaping, and pitch book preparation.
These integrations allow distinct departments—from data analytics and creative production to sales and investment banking—to automate complex, multi-step workflows without requiring IT to build custom API connections.
Concurrently, the new Sites feature introduces an interactive canvas that converts static data inputs or text documents into functional, web-hosted internal applications.
Rolling out in preview for Business and Enterprise tiers, Sites allow cross-functional teams to bypass front-end development.
Financial leaders, for example, can transform a static spreadsheet into an interactive scenario planner shared via a secure workspace URL, allowing executives to tweak assumptions in a live web app rather than clicking through document tabs.
Instead of static decks, Sites promise to keep enterprises updated on their latest metrics and important information in an easily digestible way.
A critical operational distinction in this rollout centers on exactly where these new features can be executed. Codex’s existing infrastructure runs natively across multiple surfaces, including IDE extensions and the terminal command line.
However, the release documentation notes that Sites are rolling out “through the Codex app” and that plugins are managed via a “Codex plugin directory”.
An OpenAI spokesperson confirmed that Plugins and Sites are available int he CLI and desktop app, while Sites are hosted by OpenAI.
These updates operate entirely within OpenAI’s closed, proprietary enterprise licensing model. Unlike open-source frameworks, enterprise clients do not maintain code-level ownership over Codex’s integration nodes.
Instead, system administrators manage deployment through centralized workspace settings, giving them explicit authority to enable or disable hosted “Sites” and restrict underlying application permissions.
These new capabilities deploy seamlessly on top of Codex’s existing commercial framework. Users will continue to access the agent via established baseline subscription tiers—such as the individual “Plus” plan ($20/month) or the high-volume “Pro” plan ($100/month)—or through a separate, seat-free pay-as-you-go model that draws down pre-purchased utility credits.
Enterprise AI agents are stalling — not because of model performance, but because of permissioning. Every agentic workflow eventually hits the same wall: what is this agent allowed to touch, on whose behalf, and how does the system know?
Workday’s answer is to make its existing system of record the governance layer for agents. Gerrit Kazmaier, the company’s president for product and technology, told VentureBeat in an interview that customers often struggle when they cobble together solutions for their agents.
“Sana makes sure the integrity of the approvals and security model is always adhered to,” Kazmaier said. “Frankly, that’s where we see customers struggling when they try to build do-it–yourself AI by just accessing raw data, so the richness of the security model gets lost, and the results become overly broad.”
Workday, which launched Sana in March, expanded its partnership with Google to bring its Sana agent system of record to the Gemini Enterprise — so agents built on Sana are also discoverable there.
Kazmaier said the biggest hurdle they faced was ensuring agent accuracy, especially for HR and finance users.
“Almost right is not acceptable,” Kazmaier said. “Think about paying people correctly, closing the books or managing work schedules reliably.”
Accuracy is harder to evaluate here than in most AI contexts. Policy configurations, role-based security, and organizational hierarchies are deeply interrelated — a small error compounds. And unlike most generative AI outputs, HR and finance queries often lack a correction loop. By the time a paycheck processes incorrectly or an interview is scheduled wrong, the damage is done.
Workday addressed this by building Gemini in as its base reasoning layer, then adding its context engine and business process logic on top. Workday also added verification and classification models that “interrogate” outputs before execution.
Accuracy and identity, it turns out, are the same question: does the system know enough about the agent, the authorizing human, and the current state of the record to act correctly?
Workday’s advantage is that it can infer its customers’ organizational structures from the data they provide. Already, third-party identity providers like Okta verify their information by checking Workday, so its context is the system of record for many enterprises. Kazmaier said the Sana Self-Service Agent uses Gemini as the conversational surface to trigger the workflow. The user is then authenticated and authorized through Workday’s identity and security model. Sana agents will only act on behalf of that user and work within their current permissions.
Audit trails follow the same logic: Gemini retains only interaction logs, while the main audit remains within Workday and its customer.
For many practitioners in the HR and finance space, the permission and governance layer in the agent system of record is key in regulated spaces.
“It has to live in the system of record, that’s not a preference, that’s the only way it works,” said Dan Obendorfer, director of product at Würk, in an email to VentureBeat. “If your permissions are defined somewhere outside of where the data actually lives, you’ve already lost.”
Kadan Stadelmann, chief technology officer and co-founder of Compance.AI, made the same point separately. “Without agent ownership, performance, costs or actions, chaos ensues.”
Enabling LLMs to acquire new knowledge after training remains a major hurdle for enterprise AI — current solutions are either too expensive, too slow, or constrained by context window limits.
MeMo, a framework from researchers at multiple universities, encodes new knowledge into a dedicated smaller memory model that operates separately from the main LLM.
The modular architecture works with both open- and closed-source models and sidesteps the complexity of RAG pipelines and full model retraining.
Experiments show that MeMo handles complex queries reliably even when retrieval pipelines are noisy. It avoids the catastrophic forgetting associated with direct fine-tuning and provides a cost-effective pathway for continuous knowledge updates.
Large language models are frozen after training and their internal knowledge remains static until they undergo subsequent, computationally massive updates.
Currently, developers rely on three main approaches to integrate external knowledge into an LLM, each with distinct drawbacks:
Non-parametric methods, such as retrieval-augmented generation (RAG) and in-context learning, retrieve relevant documents from an external database and insert them directly into the model’s prompt. While popular, these methods are limited by context window sizes.
As Armando Solar-Lezama, a co-author of the paper, told VentureBeat, “Vector databases have a fundamentally difficult job of encoding the full semantics of a chunk of text in a single vector, and then match that vector to a query, even when the relevance of the chunk… may only be apparent in the context of other chunks.”
The researchers note that the semantic similarity of embeddings often does not correspond to what a user’s query actually requires. Processing thousands of retrieved tokens also creates substantial computational overhead and inference latency. Most problematically, RAG systems are highly sensitive to noise. Irrelevant or poorly retrieved passages often degrade the model’s final response.
Parametric methods, like continual pretraining or supervised fine-tuning, attempt to internalize new knowledge directly into the LLM’s weights. Updating modern, massive LLMs is prohibitively expensive and typically impossible for proprietary, closed-source models hidden behind APIs. Fine-tuning is also prone to causing catastrophic forgetting. Forcing the model to adapt to new corporate data often erodes its previously acquired reasoning capabilities and safety guardrails.
Latent memory methods, such as context compression, offer a middle ground. They compress knowledge into compact “soft tokens” or representations that are added to the model’s context during inference. The fatal flaw here is “representation coupling.” The compressed memory is strictly bound to the model architecture that produced it; you can’t transfer a latent memory trained on an open-source model to a closed-source one.
The MeMo (Memory as a Model) framework introduces a modular architecture featuring two separate components. The MEMORY model is a small language model trained specifically to encode new knowledge into its parameters. The EXECUTIVE model is a frozen, off-the-shelf LLM that functions as the reasoning engine. When a user asks a question, the EXECUTIVE model treats the MEMORY model as an external oracle, issuing targeted sub-queries to gather facts and synthesizing those facts into a final answer.
The core design principle driving MeMo is the concept of “reflections.” Reflections are targeted question-answer (QA) pairs designed to capture every possible angle of a knowledge corpus. Rather than forcing the AI to process a massive, unstructured document corpus during training, MeMo uses a GENERATOR model to distill the raw text into thousands of targeted QA pairs. The MEMORY model is then fine-tuned on this dataset to answer questions using only its parametric knowledge without the need to read retrieved context.
At inference time, the interaction between the two models follows a structured, three-stage protocol:
1. The EXECUTIVE model decomposes a user’s complex query into a set of atomic sub-questions. The MEMORY model answers each independently to establish the basic facts.
2. Using those initial clues, the EXECUTIVE model issues follow-up queries to narrow down candidate entities until it confidently converges on a specific target.
3. Finally, the EXECUTIVE model queries the MEMORY model for supporting facts about that target entity and synthesizes the retrieved snippets into a cohesive answer.
This architecture merges the strengths of the three existing AI memory paradigms while bypassing their pitfalls. It leverages off-the-shelf frontier models by keeping memory storage separate from reasoning, guaranteeing compatibility with both open-weight and closed API models. It internalizes knowledge directly into parameters, but isolates the updates to a smaller, dedicated MEMORY model to protect the reasoning engine. Finally, it creates a queryable memory artifact that is not tied to any specific model and can be used with different LLM families.
Managing an AI’s memory requires continuous updates as company policies change and new reports are published. Normally, updating a model’s parameters requires retraining it from scratch on both the old and the new data combined. As the knowledge base grows, this cumulative retraining cost becomes unmanageable.
To handle continual updates efficiently, MeMo relies on a technique called “model merging.” Instead of a massive joint retraining phase, MeMo trains a new, independent MEMORY model exclusively on the newly added documents. The system derives a “task vector” representing the parameter changes learned from the fresh data. These updates are then mathematically merged into the weights of the original MEMORY model.
This approach reduces the computing hours required to keep the system current while avoiding the interference that causes catastrophic forgetting.
This efficiency comes with a trade-off: model merging incurs an 11% to 19% accuracy drop compared to a full retrain, depending on the reasoning model used.
To measure real-world effectiveness, the research team evaluated MeMo against several industry benchmarks that require complex, multi-hop reasoning across multiple documents.
The researchers used Qwen2.5-32B-Instruct as the GENERATOR model to distill raw text into reflections. For the primary MEMORY model, they deployed Qwen2.5-14B-Instruct. They also validated the approach on smaller 1-2B parameter models across different architectures, including Gemma3-1B.
For the EXECUTIVE reasoning model, they tested both the open-weight Qwen2.5-32B and Google’s proprietary Gemini 3 Flash.
They benchmarked MeMo against a “Perfect Retrieval” upper bound (where the exact correct documents are manually provided) and several advanced retrieval systems, including traditional BM25 search, dense vector retrieval, and state-of-the-art graph-based RAG (HippoRAG2). They also tested “Cartridges,” a recent method that loads a trained KV-cache onto the model during inference.
MeMo dominated in long-document reasoning. On the NarrativeQA benchmark, MeMo achieved 53.58% accuracy paired with Gemini 3 Flash, according to the researchers. HippoRAG2 maxed out at 23.21%.
Enterprise systems frequently need to synthesize complex answers, such as traversing overlapping regulatory frameworks written independently by different bodies, or consolidating insights across a massive codebase and external documentation. Traditional RAG systems falter here because they hit context window limits and fail to connect concepts spanning hundreds of pages. MeMo succeeds because those connections are mapped and internalized inside the MEMORY model during training. It is “like having your very own Malcolm Gladwell that can connect the story of the Beatles with the story of Bill Gates to make an argument about the nature of expertise,” Solar-Lezama said.
The experiments revealed another major advantage: upgrading the reasoning engine requires zero retraining. Simply switching the EXECUTIVE model from the open-source Qwen to the proprietary Gemini 3 Flash boosted MeMo’s performance by 26.73% on NarrativeQA and 11.90% on the MuSiQue benchmark. For practitioners, this means you can train a MEMORY model securely on your private data and instantly plug it into the latest commercial APIs, continuously upgrading system intelligence without incurring new training costs.
The research team described the integration as requiring no additional setup: “The base (or Executive) LLM that teams are already using in RAG can be configured to query the Memory model directly. These queries are done in natural language, similar to sending a message request to an API, with no additional setup required.”
MeMo also handles noisy data exceptionally well. When researchers deliberately flooded the dataset with irrelevant documents (up to twice the amount of the useful information), HippoRAG2’s performance dropped by 11.55%. MeMo’s performance remained relatively stable, dropping less than 2%. Enterprise knowledge bases are typically messy, filled with duplicate documents and outdated policies. Standard RAG systems struggle with this noise, pulling incorrect paragraphs into the prompt and causing hallucinations. Because MeMo’s EXECUTIVE model interacts with a synthesized oracle rather than raw document chunks, it remains highly robust against disorganized corporate data.
For engineering teams looking to deploy MeMo, there are several key limitations to consider.
Unlike traditional RAG systems that quickly index raw documents into a vector database, MeMo requires an upfront training cost for each new corpus. The data generation pipeline used to synthesize the training reflections is computationally expensive. For example, the team noted that “generating the full reflection QA dataset took approximately 240 GPU-hours on NVIDIA H200s,” while training a 14B parameter MEMORY model “took approximately 180 H200 GPU-hours.” As Solar-Lezama said, “Reducing the training cost is one of the most significant open research problems in order to make this a workhorse technique.”
Because the MEMORY model is a fixed-size neural network, its ability to internalize knowledge is bounded by its representational capacity. While the researchers did not hit a hard limit during their benchmarking, they hypothesize that “sufficiently large or information-dense corpora will exceed what a fixed-size MEMORY model can correctly compress and represent.”
Finally, because MeMo synthesizes answers from parametric memory rather than retrieving exact text snippets, it obscures the provenance of the information. This makes it difficult to attribute specific claims to original source documents, which poses a critical compliance issue for enterprise applications requiring strict audit trails.
Deciding between MeMo and traditional RAG comes down to a heuristic of “lookup vs. synthesis,” alongside data volatility. The researchers advise that “traditional RAG would be preferred when answers live in a single document or when there is a well-defined source… MeMo would be preferred when the task shifts from lookup to synthesizing an answer from information scattered across multiple chunks.” If your knowledge corpus changes rapidly (e.g., daily feeds) and you require exact source citations, RAG remains the better option due to the upfront training cost of MeMo. If your corpus consists of generalized domain knowledge that evolves slowly relative to its volume, MeMo offers vastly superior reasoning. Teams can also adopt a hybrid routing architecture in production: sending “lookup” queries to a standard vector database and “synthesis” queries to the MEMORY model.
“Looking further out, I would expect memory models to become a standard architectural component alongside retrieval,” Daniela Rus, co-author of the paper and director of the MIT Computer Science and Artificial Intelligence Lab (CSAIL), told VentureBeat, “in the same way that caching and indexing are standard components of any serious data system today.”
At 620 million monthly users, calling a frontier model for every image recommendation isn’t a strategy — it’s a bill. Pinterest CTO Matt Madrigal solved it by gutting Qwen3-VL’s vision layer and rebuilding it with proprietary embeddings, cutting costs 90% and boosting accuracy 30%.
Madrigal’s team has been heavily investing in customizing open-source models “foundationally in-house.”
“If you’ve got really unique data that you can then fine-tune an open source model with, data quality will, frankly, outweigh or overcome model size,” Madrigal explained in a recent VB Beyond the Pilot podcast.
Pinterest, which has around 620 million monthly active users, has long applied open source models for visual search and discovery, going back to Google’s BERT and OpenAI’s CLIP. The company fine-tuned its own Pin CLIP on the latter, incorporating proprietary visual embeddings and image metadata.
Pinterest’s conversational shopping assistant, Navigator 1, was built on Qwen3-VL and customized in “pretty significant” ways. Madrigal’s team essentially “ripped out” Qwen’s vision encoder layer and fine-tuned the model on proprietary multimodal embeddings. This has allowed them to capture metadata around pins and images that can then be precomputed offline and regularly retrained on new information to deliver personalized experiences.
“Open-source models, especially with open Apache licenses where you can truly tweak a lot of open weights and customize for unique use cases — that’s where we’ve found open source to be so powerful for us,” Madrigal said.
Bringing their own embeddings allows his team to gain context around metadata, pins, and images; also, notably, the model performs better at runtime and inference. Without these embeddings, devs would have to call and encode each image returned at runtime, one at a time. That results in a latency “20 times worse” from an inference perspective, Madrigal said.
“If it’s something that’s going to be critical for our end users, that’s going to drive engagement, that will have to scale to over 600 million monthly active users, we’re going to either probably build it or we’re going to leverage open source and customize the heck out of it,” he said.
To guide users from inspiration to purchase, Madrigal’s team built a “taste graph”: a dynamic representation of what individual users actually like, not just what they click on. “It’s this representation of billions of people’s evolving tastes,” he said.
People go to Google or other search engines when they have a clear picture of what they want; Pinterest is for when they’re still in the discovery phase, Madrigal said. Pinterest’s goal is to encourage “lateral exploration” and transform discovery to intent (that is, clicking through ads or making purchases).
Under the hood, the architecture combines a graph structure with representational learning. User embeddings capture a user’s evolving tastes. These are constantly updated based on activity and new content and signals. “It’s not a social graph,” Madrigal said. “It’s much more of a preference graph: What’s going to inspire you? What are you trying to do next?”
For instance, one user may be into mid-century modern designs; another may prefer a Nantucket aesthetic. Those preferences will be captured in user embeddings, and the taste graph will deliver up specific, relevant products as a result.
“You go from the upper funnel, inspiration discovery, all the way through lower funnel intent,” Madrigal said.
Listen to the full podcast to hear more about:
How Pinterest uses sandboxes to encourage creativity in a way that is secure and contained;
Why a continuous feedback loop can prevent visual AI slop;
The importance of constant benchmarking to gauge user engagement, performance, latency, and other factors.
You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.
As enterprise AI agents move into production, organizations are confronting a growing reliability problem. Many teams are discovering that LLM performance alone does not determine whether agents succeed in production. Long-running AI workflows must survive crashes, preserve state, recover from failures, manage inference costs, and coordinate across APIs, tools, and enterprise systems.
After a first wave focused on rapid deployment, organizations now need to revisit those first-generation implementations, and redesign early agent architectures around workflow orchestration, observability, governance, and recovery, said Preeti Somal, Senior VP Engineering at Temporal Technologies, during the latest AI Impact Series event in New York.
“We do have a lot of customers that come to us where they’re building version 2.0 of the same agent,” Somal said. “They had to move really fast, but they didn’t take care of the plumbing. Things crash and burn, and then they’re back to rebuilding with the reliable foundation.”
For workflow orchestration company Temporal, whose infrastructure predates the current wave of agentic AI, the shift reflects a broader enterprise realization: production AI systems require durable execution, state management, visibility into workflows, and mechanisms to recover when models or downstream systems fail.
“These patterns aren’t necessarily new,” Somal said. ” AI just supercharges them.”
Agentic systems introduce additional complexity because they often involve long-running, multi-step processes spanning multiple services, models, APIs, and tools. A single workflow might call several large language models, access retrieval systems, trigger external applications, and manage state over hours or days. The engineering questions, Somal said, often emerge only after deployment.
“People will write agents but haven’t thought about what happens if the agent crashes,” she said. “Am I going to need to run the entire agent flow again?”
For enterprises operating under cost constraints, the answer matters. Restarting workflows after failures can multiply inference expenses, increase latency, and create poor customer experiences.
Somal compared the current moment to an earlier period in enterprise cloud adoption when organizations went straight to migrating workloads before considering that they needed to redesign underlying architectures if they wanted these workloads to weather the long-term.
“This rush to do AI in a world where you haven’t even modernized your application reminds me a little bit of that lift-and-shift that happened in the cloud,” she said. “Everybody realized you’re spending more money on cloud and we haven’t gotten value there.”
Enterprise workflows increasingly involve agents executing over long windows, sometimes spanning many hours while interacting with tools and systems. Reliability challenges compound when workflows persist over time, and it impacts both state and memory, two ideas that are often treated interchangeably in AI conversations.
State concerns workflow execution. It includes where an agent is in a process, which actions have already completed, and where recovery should resume after failure. Memory or context captures information an agent carries forward across interactions or tasks.
“The state of the agent is around what step and what actions have been performed, and if something crashes, where do you want to recover from, versus the context and memory piece,” Somal explained.
That distinction becomes increasingly important when enterprises begin moving beyond simple chatbot interactions toward longer-running business processes. Somal pointed to a healthcare example involving customer Abridge, where workflows process physician visits through multiple stages, including audio processing, summarization, model calls, and after-visit generation.
“There’s not just one piece to that flow,” Somal said. “Taking videos and slicing that, taking summaries, calling the LLMs, generating the after-visit summary, all of that is being orchestrated.”
The implication for enterprises is that successful agents increasingly depend on systems that can survive interruptions, coordinate across services, and maintain continuity over time.
A useful framework for enterprise AI design is the deterministic spine, Somal said, which is how they think about Temporal’s role.
“It is denoting the path you want to take,” she said. “It is calling the brain, but if the brain doesn’t respond, it will call it again. If the brain responds but the next step is going to fail, it will pick up from where that failure happened.”
In this framing, the language model acts as a probabilistic system producing variable outputs, while orchestration software maintains execution reliability around it. And the concept matters because enterprise systems increasingly require consistency even when models remain non-deterministic. A procurement workflow, healthcare summary, customer support escalation, or compliance process cannot simply fail silently because a model call timed out or an external dependency crashed.
“What you care most about is making sure that you can recover and that you’re not paying the token tax if something goes wrong,” Somal said.
As enterprise leaders evaluate AI ROI, cost visibility has become a growing concern. Long-running agents frequently make multiple model calls across complex workflows, which can create opaque spending patterns. Somal described one operational advantage of orchestration as visibility into where costs accumulate. Because workflows are observable step-by-step, teams can see where tokens are being consumed across an agent process.
“You’ve got visibility into that entire flow in a single pane of glass,” she said. “You can now see where you’re spending the tokens in an agent that is multiple steps and calling multiple different systems.”
Workflow recovery also shapes cost efficiency. Without durable orchestration, a late-stage failure can force organizations to rerun an entire process from the beginning, including all prior model calls. Somal said systems designed around recovery can resume execution from the point of interruption.
“You pick up from where the crash happened,” she said. “We save you the cost of running the agent from step one again.”
Governance concerns are another emerging pattern as agentic AI takes hold. Rather than adopting fully managed agent systems wholesale, Somal said enterprises increasingly want standardized internal frameworks that provide guardrails while preserving flexibility, and implementing necessary features like governance controls, model selection policies, identity systems, cost management, and observability.
“The enterprises are looking at building these paved paths,” she said. “Taking something off the shelf is maybe not going to work because there are all of these other requirements.”
As organizations revisit first-generation deployments, challenges like this increasingly look less like a model problem and more like a systems engineering problem, and Temporal is positioned to help enterprises take this next step in part because for many organizations, it already existed as part of broader modernization programs before AI became a strategic priority.
“Temporal is already in the enterprise,” Somal said. “Taking that and extending that to AI and agent platforms feels very natural.”
Test-time scaling (TTS) has emerged as a proven method to improve the performance of large language models in real-world applications by giving them extra compute cycles at inference time. However, TTS strategies have historically been handcrafted, relying heavily on human intuition to dictate the rules of the model’s reasoning.
To address this bottleneck, researchers from Meta, Google, and several universities have introduced AutoTTS, a framework that automatically discovers optimal TTS strategies. This automated approach allows enterprise organizations to dynamically optimize compute allocation without manually tuning heuristics.
By implementing the optimal strategies discovered by AutoTTS, organizations can directly reduce the token usage and operational costs of deploying advanced reasoning models in production environments. In experimental trials, AutoTTS managed inference budgets efficiently, successfully reducing token consumption by up to 69.5% without sacrificing accuracy.
Test-time scaling enhances LLMs by granting them extra compute when generating answers. This extra compute allows the model to generate multiple reasoning paths or evaluate its intermediate steps before arriving at a final response.
The primary challenge for designing TTS strategies is determining how to allocate this extra computation optimally. Historically, researchers have designed these strategies manually, relying on guesswork to build rigid heuristics. Engineers must hypothesize the rules and thresholds for when a model should branch out into new reasoning paths, probe deeper into an existing path, prune an unpromising branch, or stop reasoning altogether.
Because this manual tuning process is constrained by human intuition, a vast amount of possible approaches remain unexplored. This often results in suboptimal trade-offs between model accuracy and computing costs.
Current TTS algorithms can be mapped to a width-depth control space — “width” being the number of reasoning branches explored, “depth” being how far each develops. Self-consistency (SC) samples a fixed number of trajectories and majority-votes the answer. Adaptive-consistency (ASC) saves compute by stopping early once a confidence threshold is hit. Parallel-probe takes a more granular approach, pruning unpromising branches while deepening the rest. All three are hand-crafted, and that’s the constraint AutoTTS is designed to break.
While some more advanced methods employ richer structures like tree search or external verifiers, they all share one key characteristic: they are meticulously hand-crafted. This manual approach restricts the scope of strategy discovery, leaving a massive portion of the potential resource-allocation space untouched.
AutoTTS reframes the way test-time scaling is optimized. Instead of treating strategy design as a human task, AutoTTS approaches it as an algorithmic search problem within a controlled environment.
This framework redefines the roles of both the human engineer and the AI model. Rather than hand-crafting specific rules for when an LLM should branch, prune, or stop reasoning, the engineer’s role shifts to constructing the discovery environment. The human defines the boundaries, including the control space of states and actions, optimization objectives balancing accuracy versus cost, and the specific feedback mechanisms.
An explorer LLM, such as Claude Code, designs the strategy. This explorer acts as an autonomous agent that iteratively proposes TTS “controllers.” These controllers are code-defined policies or algorithms that dictate how an AI model allocates its computational budget during inference. The explorer tests and refines these controllers based on feedback until it discovers an optimal resource-allocation policy.
To make this automated search computationally affordable, AutoTTS relies on an “offline replay environment.” If the explorer LLM had to invoke a base reasoning model to generate new tokens every time it tested a new strategy, the compute costs would be astronomical. Instead, it relies on thousands of reasoning trajectories pre-collected from the base LLM. These trajectories include “probe signals,” which are intermediate answers that help the controller evaluate progress across different reasoning branches.
During the discovery loop, the explorer agent proposes a controller and evaluates it against this offline data. The agent observes the execution traces of the proposed controller that show it allocated compute over time. By analyzing these traces, the agent can diagnose specific failure modes, such as noting if a controller pruned branches too aggressively in a specific scenario. This provides an advantage over just viewing a final result. The agent then iteratively rewrites its code to improve the accuracy-cost tradeoff.
Because the explorer agent is not constrained by human intuition, it can discover highly coordinated, complex rules that a human engineer would likely never hand-code. One optimal controller discovered by AutoTTS, named the Confidence Momentum Controller, leverages several non-obvious mechanisms to manage compute:
Trend-based stopping: Hand-crafted strategies often instruct the model to stop reasoning once it hits a certain instantaneous confidence threshold. The AutoTTS agent discovered that instantaneous confidence can be misleading due to temporary spikes. Instead, the controller tracks an exponential moving average (EMA) of confidence and only stops if the overall confidence level is high and the trend is not actively declining.
Coupled width-depth control: Manually designed algorithms usually treat the “widening” of new reasoning paths and the “deepening” of current paths as separate decisions. AutoTTS discovered a closed feedback loop where the two actions are linked. If the confidence of the current branches stalls or regresses, the controller automatically triggers the spawning of new branches.
Alignment-aware depth allocation: Instead of giving all active reasoning branches an equal computation budget, the controller dynamically identifies which branches agree with the current leading answer. It then gives those branches priority “bursts” of extra computation. This concentrates the computational budget on the emerging consensus to quickly verify if it is correct.
To test whether an AI could autonomously discover a better test-time scaling strategy, researchers set up a rigorous evaluation framework. The core experiments were conducted on Qwen3 models ranging from 0.6B to 8B parameters. The researchers also tested the system’s ability to generalize on a distilled 8B version of the DeepSeek-R1 model.
The explorer AI agent was initially tasked with discovering an optimal strategy using the AIME24 mathematical reasoning benchmark. This discovered strategy was then tested on two held-out math benchmarks, AIME25 and HMMT25, as well as the graduate-level general reasoning benchmark GPQA-Diamond.
The AutoTTS discovered controller was pitted against four manually designed test-time scaling algorithms in the industry. These baselines included Self-Consistency with 64 parallel reasoning paths (SC@64), Adaptive-Consistency (ASC), Parallel-Probe, and Early-Stopping Self-Consistency (ESC). ESC is a hybrid approach that generates trajectories in parallel and stops early when an answer seems stable.
When set to a balanced, cost-conscious mode, the AutoTTS-discovered controller reduced total token consumption by approximately 69.5% compared to SC@64. At the same time, the controller maintained the same average accuracy across the four Qwen models. When the inference budget was turned up, AutoTTS pushed peak accuracy beyond all handcrafted baselines in five out of eight test cases.
This efficiency translated to other tasks. On the GPQA-Diamond benchmark, the balanced AutoTTS variant slashed the inference token cost from 510K tokens down to just 151K tokens, while slightly improving overall accuracy. On the DeepSeek model, AutoTTS achieved the highest overall accuracy on the HMMT25 benchmark while cutting the token spend nearly in half.
For practitioners building enterprise AI applications, these experiments highlight two major operational benefits:
Raising peak performance: AutoTTS doesn’t just save money on token consumption. It actively raises the peak attainable performance of the base model. The AI-designed controller is remarkably good at detecting noisy or unproductive reasoning branches on the fly and continuously redirecting its compute budget toward the branches generating the most useful reasoning signals.
Cost-effective custom development: Because the framework relies on an offline replay environment, the entire discovery process cost only $39.90 and took 160 minutes. For enterprise teams, that means optimized reasoning strategies tailored to proprietary models and internal tasks are now within reach — without a dedicated research budget.
Both the AutoTTS framework and the Confidence Momentum Controller are available on GitHub; the CMC can be used as a drop-in replacement for other TTS controllers.
There is a category of production incident that engineering teams are not tracking yet — because it doesn’t fit any existing postmortem template.
The agent initiated an action. The action was technically correct given the agent’s context. The context was incomplete. The infrastructure cascaded. And, by the time the incident review happened, three teams were arguing about whether it was an agent failure or an infrastructure failure, because the frameworks for thinking about these two things have never been connected.
The scale of this exposure is no longer theoretical. Seventy-nine percent of organizations now have some form of AI agent in production, with 96% planning expansion. Gartner predicts 33% of enterprise software will include agentic AI by 2028, but separately warns that 40% of those projects will be canceled due to poor risk controls.
What neither statistic captures is the failure mode happening between those two numbers: Agents that are running, that are not canceled, and that are quietly generating infrastructure events no one has categorized as risk.
I’ve spent six years building infrastructure automation systems at enterprise scale, first at Cisco (leading AI-driven lifecycle platforms deployed across 20-plus global enterprise customers), then at Splunk (designing AI-assisted root cause analysis and observability workflows across thousands of enterprise environments).
During that time I also filed a patent on intent-based chaos engineering methodology. And across all of it, I kept watching organizations make the same structural mistake: Treating autonomous agents and chaos engineering as separate disciplines. They are not. They are the same discipline, and the gap between them is quietly generating the next wave of major production incidents.
To understand why this matters, you need to understand what’s actually broken in how enterprises govern chaos today, before you add agents to the picture.
Most mature engineering organizations have invested in chaos engineering programs. Game days, blast radius controls, SLO-gated experiments. When a human engineer initiates a chaos experiment, the sequence has a critical property: A human is making a judgment call about whether the system has capacity to absorb the perturbation right now. They check dashboards. They look at the error budget burn rate. They assess whether dependencies are stable. It’s imperfect and often intuitive, but there is at least a person in the loop asking the right question before anything runs.
When you introduce an autonomous remediation agent, one that can restart services, reroute traffic, scale resources, or modify configurations in response to detected anomalies, that question disappears. The agent sees an anomaly. The agent takes an action. The action is a chaos event. No SLO burn rate check. No blast radius calculation. No human judgment about whether right now is the right moment to introduce additional stress into a system that may already be under pressure from three other directions.
Here is the specific failure mode I have watched play out. A remediation agent detects elevated latency on a microservice and responds by restarting the service cluster; a reasonable action given its training data and its narrow view of the incident. What the agent doesn’t know: Three other services are in the middle of handling peak traffic. The shared connection pool is already at 87% utilization. A dependent database is running a background index rebuild. The restart triggers a thundering herd against the recovering service.
What started as a latency spike the agent was designed to fix becomes a cascade the agent was never designed to model. The blast radius of that agent action was not the service restart. It was everything downstream of the restart, in a system state the agent had no complete picture of.
Nobody’s chaos engineering program had tested for that specific combination. Nobody’s blast radius calculation had included the agent as an actor. Because we don’t think of agents as chaos injectors. We should.
According to the AI Incidents Database, reported AI-related incidents rose 21% from 2024 to 2025. That count almost certainly understates the actual exposure, because most organizations have no incident classification that captures an autonomous agent action as the initiating cause of a cascade. The incident gets logged as a service restart, a connection pool saturation, or a latency event. The agent is invisible in the postmortem.
The underlying problem is that enterprise systems have no shared language for absorb capacity — the real-time estimate of how much additional stress a system can take before it breaches its SLO commitments. Chaos engineering programs manage it implicitly, through human judgment and static thresholds that fire after a limit has already been crossed. Agents don’t manage it at all.
Through structured primary research with site reliability engineering (SRE) and platform engineering practitioners across organizations including Intuit and GPTZero, I’ve been developing a resilience budget model. The core idea is to treat absorb capacity as a continuously recomputed, consumable resource rather than a static threshold you try not to breach.
A resilience budget draws on four live signal classes.
SLO burn rate is the primary input, because it directly encodes the distance between current system behavior and the commitment that actually matters. If a system is burning its monthly error budget at five times the expected rate, the resilience budget is near zero regardless of what CPU utilization looks like.
P99 latency trend matters more than absolute latency, because a service trending upward over forty minutes tells you something different than a service that has been stable at the same absolute value.
Dependency saturation state is the most commonly missed signal; a chaos experiment or an agent action that assumes a shared connection pool is freely available when it’s sitting at 87% will produce failure modes that nobody designed for.
Application behavioral signals, session completion rates, API call pattern shifts, conversion degradation, and surface system stress earlier than infrastructure metrics do, because users feel the degradation before Prometheus reports it.
What makes this a budget rather than a threshold is that it is consumable. Every chaos experiment draws from the available capacity. Every agent action draws from it. In multi-team organizations where multiple experiments and multiple agents may be acting simultaneously, the budget is shared.
Without a shared ledger of consumption, two teams running experiments against overlapping dependencies produce a combined blast radius that neither team planned. Add autonomous agents acting completely outside the ledger, and the accounting collapses.
Several engineering organizations are now running experiments using large language models (LLMs) to generate chaos hypotheses from dependency graphs and incident postmortem corpora. The results are directionally useful. Language models surface plausible failure modes that experienced SREs recognize as worth testing, and they generate hypotheses faster than manual processes, particularly when working from rich postmortem history.
The limit is dependency graph staleness, and it is a hard limit. A hypothesis generated from a graph that doesn’t reflect last month’s service extraction, or a new shared library dependency added two sprints ago, will propose an experiment with incorrect blast radius assumptions. The problem is not that the model makes a mistake, it’s that the model doesn’t know it’s making one. It will be confidently incorrect about a system boundary that no longer exists, and in chaos engineering, confident incorrectness in production means an unplanned outage.
Stanford’s Trustworthy AI Research Lab found that model-level guardrails alone are insufficient: Fine-tuning attacks bypassed leading models in the majority of tested cases. The implication for chaos hypothesis generation is direct, a model that cannot reliably hold its own safety boundaries cannot be trusted to accurately model the blast radius of an action it has never seen in a dependency graph it has not verified.
When hypothesis generation draws instead from postmortem corpora, the staleness problem shrinks considerably. Postmortems describe failures that actually occurred in the system at a specific moment in time. The signal is inherently validated by production reality. This is the tractable near-term AI application in this space, and it is genuinely useful for organizations with mature incident documentation practices.
What AI cannot do, and should not be asked to do, is make the execution decision when signals are ambiguous. That judgment requires awareness of things that live entirely outside any monitoring system: Pending deployments that changed the dependency landscape an hour ago, on-call staffing levels on a holiday weekend, a customer commitment that makes any additional risk unacceptable until Monday.
A model without access to that context should not be making that call. This is not a temporary limitation pending a more capable model. It is a structural constraint of what machine observability can represent, and building an agent architecture that ignores it is building one that will eventually make a consequential decision with incomplete information — and no human in the loop to catch it.
The governance implication is straightforward to describe and harder to implement than it sounds. Every autonomous agent action that touches infrastructure needs to register against the same live signal layer that governs chaos experiments. The same SLO burn rates, latency trends, dependency saturation states that a human engineer would check before initiating an experiment should gate what an agent is permitted to do and when. If the resilience budget is below a defined floor, the agent waits or escalates. It does not act.
Agent actions also need to be modeled as experiments, not just logged as events. When an agent restarts a service, the question isn’t only whether the restart completed successfully. It’s whether the blast radius of that action was proportionate to the available absorb capacity, and what cascading effects it produced across dependencies. That is chaos engineering data. It belongs in the budget model, feeding the next decision the agent or the team needs to make.
And when signals are genuinely ambiguous, when the budget score is unclear, when a recent deployment has changed the topology in ways the agent’s context window doesn’t capture, when dependency states are in flux, the execution decision needs to go to a human. Not as a permanent limitation on agent autonomy, but as a hard engineering requirement for the current state of the technology.
A circuit breaker that hands ambiguous cases to a human is not a weakness in the agent architecture. It is the thing that makes the architecture trustworthy enough to actually run in production. Intent-based verification formalizes exactly this: Defining what correct agent behavior looks like before deployment, then continuously probing whether those boundaries hold under live system conditions.
The organizations that operate autonomous agents reliably at scale are not the ones with the most sophisticated models. They are the ones that understood, before something went badly wrong, that every agent action is a chaos event and built their governance layer accordingly.
The practical first step is unglamorous: Audit every autonomous agent currently touching infrastructure, map its action surface against your live SLO burn rate signals, and define explicit floor conditions below which the agent is required to wait or escalate. That audit will surface agents acting entirely outside your resilience accounting.
Most organizations running agents at scale today have several. Find them before production does.
Sayali Patil has spent 6-plus years at Cisco Systems and Splunk building the reliability and automation systems that keep enterprise AI infrastructure running at scale.
When agentic workflows fail, developers often assume the problem lies in the underlying model’s reasoning abilities. In reality, the limited information provided by the retrieval interface is often the primary limiting factor.
Researchers at multiple universities propose a technique called direct corpus interaction (DCI) that lets agents bypass embedding models entirely, searching raw corpora directly using standard command-line tools.
In classic retrieval systems such as RAG, documents are chunked, converted into vector representations (or embeddings), and indexed offline in a vector database. When an AI system processes a query, a retriever filters the entire database to return a ranked “top-k” list of document snippets that match the query. All evidence must pass through this scoring mechanism before any downstream reasoning occurs.
But modern agentic applications demand much more. “Dense retrieval is very useful for broad semantic recall, but when an agent has to solve a multi-step task, it often needs to search for exact strings, numbers, versions, error codes, file paths, or sparse combinations of clues,” the authors of the DCI paper said in comments provided to VentureBeat. “These long-tail details are precisely where semantic similarity can be brittle.”
Unlike static search, agents must also revise their search plans dynamically after observing partial or localized evidence. Exact lexical constraints and multi-step hypothesis refinement are difficult to execute with semantic retrievers. Because the retriever compresses access into a single step, any critical evidence filtered out by the similarity search cannot be recovered later, no matter how advanced the agent’s downstream reasoning capabilities are. As the authors explain, current retrieval pipelines can become a bottleneck because “they decide too early what the agent is allowed to see.”
This direct access addresses a core problem in enterprise environments: data staleness. Embedding indexes are always a snapshot of a specific moment in time, taking considerable compute and time to build and maintain.
“In many enterprise settings, the data is not a stable document collection. It is daily financial reports, live logs, tickets, code commits, configuration files, incident timelines, and internal documents that keep changing,” the authors said. DCI lets the agent reason over the current state of the workspace rather than yesterday’s vector index.
The agent operates in a terminal-like environment where its observations are raw tool outputs such as file paths, matched text spans, and surrounding lines. The core tools provided by DCI are few but highly expressive. Agents use commands like “find” and “glob” to navigate directory structures and locate files. For exact matching, they use “grep” and “rg” to locate specific keywords, regex patterns, and exact strings. When local inspection is needed, tools like “head,” “tail,” “sed,” “cat,” and lightweight Python scripts allow the agent to peek at the context surrounding a match or read specific file sections.
The agent can combine these tools via shell pipelines to execute complex search logic in a single step. An agent can pipe commands to enforce strict lexical constraints, such as searching a file for one term and piping the output to search for a second term. It can combine multiple weak clues across a corpus by finding a specific file type, searching for a keyword like “report,” and filtering for a year like “2024.” It can also immediately verify a hypothesis by inspecting the exact lines around a keyword match.
DCI delegates semantic interpretation directly to the agent instead of relying on embedding-based similarity search. The agent can formulate hypotheses, test exact lexical patterns, and extract detailed information that a traditional semantic retriever might miss.
The researchers propose two versions of this system. DCI-Agent-Lite is designed as a lightweight, low-cost setup built on the GPT-5.4 nano model and restricted purely to raw terminal interactions like bash commands and basic file reads. Because reading raw files can quickly fill up a smaller model’s memory, this version relies on lightweight runtime context-management strategies to sustain long-horizon exploration.
DCI-Agent-CC is the higher-performance version, designed for teams with more compute budget. It runs on Claude Code powered by Claude Sonnet 4.6. Claude Code provides stronger prompting, more robust tool orchestration, and superior built-in context handling, which improves the agent’s stability during complex, multi-step searches across heterogeneous datasets.
The researchers tested both versions of DCI across agentic search benchmarks like BrowseComp-Plus, knowledge-intensive QA with single-hop and multi-hop reasoning, and information retrieval ranking in tasks requiring domain-specific reasoning and scientific fact-checking.
They tested DCI against three baselines. The first included open-weight retrieval agents such as Search-R1 and proprietary agents powered by frontier models like GPT-5 and Claude Sonnet 4.6, paired with standard retrievers. The second baseline included classical sparse retrievers like BM25 and dense retrievers like OpenAI’s text-embedding-3-large and Qwen3-Embedding-8B. The third baseline consisted of high-performing reasoning-oriented re-rankers like ReasonRank-32B and Rank-R1.
DCI systematically outperformed the baselines, according to the researchers. On the complex BrowseComp-Plus benchmark, swapping a traditional Qwen3 semantic retriever for DCI on a Claude Sonnet 4.6 backbone improved accuracy from 69.0% to 80.0% while reducing the API cost from $1,440 to $1,016. The return on investment for lightweight agents was also noticeable. DCI-Agent-Lite with GPT-5.4 nano competed with the OpenAI o3 model using traditional retrieval while cutting costs by more than $600.
On multi-hop QA benchmarks, DCI-Agent-CC reached an 83.0% average accuracy, improving on the strongest open-weight retrieval baseline by 30.7 points, according to the researchers.
The data shows that DCI has lower overall document recall than dense embedding models, but once it finds a relevant document, it extracts substantially more value from it.
“If an enterprise AI lead asked where DCI is most clearly useful, I would point to tasks that require exact evidence localization in a dynamic workspace: debugging production incidents, searching large codebases, analyzing logs, compliance investigation, audit trails, or multi-document root-cause analysis,” the researchers note.
In one complex deep-research task, the agent had to identify a specific soccer match based on 12 interlocking clues, including exact attendance, yellow cards, and player birth dates. A traditional retriever would fail by surfacing short, disconnected snippets. Instead, the DCI agent explored the file directory, read specific lines of a 1990 England versus Belgium match report to verify the exact number of substitutions, pulled a specific quote from an interview file, and verified the exact birth dates of two players by peeking into their Wikipedia text files. By chaining these simple commands, DCI ensures that no evidence is permanently lost behind a flawed semantic search algorithm.
DCI has a clear operating envelope where it scales excellently in search depth but struggles with search breadth. When the experimental corpus was expanded from 100,000 to 400,000 documents, the system’s accuracy dropped significantly and the average number of tool calls rose. While DCI is powerful once a promising document is found, the cost of locating that initial useful anchor document grows sharply as the size of the candidate space increases.
DCI also has lower broad document recall compared to dense embedding models. It trades exhaustive recall for high-resolution, local precision. If an enterprise workflow strictly requires finding every single relevant document across a massive dataset, DCI may not be the right tool.
Granting an agent expressive tools like an unrestricted bash shell increases latency and compute costs due to the high volume of iterative tool calls required to complete a search. It also creates significant context-management and security challenges for IT departments.
“Tool calls can return large outputs; long trajectories can fill the context window; and raw terminal access requires sandboxing, permission control, and careful engineering,” the authors said. To manage the context window, the researchers found that moderate truncation and compaction help the agent sustain longer searches, whereas overly aggressive summarization tends to discard useful evidence.
Because of these operational realities, DCI is not meant to be a mandatory replacement for existing vector infrastructure. Instead, it serves as a complementary one.
“For orchestration engineers and data architects, our view is that the most practical near-term deployment pattern is hybrid,” the authors said. Semantic retrieval can still provide high-recall candidate discovery when a user’s intent is broad or underspecified. “DCI can then operate as a precision and verification layer: the agent can search within the retrieved documents, expand from them into neighboring files, check exact constraints, and combine weak signals across documents.”
The researchers have released the code for DCI under the permissive MIT license.
“Longer term, DCI changes how we think about enterprise data. Data will not only need to be stored for humans or indexed for search engines; it will need to be organized for agents that can inspect, compare, grep, trace, and verify,” the authors conclude. “File names, timestamps, stable identifiers, metadata, version history, and machine-readable structure become part of the retrieval interface.”
AI agents forget. Every time a coding assistant loses track of a debugging thread, or a data analysis agent re-ingests the same context it already processed, the team pays in latency, token costs, and brittle workflows. The fix most teams reach for — expanding the context window or adding more RAG — is increasingly expensive and still doesn’t reliably work.
To address this, researchers from Mind Lab and several universities proposed delta-mem, an efficient technique that compresses the model’s historical information into a dynamically updated matrix without changing the model itself. The resulting module adds just 0.12% of the backbone model’s parameters — compared to 76.40% for one leading alternative — while outperforming it on memory-heavy benchmarks. Delta-mem allows models to continuously accumulate and reuse historical data, reducing the reliance on massive context windows or complex external retrieval modules for behavioral continuity.
The conventional solution is to simply dump all the information into the model’s context window.
But as Jingdi Lei, co-author of the paper, told VentureBeat, current systems treat memory merely as a context-management problem. “Either we keep expanding the context window, or we retrieve more documents through RAG,” Lei explained. “These approaches are useful and will remain important, but they become increasingly expensive and brittle when agents need to operate over long-running, multi-step interactions, and they don’t really [work] like human memory since they are more like looking up documents.”
In enterprise settings, the bottleneck is not just whether the model can access history, but whether it can reuse that history efficiently, continuously, and with low latency. Standard attention mechanisms incur a quadratic computational cost as the sequence length increases. Furthermore, expanding the context window does not guarantee the model will actually recall the information effectively. Models often suffer from context degradation or context rot as they become overwhelmed with more (and often conflicting) information, even if they support one million tokens in theory.
The researchers argue for advanced memory mechanisms that can represent historical information compactly and maintain it dynamically across interactions. Existing solutions come with heavy trade-offs and generally fall into three paradigms:
Textual memory: stores history as text injected into context — constrained by window limits and prone to information loss under compression.
Outside-channel (RAG): encodes and retrieves from external modules — adds latency, integration complexity, and potential misalignment with the backbone.
Parametric: encodes memory into model weights via adapters — static after training, can’t adapt to new information during live interactions.
To achieve a compact and dynamically updated memory, delta-mem compresses an agent’s past interactions into an “online state of associative memory” (OSAM). This state is maintained as a fixed-size matrix that preserves historical information while the underlying language model remains frozen.
For enterprise workflows, this translates directly to resolving operational bottlenecks. Lei noted that a persistent coding assistant, for example, “may need to remember project conventions, recent debugging steps, user preferences, or intermediate decisions across a workflow.” Similarly, a data analysis agent might “need to maintain task state, assumptions, and prior observations while iterating over multiple tool calls.”
Rather than repeatedly retrieving and re-inserting all relevant history for these tasks, the delta-mem matrix provides a low-overhead way to carry forward useful interaction states inside the model’s forward computation.
During generation, the system does not retrieve raw text segments to add to the prompt. Instead, the backbone LLM’s current hidden state is projected into the matrix to retrieve old memory. This operation extracts context-relevant associative memory signals from delta-mem. These signals are then transformed into numerical corrections that are applied to the computations of the model. This steers the model’s reasoning at inference time without altering its internal parameters.
Following each interaction, delta-mem updates the online state using “delta-rule learning.” When new information arrives, the previous state makes a prediction about the resulting attention values. It then compares this prediction to the actual value and corrects the memory matrix based on the discrepancy.
This update mechanism relies on a “gated delta-rule.” Basically, the memory module has different knobs that control how much previous memory is kept and how much of the new memory is applied. This error correction with controlled forgetting allows the matrix to evolve over time, holding onto stable historical associations without being derailed by short-term noise.
The researchers explored three strategies for determining when and how the matrix updates:
Token-state write captures fine-grained changes but is vulnerable to short-term noise.
Sequence-state write averages tokens within a message segment, smoothing updates at the cost of some localized detail.
Multi-state write decomposes memory into sub-states for different information types like facts or task progress.
The researchers evaluated delta-mem across three LLM backbones: Qwen3-8B, Qwen3-4B-Instruct, and SmolLM3-3B. They configured the framework with a compact 8×8 matrix. The system was tested on general capability benchmarks, including HotpotQA, GPQA-Diamond, and IFEval. It was also evaluated on memory-heavy tasks such as LoCoMo, which tests long-term conversational memory, and Memory Agent Bench, which assesses retention, retrieval, selective forgetting, and test-time learning over extended interactions.
The framework was compared against representative models from the three existing memory paradigms: textual memory baselines (e.g., BM25 RAG, LLMLingua-2, and MemoryBank), parametric systems (Context2LoRA and MemGen), and the outside-channel approach MLP Memory.
Across the board, delta-mem outperformed the baselines, according to the researchers. On the Qwen3-4B-Instruct backbone, the token-state write variant achieved an average score of 51.66%, easily surpassing the frozen vanilla backbone at 46.79% and the strongest baseline, Context2LoRA, at 44.90%. On the memory-heavy Memory Agent Bench, the average score jumped from 29.54% to 38.85%. Performance on the specific test-time learning subtask nearly doubled from 26.14 to 50.50.
However, the most compelling takeaways are the system’s operational efficiency. The researchers tested the framework in a no-context setting where the historical text was entirely removed from the context. Even without explicit text replay, delta-mem successfully recovered context-relevant evidence in multi-hop tasks. The researchers argue that the model remembers past interactions without needing to ingest massive amounts of prompt tokens.
The framework also adds only 4.87 million trainable parameters, representing just 0.12% of the Qwen3-4B-Instruct backbone. By comparison, the MLP Memory baseline required 3 billion parameters, scaling up to 76.40% of the backbone’s size while delivering inferior results. When prompt lengths scaled up to 32,000 tokens during inference tests, the framework maintained almost the exact same GPU memory footprint as a standard, unmodified model. It sidesteps the heavy memory bloat that affects other advanced memory systems like MemGen and MLP Memory.
Different update strategies proved beneficial depending on the underlying model capacity. The sequence-state write strategy was the most effective for stronger backbones like Qwen3-8B. These more capable models use the segment-level writing to smooth out updates and mitigate token-level noise. Conversely, the multi-state write strategy drove massive performance leaps for smaller backbones like SmolLM3-3B. For these lower-capacity models, separating memory into multiple states proved critical to minimizing information interference.
The researchers have released the code for delta-mem on GitHub and the weights for their trained adapters on Hugging Face. For AI engineering teams looking to integrate this framework into their existing inference stack, the process requires minimal computing resources.
“In practice, an engineering team would start from an existing instruction-tuned backbone, attach the Delta-Mem adapter modules to selected attention layers, train only the adapter parameters on domain-relevant multi-turn or long-context data… and then run inference with the memory state updated online during interaction,” Lei said. Crucially, teams do not need a massive pretraining corpus. The training data only needs to reflect the target memory behavior, such as multi-turn dialogues, agent traces, or domain workflows where earlier information must influence later decisions.
While compressing interaction history into a fixed-size mathematical matrix creates immense efficiency, it does come with trade-offs. Delta-mem is not a lossless replacement for explicit text logs or document retrieval. Because different pieces of information compete inside the same limited state, there is a risk of memory blending.
“Delta-Mem is useful when the system needs fast, online, continuously updated behavioral state,” Lei said. “RAG is better when the system needs exact factual recall, citation, compliance, auditability, or access to a large external knowledge base.” Remembering a user’s working style or a multi-step reasoning trajectory is a perfect fit for delta-mem, while retrieving a legal contract or a medical guideline should remain in a vector database.
This means the most realistic enterprise architecture moving forward is a hybrid approach. Delta-mem acts as a lightweight internal working memory, reducing the need to retrieve or replay everything all the time, while RAG serves as the explicit, high-capacity memory layer.
“Looking ahead, I do not think vector databases will become obsolete,” Lei said. “Instead, I expect enterprise AI stacks to become more layered. We will likely see short-term working memory inside the model, longer-term explicit memory in retrieval systems, and policy or audit layers that decide what should be stored, retrieved, forgotten, or exposed to the user.”
At Google I/O, the company unveiled Managed Agents in its Gemini API — a service that promises to collapse weeks of agent deployment work into a single API call. It’s also a sign that Google believes its ecosystem, including the newly launched Antigravity CLI, is ready to own the execution layer end-to-end.
Before a single agent is written, teams are already spending days on the unglamorous work: standing up execution environments, managing sandboxes, wiring tool call infrastructure. Model providers like Anthropic have launched platforms to handle much of that work — but Google’s approach is different.
Google said in a blog post that Managed Agents in the Gemini API abstracts “away the complexity so that you can focus on your product experience and agent behavior.” The service is available in preview via new custom templates in Google AI Studio.
The growth has introduced a real architectural question: should agent management live at the execution layer — embedded in the model or its harness — or at the infrastructure layer, as a separate runtime?
Until recently, agent orchestration relied on frameworks that sat above the model, directing agents and letting teams control routing and execution separately. That layer is now being absorbed by the platforms themselves.
Recent platforms like Claude Managed Agents embed orchestration at the model layer rather than on a separate runtime platform. The idea is that the model owns the reasoning and orchestration layers, and enterprises have control over execution.
AWS, through new capabilities on Bedrock AgentCore, adds managed harnesses that stitch together the upfront tasks for deploying agents. Google’s approach goes further, optimizing the model, harness, and sandbox together and running everything in secure Google-managed environments.
René Sultan of Ramp, cited in Google’s announcement, said the shift is concrete: “The real shift with Gemini Managed Agents is that the agent runtime moves into the platform. With the sandbox, infrastructure and execution loop managed for you, developers can focus on productizing the agent’s domain-specific behavior and iterating at a completely different pace.”
Enterprises starting fresh with agents could find the platform offerings from Anthropic and Google strong, especially since they remove much of the difficulty of deploying agents while still maintaining some control. Google, however, is pushing for a more vertically integrated system, while Anthropic is betting on the model layer as an orchestration plane, and AWS focuses on authorization.
But this also brings some risks, according to XYO founder and chief executive Arie Trouw.
“An additional risk is that developers will switch out what previously were deterministic services for what will now be probabilistic services, which can introduce unpredictable outcomes for the users at best, or data corruption at worst,” Trouw told VentureBeat in an email. “This is the classic example of having an amazing hammer and everything starting to look like nails. I’ve seen this pattern repeatedly as a developer and business founder myself in the past few decades.”