Enterprise AI agents keep failing because they forget what they learned

RAG architectures are good at one thing: surfacing semantically relevant documents. That’s also where they stop.

A framework called a decision context graph addresses that gap by giving agents structured memory, time-aware reasoning, and explicit decision logic. Rippletide, a startup in the Neo4j ecosystem, has built one. The key capability: agents that are non-regressive, able to freeze validated sequences of actions and compound on them over time.

“The key point you want is non-regressivity: How do you make sure that, when the agent will generate something new, you can compound on the previous discoveries?” said Yann Bilien, Rippletid’s co-founder and chief scientific officer. 

Why RAG doesn’t go far enough

Enterprise context is sprawled across ERP tools, logs, databases, vector stores, and policy documents. Generative AI tools can retrieve from all of it — through keyword search, SQL queries, or full RAG pipelines — but retrieval has a ceiling.

Notably, data retrieved may not be relevant to the decision at hand (thus causing hallucinations); and, even if agents do pull the right data, they often lack guidance to make decisions backed by a strong rationale.

That is, RAG retrieves documents, not decision context. “Everyone starts with RAG: Pull relevant docs, stuff them in the prompt, let the model figure it out,” said Wyatt Mayham of Northwest AI Consulting

While that works fine for chatbots, it “breaks immediately” for agents that need to make decisions and take actions, he pointed out. “The biggest thing builders struggle with is the gap between retrieval and applicability.” 

A retrieved document doesn’t tell the agent whether it still applies, whether it’s been superseded, or whether there’s a conflicting rule that takes priority, Mayham said. “Agents need decision context, not just information.”

In construction (the human world), that might mean knowing that a pricing exception expired, that a safety policy only applies in certain jurisdictions, or that a standard operating procedure was updated a month prior. “Miss any of that, and the agent confidently does the wrong thing,” Mayham said. 

Without structured decision context, agents combine incompatible rules, invent constraints to fill gaps, and rely on what Bilien calls “probabilistic guesses over unbounded data.” Errors are difficult to reproduce because builders can’t trace why the agent made a given choice.

The compounding error problem is real, too, Mayham said: A small miss rate per step becomes “catastrophic” across a multi-step workflow. “That’s the main reason most enterprise agents never leave the pilot phase.” 

How decision context graphs get to the relevant answer 

A decision context graph solves this by encoding a structured map of what is applicable, what the rules are, and when they apply.

The framework is optimized for one question: “Given this situation, which context applies right now?” Time is treated as a first-class dimension; every rule, decision, and exception is scoped to when it is valid.

“The goal is to explicitly address missing, incoherent, or contradictory data when building the graph to avoid probabilistic [errors] once the agent is running,” Bilien said. 

The system is built around three principles:

  • Applicability: Logic is explicitly encoded so the agent knows what rules to remember and apply in a given situation. Context is returned only when it is relevant to the situation. 

  • Time‑aware memory: Every rule, decision, and exception is time-scoped. This allows agents to reason about “What was true then versus what is  true now,” then reproduce or explain its decisions.

  • Decision paths: The system can explain how it got from A to B and the “why” behind its rationale (for instance, why one piece of context was included and another was not). Agents are given “decision path” examples of how similar cases were handled before. 

At setup, unstructured data is ingested and structured into an ontology: what entities exist, what rules apply, what counts as an exception. Neuro-symbolic AI handles the pattern recognition and encodes formal, machine-readable logic. Over time, the system refines its knowledge base as new decisions are made.

“Neuro-symbolic brings two parts: A neuronal part giving a large autonomy to agents and a symbolic part to reduce the number of data needed and bring control,” Bilien said. 

The agent is tested at build time (pre-production) to validate its behaviors or pinpoint improvements. This reduces risks as well as computation needs during inferencing, he noted. 

Agents learning, rather than regressing 

When it comes to non-regression, the key piece is compounding both on intelligence (models) and on knowledge (shared between agents), Bilien said. It’s important that agents can explore; when they don’t know how to accomplish a task, they can attempt different possibilities, typically in a controlled environment or simulation (like a support bot trying multiple response patterns). 

Then, “once a solution is evaluated as satisfactory, the graph freezes that sequence of actions,” Bilien said. Future exploration then starts from this “stable base of validated behaviors” to prevent newly-acquired skills from overwriting previously learned good behavior. 

Before an agent acts or affects a customer, it checks against the graph: Is it violating a rule? Hallucinating? Staying within constraints? Can it generalize the solution across similar cases?

At a macro level, the system assesses outcomes: Did the behavior improve long-term performance? Did it generalize across similar contexts? Did it preserve previous capabilities?

“This determinism is key for agents to run reliability at scale,” Bilien said. It leads to behavior that is more consistent, predictable, explainable, and allowing for stronger control and auditability. 

“You want your agents to be able to learn by themselves when they face something they don’t know,” he said. “You want them to be able to explore and find new solutions.”

Getting beyond “episodic” memory

While the team initially assumed it would deploy RL everywhere, “that actually proved very difficult in an enterprise setting,” Bilien said. “Data are scarce for some specific use cases and messy for others.”

Typically, using raw data for reliable predictions has been a manual and time-consuming challenge, but “now with agents we entered a new era where building ontologies is possible automatically,” Bilien said. 

Classic supervised fine-tuning methods can lead to oscillations, when models forget the last skill they learned while learning the next tone. Overall, learning is not compounded, compression is “dramatic,” and models improve “episodically” rather than continuously, leading them to continually fail on new or unseen tasks. 

As Bilien noted: “You will never have a fully self-learning model if you are regressing every time.” 

In enterprise use cases — like banking where millions of transactions are processed a day — a high level of reliability is critical, he noted. “One question I ask all customers: Is 95% enough? In a lot of use cases, it’s not. You need 99.999%. 1% off is way too much.” 

Decision context graphs can close that gap, he contends: When the same customer support question is asked repeatedly, the agent will return a “satisfactory” answer predictably and without regression, all while retaining autonomy. 

Encoding applicability and temporal validity into a structured graph — rather than relying on an LLM to infer it — is a “sound approach” to a real limitation in existing retrieval frameworks, Mayham said. The open question is whether the automatic ontology generation holds up against the messy, diverse data that enterprises actually have. “That’s always the hard part,” he said.

NanoClaw’s creators are turning the secure, open source AI agent harness into an enterprise ‘second brain’

The creators of NanoClaw — the hit open source, enterprise-friendly variant of autonomous AI agent harness OpenClaw — are moving towards commercializing their technology for enterprises at scale, aiming to provide them with secure AI agents, and an ever-updating library of workplace context, for each human employee the enterprise has approved.

The duo, including former Wix.com engineer Gavriel Cohen and his brother Lazer Cohen, also founder of tech public relations firm Concrete Media, shared with VentureBeat that their new startup, NanoCo AI, has received a $12 million oversubscribed seed round was led by Valley Capital Partners.

The round features a roster of strategic backers that reads like an enterprise infrastructure all-star team, including Docker, Vercel, monday.com, Factorial Capital, and Hugging Face CEO and founder Clem Delangue.

Buoyed by the seed round, NanoCo AI wants to move beyond basic automation to offer every enterprise worker a secure “professional assistant.” Yet they are still committed to building out and maintaining NanoClaw as an MIT Licensed, enterprise-friendly, open source standard — just offering specialized commercial managed services integration atop it.

The new killer use case: an informed, ever-updating personal assistant for each human worker

Gavriel, now CEO of NanoCo AI, sees this personalized approach as the ultimate unlock for the modern worker.

“The killer use case is the the one to one we’re calling it professional assistant,” Cohen explained in a recent exclusive interview with VentureBeat. “If you can give someone an agent and make them twice, three times as effective, then you probably want more people as well, right?”

He noted that as users forward emails, documents, and call notes to the agent, it systematically builds an “LLM wiki” — similar to the “LLM Knowledge Base” concept articulated by influential AI researcher Andrej Karpathy — effectively creating a dynamic knowledge graph of the user’s specific job and projects.

This persistent memory allows the agent to shift from simply answering questions to actively transforming information and executing first drafts that rival human output.

Cohen emphasized that NanoClaw acts as a massive productivity multiplier rather than a headcount replacement.

One-to-one secure ‘lobster’ AI

NanoCo’s core offering is a one-to-one professional AI assistant designed to shadow employees, draft contracts, review code, and manage accounts directly within tools like Slack and Microsoft Teams.

Rather than a generic chatbot, the assistant learns the employee’s role and adapts to their specific working style through ordinary conversation.

How does NanoCo prevent this highly capable assistant from going rogue? By moving security away from fragile prompt engineering and embedding it directly into the infrastructure.

Unlike its predecessor and inspiration, the even popular open source AI assistant OpenClaw — which grew to a massive 400,000 lines of code — NanoClaw’s core logic was intentionally minimized to roughly 500 lines of TypeScript. This minimalism ensures the entire system can be audited by a human security team in about eight minutes.

Furthermore, every NanoClaw agent operates within a strictly isolated environment. Leveraging a strategic partnership with Docker announced in March, NanoCo AI runs these agents inside MicroVM-based Docker Sandboxes.

“In NanoClaw, the ‘blast radius’ of a potential prompt injection is strictly confined to the container and its specific communication channel,” Cohen previously explained.

To prevent unauthorized actions, raw API credentials never reach the agent itself. Instead, outbound requests pass through a secure OneCLI Rust Gateway that enforces company-defined policies. If an agent attempts a sensitive “write” action—like modifying a cloud environment or deleting an email—the gateway intercepts the request and pings the human user via a rich interactive card on Slack, Teams, or WhatsApp.

Only when the user explicitly taps “Approve” does the system inject the credential. It is the architectural equivalent of a highly capable junior employee drafting an important corporate communication, but being physically unable to click “send” without the manager turning a literal launch key.

Continued commitment to open source, MIT License

Despite its new enterprise push, NanoCo AI is maintaining its commitment to its open-source foundation. The core NanoClaw framework remains available under the permissive MIT License, meaning independent developers and companies can continue to fork, modify, and run the system locally.

In plain terms, the MIT License allows anyone to use the software commercially without paying NanoCo AI, provided they include the original copyright notice.

NanoCo AI’s monetization strategy instead focuses on the vast majority of enterprises that lack the specialized engineering resources to build, maintain, and scale internal agent platforms.

While highly technical teams can choose to build their own infrastructure on top of the open-source code, NanoCo will sell managed, organization-wide deployments, taking on the burden of health checks, integrations, and ongoing security maintenance.

Widespread global adoption

The open-source adoption of NanoClaw has been staggering, crossing 250,000 downloads and nearing 29,000 GitHub stars since its debut. This ground-up momentum is entirely responsible for the surging enterprise demand.

“Countless enterprise executives have told us the same thing,” Cohen stated in the press release. “They’re running NanoClaw personally, getting two and three times more done, and asking how to roll it out to their teams.”.

Perhaps the most high-profile validation came during the founders’ recent trip to Singapore. The country’s Foreign Minister, Dr. Vivian Balakrishnan, invited the NanoCo team to his office after publicly posting about his personal use of NanoClaw. Balakrishnan described the agent as “getting smarter over time,” referred to it as his “second brain,” and stated he wouldn’t “dare switch it off”.

Cohen put the platform’s security claims to the ultimate test during a live conference demonstration in Singapore. He invited a crowd of 300 people to chat simultaneously with his personal agent, which was actively connected to his real email and calendar.

Thanks to NanoClaw’s zero-trust gateway architecture, the agent safely rejected malicious attempts to access his inbox or delete existing events, while successfully allowing 12 attendees to book legitimate coffee chats.

As AI shifts from a novelty tool that answers questions into a digital workforce that autonomously executes tasks, NanoCo AI is betting that verifiable security will be the defining metric of success. By combining a transparent open-source core with strict, infrastructure-level sandboxing, they aren’t just selling an assistant; they are selling the peace of mind required for enterprises to actually use one.

Claude agents can finally connect to enterprise APIs without leaking credentials

The reason enterprises have been slow to connect AI agents to internal APIs and databases isn’t the models — it’s the credentials. In most production deployments, the agent carries authentication tokens with it as it executes tool calls, which means a compromised or misbehaving agent takes the keys with it.

Anthropic is addressing that problem with two new capabilities for Claude Managed Agents: self-hosted sandboxes, which let teams run tool execution inside their own infrastructure perimeter, and MCP tunnels, which connect agents to private MCP servers without exposing credentials in the agent’s context. Together they move credential control to the network boundary rather than leaving it inside the agent.

Right now, self-hosted sandboxes are available to Claude Managed Agent users in public beta, while MCP tunnels are currently in research preview.  

Anthropic isn’t the only model provider making this bet. OpenAI added local execution to its Agents SDK in April in response to similar demand. The architectural distinction Anthropic draws is a split: the agent loop runs on Anthropic’s infrastructure, while tool execution runs on the enterprise’s own system — a separation that existing sandbox approaches, including OpenAI’s, don’t make.

The architecture problem in sandboxes and agents

MCP moved to enterprise production faster than the security architecture around it matured. In most deployments, credentials travel through the agent itself as it executes tool calls against internal systems — meaning a compromised or misbehaving agent has everything it needs to cause damage.

Self-hosted sandboxes, such as those offered on Claude Managed Agents, help keep files and packages within an enterprise’s infrastructure. The agentic loop—orchestration, context management and error recovery—moves to the platform, and ideally, enterprises control compute resources. 

This allows the agent to complete tool calls without holding the keys that unlock it. 

Private network connectivity works similarly — a lightweight outbound-only gateway inside the organization’s network, with no credentials passing through the agent.

Orchestration teams get some control

For orchestration teams, the capabilities represent more than just a security update; they help agents run better. But the first thing they need to understand is how this split architecture can affect their deployment. 

Since sandboxes determine tool execution locations and the resources agents access, and MCP tunnels tell agents how to reach internal systems, these are separate concerns—splitting them up enables enterprises to map agents’ workflows more effectively.

For teams already on Claude Managed Agents, the practical starting point is sandboxes — move tool execution onto your own infrastructure and test the boundary before touching MCP tunnels, which are still in research preview. Teams evaluating the platform for the first time should treat the sandbox architecture as the primary technical differentiator: it’s the piece that changes the threat model, not just the deployment model.

LangSmith Engine closes the agent debugging loop automatically — but multi-model enterprises still need a neutral layer

Enterprises building and deploying agents have a problem: it’s taking their engineers too long to find out that an agent made a mistake, and the loop has continued to perpetuate, especially without a human at every step. 

LangSmith, the monitoring and evaluation platform from LangChain, launched a new capability in public beta that could make that issue more manageable. LangSmith Engine automates the entire chain by detecting production failures, diagnosing root causes against the live codebase, drafting a fix and preventing regression. It does this in a single automated pass. 

LangSmith Engine gives AI engineers a faster path to triage, but it launches into a crowded field: Anthropic, OpenAI and Google are all pulling observability and evaluation into their own platforms.

LangSmith Engine looks at failures

LangChain said in a blog post that the typical agent development cycle starts by tracing the agent to understand what it’s doing, followed by identifying gaps, making changes to the prompts and tools, and creating ground-truth datasets. Developers then run experiments and check for regressions before shipping the agent. 

The problem is that customers often run into issues when the trace review doesn’t surface faulty patterns, error repetition gets difficult to see, and there’s no targeted evaluator to catch the same problem when it repeats in production.

LangSmith Engine works by monitoring production traces for several signal types, “explicit errors, online evaluator failures, trace anomalies, negative user feedback and unusual behaviors like user asking questions the agent wasn’t built to answer,” according to the blog post.

Engine will then read the live codebase, find the culprit and draft a pull request before proposing a custom evaluator for that specific failure pattern. The human comes in at the approval step. 

It’s built on top of LangSmith’s existing tracing and evaluation infrastructure and also works with an enterprise’s evaluator results. 

Unlike observability tools such as Weights & Biases, Arize Phoenix and Honeyhive, LangSmith Engine takes the entire chain automatically — detecting the failure, diagnosing root cause, drafting a fix — and brings the human in only at the approval step.

Model providers bringing evaluators in platform

While LangSmith identified this evaluation loop as a need for many enterprises, Engine comes at a time where the larger providers are beginning to offer observability tools within their platform. This means enterprises may choose to use an end-to-end platform rather than add LangSmith Engine onto their existing workflows. 

Anthropic’s Claude Managed Agents brings together agentic deployment, evaluation and orchestration into a single suite. OpenAI’s Frontier offers a similar end-to-end platform for building, governing and evaluating enterprise agents — though both have faced questions from enterprises wary of committing to a single vendor.

However, practitioners point out that not everyone wants to bring evaluations and observability fully into one platform.

Leigh Coney, founder and principal consultant at Workwise Solutions, told VentureBeat that third-party observability is the default for many enterprises. 

“One fund I work with runs Claude for analysis and GPT for a separate workflow. If observability lives inside each provider’s tooling, you now have two systems that can’t talk to each other. Your compliance team can’t produce a unified audit trail,” he said. “So third-party observability is surviving because multi-model is already the default in enterprise, and somebody has to sit across providers.”

Jessica Arredondo Murphy, CEO and co-founder of True Fit, said independent platforms like LangSmith have to prove to enterprises that they can “answer the long-term question of whether they become the cross-model operating layer for quality and reliability.”

“Enterprises are not consolidating onto the first-party model provider tooling as quickly as the model providers would prefer. What I see is a pragmatic split: teams will use first-party tooling for fast onboarding and early-stage debugging, but as soon as they care about production reliability, governance, and long-term flexibility, they tend to introduce a more neutral layer for observability and evaluation,” she said. 

LangSmith Engine is available now in public beta. Teams can connect a tracing project, optionally connect their repo, and Engine will begin surfacing issues from production traces automatically.

Architectural patterns for graph-enhanced RAG: Moving beyond vector search in production

Retrieval-augmented generation (RAG) has become the de facto standard for grounding large language models (LLMs) in private data. The standard architecture — chunking documents, embedding them into a vector database, and retrieving top-k results via cosine similarity — is effective for unstructured semantic search.

However, for enterprise domains characterized by highly interconnected data (supply chain, financial compliance, fraud detection), vector-only RAG often fails. It captures similarity but misses structure. It struggles with multi-hop reasoning questions like, “How will the delay in Component X impact our Q3 deliverable for Client Y?” because the vector store doesn’t “know” that Component X is part of Client Y’s deliverable.

This article explores the graph-enhanced RAG pattern. Drawing on my experience building high-throughput logging systems at Meta and private data infrastructure at Cognee, we will walk through a reference architecture that combines the semantic flexibility of vector search with the structural determinism of graph databases.

The problem: When vector search loses context

Vector databases excel at capturing meaning but discard topology. When a document is chunked and embedded, explicit relationships (hierarchy, dependency, ownership) are often flattened or lost entirely.

Consider a supply chain risk scenario. While this is a hypothetical example, it represents the exact class of structural problems we see constantly in enterprise data architectures:

  • Structured data: A SQL database defining that Supplier A provides Component X to Factory Y.

  • Unstructured data: A news report stating, “Flooding in Thailand has halted production at Supplier A’s facility.”

A standard vector search for “production risks” will retrieve the news report. However, it likely lacks the context to link that report to Factory Y’s output. The LLM receives the news but cannot answer the critical business question: “Which downstream factories are at risk?”

In production, this manifests as hallucination. The LLM attempts to bridge the gap between the news report and the factory but lacks the explicit link, leading it to either guess relationships or return an “I don’t know” response despite the data being present in the system.

The pattern: Hybrid retrieval

To solve this, we move from a “Flat RAG” to a “Graph RAG” architecture. This involves a three-layer stack:

  1. Ingestion (The “Meta” Lesson): At Meta, working on the Shops logging infrastructure, we learned that structure must be enforced at ingestion. You cannot guarantee reliable analytics if you try to reconstruct structure from messy logs later. Similarly, in RAG, we must extract entities (nodes) and relationships (edges) during ingestion. We can use an LLM or named entity recognition (NER) model to extract entities from text chunks and link them to existing records in the graph.

  2. Storage: We use a graph database (like Neo4j) to store the structural graph. Vector embeddings are stored as properties on specific nodes (e.g., a RiskEvent node).

  3. Retrieval: We execute a hybrid query:

    • Vector scan: Find entry points in the graph based on semantic similarity.

    • Graph traversal: Traverse relationships from those entry points to gather context.

Reference implementation

Let’s build a simplified implementation of this supply chain risk analyzer using Python, Neo4j, and OpenAI.

1. Modeling the graph

We need a schema that connects our unstructured “risk events” to our structured “supply chain” entities.

2. Ingestion: Linking structure and semantics

In this step, we assume the structural graph (suppliers -> factories) already exists. We ingest a new unstructured “risk event” and link it to the graph.

3. The hybrid retrieval query

This is the core differentiator. Instead of just returning the top-k chunks, we use Cypher to perform a vector search to find the event, and then traverse to find the downstream impact.

The output: Instead of a generic text chunk, the LLM receives a structured payload:

[{‘issue’: ‘Severe flooding…’, ‘impacted_supplier’: ‘TechChip Inc’, ‘risk_to_factory’: ‘Assembly Plant Alpha’}]

This allows the LLM to generate a precise answer: “The flooding at TechChip Inc puts Assembly Plant Alpha at risk.”

Production lessons: Latency and consistency

Moving this architecture from a notebook to production requires handling trade-offs.

1. The latency tax

Graph traversals are more expensive than simple vector lookups. In my work on product image experimentation at Meta, we dealt with strict latency budgets where every millisecond impacted user experience. While the domain was different, the architectural lesson applies directly to Graph RAG: You cannot afford to compute everything on the fly.

  • Vector-only RAG: ~50-100ms retrieval time.

  • Graph-enhanced RAG: ~200-500ms retrieval time (depending on hop depth).

Mitigation: We use semantic caching. If a user asks a question similar (cosine similarity > 0.85) to a previous query, we serve the cached graph result. This reduces the “graph tax” for common queries.

2. The “stale edge” problem

In vector databases, data is independent. In a graph, data is dependent. If Supplier A stops supplying Factory Y, but the edge remains in the graph, the RAG system will confidently hallucinate a relationship that no longer exists.

Mitigation: Graph relationships must have Time-To-Live (TTL) or be synced via Change Data Capture (CDC) pipelines from the source of truth (the ERP system).

Infrastructure decision framework

Should you adopt Graph RAG? Here is the framework we use at Cognee:

  1. Use vector-only RAG if:

    • The corpus is flat (e.g., a chaotic Wiki or Slack dump).

    • Questions are broad (“How do I reset my VPN?”).

    • Latency < 200ms is a hard requirement.

  2. Use graph-enhanced RAG if:

    • The domain is regulated (finance, healthcare).

    • “Explainability” is required (you need to show the traversal path).

    • The answer depends on multi-hop relationships (“Which indirect subsidiaries are affected?”).

Conclusion

Graph-enhanced RAG is not a replacement for vector search, but a necessary evolution for complex domains. By treating your infrastructure as a knowledge graph, you provide the LLM with the one thing it cannot hallucinate: The structural truth of your business.

Daulet Amirkhanov is a software engineer at UseBead.

Intercom, now called Fin, launches an AI agent whose only job is managing another AI agent

The company formerly known as Intercom just did something that no major customer service platform has attempted at scale: it built an AI agent whose sole job is to manage another AI agent.

Fin Operator, announced Thursday at a live event in San Francisco, is a new AI-powered system designed specifically for the back-office teams that configure, monitor, and improve Fin, the company’s customer-facing AI agent. Rather than replacing human support agents — which is what Fin itself does on the front lines — Operator targets the growing army of support operations professionals who spend their days updating knowledge bases, debugging conversation failures, and combing through performance dashboards.

“Fin is an agent for your customers,” Brian Donohue, the company’s VP of Product, told VentureBeat in an exclusive interview ahead of the launch. “Operator is an agent for your support ops team. This is an agent for the back office team who manages Fin and then manages their human agents.”

The announcement arrives at a pivotal moment for the company. Just two days ago, CEO Eoghan McCabe formally renamed the 15-year-old company from Intercom to Fin — an aggressive signal that the AI agent is now the business, not merely a feature of it. Fin recently crossed $100 million in annual recurring revenue and is growing at 3.5x. The broader company generates $400 million in ARR, meaning the AI agent now accounts for roughly a quarter of total revenue and virtually all of its growth.

Fin Operator enters early access for Pro-tier users starting today, with general availability planned for summer 2026.

The invisible crisis behind every AI customer service deployment

As companies push their AI agents to handle more conversations — Fin alone now resolves more than two million customer issues each week across 8,000 customers globally, including Anthropic, DoorDash, and Mercury — the operational complexity behind those systems has exploded. Someone has to keep the knowledge base current. Someone has to figure out why the bot entered an infinite loop with a frustrated customer last Tuesday. Someone has to analyze whether the automation rate dropped after a product update.

That “someone” is the support operations team, and according to Donohue, they are drowning.

“Almost every support ops team is already doing data analysis and knowledge management — that’s table stakes today,” Donohue said. “Where teams struggle is the agent builder work. It’s a new skill set, and most don’t have enough time for it. They get their first iteration up and running, and then they get stuck.”

The problem is structural. AI customer agents are not static software. They require constant tuning — a process that looks more like training a new employee than configuring a SaaS tool. Each customer conversation is a potential source of failure, and each failure requires diagnosis, root-cause analysis, a configuration fix, testing, and monitoring. It is tedious, technical, and relentless. Fin Operator aims to collapse that entire loop into a conversational interface.

How one AI system plays data analyst, knowledge manager, and debugger all at once

Donohue described Operator as filling three distinct roles that typically consume the bandwidth of support ops teams: expert data analyst, expert knowledge manager, and expert agent builder.

As a data analyst, Operator can field high-level questions like, “How did my team perform last week?” and generate on-the-fly charts, trend reports, and drill-down analyses across all of the data already stored in Intercom’s platform. The company has loaded Operator with contextual knowledge about customer-specific data attributes to help it interpret workspace-specific metrics accurately.

As a knowledge manager, Operator can ingest a product update — say, a three-page PDF describing a new feature — and autonomously search the company’s entire content library to identify what needs to change. It finds gaps, drafts new articles, suggests edits to existing ones, and presents everything in a diff-style review interface. The underlying search engine is the same semantic search system that Intercom has built and optimized for Fin over more than two years.

“On that knowledge management front, you just have such a time compression of something that would take, certainly hours, sometimes days, into the space of about 10 minutes,” Donohue said.

As an agent builder, Operator introduces what the company calls a “debugger skill.” Support ops teams can paste in a link to a conversation where Fin misbehaved, and Operator will trace every step of Fin’s internal reasoning, identify the root cause — often a piece of guidance that unintentionally creates a loop — propose a rewrite, back-test the change against the original conversation, and then suggest creating a production monitor to catch similar issues going forward.

“This is literally what our professional services team does,” Donohue explained. “You’ve written guidance that is unintentionally causing Fin to repeat itself — this happens a lot. You didn’t realize it, but you never gave it an escape hatch.”

The ‘pull request’ safety net that keeps humans in control of AI changes

One of the most consequential design decisions in Fin Operator is what the company calls its “proposal system” — a mechanism that functions like a pull request in software engineering.

Every change that Operator recommends — whether it is an edit to a help article, a rewrite of an AI guidance rule, or the creation of a new QA monitor — appears as a proposal with a full diff view. Users can inspect, edit, and approve each change before it takes effect. Nothing goes live without a human clicking “Apply.”

“Right now, we’re taking zero risk on this — Fin cannot make any changes to the system without human approval,” Donohue emphasized. “Nothing goes live until a human clicks apply.”

This is a notable architectural choice. In a market increasingly enamored with fully autonomous AI systems, the company is deliberately keeping a human approval gate in place — at least for now. Donohue acknowledged this will evolve, but said the current moment demands caution: “It’s too big a leap to just let Operator make changes automatically and then tell the team, ‘Hey, let me tell you about what I did.'”

For enterprise buyers evaluating AI tools, this design point matters. It is the difference between an AI system that proposes changes and one that enacts them — a distinction that compliance teams, security officers, and risk managers will scrutinize closely.

Why Fin Operator runs on Anthropic’s Claude instead of the company’s own AI models

In a revealing technical detail, Donohue confirmed that Fin Operator does not use the company’s proprietary Apex models — the same custom AI models that power the customer-facing Fin agent and that the company has promoted as outperforming GPT-5.4 and Claude Sonnet 4.6 in customer service benchmarks.

Instead, Operator runs on Anthropic’s Claude.

“We’re not using our custom models,” Donohue said. “Those are designed to directly answer customer questions, whereas these are closer to what frontier models are best suited for. This is really closer to software engineering.”

The distinction is telling. Fin’s Apex models are optimized for one thing: resolving customer service conversations with minimal hallucination and maximum accuracy. Operator’s tasks — analyzing data, writing code-like configurations, debugging complex reasoning chains — demand a different kind of intelligence. Donohue characterized these capabilities as more akin to software engineering, an area where Anthropic’s Claude models have been deliberately optimized.

The company has not ruled out building custom models for Operator in the future, but Donohue positioned it as a lower priority. What the team has built around Claude, he argued, is the differentiated layer: the proposal system, the debugger skill, the semantic search integration, the data attribution logic, and the charting capabilities that make Operator more than just “Claude inside the app.”

Early beta testers say Fin Operator feels like adding five people to the team

Fin Operator is currently in beta with roughly 200 customers, a number Donohue said has “ramped up pretty fast the last couple of weeks.”

Constantina Samara, VP of Customer Support, Enablement & Trust at Synthesia, said the tool has already changed how her team works: “Previously, improving how Fin handles a conversation often meant reviewing everything yourself — the conversation, the configuration, the content. With Fin Operator, you just ask. It walks you through what happened and makes improving Fin dramatically easier.”

Jordan Thompson, an AI Conversational Analyst at Raylo, reported that he has been using Operator daily and has run head-to-head comparisons between Operator’s analysis and his own manual work. “It’s very accurate,” Thompson said. “It’s just as strong at high-level trend analysis as it is at debugging individual conversations. That’s a real limitation when using an LLM connector on its own — you get conversational depth but nothing on reporting or trends.”

Donohue also shared an internal anecdote from the company’s own knowledge management team. Beth, who leads knowledge operations, told the product team that Operator made her feel like she had “five more people on my team.” Whether internal testimonials carry the same weight as external customer validation is debatable, but Donohue said the knowledge management use case consistently generates the most visceral reactions because the time savings are so stark — collapsing hours or days of content auditing into roughly 10 minutes.

A new pricing model signals how AI is reshaping the economics of enterprise software

Fin Operator will live inside the company’s Pro add-on tier — a relatively new bundle that already includes advanced analytics features like CX scoring, topic detection, real-time issue detection, and quality assurance monitoring across both AI and human agent conversations.

The pricing model introduces something new for the company: usage-based billing. Intercom has historically relied on outcome-based pricing — charging roughly $0.99 per conversation that Fin resolves without human intervention. Operator’s work does not map cleanly to that model because it produces configuration changes, not customer resolutions.

“This has pushed us to a different model, to go more into that usage model for support ops teams,” Donohue said. “We’ll try to be generous with the usage amounts that come into Pro, but for people who are leaning heavily in, we’ll have the ability to buy more usage blocks.”

The shift is worth watching. Outcome-based pricing was one of the company’s most distinctive market positions — a bet that customers would pay for results rather than seats. Extending that philosophy to internal operations work proved impractical, which suggests that as AI agents take on more diverse roles within an organization, the pricing models that support them will need to become equally diverse.

How Fin Operator stacks up in a crowded field of AI customer service competitors

Fin Operator lands in an increasingly competitive landscape. Zendesk, Salesforce, Sierra, and a constellation of AI-native startups are all building some version of AI-powered support operations tooling. The broader AI automation market is projected to reach $169 billion in 2026, according to Grand View Research, growing at a 31.4% compound annual rate.

But Donohue argued that Operator’s differentiation lies in two areas. First, breadth: Operator works across the full surface area of the company’s configuration system — data, content, procedures, simulations, guidance, and monitoring — rather than addressing a single narrow use case. Second, the fact that it spans both AI and human operations.

“Most critically, where I think we have the most differentiation is because it’s for your human system and your AI system,” Donohue said. “That’s really one of the unique spaces we have — to have a first-class AI agent and a first-class help desk, and Operator works across both.”

The competitive positioning also benefits from timing. The company’s recent corporate rebrand from Intercom to Fin signals a wholesale commitment to AI that legacy players may struggle to match. As CEO McCabe wrote in announcing the name change, the AI agent “is about to be the largest part of our business.” The help desk product continues as Intercom 2, but the parent company now carries the name of its AI agent — a branding move that some industry observers have interpreted as pre-IPO positioning. The Fin API Platform, launched in early April, adds another dimension: the company opened its proprietary Apex models to third-party developers and even offered to license the technology to direct competitors like Decagon and Sierra.

The real paradigm shift isn’t a new chat interface — it’s an agent that does the thinking for you

Step back from the product specifics and Fin Operator represents something potentially more consequential than a new dashboard or analytics tool. It is one of the first commercial products to explicitly embody the emerging paradigm of AI agents that manage other AI agents — a two-layer abstraction that is beginning to reshape how companies think about operational software.

Donohue was emphatic on this point. The real paradigm shift, he argued, is not the chat interface replacing buttons and menus. It is that the AI is doing the actual knowledge work — figuring out what should change, why, and how.

“The UX change is secondary, even though it’s most visible,” Donohue said. “The change is that we are identifying and doing the work of support operations. It’s doing the work of what the knowledge manager is doing, so that they just have to approve that. That’s the huge shift.”

The analogy to software engineering is apt. Over the past year, AI coding agents have fundamentally altered the daily workflow of developers, shifting their primary responsibility from writing code to reviewing and guiding the AI that writes it. Donohue sees the same transformation arriving for support operations professionals.

“Software engineers — three months have upended their world, where their primary job now is managing agents who are actually writing the code,” he said. “Similarly now, support ops, your job is to manage an agent who’s managing the agent for your customers.”

Whether this vision pans out at enterprise scale remains to be seen. The company is still launching Operator in beta precisely because it wants to keep refining quality through what Donohue described as a painstaking, conversation-by-conversation debugging process. “We’ve spent three months, conversation by conversation, learning, fixing, learning, fixing, to get it where it’s robust,” he said.

But if the early returns hold, Fin Operator may preview what the next generation of enterprise software looks like: not tools that help humans do work faster, but agents that do the work themselves, subject to human judgment and approval. For customer service leaders already running AI agents in production, the question is no longer just “how good is my bot?” It is now, inevitably, “who is managing it?” And increasingly, the answer is another bot.

How RecursiveMAS speeds up multi-agent inference by 2.4x and reduces token usage by 75%

One of the key challenges of current multi-agent AI systems is that they communicate by generating and sharing text sequences, which introduces latency, drives up token costs, and makes it difficult to train the entire system as a cohesive unit. 

To overcome this challenge, researchers at University of Illinois Urbana-Champaign and Stanford University developed RecursiveMAS, a framework that enables agents to collaborate and transmit information through embedding space instead of text. This change results in both efficiency and performance gains. 

Experiments show that RecursiveMAS achieves accuracy improvement across complex domains like code generation, medical reasoning, and search, while also increasing inference speed and slashing token usage. 

RecursiveMAS is significantly cheaper to train than standard full fine-tuning or LoRA methods, making it a scalable and cost-effective blueprint for custom multi-agent systems.

The challenges of improving multi-agent systems

Multi-agent systems can help tackle complex tasks that single-agent systems struggle to handle. When scaling multi-agent systems for real-world applications, a big challenge is enabling the system to evolve, improve, and adapt to different scenarios over time. 

Prompt-based adaptation improves agent interactions by iteratively refining the shared context provided to the agents. By updating the prompts, the system acts as a director, guiding the agents to generate responses that are more aligned with the overarching goal. The fundamental limitation is that the capabilities of the models underlying each agent remain static. 

A more sophisticated approach is to train the agents by updating the weights of the underlying models. Training an entire system of agents is difficult because updating all the parameters across multiple models is computationally non-trivial.

Even if an engineering team commits to training their models, the standard method of agents communicating via text-based interactions creates major bottlenecks. Because agents rely on sequential text generation, it causes latency as each model must wait for the previous one to finish generating its text before it can begin its own processing. 

Forcing models to spell out their intermediate reasoning token-by-token just so the next model can read it is highly inefficient. It severely inflates token usage, drives up compute costs, and makes iterative learning across the whole system painfully slow to scale. 

How RecursiveMAS works

Instead of trying to improve each agent as an isolated, standalone component, RecursiveMAS is designed to co-evolve and scale the entire multi-agent system as a single integrated whole. 

The framework is inspired by recursive language models (RLMs). In a standard language model, data flows linearly through a stack of distinct layers. In contrast, a recursive language model reuses a set of shared layers that processes the data and feeds it back to itself. By looping the computation, the model can deepen its reasoning without adding parameters.

RecursiveMAS extends this scaling principle from a single model to a multi-agent architecture that acts as a unified recursive system. In this setup, each agent functions like a layer in a recursive language model. Rather than generating text, the agents iteratively pass their continuous latent representations to the next agent in the sequence, creating a looped hidden stream of information flowing through the system. 

This latent hand-off continues down the line through all the agents. When the final agent finishes its processing, its latent outputs are fed directly back to the very first agent, kicking off a new recursion round. 

This structure allows the entire multi-agent system to interact, reflect, and refine its collective reasoning over multiple rounds entirely in the latent space, with only the very last agent producing a textual output in the final round. It is like the agents are communicating telepathically as a unified whole and the last agent provides the final response as text.

The architecture of latent collaboration

To make continuous latent space collaboration possible, the authors introduce a specialized architectural component called the RecursiveLink. This is a lightweight, two-layer module designed to transmit and refine a model’s latent states rather than forcing it to decode text. 

A language model’s last-layer hidden states contain the rich, semantic representation of its reasoning process. The RecursiveLink is designed to preserve and transmit this high-dimensional information from one embedding space to another. 

To avoid the cost of updating every parameter across multiple large language models, the framework keeps the models’ parameters frozen. Instead, it optimizes the system by only training the parameters of the RecursiveLink modules.

To handle both internal reasoning and external communication, the system uses two variations of the module. The inner RecursiveLink operates inside an agent during its reasoning phase. It takes the model’s newly generated embeddings and maps them directly back into its own input embedding space. This allows the agent to continuously generate a stream of latent thoughts without generating discrete text tokens. 

The outer RecursiveLink serves as the bridge between agents. Because agents in a real-world system might use different model architectures and sizes, their internal embedding spaces have entirely different dimensions. The outer RecursiveLink includes an additional layer designed to match the embeddings from one agent’s hidden dimension with the next agent’s embedding space.

During training, first, the inner links are trained independently to warm up each agent’s ability to think in continuous latent embeddings. Then, the system enters outer-loop training, where the diverse, frozen models are chained together in a loop, and the system is evaluated based on the final textual output of the last agent. 

The only thing that gets updated in the training process is the RecursiveLink parameters and the original model weights remain unchanged, similar to low-rank adaptation (LoRA). Another advantage of this system comes into effect when you have multiple agents on top of the same backbone model. 

If you have a multi-agent system where two agents are built on the exact same foundation model acting in different roles, you do not need to load two copies of the model into your GPU memory, nor do you train them separately. The agents will share the same backbone as the brain and use the RecursiveLink as the connective tissue.

RecursiveMAS in action

The researchers evaluated RecursiveMAS across nine benchmarks spanning mathematics, science and medicine, code generation, and search-based question answering. They created a multi-agent system using open-weights models including Qwen, Llama-3, Gemma3, and Mistral. These models were assigned roles to form different agent collaboration patterns such as sequential reasoning and mixture-of-experts collaboration. 

RecursiveMAS was compared to baselines under identical training budgets, including standalone models enhanced with LoRA or full supervised fine-tuning, alternative multi-agent frameworks like Mixture-of-Agents and TextGrad, and recursive baselines like LoopLM. It was also compared to Recursive-TextMAS, which uses the same recursive loop structure as RecursiveMAS but forces the agents to explicitly communicate via text.

RecursiveMAS achieved an average accuracy improvement of 8.3% compared to the strongest baselines across the benchmarks. It excelled particularly on reasoning-heavy tasks, outperforming text-based optimization methods like TextGrad by 18.1% on AIME2025 and 13% on AIME2026. 

Because it avoids generating text at every step, RecursiveMAS achieved 1.2x to 2.4x end-to-end inference speedup. RecursiveMAS is also much more token efficient than the alternative. Compared to the text-based Recursive-TextMAS, it reduces token usage by 34.6% in the first round of the recursion, and by round three, it achieves 75.6% token reduction. RecursiveMAS also proved remarkably cheap to train. Because it only updates the lightweight RecursiveLink modules, which consist of roughly 13 million parameters or about 0.31% of the trainable parameters of the frozen models, it requires the lowest peak GPU memory and cuts training costs by more than half compared to full fine-tuning.

Enterprise adoption

The efficiency gains — lower token consumption, reduced GPU memory requirements, and faster inference — are intended to make complex multi-step agent workflows viable in production environments without the compute overhead that limits enterprise agentic deployments. The researchers have released the code and trained model weights under the Apache 2.0 license.

Claude’s next enterprise battle is not models: it’s the agent control plane

New VB Pulse data shows Microsoft and OpenAI leading enterprise agent orchestration, but Anthropic’s first measurable foothold points to a larger fight over who controls the infrastructure where AI agents run.

For the last two years, the enterprise AI race has mostly been framed as a model war: OpenAI’s GPT series versus Anthropic’s Claude versus Google’s Gemini, with smaller and open-source alternatives also coming in from the U.S. and China. 

But the next strategic fight may not be over which model answers a prompt best. It may be over who controls the layer where agents plan, call tools, access data, run workflows and prove to security teams that they did not do anything they were not supposed to do.

New VB Pulse survey data suggests the category is already taking shape. Our independent Enterprise Agentic Orchestration tracker, a survey that records the preferences of qualified, verified technical-decision maker respondents at enterprises at regular intervals, found that Microsoft Copilot Studio and Azure AI Studio led with 38.6% primary-platform adoption in February, up from 35.7% in January. 

OpenAI’s Assistants and Responses API held second place, rising from 23.2% to 25.7%

Anthropic remained far smaller, but it made its first appearance in the tracker: moving from 0% in January to 5.7% in February for Anthropic tool use and workflows. 

The underlying move is small — four respondents out of a total 70 in this cohort, with more to come — but strategically interesting because it marks the first sign in this tracker of Claude usage moving from the model layer into native orchestration.

That distinction matters. Enterprises are not merely choosing chatbots. They are deciding where the live operational machinery of AI work will sit: inside Microsoft’s stack, inside OpenAI’s API layer, inside Anthropic’s managed runtime, inside an open framework, or across a hybrid mix of all of them.

“This is the convergence moment for enterprise AI,” said Tom Findling, CEO and cofounder of AI cybsersecurity startup Conifers, in a statement to VentureBeat. “Models and agent frameworks have matured enough together that enterprises are now shifting focus beyond model quality to the control plane around it. In security operations, we’re seeing the competitive advantage move toward platforms that can orchestrate agents, leverage enterprise context, and provide governance and auditability across customer environments.”

Anthropic’s number is small. Its timing is not

The Anthropic number, by itself, should not be overread. A move from zero to 5.7% is not a juggernaut. It is not proof that Anthropic has captured enterprise orchestration. 

It is not even enough to say Anthropic has a durable lead in any part of this market. Microsoft owns the early enterprise distribution advantage, and OpenAI has a much larger installed base in orchestration than Anthropic.

But small numbers can matter when they appear at the start of a new market structure. Anthropic’s emergence in orchestration comes as the broader VB Pulse data shows Claude also gaining massive enterprise adoption at the model layer. 

In our VB Pulse Q1 Foundation Models and Intelligence Platforms tracker, Anthropic rose from 23.9% in January to 28.6% in February and then even more dramatically to 56.2% in March among qualified enterprise respondents, with the March reading flagged as directional only, because the sample was only 16 respondents.

The story, then, is not that Anthropic is winning orchestration today. It is that Anthropic’s model momentum may be starting to spill into the orchestration layer.

That is where the strategic stakes get higher.

A model is easier to swap than an agent runtime

A model is relatively easy to swap, at least in theory. A company can route one workload to Claude, another to GPT, another to Gemini and another to a smaller open model.

In fact, the VB Pulse Foundation Models tracker over the same Q1 period shows that multi-model strategy is the enterprise consensus: respondents increasingly report adopting multiple models and building orchestration layers that route across them by task, cost and risk profile.

An agent runtime is different. Once a company’s workflows, tool permissions, credentials, audit logs, memory, sandboxed execution and operational monitoring live inside one provider’s environment, switching providers becomes less like changing models and more like changing infrastructure.

That is the real reason Anthropic’s 5.7% foothold is worth watching

Anthropic has already made clear that it wants to provide more than the model. Its Claude Managed Agents documentation describes a public beta for a managed agent harness with secure sandboxing, built-in tools and API-run sessions, while Anthropic’s engineering post frames the architecture around decoupling the model from the surrounding agent machinery: the session, the harness and the sandbox.

In plain English, Anthropic is trying to host the environment where Claude agents remember context, use tools, run code, operate inside sandboxes and persist across long-running workflows. That is no longer just inference. That is operational infrastructure.

The pitch is obvious: most enterprises do not want to stitch together their own agent stack from scratch. They want agents that can act, but they also want permission boundaries, audit trails, workflow reliability and ways to stop the system when something goes wrong.

Security is becoming the buying criterion

The VB Pulse orchestration tracker shows that buyers are prioritizing exactly those concerns. Security and permissions ranked as the top orchestration platform selection criterion in both January and February, at 39.3% and 37.1%.

Control over agent execution rose from 17.9% to 22.9%, while flexibility across models and tools fell from 35.7% to 25.7%. The market appears to be shifting from optionality toward governance.

That shift is not surprising. A chatbot can be wrong and still remain mostly contained. An agent that can send emails, modify documents, query databases, call APIs or execute workflows has a much larger blast radius. The enterprise question is not only whether the agent is smart enough.

It is who gave it permission, what it touched, what it changed, whether those actions were logged, and whether the company can unwind the damage if something goes wrong.

Ev Kontsevoy, cofounder and CEO of Teleport, an identity and digital infrastructure solutions company, argues that the industry is still putting too much emphasis on orchestration itself and not enough on identity: “The race to own the agent orchestration layer is real,” Kontsevoy said. “It’s also solving the wrong problem first. Orchestration without identity only multiplies chaos. Without identity, you don’t know what an agent can access, what it actually did, or how to revoke its access when it operates outside policy. A unified identity layer is a prerequisite to deploying agents — one or many — in infrastructure.”

Syam Nair, Chief Product Officer at the enterprise unified data storage solutions firm NetApp, believes data management is key in all cases to secure AI agent orchestration across the enterprise. As he said in a statement to VentureBeat: “Effective agent management requires built-in intelligence and a continuously updated understanding of both data and, critically, its metadata. This visibility allows organizations to define and enforce clear policies so data is used only by the right agents, for the right purposes. Making this work at scale is a crossfunctional effort. Security, storage, and data science teams must work together to implement policies that safeguard company data, while creating a strong data foundation for AI.”

He continued: “The CIOs and technology leaders that are successful are the ones who take the input, policies, and vision from all these teams into account as they build a data infrastructure that minimizes risk and drives business value.”

Microsoft has the distribution edge

That is why Microsoft’s early lead makes sense. Copilot Studio and Azure AI Studio sit inside an enterprise stack many companies already use: Microsoft 365, Teams, Entra ID, Azure and existing procurement relationships.

The VB Pulse Orchestration Tracker for Q1 2026 describes Microsoft as the enterprise default, with no other platform within 13 percentage points in February.

David Weston, CVP, AI Security, Microsoft, provided some insight on why, writing in a statement to VentureBeat: “Without a unified control layer, you start to see fragmentation – agents operating in silos, inconsistent governance, and gaps in security. What customers are asking for is a way to bring order to that complexity. With Agent 365, we’re providing a single control plane to observe, govern, and secure agents across Microsoft, partner, and third-party ecosystems, all grounded in enterprise data and identity.”

OpenAI’s second-place position is also unsurprising. Its Assistants and Responses API gave developers an early way to build agent-like systems using OpenAI’s models and tooling. In the orchestration tracker, OpenAI is not surging, but it is still ticking up steadily: 23.2% in January to 25.7% in February.

Anthropic is the newcomer at the orchestration layer. But its timing may be favorable. The VB Pulse Foundation Models tracker for Q1 2026 suggests enterprises increasingly see Claude as a fit for higher-stakes workloads where safety, instruction following, long context and governance matter.

The orchestration tracker suggests those same buyers are now moving from agent experiments toward production workflows, where security, permissions and task reliability become the gating issues.

That creates a possible path for Anthropic: not to beat Microsoft as the default enterprise platform, at least not immediately, but to become the agent runtime for companies that already trust Claude for sensitive or complex workloads.

The risk is lock-in

The risk for enterprises is lock-in.

The orchestration tracker found that a hybrid control plane — combining provider-native orchestration with external orchestration — was the leading expected architecture, holding around 35% to 36% across the two substantive waves.

Provider-managed-only approaches grew modestly but remained a minority. The report’s conclusion is blunt: enterprises are not willing to give full orchestration control to any single provider.

It makes total sense as enterprises seek to leverage the “best-in-breed” models, harnesses, and tools from multiple vendors, especially as their needs differ widely across sector, business, and size.

“Most enterprises will operate in a multi-model, multi-agent environment, which makes an independent control plane essential,” agreed Felix Van de Maele, CEO of Collibra, a unified data governance startup for AI, in a statement to VentureBeat. “That is why we built AI Command Center: to give organizations the visibility, governance, and real-time oversight needed to manage AI systems and agents across the full lifecycle.”

That caution shows up in the risk data. When asked about risks if agent control lives inside a model provider platform, respondents cited security and permissioning limitations as the top concern. Vendor lock-in was the second-largest concern and the only one that increased from January to February, rising from 23.2% to 25.7%.

This is the tension at the heart of the agent market. Enterprises want managed infrastructure because building reliable agents is hard. But the more a provider manages, the more it may own.

Dr. Rania Khalaf, chief AI officer at WSO2 — the subsidiary of EQT that offers open source, customizable AI stacks for enterprises — said enterprises will need an agent control plane that sits apart from individual frameworks, harnesses and runtimes because agents combine the unpredictability of LLMs with the ability to take actions that have consequences.

“Teams want the freedom to use the best model and framework for each job — Claude for coding, Gemini for writing, LangGraph or CrewAI for dynamic modular behavior — and that heterogeneity makes consistent governance untenable in integrated platforms that lock into one ecosystem,” Khalaf said.

From LLMOps to Agent Ops

Khalaf said the industry is also moving from MLOps to LLMOps to “Agent Ops,” where governance has to cover the whole agent, not just the model call.

“A guardrail on an LLM call can catch hallucination or toxic output, but it will not catch an agent thrashing in an unbreakable, costly loop, which is why governance now has to extend out from the LLM interaction to the scope of the agent,” she said.

The practical implication is that enterprises need to separate policy and control from the agent logic itself. Khalaf pointed to the recent example of an agent deleting a production database despite being told not to, arguing that the failure showed the limits of relying on prompt-level instructions where hard identity and access controls are needed.

“Pulling guardrails, evals, policies, bindings, and agent identity out of the core agent logic allows them to be configured per deployment and per environment, owned by the appropriate teams in security, product, and compliance, without fragmenting the governance layer as different teams choose different models and frameworks,” Khalaf said.

MCP is open. The runtime may still be sticky

That is where Anthropic’s Model Context Protocol, or MCP, complicates the story. MCP is not a walled garden; Anthropic introduced it as an open standard for connecting AI systems to data and tools, and Anthropic’s documentation describes MCP as an open-source standard for connecting AI applications to external systems.

But openness at the protocol layer does not automatically eliminate lock-in at the runtime layer. An enterprise could use an open protocol to connect tools while still becoming dependent on a provider’s managed sessions, logs, sandboxes, permissions model, workflow state and deployment environment. In other words, MCP may reduce integration friction, while managed agent infrastructure could still increase switching costs.

Khalaf said Microsoft’s lead likely reflects its M365 and Azure distribution, while Anthropic’s emerging foothold could reflect a different architectural bet around open protocols such as MCP. But she argued the long-term direction is not a single-provider stack.

“Enterprises serious about running agents in production will end up multi-vendor across these layers,” Khalaf said, “which is why the open and interoperable control plane matters more than the current percentages might suggest.”

The next cycle may be cross-vendor collaboration

That same tension — between provider-native convenience and cross-vendor reality — is where Arick Goomanovsky, CEO and cofounder of universal AI agent orchestrator startup BAND, sees the next competitive cycle forming.

“Enterprises now run agents everywhere: individual assistants and coding agents, multi-agent systems in production, agents embedded in Agentforce and ServiceNow, and third-party agents consumed as agent-as-a-service,” Goomanovsky said. “None of them collaborate across those boundaries by default.”

Goomanovsky argues that the missing layer is not just orchestration inside a single model provider, but a cross-vendor collaboration layer that lets agents from different ecosystems act together.

“What’s emerging in parallel is demand for an agentic collaboration harness – an interaction layer that lets agents from Microsoft, OpenAI, Anthropic, and internal teams operate as one workforce,” he said. “Orchestration inside any single vendor is still a walled garden so the next competitive cycle is cross-vendor agent collaboration.”

Independent frameworks face an enterprise packaging problem

There is also a warning sign for independent orchestration frameworks. LangChain and LangGraph fell from 5.4% to 1.4% as the primary orchestration platform in the qualified enterprise sample.

External orchestration abstracted entirely from model providers also fell from 8.9% to 2.9%.

Scott Likens, Global Chief AI Engineer at professional services giant PwC, has a front row seat to this trend as the company spearheads and assists clients with their AI transformations.

As he told VentureBeat in a statement: “Right now, most enterprises are still operating in fragmented environments, with orchestration spread across platforms, business applications, and internally developed tooling. Over time, the market will likely move toward more unified orchestration models, but interoperability, governance and security will remain critical because enterprises are unlikely to standardize on a single agent ecosystem.”

The report argues that fully independent orchestration frameworks may not yet have the enterprise packaging — security certifications, support, compliance documentation and vendor accountability — that procurement teams require.

That does not mean open frameworks are irrelevant. It does suggest that enterprise buyers may increasingly consume open or developer-first orchestration through managed products, cloud-provider partnerships or internal control planes rather than as standalone frameworks.

The agent market starts to look like cloud infrastructure

This is where the agent market starts to look less like the early chatbot market and more like enterprise cloud infrastructure. The winning vendors will not only have capable models. They will have identity integration, permission controls, audit logs, observability, workflow tooling, sandboxing, evaluation and a credible answer to who owns the control plane.

Indeed, the orchestration layer is but one part of the stack that the enterprise must fill in, and enterprises may actually decide to have different orchestration layers for agents working in different departments and functions.

As Nithya Lakshmanan, Chief Product Officer at revenue team AI orchestration startup Outreach.ai wrote in a statement to VentureBeat: “General-purpose orchestration platforms coordinate agent activity well, but they don’t carry the workflow-specific context that determines whether an agent’s action is correct for a given situation. In revenue workflows, an agent acting on incomplete deal history or missing buyer context will underperform and erode trust with users. The teams getting the most out of multi-agent systems are treating domain-specific data as the governance layer, with orchestration sitting on top. Most enterprises have chosen their orchestration stack, and what they’re now figuring out is how those platforms get access to the workflow context they need to make agents useful inside specific business functions.”

That is why Anthropic — which is increasingly launching its own domain-specific agents for finance and design, among other categories — is worth following closely. The company does not need to win the entire orchestration market tomorrow for its strategy to matter. It only needs to persuade a growing set of Claude enterprise customers to let Anthropic handle more of the surrounding machinery: tools, workflows, memory, execution and governance.

If it succeeds, Claude becomes more than a model in a multi-model portfolio. It becomes part of the infrastructure where enterprise work gets done.

That would put Anthropic in a more direct fight with OpenAI and Microsoft — not just over model quality, but over the operating layer of AI agents.

The narrow but important read

The safe interpretation of the VB Pulse data is narrow but important: Anthropic is not yet a major enterprise orchestration platform. Microsoft is. OpenAI is much closer. But Anthropic has registered its first measurable foothold at the orchestration layer, just as the market is deciding who should control agent execution.

For enterprise buyers, that may be the question that matters most in 2026. Not which model is best, but which provider gets to run the agent — and how hard it will be to leave once the agent is running.

Claude Code’s ‘/goals’ separates the agent that works from the one that decides it’s done

A code migration agent finishes its run, and the pipeline looks green. But several pieces were never compiled — and it took days to catch. That’s not a model failure; that’s an agent deciding it was done before it actually was.

Many enterprises are now seeing that production AI agent pipelines fail not because of the models’ abilities but because the model behind the agent decides to stop. Several methods to prevent premature task exits are now available from LangChain, Google and OpenAI, though these often rely on separate evaluation systems. The newest method comes from Anthropic: /goals on Claude Code, which formally separates task execution and task evaluation.

Coding agents work in a loop: they read files, run commands, edit code and then check whether the task is done. 

Claude Code /goals essentially adds a second layer to that loop. After a user defines a goal, Claude will continue to turn by turn, but an evaluator model comes in after every step to review and decide if the goal has been achieved. 

The two model split

Orchestration platforms from all three vendors identified the same roadblock. But the way they approach these is different. OpenAI leaves the loop alone and lets the model decide when it’s done, but does let users tag on their own evaluators. For LangGraph and Google’s Agent Development Kit, independent evaluation is possible, but requires developers to define the critic node, write up the termination logic and configure observability. 

Claude Code /goals sets the independent evaluator’s default, whether the user wants it to run longer or shorter. Basically, the developer sets the goal completion condition via a prompt. For example, /goal all tests in test/auth pass, and the lint step is clean. Claude Code then runs, and every time the agent attempts to end its work, the evaluation model, which is Haiku by default, will check against the condition loop. If the condition is not met, the agent keeps running. If the condition is met, then it logs the achieved condition to the agent conversation transcript and clears the goal. There are only two decisions the evaluator makes, which is why the smaller Haiku model works well, whether it’s done or not. 

Claude Code makes this possible by separating the model that attempts to complete a task from the evaluator model that ensures the task is actually completed. This prevents the agent from mixing up what it’s already accomplished with what still needs to be done. With this method, Anthropic noted there’s no need for a third-party observability platform — though enterprises are free to continue using one alongside Claude Code — no need for a custom log, and less reliance on post-mortem reconstruction.

Competitors like Google ADK support similar evaluation patterns. Google ADK deploys a LoopAgent, but developers have to architect that logic.

In its documentation, Anthropic said the most successful conditions usually have: 

  • One measurable end state: a test result, a build exit code, a file count, an empty queue

  • A stated check: how Claude should prove it, such as “npm test exits 0” or “git status is clean.”

  • Constraints that matter: anything that must not change on the way there, such as “no other test file is modified”

Reliability in the loop

For enterprises already managing sprawling tool stacks, the appeal is a native evaluator that doesn’t add another system to maintain.

This is part of a broader trend in the agentic space, especially as the possibility of stateful, long-running and self-learning agents becomes more of a reality. Evaluator models, verification systems and other independent adjudication systems are starting to show up in reasoning systems and, in some cases, in coding agents like Devin or SWE-agent. 

Sean Brownell, solutions director at Sprinklr, told VentureBeat in an email that there is interest in this kind of loop, where the task and judge are separate, but he feels there is nothing unique about Anthropic’s approach.

“Yes, the loop works. Separating the builder from the judge is sound design because, fundamentally, you can’t trust a model to judge its own homework. The model doing the work is the worst judge of whether it’s done,” Brownell said. “That being said, Anthropic isn’t first to market. The most interesting story here is that two of the world’s biggest AI labs shipped the same command just days apart, but each of them reached entirely different conclusions about who gets to declare ‘done.'”

Brownell said the loop works best “for deterministic work with a verifiable end-state like migrations, fixing broken test suites, clearing a backlog,” but for more nuanced tasks or those needing design judgment, a human making that decision is far more important.

Bringing that evaluator/task split to the agent-loop level shows that companies like Anthropic are pushing agents and orchestration further toward a more auditable, observable system.

Frontier AI models don’t just delete document content — they rewrite it, and the errors are nearly impossible to catch

As large language models become more capable, users are tempted to delegate knowledge tasks where models process documents on their behalf and provide the finished results. But how far can you trust the model to stay faithful to the content of your documents when it has to iterate over them across multiple rounds?

A new study by researchers at Microsoft shows that large language models silently corrupt documents that they work on by introducing errors. The researchers developed a benchmark that simulates multi-step autonomous workflows across 52 professional domains, using a method that automatically measures how much content degrades over time.

Their findings show that even top-tier frontier models corrupt an average of 25% of document content by the end of these workflows. And providing models with agentic tools or realistic distractor documents actually worsens their performance.

This serves as a warning that while there is increasing pressure to automate knowledge work, current language models are not fully reliable for these tasks.

The mechanics of delegated work

The Microsoft study focuses on “delegated work,” an emerging paradigm where users allow LLMs to complete knowledge tasks on their behalf by analyzing and modifying documents.

A prominent example of this paradigm is vibe coding, where a user delegates software development and code editing to an AI. But delegated workflows extend far beyond programming into other domains. In accounting, for example, a user might supply a dense ledger and instruct the model to split the document into separate files organized by specific expense categories.

Because users might lack the time or the specialized expertise to manually review every modification the AI implements, delegation often hinges on trust. Users expect that the model will faithfully complete tasks without introducing unchecked errors, unauthorized deletions, or hallucinations in the documents.

To measure how far AI systems can be trusted in extended, iterative delegated workflows, the researchers developed the DELEGATE-52 benchmark. The benchmark is composed of 310 work environments spanning 52 diverse professional domains, including financial accounting, software engineering, crystallography, and music notation.

Each work environment relies on real-world seed text documents ranging from 2,000 to 5,000 tokens. Alongside the seed document, the environments include five to ten complex, non-trivial editing tasks.

Grading a complex, multi-step editing process usually requires expensive human review. DELEGATE-52 bypasses this by using a “round-trip relay” simulation method that evaluates answers without requiring human-annotated reference solutions. The approach is inspired by the backtranslation technique used in machine translation evaluation, where an AI model is told to translate a document from one language to another and back to see how perfectly it reproduces the original version.

Accordingly, every edit task in DELEGATE-52 is designed to be fully reversible, pairing a forward instruction with its precise inverse. For example, an instruction to split the ledger into separate files by expense category is paired with an instruction to merge all category files back into a single ledger.

In comments provided to VentureBeat, Philippe Laban, Senior Researcher at Microsoft Research and co-author of the paper, clarified that this is not simply a test of whether an AI can hit “undo.” Because human workers cannot be forced to instantly “forget” a task they just did, this round-trip evaluation is uniquely suited for AI. By starting a new conversational session, the researchers force the model to attempt the inverse task completely independently.

The models in their experiments “do not know whether a task is a forward or backward step and are unaware of the overall experiment design,” Laban explained. “They are simply attempting each task as thoroughly as they can at each step.”

These roundtrip tasks are chained together into a continuous relay to simulate long-horizon workflows spanning 20 consecutive interactions. To make the environment more realistic, the benchmark introduces distractor files in the context of each task. These contain 8,000 to 12,000 tokens of topically related but completely irrelevant documents. Distractors measure whether the AI can maintain focus or if it gets confused and pulls in the wrong data.

Testing frontier models in the relay

To understand how different architectures and scales handle delegated work, the researchers tested 19 different language models from OpenAI, Anthropic, Google, Mistral, xAI, and Moonshot. The main experiment subjected these models to a simulation of 20 consecutive editing interactions.

Across all models, documents suffered an average degradation of 50% by the end of the simulation. Even the best frontier models in the experiment, specifically Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4, corrupted an average of 25% of the document content.

Out of 52 professional domains, Python was the only one where most models achieved a ready status with a score of 98% or higher. Models excel in programmatic tasks but struggle severely in natural language and niche domains like fiction, earning statements, or recipes. The overall top model, Gemini 3.1 Pro, was deemed ready for delegated work in only 11 out of the 52 domains.

Interestingly, the corruption was not caused by death by a thousand cuts where the models slowly accumulate tiny errors. Instead, about 80% of total degradation is caused by sparse but massive critical failures, which are single interactions where a model suddenly drops at least 10% of the document’s content. The frontier models do not necessarily avoid small errors better. They simply delay these catastrophic failures to later rounds.

Another important observation is that when weaker models fail, their degradation originates primarily from content deletion. However, when frontier models fail, they actively corrupt the existing content. The text is still there, but it has been subtly distorted or hallucinated, making it much harder for a human overseer to detect the error.

Interestingly, giving models an agentic harness with generic tools for code execution and file read/write access actually worsened their performance, adding an average of 6% more degradation. Laban explained that the failure lies in relying on generic tools rather than domain-specific ones.

“Models lack the capability to write effective programs on the fly that can manipulate files across diverse domains without mistakes,” he noted. “When they cannot do something programmatically, they resort to reading and rewriting entire files, which is less efficient and more error prone.” The solution for developers is to build tightly scoped tools (such as specific functions to calculate or move entries within .ledger files) to keep agents on track.

Degradation also snowballs as documents get larger or as more distractor files are added to the workspace. For enterprise teams investing heavily in retrieval-augmented generation (RAG), these distractor documents serve as a direct warning about the compounding cost of messy context. While a noisy context window might cause a minimal 1% performance drop after just two interactions, that degradation compounds to a massive 2-8% drop over a long simulation.

“For the retrieval community: RAG pipelines should be evaluated over multi-step workflows, not just single-turn retrieval benchmarks,” Laban said. “Single-turn measurements systematically underestimate the harm of imprecise retrieval.”

Reality check for the autonomous enterprise

The findings from the DELEGATE-52 benchmark offer a critical reality check for the current hype surrounding fully autonomous AI agents.

The benchmark’s design also implies a practical constraint: because models can maintain a clean record for several steps before a sudden catastrophic failure, incremental human review is necessary — not a single final check. Laban recommends building AI applications around short, transparent tasks rather than complex long-horizon agents. This keeps the action implication without the writer delivering the prescription.

For organizations wanting to deploy autonomous agents safely today, the DELEGATE-52 methodology provides a practical blueprint for testing in-house data pipelines. Laban explained that “… an enterprise team wanting to adopt this framework needs to build three components: (a) a set of reversible editing tasks representative of their workflows, (b) a parser that converts their domain documents into a structured representation, and (c) a similarity function that compares two parsed representations.” Teams do not even need to build parsers from scratch. The Microsoft research team successfully repurposed existing parsing libraries for 30 out of the 52 domains tested.

Laban is optimistic about the rate of improvement. “Progress is real and fast. Looking at the GPT family alone, models go from scoring below 20% to around 70% in 18 months,” Laban said. “If that trajectory continues, models will soon be able to achieve saturated scores on DELEGATE-52.”

However, Laban cautioned that DELEGATE-52 is purposefully small compared to massive enterprise environments. Even as foundation models inevitably master this benchmark, the endless long-tail of unique enterprise data and workflows means organizations will always need to invest in custom, domain-specific tooling to keep their autonomous agents reliable.