admin.codes » Category » Orchestration

The AI governance mirage: Why 72% of enterprises don’t have the control and security they think they do

Decision makers at 72% of organizations claim to have two or more AI platforms that they identify as their “primary” layer, according to a survey of 40 enterprise companies conducted by VentureBeat last month, revealing real gaps in security and control.

For enterprise management and technical leaders, and especially security leaders, these multiple AI platforms extend the attack surfaces of most enterprises at a time when AI-driven attacks have become increasingly potent.

The multiple platforms — which include offerings from hyperscaler or AI labs like Microsoft Azure, Google, OpenAI or Anthropic, or big application companies like Epic, Workday or ServiceNow — reflect a state of sprawl that has emerged as these big software providers rush to offer their own AI to their enterprise customers.

Those customers, in their own rush to scale AI, are finding they aren’t building a singular strategy — in fact they may be building a collection of contradictions.

The strategic paradox: why leading enterprises are building around their vendors

For example, take the strategic paradox faced by Mass General Brigham (MGB) hospital system, which has 90,000 employees and is the largest employer in Massachusetts. The hospital system last year had to shut down an uncontrolled number of internal proof of concepts that had sprouted up as employees had gotten carried away with AI projects, said CTO Nallan “Sri” Sriraman at the VentureBeat AI Impact event in Boston on March 26, which focused on the challenges of scaling AI.

Instead, the company decided it was better to wait for the software giants it already uses to deliver on their AI roadmaps. Since these companies have so many resources, and were making AI a top priority themselves, it made no sense for MGB to try to build its own AI layer that would be duplicative, he said. “Why are we building it ourselves?” he asked. “Leverage it.”

Yet, even then, Sriraman’s team has been forced to build workarounds, where those companies haven’t done enough.

For example, MGB has just completed a “full-scaled” custom build around Microsoft’s Copilot — to get essentially everything offered by that tool — by putting a “skin” around Copilot to handle the safety and data privacy concerns the major model providers haven’t yet mastered. Specifically, MGB needed a way for employees to prompt the AI and not have their protected health information (PHI) leaked back to the Copilot LLM provider, OpenAI. The new secure platform, which can support up to 30,000 users, is really the ultimate contradiction: Even though the company has a mandate to leverage the AI provided by the bigger companies, it needs to build around its failures.

The contradiction goes even further. These software vendors used by MGB — which also include Epic, Workday and ServiceNow — are all now building agents for their AI, all operating differently. So MGB has to invest in building a “control plane that coordinates and orchestrates all of these agents,” Sriraman said. “That’s where our investment is going to be.”

He noted that companies like his are “discovering and experimenting as the landscape keeps shifting.” The marketplace is “still nascent,” he said, which makes decisions difficult.

The “six blind men” problem

Sriraman explained the current vendor landscape with an analogy: “When you ask six blind men to touch an elephant and say, what does this elephant look like?” Sriraman said. “You’re gonna get six different answers.”

What emerges from the research VentureBeat conducted in the first quarter, along with conversations like the one in Boston, is a situation that we at VentureBeat are calling a “governance mirage.” While many enterprises say they have adequate governance, in reality they haven’t created clear accountability or specific guardrails, evaluations or security processes to ensure that governance.

The data of disconnect: confidence vs. systematic oversight

The research comes from surveys across January, February and March by VentureBeat of enterprise companies with 100 or more employees, with 40 to 70 qualified respondents per topic area — covering agentic orchestration, AI security, RAG and governance. The data lacks statistical significance in many areas and should be treated as directional.

The research on governance found that a majority, or 56%, of respondents said they are “very confident” that they’d detect a misbehaving AI model, suggesting that most decision-makers believe they have sufficient basic governance at their companies.

However, nearly a third of respondents have no systematic mechanism to detect AI misbehavior until it surfaces through users or audits. In a world where telemetry leakage accounts for 34% of GenAI incidents (Wiz), and the global average breach cost has hit $4.4M (IBM 2025 Cost of a Data Breach), finding out after the damage is done is the default for too many companies.

Moreover, 43% of respondents say a central team owns AI governance. That sounds reassuring — until you look at what’s happening everywhere else. Twenty-three percent say governance is unclear or actively contested between teams. Twenty percent say each platform team governs independently. Six percent say no one has formally addressed it. The rest said they were unsure who owned it.

More telling is the barrier data. When asked about the single biggest obstacle to governing AI across platforms, “no single owner or accountable team” ranked second at 29% — just behind vendor opacity. Accountability structure and lack of vendor transparency are the two dominant failure modes, and they compound each other: Without a central owner, no one has the mandate to demand transparency from the vendors.

The day-two bill: managing sprawl, creep, and lock-in

The scaling trap: Red Hat’s warning

Brian Gracely, Senior Director at Red Hat, who also spoke at the VentureBeat Boston event last month, addressed the infrastructure side of this sprawl, warning that many enterprises are falling into a trap of deceptive initial wins.

Gracely noted that the barrier to entry is almost nonexistent at the start, with nearly anyone able to spin up a project using a credit card and an API key. “Day zero is very, very easy,” Gracely said. “Day two is when the bill comes due.”

Red Hat is positioning its software layer (OpenShift AI) as the necessary buffer to prevent enterprises from getting buried in a single provider’s proprietary ecosystem. Gracely’s point is direct: If your control system is built entirely inside one cloud provider’s toolset, you are effectively “renting a cage.” The illusion of speed in the early pilot phase often hides a technical debt that becomes obvious the moment you try to move your AI work to a different platform.

Gracely illustrated this with a recent example. A senior leader from Red Hat’s centralized CTO office spent part of her vacation contributing to an open-source agent project called OpenClaw, which became widely popular in the first quarter. Within days of her name appearing as a project maintainer, Red Hat was fielding calls from major New York banks. Their problem was immediate: They realized they already had upwards of 10,000 employees bringing “claws” — agent-based tools — into their infrastructure with zero centralized oversight.

Breaches caused by employees working on these sorts of unapproved technologies are costly. These so-called “shadow AI” incidents cost on average $670K more than standard incidents, according to IBM.

Red Hat’s Gracely noted that while organizations can try to shut down these unapproved ports, they eventually have to figure out how to make them productive and secure — a task that requires a serious investment in an orchestration or platform layer.

The dynamic defensive: MassMutual’s refusal to bet

While some enterprise companies seek an “AI operating system” that oversees all of their AI technologies and apps, others are simply refusing to sign the check. Sears Merritt, CIO and head of enterprise technology at MassMutual, is managing the governance conundrum by intentionally staying in a state of high-velocity flexibility.

“Things are so dynamic, it’s hard to know which of the AI vendors will end up on top,” Merritt said at the Boston event. For that reason, MassMutual is refusing to enter any long-term contracts with AI vendors. Merritt’s strategy of “dynamic defensive” highlights a core finding of our research: Vendor popularity is changing radically month to month.

Anthropic, for example, went from 0% in January to nearly 6% in February, in the number of respondents reporting what agent orchestration technology they were using. Again, the sample size was small, at 70 respondents. Still, even if directional, the dynamic landscape suggests picking a “primary” winner today is a fool’s errand.

The January figure likely reflects survey composition: Respondents represent the broader enterprise market, not the developer community where Anthropic has seen its strongest early traction.

Until recently, most organizations had signed up early with leaders like Microsoft and OpenAI as their main orchestration providers, due to their early lead with Copilot. Our finding that Anthropic is just now pushing into enterprise agent orchestration may be a confirmation of the recent excitement around that platform.

One possible explanation is that enterprises already using Claude for model inference are now routing through Anthropic’s native tooling rather than third-party frameworks — though the sample is too small to draw firm conclusions.

The rise of “platform creep”

The leading providers are also shifting toward “managed agents,” as reflected by Anthropic’s recent announcement. This offering suggests possible continued platform creep, whereby providers like OpenAI and Anthropic take over more and more of the AI infrastructure — most specifically, in this case, the memory of agentic session details. And there the trap is set. Once your session data and orchestration live inside a provider’s proprietary database, you aren’t just using a model; you are living in its ecosystem.

Moreover, persistent agent memory is a prime target for memory poisoning via injected instructions that influence every future interaction. And when that memory lives in a provider’s database, you lose your own forensic capability.

The security irony: The fox guarding the hen house

We are seeing this platform creep in our data as well. The most jarring finding in our Q1 data is what we call the “Security Irony”: the fact that the providers most responsible for creating enterprise AI risk are the same ones enterprises are using to manage it.

Respondents said the top selection criterion for AI orchestration platforms was “security and permissions generally” (37.1%), beating out other criteria like cost, flexibility, control and ease of development. Yet, the market is choosing convenience over sovereignty. According to our survey, 26% of enterprises in February were using OpenAI as their primary security solution — the very same provider whose models create the risks they are trying to secure. That trend only seemed to strengthen in March, though, as stated before, we want to be careful. Our sample size is small, and this data should only be taken as directional.

It’s not clear whether enterprises are choosing OpenAI as a security solution, or just relying on its built-in security features offered by Microsoft Azure (which partnered with OpenAI when it pushed its Copilot solution aggressively in 2024) because customers were already on that platform.

Beyond the data, there are anecdotal signs that OpenAI’s enterprise position may be shifting. Anthropic’s Claude Code drew significant attention among developers early this year alongside the Claude 4.6 model. The subsequent announcement of Mythos, its security-focused model, prompted interest from enterprise security teams given its ability to identify vulnerabilities. OpenAI has also announced a security-focused model, GPT-5.4-Cyber.

Our data may also point to a drop in OpenAI’s relative position in a few enterprise AI categories. One area was data-retrieval, where OpenAI again leads among third-party providers, but we saw an increase in the number of respondents instead using in-house solutions for retrieval — perhaps a sign that AI models and agents are getting better at natively being able to use tools to call directly to companies’ existing databases, and that custom code is often a way companies are building this in. However, here again we feel our data is at best directional for now.

We are asking the fox to guard the hen house. Hyperscaler security features (like those from OpenAI, Azure, and Google) are winning, because they are already integrated into the platforms enterprises are using. But it creates a single-provider dependency. As agents gain the power to modify documents, call APIs and access databases, the “governance mirage” suggests we have control, while the data shows we are simply clicking “I agree” on whatever the hyperscalers offer. The resulting risks, however, include content injection, privilege escalation and data exfiltration.

The path forward: toward a unified control plane

The search for the “Dynatrace for AI”

So, what is the way out? Sriraman argued that the industry desperately needs a “central observability platform” — a “Dynatrace for AI” — that provides full end-to-end visibility, including model drift and safety prompting, agent behavior analytics, privilege escalation alerts, and forensic logging. He is currently working with a number of potential providers to deliver on this.

The “swivel chair” warning

Sriraman warned that without a unified control plane, enterprises are at risk of sliding back into a fragmented “swivel chair” world — reminiscent of the early, inefficient days of Robotic Process Automation (RPA) — where employees are forced to constantly jump between different siloed AI tools to finish a single workflow. “We don’t want to create a world where you have to switch to do something here and then go back to the platform to do something else,” he said.

But that desire for a single control plane conflicts with the desire to avoid lock-in. Our data shows the market has settled on the “hybrid control plane.” In other words, the most popular situation among our respondents (at 34.3%), was to use model provider-native solutions like Copilot Studio or OpenAI assistants for some workflows, while also running external options like LangGraph or custom orchestration for others. Smaller numbers of companies reported being more dogmatic here, whether that be deliberately removing the model provider from the orchestration layer entirely, relying only on custom orchestration tools, or relying only on the model provider’s technology

Enterprises trust no single provider enough to give them full control, yet they lack the engineering capacity to build entirely from scratch.

The bottom line: The “big red button”

Visibility and integration are only half the battle. In a high-stakes industry like healthcare, Sriraman argues that any legitimate control plane must also offer a hard-stop capability. “We need a big red button,” he said. “Kill it. We should be able to have that … without that, don’t put anything in the operational setting.” In fact, such a kill switch was formally called for by the security community group OWASP as part of a recommended security framework.

The “governance mirage” is the belief that you can scale AI without deciding who owns the control and security plane.

If you are one of the 72% of organizations claiming multiple “primary” platforms, be careful because you may not have a strategy; you may have a conflict of interest. It suggests that the winner of the war between the AI behemoths — OpenAI, Anthropic, Google, Microsoft, etc. — won’t necessarily be the one with the best model, but the one that manages to sit above the models and help enterprises enforce a single version of the truth. That may be difficult to achieve, though, given that companies won’t want lock-in with a single player.

The data suggests enterprises are already resisting that outcome — and may need to formalize that resistance. Enterprises arguably need to own their control plane with independent security instrumentation, not wait for a vendor to win that role for them.

Orchestration

Kimi K2.6 runs agents for days — and exposes the limits of enterprise orchestration

Most orchestration frameworks were built for agents that run for seconds or minutes. Now that agents are running for hours — and in some cases days — those frameworks are starting to crack.

Several model providers, such as Anthropic with Claude Code and OpenAI with Codex, introduced early support for long-horizon agents through multi-session tasks, subagents and background execution. However, these systems sometimes assume agents are still operating within bounded-time workflows even when they run for extended periods.

Open-source model provider Moonshot AI wants to push beyond that with its new model, Kimi K2.6.

Moonshot says the model is designed for continuous execution, with internal use cases including agents that ran for hours and, in one case, five straight days, handling monitoring and incident response autonomously.

But this growing use of this type of agent is exposing a critical gap in orchestration: most orchestration frameworks were not designed for this type of continuous, stateful execution. Open-source models, such as Kimi K2.6, that rely on agent swarms are making the case that their orchestration approach comes close to managing stateful agents.

The difficulties of orchestrating long-running agents

While it is true that some enterprises would rather bring their own orchestration frameworks to their agentic ecosystem, model providers and agent platforms recognize that offering agent management remains a competitive advantage.

Other model providers have begun exploring long-running agents, many through multi-session tasks and background execution. For example, Anthropic’s Claude Code orchestrates agents with a lead agent that directs other agents based on a set of user-instructed definitions. OpenAI’s Codex runs similarly.

Kimi K2.6 approaches orchestration with an improved version of its Agent Swarms, capable of managing up to 300 sub-agents “executing across 4,000 coordinated steps simultaneously,” Moonshot AI wrote in a blog post. Compared to both Claude Code and Codex, K2.6 relies on the model, rather than pre-defined roles, to determine orchestration.

Kimi K2.6 is now available on Hugging Face, through its API, Kimi Code and the Kimi app.

Practitioners experimenting with long-horizon agents say the brittleness runs deeper than prompting can fix.

As one practitioner, Maxim Saplin, put it in a blog post, “That does not mean subagents are useless. It means orchestration is still fragile. Right now, it feels more like a product and training problem than something you can solve by writing a sufficiently stern prompt.”

The problem long-running agents pose is that it’s difficult to maintain their state, especially as their environment continues to change while they’re doing their job. The agent would constantly call different tools and APIs or tap into different databases during its runtime. Most current agents, those that may run for one or two executions, do call different tools, but for at most a minute.

Mark Lambert, chief product officer at ArmorCode, which builds an autonomous security platform for enterprises, told VentureBeat in an email that the governance gap is already outpacing deployment.

“These agentic systems can now generate code and system changes faster than most organizations can review, remediate, or govern them. This will require more than just additional scanning. Organizations will need stronger AI governance that provides the context, prioritization, and accountability teams need to manage Kimi and other AI-generated risk before they turn into accumulated exposure,” Lambert said.

Long-running agents could also risk failure without a clear rollback. Most importantly, these types of agents often lack a set of well-defined tasks and dynamically adjust their plans as they run.

Kunal Anand, chief product officer at F5, told VentureBeat in an email that long-horizon agents represent a much bigger architectural shift than most companies were prepared for.

“We went from scripts to services to containers to functions, and now to agents as persistent infrastructure. That creates categories we do not yet have good names for: agent runtime, agent gateway, agent identity provider, agent mesh. The API gateway pattern is morphing into something that has to understand goals and workflows, not just endpoints and verbs,” Anand said.

Running for 13 hours and even five days

Understanding how to orchestrate agents becomes important because model capabilities have begun to outpace orchestration innovations, even as enterprises start to look at long-horizon agents.

Moonshot AI says the model is built for tasks that reflect “real-world challenges that typically demand weeks or months of collective human effort.” In a separate technical document provided to VentureBeat, Moonshot claims K2.6 built a full SysY compiler from scratch in 10 hours — work it characterized as equivalent to a team of four engineers over two months — and passed all 140 functional tests without human intervention.

The team deployed K2.6 to complex engineering tasks, including overhauling an eight-year-old open source financial matching engine. Moonshot’s engineers described a 13-hour execution that “iterated through 12 optimization strategies, initiating over 1,000 tool calls to modify more than 4,000 lines of code precisely.”

Moonshot said one of its teams used K2.6 to build an agent that ran autonomously for five days. That agent managed monitoring, incident response and system operations.

Orchestration

What AI model should you use for revenue intelligence? Von says all the big ones, and it will automate mixing and matching for you

Looking at enterprise AI adoption, VentureBeat has anecdotally observed a fairly wide divergence when it comes to specific roles: For those who build—engineers and developers—the arrival of AI has been transformative, moving through the workflow with the speed of tools like Claude Code and Cursor to automate the heavy lifting of syntax and architecture.

Yet, for those who sell, the “revenue stack” has remained a fragmented collection of data silos, manual CRM entries, and anecdotal reporting.

Von, a new AI platform emerging from the team behind process automation startup Rattle, aims to bridge this gap. By positioning itself not as another “point solution” but as a foundational “intelligence layer,” Von seeks to do for Go-To-Market (GTM) teams what the modern IDE has done for the developer: provide a single, reasoning interface that understands the entire business context.

“AI has revolutionized the workflow for people who build things, but there is nothing that has revolutionized the workflow for people who sell those things,” Von CEO Sahil Aggarwal said in a recent video call interview with VentureBeat. “That is what we are trying to build with Von”.

Technology: The context graph and multi-model engine

At the core of Von’s capability is a departure from the traditional “search bar” approach to enterprise AI. While standard LLMs often struggle with the sprawling, unstructured nature of sales data, Von begins its deployment by building a “context graph” of a company’s entire business.

This process involves ingesting structured data from CRMs like Salesforce and HubSpot, alongside unstructured data from call recorders (Gong, Zoom, Chorus), email threads, and internal documentation.

“Once Von builds this context graph, it will understand your business better than anyone else in the company,” Aggarwal said.

This understanding is rooted in a company’s specific “ontology”—the unique language of its deal stages, territory definitions, and institutional knowledge.

“We train these foundational models on a company’s own business and ontology to make the model work for them,” the CEO addded.

Instead of relying on a single large language model, Von utilizes a “mixture of models” strategy to optimize performance and cost. In this architecture, Anthropic’s Claude is deployed for high-level reasoning and “thinking,” ChatGPT handles bulk data processing, and Google’s Gemini is utilized for generating creative assets such as decks and reports.

This technical approach allows Von to resolve a common frustration in Sales Operations: the gap between what is logged in a CRM and what actually happened in a meeting. By cross-referencing call transcripts with Salesforce records, the system can identify discrepancies in “lost reasons” or verify deal health based on sentiment rather than just a rep’s manual update.

From reporting queues to AI headcount

Von is designed to function as an “AI Data Scientist” or a “VP of RevOps” that lives on top of the enterprise’s existing revenue tracking tools.

During an initial product demonstration, Aggarwal showed how the platform could analyze 101 SMB accounts to identify churn risk in just over three minutes—a task he estimates would take a human analyst one to two weeks.

The platform’s primary interface resembles a chat environment, but the outputs are designed to be actionable revenue assets. Key functionalities include:

Deal Health Monitoring: Cross-referencing calls and emails to surface “risky” commits that might otherwise go unnoticed until the end of a quarter.
Automated Briefing: Generating pre-call context docs that draw from the entire history of an account, ensuring reps are briefed on every previous touchpoint.
Win/Loss Analysis: Clustered analysis of transcripts to find the “true” reasons for lost deals, often finding that the recorded reason in the CRM does not match the customer’s actual feedback.
Revenue Operations Automation: Handling “low-level” Salesforce admin tasks, such as creating flows, validation rules, or cleaning up account territories.

The goal is to shift Revenue Operations (RevOps) from a “reporting queue” that handles ad-hoc data requests into an infrastructure layer.

As Kieran Snaith, SVP of Revenue Operations at Qualified, noted in a Von testimonial blog post, the goal is to allow leaders to “run the business in chat,” asking complex questions about forecast confidence or pipeline risk and receiving data-backed answers instantly.

Pivoting into ‘the next Salesforce’

Von is operated by Rattle Software Inc., a company that previously found success with “Rattle,” a mid-seven-figure revenue business focused on Salesforce-Slack integrations. Aggarwal describes Von as a significant pivot toward a larger opportunity, aiming to build “the next Salesforce”.

The business has seen rapid early traction, reportedly crossing $500,000 in revenue within its first eight weeks of launch, with projections to reach $10 million in its first year.

The product is governed by a commercial, proprietary license typical of enterprise SaaS. Unlike open-source tools, Von’s “restricted” license means the underlying source code and the “context graph” technology are proprietary to Rattle Software Inc.. Users are granted a non-transferable, non-exclusive right to use the software for internal business purposes, with the company maintaining all rights, title, and interest in the service.

This philosophy of deep integration extends to the broader SaaS ecosystem, where Aggarwal observes, “Point solutions in SaaS are essentially dead. They will have a very hard time surviving in this world, because point solutions can now be white-coded within a company.”

Pricing follows a hybrid model of per-seat subscriptions and consumption-based credits. This structure is designed to scale with the persona using the tool; for instance, a Chief Revenue Officer (CRO) seat may cost $1,000 per month for deep strategic analysis, while individual seller seats may be as low as $20 per month for basic research and follow-up tasks.

The company is currently backed by several tier-one venture capital firms, including Sequoia Capital, Lightspeed, Insight Partners, and GV (Google Ventures).

Early adopter reaction

The reaction from early adopters highlights a shift in how AI is being integrated into the sales org.

Taylor Kelly, Head of Revenue Operations at Tapcart, remarked that “Von handles the analysis and insights that would normally require hiring another full-time analyst,” specifically citing its ability to handle complex Salesforce configurations and deal risk assessments.

Similarly, Evan Briere, VP of Partnerships at DemandScience, noted that Von’s direct connection to data sources makes it “actually applicable” compared to more “theoretical” horizontal AI tools like ChatGPT.

Other community feedback from the platform’s early users includes:

CJ Oordt, Sales Director at Coalesce: Described it as a “research assistant who knows every conversation and note”.
Rob Janke, Director of Revenue Operations at QuickNode: Stated that Von “solved this gap before we could even start building it ourselves”.
Sydney, Head of Renewals at 15Five: Highlighted its impact on renewal intelligence, allowing her to analyze actual conversation signals across an entire book of business in minutes.

The prevailing sentiment among these users is that Von serves as “additional headcount” rather than just a tool. This mirrors the company’s internal metrics, which report that Von is already completing over 10,000 revenue tasks per week for its customer base.

An autonomous revenue org

The introduction of Von signals a maturing of AI in the enterprise. We are moving past the era of “AI as a feature”—where a chatbot is simply bolted onto an existing CRM—toward “AI as a persona”.

By training foundational models on a company’s specific business logic, Von is attempting to create a system that doesn’t just return data but offers “judgment calls”.As organizations look toward the rest of 2026, the challenge for RevOps leaders will be one of trust and infrastructure.

If Von can maintain its claimed 95% accuracy in predicting deal outcomes, the role of the human salesperson will inevitably shift toward higher-value relationship management, leaving the “data science” of sales to the agents.

For now, Von remains a high-growth experiment in whether the “intelligence layer” can finally bring the same level of revolutionary workflow to the people who sell as it has to the people who build.

Orchestration

Train-to-Test scaling explained: How to optimize your end-to-end AI compute budget for inference

The standard guidelines for building large language models (LLMs) optimize only for training costs and ignore inference costs. This poses a challenge for real-world applications that use inference-time scaling techniques to increase the accuracy of model responses, such as drawing multiple reasoning samples from a model at deployment.

To bridge this gap, researchers at University of Wisconsin-Madison and Stanford University have introduced Train-to-Test (T²) scaling laws, a framework that jointly optimizes a model’s parameter size, its training data volume, and the number of test-time inference samples.

In practice, their approach proves that it is compute-optimal to train substantially smaller models on vastly more data than traditional rules prescribe, and then use the saved computational overhead to generate multiple repeated samples at inference.

For enterprise AI application developers who are training their own models, this research provides a proven blueprint for maximizing return on investment. It shows that AI reasoning does not necessarily require spending huge amounts on frontier models. Instead, smaller models can yield stronger performance on complex tasks while keeping per-query inference costs manageable within real-world deployment budgets.

Conflicting scaling laws

Scaling laws are an important part of developing large language models. Pretraining scaling laws dictate the best way to allocate compute during the model’s creation, while test-time scaling laws guide how to allocate compute during deployment, such as letting the model “think longer” or generating multiple reasoning samples to solve complex problems.

The problem is that these scaling laws have been developed completely independently of one another despite being fundamentally intertwined.

A model’s parameter size and training duration directly dictate both the quality and the per-query cost of its inference samples. Currently, the industry gold standard for pretraining is the Chinchilla rule, which suggests a compute-optimal ratio of roughly 20 training tokens for every model parameter.

However, creators of modern AI model families, such as Llama, Gemma, and Qwen, regularly break this rule by intentionally overtraining their smaller models on massive amounts of data.

As Nicholas Roberts, co-author of the paper, told VentureBeat, the traditional approach falters when building complex agentic workflows: “In my view, the inference stack breaks down when each individual inference call is expensive. This is the case when the models are large and you need to do a lot of repeated sampling.” Instead of relying on massive models, developers can use overtrained compact models to run this repeated sampling at a fraction of the cost.

But because training and test-time scaling laws are examined in isolation, there is no rigorous framework to calculate how much a model should be overtrained based on how many reasoning samples it will need to generate during deployment.

Consequently, there has previously been no formula that jointly optimizes model size, training data volume, and test-time inference budgets.

The reason that this framework is hard to formulate is that pretraining and test-time scaling speak two different mathematical languages. During pretraining, a model’s performance is measured using “loss,” a smooth, continuous metric that tracks prediction errors as the model learns.

At test time, developers use real-world, downstream metrics to evaluate a model’s reasoning capabilities, such as pass@k, which measures the probability that a model will produce at least one correct answer across k independent, repeated attempts.

Train-to-test scaling laws

To solve the disconnect between training and deployment, the researchers introduce Train-to-Test (T²) scaling laws. At a high level, this framework predicts a model’s reasoning performance by treating three variables as a single equation: the model’s size (N), the volume of training tokens it learns from (D), and the number of reasoning samples it generates during inference (k).

T² combines pretraining and inference budgets into one optimization formula that accounts for both the baseline cost to train the model (6ND) and the compounding cost to query it repeatedly at inference (2Nk). The researchers tried different modeling approaches: whether to model the pre-training loss or test-time performance (pass@k) as functions of N, D, and k.

The first approach takes the familiar mathematical equation used for Chinchilla scaling (which calculates a model’s prediction error, or loss) and directly modifies it by adding a new variable that accounts for the number of repeated test-time samples (k). This allows developers to see how increasing inference compute drives down the model’s overall error rate.

The second approach directly models the downstream pass@k accuracy. It tells developers the probability that their application will solve a problem given a specific compute budget.

But should enterprises use this framework for every application? Roberts clarifies that this approach is highly specialized. “I imagine that you would not see as much of a benefit for knowledge-heavy applications, such as chat models,” he said. Instead, “T² is tailored to reasoning-heavy applications such as coding, where typically you would use repeated sampling as your test-time scaling method.”

What it means for developers

To validate the T² scaling laws, the researchers built an extensive testbed of over 100 language models, ranging from 5 million to 901 million parameters. They trained 21 new, heavily overtrained checkpoints from scratch to test if their mathematical forecasts held up in reality. They then benchmarked the models across eight diverse tasks, which included real-world datasets like SciQ and OpenBookQA, alongside synthetic tasks designed to test arithmetic, spatial reasoning, and knowledge recall.

Both of their mathematical models proved that the compute-optimal frontier shifts drastically away from standard Chinchilla scaling. To maximize performance under a fixed budget, the optimal choice is a model that is significantly smaller and trained on vastly more data than the traditional 20-tokens-per-parameter rule dictates.

In their experiments, the highly overtrained small models consistently outperformed the larger, Chinchilla-optimal models across all eight evaluation tasks when test-time sampling costs were accounted for.

For developers looking to deploy these findings, the technical barrier is surprisingly low.

“Nothing fancy is required to perform test-time scaling with our current models,” Roberts said. “At deployment, developers can absolutely integrate infrastructure that makes the sampling process more efficient (e.g. KV caching if you’re using a transformer).”

KV caching helps by storing previously processed context so the model doesn’t have to re-read the initial prompt from scratch for every new reasoning sample.

However, extreme overtraining comes with practical trade-offs. While overtrained models can be notoriously stubborn and harder to fine-tune, Roberts notes that when they applied supervised fine-tuning, “while this effect was present, it was not a strong enough effect to pull the optimal model back to Chinchilla.” The compute-optimal strategy remains definitively skewed toward compact models.

Yet, teams pushing this to the absolute limit must be wary of hitting physical data limits. “Another angle is that if you take our overtraining recommendations to the extreme, you may actually run out of training data,” Roberts said, referring to the looming “data wall” where high-quality internet data is exhausted.

These experiments confirm that if an application relies on generating multiple test-time reasoning samples, aggressively overtraining a compact model is practically and mathematically the most effective way to spend an end-to-end compute budget.

To help developers get started, the research team plans to open-source their checkpoints and code soon, allowing enterprises to plug in their own data and test the scaling behavior immediately. Ultimately, this framework serves as an equalizing force in the AI industry.

This is especially crucial as the high price of frontier models can become a barrier as you scale agentic applications that rely on reasoning models.

“T² fundamentally changes who gets to build strong reasoning models,” Roberts concludes. “You might not need massive compute budgets to get state-of-the-art reasoning. Instead, you need good data and smart allocation of your training and inference budget.”

Orchestration

Anthropic just launched Claude Design, an AI tool that turns prompts into prototypes and challenges Figma

Anthropic today launched Claude Design, a new product from its Anthropic Labs division that allows users to create polished visual work — designs, interactive prototypes, slide decks, one-pagers, and marketing collateral — through conversational prompts and fine-grained editing controls. The release, available immediately in research preview to all paid Claude subscribers, is the company’s most aggressive expansion beyond its core language model business and into the application layer that has historically belonged to companies like Figma, Adobe, and Canva.

Claude Design is powered by Claude Opus 4.7, Anthropic’s most capable generally available vision model, which the company also released today. Anthropic says it is rolling access out gradually throughout the day to Claude Pro, Max, Team, and Enterprise subscribers.

The simultaneous launches mark a watershed for Anthropic, whose ambitions now visibly extend from foundation model provider to full-stack product company — one that wants to own the arc from a rough idea to a shipped product. The timing is also significant: Anthropic hit roughly $20 billion in annualized revenue in early March 2026, according to Bloomberg, up from $9 billion at the end of 2025 — and surpassed $30 billion by early April 2026. The company is in early talks with Goldman Sachs, JPMorgan, and Morgan Stanley about a potential IPO that could come as early as October 2026.

How Claude Design turns a text prompt into a working prototype

The product follows a workflow that Anthropic has designed to feel like a natural creative conversation. Users describe what they need, and Claude generates a first version. From there, refinement happens through a combination of channels: chat-based conversation, inline comments on specific elements, direct text editing, and custom adjustment sliders that Claude itself generates to let users tweak spacing, color, and layout in real time.

During onboarding, Claude reads a team’s codebase and design files and builds a design system — colors, typography, and components — that it automatically applies to every subsequent project. Teams can refine the system over time and maintain more than one. The import surface is broad: users can start from a text prompt, upload images and documents in various formats, or point Claude at their codebase. A web capture tool grabs elements directly from a live website so prototypes look like the real product.

What distinguishes Claude Design from the wave of AI design experiments that have proliferated in the past year is the handoff mechanism. When a design is ready to build, Claude packages everything into a handoff bundle that can be passed to Claude Code with a single instruction. That creates a closed loop — exploration to prototype to production code — all within Anthropic’s ecosystem. The export options acknowledge that not everyone’s next step is Claude Code: users can also share designs as an internal URL within their organization, save as a folder, or export to Canva, PDF, PPTX, or standalone HTML files.

Anthropic points to Brilliant, the education technology company known for intricate interactive lessons, as an early proof point. The company’s senior product designer reported that the most complex pages required 20 or more prompts to recreate in competing tools but needed only 2 in Claude Design. The Brilliant team then turned static mockups into interactive prototypes they could share and user-test without code review, and handed everything — including the design intent — to Claude Code for implementation. Datadog’s product team described a similar shift, compressing what had been a week-long cycle of briefs, mockups, and review rounds into a single conversation.

Why Anthropic’s chief product officer just resigned from Figma’s board

The launch arrives against a backdrop that makes Anthropic’s claim of complementarity with existing design tools difficult to take entirely at face value. Mike Krieger, Anthropic’s chief product officer, resigned from the board of Figma on April 14 — the same day The Information reported Anthropic’s next model would include design tools that could compete with Figma’s primary offering.

Figma has collaborated closely with Anthropic to integrate the frontier lab’s AI models into its products. Just two months ago, in February, Figma launched “Code to Canvas,” a feature that converts code generated in AI tools like Claude Code into fully editable designs inside Figma — creating a bridge between AI coding tools and Figma’s design process. The partnership felt like a mutual bet that AI would make design more essential, not less. Claude Design complicates that narrative significantly.

Anthropic’s position, based on VentureBeat’s background conversations with the company, is that Claude Design is built around interoperability and is meant to meet teams where they already work, not replace incumbent tools. The company points to the Canva export, PPTX and PDF support, and plans to make it easier for other tools to connect via MCPs (model context protocols) as evidence of that philosophy. Anthropic is also making it possible for other tools to build integrations with Claude Design, a move clearly designed to preempt accusations of walled-garden ambitions.

But the market read the signals differently. The structural tension is clear: Figma commands an estimated 80 to 90% market share in UI and UX design, according to The Next Web. Both Figma and Adobe assume a trained designer is in the loop. Anthropic’s tool does not. Claude Design is not merely another AI copilot embedded in an existing design application. It is a standalone product that generates complete, interactive prototypes from natural language — accessible to founders, product managers, and marketers who have never opened Figma. The expansion of the design user base to non-designers is the real competitive threat, even if the professional designer’s workflow remains anchored in Figma for now.

Inside Claude Opus 4.7, the model Anthropic deliberately made less dangerous

The model powering Claude Design is itself a significant story. Claude Opus 4.7 is Anthropic’s most capable generally available model, with notable improvements over its predecessor Opus 4.6 in software engineering, instruction following, and vision — but it is intentionally less capable than Anthropic’s most powerful offering, Claude Mythos Preview, the model the company announced earlier this month as too dangerous for broad release due to its cybersecurity capabilities.

That dual-track approach — one model for the public, one model locked behind a vetted-access program — is unprecedented in the AI industry. Anthropic used Claude Mythos Preview to identify thousands of zero-day vulnerabilities in every major operating system and web browser, as reported by multiple outlets. The Project Glasswing initiative that houses Mythos brings together Amazon Web Services, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, Nvidia, and Palo Alto Networks as launch partners.

Opus 4.7 sits a deliberate step below Mythos. Anthropic stated in its release that it “experimented with efforts to differentially reduce” the new model’s cyber capabilities during training and ships it with safeguards that automatically detect and block requests indicating prohibited or high-risk cybersecurity uses. What Anthropic learns from those real-world safeguards will inform the eventual goal of broader release for Mythos-class models. For security professionals with legitimate needs, the company has created a new Cyber Verification Program.

On benchmarks, the model posts strong numbers. Opus 4.7 reached 64.3% on SWE-bench Pro, and on Anthropic’s internal 93-task coding benchmark, it delivered a 13% resolution improvement over Opus 4.6, including solving four tasks that neither Opus 4.6 nor Sonnet 4.6 could crack.

The vision improvements are substantial and directly relevant to Claude Design: Opus 4.7 can accept images up to 2,576 pixels on the long edge — roughly 3.75 megapixels, more than three times the resolution of prior Claude models. Early access partner XBOW, the autonomous penetration testing company, reported that the new model scored 98.5% on their visual-acuity benchmark versus 54.5% for Opus 4.6.

Meanwhile, Bloomberg reported that the White House is preparing to make a version of Mythos available to major federal agencies, with the Office of Management and Budget setting up protections for Cabinet departments — a sign that the government views the model’s capabilities as too important to leave solely in private hands.

What enterprise buyers need to know about data privacy and pricing

For enterprise and regulated-industry buyers, the data handling architecture of Claude Design will be a critical evaluation criterion. Based on VentureBeat’s exclusive background discussions with Anthropic, the system stores the design-system representation it generates — not the source files themselves. When users link a local copy of their code, it is not uploaded to or stored on Anthropic’s servers. The company is also adding the ability to connect directly to GitHub. Anthropic states unequivocally that it does not train on this data. For Enterprise customers, Claude Design is off by default — administrators choose whether to enable it and control who has access.

On pricing, Claude Design is included at no additional cost with Pro, Max, Team, and Enterprise plans, using existing subscription limits with optional extra usage beyond those caps. Opus 4.7 holds the same API pricing as its predecessor: $5 per million input tokens and $25 per million output tokens. The pricing strategy mirrors the approach Anthropic took with Claude Code, which launched as a bundled feature and rapidly grew into a major revenue driver. Anthropic’s reasoning is straightforward: the best way to learn what people will build with a new product category is to put it in their hands, then build monetization around demonstrated value.

Anthropic is also being transparent about the product’s limitations. The design system import works best with a clean codebase; messy source code produces messy output. Collaboration is basic and not yet fully multiplayer. The editing experience has rough edges. There is no general availability date, and Anthropic says that is intentional — it will let the product and user feedback determine when Claude Design is ready for prime time.

Anthropic’s bet that owning the full creative stack is worth the risk

Claude Design is the most visible expression of a trend that has been accelerating for months: the major AI labs are moving up the stack from model providers into full application builders, directly entering categories previously owned by established software companies. Anthropic now offers a coding agent (Claude Code), a knowledge-work assistant (Claude Cowork), desktop computer control, office integrations for Word, Excel, and PowerPoint, a browser agent in Chrome, and now a design tool. Each product reinforces the others. A designer can explore concepts in Claude Design, export a prototype, hand it to Claude Code for implementation, and have Claude Cowork manage the review cycle — all within Anthropic’s platform.

The financial momentum behind this expansion is staggering. Anthropic has received investor offers valuing the company at approximately $800 billion, according to Reuters, more than doubling its $380 billion valuation from a funding round closed just two months ago. But building an application empire while simultaneously navigating an AI safety reputation, an impending IPO, growing public hostility toward the technology, and the diplomatic fallout of competing with your own partners is a balancing act that no technology company has attempted at this scale or speed.

When Figma launched Code to Canvas in February, the implicit promise was that AI coding tools and design tools would grow together, each making the other more valuable. Two months later, Anthropic’s chief product officer has left Figma’s board, and the company has shipped a product that lets anyone who can type a sentence create the kind of interactive prototype that once required years of design training and a Figma license. The partnership may survive. But the power dynamic just changed — and in the AI industry, that tends to be the only kind of change that matters.

Orchestration, technology

Should my enterprise AI agent do that? NanoClaw and Vercel launch easier agentic policy setting and approval dialogs across 15 messaging apps

For the past year, early adopters of autonomous AI agents have been forced to play a murky game of chance: keep the agent in a useless sandbox or give it the keys to the kingdom and hope it doesn’t hallucinate a catastrophic “delete all” command.

To unlock the true utility of an agent—scheduling meetings, triaging emails, or managing cloud infrastructure—users have had to grant these models raw API keys and broad permissions, raising the risk of their systems being disrupted by an accidental agent mistake.

That tradeoff ends today. The creators of the open source sandboxed NanoClaw agent framework — now known under their new private startup named NanoCo — have announced a landmark partnership with Vercel and OneCLI to introduce a standardized, infrastructure-level approval system.

By integrating Vercel’s Chat SDK and OneCLI’s open source credentials vault, NanoClaw 2.0 ensures that no sensitive action occurs without explicit human consent, delivered natively through the messaging apps where users already live.

The specific use cases that stand to benefit most are those involving high-consequence “write” actions. That is, in DevOps, an agent could propose a cloud infrastructure change that only goes live once a senior engineer taps “Approve” in Slack.

For finance teams, an agent could prepare batch payments or invoice triaging, with the final disbursement requiring a human signature via a WhatsApp card.

Technology: security by isolation

The fundamental shift in NanoClaw 2.0 is the move away from “application-level” security to “infrastructure-level” enforcement. In traditional agent frameworks, the model itself is often responsible for asking for permission—a flow that Gavriel Cohen, co-founder of NanoCo, describes as inherently flawed.

“The agent could potentially be malicious or compromised,” Cohen noted in a recent interview. “If the agent is generating the UI for the approval request, it could trick you by swapping the ‘Accept’ and ‘Reject’ buttons.”

NanoClaw solves this by running agents in strictly isolated Docker or Apple Containers. The agent never sees a real API key; instead, it uses “placeholder” keys. When the agent attempts an outbound request, the request is intercepted by the OneCLI Rust Gateway. The gateway checks a set of user-defined policies (e.g., “Read-only access is okay, but sending an email requires approval”).

If the action is sensitive, the gateway pauses the request and triggers a notification to the user. Only after the user approves does the gateway inject the real, encrypted credential and allow the request to reach the service.

Product: bringing the ‘human’ into the loop

While security is the engine, Vercel’s Chat SDK is the dashboard. Integrating with different messaging platforms is notoriously difficult because every app—Slack, Teams, WhatsApp, Telegram—uses different APIs for interactive elements like buttons and cards.

By leveraging Vercel’s unified SDK, NanoClaw can now deploy to 15 different channels from a single TypeScript codebase. When an agent wants to perform a protected action, the user receives a rich interactive card on their phone. “The approval shows up as a rich, native card right inside Slack or WhatsApp or Teams, and the user taps once to approve or deny,” said Cohen. This “seamless UX” is what makes human-in-the-loop oversight practical rather than a productivity bottleneck.

The full list of 15 supported messaging apps/channels contains many favored by enterprise knowledge workers, including:

Slack
WhatsApp
Telegram
Microsoft Teams
Discord
Google Chat
iMessage
Facebook Messenger
Instagram
X (Twitter)
GitHub
Linear
Matrix
Email
Webex

Background on NanoClaw

NanoClaw launched on January 31, 2026, as a minimalist and security-focused response to the “security nightmare” inherent in complex, non-sandboxed agent frameworks.

Created by Cohen, a former Wix.com engineer, and marketed by his brother Lazer, CEO of B2B tech public relations firm Concrete Media, the project was designed to solve the auditability crisis found in competing platforms like OpenClaw, which had grown to nearly 400,000 lines of code.

By contrast, NanoClaw condensed its core logic into roughly 500 lines of TypeScript—a size that, according to VentureBeat, allows the entire system to be audited by a human or a secondary AI in approximately eight minutes.

The platform’s primary technical defense is its use of operating system-level isolation. Every agent is placed inside an isolated Linux container—utilizing Apple Containers for high performance on macOS or Docker for Linux—to ensure that the AI only interacts with directories explicitly mounted by the user.

As detailed in VentureBeat’s reporting on the project’s infrastructure, this approach confines the “blast radius” of potential prompt injections strictly to the container and its specific communication channel.

In March 2026, NanoClaw further matured this security posture through an official partnership with the software container firm Docker to run agents inside “Docker Sandboxes”.

This integration utilizes MicroVM-based isolation to provide an enterprise-ready environment for agents that, by their nature, must mutate their environments by installing packages, modifying files, and launching processes—actions that typically break traditional container immutability assumptions.

Operationally, NanoClaw rejects the traditional “feature-rich” software model in favor of a “Skills over Features” philosophy. Instead of maintaining a bloated main branch with dozens of unused modules, the project encourages users to contribute “Skills”—modular instructions that teach a local AI assistant how to transform and customize the codebase for specific needs, such as adding Telegram or Gmail support.

This methodology, as described on NanoClaw’s website and in VentureBeat interviews, ensures that users only maintain the exact code required for their specific implementation.

Furthermore, the framework natively supports “Agent Swarms” via the Anthropic Agent SDK, allowing specialized agents to collaborate in parallel while maintaining isolated memory contexts for different business functions.

Licensing and open source strategy

NanoClaw remains firmly committed to the open source MIT License, encouraging users to fork the project and customize it for their own needs. This stands in stark contrast to “monolithic” frameworks.

NanoClaw’s codebase is remarkably lean, consisting of only 15 source files and roughly 3,900 lines of code, compared to the hundreds of thousands of lines found in competitors like OpenClaw.

The partnership also highlights the strength of the “Open Source Avengers” coalition.

By combining NanoClaw (agent orchestration), Vercel Chat SDK (UI/UX), and OneCLI (security/secrets), the project demonstrates that modular, open-source tools can outpace proprietary labs in building the application layer for AI.

Community reactions

As shown on the NanoClaw website, the project has amassed more than 27,400 stars on GitHub and maintains an active Discord community.

A core claim on the NanoClaw site is that the codebase is small enough to understand in “8 minutes,” a feature targeted at security-conscious users who want to audit their assistant.

In an interview, Cohen noted that iMessage support via Vercel’s Photon project addresses a common community hurdle: previously, users often had to maintain a separate Mac Mini to connect agents to an iMessage account.

The enterprise perspective: should you adopt?

For enterprises, NanoClaw 2.0 represents a shift from speculative experimentation to safe operationalization.

Historically, IT departments have blocked agent usage due to the “all-or-nothing” nature of credential access. By decoupling the agent from the secret, NanoClaw provides a middle ground that mirrors existing corporate security protocols—specifically the principle of least privilege.

Enterprises should consider this framework if they require high-auditability and have strict compliance needs regarding data exfiltration. According to Cohen, many businesses have not been ready to grant agents access to calendars or emails because of security concerns. This framework addresses that by ensuring the agent structurally cannot act without permission.

Enterprises stand to benefit specifically in use cases involving “high-stakes” actions. As illustrated in the OneCLI dashboard, a user can set a policy where an agent can read emails freely but must trigger a manual approval dialog to “delete” or “send” one.

Because NanoClaw runs as a single Node.js process with isolated containers , it allows enterprise security teams to verify that the gateway is the only path for outbound traffic. This architecture transforms the AI from an unmonitored operator into a supervised junior staffer, providing the productivity of autonomous agents without forgoing executive control.

Ultimately, NanoClaw is a recommendation for organizations that want the productivity of autonomous agents without the “black box” risk of traditional LLM wrappers. It turns the AI from a potentially rogue operator into a highly capable junior staffer who always asks for permission before hitting the “send” or “buy” button.

As AI-native setups become the standard, this partnership establishes the blueprint for how trust will be managed in the age of the autonomous workforce.

Orchestration

Salesforce launches Headless 360 to turn its entire platform into infrastructure for AI agents

Salesforce on Wednesday unveiled the most ambitious architectural transformation in its 27-year history, introducing “Headless 360” — a sweeping initiative that exposes every capability in its platform as an API, MCP tool, or CLI command so AI agents can operate the entire system without ever opening a browser.

The announcement, made at the company’s annual TDX developer conference in San Francisco, ships more than 100 new tools and skills immediately available to developers. It marks a decisive response to the existential question hanging over enterprise software: In a world where AI agents can reason, plan, and execute, does a company still need a CRM with a graphical interface?

Salesforce’s answer: No — and that’s exactly the point.

“We made a decision two and a half years ago: Rebuild Salesforce for agents,” the company said in its announcement. “Instead of burying capabilities behind a UI, expose them so the entire platform will be programmable and accessible from anywhere.”

The timing is anything but coincidental. Salesforce finds itself navigating one of the most turbulent periods in enterprise software history — a sector-wide sell-off that has pushed the iShares Expanded Tech-Software Sector ETF down roughly 28% from its September peak. The fear driving the decline: that AI, particularly large language models from Anthropic, OpenAI, and others, could render traditional SaaS business models obsolete.

Jayesh Govindarjan, EVP of Salesforce and one of the key architects behind the Headless 360 initiative, described the announcement as rooted not in marketing theory but in hard-won lessons from deploying agents with thousands of enterprise customers.

“The problem that emerged is the lifecycle of building an agentic system for every one of our customers on any stack, whether it’s ours or somebody else’s,” Govindarjan told VentureBeat in an exclusive interview. “The challenge that they face is very much the software development challenge. How do I build an agent? That’s only step one.”

More than 100 new tools give coding agents full access to the Salesforce platform for the first time

Salesforce Headless 360 rests on three pillars that collectively represent the company’s attempt to redefine what an enterprise platform looks like in the agentic era.

The first pillar — build any way you want — delivers more than 60 new MCP (Model Context Protocol) tools and 30-plus preconfigured coding skills that give external coding agents like Claude Code, Cursor, Codex, and Windsurf complete, live access to a customer’s entire Salesforce org, including data, workflows, and business logic. Developers no longer need to work inside Salesforce’s own IDE. They can direct AI coding agents from any terminal to build, deploy, and manage Salesforce applications.

Agentforce Vibes 2.0, the company’s own native development environment, now includes what it calls an “open agent harness” supporting both the Anthropic agent SDK and the OpenAI agents SDK. As demonstrated during the keynote, developers can choose between Claude Code and OpenAI agents depending on the task, with the harness dynamically adjusting available capabilities based on the selected agent. The environment also adds multi-model support, including Claude Sonnet and GPT-5, along with full org awareness from the start.

A significant technical addition is native React support on the Salesforce platform. During the keynote demo, presenters built a fully functional partner service application using React — not Salesforce’s own Lightning framework — that connected to org metadata via GraphQL while inheriting all platform security primitives. This opens up dramatically more expressive front-end possibilities for developers who want complete control over the visual layer.

The second pillar — deploy on any surface — centers on the new Agentforce Experience Layer, which separates what an agent does from how it appears, rendering rich interactive components natively across Slack, mobile apps, Microsoft Teams, ChatGPT, Claude, Gemini, and any client supporting MCP apps. During the keynote, presenters defined an experience once and deployed it across six different surfaces without writing surface-specific code. The philosophical shift is significant: rather than pulling customers into a Salesforce UI, enterprises push branded, interactive agent experiences into whatever workspace their customers already inhabit.

The third pillar — build agents you can trust at scale — introduces an entirely new suite of lifecycle management tools spanning testing, evaluation, experimentation, observation, and orchestration. Agent Script, the company’s new domain-specific language for defining agent behavior deterministically, is now generally available and open-sourced. A new Testing Center surfaces logic gaps and policy violations before deployment. Custom Scoring Evals let enterprises define what “good” looks like for their specific use case. And a new A/B Testing API enables running multiple agent versions against real traffic simultaneously.

Why enterprise customers kept breaking their own AI agents — and how Salesforce redesigned its tooling in response

Perhaps the most technically significant — and candid — portion of VentureBeat’s interview with Govindarjan addressed the fundamental engineering tension at the heart of enterprise AI: agents are probabilistic systems, but enterprises demand deterministic outcomes.

Govindarjan explained that early Agentforce customers, after getting agents into production through “sheer hard work,” discovered a painful reality. “They were afraid to make changes to these agents, because the whole system was brittle,” he said. “You make one change and you don’t know whether it’s going to work 100% of the time. All the testing you did needs to be redone.”

This brittleness problem drove the creation of Agent Script, which Govindarjan described as a programming language that “brings together the determinism that’s in programming languages with the inherent flexibility in probabilistic systems that LLMs provide.” The language functions as a single flat file — versionable, auditable — that defines a state machine governing how an agent behaves. Within that machine, enterprises specify which steps must follow explicit business logic and which can reason freely using LLM capabilities.

Salesforce open-sourced Agent Script this week, and Govindarjan noted that Claude Code can already generate it natively because of its clean documentation. The approach stands in sharp contrast to the “vibe coding” movement gaining traction elsewhere in the industry. As the Wall Street Journal recently reported, some companies are now attempting to vibe-code entire CRM replacements — a trend Salesforce’s Headless 360 directly addresses by making its own platform the most agent-friendly substrate available.

Govindarjan described the tooling as a product of Salesforce’s own internal practice. “We needed these tools to make our customers successful. Then our FDEs needed them. We hardened them, and then we gave them to our customers,” he told VentureBeat. In other words, Salesforce productized its own pain.

Inside the two competing AI agent architectures Salesforce says every enterprise will need

Govindarjan drew a revealing distinction between two fundamentally different agentic architectures emerging in the enterprise — one for customer-facing interactions and one he linked to what he called the “Ralph Wiggum loop.”

Customer-facing agents — those deployed to interact with end customers for sales or service — demand tight deterministic control. “Before customers are willing to put these agents in front of their customers, they want to make sure that it follows a certain paradigm — a certain brand set of rules,” Govindarjan told VentureBeat. Agent Script encodes these as a static graph — a defined funnel of steps with LLM reasoning embedded within each step.

The “Ralph Wiggum loop,” by contrast, represents the opposite end of the spectrum: a dynamic graph that unrolls at runtime, where the agent autonomously decides its next step based on what it learned in the previous step, killing dead-end paths and spawning new ones until the task is complete. This architecture, Govindarjan said, manifests primarily in employee-facing scenarios — developers using coding agents, salespeople running deep research loops, marketers generating campaign materials — where an expert human reviews the output before it ships.

“Ralph Wiggum loops are great for employee-facing because employees are, in essence, experts at something,” Govindarjan explained. “Developers are experts at development, salespeople are experts at sales.”

The critical technical insight: both architectures run on the same underlying platform and the same graph engine. “This is a dynamic graph. This is a static graph,” he said. “It’s all a graph underneath.” That unified runtime — spanning the spectrum from tightly controlled customer interactions to free-form autonomous loops — may be Salesforce’s most important technical bet, sparing enterprises from maintaining separate platforms for different agent modalities.

Salesforce hedges its bets on MCP while opening its ecosystem to every major AI model and tool

Salesforce’s embrace of openness at TDX was striking. The platform now integrates with OpenAI, Anthropic, Google Gemini, Meta’s LLaMA, and Mistral AI models. The open agent harness supports third-party agent SDKs. MCP tools work from any coding environment. And the new AgentExchange marketplace unifies 10,000 Salesforce apps, 2,600-plus Slack apps, and 1,000-plus Agentforce agents, tools, and MCP servers from partners including Google, Docusign, and Notion, backed by a new $50 million AgentExchange Builders Initiative.

Yet Govindarjan offered a surprisingly candid assessment of MCP itself — the protocol Anthropic created that has become a de facto standard for agent-tool communication.

“To be very honest, not at all sure” that MCP will remain the standard, he told VentureBeat. “When MCP first came along as a protocol, a lot of us engineers felt that it was a wrapper on top of a really well-written CLI — which now it is. A lot of people are saying that maybe CLI is just as good, if not better.”

His approach: pragmatic flexibility. “We’re not wedded to one or the other. We just use the best, and often we will offer all three. We offer an API, we offer a CLI, we offer an MCP.” This hedging explains the “Headless 360” naming itself — rather than betting on a single protocol, Salesforce exposes every capability across all three access patterns, insulating itself against protocol shifts.

Engine, the B2B travel management company featured prominently in the keynote demos, offered a real-world proof point for the open ecosystem approach. The company built its customer service agent, Ava, in 12 days using Agentforce and now handles 50% of customer cases autonomously. Engine runs five agents across customer-facing and employee-facing functions, with Data 360 at the heart of its infrastructure and Slack as its primary workspace. “CSAT goes up, costs to deliver go down. Customers are happier. We’re getting them answers faster. What’s the trade off? There’s no trade off,” an Engine executive said during the keynote.

Underpinning all of it is a shift in how Salesforce gets paid. The company is moving from per-seat licensing to consumption-based pricing for Agentforce — a transition Govindarjan described as “a business model change and innovation for us.” It’s a tacit acknowledgment that when agents, not humans, are doing the work, charging per user no longer makes sense.

Salesforce isn’t defending the old model — it’s dismantling it and betting the company on what comes next

Govindarjan framed the company’s evolution in architectural terms. Salesforce has organized its platform around four layers: a system of context (Data 360), a system of work (Customer 360 apps), a system of agency (Agentforce), and a system of engagement (Slack and other surfaces). Headless 360 opens every layer via programmable endpoints.

“What you saw today, what we’re doing now, is we’re opening up every single layer, right, with MCP tools, so we can go build the agentic experiences that are needed,” Govindarjan told VentureBeat. “I think you’re seeing a company transforming itself.”

Whether that transformation succeeds will depend on execution across thousands of customer deployments, the staying power of MCP and related protocols, and the fundamental question of whether incumbent enterprise platforms can move fast enough to remain relevant when AI agents can increasingly build new systems from scratch. The software sector’s bear market, the financial pressures bearing down on the entire industry, and the breathtaking pace of LLM improvement all conspire to make this one of the highest-stakes bets in enterprise technology.

But there is an irony embedded in Salesforce’s predicament that Headless 360 makes explicit. The very AI capabilities that threaten to displace traditional software are the same capabilities that Salesforce now harnesses to rebuild itself. Every coding agent that could theoretically replace a CRM is now, through Headless 360, a coding agent that builds on top of one. The company is not arguing that agents won’t change the game. It’s arguing that decades of accumulated enterprise data, workflows, trust layers, and institutional logic give it something no coding agent can generate from a blank prompt.

As Benioff declared on CNBC’s Mad Money in March: “The software industry is still alive, well and growing.” Headless 360 is his company’s most forceful attempt to prove him right — by tearing down the walls of the very platform that made Salesforce famous and inviting every agent in the world to walk through the front door.

Parker Harris, Salesforce’s co-founder, captured the bet most succinctly in a question he posed last month: “Why should you ever log into Salesforce again?”

If Headless 360 works as designed, the answer is: You shouldn’t have to. And that, Salesforce is wagering, is precisely what will keep you paying for it.

Infrastructure, Orchestration, technology

Meta researchers introduce ‘hyperagents’ to unlock self-improving AI for non-coding tasks

Creating self-improving AI systems is an important step toward deploying agents in dynamic environments, especially in enterprise production environments, where tasks are not always predictable, nor consistent.

Current self-improving AI systems face severe limitations because they rely on fixed, handcrafted improvement mechanisms that only work under strict conditions such as software engineering.

To overcome this practical challenge, researchers at Meta and several universities introduced “hyperagents,” a self-improving AI system that continuously rewrites and optimizes its problem-solving logic and the underlying code.

In practice, this allows the AI to self-improve across non-coding domains, such as robotics and document review. The agent independently invents general-purpose capabilities like persistent memory and automated performance tracking.

More broadly, hyperagents don’t just get better at solving tasks, they learn to improve the self-improving cycle to accelerate progress.

This framework can help develop highly adaptable agents that autonomously build structured, reusable decision machinery. This approach compounds capabilities over time with less need for constant, manual prompt engineering and domain-specific human customization.

Current self-improving AI and its architectural bottlenecks

The core goal of self-improving AI systems is to continually enhance their own learning and problem-solving capabilities. However, most existing self-improvement models rely on a fixed “meta agent.” This static, high-level supervisory system is designed to modify a base system.

“The core limitation of handcrafted meta-agents is that they can only improve as fast as humans can design and maintain them,” Jenny Zhang, co-author of the paper, told VentureBeat. “Every time something changes or breaks, a person has to step in and update the rules or logic.”

Instead of an abstract theoretical limit, this creates a practical “maintenance wall.”

The current paradigm ties system improvement directly to human iteration speed, slowing down progress because it relies heavily on manual engineering effort rather than scaling with agent-collected experience.

To overcome this limitation, the researchers argue that the AI system must be “fully self-referential.” These systems must be able to analyze, evaluate, and rewrite any part of themselves without the constraints of their initial setup. This allows the AI system to break free from structural limits and become self-accelerating.

One example of a self-referential AI system is Sakana AI’s Darwin Gödel Machine (DGM), an AI system that improves itself by rewriting its own code.

In DGM, an agent iteratively generates, evaluates, and modifies its own code, saving successful variants in an archive to act as stepping stones for future improvements. DGM proved open-ended, recursive self-improvement is practically achievable in coding.

However, DGM falls short when applied to real-world applications outside of software engineering because of a critical skill gap. In DGM, the system improves because both evaluation and self-modification are coding tasks. Improving the agent’s coding ability naturally improves its ability to rewrite its own code. But if you deploy DGM for a non-coding enterprise task, this alignment breaks down.

“For tasks like math, poetry, or paper review, improving task performance does not necessarily improve the agent’s ability to modify its own behavior,” Zhang said.

The skills needed to analyze subjective text or business data are entirely different from the skills required to analyze failures and write new Python code to fix them.

DGM also relies on a fixed, human-engineered mechanism to generate its self-improvement instructions. In practice, if enterprise developers want to use DGM for anything other than coding, they must heavily engineer and manually customize the instruction prompts for every new domain.

The hyperagent framework

To overcome the limitations of previous architectures, the researchers introduce hyperagents. The framework proposes “self-referential agents that can in principle self-improve for any computable task.”

In this framework, an agent is any computable program that can invoke LLMs, external tools, or learned components. Traditionally, these systems are split into two distinct roles: a “task agent” that executes the specific problem at hand, and a “meta agent” that analyzes and modifies the agents. A hyperagent fuses both the task agent and the meta agent into a single, self-referential, and editable program.

Because the entire program can be rewritten, the system can modify the self-improvement mechanism, a process the researchers call metacognitive self-modification.

“Hyperagents are not just learning how to solve the given tasks better, but also learning how to improve,” Zhang said. “Over time, this leads to accumulation. Hyperagents do not need to rediscover how to improve in each new domain. Instead, they retain and build on improvements to the self-improvement process itself, allowing progress to compound across tasks.”

The researchers extended the Darwin Gödel Machine to create DGM-Hyperagents (DGM-H). DGM-H retains the powerful open-ended exploration structure of the original DGM, which prevents the AI from converging too early or getting stuck in dead ends by maintaining a growing archive of successful hyperagents.

The system continuously branches from selected candidates in this archive, allows them to self-modify, evaluates the new variants on given tasks, and adds the successful ones back into the pool as stepping stones for future iterations.

By combining this open-ended evolutionary search with metacognitive self-modification, DGM-H eliminates the fixed, human-engineered instruction step of the original DGM. This enables the agent to self-improve across any computable task.

Hyperagents in action

The researchers used the Polyglot coding benchmark to compare the hyperagent framework against previous coding-only AI. They also evaluated hyperagents across non-coding domains that involve subjective reasoning, external tool use, and complex logic.

These included paper review to simulate a peer reviewer outputting accept or reject decisions, reward model design for training a quadruped robot, and Olympiad-level math grading. Math grading served as a held-out test to see if an AI that learned how to self-improve while reviewing papers and designing robots could transfer those meta-skills to an entirely unseen domain.

The researchers compared hyperagents against several baselines, including domain-specific models like AI-Scientist-v2 for paper reviews and the ProofAutoGrader for math. They also tested against the classic DGM and a manually customized DGM for new domains.

On the coding benchmark, hyperagents matched the performance of DGM despite not being designed specifically for coding. In paper review and robotics, hyperagents outperformed the open-source baselines and human-engineered reward functions.

When the researchers took a hyperagent optimized for paper review and robotics and deployed it on the unseen math grading task, it achieved an improvement metric of 0.630 in 50 iterations. Baselines relying on classic DGM architectures remained at a flat 0.0. The hyperagent even beat the domain-specific ProofAutoGrader.

The experiments also highlighted interesting autonomous behaviors from hyperagents. In paper evaluation, the agent first used standard prompt-engineering tricks like adopting a rigorous persona. When this proved unreliable, it rewrote its own code to build a multi-stage evaluation pipeline with explicit checklists and rigid decision rules, leading to much higher consistency.

Hyperagents also autonomously developed a memory tool to avoid repeating past mistakes. Furthermore, the system wrote a performance tracker to log and monitor the result of architectural changes across generations. The model even developed a compute-budget aware behavior, where it tracked remaining iterations to adjust its planning. Early generations executed ambitious architectural changes, while later generations focused on conservative, incremental refinements.

For enterprise data teams wondering where to start, Zhang recommends focusing on tasks where success is unambiguous. “Workflows that are clearly specified and easy to evaluate, often referred to as verifiable tasks, are the best starting point,” she said. “This generally opens new opportunities for more exploratory prototyping, more exhaustive data analysis, more exhaustive A/B testing, [and] faster feature engineering.” For harder, unverified tasks, teams can use hyperagents to first develop learned judges that better reflect human preferences, creating a bridge to more complex domains.

The researchers have shared the code for hyperagents, though it has been released under a non-commercial license.

Caveats and future threats

The benefits of hyperagents introduce clear tradeoffs. The researchers highlight several safety considerations regarding systems that can modify themselves in increasingly open-ended ways.

These AI systems pose the risk of evolving far more rapidly than humans can audit or interpret. While researchers contained DGM-H within safety boundaries such as sandboxed environments designed to prevent unintended side effects, these initial safeguards are actually practical deployment blueprints.

Zhang advises developers to enforce resource limits and restrict access to external systems during the self-modification phase. “The key principle is to separate experimentation from deployment: allow the agent to explore and improve within a controlled sandbox, while ensuring that any changes that affect real systems are carefully validated before being applied,” she said. Only after the newly modified code passes developer-defined correctness checks should it be promoted to a production setting.

Another significant danger is evaluation gaming, where the AI improves its metrics without making actual progress toward the intended real-world goal. Because hyperagents are driven by empirical evaluation signals, they can autonomously discover strategies that exploit blind spots or weaknesses in the evaluation procedure itself to artificially inflate their scores. Preventing this behavior requires developers to implement diverse, robust, and periodically refreshed evaluation protocols alongside continuous human oversight.

Ultimately, these systems will shift the day-to-day responsibilities of human engineers. Just as we do not recompute every operation a calculator performs, future AI orchestration engineers will not write the improvement logic directly, Zhang believes.

Instead, they will design the mechanisms for auditing and stress-testing the system. “As self-improving systems become more capable, the question is no longer just how to improve performance, but what objectives are worth pursuing,” Zhang said. “In that sense, the role evolves from building systems to shaping their direction.

Orchestration

We tested Anthropic’s redesigned Claude Code desktop app and ‘Routines’ — here’s what enterprises should know

The transition from AI as a chatbot to AI as a workforce is no longer a theoretical projection; it has become the primary design philosophy for the modern developer’s toolkit.

On April 14, 2026, Anthropic signaled this shift with a dual release: a complete redesign of the Claude Code desktop app (for Mac and Windows) and the launch of “Routines” in research preview.

These updates suggest that for the modern enterprise, the developer’s role is shifting from a solo practitioner to a high-level orchestrator managing multiple, simultaneous streams of work.

For years, the industry focused on “copilots”—single-threaded assistants that lived within the IDE and responded to the immediate line of code being written. Anthropic’s latest update acknowledges that the shape of “agentic work” has fundamentally changed.

Developers are no longer just typing prompts and waiting for answers; they are initiating refactors in one repository, fixing bugs in another, and writing tests in a third, all while monitoring the progress of these disparate tasks. The redesigned desktop application reflects this change through its central “Mission Control” feature: the new sidebar.

This interface element allows a developer to manage every active and recent session in a single view, filtering by status, project, or environment. It effectively turns the developer’s desktop into a command center where they can steer agents as they drift or review diffs before shipping. This represents a philosophical move away from “conversation” toward “orchestration”.

Routines: your new ‘set and forget’ option for repeating processes and tasks

The introduction of “Routines” represents a significant architectural evolution for Claude Code. Previously, automation was often tied to the user’s local hardware or manually managed infrastructure.

Routines move this execution to Anthropic’s web infrastructure, decoupling progress from the user’s local machine.

This means a critical task—such as a nightly triage of bugs from a Linear backlog—can run at 2:00 AM without the developer’s laptop being open.

These Routines are segmented into three distinct categories designed for enterprise integration:

Scheduled Routines: These function like a sophisticated cron job, performing repeatable maintenance like docs-drift scanning or backlog management on a cadence.
API Routines: These provide dedicated endpoints and auth tokens, allowing enterprises to trigger Claude via HTTP requests from alerting tools like Datadog or CI/CD pipelines.
Webhook Routines: Currently focused on GitHub, these allow Claude to listen for repository events and automatically open sessions to address PR comments or CI failures.

For enterprise teams, these Routines come with structured daily limits: Pro users are capped at 5, Max at 15, and Team/Enterprise tiers at 25 routines per day, though additional usage can be purchased.

Analysis: desktop GUI vs. Terminal

The pivot toward a dedicated Desktop GUI for a tool that originated in the terminal (CLI) invites an analysis of the trade-offs for enterprise users.

The primary benefit of the new desktop app is high-concurrency visibility. In a terminal environment, managing four different AI agents working on four different repositories is a cognitive burden, requiring multiple tabs and constant context switching.

The desktop app’s drag-and-drop layout allows the terminal, preview pane, diff viewer, and chat to be arranged in a grid that matches the user’s specific workflow.

Furthermore, the “Side Chat” feature (accessible via ⌘ + ;) solves a common problem in agentic work: the need to ask a clarifying question without polluting the main task’s history. This ensures that the agent’s primary mission remains focused while the human operator gets the context they need. However, it is also available in the Terminal view via the /btw command.

Despite the GUI’s benefits, the CLI remains the home of many developers. The terminal is lightweight and fits into existing shell-based automation.

Recognizing this, Anthropic has maintained parity: CLI plugins are supposed to work exactly the same in the desktop app as they do in the terminal. Yet in my testing, I was unable to get some of my third-party plugins to show up in the terminal or main view.

For pure speed and users who operate primarily within a single repository, the CLI avoids the resource overhead of a full GUI.

How to use the new Claude Code desktop app view

In practice, accessing the redesigned Claude Code desktop app requires a bit of digital hunting.

It’s not a separate new application — instead, it is but one of three main views in the official Claude desktop app, accessible only by hovering over the “Chat” icon in the top-left corner to reveal the specific coding interfaces.

Once inside, the transition from a standard chat window to the “Claude Code” view is stark. The interface is dominated by a central conversational thread flanked by a session-management sidebar that allows for quick navigation between active and archived projects.

The addition of a new, subtle, hover-over circular indicator at the bottom showing how much context the user has used in their current session and weekly plan limits is nice, but again, a departure from third-party CLI plugins that can show this constantly to the user without having to take the extra step of hovering over.

Similarly, pop up icons for permissions and a small orange asterisk showing the time Claude Code has spent on responding to each prompt (working) and tokens consumed right in the stream is excellent for visibility into costs and activity.

While the visual clarity is high—bolstered by interactive charts and clickable inline links—the discoverability of parallel agent orchestration remains a hurdle.

Despite the promise of “many things in flight,” attempting to run tests across multiple disparate project folders proved difficult, as the current iteration tends to lock the user into a single project focus at a time.

Unlike the Terminal CLI version of Claude Code, which defaults to asking the user to start their session in their user folder on Mac OS, the Claude Code desktop app asks for access to specific subfolder — which can be helpful if you have already started a project, but not necessarily for starting work on a new one or multiple in parallel.

The most effective addition for the “vibe coding” workflow is the integrated preview pane, located in the upper-right corner.

For developers who previously relied on the terminal-only version of Claude Code, this feature eliminates the need to maintain separate browser windows or rely on third-party extensions to view live changes to web applications.

However, the desktop experience is not without friction. The integrated terminal, intended to allow for side-by-side builds and testing, suffered from notable latency, often failing to update in real-time with user input. For users accustomed to the near-instantaneous response of a native terminal, this lag can make the GUI feel like an “overkill” layer that complicates rather than streamlines the dev cycle.

Setting up the new Routines feature also followed a steep learning curve. The interface does not immediately surface how to initiate these background automations; discovery required asking Claude directly and referencing the internal documentation to find the /schedule command.

Once identified, however, the process was remarkably efficient. By using the CLI command and configuring connectors in the browser, a routine can be operational in under two minutes, running autonomously on Anthropic’s web infrastructure without requiring the desktop app to remain active.

The ultimate trade-off for the enterprise user is one of flexibility (standard Terminal/CLI view) versus integrated convenience (new Claude Code desktop app).

The desktop app provides a high-context “Plan” view and a readable narrative of the agent’s logic, which is undeniably helpful for complex, multi-step refactors.

Yet, the platform creates a distinct “walled garden” effect. While the terminal version of Claude Code offers a broader range of movement, the desktop app is strictly optimized for Anthropic’s models.

For the professional coder who frequently switches between Claude and other AI models to work around rate limits or seek different architectural perspectives, this model-lock may be a dealbreaker. For these power users, the traditional terminal interface remains the superior surface for maintaining a diverse and resilient AI stack.

The enterprise verdict

For the enterprise, the Desktop GUI is likely to become the standard for management and review, while the CLI remains the tool for execution.

The desktop app’s inclusion of an in-app file editor and a faster diff viewer—rebuilt for performance on large changesets—makes it a superior environment for the “Review and Ship” phase of development.

It allows a lead developer to review an agent’s work, make spot edits, and approve a PR without ever leaving the application.

Philosophical implications for the future of AI-driven enterprise knowledge work

Anthropic developer Felix Rieseberg noted on X that this version was “redesigned from the ground up for parallel work,” emphasizing that it has become his primary way to interact with the system.

This shift suggests a future where “coding” is less about syntax and more about managing the lifecycle of AI sessions.

The enterprise user now occupies the “orchestrator seat,” managing a fleet of agents that can triage alerts, verify deploys, and resolve feedback automatically.

By providing the infrastructure to run these tasks in the cloud and the interface to monitor them on the desktop, Anthropic is defining a new standard for professional AI-assisted engineering.

Orchestration

AI’s next bottleneck isn’t the models — it’s whether agents can think together

AI agents can connect together, but they cannot think together. That’s a huge difference and a bottleneck for next-gen systems, says Outshift by Cisco’s SVP and GM Vijoy Pandey.

As he describes the current state of AI: Agents can be stitched together in a workflow or plug into a supervisor model — but there’s no semantic alignment, no shared context. They’re essentially working from scratch each go-around.

This calls for next-level infrastructure, or what Pandey describes as the “internet of cognition.”

“Agents are not able to think together because connection is not cognition,” he said. “We need to get to a point where you are sharing cognition. That is the greater unlock.”

Creating new protocols to support next-gen agent communication

So what is shared cognition? It’s when AI agents or entities can meaningfully work together to solve for something net new that they weren’t trained for, and do it “100% without human intervention,” Pandey said on the latest episode of Beyond the Pilot.

The Cisco exec analogizes it to human intelligence. Humans evolved over hundreds of thousands of years, first becoming intelligent individually, then communicating on a basic level (with gestures or drawings). That communication improved over time, eventually unlocking a ‘cognitive revolution’ and collective intelligence that allowed for shared intent and the ability to coordinate, negotiate, and ground and discover information.

“Shared intent, shared context, collective innovation: That’s the exact trajectory that’s playing out in silicon today,” Pandey said.

His team sees it as a “horizontal distributed assistance problem.” They are pursuing “distributed super intelligence” by codifying intent, context, and collective innovation as a set of rules, APIs, and capabilities within the infrastructure itself.

Their approach is a set of new protocols: Semantic State Transfer Protocol (SSTP); Latent Space Transfer Protocol (LSTP); and Compressed State Transfer Protocol (CSTP).

SSTP operates at the language level, analyzing semantic communication so systems can infer the right tool or task. Pandey’s team recently collaborated with MIT on a related piece called the Ripple Effect Protocol.

LSTP can be used to transfer the “entire latent space” of one agent to another, Pandey explained. “Can we just take the KV cache and send it over as an example?” he said. “Because that would be the most efficient way: instead of going through the tax of tokenizing it, going to a natural language, then going back the stack on the other side.”

CSTP handles compression — grounding only the targeted variants while compressing everything else. Pandey says it’s particularly well-suited for edge deployments where you need to send large amounts of state accurately.

Ultimately, Pandey’s team is building a fabric to scale out intelligence and ensure that cognition states are synchronized across endpoints. Further, they are developing what they call “cognition engines” that provide guardrails and accelerate systems.

“Protocols, fabric, cognition engines: These are the three layers that we are building out in the pursuit of distributed super intelligence,” Pandey said.

How Cisco solved a big pain point

Stepping back from these advanced, next-level systems, Cisco has achieved tangible results with existing AI capabilities. Pandey described a specific pain point with the company’s site reliability engineering (SRE) team.

While they were churning out more and more products and code, the team itself wasn’t growing, and were feeling pressure to improve efficiency. Pandey introduced AI agents that automated more than a dozen end-to-end workflows, including continuous integration/continuous delivery CI/CD pipelines, EC2 instance spin-ups and Kubernetes cluster deployments.

Now, more than 20 agents — some built in-house, some third-party — have access to 100-plus tools via frameworks like Model Context Protocol (MCP), while also plugging into Cisco’s security platforms.

The result: A decrease from “hours and hours to seconds” with certain deployments; further, agents have reduced 80% of the issues the SRE team were seeing within Kubernetes workflows.

Still, as Pandey noted, AI is a tool like any other. “It does not mean that I have a new hammer and I’m just gonna go around looking for nails,” he said. “You still have deterministic code. You need to marry these two worlds to get the best outcome for the problem that you’re solving.”

Listen to the podcast to hear more about:

How we are now enabling a new paradigm of non-deterministic computing.
How Cisco bumped error detection capabilities in large networks from 10% to 100%.
How Pandey named his own AI agent Arnold Layne after an early Pink Floyd song.
Why the “internet of cognition” must be an open, interoperable effort.
How Cisco’s open source project Agntcy addresses discovery, identity and access management (IAM), observability, and evaluation.

You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.

Orchestration