Presented by EdgeVerveFor most enterprises, AI adoption began with a straightforward ambition: automate work faster, cheaper, and at scale. Chatbots replaced basic service requests, machine‑learning models optimized forecasts, and analytics dashboards …
Presented by Apptio, an IBM company
AI spending is surging, but the full impact often remains an open question. Closing the gap requires clear answers to how AI is governed, measured, and tied to business outcomes.
ROI uncertainty isn’t unique to AI: In the Apptio 2026 Technology Investment Management Report, 90% of technology leaders surveyed said that ROI uncertainty has a moderate or major impact on overall tech investment decisions, a 5-percentage point year-over-year increase. In other words, tech leaders are increasing their reliance on ROI – even if they don’t fully know how to measure it. And AI economics involves new and unpredictable costs, further complicating ROI calculations. Faced with increasing uncertainty and increasing budgets, technology leaders need a clear, reliable framework for evaluating AI ROI.
Organizations increasingly expect scaled AI to pay its own way, at least partially. According to Apptio’s technology investment management report, 45% of organizations surveyed intend to fund innovation by reinvesting savings from AI-driven efficiencies. That model assumes that such savings are both achievable and quantifiable. Meanwhile, the two-thirds of organizations planning to reallocate existing budget capital to AI will need clarity on the trade-offs involved.
Much like the early days of public cloud, AI costs and returns are difficult to predict. Pricing varies widely across providers and continues to evolve, while consumption is unpredictable. The pressure to adopt quickly is also formidable as organizations navigate the threat of disruption by more agile competitors.
Considering the many variables, tech leaders should view AI ROI as a matter of optimization. At a high level, the implementation of AI initiatives is inevitable. The question is how to achieve the greatest possible returns — both financial and organizational.
Start with the business problem. There are many ways AI can deliver positive impact, but organizational resources and focus may be limited. Make sure you’re prioritizing the right initiatives by basing your AI investment strategy on quantifiable goals tied to real business outcomes. Are you trying to improve decision-making speed? Increase throughput or capacity? Or chasing cool edge cases with high potential returns but minimal strategic relevance?
Determine what success looks like. AI can introduce a new capability or augment an existing one. For new capabilities, articulate the possibilities you’d like to unlock, such as new revenue opportunities, workflows, or decision-making processes. For augmentations, establish baseline performance and the expected lift you aim to achieve with AI.
Consider how finances will influence your evaluation. Some use cases may show minimal results in the near-term but drive significant value in the long-term. What’s your timeframe for return? On the other hand, more successful rollouts with rapid adoption can generate unexpectedly high inference bills. Would that mean pulling the plug — or leaning in further? What should your cost and return curve look like over the years? As you map your timeline, establish clear thresholds to determine whether you’ll proceed, pause, stop, or accelerate your investment.
Identify the right KPIs. The returns on an AI investment can be even more difficult to evaluate than the costs. Usage, efficiency, and financial impact all matter. But AI success metrics won’t always be straightforward. There may be new usage patterns you don’t yet have a way to measure. Your technology environment may experience follow-on shifts that call for further evaluation. Will you be able to lessen your reliance on other tools, such as reducing seats in your data analytics platform? How will you factor in cross-tool pricing comparisons for multiple AI providers with shifting rates?
To gain full context and insight, you must also take into account the alignment of the initiative with your broader strategy and consider the opportunity cost of the investments you might otherwise have made. Remember that you’re not evaluating AI business value in isolation; you’re deciding whether it’s the best use of finite capital across all your investments.
These decisions will call for a level of insight far exceeding what was needed to justify traditional purchases like network infrastructure or enterprise software. Tech leaders navigating the complexities of AI economics should consider a new framework for data-driven decision-making.
Technology business management (TBM) helps make ROI more concrete and measurable, so it can be relevant to the business. By bringing together IT Financial Management (ITFM), AI FinOps (cloud financial management for AI workloads), and Strategic Portfolio Management (SPM), a TBM framework connects financial, operational, and business data across the enterprise.This makes it possible to account for AI value and cost across a wide array of dimensions — and translate hypothetical innovation into board presentations and budget justifications that hold up under scrutiny.
TBM can help leaders build a trustworthy cost foundation that captures AI spend across labor, infrastructure, inference, storage, and applications. As AI workloads shift dynamically, TBM provides visibility into how that spend is distributed across on-premises systems and cloud environments — both of which require different capacity planning for specialized skill sets. The framework also connects investments to business outcomes, aligning AI initiatives with strategic priorities and measurable results. With increased visibility, you’re able to identify issues and make decisions fast, such as catching cost spikes early. Early detection can help to determine if the usage shift merits shifting funding. This unified view of financial and operational data helps leaders scale what’s working and reassess what isn’t as adoption increases. TBM provides essential visibility and context across the entire AI spend management conversation. Even as pricing evolves, tooling changes, and workflows shift, you can apply the same analytical approach and understand what’s actually working and demonstrate ROI. Leaders who operationalize AI within a TBM framework can:
Evaluate ROI at both project and portfolio levels
Spot unexpected cost spikes
Compare multiple AI tools
Understand ripple effects across run-the-business systems
Defend investment decisions with confidence
Understand and manage total costs and usage across the AI investment lifecycle
Organizations are moving beyond AI experiments, and we’re past the point where these investments can be funded on optimism alone. Amid heightened uncertainty and cost sensitivity, boards are asking more strategic questions and finance wants trustworthy data.
Enterprise leaders who treat AI as a managed investment, rather than a bet on innovation, are those who will scale it successfully. To fund AI responsibly, leaders must establish clarity around scope, outcomes, cost drivers, and readiness. A TBM-driven approach provides the data foundation, visibility, and accountability to make those decisions.
Learn more here about how Apptio TBM transforms IT spend management in the AI era.
Ajay Patel is General Manager at Apptio, an IBM Company.
Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Voice agents have been expensive to run and painful to orchestrate, not because the models can’t handle conversation, but because context ceilings forced enterprises to build session resets, state compression, and reconstruction layers into every deployment. OpenAI’s three new voice models are designed to reduce that overhead, and they change how engineers can think about building voice into a larger agent stack.
GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper integrate real-time audio into the model management stack as discrete orchestration primitives — separating conversational reasoning, translation, and transcription into specialized components rather than bundling them in a single voice product.
The company said in a blog post that Realtime-2 is its first voice model “with GPT-5 class reasoning” and can handle difficult requests and keep conversations flowing naturally. Realtime-Translate understands more than 70 languages and translates them into 13 others at the speaker’s pace, and Realtime-Whisper is its new speech-to-text transcription model.
These three actions no longer sit inside a single stack or model. GPT-Realtime-2 could technically handle transcription, but OpenAI is routing distinct tasks to specialized models: Realtime-Translate for multilingual speech and Realtime-Whisper for transcription. Enterprises can assign each task to the appropriate model rather than routing everything through a single, all-encompassing voice system.
The new OpenAI models compete against Mistral’s Voxtral models, which also separate transcription and target enterprise use cases.
More enterprises are seeing the value of voice agents now that more people are becoming comfortable conversing with an AI agent, and also because of the richness of data from voice customer interactions.
Organizations evaluating these models will need to consider their orchestration architecture, not just model quality — specifically, whether their stack can route discrete voice tasks to specialized models and manage state across a 128K-token context window.
Just a few weeks after announcing Claude Managed Agents, Anthropic has updated the platform with three new capabilities that collapse infrastructure layers like memory, evaluation, and multi-agent orchestration, into a single runtime.
This move could threaten the standalone tools that many enterprises cobble together.
The new capabilities — ‘Dreaming,’ ‘Outcomes,’ and ‘Multi-Agent Orchestration’ — aim to make agents inside Claude Managed Agents “more capable at handling complex tasks with minimal steering,” Anthropic said in a press release.
Dreaming deals with memory, where agents “reflect” on their many sessions and curate memories so they learns and surface unknown patterns. Outcomes allows teams to define and set specific rubrics to measure an agent’s success, while Multi-Agent Orchestration breaks jobs down so a lead agent can delegate to other agents.
Claude Managed Agents ideally provides enterprises with a simpler path to deploy agents and embeds orchestration logic in the model layer. It’s an end-to-end platform to manage state, execution graphs, and routing. With the addition of Dreaming, Outcomes and Multi-agent Orchestration, Claude Managed Agents expands capabilities even further and directly competes with tools like LangGraph or CrewAI, as well as external evaluation frameworks, RAG memory architectures, and QA loops.
Enterprises must now ask: Should we ditch our flexible, modular system in favor of an agent platform that brings almost everything in-house?
Anthropic designed Claude Managed Agents to share context, state, and traceability in one place. This means the platform sees every decision agents make, rather than enterprises having to wire separate systems together. It sounds practical to have one platform that does everything. But not all enterprises want a full-service system.
Claude Managed Agents already faces criticism that it encourages vendor lock-in because it owns most of the architecture and tools that govern agents. In the current paradigm, an organization may run Managed Agents but keep multi-agent orchestration, memory, or evaluations in a separate space ensures flexibility.
The platform offers a fully-hosted runtime, which means memory and orchestration run on infrastructure the enterprise does not own. This can become a compliance nightmare for some organizations that have to prove data residency.
Another problem to consider is that enterprises already in the middle of large-scale AI transformations must cobble together workarounds to deal with the constraints of their tech stack. Not every workflow is easily replaceable by switching to Claude Managed Agents.
Most enterprises have a fragmented approach to AI deployment.
For example, they may use LangGraph or Crew AI for agent routing and workflow management, Pinecone as a vector database for long-term memory, DeepEval for external evaluation, and a human-in-the-loop quality assurance to review some tasks. Anthropic hopes to do away with all of that.
With Dreaming, Anthropic approaches memory by allowing users to actively rewrite it between sessions, so the agent essentially learns from its mistakes. Anthropic says this capability is useful for long-running states and orchestration. Current systems often handle memory persistence by storing embeddings, retrieving relevant context, and adding more state over time.
Outcomes addresses the evaluation portion by detailing expectations for agents. Instead of external quality checks, which are often done by a team of humans, Anthropic is bringing evaluation into the orchestration layer rather than above it.
But it’s the Multi-Agent Orchestration capability that pits Claude Managed Agents against orchestration frameworks from Microsoft, LangChain, CrewAI, and others. Model providers like Anthropic and OpenAI have already begun pushing aggressively into this space, arguing that bringing this to the model layer gives teams better control.
Enterprises face a big decision, and this one could depend on where they are in agent maturity.
If an organization is still experimenting with agents and has not deployed many in production, they may find moving to Claude Managed Agents and configuring Dreaming and Outcomes to their needs much easier. This is the stage of development where, even if enterprises are using a third-party orchestrator like LangChain, they’re still customizing it.
But for those who are already further along in the process, the calculation becomes trickier. It’s now a matter of parallel evaluation and better understanding of their processes.
Businesses, though, will face the same decision even if they don’t intend to use Claude Managed Agents. Anthropic has signaled that other model and platform providers will likely shift their product roadmaps to a similar model that keeps everything locked in the same system — because models may become interchangeable, but the tooling and orchestration infrastructure will not.
Presented by SAP
The enterprise software industry has undergone a fundamental shift, and vendors are adapting their approaches to better protect the customers who rely on them. For years, every global platform vendor running multi-tenant cloud infrastructure has maintained documented rate limits, usage controls, and restrictions on the use of undocumented internal interfaces.
CRM platforms impose daily API call limits per organization, enforce platform-layer limits, and maintain a strict separation between bulk data APIs and transactional REST surfaces. Productivity and collaboration suites throttle their graph APIs and redirect bulk workloads to purpose-built data access channels designed for that load. HR and workforce management platforms enforce concurrent request limits and per-session data retrieval caps. IT service management platforms enforce per-user rate limits and instance-level throttling. Hyperscalers publish per-service quotas, enforce them at the infrastructure layer, and explicitly prohibit applications from calling non-SDK or non-published interfaces.
These are not controversial measures. They are baseline hygiene for enterprise-grade software platforms operating shared infrastructure at scale. For more than a decade these measures have been in place without serious objection.
As SAP has taken responsibility for securing customers’ mission-critical workloads in the cloud, a unified API policy with clarified usage controls is not a restriction but the expression of enterprise-grade stewardship. Some have read the policy as a new restriction. The policy does not introduce new restrictions. It names and unifies controls that have existed across individual SAP products for years.
SAP is not introducing API governance as a novel concept. SAP SuccessFactors, SAP Ariba, SAP LeanIX, and several other SAP solutions have enforced documented rate limits and usage controls. SAP Notes and SAP’s documentation have also in the past defined API usage.
What the recent policy does is unify that existing practice into a single cross-portfolio standard, a step made urgent by the arrival of autonomous agentic harnesses that SAP is fully committed to enabling, but which place a categorically different performance, stability, and security load on API surfaces that were never designed for autonomous orchestration and data extraction at scale.
Custom APIs built by customers in their own namespace for their own extensibility, integration, and migration purposes are customer-developed interfaces. If you have spent years building custom data services, custom RFCs, and ABAP interfaces to connect your SAP system to the world around it, the policy’s restriction on non-published APIs might read, on first encounter, like a demolition order. It is not. The policy’s restriction targets SAP’s own internal unreleased objects. It does not reach into the Z namespace and condemn two decades of ABAP engineering.
SAP’s Private Cloud customers are in a distinctly privileged position compared with much of the enterprise world, because they have long been able to build in their own namespace and to shape an environment they were free to modify and extend, and that freedom is not being revoked.
The policy is focused on something narrower: SAP’s own internal interfaces that were never published, never documented for customer use, and never offered as a dependable foundation for integration. Most custom code never touches these internals and will continue untouched; where it does, the risk for customers has always been present, and the policy merely names it rather than inventing it.
However, within that set there is a smaller class of interfaces that is not a matter for debate but for prohibition. ODP-RFC belongs in that class: it sits in SAP’s namespace as an internal, non-released interface that SAP explicitly classifies as “unpermitted” for customer or third-party application use as documented in SAP Note 3255746.
These are precisely the kinds of interfaces SAP will flag as prohibited in notes and automated tooling so that such usage can be identified early through tooling and guidance, rather than discovered late in deployment or operational context. Clean Core is distinct from the API Policy but points in the same direction, and it bears noting that customers did not merely accept it but asked for it repeatedly, having lived through the upgrade costs of the alternative; in the agentic era, where SAP runs mission-critical ERP as a service, both the Clean Core Recommendations and API Policy are conditions of the enterprise-grade reliability that cloud operations make possible.
While some commentators have argued this policy is primarily a commercial move, the technical evidence tells a different story.
AI has changed everything about our traditional view of transactional interfaces. The APIs that enterprises have used for decades to integrate SAP systems with third-party applications are request-response interfaces built for transactional workloads. They were designed to fetch a sales order, post a goods receipt, or trigger a payment run. They were designed to be mostly called by a human-authored integration flow, at a predictable frequency, for a defined business purpose. They were not designed to have an autonomous AI orchestration harness run thousands of sequential calls against them in pursuit of semantic context about the business model encoded within. That is not a clean core integration pattern.
Much of the debate misses a core architectural distinction. A traditional integration tool reads a sales order from SAP, converts it into the format a target schema needs, and moves it on. SAP’s data model plays no role beyond being a transient interpretation step.
An AI agent does something categorically different. It does not merely retrieve a value. It reads the sales order header data and learns that this structure represents a customer commitment to buy. It reads the line item data and learns how individual items relate to that order. It reads the net value and learns that this number is meaningful only when paired with the document currency. It traces the path that a sales order takes through delivery, billing, and finally into the accounting ledger, and internalizes how SAP reconciles operations and finance within its business object model.
The agent is not only consuming a customer’s transactional data. It is consuming the semantic ontology: the business object definitions, the relationships between entities, the conceptual architecture that SAP has built and refined over five decades of enterprise knowledge encoding.
SAP has long distinguished between enabling transactional access to customer data and the broader extraction or replication of the underlying ontology. The policy does not create this boundary, because it already existed. Autonomous agents must continue to respect that boundary, rather than redefine it.
Then there is a security angle, and it is not abstract. The same week this policy was published, a supply chain attack named the Mini Shai-Hulud – a variant of the npm worm, quietly compromised hundreds of software packages. SAP-ecosystem npm packages were compromised and we addressed this with this security note for customers. This is not a theoretical threat model. This is the active threat environment in which community-built MCP servers are being connected to productive SAP systems running mission-critical business processes.
The OWASP MCP Top 10 documents the vulnerability classes systematically: tool poisoning, prompt injection, privilege escalation via scope creep, token mismanagement, and supply chain compromise. Recent research across thousands of analyzed MCP implementations shows that a majority operate with static long-lived credentials or carry identifiable security findings, and a single compromised package in the MCP ecosystem can cascade into hundreds of thousands of exposed development environments. VentureBeat just last week reported a serious com.mand execution flaw that made up to 200,000 MCP servers vulnerable.
Consider what that means in practice. An AI agent that has just internalized the semantic structure of your SAP data model and is operating through a community MCP server, moves beyond a productivity tool and into an elevated risk category, one that combines broad system access with an attack surface that is still evolving.
The MCP debate has also obscured a technical reality that enterprise architects need to confront directly. The Model Context Protocol is plumbing. It specifies how an AI model calls a tool. It says nothing about whether the model understands what the tool does in a business context, in what sequence tools must be called, what side effects a given API invocation will trigger, or what the consequences of an incorrect parameter will be. A naive MCP implementation connecting to SAP OData services can call a tool. It cannot run a business process.
The token consumption data from production agentic deployments is instructive. For illustration, a query asking for an employee’s manager and traversing through the list of peers in an SAP SuccessFactors system consumed 565,000 tokens under a standard MCP implementation. The same query under a context-aware implementation consumed 80,000 tokens. That is the difference between a query costing $1.70 and a query costing $.24, for example, on a single operation, repeated across thousands of daily transactions. The standard MCP implementation is not automation. It is an expensive approximation of automation that fails on complex queries while loading the API surface with traffic it was not designed to carry.
SAP’s response to these challenges is not to close the ecosystem but to build the right infrastructure for an open one. That distinction is worth dwelling on.
The API Policy anchors compliance in documented, co-engineered architectures. The agentic interoperability reference architectures jointly developed with major technology partners are published and available on the SAP Architecture Center, prioritized by customer demand and updated as new patterns are validated.
The bi-directional integration of SAP Joule and Microsoft 365 Copilot is the most visible example of what co-engineered agentic integration looks like in production: two AI systems, from two different vendors, working across each other’s application surfaces without either party bypassing the other’s security model. The endorsed path for external AI agent access to SAP is the Agent Gateway via the A2A protocol, with reference AI Golden Path on the SAP Architecture Center. The SAP Knowledge Graph, Open Resource Discovery (ORD) specification for metadata, and SAP BDC data products provide the context layer that transforms a protocol connection into a business-capable interaction. SAP also offers governed MCP servers for CAP, UI5, Fiori Elements, and has indicated its intent to extent this model to additional development environments, including ABAP development. These are not closed doors, they are the right doors.
SAP’s position in the standards community is that of an active contributor, not a gatekeeper. SAP is a launch partner of the Agent2Agent (A2A) protocol under the Linux Foundation and holds Gold level membership in the Agentic AI Foundation, co-chairing the Agent Identity and Trust workstream alongside the organizations that define how AI agents authenticate, authorize, and interoperate across enterprise boundaries.
A2A and MCP are not external constraints that SAP is grudgingly accommodating. They are protocols SAP uses internally and is actively hardening through standards work. When community and open-source frameworks meet the security floor that enterprise deployment requires, external integration pathways will follow.
The API Policy issued by SAP does not mark the end of openness. The industry has spent two years deploying AI agents against enterprise systems using protocols that the enterprise security community had not finished hardening, against APIs that were never designed for autonomous orchestration, with community tooling that documented attackers had already learned to compromise. Governance was not optional, it was timely.
Anirban Majumdar is Head of the Office of the CTO at SAP.
Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Anthropic on Tuesday unveiled a suite of updates to its Claude Managed Agents platform at its second annual Code with Claude developer conference in San Francisco, introducing a new capability called “dreaming” that lets AI agents learn from their own past sessions and improve over time — a step toward the kind of self-correcting, self-improving AI systems that enterprises have demanded before trusting agents with production workloads.
The company also moved two previously experimental features — outcomes and multi-agent orchestration — from research preview into public beta, making them broadly available to developers building on the Claude platform. Together, the three features address what Anthropic says are the hardest problems in running AI agents at scale: keeping them accurate, helping them learn, and preventing them from becoming bottlenecks on complex, multi-step work.
Early adopters are already reporting significant results. Legal AI company Harvey saw task completion rates increase roughly 6x after implementing dreaming. Medical document review company Wisedocs cut its document review time by 50% using outcomes. And Netflix is now processing logs from hundreds of builds simultaneously using multi-agent orchestration.
The announcements come at a moment of extraordinary momentum for Anthropic. CEO Dario Amodei disclosed during a fireside chat at the conference that the company’s growth has outpaced even its own aggressive internal projections.
In the first quarter of 2026, Anthropic saw what Amodei described as 80x annualized growth in revenue and usage — far exceeding the 10x annual growth the company had planned for. API volume on the Claude platform is up nearly 70x year over year, and the average developer using Claude Code now spends 20 hours per week working with the tool.
“We tried to plan very well for a world of 10x growth per year,” Amodei said. “And yet we saw 80x. And so that is the reason we have had difficulties with compute.”
Dreaming is the most novel of the three features and the one Anthropic is most eager to distinguish from conventional memory systems. While the company launched agent memory earlier this year — allowing Claude to retain preferences and context within and across individual sessions — dreaming works at a higher level of abstraction. It is a scheduled process that reviews an agent’s past sessions and memory stores, extracts patterns across them, and curates those memories so agents improve over time. It surfaces insights that no single agent session could see on its own: recurring mistakes, workflows that multiple agents converge on independently, and preferences shared across a team of agents.
Alex Albert, who leads research product management at Anthropic, explained the concept in an interview at the conference. He described dreaming as analogous to how people within organizations create skills after working through a task. “They might do a workflow with Claude, and at the end of that workflow, after they’ve iterated and zigzagged a little bit, they want to record that path from A to B,” Albert said. “A very similar thing is happening with dreaming — instead of you manually creating the skill from your experience working with Claude, the model is doing it, so it has that same context for a future session.”
Crucially, dreaming does not modify the underlying model weights. “We’re not changing the model itself through dreaming — it’s not doing updates to the weights or anything like that,” Albert said. Instead, the agent writes learnings as plain-text notes and structured “playbooks” that future sessions can reference, making the entire process observable and auditable by humans. When asked about the trust implications of agents consolidating their own knowledge, Albert acknowledged that “there is a level of trust that you need to place” but noted that all memories are inspectable and that smarter models are getting progressively better at managing this process. “They’re learning to write better notes for their future self,” he said.
During the keynote, the Anthropic team demonstrated all three features live on stage using a fictional aerospace startup called “Lumara” that needed to autonomously land drones on the moon for resource mining. The team configured a multi-agent system with three specialists — a commander agent responsible for overall mission success, a detector agent that identified high-quality landing sites, and a navigator agent that handled safe drone flight and landing — and defined a success rubric requiring soft landings, clear ground, and enough fuel reserves for a return trip to Earth.
An initial simulation across six hypothetical landing sites produced strong but imperfect results. To improve, the presenters triggered a dreaming session directly from the Claude Developer Console. Overnight, the dreaming agent reviewed all past simulation sessions and wrote a detailed descent playbook — a comprehensive set of heuristics drawn from patterns across multiple mission runs. When the team ran a new simulation the following morning with the dreaming-derived playbook in memory, the results improved meaningfully on the sites that had previously underperformed.
“All we had to do was just have Caitlin press a button,” said Angela Jiang, Head of Product for the Claude Platform, referring to her colleague on stage. “All dreaming.”
The demo illustrated how the three features compose together in practice. Multi-agent orchestration split the complex task across specialists with independent context windows. Outcomes provided the rubric against which a separate grader agent evaluated each run. And dreaming extracted lessons across those runs to improve future performance — forming what Anthropic describes as a continuous improvement loop that requires no human intervention between iterations.
The outcomes feature, now in public beta, gives developers a way to define what success looks like using a rubric — a structural framework, a presentation standard, a brand voice, or any other set of criteria — and then lets the agent iterate toward that standard autonomously. What makes outcomes architecturally distinctive is its separation of concerns. When an agent completes its work, a separate grader agent evaluates the output against the developer-defined rubric in its own independent context window. Because the grader operates in a fresh context, it is not influenced by the working agent’s reasoning or accumulated biases from the session.
When the grader identifies gaps between the output and the rubric, it pinpoints specifically what needs to change, and the working agent takes another pass. This loop continues until the rubric criteria are met — without a human needing to review each attempt.
Albert described Anthropic’s broader verification strategy as employing “more test time compute, more models thinking about a problem for longer, to check over the work of another.” He acknowledged that having a model check its own work raises reasonable questions, but said a fresh context window reviewing completed work consistently outperforms asking the same long-running thread to identify its own bugs. “You will get higher success if you give that output to a fresh Claude and say, ‘what bugs do you see?'” he said. “There is still something to the attention” that degrades over very long sessions — a limitation he said Anthropic is actively working to fix in future models.
The approach mirrors strategies already in use at GitHub. Mario Rodriguez, Chief Product Officer at GitHub, described during a separate talk at the conference how Copilot uses a similar advisor pattern with Claude models — pairing a smaller, cheaper model as an executor with a larger model as a mentor. When the smaller model encounters a problem beyond its capability, it calls the larger model for guidance, then continues executing on its own. Rodriguez said the approach delivers near-Opus-level intelligence at significantly lower cost, and that GitHub inserts critique models at three specific points in the coding workflow: after drafting a plan, after a complex implementation, and after writing tests but before running them.
Multi-agent orchestration, the third feature moving to public beta, allows a lead agent to decompose a large task into subtasks and delegate each one to a specialist agent — each with its own model, system prompt, tools, and independent context window. Every step in the process is traceable in the Claude Console, showing which agent did what, in what order, and why.
The design gives each sub-agent an isolated context, which Anthropic says produces better results than having a single agent attempt to hold all the complexity in one thread. “Each sub-agent has its own independent thread and context window,” the keynote presenters explained. “This is very intentional — we found that by splitting the work and then merging the results, we get better outcomes.”
Albert offered his own heuristic for when multi-agent architectures make sense versus sticking with a single thread. “Parallel agents are better for investigation,” he said — situations where there is a lot of context that will ultimately be discarded. “If you’re trying to answer a specific question, you don’t need all the search results from the areas where it didn’t find the answer. You just need the answer.” He described spinning up disposable sub-agents for specific retrieval tasks and bringing only the result back to the main thread. Increasingly, he said, the model itself will decide when to parallelize. “In the future, you won’t really care if it’s one agent or multi-agent or whatever’s happening. You just have a Claude that you’re talking to, and it will deploy the right architecture automatically.”
The three features arrive as part of a broader platform push that Anthropic framed throughout the conference as closing “the gap between what AI can do and what it’s actually doing for people.” Ami Vora, Anthropic’s Chief Product Officer, set the theme in her opening keynote, noting that while model capabilities are advancing on an exponential curve, most organizations are still adopting AI on a linear path.
Dianne Penn, who leads product for Anthropic’s research team, described the company’s measure of progress as “task horizon” — how long an AI agent can work autonomously while improving the quality of its deliverables. “This time last year, models could work for minutes,” she said. “Now, most of us have agents running for hours on end. Tomorrow, we’ll have agents that are proactive, always on, and know what to work on without losing the frame.”
The event also included several infrastructure announcements designed to help developers keep pace. Anthropic said it is doubling its five-hour rate limits for Pro, Max, Team, and Enterprise plans, and raising API rate limits considerably. The company announced a partnership with SpaceX to use the full capacity of its Colossus data center to expand compute availability — a direct response to the demand crunch Amodei described.
All three features are built into Claude Managed Agents, which launched in public beta on April 8 as an opinionated harness that bundles best practices including memory, tool integration, and action handling. Anthropic says teams using Managed Agents have shipped 10x faster than those building their own agent infrastructure from scratch. Albert described the platform using an operating system analogy: “With managed agents, you don’t need to think about all the technicalities of how you set up the surrounding system,” he said. “You’re building an application for Macs — you don’t want to go have to re-implement every detail of macOS.”
The competitive implications are significant. As AI agent platforms from OpenAI, Google, and others compete for developer adoption, Anthropic is betting that production reliability — not just raw model intelligence — will determine which platform wins enterprise budgets. The dreaming feature in particular stakes out new territory: while other platforms offer memory and tool use, the idea of agents systematically reviewing their own histories to extract reusable knowledge goes further toward the kind of continuously improving systems that enterprises need before delegating high-stakes work.
The conference showcased companies already operating at that scale. Mercado Libre, Latin America’s largest e-commerce platform, has 23,000 engineers running Claude Code, has reviewed more than 500,000 pull requests with human oversight, and is aiming for 90% autonomous coding by the third quarter of this year. Shopify has deployed Claude Code across not just engineering but design, product, and data science teams.
But it was Dario Amodei who articulated the most expansive vision for where all of this leads. He described a progression from single agents to multiple agents to whole organizational intelligence — from “a team of smart people in a room” to what he called “a country of geniuses in the data center.” And he reiterated a prediction he made roughly a year ago: that 2026 would see the first billion-dollar company run by a single person. “Hasn’t quite happened yet,” he said. “But we’ve got seven more months.”
Dreaming is available now in research preview. Outcomes and multi-agent orchestration are in public beta and available to all developers on the Claude platform. Whether seven months is enough time for a solo founder to build a billion-dollar business remains an open question — but after Tuesday, they have a few more tools to try.
Every LangChain pipeline your team hardcodes starts breaking the moment the query distribution shifts — and it always shifts. That bottleneck is what Sakana AI set out to eliminate.
Researchers at Sakana AI have introduced the “RL Conductor,” a small language model trained via reinforcement learning to automatically orchestrate a diverse pool of worker LLMs. Conductor dynamically analyzes inputs, distributes labor among workers, and coordinates among agents.
This automated coordination achieves state-of-the-art results on difficult reasoning and coding benchmarks, outperforming individual frontier models like GPT-5 and Claude Sonnet 4 as well as expensive human-designed multi-agent pipelines. It achieves this performance at a fraction of the cost and with fewer API calls than competitors. RL Conductor is the backbone of Fugu, Sakana AI’s commercial multi-agent orchestration service.
Large language models have strong latent capabilities. But tapping these capabilities to their fullest is a great challenge. Extracting this level of performance relies heavily on manually designed agentic workflows, which serve as critical components in commercial AI products.
However, these frameworks fall short because they are inherently rigid and constrained. In comments to VentureBeat, Yujin Tang, co-author of the paper, explained the exact breaking point of current systems: “While using frameworks with hard-coded pipelines like LangChain and Mixture-of-Agents can work well for specific use cases … In production, an inherent bottleneck arises when targeting domains with large user bases with very heterogeneous demands.”
Tang noted that achieving “real-world generalization in such heterogeneous applications inherently necessitates going beyond human-hardcoded designs.”
Another bottleneck for building robust agentic systems is that no single model is optimal for all tasks. Different models are fine-tuned to specialize in distinct domains. One model might excel at scientific reasoning, while another is superior at code generation, mathematical logic, or high-level planning.
Because models have these varying characteristics and complementary skills, manually predicting and hard-coding the ideal combination of models for every query is practically impossible. An optimal agentic framework should be able to analyze a problem and delegate subtasks to the most suitable expert in the pool.
The RL Conductor is designed to overcome the limitations of rigid, human-designed frameworks. As the name implies, it conducts an orchestra of agents by dividing challenging problems, delegating targeted subtasks, and designing communication topologies for a set of worker LLMs.
Instead of relying on fixed code or static routing, the Conductor orchestrates these models by generating a customized workflow. For each step in the workflow, the model generates a natural language instruction for a specific aspect of the task, assigns an agent to carry it out, and defines an “access list” that dictates which past subtasks and responses from other agents are included in that agent’s context.
By defining everything in natural language, the Conductor builds flexible workflows tailored to each input. It can construct simple sequential chains, parallel tree structures, or even recursive loops depending on the problem’s demands.
Importantly, the model learns these strategies not by human design but through reinforcement learning (RL) and reward maximization. During training, the model is given a task, a pool of workers, and a reward signal based on whether its answer and output format are correct.
Through a simple trial-and-error RL algorithm, the model organically discovers which combinations of instructions and communication structures yield the highest reward. As a result, it automatically adopts advanced orchestration strategies such as targeted prompt engineering, iterative refinement, and meta-prompt optimization.
The model learns to dynamically adjust its strategies and leverage the distinct strengths of its worker agents without any human developer having to hard-code the process.
To test RL Conductor in action, the researchers fine-tuned the 7-billion parameter Qwen2.5-7B using the framework. During training, the Conductor was tasked with designing agentic workflows of up to five steps. It was given access to a worker pool containing seven different models: three closed-source giants (Gemini 2.5 Pro, Claude-Sonnet-4, and GPT-5) and four open-source models (including DeepSeek-R1-Distill-Qwen-32B, Gemma3-27B, and Qwen3-32B).
The team evaluated the Conductor across a variety of highly challenging benchmarks, comparing it against individual frontier models acting alone, self-reflection agents prompted iteratively to improve their own answers, and state-of-the-art multi-agent routing frameworks like MASRouter, Mixture-of-Agents (MoA), RouterDC, and Smoothie. The small 7B Conductor set new benchmarks across the board. It achieved an average score of 77.27% across all tasks, hitting 93.3% on the AIME25 math benchmark, 87.5% on GPQA-Diamond, and 83.93% on LiveCodeBench, according to the researchers.
Remarkably, it achieved these marks while remaining highly efficient. While baseline models like MoA burned through 11,203 tokens per question, the Conductor used an average of just 1,820 tokens, taking an average of only three steps per workflow.
A closer look at the experimental details shows exactly why the framework is so effective. The Conductor automatically learned to measure task difficulty. For simple factual recall questions, it often solved the problem in a single step or used a basic two-agent setup. However, for complex coding problems, it built extensive workflows involving up to four agents with dedicated planning, implementation, and verification phases.
The Conductor also learned that frontier models have different strengths. To achieve record scores on coding benchmarks, the Conductor frequently assigned Gemini 2.5 Pro and Claude Sonnet 4 to act as high-level planners, and only brought in GPT-5 at the very end to write the final optimized code. In a particularly clever display of adaptability, the Conductor would sometimes completely abdicate its own role, handing the entire planning process over to Gemini 2.5 Pro and allowing it to dictate the subtasks for the rest of the pool.
Beyond math and coding benchmarks, Sakana AI is already putting the underlying architecture to work in front-office utility. “We have been using our Fugu models based on the Conductor technology internally for various practical enterprise applications: software development, deep research, strategy development, and even visual tasks like slide generation,” Tang said.
While the 7B model described in the research paper was an exploratory blueprint and is not publicly available, Sakana AI has productized the Conductor framework into its flagship commercial AI product, Sakana Fugu. Now in its beta phase, Fugu serves as a multi-agent orchestration system accessible through a standard OpenAI-compatible API.
Tang noted Fugu targets “the large market of industries where AI adoption has yet to bring large productivity gains due to the generalization limitations of current hard-coded pipelines, such as finance and defense.”
For enterprise developers, this allows seamless integration into existing applications without the headache of managing multiple API keys or manually routing tasks across different vendors. Behind the API interface, Fugu automates complex collaboration topologies and role assignments across a pool of models. To support varying business needs, Sakana released two variants: Fugu Mini, built for low-latency operations, and Fugu Ultra, designed for maximum performance on demanding workloads.
Addressing governance concerns around autonomous agents spinning up invisible workflows, Tang pointed out that the interpretability risks are functionally similar to the hidden reasoning traces of current top-tier closed APIs, and the system is managed with established guardrails to minimize hallucinations.
For enterprise architects weighing when to deploy RL-orchestration versus traditional routing, the decision often comes down to engineering resources. “We believe the absolute sweet spot comes whenever users and their teams feel they are spending a disproportionate amount of time guiding their underlying agents,” Tang said. However, he cautioned that the framework isn’t necessary for everything, noting that “it’s hard to beat the economic proposition of a local model running directly on the user’s machine for simple queries.”
As the diversity of specialized open- and closed-source AI models continues to grow, static hardcoded pipelines will inevitably become obsolete. Looking ahead, this dynamic orchestration will likely extend beyond text and code environments. “There is indeed a large potential to fill this gap with cross-modal Conductor frameworks becoming the foundation for more autonomous, self-coordinating physical AI systems,” Tang said.
Presented by Zeta Global
The gap between what AI promises and what it delivers is not subtle. The same model can produce precise, useful output in one system and generic, irrelevant results in another.
The issue is not the model. It’s the context.
Most enterprise systems were not built for how AI operates. Data is scattered across tools. Identity is inconsistent. Signals arrive late or not at all. Systems record events but fail to connect them into a continuous view.
AI depends on that continuity. Without it, the model fills in the gaps so the result looks polished but lacks relevance. This is where most teams get stuck.
A better model does not fix fragmented, stale, or commoditized data. Gartner estimates organizations lose an average of $12.9 million annually due to poor data quality. AI does not solve that problem, it surfaces it faster and at a greater scale.
There is a fast diagnostic test for this. Give your AI a perfect, high-intent customer signal and see what comes back. If the output is generic or irrelevant, the model needs work. But if the model produces something sharp and useful on clean data, and then falls apart on real production data, the problem is the data.
In practice, it is almost always the second scenario. AI functions like a magnifying glass, so strong data systems become dramatically more powerful, and the weak ones become dramatically more visible. Organizations that have been coasting on fragmented, poorly integrated customer data can no longer hide behind reporting lag and manual interpretation. The AI renders the problem in plain sight.
This is really where the next evolution gets interesting. Even after you solve the data quality problem, there is still a second shift underway in how customer profiles are built and used.
For years, enterprise data systems stored content: transactions in CRMs, demographics in data warehouses, campaign responses in marketing platforms. These records described what had already happened. They were useful for reporting but were not built for AI.
AI requires context. Context is not a static record. It is a current view of the customer including recent behavior, cross-channel signals, and emerging intent. The thread that connects one interaction to the next. Identity tells you who someone is. Context tells you what they are doing and what they are likely to do next.
Consider a simple example: ask an AI to recommend a beach vacation destination, and it might suggest Hawaii or Florida. Tell it you have three children, and it surfaces family-friendly options. Give it access to your recent search patterns, your affordability signals, and where you have been searching over the past year, and the recommendation changes entirely because the model is no longer working from demographic categories but from a live picture of who you are and what you are doing right now.
Most enterprise systems were built to store state, not maintain context. They capture events, but they don’t maintain continuity between them.
That’s the gap AI exposes.
But for practitioners, the challenge is not conceptual; it is architectural. Context does not live in a single system. It is fragmented across event streams, product analytics tools, CRMs, data warehouses, and real-time pipelines. Stitching that into something an AI system can actually use requires moving from batch-oriented data models to streaming or near-real-time architectures, where signals are continuously ingested, resolved, and made available at inference time.
This is where many AI initiatives stall. The model is ready, but the context layer is not operationalized. Systems are not designed to retrieve the right signals within milliseconds, or to resolve identity across channels in real time. Without that, “context” remains theoretical rather than actionable.
Architectures like Model Context Protocol (MCP) are accelerating this shift by giving AI systems a way to pass memory about a user between applications, essentially threading a continuous line of context around an individual across different interactions. The result is a profile that becomes richer and more predictive over time, one that creates a line of continuity between what someone has done, what they are doing now, and what they are likely to do next.
When that identity layer is strong, the same model produces better outcomes. When it is weak, no model can compensate.
Organizations that built first-party data systems and durable identity infrastructure before the AI wave are now benefiting from a compounding effect. Better data trains smarter models. Smarter models attract more consented users. More consented users generate richer behavioral signals.
Competitors without that foundation cannot replicate this, regardless of which model they are running. The gap is structural, not algorithmic, and because identity systems improve incrementally over time, the organizations that started investing earlier have advantages that are genuinely hard to close.
The practical implication is a shift in where AI investment goes. The organizations getting consistent results from AI are treating it as a processing layer for a living data system, not as a standalone capability to be bolted onto existing infrastructure.
For builders and operators, this translates into a different set of priorities than the last two years of AI experimentation:
First, instrument for real-time signals. Batch pipelines and nightly refreshes are not sufficient when AI systems are expected to respond to user intent as it happens. Teams need event-driven architectures that capture and surface behavioral signals in near real time.
Second, make context retrievable at inference time. It is not enough to store data in a warehouse. Systems must be designed so that relevant context can be resolved and injected into prompts or retrieved by agents within milliseconds.
Third, invest in identity resolution as infrastructure. Connecting fragmented signals across devices and channels so the system understands real individuals rather than anonymous interactions is foundational, not optional.
Fourth, treat governance and consent as part of system design. First-party data built on trust is not just safer; it is more durable and ultimately more valuable than third-party data that competitors can access.
These investments are less visible than a new model launch and are also far harder to copy.
Models are now interchangeable. The difference will come from who can operationalize context at scale and treat the model as a processing layer, not the advantage.
That advantage comes from years of investment in identity infrastructure, first-party data, and systems that keep customer context current.
The organizations that win won’t be the ones with better prompts. They’ll be the ones whose systems understand the customer before the prompt is ever written.
Neej Gore is Chief Data Officer at Zeta Global.
Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Presented by NutanixAcross industries, organizations are focused on how to move from AI pilots, proofs of concept, and cloud-based experimentation to deploying it at scale — across real workloads, for real users, in real business environments. VentureB…
OpenAI updated the default model for ChatGPT to its new GPT-5.5 Instant, along with a new memory capability that finally shows which context shaped responses — at least some of them.
This limitation signals that models are starting to create a second, incomplete memory observability layer that could conflict with existing audit systems and agent logs.
GPT-5.5 Instant replaces GPT-5.3 Instant as the default ChatGPT model and is a version of its new flagship GPT-5.5 LLM. It’s supposed to be more dependable, accurate and smarter than 5.3.
But it’s the introduction of memory sources, which will be enabled across all models in the platform, that could help enterprises in their projects.
“When a response is personalized, you can see what context was used, such as saved memories or past chats, and delete or correct it if something is outdated or no longer relevant,” OpenAI said in a blog post.
When a user asks ChatGPT something, users can tap the sources button (at the bottom of the response) to see which files or past chats the model tapped to find the answer. Users also have full control over the sources models can cite, and these sources will not be shared if the conversation is sent to others.
The company said memory sources should make it easier to personalize model responses. Still, OpenAI admitted that the models “may not show every factor that shaped an answer” and promised to make the capability more comprehensive over time.
What this means is that memory sources offer a semblance of observability in ChatGPT answers, but not full auditability yet.
Enterprises have a system in place to solve part of the memory and context problem with models and agents.
Models are exposed to context through retrieval-augmented generation (RAG) pipelines; whatever the agent fetches from the vector databases is logged, and the agent’s state is stored in a memory layer. All of this is tracked in application logs, usually in an orchestration or management layer with built-in observability. Ideally, this allows teams to trace failure back through the stack.
The current system is imperfect; sometimes, it’s not easy to trace failure points, but it’s at least internally consistent. For enterprises using ChatGPT, whether the default GPT-5.5 Instant or their model of choice, that’s no longer the case.
The model surfaces its own version with memory sources that are wholly separate from existing retrieval logs — in short, a model-reported context. A problem arises if these cannot be reconciled reliably. And because memory sources only give users part of the picture — it’s unclear what ChatGPT’s limit on citing memory sources is — it becomes even harder to match what GPT-5.5 Instant said it tapped to what it actually did in the production environment.
This situation creates a new failure mode: A competing context log. If something seems wrong, it can create inconsistencies that enterprises have to deal with.
Malcolm Harkins, chief trust and security officer at HiddenLayer, told VentureBeat that memory sources “look like a pragmatic middle ground ” in offering some transparency, but it’s still not easy to see its value.
“For enterprises, it’s directionally useful but insufficient on its own,” Harkins said. “Real value will depend on how it integrates with security, governance, access controls and audit systems.”
However, GPT-5.5 Instant handles memory, and OpenAI calls it an improvement over GPT-5.3 Instant.
Internal evaluations showed GPT-5.5 Instant returned 52.5% fewer hallucinated claims than the previous default model, especially for high-stakes domains such as medicine, law, and finance. Inaccurate claims fell by 37.3% on challenging conversations. The company said the model improved on photo analysis and image uploads, answering STEM questions and knowing when to tap its own knowledge base or use web search.
Peter Gostev, AI capability at independent model evaluator Arena, explained to VentureBeat in an email that the key result to watch about GPT-5.5 Instant is how it performs on the overall text rankings, especially because its predecessor did not have a strong showing.
“Since GPT-4o, the strongest-performing OpenAI chat model on the Arena has been GPT-5.2-Chat, which still ranks 12th on the Overall Text Arena months after release,” Gostev said. Notably, users preferred it even over the higher-reasoning GPT-5.2-High variant, which is currently ranked 52nd on the Arena. “By comparison, GPT-5.3-Chat, the previous default model in ChatGPT, was significantly less competitive, ranking 44th overall, 32 places below GPT-5.2-Chat.”
Organizations that rely on ChatGPT for some tasks will need to formalize how memory works for their stack. Memory sources are not limited to GPT-5.5 Instant; it is enabled for all models on the ChatGPT platform.
To address the problem of competing memory sources, enterprises have to audit their memory management. Model-reported context could overlap or contradict these logs, so it’s best to define a clear source of truth. In the event of a failure, administrators know which log to believe.
It would also be a good idea to decide whether or not to expose memory sources to users. ChatGPT only shows a select number of chats or files it used to complete a request. Some users may find more transparency trustworthy.
Ultimately, the number one thing for enterprises to remember about memory sources is that what the model reports as its context is not the full picture for auditing. It’s a form of observability, but it cannot withstand a full examination.