Inside AMEX’s agentic commerce stack: How intent contracts and single-use tokens enforce AI transactions

American Express (Amex) is building a system that lets AI agents shop and pay on behalf of users — but right now it’s only within its own payment network, and still involves a black box that could hinder trust and auditability.

Amex already participates in agentic commerce protocol projects, especially Google’s Agent Pay Protocol (AP2), which focuses on interoperability. Amex’s Agentic Commerce Experiences (ACE) developer kit, on the other hand, touches on something most protocols currently lack: Full transaction control in the payment layer. 

But it still isn’t completely transparent in how it handles validation. ACE uses a closed-loop system — serving as both the card issuer and the payment network — to validate agent-led transactions. 

Luke Gebb, Amex’s EVP and global head of innovation, told VentureBeat that the company believes this model is the missing piece in agentic commerce.  

“Some of what is missing so far is the perspective of a company like ours: We feel that trust and security are critical to advancing this space,” Gebb said. “This is really the first time that an issuer is coming to the table.”

Amex sits in that interesting space: Unlike other financial institutions or card providers like Chase or Bank of America, Amex can route transactions through its American Express Network. Visa and Mastercard are two of the most well-known payment networks, but these companies don’t issue cards themselves and must work with a bank.

The continued black box of agentic commerce 

The ACE kit is just one approach to addressing some of agentic commerce’s biggest problems: trust, control, accountability, validation, and security. 

Consumers generally don’t want rogue agents to run away with their bank accounts and start buying things. Merchants don’t want to be stuck with unpaid items. Banks don’t want to deal with an influx of chargebacks and the potential for fraud. 

Projects like the ACE kit aim to build trust and accountability by verifying an agent’s identity and goals. This can build the trust agentic commerce desperately needs.

Amex claims it offers validation, too, although the process behind that is unclear. It is abstracting how it performs validation, even though it explains at which layer it does it. More traditional systems feature a mix of deterministic checks and a flexible, semantic evaluation that helps match intent and outcome for validation. Amex said agents built with ACE can submit user shopping carts and check them against the agent’s original intent. However, they did not disclose how this works.

Practitioners building to the agentic commerce ecosystem lament that, despite strides in creating a trust layer, many black boxes remain that could hinder widespread adoption.

Raj Ananthanpillai, founder and CEO of identity and verification system provider Trua, told VentureBeat that payment protocols and software kits like Agentic Commerce Suite from Stripe, Google’s Verifiable Intent proof chain, and the ACE developer kit “excel at handling proofs, verifiable authorizations and the mechanics of fund movement, but leave upstream human validation opaque and underdeveloped.”

Ananthanpillai continued: “Without a clear, high-assurance cryptographic link proving that an agent is acting under the explicit authority of a verified human owner, merchants, issuers, and networks face heightened risks of repudiation, massive chargebacks, sanctioned people conducting financial transactions, and fraud.”

The ACE kit

The ACE developer kit solves several running issues with agentic commerce, Gebb said, and gives developers access to integrated services:

  • Agent registration

  • Account enablement

  • Intent intelligence

  • Payment credentials 

  • Cart context

First, it deals with agent registration, establishing identity and trust with both the consumer and company agents. When a transaction begins, the agent acting on behalf of the customer and the merchant’s agent can verify each other’s identities and trust that they are dealing with the correct entity. 

Next comes account enablement, which links the user’s Amex account to their agent and grants the agent permission to act, or, in the case of agentic commerce, buy something.

Intent intelligence creates what Amex calls an intent contract, where the user defines what they want the agent to do. Once the intent is defined, the ACE system generates an Intent ID and a Proof of Intent Token that definitively proves authorization in the event of a dispute.

Amex handles the actual transaction part, where the user pays for the product through a single-use token. ACE establishes payment credentials used for the transaction, bound to intent and constraints. 

“Once the agent has found the item that the customer has asked for, like red shoes, they’ll make a call for the payment credentials, which is a token that has the boundaries that the card member has provided,” Gebb said. “So, for instance, if they said they only wanted to spend $500, that token won’t allow for a purchase of $600 because it has controls built in.”

The last piece is cart context and validation, which Gebb said helps banks and brands compare a user’s cart that their agent submitted to their intent. 

Amex’s approach shows that for agentic commerce to really soar, providers must understand what systems will allow agents to do and who is ultimately accountable if something goes wrong. 

Salesforce launches Agentforce Operations to fix the workflows breaking enterprise AI

Enterprise AI teams are hitting a wall — not because their models can’t reason, but because the workflows underneath them were never built for agents. Tasks fail, handoffs break, and the problem compounds as organizations push agents deeper into back-office systems. A new architectural layer is emerging to address it: workflow execution control planes that impose deterministic structure on processes agents are expected to run.

One of the companies bringing this to the forefront is Salesforce, with a new workflow platform that turns back-office workflows into a set of tasks for specialized agents to complete. Users can upload their processes or use one of the set Blueprints provided by Salesforce, and Agentforce Operations will break it down for agents. 

Salesforce senior vice president of Product, Sanjna Parulekar, told VentureBeat in an interview that the problem is that many enterprise workflows are not built for agents. “What we’ve observed with customers is that a lot of times, the brokenness in a process is probably in your product requirements document,” Parulekar said. “So when that’s uploaded into a product, it doesn’t quite work. We can optimize it and cut out some things and replace it with an agent.”

Without this control panel layer, enterprises could risk deploying agents that increase cost rather than fix their workflow problems.

Making the workflow work for agents, not just humans

Enterprises deploying agents are learning a costly lesson: Their workflows were designed around human judgment gaps, not machine execution. Processes that evolved through years of workarounds — loosely defined steps, implicit decisions, coordination that depends on individuals knowing what to do next — break when agents are asked to follow them literally.

Even with all of an enterprise’s context at its fingertips, AI systems will have difficulty completing tasks if it is not clear what it’s supposed to do. 

Parulekar said her team found that focusing on what makes the process tick and breaking it down into more explicit steps and workflows makes the system more deterministic. Then, when platforms like Agentforce Operations introduce agents, those agents already know their specific tasks.  

“It forces companies to rethink their processes and introduces observability into the mix because of the session tracing model in the system,” she said.

Parulekar said human checks can be built into the system, so the process is more transparent.

What makes this approach different from other workflow automation offerings is that it doesn’t rely on agents to decide what to do next; the system does. Unlike more traditional automation tools that route tasks and agents on probabilistic decision-making, this enforces execution on a more pre-defined, deterministic structure.

The problem it introduces

Codifying a workflow doesn’t fix a broken one. If a process has flawed steps, encoding it for agents locks in the problem at scale. And once workflows are distributed across agents, the challenge shifts from execution to governance: who owns the process, who validates it, and how it evolves when business conditions change.

It puts the onus on teams to take a hard look at what works for them and what doesn’t.

Organizations need to consider that, along with the execution control plane offered by platforms like Agentforce Operations, someone should be made responsible for task completion and success. 

Brandon Metcalf, founder and CEO of workforce orchestration company Asymbl, told VentureBeat in a separate interview that the key to both humans and agents following a workflow is a shared goal. 

“You have to understand the goal or the agent or human won’t complete the task successfully,” Metcalf said. “Someone has to manage that outcome that has to be delivered. It can be a person or an agent.”

The bottleneck has moved. As Metcalf framed it, the question is no longer whether agents can reason through a task, it’s whether the workflow underneath them is coherent enough to execute. For enterprises that built their processes around human judgment and institutional memory, that’s a harder fix than swapping in a smarter model.

Alibaba’s Metis agent cuts redundant AI tool calls from 98% to 2% — and gets more accurate doing it

One of the key challenges of building effective AI agents is teaching them to choose between using external tools or relying on their internal knowledge. But large language models are often trained to blindly invoke tools, which causes latency bottlenecks, unnecessary API costs, and degraded reasoning caused by environmental noise. 

To overcome this challenge, researchers at Alibaba introduced Hierarchical Decoupled Policy Optimization (HDPO), a reinforcement learning framework that trains agents to balance both execution efficiency and task accuracy. 

Metis, a multimodal model they trained using this framework, reduces redundant tool invocations from 98% to just 2% while establishing new state-of-the-art reasoning accuracy across key industry benchmarks. This framework helps create AI agents that are not trigger-happy and know when to abstain from using tools, enabling the development of responsive and cost-effective agentic systems.

The metacognitive deficit

Current agentic models face what the researchers call a “profound metacognitive deficit.” The models have a hard time deciding when to use their internal parametric knowledge versus when to query an external utility. As a result, they blindly invoke tools and APIs, like web search or code execution, even when the user’s prompt already contains all the necessary information to resolve the task.

This trigger-happy tool-calling behavior creates severe operational hurdles for real-world applications. Because the models are trained to focus almost entirely on task completion, they are indifferent to latency. These agents frequently hit exorbitant tool call rates. Every unnecessary external API call introduces a serial processing bottleneck, turning a technically capable AI into a sluggish system that frustrates users and burns through tool budgets.

At the same time, burning computational resources on excessive tool use does not translate to better reasoning. Redundant tool interactions inject noise into the model’s context. This noise can distract the model, derailing an otherwise sound chain of reasoning and actively degrading the final output.

To address the latency and cost issues of blind tool invocation, previous reinforcement learning methods attempted to penalize excessive tool usage by combining task accuracy and execution efficiency into one reward signal. However, this entangled design creates an unsolvable optimization dilemma. If the efficiency penalty is too aggressive, the model becomes overly conservative and suppresses essential tool use, sacrificing correctness on arduous tasks. Conversely, if the penalty is mild, the optimization signal loses its value and does not prevent tool overuse on simpler tasks.

Furthermore, this shared reward creates semantic ambiguity, where an inaccurate trajectory with zero tool calls might yield the same reward as an accurate trajectory with excessive tool usage. Because the training signals for accuracy and efficiency become entangled, the model can’t learn to control tool-use without degrading its core reasoning capabilities.

Hierarchical decoupled policy optimization

To solve the optimization dilemma of coupled rewards, the researchers introduced HDPO. HDPO separates accuracy and efficiency into two independent optimization channels. The accuracy channel focuses on maximizing task correctness across all of the model’s rollouts. The efficiency channel optimizes for execution economy.

HDPO computes the training signals for these two channels independently and only combines them at the final stage of loss computation. The efficiency signal is conditional upon the accuracy channel. This means that an incorrect response is never rewarded simply for being fast or using fewer tools. This decoupling avoids situations where accuracy and efficiency gradients cancel each other out, providing the AI with clean learning signals for both goals.

The most powerful emergent property of this decoupled design is that it creates an implicit cognitive curriculum. Early in training, when the model still struggles with the task, the optimization is dominated by the accuracy objective, forcing the model to prioritize learning correct reasoning and knowledge. As the model’s reasoning capabilities mature and it consistently arrives at the right answers, the efficiency signal smoothly scales up. This mechanism causes the model to first master task resolution, and only then refine its self-reliance by avoiding redundant, costly API calls.

To complement HDPO, the researchers developed a rigorous, multi-stage data curation regime that tackles severe flaws found in existing tool-augmented datasets. Their data curation pipeline covers supervised fine-tuning (SFT) and reinforcement learning (RL) stages.

For the SFT phase, they sourced data from publicly available tool-augmented multimodal trajectories and filtered them to remove low-quality examples containing execution failures or feedback inconsistencies. They also aggressively filtered out any training sample that the base model could solve directly without tools. Finally, using Google’s Gemini 3.1 Pro as an automated judge, they filtered the SFT corpus to only keep examples that demonstrated strategic tool use.

For the RL phase, the curation focused on ensuring a stable optimization signal. They filtered out prompts with corrupted visuals or semantic ambiguity. The HDPO algorithm relies on comparing correct and incorrect responses. If a task is trivially easy where the model always gets it right, or prohibitively hard where the model always fails, there is no meaningful mathematical variance to learn from. The team strictly retained only prompts that exhibited a non-trivial mix of successes and failures to guarantee an actionable gradient signal.

Metis agent: HDPO  in action

To test HDPO in action, the researchers used the framework to develop Metis, a multimodal reasoning agent equipped with coding and search tools. Metis is built on top of the Qwen3-VL-8B-Instruct vision-language model. The researchers trained it in two distinct stages. First, they applied SFT using their curated data to provide a cold-start initialization. Next, they applied RL using the HDPO framework, exposing the model to multi-turn interactions where it could invoke tools like Python code execution, text search, and image search.

The researchers pitted Metis against standard open-source vision models like LLaVA-OneVision, text-only reasoners, and state-of-the-art agentic models including DeepEyes V2 and the 30-billion-parameter Skywork-R1V4. The evaluation spanned two main areas: visual perception and document understanding datasets like HRBench and V*Bench, and rigorous mathematical and logical reasoning tasks like WeMath and MathVista.

On all tasks, Metis achieved state-of-the-art or highly competitive performance, outperforming existing agentic models — including the much larger 30-billion-parameter Skywork-R1V4 — across both visual perception and reasoning tasks.

Equally important is the anecdotal behavior Metis showed in the experiments. For example, when presented with an image of a museum sign and asked what the center text says, standard agentic models waste time blindly writing Python scripts to crop the image just to read it. Metis, however, recognizes that the text is clearly legible in the raw image. It skips the tools entirely and uses a single inference pass.

In another experiment, the model was given a complex chart and asked to identify the second-highest line at a specific data point within a tiny subplot. Metis recognized that fine-grained visual analysis exceeded its native resolution capabilities and could not accurately distinguish the overlapping lines. Instead of guessing from the full image, it invoked Python to crop and zoom in exclusively on that specific subplot region, allowing it to correctly identify the line. It treats code as a precision instrument deployed only when the visual evidence is genuinely ambiguous, not as a default fallback.

The researchers released Metis along with the code for HDPO under the permissive Apache 2.0 license.

“Our results demonstrate that strategic tool use and strong reasoning performance are not a trade-off; rather, eliminating noisy, redundant tool calls directly contributes to superior accuracy,” the researchers conclude. “More broadly, our work suggests a paradigm shift in tool-augmented learning: from merely teaching models how to execute tools, to cultivating the meta-cognitive wisdom of when to abstain from them.”

Writer launches AI agents that can act without prompts, taking on Amazon, Microsoft and Salesforce

Writer, the enterprise AI agent platform backed by Salesforce Ventures, Adobe Ventures, and Insight Partners, today launched event-based triggers for its Writer Agent platform, enabling AI agents to autonomously detect business signals across Gmail, Gong, Google Calendar, Google Drive, Microsoft SharePoint, and Slack — and execute complex multi-step workflows without any human initiating the process.

The release, which also includes a new Adobe Experience Manager connector and a suite of enhanced governance controls such as bring-your-own encryption keys and a Datadog observability plugin, represents Writer’s most aggressive bet yet on fully autonomous enterprise AI. It arrives at a moment when AWS, Salesforce, and Microsoft are all racing to establish their own agentic platforms, and when the question of how much autonomy enterprises will actually hand to AI agents remains deeply unresolved.

“We are launching a series of event triggers that power and drive our playbooks to be more proactively called,” Doris Jwo, Writer’s VP of Product Management, told VentureBeat ahead of the announcement. “We’re building on the ecosystem to actually for these connectors, such as SharePoint, Google Drive, Gong, Gmail, Google Calendar, actually listen for events happening in those platforms, so that the agent can practically know that something happened externally, and then, where relevant, call a certain playbook to be actually run live in real time, without any sort of human intervention required.”

The shift from reactive to proactive AI agents marks a critical inflection point for enterprise software. Until now, most AI assistants — including Writer’s own platform — required a human to initiate every interaction. A marketer had to open a chat window and ask for help. A salesperson had to prompt a research brief. The new event-based triggers flip that dynamic entirely: the system watches for business events and acts on its own.

Why Writer decided humans were the weakest link in enterprise AI workflows

Writer’s push toward autonomous triggers stems from a practical observation its product team made as enterprise customers scaled their use of the platform’s playbooks — the reusable, natural-language workflows that Writer introduced in November 2025 to let business users automate recurring tasks without writing code.

“What we found is, as playbooks continue to get integrated into enterprise workflows, it’s actually humans that become the bottleneck in making sure that playbooks get triggered,” Jwo said. “This really kind of solves that problem, to make sure that that sort of always-on, proactive, autonomous nature of that agent has continued to be built on.”

The mechanics work like this: Writer’s connectors, which already provided read and write access to third-party enterprise tools, now also listen for specific events — an email arriving in Gmail, a sales call completing in Gong, a new file landing in a Google Drive folder, a meeting starting or ending on Google Calendar, a message posted in Slack. When the system detects a qualifying event, it triggers a predefined playbook that executes a multi-step workflow autonomously.

Consider the use case Jwo described for marketing teams already running on Writer’s platform. An email campaign workflow typically begins when a creative brief lands in a Google Drive folder. From there, multiple team members coordinate through Slack to assemble research, build assets, draft copy, review graphics, and package everything for a campaign management tool. Writer’s event-based triggers collapse much of that chain: the moment a brief hits the designated folder, the system automatically fires a cascade of playbooks that assemble the research, generate the assets, and prepare deliverables for human review.

“All the playbooks that our customers have been building with us to build all those each individual pieces now just get automatically triggered the minute that initial brief kind of hits the Google Drive folder,” Jwo said. “That’s, I think, a very common workflow for most of these marketing sort of, like, content-heavy use cases, where it’s multiple parties involved, it’s a lot of assets coming together in a cascade.”

How Writer’s AI reasoning engine separates it from simple automation tools like Zapier

The comparison to Zapier — the popular automation tool that connects thousands of apps through if-this-then-that logic — is inevitable, and Jwo addressed it directly.

“It’s more than just an LLM in the middle,” she said. “It is an agent with reasoning and then access to a really powerful set of tools that includes connectors, that includes its own virtual sandbox, which enables it to do things like write and execute code on the fly and create those assets.”

The distinction matters for understanding where Writer sits in an increasingly crowded landscape. Zapier and similar workflow automation tools require users to manually define rigid logic paths, specifying exact conditions and actions in a deterministic sequence. Writer’s approach uses its Palmyra-powered reasoning engine to process event context and make real-time execution decisions. Users describe their goals in natural language rather than dragging around boxes and defining conditional branches.

“It’s not quite Zapier, because I think it requires a lot more — it’s more rigid,” Jwo said of traditional automation tools. “It requires more manual kind of setup to define the logic and the roles and the conditions for which a workflow has to be run.” Writer’s playbooks, by contrast, allow “a simple idea to turn into something that’s actually executable and repeatable,” she added, noting that builds take “hours and days, not weeks and months.”

This natural-language accessibility has been central to Writer’s strategy since it introduced the Agent platform and playbooks last November. The company has consistently positioned itself as a platform that puts power in the hands of business users — marketers, sales teams, operations leads — rather than requiring engineering resources to build and maintain AI workflows. Writer CEO May Habib made this case forcefully at Davos earlier this year, arguing that the leaders pulling ahead are those entering what she called “rebuild mode” — stripping workflows down to outcomes and eliminating what she described as the “coordination tax” of endless handoffs, status meetings, and alignment emails.

The event-based triggers extend that philosophy to its logical conclusion. If business users can build playbooks in natural language, and those playbooks can now fire automatically based on real-world business events, then the entire loop from signal to action can operate with minimal human involvement.

Inside the governance controls Writer built to make autonomous AI agents safe for regulated enterprises

That level of autonomy raises obvious concerns, and Writer appears to understand that governance is the linchpin of the entire strategy. The company paired its trigger launch with a substantial expansion of its administrative controls — a combination that suggests Writer views enterprise trust as its primary competitive weapon.

The new governance features include Connector Profiles, which allow administrators to configure multiple versions of the same connector with different permissions per team; Writer Agent Profiles for deploying customized agent configurations with specific capability toggles and security settings; AI Studio Observability for auditable tracking of every agent interaction; a Datadog Logs Plugin that forwards every LLM request and response as structured log events; and bring-your-own encryption key support through AWS, Azure, or GCP key management services.

“A really important part of that, and a baseline, sort of foundation for everything that we roll out, is our observability and governance platform,” Jwo told VentureBeat. “When connectors are set up, admins have full control over connector access, what is set up, who has access, which teams exactly are those access granted to, as well as individually, which exact tools do teams are able to call.”

The observability story extends to the individual user level as well. Jwo described Writer Agent’s user experience as built around progressive disclosure — clean initial views that users can expand to inspect the full chain of reasoning behind any agent action. “You can drill down to the actual tool call level,” she said. “You’d actually have the ability to look at specifically what web search results were pulled, what connector was called, what tool called, what succeeded, what failed, how did the agent divert its path to fulfill your goal.”

This transparency architecture reflects a broader conviction Writer has articulated through what it calls “The Agentic Compact” — a framework the company published for responsible AI that emphasizes foundational transparency, auditability, and human oversight. Dan Bikel, Writer’s head of AI, has argued publicly that the industry’s obsession with model scale has created what he calls a “transparency paradox,” leaving businesses with powerful tools they cannot fully understand or control. Writer’s governance-first approach to autonomous triggers represents the operational expression of that philosophy.

Writer also introduced its agent supervision suite in December 2025, offering centralized monitoring, agent approval workflows, global guardrails, and integrations with external observability and security platforms like Datadog, Noma, and Lakera. The event-based triggers now extend that governance framework to cover actions initiated without any human in the loop — a meaningfully harder problem.

Writer takes aim at AWS, Salesforce, and Microsoft in the escalating agentic platform wars

The timing of Writer’s announcement is not accidental. The enterprise agentic AI market has entered a period of intense platform competition, with the largest technology companies in the world staking claims to the same territory Writer occupies.

Jwo acknowledged the pressure directly when asked why a CIO would choose Writer over established vendor relationships with AWS, Salesforce, or Microsoft — all of which have announced agentic platforms of their own.

“At the baseline, I think we have all the pieces to be fully enterprise-grade and ready,” Jwo said. But she argued that Writer’s real advantage lies in accessibility for non-technical users. “A lot of the challenge has been: how do we get business users to actually be able to build these powerful workflows in a way that maybe a technical user, using coding agents, can do very quickly and well, but the typical business user is not accustomed to anything beyond typical prompting to actually create?”

That positioning — enterprise-grade capabilities wrapped in a business-user-friendly interface — has been Writer’s core differentiation since the company’s founding in 2020. It is also the reason Writer has attracted strategic investment from Salesforce Ventures and Adobe Ventures, both of which are building their own AI platforms but apparently see value in Writer’s approach to the business-user segment.

The company’s March 2026 release of Skills — reusable building blocks that encode a team’s specific methodologies, quality standards, and decision frameworks into the Agent platform — reinforced this direction. Skills allow marketing teams, for instance, to capture exactly how their best strategist structures competitive analysis or formats campaign briefs, then make that expertise available to every team member and every playbook across the organization. Combined with event-based triggers, the result is a system where institutional knowledge executes automatically in response to real-world business events.

Writer’s 2026 AI adoption survey, conducted with Workplace Intelligence and covering 2,400 global executives, found that 79% of enterprises face AI adoption challenges despite high investment — and that organizations with strong change management programs are six times more likely to reach production. Writer CMO Diego Lomanto has argued that the real barrier to AI adoption is not technology but trust, writing that “they treat resistance as a training problem when it’s actually a trust problem.” The governance-heavy approach to event-based triggers appears designed to address exactly that dynamic.

Salesforce, SAP, and Workday triggers are next as Writer expands its connector roadmap

Writer’s initial event trigger support covers Gmail, Gong, Google Calendar, Google Drive, SharePoint, and Slack — tools that Jwo described as “generally the most applicable to every end user.” But the company has its eye on deeper enterprise system integration.

When asked about CRM and ERP triggers for systems like Salesforce, SAP, and Workday, Jwo confirmed these are within the scope of the roadmap. “You can imagine, you know, a Salesforce opportunity is created that may trigger a cascade of events that happens,” she said. “You might want to set up the right assets, maybe the right customer environment, all sorts of things can kind of cascade from that.”

The connector ecosystem has been a strategic priority since Writer launched its MCP (Model Context Protocol) gateway in November 2025, providing governed agent access across enterprise systems including Microsoft 365, Google Workspace, HubSpot, Gong, PitchBook, FactSet, and others. The addition of Adobe Experience Manager in this release gives marketing teams direct read/write access to pages, fragments, and digital assets in Adobe’s content management system — a connector that closes the gap between AI-generated content and published output.

Jwo clarified that in most integration scenarios, Writer Agent delivers content in a draft state rather than publishing it directly. “Writer Agent basically accomplishes the majority of the workload — pulling together the assets, making the changes and presenting — and then hopefully a person just has to go through the last three or so final steps to get it out,” she said.

The real question enterprise AI must answer: how much autonomy is too much autonomy

The degree of autonomy enterprises are comfortable granting their AI agents remains one of the most consequential open questions in the industry. Jwo acknowledged that most customers still maintain human checkpoints in their workflows.

“You can also build in instructions into our playbooks to say, ‘Hey, before you move on to a next playbook, make sure that you check with me. I want to take a look, and then if I hit go, then you’re good to go,'” she said. The agent can also be designed with self-QA capabilities, validating outputs against known pitfalls before proceeding.

Writer plans to expand these checkpoint capabilities in the coming quarter, adding the ability to specify not just that a checkpoint is required but which specific person must respond and what types of responses are expected — essentially building a formal approval workflow into the autonomous trigger chain.

Jwo characterized the current system as a hybrid: the platform listens deterministically for predefined events, but the agent applies reasoning to decide what action to take — or whether to act at all. “The agent has the ability to process what happened, understand the context of it, and understand the intent of what you want to do, so it can make that decision,” she said. “You’re just saying, like, ‘Hey, the goal might be feedback is coming in, and we want to triage that in real time. And some things we might not want to action on, some things we do.’ You basically just explain that to the agent.”

She views this release as a stepping stone toward a future where agents are “even more mission-driven, and less governed by even like a set of instructions or roles” — a future where the AI doesn’t just respond to triggers but proactively identifies when action is needed based on broader organizational goals.

For now, Writer is betting that the combination of autonomous triggers, robust governance, and business-user accessibility will be enough to carve out defensible territory in an enterprise AI market where the biggest technology companies in the world are all converging on the same set of capabilities. The company’s argument is that having the foundational pieces is not enough — what matters is making those pieces work together in a way that non-technical business users can build, manage, and trust.

It is, in other words, the same wager Writer has been making since 2020 — that the future of enterprise AI belongs not to the platform with the most powerful model, but to the one that can get an entire organization to actually use it. The difference now is that the agents don’t wait to be asked.

Event-based triggers, new connectors, and enhanced governance controls are available immediately to Writer enterprise customers.

Netomi raises $110 million as Accenture and Adobe bet on AI for customer service

Netomi, the San Francisco-based startup building AI systems for enterprise customer service, said Thursday that it has raised $110 million in new funding in a round led by Accenture Ventures, with participation from Adobe Ventures, WndrCo, Silver Lake Waterman, NAVER Ventures, Metis Strategy and Fin Capital. Jeffrey Katzenberg, managing partner of WndrCo and co-founder of DreamWorks, has joined the company’s board. The round builds on early backing from a roster of AI luminaries that includes OpenAI co-founder Greg Brockman, Google DeepMind co-founder Demis Hassabis and Microsoft AI CEO Mustafa Suleyman.

On its face, the financing is another large AI round in a market still awash in capital. But the deal is more revealing than that. It suggests that a new line is being drawn inside enterprise AI — not between companies that have a chatbot and companies that do not, but between companies that can show AI works in the messy, brittle, heavily governed environments where large businesses actually operate, and those that still mostly shine in demos.

The market around Netomi makes the stakes clear. Sierra, the AI agent startup led by former Salesforce co-CEO Bret Taylor, raised $350 million at a $10 billion valuation in September 2025 and has since made three acquisitions in 2026 alone. Decagon tripled its valuation to $4.5 billion in January 2026 with a $250 million Series D. Salesforce, ServiceNow and Intercom are all racing to embed AI agents into their existing platforms; Intercom’s Fin AI agent reportedly crossed $100 million in annual recurring revenue at $0.99 per resolution. Gartner predicts that 40 percent of enterprise applications will include task-specific AI agents by the end of 2026, up from less than 5 percent in 2025.

Against that backdrop, Netomi’s $110 million round is not the largest in the category, but it may be the most strategically constructed. The combination of Accenture’s enterprise consulting network, Adobe’s dominance in digital experience management and Netomi’s track record in production deployments represents a coordinated play to embed AI not as a chatbot layer on top of websites, but as the fundamental intelligence governing how entire digital experiences behave.

The company did not disclose its valuation, and in an interview tied to the announcement, Netomi executives declined to provide revenue or profitability figures. Instead, Chief Executive Puneet Mehta pointed to customer economics, saying a typical large deployment can generate at least tens of millions of dollars in impact, with some customers on a path to hundreds of millions.

For technical decision-makers, though, the more important part of Thursday’s news may be the partnerships attached to the money.

Why Accenture and Adobe turned a venture deal into a global distribution play

The structure of the deal reads like a map of how enterprise AI gets bought in 2026.

Alongside the investment, Accenture has entered a global alliance with Netomi to bring the platform to its Fortune 100 client base worldwide. The alliance will involve hundreds of Accenture team members receiving training on Netomi’s platform — a meaningful commitment from the world’s largest consulting firm and a distribution channel that few AI startups can match. Adobe Ventures’ participation comes with plans to integrate Netomi into Adobe’s Brand Concierge agentic ecosystem, giving Netomi a path into the software layer many large brands already use to manage websites, content and digital journeys. Metis Strategy brings access to CIO advisory channels. Ndidi Oteh, CEO of Accenture Song, said in the press release that the partnership is designed to help clients “reinvent how they serve their customers — seamlessly, responsibly and at scale.”

The result is not just more cash. It is a distribution network wrapped around a thesis.

Justin Wexler, a partner at WndrCo who led the firm’s Series B investment in Netomi in 2021, said most companies in the customer experience space are simply swapping a human for an AI. “That’s the extent of what they’re building,” Wexler said. “What we’re doing at Netomi, particularly with the Adobe partnership, is leapfrogging that altogether — merging the two layers. You don’t have a ‘How can I help you?’ chatbot. This is anticipating the issue and eliminating the ticket altogether.”

The distinction matters because it describes a fundamentally different kind of product. Most customer service AI still sits downstream. A customer encounters a problem, opens a chat window, explains the issue and waits for a response. Even when AI speeds up that exchange, the friction has already happened. Netomi wants to move upstream, into the experience before the ticket exists.

Mehta described the idea in blunt economic terms. “Why are there so many customer service tickets? Why is $500 billion spent on human labor answering customer service phone calls, emails and chats?” he asked. “What we realized is that the world’s largest companies wait for a problem to happen and then jump on it to solve it — but by that time, they’ve already created a lot of frustration, and it’s very expensive to do that.”

The answer, in Mehta’s view, is not to make downstream customer service faster with AI. It is to prevent the service ticket from being created in the first place. That logic sits behind almost every strategic decision the company has made — including the Adobe partnership.

“Most important websites run on Adobe Experience Manager,” Mehta said. “So we’re saying, what if we bring that kind of context and awareness upstream — capturing that a customer might be affected before it even turns into a customer service ticket.”

The Wall Street trading floor origins behind Netomi’s AI architecture

To understand what Netomi is building, you have to understand where its founder came from.

Mehta, who spent his early career constructing automated trading engines on Wall Street, told VentureBeat that the founding thesis was deceptively simple. “When we started Netomi, the core thesis was that AI is going to become the new customer interface,” he said. “The Transformers [paper] did not exist, so we had literally stitched together a set of different models to create the same end result.”

That background in low-latency finance is not incidental. It is the intellectual architecture that undergirds everything Netomi builds. When asked what connects trading systems to customer experience platforms, Mehta drew a direct line.

“If you think about the low-latency trading world, that was the first technology application to use situational awareness and a variety of different signals at scale,” he said. “There was not one signal that it was making decisions on. You needed market data feeds. You needed situational awareness. You needed news. You needed awareness of your own book of business. You needed your own risk assessment.”

That multi-signal architecture, Mehta argued, translates directly to what enterprise customer experience demands. Rather than waiting passively for a customer to describe a problem — the way traditional chatbots and even most current AI agents operate — Netomi’s system attempts to reconstruct the full situation before it acts. The request itself is only part of the story.

“What the customer tells you is very important, but the situation the customer is in is sometimes even more important,” Mehta said. “What if we borrowed that design pattern we built for low-latency trading? Because we can probably know why the customer is calling us. And if we can know that, we could maybe even reach out to them before they reach out to us and solve the problem.”

He summarized the philosophical distinction this way: “What large language models by themselves did was they essentially democratized just raw intelligence. We are democratizing context, and that changes everything.”

That is a sharp line, and also a revealing one. Netomi is effectively betting that the defensible layer in enterprise AI will not be the foundation model alone. It will be the orchestration layer that turns general model capability into governed, auditable, domain-specific action.

That governed approach extends to how the platform handles risk. Netomi uses what it calls an AI authority matrix — a real-time system that defines what the AI can do autonomously and when it must escalate to a human. “It’s a little bit like autonomous driving,” Mehta said. The AI knows when it’s approaching a boundary and pulls a human in. For regulated industries, specific endpoints can be locked to deterministic, rules-based flows while the agentic layer handles broader orchestration — and all of it is version-controlled and traceable, with metadata saved for seven years.

Inside the AI system that rearranges websites and retail stores in real time

The most technically ambitious element of Netomi’s vision — and the one that most sharply distinguishes it from competitors — is what the company calls AI-embedded customer experience orchestration. Rather than placing a chatbot in the corner of a website, Netomi’s system can rearrange the website itself based on what the AI infers about each individual customer’s situation.

Wexler demonstrated a live example during the interview. “As we see most deployments, companies that want to deploy AI on their websites, they throw a chatbot on the corner,” he said. “If you embed agentic capabilities into the digital layer itself — and again, Adobe Experience Manager is the leading digital layer of enterprise — then you could do really unique things.”

Wexler described what this looks like in practice. In a typical deployment, he said, the AI doesn’t just answer questions — it reshapes the page. Based on a customer’s browsing behavior, purchase history and inferred intent, the system can reorganize a product page in real time: surfacing warnings one customer needs but another doesn’t, prompting a sample order at the moment of hesitation, or flagging a compatibility issue before checkout. Two customers looking at the same product might see fundamentally different pages — not because a marketing team built two versions, but because the AI is composing the experience on the fly.

“The AI is playing the role of arranging the elements of the website to cater to me and my needs,” Wexler said. “It’s anticipating my needs.”

The implication is a shift from static web pages to something closer to generative websites — pages that reconstruct themselves around each visitor the way a good salesperson adjusts a pitch mid-conversation. It is a fundamentally different model from bolting a chat widget onto a page that otherwise looks the same for everyone.

“The AI is playing the role of arranging the elements of the website to cater to me and my needs,” Wexler said. “It’s anticipating my needs.”

That vision already extends beyond screens. Mehta revealed that Coach, the handbag company owned by Tapestry, deployed Netomi’s platform in a physical flagship store during the holiday season to help customers navigate the retail space and is now rolling it out chainwide.

The numbers Netomi is putting behind its production claims are equally ambitious. At DraftKings, the company said its platform can handle traffic surging to more than 40,000 concurrent customer requests per second during major sporting events, while delivering sub-three-second response times and 98 percent intent classification accuracy. At Paramount, the company said it deployed across chat and voice in two weeks and then scaled through a weekend that included a major UFC event and the AFC Championship.

Those are company-reported numbers, and they are hard to benchmark against competitors because the industry lacks standard public reporting. But they illustrate the kind of problem Netomi wants buyers to think about. At that scale, an AI support product stops looking like a smarter FAQ bot and starts looking like a distributed systems challenge. You are not just asking whether a model can answer a question. You are asking whether an entire system can make decisions quickly, safely and consistently while traffic spikes and business rules collide.

The $110 million question: can invisible AI beat the chatbot industrial complex?

Whether Netomi can deliver on the full scope of its ambition — transforming from an AI customer service platform into an ambient intelligence layer that reshapes digital and physical experiences in real time — remains an open question. The company faces competitors with far larger war chests, deeper platform footprints and, in Sierra’s case, a founder-level relationship with OpenAI.

But Netomi’s bet is fundamentally different from what much of the field is building. While Sierra and Decagon race to replace human agents with AI concierges, measuring success in conversations handled, Netomi is wagering that the highest form of customer service is the interaction that never needs to happen at all.

“There are new startups trying to convince enterprises that if every customer gets a ‘concierge,’ if there’s ‘an agent for every moment,’ then loyalty follows,” Mehta said. “But most relationships with brands are functional. Customers don’t want a conversational relationship with their airline or their bank. They want things to work — seamlessly, invisibly, without friction.”

In his closing comments during the interview, Mehta warned that many companies still underestimate the operational risk of deploying immature AI into sensitive customer environments. “What large companies adopting AI don’t fully realize yet is what kind of risk are they taking by adopting those platforms that are not really field tested for this kind of scale and situations,” he said.

That may be the most important line in the whole announcement. Because beneath the funding round, beneath the partner logos and beneath the talk of agents and orchestration, the real question in enterprise AI remains old-fashioned: which systems can be trusted when the environment gets ugly?

“We have built this technology more like how automated trading got built, or how autonomous driving got built, compared to coming at this from just a customer service lens,” Mehta said.

It is a fitting frame for a company whose founder left Wall Street to fix customer service. On the trading floor, the best systems were never the ones that made the most trades. They were the ones that knew, with precision, when not to act — and the ones nobody noticed until something went wrong and they held. Netomi’s new investors are betting $110 million that the same principle applies when the person on the other end of the system is not a trader, but a customer who just wants their floor not to leak.

Cheaper tokens, bigger bills: The new math of AI infrastructure

Presented by Nutanix


As enterprises move from AI experimentation into production deployment, the primary cost driver has shifted away from foundation model training and toward the infrastructure required to run thousands of concurrent inference workloads at scale, with agentic AI as the accelerant.

Where early enterprise AI projects involved a handful of large, scheduled training jobs, production agentic environments require continuous support for short-lived, unpredictable requests that consume GPU, networking, and storage resources in ways traditional infrastructure was never designed to handle. For enterprise technology leaders, that shift is turning infrastructure efficiency into a make-or-break factor in AI economics.

“Every employee with an AI assistant, every automated workflow, every agent pipeline needs models for inferencing and generates a lot of tokens,” says Anindo Sengupta, VP of products at Nutanix. “Those inferencing requests land on a GPU infrastructure, traverse specialized networks, and pull data from storage systems purpose built to support these AI workloads.”

Why cost per token is becoming a core infrastructure metric

Inference costs per token have dropped by roughly an order of magnitude over the past two years, driven by model efficiency improvements and competitive pressure among cloud providers. The expectation would be that enterprise AI is getting cheaper. Instead, total costs are rising, Sengupta says, pointing to what economists call the Jevons paradox: when a resource becomes cheaper to use, consumption tends to increase faster than the price drops.

So while the cost per token is going down by almost an order of 10 in the last couple of years, consumption has risen more than 100X. The result is that cost per token and GPU utilization are becoming primary operational metrics for enterprise IT, sitting alongside traditional measures like uptime and throughput.

“Cost per token is really about the total cost of ownership for serving inference models,” Sengupta says. “Utilization is about making sure that once you have GPU assets, you’re getting maximum return from them. These metrics will be critical for enterprise IT leaders.”

What makes this difficult is the number of variables involved. Token costs shift depending on which models an organization runs, where workloads execute, and how prompts are structured.

“There are too many variables in cost to manage intuitively,” Sengupta adds. “Optimizing it is an engineering problem, and one that requires continuous tuning.”

Agentic workloads expose the limits of traditional infrastructure

Production agentic AI introduces a workload profile that traditional enterprise infrastructure was not designed to handle. Classic data center deployments are built around predictable loads and long planning cycles. Agentic environments produce unpredictable, high-frequency bursts of short inference requests, place new demands on networking and storage, and change faster than most procurement cycles allow.

The infrastructure supporting agentic AI is also structurally different from CPU-based computing. GPU topology, high-speed interconnects, parallel storage systems for agent memory and KV cache, and networking architectures capable of handling DPU offloading all represent new capabilities that require new operational skills.

Siloed infrastructure compounds these challenges. When GPU resources, networking, and data access are managed independently, scheduling inefficiencies accumulate, utilization drops, and costs climb. Organizations running fragmented stacks tend to underutilize expensive GPU assets while simultaneously bottlenecking on storage and network throughput.

Integrated stacks and the case for full-stack architecture

The response emerging among infrastructure vendors is a move toward tightly integrated, validated full-stack platforms designed specifically for production AI workloads. The premise is that end-to-end optimization across compute, networking, storage, and software layers produces better utilization and lower per-token costs than assembling best-of-breed components from separate vendors.

Nutanix’s Agentic AI solutionrepresents one approach to this problem. Built on the Nutanix AHV hypervisor, Nutanix Enterprise AI and Nutanix Kubernetes Platform, the solution is designed to manage both the traditional compute layer where agent orchestration runs and the accelerated compute layer where inference executes. The company has introduced NVIDIA topology-aware enhancements to AHV that automatically optimize how GPUs, CPUs, memory, and DPUs are allocated to virtual machines, and has offloaded the Nutanix Flow Virtual Networking to BlueField DPUs, to free GPU cycles and sustain throughput without compromising security.

The solution supports instant deployment of NVIDIA NIM microservices and open-source models including Nemotron, and integrates an AI gateway that governs access to frontier cloud LLMs from Anthropic, Google, OpenAI, and others. The gateway also implements model context protocol (MCP) to allow agents to connect to enterprise data with granular access controls. The solution runs on Cisco infrastructure, allowing organizations to deploy on infrastructure they already operate.

“By integrating everything from the AHV hypervisor and Flow Virtual Networking up to the Kubernetes platform, you remove the silos that slow down AI projects,” Sengupta explains.

Platform teams and developer agility cannot be traded off against each other

One organizational tension that scales with agentic AI adoption is the relationship between platform teams managing shared infrastructure and the developers building and running agent applications on top of it. These groups have historically operated with different tooling, different priorities, and different time horizons, but Sengupta argues that the core dynamic hasn’t changed even as the technology has.

“Platform teams will continue to deliver a catalog of self-service AI capabilities that are also compliant to business needs, that they can serve to agentic AI builders,” Sengupta says. “Mature AI teams will do a great job not just in GPU utilization, but in creating an operating model that enables fast AI infrastructure delivery to meet the pace of innovation that developers want. That’s what is very critical to success.”

The organizations that are managing GPU utilization most effectively tend to be further along in their AI adoption journey, with more established operating models and clearer cost accountability. For organizations earlier in that journey, the infrastructure design and operating model decisions being made now will determine whether AI projects can move from pilot to production without cost or complexity becoming the limiting factor.

The AI factory operating model

The emerging framework for enterprise AI infrastructure is the AI factory, a purpose-built environment for producing and running AI workloads at scale. The challenge is that most organizations will need to operate both traditional compute and accelerated compute simultaneously for years, requiring a common operating model that spans both technology paradigms without sacrificing agility.

With Nutanix, running on Cisco as part of the Cisco AI Pods, powered by Intel and optimized for the NVIDIA reference architecture, organizations get a production-ready, full-stack foundation by enabling AI factories to be securely and efficiently shared by thousands of agents, to achieve the lowest costs per token. The solution bridges the gap between the infrastructure and platform engineering teams who manage the hardware and the AI engineering and agentic AI developer teams who build and run agentic AI applications, making it truly affordable to run AI at a massive scale.

“The metrics that will determine whether an organization can sustain and scale its AI investment — cost per token, GPU utilization, scheduling efficiency — are infrastructure metrics,” Sengupta says. “Managing them well is increasingly a precondition for making AI viable, not just functional.”

Secure and scale your AI factory — explore the full-stack approach here.


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Amazon’s OpenAI gambit signals a new phase in the cloud wars — one where exclusivity no longer applies

Amazon Web Services on Tuesday launched one of the most consequential enterprise AI plays in the company’s 20-year history, simultaneously bringing OpenAI’s most powerful models to its Bedrock platform, unveiling a new agentic developer framework, releasing a desktop AI productivity tool called Amazon Quick, and expanding its Amazon Connect service from a single contact-center product into a family of four agentic AI solutions targeting supply chains, hiring, healthcare, and customer experience.

The announcements, made at a live event in San Francisco titled “What’s Next with AWS,” landed just 24 hours after OpenAI and Microsoft publicly restructured their exclusive cloud partnership — a move that, for the first time, freed OpenAI to distribute all of its products across rival cloud providers. AWS CEO Matt Garman called it “a huge partnership” and said customers have been asking for OpenAI models inside AWS “from the very early days.”

The timing was no accident. Amazon CEO Andy Jassy had flagged the Microsoft-OpenAI restructuring as “very interesting” in a post on X the day prior, promising more details on Tuesday. What followed was a sweeping set of launches that together represent AWS’s bid to become the definitive infrastructure layer for the agentic AI era — one where intelligent software agents don’t just answer questions but take autonomous action inside enterprise workflows.

OpenAI’s most capable models arrive on Amazon Bedrock for the first time, reshaping the cloud AI marketplace

The centerpiece announcement: OpenAI’s latest models are now available through Amazon Bedrock in limited preview, with general availability expected within weeks. AWS confirmed that GPT-5.4 is available immediately in limited preview, with GPT-5.5 arriving shortly thereafter.

In an exclusive interview with VentureBeat at the event, Anthony Liguori, Vice President and Distinguished Engineer at AWS, described the significance of the moment. “We announced a partnership about eight weeks ago centered around this idea of the stateful runtime environment, the SRE APIs,” Liguori said. “However, today we announced the availability of all of OpenAI’s frontier models in Amazon Bedrock available via both the stateless APIs — these are the APIs that are commonly used, like chat completions and responses.”

Liguori characterized the stateless API availability as particularly critical because it removes migration friction. “Customers can take their existing workloads today and just start using AWS right off the bat,” he said. “They don’t have to write any new software, develop any new things. I think that’s one of the most exciting announcements that came out today.”

The integration means AWS customers can now evaluate and deploy OpenAI models alongside offerings from Anthropic, Meta, Mistral, Cohere, and Amazon’s own models — all through Bedrock’s unified security, governance, and cost controls. For enterprise procurement teams, this collapses what had been a fragmented multi-vendor landscape into a single pane of glass.

How a $50 billion Amazon investment and a messy Microsoft breakup cleared the way for Tuesday’s deal

The path to Tuesday’s announcement was anything but smooth. As TechCrunch reported, OpenAI’s earlier $50 billion deal with Amazon, announced in February, had created a legal tangle with Microsoft. Under the original Microsoft-OpenAI agreement, Microsoft retained exclusive rights to OpenAI products accessed through APIs, which appeared to conflict directly with OpenAI’s promise to give AWS exclusive hosting rights for its new Frontier agent-building tool.

Microsoft had publicly pushed back at the time, stating that “Azure remains the exclusive cloud provider of stateless OpenAI APIs.” The Financial Times reported that Microsoft even contemplated legal action. Monday’s restructured deal — which replaced Microsoft’s open-ended exclusivity with a nonexclusive license running through 2032 — swept those legal obstacles aside.

For AWS, the resolution means its multi-billion-dollar investment in OpenAI can now fully bear fruit. As CNBC reported, OpenAI’s revenue chief Denise Dresser had told employees in a memo that the Microsoft relationship “has also limited our ability to meet enterprises where they are — for many that’s Bedrock.” At the San Francisco event, Dresser framed the moment as a turning point. “They’re no longer in the mindset of experimentation and pilots,” she said of enterprise customers. “They really want to go full enterprise wide, and they understand that to do that, they need to have powerful models. But even more importantly, they want those models in a trusted environment.”

OpenAI CEO Sam Altman, who was unable to attend in person due to his ongoing court case against Elon Musk across the Bay Bridge in Oakland, sent a recorded video message. “We are co-developing an agent platform from the ground up, deeply integrated with AWS services and powered by OpenAI’s most advanced models and tools,” Altman said, “so that customers can build and run powerful agents in their own environment without worrying about the underlying plumbing.”

Inside Bedrock managed agents, the reinforcement learning-trained ‘harness’ that AWS says will define the agentic era

Beyond raw model access, AWS launched Amazon Bedrock Managed Agents powered by OpenAI — a system that combines OpenAI’s frontier models with its proprietary “harness,” the agentic execution framework that powers products like Codex. This is where Liguori’s technical analysis was most revealing.

He explained that the harness concept represents a shift in how models are trained and deployed for agentic work. “When you think about an agentic platform, there’s really two components,” Liguori told VentureBeat. “One is the harness — the actual logic that will execute tool calls for the model, determine when to compact the context, all of those sorts of things — and then the model itself.”

Critically, Liguori argued, the best agentic performance comes when models are trained specifically against their harness through reinforcement learning — not merely prompted to use tools at inference time. “You can give a model a whole lot of instructions and a set of tools, and it will be able to use it most of the time,” he said. “But when you really train the model on a specific set of tools, a specific style of operations, it’s just like drilling plays over and over again — the model builds muscle memory for using that harness.”

The football analogy is instructive. Where general-purpose models are like versatile athletes who can adapt to any playbook, harness-trained models are like championship teams that have run the same formations thousands of times until execution becomes instinctive. For enterprises deploying agents in high-stakes production environments — managing financial transactions, orchestrating supply chains, or processing sensitive healthcare data — that reliability gap matters enormously.

Bedrock Managed Agents consists of three components: a runtime layer for configuring skills, memory policies, and tool access; an environment layer where the agent lives (deployable on Fargate or other AWS compute); and an inference API for interacting with the agent. The system integrates deeply with AWS’s identity and access management, VPC networking, and CloudTrail auditing — meaning every action an agent takes is logged and governed by existing enterprise security policies.

AWS makes its boldest security claim yet: zero human access to inference machines running OpenAI’s models

Liguori made what may be his most striking claim when discussing why enterprises should trust AWS over on-premises alternatives or smaller cloud providers. “With Bedrock, the system that we’re using to host the GPT-5.4 models, that whole environment is zero operator access,” he told VentureBeat. “There’s no human that could ever log into one of those machines, so your inference data is never able to be accessed by a human.”

He pointed to AWS’s custom silicon — Graviton processors and Nitro security chips — as the foundation for this claim. “When you look at one of our servers, either compute servers or the servers we’re using for Gen AI, the only thing that you can buy off the shelf is the memory modules. Everything else is either custom boards or even custom silicon.”

This argument is designed to counter a growing narrative from what the industry calls “neo-clouds” — smaller providers that offer on-premises model hosting with tighter physical security controls. Liguori flipped that argument on its head: “You’re actually way more secure in the cloud because we have built a platform with such strong physical securities… If you were to try to stand up your own inference system today, you’d probably be running open source software on just Linux.”

It’s a bold claim, and one that enterprise CISOs will undoubtedly scrutinize. But it underscores AWS’s conviction that the agentic era — where AI agents access source code, PII data, and critical business systems — demands infrastructure security guarantees that go far beyond what most organizations can build independently.

Codex’s 4 million weekly users could soon multiply as OpenAI’s coding agent arrives on AWS

OpenAI’s Codex coding agent also arrived on Bedrock in limited preview. Dresser shared that Codex has been growing at a blistering pace, expanding “from 3 million weekly active users to 4 million in two weeks.” The tool has evolved beyond simple code generation into a full agentic software development lifecycle platform.

For Liguori, who described himself as “10 to 20 times more productive” as an engineer thanks to tools like Codex, bringing this capability into AWS represents the bridge between individual developer productivity and enterprise-scale deployment. “Most developers today are using these OpenAI models on their laptops,” he said. “We haven’t seen that happen yet in the rest of the industry, and with Bedrock Managed Agents, we think we have a way for enterprises to deploy agents in a means that meets their compliance requirements.”

The gap Liguori is describing — between the solo developer experience and enterprise-wide adoption — is arguably the central challenge of the current AI moment. Individual engineers can achieve extraordinary productivity gains with agentic coding tools. But scaling that to thousands of developers across a Fortune 500 company, with proper governance, security, and auditability, requires platform-level infrastructure. That’s the market AWS is targeting.

Liguori saw the near-term potential in even more immediate terms. He described leading a team of about 20 engineers who share a common codebase of skills and MCP tools. “That has been an amazingly powerful thing, because we’re all able to build on top of each other as we learn how to use these models,” he said. “Where I’ve run into a hurdle is there’s a lot of stuff I’d like to share with our finance team… and I can’t really ask them to clone a Git repo and build it from a Git repo.” Bedrock Managed Agents, he argued, will let teams create hosted agents that non-technical colleagues can access — taking agentic development from a developer-only practice to an enterprise-wide capability within the next six months.

Amazon Quick Desktop aims to be the agentic AI assistant that finally works for non-developers

While the OpenAI partnership dominated headlines, AWS also launched Amazon Quick Desktop — a new desktop application designed to bring agentic AI to knowledge workers who aren’t developers. Liguori framed the product as addressing a critical gap. “A lot of these agentic tools have primarily targeted developers,” he said. “Quick Desktop is a really great tool if you are a knowledge worker that is not a developer… I think it’s been underserved for the non-developer knowledge workers.”

Quick Desktop integrates with a user’s local files, calendar, email, Slack, and enterprise applications — building what AWS calls a “Knowledge Graph” that maps relationships between people, projects, decisions, and actions. The system connects natively with Google Workspace, Microsoft 365, Zoom, and Salesforce. Unlike other AI productivity tools, Quick doesn’t wait for prompts. It proactively surfaces what matters — unanswered emails, deals needing updates, documents awaiting review — and can take action like scheduling meetings, drafting emails, or updating Jira tickets.

Garman, who said he had been using the desktop app for several weeks, called it “by far the most effective tool” among AI productivity products he has tested. “If you think about what we’ve done with Quick — combine all of your sources of data inside of the enterprise — but then we also saw the power of having access to a local desktop and being able to operate with your local files and your local email and your local Slack… but people were worried about security, appropriately so,” Garman said. “What we’re doing here is combining a bunch of those things together with QUIC to give you the best of all of those worlds.”

The product is available in preview today, with no AWS account required — users can sign up with just an email address. Customers including BMW, 3M, Mondelēz, Southwest Airlines, and the NFL are already using it, with some reporting production time reductions of nearly 80% and customer issue processing cut by more than 50%.

Amazon Connect becomes a family of four as AWS bets that ‘agentic teammates’ will transform supply chains, hiring, and healthcare

Perhaps the most ambitious long-term bet announced Tuesday was the expansion of Amazon Connect from a single contact-center product — one that reached over $1 billion in revenue last year and processes 20 million interactions daily — into a family of four agentic AI solutions.

The new lineup includes Amazon Connect Decisions, an agentic supply chain planning tool built on more than 25 specialized supply chain tools and 30 years of Amazon operational science, including one of Amazon’s SCOT (Supply Chain Optimization Technologies) foundation models. Amazon Connect Talent is a high-volume hiring platform inspired by Amazon’s experience hiring 250,000 seasonal employees during peak periods, using AI agents to conduct voice interviews around the clock and present recruiters with anonymized, skills-based scoring. Amazon Connect Customer AI is the renamed and enhanced version of the original contact-center service. And Amazon Connect Health covers the patient journey from appointment scheduling through clinical encounters, including ambient documentation, billing code suggestions, and post-visit summaries drawn from Amazon’s experience with One Medical and Amazon Pharmacy.

Colleen Aubrey, who leads applied AI solutions at AWS and previously co-founded Amazon’s advertising business, introduced a new design philosophy underlying all four products: “humorphism.” Where skeuomorphism translated physical objects into digital metaphors — desks to desktops, files to folders — humorphism translates human interaction dynamics into AI agent behavior. “If we’re building products that at the heart of which is an agentic teammate, then how should those teammates interact with you?” Aubrey asked. The philosophy manifests in specific design choices: Connect Decisions agents ask planners why they made manual adjustments and apply those insights across similar products. Connect Talent agents adapt follow-up questions based on candidate responses. Connect Health agents trace every clinical insight back to source data so physicians can verify AI-generated documentation.

What AWS’s four-layer strategy reveals about where the real value in enterprise AI will be captured

Taken together, Tuesday’s announcements reveal a coherent strategy operating across four distinct layers: custom infrastructure (Graviton, Trainium, zero-operator-access security), model access (Bedrock as a model marketplace with unified APIs), an agentic platform (Bedrock Managed Agents and AgentCore for building and governing agents), and purpose-built applications (Quick for individual productivity, Connect for vertical business operations).

This layered approach addresses a fundamental tension in the enterprise AI market. Companies want choice at the model layer but integration at the platform layer and specificity at the application layer. By offering all three through a single security and governance framework, AWS is betting it can capture value across the entire stack — a strategy that reshapes competitive dynamics for Microsoft, Google Cloud, and the growing constellation of smaller AI infrastructure providers.

Garman pushed back on the “SaaSpocalypse” narrative that agentic AI will destroy incumbent enterprise software companies. “The incumbent providers today have such a huge advantage,” he said. “They have deep domain expertise… a large customer set with all of their data.” He pointed to Salesforce’s recent headless API offering as an example of incumbents adapting smartly. But he also drew an explicit parallel to the early days of cloud computing, when customers would simply replicate their on-premises data centers in the cloud rather than reimagine what was possible. “You see that today with how people are thinking about AI and agents,” Garman said. “They’re like, ‘I have this business process, I’m gonna have agents do the exact same thing that humans do.’ It kind of works… but it doesn’t give you that transformational change.”

He pointed to Amazon’s own Prime Video team as proof of what that change looks like in practice. The team used agentic tools to rebuild a partner payment system that was projected to take two years — completing it in roughly two quarters with a handful of people, while simultaneously improving the system for customers, for Amazon, and for the partners who get paid through it.

The enterprise AI arms race enters a new phase as model access becomes table stakes and the platform war begins

For enterprises evaluating their AI strategies, Tuesday’s announcements simplify one decision — OpenAI models are now available where most of them already run production workloads — while complicating another. With model access increasingly commoditized across cloud providers, the real differentiator becomes the platform layer: where agents are built, governed, deployed, and trusted to take consequential actions. That’s the battleground AWS is staking out, and it’s the same ground Microsoft, Google, Salesforce, and a growing number of startups intend to contest.

Liguori sees the transformation accelerating fast. “I think what we’re going to see in the next six months is a lot of this agentic stuff going from developer only to being able to be consumed by a larger number of folks within an enterprise,” he told VentureBeat. Anthony Liguori, the AWS distinguished engineer who led the technical work over eight sleepless weeks to bring OpenAI’s models to Bedrock, said his own productivity as a software engineer has increased 10 to 20 times over the past year. When asked what excites him most about what comes next, he didn’t talk about models or infrastructure. He talked about what happens when that same multiplier reaches the finance team, the product managers, the supply chain planners — the millions of knowledge workers who have been watching the agentic revolution from the sidelines.

“We had nothing eight weeks ago,” he said, “and now we’re here.” If the next eight weeks move as fast, the sidelines may not exist for much longer.

IBM launches Bob with multi-model routing and human checkpoints to turn AI coding into a secure production system

Bringing AI agents into the enterprise software development lifecycle is fast becoming the norm. As developers experiment with new platforms, organizations are exposed to potential security and orchestration failures. Systems that work in pilots may fail once the agents start working with real-time data.

Legacy tech giant IBM is one of several companies trying to address that gap by introducing more structure into how these workflows run. Yesterday, it announced the global launch of its AI-powered software development platform Bob, designed to write and test code across the development cycle, already in use by more than 80,000 of its employees after starting with just 100 internal users in summer 2025.

Bob introduces a structured layer that constantly pauses for human-led checkpoints, yet by harnessing AI models to perform agentic tasks, IBM says it has saved some teams up to 70% of time “on selected tasks…equaling an average time savings of 10 hours per week.”

Specific models supported include IBM’s own Granite series, Anthropic’s Claude, some from French AI firm Mistral and other smaller distilled models — no Alibaba Qwen or other fully open source ones.

This approach reflects a shift in how enterprises want to approach AI-led development: to build systems that not only build applications but also execute complex, multi-step workflows that do not rely on a single model or a single orchestration framework. It provides a structured, guarded approach to automation that seeks to center humans more in the process and fill audit gaps. 

Neal Sundaresan, general manager, Automation and AI at IBM, told VentureBeat in an exclusive interview that a large part of using AI for software development is being systematic. 

“Model capability alone isn’t enough,” Sundaresan said. “How you deploy it, how you structure context, and how you keep humans in the loop is what determines whether AI actually delivers.”

That divide is shaping how enterprises choose AI tools, whether they prioritize flexibility and experimentation or reliability and auditability.

Varying approaches to AI-led development

A growing class of open or autonomous agent systems has pushed the boundaries of what developers can do. They can now run extended or stateful workflows without much human intervention.

The rise of OpenClaw showed enterprises how far experimentation can go, especially when trained on local data and run in sandboxes. But it also meant that the choice between easier agent and workflow creation and security. 

Some companies have embraced this spirit of experimentation.

Enterprise providers like Nvidia chose to embrace OpenClaw-like systems by adding a fence around the sandbox environment that runs autonomous agents, using NemoClaw. Kilo launched Kilo Claw, aimed at providing security for autonomous agents. OpenAI, in its updated Agents SDK, added support for sandbox agent implementations that mirror a lot of the usage patterns of systems like OpenClaw. 

Sundaresan said enterprises continue to experiment with how they want to approach coding and agent building. He doesn’t want to close the door on fully autonomous agents proactively completing tasks, but he believes enterprises will want to exercise more caution as well. 

“If you tell me that the final answer will be OpenClaw, then we will get there,” he said. “But it’s better to open the gate slowly than say, ‘oops, how do I close it now?’”

Bob reflects that thought process, highlighting the increasing shift for enterprises. 

How Bob compares

Bob acts as a coding platform, but unlike similar products, it aims to standardize and govern the agent workflows created on it. 

Tools like Cursor and Claude Code position the user at the beginning of the task. They are writing the prompts, chaining steps and debugging. LangGraph does similarly while also allowing teams to define agent flows.

The difference is not about capabilities but about control, and whether the system enterprises use explores potential solutions or delivers predictable execution.

In this case, the human employee starts and ends the process. If the agent is unable to complete its task or makes a mistake, this is handled after the fact. 

Bob, on the other hand, essentially pre-structures the development lifecycle into role-based stages. The agents will often check-in with the user for approval as a natural workflow checkpoint. Sundaresan said the idea is to combine the human and automated workflows. 

What is becoming clear is that the next phase of enterprise AI no longer relies on model power, but rather on how well tools are designed to balance autonomy and control. 

Pricing and availability

As mentioned previously, Bob is now available for all regions where IBM does business. IBM’s pricing structure for Bob consists of four primary subscription tiers for each user/seat and is built around its own internal credits system called “Bobcoins,” which serves as the primary metric for transparency and predictability.

These are set at a fixed valuation of 1 Bobcoin per $0.50 USD. Users consume these coins by performing specific actions, such as generating code, running commands, or performing file operations. If a user exhausts their balance, they must upgrade their plan to continue using the service.

Here are the plans currently offered and how many Bobcoins the user obtains by subscribing to each tier.

  1. 30-day Free Trial providing 40 Bobcoins

  2. Pro plan at $20 per month with 40 Bobcoins

  3. Pro+ plan at $60 per month with 160 Bobcoins

  4. Ultra tier priced at $200 per month for 500 Bobcoins.

All standard plans provide access to core features including specialized agentic modes, literate coding, the Bob Shell for intelligent CLI workflows, and Model Context Protocol (MCP) integration.

While all individual plans are restricted to a single user, an Enterprise plan is available through sales contact, offering centralized team management, flexible role assignments, and the ability to distribute Bobcoins across an organization.

Enterprise subscribers receive additional benefits such as priority support and a dashboard to track entitlements and usage awareness.

AWS Quick’s personal knowledge graph is making orchestration decisions most control planes can’t see

Enterprise AI teams running centralized orchestration stacks now have a new variable to account for: AWS Quick, which expanded this week to a desktop-native agent that builds a persistent personal knowledge graph and executes actions across local files and SaaS tools — outside the visibility of most control planes.

Unlike chat-based copilots that reset with each session, Quick now maintains a continuously updated knowledge graph built from the user’s local files, calendar, email and connected SaaS apps. It uses it to proactively trigger actions without waiting to be asked.

AWS launched Quick in October last year as an alternative to AI workflow and productivity platforms coming from Google, OpenAI and Anthropic. It was a way for enterprise employees to access insights from connected applications, an agent builder, deep research, and workflow automation. Now, it’s grown beyond a simple AI assistant and acts more as a proactive workflow agent with a stateful, real-time knowledge graph of the user. It integrates with third-party apps like Google Workspace, Microsoft 365, Zoom, Salesforce and Slack — and now local files — so the agent can gather context and take actions. 

“What we’ve been hearing is that many enterprises have not been happy with how difficult it is to get context from their legacy tools,” Jigar Thakkar, vice president of Quick Suite at AWS, told VentureBeat in an interview. “Our vision is that Quick is a desktop experience that is the one place where people can go to get all their information and tasks.”

Governance blindspots 

Enterprises often put orchestration layers at the center to help guide and manage agents. Context is pulled in, decisions are made, and then actions are executed within defined system boundaries.

Recent releases like Anthropic’s Claude Managed Agents or updates to OpenAI’s Agent SDK also push for more stateless, autonomous agents within enterprise workflows, but still operate within defined orchestration boundaries. 

Quick still operates under enterprise controls, something that AWS has always underscored with its AI products, so actions taken on Quick remain bound by permissions, identity and security. Integrations remain managed by either an API or an MCP connection. 

However, this evolution of Quick introduces a more subtle shift in the decision layer. AWS updated Quick to build a personal knowledge graph that learns more about the user the more they interact with the platform. It builds a profile based on how they use local files, calendar, email or third-party app integrations to proactively suggest actions such as reminding a team leader to set up check-ins. 

Enterprises should be wary that a kind of shadow orchestration could arise in a system like this. The personalized context means the decision layer focuses on implicit triggers rather than set workflows, user-specific interpretations, and different action timings. Practitioners are rightfully wary of this much autonomy, understanding that shadow orchestration may not be something completely under their control.

Upal Saha, co-founder and CTO of Bem, told VentureBeat in an email that platforms like AWS Bedrock AgentCore, its managed agent runtime, and similar ones from Salesforce “maximize autonomy rather than accountability” so enterprises are not losing agent visibility by accident.

“When you deploy an agent that reasons its way to a decision across multiple steps, you have already accepted that you will not be able to fully explain what happened after the fact,” Saha said. “That is fine for a demo. It is not fine for a claims processing pipeline or a financial workflow where a regulator can ask you to produce a complete audit trail for every automated decision made in the last three years.”

AWS said the platform’s governance model is designed to address these concerns. “Users can set up different agents and automated workflows tailored to their role — things like monitoring tickets, pulling data from connected systems, or drafting docs — all managed within a governed environment where IT retains control over what’s connected and what data flows where. It’s designed to give individual users flexibility while keeping enterprise-level oversight in place,” an AWS spokesperson said. 

A possible blueprint 

Quick’s evolution from an AI assistant to something more proactive represents a possible approach some enterprise software providers will take to deep AI agent integration into workflows. While what AWS wants to accomplish with Quick—better context from apps and local files and a strong understanding of what its users actually want to do—is not unique, it isn’t focusing on traditional orchestration. Instead, it’s relying on context-driven agent management. 

This market tension is growing, as evidenced by the release of similar platforms. Mistral, for example, announced Workflows the same day as the updates to Quick. That platform uses a more traditional orchestration framework. 

Stateful and personalized agents continue to evolve, and so do the questions around how enterprises govern them.

How to build custom reasoning agents with a fraction of the compute

Training AI reasoning models demands resources that most enterprise teams do not have. Engineering teams are often forced to choose between distilling knowledge from large, expensive models or relying on reinforcement learning techniques that provide sparse feedback.

Researchers at JD.com and several academic institutions recently introduced a new training paradigm that sidesteps this dilemma. The technique, called Reinforcement Learning with Verifiable Rewards with Self-Distillation (RLSD), combines the reliable performance tracking of reinforcement learning with the granular feedback of self-distillation. 

Experiments indicate that models trained with RLSD outperform those built on classic distillation and reinforcement learning algorithms. For enterprise teams, this approach lowers the technical and financial barriers to building custom reasoning models tailored to specific business logic.

The problem with training reasoning models

The standard method for training reasoning models is Reinforcement Learning with Verifiable Rewards (RLVR). In this paradigm, the model learns through trial and error, guided by a final outcome from its environment. An automated verifier checks if the model’s answer is right or wrong, providing a binary reward, such as a 0 or 1.

RLVR suffers from sparse and uniform feedback. “Standard GRPO has a signal density problem,” Chenxu Yang, co-author of the paper, told VentureBeat. “A multi-thousand-token reasoning trace gets a single binary reward, and every token inside that trace receives identical credit, whether it’s a pivotal logical step or a throwaway phrase.” Consequently, the model never learns which intermediate steps led to its success or failure.

On-Policy Distillation (OPD) takes a different approach. Instead of waiting for a final outcome, developers pair a smaller student model with a larger, more capable teacher model. For each training example, the student compares its response to that of the teacher token by token. This provides the student with granular feedback on the entire reasoning chain and response-generation process.

Deploying and running a separate, massive teacher model alongside the student throughout the entire training process incurs massive computational overhead. “You have to keep a larger teacher model resident throughout training, which roughly doubles your GPU footprint,” Yang said. Furthermore, the teacher and student models must share the exact same vocabulary structure, which according to Yang, “quietly rules out most cross-architecture, cross-modality, or multilingual setups that enterprises actually run.”

The promise and failure of self-distillation

On-Policy Self-Distillation (OPSD) emerged as a solution designed to overcome the shortcomings of the other two approaches. In OPSD, the same model plays the role of both the student and the teacher.

During training, the student receives a standard prompt while the teacher receives privileged information, such as a verified, step-by-step answer key. This well-informed teacher version of the model then evaluates the student version, providing token-by-token feedback as the student tries to solve the problem using only the standard prompt.

OPSD appears to be the perfect compromise for an enterprise budget. It delivers the granular, step-by-step guidance of OPD. Because it eliminates the need for an external teacher model, it operates with the high computational efficiency and low cost of RLVR, only requiring an extra forward pass for the teacher.

However, the researchers found that OPSD suffers from a phenomenon called “privileged information leakage.”

“The objective is structurally ill-posed,” Yang said. “There’s an irreducible mutual-information gap that the student can never close… When self-distillation is set up as distribution matching, the student is asked to imitate the teacher’s full output distribution under privileged context.”

Because the teacher evaluates the student based on a hidden answer key, the training objective forces the student model to learn the teacher’s exact phrasing or steps instead of the underlying reasoning logic. As a result, the student model starts hallucinating references to an invisible solution that it will not have access to in a real-world deployment.

In practice, OPSD models show a rapid spike in performance early in training, but their reasoning capabilities soon plateau and progressively degrade over time.

Decoupling direction from magnitude with RLSD

The researchers behind RLSD realized that the signals governing how a model updates its parameters have fundamentally asymmetric requirements. They identified that the signal dictating the direction of the update (i.e., whether to reinforce or penalize a behavior) can be sparse, but must be perfectly reliable, because pointing the model in the wrong direction damages its reasoning policy.

On the other hand, the signal dictating the magnitude of the update (i.e., how much relative credit or blame a specific step deserves) benefits from being extremely dense to enable fine-grained, step-by-step corrections.

RLSD builds on this principle by decoupling the update direction from the update magnitude. The framework lets the verifiable environmental feedback from the RLVR signal strictly determine the direction of learning. The model only receives overall reinforcement if the final answer is objectively correct.

The self-teacher is stripped of its power to dictate what the model should generate. Instead, the teacher’s token-by-token assessment is repurposed to determine the magnitude of the update. It simply distributes the total credit or blame across the individual steps of the model’s reasoning path.

This alters how the model learns compared to the classic OPSD paradigm. In standard OPSD, the training objective acts like behavioral cloning, where the model is forced to directly copy the exact wording and phrasing of the teacher. This causes the student to hallucinate and leak references to data it does not have.

Instead of forcing the model to copy a hidden solution, RLSD provides a natural and virtually cost-free source of per-token credit information.

“The intuition: we’re not teaching the model to reason like the teacher,” Yang said. “We’re telling the model, on the path it chose, which of its own tokens were actually doing the work. The model’s exploration distribution stays its own. Only the credit allocation gets sharpened.”

If a specific deduction strongly supports the correct outcome, it receives a higher score. If it is just a useless filler word, it receives a baseline score. RLSD eliminates the need to train complex auxiliary reward networks, manually annotate step-by-step data, or maintain massive external teacher models.

Putting RLSD to the test

To test RLSD, the researchers trained the open-weight Qwen3-VL-8B vision-language model and evaluated it on several visual reasoning benchmarks. These included MMMU for college-level multi-discipline questions, MathVista, MathVision, WeMath, and ZeroBench, a stress-test benchmark explicitly designed to be nearly impossible for current frontier models.

They compared the RLSD model against the base model with no post-training, standard RLVR via the GRPO algorithm, standard OPSD, and a hybrid combination of the two.

RLSD significantly outperformed every other method, achieving the highest average accuracy of 56.18% across all five benchmarks. It beat the base model by 4.69% and outperformed standard RLVR by 2.32%. The gains were most pronounced in complex mathematical reasoning tasks, where RLSD outperformed standard RLVR by 3.91% on the MathVision benchmark.

Beyond accuracy, the framework offers massive efficiency gains. “Concretely, RLSD at 200 training steps already beats GRPO trained for 400 steps, so roughly 2x convergence speedup,” Yang said. “Cost-wise, the only overhead beyond a normal GRPO pipeline is one extra forward pass per response to grab teacher logits. Compared to rollout generation… that’s basically free.”

Unlike OPSD, which saw performance spike and then completely collapse due to information leakage, RLSD maintained long-term training stability and converged on a higher performance ceiling than standard methods.

The qualitative findings highlight how the model alters its learning behavior. For example, in a complex visual counting task, standard RLVR looks at the final correct answer and gives the entire paragraph of reasoning tokens the same reward. RLSD surgically applied rewards to the specific mathematical subtraction steps that solved the problem, while actively down-weighting generic filler text like “Looking at the image, I see…”.

In another example, the model performed an incorrect math derivation based on a bar chart. Instead of labeling the whole response as a failure, RLSD concentrated the heaviest penalty on the exact point where the model misread a relationship from the chart. It remained neutral on the rest of the logical setup, recognizing that the initial framework was valid.

This is particularly important for messy, real-world enterprise use cases. If a model makes a mistake analyzing a 50-page quarterly earnings report, developers do not want it to unlearn its entire analytical framework. They just want it to fix the specific assumption it got wrong. RLSD allows the model to learn exactly which logical leaps are valuable and which are flawed, token by token. Because RLSD does this by repurposing the model itself, it provides models with granular reasoning capabilities while keeping the costs of training reasonable.

How enterprises can get started

For data engineers and AI orchestration teams, integrating RLSD is straightforward, but it requires the right setup. The most critical requirement is a verifiable reward signal, such as code compilers, math checkers, SQL execution, or schema validators. “Tasks without verifiable reward (open-ended dialogue, brand-voice writing) belong in preference-based pipelines,” Yang said.

However, RLSD is highly flexible regarding the privileged information it requires. While OPSD structurally requires full intermediate reasoning traces, forcing enterprises to either pay annotators or distill from a frontier model, RLSD does not.

“If you have full verified reasoning traces, great, RLSD will use them,” Yang said. “If all you have is the ground-truth final answer, that also works… OPSD doesn’t have this flexibility.”

Integrating the technique into existing open-source multi-modality RL frameworks like veRL or EasyR1 is incredibly lightweight. According to Yang, it requires no framework rewrite and slots right into the standard stack. The code swap involves simply changing tens of lines to adjust the GRPO objective and sync the teacher with the student.

Looking ahead, RLSD offers a powerful way for enterprises to maximize their existing internal assets.

“The proprietary data enterprises hold inside their perimeter (compliance manuals, internal documentation, historical tickets, verified code snippets) is essentially free privileged information,” Yang concluded. “RLSD lets enterprises feed this kind of data straight in as privileged context, which sharpens the learning signal on smaller models without needing an external teacher and without sending anything outside the network.”