Traza raises $2.1 million led by Base10 to automate procurement workflows with AI

For decades, procurement has been the back office that enterprise software forgot. Billions of dollars flow through vendor negotiations, purchase orders, and supplier communications every year at the largest manufacturers and construction companies in the country — and the vast majority of that work still runs on email threads, spreadsheets, and phone calls.

Traza, a newly launched startup headquartered in New York, believes the moment has arrived to change that. The company announced today the close of a $2.1 million pre-seed round led by Base10 Partners, with participation from Kfund, a16z scouts, Clara Ventures, Masia Ventures, and a roster of angel investors including Pepe Agell, who scaled Chartboost to 700 million monthly users before its acquisition by Zynga.

The funding is modest by Silicon Valley standards. But Traza’s pitch is anything but incremental: the company deploys AI agents that don’t just recommend procurement actions — they execute them autonomously, handling vendor outreach, request-for-quote generation, order tracking, supplier communications, and invoice processing without continuous human supervision.

“AI is redesigning the procurement category from the ground up,” said Silvestre Jara Montes, Traza’s CEO and co-founder, in an exclusive interview with VentureBeat. “This wave of AI won’t just build procurement software — it will rebuild how procurement works.”

Why procurement contracts silently lose millions after the ink dries

The market Traza is targeting is enormous and, by the company’s framing, spectacularly underserved. The procurement software market alone exceeds $8 billion and grows at roughly 10% annually. But the real cost sits in the labor — the armies of people, agencies, and ad hoc workarounds required to actually run procurement operations at scale. Most enterprises meaningfully engage with only their top 20% of suppliers. The remaining 80% — the vendor outreach, order tracking, invoice reconciliation, and compliance monitoring — goes largely unmanaged.

Research from World Commerce & Contracting and Ironclad finds that organizations lose an average of 11% of total contract value after agreements are signed, a phenomenon described as “post-signature value leakage.” As Tim Cummins, President of WorldCC, put it: “The research shows that the 11% value gap is not caused by poor negotiation, but by how contracts are managed after signature.” For a large enterprise with $500 million in annual contracted spend, that represents $55 million vanishing each year — not from bad deals, but from the operational void between what gets agreed at the negotiating table and what actually gets executed on the ground. Missed savings, unauthorized changes, and poor renewal planning are responsible for the biggest losses.

Jara Montes argues that Traza sits precisely in this gap. “The 11% spans commercial, operational, and compliance leakage. We own the operational layer — and that’s where the most recoverable value sits,” he said. “Supplier tail management that never happens, RFQ processes skipped because someone ran out of bandwidth, invoice discrepancies that slip through unnoticed. That’s where contracts bleed value after signing, and that’s exactly what we automate.” The numbers from Traza’s early deployments, while nascent, are striking: the company claims a 70% reduction in human hours spent on procurement tasks and procurement cycles running three times faster than manual baselines.

How AI agents crossed the line from procurement copilot to autonomous worker

To understand what makes Traza’s approach different, it helps to understand what “AI for procurement” has meant until now. For the past several years, the term largely described dashboards, analytics layers, and recommendation engines that surfaced insights but left every decision and action in a human’s hands. Products from incumbents like SAP Ariba and Coupa — as well as newer entrants like Zip, Fairmarkit, and Tonkean — have layered AI capabilities on top of existing systems of record. But the gap between piloting AI and achieving production-scale impact remains stark, with 49 percent of procurement teams running pilots but only 4 percent reaching meaningful deployment.

Traza’s bet is that 2026 represents an inflection point. AI agents now possess the multi-step reasoning, tool use, and contextual memory required to execute full procurement workflows autonomously — from vendor discovery through invoice processing. The company frames this not as an upgrade to existing procurement software, but as an entirely new product category. “The incumbents built systems of record. They organize procurement data and they’ve never executed procurement work — and their AI additions don’t fundamentally change that,” Jara Montes said. “What they’re shipping is a recommendation layer on the same underlying architecture. A human still has to act on every suggestion. We replace the operational layer entirely.”

Industry data supports the thesis that enterprises are hungry for this shift. According to the 2025 Global CPO Survey from EY, 80 percent of global chief procurement officers plan to deploy generative AI in some capacity over the next three years, and 66 percent consider it a high priority over the next 12 months. A 2025 ABI Research survey found that 76% of supply chain professionals already see autonomous AI agents as ready to handle core tasks like reordering, supplier outreach, and shipment rerouting without human intervention — and early deployments are demonstrably reducing supply chain operational costs by 20 to 35%.

Inside the workflow: what Traza’s AI does and where humans still make the call

In a typical deployment, Traza’s AI agent takes over the operational labor that currently lives in inboxes, spreadsheets, and manual follow-up chains. In a standard RFQ workflow, the agent identifies suitable suppliers, drafts and sends the request for quotes, monitors supplier responses, follows up automatically when responses lag, parses incoming quotes regardless of their format, and builds a structured comparison table ready for a human decision-maker. The key design principle is deliberate: humans remain in the loop at critical junctures.

“At critical steps — approving a purchase order, flagging a compliance issue, committing spend above a threshold — a human is always in the loop,” Jara Montes explained. “That’s not a limitation, it’s the design. It’s how you maintain the auditability enterprises require while moving faster than any manual process could. You earn expanded autonomy over time, as trust is built and results compound.”

When asked about the risk of AI errors — a wrong purchase order or a missed compliance check that could prove costly — Jara Montes was direct: “Anything with meaningful financial or compliance exposure requires human approval before it executes — that’s non-negotiable and baked into the architecture. Below those thresholds, the agent acts autonomously and logs everything.” He added a point that reveals a subtler product insight: “Most procurement operations today are a black box — nobody has a clear picture of what’s happening across the supplier tail. We make it legible.” In other words, the transparency the AI agent provides may itself be a product — giving procurement leaders visibility they have never had into the long tail of supplier relationships that most enterprises simply ignore.

How Traza plugs into legacy enterprise systems without ripping them out

One of the recurring challenges for any enterprise AI startup is the integration question: How do you plug into the deeply entrenched, often decades-old technology stacks that large manufacturers and construction companies rely on? Traza’s answer is to sit on top of existing systems rather than replace them. “We connect via API or direct integration into whatever the customer already runs — ERPs, email, supplier portals. We have reach across more than 200 enterprise tools,” Jara Montes said. “We don’t rip out their system, we sit on top of them.”

The go-to-market motion mirrors this pragmatism. Instead of attempting a big-bang deployment, Traza runs a two-to-three-month proof of value focused on a single, specific workflow. Integrations are built at the key steps that matter for that particular use case, then expanded as the scope of the engagement grows. “We don’t try to connect everything upfront — we compound integrations as we expand scope within each account,” Jara Montes said. “And every integration we build compounds across customers too. Each new deployment makes the next one faster.” Throughout the process, the company works side by side with the customer’s team, managing complexity and helping them transition into a new way of operating. It is a notably high-touch approach for a company selling automation.

The company is already working with large manufacturers and construction companies and says they are paying, though it declines to name them publicly. “We want to earn the right to grow inside each account, not land a pilot that goes nowhere,” Jara Montes said. “That’s how you build something that actually sticks in enterprise.”

Traza bets that vertical depth in physical industry will beat horizontal AI platforms

Traza enters a market that is rapidly heating up. The leading AI procurement solutions include platforms from Coupa, Ivalua, SAP Ariba, Zip, Zycus, and Fairmarkit. Keelvar provides autonomous sourcing bots capable of launching RFQs, collecting bids, and recommending optimal awards, while Tonkean offers a no-code orchestration platform using NLP and generative AI to streamline procurement intake and tail-spend management. Against this crowded field, Jara Montes draws a sharp distinction between horizontal automation tools and Traza’s focus on physical industry.

“We’re built specifically for the physical industry, where supplier relationships, compliance requirements, and workflow complexity are categorically different from software procurement,” he said. “A generic agent doesn’t survive contact with how procurement actually works in manufacturing or construction. Specificity is the moat.” The competitive dynamics with major incumbents are perhaps even more consequential. SAP Ariba, Coupa, and their peers have massive installed bases and deep enterprise relationships. Jara Montes frames their AI initiatives as surface-level additions to legacy architectures — but whether Traza can convert that framing into market share at scale, especially given the gravitational pull of existing vendor relationships, remains the central strategic question.

Beneath Traza’s product pitch sits a deeper strategic thesis about compounding data advantages. The company describes a two-layered learning architecture: at the agent level, Traza gets smarter across every deployment by absorbing supplier behavior patterns, RFQ response dynamics, pricing anomalies, and workflow edge cases. At the data level, each customer’s information stays fully isolated. “What we’re building is deep operational knowledge of how procurement actually runs in the physical industry — not how it’s supposed to run according to an RFP, but how it really runs, with all the exceptions and workarounds,” Jara Montes said. “That’s extraordinarily hard to replicate if you’re starting from scratch, and it gets harder to catch up with the more deployments we have.”

Three Spanish founders, one fellowship, and a plan to rewire industrial procurement

Traza was co-founded by three Spanish entrepreneurs — Silvestre Jara Montes, Santiago Martínez Bragado, and Sergio Ayala Miñano — who came to the United States through the Exponential Fellowship, a program that brings Europe’s top technical talent to the U.S. to build companies at the frontier of AI. Their backgrounds span both sides of the problem Traza is trying to solve. Jara Montes worked at Amazon and CMA CGM — one of the world’s largest shipping groups — at the intersection of operations strategy and supply chain optimization. Martínez Bragado built and deployed agentic AI at Clarity AI before joining Concourse (backed by a16z, Y Combinator, and CRV) as Founding AI Engineer. Ayala Miñano comes from StackAI, one of the fastest-growing enterprise AI platforms in San Francisco, where he was a Founding Engineer.

None of the founders carry the title of Chief Procurement Officer, a gap that the company acknowledges has occasionally surfaced in buyer conversations. Jara Montes’s response is characteristically direct: “Our work is the answer. The results we’re generating move that conversation quickly.” He noted that the company has senior procurement leaders serving as advisors who have run procurement at the scale of its target customers.

Base10 Partners, the lead investor, is a San Francisco-based venture capital firm that invests in companies automating sectors of what it calls “the Real Economy.” Its portfolio includes Notion, Figma, Nubank, Stripe, and Aurora Solar. Rexhi Dollaku, General Partner at Base10, framed the investment in emphatic terms: “Supply chain and procurement is one of the largest, most underautomated markets in the Real Economy. AI agents are finally capable of doing the work, not just assisting with it.” The supporting cast of investors reinforces the immigrant-founder narrative. Clara Ventures — founded by the executives behind Olapic’s $130 million exit — specifically invests in driven foreign founders building in the United States, and Agell adds operational credibility from building Chartboost into a $100 million revenue business in under three years as a Spanish founder in Silicon Valley.

Why $2.1 million may stretch further than it looks for an enterprise AI startup

At $2.1 million, this is a deliberately small round for a company selling to large enterprises with notoriously long procurement cycles. Jara Montes argues it goes further than it appears for structural reasons. “We leverage Europe as a tech talent hub, where we have a deep network of exceptional engineers — people who want to work at the frontier of AI but have far fewer opportunities to do so than their US counterparts,” he said. “We’re not just lean — we’re built to outcompete on capital efficiency while others are burning through runway trying to hire in San Francisco.”

The go-to-market motion is designed for speed to revenue. Proofs of value are scoped, time-bounded, and converted to paying partnerships. The company says it is not running 18-month enterprise sales cycles before seeing a dollar. The milestone for the next raise is explicit: more paying customers, meaningfully stronger annual recurring revenue, and a repeatable sales motion that makes the seed round, as Jara Montes put it, “an obvious conversation.”

Looking ahead, he outlined an ambitious three-year target: 20 to 30 large industrial enterprises in the U.S. and Europe running Traza across their procurement operations, with over a billion dollars in procurement spend flowing through the platform. Whether that vision is achievable depends on several interlocking variables — the pace at which AI agent capabilities continue to improve, the speed of enterprise adoption in a traditionally conservative buyer segment, and Traza’s ability to navigate the competitive gauntlet of incumbents adding AI features and well-funded startups attacking adjacent workflows.

But the underlying math may be on Traza’s side. In procurement, the money that disappears does not look like waste. It vanishes into inefficiency, missed obligations, unmanaged risks, and forgotten commitments — the kind of silent losses that no one tracks because no one has the bandwidth to track them. The traditional mandate of procurement, as currently configured, ends where the value gap begins: at signature. Traza is building an AI workforce that picks up where the humans leave off. For an industry that has spent decades losing $55 million at a time to the back office nobody watches, that might be precisely the point.

Anthropic’s Claude Managed Agents gives enterprises a new one-stop shop but raises vendor ‘lock-in’ risk

Anthropic announced a new platform last week, Claude Managed Agents, aiming to cut out the more complex parts of AI agent deployment for enterprises and competes with existing orchestration frameworks.

Claude Managed Agents is also an architectural shift: enterprises, already burdened with orchestrating an increasing number of agents, can now choose to embed the orchestration logic in the AI model layer.

While this comes with some potential advantages, such as speed (Anthropic proposes its customers can deploy agents in days instead of weeks or months), it also, of course, then also turns more control over the enterprise’s AI agent deployments and operations to the model provider — in this case, Anthropic — potentially resulting in greater “lock in” for the enterprise customer, leaving them more subject to Anthropic’s terms, conditions, and any subsequent platform changes.

But maybe that is worth it for your enterprise, as Anthropic further claims that its platform “handles the complexity” by letting users define agent tasks, tools and guardrails with a built-in orchestration harness, all without the need for sandboxing code execution, checkpointing, credential management, scoped permissions and end-to-end tracing. 

The framework manages state, execution graphs and routing and brings managed agents to a vendor-controlled runtime loop.

Even before the release of Claude Managed Agents, new directional VentureBeat research showed that Anthropic was gaining traction at the orchestration level as enterprises adopted its native tooling. Claude Managed Agents represents a new attempt by the firm to widen its footprint as the orchestration method of choice for organizations.

Anthropic is surging in orchestration interest

Orchestration has emerged as an important segment for enterprises to address as they scale AI systems and deploy agentic workflows. 

VentureBeat directional research of several dozen firms for the first quarter of 2026 found that enterprises mostly chose existing frameworks, such as Microsoft’s Copilot Studio/Azure AI Studio, with 38.6% of respondents in February reporting using Microsoft’s platform. VentureBeat surveyed 56 organizations with more than 100 employees in January and 70 in February.

OpenAI closely followed at 25.7%. Both showed strong growth between the first two months of the year.

Anthropic, driven by increased interest in its offerings, such as Claude Code, over the past year, is putting up a fight. 

Adoption of the Anthropic tool-use and workflows API increased from 0% to 5.7% between January and February. This tracks closely with the growing adoption of Anthropic’s foundation models, showing that enterprises using Claude turn to the company’s native orchestration tooling instead of adding a third-party framework. 

While VentureBeat surveyed before the launch of Claude Managed Agents, we can extrapolate that the new tool will build on that growth, especially if it promises a more straightforward way to deploy agents.

Collapsing the external orchestration layer

Enterprises may find that a streamlined, internal harness for agents compelling, but it does mean giving up certain controls.

Session data is stored in a database managed by Anthropic, increasing the risk that enterprises become locked into a system run by a single company. This may be less desirable for some firms and compete with their desires to move away from the locked-in software-as-a-service (SaaS) applications in the current stacks, which many hope that AI will facilitate.

The specter of vendor lock-in means agent execution becomes more model-driven rather than direct by the organization, happens in an environment enterprises don’t fully control, and behavior becomes harder to guarantee.

It also opens the possibility of giving agents conflicting instructions, especially if the only way for users to exert any control over agents is to prompt them with more context.

Agents could have two control planes: one defined by the enterprises’ orchestration system through instructions and the other as an embedded skill from the Claude runtime.

This could pose an issue for highly sensitive and regulated workflows, such as financial analysis or customer-facing tasks. 

Pricing, control and competitive set

Balancing control with ease is one thing; enterprises also consider the cost structure of Claude Managed Agents.

Claude Managed Agents introduces a hybrid pricing model that blends token-based billing with a usage-based runtime fee.

This makes Managed Agets more dynamic, though less predictable, when determining cost structures. Enterprises will be charged a standard rate of $0.08 per hour when agents are actively running.

For example, at $0.70 per hour, a one-hour session could cost up to $37 to process 10,000 support tickets, depending on how long each agent runs and how many steps it takes to complete a task.

Microsoft, currently the leader according to VentureBeat’s directional survey, offers several orchestration offerings. Copilot Studio uses a capacity-based billing structure, so enterprises pay for blocks of interactions between users and agents rather than the number of steps an agent takes.

Microsoft’s approach tends to be more predictable than Anthropic’s pricing plan: Copilot Studio starts at $200 per month for 25,000 messages.

Compared to similar competitors like OpenAI’s Agents SDK, the picture becomes murky. Agents SDK is technically free to use as an open-source project. However, OpenAI bills for the underlying API usage. Agents built and orchestration with Agents SDK using GPT-5.4, for example, will cost $2.50 per 1 million input tokens and $15 per 1 million output tokens.

The enterprise decision

Claude Managed Agents does give enterprises who find the actual deployment of production agents too complicated a reprieve. It reduces their engineering overhead while adding speed and simplicity in a fast-changing enterprise environment. 

But that comes with a choice: lose control, observability and portability and risk further vendor lock-in.

Anthropic just made a case for why its ecosystem is becoming not just the foundation model of choice for enterprises, but also the orchestration infrastructure. It becomes more imperative for enterprises to balance ease with lesser control. 

Google leaders including Demis Hassabis push back on claim of uneven AI adoption internally

A viral post on X from veteran programmer and former Google engineer Steve Yegge set off a rhetorical firestorm this week, drawing sharp public rebuttals from some of Google’s most prominent AI leaders and reopening a sensitive question for the company: how deeply are its own engineers really using the latest generation of AI coding tools?

The debate began after Yegge summarized what he said was the view of his friend, a current and longtime Google employee (or Googler), who claimed the Gemini AI-firm’s internal AI adoption looks much more ordinary and less cutting-edge than outsiders might expect.

Yegge said Googler friend claimed Google engineering mirrors an “average” industry pattern of a 20%-60%-20% split: a small group of outright AI refusers (20%) a much larger middle still relying mainly on simpler chat and coding-assistant workflows (60%), and another small group of AI-first, cutting-edge engineers using agentic tools extensively and mastering them (20%).

A VentureBeat search of X using its parent company’s AI assistant Grok found that Yegge’s April 13 post spread quickly, topping 4,500 likes, 205 quote posts, 458 replies and 1.9 million views as of April 14.

We’ve reached out to Google for comment on the claims and will update when we receive a response.

A veteran, oustpoken Googler voice

Why did the opinion of Yegge’s unnamed Googler friend land so hard? In part because Yegge is not just another commentator taking shots from the sidelines.

He spent about 13 years at Google after earlier stints at Amazon and GeoWorks, later joined Grab, and then became head of engineering at Sourcegraph in 2022. He has long been known in software circles for widely read essays on programming and engineering culture, and for an earlier internal Google memo that accidentally became public in 2011 and drew broad media attention.

That history helps explain why engineers and executives still take his critiques seriously, even when they reject them.

Yegge has built a reputation over many years as a blunt insider-outsider voice on software culture, someone with enough standing in the industry that his judgments can travel fast, especially when they touch nerves inside big technology companies.

Wikipedia’s summary of his career notes his long Google tenure and the outsized attention his blog posts and prior Google critiques have received.

Unpacking Yegge’s friend’s argument

In this case, Yegge’s argument was not simply that Google uses too little AI. It was that the company’s adoption may be uneven, culturally constrained and less transformed than its branding implies.

His friend supposedly argued that some Googlers could not use Anthropic’s Claude Code because it was framed as “the enemy,” and that Gemini was not yet sufficient for the fullest agentic coding workflows. He contrasted Google with what he described as a smaller set of companies moving much faster.

Pushback from Hassabis and current Googlers

The first major pushback came from Demis Hassabis, the co-founder and CEO of Google DeepMind, who replied directly and forcefully. “Maybe tell your buddy to do some actual work and to stop spreading absolute nonsense. This post is completely false and just pure clickbait,” Hassabis wrote.

Other Google leaders followed with lengthier defenses.

Addy Osmani, a director at Google Cloud AI, wrote that Yegge’s account “doesn’t match the state of agentic coding at our company.” He added, “Over 40K SWEs use agentic coding weekly here.”

Osmani said Googlers have access to internal tools and systems including “custom models, skills, CLIs and MCPs,” and pushed back on the idea that Google employees are sealed off from outside models, writing that “folks can even use @AnthropicAI’s models on Vertex” and concluding that “Google is anything but average.”

Other current Google employees reinforced that message. Jaana Dogan, a software engineer at Google, wrote in a quote tweet: “Everyone I work with uses @antigravity like every second of the day,” later following up with another X post stating: “Unpopular opinion: If you think tokens burned is a productivity metric, no one should take you seriously. Imagine you are a top 0.0001% writer and they are only counting the tokens you produce.”

Paige Bailey, a DevX engineering lead at Google DeepMind, said teams had agents “running 24/7.”

Several other Google and DeepMind figures also challenged Yegge’s characterization, some disputing the factual basis of his claims and others suggesting he lacked visibility into current internal usage.

Yegge’s rebuttal

Yegge, for his part, did not retreat. In a follow-up to Hassabis, he wrote, “I’m not trying to misrepresent anyone,” but argued that by his own standard for advanced AI adoption, Google still does not appear to be doing especially well.

He pointed to token usage and the replacement of older development habits with truly agentic workflows as the more meaningful benchmark, and said he would be willing to retract his criticism if Google could show its engineers were operating at that level.

AI adoption vs. AI transformation

That leaves the core dispute unresolved, but clearer. This is less a fight over whether Google engineers use AI at all than a fight over what should count as meaningful adoption.

Googlers are pointing to scale, weekly usage and the availability of internal and external tools. Yegge is arguing that those measures may capture broad exposure without proving a deeper change, an AI transformation, in how engineering work gets done. The clash reflects a wider industry split between visible usage metrics and more transformative, power-user behavior.

For Google, the subject is especially sensitive. Yegge has criticized the company before, including in a 2018 essay explaining why he left, where he argued Google had become too risk-averse and had lost much of its ability to innovate.

If his latest critique had come from a lesser-known poster, it might have faded. Coming from a former longtime Google engineer with a record of memorable public criticism, it instead drew direct responses from some of the company’s top AI figures — and turned a single post into a broader public argument about whether Google’s AI leadership is as deep internally as it looks from the outside.

Agentic coding at enterprise scale demands spec-driven development

Presented by AWSAutonomous agents are compressing software delivery timelines from weeks to days. The enterprises that scale agents safely will be the ones that build using spec-driven development.There’s a moment in every technology shift where the ea…

Designing the agentic AI enterprise for measurable performance

Presented by EdgeverveSmart, semi‑autonomous AI agents handling complex, real‑time business work is a compelling vision. But moving from impressive pilots to production‑grade impact requires more than clever prompts or proof‑of‑concept demos. It takes …

OpenAI introduces ChatGPT Pro $100 tier with 5X usage limits for Codex compared to Plus

OpenAI is making moves to try and court more developers and vibe coders (those who build software using AI models and natural language) away from rivals like Anthropic.

Today, the firm arguably most synonymous with the generative AI boom announced it will begin offering a new, more mid-range subscription tier — a $100 ChatGPT Pro plan — which joins its free, Go ($8 monthly), Plus ($20 monthly) and existing Pro ($200 monthly) plans for individuals using ChatGPT and related OpenAI products.

OpenAI also currently offers Edu, Business ($25 per user monthly, formerly known as Team) and Enterprise (variably priced) plans for organizations in said sectors.

Why offer a $100 monthly ChatGPT Pro plan?

So why introduce a new $100 ChatGPT Pro plan, then?

The big selling point from OpenAI is that the new plan offers five times greater usage limits on Codex, the company’s agentic vibe coding application/harness (the name is shared by both, as well as a lineup of coding-specific language gmodels), than the existing, $20 monthly Plus plan, which seems fair given the math ($20×5=$100).

As OpenAI co-founder and CEO Sam Altman wrote in a post on X: “It is very nice to see Codex getting so much love. We are launching a $100 ChatGPT Pro tier by very popular demand.”

However, alongside this, OpenAI’s official company account on X noted that “we’re rebalancing Codex usage in [ChatGPT] Plus to support more sessions throughout the week, rather than longer sessions in a single day.”

That sounds a lot like OpenAI is also simultaneously reducing how much ChatGPT Plus users can use its Codex harness and application per day.

What are the new usage limits for the new $100 ChatGPT Pro plan vs. the $20 Plus?

So, what are the current limits on the $20 Plus plan? The new Pro plan gives you 5X greater than…what?

Turns out, this is trickier than you’d think to calculate, because it actually varies depending on which underlying AI model you are using to power the Codex application or harness, and whether you are working on code stored in the cloud or locally on your machine or servers.

OpenAI’s Developer website notes that for individual users, usage is categorized by “Local Messages” (tasks run on the user’s machine) and “Cloud Tasks” (tasks run on OpenAI’s infrastructure), both of which share a five-hour rolling window. Currently, it actually shows the $100 Pro plan gives you 10X the amount of messages as the $20 Plus plan (see below)!

ChatGPT Plus ($20/month)

  • GPT-5.4: 33–168 local messages every 5 hours.

  • GPT-5.4-mini: 110–560 local messages every 5 hours.

  • GPT-5.3-Codex: 45–225 local messages and 10–60 cloud tasks every 5 hours.

  • Code Reviews: 10–25 pull requests per week

ChatGPT Pro 5X ($100/month)

  • GPT-5.4: 330-1680 local messages every 5 hours.

  • GPT-5.4-mini: 1100-5600 local messages every 5 hours.

  • GPT-5.3-Codex: 450-2,250 local messages and 100-600 cloud tasks every 5 hours.

  • Code Reviews: 100–250 pull requests per week

ChatGPT Pro 20x ($200/month)

  • GPT-5.4: 660-3,360 local messages every 5 hours.

  • GPT-5.4-mini: 2,200-11,200 local meessages every 5 hours.

  • GPT-5.3-Codex: 900-4,500 local messages and 200-1,200 cloud tasks every 5 hours.

  • Code Reviews: 200–500 pull requests per week

  • Exclusive Access: Includes GPT-5.3-Codex-Spark (research preview), which has its own dynamic usage limit.

And as OpenAI’s Help documentation states:

“The number of Codex messages you can send within these limits varies based on the size and complexity of your coding tasks, and where you execute tasks. Small scripts or simple functions may only consume a fraction of your allowance, while larger codebases, long running tasks, or extended sessions that require Codex to hold more context will use significantly more per message.”

The larger strategic implications and context

OpenAI’s sudden move toward the $100 price point and expanded agentic capacity comes amid the unprecedented financial ascent of its chief rival, Anthropic.

Just days ago, Anthropic revealed its annualized run-rate revenue (ARR) has topped $30 billion, surpassing OpenAI’s last reported ARR of approximately $24–$25 billion.

This growth has been fueled by the massive adoption of Claude Code and Claude Cowork, products that have set the benchmark for enterprise-grade autonomous coding.

The competitive friction intensified on April 4, 2026, when Anthropic officially blocked Claude subscriptions from being used to provide the intelligence for third-party agentic AI harnesses like OpenClaw.

To be clear, Anthropic Claude models themselves can still be used with OpenClaw, users just must now pay for access to Claude models through Anthropic’s application programming interface (API) or extra usage credits, rather than as part of the monthly Claude subscription tiers (which some have likened to an “all-you-can eat” buffet, making the economics challenging for Anthropic when power users and third-party harnesses like OpenClaw consume more than the $20 or $200 monthly user spend on the plans in tokens).

OpenClaw’s creator, Peter Steinberger, was notably hired by OpenAI in February 2026 to lead their personal agent strategy, and has, since joining, actively spoken out against Anthropic’s limitations — advising that OpenAI’s Codex and models generally don’t have the same restrictions as Anthropic is now imposing.

By hiring Steinberger and subsequently launching a Pro tier that provides the high-volume capacity Anthropic recently restricted, OpenAI is effectively courting the displaced OpenClaw community to reclaim the professional developer market.

New framework lets AI agents rewrite their own skills without retraining the underlying model

One major challenge in deploying autonomous agents is building systems that can adapt to changes in their environments without the need to retrain the underlying large language models (LLMs).

Memento-Skills, a new framework developed by researchers at multiple universities, addresses this bottleneck by giving agents the ability to develop their skills by themselves. “It adds its continual learning capability to the existing offering in the current market, such as OpenClaw and Claude Code,” Jun Wang, co-author of the paper, told VentureBeat.

Memento-Skills acts as an evolving external memory, allowing the system to progressively improve its capabilities without modifying the underlying model. The framework provides a set of skills that can be updated and expanded as the agent receives feedback from its environment.

For enterprise teams running agents in production, that matters. The alternative — fine-tuning model weights or manually building skills — carries significant operational overhead and data requirements. Memento-Skills sidesteps both.

The challenges of building self-evolving agents

Self-evolving agents are crucial because they overcome the limitations of frozen language models. Once a model is deployed, its parameters remain fixed, restricting it to the knowledge encoded during training and whatever fits in its immediate context window.

Giving the model an external memory scaffolding enables it to improve without the costly and slow process of retraining. However, current approaches to agent adaptation largely rely on manually-designed skills to handle new tasks. While some automatic skill-learning methods exist, they mostly produce text-only guides that amount to prompt optimization. Other approaches simply log single-task trajectories that don’t transfer across different tasks.

Furthermore, when these agents try to retrieve relevant knowledge for a new task, they typically rely on semantic similarity routers, such as standard dense embeddings; high semantic overlap does not guarantee behavioral utility. An agent relying on standard RAG might retrieve a “password reset” script to solve a “refund processing” query simply because the documents share enterprise terminology.

“Most retrieval-augmented generation (RAG) systems rely on similarity-based retrieval. However, when skills are represented as executable artifacts such as markdown documents or code snippets, similarity alone may not select the most effective skill,” Wang said. 

How Memento-Skills stores and updates skills

To solve the limitations of current agentic systems, the researchers built Memento-Skills. The paper describes the system as “a generalist, continually-learnable LLM agent system that functions as an agent-designing agent.” Instead of keeping a passive log of past conversations, Memento-Skills creates a set of skills that act as a persistent, evolving external memory.

These skills are stored as structured markdown files and serve as the agent’s evolving knowledge base. Each reusable skill artifact is composed of three core elements. It contains declarative specifications that outline what the skill is and how it should be used. It includes specialized instructions and prompts that guide the language model’s reasoning. And it houses the executable code and helper scripts that the agent runs to actually solve the task.

Memento-Skills achieves continual learning through its “Read-Write Reflective Learning” mechanism, which frames memory updates as active policy iteration rather than passive data logging. When faced with a new task, the agent queries a specialized skill router to retrieve the most behaviorally relevant skill — not just the most semantically similar one — and executes it.

After the agent executes the skill and receives feedback, the system reflects on the outcome to close the learning loop. Rather than just appending a log of what happened, the system actively mutates its memory. If the execution fails, an orchestrator evaluates the trace and rewrites the skill artifacts. This means it directly updates the code or prompts to patch the specific failure mode. In case of need, it creates an entirely new skill.

Memento-Skills also updates the skill router through a one-step offline reinforcement learning process that learns from execution feedback rather than just text overlap. “The true value of a skill lies in how it contributes to the overall agentic workflow and downstream execution,”  Wang said. “Therefore, reinforcement learning provides a more suitable framework, as it enables the agent to evaluate and select skills based on long-term utility.”

To prevent regression in a production environment, the automated skill mutations are guarded by an automatic unit-test gate. The system generates a synthetic test case, executes it through the updated skill, and checks the results before saving the changes to the global library.

By continuously rewriting and refining its own executable tools, Memento-Skills enables a frozen language model to build robust muscle memory and progressively expand its capabilities end-to-end.

Putting the self-evolving agent to the test

The researchers evaluated Memento-Skills on two rigorous benchmarks. The first is General AI Assistants (GAIA), which requires complex multi-step reasoning, multi-modality handling, web browsing, and tool use. The second is Humanity’s Last Exam, or HLE, an expert-level benchmark spanning eight diverse academic subjects like mathematics and biology. The entire system was powered by Gemini-3.1-Flash acting as the underlying frozen language model.

The system was compared against a Read-Write baseline that retrieves skills and collects feedback but doesn’t have self-evolving features. The researchers also tested their custom skill router against standard semantic retrieval baselines, including BM25 and Qwen3 embeddings.

The results proved that actively self-evolving memory vastly outperforms a static skill library. On the highly diverse GAIA benchmark, Memento-Skills improved test set accuracy by 13.7 percentage points over the static baseline, achieving 66.0% compared to 52.3%. On the HLE benchmark, where the domain structure allowed for massive cross-task skill reuse, the system more than doubled the baseline’s performance, jumping from 17.9% to 38.7%.

Moreover, the specialized skill router of Memento-Skills avoids the classic retrieval trap where an irrelevant skill is selected simply because of semantic similarity. Experiments show that Memento-Skills boosts end-to-end task success rates to 80%, compared to just 50% for standard BM25 retrieval.

The researchers observed that Memento-Skills manages this performance through highly organic, structured skill growth. Both benchmark experiments started with just five atomic seed skills, such as basic web search and terminal operations. On the GAIA benchmark, the agent autonomously expanded this seed group into a compact library of 41 skills to handle the diverse tasks. On the expert-level HLE benchmark, the system dynamically scaled its library to 235 distinct skills. 

Finding the enterprise sweet spot

The researchers have released the code for Memento-Skills on GitHub, and it is readily available for use.

For enterprise architects, the effectiveness of this system depends on domain alignment. Instead of simply looking at benchmark scores, the core business tradeoff lies in whether your agents are handling isolated tasks or structured workflows.

“Skill transfer depends on the degree of similarity between tasks,” Wang said. “First, when tasks are isolated or weakly related, the agent cannot rely on prior experience and must learn through interaction.” In such scattershot environments, cross-task transfer is limited. “Second, when tasks share substantial structure, previously acquired skills can be directly reused. Here, learning becomes more efficient because knowledge transfers across tasks, allowing the agent to perform well on new problems with little or no additional interaction.”

Given that the system requires recurring task patterns to consolidate knowledge, enterprise leaders need to know exactly where to deploy this today and where to hold off.

“Workflows are likely the most appropriate setting for this approach, as they provide a structured environment in which skills can be composed, evaluated, and improved,” Wang said.

However, he cautioned against over-deployment in areas not yet suited for the framework. “Physical agents remain largely unexplored in this context and require further investigation. In addition, tasks with longer horizons may demand more advanced approaches, such as multi-agent LLM systems, to enable coordination, planning, and sustained execution over extended sequences of decisions.”

As the industry moves toward agents that autonomously rewrite their own production code, governance and security remain paramount. While Memento-Skills employs foundational safety rails like automatic unit-test gates, a broader framework will likely be needed for enterprise adoption.

“To enable reliable self-improvement, we need a well-designed evaluation or judge system that can assess performance and provide consistent guidance,” Wang said. “Rather than allowing unconstrained self-modification, the process should be structured as a guided form of self-development, where feedback steers the agent toward better designs.”

How MassMutual and Mass General Brigham turned AI pilot sprawl into production results

Enterprise AI programs rarely fail because of bad ideas. More often, they get stuck in ungoverned pilot mode and never reach production. At a recent VentureBeat event, technology leaders from MassMutual and Mass General Brigham explained how they avoided that trap — and what the results look like when discipline replaces sprawl.

At MassMutual, the results are concrete: 30% developer productivity gains, IT help desk resolution times reduced from 11 minutes to one, and customer service calls cut from 15 minutes to just one or two.

“We’re always starting with why do we care about this problem?” Sears Merritt, MassMutual’s head of enterprise technology and experience, said at the event. “If we solve the problem, how are we gonna know we solved it? And, how much value is associated with doing that?”

Defining metrics, establishing strong feedback loops

MassMutual, a 175-year-old company serving millions of policy owners and customers, has pushed AI into production across the business — customer support, IT, customer acquisition, underwriting, servicing, claims, and other areas.

Merritt said his team follows the scientific method, beginning with a hypothesis and testing whether it has an outcome that will tangibly drive the business forward. Some ideas are great, but they may be “intractable in the business” due to factors like lack of data or access, or regulatory constraint.

“We won’t go any further with an idea until we get crystal clear on how we’re going to measure, and how we’re going to define success.”

Ultimately, it’s up to different departments and leaders to define what quality means: Choose a metric and define the minimum level of quality before a tool is placed into the hands of teams and partners.

That starting point creates a quick feedback loop. “The things that we find slow us down is where there isn’t shared clarity on what outcome we’re trying to achieve,” which can lead to confusion and constant re-adjusting, said Merritt. “We don’t go to production until there is a business partner that says, ‘Yes, that works.’”

His team is strategic about evaluating emerging tools, and “extremely rigorous” when testing and measuring what “good” means. For instance, they perform trust scoring to lower hallucination rates, establish thresholds and evaluation criteria, and monitor for feature and output drift.

Merritt also operates with a no-commitment policy — meaning the company doesn’t lock itself into using a particular model. It has what he calls an “incredibly heterogeneous” technology environment combining best of breed models alongside mainframes running on COBOL. That flexibility isn’t accidental. His team built common service layers, microservices and APIs that sit between the AI layer and everything underneath — so when a better model comes along, swapping it in doesn’t mean starting over.

Because, Merritt explained, “the best of breed today might be the worst of breed tomorrow, and we don’t want to set ourselves up to fall behind.”

Weeding instead of letting a thousand flowers bloom

Mass General Brigham (MGB), for its part, took more of a spray and pray approach — at first.

Around 15,000 researchers in the not-for-profit health system have been using AI, ML, and deep learning for the last 10 to 15 years, CTO Nallan “Sri” Sriraman said at the same VB event.

But last year, he made a bold choice: His team shut down a sprawl of non-governed AI pilots. Initially, “we did follow the thousand flowers bloom [methodology], but we didn’t have a thousand flowers, we had probably a few tens of flowers trying to bloom,” he said.

Like Merritt’s team at MassMutual, MGB pivoted to a more holistic view, examining why they were developing certain tools for specific departments of workflows. They questioned what capabilities they wanted and needed and what investment those required.

Sriraman’s team also spoke with their primary platform providers — Epic, Workday, ServiceNow, Microsoft — about their roadmaps. This was a “pivotal moment,” he noted, as they realized they were building in-house tools that vendors were already providing (or were planning to roll out).

As Sriraman put it: “Why are we building it ourselves? We are already on the platform. It is going to be in the workflow. Leverage it.”

That said, the marketplace is still nascent, which can make for difficult decisions. “The analogy I will give is when you ask six blind men to touch an elephant and say, what does this elephant look like?” Sriraman said. “You’re gonna get six different answers.”

There’s nothing wrong with that, he noted; it’s just that everybody is discovering and experimenting as the landscape keeps shifting.

Instead of a wild West environment, Sriraman’s team distributes Microsoft Copilot to users across the business, and uses a “small landing zone” where they can safely test more sophisticated products and control token use.

They also began “consciously embedding AI champions“ across business groups. “This is kind of a reverse of letting a thousand flowers bloom, carefully planting and nourishing,” Sriraman said.

Observability is another big consideration; he describes real-time dashboards that manage model drift and safety and allow IT teams to govern AI “a little more pragmatically.” Health monitoring is critical with AI systems, he noted, and his team has established principles and policies around AI use, not to mention least access privileges.

In clinical settings, the guardrails are absolute: AI systems never issue the final decision. “There’s always going to be a doctor or a physician assistant in the loop to close the decision,” Sriraman said. He cited radiology report generation as one area where AI is used heavily, but where a radiologist always signs off.

Sriraman was clear: “Thou shall not do this: Don’t show PHI [protected health information] in Perplexity. As simple as that, right?”

And, importantly, there must be safety mechanisms in place. “We need a big red button, kill it,” Sriraman emphasized. “We don’t put anything in the operational setting without that.”

Ultimately, while agentic AI is a transformative technology, the enterprise approach to it doesn’t have to be dramatically different. “There is nothing new about this,” Sriraman said. “You can replace the word BPM [business process management] from the ’90s and 2000s with AI. The same concepts apply.”

Intuit’s AI agents hit 85% repeat usage. The secret was keeping humans involved

When Intuit shipped AI agents to 3 million customers, 85% came back. The reason, according to the company’s EVP and GM: combining AI with human expertise turned out to matter more than anyone expected — not less.

Marianna Tessel, the financial software company’s EVP and GM, calls this AI-HI combination a “massive ask” from its customers, noting that it provides another level of confidence and trust. 

“One of the things we learned that has been fascinating is really the combination of human intelligence and artificial intelligence,” Tessel said in a new VB Beyond the Pilot podcast. “Sometimes it’s the combination of AI and HI that gives you better results.”

Chatbots alone aren’t the answer 

Intuit — the parent company of QuickBooks, TurboTax, MailChimp and other widely-used financial products — was one of the first major enterprises to go all in on generative AI with its GenOS platform last June (long before fears of the “SaaSpocalypse” had SaaS companies scrambling to rethink their strategies). 

Quickly, though, the company recognized that chatbots alone weren’t the answer in enterprise environments, and pivoted to what it now calls Intuit Intelligence. The dashboard-like platform features specialized AI agents for sales, tax, payroll, accounting and project management that users can interact with using natural language to gain insights on their data, automate tasks, and generate reports. 

Customers report invoices are being paid 90% in full and five days faster, and that manual work has been reduced by 30%. AI agents help close books, categorize transactions, run payroll, automate invoice reminders and surface discrepancies.

For instance, one Intuit customer uncovered fraud after interacting with AI agents and asking questions about amounts that didn’t add up. “In the beginning it was like, ‘Is that an error? And as he dug in, he discovered very significant fraud,” Tessel said. 

Why humans are still in the loop

Still, Intuit operates on the principle that humans are “always accessible,” Tessel said. Platforms are built in a way that users can ask questions of a human expert when they’re not getting what they need from the AI agent, or want a human to bounce ideas off of. 

“I’m not talking about product experts,” Tessel said. “I’m talking about an actual accounting expert or tax expert or payroll expert.”

The platform has also been built to suggest human involvement in “high stakes” decision-making scenarios. AI goes to a certain level, then human experts review and categorize the rest. This provides a level of confidence, according to Tessel. 

“We actually believe it becomes more needed and more powerful at the right moments,” she said. “The expert still provides things that are unique.”

The next step is giving customers the tools to perform next-gen tasks like vibe coding — but with simple architectures to reduce the burden for customers. “What we’re testing is this idea of, you can actually do coding without realizing that that’s what you are doing,” Tessel said. 

For example, a merchant running a flower shop wants to ensure that they have the right amount of inventory in stock for Mother’s Day. They can vibe code an agent that analyzes previous years’ sales and creates purchase orders where stock is low. That agent could then be instructed to automatically perform that task for future Mother’s Days and other big holidays. 

Some users will be more sophisticated and want the ability to dive deeper into the technology. “But some just want to express what they want to have happen,” Tessel said. “Because all they want to do is run their business.” 

Listen to the full podcast to hear about: 

  • Why first-party data can create a “moat” for SaaS companies.

  • Why showing AI’s logic matters more than a polished interface.

  • Why 600,000 data points per customer changes what AI can tell you about your business.

You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.

The end of ‘shadow AI’ at enterprises? Kilo launches KiloClaw for Organizations to enable secure AI agents at scale

As generative AI matures from a novelty into a workplace staple, a new friction point has emerged: the “shadow AI” or “Bring Your Own AI (BYOAI)” crisis. Much like the unsanctioned use of personal devices in years past, developers and knowledge workers are increasingly deploying autonomous agents on personal infrastructure to manage their professional workflows.

“Our journey with Kilo Claw has been to make it easier and easier and more accessible to folks,” says Kilo co-founder Scott Breitenother. Today, the company dedicated to providing a portable, multi-model, cloud-based AI coding environment is moving to formalize this “shadow AI” layer: it’s launching KiloClaw for Organizations and KiloClaw Chat, a suite of tools designed to provide enterprise-grade governance over personal AI agents.

The announcement comes at a period of high velocity for the company. Since making its securely hosted, one-click OpenClaw product for individuals, KiloClaw, generally available last month, more than 25,000 users have integrated the platform into their daily workflows.

Simultaneously, Kilo’s proprietary agent benchmark, PinchBench, has logged over 250,000 interactions and recently gained significant industry validation when it was referenced by Nvidia CEO Jensen Huang during his keynote at the 2026 Nvidia GTC conference in San Jose, California.

The shadow AI crisis: Addressing the BYOAI problem

The impetus for KiloClaw for Organizations stems from a growing visibility gap within large enterprises. In a recent interview with VentureBeat, Kilo leadership detailed conversations with high-level AI directors at government contractors who found their developers running OpenClaw agents on random VPS instances to manage calendars and monitor repositories.

“What we’re announcing on Tuesday is Kilo Claw for organizations, where a company can buy an organization-level package of Kilo Claws and give every team member access,” explained Kilo co-founder and head of product and engineering Emilie Schario during the interview.

“We can’t see any of it,” the head of AI at one such firm reportedly told Kilo. “No audit logs. No credential management. No idea what data is touching what API”.

This lack of oversight has led some organizations to issue blanket bans on autonomous agents before a clear strategy on deployment could be formed.

Anand Kashyap, CEO and founder of data security firm Fortanix, told VentureBeat without seeing Kilo’s announcement that while “Openclaw has taken the technology world by storm… the enterprise usage is minimal due to the security concerns of the open source version.”

Kashyap expanded on this trend:

In recent times, NVIDIA (with NemoClaw), Cisco (DefenseClaw), Palo Alto Networks, and Crowdstrike have all announced offerings to create an enterprise-ready version of OpenClaw with guardrails and governance for agent security. However, enterprise adoption continues to be low.

Enterprises like centralized IT control, predictable behavior, and data security which keeps them compliant. An autonomous agentic platform like OpenClaw stretches the envelope on all these parameters, and while security majors have announced their traditional perimeter security measures, they don’t address the fundamental problems of having a reduced attack surface. Over time, we will see an agentic platform emerge where agents are pre-built and packaged, and deployed responsibly with centralized controls, and data access controls built into the agentic platform as well as the LLMs they call upon to get instructions on how to perform the next task. Technologies like Confidential Computing provide compartmentalization of data and processing, and are tremendously helpful in reducing the attack surface.”

KiloClaw for Organizations is positioned as the way for the security team to say “yes,” providing the visibility and control required to bring these agents in-house.

It transitions agents from developer-managed infrastructure into a managed environment characterized by scoped access and organizational-level controls.

Technology: Universal persistence and the “Swiss cheese” method

A core technical hurdle in the current agent landscape is the fragmentation of chat sessions.

During the VentureBeat interview, Schario noted that even advanced tools often struggle with canonical sessions, frequently dropping messages or failing to sync across devices.

Schario emphasized the security layer that supports this new structure: “You get all the same benefits of the Kilo gateway and the Kilo platform: you can limit what models people can use, get usage visibility, cost controls, and all the advantages of leveraging Kilo with managed, hosted, controlled Kilo Claw”.

To address the inherent unreliability of autonomous agents—such as missed cron jobs or failed executions—Kilo employs what Schario calls the “Swiss cheese method” of reliability. By layering additional protections and deterministic guardrails on top of the base OpenClaw architecture, Kilo aims to ensure that tasks, such as a daily 6:00 PM summary, are completed even if the underlying agent logic falters.

This is critical because, as Schario noted, “The real risk for any company is data leakage, and that can come from a bot commenting on a GitHub issue or accidentally emailing the person who’s going to get fired before they get fired”.

Product: KiloClaw Chat and organizational guardrails

While managed infrastructure solves the backend problem, KiloClaw Chat addresses the user experience. Schario noted that “Hosted, managed OpenClaw is easier to get started with, but it’s not enough, and it still requires you to be at the edge of technology to understand how to set it up”. Kilo is looking to lower that barrier for the average worker, asking: “How do we give people who have never heard the phrase OpenClaw or Clawdbot,k an always-on AI assistant?”.

Traditionally, interacting with an OpenClaw agent required connecting to third-party messaging services like Telegram or Discord—a process that involves navigating “BotFather” tokens and technical configurations that alienate non-engineers.

“One of the number one hurdles we see, both anecdotally and in the data, is that you get your bot running and then you have to connect a channel to it. If you don’t know what’s going on, it’s overwhelming,” Schario observed.

“We solved that problem. You don’t need to set up a channel. You can chat with Kilo in the web UI and, with the Kilo Claw app on your phone, interact with Kilo without setting an external channel,” she continued.

This native approach is essential for corporate compliance because, as she further explained, “When we were talking to early enterprise opportunities, they don’t want you using your personal Telegram account to chat with your work bot”. As Schario put it, there is a reason enterprise communication doesn’t flow through personal DMs; when a company shuts off access, they must be able to shut off access to the bot.

Looking ahead, the company plans to integrate these environments further. “What we’re going to do is make Kilo Chat the waypoint between Telegram, Discord, and OpenClaw, so you get all the convenience of Kilo Chat but can use it in the other channels,” Breitenother added.

The enterprise package includes several critical governance features:

  • Identity Management: SSO/OIDC integration and SCIM provisioning for automated user lifecycles.

  • Centralized Billing: Full visibility into compute and inference usage across the entire organization.

  • Admin Controls: Org-wide policies regarding which models can be used, specific permissions, and session durations.

  • Secrets Configuration: Integration with 1Password ensures that agents never handle credentials in plain text, preventing accidental leaks.

Licensing and governance: The “bot account” model

Other security experts note that handling bot and AI agentic permissions are among the most pressing problems enterprises are facing today

As Ev Kontsevoy, CEO and co-founder of AI infrastructure and identity management company Teleport told VentureBeat without seeing the Kilo news: “The potential impact of OpenClaw as a non-deterministic actor demonstrates why identity can’t be an afterthought. You have an autonomous agent with shell access, browser control, and API credentials — running on a persistent loop, across dozens of messaging platforms, with the ability to write its own skills. That’s not a chatbot. That’s a non-deterministic actor with broad infrastructure access and no cryptographic identity, no short-lived credentials, and no real-time audit trail tying actions to a verifiable actor.”

Kilo is proposing to solve it with a major change in organizational structure: the adoption of employee “bot accounts”.

In Kilo’s vision, every employee eventually carries two identities—their standard human account and a corresponding bot account, such as scott.bot@kilo.ai.

These bot identities operate with strictly limited, read-only permissions. For example, a bot might be granted read-only access to company logs or a GitHub account with contributor-only rights. This “scoped” approach allows the agent to maintain full visibility of the data it needs to be helpful while ensuring it cannot accidentally share sensitive information with others.

Addressing concerns over data privacy and “black box” algorithms, Kilo emphasizes that its code is source available.

“Anyone can go look at our code. It’s not a black box. When you’re buying Kilo Claw, you’re not giving us your data, and we’re not training on any of your data because we’re not building our own model,” Schario clarified.

This licensing choice allows organizations to audit the resiliency and security of the platform without fearing their proprietary data will be used to improve third-party models.

Pricing and availability

KiloClaw for Organizations follows a usage-based pricing model where companies pay only for the compute and inference consumed. Organizations can utilize a “Bring Your Own Key” (BYOK) approach or use Kilo Gateway credits for inference.

The service is available starting today, Wednesday, April 1. KiloClaw Chat is currently in beta, with support for web, desktop, and iOS sessions. New users can evaluate the platform via a free tier that includes seven days of compute.

As Breitenother summarized to VentureBeat, the goal is to shift from “one-off” deployments to a scalable model for the entire workforce: “I think of Kilo for Orgs as buying KiloClaw by the bushel instead of one-off. And we’re hoping to sell a lot of bushels of KiloClaw.”