Presented by AWSAutonomous agents are compressing software delivery timelines from weeks to days. The enterprises that scale agents safely will be the ones that build using spec-driven development.There’s a moment in every technology shift where the ea…
Presented by EdgeverveSmart, semi‑autonomous AI agents handling complex, real‑time business work is a compelling vision. But moving from impressive pilots to production‑grade impact requires more than clever prompts or proof‑of‑concept demos. It takes …
OpenAI is making moves to try and court more developers and vibe coders (those who build software using AI models and natural language) away from rivals like Anthropic.
Today, the firm arguably most synonymous with the generative AI boom announced it will begin offering a new, more mid-range subscription tier — a $100 ChatGPT Pro plan — which joins its free, Go ($8 monthly), Plus ($20 monthly) and existing Pro ($200 monthly) plans for individuals using ChatGPT and related OpenAI products.
OpenAI also currently offers Edu, Business ($25 per user monthly, formerly known as Team) and Enterprise (variably priced) plans for organizations in said sectors.
So why introduce a new $100 ChatGPT Pro plan, then?
The big selling point from OpenAI is that the new plan offers five times greater usage limits on Codex, the company’s agentic vibe coding application/harness (the name is shared by both, as well as a lineup of coding-specific language gmodels), than the existing, $20 monthly Plus plan, which seems fair given the math ($20×5=$100).
As OpenAI co-founder and CEO Sam Altman wrote in a post on X: “It is very nice to see Codex getting so much love. We are launching a $100 ChatGPT Pro tier by very popular demand.”
However, alongside this, OpenAI’s official company account on X noted that “we’re rebalancing Codex usage in [ChatGPT] Plus to support more sessions throughout the week, rather than longer sessions in a single day.”
That sounds a lot like OpenAI is also simultaneously reducing how much ChatGPT Plus users can use its Codex harness and application per day.
So, what are the current limits on the $20 Plus plan? The new Pro plan gives you 5X greater than…what?
Turns out, this is trickier than you’d think to calculate, because it actually varies depending on which underlying AI model you are using to power the Codex application or harness, and whether you are working on code stored in the cloud or locally on your machine or servers.
OpenAI’s Developer website notes that for individual users, usage is categorized by “Local Messages” (tasks run on the user’s machine) and “Cloud Tasks” (tasks run on OpenAI’s infrastructure), both of which share a five-hour rolling window. Currently, it actually shows the $100 Pro plan gives you 10X the amount of messages as the $20 Plus plan (see below)!
GPT-5.4: 33–168 local messages every 5 hours.
GPT-5.4-mini: 110–560 local messages every 5 hours.
GPT-5.3-Codex: 45–225 local messages and 10–60 cloud tasks every 5 hours.
Code Reviews: 10–25 pull requests per week
GPT-5.4: 330-1680 local messages every 5 hours.
GPT-5.4-mini: 1100-5600 local messages every 5 hours.
GPT-5.3-Codex: 450-2,250 local messages and 100-600 cloud tasks every 5 hours.
Code Reviews: 100–250 pull requests per week
GPT-5.4: 660-3,360 local messages every 5 hours.
GPT-5.4-mini: 2,200-11,200 local meessages every 5 hours.
GPT-5.3-Codex: 900-4,500 local messages and 200-1,200 cloud tasks every 5 hours.
Code Reviews: 200–500 pull requests per week
Exclusive Access: Includes GPT-5.3-Codex-Spark (research preview), which has its own dynamic usage limit.
And as OpenAI’s Help documentation states:
“The number of Codex messages you can send within these limits varies based on the size and complexity of your coding tasks, and where you execute tasks. Small scripts or simple functions may only consume a fraction of your allowance, while larger codebases, long running tasks, or extended sessions that require Codex to hold more context will use significantly more per message.”
OpenAI’s sudden move toward the $100 price point and expanded agentic capacity comes amid the unprecedented financial ascent of its chief rival, Anthropic.
Just days ago, Anthropic revealed its annualized run-rate revenue (ARR) has topped $30 billion, surpassing OpenAI’s last reported ARR of approximately $24–$25 billion.
This growth has been fueled by the massive adoption of Claude Code and Claude Cowork, products that have set the benchmark for enterprise-grade autonomous coding.
The competitive friction intensified on April 4, 2026, when Anthropic officially blocked Claude subscriptions from being used to provide the intelligence for third-party agentic AI harnesses like OpenClaw.
To be clear, Anthropic Claude models themselves can still be used with OpenClaw, users just must now pay for access to Claude models through Anthropic’s application programming interface (API) or extra usage credits, rather than as part of the monthly Claude subscription tiers (which some have likened to an “all-you-can eat” buffet, making the economics challenging for Anthropic when power users and third-party harnesses like OpenClaw consume more than the $20 or $200 monthly user spend on the plans in tokens).
OpenClaw’s creator, Peter Steinberger, was notably hired by OpenAI in February 2026 to lead their personal agent strategy, and has, since joining, actively spoken out against Anthropic’s limitations — advising that OpenAI’s Codex and models generally don’t have the same restrictions as Anthropic is now imposing.
By hiring Steinberger and subsequently launching a Pro tier that provides the high-volume capacity Anthropic recently restricted, OpenAI is effectively courting the displaced OpenClaw community to reclaim the professional developer market.
One major challenge in deploying autonomous agents is building systems that can adapt to changes in their environments without the need to retrain the underlying large language models (LLMs).
Memento-Skills, a new framework developed by researchers at multiple universities, addresses this bottleneck by giving agents the ability to develop their skills by themselves. “It adds its continual learning capability to the existing offering in the current market, such as OpenClaw and Claude Code,” Jun Wang, co-author of the paper, told VentureBeat.
Memento-Skills acts as an evolving external memory, allowing the system to progressively improve its capabilities without modifying the underlying model. The framework provides a set of skills that can be updated and expanded as the agent receives feedback from its environment.
For enterprise teams running agents in production, that matters. The alternative — fine-tuning model weights or manually building skills — carries significant operational overhead and data requirements. Memento-Skills sidesteps both.
Self-evolving agents are crucial because they overcome the limitations of frozen language models. Once a model is deployed, its parameters remain fixed, restricting it to the knowledge encoded during training and whatever fits in its immediate context window.
Giving the model an external memory scaffolding enables it to improve without the costly and slow process of retraining. However, current approaches to agent adaptation largely rely on manually-designed skills to handle new tasks. While some automatic skill-learning methods exist, they mostly produce text-only guides that amount to prompt optimization. Other approaches simply log single-task trajectories that don’t transfer across different tasks.
Furthermore, when these agents try to retrieve relevant knowledge for a new task, they typically rely on semantic similarity routers, such as standard dense embeddings; high semantic overlap does not guarantee behavioral utility. An agent relying on standard RAG might retrieve a “password reset” script to solve a “refund processing” query simply because the documents share enterprise terminology.
“Most retrieval-augmented generation (RAG) systems rely on similarity-based retrieval. However, when skills are represented as executable artifacts such as markdown documents or code snippets, similarity alone may not select the most effective skill,” Wang said.
To solve the limitations of current agentic systems, the researchers built Memento-Skills. The paper describes the system as “a generalist, continually-learnable LLM agent system that functions as an agent-designing agent.” Instead of keeping a passive log of past conversations, Memento-Skills creates a set of skills that act as a persistent, evolving external memory.
These skills are stored as structured markdown files and serve as the agent’s evolving knowledge base. Each reusable skill artifact is composed of three core elements. It contains declarative specifications that outline what the skill is and how it should be used. It includes specialized instructions and prompts that guide the language model’s reasoning. And it houses the executable code and helper scripts that the agent runs to actually solve the task.
Memento-Skills achieves continual learning through its “Read-Write Reflective Learning” mechanism, which frames memory updates as active policy iteration rather than passive data logging. When faced with a new task, the agent queries a specialized skill router to retrieve the most behaviorally relevant skill — not just the most semantically similar one — and executes it.
After the agent executes the skill and receives feedback, the system reflects on the outcome to close the learning loop. Rather than just appending a log of what happened, the system actively mutates its memory. If the execution fails, an orchestrator evaluates the trace and rewrites the skill artifacts. This means it directly updates the code or prompts to patch the specific failure mode. In case of need, it creates an entirely new skill.
Memento-Skills also updates the skill router through a one-step offline reinforcement learning process that learns from execution feedback rather than just text overlap. “The true value of a skill lies in how it contributes to the overall agentic workflow and downstream execution,” Wang said. “Therefore, reinforcement learning provides a more suitable framework, as it enables the agent to evaluate and select skills based on long-term utility.”
To prevent regression in a production environment, the automated skill mutations are guarded by an automatic unit-test gate. The system generates a synthetic test case, executes it through the updated skill, and checks the results before saving the changes to the global library.
By continuously rewriting and refining its own executable tools, Memento-Skills enables a frozen language model to build robust muscle memory and progressively expand its capabilities end-to-end.
The researchers evaluated Memento-Skills on two rigorous benchmarks. The first is General AI Assistants (GAIA), which requires complex multi-step reasoning, multi-modality handling, web browsing, and tool use. The second is Humanity’s Last Exam, or HLE, an expert-level benchmark spanning eight diverse academic subjects like mathematics and biology. The entire system was powered by Gemini-3.1-Flash acting as the underlying frozen language model.
The system was compared against a Read-Write baseline that retrieves skills and collects feedback but doesn’t have self-evolving features. The researchers also tested their custom skill router against standard semantic retrieval baselines, including BM25 and Qwen3 embeddings.
The results proved that actively self-evolving memory vastly outperforms a static skill library. On the highly diverse GAIA benchmark, Memento-Skills improved test set accuracy by 13.7 percentage points over the static baseline, achieving 66.0% compared to 52.3%. On the HLE benchmark, where the domain structure allowed for massive cross-task skill reuse, the system more than doubled the baseline’s performance, jumping from 17.9% to 38.7%.
Moreover, the specialized skill router of Memento-Skills avoids the classic retrieval trap where an irrelevant skill is selected simply because of semantic similarity. Experiments show that Memento-Skills boosts end-to-end task success rates to 80%, compared to just 50% for standard BM25 retrieval.
The researchers observed that Memento-Skills manages this performance through highly organic, structured skill growth. Both benchmark experiments started with just five atomic seed skills, such as basic web search and terminal operations. On the GAIA benchmark, the agent autonomously expanded this seed group into a compact library of 41 skills to handle the diverse tasks. On the expert-level HLE benchmark, the system dynamically scaled its library to 235 distinct skills.
The researchers have released the code for Memento-Skills on GitHub, and it is readily available for use.
For enterprise architects, the effectiveness of this system depends on domain alignment. Instead of simply looking at benchmark scores, the core business tradeoff lies in whether your agents are handling isolated tasks or structured workflows.
“Skill transfer depends on the degree of similarity between tasks,” Wang said. “First, when tasks are isolated or weakly related, the agent cannot rely on prior experience and must learn through interaction.” In such scattershot environments, cross-task transfer is limited. “Second, when tasks share substantial structure, previously acquired skills can be directly reused. Here, learning becomes more efficient because knowledge transfers across tasks, allowing the agent to perform well on new problems with little or no additional interaction.”
Given that the system requires recurring task patterns to consolidate knowledge, enterprise leaders need to know exactly where to deploy this today and where to hold off.
“Workflows are likely the most appropriate setting for this approach, as they provide a structured environment in which skills can be composed, evaluated, and improved,” Wang said.
However, he cautioned against over-deployment in areas not yet suited for the framework. “Physical agents remain largely unexplored in this context and require further investigation. In addition, tasks with longer horizons may demand more advanced approaches, such as multi-agent LLM systems, to enable coordination, planning, and sustained execution over extended sequences of decisions.”
As the industry moves toward agents that autonomously rewrite their own production code, governance and security remain paramount. While Memento-Skills employs foundational safety rails like automatic unit-test gates, a broader framework will likely be needed for enterprise adoption.
“To enable reliable self-improvement, we need a well-designed evaluation or judge system that can assess performance and provide consistent guidance,” Wang said. “Rather than allowing unconstrained self-modification, the process should be structured as a guided form of self-development, where feedback steers the agent toward better designs.”
Enterprise AI programs rarely fail because of bad ideas. More often, they get stuck in ungoverned pilot mode and never reach production. At a recent VentureBeat event, technology leaders from MassMutual and Mass General Brigham explained how they avoided that trap — and what the results look like when discipline replaces sprawl.
At MassMutual, the results are concrete: 30% developer productivity gains, IT help desk resolution times reduced from 11 minutes to one, and customer service calls cut from 15 minutes to just one or two.
“We’re always starting with why do we care about this problem?” Sears Merritt, MassMutual’s head of enterprise technology and experience, said at the event. “If we solve the problem, how are we gonna know we solved it? And, how much value is associated with doing that?”
MassMutual, a 175-year-old company serving millions of policy owners and customers, has pushed AI into production across the business — customer support, IT, customer acquisition, underwriting, servicing, claims, and other areas.
Merritt said his team follows the scientific method, beginning with a hypothesis and testing whether it has an outcome that will tangibly drive the business forward. Some ideas are great, but they may be “intractable in the business” due to factors like lack of data or access, or regulatory constraint.
“We won’t go any further with an idea until we get crystal clear on how we’re going to measure, and how we’re going to define success.”
Ultimately, it’s up to different departments and leaders to define what quality means: Choose a metric and define the minimum level of quality before a tool is placed into the hands of teams and partners.
That starting point creates a quick feedback loop. “The things that we find slow us down is where there isn’t shared clarity on what outcome we’re trying to achieve,” which can lead to confusion and constant re-adjusting, said Merritt. “We don’t go to production until there is a business partner that says, ‘Yes, that works.’”
His team is strategic about evaluating emerging tools, and “extremely rigorous” when testing and measuring what “good” means. For instance, they perform trust scoring to lower hallucination rates, establish thresholds and evaluation criteria, and monitor for feature and output drift.
Merritt also operates with a no-commitment policy — meaning the company doesn’t lock itself into using a particular model. It has what he calls an “incredibly heterogeneous” technology environment combining best of breed models alongside mainframes running on COBOL. That flexibility isn’t accidental. His team built common service layers, microservices and APIs that sit between the AI layer and everything underneath — so when a better model comes along, swapping it in doesn’t mean starting over.
Because, Merritt explained, “the best of breed today might be the worst of breed tomorrow, and we don’t want to set ourselves up to fall behind.”
Mass General Brigham (MGB), for its part, took more of a spray and pray approach — at first.
Around 15,000 researchers in the not-for-profit health system have been using AI, ML, and deep learning for the last 10 to 15 years, CTO Nallan “Sri” Sriraman said at the same VB event.
But last year, he made a bold choice: His team shut down a sprawl of non-governed AI pilots. Initially, “we did follow the thousand flowers bloom [methodology], but we didn’t have a thousand flowers, we had probably a few tens of flowers trying to bloom,” he said.
Like Merritt’s team at MassMutual, MGB pivoted to a more holistic view, examining why they were developing certain tools for specific departments of workflows. They questioned what capabilities they wanted and needed and what investment those required.
Sriraman’s team also spoke with their primary platform providers — Epic, Workday, ServiceNow, Microsoft — about their roadmaps. This was a “pivotal moment,” he noted, as they realized they were building in-house tools that vendors were already providing (or were planning to roll out).
As Sriraman put it: “Why are we building it ourselves? We are already on the platform. It is going to be in the workflow. Leverage it.”
That said, the marketplace is still nascent, which can make for difficult decisions. “The analogy I will give is when you ask six blind men to touch an elephant and say, what does this elephant look like?” Sriraman said. “You’re gonna get six different answers.”
There’s nothing wrong with that, he noted; it’s just that everybody is discovering and experimenting as the landscape keeps shifting.
Instead of a wild West environment, Sriraman’s team distributes Microsoft Copilot to users across the business, and uses a “small landing zone” where they can safely test more sophisticated products and control token use.
They also began “consciously embedding AI champions“ across business groups. “This is kind of a reverse of letting a thousand flowers bloom, carefully planting and nourishing,” Sriraman said.
Observability is another big consideration; he describes real-time dashboards that manage model drift and safety and allow IT teams to govern AI “a little more pragmatically.” Health monitoring is critical with AI systems, he noted, and his team has established principles and policies around AI use, not to mention least access privileges.
In clinical settings, the guardrails are absolute: AI systems never issue the final decision. “There’s always going to be a doctor or a physician assistant in the loop to close the decision,” Sriraman said. He cited radiology report generation as one area where AI is used heavily, but where a radiologist always signs off.
Sriraman was clear: “Thou shall not do this: Don’t show PHI [protected health information] in Perplexity. As simple as that, right?”
And, importantly, there must be safety mechanisms in place. “We need a big red button, kill it,” Sriraman emphasized. “We don’t put anything in the operational setting without that.”
Ultimately, while agentic AI is a transformative technology, the enterprise approach to it doesn’t have to be dramatically different. “There is nothing new about this,” Sriraman said. “You can replace the word BPM [business process management] from the ’90s and 2000s with AI. The same concepts apply.”
When Intuit shipped AI agents to 3 million customers, 85% came back. The reason, according to the company’s EVP and GM: combining AI with human expertise turned out to matter more than anyone expected — not less.
Marianna Tessel, the financial software company’s EVP and GM, calls this AI-HI combination a “massive ask” from its customers, noting that it provides another level of confidence and trust.
“One of the things we learned that has been fascinating is really the combination of human intelligence and artificial intelligence,” Tessel said in a new VB Beyond the Pilot podcast. “Sometimes it’s the combination of AI and HI that gives you better results.”
Intuit — the parent company of QuickBooks, TurboTax, MailChimp and other widely-used financial products — was one of the first major enterprises to go all in on generative AI with its GenOS platform last June (long before fears of the “SaaSpocalypse” had SaaS companies scrambling to rethink their strategies).
Quickly, though, the company recognized that chatbots alone weren’t the answer in enterprise environments, and pivoted to what it now calls Intuit Intelligence. The dashboard-like platform features specialized AI agents for sales, tax, payroll, accounting and project management that users can interact with using natural language to gain insights on their data, automate tasks, and generate reports.
Customers report invoices are being paid 90% in full and five days faster, and that manual work has been reduced by 30%. AI agents help close books, categorize transactions, run payroll, automate invoice reminders and surface discrepancies.
For instance, one Intuit customer uncovered fraud after interacting with AI agents and asking questions about amounts that didn’t add up. “In the beginning it was like, ‘Is that an error? And as he dug in, he discovered very significant fraud,” Tessel said.
Still, Intuit operates on the principle that humans are “always accessible,” Tessel said. Platforms are built in a way that users can ask questions of a human expert when they’re not getting what they need from the AI agent, or want a human to bounce ideas off of.
“I’m not talking about product experts,” Tessel said. “I’m talking about an actual accounting expert or tax expert or payroll expert.”
The platform has also been built to suggest human involvement in “high stakes” decision-making scenarios. AI goes to a certain level, then human experts review and categorize the rest. This provides a level of confidence, according to Tessel.
“We actually believe it becomes more needed and more powerful at the right moments,” she said. “The expert still provides things that are unique.”
The next step is giving customers the tools to perform next-gen tasks like vibe coding — but with simple architectures to reduce the burden for customers. “What we’re testing is this idea of, you can actually do coding without realizing that that’s what you are doing,” Tessel said.
For example, a merchant running a flower shop wants to ensure that they have the right amount of inventory in stock for Mother’s Day. They can vibe code an agent that analyzes previous years’ sales and creates purchase orders where stock is low. That agent could then be instructed to automatically perform that task for future Mother’s Days and other big holidays.
Some users will be more sophisticated and want the ability to dive deeper into the technology. “But some just want to express what they want to have happen,” Tessel said. “Because all they want to do is run their business.”
Listen to the full podcast to hear about:
Why first-party data can create a “moat” for SaaS companies.
Why showing AI’s logic matters more than a polished interface.
Why 600,000 data points per customer changes what AI can tell you about your business.
You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.
As generative AI matures from a novelty into a workplace staple, a new friction point has emerged: the “shadow AI” or “Bring Your Own AI (BYOAI)” crisis. Much like the unsanctioned use of personal devices in years past, developers and knowledge workers are increasingly deploying autonomous agents on personal infrastructure to manage their professional workflows.
“Our journey with Kilo Claw has been to make it easier and easier and more accessible to folks,” says Kilo co-founder Scott Breitenother. Today, the company dedicated to providing a portable, multi-model, cloud-based AI coding environment is moving to formalize this “shadow AI” layer: it’s launching KiloClaw for Organizations and KiloClaw Chat, a suite of tools designed to provide enterprise-grade governance over personal AI agents.
The announcement comes at a period of high velocity for the company. Since making its securely hosted, one-click OpenClaw product for individuals, KiloClaw, generally available last month, more than 25,000 users have integrated the platform into their daily workflows.
Simultaneously, Kilo’s proprietary agent benchmark, PinchBench, has logged over 250,000 interactions and recently gained significant industry validation when it was referenced by Nvidia CEO Jensen Huang during his keynote at the 2026 Nvidia GTC conference in San Jose, California.
The impetus for KiloClaw for Organizations stems from a growing visibility gap within large enterprises. In a recent interview with VentureBeat, Kilo leadership detailed conversations with high-level AI directors at government contractors who found their developers running OpenClaw agents on random VPS instances to manage calendars and monitor repositories.
“What we’re announcing on Tuesday is Kilo Claw for organizations, where a company can buy an organization-level package of Kilo Claws and give every team member access,” explained Kilo co-founder and head of product and engineering Emilie Schario during the interview.
“We can’t see any of it,” the head of AI at one such firm reportedly told Kilo. “No audit logs. No credential management. No idea what data is touching what API”.
This lack of oversight has led some organizations to issue blanket bans on autonomous agents before a clear strategy on deployment could be formed.
Anand Kashyap, CEO and founder of data security firm Fortanix, told VentureBeat without seeing Kilo’s announcement that while “Openclaw has taken the technology world by storm… the enterprise usage is minimal due to the security concerns of the open source version.”
Kashyap expanded on this trend:
“In recent times, NVIDIA (with NemoClaw), Cisco (DefenseClaw), Palo Alto Networks, and Crowdstrike have all announced offerings to create an enterprise-ready version of OpenClaw with guardrails and governance for agent security. However, enterprise adoption continues to be low.
Enterprises like centralized IT control, predictable behavior, and data security which keeps them compliant. An autonomous agentic platform like OpenClaw stretches the envelope on all these parameters, and while security majors have announced their traditional perimeter security measures, they don’t address the fundamental problems of having a reduced attack surface. Over time, we will see an agentic platform emerge where agents are pre-built and packaged, and deployed responsibly with centralized controls, and data access controls built into the agentic platform as well as the LLMs they call upon to get instructions on how to perform the next task. Technologies like Confidential Computing provide compartmentalization of data and processing, and are tremendously helpful in reducing the attack surface.”
KiloClaw for Organizations is positioned as the way for the security team to say “yes,” providing the visibility and control required to bring these agents in-house.
It transitions agents from developer-managed infrastructure into a managed environment characterized by scoped access and organizational-level controls.
A core technical hurdle in the current agent landscape is the fragmentation of chat sessions.
During the VentureBeat interview, Schario noted that even advanced tools often struggle with canonical sessions, frequently dropping messages or failing to sync across devices.
Schario emphasized the security layer that supports this new structure: “You get all the same benefits of the Kilo gateway and the Kilo platform: you can limit what models people can use, get usage visibility, cost controls, and all the advantages of leveraging Kilo with managed, hosted, controlled Kilo Claw”.
To address the inherent unreliability of autonomous agents—such as missed cron jobs or failed executions—Kilo employs what Schario calls the “Swiss cheese method” of reliability. By layering additional protections and deterministic guardrails on top of the base OpenClaw architecture, Kilo aims to ensure that tasks, such as a daily 6:00 PM summary, are completed even if the underlying agent logic falters.
This is critical because, as Schario noted, “The real risk for any company is data leakage, and that can come from a bot commenting on a GitHub issue or accidentally emailing the person who’s going to get fired before they get fired”.
While managed infrastructure solves the backend problem, KiloClaw Chat addresses the user experience. Schario noted that “Hosted, managed OpenClaw is easier to get started with, but it’s not enough, and it still requires you to be at the edge of technology to understand how to set it up”. Kilo is looking to lower that barrier for the average worker, asking: “How do we give people who have never heard the phrase OpenClaw or Clawdbot,k an always-on AI assistant?”.
Traditionally, interacting with an OpenClaw agent required connecting to third-party messaging services like Telegram or Discord—a process that involves navigating “BotFather” tokens and technical configurations that alienate non-engineers.
“One of the number one hurdles we see, both anecdotally and in the data, is that you get your bot running and then you have to connect a channel to it. If you don’t know what’s going on, it’s overwhelming,” Schario observed.
“We solved that problem. You don’t need to set up a channel. You can chat with Kilo in the web UI and, with the Kilo Claw app on your phone, interact with Kilo without setting an external channel,” she continued.
This native approach is essential for corporate compliance because, as she further explained, “When we were talking to early enterprise opportunities, they don’t want you using your personal Telegram account to chat with your work bot”. As Schario put it, there is a reason enterprise communication doesn’t flow through personal DMs; when a company shuts off access, they must be able to shut off access to the bot.
Looking ahead, the company plans to integrate these environments further. “What we’re going to do is make Kilo Chat the waypoint between Telegram, Discord, and OpenClaw, so you get all the convenience of Kilo Chat but can use it in the other channels,” Breitenother added.
The enterprise package includes several critical governance features:
Identity Management: SSO/OIDC integration and SCIM provisioning for automated user lifecycles.
Centralized Billing: Full visibility into compute and inference usage across the entire organization.
Admin Controls: Org-wide policies regarding which models can be used, specific permissions, and session durations.
Secrets Configuration: Integration with 1Password ensures that agents never handle credentials in plain text, preventing accidental leaks.
Other security experts note that handling bot and AI agentic permissions are among the most pressing problems enterprises are facing today
As Ev Kontsevoy, CEO and co-founder of AI infrastructure and identity management company Teleport told VentureBeat without seeing the Kilo news: “The potential impact of OpenClaw as a non-deterministic actor demonstrates why identity can’t be an afterthought. You have an autonomous agent with shell access, browser control, and API credentials — running on a persistent loop, across dozens of messaging platforms, with the ability to write its own skills. That’s not a chatbot. That’s a non-deterministic actor with broad infrastructure access and no cryptographic identity, no short-lived credentials, and no real-time audit trail tying actions to a verifiable actor.”
Kilo is proposing to solve it with a major change in organizational structure: the adoption of employee “bot accounts”.
In Kilo’s vision, every employee eventually carries two identities—their standard human account and a corresponding bot account, such as scott.bot@kilo.ai.
These bot identities operate with strictly limited, read-only permissions. For example, a bot might be granted read-only access to company logs or a GitHub account with contributor-only rights. This “scoped” approach allows the agent to maintain full visibility of the data it needs to be helpful while ensuring it cannot accidentally share sensitive information with others.
Addressing concerns over data privacy and “black box” algorithms, Kilo emphasizes that its code is source available.
“Anyone can go look at our code. It’s not a black box. When you’re buying Kilo Claw, you’re not giving us your data, and we’re not training on any of your data because we’re not building our own model,” Schario clarified.
This licensing choice allows organizations to audit the resiliency and security of the platform without fearing their proprietary data will be used to improve third-party models.
KiloClaw for Organizations follows a usage-based pricing model where companies pay only for the compute and inference consumed. Organizations can utilize a “Bring Your Own Key” (BYOK) approach or use Kilo Gateway credits for inference.
The service is available starting today, Wednesday, April 1. KiloClaw Chat is currently in beta, with support for web, desktop, and iOS sessions. New users can evaluate the platform via a free tier that includes seven days of compute.
As Breitenother summarized to VentureBeat, the goal is to shift from “one-off” deployments to a scalable model for the entire workforce: “I think of Kilo for Orgs as buying KiloClaw by the bushel instead of one-off. And we’re hoping to sell a lot of bushels of KiloClaw.”
Deploying AI agents for repository-scale tasks like bug detection, patch verification, and code review requires overcoming significant technical hurdles. One major bottleneck: the need to set up dynamic execution sandboxes for every repository, which are expensive and computationally heavy.
Using large language model (LLM) reasoning instead of executing the code is rising in popularity to bypass this overhead, yet it frequently leads to unsupported guesses and hallucinations.
To improve execution-free reasoning, researchers at Meta introduce “semi-formal reasoning,” a structured prompting technique. This method requires the AI agent to fill out a logical certificate by explicitly stating premises, tracing concrete execution paths, and deriving formal conclusions before providing an answer.
The structured format forces the agent to systematically gather evidence and follow function calls before drawing conclusions. This increases the accuracy of LLMs in coding tasks and significantly reduces errors in fault localization and codebase question-answering.
For developers using LLMs in code review tasks, semi-formal reasoning enables highly reliable, execution-free semantic code analysis while drastically reducing the infrastructure costs of AI coding systems.
Agentic code reasoning is an AI agent’s ability to navigate files, trace dependencies, and iteratively gather context to perform deep semantic analysis on a codebase without running the code. In enterprise AI applications, this capability is essential for scaling automated bug detection, comprehensive code reviews, and patch verification across complex repositories where relevant context spans multiple files.
The industry currently tackles execution-free code verification through two primary approaches. The first involves unstructured LLM evaluators that try to verify code either directly or by training specialized LLMs as reward models to approximate test outcomes. The major drawback is their reliance on unstructured reasoning, which allows models to make confident claims about code behavior without explicit justification. Without structured constraints, it is difficult to ensure agents reason thoroughly rather than guess based on superficial patterns like function names.
The second approach involves formal verification, which translates code or reasoning into formal mathematical languages like Lean, Coq, or Datalog to enable automated proof checking. While rigorous, formal methods require defining the semantics of the programming language. This is entirely impractical for arbitrary enterprise codebases that span multiple frameworks and languages.
Existing approaches also tend to be highly fragmented and task-specific, often requiring entirely separate architectures or specialized training for each new problem domain. They lack the flexibility needed for broad, multi-purpose enterprise applications.
To bridge the gap between unstructured guessing and overly rigid mathematical proofs, the Meta researchers propose a structured prompting methodology, which they call “semi-formal reasoning.” This approach equips LLM agents with task-specific, structured reasoning templates.
These templates function as mandatory logical certificates. To complete a task, the agent must explicitly state premises, trace execution paths for specific tests, and derive a formal conclusion based solely on verifiable evidence.
The template forces the agent to gather proof from the codebase before making a judgment. The agent must actually follow function calls and data flows step-by-step rather than guessing their behavior based on surface-level naming conventions. This systematic evidence gathering helps the agent handle edge cases, such as confusing function names, and avoid making unsupported claims.
The researchers evaluated semi-formal reasoning across three software engineering tasks: patch equivalence verification to determine if two patches yield identical test outcomes without running them, fault localization to pinpoint the exact lines of code causing a bug, and code question answering to test nuanced semantic understanding of complex codebases. The experiments used the Claude Opus-4.5 and Sonnet-4.5 models acting as autonomous verifier agents.
The team compared their structured semi-formal approach against several baselines, including standard reasoning, where an agentic model is given a minimal prompt and allowed to explain its thinking freely in unstructured natural language. They also compared against traditional text-similarity algorithms like difflib.
In patch equivalence, semi-formal reasoning improved accuracy on challenging, curated examples from 78% using standard reasoning to 88%. When evaluating real-world, agent-generated patches with test specifications available, the Opus-4.5 model using semi-formal reasoning achieved 93% verification accuracy, outperforming both the unstructured single-shot baseline at 86% and the difflib baseline at 73%. Other tasks showed similar gains across the board.
The paper highlights the value of semi-formal reasoning through real-world examples. In one case, the agent evaluates two patches in the Python Django repository that attempt to fix a bug with 2-digit year formatting for years before 1000 CE. One patch uses a custom format() function within the library that overrides the standard function used in Python.
Standard reasoning models look at these patches, assume format() refers to Python’s standard built-in function, calculate that both approaches will yield the same string output, and incorrectly declare the patches equivalent.
With semi-formal reasoning, the agent traces the execution path and checks method definitions. Following the structured template, the agent discovers that within one of the library’s files, the format() name is actually shadowed by a custom, module-level function. The agent formally proves that given the attributes of the input passed to the code, this patch will crash the system while the other will succeed.
Based on their experiments, the researchers suggest that “LLM agents can perform meaningful semantic code analysis without execution, potentially reducing verification costs in RL training pipelines by avoiding expensive sandbox execution.”
While semi-formal reasoning offers substantial reliability improvements, enterprise developers must consider several practical caveats before adopting it. There is a clear compute and latency tradeoff. Semi-formal reasoning requires more API calls and tokens. In patch equivalence evaluations, semi-formal reasoning required roughly 2.8 times as many execution steps as standard unstructured reasoning.
The technique also does not universally improve performance, particularly if a model is already highly proficient at a specific task. When researchers evaluated the Sonnet-4.5 model on a code question-answering benchmark, standard unstructured reasoning already achieved a high accuracy of around 85%. Applying the semi-formal template in this scenario yielded no additional gains.
Furthermore, structured reasoning can produce highly confident wrong answers. Because the agent is forced to build elaborate, formal proof chains, it can become overly assured if its investigation is deep but incomplete. In one Python evaluation, the agent meticulously traced five different functions to uncover a valid edge case, but completely missed that a downstream piece of code already safely handled that exact scenario. Because it had built a strong evidence chain, it delivered an incorrect conclusion with extremely high confidence.
The system’s reliance on concrete evidence also breaks down when it hits the boundaries of a codebase. When analyzing third-party libraries where the underlying source code is unavailable, the agent will still resort to guessing behavior based on function names.
And in some cases, despite strict prompt instructions, models will occasionally fail to fully trace concrete execution paths.
Ultimately, while semi-formal reasoning drastically reduces unstructured guessing and hallucinations, it does not completely eliminate them.
This technique can be used out-of-the-box, requiring no model training or special packaging. It is code-execution free, which means you do not need to add additional tools to your LLM environment. You pay more compute at inference time to get higher accuracy at code review tasks.
The researchers suggest that structured agentic reasoning may offer “a flexible alternative to classical static analysis tools: rather than encoding analysis logic in specialized algorithms, we can prompt LLM agents with task-specific reasoning templates that generalize across languages and frameworks.”
The researchers have made the prompt templates available, allowing them to be readily implemented into your applications. While there is a lot of conversation about prompt engineering being dead, this technique shows how much performance you can still squeeze out of well-structured prompts.
Slack today announced more than 30 new capabilities for Slackbot, its AI-powered personal agent, in what amounts to the most sweeping overhaul of the workplace messaging platform since Salesforce acquired it for $27.7 billion in 2021. The update transforms Slackbot from a simple conversational assistant into a full-spectrum enterprise agent that can take meeting notes across any video provider, operate outside the Slack application on users’ desktops, execute tasks through third-party tools via the Model Context Protocol (MCP), and even serve as a lightweight CRM for small businesses — all without requiring users to install anything new.
The announcement, timed to a keynote event that Salesforce CEO Marc Benioff is headlining Tuesday morning, arrives less than three months after Slackbot first became generally available on January 13 to Business+ and Enterprise+ subscribers. In that short window, Slack says the feature is on track to become the fastest-adopted product in Salesforce’s 27-year history, with some employees at customer organizations reporting they save up to 90 minutes per day. Inside Salesforce itself, teams claim savings of up to 20 hours per week, translating to more than $6.4 million in estimated productivity value.
“Slackbot is smart. It’s pleasant, and I think it’s endlessly useful,” Rob Seaman, Slack’s interim CEO and former chief product officer, told VentureBeat in an exclusive interview ahead of the announcement. “The upper bound of use cases is effectively limitless for it.”
The release signals Slack’s clearest bid yet to become what Seaman and the company’s leadership describe as an “agentic operating system” — a single surface through which workers interact with AI agents, enterprise applications, and one another. It also marks a direct challenge to Microsoft, which has spent the past two years embedding its Copilot assistant across the entirety of its productivity stack.
The features announced Tuesday organize around several major capability areas, each designed to push Slackbot well beyond the role of a chatbot and into something closer to an autonomous digital coworker.
The most foundational may be what Slack is calling AI-Skills — reusable instruction sets that define the inputs, the steps, and the exact output format for a given task. Any team can build a skill once and deploy it on demand. Slackbot ships with a built-in library for common workflows, but users can also create their own. Critically, Slackbot can recognize when a user’s prompt matches an existing skill and apply it automatically, without being explicitly told to do so. “Think of these as topics or instructions — basically instructions for Slackbot to perform a repeat task that the user might want to do, that they can share with others, or a company might be able to set up for their whole company,” Seaman explained.
Deep research mode gives Slackbot the ability to conduct extended, multi-step investigations that take approximately four minutes to complete — a significant departure from the instant-response paradigm of most enterprise chatbots. Slack chose not to demonstrate this feature on stage at the keynote, Seaman said, precisely because its value lies in depth, not speed. MCP client integration, meanwhile, allows Slackbot to make tool calls into external systems through the Model Context Protocol, meaning it can now create Google Slides, draft Google Docs, and interact with the more than 2,600 apps in the Slack Marketplace and the 6,000-plus apps built over two decades for the Salesforce AppExchange. “We’re going all in on MCP for Slackbot,” Seaman said. “MCP clients and MCP servers are becoming very mature.”
Meeting intelligence allows Slackbot to listen to any meeting — not just Slack huddles, but calls on Zoom, Google Meet, or any other provider — by tapping into the user’s local audio through the desktop application. It captures discussions, summarizes decisions, surfaces action items, and because Slackbot is natively connected to Salesforce, it can log actions and update opportunities directly in the CRM. Slackbot on Desktop extends the agent outside the Slack container entirely, while voice mode adds text-to-speech and speech-to-text capabilities, with full speech-to-speech functionality under active development.
Slackbot is built on Anthropic’s Claude model, a detail Seaman confirmed ahead of the keynote, where Anthropic’s leadership will appear alongside Slack executives on stage. The partnership underscores the deepening relationship between the two companies: Anthropic’s technology powers the reasoning layer, while Slack’s “context engineering” — the process of determining exactly which information from a user’s channels, files, and messages should be fed into the model’s context window — determines the quality and relevance of every response.
Managing the cost of that reasoning at enterprise scale is one of the most significant technical and financial challenges the team faces. Slackbot is included in Business+ and Enterprise+ plans at no additional consumption charge — a deliberate strategic choice that places the burden of cost optimization squarely on Slack’s engineering team rather than on customers.
“A lot of what we’ve done is in the context engineering phase, working really closely with Anthropic to make sure that we’re optimizing the RAG phase, optimizing our system prompts and everything, to make sure we’re getting the right amount of context into the context window and not obviously making fiscally irresponsible decisions for ourselves,” Seaman said. Starting in April, Slackbot will also become available in a limited sampling capacity to users on Slack’s free and Pro plans — a move designed to drive conversion up the pricing tiers.
The extension of Slackbot beyond the Slack application window — particularly its ability to listen to meetings and view screen content — raises immediate questions about employee surveillance, especially in large enterprise environments where tens of thousands of workers may be subject to company-wide IT policies.
Seaman was emphatic that every capability is user-initiated and opt-in. Slackbot cannot listen to audio unless the user explicitly tells it to take meeting notes. It cannot view the desktop autonomously; in its current form, users must manually capture and share screenshots. And it inherits every permission the organization has already established in Slack.
“Everything is user opt-in. That’s a key tenet of Slack,” Seaman said. “It’s not rogue looking at your desktop or autonomously looking at your desktop. It’s very important to us, and very important to our enterprise customers.” On Slackbot’s memory feature — which allows it to learn user preferences and habits over time — Seaman said the company has no plans to make that data available to administrators. Users can flush their stored preferences at any time simply by telling Slackbot to do so.
Among the most important features in Tuesday’s release is a native CRM built directly into Slack, targeting small businesses that haven’t yet adopted a dedicated customer relationship management system.
The logic is straightforward: small companies typically adopt Slack early in their lifecycle, often on the free tier, and their customer conversations already happen in channels and direct messages. Slack’s native CRM reads those channels, understands the conversations, and automatically keeps deals, contacts, and call notes up to date. When companies are ready to scale, every record is already connected to Salesforce — no migrations, no starting over.
“The hypothesis is that along the way, companies are effectively going to have moments where a CRM might matter,” Seaman said. “Our goal is to make it available to them as a default, so as they are starting their company and their company is growing, it’s just right there for them. They don’t have to think about going off and procuring another tool.”
The feature also represents a response to a growing competitive threat. As the Wall Street Journal reported earlier this year, a wave of startups and individual developers have begun “vibe coding” their own lightweight CRMs, emboldened by the capabilities of large language models. By embedding CRM directly into Slack — the tool many of those same startups already depend on — Salesforce aims to make the procurement of a separate system unnecessary.
The announcements arrive at a moment of intense competitive pressure. Microsoft has integrated Copilot across its entire productivity suite, giving it a distribution advantage that reaches into virtually every Fortune 500 company. Google has been similarly aggressive with Gemini across Workspace. And standalone AI tools from OpenAI to Anthropic threaten to fragment the enterprise AI experience.
Seaman took a measured approach when asked directly about competitive positioning, invoking a mantra he said Slack uses internally: “We are competitor aware, but customer obsessed.”
“I think there are two things that really stand out. One, we have a context advantage — if you look at the way people use Slack, they love it. They use it so much, constantly communicating with their colleagues, openly thinking, working in public project channels. Two is the user experience. We focus so much on how our product feels in people’s hands.”
That context advantage is real but not guaranteed. Slack’s strength lies in the richness and volume of conversational data flowing through its channels — data that, when fed into an AI model, can produce responses with a degree of organizational awareness that competitors struggle to match. But Microsoft’s Teams captures similar conversational data, and its deep integration with Windows, Office, and Azure gives it a systems-level advantage that Slack, operating as a single application, cannot easily replicate.
Starting this summer, every new Salesforce customer will receive Slack automatically provisioned and AI-powered from day one — a bundling play that ensures the messaging platform reaches the broadest possible enterprise audience. Salesforce reported $41.5 billion in revenue for fiscal year 2026, up 10% year-over-year, with Agentforce ARR reaching $800 million. But Wall Street has remained skeptical about whether AI will ultimately erode demand for traditional enterprise software, and Salesforce’s stock has underperformed the broader Nasdaq over the past year. More Slack users in more organizations gives AI-driven features more surface area to prove their value.
Tuesday’s launch is the first major product release under Seaman’s leadership. He assumed the interim CEO role after former Slack CEO Denise Dresser departed in December 2025 to become OpenAI’s first chief revenue officer — a move that signaled even Salesforce’s own executives felt the gravitational pull of frontier AI companies. The overarching thesis embedded in the announcement — that Slack is evolving from a messaging platform into an operating system for AI agents — is as risky as it is ambitious.
“One of the fundamental tenets of an operating system is that it obscures the complexity of the hardware from the end user,” Seaman said. “There are thousands of apps and agents out there, and that can be overwhelming. I think that’s our job — to be the OS that obscures that complexity, so you just use it like it’s a communication tool.”
When asked whether Slack risks losing its simplicity by trying to do everything, Seaman didn’t flinch. “There’s absolutely a risk,” he said. “That’s what keeps us up at night.”
It’s a remarkably candid admission from the leader of a platform that just launched 30 new features in a single day. The company that won the hearts of millions of workers with playful emoji reactions and frictionless messaging is now betting its future on meeting transcription, CRM pipelines, desktop agents, and enterprise orchestration. Whether Slack can absorb all of that ambition without losing the thing that made people love it in the first place isn’t just a product question — it’s the $27.7 billion question that Salesforce is still trying to answer.
Enterprises building voice-enabled workflows have had limited options for production-grade transcription: closed APIs with data residency risks, or open models that trade accuracy for deployability. Cohere’s new open-weight ASR model, Transcribe, is built to compete on all four key differentiators — contextual accuracy, latency, control and cost.
Cohere says that Transcribe outperforms current leaders on accuracy — and unlike closed APIs, it can run on an organization’s own infrastructure.
Cohere, which can be accessed via an API or in Cohere’s Model Vault as cohere-transcribe-03-2026, has 2 billion parameters and is licensed under Apache-2.0. The company said Transcribe has an average word error rate (WER) of just 5.42%, so it makes fewer mistakes than similar models.
It’s trained on 14 languages: English, French, German, Italian, Spanish, Greek, Dutch, Polish, Portuguese, Chinese, Japanese, Korean, Vietnamese and Arabic. The company did not specify which Chinese dialect the model was trained on.
Cohere said it trained the model “with a deliberate focus on minimizing WER, while keeping production readiness top-of-mind.” According to Cohere, the result is a model that enterprises can plug directly into voice-powered automations, transcription pipelines, and audio search workflows.
Until recently, enterprise transcription has been a trade-off — closed APIs offered accuracy but locked in data; open models offered control but lagged on performance. Unlike Whisper, which launched as a research model under MIT license, Transcribe is available for commercial use from release and can run on an organization’s own local GPU infrastructure. Early users flagged the commercial-ready open-weight approach as meaningful for enterprise deployments.
Organizations can bring Transcribe to their own local instances, since Cohere said the model has a more manageable inference footprint for local GPUs. The company said they were able to do this because the model “extends the Pareto frontier, delivering state-of-the-art accuracy (low WER) while sustaining best-in-class throughput (high RTFx) within the 1B+ parameter model cohort.”
Transcribe outperformed speech-model stalwarts, including Whisper from OpenAI, which powers the voice feature of ChatGPT, and ElevenLabs, which many big retail brands deploy. It currently tops the Hugging Face ASR leaderboard, leading with an average word error rate of 5.42%, outperforming Whisper Large v3 at 7.44%, ElevenLabs Scribe v2 at 5.83%, and Qwen3-ASR-1.7B at 5.76%.
Based on other datasets tested by Hugging Face, Transcribe also performed well. The AMI dataset, which measures meeting understanding and dialogue analysis, Transcribe logged a score of 8.15%. For the Voxpopuli dataset that tests understanding of different accents, the model scored 5.87%, beaten only by Zoom Scribe.
Early users have flagged accuracy and local deployment as the standout factors — particularly for teams that have been routing audio data through external APIs and want to bring that workload in-house.
For engineering teams building RAG pipelines or agent workflows with audio inputs, Transcribe offers a path to production-grade transcription without the data residency and latency penalties of closed APIs.