Meta’s new structured prompting technique makes LLMs significantly better at code review — boosting accuracy to 93% in some cases

Deploying AI agents for repository-scale tasks like bug detection, patch verification, and code review requires overcoming significant technical hurdles. One major bottleneck: the need to set up dynamic execution sandboxes for every repository, which are expensive and computationally heavy. 

Using large language model (LLM) reasoning instead of executing the code is rising in popularity to bypass this overhead, yet it frequently leads to unsupported guesses and hallucinations. 

To improve execution-free reasoning, researchers at Meta introduce “semi-formal reasoning,” a structured prompting technique. This method requires the AI agent to fill out a logical certificate by explicitly stating premises, tracing concrete execution paths, and deriving formal conclusions before providing an answer. 

The structured format forces the agent to systematically gather evidence and follow function calls before drawing conclusions. This increases the accuracy of LLMs in coding tasks and significantly reduces errors in fault localization and codebase question-answering. 

For developers using LLMs in code review tasks, semi-formal reasoning enables highly reliable, execution-free semantic code analysis while drastically reducing the infrastructure costs of AI coding systems.

Agentic code reasoning

Agentic code reasoning is an AI agent’s ability to navigate files, trace dependencies, and iteratively gather context to perform deep semantic analysis on a codebase without running the code. In enterprise AI applications, this capability is essential for scaling automated bug detection, comprehensive code reviews, and patch verification across complex repositories where relevant context spans multiple files.

The industry currently tackles execution-free code verification through two primary approaches. The first involves unstructured LLM evaluators that try to verify code either directly or by training specialized LLMs as reward models to approximate test outcomes. The major drawback is their reliance on unstructured reasoning, which allows models to make confident claims about code behavior without explicit justification. Without structured constraints, it is difficult to ensure agents reason thoroughly rather than guess based on superficial patterns like function names.

The second approach involves formal verification, which translates code or reasoning into formal mathematical languages like Lean, Coq, or Datalog to enable automated proof checking. While rigorous, formal methods require defining the semantics of the programming language. This is entirely impractical for arbitrary enterprise codebases that span multiple frameworks and languages. 

Existing approaches also tend to be highly fragmented and task-specific, often requiring entirely separate architectures or specialized training for each new problem domain. They lack the flexibility needed for broad, multi-purpose enterprise applications.

How semi-formal reasoning works

To bridge the gap between unstructured guessing and overly rigid mathematical proofs, the Meta researchers propose a structured prompting methodology, which they call “semi-formal reasoning.” This approach equips LLM agents with task-specific, structured reasoning templates.

These templates function as mandatory logical certificates. To complete a task, the agent must explicitly state premises, trace execution paths for specific tests, and derive a formal conclusion based solely on verifiable evidence. 

The template forces the agent to gather proof from the codebase before making a judgment. The agent must actually follow function calls and data flows step-by-step rather than guessing their behavior based on surface-level naming conventions. This systematic evidence gathering helps the agent handle edge cases, such as confusing function names, and avoid making unsupported claims.

Semi-formal reasoning in action

The researchers evaluated semi-formal reasoning across three software engineering tasks: patch equivalence verification to determine if two patches yield identical test outcomes without running them, fault localization to pinpoint the exact lines of code causing a bug, and code question answering to test nuanced semantic understanding of complex codebases. The experiments used the Claude Opus-4.5 and Sonnet-4.5 models acting as autonomous verifier agents.

The team compared their structured semi-formal approach against several baselines, including standard reasoning, where an agentic model is given a minimal prompt and allowed to explain its thinking freely in unstructured natural language. They also compared against traditional text-similarity algorithms like difflib.

In patch equivalence, semi-formal reasoning improved accuracy on challenging, curated examples from 78% using standard reasoning to 88%. When evaluating real-world, agent-generated patches with test specifications available, the Opus-4.5 model using semi-formal reasoning achieved 93% verification accuracy, outperforming both the unstructured single-shot baseline at 86% and the difflib baseline at 73%. Other tasks showed similar gains across the board.

The paper highlights the value of semi-formal reasoning through real-world examples. In one case, the agent evaluates two patches in the Python Django repository that attempt to fix a bug with 2-digit year formatting for years before 1000 CE. One patch uses a custom format() function within the library that overrides the standard function used in Python. 

Standard reasoning models look at these patches, assume format() refers to Python’s standard built-in function, calculate that both approaches will yield the same string output, and incorrectly declare the patches equivalent. 

With semi-formal reasoning, the agent traces the execution path and checks method definitions. Following the structured template, the agent discovers that within one of the library’s files, the format() name is actually shadowed by a custom, module-level function. The agent formally proves that given the attributes of the input passed to the code, this patch will crash the system while the other will succeed.

Based on their experiments, the researchers suggest that “LLM agents can perform meaningful semantic code analysis without execution, potentially reducing verification costs in RL training pipelines by avoiding expensive sandbox execution.”

Caveats and tradeoffs

While semi-formal reasoning offers substantial reliability improvements, enterprise developers must consider several practical caveats before adopting it. There is a clear compute and latency tradeoff. Semi-formal reasoning requires more API calls and tokens. In patch equivalence evaluations, semi-formal reasoning required roughly 2.8 times as many execution steps as standard unstructured reasoning.

The technique also does not universally improve performance, particularly if a model is already highly proficient at a specific task. When researchers evaluated the Sonnet-4.5 model on a code question-answering benchmark, standard unstructured reasoning already achieved a high accuracy of around 85%. Applying the semi-formal template in this scenario yielded no additional gains.

Furthermore, structured reasoning can produce highly confident wrong answers. Because the agent is forced to build elaborate, formal proof chains, it can become overly assured if its investigation is deep but incomplete. In one Python evaluation, the agent meticulously traced five different functions to uncover a valid edge case, but completely missed that a downstream piece of code already safely handled that exact scenario. Because it had built a strong evidence chain, it delivered an incorrect conclusion with extremely high confidence.

The system’s reliance on concrete evidence also breaks down when it hits the boundaries of a codebase. When analyzing third-party libraries where the underlying source code is unavailable, the agent will still resort to guessing behavior based on function names. 

And in some cases, despite strict prompt instructions, models will occasionally fail to fully trace concrete execution paths. 

Ultimately, while semi-formal reasoning drastically reduces unstructured guessing and hallucinations, it does not completely eliminate them.

What developers should take away

This technique can be used out-of-the-box, requiring no model training or special packaging. It is code-execution free, which means you do not need to add additional tools to your LLM environment. You pay more compute at inference time to get higher accuracy at code review tasks. 

The researchers suggest that structured agentic reasoning may offer “a flexible alternative to classical static analysis tools: rather than encoding analysis logic in specialized algorithms, we can prompt LLM agents with task-specific reasoning templates that generalize across languages and frameworks.”

The researchers have made the prompt templates available, allowing them to be readily implemented into your applications. While there is a lot of conversation about prompt engineering being dead, this technique shows how much performance you can still squeeze out of well-structured prompts.

Slack adds 30 AI features to Slackbot, its most ambitious update since the Salesforce acquisition

Slack today announced more than 30 new capabilities for Slackbot, its AI-powered personal agent, in what amounts to the most sweeping overhaul of the workplace messaging platform since Salesforce acquired it for $27.7 billion in 2021. The update transforms Slackbot from a simple conversational assistant into a full-spectrum enterprise agent that can take meeting notes across any video provider, operate outside the Slack application on users’ desktops, execute tasks through third-party tools via the Model Context Protocol (MCP), and even serve as a lightweight CRM for small businesses — all without requiring users to install anything new.

The announcement, timed to a keynote event that Salesforce CEO Marc Benioff is headlining Tuesday morning, arrives less than three months after Slackbot first became generally available on January 13 to Business+ and Enterprise+ subscribers. In that short window, Slack says the feature is on track to become the fastest-adopted product in Salesforce’s 27-year history, with some employees at customer organizations reporting they save up to 90 minutes per day. Inside Salesforce itself, teams claim savings of up to 20 hours per week, translating to more than $6.4 million in estimated productivity value.

“Slackbot is smart. It’s pleasant, and I think it’s endlessly useful,” Rob Seaman, Slack’s interim CEO and former chief product officer, told VentureBeat in an exclusive interview ahead of the announcement. “The upper bound of use cases is effectively limitless for it.”

The release signals Slack’s clearest bid yet to become what Seaman and the company’s leadership describe as an “agentic operating system” — a single surface through which workers interact with AI agents, enterprise applications, and one another. It also marks a direct challenge to Microsoft, which has spent the past two years embedding its Copilot assistant across the entirety of its productivity stack.

From simple chatbot to autonomous coworker: six new capabilities that redefine what Slackbot can do

The features announced Tuesday organize around several major capability areas, each designed to push Slackbot well beyond the role of a chatbot and into something closer to an autonomous digital coworker.

The most foundational may be what Slack is calling AI-Skills — reusable instruction sets that define the inputs, the steps, and the exact output format for a given task. Any team can build a skill once and deploy it on demand. Slackbot ships with a built-in library for common workflows, but users can also create their own. Critically, Slackbot can recognize when a user’s prompt matches an existing skill and apply it automatically, without being explicitly told to do so. “Think of these as topics or instructions — basically instructions for Slackbot to perform a repeat task that the user might want to do, that they can share with others, or a company might be able to set up for their whole company,” Seaman explained.

Deep research mode gives Slackbot the ability to conduct extended, multi-step investigations that take approximately four minutes to complete — a significant departure from the instant-response paradigm of most enterprise chatbots. Slack chose not to demonstrate this feature on stage at the keynote, Seaman said, precisely because its value lies in depth, not speed. MCP client integration, meanwhile, allows Slackbot to make tool calls into external systems through the Model Context Protocol, meaning it can now create Google Slides, draft Google Docs, and interact with the more than 2,600 apps in the Slack Marketplace and the 6,000-plus apps built over two decades for the Salesforce AppExchange. “We’re going all in on MCP for Slackbot,” Seaman said. “MCP clients and MCP servers are becoming very mature.”

Meeting intelligence allows Slackbot to listen to any meeting — not just Slack huddles, but calls on Zoom, Google Meet, or any other provider — by tapping into the user’s local audio through the desktop application. It captures discussions, summarizes decisions, surfaces action items, and because Slackbot is natively connected to Salesforce, it can log actions and update opportunities directly in the CRM. Slackbot on Desktop extends the agent outside the Slack container entirely, while voice mode adds text-to-speech and speech-to-text capabilities, with full speech-to-speech functionality under active development.

How Anthropic’s Claude powers Slackbot — and why keeping it affordable is the hardest part

Slackbot is built on Anthropic’s Claude model, a detail Seaman confirmed ahead of the keynote, where Anthropic’s leadership will appear alongside Slack executives on stage. The partnership underscores the deepening relationship between the two companies: Anthropic’s technology powers the reasoning layer, while Slack’s “context engineering” — the process of determining exactly which information from a user’s channels, files, and messages should be fed into the model’s context window — determines the quality and relevance of every response.

Managing the cost of that reasoning at enterprise scale is one of the most significant technical and financial challenges the team faces. Slackbot is included in Business+ and Enterprise+ plans at no additional consumption charge — a deliberate strategic choice that places the burden of cost optimization squarely on Slack’s engineering team rather than on customers.

“A lot of what we’ve done is in the context engineering phase, working really closely with Anthropic to make sure that we’re optimizing the RAG phase, optimizing our system prompts and everything, to make sure we’re getting the right amount of context into the context window and not obviously making fiscally irresponsible decisions for ourselves,” Seaman said. Starting in April, Slackbot will also become available in a limited sampling capacity to users on Slack’s free and Pro plans — a move designed to drive conversion up the pricing tiers.

Desktop AI and meeting transcription are powerful, but they raise hard questions about workplace surveillance

The extension of Slackbot beyond the Slack application window — particularly its ability to listen to meetings and view screen content — raises immediate questions about employee surveillance, especially in large enterprise environments where tens of thousands of workers may be subject to company-wide IT policies.

Seaman was emphatic that every capability is user-initiated and opt-in. Slackbot cannot listen to audio unless the user explicitly tells it to take meeting notes. It cannot view the desktop autonomously; in its current form, users must manually capture and share screenshots. And it inherits every permission the organization has already established in Slack.

“Everything is user opt-in. That’s a key tenet of Slack,” Seaman said. “It’s not rogue looking at your desktop or autonomously looking at your desktop. It’s very important to us, and very important to our enterprise customers.” On Slackbot’s memory feature — which allows it to learn user preferences and habits over time — Seaman said the company has no plans to make that data available to administrators. Users can flush their stored preferences at any time simply by telling Slackbot to do so.

Slack’s native CRM is a Trojan horse designed to capture startups before they outgrow it

Among the most important features in Tuesday’s release is a native CRM built directly into Slack, targeting small businesses that haven’t yet adopted a dedicated customer relationship management system.

The logic is straightforward: small companies typically adopt Slack early in their lifecycle, often on the free tier, and their customer conversations already happen in channels and direct messages. Slack’s native CRM reads those channels, understands the conversations, and automatically keeps deals, contacts, and call notes up to date. When companies are ready to scale, every record is already connected to Salesforce — no migrations, no starting over.

“The hypothesis is that along the way, companies are effectively going to have moments where a CRM might matter,” Seaman said. “Our goal is to make it available to them as a default, so as they are starting their company and their company is growing, it’s just right there for them. They don’t have to think about going off and procuring another tool.”

The feature also represents a response to a growing competitive threat. As the Wall Street Journal reported earlier this year, a wave of startups and individual developers have begun “vibe coding” their own lightweight CRMs, emboldened by the capabilities of large language models. By embedding CRM directly into Slack — the tool many of those same startups already depend on — Salesforce aims to make the procurement of a separate system unnecessary.

Slack says it has a context advantage over Microsoft and Google — but can it last?

The announcements arrive at a moment of intense competitive pressure. Microsoft has integrated Copilot across its entire productivity suite, giving it a distribution advantage that reaches into virtually every Fortune 500 company. Google has been similarly aggressive with Gemini across Workspace. And standalone AI tools from OpenAI to Anthropic threaten to fragment the enterprise AI experience.

Seaman took a measured approach when asked directly about competitive positioning, invoking a mantra he said Slack uses internally: “We are competitor aware, but customer obsessed.”

“I think there are two things that really stand out. One, we have a context advantage — if you look at the way people use Slack, they love it. They use it so much, constantly communicating with their colleagues, openly thinking, working in public project channels. Two is the user experience. We focus so much on how our product feels in people’s hands.”

That context advantage is real but not guaranteed. Slack’s strength lies in the richness and volume of conversational data flowing through its channels — data that, when fed into an AI model, can produce responses with a degree of organizational awareness that competitors struggle to match. But Microsoft’s Teams captures similar conversational data, and its deep integration with Windows, Office, and Azure gives it a systems-level advantage that Slack, operating as a single application, cannot easily replicate.

Starting this summer, every new Salesforce customer will receive Slack automatically provisioned and AI-powered from day one — a bundling play that ensures the messaging platform reaches the broadest possible enterprise audience. Salesforce reported $41.5 billion in revenue for fiscal year 2026, up 10% year-over-year, with Agentforce ARR reaching $800 million. But Wall Street has remained skeptical about whether AI will ultimately erode demand for traditional enterprise software, and Salesforce’s stock has underperformed the broader Nasdaq over the past year. More Slack users in more organizations gives AI-driven features more surface area to prove their value.

Slack’s biggest bet is that it can do everything without losing the simplicity that made it beloved

Tuesday’s launch is the first major product release under Seaman’s leadership. He assumed the interim CEO role after former Slack CEO Denise Dresser departed in December 2025 to become OpenAI’s first chief revenue officer — a move that signaled even Salesforce’s own executives felt the gravitational pull of frontier AI companies. The overarching thesis embedded in the announcement — that Slack is evolving from a messaging platform into an operating system for AI agents — is as risky as it is ambitious.

“One of the fundamental tenets of an operating system is that it obscures the complexity of the hardware from the end user,” Seaman said. “There are thousands of apps and agents out there, and that can be overwhelming. I think that’s our job — to be the OS that obscures that complexity, so you just use it like it’s a communication tool.”

When asked whether Slack risks losing its simplicity by trying to do everything, Seaman didn’t flinch. “There’s absolutely a risk,” he said. “That’s what keeps us up at night.”

It’s a remarkably candid admission from the leader of a platform that just launched 30 new features in a single day. The company that won the hearts of millions of workers with playful emoji reactions and frictionless messaging is now betting its future on meeting transcription, CRM pipelines, desktop agents, and enterprise orchestration. Whether Slack can absorb all of that ambition without losing the thing that made people love it in the first place isn’t just a product question — it’s the $27.7 billion question that Salesforce is still trying to answer.

Cohere’s open-weight ASR model hits 5.4% word error rate — low enough to replace speech APIs in production pipelines

Enterprises building voice-enabled workflows have had limited options for production-grade transcription: closed APIs with data residency risks, or open models that trade accuracy for deployability. Cohere’s new open-weight ASR model, Transcribe, is built to compete on all four key differentiators — contextual accuracy, latency, control and cost.

Cohere says that Transcribe outperforms current leaders on accuracy — and unlike closed APIs, it can run on an organization’s own infrastructure.

Cohere, which can be accessed via an API or in Cohere’s Model Vault as cohere-transcribe-03-2026, has 2 billion parameters and is licensed under Apache-2.0. The company said Transcribe has an average word error rate (WER) of just 5.42%, so it makes fewer mistakes than similar models.

It’s trained on 14 languages: English, French, German, Italian, Spanish, Greek, Dutch, Polish, Portuguese, Chinese, Japanese, Korean, Vietnamese and Arabic. The company did not specify which Chinese dialect the model was trained on. 

Cohere said it trained the model “with a deliberate focus on minimizing WER, while keeping production readiness top-of-mind.” According to Cohere, the result is a model that enterprises can plug directly into voice-powered automations, transcription pipelines, and audio search workflows.

Self-hosted transcription for production pipelines

Until recently, enterprise transcription has been a trade-off — closed APIs offered accuracy but locked in data; open models offered control but lagged on performance. Unlike Whisper, which launched as a research model under MIT license, Transcribe is available for commercial use from release and can run on an organization’s own local GPU infrastructure. Early users flagged the commercial-ready open-weight approach as meaningful for enterprise deployments.

Organizations can bring Transcribe to their own local instances, since Cohere said the model has a more manageable inference footprint for local GPUs. The company said they were able to do this because the model “extends the Pareto frontier, delivering state-of-the-art accuracy (low WER) while sustaining best-in-class throughput (high RTFx) within the 1B+ parameter model cohort.”

How Transcribe stacks up

Transcribe outperformed speech-model stalwarts, including Whisper from OpenAI, which powers the voice feature of ChatGPT, and ElevenLabs, which many big retail brands deploy. It currently tops the Hugging Face ASR leaderboard, leading with an average word error rate of 5.42%, outperforming Whisper Large v3 at 7.44%, ElevenLabs Scribe v2 at 5.83%, and Qwen3-ASR-1.7B at 5.76%.

Based on other datasets tested by Hugging Face, Transcribe also performed well. The AMI dataset, which measures meeting understanding and dialogue analysis, Transcribe logged a score of 8.15%. For the Voxpopuli dataset that tests understanding of different accents, the model scored 5.87%, beaten only by Zoom Scribe.

Early users have flagged accuracy and local deployment as the standout factors — particularly for teams that have been routing audio data through external APIs and want to bring that workload in-house.

For engineering teams building RAG pipelines or agent workflows with audio inputs, Transcribe offers a path to production-grade transcription without the data residency and latency penalties of closed APIs.

When product managers ship code: AI just broke the software org chart

Last week, one of our product managers (PMs) built and shipped a feature. Not spec’d it. Not filed a ticket for it. Built it, tested it, and shipped it to production. In a day.

A few days earlier, our designer noticed that the visual appearance of our IDE plugins had drifted from the design system. In the old world, that meant screenshots, a JIRA ticket, a conversation to explain the intent, and a sprint slot. Instead, he opened an agent, adjusted the layout himself, experimented, iterated, and tuned in real time, then pushed the fix. The person with the strongest design intuition fixed the design directly. No translation layer required.

None of this is new in theory. Vibe coding opened the gates of software creation to millions. That was aspiration. When I shared the data on how our engineers doubled throughput, shifted from coding to validation, brought design upfront for rapid experimentation, it was still an engineering story. What changed is that the theory became practice. Here’s how it actually played out.

The bottleneck moved

When we went AI-first in 2025, implementation cost collapsed. Agents took over scaffolding, tests, and the repetitive glue code that used to eat half the sprint. Cycle times dropped from weeks to days, from days to hours. Engineers started thinking less in files and functions and more in architecture, constraints, and execution plans.

But once engineering capacity stopped being the bottleneck, we noticed something: Decision velocity was. All the coordination mechanisms we’d built to protect engineering time (specs, tickets, handoffs, backlog grooming) were now the slowest part of the system. We were optimizing for a constraint that no longer existed.

What happens when building is cheaper than coordination

We started asking a different question: What would it look like if the people closest to the intent could ship the software directly?

PMs already think in specifications. Designers already define structure, layout, and behavior. They don’t think in syntax. They think in outcomes. When the cost of turning intent into working software dropped far enough, these roles didn’t need to “learn to code.” The cost of implementation simply fell to their level.

I asked one of our PMs, Dmitry, to describe what changed from his perspective. He told me: “While agents are generating tasks in Zenflow, there’s a few minutes of idle time. Just dead air. I wanted to build a small game, something to interact with while you wait.”

If you’ve ever run a product team, you know this kind of idea. It doesn’t move a KPI. It’s impossible to justify in a prioritization meeting. It gets deferred forever. But it adds personality. It makes the product feel like someone cared about the small details. These are exactly the things that get optimized out of every backlog grooming session, and exactly the things users remember.

He built it in a day.

In the past, that idea would have died in a prioritization spreadsheet. Not because it was bad, but because the cost of implementation made it irrational to pursue. When that cost drops to near zero, the calculus changes completely.

Shipping became cheaper than explaining

As more people started building directly, entire layers of process quietly vanished. Fewer tickets. Fewer handoffs. Fewer “can you explain what you mean by…” conversations. Fewer lost-in-translation moments.

For a meaningful class of tasks, it became faster to just build the thing than to describe what you wanted and wait for someone else to build it. Think about that for a second. Every modern software organization is structured around the assumption that implementation is the expensive part. When that assumption breaks, the org has to change with it.

Our designer fixing the plugin UI is a perfect example. The old workflow (screenshot the problem, file a ticket, explain the gap between intent and implementation, wait for a sprint slot, review the result, request adjustments) existed entirely to protect engineering bandwidth. When the person with the design intuition can act on it directly, that whole stack disappears. Not because we eliminated process for its own sake, but because the process was solving a problem that no longer existed.

The compounding effect

Here’s what surprised me most: It compounds.

When PMs build their own ideas, their specifications get sharper, because they now understand what the agent needs to execute well. Sharper specs produce better agent output. Better output means fewer iteration cycles. We’re seeing velocity compound week over week, not just because the models improved, but because the people using them got closer to the work.

Dmitry put it well: The feedback loop between intent and outcome went from weeks to minutes. When you can see the result of your specification immediately, you learn what precision the system needs, and you start providing it instinctively.

There’s a second-order effect that’s harder to measure but impossible to miss: Ownership. People stop waiting. They stop filing tickets for things they could just fix. “Builder” stopped being a job title. It became the default behavior.

What this means for the industry

A lot of the “everyone can code” narrative last year was theoretical, or focused on solo founders and tiny teams. What we experienced is different. We have ~50 engineers working in a complex brownfield codebase: Multiple surfaces and programming languages, enterprise integrations, the full weight of a real production system. 

I don’t think we’re unique. I think we’re early. And with each new generation of models, the gap between who can build and who can’t is closing faster than most organizations realize. Every software company is about to discover that their PMs and designers are sitting on unrealized building capacity, blocked not by skill, but by the cost of implementation. As that cost continues to fall, the organizational implications are profound.

We started with an intent to accelerate software engineering. What we’re becoming is something different: A company where everyone ships.

Andrew Filev is founder and CEO of Zencoder.

When AI turns software development inside-out: 170% throughput at 80% headcount

Many people have tried AI tools and walked away unimpressed. I get it — many demos promise magic, but in practice, the results can feel underwhelming.

That’s why I want to write this not as a futurist prediction, but from lived experience. Over the past six months, I turned my engineering organization AI-first. I’ve shared before about the system behind that transformation — how we built the workflows, the metrics, and the guardrails. Today, I want to zoom out from the mechanics and talk about what I’ve learned from that experience — about where our profession is heading when software development itself turns inside out. 

Before I do, a couple of numbers to illustrate the scale of change. Subjectively, it feels that we are moving twice as fast. Objectively, here’s how the throughput evolved. Our total engineering team headcount floated from 36 at the beginning of the year to 30. So you get ~170% throughput on ~80% headcount, which matches the subjective ~2x. 

Zooming in, I picked a couple of our senior engineers who started the year in a more traditional software engineering process and ended it in the AI-first way. [The dips correspond to vacations and off-sites]:

Note that our PRs are tied to JIRA tickets, and the average scope of those tickets didn’t change much through the year, so it’s as good a proxy as the data can give us. 

Qualitatively, looking at the business value, I actually see even higher uplift. One reason is that, as we started last year, our quality assurance (QA) team couldn’t keep up with our engineers’ velocity. As the company leader, I wasn’t happy with the quality of some of our early releases. As we progressed through the year, and tooled our AI workflows to include writing unit and end-to-end tests, our coverage improved, the number of bugs dropped, users became fans, and the business value of engineering work multiplied.

From big design to rapid experimentation

Before AI, we spent weeks perfecting user flows before writing code. It made sense when change was expensive. Agile helped, but even then, testing multiple product ideas was too costly.

Once we went AI-first, that trade-off disappeared. The cost of experimentation collapsed. An idea could go from whiteboard to a working prototype in a day: From idea to AI-generated product requirements document (PRD), to AI-generated tech spec, to AI-assisted implementation. 

It manifested itself in some amazing transformations. Our website—central to our acquisition and inbound demand—is now a product-scale system with hundreds of custom components, all designed, developed, and maintained directly in code by our creative director

Now, instead of validating with slides or static prototypes, we validate with working products. We test ideas live, learn faster, and release major updates every other month, a pace I couldn’t imagine three years ago.

For example, Zen CLI was first written in Kotlin, but then we changed our mind and moved it to TypeScript with no release velocity lost.

Instead of mocking the features, our UX designers and project managers vibe code them. And when the release-time crunch hit everyone, they jumped into action and fixed dozens of small details with production-ready PRs to help us ship a great product. This included an overnight UI layout change.

From coding to validation

The next shift came where I least expected it: Validation.

In a traditional org, most people write code and a smaller group tests it. But when AI generates much of the implementation, the leverage point moves. The real value lies in defining what “good” looks like — in making correctness explicit.

We support 70-plus programming languages and countless integrations. Our QA engineers have evolved into system architects. They build AI agents that generate and maintain acceptance tests directly from requirements. And those agents are embedded into the codified AI workflows that allow us to achieve predictable engineering outcomes by using a system.

This is what “shift left” really means. Validation isn’t a stand-alone function, it’s an integral part of the production process. If the agent can’t validate it’s work, it can’t be trusted to generate production code. For QA professionals, this is a moment of reinvention, where, with the right upskilling, their work becomes a critical enabler and accelerator of the AI adoption

Product managers, tech leads, and data engineers now share this responsibility as well, because defining correctness has become a cross-functional skill, not a role confined to QA.

From diamond to double funnel

For decades, software development followed a “diamond” shape: A small product team handed off to a large engineering team, then narrowed again through QA.

Today, that geometry is flipping. Humans engage more deeply at the beginning — defining intent, exploring options — and again at the end, validating outcomes. The middle, where AI executes, is faster and narrower.

It’s not just a new workflow; it’s a structural inversion.

The model looks less like an assembly line and more like a control tower. Humans set direction and constraints, AI handles execution at speed, and people step back in to validate outcomes before decisions land in production.

Engineering at a higher level of abstraction

Every major leap in software raised our level of abstraction — from punch cards to high-level programming languages, from hardware to cloud. AI is the next step. Our engineers now work at a meta-layer: Orchestrating AI workflows, tuning agentic instructions and skills, and defining guardrails. The machines build; the humans decide what and why.

Teams now routinely decide when AI output is safe to merge without review, how tightly to bound agent autonomy in production systems, and what signals actually indicate correctness at scale, decisions that simply didn’t exist before.

And that’s the paradox of AI-first engineering — it feels less like coding, and more like thinking. Welcome to the new era of human intelligence, powered by AI.

Andrew Filev is founder and CEO of Zencoder

Mistral AI just released a text-to-speech model it says beats ElevenLabs — and it’s giving away the weights for free

The enterprise voice AI market is in the middle of a land grab. ElevenLabs and IBM announced a collaboration just this week to bring premium voice capabilities into IBM’s watsonx Orchestrate platform. Google Cloud has been expanding its Chirp 3 HD voices. OpenAI continues to iterate on its own speech synthesis. And the market underpinning all of this activity is enormous — voice AI crossed $22 billion globally in 2026, with the voice AI agents segment alone projected to reach $47.5 billion by 2034, according to industry estimates.

On Thursday morning, Mistral AI entered that fight with a fundamentally different proposition. The Paris-based AI startup released Voxtral TTS, what it calls the first frontier-quality, open-weight text-to-speech model designed specifically for enterprise use. Where every major competitor in the space operates a proprietary, API-first business — enterprises rent the voice, they don’t own it — Mistral is releasing the full model weights, inviting companies to download Voxtral TTS, run it on their own servers or even on a smartphone, and never send a single audio frame to a third party.

It is a bet that the future of enterprise voice AI will not be shaped by whoever builds the best-sounding model, but by whoever gives companies the most control over it. And it arrives at a moment when Mistral, valued at $13.8 billion after a $2 billion Series C round led by Dutch chipmaker ASML last September, has been aggressively assembling the building blocks of a complete, enterprise-owned AI stack — from its Forge customization platform announced at Nvidia GTC earlier this month, to its AI Studio production infrastructure, to the Voxtral Transcribe speech-to-text model released just weeks ago.

Voxtral TTS is the output layer that completes that picture, giving enterprises a speech-to-speech pipeline they can run end-to-end without relying on any external provider.

“We see audio as a big bet and as a critical and maybe the only future interface with all the AI models,” Pierre Stock, Mistral’s vice president of science and the first employee hired at the company, said in an exclusive interview with VentureBeat. “This is something customers have been asking for.”

A 3-billion-parameter model that fits on a laptop and runs six times faster than real-time speech

The technical specifications of Voxtral TTS read like a deliberate inversion of industry norms. Where most frontier TTS models are large and resource-intensive, Mistral built its model to be roughly three times smaller than what it calls the industry standard for comparable quality.

The architecture comprises three components: a 3.4-billion-parameter transformer decoder backbone, a 390-million-parameter flow-matching acoustic transformer, and a 300-million-parameter neural audio codec that Mistral developed in-house. The system is built on top of Ministral 3B, the same pretrained backbone that powers the company’s Voxtral Transcribe model — a design choice that Stock described as emblematic of Mistral’s culture of efficiency and artifact reuse.

In practice, the model achieves a time-to-first-audio of 90 milliseconds for a typical input and generates speech at approximately six times real-time speed. When quantized for inference, it requires roughly three gigabytes of RAM. Stock confirmed it can run on any laptop or smartphone, and even on older hardware it still operates in real time.

“It’s a 3B model, so it can basically run on any laptop or any smartphone,” Stock told VentureBeat. “If you quantize it to infer, it’s actually three gigabytes of RAM. And you can run it on super old chips — it’s still going to be real time.”

The model supports nine languages — English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic — and can adapt to a custom voice with as little as five seconds of reference audio. Perhaps more remarkably, it demonstrates zero-shot cross-lingual voice adaptation without explicit training for that task.

Stock illustrated this with a personal example: he can feed the model 10 seconds of his own French-accented voice, type a prompt in German, and the model will generate German speech that sounds like him — complete with his natural accent and vocal characteristics. For enterprises operating across borders, this capability unlocks cascaded speech-to-speech translation that preserves speaker identity, a feature that has obvious applications in customer support, sales, and internal communications for multinational organizations.

Human evaluators preferred Voxtral over ElevenLabs nearly 70 percent of the time on voice customization

Mistral is not being coy about which competitor it intends to displace. In human evaluations conducted by the company, Voxtral TTS achieved a 62.8 percent listener preference rate against ElevenLabs Flash v2.5 on flagship voices and a 69.9 percent preference rate in voice customization tasks. Mistral also claims the model performs at parity with ElevenLabs v3 — the company’s premium, higher-latency tier — on emotional expressiveness, while maintaining similar latency to the much faster Flash model.

The evaluation methodology involved a comparative side-by-side test across all nine supported languages. Using two recognizable voices in their native dialects for each language, three annotators performed preference tests on naturalness, accent adherence, and acoustic similarity to the original reference. Mistral says Voxtral TTS widened the quality gap to ElevenLabs v2.5 Flash especially in zero-shot multilingual custom voice settings, highlighting what the company calls the “instant customizability” of the model.

ElevenLabs remains widely regarded as the benchmark for raw voice quality. Its Eleven v3 model has been described by multiple independent reviewers as the gold standard for emotionally nuanced AI speech. But ElevenLabs operates as a closed platform with tiered subscription pricing that scales from around $5 per month at the starter level to over $1,300 per month for business plans. It does not release model weights.

Mistral’s pitch is that enterprises shouldn’t have to choose between quality and control — and that at scale, the economics of an open-weight model are dramatically more favorable.

“What we want to underline is that we’re faster and cheaper as well — and open source,” Stock told VentureBeat. “When something is open source and cheap, people adopt it and people build on it.”

He framed the cost argument in terms that resonate with CTOs managing AI budgets: “AI is a transformative technology, but it has a cost. When you want to scale and have impact on a large business, that cost matters. And what we allow is to scale seamlessly while minimizing the cost and maximizing the accuracy.”

Why Mistral thinks enterprises will want to own their voice AI rather than rent it

To understand why Mistral is entering text-to-speech now, you have to understand the broader strategic architecture the company has been building for the past year. While OpenAI and Anthropic have captured the imagination of consumers, Mistral has quietly assembled what may be the most comprehensive enterprise AI platform in Europe — and increasingly, globally.

CEO Arthur Mensch has said the company is on track to surpass $1 billion in annual recurring revenue this year, according to TechCrunch’s reporting on the Forge launch. The Financial Times has reported that Mistral’s annualized revenue run rate surged from $20 million to over $400 million within a single year. That growth has been powered by more than 100 major enterprise customers and a consistent thesis: companies should own their AI infrastructure, not rent it.

Voxtral TTS is the latest expression of that thesis, applied to what may be the most sensitive category of enterprise data there is. Voice recordings capture not just words but emotion, identity, and intent. They carry legal, regulatory, and reputational weight that text data often does not. For industries like financial services, healthcare, and government — all key Mistral verticals — sending voice data to a third-party API introduces risks that many compliance teams are unwilling to accept.

Stock made the data sovereignty argument forcefully. “Since the models are open weights, we have no trouble and no problem actually giving the weights to the enterprise and helping them customize the models,” he said. “We don’t see the weights anymore. We don’t see the data. We see nothing. And you are fully controlled.”

That message has particular resonance in Europe, where concern about technological dependence on American cloud providers has intensified throughout 2026. The EU currently sources more than 80 percent of its digital services from foreign providers, most of them American. Mistral has positioned itself as the answer to that anxiety — the only European frontier AI developer with the scale and technical capability to offer a credible alternative.

Voice agents are the enterprise use case that makes Mistral’s full AI stack click into place

Voxtral TTS is the final piece in a pipeline Mistral has been methodically assembling. Voxtral Transcribe handles speech-to-text. Mistral’s language models — from Mistral Small to Mistral Large — provide the reasoning layer. Forge allows enterprises to customize any of these models on their own data. AI Studio provides the production infrastructure for observability, governance, and deployment. And Mistral Compute offers the underlying GPU resources.

Together, these pieces form what Stock described as a “full AI stack, fully controllable and customizable” for the enterprise. Voice agents — AI systems that can listen to a customer, understand what they need, reason about the answer, and respond in natural-sounding speech — are the use case that ties all of these layers together.

The applications Mistral envisions span customer support, where voice agents can route and resolve queries with brand-appropriate speech; sales and marketing, where a single voice can work across markets through cross-lingual emulation; real-time translation for cross-border operations; and even interactive storytelling and game design, where emotion-steering can control tone and personality.

Stock was most animated when discussing how Voxtral TTS fits into the broader agentic AI trend that has dominated enterprise technology discussions in 2026. “We are totally building for a world in which audio is a natural interface, in particular for agents to which you can delegate work — extensions of yourself,” he said. He described a scenario in which a user starts planning a vacation on a computer, commutes to work, and then picks up the workflow on a phone simply by asking for an update by voice.

“To make that happen, you need a model you can trust, you need a model that’s super efficient and super cheap to run — otherwise you won’t use it for long — and you need a model that sounds super conversational and that you can interrupt at any time,” Stock said.

That emphasis on interruptibility and real-time responsiveness reflects a broader insight about voice interfaces that distinguishes them from text. A chatbot can take two or three seconds to respond without breaking the user experience. A voice agent cannot. The 90-millisecond time-to-first-audio that Voxtral TTS achieves is not just a benchmark number — it is the threshold between a voice interaction that feels natural and one that feels robotic.

Mistral’s open-weight approach aligns with a broader industry shift that even Nvidia is backing

Mistral’s decision to release Voxtral TTS with open weights is consistent with a movement that has been gathering momentum across the AI industry. At Nvidia GTC earlier this month, Nvidia CEO Jensen Huang declared that “proprietary versus open is not a thing — it’s proprietary and open.” Nvidia announced the Nemotron Coalition, a first-of-its-kind collaboration of model builders working to advance open frontier-level foundation models, with Mistral as a founding member. The first project from that coalition will be a base model codeveloped by Mistral AI and Nvidia.

For Mistral, open weights serve a dual commercial purpose. They drive adoption — developers and enterprises can experiment without friction or commitment — while the company monetizes through its platform services, customization offerings, and managed infrastructure. The model is available to test in Mistral Studio and through the company’s API, but the strategic play is to become embedded in enterprise voice pipelines as an owned asset, not a metered service.

This mirrors the playbook that worked for Mistral’s language models. As Mensch told CNBC in February, “AI is making us able to develop software at the speed of light,” predicting that “more than half of what’s currently being bought by IT in terms of SaaS is going to shift to AI.” He described a “replatforming” taking place across enterprise technology, with businesses looking to replace legacy software systems with AI-native alternatives. An open-weight voice model that enterprises can customize and deploy on their own terms fits naturally into that narrative.

Mistral signals that end-to-end audio AI is where the company is headed next

When asked what comes after Voxtral TTS, Stock outlined two directions. The first is expanding language and dialect support, with particular attention to cultural nuance. “It’s not the same to speak French in Paris than to speak French in Canada, in Montreal,” he said. “We want to respect both cultures, and we want our models to perform in both contexts with all the cultural specifics.”

The second direction is more ambitious: a fully end-to-end audio model that doesn’t just generate speech from text but understands the complete spectrum of human vocal communication.

“We convey some meaning with the words we speak,” Stock said. “We actually convey way more with the intonation, the rhythm, and how we say it. When people talk about end-to-end audio, that’s what they mean — the model is able to pick up that you’re in a hurry, for instance, and will go for the fastest answer. The model will know that you’re joyful today and crack a joke. It’s super adaptive to you, and that’s where we want to go.”

That vision — an AI that speaks naturally, listens with nuance, responds with emotional intelligence, and runs on a model small enough to fit in your pocket — is the frontier every major AI lab is racing toward. For now, Voxtral TTS gives Mistral a foundation to build on and enterprises a question they haven’t had to answer before: if you could own your voice AI stack outright, at lower cost and with competitive quality, why would you keep renting someone else’s?

The consequential AI work that actually moves the needle for enterprises

Presented by OutSystems


After two years of flashy AI demos, rushed agent prototypes, and breathless predictions, enterprise technology leaders are striking a more pragmatic tone in 2026. In a recent webinar hosted by OutSystems, a panel of software executives and enterprise practitioners made the case that the most consequential AI work happening now is focused on the practical matters of governance, orchestration, and iteration, along with integrating agents into the systems they’ve spent decades building.

Enterprise leaders are increasingly focused on fundamentals. The priority is using new AI technologies

to accelerate productivity, improve delivery, and produce measurable business results.

Three elements shape this work:

  • The move from AI agent prototypes to agentic systems that deliver measurable ROI in production

  • The growing role of enterprise platforms in governing, orchestrating, and scaling AI agents safely

  • The rise of the generalist developer and enterprise architect as the most valuable technical profiles in an era of AI-generated code

Against this backdrop, the panel discussed governance frameworks, the economics of enterprise AI, and the limits of large language models without orchestration. The conversation ultimately turned to how leading organizations are building multi-agent systems grounded in existing enterprise data and workflows.

Agents in the real world

Enabling agents to work in production across the enterprise is best accomplished with a unified platform that handles development, iteration, and deployment. And that’swhere capabilities like the Agent Workbench in the OutSystems platform matter, said Rajkiran Vajreshwari, senior manager of app development at Thermo Fisher Scientific. It provides the infrastructure to learn, iterate, and govern agents at scale.

His team at Thermo Fisher has moved away from single-task AI assistants in customer service to building a coordinated team of specialized agents using the workbench. When a support case arrives, a triage assistant classifies the request and dynamically routes it to the right specialist agent, whether that’s an intent and priority agent, a product context agent, a troubleshooting agent, or a compliance agent.

“We don’t have to think about what will work and how. It’s all pre-built,” he explained. “Each agent has a narrow role and clear guardrails. They stay accurate and auditable.”

Governing the risks of shadow AI

A new category of risk emerges when AI makes it possible for anyone in a company to generate production-level code without IT oversight. Basically, this is ungoverned shadow AI. These homegrown products are prone to hallucinations, data leakage, policy violations, model drift, and agents taking actions that were never formally approved.

To get ahead of the risk, leading organizations need to do three things, said Luis Blando, CPTO of OutSystems.

“Give users guardrails. They’re going to use AI whether you like it or not. Companies that seem to be getting ahead are using AI to govern AI across their full portfolio,” he explained. “That is the difference between shadow AI chaos and enterprise-grade scale.”

Eric Kavanagh, CEO of The Bloor Group, noted that governance requires a layered set of disciplines that includes securing data, monitoring models for drift, and making deliberate choices about where AI connects to existing business processes.

“Companies don’t have to be manually creating these controls,” he added. “A lot of those guardrails and levers are baked in to platforms like OutSystems.”

Why the real orchestration challenge is models vs. platforms

Much of the early excitement around enterprise AI focused on selecting the right large language model. Now the harder challenge, and far more durable source of value, is orchestration. This includes routing tasks, coordinating workflows, governing execution, and integrating AI into existing enterprise systems.

Scott Finkle, VP of development at McConkey Auction Group, noted that LLMs, however impressive, are pieces of complex workflows, not final solutions. Organizations should be ready to hot-swap between Gemini, ChatGPT, Claude, and whatever emerges next without having to rebuild the agentic system around it.

A platform with orchestration capabilities makes that possible. It manages the lifecycle, provides visibility, and ensures processes execute reliably, even as AI handles the reasoning layer on top.

“The AI and the models change, the workflows can change, but the orchestration remains the same,” Finkle said. “That’s how we’re going to extract value out of AI.”

The economics of enterprise AI investing

Security, compliance, governance, and platform-level AI capabilities will all command greater investment in 2026, particularly as AI moves into core workflows like finance and supply chain. Enterprises should favor incremental wins rather than expect big, immediate gains.

“We’re focusing on base hits,” Finkle said. “The way it counts is by getting something into production and having it make an impact. Big investments in pilot projects that don’t make it into production don’t save any money. It’s not going to happen overnight, but over time I think we’ll see tremendous savings.”

There’s still a split in how enterprises are approaching AI transformation. Some start from scratch and reimagine every process. Others, especially those with billions of dollars in existing infrastructure depreciating in-house, want AI to integrate with their systems. They want agentic systems to reuse data, APIs, and proven processes while speeding up delivery. The agent platform approach serves both camps, but particularly the latter. Organizations can deploy agents where they add clear value while preserving the integrity of established, deterministic workflows.

The rise of the enterprise architect and the generalist developer

As AI accelerates code generation, bottlenecks in software delivery are dissolving. In its place is a premium on systems thinking. This is the ability to understand the broader enterprise architecture, decompose complex business problems, and reason about how AI integrates with existing infrastructure. Kavanagh pointed to enterprise architects specifically as the professionals best positioned to capitalize on this moment.

“We’re entering a very interesting age of the generalist,” he explained. “The better you know your enterprise architecture and your business architecture and how those things align, the better off you’re going to be. ”

“The result is faster delivery with fewer interruptions and fewer bugs,” Kavanaugh said. “You can focus on the non-repetitive tasks. It’s a benefit to the developer, to the business, and to the whole IT organization.”

Catch the entire webinar here.


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

How xMemory cuts token costs and context bloat in AI agents

Standard RAG pipelines break when enterprises try to use them for long-term, multi-session LLM agent deployments. This is a critical limitation as demand for persistent AI assistants grows.

xMemory, a new technique developed by researchers at King’s College London and The Alan Turing Institute, solves this by organizing conversations into a searchable hierarchy of semantic themes.

Experiments show that xMemory improves answer quality and long-range reasoning across various LLMs while cutting inference costs. According to the researchers, it drops token usage from over 9,000 to roughly 4,700 tokens per query compared to existing systems on some tasks.

For real-world enterprise applications like personalized AI assistants and multi-session decision support tools, this means organizations can deploy more reliable, context-aware agents capable of maintaining coherent long-term memory without blowing up computational expenses.

RAG wasn’t built for this

In many enterprise LLM applications, a critical expectation is that these systems will maintain coherence and personalization across long, multi-session interactions. To support this long-term reasoning, one common approach is to use standard RAG: store past dialogues and events, retrieve a fixed number of top matches based on embedding similarity, and concatenate them into a context window to generate answers.

However, traditional RAG is built for large databases where the retrieved documents are highly diverse. The main challenge is filtering out entirely irrelevant information. An AI agent’s memory, by contrast, is a bounded and continuous stream of conversation, meaning the stored data chunks are highly correlated and frequently contain near-duplicates.

To understand why simply increasing the context window doesn’t work, consider how standard RAG handles a concept like citrus fruit.

Imagine a user has had many conversations saying things like “I love oranges,” “I like mandarins,” and separately, other conversations about what counts as a citrus fruit. Traditional RAG may treat all of these as semantically close and keep retrieving similar “citrus-like” snippets. 

“If retrieval collapses onto whichever cluster is densest in embedding space, the agent may get many highly similar passages about preference, while missing the category facts needed to answer the actual query,” Lin Gui, co-author of the paper, told VentureBeat. 

A common fix for engineering teams is to apply post-retrieval pruning or compression to filter out the noise. These methods assume that the retrieved passages are highly diverse and that irrelevant noise patterns can be cleanly separated from useful facts.

This approach falls short in conversational agent memory because human dialogue is “temporally entangled,” the researchers write. Conversational memory relies heavily on co-references, ellipsis, and strict timeline dependencies. Because of this interconnectedness, traditional pruning tools often accidentally delete important bits of a conversation, leaving the AI without vital context needed to reason accurately.

Why the fix most teams reach for makes things worse

To overcome these limitations, the researchers propose a shift in how agent memory is built and searched, which they describe as “decoupling to aggregation.”

Instead of matching user queries directly against raw, overlapping chat logs, the system organizes the conversation into a hierarchical structure. First it decouples the conversation stream into distinct, standalone semantic components. These individual facts are then aggregated into a higher-level structural hierarchy of themes.

When the AI needs to recall information, it searches top-down through the hierarchy, going from themes to semantics and finally to raw snippets. This approach avoids redundancy. If two dialogue snippets have similar embeddings, the system is unlikely to retrieve them together if they have been assigned to different semantic components.

For this architecture to succeed, it must balance two vital structural properties. The semantic components must be sufficiently differentiated to prevent the AI from retrieving redundant data. At the same time, the higher-level aggregations must remain semantically faithful to the original context to ensure the model can craft accurate answers.

A four-level hierarchy that shrinks the context window

The researchers developed xMemory, a framework that combines structured memory management with an adaptive, top-down search strategy.

xMemory continuously organizes the raw stream of conversation into a structured, four-level hierarchy. At the base are the raw messages, which are first summarized into contiguous blocks called “episodes.” From these episodes, the system distills reusable facts as semantics that disentangle the core, long-term knowledge from repetitive chat logs. Finally, related semantics are grouped together into high-level themes to make them easily searchable.

xMemory uses a special objective function to constantly optimize how it groups these items. This prevents categories from becoming too bloated, which slows down search, or too fragmented, which weakens the model’s ability to aggregate evidence and answer questions.

When it receives a prompt, xMemory performs a top-down retrieval across this hierarchy. It starts at the theme and semantic levels, selecting a diverse, compact set of relevant facts. This is crucial for real-world applications where user queries often require gathering descriptions across multiple topics or chaining connected facts together for complex, multi-hop reasoning.

Once it has this high-level skeleton of facts, the system controls redundancy through what the researchers call “Uncertainty Gating.” It only drills down to pull the finer, raw evidence at the episode or message level if that specific detail measurably decreases the model’s uncertainty.

“Semantic similarity is a candidate-generation signal; uncertainty is a decision signal,” Gui said. “Similarity tells you what is nearby. Uncertainty tells you what is actually worth paying for in the prompt budget.” It stops expanding when it detects that adding more detail no longer helps answer the question.

What are the alternatives?

Existing agent memory systems generally fall into two structural categories: flat designs and structured designs. Both suffer from fundamental limitations.

Flat approaches such as MemGPT log raw dialogue or minimally processed traces. This captures the conversation but accumulates massive redundancy and increases retrieval costs as the history grows longer.

Structured systems such as A-MEM and MemoryOS try to solve this by organizing memories into hierarchies or graphs. However, they still rely on raw or minimally processed text as their primary retrieval unit, often pulling in extensive, bloated contexts. These systems also depend heavily on LLM-generated memory records that have strict schema constraints. If the AI deviates slightly in its formatting, it can cause memory failure.

xMemory addresses these limitations through its optimized memory construction scheme, hierarchical retrieval, and dynamic restructuring of its memory as it grows larger.

When to use xMemory

For enterprise architects, knowing when to adopt this architecture over standard RAG is critical. According to Gui, “xMemory is most compelling where the system needs to stay coherent across weeks or months of interaction.”

Customer support agents, for instance, benefit greatly from this approach because they must remember stable user preferences, past incidents, and account-specific context without repeatedly pulling up near-duplicate support tickets. Personalized coaching is another ideal use case, requiring the AI to separate enduring user traits from episodic, day-to-day details.

Conversely, if an enterprise is building an AI to chat with a repository of files, such as policy manuals or technical documentation, “a simpler RAG stack is still the better engineering choice,” Gui said. In those static, document-centric scenarios, the corpus is diverse enough that standard nearest-neighbor retrieval works perfectly well without the operational overhead of hierarchical memory.

The write tax is worth it

xMemory cuts the latency bottleneck associated with the LLM’s final answer generation. In standard RAG systems, the LLM is forced to read and process a bloated context window full of redundant dialogue. Because xMemory’s precise, top-down retrieval builds a much smaller, highly targeted context window, the reader LLM spends far less compute time analyzing the prompt and generating the final output.

In their experiments on long-context tasks, both open and closed models equipped with xMemory outperformed other baselines, using considerably fewer tokens while increasing task accuracy.

However, this efficient retrieval comes with an upfront cost. For an enterprise deployment, the catch with xMemory is that it trades a massive read tax for an upfront write tax. While it ultimately makes answering user queries faster and cheaper, maintaining its sophisticated architecture requires substantial background processing.

Unlike standard RAG pipelines, which cheaply dump raw text embeddings into a database, xMemory must execute multiple auxiliary LLM calls to detect conversation boundaries, summarize episodes, extract long-term semantic facts, and synthesize overarching themes.

Furthermore, xMemory’s restructuring process adds additional computational requirements as the AI must curate, link, and update its own internal filing system. To manage this operational complexity in production, teams can execute this heavy restructuring asynchronously or in micro-batches rather than synchronously blocking the user’s query.

For developers eager to prototype, the xMemory code is publicly available on GitHub under an MIT license, making it viable for commercial uses. If you are trying to implement this in existing orchestration tools like LangChain, Gui advises focusing on the core innovation first: “The most important thing to build first is not a fancier retriever prompt. It is the memory decomposition layer. If you get only one thing right first, make it the indexing and decomposition logic.”

Retrieval isn’t the last bottleneck

While xMemory offers a powerful solution to today’s context-window limitations, it clears the path for the next generation of challenges in agentic workflows. As AI agents collaborate over longer horizons, simply finding the right information won’t be enough.

“Retrieval is a bottleneck, but once retrieval improves, these systems quickly run into lifecycle management and memory governance as the next bottlenecks,” Gui said. Navigating how data should decay, handling user privacy, and maintaining shared memory across multiple agents is exactly “where I expect a lot of the next wave of work to happen,” he said.

What is DeerFlow 2.0 and what should enterprises know about this new, powerful local AI agent orchestrator?

ByteDance, the Chinese tech giant behind TikTok, last month released what may be one of the most ambitious open-source AI agent frameworks to date: DeerFlow 2.0. It’s now going viral across the machine learning community on social media. But is it safe and ready for enterprise use?

This is a so-called “SuperAgent harness” that orchestrates multiple AI sub-agents to autonomously complete complex, multi-hour tasks. Best of all: it is available under the permissive, enterprise-friendly standard MIT License, meaning anyone can use, modify, and build on it commercially at no cost.

DeerFlow 2.0 is designed for high-complexity, long-horizon tasks that require autonomous orchestration over minutes or hours, including conducting deep research into industry trends, generating comprehensive reports and slide decks, building functional web pages, producing AI-generated videos and reference images, performing exploratory data analysis with insightful visualizations, analyzing and summarizing podcasts or video content, automating complex data and content workflows, and explaining technical architectures through creative formats like comic strips.

ByteDance offers a bifurcated deployment strategy that separates the orchestration harness from the AI inference engine. Users can run the core harness directly on a local machine, deploy it across a private Kubernetes cluster for enterprise scale, or connect it to external messaging platforms like Slack or Telegram without requiring a public IP.

While many opt for cloud-based inference via OpenAI or Anthropic APIs, the framework is natively model-agnostic, supporting fully localized setups through tools like Ollama. This flexibility allows organizations to tailor the system to their specific data sovereignty needs, choosing between the convenience of cloud-hosted “brains” and the total privacy of a restricted on-premise stack.

Importantly, choosing the local route does not mean sacrificing security or functional isolation. Even when running entirely on a single workstation, DeerFlow still utilizes a Docker-based “AIO Sandbox” to provide the agent with its own execution environment.

This sandbox—which contains its own browser, shell, and persistent filesystem—ensures that the agent’s “vibe coding” and file manipulations remain strictly contained. Whether the underlying models are served via the cloud or a local server, the agent’s actions always occur within this isolated container, allowing for safe, long-running tasks that can execute bash commands and manage data without risk to the host system’s core integrity.

Since its release last month, it has accumulated more than 39,000 stars (user saves) and 4,600 forks — a growth trajectory that has developers and researchers alike paying close attention.

Not a chatbot wrapper: what DeerFlow 2.0 actually is

DeerFlow is not another thin wrapper around a large language model. The distinction matters.

While many AI tools give a model access to a search API and call it an agent, DeerFlow 2.0 gives its agents an actual isolated computer environment: a Docker sandbox with a persistent, mountable filesystem.

The system maintains both short- and long-term memory that builds user profiles across sessions. It loads modular “skills” — discrete workflows — on demand to keep context windows manageable. And when a task is too large for one agent, a lead agent decomposes it, spawns parallel sub-agents with isolated contexts, executes code and Bash commands safely, and synthesizes the results into a finished deliverable.

It is similar to the approach being pursued by NanoClaw, an OpenClaw variant, which recently partnered with Docker itself to offer enterprise-grade sandboxes for agents and subagents.

But while NanoClaw is extremely open ended, DeerFlow has more clearly defined its architecture and scoped tasks: Demos on the project’s official site, deerflow.tech, showcase real outputs: agent trend forecast reports, videos generated from literary prompts, comics explaining machine learning concepts, data analysis notebooks, and podcast summaries.

The framework is designed for tasks that take minutes to hours to complete — the kind of work that currently requires a human analyst or a paid subscription to a specialized AI service.

From Deep Research to Super Agent

DeerFlow’s original v1 launched in May 2025 as a focused deep-research framework. Version 2.0 is something categorically different: a ground-up rewrite on LangGraph 1.0 and LangChain that shares no code with its predecessor. ByteDance explicitly framed the release as a transition “from a Deep Research agent into a full-stack Super Agent.”

New in v2: a batteries-included runtime with filesystem access, sandboxed execution, persistent memory, and sub-agent spawning; progressive skill loading; Kubernetes support for distributed execution; and long-horizon task management that can run autonomously across extended timeframes.

The framework is fully model-agnostic, working with any OpenAI-compatible API. It has strong out-of-the-box support for ByteDance’s own Doubao-Seed models, as well as DeepSeek v3.2, Kimi 2.5, Anthropic’s Claude, OpenAI’s GPT variants, and local models run via Ollama. It also integrates with Claude Code for terminal-based tasks, and with messaging platforms including Slack, Telegram, and Feishu.

Why it’s going viral now

The project’s current viral moment is the result of a slow build that accelerated sharply this week.

The February 28 launch generated significant initial buzz, but it was coverage in machine learning media — including deeplearning.ai’s The Batch — over the following two weeks that built credibility in the research community.

Then, on March 21, AI influencer Min Choi posted to his large X following: “China’s ByteDance just dropped DeerFlow 2.0. This AI is a super agent harness with sub-agents, memory, sandboxes, IM channels, and Claude Code integration. 100% open source.” The post earned more than 1,300 likes and triggered a cascade of reposts and commentary across AI Twitter.

A search of X using Grok uncovered the full scope of that response. Influencer Brian Roemmele, after conducting what he described as intensive personal testing, declared that “DeerFlow 2.0 absolutely smokes anything we’ve ever put through its paces” and called it a “paradigm shift,” adding that his company had dropped competing frameworks entirely in favor of running DeerFlow locally. “We use 2.0 LOCAL ONLY. NO CLOUD VERSION,” he wrote.

More pointed commentary came from accounts focused on the business implications. One post from @Thewarlordai, published March 23, framed it bluntly: “MIT licensed AI employees are the death knell for every agent startup trying to sell seat-based subscriptions. The West is arguing over pricing while China just commoditized the entire workforce.”

Another widely shared post described DeerFlow as “an open-source AI staff that researches, codes and ships products while you sleep… now it’s a Python repo and ‘make up’ away.”

Cross-linguistic amplification — with substantive posts in English, Japanese, and Turkish — points to genuine global reach rather than a coordinated promotion campaign, though the latter is not out of the question and may be contributing to the current virality.

The ByteDance question

ByteDance’s involvement is the variable that makes DeerFlow’s reception more complicated than a typical open-source release.

On the technical merits, the open-source, MIT-licensed nature of the project means the code is fully auditable. Developers can inspect what it does, where data flows, and what it sends to external services. That is materially different from using a closed ByteDance consumer product.

But ByteDance operates under Chinese law, and for organizations in regulated industries — finance, healthcare, defense, government — the provenance of software tooling increasingly triggers formal review requirements, regardless of the code’s quality or openness.

The jurisdictional question is not hypothetical: U.S. federal agencies are already operating under guidance that treats Chinese-origin software as a category requiring scrutiny.

For individual developers and small teams running fully local deployments with their own LLM API keys, those concerns are less operationally pressing. For enterprise buyers evaluating DeerFlow as infrastructure, they are not.

A real tool, with limitations

The community enthusiasm is credible, but several caveats apply.

DeerFlow 2.0 is not a consumer product. Setup requires working knowledge of Docker, YAML configuration files, environment variables, and command-line tools. There is no graphical installer. For developers comfortable with that environment, the setup is described as relatively straightforward; for others, it is a meaningful barrier.

Performance when running fully local models — rather than cloud API endpoints — depends heavily on available VRAM and hardware, with context handoff between multiple specialized models a known challenge. For multi-agent tasks running several models in parallel, the resource requirements escalate quickly.

The project’s documentation, while improving, still has gaps for enterprise integration scenarios. There has been no independent public security audit of the sandboxed execution environment, which represents a non-trivial attack surface if exposed to untrusted inputs.

And the ecosystem, while growing fast, is weeks old. The plugin and skill library that would make DeerFlow comparably mature to established orchestration frameworks simply does not exist yet.

What does it mean for enterprises in the AI transformation age?

The deeper significance of DeerFlow 2.0 may be less about the tool itself and more about what it represents in the broader race to define autonomous AI infrastructure.

DeerFlow’s emergence as a fully capable, self-hostable, MIT-licensed agentic orchestrator adds yet another twist to the ongoing race among enterprises — and AI builders and model providers themselves — to turn generative AI models into more than chatbots, but something more like full or at least part-time employees, capable of both communications and reliable actions.

In a sense, it marks the natural next wave after OpenClaw: whereas that open source tool sought to great a dependable, always on autonomous AI agent the user could message, DeerFlow is designed to allow a user to deploy a fleet of them and keep track of them, all within the same system.

The decision to implement it in your enterprise hinges on whether your organization’s workload demands “long-horizon” execution—complex, multi-step tasks spanning minutes to hours that involve deep research, coding, and synthesis. Unlike a standard LLM interface, this “SuperAgent” harness decomposes broad prompts into parallel sub-tasks performed by specialized experts. This architecture is specifically designed for high-context workflows where a single-pass response is insufficient and where “vibe coding” or real-time file manipulation in a secure environment is necessary.

The primary condition for use is the technical readiness of an organization’s hardware and sandbox environment. Because each task runs within an isolated Docker container with its own filesystem, shell, and browser, DeerFlow acts as a “computer-in-a-box” for the agent. This makes it ideal for data-intensive workloads or software engineering tasks where an agent must execute and debug code safely without contaminating the host system. However, this “batteries-included” runtime places a significant burden on the infrastructure layer; decision-makers must ensure they have the GPU clusters and VRAM capacity to support multi-agent fleets running in parallel, as the framework’s resource requirements escalate quickly during complex tasks.

Strategic adoption is often a calculation between the overhead of seat-based SaaS subscriptions and the control of self-hosted open-source deployments. The MIT License positions DeerFlow 2.0 as a highly capable, royalty-free alternative to proprietary agent platforms, potentially functioning as a cost ceiling for the entire category. Enterprises should favor adoption if they prioritize data sovereignty and auditability, as the framework is model-agnostic and supports fully local execution with models like DeepSeek or Kimi. If the goal is to commoditize a digital workforce while maintaining total ownership of the tech stack, the framework provides a compelling, if technically demanding, benchmark.

Ultimately, the decision to deploy must be weighed against the inherent risks of an autonomous execution environment and its jurisdictional provenance. While sandboxing provides isolation, the ability of agents to execute bash commands creates a non-trivial attack surface that requires rigorous security governance and auditability. Furthermore, because the project is a ByteDance-led initiative via Volcengine and BytePlus, organizations in regulated sectors must reconcile its technical performance with emerging software-origin standards. Deployment is most appropriate for teams comfortable with a CLI-first, Docker-heavy setup who are ready to trade the convenience of a consumer product for a sophisticated and extensible SuperAgent harness.

The three disciplines separating AI agent demos from real-world deployment

Getting AI agents to perform reliably in production — not just in demos — is turning out to be harder than enterprises anticipated. Fragmented data, unclear workflows, and runaway escalation rates are slowing deployments across industries.

“The technology itself often works well in demonstrations,” said Sanchit Vir Gogia, chief analyst with Greyhound Research. “The challenge begins when it is asked to operate inside the complexity of a real organization.” 

Burley Kawasaki, who oversees agent deployment at Creatio, and team have developed a methodology built around three disciplines: data virtualization to work around data lake delays; agent dashboards and KPIs as a management layer; and tightly bounded use-case loops to drive toward high autonomy.

In simpler use cases, Kawasaki says these practices have enabled agents to handle up to 80-90% of tasks on their own. With further tuning, he estimates they could support autonomous resolution in at least half of use cases, even in more complex deployments.

“People have been experimenting a lot with proof of concepts, they’ve been putting a lot of tests out there,” Kawasaki told VentureBeat. “But now in 2026, we’re starting to focus on mission-critical workflows that drive either operational efficiencies or additional revenue.”

Why agents keep failing in production

Enterprises are eager to adopt agentic AI in some form or another — often because they’re afraid to be left out, even before they even identify real-world tangible use cases — but run into significant bottlenecks around data architecture, integration, monitoring, security, and workflow design. 

The first obstacle almost always has to do with data, Gogia said. Enterprise information rarely exists in a neat or unified form; it is spread across SaaS platforms, apps, internal databases, and other data stores. Some are structured, some are not. 

But even when enterprises overcome the data retrieval problem, integration is a big challenge. Agents rely on APIs and automation hooks to interact with applications, but many enterprise systems were designed long before this kind of autonomous interaction was a reality, Gogia pointed out. 

This can result in incomplete or inconsistent APIs, and systems can respond unpredictably when accessed programmatically. Organizations also run into snags when they attempt to automate processes that were never formally defined, Gogia said. 

“Many business workflows depend on tacit knowledge,” he said. That is, employees know how to resolve exceptions they’ve seen before without explicit instructions — but, those missing rules and instructions become startlingly obvious when workflows are translated into automation logic.

The tuning loop

Creatio deploys agents in a “bounded scope with clear guardrails,” followed by an “explicit” tuning and validation phase, Kawasaki explained. Teams review initial outcomes, adjust as needed, then re-test until they’ve reached an acceptable level of accuracy. 

That loop typically follows this pattern: 

  • Design-time tuning (before go-live): Performance is improved through prompt engineering, context wrapping, role definitions, workflow design, and grounding in data and documents. 

  • Human-in-the-loop correction (during execution): Devs approve, edit, or resolve exceptions. In instances where humans have to intervene the most (escalation or approval), users establish stronger rules, provide more context, and update workflow steps; or, they’ll narrow tool access. 

  • Ongoing optimization (after go-live): Devs continue to monitor exception rates and outcomes, then tune repeatedly as needed, helping to improve accuracy and autonomy over time. 

Kawasaki’s team applies retrieval-augmented generation to ground agents in enterprise knowledge bases, CRM data, and other proprietary sources. 

Once agents are deployed in the wild, they are monitored with a dashboard providing performance analytics, conversion insights, and auditability. Essentially, agents are treated like digital workers. They have their own management layer with dashboards and KPIs.

For instance, an onboarding agent will be incorporated as a standard dashboard interface providing agent monitoring and telemetry. This is part of the platform layer — orchestration, governance, security, workflow execution, monitoring, and UI embedding —  that sits “above the LLM,” Kawasaki said.

Users see a dashboard of agents in use and each of their processes, workflows, and executed results. They can “drill down” into an individual record (like a referral or renewal) that shows a step-by-step execution log and related communications to support traceability, debugging, and agent tweaking. The most common adjustments involve logic and incentives, business rules, prompt context, and tool access, Kawasaki said. 

The biggest issues that come up post-deployment: 

  • Exception handling volume can be high: Early spikes in edge cases often occur until guardrails and workflows are tuned. 

  • Data quality and completeness: Missing or inconsistent fields and documents can cause escalations; teams can identify which data to prioritize for grounding and which checks to automate.

  • Auditability and trust: Regulated customers, particularly, require clear logs, approvals, role-based access control (RBAC), and audit trails.

“We always explain that you have to allocate time to train agents,” Creatio’s CEO Katherine Kostereva told VentureBeat. “It doesn’t happen immediately when you switch on the agent, it needs time to understand fully, then the number of mistakes will decrease.” 

“Data readiness” doesn’t always require an overhaul

When looking to deploy agents, “Is my data ready?,” is a common early question. Enterprises know data access is important, but can be turned off by a massive data consolidation project. 

But virtual connections can allow agents access to underlying systems and get around typical data lake/lakehouse/warehouse delays. Kawasaki’s team built a platform that integrates with data, and is now working on an approach that will pull data into a virtual object, process it, and use it like a standard object for UIs and workflows. This way, they don’t have to “persist or duplicate” large volumes of data in their database. 

This technique can be helpful in areas like banking, where transaction volumes are simply too large to copy into CRM, but are “still valuable for AI analysis and triggers,” Kawasaki said.

Once integrations and virtual objects are established, teams can evaluate data completeness, consistency, and availability, and identify low-friction starting points (like document-heavy or unstructured workflows). 

Kawasaki emphasized the importance of “really using the data in the underlying systems, which tends to actually be the cleanest or the source of truth anyway.” 

Matching agents to the work

The best fit for autonomous (or near-autonomous) agents are high-volume workflows with “clear structure and controllable risk,” Kawasaki said. For instance, document intake and validation in onboarding or loan preparation, or standardized outreach like renewals and referrals.

“Especially when you can link them to very specific processes inside an industry — that’s where you can really measure and deliver hard ROI,” he said. 

For instance, financial institutions are often siloed by nature. Commercial lending teams perform in their own environment, wealth management in another. But an autonomous agent can look across departments and separate data stores to identify, for instance, commercial customers who might be good candidates for wealth management or advisory services.

“You think it would be an obvious opportunity, but no one is looking across all the silos,” Kawasaki said. Some banks that have applied agents to this very scenario have seen “benefits of millions of dollars of incremental revenue,” he claimed, without naming specific institutions. 

However, in other cases — particularly in regulated industries — longer-context agents are not only preferable, but necessary. For instance, in multi-step tasks like gathering evidence across systems, summarizing, comparing, drafting communications, and producing auditable rationales.

“The agent isn’t giving you a response immediately,” Kawasaki said. “It may take hours, days, to complete full end-to-end tasks.” 

This requires orchestrated agentic execution rather than a “single giant prompt,” he said. This approach breaks work down into deterministic steps to be performed by sub-agents. Memory and context management can be maintained across various steps and time intervals. Grounding with RAG can help keep outputs tied to approved sources, and users have the ability to dictate expansion to file shares and other document repositories.

This model typically doesn’t require custom retraining or a new foundation model. Whatever model enterprises use (GPT, Claude, Gemini), performance improves through prompts, role definitions, controlled tools, workflows, and data grounding, Kawasaki said. 

The feedback loop puts “extra emphasis” on intermediate checkpoints, he said. Humans review intermediate artifacts (such as summaries, extracted facts, or draft recommendations) and correct errors. Those can then be converted into better rules and retrieval sources, narrower tool scopes, and improved templates. 

“What is important for this style of autonomous agent, is you mix the best of both worlds: The dynamic reasoning of AI, with the control and power of true orchestration,” Kawasaki said.

Ultimately, agents require coordinated changes across enterprise architecture, new orchestration frameworks, and explicit access controls, Gogia said. Agents must be assigned identities to restrict their privileges and keep them within bounds. Observability is critical; monitoring tools can record task completion rates, escalation events, system interactions, and error patterns. This kind of evaluation must be a permanent practice, and agents should be tested to see how they react when encountering new scenarios and unusual inputs. 

“The moment an AI system can take action, enterprises have to answer several questions that rarely appear during copilot deployments,” Gogia said. Such as: What systems is the agent allowed to access? What types of actions can it perform without approval? Which activities must always require a human decision? How will every action be recorded and reviewed?

“Those [enterprises] that underestimate the challenge often find themselves stuck in demonstrations that look impressive but cannot survive real operational complexity,” Gogia said.