When product managers ship code: AI just broke the software org chart

Last week, one of our product managers (PMs) built and shipped a feature. Not spec’d it. Not filed a ticket for it. Built it, tested it, and shipped it to production. In a day.

A few days earlier, our designer noticed that the visual appearance of our IDE plugins had drifted from the design system. In the old world, that meant screenshots, a JIRA ticket, a conversation to explain the intent, and a sprint slot. Instead, he opened an agent, adjusted the layout himself, experimented, iterated, and tuned in real time, then pushed the fix. The person with the strongest design intuition fixed the design directly. No translation layer required.

None of this is new in theory. Vibe coding opened the gates of software creation to millions. That was aspiration. When I shared the data on how our engineers doubled throughput, shifted from coding to validation, brought design upfront for rapid experimentation, it was still an engineering story. What changed is that the theory became practice. Here’s how it actually played out.

The bottleneck moved

When we went AI-first in 2025, implementation cost collapsed. Agents took over scaffolding, tests, and the repetitive glue code that used to eat half the sprint. Cycle times dropped from weeks to days, from days to hours. Engineers started thinking less in files and functions and more in architecture, constraints, and execution plans.

But once engineering capacity stopped being the bottleneck, we noticed something: Decision velocity was. All the coordination mechanisms we’d built to protect engineering time (specs, tickets, handoffs, backlog grooming) were now the slowest part of the system. We were optimizing for a constraint that no longer existed.

What happens when building is cheaper than coordination

We started asking a different question: What would it look like if the people closest to the intent could ship the software directly?

PMs already think in specifications. Designers already define structure, layout, and behavior. They don’t think in syntax. They think in outcomes. When the cost of turning intent into working software dropped far enough, these roles didn’t need to “learn to code.” The cost of implementation simply fell to their level.

I asked one of our PMs, Dmitry, to describe what changed from his perspective. He told me: “While agents are generating tasks in Zenflow, there’s a few minutes of idle time. Just dead air. I wanted to build a small game, something to interact with while you wait.”

If you’ve ever run a product team, you know this kind of idea. It doesn’t move a KPI. It’s impossible to justify in a prioritization meeting. It gets deferred forever. But it adds personality. It makes the product feel like someone cared about the small details. These are exactly the things that get optimized out of every backlog grooming session, and exactly the things users remember.

He built it in a day.

In the past, that idea would have died in a prioritization spreadsheet. Not because it was bad, but because the cost of implementation made it irrational to pursue. When that cost drops to near zero, the calculus changes completely.

Shipping became cheaper than explaining

As more people started building directly, entire layers of process quietly vanished. Fewer tickets. Fewer handoffs. Fewer “can you explain what you mean by…” conversations. Fewer lost-in-translation moments.

For a meaningful class of tasks, it became faster to just build the thing than to describe what you wanted and wait for someone else to build it. Think about that for a second. Every modern software organization is structured around the assumption that implementation is the expensive part. When that assumption breaks, the org has to change with it.

Our designer fixing the plugin UI is a perfect example. The old workflow (screenshot the problem, file a ticket, explain the gap between intent and implementation, wait for a sprint slot, review the result, request adjustments) existed entirely to protect engineering bandwidth. When the person with the design intuition can act on it directly, that whole stack disappears. Not because we eliminated process for its own sake, but because the process was solving a problem that no longer existed.

The compounding effect

Here’s what surprised me most: It compounds.

When PMs build their own ideas, their specifications get sharper, because they now understand what the agent needs to execute well. Sharper specs produce better agent output. Better output means fewer iteration cycles. We’re seeing velocity compound week over week, not just because the models improved, but because the people using them got closer to the work.

Dmitry put it well: The feedback loop between intent and outcome went from weeks to minutes. When you can see the result of your specification immediately, you learn what precision the system needs, and you start providing it instinctively.

There’s a second-order effect that’s harder to measure but impossible to miss: Ownership. People stop waiting. They stop filing tickets for things they could just fix. “Builder” stopped being a job title. It became the default behavior.

What this means for the industry

A lot of the “everyone can code” narrative last year was theoretical, or focused on solo founders and tiny teams. What we experienced is different. We have ~50 engineers working in a complex brownfield codebase: Multiple surfaces and programming languages, enterprise integrations, the full weight of a real production system. 

I don’t think we’re unique. I think we’re early. And with each new generation of models, the gap between who can build and who can’t is closing faster than most organizations realize. Every software company is about to discover that their PMs and designers are sitting on unrealized building capacity, blocked not by skill, but by the cost of implementation. As that cost continues to fall, the organizational implications are profound.

We started with an intent to accelerate software engineering. What we’re becoming is something different: A company where everyone ships.

Andrew Filev is founder and CEO of Zencoder.

When AI turns software development inside-out: 170% throughput at 80% headcount

Many people have tried AI tools and walked away unimpressed. I get it — many demos promise magic, but in practice, the results can feel underwhelming.

That’s why I want to write this not as a futurist prediction, but from lived experience. Over the past six months, I turned my engineering organization AI-first. I’ve shared before about the system behind that transformation — how we built the workflows, the metrics, and the guardrails. Today, I want to zoom out from the mechanics and talk about what I’ve learned from that experience — about where our profession is heading when software development itself turns inside out. 

Before I do, a couple of numbers to illustrate the scale of change. Subjectively, it feels that we are moving twice as fast. Objectively, here’s how the throughput evolved. Our total engineering team headcount floated from 36 at the beginning of the year to 30. So you get ~170% throughput on ~80% headcount, which matches the subjective ~2x. 

Zooming in, I picked a couple of our senior engineers who started the year in a more traditional software engineering process and ended it in the AI-first way. [The dips correspond to vacations and off-sites]:

Note that our PRs are tied to JIRA tickets, and the average scope of those tickets didn’t change much through the year, so it’s as good a proxy as the data can give us. 

Qualitatively, looking at the business value, I actually see even higher uplift. One reason is that, as we started last year, our quality assurance (QA) team couldn’t keep up with our engineers’ velocity. As the company leader, I wasn’t happy with the quality of some of our early releases. As we progressed through the year, and tooled our AI workflows to include writing unit and end-to-end tests, our coverage improved, the number of bugs dropped, users became fans, and the business value of engineering work multiplied.

From big design to rapid experimentation

Before AI, we spent weeks perfecting user flows before writing code. It made sense when change was expensive. Agile helped, but even then, testing multiple product ideas was too costly.

Once we went AI-first, that trade-off disappeared. The cost of experimentation collapsed. An idea could go from whiteboard to a working prototype in a day: From idea to AI-generated product requirements document (PRD), to AI-generated tech spec, to AI-assisted implementation. 

It manifested itself in some amazing transformations. Our website—central to our acquisition and inbound demand—is now a product-scale system with hundreds of custom components, all designed, developed, and maintained directly in code by our creative director

Now, instead of validating with slides or static prototypes, we validate with working products. We test ideas live, learn faster, and release major updates every other month, a pace I couldn’t imagine three years ago.

For example, Zen CLI was first written in Kotlin, but then we changed our mind and moved it to TypeScript with no release velocity lost.

Instead of mocking the features, our UX designers and project managers vibe code them. And when the release-time crunch hit everyone, they jumped into action and fixed dozens of small details with production-ready PRs to help us ship a great product. This included an overnight UI layout change.

From coding to validation

The next shift came where I least expected it: Validation.

In a traditional org, most people write code and a smaller group tests it. But when AI generates much of the implementation, the leverage point moves. The real value lies in defining what “good” looks like — in making correctness explicit.

We support 70-plus programming languages and countless integrations. Our QA engineers have evolved into system architects. They build AI agents that generate and maintain acceptance tests directly from requirements. And those agents are embedded into the codified AI workflows that allow us to achieve predictable engineering outcomes by using a system.

This is what “shift left” really means. Validation isn’t a stand-alone function, it’s an integral part of the production process. If the agent can’t validate it’s work, it can’t be trusted to generate production code. For QA professionals, this is a moment of reinvention, where, with the right upskilling, their work becomes a critical enabler and accelerator of the AI adoption

Product managers, tech leads, and data engineers now share this responsibility as well, because defining correctness has become a cross-functional skill, not a role confined to QA.

From diamond to double funnel

For decades, software development followed a “diamond” shape: A small product team handed off to a large engineering team, then narrowed again through QA.

Today, that geometry is flipping. Humans engage more deeply at the beginning — defining intent, exploring options — and again at the end, validating outcomes. The middle, where AI executes, is faster and narrower.

It’s not just a new workflow; it’s a structural inversion.

The model looks less like an assembly line and more like a control tower. Humans set direction and constraints, AI handles execution at speed, and people step back in to validate outcomes before decisions land in production.

Engineering at a higher level of abstraction

Every major leap in software raised our level of abstraction — from punch cards to high-level programming languages, from hardware to cloud. AI is the next step. Our engineers now work at a meta-layer: Orchestrating AI workflows, tuning agentic instructions and skills, and defining guardrails. The machines build; the humans decide what and why.

Teams now routinely decide when AI output is safe to merge without review, how tightly to bound agent autonomy in production systems, and what signals actually indicate correctness at scale, decisions that simply didn’t exist before.

And that’s the paradox of AI-first engineering — it feels less like coding, and more like thinking. Welcome to the new era of human intelligence, powered by AI.

Andrew Filev is founder and CEO of Zencoder

Mistral AI just released a text-to-speech model it says beats ElevenLabs — and it’s giving away the weights for free

The enterprise voice AI market is in the middle of a land grab. ElevenLabs and IBM announced a collaboration just this week to bring premium voice capabilities into IBM’s watsonx Orchestrate platform. Google Cloud has been expanding its Chirp 3 HD voices. OpenAI continues to iterate on its own speech synthesis. And the market underpinning all of this activity is enormous — voice AI crossed $22 billion globally in 2026, with the voice AI agents segment alone projected to reach $47.5 billion by 2034, according to industry estimates.

On Thursday morning, Mistral AI entered that fight with a fundamentally different proposition. The Paris-based AI startup released Voxtral TTS, what it calls the first frontier-quality, open-weight text-to-speech model designed specifically for enterprise use. Where every major competitor in the space operates a proprietary, API-first business — enterprises rent the voice, they don’t own it — Mistral is releasing the full model weights, inviting companies to download Voxtral TTS, run it on their own servers or even on a smartphone, and never send a single audio frame to a third party.

It is a bet that the future of enterprise voice AI will not be shaped by whoever builds the best-sounding model, but by whoever gives companies the most control over it. And it arrives at a moment when Mistral, valued at $13.8 billion after a $2 billion Series C round led by Dutch chipmaker ASML last September, has been aggressively assembling the building blocks of a complete, enterprise-owned AI stack — from its Forge customization platform announced at Nvidia GTC earlier this month, to its AI Studio production infrastructure, to the Voxtral Transcribe speech-to-text model released just weeks ago.

Voxtral TTS is the output layer that completes that picture, giving enterprises a speech-to-speech pipeline they can run end-to-end without relying on any external provider.

“We see audio as a big bet and as a critical and maybe the only future interface with all the AI models,” Pierre Stock, Mistral’s vice president of science and the first employee hired at the company, said in an exclusive interview with VentureBeat. “This is something customers have been asking for.”

A 3-billion-parameter model that fits on a laptop and runs six times faster than real-time speech

The technical specifications of Voxtral TTS read like a deliberate inversion of industry norms. Where most frontier TTS models are large and resource-intensive, Mistral built its model to be roughly three times smaller than what it calls the industry standard for comparable quality.

The architecture comprises three components: a 3.4-billion-parameter transformer decoder backbone, a 390-million-parameter flow-matching acoustic transformer, and a 300-million-parameter neural audio codec that Mistral developed in-house. The system is built on top of Ministral 3B, the same pretrained backbone that powers the company’s Voxtral Transcribe model — a design choice that Stock described as emblematic of Mistral’s culture of efficiency and artifact reuse.

In practice, the model achieves a time-to-first-audio of 90 milliseconds for a typical input and generates speech at approximately six times real-time speed. When quantized for inference, it requires roughly three gigabytes of RAM. Stock confirmed it can run on any laptop or smartphone, and even on older hardware it still operates in real time.

“It’s a 3B model, so it can basically run on any laptop or any smartphone,” Stock told VentureBeat. “If you quantize it to infer, it’s actually three gigabytes of RAM. And you can run it on super old chips — it’s still going to be real time.”

The model supports nine languages — English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic — and can adapt to a custom voice with as little as five seconds of reference audio. Perhaps more remarkably, it demonstrates zero-shot cross-lingual voice adaptation without explicit training for that task.

Stock illustrated this with a personal example: he can feed the model 10 seconds of his own French-accented voice, type a prompt in German, and the model will generate German speech that sounds like him — complete with his natural accent and vocal characteristics. For enterprises operating across borders, this capability unlocks cascaded speech-to-speech translation that preserves speaker identity, a feature that has obvious applications in customer support, sales, and internal communications for multinational organizations.

Human evaluators preferred Voxtral over ElevenLabs nearly 70 percent of the time on voice customization

Mistral is not being coy about which competitor it intends to displace. In human evaluations conducted by the company, Voxtral TTS achieved a 62.8 percent listener preference rate against ElevenLabs Flash v2.5 on flagship voices and a 69.9 percent preference rate in voice customization tasks. Mistral also claims the model performs at parity with ElevenLabs v3 — the company’s premium, higher-latency tier — on emotional expressiveness, while maintaining similar latency to the much faster Flash model.

The evaluation methodology involved a comparative side-by-side test across all nine supported languages. Using two recognizable voices in their native dialects for each language, three annotators performed preference tests on naturalness, accent adherence, and acoustic similarity to the original reference. Mistral says Voxtral TTS widened the quality gap to ElevenLabs v2.5 Flash especially in zero-shot multilingual custom voice settings, highlighting what the company calls the “instant customizability” of the model.

ElevenLabs remains widely regarded as the benchmark for raw voice quality. Its Eleven v3 model has been described by multiple independent reviewers as the gold standard for emotionally nuanced AI speech. But ElevenLabs operates as a closed platform with tiered subscription pricing that scales from around $5 per month at the starter level to over $1,300 per month for business plans. It does not release model weights.

Mistral’s pitch is that enterprises shouldn’t have to choose between quality and control — and that at scale, the economics of an open-weight model are dramatically more favorable.

“What we want to underline is that we’re faster and cheaper as well — and open source,” Stock told VentureBeat. “When something is open source and cheap, people adopt it and people build on it.”

He framed the cost argument in terms that resonate with CTOs managing AI budgets: “AI is a transformative technology, but it has a cost. When you want to scale and have impact on a large business, that cost matters. And what we allow is to scale seamlessly while minimizing the cost and maximizing the accuracy.”

Why Mistral thinks enterprises will want to own their voice AI rather than rent it

To understand why Mistral is entering text-to-speech now, you have to understand the broader strategic architecture the company has been building for the past year. While OpenAI and Anthropic have captured the imagination of consumers, Mistral has quietly assembled what may be the most comprehensive enterprise AI platform in Europe — and increasingly, globally.

CEO Arthur Mensch has said the company is on track to surpass $1 billion in annual recurring revenue this year, according to TechCrunch’s reporting on the Forge launch. The Financial Times has reported that Mistral’s annualized revenue run rate surged from $20 million to over $400 million within a single year. That growth has been powered by more than 100 major enterprise customers and a consistent thesis: companies should own their AI infrastructure, not rent it.

Voxtral TTS is the latest expression of that thesis, applied to what may be the most sensitive category of enterprise data there is. Voice recordings capture not just words but emotion, identity, and intent. They carry legal, regulatory, and reputational weight that text data often does not. For industries like financial services, healthcare, and government — all key Mistral verticals — sending voice data to a third-party API introduces risks that many compliance teams are unwilling to accept.

Stock made the data sovereignty argument forcefully. “Since the models are open weights, we have no trouble and no problem actually giving the weights to the enterprise and helping them customize the models,” he said. “We don’t see the weights anymore. We don’t see the data. We see nothing. And you are fully controlled.”

That message has particular resonance in Europe, where concern about technological dependence on American cloud providers has intensified throughout 2026. The EU currently sources more than 80 percent of its digital services from foreign providers, most of them American. Mistral has positioned itself as the answer to that anxiety — the only European frontier AI developer with the scale and technical capability to offer a credible alternative.

Voice agents are the enterprise use case that makes Mistral’s full AI stack click into place

Voxtral TTS is the final piece in a pipeline Mistral has been methodically assembling. Voxtral Transcribe handles speech-to-text. Mistral’s language models — from Mistral Small to Mistral Large — provide the reasoning layer. Forge allows enterprises to customize any of these models on their own data. AI Studio provides the production infrastructure for observability, governance, and deployment. And Mistral Compute offers the underlying GPU resources.

Together, these pieces form what Stock described as a “full AI stack, fully controllable and customizable” for the enterprise. Voice agents — AI systems that can listen to a customer, understand what they need, reason about the answer, and respond in natural-sounding speech — are the use case that ties all of these layers together.

The applications Mistral envisions span customer support, where voice agents can route and resolve queries with brand-appropriate speech; sales and marketing, where a single voice can work across markets through cross-lingual emulation; real-time translation for cross-border operations; and even interactive storytelling and game design, where emotion-steering can control tone and personality.

Stock was most animated when discussing how Voxtral TTS fits into the broader agentic AI trend that has dominated enterprise technology discussions in 2026. “We are totally building for a world in which audio is a natural interface, in particular for agents to which you can delegate work — extensions of yourself,” he said. He described a scenario in which a user starts planning a vacation on a computer, commutes to work, and then picks up the workflow on a phone simply by asking for an update by voice.

“To make that happen, you need a model you can trust, you need a model that’s super efficient and super cheap to run — otherwise you won’t use it for long — and you need a model that sounds super conversational and that you can interrupt at any time,” Stock said.

That emphasis on interruptibility and real-time responsiveness reflects a broader insight about voice interfaces that distinguishes them from text. A chatbot can take two or three seconds to respond without breaking the user experience. A voice agent cannot. The 90-millisecond time-to-first-audio that Voxtral TTS achieves is not just a benchmark number — it is the threshold between a voice interaction that feels natural and one that feels robotic.

Mistral’s open-weight approach aligns with a broader industry shift that even Nvidia is backing

Mistral’s decision to release Voxtral TTS with open weights is consistent with a movement that has been gathering momentum across the AI industry. At Nvidia GTC earlier this month, Nvidia CEO Jensen Huang declared that “proprietary versus open is not a thing — it’s proprietary and open.” Nvidia announced the Nemotron Coalition, a first-of-its-kind collaboration of model builders working to advance open frontier-level foundation models, with Mistral as a founding member. The first project from that coalition will be a base model codeveloped by Mistral AI and Nvidia.

For Mistral, open weights serve a dual commercial purpose. They drive adoption — developers and enterprises can experiment without friction or commitment — while the company monetizes through its platform services, customization offerings, and managed infrastructure. The model is available to test in Mistral Studio and through the company’s API, but the strategic play is to become embedded in enterprise voice pipelines as an owned asset, not a metered service.

This mirrors the playbook that worked for Mistral’s language models. As Mensch told CNBC in February, “AI is making us able to develop software at the speed of light,” predicting that “more than half of what’s currently being bought by IT in terms of SaaS is going to shift to AI.” He described a “replatforming” taking place across enterprise technology, with businesses looking to replace legacy software systems with AI-native alternatives. An open-weight voice model that enterprises can customize and deploy on their own terms fits naturally into that narrative.

Mistral signals that end-to-end audio AI is where the company is headed next

When asked what comes after Voxtral TTS, Stock outlined two directions. The first is expanding language and dialect support, with particular attention to cultural nuance. “It’s not the same to speak French in Paris than to speak French in Canada, in Montreal,” he said. “We want to respect both cultures, and we want our models to perform in both contexts with all the cultural specifics.”

The second direction is more ambitious: a fully end-to-end audio model that doesn’t just generate speech from text but understands the complete spectrum of human vocal communication.

“We convey some meaning with the words we speak,” Stock said. “We actually convey way more with the intonation, the rhythm, and how we say it. When people talk about end-to-end audio, that’s what they mean — the model is able to pick up that you’re in a hurry, for instance, and will go for the fastest answer. The model will know that you’re joyful today and crack a joke. It’s super adaptive to you, and that’s where we want to go.”

That vision — an AI that speaks naturally, listens with nuance, responds with emotional intelligence, and runs on a model small enough to fit in your pocket — is the frontier every major AI lab is racing toward. For now, Voxtral TTS gives Mistral a foundation to build on and enterprises a question they haven’t had to answer before: if you could own your voice AI stack outright, at lower cost and with competitive quality, why would you keep renting someone else’s?

The consequential AI work that actually moves the needle for enterprises

Presented by OutSystems


After two years of flashy AI demos, rushed agent prototypes, and breathless predictions, enterprise technology leaders are striking a more pragmatic tone in 2026. In a recent webinar hosted by OutSystems, a panel of software executives and enterprise practitioners made the case that the most consequential AI work happening now is focused on the practical matters of governance, orchestration, and iteration, along with integrating agents into the systems they’ve spent decades building.

Enterprise leaders are increasingly focused on fundamentals. The priority is using new AI technologies

to accelerate productivity, improve delivery, and produce measurable business results.

Three elements shape this work:

  • The move from AI agent prototypes to agentic systems that deliver measurable ROI in production

  • The growing role of enterprise platforms in governing, orchestrating, and scaling AI agents safely

  • The rise of the generalist developer and enterprise architect as the most valuable technical profiles in an era of AI-generated code

Against this backdrop, the panel discussed governance frameworks, the economics of enterprise AI, and the limits of large language models without orchestration. The conversation ultimately turned to how leading organizations are building multi-agent systems grounded in existing enterprise data and workflows.

Agents in the real world

Enabling agents to work in production across the enterprise is best accomplished with a unified platform that handles development, iteration, and deployment. And that’swhere capabilities like the Agent Workbench in the OutSystems platform matter, said Rajkiran Vajreshwari, senior manager of app development at Thermo Fisher Scientific. It provides the infrastructure to learn, iterate, and govern agents at scale.

His team at Thermo Fisher has moved away from single-task AI assistants in customer service to building a coordinated team of specialized agents using the workbench. When a support case arrives, a triage assistant classifies the request and dynamically routes it to the right specialist agent, whether that’s an intent and priority agent, a product context agent, a troubleshooting agent, or a compliance agent.

“We don’t have to think about what will work and how. It’s all pre-built,” he explained. “Each agent has a narrow role and clear guardrails. They stay accurate and auditable.”

Governing the risks of shadow AI

A new category of risk emerges when AI makes it possible for anyone in a company to generate production-level code without IT oversight. Basically, this is ungoverned shadow AI. These homegrown products are prone to hallucinations, data leakage, policy violations, model drift, and agents taking actions that were never formally approved.

To get ahead of the risk, leading organizations need to do three things, said Luis Blando, CPTO of OutSystems.

“Give users guardrails. They’re going to use AI whether you like it or not. Companies that seem to be getting ahead are using AI to govern AI across their full portfolio,” he explained. “That is the difference between shadow AI chaos and enterprise-grade scale.”

Eric Kavanagh, CEO of The Bloor Group, noted that governance requires a layered set of disciplines that includes securing data, monitoring models for drift, and making deliberate choices about where AI connects to existing business processes.

“Companies don’t have to be manually creating these controls,” he added. “A lot of those guardrails and levers are baked in to platforms like OutSystems.”

Why the real orchestration challenge is models vs. platforms

Much of the early excitement around enterprise AI focused on selecting the right large language model. Now the harder challenge, and far more durable source of value, is orchestration. This includes routing tasks, coordinating workflows, governing execution, and integrating AI into existing enterprise systems.

Scott Finkle, VP of development at McConkey Auction Group, noted that LLMs, however impressive, are pieces of complex workflows, not final solutions. Organizations should be ready to hot-swap between Gemini, ChatGPT, Claude, and whatever emerges next without having to rebuild the agentic system around it.

A platform with orchestration capabilities makes that possible. It manages the lifecycle, provides visibility, and ensures processes execute reliably, even as AI handles the reasoning layer on top.

“The AI and the models change, the workflows can change, but the orchestration remains the same,” Finkle said. “That’s how we’re going to extract value out of AI.”

The economics of enterprise AI investing

Security, compliance, governance, and platform-level AI capabilities will all command greater investment in 2026, particularly as AI moves into core workflows like finance and supply chain. Enterprises should favor incremental wins rather than expect big, immediate gains.

“We’re focusing on base hits,” Finkle said. “The way it counts is by getting something into production and having it make an impact. Big investments in pilot projects that don’t make it into production don’t save any money. It’s not going to happen overnight, but over time I think we’ll see tremendous savings.”

There’s still a split in how enterprises are approaching AI transformation. Some start from scratch and reimagine every process. Others, especially those with billions of dollars in existing infrastructure depreciating in-house, want AI to integrate with their systems. They want agentic systems to reuse data, APIs, and proven processes while speeding up delivery. The agent platform approach serves both camps, but particularly the latter. Organizations can deploy agents where they add clear value while preserving the integrity of established, deterministic workflows.

The rise of the enterprise architect and the generalist developer

As AI accelerates code generation, bottlenecks in software delivery are dissolving. In its place is a premium on systems thinking. This is the ability to understand the broader enterprise architecture, decompose complex business problems, and reason about how AI integrates with existing infrastructure. Kavanagh pointed to enterprise architects specifically as the professionals best positioned to capitalize on this moment.

“We’re entering a very interesting age of the generalist,” he explained. “The better you know your enterprise architecture and your business architecture and how those things align, the better off you’re going to be. ”

“The result is faster delivery with fewer interruptions and fewer bugs,” Kavanaugh said. “You can focus on the non-repetitive tasks. It’s a benefit to the developer, to the business, and to the whole IT organization.”

Catch the entire webinar here.


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

How xMemory cuts token costs and context bloat in AI agents

Standard RAG pipelines break when enterprises try to use them for long-term, multi-session LLM agent deployments. This is a critical limitation as demand for persistent AI assistants grows.

xMemory, a new technique developed by researchers at King’s College London and The Alan Turing Institute, solves this by organizing conversations into a searchable hierarchy of semantic themes.

Experiments show that xMemory improves answer quality and long-range reasoning across various LLMs while cutting inference costs. According to the researchers, it drops token usage from over 9,000 to roughly 4,700 tokens per query compared to existing systems on some tasks.

For real-world enterprise applications like personalized AI assistants and multi-session decision support tools, this means organizations can deploy more reliable, context-aware agents capable of maintaining coherent long-term memory without blowing up computational expenses.

RAG wasn’t built for this

In many enterprise LLM applications, a critical expectation is that these systems will maintain coherence and personalization across long, multi-session interactions. To support this long-term reasoning, one common approach is to use standard RAG: store past dialogues and events, retrieve a fixed number of top matches based on embedding similarity, and concatenate them into a context window to generate answers.

However, traditional RAG is built for large databases where the retrieved documents are highly diverse. The main challenge is filtering out entirely irrelevant information. An AI agent’s memory, by contrast, is a bounded and continuous stream of conversation, meaning the stored data chunks are highly correlated and frequently contain near-duplicates.

To understand why simply increasing the context window doesn’t work, consider how standard RAG handles a concept like citrus fruit.

Imagine a user has had many conversations saying things like “I love oranges,” “I like mandarins,” and separately, other conversations about what counts as a citrus fruit. Traditional RAG may treat all of these as semantically close and keep retrieving similar “citrus-like” snippets. 

“If retrieval collapses onto whichever cluster is densest in embedding space, the agent may get many highly similar passages about preference, while missing the category facts needed to answer the actual query,” Lin Gui, co-author of the paper, told VentureBeat. 

A common fix for engineering teams is to apply post-retrieval pruning or compression to filter out the noise. These methods assume that the retrieved passages are highly diverse and that irrelevant noise patterns can be cleanly separated from useful facts.

This approach falls short in conversational agent memory because human dialogue is “temporally entangled,” the researchers write. Conversational memory relies heavily on co-references, ellipsis, and strict timeline dependencies. Because of this interconnectedness, traditional pruning tools often accidentally delete important bits of a conversation, leaving the AI without vital context needed to reason accurately.

Why the fix most teams reach for makes things worse

To overcome these limitations, the researchers propose a shift in how agent memory is built and searched, which they describe as “decoupling to aggregation.”

Instead of matching user queries directly against raw, overlapping chat logs, the system organizes the conversation into a hierarchical structure. First it decouples the conversation stream into distinct, standalone semantic components. These individual facts are then aggregated into a higher-level structural hierarchy of themes.

When the AI needs to recall information, it searches top-down through the hierarchy, going from themes to semantics and finally to raw snippets. This approach avoids redundancy. If two dialogue snippets have similar embeddings, the system is unlikely to retrieve them together if they have been assigned to different semantic components.

For this architecture to succeed, it must balance two vital structural properties. The semantic components must be sufficiently differentiated to prevent the AI from retrieving redundant data. At the same time, the higher-level aggregations must remain semantically faithful to the original context to ensure the model can craft accurate answers.

A four-level hierarchy that shrinks the context window

The researchers developed xMemory, a framework that combines structured memory management with an adaptive, top-down search strategy.

xMemory continuously organizes the raw stream of conversation into a structured, four-level hierarchy. At the base are the raw messages, which are first summarized into contiguous blocks called “episodes.” From these episodes, the system distills reusable facts as semantics that disentangle the core, long-term knowledge from repetitive chat logs. Finally, related semantics are grouped together into high-level themes to make them easily searchable.

xMemory uses a special objective function to constantly optimize how it groups these items. This prevents categories from becoming too bloated, which slows down search, or too fragmented, which weakens the model’s ability to aggregate evidence and answer questions.

When it receives a prompt, xMemory performs a top-down retrieval across this hierarchy. It starts at the theme and semantic levels, selecting a diverse, compact set of relevant facts. This is crucial for real-world applications where user queries often require gathering descriptions across multiple topics or chaining connected facts together for complex, multi-hop reasoning.

Once it has this high-level skeleton of facts, the system controls redundancy through what the researchers call “Uncertainty Gating.” It only drills down to pull the finer, raw evidence at the episode or message level if that specific detail measurably decreases the model’s uncertainty.

“Semantic similarity is a candidate-generation signal; uncertainty is a decision signal,” Gui said. “Similarity tells you what is nearby. Uncertainty tells you what is actually worth paying for in the prompt budget.” It stops expanding when it detects that adding more detail no longer helps answer the question.

What are the alternatives?

Existing agent memory systems generally fall into two structural categories: flat designs and structured designs. Both suffer from fundamental limitations.

Flat approaches such as MemGPT log raw dialogue or minimally processed traces. This captures the conversation but accumulates massive redundancy and increases retrieval costs as the history grows longer.

Structured systems such as A-MEM and MemoryOS try to solve this by organizing memories into hierarchies or graphs. However, they still rely on raw or minimally processed text as their primary retrieval unit, often pulling in extensive, bloated contexts. These systems also depend heavily on LLM-generated memory records that have strict schema constraints. If the AI deviates slightly in its formatting, it can cause memory failure.

xMemory addresses these limitations through its optimized memory construction scheme, hierarchical retrieval, and dynamic restructuring of its memory as it grows larger.

When to use xMemory

For enterprise architects, knowing when to adopt this architecture over standard RAG is critical. According to Gui, “xMemory is most compelling where the system needs to stay coherent across weeks or months of interaction.”

Customer support agents, for instance, benefit greatly from this approach because they must remember stable user preferences, past incidents, and account-specific context without repeatedly pulling up near-duplicate support tickets. Personalized coaching is another ideal use case, requiring the AI to separate enduring user traits from episodic, day-to-day details.

Conversely, if an enterprise is building an AI to chat with a repository of files, such as policy manuals or technical documentation, “a simpler RAG stack is still the better engineering choice,” Gui said. In those static, document-centric scenarios, the corpus is diverse enough that standard nearest-neighbor retrieval works perfectly well without the operational overhead of hierarchical memory.

The write tax is worth it

xMemory cuts the latency bottleneck associated with the LLM’s final answer generation. In standard RAG systems, the LLM is forced to read and process a bloated context window full of redundant dialogue. Because xMemory’s precise, top-down retrieval builds a much smaller, highly targeted context window, the reader LLM spends far less compute time analyzing the prompt and generating the final output.

In their experiments on long-context tasks, both open and closed models equipped with xMemory outperformed other baselines, using considerably fewer tokens while increasing task accuracy.

However, this efficient retrieval comes with an upfront cost. For an enterprise deployment, the catch with xMemory is that it trades a massive read tax for an upfront write tax. While it ultimately makes answering user queries faster and cheaper, maintaining its sophisticated architecture requires substantial background processing.

Unlike standard RAG pipelines, which cheaply dump raw text embeddings into a database, xMemory must execute multiple auxiliary LLM calls to detect conversation boundaries, summarize episodes, extract long-term semantic facts, and synthesize overarching themes.

Furthermore, xMemory’s restructuring process adds additional computational requirements as the AI must curate, link, and update its own internal filing system. To manage this operational complexity in production, teams can execute this heavy restructuring asynchronously or in micro-batches rather than synchronously blocking the user’s query.

For developers eager to prototype, the xMemory code is publicly available on GitHub under an MIT license, making it viable for commercial uses. If you are trying to implement this in existing orchestration tools like LangChain, Gui advises focusing on the core innovation first: “The most important thing to build first is not a fancier retriever prompt. It is the memory decomposition layer. If you get only one thing right first, make it the indexing and decomposition logic.”

Retrieval isn’t the last bottleneck

While xMemory offers a powerful solution to today’s context-window limitations, it clears the path for the next generation of challenges in agentic workflows. As AI agents collaborate over longer horizons, simply finding the right information won’t be enough.

“Retrieval is a bottleneck, but once retrieval improves, these systems quickly run into lifecycle management and memory governance as the next bottlenecks,” Gui said. Navigating how data should decay, handling user privacy, and maintaining shared memory across multiple agents is exactly “where I expect a lot of the next wave of work to happen,” he said.

What is DeerFlow 2.0 and what should enterprises know about this new, powerful local AI agent orchestrator?

ByteDance, the Chinese tech giant behind TikTok, last month released what may be one of the most ambitious open-source AI agent frameworks to date: DeerFlow 2.0. It’s now going viral across the machine learning community on social media. But is it safe and ready for enterprise use?

This is a so-called “SuperAgent harness” that orchestrates multiple AI sub-agents to autonomously complete complex, multi-hour tasks. Best of all: it is available under the permissive, enterprise-friendly standard MIT License, meaning anyone can use, modify, and build on it commercially at no cost.

DeerFlow 2.0 is designed for high-complexity, long-horizon tasks that require autonomous orchestration over minutes or hours, including conducting deep research into industry trends, generating comprehensive reports and slide decks, building functional web pages, producing AI-generated videos and reference images, performing exploratory data analysis with insightful visualizations, analyzing and summarizing podcasts or video content, automating complex data and content workflows, and explaining technical architectures through creative formats like comic strips.

ByteDance offers a bifurcated deployment strategy that separates the orchestration harness from the AI inference engine. Users can run the core harness directly on a local machine, deploy it across a private Kubernetes cluster for enterprise scale, or connect it to external messaging platforms like Slack or Telegram without requiring a public IP.

While many opt for cloud-based inference via OpenAI or Anthropic APIs, the framework is natively model-agnostic, supporting fully localized setups through tools like Ollama. This flexibility allows organizations to tailor the system to their specific data sovereignty needs, choosing between the convenience of cloud-hosted “brains” and the total privacy of a restricted on-premise stack.

Importantly, choosing the local route does not mean sacrificing security or functional isolation. Even when running entirely on a single workstation, DeerFlow still utilizes a Docker-based “AIO Sandbox” to provide the agent with its own execution environment.

This sandbox—which contains its own browser, shell, and persistent filesystem—ensures that the agent’s “vibe coding” and file manipulations remain strictly contained. Whether the underlying models are served via the cloud or a local server, the agent’s actions always occur within this isolated container, allowing for safe, long-running tasks that can execute bash commands and manage data without risk to the host system’s core integrity.

Since its release last month, it has accumulated more than 39,000 stars (user saves) and 4,600 forks — a growth trajectory that has developers and researchers alike paying close attention.

Not a chatbot wrapper: what DeerFlow 2.0 actually is

DeerFlow is not another thin wrapper around a large language model. The distinction matters.

While many AI tools give a model access to a search API and call it an agent, DeerFlow 2.0 gives its agents an actual isolated computer environment: a Docker sandbox with a persistent, mountable filesystem.

The system maintains both short- and long-term memory that builds user profiles across sessions. It loads modular “skills” — discrete workflows — on demand to keep context windows manageable. And when a task is too large for one agent, a lead agent decomposes it, spawns parallel sub-agents with isolated contexts, executes code and Bash commands safely, and synthesizes the results into a finished deliverable.

It is similar to the approach being pursued by NanoClaw, an OpenClaw variant, which recently partnered with Docker itself to offer enterprise-grade sandboxes for agents and subagents.

But while NanoClaw is extremely open ended, DeerFlow has more clearly defined its architecture and scoped tasks: Demos on the project’s official site, deerflow.tech, showcase real outputs: agent trend forecast reports, videos generated from literary prompts, comics explaining machine learning concepts, data analysis notebooks, and podcast summaries.

The framework is designed for tasks that take minutes to hours to complete — the kind of work that currently requires a human analyst or a paid subscription to a specialized AI service.

From Deep Research to Super Agent

DeerFlow’s original v1 launched in May 2025 as a focused deep-research framework. Version 2.0 is something categorically different: a ground-up rewrite on LangGraph 1.0 and LangChain that shares no code with its predecessor. ByteDance explicitly framed the release as a transition “from a Deep Research agent into a full-stack Super Agent.”

New in v2: a batteries-included runtime with filesystem access, sandboxed execution, persistent memory, and sub-agent spawning; progressive skill loading; Kubernetes support for distributed execution; and long-horizon task management that can run autonomously across extended timeframes.

The framework is fully model-agnostic, working with any OpenAI-compatible API. It has strong out-of-the-box support for ByteDance’s own Doubao-Seed models, as well as DeepSeek v3.2, Kimi 2.5, Anthropic’s Claude, OpenAI’s GPT variants, and local models run via Ollama. It also integrates with Claude Code for terminal-based tasks, and with messaging platforms including Slack, Telegram, and Feishu.

Why it’s going viral now

The project’s current viral moment is the result of a slow build that accelerated sharply this week.

The February 28 launch generated significant initial buzz, but it was coverage in machine learning media — including deeplearning.ai’s The Batch — over the following two weeks that built credibility in the research community.

Then, on March 21, AI influencer Min Choi posted to his large X following: “China’s ByteDance just dropped DeerFlow 2.0. This AI is a super agent harness with sub-agents, memory, sandboxes, IM channels, and Claude Code integration. 100% open source.” The post earned more than 1,300 likes and triggered a cascade of reposts and commentary across AI Twitter.

A search of X using Grok uncovered the full scope of that response. Influencer Brian Roemmele, after conducting what he described as intensive personal testing, declared that “DeerFlow 2.0 absolutely smokes anything we’ve ever put through its paces” and called it a “paradigm shift,” adding that his company had dropped competing frameworks entirely in favor of running DeerFlow locally. “We use 2.0 LOCAL ONLY. NO CLOUD VERSION,” he wrote.

More pointed commentary came from accounts focused on the business implications. One post from @Thewarlordai, published March 23, framed it bluntly: “MIT licensed AI employees are the death knell for every agent startup trying to sell seat-based subscriptions. The West is arguing over pricing while China just commoditized the entire workforce.”

Another widely shared post described DeerFlow as “an open-source AI staff that researches, codes and ships products while you sleep… now it’s a Python repo and ‘make up’ away.”

Cross-linguistic amplification — with substantive posts in English, Japanese, and Turkish — points to genuine global reach rather than a coordinated promotion campaign, though the latter is not out of the question and may be contributing to the current virality.

The ByteDance question

ByteDance’s involvement is the variable that makes DeerFlow’s reception more complicated than a typical open-source release.

On the technical merits, the open-source, MIT-licensed nature of the project means the code is fully auditable. Developers can inspect what it does, where data flows, and what it sends to external services. That is materially different from using a closed ByteDance consumer product.

But ByteDance operates under Chinese law, and for organizations in regulated industries — finance, healthcare, defense, government — the provenance of software tooling increasingly triggers formal review requirements, regardless of the code’s quality or openness.

The jurisdictional question is not hypothetical: U.S. federal agencies are already operating under guidance that treats Chinese-origin software as a category requiring scrutiny.

For individual developers and small teams running fully local deployments with their own LLM API keys, those concerns are less operationally pressing. For enterprise buyers evaluating DeerFlow as infrastructure, they are not.

A real tool, with limitations

The community enthusiasm is credible, but several caveats apply.

DeerFlow 2.0 is not a consumer product. Setup requires working knowledge of Docker, YAML configuration files, environment variables, and command-line tools. There is no graphical installer. For developers comfortable with that environment, the setup is described as relatively straightforward; for others, it is a meaningful barrier.

Performance when running fully local models — rather than cloud API endpoints — depends heavily on available VRAM and hardware, with context handoff between multiple specialized models a known challenge. For multi-agent tasks running several models in parallel, the resource requirements escalate quickly.

The project’s documentation, while improving, still has gaps for enterprise integration scenarios. There has been no independent public security audit of the sandboxed execution environment, which represents a non-trivial attack surface if exposed to untrusted inputs.

And the ecosystem, while growing fast, is weeks old. The plugin and skill library that would make DeerFlow comparably mature to established orchestration frameworks simply does not exist yet.

What does it mean for enterprises in the AI transformation age?

The deeper significance of DeerFlow 2.0 may be less about the tool itself and more about what it represents in the broader race to define autonomous AI infrastructure.

DeerFlow’s emergence as a fully capable, self-hostable, MIT-licensed agentic orchestrator adds yet another twist to the ongoing race among enterprises — and AI builders and model providers themselves — to turn generative AI models into more than chatbots, but something more like full or at least part-time employees, capable of both communications and reliable actions.

In a sense, it marks the natural next wave after OpenClaw: whereas that open source tool sought to great a dependable, always on autonomous AI agent the user could message, DeerFlow is designed to allow a user to deploy a fleet of them and keep track of them, all within the same system.

The decision to implement it in your enterprise hinges on whether your organization’s workload demands “long-horizon” execution—complex, multi-step tasks spanning minutes to hours that involve deep research, coding, and synthesis. Unlike a standard LLM interface, this “SuperAgent” harness decomposes broad prompts into parallel sub-tasks performed by specialized experts. This architecture is specifically designed for high-context workflows where a single-pass response is insufficient and where “vibe coding” or real-time file manipulation in a secure environment is necessary.

The primary condition for use is the technical readiness of an organization’s hardware and sandbox environment. Because each task runs within an isolated Docker container with its own filesystem, shell, and browser, DeerFlow acts as a “computer-in-a-box” for the agent. This makes it ideal for data-intensive workloads or software engineering tasks where an agent must execute and debug code safely without contaminating the host system. However, this “batteries-included” runtime places a significant burden on the infrastructure layer; decision-makers must ensure they have the GPU clusters and VRAM capacity to support multi-agent fleets running in parallel, as the framework’s resource requirements escalate quickly during complex tasks.

Strategic adoption is often a calculation between the overhead of seat-based SaaS subscriptions and the control of self-hosted open-source deployments. The MIT License positions DeerFlow 2.0 as a highly capable, royalty-free alternative to proprietary agent platforms, potentially functioning as a cost ceiling for the entire category. Enterprises should favor adoption if they prioritize data sovereignty and auditability, as the framework is model-agnostic and supports fully local execution with models like DeepSeek or Kimi. If the goal is to commoditize a digital workforce while maintaining total ownership of the tech stack, the framework provides a compelling, if technically demanding, benchmark.

Ultimately, the decision to deploy must be weighed against the inherent risks of an autonomous execution environment and its jurisdictional provenance. While sandboxing provides isolation, the ability of agents to execute bash commands creates a non-trivial attack surface that requires rigorous security governance and auditability. Furthermore, because the project is a ByteDance-led initiative via Volcengine and BytePlus, organizations in regulated sectors must reconcile its technical performance with emerging software-origin standards. Deployment is most appropriate for teams comfortable with a CLI-first, Docker-heavy setup who are ready to trade the convenience of a consumer product for a sophisticated and extensible SuperAgent harness.

The three disciplines separating AI agent demos from real-world deployment

Getting AI agents to perform reliably in production — not just in demos — is turning out to be harder than enterprises anticipated. Fragmented data, unclear workflows, and runaway escalation rates are slowing deployments across industries.

“The technology itself often works well in demonstrations,” said Sanchit Vir Gogia, chief analyst with Greyhound Research. “The challenge begins when it is asked to operate inside the complexity of a real organization.” 

Burley Kawasaki, who oversees agent deployment at Creatio, and team have developed a methodology built around three disciplines: data virtualization to work around data lake delays; agent dashboards and KPIs as a management layer; and tightly bounded use-case loops to drive toward high autonomy.

In simpler use cases, Kawasaki says these practices have enabled agents to handle up to 80-90% of tasks on their own. With further tuning, he estimates they could support autonomous resolution in at least half of use cases, even in more complex deployments.

“People have been experimenting a lot with proof of concepts, they’ve been putting a lot of tests out there,” Kawasaki told VentureBeat. “But now in 2026, we’re starting to focus on mission-critical workflows that drive either operational efficiencies or additional revenue.”

Why agents keep failing in production

Enterprises are eager to adopt agentic AI in some form or another — often because they’re afraid to be left out, even before they even identify real-world tangible use cases — but run into significant bottlenecks around data architecture, integration, monitoring, security, and workflow design. 

The first obstacle almost always has to do with data, Gogia said. Enterprise information rarely exists in a neat or unified form; it is spread across SaaS platforms, apps, internal databases, and other data stores. Some are structured, some are not. 

But even when enterprises overcome the data retrieval problem, integration is a big challenge. Agents rely on APIs and automation hooks to interact with applications, but many enterprise systems were designed long before this kind of autonomous interaction was a reality, Gogia pointed out. 

This can result in incomplete or inconsistent APIs, and systems can respond unpredictably when accessed programmatically. Organizations also run into snags when they attempt to automate processes that were never formally defined, Gogia said. 

“Many business workflows depend on tacit knowledge,” he said. That is, employees know how to resolve exceptions they’ve seen before without explicit instructions — but, those missing rules and instructions become startlingly obvious when workflows are translated into automation logic.

The tuning loop

Creatio deploys agents in a “bounded scope with clear guardrails,” followed by an “explicit” tuning and validation phase, Kawasaki explained. Teams review initial outcomes, adjust as needed, then re-test until they’ve reached an acceptable level of accuracy. 

That loop typically follows this pattern: 

  • Design-time tuning (before go-live): Performance is improved through prompt engineering, context wrapping, role definitions, workflow design, and grounding in data and documents. 

  • Human-in-the-loop correction (during execution): Devs approve, edit, or resolve exceptions. In instances where humans have to intervene the most (escalation or approval), users establish stronger rules, provide more context, and update workflow steps; or, they’ll narrow tool access. 

  • Ongoing optimization (after go-live): Devs continue to monitor exception rates and outcomes, then tune repeatedly as needed, helping to improve accuracy and autonomy over time. 

Kawasaki’s team applies retrieval-augmented generation to ground agents in enterprise knowledge bases, CRM data, and other proprietary sources. 

Once agents are deployed in the wild, they are monitored with a dashboard providing performance analytics, conversion insights, and auditability. Essentially, agents are treated like digital workers. They have their own management layer with dashboards and KPIs.

For instance, an onboarding agent will be incorporated as a standard dashboard interface providing agent monitoring and telemetry. This is part of the platform layer — orchestration, governance, security, workflow execution, monitoring, and UI embedding —  that sits “above the LLM,” Kawasaki said.

Users see a dashboard of agents in use and each of their processes, workflows, and executed results. They can “drill down” into an individual record (like a referral or renewal) that shows a step-by-step execution log and related communications to support traceability, debugging, and agent tweaking. The most common adjustments involve logic and incentives, business rules, prompt context, and tool access, Kawasaki said. 

The biggest issues that come up post-deployment: 

  • Exception handling volume can be high: Early spikes in edge cases often occur until guardrails and workflows are tuned. 

  • Data quality and completeness: Missing or inconsistent fields and documents can cause escalations; teams can identify which data to prioritize for grounding and which checks to automate.

  • Auditability and trust: Regulated customers, particularly, require clear logs, approvals, role-based access control (RBAC), and audit trails.

“We always explain that you have to allocate time to train agents,” Creatio’s CEO Katherine Kostereva told VentureBeat. “It doesn’t happen immediately when you switch on the agent, it needs time to understand fully, then the number of mistakes will decrease.” 

“Data readiness” doesn’t always require an overhaul

When looking to deploy agents, “Is my data ready?,” is a common early question. Enterprises know data access is important, but can be turned off by a massive data consolidation project. 

But virtual connections can allow agents access to underlying systems and get around typical data lake/lakehouse/warehouse delays. Kawasaki’s team built a platform that integrates with data, and is now working on an approach that will pull data into a virtual object, process it, and use it like a standard object for UIs and workflows. This way, they don’t have to “persist or duplicate” large volumes of data in their database. 

This technique can be helpful in areas like banking, where transaction volumes are simply too large to copy into CRM, but are “still valuable for AI analysis and triggers,” Kawasaki said.

Once integrations and virtual objects are established, teams can evaluate data completeness, consistency, and availability, and identify low-friction starting points (like document-heavy or unstructured workflows). 

Kawasaki emphasized the importance of “really using the data in the underlying systems, which tends to actually be the cleanest or the source of truth anyway.” 

Matching agents to the work

The best fit for autonomous (or near-autonomous) agents are high-volume workflows with “clear structure and controllable risk,” Kawasaki said. For instance, document intake and validation in onboarding or loan preparation, or standardized outreach like renewals and referrals.

“Especially when you can link them to very specific processes inside an industry — that’s where you can really measure and deliver hard ROI,” he said. 

For instance, financial institutions are often siloed by nature. Commercial lending teams perform in their own environment, wealth management in another. But an autonomous agent can look across departments and separate data stores to identify, for instance, commercial customers who might be good candidates for wealth management or advisory services.

“You think it would be an obvious opportunity, but no one is looking across all the silos,” Kawasaki said. Some banks that have applied agents to this very scenario have seen “benefits of millions of dollars of incremental revenue,” he claimed, without naming specific institutions. 

However, in other cases — particularly in regulated industries — longer-context agents are not only preferable, but necessary. For instance, in multi-step tasks like gathering evidence across systems, summarizing, comparing, drafting communications, and producing auditable rationales.

“The agent isn’t giving you a response immediately,” Kawasaki said. “It may take hours, days, to complete full end-to-end tasks.” 

This requires orchestrated agentic execution rather than a “single giant prompt,” he said. This approach breaks work down into deterministic steps to be performed by sub-agents. Memory and context management can be maintained across various steps and time intervals. Grounding with RAG can help keep outputs tied to approved sources, and users have the ability to dictate expansion to file shares and other document repositories.

This model typically doesn’t require custom retraining or a new foundation model. Whatever model enterprises use (GPT, Claude, Gemini), performance improves through prompts, role definitions, controlled tools, workflows, and data grounding, Kawasaki said. 

The feedback loop puts “extra emphasis” on intermediate checkpoints, he said. Humans review intermediate artifacts (such as summaries, extracted facts, or draft recommendations) and correct errors. Those can then be converted into better rules and retrieval sources, narrower tool scopes, and improved templates. 

“What is important for this style of autonomous agent, is you mix the best of both worlds: The dynamic reasoning of AI, with the control and power of true orchestration,” Kawasaki said.

Ultimately, agents require coordinated changes across enterprise architecture, new orchestration frameworks, and explicit access controls, Gogia said. Agents must be assigned identities to restrict their privileges and keep them within bounds. Observability is critical; monitoring tools can record task completion rates, escalation events, system interactions, and error patterns. This kind of evaluation must be a permanent practice, and agents should be tested to see how they react when encountering new scenarios and unusual inputs. 

“The moment an AI system can take action, enterprises have to answer several questions that rarely appear during copilot deployments,” Gogia said. Such as: What systems is the agent allowed to access? What types of actions can it perform without approval? Which activities must always require a human decision? How will every action be recorded and reviewed?

“Those [enterprises] that underestimate the challenge often find themselves stuck in demonstrations that look impressive but cannot survive real operational complexity,” Gogia said. 

Nvidia’s Nemotron-Cascade 2 wins math and coding gold medals with 3B active parameters — and its post-training recipe is now open-source

The prevailing assumption in AI development has been straightforward: larger models trained on more data produce better results. Nvidia’s latest release directly challenges that size assumption — and the training recipe behind it may matter more to enterprise AI teams than the model itself. The open-weight model’s Cascade RL post-training pipeline, detailed in Nvidia’s technical report, offers a reproducible blueprint for enterprise teams building domain-specific reasoning systems without training from scratch.

Nemotron-Cascade 2 is an open-weight 30B Mixture-of-Experts (MoE) model that activates only 3B parameters at inference time. Despite this compact footprint, it achieved gold medal-level performance on three of the world’s most demanding competitions: the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals. It is the second open model to reach this tier, after DeepSeek-V3.2-Speciale — a model with 20 times more parameters.

Why post-training is becoming the real competitive advantage

Pre-training a large language model from scratch is enormously expensive — on the order of tens to possibly hundreds of millions of dollars for frontier models. Nemotron-Cascade 2 starts from the same base model as Nvidia’s existing Nemotron-3-Nano — yet it outperforms that model on nearly every benchmark, and in many cases outperforms Nvidia’s own Nemotron-3-Super, a model with four times the active parameters, according to Nvidia’s technical report. The difference is entirely in the post-training recipe.

This is the strategic insight for enterprise teams: You don’t necessarily need a bigger or more expensive base model. You may need a better training pipeline on top of the one you already have. Cascade RL and MOPD represent a specific, reproducible approach to that problem.

Cascade RL explained: sequential domain training that avoids catastrophic forgetting

Reinforcement learning (RL) has become the dominant technique for teaching LLMs to reason. The challenge is that training a model on multiple domains simultaneously — math, code, instruction-following, agentic tasks — often causes interference. Improving performance in one domain degrades it in another. This is the problem of catastrophic forgetting, a long-documented challenge in multi-task machine learning.

Cascade RL addresses this by training RL stages sequentially, one domain at a time, rather than mixing everything together. Nemotron-Cascade 2 follows a specific ordering: first instruction-following RL, then multi-domain RL (covering STEM questions, tool calling, and structured output), then on-policy distillation, then RLHF for human preference alignment, then long-context RL, then code RL, and finally software engineering RL.

Three properties make this approach practical, according to Nvidia’s technical report. First, domain-specific RL stages turn out to be resistant to catastrophic forgetting — training on code rarely degrades math performance, and in some cases actually improves it. Second, because each stage trains on a single domain, hyperparameters and the training curriculum can be tailored to that domain’s specific characteristics, enabling better learning overall. Third, because responses within a single domain tend to be similar in length and verification cost, compute utilization is substantially more efficient than mixed-domain training.

The ordering itself is not fixed; it depends on the model’s behavior. The Nemotron-Cascade 2 team found that instruction-following RL should come first (because it can conflict with human preference alignment, which can be recovered later), while code RL and software engineering RL work best as the final stages, according to the report.

For enterprise teams, the implication is straightforward: If you are applying RL to improve a model across multiple capabilities, training them sequentially with careful ordering may give you better results than trying to train everything at once.

MOPD: reusing your own training checkpoints as teachers

Even with careful sequential ordering, some performance drift is inevitable as the model passes through many RL stages. Nvidia’s solution is Multi-Domain On-Policy Distillation (MOPD) — a technique inserted partway through the Cascade RL pipeline to rebalance capabilities.

The approach works as follows: As the model passes through different RL stages, some intermediate checkpoints will be the best-performing version for specific domains. The math checkpoint might be strongest after SFT; the instruction-following checkpoint might be strongest after IF-RL. MOPD selects the best intermediate checkpoint for each domain and uses it as a “teacher” to distill knowledge back into the student model.

Critically, these teachers are not external models. They come from the same training run, sharing the same tokenizer and architecture. This eliminates distribution mismatch problems that arise when distilling from a completely different model family.

According to Nvidia’s technical report, MOPD works at the token level rather than the sequence level, which makes it substantially more sample-efficient than RL with outcome-based rewards (GRPO etc). The Nvidia team reports that on the AIME 2025 math benchmark, MOPD recovered teacher-level performance within 30 optimization steps, while standard GRPO (Group Relative Policy Optimization) required more steps to achieve a lower score. On the ArenaHard benchmark for human preference alignment, MOPD reached 85.5 on hard prompts in 52 steps versus RLHF’s 80.7 in 160 steps.

The benchmark picture: dominant in reasoning, honest about trade-offs

The results on reasoning-intensive benchmarks are striking. On LiveCodeBench v6, a coding benchmark with problems from competitive programming platforms, Nemotron-Cascade 2 scores 87.2 — surpassing Qwen3.5-35B-A3B (74.6), Qwen3.5-397B-A17B (83.6), and even Kimi-K2.5-1T (85.0). On HMMT February 2025, a rigorous math competition benchmark, it scores 94.6, neck-and-neck with models many times its size. On ArenaHard v2 for alignment quality, it reaches 83.5, well ahead of competitors in its class. With tool-integrated reasoning enabled, AIME 2025 performance climbs to 98.6. All benchmark scores are self-reported by Nvidia and have not been independently verified.

The technical report is also candid about weaknesses. The model underperforms Qwen3.5-35B-A3B on knowledge-intensive benchmarks like MMLU-Pro (79.8 vs. 85.3) and GPQA-Diamond (76.1 vs. 84.2), as well as on several agentic benchmarks like BFCL v4 and τ²-Bench. The authors explicitly note that stronger knowledge-intensive pre-training and agentic RL are needed in future work.

This honesty matters for practitioners. The model is optimized for deep reasoning and instruction-following — not general knowledge retrieval or complex multi-turn agent interactions. Teams should evaluate against their specific use case, not assume blanket superiority.

What enterprise AI teams can take from this recipe

Several design patterns from this work are directly applicable to enterprise post-training efforts. The sequential domain ordering in Cascade RL means teams can add new capabilities without rebuilding the entire pipeline — a critical property for organizations that need to iterate quickly. MOPD’s approach of using intermediate checkpoints as domain-specific teachers eliminates the need for expensive external teacher models; teams can distill from their own best-performing snapshots. 

The training setup is also notable: Cascade RL utilizes GRPO with strict on-policy training and no KL penalty via Nvidia’s open-source Nemo-RL repository. For code RL, the pipeline used only 3,500 difficult, filtered problems.

The bigger picture: intelligence density as a design principle

Nemotron-Cascade 2 is part of a broader trend toward “intelligence density” — extracting maximum capability per active parameter. DeepSeek’s MoE models, Qwen’s A3B variants, and now Nvidia’s Cascade series all point toward a future where the most capable reasoning models are not necessarily the largest.

For enterprise deployment, this matters enormously. A model with 3B active parameters can be served at a fraction of the cost and latency of a dense 70B model. Nvidia’s results suggest that post-training techniques like Cascade RL and MOPD can close the performance gap on targeted domains — giving organizations a path to deploy strong reasoning capabilities without frontier-level infrastructure costs.

The open question is how far this approach can be generalized. Cascade RL works well for domains with verifiable rewards — math has correct answers, code has test cases, instruction-following has rule-based checkers. Extending it to more open-ended enterprise tasks, where verification is ambiguous, remains an active research challenge. For teams building systems that need deep reasoning on structured problems — financial modeling, scientific computing, software engineering, compliance analysis — Nvidia’s technical report offers one of the more detailed post-training methodologies published to date.

Testing autonomous agents (Or: how I learned to stop worrying and embrace chaos)

Look, we’ve spent the last 18 months building production AI systems, and we’ll tell you what keeps us up at night — and it’s not whether the model can answer questions. That’s table stakes now. What haunts us is the mental image of an agent autonomously approving a six-figure vendor contract at 2 a.m. because someone typo’d a config file.

We’ve moved past the era of “ChatGPT wrappers” (thank God), but the industry still treats autonomous agents like they’re just chatbots with API access. They’re not. When you give an AI system the ability to take actions without human confirmation, you’re crossing a fundamental threshold. You’re not building a helpful assistant anymore — you’re building something closer to an employee. And that changes everything about how we need to engineer these systems.

The autonomy problem nobody talks about

Here’s what’s wild: We’ve gotten really good at making models that *sound* confident. But confidence and reliability aren’t the same thing, and the gap between them is where production systems go to die.

We learned this the hard way during a pilot program where we let an AI agent manage calendar scheduling across executive teams. Seems simple, right? The agent could check availability, send invites, handle conflicts. Except, one Monday morning, it rescheduled a board meeting because it interpreted “let’s push this if we need to” in a Slack message as an actual directive. The model wasn’t wrong in its interpretation — it was plausible. But plausible isn’t good enough when you’re dealing with autonomy.

That incident taught us something crucial: The challenge isn’t building agents that work most of the time. It’s building agents that fail gracefully, know their limitations, and have the circuit breakers to prevent catastrophic mistakes.

What reliability actually means for autonomous systems

Layered reliability architecture

When we talk about reliability in traditional software engineering, we’ve got decades of patterns: Redundancy, retries, idempotency, graceful degradation. But AI agents break a lot of our assumptions.

Traditional software fails in predictable ways. You can write unit tests. You can trace execution paths. With AI agents, you’re dealing with probabilistic systems making judgment calls. A bug isn’t just a logic error—it’s the model hallucinating a plausible-sounding but completely fabricated API endpoint, or misinterpreting context in a way that technically parses but completely misses the human intent.

So what does reliability look like here? In our experience, it’s a layered approach.

Layer 1: Model selection and prompt engineering

This is foundational but insufficient. Yes, use the best model you can afford. Yes, craft your prompts carefully with examples and constraints. But don’t fool yourself into thinking that a great prompt is enough. I’ve seen too many teams ship “GPT-4 with a really good system prompt” and call it enterprise-ready.

Layer 2: Deterministic guardrails

Before the model does anything irreversible, run it through hard checks. Is it trying to access a resource it shouldn’t? Is the action within acceptable parameters? We’re talking old-school validation logic — regex, schema validation, allowlists. It’s not sexy, but it’s effective.

One pattern that’s worked well for us: Maintain a formal action schema. Every action an agent can take has a defined structure, required fields, and validation rules. The agent proposes actions in this schema, and we validate before execution. If validation fails, we don’t just block it — we feed the validation errors back to the agent and let it try again with context about what went wrong.

Layer 3: Confidence and uncertainty quantification

Here’s where it gets interesting. We need agents that know what they don’t know. We’ve been experimenting with agents that can explicitly reason about their confidence before taking actions. Not just a probability score, but actual articulated uncertainty: “I’m interpreting this email as a request to delay the project, but the phrasing is ambiguous and could also mean…”

This doesn’t prevent all mistakes, but it creates natural breakpoints where you can inject human oversight. High-confidence actions go through automatically. Medium-confidence actions get flagged for review. Low-confidence actions get blocked with an explanation.

Layer 4: Observability and auditability

Action Validation Pipeline

If you can’t debug it, you can’t trust it. Every decision the agent makes needs to be loggable, traceable, and explainable. Not just “what action did it take” but “what was it thinking, what data did it consider, what was the reasoning chain?”

We’ve built a custom logging system that captures the full large language model (LLM) interaction — the prompt, the response, the context window, even the model temperature settings. It’s verbose as hell, but when something goes wrong (and it will), you need to be able to reconstruct exactly what happened. Plus, this becomes your dataset for fine-tuning and improvement.

Guardrails: The art of saying no

Let’s talk about guardrails, because this is where engineering discipline really matters. A lot of teams approach guardrails as an afterthought — “we’ll add some safety checks if we need them.” That’s backwards. Guardrails should be your starting point.

We think of guardrails in three categories.

Permission boundaries

What is the agent physically allowed to do? This is your blast radius control. Even if the agent hallucinates the worst possible action, what’s the maximum damage it can cause?

We use a principle called “graduated autonomy.” New agents start with read-only access. As they prove reliable, they graduate to low-risk writes (creating calendar events, sending internal messages). High-risk actions (financial transactions, external communications, data deletion) either require explicit human approval or are simply off-limits.

One technique that’s worked well: Action cost budgets. Each agent has a daily “budget” denominated in some unit of risk or cost. Reading a database record costs 1 unit. Sending an email costs 10. Initiating a vendor payment costs 1,000. The agent can operate autonomously until it exhausts its budget; then, it needs human intervention. This creates a natural throttle on potentially problematic behavior.

Graduated Autonomy and Action Cost Budget

Semantic Houndaries

What should the agent understand as in-scope vs out-of-scope? This is trickier because it’s conceptual, not just technical.

I’ve found that explicit domain definitions help a lot. Our customer service agent has a clear mandate: handle product questions, process returns, escalate complaints. Anything outside that domain — someone asking for investment advice, technical support for third-party products, personal favors — gets a polite deflection and escalation.

The challenge is making these boundaries robust to prompt injection and jailbreaking attempts. Users will try to convince the agent to help with out-of-scope requests. Other parts of the system might inadvertently pass instructions that override the agent’s boundaries. You need multiple layers of defense here.

Operational boundaries

How much can the agent do, and how fast? This is your rate limiting and resource control.

We’ve implemented hard limits on everything: API calls per minute, maximum tokens per interaction, maximum cost per day, maximum number of retries before human escalation. These might seem like artificial constraints, but they’re essential for preventing runaway behavior.

We once saw an agent get stuck in a loop trying to resolve a scheduling conflict. It kept proposing times, getting rejections, and trying again. Without rate limits, it sent 300 calendar invites in an hour. With proper operational boundaries, it would’ve hit a threshold and escalated to a human after attempt number 5.

Agents need their own style of testing

Traditional software testing doesn’t cut it for autonomous agents. You can’t just write test cases that cover all the edge cases, because with LLMs, everything is an edge case.

What’s worked for us:

Simulation environments

Build a sandbox that mirrors production but with fake data and mock services. Let the agent run wild. See what breaks. We do this continuously — every code change goes through 100 simulated scenarios before it touches production.

The key is making scenarios realistic. Don’t just test happy paths. Simulate angry customers, ambiguous requests, contradictory information, system outages. Throw in some adversarial examples. If your agent can’t handle a test environment where things go wrong, it definitely can’t handle production.

Red teaming

Get creative people to try to break your agent. Not just security researchers, but domain experts who understand the business logic. Some of our best improvements came from sales team members who tried to “trick” the agent into doing things it shouldn’t.

Shadow mode

Before you go live, run the agent in shadow mode alongside humans. The agent makes decisions, but humans actually execute the actions. You log both the agent’s choices and the human’s choices, and you analyze the delta.

This is painful and slow, but it’s worth it. You’ll find all kinds of subtle misalignments you’d never catch in testing. Maybe the agent technically gets the right answer, but with phrasing that violates company tone guidelines. Maybe it makes legally correct but ethically questionable decisions. Shadow mode surfaces these issues before they become real problems.

The human-in-the-loop pattern

Three Human-in-the-Loop Patterns

Despite all the automation, humans remain essential. The question is: Where in the loop?

We’re increasingly convinced that “human-in-the-loop” is actually several distinct patterns:

Human-on-the-loop: The agent operates autonomously, but humans monitor dashboards and can intervene. This is your steady-state for well-understood, low-risk operations.

Human-in-the-loop: The agent proposes actions, humans approve them. This is your training wheels mode while the agent proves itself, and your permanent mode for high-risk operations.

Human-with-the-loop: Agent and human collaborate in real-time, each handling the parts they’re better at. The agent does the grunt work, the human does the judgment calls.

The trick is making these transitions smooth. An agent shouldn’t feel like a completely different system when you move from autonomous to supervised mode. Interfaces, logging, and escalation paths should all be consistent.

Failure modes and recovery

Let’s be honest: Your agent will fail. The question is whether it fails gracefully or catastrophically.

We classify failures into three categories:

Recoverable errors: The agent tries to do something, it doesn’t work, the agent realizes it didn’t work and tries something else. This is fine. This is how complex systems operate. As long as the agent isn’t making things worse, let it retry with exponential backoff.

Detectable failures: The agent does something wrong, but monitoring systems catch it before significant damage occurs. This is where your guardrails and observability pay off. The agent gets rolled back, humans investigate, you patch the issue.

Undetectable failures: The agent does something wrong, and nobody notices until much later. These are the scary ones. Maybe it’s been misinterpreting customer requests for weeks. Maybe it’s been making subtly incorrect data entries. These accumulate into systemic issues.

The defense against undetectable failures is regular auditing. We randomly sample agent actions and have humans review them. Not just pass/fail, but detailed analysis. Is the agent showing any drift in behavior? Are there patterns in its mistakes? Is it developing any concerning tendencies?

The cost-performance tradeoff

Here’s something nobody talks about enough: reliability is expensive.

Every guardrail adds latency. Every validation step costs compute. Multiple model calls for confidence checking multiply your API costs. Comprehensive logging generates massive data volumes.

You have to be strategic about where you invest. Not every agent needs the same level of reliability. A marketing copy generator can be looser than a financial transaction processor. A scheduling assistant can retry more liberally than a code deployment system.

We use a risk-based approach. High-risk agents get all the safeguards, multiple validation layers, extensive monitoring. Lower-risk agents get lighter-weight protections. The key is being explicit about these trade-offs and documenting why each agent has the guardrails it does.

Organizational challenges

We’d be remiss if we didn’t mention that the hardest parts aren’t technical — they’re organizational.

Who owns the agent when it makes a mistake? Is it the engineering team that built it? The business unit that deployed it? The person who was supposed to be supervising it?

How do you handle edge cases where the agent’s logic is technically correct but contextually inappropriate? If the agent follows its rules but violates an unwritten norm, who’s at fault?

What’s your incident response process when an agent goes rogue? Traditional runbooks assume human operators making mistakes. How do you adapt these for autonomous systems?

These questions don’t have universal answers, but they need to be addressed before you deploy. Clear ownership, documented escalation paths, and well-defined success metrics are just as important as the technical architecture.

Where we go from here

The industry is still figuring this out. There’s no established playbook for building reliable autonomous agents. We’re all learning in production, and that’s both exciting and terrifying.

What we know for sure: The teams that succeed will be the ones who treat this as an engineering discipline, not just an AI problem. You need traditional software engineering rigor — testing, monitoring, incident response — combined with new techniques specific to probabilistic systems.

You need to be paranoid but not paralyzed. Yes, autonomous agents can fail in spectacular ways. But with proper guardrails, they can also handle enormous workloads with superhuman consistency. The key is respecting the risks while embracing the possibilities.

We’ll leave you with this: Every time we deploy a new autonomous capability, we run a pre-mortem. We imagine it’s six months from now and the agent has caused a significant incident. What happened? What warning signs did we miss? What guardrails failed?

This exercise has saved us more times than we can count. It forces you to think through failure modes before they occur, to build defenses before you need them, to question assumptions before they bite you.

Because in the end, building enterprise-grade autonomous AI agents isn’t about making systems that work perfectly. It’s about making systems that fail safely, recover gracefully, and learn continuously.

And that’s the kind of engineering that actually matters.

Madhvesh Kumar is a principal engineer. Deepika Singh is a senior software engineer.

Views expressed are based on hands-on experience building and deploying autonomous agents, along with the occasional 3 AM incident response that makes you question your career choices.

Anthropic just shipped an OpenClaw killer called Claude Code Channels, letting you message it over Telegram and Discord

The hit open source autonomous AI agent OpenClaw may have just gotten mogged by Anthropic.

Today, Anthropic announced Claude Code Channels, a way to hook up its own powerful Claude Code AI agentic harness to a human user’s Discord or Telegram messaging applications, letting them message Claude Code directly whenever they want while on the go and instruct it to write code for them. Official documentation is here.

This isn’t just a new UI; it is a fundamental shift in how developers interact with AI agents, moving from a synchronous “ask-and-wait” model to an asynchronous, autonomous partnership. Previously, Claude Code users were stuck interacting with the agentic harness on the Claude desktop application, terminal or supported developer environment, and Claude mobile app through a somewhat flaky (in my experience) interconnection setting called Remote Control.

Now, Anthropic is offering some of the same core functionality as OpenClaw that drove its rapid adoption among software developers and vibe coders following its release in November 2025 by Austrian developer Peter Steinberger (who, ironically, originally called his project “Clawd” in honor of Anthropic’s own AI model Claude which powered it initially, until Anthropic sent him a cease-and-desist for potential trademark violations. Steinberger was since hired by Anthropic’s rival OpenAI.)

Central to OpenClaw’s appeal was its capability of allowing users to have a persistent, personal AI worker that they can message 24/7, whenever they feel like, over common messaging apps such as iMessage, Slack, Telegram, WhatsApp and Discord, and have their AI message them back — not just to chat with, but to perform real work for them on its own, from writing, sending and organizing email and files to creating whole applications, applying for jobs on the user’s behalf, to managing complete ongoing social marketing campaigns. When the AI finishes a task, it can immediately alert the human user over their preferred messaging platform.

But OpenClaw also came with a high degree of security risk (since it could be given access to a user’s hard drive and file system, or other personal information, and run amok) and difficulty for non-technical users, inspiring a wave of offshoots promising greater ease and security, including NanoClaw, KiloClaw and Nvidia’s recently announced NemoClaw.

By giving Claude Code this same basic functionality — the ability for users to message it from popular third-party apps Discord and Telegram, and have it message them back when it finishes a task — Anthropic has effectively countered OpenClaw’s appeal and offered something it does not: the Anthropic brand name with its commitment to AI security and safety, and ease of use right out of the box for less technically inclined users.

Technology: The Bridge of the Model Context Protocol

At the heart of this update is the Model Context Protocol (MCP) open source standard that Anthropic introduced back in 2024. Think of MCP as a universal USB-C port for AI: it provides a standardized way for an AI model to connect to external data and tools. In the new “Channels” architecture, an MCP server acts as a two-way bridge.

When a developer starts a Claude Code session with the --channels flag, they aren’t just opening a chat; they are spinning up a polling service.

Using the Bun runtime—known for its extreme speed in executing JavaScript—Claude Code monitors specific plugins (currently Telegram and Discord).

When a message arrives, it is injected directly into the active session as a <channel> event. Claude can then use its internal tools to execute code, run tests, or fix bugs, and reply back to the external platform using a specialized reply tool.

The technical achievement here is persistence. Unlike a standard web-chat that times out, a Claude Code session can now run in a background terminal or a persistent server (like a VPS), waiting for a “ping” to spring into action.

How to set up Claude Code Connectors on Telegram and Discord

Setting up these native connectors requires Claude Code v2.1.80 or later and the Bun runtime installed on your desktop PC or Mac. Follow the instructions here or below.

1. Setting up Telegram

  1. Create your Bot: Open BotFather in Telegram and use the /newbot command to generate a unique bot and access token.

  2. Install the Plugin: Inside your Claude Code terminal, run: /plugin install telegram@claude-plugins-official

  3. Configure the Token: Run /telegram:configure <your-token> to save your credentials.

  4. Restart with Channels: Exit Claude and restart using the channel flag: claude --channels plugin:telegram@claude-plugins-official

  5. Pair your Account: DM your new bot on Telegram to receive a pairing code, then enter it in your terminal: /telegram:access pair <code>

2. Setting up Discord

  1. Create an Application: Go to the Discord Developer Portal, create a “New Application,” and reset the bot token to copy it.

  2. Enable Intents: In the Bot settings, you must enable Message Content Intent under “Privileged Gateway Intents.”

  3. Install and Configure: In Claude Code, run /plugin install discord@claude-plugins-official followed by /discord:configure <your-token>.

  4. Launch and Pair: Restart with claude --channels plugin:discord@claude-plugins-official. DM your bot on Discord and use the /discord:access pair <code> command to finish the link.

Product: From Desktop to “Everywhere”

The immediate practical impact is the democratization of mobile AI coding. Previously, if a developer wanted to check a build status or run a quick fix while away from their desk, they had to rely on complex self-hosted setups like OpenClaw.

With Channels, the setup is native. A developer can create a Telegram bot via BotFather, link it to Claude Code with a /telegram:configure command, and “pair” their account with a security code. Once configured, the phone becomes a remote control for the development environment.

The product also introduces a “Fakechat” demo—a local-only chat UI that allows developers to test the “push” logic on their own machine before connecting to external servers. This reflects Anthropic’s cautious, “research preview” approach, ensuring developers understand the flow of events before exposing their terminal to the internet.

Licensing: Proprietary Power on Open Standards

The licensing implications of this release highlight a growing trend in the AI industry: proprietary engines running on open tracks. Claude Code remains a proprietary product tied to Anthropic’s commercial subscriptions (Pro, Max, and Enterprise).

However, by building on the open-source Model Context Protocol, Anthropic is encouraging a developer ecosystem to build the “connectors” that make their model more useful.

While the core Claude “brain” is closed, the plugins for Telegram and Discord are being hosted on GitHub under official Anthropic repositories, likely allowing for community contributions or forks.

This strategy allows Anthropic to maintain the security and quality of the model while benefiting from the rapid innovation of the open-source community—a direct challenge to the “free” but often fragmented nature of purely open-source agent frameworks.

And because it’s built on MCP, the community can now build “Connectors” for Slack or WhatsApp themselves, rather than waiting for Anthropic to ship them.

Community Reactions: ‘The OpenClaw Killer’

The response from users, especially AI observers on X, was swift and definitive. The sentiment was best captured by Ejaaz (@cryptopunk7213), who noted that Anthropic’s speed of shipping—incorporating texting, thousands of MCP skills, and autonomous bug-fixing in just four weeks—was “fucking crazy.”

For many, this update renders local-first agent frameworks obsolete. BentoBoi (@BentoBoiNFT) observed, “Claude just killed OpenClaw with this update. You no longer need to buy a Mac Mini. I say this as someone who owns a one lol,” referring to the common practice of developers buying dedicated hardware to run open-source agents like OpenClaw 24/7. By moving this persistence into the Claude Code environment, Anthropic has simplified the “hardware tax” for autonomy.

AI YouTuber Matthew Berman summarized the shift succinctly: “They’ve BUILT OpenClaw.”

The consensus among early adopters is that Anthropic has successfully internalized the most desirable features of the open-source movement—multi-channel support and long-term memory—while maintaining the reliability of a tier-one AI provider.

While Anthropic’s Claude has long been a favorite for its reasoning, it remained a “brain in a jar”—a stateless entity that waited for a user to type before it could think. Meanwhile, open-source projects like OpenClaw thrived by offering “always-on” persistence, allowing developers to message their AI from Telegram or Discord to trigger complex workflows.

Now, with Anthropic closing the gap, it’s up to the users to choose which approach is best for them.