admin.codes » Category » Orchestration

Nvidia’s Nemotron-Cascade 2 wins math and coding gold medals with 3B active parameters — and its post-training recipe is now open-source

The prevailing assumption in AI development has been straightforward: larger models trained on more data produce better results. Nvidia’s latest release directly challenges that size assumption — and the training recipe behind it may matter more to enterprise AI teams than the model itself. The open-weight model’s Cascade RL post-training pipeline, detailed in Nvidia’s technical report, offers a reproducible blueprint for enterprise teams building domain-specific reasoning systems without training from scratch.

Nemotron-Cascade 2 is an open-weight 30B Mixture-of-Experts (MoE) model that activates only 3B parameters at inference time. Despite this compact footprint, it achieved gold medal-level performance on three of the world’s most demanding competitions: the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals. It is the second open model to reach this tier, after DeepSeek-V3.2-Speciale — a model with 20 times more parameters.

Why post-training is becoming the real competitive advantage

Pre-training a large language model from scratch is enormously expensive — on the order of tens to possibly hundreds of millions of dollars for frontier models. Nemotron-Cascade 2 starts from the same base model as Nvidia’s existing Nemotron-3-Nano — yet it outperforms that model on nearly every benchmark, and in many cases outperforms Nvidia’s own Nemotron-3-Super, a model with four times the active parameters, according to Nvidia’s technical report. The difference is entirely in the post-training recipe.

This is the strategic insight for enterprise teams: You don’t necessarily need a bigger or more expensive base model. You may need a better training pipeline on top of the one you already have. Cascade RL and MOPD represent a specific, reproducible approach to that problem.

Cascade RL explained: sequential domain training that avoids catastrophic forgetting

Reinforcement learning (RL) has become the dominant technique for teaching LLMs to reason. The challenge is that training a model on multiple domains simultaneously — math, code, instruction-following, agentic tasks — often causes interference. Improving performance in one domain degrades it in another. This is the problem of catastrophic forgetting, a long-documented challenge in multi-task machine learning.

Cascade RL addresses this by training RL stages sequentially, one domain at a time, rather than mixing everything together. Nemotron-Cascade 2 follows a specific ordering: first instruction-following RL, then multi-domain RL (covering STEM questions, tool calling, and structured output), then on-policy distillation, then RLHF for human preference alignment, then long-context RL, then code RL, and finally software engineering RL.

Three properties make this approach practical, according to Nvidia’s technical report. First, domain-specific RL stages turn out to be resistant to catastrophic forgetting — training on code rarely degrades math performance, and in some cases actually improves it. Second, because each stage trains on a single domain, hyperparameters and the training curriculum can be tailored to that domain’s specific characteristics, enabling better learning overall. Third, because responses within a single domain tend to be similar in length and verification cost, compute utilization is substantially more efficient than mixed-domain training.

The ordering itself is not fixed; it depends on the model’s behavior. The Nemotron-Cascade 2 team found that instruction-following RL should come first (because it can conflict with human preference alignment, which can be recovered later), while code RL and software engineering RL work best as the final stages, according to the report.

For enterprise teams, the implication is straightforward: If you are applying RL to improve a model across multiple capabilities, training them sequentially with careful ordering may give you better results than trying to train everything at once.

MOPD: reusing your own training checkpoints as teachers

Even with careful sequential ordering, some performance drift is inevitable as the model passes through many RL stages. Nvidia’s solution is Multi-Domain On-Policy Distillation (MOPD) — a technique inserted partway through the Cascade RL pipeline to rebalance capabilities.

The approach works as follows: As the model passes through different RL stages, some intermediate checkpoints will be the best-performing version for specific domains. The math checkpoint might be strongest after SFT; the instruction-following checkpoint might be strongest after IF-RL. MOPD selects the best intermediate checkpoint for each domain and uses it as a “teacher” to distill knowledge back into the student model.

Critically, these teachers are not external models. They come from the same training run, sharing the same tokenizer and architecture. This eliminates distribution mismatch problems that arise when distilling from a completely different model family.

According to Nvidia’s technical report, MOPD works at the token level rather than the sequence level, which makes it substantially more sample-efficient than RL with outcome-based rewards (GRPO etc). The Nvidia team reports that on the AIME 2025 math benchmark, MOPD recovered teacher-level performance within 30 optimization steps, while standard GRPO (Group Relative Policy Optimization) required more steps to achieve a lower score. On the ArenaHard benchmark for human preference alignment, MOPD reached 85.5 on hard prompts in 52 steps versus RLHF’s 80.7 in 160 steps.

The benchmark picture: dominant in reasoning, honest about trade-offs

The results on reasoning-intensive benchmarks are striking. On LiveCodeBench v6, a coding benchmark with problems from competitive programming platforms, Nemotron-Cascade 2 scores 87.2 — surpassing Qwen3.5-35B-A3B (74.6), Qwen3.5-397B-A17B (83.6), and even Kimi-K2.5-1T (85.0). On HMMT February 2025, a rigorous math competition benchmark, it scores 94.6, neck-and-neck with models many times its size. On ArenaHard v2 for alignment quality, it reaches 83.5, well ahead of competitors in its class. With tool-integrated reasoning enabled, AIME 2025 performance climbs to 98.6. All benchmark scores are self-reported by Nvidia and have not been independently verified.

The technical report is also candid about weaknesses. The model underperforms Qwen3.5-35B-A3B on knowledge-intensive benchmarks like MMLU-Pro (79.8 vs. 85.3) and GPQA-Diamond (76.1 vs. 84.2), as well as on several agentic benchmarks like BFCL v4 and τ²-Bench. The authors explicitly note that stronger knowledge-intensive pre-training and agentic RL are needed in future work.

This honesty matters for practitioners. The model is optimized for deep reasoning and instruction-following — not general knowledge retrieval or complex multi-turn agent interactions. Teams should evaluate against their specific use case, not assume blanket superiority.

What enterprise AI teams can take from this recipe

Several design patterns from this work are directly applicable to enterprise post-training efforts. The sequential domain ordering in Cascade RL means teams can add new capabilities without rebuilding the entire pipeline — a critical property for organizations that need to iterate quickly. MOPD’s approach of using intermediate checkpoints as domain-specific teachers eliminates the need for expensive external teacher models; teams can distill from their own best-performing snapshots.

The training setup is also notable: Cascade RL utilizes GRPO with strict on-policy training and no KL penalty via Nvidia’s open-source Nemo-RL repository. For code RL, the pipeline used only 3,500 difficult, filtered problems.

The bigger picture: intelligence density as a design principle

Nemotron-Cascade 2 is part of a broader trend toward “intelligence density” — extracting maximum capability per active parameter. DeepSeek’s MoE models, Qwen’s A3B variants, and now Nvidia’s Cascade series all point toward a future where the most capable reasoning models are not necessarily the largest.

For enterprise deployment, this matters enormously. A model with 3B active parameters can be served at a fraction of the cost and latency of a dense 70B model. Nvidia’s results suggest that post-training techniques like Cascade RL and MOPD can close the performance gap on targeted domains — giving organizations a path to deploy strong reasoning capabilities without frontier-level infrastructure costs.

The open question is how far this approach can be generalized. Cascade RL works well for domains with verifiable rewards — math has correct answers, code has test cases, instruction-following has rule-based checkers. Extending it to more open-ended enterprise tasks, where verification is ambiguous, remains an active research challenge. For teams building systems that need deep reasoning on structured problems — financial modeling, scientific computing, software engineering, compliance analysis — Nvidia’s technical report offers one of the more detailed post-training methodologies published to date.

Orchestration

Testing autonomous agents (Or: how I learned to stop worrying and embrace chaos)

Look, we’ve spent the last 18 months building production AI systems, and we’ll tell you what keeps us up at night — and it’s not whether the model can answer questions. That’s table stakes now. What haunts us is the mental image of an agent autonomously approving a six-figure vendor contract at 2 a.m. because someone typo’d a config file.

We’ve moved past the era of “ChatGPT wrappers” (thank God), but the industry still treats autonomous agents like they’re just chatbots with API access. They’re not. When you give an AI system the ability to take actions without human confirmation, you’re crossing a fundamental threshold. You’re not building a helpful assistant anymore — you’re building something closer to an employee. And that changes everything about how we need to engineer these systems.

The autonomy problem nobody talks about

Here’s what’s wild: We’ve gotten really good at making models that *sound* confident. But confidence and reliability aren’t the same thing, and the gap between them is where production systems go to die.

We learned this the hard way during a pilot program where we let an AI agent manage calendar scheduling across executive teams. Seems simple, right? The agent could check availability, send invites, handle conflicts. Except, one Monday morning, it rescheduled a board meeting because it interpreted “let’s push this if we need to” in a Slack message as an actual directive. The model wasn’t wrong in its interpretation — it was plausible. But plausible isn’t good enough when you’re dealing with autonomy.

That incident taught us something crucial: The challenge isn’t building agents that work most of the time. It’s building agents that fail gracefully, know their limitations, and have the circuit breakers to prevent catastrophic mistakes.

What reliability actually means for autonomous systems

Layered reliability architecture

When we talk about reliability in traditional software engineering, we’ve got decades of patterns: Redundancy, retries, idempotency, graceful degradation. But AI agents break a lot of our assumptions.

Traditional software fails in predictable ways. You can write unit tests. You can trace execution paths. With AI agents, you’re dealing with probabilistic systems making judgment calls. A bug isn’t just a logic error—it’s the model hallucinating a plausible-sounding but completely fabricated API endpoint, or misinterpreting context in a way that technically parses but completely misses the human intent.

So what does reliability look like here? In our experience, it’s a layered approach.

Layer 1: Model selection and prompt engineering

This is foundational but insufficient. Yes, use the best model you can afford. Yes, craft your prompts carefully with examples and constraints. But don’t fool yourself into thinking that a great prompt is enough. I’ve seen too many teams ship “GPT-4 with a really good system prompt” and call it enterprise-ready.

Layer 2: Deterministic guardrails

Before the model does anything irreversible, run it through hard checks. Is it trying to access a resource it shouldn’t? Is the action within acceptable parameters? We’re talking old-school validation logic — regex, schema validation, allowlists. It’s not sexy, but it’s effective.

One pattern that’s worked well for us: Maintain a formal action schema. Every action an agent can take has a defined structure, required fields, and validation rules. The agent proposes actions in this schema, and we validate before execution. If validation fails, we don’t just block it — we feed the validation errors back to the agent and let it try again with context about what went wrong.

Layer 3: Confidence and uncertainty quantification

Here’s where it gets interesting. We need agents that know what they don’t know. We’ve been experimenting with agents that can explicitly reason about their confidence before taking actions. Not just a probability score, but actual articulated uncertainty: “I’m interpreting this email as a request to delay the project, but the phrasing is ambiguous and could also mean…”

This doesn’t prevent all mistakes, but it creates natural breakpoints where you can inject human oversight. High-confidence actions go through automatically. Medium-confidence actions get flagged for review. Low-confidence actions get blocked with an explanation.

Layer 4: Observability and auditability

Action Validation Pipeline

If you can’t debug it, you can’t trust it. Every decision the agent makes needs to be loggable, traceable, and explainable. Not just “what action did it take” but “what was it thinking, what data did it consider, what was the reasoning chain?”

We’ve built a custom logging system that captures the full large language model (LLM) interaction — the prompt, the response, the context window, even the model temperature settings. It’s verbose as hell, but when something goes wrong (and it will), you need to be able to reconstruct exactly what happened. Plus, this becomes your dataset for fine-tuning and improvement.

Guardrails: The art of saying no

Let’s talk about guardrails, because this is where engineering discipline really matters. A lot of teams approach guardrails as an afterthought — “we’ll add some safety checks if we need them.” That’s backwards. Guardrails should be your starting point.

We think of guardrails in three categories.

Permission boundaries

What is the agent physically allowed to do? This is your blast radius control. Even if the agent hallucinates the worst possible action, what’s the maximum damage it can cause?

We use a principle called “graduated autonomy.” New agents start with read-only access. As they prove reliable, they graduate to low-risk writes (creating calendar events, sending internal messages). High-risk actions (financial transactions, external communications, data deletion) either require explicit human approval or are simply off-limits.

One technique that’s worked well: Action cost budgets. Each agent has a daily “budget” denominated in some unit of risk or cost. Reading a database record costs 1 unit. Sending an email costs 10. Initiating a vendor payment costs 1,000. The agent can operate autonomously until it exhausts its budget; then, it needs human intervention. This creates a natural throttle on potentially problematic behavior.

Graduated Autonomy and Action Cost Budget

Semantic Houndaries

What should the agent understand as in-scope vs out-of-scope? This is trickier because it’s conceptual, not just technical.

I’ve found that explicit domain definitions help a lot. Our customer service agent has a clear mandate: handle product questions, process returns, escalate complaints. Anything outside that domain — someone asking for investment advice, technical support for third-party products, personal favors — gets a polite deflection and escalation.

The challenge is making these boundaries robust to prompt injection and jailbreaking attempts. Users will try to convince the agent to help with out-of-scope requests. Other parts of the system might inadvertently pass instructions that override the agent’s boundaries. You need multiple layers of defense here.

Operational boundaries

How much can the agent do, and how fast? This is your rate limiting and resource control.

We’ve implemented hard limits on everything: API calls per minute, maximum tokens per interaction, maximum cost per day, maximum number of retries before human escalation. These might seem like artificial constraints, but they’re essential for preventing runaway behavior.

We once saw an agent get stuck in a loop trying to resolve a scheduling conflict. It kept proposing times, getting rejections, and trying again. Without rate limits, it sent 300 calendar invites in an hour. With proper operational boundaries, it would’ve hit a threshold and escalated to a human after attempt number 5.

Agents need their own style of testing

Traditional software testing doesn’t cut it for autonomous agents. You can’t just write test cases that cover all the edge cases, because with LLMs, everything is an edge case.

What’s worked for us:

Simulation environments

Build a sandbox that mirrors production but with fake data and mock services. Let the agent run wild. See what breaks. We do this continuously — every code change goes through 100 simulated scenarios before it touches production.

The key is making scenarios realistic. Don’t just test happy paths. Simulate angry customers, ambiguous requests, contradictory information, system outages. Throw in some adversarial examples. If your agent can’t handle a test environment where things go wrong, it definitely can’t handle production.

Red teaming

Get creative people to try to break your agent. Not just security researchers, but domain experts who understand the business logic. Some of our best improvements came from sales team members who tried to “trick” the agent into doing things it shouldn’t.

Shadow mode

Before you go live, run the agent in shadow mode alongside humans. The agent makes decisions, but humans actually execute the actions. You log both the agent’s choices and the human’s choices, and you analyze the delta.

This is painful and slow, but it’s worth it. You’ll find all kinds of subtle misalignments you’d never catch in testing. Maybe the agent technically gets the right answer, but with phrasing that violates company tone guidelines. Maybe it makes legally correct but ethically questionable decisions. Shadow mode surfaces these issues before they become real problems.

The human-in-the-loop pattern

Three Human-in-the-Loop Patterns

Despite all the automation, humans remain essential. The question is: Where in the loop?

We’re increasingly convinced that “human-in-the-loop” is actually several distinct patterns:

Human-on-the-loop: The agent operates autonomously, but humans monitor dashboards and can intervene. This is your steady-state for well-understood, low-risk operations.

Human-in-the-loop: The agent proposes actions, humans approve them. This is your training wheels mode while the agent proves itself, and your permanent mode for high-risk operations.

Human-with-the-loop: Agent and human collaborate in real-time, each handling the parts they’re better at. The agent does the grunt work, the human does the judgment calls.

The trick is making these transitions smooth. An agent shouldn’t feel like a completely different system when you move from autonomous to supervised mode. Interfaces, logging, and escalation paths should all be consistent.

Failure modes and recovery

Let’s be honest: Your agent will fail. The question is whether it fails gracefully or catastrophically.

We classify failures into three categories:

Recoverable errors: The agent tries to do something, it doesn’t work, the agent realizes it didn’t work and tries something else. This is fine. This is how complex systems operate. As long as the agent isn’t making things worse, let it retry with exponential backoff.

Detectable failures: The agent does something wrong, but monitoring systems catch it before significant damage occurs. This is where your guardrails and observability pay off. The agent gets rolled back, humans investigate, you patch the issue.

Undetectable failures: The agent does something wrong, and nobody notices until much later. These are the scary ones. Maybe it’s been misinterpreting customer requests for weeks. Maybe it’s been making subtly incorrect data entries. These accumulate into systemic issues.

The defense against undetectable failures is regular auditing. We randomly sample agent actions and have humans review them. Not just pass/fail, but detailed analysis. Is the agent showing any drift in behavior? Are there patterns in its mistakes? Is it developing any concerning tendencies?

The cost-performance tradeoff

Here’s something nobody talks about enough: reliability is expensive.

Every guardrail adds latency. Every validation step costs compute. Multiple model calls for confidence checking multiply your API costs. Comprehensive logging generates massive data volumes.

You have to be strategic about where you invest. Not every agent needs the same level of reliability. A marketing copy generator can be looser than a financial transaction processor. A scheduling assistant can retry more liberally than a code deployment system.

We use a risk-based approach. High-risk agents get all the safeguards, multiple validation layers, extensive monitoring. Lower-risk agents get lighter-weight protections. The key is being explicit about these trade-offs and documenting why each agent has the guardrails it does.

Organizational challenges

We’d be remiss if we didn’t mention that the hardest parts aren’t technical — they’re organizational.

Who owns the agent when it makes a mistake? Is it the engineering team that built it? The business unit that deployed it? The person who was supposed to be supervising it?

How do you handle edge cases where the agent’s logic is technically correct but contextually inappropriate? If the agent follows its rules but violates an unwritten norm, who’s at fault?

What’s your incident response process when an agent goes rogue? Traditional runbooks assume human operators making mistakes. How do you adapt these for autonomous systems?

These questions don’t have universal answers, but they need to be addressed before you deploy. Clear ownership, documented escalation paths, and well-defined success metrics are just as important as the technical architecture.

Where we go from here

The industry is still figuring this out. There’s no established playbook for building reliable autonomous agents. We’re all learning in production, and that’s both exciting and terrifying.

What we know for sure: The teams that succeed will be the ones who treat this as an engineering discipline, not just an AI problem. You need traditional software engineering rigor — testing, monitoring, incident response — combined with new techniques specific to probabilistic systems.

You need to be paranoid but not paralyzed. Yes, autonomous agents can fail in spectacular ways. But with proper guardrails, they can also handle enormous workloads with superhuman consistency. The key is respecting the risks while embracing the possibilities.

We’ll leave you with this: Every time we deploy a new autonomous capability, we run a pre-mortem. We imagine it’s six months from now and the agent has caused a significant incident. What happened? What warning signs did we miss? What guardrails failed?

This exercise has saved us more times than we can count. It forces you to think through failure modes before they occur, to build defenses before you need them, to question assumptions before they bite you.

Because in the end, building enterprise-grade autonomous AI agents isn’t about making systems that work perfectly. It’s about making systems that fail safely, recover gracefully, and learn continuously.

And that’s the kind of engineering that actually matters.

Madhvesh Kumar is a principal engineer. Deepika Singh is a senior software engineer.

Views expressed are based on hands-on experience building and deploying autonomous agents, along with the occasional 3 AM incident response that makes you question your career choices.

DataDecisionMakers, Infrastructure, Orchestration

Anthropic just shipped an OpenClaw killer called Claude Code Channels, letting you message it over Telegram and Discord

The hit open source autonomous AI agent OpenClaw may have just gotten mogged by Anthropic.

Today, Anthropic announced Claude Code Channels, a way to hook up its own powerful Claude Code AI agentic harness to a human user’s Discord or Telegram messaging applications, letting them message Claude Code directly whenever they want while on the go and instruct it to write code for them. Official documentation is here.

This isn’t just a new UI; it is a fundamental shift in how developers interact with AI agents, moving from a synchronous “ask-and-wait” model to an asynchronous, autonomous partnership. Previously, Claude Code users were stuck interacting with the agentic harness on the Claude desktop application, terminal or supported developer environment, and Claude mobile app through a somewhat flaky (in my experience) interconnection setting called Remote Control.

Now, Anthropic is offering some of the same core functionality as OpenClaw that drove its rapid adoption among software developers and vibe coders following its release in November 2025 by Austrian developer Peter Steinberger (who, ironically, originally called his project “Clawd” in honor of Anthropic’s own AI model Claude which powered it initially, until Anthropic sent him a cease-and-desist for potential trademark violations. Steinberger was since hired by Anthropic’s rival OpenAI.)

Central to OpenClaw’s appeal was its capability of allowing users to have a persistent, personal AI worker that they can message 24/7, whenever they feel like, over common messaging apps such as iMessage, Slack, Telegram, WhatsApp and Discord, and have their AI message them back — not just to chat with, but to perform real work for them on its own, from writing, sending and organizing email and files to creating whole applications, applying for jobs on the user’s behalf, to managing complete ongoing social marketing campaigns. When the AI finishes a task, it can immediately alert the human user over their preferred messaging platform.

But OpenClaw also came with a high degree of security risk (since it could be given access to a user’s hard drive and file system, or other personal information, and run amok) and difficulty for non-technical users, inspiring a wave of offshoots promising greater ease and security, including NanoClaw, KiloClaw and Nvidia’s recently announced NemoClaw.

By giving Claude Code this same basic functionality — the ability for users to message it from popular third-party apps Discord and Telegram, and have it message them back when it finishes a task — Anthropic has effectively countered OpenClaw’s appeal and offered something it does not: the Anthropic brand name with its commitment to AI security and safety, and ease of use right out of the box for less technically inclined users.

Technology: The Bridge of the Model Context Protocol

At the heart of this update is the Model Context Protocol (MCP) open source standard that Anthropic introduced back in 2024. Think of MCP as a universal USB-C port for AI: it provides a standardized way for an AI model to connect to external data and tools. In the new “Channels” architecture, an MCP server acts as a two-way bridge.

When a developer starts a Claude Code session with the --channels flag, they aren’t just opening a chat; they are spinning up a polling service.

Using the Bun runtime—known for its extreme speed in executing JavaScript—Claude Code monitors specific plugins (currently Telegram and Discord).

When a message arrives, it is injected directly into the active session as a <channel> event. Claude can then use its internal tools to execute code, run tests, or fix bugs, and reply back to the external platform using a specialized reply tool.

The technical achievement here is persistence. Unlike a standard web-chat that times out, a Claude Code session can now run in a background terminal or a persistent server (like a VPS), waiting for a “ping” to spring into action.

How to set up Claude Code Connectors on Telegram and Discord

Setting up these native connectors requires Claude Code v2.1.80 or later and the Bun runtime installed on your desktop PC or Mac. Follow the instructions here or below.

1. Setting up Telegram

Create your Bot: Open BotFather in Telegram and use the /newbot command to generate a unique bot and access token.
Install the Plugin: Inside your Claude Code terminal, run: /plugin install telegram@claude-plugins-official
Configure the Token: Run /telegram:configure <your-token> to save your credentials.
Restart with Channels: Exit Claude and restart using the channel flag: claude --channels plugin:telegram@claude-plugins-official
Pair your Account: DM your new bot on Telegram to receive a pairing code, then enter it in your terminal: /telegram:access pair <code>

2. Setting up Discord

Create an Application: Go to the Discord Developer Portal, create a “New Application,” and reset the bot token to copy it.
Enable Intents: In the Bot settings, you must enable Message Content Intent under “Privileged Gateway Intents.”
Install and Configure: In Claude Code, run /plugin install discord@claude-plugins-official followed by /discord:configure <your-token>.
Launch and Pair: Restart with claude --channels plugin:discord@claude-plugins-official. DM your bot on Discord and use the /discord:access pair <code> command to finish the link.

Product: From Desktop to “Everywhere”

The immediate practical impact is the democratization of mobile AI coding. Previously, if a developer wanted to check a build status or run a quick fix while away from their desk, they had to rely on complex self-hosted setups like OpenClaw.

With Channels, the setup is native. A developer can create a Telegram bot via BotFather, link it to Claude Code with a /telegram:configure command, and “pair” their account with a security code. Once configured, the phone becomes a remote control for the development environment.

The product also introduces a “Fakechat” demo—a local-only chat UI that allows developers to test the “push” logic on their own machine before connecting to external servers. This reflects Anthropic’s cautious, “research preview” approach, ensuring developers understand the flow of events before exposing their terminal to the internet.

Licensing: Proprietary Power on Open Standards

The licensing implications of this release highlight a growing trend in the AI industry: proprietary engines running on open tracks. Claude Code remains a proprietary product tied to Anthropic’s commercial subscriptions (Pro, Max, and Enterprise).

However, by building on the open-source Model Context Protocol, Anthropic is encouraging a developer ecosystem to build the “connectors” that make their model more useful.

While the core Claude “brain” is closed, the plugins for Telegram and Discord are being hosted on GitHub under official Anthropic repositories, likely allowing for community contributions or forks.

This strategy allows Anthropic to maintain the security and quality of the model while benefiting from the rapid innovation of the open-source community—a direct challenge to the “free” but often fragmented nature of purely open-source agent frameworks.

And because it’s built on MCP, the community can now build “Connectors” for Slack or WhatsApp themselves, rather than waiting for Anthropic to ship them.

Community Reactions: ‘The OpenClaw Killer’

The response from users, especially AI observers on X, was swift and definitive. The sentiment was best captured by Ejaaz (@cryptopunk7213), who noted that Anthropic’s speed of shipping—incorporating texting, thousands of MCP skills, and autonomous bug-fixing in just four weeks—was “fucking crazy.”

For many, this update renders local-first agent frameworks obsolete. BentoBoi (@BentoBoiNFT) observed, “Claude just killed OpenClaw with this update. You no longer need to buy a Mac Mini. I say this as someone who owns a one lol,” referring to the common practice of developers buying dedicated hardware to run open-source agents like OpenClaw 24/7. By moving this persistence into the Claude Code environment, Anthropic has simplified the “hardware tax” for autonomy.

AI YouTuber Matthew Berman summarized the shift succinctly: “They’ve BUILT OpenClaw.”

The consensus among early adopters is that Anthropic has successfully internalized the most desirable features of the open-source movement—multi-channel support and long-term memory—while maintaining the reliability of a tier-one AI provider.

While Anthropic’s Claude has long been a favorite for its reasoning, it remained a “brain in a jar”—a stateless entity that waited for a user to type before it could think. Meanwhile, open-source projects like OpenClaw thrived by offering “always-on” persistence, allowing developers to message their AI from Telegram or Discord to trigger complex workflows.

Now, with Anthropic closing the gap, it’s up to the users to choose which approach is best for them.

Orchestration

Nvidia says it can shrink LLM memory 20x without changing model weights

Nvidia researchers have introduced a new technique that dramatically reduces how much memory large language models need to track conversation history — by as much as 20x — without modifying the model itself. The method, called KV Cache Transform Coding (KVTC), applies ideas from media compression formats like JPEG to shrink the key-value cache behind multi-turn AI systems, lowering GPU memory demands and speeding up time-to-first-token by up to 8x.

For enterprise AI applications that rely on agents and long contexts, this translates to reduced GPU memory costs, better prompt reuse, and up to an 8x reduction in latency by avoiding the need to recompute dropped KV cache values.

Serving large language models at scale requires managing a massive amount of data, especially for multi-turn conversations and long coding sessions. Every time a user adds to a prompt, the system relies on stored memory to avoid recomputing the entire conversation history from scratch.

However, this memory footprint grows rapidly, creating a severe bottleneck for latency and infrastructure costs.

Why KV cache becomes a bottleneck at scale

To power multi-turn AI applications like coding assistants or chat apps, large language models rely on a mechanism known as the key-value (KV) cache. This cache stores the hidden numerical representations for every previous token in a conversation. Because the model remembers the past conversation, it does not have to redundantly re-process the entire chat history each time the user submits a new prompt.

However, for AI applications with long context tasks, this cache can easily balloon to multiple gigabytes. As models scale up and generate increasingly long reasoning chains, the KV cache becomes a critical bottleneck for system throughput and latency.

This creates a difficult challenge for production environments. Because LLMs are highly memory-bound during inference, serving multiple users simultaneously is constrained by GPU memory exhaustion rather than computation time. “Effective KV cache management becomes critical, as idle caches must be quickly offloaded from GPU memory to accommodate other users, and quickly restored for resumed conversations,” Adrian Lancucki, Senior Deep Learning Engineer at Nvidia, told VentureBeat. “These infrastructure costs are now reflected in commercial pricing (e.g., as ‘prompt caching’) with additional charges for caching.”

Even compromise solutions, like offloading the cache to lower-tier storage like CPU memory or SSDs, introduce significant data transfer overheads that can saturate network bandwidth and create bottlenecks.

One common solution is to compress the KV cache so that it takes up less memory. However, existing solutions often fall short of solving the problem holistically. Tools designed to compress caches for network transmission achieve low compression rates. Other compression methods require resource-intensive calculations on the fly for every single user prompt. Meanwhile, popular techniques like quantization or sparsification can introduce latency and accuracy drops or require making permanent changes to the model’s weights, which limits their practicality.

In their paper, the Nvidia researchers note that existing approaches “seldom exploit the strong low-rank structure of KV tensors.” This means that despite its huge number of dimensions and gigabytes of size, the actual underlying information in the KV cache is highly correlated and can be accurately represented using far fewer variables. Exploiting this characteristic is what KVTC focuses on.

Borrowing tricks from media codecs

At a high level, KVTC tackles the AI memory bottleneck by borrowing a proven concept from classical media: transform coding, the methodology that powers familiar image and video compression formats like JPEG. The framework shrinks the cache footprint through a fast, multi-step process that executes between inference phases to avoid slowing down the actual token generation. “This ‘media compression’ approach is advantageous for enterprise deployment because it is non-intrusive: it requires no changes to model weights or code and operates close to the transportation layer,” Lancucki said.

First, KVTC uses principal component analysis (PCA) to align the features of the KV cache data based on their importance. PCA is a statistical technique often used in machine learning to make models more efficient by isolating the most critical features of the data and stripping away redundancies. This part of the process is performed only once during an initial calibration phase for each model. Because the PCA alignment matrix is computed offline and reused, it does not slow down the compression process at inference time for individual user prompts.

Next, the system uses a dynamic programming algorithm to automatically budget how much memory each specific data dimension actually needs. The most critical principal components get high precision, while the trailing, less important components receive fewer bits or are assigned zero bits and dropped entirely.

Finally, the pipeline takes this optimized, quantized data and packs it into a byte array, running it through an entropy coder called DEFLATE. Because this step is executed in parallel directly on the GPU using Nvidia’s nvCOMP library, it operates at very high speeds.

To decompress the data when the user returns, KVTC simply performs the computations in reverse. To speed up the process, it performs the heavy lifting of decompression in chunks, layer-by-layer. This allows the AI model to begin computing the next response early using the first decompressed chunk while the subsequent chunks are being decompressed in the background.

20x compression, less than 1% accuracy penalty

Nvidia researchers tested KVTC on a diverse roster of models ranging from 1.5B to 70B parameters, including the Llama 3 family, Mistral NeMo, and the reasoning-heavy R1-distilled Qwen 2.5 models. They evaluated these models on a variety of benchmarks, including complex math and coding challenges like MATH-500 and LiveCodeBench, as well as intensive long-context retrieval tasks like “Needle In A Haystack” and key-value retrieval.

They pitted KVTC against several popular baselines: token eviction methods (e.g., H2O and TOVA), heavy quantization techniques (e.g., KIVI and GEAR), and xKV (a prompt compression technique based on singular value decomposition).

At an effective 20x compression ratio, KVTC consistently maintained performance within less than one percentage point of accuracy penalty in comparison to the original, uncompressed vanilla models across most tasks. When researchers pushed the system to extreme limits of up to 32x and 64x compression, KVTC held its ground remarkably well.

By contrast, popular baselines like KIVI and GEAR began to suffer massive accuracy degradation at just a 5x compression ratio, particularly on long-context tasks. Standard cache eviction methods like H2O and TOVA proved entirely inadequate as generic compressors, effectively breaking down when asked to retrieve deep contextual information.

Consider the deployment of a smaller reasoning model like Qwen 2.5 1.5B for a coding assistant. Normally, this model requires 29 KB of memory for every single token. Using an 8x compression setting, KVTC shrank that footprint to roughly 3.2 KB per token, while suffering a negligible 0.3 percentage point drop in coding accuracy.

For enterprise architects, deciding when to deploy this technique depends heavily on the use case. “KVTC is optimized for long-context, multi-turn scenarios,” Lancucki said. He pointed to coding assistants, iterative agentic reasoning workflows — particularly when waiting for high-latency tool outputs — and iterative RAG as ideal applications. “However, the users should skip KVTC for short conversations,” he added, because the uncompressed sliding window of the newest tokens dominates the sequence in shorter interactions, preventing meaningful compression ratios.

KVTC is highly portable and an optimized implementation will soon be integrated into the KV Block Manager (KVBM) within the Dynamo framework, making it compatible with popular open-source inference engines like vLLM.

Most importantly for user experience, KVTC considerably reduces the time to first token (TTFT), the delay between sending a prompt and the model generating the first response token. On an 8,000-token prompt, a vanilla 12B model running on an Nvidia H100 GPU takes roughly 3 seconds to recompute the history from scratch. Meanwhile a system can decompress the KVTC cache in just 380 milliseconds, delivering up to an 8x reduction in the time it takes to generate the first token.

Because KVTC does not alter how the model pays attention to tokens, it is theoretically compatible with token eviction methods like Dynamic Memory Sparsification (DMS), another advanced compression technique. DMS is an autoregressive token eviction method that optimizes memory by identifying and dropping the least important tokens from the context window entirely.

“In principle, KVTC is complementary to DMS,” Lancucki stated. “While DMS evicts individual tokens along the time axis, KVTC compresses the data at each position separately.” However, he cautioned that while they target different dimensions, “it remains to be tested what compression ratios can be achieved with KVTC on sparsified caches.”

As models continue to scale natively to multi-million token context windows, the need for robust memory management will only grow. “Given the structural similarities and recurring patterns in KV caches across various model architectures, the emergence of a dedicated, standardized compression layer is probable,” Lancucki said. Supported by hardware advancements, AI infrastructure could soon treat KV cache compression as an invisible, standardized layer, much like video compression is to streaming today.

Orchestration

Rethinking AEO when software agents navigate the web on behalf of users

For more than two decades, digital businesses have relied on a simple assumption: When someone interacts with a website, that activity reflects a human making a conscious choice. Clicks are treated as signals of interest. Time on page is assumed to ind…

DataDecisionMakers, Infrastructure, Orchestration, technology

How LinkedIn replaced five feed retrieval systems with one LLM model, at 1.3 billion-user scale

LinkedIn’s feed reaches more than 1.3 billion members — and the architecture behind it hadn’t kept pace. The system had accumulated five separate retrieval pipelines, each with its own infrastructure and optimization logic, serving different slices of what users might want to see. Engineers at the company spent the last year tearing that apart and replacing it with a single LLM-based system. The result, LinkedIn says, is a feed that understands professional context more precisely and costs less to run at scale.

The redesign touched three layers of the stack: how content is retrieved, how it’s ranked, and how the underlying compute is managed. Tim Jurka, vice president of engineering at LinkedIn, told VentureBeat the team ran hundreds of tests over the past year before reaching a milestone that, he says, reinvented a large chunk of its infrastructure.

“Starting from our entire system for retrieving content, we’ve moved over to using really large-scale LLMs to understand content much more richly on LinkedIn and be able to match it much in a much more personalized way to members,” Jurka said. “All the way to how we rank content, using really, really large sequence models, generative recommenders, and combining that end-to-end system to make things much more relevant and meaningful for members.”

One feed, 1.3 billion members

The core challenge, Jurka said, is two-sided: LinkedIn has to match members’ stated professional interests — their title, skills, industry — to their actual behavior over time, and it has to surface content that goes beyond what their immediate network is posting. Those two signals frequently pull in different directions.

People use LinkedIn in different ways: some look to connect with others in their industry, others prioritize thought leadership, and job seekers and recruiters use it to find candidates.

How LinkedIn unified five pipelines into one

LinkedIn has spent more than 15 years building AI-driven recommendation systems, including prior work on job search and people search. LinkedIn’s feed, the one that greets you when you open the website, was built on a heterogeneous architecture, the company said in a blog post. Content fed to users came from various sources, including a chronological index of a user’s network, geographic trending topics, interest-based filtering, industry-specific content, and other embedding-based systems.

The company said this method meant each source had its own infrastructure and optimization strategy. But while it worked, maintenance costs soared. Jurka said using LLMs to scale out its new recommendation algorithm also meant updating the surrounding architecture around the feed.

“There’s a lot that goes into that, including how we maintain that kind of member context in a prompt, making sure we provide the right data to hydrate the model, profile data, recent activity data, etc,” he said. “The second is how you actually sample the most meaningful kind of data points to then fine-tune the LLM.”

LinkedIn tested different iterations of the data mix in an offline testing environment.

One of LinkedIn’s first hurdles in revamping its retrieval system revolved around converting its data into text for LLMs to process. To do this, LinkedIn built a prompt library that lets them create templated sequences. For posts, LinkedIn focused on format, author information, engagement counts, article metadata, and the post’s text. For members, they incorporated profile data, skills, work history, education and “a chronologically ordered sequence of posts they’ve previously engaged with.”

One of the most consequential findings from that testing phase involved how LLMs handle numbers. When a post had, say, 12,345 views, that figure appeared in the prompt as “views:12345,” and the model treated it like any other text token, stripping it of its significance as a popularity signal. To fix this, the team broke engagement counts into percentile buckets and wrapped them in special tokens, so the model could distinguish them from unstructured text. The intervention meaningfully improved how the system weighs post reach.

Teaching the feed to read professional history as a sequence

Of course, if LinkedIn wants its feed to feel more personal and posts reach the right audience, it needs to reimagine how it ranks posts, too. Traditional ranking models, the company said, misunderstand how people engage with content: that it isn’t random but follows patterns emerging from someone’s professional journey.

LinkedIn built a proprietary Generative Recommender (GR) model for its feed that treats interaction history as a sequence, or “a professional story told through the posts you’ve engaged with over time.”

“Instead of scoring each post in isolation, GR processes more than a thousand of your historical interactions to understand temporal patterns and long-term interests,” LinkedIn’s blog said. “As with retrieval, the ranking model relies on professional signals and engagement patterns, never demographic attributes, and is regularly audited for equitable treatment across our member base.”

The compute cost of running LLMs at LinkedIn’s scale

With a revitalized data pipeline and feed, LinkedIn faced another problem: GPU cost.

LinkedIn invested heavily in new training infrastructure to reduce how much it leans on GPUs. The biggest architectural shift was disaggregating CPU-bound feature processing from GPU-heavy model inference — keeping each type of compute doing what it’s suited for rather than bottlenecking on GPU availability. The team also wrote custom C++ data loaders to cut the overhead that Python multiprocessing was adding, and built a custom Flash Attention variant to optimize attention computation during inference. Checkpointing was parallelized rather than serialized, which helped squeeze more out of available GPU memory.

“One of the things we had to engineer for was that we needed to use a lot more GPUs than we’d like to,” Jurka said. “Being very deliberate about how you coordinate between CPU and GPU workloads because the nice thing about these kinds of LLMs and prompt context that we use to generate embeddings is you can dynamically scale them.”

For engineers building recommendation or retrieval systems, LinkedIn’s redesign offers a concrete case study in what replacing fragmented pipelines with a unified embedding model actually requires: rethinking how numerical signals are represented in prompts, separating CPU and GPU workloads deliberately, and building ranking models that treat user history as a sequence rather than a set of independent events. The lesson isn’t that LLMs solve feed problems — it’s that deploying them at scale forces you to solve a different class of problems than the ones you started with.

Orchestration

Y Combinator-backed Random Labs launches Slate V1, claiming the first ‘swarm-native’ coding agent

The software engineering world is currently wrestling with a fundamental paradox of the AI era: as models become more capable, the “systems problem” of managing them has become the primary bottleneck to real-world productivity. While a developer might have access to the raw intelligence of a frontier model, that intelligence often degrades the moment a task requires a long horizon or a deep context window.

But help appears to be on the way: San Francisco-based, Y Combinator-backed startup Random Labs has officially launched Slate V1, described as the industry’s first “swarm native” autonomous coding agent designed to execute massively parallel, complex engineering tasks.

Emerging from an open beta, the tool utilizes a “dynamic pruning algorithm” to maintain context in large codebases while scaling output to enterprise complexity. Co-founded by Kiran and Mihir Chintawar in 2024, the company aims to bridge the global engineering shortage by positioning Slate as a collaborative tool for the “next 20 million engineers” rather than a replacement for human developers.

With the release of Slate V1, the team at Random Labs is attempting to architect a way out of this zone by introducing the first “swarm-native” agentic coding environment. Slate is not merely a wrapper or a chatbot with file access; it is an implementation of a “hive mind” philosophy designed to scale agentic work with the complexity of a human organization.

By leveraging a novel architectural primitive called Thread Weaving, Slate moves beyond the rigid task trees and lossy compaction methods that have defined the first generation of AI coding assistants.

Strategy: Action space

At the heart of Slate’s effectiveness is a deep engagement with Recursive Language Models (RLM).

In a traditional setup, an agent might be asked to “fix a bug,” a prompt that forces the model to juggle high-level strategy and low-level execution simultaneously.

Random Labs identifies this as a failure to tap into “Knowledge Overhang”—the latent intelligence a model possesses but cannot effectively access when it is tactically overwhelmed.

Slate solves this by using a central orchestration thread that essentially “programs in action space”. This orchestrator doesn’t write the code directly; instead, it uses a TypeScript-based DSL to dispatch parallel worker threads to handle specific, bounded tasks.

This creates a clear separation between the “kernel”—which manages the execution graph and maintains strategic alignment—and the worker “processes” that execute tactical operations in the terminal.

By mapping onto an OS-style framework, inspired by Andrej Karpathy’s “LLM OS” concept, Slate is able to treat the limited context window of a model as precious RAM, actively, intelligently managing what is retained and what is discarded.

Episodic memory and the swarm

The true innovation of the “Thread Weaving” approach lies in how it handles memory. Most agents today rely on “compaction,” which is often just a fancy term for lossy compression that risks dropping critical project state. Slate instead generates “episodes”.

When a worker thread completes a task, it doesn’t return a sprawling transcript of every failed attempt; it returns a compressed summary of the successful tool calls and conclusions.

Because these episodes share context directly with the orchestrator rather than relying on brittle message passing, the system maintains a “swarm” intelligence.

This architecture allows for massive parallelism. A developer can have Claude Sonnet orchestrating a complex refactor while GPT-5.4 executes code, and GLM 5—a favorite for its agentic search capabilities—simultaneously researches library documentation in the background. It’s a similar approach taken by Perplexity with its new Computer multi-model agent

By selecting the “right model for the job,” Slate ensures that users aren’t overspending on intelligence for simple tactical steps while still benefiting from the strategic depth of the world’s most powerful models.

The business of autonomy

From a commercial perspective, Random Labs is navigating the early beta period with a mix of transparency and strategic ambiguity.

While the company has not yet published a fixed-price subscription sheet, the Slate CLI documentation confirms a shift toward a usage-based credit model.

Commands like /usage and /billing allow users to monitor their credit burn in real-time, and the inclusion of organization-level billing toggles suggests a clear focus on professional engineering teams rather than solo hobbyists.

There is also a significant play toward integration. Random Labs recently announced that direct support for OpenAI’s Codex and Anthropic’s Claude Code is slated for release next week.

This suggests that Slate isn’t trying to compete with these models’ native interfaces, but rather to act as the superior orchestration layer that allows engineers to use all of them at once, safely and cost-effectively.

I’ve reached out to

Architecturally, the system is designed to maximize caching through subthread reuse, a “novel context engineering” trick that the team claims keeps the swarm approach from becoming a financial burden for users.

Stability AI

Perhaps the most compelling argument for the Slate architecture is its stability. In internal testing, an early version of this threading system managed to pass 2/3 of the tests on the make-mips-interpreter task within the Terminal Bench 2.0 suite.

This is a task where even the newest frontier models, like Opus 4.6, often succeed less than 20% of the time when used in standard, non-orchestrated harnesses.

This success in a “mutated” or changing environment is what separates a tool from a partner. According to Random Labs’ documentation, one fintech founder in NYC described Slate as their “best debugging tool,” a sentiment that echoes the broader goal of Random Labs: to build agents that don’t just complete a prompt, but scale like an organization.

As the industry moves past simple “chat with your code” interfaces, the “Thread Weaving” of Slate V1 offers a glimpse into a future where the primary role of the human engineer is to direct a hive mind of specialized models, each working in concert to solve the long-horizon problems of modern software.

Orchestration

Anthropic gives Claude shared context across Microsoft Excel and PowerPoint, enabling reusable workflows in multiple applications

Anthropic has upgraded its Claude AI model with new capabilities for Microsoft Excel and PowerPoint, marking a strategic move to expand its enterprise footprint and potentially challenging Microsoft’s newly launched Copilot Cowork — which Claude also partially powers.

The updated add-ins are available to Mac and Windows users on paid Claude plans starting today, March 11.

Anthropic is also expanding how enterprises can deploy the tools.

Claude for Excel and Claude for PowerPoint can now be accessed either through a Claude account or through an existing LLM gateway routing to Claude models on Amazon Bedrock, Google Cloud Vertex AI or Microsoft Foundry.

That gives enterprises more flexibility to use the add-ins within cloud and compliance setups they may already have in place.

Shared context across Office apps

Starting March 11, paid Claude users on Mac and Windows can access a new beta experience in which Claude for Excel and Claude for PowerPoint share the full context of a user’s conversation with the AI model between the two applications — no need for manually copying and pasting it over.

That means Claude can carry information, instructions and task history between an open spreadsheet and an open presentation in a single continuous session.

For example, Claude can write formulas to extract data from an Excel workbook and immediately apply it to a stylized PowerPoint slide in the same session.

“In practice: a financial analyst can ask Claude to pull comparable company financials from an open workbook, build out a trading comps table in Excel, drop the valuation summary into the pitch deck, and draft the email to the MD—without switching tabs or re-explaining the dataset at each step,” Anthropic said in a press release.

This builds on Anthropic’s release of a Claude plugin for Excel back in October 2025.

Repeatable workflows inside applications

A central feature of this launch is Skills, which allows teams to build and save repeatable workflows directly inside the Excel and PowerPoint sidebars.

Rather than re-uploading references or re-prompting instructions, users can save standardized processes—such as specific variance analyses or approved slide templates—as one-click actions available to the entire organization.

That could include workflows for recurring financial analysis, preparing presentations in a preferred house style or running common review steps that would otherwise need to be rewritten as prompts each time.

Anthropic said every Skill, whether personal or organization-wide, will work inside the add-ins the same way MCP connectors do.

“Workflows that previously lived in one person’s head become one-click actions available to the whole organization,” the company said.

Anthropic distinguishes these Skills from Instructions, which let users set persistent preferences across the add-ins, such as preferred number formatting in Excel or presentation-writing rules in PowerPoint.

Anthropic is also shipping a preloaded starter set of Skills, including:

Excel: Auditing models for formula errors, populating DCF and LBO templates, and cleaning messy data ranges.
PowerPoint: Building competitive landscape decks and reviewing investment banking materials for narrative alignment.

Similarly, Microsoft’s new Copilot Cowork capability introduced on Monday enables enterprise users to deploy agents to complete tasks across Microsoft applications such as Excel and PowerPoint.

The software giant openly stated it was built in conjunction with Anthropic, which also released its own stand-alone Claude Cowork application for Mac and Windows earlier this year offering a way for Claude to access, edit, create and move information between files on a user’s computer, autonomously, at the user’s direction.

Previously, even with autonomous tools like the standalone Claude Cowork app, users often had to ask the AI to complete tasks in separate steps for each application. Now, Claude maintains a continuous session that reads live data and writes formulas across both apps simultaneously.

Battle of the enterprise app agents

Ever since the launch of Claude Cowork earlier this year, Anthropic has been making a case to be the chat and productivity platform of choice for enterprises.

Competitors like Google, with its close association with Google Workspace, which includes Gmail and Google Docs, and Microsoft, with its continued leadership in the Office suite, can directly bring AI capabilities to users’ workflows.

Anthropic did not present the new Skills feature as equivalent to the more autonomous, agentic behavior Microsoft is now emphasizing with its own Copilot Cowork.

But the release does show Anthropic steadily expanding beyond chatbot use cases and into more structured, repeatable work inside the applications many business users already rely on.

Anthropic, through Claude Cowork, Claude Code and the Claude model family, has seeped into many organizations’ systems, using its high performance in coding benchmarks and general knowledge to navigate a computer better and complete knowledge work rapidly, at scale, with high quality.

OpenClaw, the open source AI agent that has taken the developer world by storm, owes much of its existence to Claude Code.

The result is another sign that the battle over enterprise AI is no longer just about which model performs best on benchmarks. It is increasingly about what AI tools and systems enterprises trust to get real work done across their existing applications, files, and workflows.

Orchestration

Google finds that AI agents learn to cooperate when trained against unpredictable opponents

Training standard AI models against a diverse pool of opponents — rather than building complex hardcoded coordination rules — is enough to produce cooperative multi-agent systems that adapt to each other on the fly. That’s the finding from Google’s Paradigms of Intelligence team, which argues the approach offers a scalable and computationally efficient blueprint for enterprise multi-agent deployments without requiring specialized scaffolding.

The technique works by training an LLM agent via decentralized reinforcement learning against a mixed pool of opponents — some actively learning, some static and rule-based. Instead of hardcoded rules, the agent uses in-context learning to read each interaction and adapt its behavior in real time.

Why multi-agent systems keep fighting each other

The AI landscape is rapidly shifting away from isolated systems toward a fleet of agents that must negotiate, collaborate, and operate in shared spaces simultaneously. In multi-agent systems, the success of a task depends on the interactions and behaviors of multiple entities as opposed to a single agent.

The central friction in these multi-agent systems is that their interactions frequently involve competing goals. Because these autonomous agents are designed to maximize their own specific metrics, ensuring they don’t actively undermine one another in these mixed-motive scenarios is incredibly difficult.

Multi-agent reinforcement learning (MARL) tries to address this problem by training multiple AI agents operating, interacting, and learning in the same shared environment at the same time. However, in real-world enterprise architectures, a single, centralized system rarely has visibility over or controls every moving part. Developers must rely on decentralized MARL, where individual agents must figure out how to interact with others while only having access to their own limited, local data and observations.

One of the main problems with decentralized MARL is that the agents frequently get stuck in suboptimal states as they try to maximize their own specific rewards. The researchers refer to it as “mutual defection,” based on the Prisoner’s Dilemma puzzle used in game theory. For example, think of two automated pricing algorithms locked in a destructive race to the bottom. Because each agent optimizes strictly for its own selfish reward, they arrive at a stalemate where the broader enterprise loses.

Another problem is that traditional training frameworks are designed for stationary environments, meaning the rules of the game and the behavior of the environment are relatively fixed. In a multi-agent system, from the perspective of any single agent, the environment is fundamentally unpredictable and constantly shifting because the other agents are simultaneously learning and adapting their own policies.

While enterprise developers currently rely on frameworks that use rigid state machines, these methods often hit a scalability wall in complex deployments.

“The primary limitation of hardcoded orchestration is its lack of flexibility,” Alexander Meulemans, co-author of the paper and Senior Research Scientist on Google’s Paradigms of Intelligence team, told VentureBeat. “While rigid state machines function adequately in narrow domains, they can fail to scale as the scope and complexity of agent deployments broaden. Our in-context approach complements these existing frameworks by fostering adaptive social behaviors that are deeply embedded during the post-training phase.”

What this means for developers using LangGraph, CrewAI, or AutoGen

Frameworks like LangGraph require developers to explicitly define agents, state transitions, and routing logic as a graph. LangChain describes this approach as equivalent to a state machine, where agent nodes and their connections represent states and transition matrices. Google’s approach inverts that model: rather than hardcoding how agents should coordinate, it produces cooperative behavior through training, leaving the agents to infer coordination rules from context.

The researchers prove that developers can achieve advanced, cooperative multi-agent systems using the exact same standard sequence modeling and reinforcement learning techniques that already power today’s foundation models.

The team validated the concept using a new method called Predictive Policy Improvement (PPI), though Meulemans notes the underlying principle is model-agnostic.

“Rather than training a small set of agents with fixed roles, teams should implement a ‘mixed pool’ training routine,” Meulemans said. “Developers can reproduce these dynamics using standard, out-of-the-box reinforcement learning algorithms (such as GRPO).”

By exposing agents to interact with diverse co-players (i.e., varying in system prompts, fine-tuned parameters, or underlying policies) teams create a robust learning environment. This produces strategies that are resilient when interacting with new partners and ensures that multi-agent learning leads toward stable, long-term cooperative behaviors.

How the researchers proved it works

To build agents that can successfully deduce a co-player’s strategy, the researchers created a decentralized training setup where the AI is pitted against a highly diverse, mixed pool of opponents composed of actively learning models and static, rule-based programs. This forced diversity requires the agent to dynamically figure out who it is interacting with and adapt its behavior on the fly, entirely from the context of the interaction.

For enterprise developers, the phrase “in-context learning” often triggers concerns about context window bloat, API costs, and latency, especially when windows are already packed with retrieval-augmented generation (RAG) data and system prompts. However, Meulemans clarifies that this technique focuses on efficiency rather than token count. “Our method focuses on optimizing how agents utilize their available context during post-training, rather than strictly demanding larger context windows,” he said. By training agents to parse their interaction history to infer strategies, they use their allocated context more adaptively without requiring longer context windows than existing applications.

Using the Iterated Prisoner’s Dilemma (IPD) as a benchmark, the researchers achieved robust, stable cooperation without any of the traditional crutches. There are no artificial separations between meta and inner learners, and no need to hardcode assumptions about how the opponent’s algorithm functions. Because the agent is adapting in real-time while also updating its core foundation model weights over time across many interactions, it effectively occupies both roles simultaneously. In fact, the agents performed better when given no information about their adversaries and were forced to adapt to their behavior through trial and error.

The developer’s role shifts from rule writer to architect

The researchers say that their work bridges the gap between multi-agent reinforcement learning and the training paradigms of modern foundation models. “Since foundation models naturally exhibit in-context learning and are trained on diverse tasks and behaviors, our findings suggest a scalable and computationally efficient path for the emergence of cooperative social behaviors using standard decentralized learning techniques,” they write.

As relying on in-context behavioral adaptation becomes the standard over hardcoding strict rules, the human element of AI engineering will fundamentally shift. “The AI application developer’s role may evolve from designing and managing individual interaction rules to designing and providing high-level architectural oversight for training environments,” Meulemans said. This transition elevates developers from writing narrow rulebooks to taking on a strategic role, defining the broad parameters that ensure agents learn to be helpful, safe, and collaborative in any situation.

Orchestration

Google upgrades Gemini for Workspace allowing it to pull data from multiple apps to create Docs, Sheets, Slides and more

Lest you thought Microsoft would have all the fun introducing new AI features for white collar enterprise work this week with its Copilot Cowork announcement yesterday, Google is here to take back the spotlight.

The search giant and, increasingly, AI leader today announced a sweeping series of updates to its Gemini AI models embedded into Google Workspace — the productivity suite of cloud-based apps including Drive, Docs, Sheets, Slides, and more. They’re being made available both to individual consumers and enterprises, though you’ll need an AI Pro ($20 per month) or higher subscription plan for the former, and your enterprise will need to be enrolled in the “Gemini Alpha” program and have the features switched on by an administrator.

The biggest news: it’s now possible to have Gemini automatically create these file types from a single text prompt and fill them out with information gathered from other files and apps throughout you, the user’s Google Workspace, including emails, chats, files, and the open web via Google Search.

By synthesizing information across these disparate apps and experiences, Gemini acts as an assistant capable of drafting, iterating, and perfecting complex, finished, professional-grade content in seconds, effectively ending the era of the manual “dig” for information.

The message is simple: the era of searching across multiple windows, tabs, files and folders for your information is over — Gemini will do it and put it all together for you in a nearly finished product, simply from a plain English (or language of your choosing) natural language text prompt!

And best of all for enterprise technical leaders — this feature is now provided first-party by Google themselves, short-cutting or eliminating large parts of the need to build their own orchestration system (if they don’t wish to pursue this route and have most of their data in these Google applications, albeit).

Prompt to document, spreadsheet, slide deck and more

The rollout spans the entire Workspace suite, with specific features tailored to the unique demands of each application:

Google Docs: “Help me create”: The new “Help me create” experience allows users to generate fully formatted first drafts by simply describing their goal. Because Gemini can access Drive, Gmail, and Chat, a user can prompt: “Draft a newsletter using the meeting minutes from my January HOA meeting and the list of upcoming events”. The result is a contextualized document that includes smart chips and structured formatting, rather than a generic template.

Google Sheets gets a 9x speed boost: The most striking efficiency claim in this release involves “Fill with Gemini”. A 95-participant study conducted by Google found that using Gemini to auto-populate tables with categorized or summarized data was 9x faster than manual entry for 100-cell tasks. Users can now describe a goal—like optimizing a weekly schedule to maximize profit while balancing staff skills—and Gemini handles the multi-step construction from start to finish.

Google Slides gets narrative-first design: Slides is receiving updates that allow Gemini to act as a design collaborator. It can now turn rough brainstorm sketches into editable diagrams and generate slide layouts that balance visual weight and hierarchy while matching the theme of an existing deck. Google also teased an upcoming feature that will generate an entire presentation from a single prompt based on a reference document.

Google Drive: The Knowledge Base: Perhaps the most fundamental shift is in Google Drive, which is moving from “passive storage” to an “active knowledge base” that compiles data from multiple files and file types stored there and allows Gemini to access it and move it around as needed in creating and editing projects.

AI Overviews: Similar to Google Search, Drive will now provide a summarized answer with citations at the top of search results, removing the need to open multiple files to find a specific detail.
Ask Gemini in Drive: This allows for complex, cross-file queries, such as comparing multiple catering proposals or synthesizing months of research on a specific topic.
Projects: Users can now save curated lists of sources as “projects” to share with others, maintaining built-in security and compliance controls.

Not just Gemini — numerous Google AI models power this experience

While the user interface of the new Workspace updates is designed for simplicity, the backend architecture relies on a specialized ensemble of Google’s most advanced AI models.

These features are not powered by a single general-purpose engine but rather a suite of task-specific models developed by Google DeepMind and Google Research.

Gemini 3 Flash & Deep Think: The core text generation, summarization, and reasoning capabilities—such as “Help me create” in Docs and “AI Overviews” in Drive—are driven by the Gemini 3 family. Specifically, Gemini 3 Flash is utilized for high-speed summarization, while Gemini 3 Deep Think handles more complex reasoning tasks involving science, research, and engineering.
Google Research OR-Tools: To solve the “advanced optimization problems” in Sheets, such as complex employee scheduling or budget maximization, Google integrates its OR-Tools (Operations Research tools) alongside DeepMind’s logic models.
Nano Banana 2 (Gemini 3 Flash Image): The professional layouts and editable diagrams found in Slides are generated by Nano Banana 2, a state-of-the-art multi-image-to-image model. This model handles everything from text-to-image generation to complex style transfers, ensuring that new slides match a company’s existing brand aesthetics.
Veo & Lyria 3: For multimedia integration, Google utilizes Veo for high-fidelity video generation and Lyria 3 for professional-grade music and vocal arrangements, both of which include SynthID watermarking for AI identification.

Licensing and availability

Google is positioning these features as premium additions to its ecosystem. The new Gemini capabilities are rolling out in beta starting today.

Feature Set	Target Audience	Availability
Google AI Ultra & Pro	Individual power users	English (Global for Docs, Sheets, Slides)
Gemini Alpha	Business/Enterprise customers	English (Global for Docs, Sheets, Slides)
Google Drive Updates	U.S. Customers (Initial)	English (U.S. Only for now)

The Gemini Alpha program is a pre-release initiative that allows Google Workspace administrators to grant users early access to experimental AI features before they are made generally available.

To participate, your organization must have a supported subscription — such as Business Standard, Business Plus, Enterprise, or Education tiers — along with Google AI Pro or Ultra add-ons.

Participation is managed entirely by your Google Workspace administrator, as the program is turned off by default. An admin can manually enable accessvia the Google Admin console by navigating to Menu > Generative AI > Gemini for Workspace and selecting the Alpha features panel. Once enabled, eligible users can begin reimagining their content creation journeys using these next-generation tools.

For business users, Google emphasizes that these features are built with “enterprise-grade data protections,” ensuring that sensitive company data used to ground Gemini’s responses remains confidential and is not used to train global models

Community and leadership reactions

The announcement was met with immediate traction on social media, spearheaded by CEO Sundar Pichai. In a post on X, Pichai highlighted the practical, time-saving nature of the updates:

“New Gemini updates to make @GoogleWorkspace more personal, helpful and collaborative… no more digging through folders.”

The reaction from the broader tech community has focused heavily on the “9x faster” claim for Sheets, a metric that resonates with data analysts and project managers who spend a significant portion of their week on manual data entry.

Yulie Kwon Kim, VP of Product for Workspace, framed the release as a fundamental reimagining of content creation, stating that Gemini is no longer just a “tool” but a “partner that works alongside you throughout the creative process”.

As these features move from beta to general availability in the coming months, the true test will be how effectively Gemini handles the nuance of complex, real-world data without human intervention. For now, Google has signaled its intent: the era of starting with a blank page is officially over.

AI is the latest escalation in the cloud contest — and enterprise technical leaders should take note

For CTOs, CIOs, and product managers, the deep integration of Gemini into Google Workspace is not merely a suite of new features; it is a fundamental shift toward an “agentic” operating model.

This announcement arrives just 24 hours after Microsoft unveiled “Copilot Cowork,” a cloud-based AI agentic tool designed to compl

ete work on a user’s behalf across the entire Microsoft 365 suite. Both tech giants are now converging on a singular vision: the AI assistant as an execution layer that can navigate multiple files, formats, and data sources to independently plan and deliver finished workplace materials.

By transforming static storage into an active knowledge base, these platforms are providing technical leaders with a framework to reduce the “digital debt” of searching through siloed applications, effectively reimagining white-collar work as a series of delegated outcomes rather than manual tasks.

The scale of this transformation is underpinned by Google’s massive and rapidly expanding footprint. As of early 2026, Google Workspace has surpassed 3 billion monthly active users globally. Within this ecosystem, the paid enterprise segment is seeing explosive growth, with approximately 11 million paying business customers—up from 8 million just one year prior.

More specifically, over 8 million paid Gemini Enterprise seats have already been deployed across more than 2,800 companies. While Microsoft leverages a multi-model architecture incorporating Anthropic’s Claude models for its “Cowork” features, Google is doubling down on its own integrated stack of Gemini 3 and DeepMind logic to provide a seamless, context-aware environment for its vast user base.

From an interpretive standpoint, this “Gemini-fication” of work represents the democratization of advanced analytics. When a manager can use natural language to solve complex optimization problems in Sheets or generate entire presentations from a single prompt, the traditional boundaries of professional roles begin to blur.

While early studies suggest these agentic tools can lead to productivity gains of 15% to 35%, the real value for technical leaders lies in headcount leverage—the ability to maintain high output with leaner teams.

As AI assistants evolve into autonomous agents that navigate enterprise data to “do the work for you,” the role of the knowledge worker is shifting from “creator” to “orchestrator,” requiring a strategic pivot in how enterprises hire and measure human talent in an AI-first economy.

Orchestration