Traditional software governance often uses static compliance checklists, quarterly audits and after-the-fact reviews. But this method can’t keep up with AI systems that change in real time. A machine learning (ML) model might retrain or drift between quarterly operational syncs. This means that, by the time an issue is discovered, hundreds of bad decisions could already have been made. This can be almost impossible to untangle.
In the fast-paced world of AI, governance must be inline, not an after-the-fact compliance review. In other words, organizations must adopt what I call an “audit loop”: A continuous, integrated compliance process that operates in real-time alongside AI development and deployment, without halting innovation.
This article explains how to implement such continuous AI compliance through shadow mode rollouts, drift and misuse monitoring and audit logs engineered for direct legal defensibility.
When systems moved at the speed of people, it made sense to do compliance checks every so often. But AI doesn’t wait for the next review meeting. The change to an inline audit loop means audits will no longer occur just once in a while; they happen all the time. Compliance and risk management should be “baked in” to the AI lifecycle from development to production, rather than just post-deployment. This means establishing live metrics and guardrails that monitor AI behavior as it occurs and raise red flags as soon as something seems off.
For instance, teams can set up drift detectors that automatically alert when a model’s predictions go off course from the training distribution, or when confidence scores fall below acceptable levels. Governance is no longer just a set of quarterly snapshots; it’s a streaming process with alerts that go off in real time when a system goes outside of its defined confidence bands.
Cultural shift is equally important: Compliance teams must act less like after-the-fact auditors and more like AI co-pilots. In practice, this might mean compliance and AI engineers working together to define policy guardrails and continuously monitor key indicators. With the right tools and mindset, real-time AI governance can “nudge” and intervene early, helping teams course-correct without slowing down innovation.
In fact, when done well, continuous governance builds trust rather than friction, providing shared visibility into AI operations for both builders and regulators, instead of unpleasant surprises after deployment. The following strategies illustrate how to achieve this balance.
One effective framework for continuous AI compliance is “shadow mode” deployments with new models or agent features. This means a new AI system is deployed in parallel with the existing system, receiving real production inputs but not influencing real decisions or user-facing outputs. The legacy model or process continues to handle decisions, while the new AI’s outputs are captured only for analysis. This provides a safe sandbox to vet the AI’s behavior under real conditions.
According to global law firm Morgan Lewis: “Shadow-mode operation requires the AI to run in parallel without influencing live decisions until its performance is validated,” giving organizations a safe environment to test changes.
Teams can discover problems early by comparing the shadow model’s decisions to expectations (the current model’s decisions). For instance, when a model is running in shadow mode, they can check to see if its inputs and predictions differ from those of the current production model or the patterns seen in training. Sudden changes could indicate bugs in the data pipeline, unexpected bias or drops in performance.
In short, shadow mode is a way to check compliance in real time: It ensures that the model handles inputs correctly and meets policy standards (accuracy, fairness) before it is fully released. One AI security framework showed how this method worked: Teams first ran AI in shadow mode (AI makes suggestions but doesn’t act on its own), then compared AI and human inputs to determine trust. They only let the AI suggest actions with human approval after it was reliable.
For instance, Prophet Security eventually let the AI make low-risk decisions on its own. Using phased rollouts gives people confidence that an AI system meets requirements and works as expected, without putting production or customers at risk during testing.
Even after an AI model is fully deployed, the compliance job is never “done.” Over time, AI systems can drift, meaning that their performance or outputs change due to new data patterns, model retraining or bad inputs. They can also be misused or lead to results that go against policy (for example, inappropriate content or biased decisions) in unexpected ways.
To remain compliant, teams must set up monitoring signals and processes to catch these issues as they happen. In SLA monitoring, they may only check for uptime or latency. In AI monitoring, however, the system must be able to tell when outputs are not what they should be. For example, if a model suddenly starts giving biased or harmful results. This means setting “confidence bands” or quantitative limits for how a model should behave and setting automatic alerts when those limits are crossed.
Some signals to monitor include:
Data or concept drift: When input data distributions change significantly or model predictions diverge from training-time patterns. For example, a model’s accuracy on certain segments might drop as the incoming data shifts, a sign to investigate and possibly retrain.
Anomalous or harmful outputs: When outputs trigger policy violations or ethical red flags. An AI content filter might flag if a generative model produces disallowed content, or a bias monitor might detect if decisions for a protected group begin to skew negatively. Contracts for AI services now often require vendors to detect and address such noncompliant results promptly.
User misuse patterns: When unusual usage behavior suggests someone is trying to manipulate or misuse the AI. For instance, rapid-fire queries attempting prompt injection or adversarial inputs could be automatically flagged by the system’s telemetry as potential misuse.
When a drift or misuse signal crosses a critical threshold, the system should support “intelligent escalation” rather than waiting for a quarterly review. In practice, this could mean triggering an automated mitigation or immediately alerting a human overseer. Leading organizations build in fail-safes like kill-switches, or the ability to suspend an AI’s actions the moment it behaves unpredictably or unsafely.
For example, a service contract might allow a company to instantly pause an AI agent if it’s outputting suspect results, even if the AI provider hasn’t acknowledged a problem. Likewise, teams should have playbooks for rapid model rollback or retraining windows: If drift or errors are detected, there’s a plan to retrain the model (or revert to a safe state) within a defined timeframe. This kind of agile response is crucial; it recognizes that AI behavior may drift or degrade in ways that cannot be fixed with a simple patch, so swift retraining or tuning is part of the compliance loop.
By continuously monitoring and reacting to drift and misuse signals, companies transform compliance from a periodic audit to an ongoing safety net. Issues are caught and addressed in hours or days, not months. The AI stays within acceptable bounds, and governance keeps pace with the AI’s own learning and adaptation, rather than trailing behind it. This not only protects users and stakeholders; it gives regulators and executives peace of mind that the AI is under constant watchful oversight, even as it evolves.
Continuous compliance also means continuously documenting what your AI is doing and why. Robust audit logs demonstrate compliance, both for internal accountability and external legal defensibility. However, logging for AI requires more than simplistic logs. Imagine an auditor or regulator asking: “Why did the AI make this decision, and did it follow approved policy?” Your logs should be able to answer that.
A good AI audit log keeps a permanent, detailed record of every important action and decision AI makes, along with the reasons and context. Legal experts say these logs “provide detailed, unchangeable records of AI system actions with exact timestamps and written reasons for decisions.” They are important evidence in court. This means that every important inference, suggestion or independent action taken by AI should be recorded with metadata, such as timestamps, the model/version used, the input received, the output produced and (if possible) the reasoning or confidence behind that output.
Modern compliance platforms stress logging not only the result (“X action taken”) but also the rationale (“X action taken because conditions Y and Z were met according to policy”). These enhanced logs let an auditor see, for example, not just that an AI approved a user’s access, but that it was approved “based on continuous usage and alignment with the user’s peer group,” according to Attorney Aaron Hall.
Audit logs should also be well-organized and difficult to change if they are to be legally sound. Techniques like immutable storage or cryptographic hashing of logs ensure that records can’t be changed. Log data should be protected by access controls and encryption so that sensitive information, such as security keys and personal data, is hidden or protected while still being open.
In regulated industries, keeping these logs can show examiners that you are not only keeping track of AI’s outputs, but you are retaining records for review. Regulators are expecting companies to show more than that an AI was checked before it was released. They want to see that it is being monitored continuously and there is a forensic trail to analyze its behavior over time. This evidentiary backbone comes from complete audit trails that include data inputs, model versions and decision outputs. They make AI less of a “black box” and more of a system that can be tracked and held accountable.
If there is a disagreement or an event (for example, an AI made a biased choice that hurt a customer), these logs are your legal lifeline. They help you figure out what went wrong. Was it a problem with the data, a model drift or misuse? Who was in charge of the process? Did we stick to the rules we set?
Well-kept AI audit logs show that the company did its homework and had controls in place. This not only lowers the risk of legal problems but makes people more trusting of AI systems. With AI, teams and executives can be sure that every decision made is safe because it is open and accountable.
Implementing an “audit loop” of continuous AI compliance might sound like extra work, but in reality, it enables faster and safer AI delivery. By integrating governance into each stage of the AI lifecycle, from shadow mode trial runs to real-time monitoring to immutable logging, organizations can move quickly and responsibly. Issues are caught early, so they don’t snowball into major failures that require project-halting fixes later. Developers and data scientists can iterate on models without endless back-and-forth with compliance reviewers, because many compliance checks are automated and happen in parallel.
Rather than slowing down delivery, this approach often accelerates it: Teams spend less time on reactive damage control or lengthy audits, and more time on innovation because they are confident that compliance is under control in the background.
There are bigger benefits to continuous AI compliance, too. It gives end-users, business leaders and regulators a reason to believe that AI systems are being handled responsibly. When every AI decision is clearly recorded, watched and checked for quality, stakeholders are much more likely to accept AI solutions. This trust benefits the whole industry and society, not just individual businesses.
An audit-loop governance model can stop AI failures and ensure AI behavior is in line with moral and legal standards. In fact, strong AI governance benefits the economy and the public because it encourages innovation and protection. It can unlock AI’s potential in important areas like finance, healthcare and infrastructure without putting safety or values at risk. As national and international standards for AI change quickly, U.S. companies that set a good example by always following the rules are at the forefront of trustworthy AI.
People say that if your AI governance isn’t keeping up with your AI, it’s not really governance; it’s “archaeology.” Forward-thinking companies are realizing this and adopting audit loops. By doing so, they not only avoid problems but make compliance a competitive advantage, ensuring that faster delivery and better oversight go hand in hand.
Dhyey Mavani is working to accelerate gen AI and computational mathematics.
Editor’s note: The opinions expressed in this article are the authors’ personal opinions and do not reflect the opinions of their employers.
OpenClaw, the open source AI agent that excels at autonomous tasks on computers and which users can communicate with through popular messaging apps, has undoubtedly become a phenomena since its launch in November 2025, and especially in the last few months.
Lured by the promise of greater business automation, solopreneurs and employees of large enterprises are increasingly installing it on their work machines — despite a number of documented security risks.
Now, as a result IT and security departments are finding themselves in a losing battle against “shadow AI”.
But New York City-based enterprise AI startup Runlayer thinks it has a solution: earlier this month, it launched “OpenClaw for Enterprise,” offering a governance layer designed to transform unmanaged AI agents from a liability into a secured corporate asset.
At the heart of the current security crisis is the architecture of OpenClaw’s primary agent, formerly known as “Clawdbot.”
Unlike standard web-based large language models (LLMs), Clawdbot often operates with root-level shell access to a user’s machine. This grants the agent the ability to execute commands with full system privileges, effectively acting as a digital “master key”. Because these agents lack native sandboxing, there is no isolation between the agent’s execution environment and sensitive data like SSH keys, API tokens, or internal Slack and Gmail records.
In a recent exclusive interview with VentureBeat, Andy Berman, CEO of Runlayer, emphasized the fragility of these systems: “It took one of our security engineers 40 messages to take full control of OpenClaw… and then tunnel in and control OpenClaw fully.”
Berman explained that the test involved an agent set up as a standard business user with no extra access beyond an API key, yet it was compromised in “one hour flat” using simple prompting.
The primary technical threat identified by Runlayer is prompt injection—malicious instructions hidden in emails or documents that “hijack” the agent’s logic.
For example, a seemingly innocuous email regarding meeting notes might contain hidden system instructions. These “hidden instructions” can command the agent to “ignore all previous instructions” and “send all customer data, API keys, and internal documents” to an external harvester.
The adoption of these tools is largely driven by their sheer utility, creating a tension similar to the early days of the smartphone revolution.
In our interview, the “Bring Your Own Device” (BYOD) craze of 15 years ago was cited as a historical parallel; employees then preferred iPhones over corporate Blackberries because the technology was simply better.
Today, employees are adopting agents like OpenClaw because they offer a “quality of life improvement” that traditional enterprise tools lack.
In a series of posts on X earlier this month, Berman noted that the industry has moved past the era of simple prohibition: “We passed the point of ‘telling employees no’ in 2024”.
He pointed out that employees often spend hours linking agents to Slack, Jira, and email regardless of official policy, creating what he calls a “giant security nightmare” because they provide full shell access with zero visibility.
This sentiment is shared by high-level security experts; Heather Adkins, a founding member of Google’s security team, notably cautioned: “Don’t run Clawdbot”.
Runlayer’s ToolGuard technology attempts to solve this by introducing real-time blocking with a latency of less than 100ms.
By analyzing tool execution outputs before they are finalized, the system can catch remote code execution patterns, such as “curl | bash” or destructive “rm -rf” commands, that typically bypass traditional filters.
According to Runlayer’s internal benchmarks, this technical layer increases prompt injection resistance from a baseline of 8.7% to 95%.
The Runlayer suite for OpenClaw is structured around two primary pillars: discovery and active defense.
OpenClaw Watch: This tool functions as a detection mechanism for “shadow” Model Context Protocol (MCP) servers across an organization. It can be deployed via Mobile Device Management (MDM) software to scan employee devices for unmanaged configurations.
Runlayer ToolGuard: This is the active enforcement engine that monitors every tool call made by the agent,. It is designed to catch over 90% of credential exfiltration attempts, specifically looking for the “leaking” of AWS keys, database credentials, and Slack tokens.
Berman noted in our interview that the goal is to provide the infrastructure to govern AI agents “in the same way that the enterprise learned to govern the cloud, to govern SaaS, to govern mobile”.
Unlike standard LLM gateways or MCP proxies, Runlayer provides a control plane that integrates directly with existing enterprise identity providers (IDPs) like Okta and Entra.
While the OpenClaw community often relies on open-source or unmanaged scripts, Runlayer positions its enterprise solution as a proprietary commercial layer designed to meet rigorous standards. The platform is SOC 2 certified and HIPAA certified, making it a viable option for companies in highly regulated sectors.
Berman clarified the company’s approach to data in the interview, stating: “Our ToolGuard model family… these are all focused on the security risks with these type of tools, and we don’t train on organizations’ data”. He further emphasized that contracting with Runlayer “looks exactly like you’re contracting with a security vendor,” rather than an LLM inference provider.
This distinction is critical; it means any data used is anonymized at the source, and the platform does not rely on inference to provide its security layers.
For the end-user, this licensing model means a transition from “community-supported” risk to “enterprise-supported” stability. While the underlying AI agent might be flexible and experimental, the Runlayer wrapper provides the legal and technical guarantees—such as terms of service and privacy policies—that large organizations require.
Runlayer’s pricing structure deviates from the traditional per-user seat model common in SaaS. Berman explained in our interview that the company prefers a platform fee to encourage wide-scale adoption without the friction of incremental costs: “We don’t believe in charging per user. We want you to roll it enterprise across your organization”.
This platform fee is scoped based on the size of the deployment and the specific capabilities the customer requires.
Because Runlayer functions as a comprehensive control plane—offering “six products on day one”—the pricing is tailored to the infrastructure needs of the enterprise rather than simple headcount.
Runlayer’s current focus is on enterprise and mid-market segments, but Berman noted that the company plans to introduce offerings in the future specifically “scoped to smaller companies”.
Runlayer is designed to fit into the existing “stack” used by security and infrastructure teams. For engineering and IT teams, it can be deployed in the cloud, within a private virtual private cloud (VPC), or even on-premise. Every tool call is logged and auditable, with integrations that allow data to be exported to SIEM vendors like Datadog or Splunk.
During our interview, Berman highlighted the positive cultural shift that occurs when these tools are secured properly, rather than banned. He cited the example of Gusto, where the IT team was renamed the “AI transformation team” after partnering with Runlayer.
Berman said: “We have taken their company from… not using these type of tools, to half the company on a daily basis using MCP, and it’s incredible”. He noted that this includes non-technical users, proving that safe AI adoption can scale across an entire workforce.
Similarly, Berman shared a quote from a customer at home sales tech firm OpenDoor who claimed that “hands down, the biggest quality of life improvement I’m noticing at OpenDoor is Runlayer” because it allowed them to connect agents to sensitive, private systems without fear of compromise.
The market response appears to validate the need for this “middle ground” in AI governance. Runlayer already powers security for several high-growth companies, including Gusto, Instacart, Homebase, and AngelList.
These early adopters suggest that the future of AI in the workplace may not be found in banning powerful tools, but in wrapping them in a layer of measurable, real-time governance.
As the cost of tokens drops and the capabilities of models like “Opus 4.5” or “GPT 5.2” increase, the urgency for this infrastructure only grows.
“The question isn’t really whether enterprise will use agents,” Berman concluded in our interview, “it’s whether they can do it, how fast they can do it safely, or they’re going to just do it recklessly, and it’s going to be a disaster”.
For the modern CISO, the goal is no longer to be the person who says “no,” but to be the enabler who brings a “governed, safe, and secure way to roll out AI”.
Agents built on top of today’s models often break with simple changes — a new library, a workflow modification — and require a human engineer to fix it. That’s one of the most persistent challenges in deploying AI for the enterprise: creating agents that can adapt to dynamic environments without constant hand-holding. While today’s models are powerful, they are largely static.
To address this, researchers at the University of California, Santa Barbara have developed Group-Evolving Agents (GEA), a new framework that enables groups of AI agents to evolve together, sharing experiences and reusing their innovations to autonomously improve over time.
In experiments on complex coding and software engineering tasks, GEA substantially outperformed existing self-improving frameworks. Perhaps most notably for enterprise decision-makers, the system autonomously evolved agents that matched or exceeded the performance of frameworks painstakingly designed by human experts.
Most existing agentic AI systems rely on fixed architectures designed by engineers. These systems often struggle to move beyond the capability boundaries imposed by their initial designs.
To solve this, researchers have long sought to create self-evolving agents that can autonomously modify their own code and structure to overcome their initial limits. This capability is essential for handling open-ended environments where the agent must continuously explore new solutions.
However, current approaches to self-evolution have a major structural flaw. As the researchers note in their paper, most systems are inspired by biological evolution and are designed around “individual-centric” processes. These methods typically use a tree-structured approach: a single “parent” agent is selected to produce offspring, creating distinct evolutionary branches that remain strictly isolated from one another.
This isolation creates a silo effect. An agent in one branch cannot access the data, tools, or workflows discovered by an agent in a parallel branch. If a specific lineage fails to be selected for the next generation, any valuable discovery made by that agent, such as a novel debugging tool or a more efficient testing workflow, dies out with it.
In their paper, the researchers question the necessity of adhering to this biological metaphor. “AI agents are not biological individuals,” they argue. “Why should their evolution remain constrained by biological paradigms?”
GEA shifts the paradigm by treating a group of agents, rather than an individual, as the fundamental unit of evolution.
The process begins by selecting a group of parent agents from an existing archive. To ensure a healthy mix of stability and innovation, GEA selects these agents based on a combined score of performance (competence in solving tasks) and novelty (how distinct their capabilities are from others).
Unlike traditional systems where an agent only learns from its direct parent, GEA creates a shared pool of collective experience. This pool contains the evolutionary traces from all members of the parent group, including code modifications, successful solutions to tasks, and tool invocation histories. Every agent in the group gains access to this collective history, allowing them to learn from the breakthroughs and mistakes of their peers.
A “Reflection Module,” powered by a large language model, analyzes this collective history to identify group-wide patterns. For instance, if one agent discovers a high-performing debugging tool while another perfects a testing workflow, the system extracts both insights. Based on this analysis, the system generates high-level “evolution directives” that guide the creation of the child group. This ensures the next generation possesses the combined strengths of all their parents, rather than just the traits of a single lineage.
However, this hive-mind approach works best when success is objective, such as in coding tasks. “For less deterministic domains (e.g., creative generation), evaluation signals are weaker,” Zhaotian Weng and Xin Eric Wang, co-authors of the paper, told VentureBeat in written comments. “Blindly sharing outputs and experiences may introduce low-quality experiences that act as noise. This suggests the need for stronger experience filtering mechanisms” for subjective tasks.
The researchers tested GEA against the current state-of-the-art self-evolving baseline, the Darwin Godel Machine (DGM), on two rigorous benchmarks. The results demonstrated a massive leap in capability without increasing the number of agents used.
This collaborative approach also makes the system more robust against failure. In their experiments, the researchers intentionally broke agents by manually injecting bugs into their implementations. GEA was able to repair these critical bugs in an average of 1.4 iterations, while the baseline took 5 iterations. The system effectively leverages the “healthy” members of the group to diagnose and patch the compromised ones.
On SWE-bench Verified, a benchmark consisting of real GitHub issues including bugs and feature requests, GEA achieved a 71.0% success rate, compared to the baseline’s 56.7%. This translates to a significant boost in autonomous engineering throughput, meaning the agents are far more capable of handling real-world software maintenance. Similarly, on Polyglot, which tests code generation across diverse programming languages, GEA achieved 88.3% against the baseline’s 68.3%, indicating high adaptability to different tech stacks.
For enterprise R&D teams, the most critical finding is that GEA allows AI to design itself as effectively as human engineers. On SWE-bench, GEA’s 71.0% success rate effectively matches the performance of OpenHands, the top human-designed open-source framework. On Polyglot, GEA significantly outperformed Aider, a popular coding assistant, which achieved 52.0%. This suggests that organizations may eventually reduce their reliance on large teams of prompt engineers to tweak agent frameworks, as the agents can meta-learn these optimizations autonomously.
This efficiency extends to cost management. “GEA is explicitly a two-stage system: (1) agent evolution, then (2) inference/deployment,” the researchers said. “After evolution, you deploy a single evolved agent… so enterprise inference cost is essentially unchanged versus a standard single-agent setup.”
The success of GEA stems largely from its ability to consolidate improvements. The researchers tracked specific innovations invented by the agents during the evolutionary process. In the baseline approach, valuable tools often appeared in isolated branches but failed to propagate because those specific lineages ended. In GEA, the shared experience model ensured these tools were adopted by the best-performing agents. The top GEA agent integrated traits from 17 unique ancestors (representing 28% of the population) whereas the best baseline agent integrated traits from only 9. In effect, GEA creates a “super-employee” that possesses the combined best practices of the entire group.
“A GEA-inspired workflow in production would allow agents to first attempt a few independent fixes when failures occur,” the researchers explained regarding this self-healing capability. “A reflection agent (typically powered by a strong foundation model) can then summarize the outcomes… and guide a more comprehensive system update.”
Furthermore, the improvements discovered by GEA are not tied to a specific underlying model. Agents evolved using one model, such as Claude, maintained their performance gains even when the underlying engine was swapped to another model family, such as GPT-5.1 or GPT-o3-mini. This transferability offers enterprises the flexibility to switch model providers without losing the custom architectural optimizations their agents have learned.
For industries with strict compliance requirements, the idea of self-modifying code might sound risky. To address this, the authors said: “We expect enterprise deployments to include non-evolvable guardrails, such as sandboxed execution, policy constraints, and verification layers.”
While the researchers plan to release the official code soon, developers can already begin implementing the GEA architecture conceptually on top of existing agent frameworks. The system requires three key additions to a standard agent stack: an “experience archive” to store evolutionary traces, a “reflection module” to analyze group patterns, and an “updating module” that allows the agent to modify its own code based on those insights.
Looking ahead, the framework could democratize advanced agent development. “One promising direction is hybrid evolution pipelines,” the researchers said, “where smaller models explore early to accumulate diverse experiences, and stronger models later guide evolution using those experiences.”
Anthropic on Tuesday released Claude Sonnet 4.6, a model that amounts to a seismic repricing event for the AI industry. It delivers near-flagship intelligence at mid-tier cost, and it lands squarely in the middle of an unprecedented corporate rush to deploy AI agents and automated coding tools.
The model is a full upgrade across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. It features a 1M token context window in beta. It is now the default model in claude.ai and Claude Cowork, and pricing holds steady at $3/$15 per million tokens — the same as its predecessor, Sonnet 4.5.
That pricing detail is the headline that matters most. Anthropic’s flagship Opus models cost $15/$75 per million tokens — five times the Sonnet price. Yet performance that would have previously required reaching for an Opus-class model — including on real-world, economically valuable office tasks — is now available with Sonnet 4.6. For the thousands of enterprises now deploying AI agents that make millions of API calls per day, that math changes everything.
To understand the significance of this release, you need to understand the moment it arrives in. The past year has been dominated by the twin phenomena of “vibe coding” and agentic AI. Claude Code — Anthropic’s developer-facing terminal tool — has become a cultural force in Silicon Valley, with engineers building entire applications through natural-language conversation. The New York Times profiled its meteoric rise in January. The Verge recently declared that Claude Code is having a genuine “moment.” OpenAI, meanwhile, has been waging its own offensive with Codex desktop applications and faster inference chips.
The result is an industry where AI models are no longer evaluated in isolation. They are evaluated as the engines inside autonomous agents — systems that run for hours, make thousands of tool calls, write and execute code, navigate browsers, and interact with enterprise software. Every dollar spent per million tokens gets multiplied across those thousands of calls. At scale, the difference between $15 and $3 per million input tokens is not incremental. It is transformational.
The benchmark table Anthropic released paints a striking picture. On SWE-bench Verified, the industry-standard test for real-world software coding, Sonnet 4.6 scored 79.6% — nearly matching Opus 4.6’s 80.8%. On agentic computer use (OSWorld-Verified), Sonnet 4.6 scored 72.5%, essentially tied with Opus 4.6’s 72.7%. On office tasks (GDPval-AA Elo), Sonnet 4.6 actually scored 1633, surpassing Opus 4.6’s 1606. On agentic financial analysis, Sonnet 4.6 hit 63.3%, beating every model in the comparison, including Opus 4.6 at 60.1%.
These are not marginal differences. In many of the categories enterprises care about most, Sonnet 4.6 matches or beats models that cost five times as much to run. An enterprise running an AI agent that processes 10 million tokens per day was previously forced to choose between inferior results at lower cost or superior results at rapidly scaling expense. Sonnet 4.6 largely eliminates that trade-off.
In Claude Code, early testing found that users preferred Sonnet 4.6 over Sonnet 4.5 roughly 70% of the time. Users even preferred Sonnet 4.6 to Opus 4.5, Anthropic’s frontier model from November, 59% of the time. They rated Sonnet 4.6 as significantly less prone to over-engineering and “laziness,” and meaningfully better at instruction following. They reported fewer false claims of success, fewer hallucinations, and more consistent follow-through on multi-step tasks.
One of the most dramatic storylines in the release is Anthropic’s progress on computer use — the ability of an AI to operate a computer the way a human does, clicking a mouse, typing on a keyboard, and navigating software that lacks modern APIs.
When Anthropic first introduced this capability in October 2024, the company acknowledged it was “still experimental — at times cumbersome and error-prone.” The numbers since then tell a remarkable story: on OSWorld, Claude Sonnet 3.5 scored 14.9% in October 2024. Sonnet 3.7 reached 28.0% in February 2025. Sonnet 4 hit 42.2% by June. Sonnet 4.5 climbed to 61.4% in October. Now Sonnet 4.6 has reached 72.5% — nearly a fivefold improvement in 16 months.
This matters because computer use is the capability that unlocks the broadest set of enterprise applications for AI agents. Almost every organization has legacy software — insurance portals, government databases, ERP systems, hospital scheduling tools — that was built before APIs existed. A model that can simply look at a screen and interact with it opens all of these to automation without building bespoke connectors.
Jamie Cuffe, CEO of Pace, said Sonnet 4.6 hit 94% on their complex insurance computer use benchmark, the highest of any Claude model tested. “It reasons through failures and self-corrects in ways we haven’t seen before,” Cuffe said in a statement sent to VentureBeat. Will Harvey, co-founder of Convey, called it “a clear improvement over anything else we’ve tested in our evals.”
The safety dimension of computer use also got attention. Anthropic noted that computer use poses prompt injection risks — malicious actors hiding instructions on websites to hijack the model — and said its evaluations show Sonnet 4.6 is a major improvement over Sonnet 4.5 in resisting such attacks. For enterprises deploying agents that browse the web and interact with external systems, that hardening is not optional.
The customer reaction has been unusually specific about cost-performance dynamics. Multiple early testers explicitly described Sonnet 4.6 as eliminating the need to reach for the more expensive Opus tier.
Caitlin Colgrove, CTO of Hex Technologies, said the company is moving the majority of its traffic to Sonnet 4.6, noting that with adaptive thinking and high effort, “we see Opus-level performance on all but our hardest analytical tasks with a more efficient and flexible profile. At Sonnet pricing, it’s an easy call for our workloads.”
Ben Kus, CTO of Box, said the model outperformed Sonnet 4.5 in heavy reasoning Q&A by 15 percentage points across real enterprise documents. Michele Catasta, President of Replit, called the performance-to-cost ratio “extraordinary.” Ryan Wiggins of Mercury Banking put it more bluntly: “Claude Sonnet 4.6 is faster, cheaper, and more likely to nail things on the first try. That combination was a surprising combination of improvements, and we didn’t expect to see it at this price point.”
The coding improvements resonate particularly given Claude Code’s dominance in the developer tools market. David Loker, VP of AI at CodeRabbit, said the model “punches way above its weight class for the vast majority of real-world PRs.” Leo Tchourakov of Factory AI said the team is “transitioning our Sonnet traffic over to this model.” GitHub’s VP of Product, Joe Binder, confirmed the model is “already excelling at complex code fixes, especially when searching across large codebases is essential.”
Brendan Falk, Founder and CEO of Hercules, went further: “Claude Sonnet 4.6 is the best model we have seen to date. It has Opus 4.6 level accuracy, instruction following, and UI, all for a meaningfully lower cost.”
Buried in the technical details is a capability that hints at where autonomous AI agents are heading. Sonnet 4.6’s 1M token context window can hold entire codebases, lengthy contracts, or dozens of research papers in a single request. Anthropic says the model reasons effectively across all that context — a claim the company demonstrated through an unusual evaluation.
The Vending-Bench Arena tests how well a model can run a simulated business over time, with different AI models competing against each other for the biggest profits. Without human prompting, Sonnet 4.6 developed a novel strategy: it invested heavily in capacity for the first ten simulated months, spending significantly more than its competitors, and then pivoted sharply to focus on profitability in the final stretch. The model ended its 365-day simulation at approximately $5,700 in balance, compared to Sonnet 4.5’s roughly $2,100.
This kind of multi-month strategic planning, executed autonomously, represents a qualitatively different capability than answering questions or generating code snippets. It is the type of long-horizon reasoning that makes AI agents viable for real business operations — and it helps explain why Anthropic is positioning Sonnet 4.6 not just as a chatbot upgrade, but as the engine for a new generation of autonomous systems.
This release does not arrive in a vacuum. Anthropic is in the middle of the most consequential stretch in its history, and the competitive landscape is intensifying on every front.
On the same day as this launch, TechCrunch reported that Indian IT giant Infosys announced a partnership with Anthropic to build enterprise-grade AI agents, integrating Claude models into Infosys’s Topaz AI platform for banking, telecoms, and manufacturing. Anthropic CEO Dario Amodei told TechCrunch there is “a big gap between an AI model that works in a demo and one that works in a regulated industry,” and that Infosys helps bridge it. TechCrunch also reported that Anthropic opened its first India office in Bengaluru, and that India now accounts for about 6% of global Claude usage, second only to the U.S. The company, which CNBC reported is valued at $183 billion, has been expanding its enterprise footprint rapidly.
Meanwhile, Anthropic president Daniela Amodei told ABC News last week that AI would make humanities majors “more important than ever,” arguing that critical thinking skills would become more valuable as large language models master technical work. It is the kind of statement a company makes when it believes its technology is about to reshape entire categories of white-collar employment.
The competitive picture for Sonnet 4.6 is also notable. The model outperforms Google’s Gemini 3 Pro and OpenAI’s GPT-5.2 on multiple benchmarks. GPT-5.2 trails on agentic computer use (38.2% vs. 72.5%), agentic search (77.9% vs. 74.7% for Sonnet 4.6’s non-Pro score), and agentic financial analysis (59.0% vs. 63.3%). Gemini 3 Pro shows competitive performance on visual reasoning and multilingual benchmarks, but falls behind on the agentic categories where enterprise investment is surging.
The broader takeaway may not be about any single model. It is about what happens when Opus-class intelligence becomes available for a few dollars per million tokens rather than a few tens of dollars. Companies that were cautiously piloting AI agents with small deployments now face a fundamentally different cost calculus. The agents that were too expensive to run continuously in January are suddenly affordable in February.
Claude Sonnet 4.6 is available now on all Claude plans, Claude Cowork, Claude Code, the API, and all major cloud platforms. Anthropic has also upgraded its free tier to Sonnet 4.6 by default. Developers can access it immediately using claude-sonnet-4-6 via the Claude API.
As AI-powered coding tools flood the market, a critical weakness has emerged: by default, as with most LLM chat sessions, they are temporary — as soon as you close a session and start a new one, the tool forgets everything you were just working on.
Developers have worked around this by having coding tools and agents save their state to markdown and text files, but this solution is hacky at best.
Qodo, the AI code review startup, believes it has a solution with the launch of what it calls the industry’s first intelligent Rules System for AI governance — a framework that gives AI code reviewers persistent, organizational memory.
The new system, announced today as part of Qodo 2.1, replaces static, manually maintained rule files with an intelligent governance layer. It automatically generates rules from actual code patterns and past review decisions, continuously maintains rule health, enforces standards in every code review, and measures real-world impact.
For Itamar Friedman, CEO and co-founder of Qodo, the release represents a pivotal moment not just for his company but for the entire AI development tools space.
“I strongly believe that this announcement of ours is most important we ever done,” Friedman said in an interview with VentureBeat.
To explain the limitation of current AI coding tools, Friedman invokes the 2000 Christopher Nolan film Memento, in which the protagonist suffers from short-term memory loss and must tattoo notes on his body to remember crucial information.
“Every time you call them, it’s a machine that wakes up from scratch,” Friedman said of today’s AI coding assistants. “So all it can do is, before it goes to sleep and restart, it could write whatever it did in a file.”
This approach—saving context to markdown files like agents.md or napkin.md—has become a common workaround among developers using tools like Claude Code and Cursor. But Friedman argues this method breaks down at enterprise scale.
“Think about heavy duty software where you now have, let’s say, 100,000 of those sticky notes,” he said. “Some of them are sticky notes. Some of them are huge explanations. Some of them are stories. You wake up and you get a task. The first thing that [the AI] is doing is statistically starting to look for the right memos… It’s much better than not having it. But it’s very random.”
The evolution of AI development tools has followed a clear trajectory, according to Friedman: from autocomplete (GitHub Copilot) to question-and-answer (ChatGPT) to agentic coding within the IDE (Cursor) to agentic capabilities everywhere (Claude Code). But he contends all of these remain fundamentally stateless.
“In order for software development to really revolutionize how we do software development for real world software, it needs to be a stateful machine,” Friedman said.
The core challenge, he explained, is that code quality is inherently subjective. Different organizations have different standards, and even teams within the same enterprise may approach problems differently.
“In order to really reach high level of automation, you need to be able to customize for the specific requirements of the enterprise,” Friedman said. “You need to be able to provide code in high quality. But quality is subjective.”
Qodo’s answer is what Friedman describes as “memory that is built over a long time and is accessible to the coding agents, and then they can poke and check and verify that what they’re actually doing is according to the subjective needs of the enterprise.”
Qodo’s Rules System establishes what the company calls a unified source of truth for organizational coding standards. The system includes several key components:
Automatic Rule Discovery: A Rules Discovery Agent generates standards from codebases and pull request feedback, eliminating manual authoring of rule files.
Intelligent Maintenance: A Rules Expert Agent continuously identifies conflicts, duplicates, and outdated standards to prevent what the company calls “rule decay.”
Scalable Enforcement: Rules are automatically enforced during pull request code review, with recommended fixes provided to developers.
Real-World Analytics: Organizations can track adoption rates, violation trends, and improvement metrics to prove standards are being followed.
Friedman emphasized that this represents a fundamental shift in how AI code review tools operate. “It’s the first time that AI code review tool is moving from reactive to proactive,” he said.
The system surfaces rules based on code patterns, best practices, and its own library, then presents them to technical leads for approval. Once accepted, organizations receive statistics on rule adoption and violations across their entire codebase.
What distinguishes Qodo’s approach, according to Friedman, is how tightly the rules system integrates with the AI agents themselves—as opposed to treating memory as an external resource the AI must search through.
“At Qodo, this memory and agents are much more connected, like we have in our brain,” Friedman said. “There’s much more structure to it… where different parts are well connected and not separated.”
Friedman noted that Qodo applies fine-tuning and reinforcement learning techniques to this integrated system, which he credits for the company achieving an 11% improvement in precision and recall over other platforms, successfully identifying 580 defects across 100 real-world production PRs.
Friedman offered a prediction for the industry: “When you look one year ahead, it will be very clear that when we started 2026, we were in stateless machines that are trying to hack how they interact with memory. And we will have a very coupled way by the end of 2026, and Qodo 2.1 is the first blueprint of how to do that.”
Qodo positions itself as an enterprise-first company, offering multiple deployment options. Organizations can deploy the system entirely within their own infrastructure via cloud premise or VPN, use a single-tenant SaaS option where Qodo hosts an isolated instance, or opt for traditional self-serve SaaS.
The rules and memory files can reside wherever the enterprise requires—on their own cloud infrastructure or hosted by Qodo—addressing data governance concerns that enterprise customers typically raise.
On pricing, Qodo is maintaining its existing seat-based model with usage quotas. At present, the company offers three pricing tiers: a free Developer plan for individuals with 30 PR reviews per month, a Teams plan at $38 per user per month (with 21% savings available for annual billing) that includes 20 PRs per user monthly and 2,500 IDE/CLI credits, and a custom-priced Enterprise plan with contact-us pricing that adds features like multi-repo context awareness, on-prem deployment options, SSO, and priority support.
Friedman acknowledged the ongoing industry debate about whether seat-based pricing makes sense in an age of AI agents but said the company plans to address this topic more comprehensively later this year.
“If you get more value, you pay more,” Friedman said. “If you don’t, then we’re all good.”
Ofer Morag Brin of HR technology company Hibob, an early user of the Rules System, reported positive results in a press statement Qodo shared with VentureBeat ahead of the launch.
“Qodo’s Rules System didn’t just surface the standards we had scattered across different places; it operationalized them,” Brin said. “The system continuously reinforces how our teams actually review and write code, and we are seeing stronger consistency, faster onboarding, and measurable improvements in review quality across teams.”
Founded in 2018, Qodo has raised $50 million from investors including TLV Partners, Vine Ventures, Susa Ventures, and Square Peg, with angel investors from OpenAI, Shopify, and Snyk.
From miles away across the desert, the Great Pyramid looks like a perfect, smooth geometry — a sleek triangle pointing to the stars. Stand at the base, however, and the illusion of smoothness vanishes. You see massive, jagged blocks of limestone. It is not a slope; it is a staircase.
Remember this the next time you hear futurists talking about exponential growth.
Intel’s co-founder Gordon Moore (Moore’s Law) is famously quoted for saying in 1965 that the transistor count on a microchip would double every year. Another Intel executive, David House, later revised this statement to “compute power doubling every 18 months.” For a while, Intel’s CPUs were the poster child of this law. That is, until the growth in CPU performance flattened out like a block of limestone.
If you zoom out, though, the next limestone block was already there — the growth in compute merely shifted from CPUs to the world of GPUs. Jensen Huang, Nvidia’s CEO, played a long game and came out a strong winner, building his own stepping stones initially with gaming, then computer visioniand recently, generative AI.
Technology growth is full of sprints and plateaus, and gen AI is not immune. The current wave is driven by transformer architecture. To quote Anthropic’s President and co-founder Dario Amodei: “The exponential continues until it doesn’t. And every year we’ve been like, ‘Well, this can’t possibly be the case that things will continue on the exponential’ — and then every year it has.”
But just as the CPU plateaued and GPUs took the lead, we are seeing signs that LLM growth is shifting paradigms again. For example, late in 2024, DeepSeek surprised the world by training a world-class model on an impossibly small budget, in part by using the MoE technique.
Do you remember where you recently saw this technique mentioned? Nvidia’s Rubin press release: The technology includes “…the latest generations of Nvidia NVLink interconnect technology… to accelerate agentic AI, advanced reasoning and massive-scale MoE model inference at up to 10x lower cost per token.”
Jensen knows that achieving that coveted exponential growth in compute doesn’t come from pure brute force anymore. Sometimes you need to shift the architecture entirely to place the next stepping stone.
This long introduction brings us to Groq.
The biggest gains in AI reasoning capabilities in 2025 were driven by “inference time compute” — or, in lay terms, “letting the model think for a longer period of time.” But time is money. Consumers and businesses do not like waiting.
Groq comes into play here with its lightning-speed inference. If you bring together the architectural efficiency of models like DeepSeek and the sheer throughput of Groq, you get frontier intelligence at your fingertips. By executing inference faster, you can “out-reason” competitive models, offering a “smarter” system to customers without the penalty of lag.
For the last decade, the GPU has been the universal hammer for every AI nail. You use H100s to train the model; you use H100s (or trimmed-down versions) to run the model. But as models shift toward “System 2” thinking — where the AI reasons, self-corrects and iterates before answering — the computational workload changes.
Training requires massive parallel brute force. Inference, especially for reasoning models, requires faster sequential processing. It must generate tokens instantly to facilitate complex chains of thought without the user waiting minutes for an answer. Groq’s LPU (Language Processing Unit) architecture removes the memory bandwidth bottleneck that plagues GPUs during small-batch inference, delivering lightning-fast inference.
For the C-Suite, this potential convergence solves the “thinking time” latency crisis. Consider the expectations from AI agents: We want them to autonomously book flights, code entire apps and research legal precedent. To do this reliably, a model might need to generate 10,000 internal “thought tokens” to verify its own work before it outputs a single word to the user.
On a standard GPU: 10,000 thought tokens might take 20 to 40 seconds. The user gets bored and leaves.
On Groq: That same chain of thought happens in less than 2 seconds.
If Nvidia integrates Groq’s technology, they solve the “waiting for the robot to think” problem. They preserve the magic of AI. Just as they moved from rendering pixels (gaming) to rendering intelligence (gen AI), they would now move to rendering reasoning in real-time.
Furthermore, this creates a formidable software moat. Groq’s biggest hurdle has always been the software stack; Nvidia’s biggest asset is CUDA. If Nvidia wraps its ecosystem around Groq’s hardware, they effectively dig a moat so wide that competitors cannot cross it. They would offer the universal platform: The best environment to train and the most efficient environment to run (Groq/LPU).
Consider what happens when you couple that raw inference power with a next-generation open source model (like the rumored DeepSeek 4): You get an offering that would rival today’s frontier models in cost, performance and speed. That opens up opportunities for Nvidia, from directly entering the inference business with its own cloud offering, to continuing to power a growing number of exponentially growing customers.
Returning to our opening metaphor: The “exponential” growth of AI is not a smooth line of raw FLOPs; it is a staircase of bottlenecks being smashed.
Block 1: We couldn’t calculate fast enough. Solution: The GPU.
Block 2: We couldn’t train deep enough. Solution: Transformer architecture.
Block 3: We can’t “think” fast enough. Solution: Groq’s LPU.
Jensen Huang has never been afraid to cannibalize his own product lines to own the future. By validating Groq, Nvidia wouldn’t just be buying a faster chip; they would be bringing next-generation intelligence to the masses.
Andrew Filev, founder and CEO of Zencoder
The average Fortune 1000 company has more than 30,000 employees and engineering, sales and marketing teams with hundreds of members. Equally large teams exist in government, science and defense organizations. And yet, research shows that the ideal size…
Researchers at Nvidia have developed a technique that can reduce the memory costs of large language model reasoning by up to eight times. Their technique, called dynamic memory sparsification (DMS), compresses the key value (KV) cache, the temporary memory LLMs generate and store as they process prompts and reason through problems and documents.
While researchers have proposed various methods to compress this cache before, most struggle to do so without degrading the model’s intelligence. Nvidia’s approach manages to discard much of the cache while maintaining (and in some cases improving) the model’s reasoning capabilities.
Experiments show that DMS enables LLMs to “think” longer and explore more solutions without the usual penalty in speed or memory costs.
LLMs improve their performance on complex tasks by generating “chain-of-thought” tokens, essentially writing out their reasoning steps before arriving at a final answer. Inference-time scaling techniques leverage this by giving the model a larger budget to generate these thinking tokens or to explore multiple potential reasoning paths in parallel.
However, this improved reasoning comes with a significant computational cost. As the model generates more tokens, it builds up a KV cache.
For real-world applications, the KV cache is a major bottleneck. As the reasoning chain grows, the cache grows linearly, consuming vast amounts of memory on GPUs. This forces the hardware to spend more time reading data from memory than actually computing, which slows down generation and increases latency. It also caps the number of users a system can serve simultaneously, as running out of VRAM causes the system to crash or slow to a crawl.
Nvidia researchers frame this not just as a technical hurdle, but as a fundamental economic one for the enterprise.
“The question isn’t just about hardware quantity; it’s about whether your infrastructure is processing 100 reasoning threads or 800 threads for the same cost,” Piotr Nawrot, Senior Deep Learning Engineer at Nvidia, told VentureBeat.
Previous attempts to solve this focused on heuristics-based approaches. These methods use rigid rules, such as a “sliding window” that only caches the most recent tokens and deletes the rest. While this reduces memory usage, it often forces the model to discard critical information required for solving the problem, degrading the accuracy of the output.
“Standard eviction methods attempt to select old and unused tokens for eviction using heuristics,” the researchers said. “They simplify the problem, hoping that if they approximate the model’s internal mechanics, the answer will remain correct.”
Other solutions use paging to offload the unused parts of the KV cache to slower memory, but the constant swapping of data introduces latency overhead that makes real-time applications sluggish.
DMS takes a different approach by “retrofitting” existing LLMs to intelligently manage their own memory. Rather than applying a fixed rule for what to delete, DMS trains the model to identify which tokens are essential for future reasoning and which are disposable.
“It doesn’t just guess importance; it learns a policy that explicitly preserves the model’s final output distribution,” Nawrot said.
The process transforms a standard, pre-trained LLM such as Llama 3 or Qwen 3 into a self-compressing model. Crucially, this does not require training the model from scratch, which would be prohibitively expensive. Instead, DMS repurposes existing neurons within the model’s attention layers to output a “keep” or “evict” signal for each token.
For teams worried about the complexity of retrofitting, the researchers noted that the process is designed to be lightweight. “To improve the efficiency of this process, the model’s weights can be frozen, which makes the process similar to Low-Rank Adaptation (LoRA),” Nawrot said. This means a standard enterprise model like Qwen3-8B “can be retrofitted with DMS within hours on a single DGX H100.”
One of the important parts of DMS is a mechanism called “delayed eviction.” In standard sparsification, if a token is deemed unimportant, it is deleted immediately. This is risky because the model might need a split second to integrate that token’s context into its current state.
DMS mitigates this by flagging a token for eviction but keeping it accessible for a short window of time (e.g., a few hundred steps). This delay allows the model to “extract” any remaining necessary information from the token and merge it into the current context before the token is wiped from the KV cache.
“The ‘delayed eviction’ mechanism is crucial because not all tokens are simply ‘important’ (keep forever) or ‘useless’ (delete immediately). Many fall in between — they carry some information, but not enough to justify occupying an entire slot in memory,” Nawrot said. “This is where the redundancy lies. By keeping these tokens in a local window for a short time before eviction, we allow the model to attend to them and redistribute their information into future tokens.”
The researchers found that this retrofitting process is highly efficient. They could equip a pre-trained LLM with DMS in just 1,000 training steps, a tiny fraction of the compute required for the original training. The resulting models use standard kernels and can drop directly into existing high-performance inference stacks without custom hardware or complex software rewriting.
To validate the technique, the researchers applied DMS to several reasoning models, including the Qwen-R1 series (distilled from DeepSeek R1) and Llama 3.2, and tested them on difficult benchmarks like AIME 24 (math), GPQA Diamond (science), and LiveCodeBench (coding).
The results show that DMS effectively moves the Pareto frontier, the optimal trade-off between cost and performance. On the AIME 24 math benchmark, a Qwen-R1 32B model equipped with DMS achieved a score 12.0 points higher than a standard model when constrained to the same memory bandwidth budget. By compressing the cache, the model could afford to “think” much deeper and wider than the standard model could for the same memory and compute budget.
Perhaps most surprisingly, DMS defied the common wisdom that compression hurts long-context understanding. In “needle-in-a-haystack” tests, which measure a model’s ability to find a specific piece of information buried in a large document, DMS variants actually outperformed the standard models. By actively managing its memory rather than passively accumulating noise, the model maintained a cleaner, more useful context.
For enterprise infrastructure, the efficiency gains translate directly to throughput and hardware savings. Because the memory cache is significantly smaller, the GPU spends less time fetching data, reducing the wait time for users. In tests with the Qwen3-8B model, DMS matched the accuracy of the vanilla model while delivering up to 5x higher throughput. This means a single server can handle five times as many customer queries per second without a drop in quality.
Nvidia has released DMS as part of its KVPress library. Regarding how enterprises can get started with DMS, Nawrot emphasized that the barrier to entry is low. “The ‘minimum viable infrastructure’ is standard Hugging Face pipelines — no custom CUDA kernels are required,” Nawrot said, noting that the code is fully compatible with standard FlashAttention.
Looking ahead, the team views DMS as part of a larger shift where memory management becomes a distinct, intelligent layer of the AI stack. Nawrot also confirmed that DMS is “fully compatible” with newer architectures like the Multi-Head Latent Attention (MLA) used in DeepSeek’s models, suggesting that combining these approaches could yield even greater efficiency gains.
As enterprises move from simple chatbots to complex agentic systems that require extended reasoning, the cost of inference is becoming a primary concern. Techniques like DMS provide a path to scale these capabilities sustainably.
“We’ve barely scratched the surface of what is possible,” Nawrot said, “and we expect inference-time scaling to further evolve.”
Anthropic released its Claude Cowork AI agent software for Windows on Monday, bringing the file management and task automation tool to roughly 70 percent of the desktop computing market and intensifying a remarkable corporate realignment that has seen Microsoft embrace a direct competitor to its longtime AI partner, OpenAI.
The Windows launch arrives with what Anthropic calls “full feature parity” with the macOS version: file access, multi-step task execution, plugins, and Model Context Protocol (MCP) connectors for integrating external services. Users can now also set global and folder-specific instructions that Claude follows in every session, a feature developers on Reddit described as “a game-changer” for maintaining context across projects.
“Cowork is now available on Windows,” Anthropic announced on X. “We’re bringing full feature parity with MacOS: file access, multi-step task execution, plugins, and MCP connectors.”
The release closes a critical platform gap that had limited Cowork to Apple’s operating system since its January 12 debut. The Windows expansion underscores a broader transformation already underway in enterprise AI, with Microsoft simultaneously selling its own GitHub Copilot to customers while encouraging thousands of its own employees to adopt Anthropic’s competing tools internally.
The relationship between Microsoft and Anthropic has accelerated with striking speed. In November, the two companies announced a strategic partnership allowing Microsoft Foundry customers access to Claude Sonnet 4.5, Claude Opus 4.1, and Claude Haiku 4.5. As part of that arrangement, Anthropic committed to purchasing $30 billion of Azure compute capacity.
But the partnership has expanded well beyond cloud hosting. According to a January 22 report in The Verge, Microsoft has begun encouraging thousands of employees from some of its most prolific teams to adopt Claude Code — and now, by extension, Cowork — even if they have no coding experience.
Microsoft’s CoreAI team, the new AI engineering group led by former Meta engineering chief Jay Parikh, has tested Claude Code in recent months, The Verge reported. The company has also approved Claude Code across all code and repositories for its Business and Industry Copilot teams.
“Software engineers at Microsoft are now expected to use both Claude Code and GitHub Copilot and give feedback comparing the two,” The Verge reported.
The company’s spending on Anthropic approaches $500 million annually, according to The Information. Microsoft has even begun counting Anthropic AI model sales toward Azure sales quotas — an unusual incentive structure that the company typically reserves for homegrown products or models from OpenAI.
Microsoft’s embrace of Anthropic raises uncomfortable questions about its $13 billion investment in OpenAI, which has long served as the exclusive provider of frontier AI models for Microsoft’s products. The two companies signed their landmark partnership in 2019, with Microsoft providing Azure computing infrastructure in exchange for preferential access to OpenAI’s technology.
That relationship now appears to be evolving into something more nuanced. Microsoft has started favoring Anthropic’s Claude models inside Microsoft 365 apps and Copilot recently, deploying them in specific applications or features where Anthropic’s models have proven more capable than OpenAI’s counterparts.
On February 5, Microsoft announced that Claude Opus 4.6 — Anthropic’s most advanced model — would become available in Microsoft Foundry, the company’s enterprise AI platform. The Azure blog post framed the integration as bringing “even more capability to agents that increasingly learn from and act on business systems.”
“At Microsoft we believe that intelligence and trust are the core requirements of agentic AI at scale,” the announcement stated. “Built on Azure, Microsoft Foundry brings these capabilities together on a secure, scalable cloud foundation for enterprise AI.”
The timing and tone suggest Microsoft views Anthropic not merely as a hedging strategy but as a genuine technical leader in certain domains. Claude Opus 4.6 offers a one-million-token context window and 128,000-token maximum output — specifications that position it for complex, long-running enterprise tasks that require processing vast amounts of information.
The deepening Microsoft-Anthropic alliance takes on added significance when viewed against a backdrop of genuine alarm rippling through the software industry. Within days of the macOS launch in January, investors began repricing SaaS companies whose products overlap with Cowork’s capabilities — project management tools, writing assistants, data analysis platforms, and workflow automation software all saw sharp declines.
Bloomberg reported that Cowork triggered a $285 billion software stocks selloff. The carnage reflected growing investor conviction that AI agents capable of automating knowledge work could render entire categories of enterprise software obsolete.
The fear is not abstract. Cowork operates as a desktop agent powered by Claude Opus 4.6 that can read local files, execute multi-step tasks, and interact with external services through plugins — all running directly on a user’s machine. Unlike chatbot interfaces that respond to individual prompts, Cowork plans and executes complete workflows across files, applications, and connected services.
Anthropic has leaned into this positioning. On January 30, the company’s Anthropic Labs division released 11 open-source agentic plugins spanning sales, legal, finance, marketing, data analysis, and software development. These plugins connect Cowork to external tools, enabling the agent to pull data from CRMs, draft legal documents, analyze spreadsheets, or manage project boards without users switching applications.
Such convenience comes with tradeoffs, and Anthropic has been transparent about the risks inherent in agent software that can read, write, and delete files. The company’s support documentation warns users to “be cautious about granting access to sensitive information like financial documents, credentials, or personal records” and suggests saving backups and creating dedicated folders with nonsensitive information.
Cowork remains susceptible to prompt injection attacks — hidden instructions embedded in documents or websites that can hijack AI agents and redirect their actions. The browser automation feature includes an explicit disclaimer warning that hidden code in websites may “steal your data, inject malware into your systems, or take over your system.”
“We use a virtual machine under the hood,” Boris Cherny, Anthropic’s head of Claude Code, told Wired. “This means you have to say which folders Claude has access to. And if you don’t give it access to a folder, Claude literally cannot see that folder.”
The Windows version includes additional safety constraints. According to user reports on Reddit, Cowork on Windows restricts file access to the user’s personal folder, preventing the agent from accessing common development directories like C:\git. While some users expressed frustration at this limitation, others noted it as a prudent safeguard for less technical users.
“To be fair, seeing how many people nuked themselves with Claude Code, it is much safer to limit people to reduce the collateral damage,” wrote one Reddit user.
Despite the security caveats, early enterprise adoption suggests meaningful interest. Customer testimonials published alongside the Claude Opus 4.6 announcement on the Microsoft Azure blog included statements from Adobe, Dentons, and other major organizations already integrating Anthropic’s technology into their workflows.
“At Adobe, we’re continuously evaluating new AI capabilities that can help us deliver more powerful, responsible, and intuitive experiences for our customers,” said Michael Marth, VP Engineering for Experience Manager and LLM Optimizer. “Foundry gives us a flexible, enterprise-ready environment to explore frontier models while maintaining the trust, governance, and scale that are critical for Adobe.”
Matej Jambrich, CTO of Dentons Europe, described deploying Claude for legal work: “Better model reasoning reduces rework and improves consistency, so our lawyers can focus on higher value judgment.”
On Reddit, an Anthropic representative wrote that the Windows release addresses “the most consistent request” since Cowork’s macOS debut — a demand that came “especially from enterprise teams.” The detail underscores the tool’s perceived value in corporate environments where Windows dominates the desktop landscape.
Access to these capabilities comes at a price. Cowork for Windows is available in research preview at claude.com/cowork for all paid Claude subscription tiers, including Pro ($20/month), Max ($100/month), Team, and Enterprise. Free-tier users cannot access the feature.
This pricing structure positions Cowork as a premium productivity tool rather than a mass-market offering — at least for now. Anthropic has not announced plans for broader availability, and the “research preview” designation suggests the company continues to gather user feedback before committing to a general release.
The January macOS launch was similarly restricted to $100/month Max subscribers before expanding to other paid tiers, suggesting Anthropic may follow a gradual rollout strategy as it refines the product. For enterprise customers evaluating the tool, the pricing represents a fraction of what many pay for traditional software licenses—a calculus that could accelerate adoption if Cowork delivers on its automation promises.
For Microsoft, the deepening Anthropic partnership reflects a pragmatic recognition that AI leadership may require embracing multiple frontier providers rather than relying exclusively on a single partner.
The company’s willingness to deploy Claude tools internally while selling GitHub Copilot externally suggests confidence that the enterprise market can accommodate competing approaches — or perhaps an acknowledgment that betting everything on OpenAI carries its own risks.
For the broader software industry, Cowork’s expansion to Windows extends the competitive threat to an even larger installed base. Companies whose value propositions rest on task automation, file management, or workflow orchestration now face a well-funded competitor capable of replicating their core functionality through natural language commands.
The $285 billion in market capitalization that evaporated after Cowork’s January launch may prove to be just an opening salvo. With Windows support now live, Anthropic has removed the last major platform barrier between its AI agent and the enterprise customers most likely to adopt it.
The software industry spent decades building tools to help knowledge workers manage files, automate tasks, and organize information. Now it faces a future where a single application, powered by an AI that learns and improves with every interaction, threatens to do all of that and more. The question is no longer whether AI agents will reshape enterprise software, but how much of the old world will survive the transformation.
When enterprises fine-tune LLMs for new tasks, they risk breaking everything the models already know. This forces companies to maintain separate models for every skill.
Researchers at MIT, the Improbable AI Lab and ETH Zurich have developed a new technique that enables large language models to learn new skills and knowledge without forgetting their past capabilities.
Their technique, called self-distillation fine-tuning (SDFT), allows models to learn directly from demonstrations and their own experiments by leveraging the inherent in-context learning abilities of modern LLMs. Experiments show that SDFT consistently outperforms traditional supervised fine-tuning (SFT) while addressing the limitations of reinforcement learning algorithms.
For enterprise applications, the method enables a single model to accumulate multiple skills over time without suffering from performance regression on earlier tasks. This offers a potential pathway for building AI agents that can adapt to dynamic business environments, gathering new proprietary knowledge and skills as needed without requiring expensive retraining cycles or losing their general reasoning abilities.
Once an LLM is trained and deployed, it remains static. It does not update its parameters to acquire new skills, internalize new knowledge, or improve from experience. To build truly adaptive AI, the industry needs to solve “continual learning,” allowing systems to accumulate knowledge much like humans do throughout their careers.
The most effective way for models to learn is through “on-policy learning.” In this approach, the model learns from data it generates itself allowing it to correct its own errors and reasoning processes. This stands in contrast to learning by simply mimicking static datasets. Without on-policy learning, models are prone to “catastrophic forgetting,” a phenomenon where learning a new task causes the model to lose its past knowledge and ability to perform previous tasks.
However, on-policy learning typically requires reinforcement learning (RL), which depends on an explicit reward function to score the model’s outputs. This works well for problems with clear outcomes, such as math and coding. But in many real-world enterprise scenarios (e.g., writing a legal brief or summarizing a meeting), defining a mathematical reward function is difficult or impossible.
RL methods also often fail when trying to teach a model entirely new information, such as a specific company protocol or a new product line. As Idan Shenfeld, a doctorate student at MIT and co-author of the paper, told VentureBeat, “No matter how many times the base model tries, it cannot generate correct answers for a topic it has zero knowledge about,” meaning it never gets a positive signal to learn from.
The standard alternative is supervised fine-tuning (SFT), where the model is trained on a fixed dataset of expert demonstrations. While SFT provides clear ground truth, it is inherently “off-policy.” Because the model is just mimicking data rather than learning from its own attempts, it often fails to generalize to out-of-distribution examples and suffers heavily from catastrophic forgetting.
SDFT seeks to bridge this gap: enabling the benefits of on-policy learning using only prerecorded demonstrations, without needing a reward function.
SDFT solves this problem by using “distillation,” a process where a student model learns to mimic a teacher. The researchers’ insight was to use the model’s own “in-context learning” (ICL) capabilities to create a feedback loop within a single model.
In-context learning is the phenomenon where you provide the LLM with a difficult task and one or more demonstrations of how similar problems are solved. Most advanced LLMs are designed to solve new problems with ICL examples, without any parameter updates.
During the training cycle, SDFT employs the model in two roles.
The teacher: A frozen version of the model is fed the query along with expert demonstrations. Using ICL, the teacher deduces the correct answer and the reasoning logic required to reach it.
The student: This version sees only the query, simulating a real-world deployment scenario where no answer key is available.
When the student generates an answer, the teacher, which has access to the expert demonstrations, provides feedback. The student then updates its parameters to align closer to the teacher’s distribution.
This process effectively creates an on-policy learning loop by combining elements of SFT and RL. The supervision comes not from a static dataset, but from the model’s own interaction and outputs. It allows the model to correct its own reasoning trajectories without requiring an external reward signal. This process works even for new knowledge that RL would miss.
To validate the approach, the researchers tested SDFT using the open-weight Qwen 2.5 model on three complex enterprise-grade skills: science Q&A, software tool use, and medical reasoning.
The results showed that SDFT learned new tasks more effectively than standard methods. On the Science Q&A benchmark, the SDFT model achieved 70.2% accuracy, compared to 66.2% for the standard SFT approach.
More important for enterprise adoption is the impact on catastrophic forgetting. When the standard SFT model learned the science task, its ability to answer general questions (such as logic or humanities) collapsed. In contrast, the SDFT model improved on the science task while holding its “Previous Tasks” score steady at 64.5%. This stability suggests companies could specialize models for specific departments (e.g., HR or Legal) without degrading the model’s basic common sense or reasoning capabilities.
The team also simulated a knowledge injection scenario, creating a dataset of fictional “2025 Natural Disasters” to teach the model new facts. They tested the model on indirect reasoning questions, such as “Given the floods in 2025, which countries likely needed humanitarian aid?”
Standard SFT resulted in a model that memorized facts but struggled to use them in reasoning scenarios. The SDFT model, having internalized the logic during training, scored 98% on the same questions.
Finally, the researchers conducted a sequential learning experiment, training the model on science, tool use, and medical tasks one after another. While the standard model’s performance oscillated, losing previous skills as it learned new ones, the SDFT model successfully accumulated all three skills without regression.
This capability addresses a major pain point for enterprises currently managing “model zoos” of separate adapters for different tasks.
“We offer the ability to maintain only a single model for all the company’s needs,” Shenfeld said. This consolidation “can lead to a substantial reduction in inference costs” because organizations don’t need to host multiple models simultaneously.
The code for SDFT is available on GitHub and ready to be integrated into existing model training workflows.
“The SDFT pipeline is more similar to the RL pipeline in that it requires online response generation during training,” Shenfeld said. They are working with Hugging Face to integrate SDFT into the latter’s Transformer Reinforcement Learning (TRL) library, he added, noting that a pull request is already open for developers who want to test the integration.
For teams considering SDFT, the practical tradeoffs come down to model size and compute. The technique requires models with strong enough in-context learning to act as their own teachers — currently around 4 billion parameters with newer architectures like Qwen 3, though Shenfeld expects 1 billion-parameter models to work soon. It demands roughly 2.5 times the compute of standard fine-tuning, but is best suited for organizations that need a single model to accumulate multiple skills over time, particularly in domains where defining a reward function for reinforcement learning is difficult or impossible.
While effective, the method does come with computational tradeoffs. SDFT is approximately four times slower and requires 2.5 times more computational power (FLOPs) than standard fine-tuning because the model must actively generate its own answers (“rollouts”) during training to compare against the teacher. However, the researchers note that because the model retains knowledge better, organizations may avoid the costly multi-stage retraining processes often required to repair models that suffer from catastrophic forgetting.
The technique also relies on the underlying model being large enough to benefit from in-context learning. The paper notes that smaller models (e.g., 3 billion parameters) initially struggled because they lacked the “intelligence” to act as their own teachers.
However, Shenfeld said that the rapid improvement of small models is changing this dynamic. “The Qwen 2.5 3B models were too weak, but in some experiments we currently do, we found that the Qwen 3 4B model is strong enough,” he said. “I see a future where even 1B models have good enough ICL capabilities to support SDFT.”
Ultimately, the goal is to move beyond static snapshots toward systems that improve through use.
“Lifelong learning, together with the ability to extract learning signal from unstructured user interactions… will bring models that just keep and keep improving with time,” Shenfeld said.
“Think about the fact that already the majority of compute around the world goes into inference instead of training. We have to find ways to harness this compute to improve our models.”