Anthropic’s Sonnet 4.6 matches flagship AI performance at one-fifth the cost, accelerating enterprise adoption

Anthropic on Tuesday released Claude Sonnet 4.6, a model that amounts to a seismic repricing event for the AI industry. It delivers near-flagship intelligence at mid-tier cost, and it lands squarely in the middle of an unprecedented corporate rush to deploy AI agents and automated coding tools.

The model is a full upgrade across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. It features a 1M token context window in beta. It is now the default model in claude.ai and Claude Cowork, and pricing holds steady at $3/$15 per million tokens — the same as its predecessor, Sonnet 4.5.

That pricing detail is the headline that matters most. Anthropic’s flagship Opus models cost $15/$75 per million tokens — five times the Sonnet price. Yet performance that would have previously required reaching for an Opus-class model — including on real-world, economically valuable office tasks — is now available with Sonnet 4.6. For the thousands of enterprises now deploying AI agents that make millions of API calls per day, that math changes everything.

Why the cost of running AI agents at scale just dropped dramatically

To understand the significance of this release, you need to understand the moment it arrives in. The past year has been dominated by the twin phenomena of “vibe coding” and agentic AI. Claude Code — Anthropic’s developer-facing terminal tool — has become a cultural force in Silicon Valley, with engineers building entire applications through natural-language conversation. The New York Times profiled its meteoric rise in January. The Verge recently declared that Claude Code is having a genuine “moment.” OpenAI, meanwhile, has been waging its own offensive with Codex desktop applications and faster inference chips.

The result is an industry where AI models are no longer evaluated in isolation. They are evaluated as the engines inside autonomous agents — systems that run for hours, make thousands of tool calls, write and execute code, navigate browsers, and interact with enterprise software. Every dollar spent per million tokens gets multiplied across those thousands of calls. At scale, the difference between $15 and $3 per million input tokens is not incremental. It is transformational.

The benchmark table Anthropic released paints a striking picture. On SWE-bench Verified, the industry-standard test for real-world software coding, Sonnet 4.6 scored 79.6% — nearly matching Opus 4.6’s 80.8%. On agentic computer use (OSWorld-Verified), Sonnet 4.6 scored 72.5%, essentially tied with Opus 4.6’s 72.7%. On office tasks (GDPval-AA Elo), Sonnet 4.6 actually scored 1633, surpassing Opus 4.6’s 1606. On agentic financial analysis, Sonnet 4.6 hit 63.3%, beating every model in the comparison, including Opus 4.6 at 60.1%.

These are not marginal differences. In many of the categories enterprises care about most, Sonnet 4.6 matches or beats models that cost five times as much to run. An enterprise running an AI agent that processes 10 million tokens per day was previously forced to choose between inferior results at lower cost or superior results at rapidly scaling expense. Sonnet 4.6 largely eliminates that trade-off.

In Claude Code, early testing found that users preferred Sonnet 4.6 over Sonnet 4.5 roughly 70% of the time. Users even preferred Sonnet 4.6 to Opus 4.5, Anthropic’s frontier model from November, 59% of the time. They rated Sonnet 4.6 as significantly less prone to over-engineering and “laziness,” and meaningfully better at instruction following. They reported fewer false claims of success, fewer hallucinations, and more consistent follow-through on multi-step tasks.

How Claude’s computer use abilities went from ‘experimental’ to near-human in 16 months

One of the most dramatic storylines in the release is Anthropic’s progress on computer use — the ability of an AI to operate a computer the way a human does, clicking a mouse, typing on a keyboard, and navigating software that lacks modern APIs.

When Anthropic first introduced this capability in October 2024, the company acknowledged it was “still experimental — at times cumbersome and error-prone.” The numbers since then tell a remarkable story: on OSWorld, Claude Sonnet 3.5 scored 14.9% in October 2024. Sonnet 3.7 reached 28.0% in February 2025. Sonnet 4 hit 42.2% by June. Sonnet 4.5 climbed to 61.4% in October. Now Sonnet 4.6 has reached 72.5% — nearly a fivefold improvement in 16 months.

This matters because computer use is the capability that unlocks the broadest set of enterprise applications for AI agents. Almost every organization has legacy software — insurance portals, government databases, ERP systems, hospital scheduling tools — that was built before APIs existed. A model that can simply look at a screen and interact with it opens all of these to automation without building bespoke connectors.

Jamie Cuffe, CEO of Pace, said Sonnet 4.6 hit 94% on their complex insurance computer use benchmark, the highest of any Claude model tested. “It reasons through failures and self-corrects in ways we haven’t seen before,” Cuffe said in a statement sent to VentureBeat. Will Harvey, co-founder of Convey, called it “a clear improvement over anything else we’ve tested in our evals.”

The safety dimension of computer use also got attention. Anthropic noted that computer use poses prompt injection risks — malicious actors hiding instructions on websites to hijack the model — and said its evaluations show Sonnet 4.6 is a major improvement over Sonnet 4.5 in resisting such attacks. For enterprises deploying agents that browse the web and interact with external systems, that hardening is not optional.

Enterprise customers say the model closes the gap between Sonnet and Opus pricing tiers

The customer reaction has been unusually specific about cost-performance dynamics. Multiple early testers explicitly described Sonnet 4.6 as eliminating the need to reach for the more expensive Opus tier.

Caitlin Colgrove, CTO of Hex Technologies, said the company is moving the majority of its traffic to Sonnet 4.6, noting that with adaptive thinking and high effort, “we see Opus-level performance on all but our hardest analytical tasks with a more efficient and flexible profile. At Sonnet pricing, it’s an easy call for our workloads.”

Ben Kus, CTO of Box, said the model outperformed Sonnet 4.5 in heavy reasoning Q&A by 15 percentage points across real enterprise documents. Michele Catasta, President of Replit, called the performance-to-cost ratio “extraordinary.” Ryan Wiggins of Mercury Banking put it more bluntly: “Claude Sonnet 4.6 is faster, cheaper, and more likely to nail things on the first try. That combination was a surprising combination of improvements, and we didn’t expect to see it at this price point.”

The coding improvements resonate particularly given Claude Code’s dominance in the developer tools market. David Loker, VP of AI at CodeRabbit, said the model “punches way above its weight class for the vast majority of real-world PRs.” Leo Tchourakov of Factory AI said the team is “transitioning our Sonnet traffic over to this model.” GitHub’s VP of Product, Joe Binder, confirmed the model is “already excelling at complex code fixes, especially when searching across large codebases is essential.”

Brendan Falk, Founder and CEO of Hercules, went further: “Claude Sonnet 4.6 is the best model we have seen to date. It has Opus 4.6 level accuracy, instruction following, and UI, all for a meaningfully lower cost.”

A simulated business competition reveals how AI agents plan over months, not minutes

Buried in the technical details is a capability that hints at where autonomous AI agents are heading. Sonnet 4.6’s 1M token context window can hold entire codebases, lengthy contracts, or dozens of research papers in a single request. Anthropic says the model reasons effectively across all that context — a claim the company demonstrated through an unusual evaluation.

The Vending-Bench Arena tests how well a model can run a simulated business over time, with different AI models competing against each other for the biggest profits. Without human prompting, Sonnet 4.6 developed a novel strategy: it invested heavily in capacity for the first ten simulated months, spending significantly more than its competitors, and then pivoted sharply to focus on profitability in the final stretch. The model ended its 365-day simulation at approximately $5,700 in balance, compared to Sonnet 4.5’s roughly $2,100.

This kind of multi-month strategic planning, executed autonomously, represents a qualitatively different capability than answering questions or generating code snippets. It is the type of long-horizon reasoning that makes AI agents viable for real business operations — and it helps explain why Anthropic is positioning Sonnet 4.6 not just as a chatbot upgrade, but as the engine for a new generation of autonomous systems.

Anthropic’s Sonnet 4.6 arrives as the company expands into enterprise markets and defense

This release does not arrive in a vacuum. Anthropic is in the middle of the most consequential stretch in its history, and the competitive landscape is intensifying on every front.

On the same day as this launch, TechCrunch reported that Indian IT giant Infosys announced a partnership with Anthropic to build enterprise-grade AI agents, integrating Claude models into Infosys’s Topaz AI platform for banking, telecoms, and manufacturing. Anthropic CEO Dario Amodei told TechCrunch there is “a big gap between an AI model that works in a demo and one that works in a regulated industry,” and that Infosys helps bridge it. TechCrunch also reported that Anthropic opened its first India office in Bengaluru, and that India now accounts for about 6% of global Claude usage, second only to the U.S. The company, which CNBC reported is valued at $183 billion, has been expanding its enterprise footprint rapidly.

Meanwhile, Anthropic president Daniela Amodei told ABC News last week that AI would make humanities majors “more important than ever,” arguing that critical thinking skills would become more valuable as large language models master technical work. It is the kind of statement a company makes when it believes its technology is about to reshape entire categories of white-collar employment.

The competitive picture for Sonnet 4.6 is also notable. The model outperforms Google’s Gemini 3 Pro and OpenAI’s GPT-5.2 on multiple benchmarks. GPT-5.2 trails on agentic computer use (38.2% vs. 72.5%), agentic search (77.9% vs. 74.7% for Sonnet 4.6’s non-Pro score), and agentic financial analysis (59.0% vs. 63.3%). Gemini 3 Pro shows competitive performance on visual reasoning and multilingual benchmarks, but falls behind on the agentic categories where enterprise investment is surging.

The broader takeaway may not be about any single model. It is about what happens when Opus-class intelligence becomes available for a few dollars per million tokens rather than a few tens of dollars. Companies that were cautiously piloting AI agents with small deployments now face a fundamentally different cost calculus. The agents that were too expensive to run continuously in January are suddenly affordable in February.

Claude Sonnet 4.6 is available now on all Claude plans, Claude Cowork, Claude Code, the API, and all major cloud platforms. Anthropic has also upgraded its free tier to Sonnet 4.6 by default. Developers can access it immediately using claude-sonnet-4-6 via the Claude API.

Qodo 2.1 solves your coding agents’ ‘amnesia’ problem, giving them an 11% precision boost

As AI-powered coding tools flood the market, a critical weakness has emerged: by default, as with most LLM chat sessions, they are temporary — as soon as you close a session and start a new one, the tool forgets everything you were just working on.

Developers have worked around this by having coding tools and agents save their state to markdown and text files, but this solution is hacky at best.

Qodo, the AI code review startup, believes it has a solution with the launch of what it calls the industry’s first intelligent Rules System for AI governance — a framework that gives AI code reviewers persistent, organizational memory.

The new system, announced today as part of Qodo 2.1, replaces static, manually maintained rule files with an intelligent governance layer. It automatically generates rules from actual code patterns and past review decisions, continuously maintains rule health, enforces standards in every code review, and measures real-world impact.

For Itamar Friedman, CEO and co-founder of Qodo, the release represents a pivotal moment not just for his company but for the entire AI development tools space.

“I strongly believe that this announcement of ours is most important we ever done,” Friedman said in an interview with VentureBeat.

The ‘Memento’ problem

To explain the limitation of current AI coding tools, Friedman invokes the 2000 Christopher Nolan film Memento, in which the protagonist suffers from short-term memory loss and must tattoo notes on his body to remember crucial information.

“Every time you call them, it’s a machine that wakes up from scratch,” Friedman said of today’s AI coding assistants. “So all it can do is, before it goes to sleep and restart, it could write whatever it did in a file.”

This approach—saving context to markdown files like agents.md or napkin.md—has become a common workaround among developers using tools like Claude Code and Cursor. But Friedman argues this method breaks down at enterprise scale.

“Think about heavy duty software where you now have, let’s say, 100,000 of those sticky notes,” he said. “Some of them are sticky notes. Some of them are huge explanations. Some of them are stories. You wake up and you get a task. The first thing that [the AI] is doing is statistically starting to look for the right memos… It’s much better than not having it. But it’s very random.”

From stateless to stateful

The evolution of AI development tools has followed a clear trajectory, according to Friedman: from autocomplete (GitHub Copilot) to question-and-answer (ChatGPT) to agentic coding within the IDE (Cursor) to agentic capabilities everywhere (Claude Code). But he contends all of these remain fundamentally stateless.

“In order for software development to really revolutionize how we do software development for real world software, it needs to be a stateful machine,” Friedman said.

The core challenge, he explained, is that code quality is inherently subjective. Different organizations have different standards, and even teams within the same enterprise may approach problems differently.

“In order to really reach high level of automation, you need to be able to customize for the specific requirements of the enterprise,” Friedman said. “You need to be able to provide code in high quality. But quality is subjective.”

Qodo’s answer is what Friedman describes as “memory that is built over a long time and is accessible to the coding agents, and then they can poke and check and verify that what they’re actually doing is according to the subjective needs of the enterprise.”

How Qodo’s Rules System works

Qodo’s Rules System establishes what the company calls a unified source of truth for organizational coding standards. The system includes several key components:

  • Automatic Rule Discovery: A Rules Discovery Agent generates standards from codebases and pull request feedback, eliminating manual authoring of rule files.

  • Intelligent Maintenance: A Rules Expert Agent continuously identifies conflicts, duplicates, and outdated standards to prevent what the company calls “rule decay.”

  • Scalable Enforcement: Rules are automatically enforced during pull request code review, with recommended fixes provided to developers.

  • Real-World Analytics: Organizations can track adoption rates, violation trends, and improvement metrics to prove standards are being followed.

Friedman emphasized that this represents a fundamental shift in how AI code review tools operate. “It’s the first time that AI code review tool is moving from reactive to proactive,” he said.

The system surfaces rules based on code patterns, best practices, and its own library, then presents them to technical leads for approval. Once accepted, organizations receive statistics on rule adoption and violations across their entire codebase.

A tighter connection between memory and agents

What distinguishes Qodo’s approach, according to Friedman, is how tightly the rules system integrates with the AI agents themselves—as opposed to treating memory as an external resource the AI must search through.

“At Qodo, this memory and agents are much more connected, like we have in our brain,” Friedman said. “There’s much more structure to it… where different parts are well connected and not separated.”

Friedman noted that Qodo applies fine-tuning and reinforcement learning techniques to this integrated system, which he credits for the company achieving an 11% improvement in precision and recall over other platforms, successfully identifying 580 defects across 100 real-world production PRs.

Friedman offered a prediction for the industry: “When you look one year ahead, it will be very clear that when we started 2026, we were in stateless machines that are trying to hack how they interact with memory. And we will have a very coupled way by the end of 2026, and Qodo 2.1 is the first blueprint of how to do that.”

Enterprise deployment and pricing

Qodo positions itself as an enterprise-first company, offering multiple deployment options. Organizations can deploy the system entirely within their own infrastructure via cloud premise or VPN, use a single-tenant SaaS option where Qodo hosts an isolated instance, or opt for traditional self-serve SaaS.

The rules and memory files can reside wherever the enterprise requires—on their own cloud infrastructure or hosted by Qodo—addressing data governance concerns that enterprise customers typically raise.

On pricing, Qodo is maintaining its existing seat-based model with usage quotas. At present, the company offers three pricing tiers: a free Developer plan for individuals with 30 PR reviews per month, a Teams plan at $38 per user per month (with 21% savings available for annual billing) that includes 20 PRs per user monthly and 2,500 IDE/CLI credits, and a custom-priced Enterprise plan with contact-us pricing that adds features like multi-repo context awareness, on-prem deployment options, SSO, and priority support.

Friedman acknowledged the ongoing industry debate about whether seat-based pricing makes sense in an age of AI agents but said the company plans to address this topic more comprehensively later this year.

“If you get more value, you pay more,” Friedman said. “If you don’t, then we’re all good.”

Early customer response

Ofer Morag Brin of HR technology company Hibob, an early user of the Rules System, reported positive results in a press statement Qodo shared with VentureBeat ahead of the launch.

“Qodo’s Rules System didn’t just surface the standards we had scattered across different places; it operationalized them,” Brin said. “The system continuously reinforces how our teams actually review and write code, and we are seeing stronger consistency, faster onboarding, and measurable improvements in review quality across teams.”

Founded in 2018, Qodo has raised $50 million from investors including TLV Partners, Vine Ventures, Susa Ventures, and Square Peg, with angel investors from OpenAI, Shopify, and Snyk.

Nvidia, Groq and the limestone race to real-time AI: Why enterprises win or lose here

​From miles away across the desert, the Great Pyramid looks like a perfect, smooth geometry — a sleek triangle pointing to the stars. Stand at the base, however, and the illusion of smoothness vanishes. You see massive, jagged blocks of limestone. It is not a slope; it is a staircase.

​Remember this the next time you hear futurists talking about exponential growth.

​Intel’s co-founder Gordon Moore (Moore’s Law) is famously quoted for saying in 1965 that the transistor count on a microchip would double every year. Another Intel executive, David House, later revised this statement to “compute power doubling every 18 months.” For a while, Intel’s CPUs were the poster child of this law. That is, until the growth in CPU performance flattened out like a block of limestone.

​If you zoom out, though, the next limestone block was already there — the growth in compute merely shifted from CPUs to the world of GPUs. Jensen Huang, Nvidia’s CEO, played a long game and came out a strong winner, building his own stepping stones initially with gaming, then computer visioniand recently, generative AI.

​The illusion of smooth growth

​Technology growth is full of sprints and plateaus, and gen AI is not immune. The current wave is driven by transformer architecture. To quote Anthropic’s President and co-founder Dario Amodei: “The exponential continues until it doesn’t. And every year we’ve been like, ‘Well, this can’t possibly be the case that things will continue on the exponential’ — and then every year it has.”

​But just as the CPU plateaued and GPUs took the lead, we are seeing signs that LLM growth is shifting paradigms again. For example, late in 2024, DeepSeek surprised the world by training a world-class model on an impossibly small budget, in part by using the MoE technique.

​Do you remember where you recently saw this technique mentioned? Nvidia’s Rubin press release: The technology includes “…the latest generations of Nvidia NVLink interconnect technology… to accelerate agentic AI, advanced reasoning and massive-scale MoE model inference at up to 10x lower cost per token.”

​Jensen knows that achieving that coveted exponential growth in compute doesn’t come from pure brute force anymore. Sometimes you need to shift the architecture entirely to place the next stepping stone.

​The latency crisis: Where Groq fits in

​This long introduction brings us to Groq.

​The biggest gains in AI reasoning capabilities in 2025 were driven by “inference time compute” — or, in lay terms, “letting the model think for a longer period of time.” But time is money. Consumers and businesses do not like waiting.

​Groq comes into play here with its lightning-speed inference. If you bring together the architectural efficiency of models like DeepSeek and the sheer throughput of Groq, you get frontier intelligence at your fingertips. By executing inference faster, you can “out-reason” competitive models, offering a “smarter” system to customers without the penalty of lag.

​From universal chip to inference optimization

​For the last decade, the GPU has been the universal hammer for every AI nail. You use H100s to train the model; you use H100s (or trimmed-down versions) to run the model. But as models shift toward “System 2” thinking — where the AI reasons, self-corrects and iterates before answering — the computational workload changes.

​Training requires massive parallel brute force. Inference, especially for reasoning models, requires faster sequential processing. It must generate tokens instantly to facilitate complex chains of thought without the user waiting minutes for an answer. ​Groq’s LPU (Language Processing Unit) architecture removes the memory bandwidth bottleneck that plagues GPUs during small-batch inference, delivering lightning-fast inference.

​The engine for the next wave of growth

​For the C-Suite, this potential convergence solves the “thinking time” latency crisis. Consider the expectations from AI agents: We want them to autonomously book flights, code entire apps and research legal precedent. To do this reliably, a model might need to generate 10,000 internal “thought tokens” to verify its own work before it outputs a single word to the user.

  • On a standard GPU: 10,000 thought tokens might take 20 to 40 seconds. The user gets bored and leaves.

  • On Groq: That same chain of thought happens in less than 2 seconds.

​If Nvidia integrates Groq’s technology, they solve the “waiting for the robot to think” problem. They preserve the magic of AI. Just as they moved from rendering pixels (gaming) to rendering intelligence (gen AI), they would now move to rendering reasoning in real-time.

​Furthermore, this creates a formidable software moat. Groq’s biggest hurdle has always been the software stack; Nvidia’s biggest asset is CUDA. If Nvidia wraps its ecosystem around Groq’s hardware, they effectively dig a moat so wide that competitors cannot cross it. They would offer the universal platform: The best environment to train and the most efficient environment to run (Groq/LPU).

Consider what happens when you couple that raw inference power with a next-generation open source model (like the rumored DeepSeek 4): You get an offering that would rival today’s frontier models in cost, performance and speed. That opens up opportunities for Nvidia, from directly entering the inference business with its own cloud offering, to continuing to power a growing number of exponentially growing customers.

​The next step on the pyramid

​Returning to our opening metaphor: The “exponential” growth of AI is not a smooth line of raw FLOPs; it is a staircase of bottlenecks being smashed.

  • Block 1: We couldn’t calculate fast enough. Solution: The GPU.

  • Block 2: We couldn’t train deep enough. Solution: Transformer architecture.

  • Block 3: We can’t “think” fast enough. Solution: Groq’s LPU.

​Jensen Huang has never been afraid to cannibalize his own product lines to own the future. By validating Groq, Nvidia wouldn’t just be buying a faster chip; they would be bringing next-generation intelligence to the masses.

Andrew Filev, founder and CEO of Zencoder

AI agents turned Super Bowl viewers into one high-IQ team — now imagine this in the enterprise

The average Fortune 1000 company has more than 30,000 employees and engineering, sales and marketing teams with hundreds of members. Equally large teams exist in government, science and defense organizations. And yet, research shows that the ideal size…

Nvidia’s new technique cuts LLM reasoning costs by 8x without losing accuracy

Researchers at Nvidia have developed a technique that can reduce the memory costs of large language model reasoning by up to eight times. Their technique, called dynamic memory sparsification (DMS), compresses the key value (KV) cache, the temporary memory LLMs generate and store as they process prompts and reason through problems and documents.

While researchers have proposed various methods to compress this cache before, most struggle to do so without degrading the model’s intelligence. Nvidia’s approach manages to discard much of the cache while maintaining (and in some cases improving) the model’s reasoning capabilities.

Experiments show that DMS enables LLMs to “think” longer and explore more solutions without the usual penalty in speed or memory costs.

The bottleneck of reasoning

LLMs improve their performance on complex tasks by generating “chain-of-thought” tokens, essentially writing out their reasoning steps before arriving at a final answer. Inference-time scaling techniques leverage this by giving the model a larger budget to generate these thinking tokens or to explore multiple potential reasoning paths in parallel.

However, this improved reasoning comes with a significant computational cost. As the model generates more tokens, it builds up a KV cache.

For real-world applications, the KV cache is a major bottleneck. As the reasoning chain grows, the cache grows linearly, consuming vast amounts of memory on GPUs. This forces the hardware to spend more time reading data from memory than actually computing, which slows down generation and increases latency. It also caps the number of users a system can serve simultaneously, as running out of VRAM causes the system to crash or slow to a crawl.

Nvidia researchers frame this not just as a technical hurdle, but as a fundamental economic one for the enterprise.

“The question isn’t just about hardware quantity; it’s about whether your infrastructure is processing 100 reasoning threads or 800 threads for the same cost,” Piotr Nawrot, Senior Deep Learning Engineer at Nvidia, told VentureBeat.

Previous attempts to solve this focused on heuristics-based approaches. These methods use rigid rules, such as a “sliding window” that only caches the most recent tokens and deletes the rest. While this reduces memory usage, it often forces the model to discard critical information required for solving the problem, degrading the accuracy of the output.

“Standard eviction methods attempt to select old and unused tokens for eviction using heuristics,” the researchers said. “They simplify the problem, hoping that if they approximate the model’s internal mechanics, the answer will remain correct.”

Other solutions use paging to offload the unused parts of the KV cache to slower memory, but the constant swapping of data introduces latency overhead that makes real-time applications sluggish.

Dynamic memory sparsification

DMS takes a different approach by “retrofitting” existing LLMs to intelligently manage their own memory. Rather than applying a fixed rule for what to delete, DMS trains the model to identify which tokens are essential for future reasoning and which are disposable.

“It doesn’t just guess importance; it learns a policy that explicitly preserves the model’s final output distribution,” Nawrot said.

The process transforms a standard, pre-trained LLM such as Llama 3 or Qwen 3 into a self-compressing model. Crucially, this does not require training the model from scratch, which would be prohibitively expensive. Instead, DMS repurposes existing neurons within the model’s attention layers to output a “keep” or “evict” signal for each token.

For teams worried about the complexity of retrofitting, the researchers noted that the process is designed to be lightweight. “To improve the efficiency of this process, the model’s weights can be frozen, which makes the process similar to Low-Rank Adaptation (LoRA),” Nawrot said. This means a standard enterprise model like Qwen3-8B “can be retrofitted with DMS within hours on a single DGX H100.”

One of the important parts of DMS is a mechanism called “delayed eviction.” In standard sparsification, if a token is deemed unimportant, it is deleted immediately. This is risky because the model might need a split second to integrate that token’s context into its current state.

DMS mitigates this by flagging a token for eviction but keeping it accessible for a short window of time (e.g., a few hundred steps). This delay allows the model to “extract” any remaining necessary information from the token and merge it into the current context before the token is wiped from the KV cache.

“The ‘delayed eviction’ mechanism is crucial because not all tokens are simply ‘important’ (keep forever) or ‘useless’ (delete immediately). Many fall in between — they carry some information, but not enough to justify occupying an entire slot in memory,” Nawrot said. “This is where the redundancy lies. By keeping these tokens in a local window for a short time before eviction, we allow the model to attend to them and redistribute their information into future tokens.”

The researchers found that this retrofitting process is highly efficient. They could equip a pre-trained LLM with DMS in just 1,000 training steps, a tiny fraction of the compute required for the original training. The resulting models use standard kernels and can drop directly into existing high-performance inference stacks without custom hardware or complex software rewriting.

DMS in action

To validate the technique, the researchers applied DMS to several reasoning models, including the Qwen-R1 series (distilled from DeepSeek R1) and Llama 3.2, and tested them on difficult benchmarks like AIME 24 (math), GPQA Diamond (science), and LiveCodeBench (coding).

The results show that DMS effectively moves the Pareto frontier, the optimal trade-off between cost and performance. On the AIME 24 math benchmark, a Qwen-R1 32B model equipped with DMS achieved a score 12.0 points higher than a standard model when constrained to the same memory bandwidth budget. By compressing the cache, the model could afford to “think” much deeper and wider than the standard model could for the same memory and compute budget.

Perhaps most surprisingly, DMS defied the common wisdom that compression hurts long-context understanding. In “needle-in-a-haystack” tests, which measure a model’s ability to find a specific piece of information buried in a large document, DMS variants actually outperformed the standard models. By actively managing its memory rather than passively accumulating noise, the model maintained a cleaner, more useful context.

For enterprise infrastructure, the efficiency gains translate directly to throughput and hardware savings. Because the memory cache is significantly smaller, the GPU spends less time fetching data, reducing the wait time for users. In tests with the Qwen3-8B model, DMS matched the accuracy of the vanilla model while delivering up to 5x higher throughput. This means a single server can handle five times as many customer queries per second without a drop in quality.

The future of memory

Nvidia has released DMS as part of its KVPress library. Regarding how enterprises can get started with DMS, Nawrot emphasized that the barrier to entry is low. “The ‘minimum viable infrastructure’ is standard Hugging Face pipelines — no custom CUDA kernels are required,” Nawrot said, noting that the code is fully compatible with standard FlashAttention. 

Looking ahead, the team views DMS as part of a larger shift where memory management becomes a distinct, intelligent layer of the AI stack. Nawrot also confirmed that DMS is “fully compatible” with newer architectures like the Multi-Head Latent Attention (MLA) used in DeepSeek’s models, suggesting that combining these approaches could yield even greater efficiency gains.

As enterprises move from simple chatbots to complex agentic systems that require extended reasoning, the cost of inference is becoming a primary concern. Techniques like DMS provide a path to scale these capabilities sustainably.

“We’ve barely scratched the surface of what is possible,” Nawrot said, “and we expect inference-time scaling to further evolve.”

Anthropic’s Claude Cowork finally lands on Windows — and it wants to automate your workday

Anthropic released its Claude Cowork AI agent software for Windows on Monday, bringing the file management and task automation tool to roughly 70 percent of the desktop computing market and intensifying a remarkable corporate realignment that has seen Microsoft embrace a direct competitor to its longtime AI partner, OpenAI.

The Windows launch arrives with what Anthropic calls “full feature parity” with the macOS version: file access, multi-step task execution, plugins, and Model Context Protocol (MCP) connectors for integrating external services. Users can now also set global and folder-specific instructions that Claude follows in every session, a feature developers on Reddit described as “a game-changer” for maintaining context across projects.

“Cowork is now available on Windows,” Anthropic announced on X. “We’re bringing full feature parity with MacOS: file access, multi-step task execution, plugins, and MCP connectors.”

The release closes a critical platform gap that had limited Cowork to Apple’s operating system since its January 12 debut. The Windows expansion underscores a broader transformation already underway in enterprise AI, with Microsoft simultaneously selling its own GitHub Copilot to customers while encouraging thousands of its own employees to adopt Anthropic’s competing tools internally.

Inside Microsoft’s surprising pivot toward its biggest AI rival

The relationship between Microsoft and Anthropic has accelerated with striking speed. In November, the two companies announced a strategic partnership allowing Microsoft Foundry customers access to Claude Sonnet 4.5, Claude Opus 4.1, and Claude Haiku 4.5. As part of that arrangement, Anthropic committed to purchasing $30 billion of Azure compute capacity.

But the partnership has expanded well beyond cloud hosting. According to a January 22 report in The Verge, Microsoft has begun encouraging thousands of employees from some of its most prolific teams to adopt Claude Code — and now, by extension, Cowork — even if they have no coding experience.

Microsoft’s CoreAI team, the new AI engineering group led by former Meta engineering chief Jay Parikh, has tested Claude Code in recent months, The Verge reported. The company has also approved Claude Code across all code and repositories for its Business and Industry Copilot teams.

“Software engineers at Microsoft are now expected to use both Claude Code and GitHub Copilot and give feedback comparing the two,” The Verge reported.

The company’s spending on Anthropic approaches $500 million annually, according to The Information. Microsoft has even begun counting Anthropic AI model sales toward Azure sales quotas — an unusual incentive structure that the company typically reserves for homegrown products or models from OpenAI.

A $13 billion partnership faces new questions as Microsoft hedges its bets

Microsoft’s embrace of Anthropic raises uncomfortable questions about its $13 billion investment in OpenAI, which has long served as the exclusive provider of frontier AI models for Microsoft’s products. The two companies signed their landmark partnership in 2019, with Microsoft providing Azure computing infrastructure in exchange for preferential access to OpenAI’s technology.

That relationship now appears to be evolving into something more nuanced. Microsoft has started favoring Anthropic’s Claude models inside Microsoft 365 apps and Copilot recently, deploying them in specific applications or features where Anthropic’s models have proven more capable than OpenAI’s counterparts.

On February 5, Microsoft announced that Claude Opus 4.6 — Anthropic’s most advanced model — would become available in Microsoft Foundry, the company’s enterprise AI platform. The Azure blog post framed the integration as bringing “even more capability to agents that increasingly learn from and act on business systems.”

“At Microsoft we believe that intelligence and trust are the core requirements of agentic AI at scale,” the announcement stated. “Built on Azure, Microsoft Foundry brings these capabilities together on a secure, scalable cloud foundation for enterprise AI.”

The timing and tone suggest Microsoft views Anthropic not merely as a hedging strategy but as a genuine technical leader in certain domains. Claude Opus 4.6 offers a one-million-token context window and 128,000-token maximum output — specifications that position it for complex, long-running enterprise tasks that require processing vast amounts of information.

Why a $285 billion stock selloff has the software industry questioning its future

The deepening Microsoft-Anthropic alliance takes on added significance when viewed against a backdrop of genuine alarm rippling through the software industry. Within days of the macOS launch in January, investors began repricing SaaS companies whose products overlap with Cowork’s capabilities — project management tools, writing assistants, data analysis platforms, and workflow automation software all saw sharp declines.

Bloomberg reported that Cowork triggered a $285 billion software stocks selloff. The carnage reflected growing investor conviction that AI agents capable of automating knowledge work could render entire categories of enterprise software obsolete.

The fear is not abstract. Cowork operates as a desktop agent powered by Claude Opus 4.6 that can read local files, execute multi-step tasks, and interact with external services through plugins — all running directly on a user’s machine. Unlike chatbot interfaces that respond to individual prompts, Cowork plans and executes complete workflows across files, applications, and connected services.

Anthropic has leaned into this positioning. On January 30, the company’s Anthropic Labs division released 11 open-source agentic plugins spanning sales, legal, finance, marketing, data analysis, and software development. These plugins connect Cowork to external tools, enabling the agent to pull data from CRMs, draft legal documents, analyze spreadsheets, or manage project boards without users switching applications.

The hidden risks of giving an AI agent access to your files

Such convenience comes with tradeoffs, and Anthropic has been transparent about the risks inherent in agent software that can read, write, and delete files. The company’s support documentation warns users to “be cautious about granting access to sensitive information like financial documents, credentials, or personal records” and suggests saving backups and creating dedicated folders with nonsensitive information.

Cowork remains susceptible to prompt injection attacks — hidden instructions embedded in documents or websites that can hijack AI agents and redirect their actions. The browser automation feature includes an explicit disclaimer warning that hidden code in websites may “steal your data, inject malware into your systems, or take over your system.”

“We use a virtual machine under the hood,” Boris Cherny, Anthropic’s head of Claude Code, told Wired. “This means you have to say which folders Claude has access to. And if you don’t give it access to a folder, Claude literally cannot see that folder.”

The Windows version includes additional safety constraints. According to user reports on Reddit, Cowork on Windows restricts file access to the user’s personal folder, preventing the agent from accessing common development directories like C:\git. While some users expressed frustration at this limitation, others noted it as a prudent safeguard for less technical users.

“To be fair, seeing how many people nuked themselves with Claude Code, it is much safer to limit people to reduce the collateral damage,” wrote one Reddit user.

Major corporations are already betting on Claude’s enterprise potential

Despite the security caveats, early enterprise adoption suggests meaningful interest. Customer testimonials published alongside the Claude Opus 4.6 announcement on the Microsoft Azure blog included statements from Adobe, Dentons, and other major organizations already integrating Anthropic’s technology into their workflows.

“At Adobe, we’re continuously evaluating new AI capabilities that can help us deliver more powerful, responsible, and intuitive experiences for our customers,” said Michael Marth, VP Engineering for Experience Manager and LLM Optimizer. “Foundry gives us a flexible, enterprise-ready environment to explore frontier models while maintaining the trust, governance, and scale that are critical for Adobe.”

Matej Jambrich, CTO of Dentons Europe, described deploying Claude for legal work: “Better model reasoning reduces rework and improves consistency, so our lawyers can focus on higher value judgment.”

On Reddit, an Anthropic representative wrote that the Windows release addresses “the most consistent request” since Cowork’s macOS debut — a demand that came “especially from enterprise teams.” The detail underscores the tool’s perceived value in corporate environments where Windows dominates the desktop landscape.

At $20 a month, Cowork positions itself as a premium productivity play

Access to these capabilities comes at a price. Cowork for Windows is available in research preview at claude.com/cowork for all paid Claude subscription tiers, including Pro ($20/month), Max ($100/month), Team, and Enterprise. Free-tier users cannot access the feature.

This pricing structure positions Cowork as a premium productivity tool rather than a mass-market offering — at least for now. Anthropic has not announced plans for broader availability, and the “research preview” designation suggests the company continues to gather user feedback before committing to a general release.

The January macOS launch was similarly restricted to $100/month Max subscribers before expanding to other paid tiers, suggesting Anthropic may follow a gradual rollout strategy as it refines the product. For enterprise customers evaluating the tool, the pricing represents a fraction of what many pay for traditional software licenses—a calculus that could accelerate adoption if Cowork delivers on its automation promises.

The battle for the future of work has a new front line

For Microsoft, the deepening Anthropic partnership reflects a pragmatic recognition that AI leadership may require embracing multiple frontier providers rather than relying exclusively on a single partner.

The company’s willingness to deploy Claude tools internally while selling GitHub Copilot externally suggests confidence that the enterprise market can accommodate competing approaches — or perhaps an acknowledgment that betting everything on OpenAI carries its own risks.

For the broader software industry, Cowork’s expansion to Windows extends the competitive threat to an even larger installed base. Companies whose value propositions rest on task automation, file management, or workflow orchestration now face a well-funded competitor capable of replicating their core functionality through natural language commands.

The $285 billion in market capitalization that evaporated after Cowork’s January launch may prove to be just an opening salvo. With Windows support now live, Anthropic has removed the last major platform barrier between its AI agent and the enterprise customers most likely to adopt it.

The software industry spent decades building tools to help knowledge workers manage files, automate tasks, and organize information. Now it faces a future where a single application, powered by an AI that learns and improves with every interaction, threatens to do all of that and more. The question is no longer whether AI agents will reshape enterprise software, but how much of the old world will survive the transformation.

MIT’s new fine-tuning method lets LLMs learn new skills without losing old ones

When enterprises fine-tune LLMs for new tasks, they risk breaking everything the models already know. This forces companies to maintain separate models for every skill.

Researchers at MIT, the Improbable AI Lab and ETH Zurich have developed a new technique that enables large language models to learn new skills and knowledge without forgetting their past capabilities.

Their technique, called self-distillation fine-tuning (SDFT), allows models to learn directly from demonstrations and their own experiments by leveraging the inherent in-context learning abilities of modern LLMs. Experiments show that SDFT consistently outperforms traditional supervised fine-tuning (SFT) while addressing the limitations of reinforcement learning algorithms.

For enterprise applications, the method enables a single model to accumulate multiple skills over time without suffering from performance regression on earlier tasks. This offers a potential pathway for building AI agents that can adapt to dynamic business environments, gathering new proprietary knowledge and skills as needed without requiring expensive retraining cycles or losing their general reasoning abilities.

The challenge of continual learning

Once an LLM is trained and deployed, it remains static. It does not update its parameters to acquire new skills, internalize new knowledge, or improve from experience. To build truly adaptive AI, the industry needs to solve “continual learning,” allowing systems to accumulate knowledge much like humans do throughout their careers.

The most effective way for models to learn is through “on-policy learning.” In this approach, the model learns from data it generates itself allowing it to correct its own errors and reasoning processes. This stands in contrast to learning by simply mimicking static datasets. Without on-policy learning, models are prone to “catastrophic forgetting,” a phenomenon where learning a new task causes the model to lose its past knowledge and ability to perform previous tasks.

However, on-policy learning typically requires reinforcement learning (RL), which depends on an explicit reward function to score the model’s outputs. This works well for problems with clear outcomes, such as math and coding. But in many real-world enterprise scenarios (e.g., writing a legal brief or summarizing a meeting), defining a mathematical reward function is difficult or impossible.

RL methods also often fail when trying to teach a model entirely new information, such as a specific company protocol or a new product line. As Idan Shenfeld, a doctorate student at MIT and co-author of the paper, told VentureBeat, “No matter how many times the base model tries, it cannot generate correct answers for a topic it has zero knowledge about,” meaning it never gets a positive signal to learn from.

The standard alternative is supervised fine-tuning (SFT), where the model is trained on a fixed dataset of expert demonstrations. While SFT provides clear ground truth, it is inherently “off-policy.” Because the model is just mimicking data rather than learning from its own attempts, it often fails to generalize to out-of-distribution examples and suffers heavily from catastrophic forgetting. 

SDFT seeks to bridge this gap: enabling the benefits of on-policy learning using only prerecorded demonstrations, without needing a reward function.

How SDFT works

SDFT solves this problem by using “distillation,” a process where a student model learns to mimic a teacher. The researchers’ insight was to use the model’s own “in-context learning” (ICL) capabilities to create a feedback loop within a single model.

In-context learning is the phenomenon where you provide the LLM with a difficult task and one or more demonstrations of how similar problems are solved. Most advanced LLMs are designed to solve new problems with ICL examples, without any parameter updates.

During the training cycle, SDFT employs the model in two roles.

The teacher: A frozen version of the model is fed the query along with expert demonstrations. Using ICL, the teacher deduces the correct answer and the reasoning logic required to reach it.

The student: This version sees only the query, simulating a real-world deployment scenario where no answer key is available.

When the student generates an answer, the teacher, which has access to the expert demonstrations, provides feedback. The student then updates its parameters to align closer to the teacher’s distribution.

This process effectively creates an on-policy learning loop by combining elements of SFT and RL. The supervision comes not from a static dataset, but from the model’s own interaction and outputs. It allows the model to correct its own reasoning trajectories without requiring an external reward signal. This process works even for new knowledge that RL would miss.

SDFT in action

To validate the approach, the researchers tested SDFT using the open-weight Qwen 2.5 model on three complex enterprise-grade skills: science Q&A, software tool use, and medical reasoning.

The results showed that SDFT learned new tasks more effectively than standard methods. On the Science Q&A benchmark, the SDFT model achieved 70.2% accuracy, compared to 66.2% for the standard SFT approach.

More important for enterprise adoption is the impact on catastrophic forgetting. When the standard SFT model learned the science task, its ability to answer general questions (such as logic or humanities) collapsed. In contrast, the SDFT model improved on the science task while holding its “Previous Tasks” score steady at 64.5%. This stability suggests companies could specialize models for specific departments (e.g., HR or Legal) without degrading the model’s basic common sense or reasoning capabilities.

The team also simulated a knowledge injection scenario, creating a dataset of fictional “2025 Natural Disasters” to teach the model new facts. They tested the model on indirect reasoning questions, such as “Given the floods in 2025, which countries likely needed humanitarian aid?”

Standard SFT resulted in a model that memorized facts but struggled to use them in reasoning scenarios. The SDFT model, having internalized the logic during training, scored 98% on the same questions.

Finally, the researchers conducted a sequential learning experiment, training the model on science, tool use, and medical tasks one after another. While the standard model’s performance oscillated, losing previous skills as it learned new ones, the SDFT model successfully accumulated all three skills without regression.

This capability addresses a major pain point for enterprises currently managing “model zoos” of separate adapters for different tasks.

“We offer the ability to maintain only a single model for all the company’s needs,” Shenfeld said. This consolidation “can lead to a substantial reduction in inference costs” because organizations don’t need to host multiple models simultaneously.

SDFT limitations and availability

The code for SDFT is available on GitHub and ready to be integrated into existing model training workflows.

“The SDFT pipeline is more similar to the RL pipeline in that it requires online response generation during training,” Shenfeld said. They are working with Hugging Face to integrate SDFT into the latter’s Transformer Reinforcement Learning (TRL) library, he added, noting that a pull request is already open for developers who want to test the integration.

For teams considering SDFT, the practical tradeoffs come down to model size and compute. The technique requires models with strong enough in-context learning to act as their own teachers — currently around 4 billion parameters with newer architectures like Qwen 3, though Shenfeld expects 1 billion-parameter models to work soon. It demands roughly 2.5 times the compute of standard fine-tuning, but is best suited for organizations that need a single model to accumulate multiple skills over time, particularly in domains where defining a reward function for reinforcement learning is difficult or impossible.

While effective, the method does come with computational tradeoffs. SDFT is approximately four times slower and requires 2.5 times more computational power (FLOPs) than standard fine-tuning because the model must actively generate its own answers (“rollouts”) during training to compare against the teacher. However, the researchers note that because the model retains knowledge better, organizations may avoid the costly multi-stage retraining processes often required to repair models that suffer from catastrophic forgetting.

The technique also relies on the underlying model being large enough to benefit from in-context learning. The paper notes that smaller models (e.g., 3 billion parameters) initially struggled because they lacked the “intelligence” to act as their own teachers.

However, Shenfeld said that the rapid improvement of small models is changing this dynamic. “The Qwen 2.5 3B models were too weak, but in some experiments we currently do, we found that the Qwen 3 4B model is strong enough,” he said. “I see a future where even 1B models have good enough ICL capabilities to support SDFT.”

Ultimately, the goal is to move beyond static snapshots toward systems that improve through use.

“Lifelong learning, together with the ability to extract learning signal from unstructured user interactions… will bring models that just keep and keep improving with time,” Shenfeld said.

“Think about the fact that already the majority of compute around the world goes into inference instead of training. We have to find ways to harness this compute to improve our models.”

NanoClaw solves one of OpenClaw’s biggest security issues — and it’s already powering the creator’s biz

The rapid viral adoption of Austrian developer Peter Steinberger’s open source AI assistant OpenClaw in recent weeks has sent enterprises and indie developers into a tizzy.

It’s easy to easy why: OpenClaw is freely available now and offers a powerful means of autonomously completing work and performing tasks across a user’s entire computer, phone, or even business with natural language prompts that spin up swarms of agents. Since its release in November 2025, it’s captured the market with over 50 modules and broad integrations — but its “permissionless” architecture raised alarms among developers and security teams.

Enter NanoClaw, a lighter, more secure version which debuted under an open source MIT License on January 31, 2026, and achieved explosive growth—surpassing 7,000 stars on GitHub in just over a week.

Created by Gavriel Cohen—an experienced software engineer who spent seven years at website builder Wix.com—the project was built to address the “security nightmare” inherent in complex, non-sandboxed agent frameworks. Cohen and his brother Lazer are also co-founders of Qwibit, a new AI-first go-to-market agency, and vice president and CEO, respectively, of Concrete Media, a respected public relations firm that often works with tech businesses covered by VentureBeat.

NanoClaw’s immediate solution to this architectural anxiety is a hard pivot toward operating system-level isolation. The project places every agent inside isolated Linux containers—utilizing Apple Containers for high-performance execution on macOS or Docker for Linux environments.

This creates a strictly “sandboxed” environment where the AI only interacts with directories explicitly mounted by the user.

While other frameworks build internal “safeguards” or application-level allowlists to block certain commands, Gavriel maintains that such defenses are inherently fragile.

“I’m not running that on my machine and letting an agent run wild,” Cohen explained during a recent technical interview. “There’s always going to be a way out if you’re running directly on the host machine. In NanoClaw, the ‘blast radius’ of a potential prompt injection is strictly confined to the container and its specific communication channel.”

A more secure foundation for agentic autonomy

The technical critique at the heart of NanoClaw’s development is one of bloat and auditability. When Cohen first evaluated OpenClaw (formerly Clawbot), he discovered a codebase approaching 400,000 lines with hundreds of dependencies.

In the fast-moving AI landscape, such complexity is an engineering hurdle and a potential liability.

“As a developer, every open source dependency that we added to our codebase, you vet. You look at how many stars it has, who are the maintainers, and if it has a proper process in place,” Cohen notes. “When you have a codebase with half a million lines of code, nobody’s reviewing that. It breaks the concept of what people rely on with open source”.

NanoClaw counters this by reducing the core logic to roughly 500 lines of TypeScript. This minimalism ensures that the entire system—from the state management to the agent invocation—can be audited by a human or a secondary AI in roughly eight minutes.

The architecture employs a single-process Node.js orchestrator that manages a per-group message queue with concurrency control.

Instead of heavy distributed message brokers, it relies on SQLite for lightweight persistence and filesystem-based IPC. This design choice is intentional: by using simple primitives, the system remains transparent and reproducible.

Furthermore, the isolation extends beyond just the filesystem. NanoClaw natively supports Agent Swarms via the Anthropic Agent SDK, allowing specialized agents to collaborate in parallel. In this model, each sub-agent in a swarm can be isolated with its own specific memory context, preventing sensitive data from leaking between different chat groups or business functions.

The product vision: Skills over features

One of the most radical departures in NanoClaw is its rejection of the traditional “feature-rich” software model. Cohen describes NanoClaw as “AI-native” software—a system designed to be managed and extended primarily through AI interaction rather than manual configuration.

The project explicitly discourages contributors from submitting PRs that add broad features like Slack or Discord support to the main branch. Instead, they are encouraged to contribute “Skills”—modular instructions housed in .claude/skills/ that teach a developer’s local AI assistant how to transform the code.

“If you want Telegram, rip out the WhatsApp and put in Telegram,” Cohen says. “Every person should have exactly the code they need to run their agent. It’s not a Swiss Army knife; it’s a secure harness that you customize by talking to Claude Code”.

This “Skills over Features” model means that a user can run a command like /add-telegram or /add-gmail, and the AI will rewrite the local installation to integrate the new capability while keeping the codebase lean. This methodology ensures that if a user only needs a WhatsApp-based assistant, they aren’t forced to inherit the security vulnerabilities of fifty other unused modules.

Real-world utility in an AI-native agency

This isn’t merely a theoretical experiment for the Cohen brothers. Their new AI go-to-market agency Qwibit uses NanoClaw—specifically a personal instance named “Andy”—to run its internal operations.

“Andy manages our sales pipeline for us. I don’t interact with the sales pipeline directly,” Cohen explained.

The agent provides Sunday-through-Friday briefings at 9:00 AM, detailing lead statuses and assigning tasks to the team.

The utility lies in the friction-less capture of data. Throughout the day, Lazer and Gavriel forward messy WhatsApp notes or email threads into their admin group.

Andy parses these inputs, updates the relevant files in an Obsidian vault or SQLite database, and sets automated follow-up reminders.

Because the agent has access to the codebase, it can also be tasked with recurring technical jobs, such as reviewing git history for “documentation drift” or refactoring its own functions to improve ergonomics for future agents.

Strategic evaluation for the enterprise

As the pace of change accelerates in early 2026, technical decision-makers are faced with a fundamental choice between convenience and control. For AI engineers focused on rapid deployment, NanoClaw offers a blueprint for what Cohen calls the “best harness” for the “best model”.

By building on top of the Claude Agent SDK, NanoClaw provides a pathway to leverage state-of-the-art models (like Opus 4.6) within a framework that a lean engineering team can actually maintain and optimize.

From the perspective of orchestration engineers, NanoClaw’s simplicity is its greatest asset for building scalable, reliable pipelines.

Traditional, bloated frameworks often introduce budget-draining overhead through complex microservices and message queues.

NanoClaw’s container-first approach allows for the implementation of advanced AI technologies—including autonomous swarms—without the resource constraints and “technical debt” associated with 400,000-line legacy systems.

Perhaps most critically, for security leaders, NanoClaw addresses the “multiple responsibilities” of incident response and organizational protection.

In an environment where prompt injection and data exfiltration are evolving daily, a 500-line auditable core is far safer than a generic system trying to support every use case.

“I recommend you send the repository link to your security team and ask them to audit it,” Cohen advises. “They can review it in an afternoon—not just read the code, but whiteboard the entire system, map out the attack vectors, and verify it’s safe”.

Ultimately, NanoClaw represents a shift in the AI developer mindset. It is an argument that as AI becomes more powerful, the software that hosts it should become simpler. In the race to automate the enterprise, the winners may not be those who adopt the most features, but those who build upon the most transparent and secure foundations.

Why enterprise IT operations are breaking — and how AgenticOps fixes them

Presented by Cisco


AI agents are breaking traditional IT operations models, adding complexity, data silos, and fragmented workflows. DJ Sampath, Cisco’s SVP of AI Software and Platform, believes that AgenticOps is the solution: a new operational paradigm where humans and AI collaborate in real time to create efficiency, boost security, and allow for innovative technological applications.

In a recent conversation with VentureBeat, Sampath outlined why current enterprise IT management is fundamentally breaking and what makes AgenticOps not just useful, but necessary for IT operations going forward.

The breaking point of traditional IT operations

The core problem plaguing enterprise IT today is fragmentation, Sampath said.

“A lot of times inside of these enterprises, data is sitting across multiple different silos,” he explained. “For an operator to come in and start troubleshooting something, they have to go through many different dashboards, many different products, and that results in an increasing amount of time spent trying to figure out what is where before they can actually get to the root cause of an issue.”

This challenge is about to intensify dramatically. As AI agents become ubiquitous within enterprises, the complexity will multiply exponentially.

“Every single person is going to have at least 10 or more agents that are working on their behalf doing different types of things,” Sampath said. “This problem is only going to be tenfold, if not a hundredfold worse when you start to think about what’s really happening with the inclusion of agents.”

Three core principles of AgenticOps

To address these challenges, Cisco has developed its AgenticOps capabilities around three fundamental design principles that Sampath believes must be true for this new operational model to succeed.

First, unified data access across silos. The platform must bring together disparate data sources: network data, security data, application data, and infrastructure data.

“Bringing all of that stuff together is going to be incredibly important so that the agents that you are deploying to do work on your behalf can seamlessly connect the dots across the board,” Sampath said.

Second, multiplayer-first design. AgenticOps must be fundamentally collaborative from the ground up, enabling IT operations, security operations, network operations teams — and agents — to work together seamlessly.

“When you bring the IT ops person, the SecOps person, the NetOps person all together, you can troubleshoot and debug issues a whole lot faster than if you’re working in silos and copy pasting things back and forth,” he explained. “It’s humans and agents working together in a synchronous environment.”

Third, purpose-built AI models. While general-purpose AI models excel at broad tasks, specialized operations require models trained for specific domains.

“When you start to go into specializations, it becomes really important for these models to understand very specific things like network configuration or thread models that you care about and needs to be able to reason about that,” he said.

How Cisco operationalizes AgenticOps across the enterprise stack

Cisco’s approach unites telemetry, intelligence, and collaboration into a single coherent platform. Cisco AI Canvas is an operations workspace that replaces multiple dashboards with a generative UI and a unified collaborative experience. Within AI Canvas, operators can use natural language to delegate actions to agents — pulling telemetry, correlating signals, testing hypotheses, and executing changes — while maintaining human-in-the-loop control.

The reasoning capabilities come from Cisco’s Deep Network Model, trained on over 40 years of operational data including CCIE expertise, production telemetry, Cisco’s Technical Assistance Center (TAC), and Customer Experience (CX) insights. This purpose-built model delivers domain-specific intelligence that general-purpose models cannot match.

Cisco’s platform spans campus, branch, cloud, and edge environments, allowing agents to consume telemetry across the entire ecosystem at machine speed, including Meraki, ThousandEyes, and Splunk. With MCP servers implemented across Cisco products, agents gain standardized access to tools and data without custom integration work.

How fragmented reporting data undermines IT troubleshooting

The traditional approach to IT troubleshooting involves raising tickets and piecing together fragmenting information across multiple systems.

“People take screenshots. Sometimes it’s in Post-it notes,” Sampath said. “All of this information stays in completely different channels so it becomes really hard for somebody to start collecting them together.”

Cisco AI Canvas addresses this by giving teams one shared, real-time workspace for the work at hand — so context doesn’t get scattered across chats, tickets and screen shares. Teams can collaborate live, escalate instantly, and contribute context (such as screenshots and notes) alongside the agent’s generated charts and graphs. But the real power emerges when AI agents join these collaborative sessions.

“The machines are constantly learning from these human to machine interactions,” Sampath explained. “When you see that same problem happen again, you are that much faster in responding because the machines can assist you.”

This creates a virtuous cycle of continuous improvement, where the agent asks if you’d like to continue using the same approach as last time, for example, and you’re able to hand over more work to the agent. And the time spent debugging gets compressed as the system learns and accelerates future responses.

Security as an AI accelerator

Historically security has been considered a roadblock to adoption and even innovation. But with the right guardrails, organizations can confidently deploy AI at scale, and even accelerate it.

Employees have already experienced the productivity gains of tools like ChatGPT and want similar capabilities within their enterprise environments. When organizations can detect personally identifiable information, prevent prompt injection attacks, and maintain proper data governance, they can unlock and unleash the AI adoption inside of the enterprise in a fundamentally different fashion.

The identity layer required for cross-domain AgenticOps

Cross-domain data access presents one of the most complex challenges in AgenticOps implementation. Cisco’s strategic acquisitions, particularly Splunk, position the company to address this, unifying data across traditionally disconnected systems. But bringing data together is only half the battle, since who has access to what data becomes vitally important.

Cisco is evolving its Duo platform beyond multi-factor authentication to serve as a comprehensive identity provider, with robust identity and access management baked into the platform from the beginning, not bolted on as an afterthought.

“We’re investing in identity as a very core pillar of how these agents are going to be able to pull data from different data sources with the right authorization in mind,” explains Sampath. “Should this agent have access to this type of data? Should you be correlating these types of data together to be able to solve a problem?”

Humans in the loop, but at a higher level

As AI agents become more autonomous, the role of humans will evolve rather than disappear.

“We’re always going to have humans in the loop,” Sampath said. “What you’re going to see is the complexity of the tasks that are being performed are going to be a lot more involved.”

Take coding as an example, which today can be entirely agentic. The human role has shifted from manual coding, or even tab completion, to asking an agent to create code wholesale, and then verifying that it meets requirements before merging it into the codebase. This pattern will repeat across IT operations, with humans focusing on higher-level decision-making while agents handle execution. Importantly, rollback capabilities ensure that even autonomous actions can be reversed if needed.

Why waiting for AI to ‘settle down’ is the wrong move

For CIOs and CTOs, the message is clear: don’t wait.

“A lot of folks are in this holding pattern of waiting and watching,” Sampath said. “They’re waiting for AI to settle down before they make some of their decisions. And I think that is the wrong way to think about this. A partnership with the right groups of people, with the right sets of vendors, is going to help you go a whole lot faster, as opposed to trying to just stay on the fence, trying to figure out what’s right and what’s wrong.”


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

OpenAI upgrades its Responses API to support agent skills and a complete terminal shell

Until recently, the practice of building AI agents has been a bit like training a long-distance runner with a thirty-second memory.

Yes, you could give your AI models tools and instructions, but after a few dozen interactions — several laps around the track, to extend our running analogy — it would inevitably lose context and start hallucinating.

With OpenAI’s latest updates to its Responses API — the application programming interface that allows developers on OpenAI’s platform to access multiple agentic tools like web search and file search with a single call — the company is signaling that the era of the limited agent is waning.

The updates announced today include Server-side Compaction, Hosted Shell Containers, and a new “Skills” standard for agents.

With these three major updates, OpenAI is effectively handing agents a permanent desk, a terminal, and a memory that doesn’t fade and should help agents evolve furhter into reliable, long-term digital workers.

Technology: overcoming ‘context amnesia’

The most significant technical hurdle for autonomous agents has always been the “clutter” of long-running tasks. Every time an agent calls a tool or runs a script, the conversation history grows.

Eventually, the model hits its token limit, and the developer is forced to truncate the history—often deleting the very “reasoning” the agent needs to finish the job.

OpenAI’s answer is Server-side Compaction. Unlike simple truncation, compaction allows agents to run for hours or even days.

Early data from e-commerce platform Triple Whale suggests this is a breakthrough in stability: their agent, Moby, successfully navigated a session involving 5 million tokens and 150 tool calls without a drop in accuracy.

In practical terms, this means the model can “summarize” its own past actions into a compressed state, keeping the essential context alive while clearing the noise. It transforms the model from a forgetful assistant into a persistent system process.

Managed cloud sandboxes

The introduction of the Shell Tool moves OpenAI into the realm of managed compute. Developers can now opt for container_auto, which provisions an OpenAI-hosted Debian 12 environment.

This isn’t just a code interpreter: it gives each agent its own full terminal environment pre-loaded with:

  • Native execution environments including Python 3.11, Node.js 22, Java 17, Go 1.23, and Ruby 3.1.

  • Persistent storage via /mnt/data, allowing agents to generate, save, and download artifacts.

  • Networking capabilities that allow agents to reach out to the internet to install libraries or interact with third-party APIs.

The Hosted Shell and its persistent /mnt/data storage provide a managed environment where agents can perform complex data transformations using Python or Java without requiring the team to build and maintain custom ETL (Extract, Transform, Load) middleware for every AI project.

By leveraging these hosted containers, data engineers can implement high-performance data processing tasks while minimizing the “multiple responsibilities” that come with managing bespoke infrastructure, removing the overhead of building and securing their own sandboxes. OpenAI is essentially saying: “Give us the instructions; we’ll provide the computer.”

OpenAI’s Skills vs. Anthropic’s Skills

While OpenAI is racing toward a unified agent orchestration stack, it faces a significant philosophical challenge from Anthropic’s Agent Skills.

Both companies have converged on a remarkably similar file structure — using a SKILL.md (markdown) manifest with YAML frontmatter — but their underlying strategies reveal divergent visions for the future of work.

OpenAI’s approach prioritizes a “programmable substrate” optimized for developer velocity. By bundling the shell, the memory, and the skills into the Responses API, they offer a “turnkey” experience for building complex agents rapidly.

Already, enterprise AI search startup Glean reported a jump in tool accuracy from 73% to 85% by using OpenAI’s Skills framework.

In contrast, Anthropic has launched Agent Skills as an independent open standard (agentskills.io).

While OpenAI’s system is tightly integrated into its own cloud infrastructure, Anthropic’s skills are designed for portability. A skill built for Claude can theoretically be moved to VS Code, Cursor, or any other platform that adopts the specification.

Indeed, the hit new open source AI agent OpenClaw adopted this exact SKILL.md manifest and folder-based packaging, allowing it to inherit a wealth of specialized procedural knowledge originally designed for Claude.

This architectural compatibility has fueled a community-driven “skills boom” on platforms like ClawHub, which now hosts over 3,000 community-built extensions ranging from smart home integrations to complex enterprise workflow automations.

This cross-pollination demonstrates that the “Skill” has become a portable, versioned asset rather than a vendor-locked feature. Because OpenClaw supports multiple models — including OpenAI’s GPT-5 series and local Llama instances — developers can now write a skill once and deploy it across a heterogeneous landscape of agents.

For technical decision-makers, this open standard is turning into the industry’s preferred way to externalize and share “agentic knowledge,” moving past proprietary prompts toward a shared, inspectable, and interoperable infrastructure.

But there is another important distinction between OpenAI’s and Anthropic’s “Skills.”

OpenAI uses Server-side Compaction to manage the active state of a long-running session. Anthropic utilizes Progressive Disclosure, a three-level system where the model is initially only aware of skill names and descriptions.

Full details and auxiliary scripts are only loaded when the task specifically requires them. This allows for massive skill libraries—brand guidelines, legal checklists, and code templates—to exist without overwhelming the model’s working memory.

Implications for enterprise technical decision-makers

For engineers focused on “rapid deployment and fine-tuning,” the combination of Server-side Compaction and Skills provides a massive productivity boost

Instead of building custom state management for every agent run, engineers can leverage built-in compaction to handle multi-hour tasks.

Skills allow for “packaged IP,” where specific fine-tuning or specialized procedural knowledge can be modularized and reused across different internal projects.

For those tasked with moving AI from a “chat box” into a production-grade workflow—OpenAI’s announcement marks the end of the “bespoke infrastructure” era.

Historically, orchestrating an agent required significant manual scaffolding: developers had to build custom state-management logic to handle long conversations and secure, ephemeral sandboxes to execute code.

The challenge is no longer “How do I give this agent a terminal?” but “Which skills are authorized for which users?” and “How do we audit the artifacts produced in the hosted filesystem?” OpenAI has provided the engine and the chassis; the orchestrator’s job is now to define the rules of the road.

For security operations (SecOps) managers, giving an AI model a shell and network access is a high-stakes evolution. OpenAI’s use of Domain Secrets and Org Allowlists provides a defense-in-depth strategy, ensuring that agents can call APIs without exposing raw credentials to the model’s context.

But as agents become easier to deploy via “Skills,” SecOps must be vigilant about “malicious skills” that could introduce prompt injection vulnerabilities or unauthorized data exfiltration paths.

How should enterprises decide?

OpenAI is no longer just selling a “brain” (the model); it is selling the “office” (the container), the “memory” (compaction), and the “training manual” (skills). For enterprise leaders, the choice is becoming clear:

  • Choose OpenAI if you need an integrated, high-velocity environment for long-running autonomous work.

  • Choose Anthropic if your organization requires model-agnostic portability and an open ecosystem standard.

Ultimately, the announcements signal that AI is moving out of the chat box and into the system architecture, turning “prompt spaghetti” into maintainable, versioned, and scalable business workflows.