Microsoft announces Copilot Cowork with help from Anthropic — a cloud-powered AI agent that works across M365 apps

If you thought Anthropic was about to run away with the enterprise AI business…you’re not totally off the mark, actually.

This morning, Microsoft announced “Copilot Cowork” a new cloud-based AI agentic automation tool within Microsoft’s existing AI tool 365 Copilot, except now it can complete work on users’ behalf across many Microsoft apps, instead of contained within each one. If it sounds suspiciously similar to Anthropic’s own “Claude Cowork” applications for Mac and Windows (released in January and February of 2026, respectively) that’s to be expected — as Microsoft and Anthropic worked together on this new feature.

Copilot Cowork is the centerpiece of what Microsoft is calling “Wave 3” of Microsoft 365 Copilot — a sweeping platform update that also brings agentic capabilities directly into individual Office apps, makes Anthropic’s Claude models available in mainline Copilot Chat, and introduces new enterprise pricing tiers designed to bundle AI productivity with security and governance.

Anthropic’s initial Claude Cowork applications released in the first two months of 2026 helped trigger a $285 billion selloff in enterprise software stocks as investors repriced companies whose core functionality — project management, writing, data analysis, workflow automation — overlapped with what Anthropic’s AI could do.

Thus, to some AI power users and observers in business and tech who have shared their views on X, the arrival of the closely named and similarly featured Copilot Cowork appears to be an instance of Microsoft playing “catch up.”

Like Claude Cowork, Copilot Cowork users to delegate complex, multi-step tasks to an AI agent that plans, executes, and delivers finished work — in this case, the AI is able to move across and use all the tools and features of Microsoft’s Outlook, Teams, Excel, PowerPoint, and other M365 applications.

CEO Satya Nadella promoted the launch on X, writing: “Announcing Copilot Cowork, a new way to complete tasks and get work done in M365. When you hand off a task to Cowork, it turns your request into a plan and executes it across your apps and files, grounded in your work data and operating within M365’s security and governance boundaries.”

Copilot Cowork is currently in Research Preview with a limited set of customers. Broader access will come through Microsoft’s Frontier program in late March 2026. Enterprises interested in getting early access can join the Frontier program at adoption.microsoft.com/en-us/copilot/frontier-program/. Microsoft also published a companion blog post, “Powering Frontier Transformation with Copilot and agents,” that outlines how organizations can prepare for the rollout.

The announcement represents Microsoft’s most significant step yet in transforming Copilot from a conversational assistant into what the company calls an execution layer — an AI that doesn’t just answer questions but actually completes work on a user’s behalf.

But with Claude Cowork offering much of the same functionality — the question is whether Microsoft’s first-party offering comes with enough unique advantages or integration with trusted systems currently used by enterprises, to catch on.

Claude Cowork vs. Copilot Cowork: same DNA, different bets on how work gets done

Microsoft’s announcement blog post explicitly states that Copilot Cowork integrates “the technology behind Claude Cowork,” and both products share a core premise: AI should plan and execute multi-step work, not just respond to prompts.

But the two products diverge sharply in where they operate, what they can reach, and who they’re built for.

Claude Cowork is a desktop agent. It lives on your machine — first Mac, now Windows — and operates within folders you explicitly grant it access to.

It can read, edit, and create local files, automate browser tasks, and connect to external services through Anthropic’s growing library of MCP connectors and plugins spanning tools like Google Drive, Slack, DocuSign, and Salesforce. Its power comes from flexibility: users can point it at essentially any local workflow, and Anthropic’s open plugin architecture means its reach keeps expanding.

But it is fundamentally a personal tool. The user manages what Claude can see, and the security model depends on folder-level sandboxing and individual judgment about what to share.

Copilot Cowork operates in the cloud, inside Microsoft 365’s infrastructure, and draws on something Claude Cowork simply cannot access: the full graph of a user’s enterprise work data.

That means Outlook email threads, Teams conversations, calendar history, SharePoint files, Excel workbooks, and the relationships between them. When Copilot Cowork reschedules a meeting or builds a briefing document, it is pulling from signals across all of those systems simultaneously — a capability that requires deep integration with M365’s APIs and data layer rather than just local file access. Enterprise IT administrators retain control through existing identity, permissions, and compliance policies, and all actions are auditable by default.

The practical upshot is that these products are likely to appeal to different buyers solving different problems, at least in the near term.

Organizations that are deeply embedded in the Microsoft 365 ecosystem — which is to say, most large enterprises — are the natural audience for Copilot Cowork. For a Fortune 500 company whose employees live in Outlook, Teams, and SharePoint all day, the value proposition is compelling: an AI agent that already understands your organizational context, operates within your existing security and compliance framework, and doesn’t require employees to adopt a new application or manage local file permissions.

IT departments that have spent years configuring M365 governance policies will find Copilot Cowork far easier to greenlight than a standalone desktop agent that requires individual users to make security decisions about folder access.

Claude Cowork, by contrast, is likely to attract organizations and individuals who need more flexibility than M365 can provide — teams working across heterogeneous tool stacks, power users who want granular control over what the AI can touch, and companies already building on Anthropic’s API and plugin ecosystem.

Startups and mid-market companies that haven’t standardized on Microsoft’s suite may find Claude Cowork more natural, since it doesn’t assume M365 as the center of gravity. Creative agencies, research teams, legal shops using specialized software, and technical organizations that prize customization over managed simplicity are plausible early adopters.

There is also a pricing and access dimension. Claude Cowork is available today to anyone with a $20/month Claude Pro subscription.

Copilot Cowork is in limited Research Preview and will require a Microsoft 365 Copilot license, which currently runs $30 per user per month on top of existing M365 enterprise subscriptions.

Microsoft also introduced Microsoft 365 E7, a new top-tier enterprise bundle priced at $99 per user per month and available May 1, which includes Copilot, the Agent 365 agentic AI control suite, and the Microsoft Entra Suite comprehensive security solution for identity management, and the full E5 security stack — representing the all-in price for organizations that want AI productivity, agent governance, and advanced security in a single license

For individual knowledge workers or small teams, Claude Cowork is more accessible. For organizations already paying for M365 Copilot, Copilot Cowork arrives as an incremental capability within an existing investment.

The most intriguing question may be whether the two products end up competing at all or instead serve as complementary distribution channels for the same underlying intelligence.

Microsoft is explicitly positioning itself as model-agnostic, choosing “the right model for the job regardless of who built it.”

Anthropic, for its part, benefits from having its technology embedded in the world’s dominant enterprise productivity suite while maintaining a standalone product that keeps its brand and direct customer relationship intact.

It is possible — perhaps even likely — that some enterprises will end up using both: Copilot Cowork for M365-native workflows and Claude Cowork for everything else.

From chat to execution: how Copilot Cowork actually works

Charles Lamanna, Microsoft’s president of business applications and agents, framed the product as the logical next step in Copilot’s evolution in the announcement blog post, writing: “Copilot Cowork is built for that: it helps Copilot take action, not just chat,” Lamanna wrote in a blog post accompanying the announcement.

The workflow is straightforward in concept but ambitious in scope. Users describe an outcome they want — preparing for a client meeting, researching a company, building a product launch plan — and Cowork automatically breaks that request into a structured plan.

It then grounds the work in the user’s existing emails, meetings, messages, files, and data using what Microsoft calls Work IQ, a system that draws on signals across the M365 suite so the AI operates with contextual awareness of the user’s actual work environment.

Critically, the plan executes in the background. Users can have a dozen tasks running simultaneously, each progressing while they focus on other work. Cowork checks in if it needs clarification and presents recommended actions for user approval before applying changes. Microsoft emphasized that users never give up control — the AI works independently but transparently.

Jared Spataro, Microsoft’s chief marketing officer for AI at Work, described the shift in a companion blog post: “Tasks are no longer confined to a single turn or a single app. They can run for minutes or hours, coordinating actions and producing real outputs along the way.”

Calendar triage, meeting prep, deep research, and launch planning

Microsoft showcased four scenarios that illustrate what Cowork can do in practice.

In the first, Cowork reviews a user’s Outlook calendar, identifies conflicts and low-value meetings, and proposes changes — rescheduling, declining, or adding focus blocks — that it then applies once approved. In the second, it handles end-to-end meeting preparation: pulling relevant inputs from email and files, scheduling prep time, and producing a briefing document, supporting analysis, and a client-ready deck, all saved in M365 for team collaboration.

The third scenario demonstrates deep research capabilities. Cowork can gather earnings reports, SEC filings, analyst commentary, and news, then organize findings with citations into an executive summary, a structured research memo, and an Excel workbook with labeled tabs. The fourth tackles product launch planning, building a competitive comparison in Excel, distilling a value proposition document, generating a pitch deck, and outlining milestones and owners.

In each case, Microsoft stressed that Cowork isn’t just creating content — it’s coordinating the work around it, producing multiple connected deliverables across applications in a single workflow.

The Anthropic connection: Claude technology powers Copilot’s new agent brain

This is the clearest public confirmation yet that Microsoft’s deepening relationship with Anthropic — the $30 billion Azure compute deal announced in November 2025, the integration of Claude Opus 4.6 into Microsoft Foundry in early February 2026, the ongoing internal adoption of Claude Code across Microsoft engineering teams, according to The Verge — has now reached the company’s flagship productivity suite.

But it goes beyond Cowork. According to Spataro’s companion blog post, Claude is now available in mainline Copilot Chat for Frontier program users, alongside the latest generation of OpenAI models. That means Anthropic’s models aren’t just powering Cowork’s task execution behind the scenes — they’re becoming a general-purpose option that users can access directly in everyday Copilot conversations.

And this is despite Anthropic’s ongoing clash with the U.S. Department of War over its “red lines” prohibiting AI use in mass domestic surveillance and fully autonomous lethal weaponry, which the Department of War claims are unnecessary, already guided by existing law, and should not be enforced by an outside contract vendor. Notably, Microsoft is also a major vendor of the Department of War (formerly Defense) and governments more broadly.

Yet in a major difference, Microsoft users other AI models across the Copilot 365 experience.

Spataro was pointed about why. “Many AI tools lock users into a single vendor’s models,” he wrote. “Others force people to choose between tools, experiences, or modes depending on the task. That fragmentation creates friction for individuals and complexity for organizations.” Microsoft’s answer is a multi-model architecture where “Copilot automatically applies the right model for the task, all grounded in your enterprise context.”

That multi-model positioning is a significant strategic signal. Microsoft has invested $13 billion in OpenAI, which has long served as the primary provider of AI models for Microsoft’s products.

The decision to power a major new M365 capability with Anthropic’s technology suggests Microsoft increasingly views model diversity not as a hedge but as a competitive advantage — choosing the best available AI for each specific task rather than remaining locked to a single provider.

Wave 3 extends agentic AI beyond Cowork

Copilot Cowork is the headline feature of the Wave 3 update, but it’s far from the only change.

What Microsoft previously called “Agent Mode” in Word, Excel, PowerPoint, and Outlook has been rebranded as simply how Copilot works in those apps going forward.

In Excel and Word, these agentic capabilities are now generally available. In PowerPoint and Outlook, they’re rolling out over the coming months. Copilot can now refine a Word document into a polished draft, improve Excel spreadsheets with real formulas, produce slides in PowerPoint that match an organization’s brand kits and layout conventions, and draft and refine emails directly in Outlook — all grounded in the user’s work context through Work IQ.

Copilot Chat is also becoming a more capable starting point. Users can now create documents, spreadsheets, and presentations directly from a conversation, or take workplace actions like scheduling meetings and sending emails without switching apps.

Microsoft is opening the chat experience to third-party agents as well, with integrations from Adobe, Monday.com, Figma, and others surfacing directly within Copilot Chat through open standards including MCP.

Enterprise-grade security and a cautious rollout

Microsoft emphasized that Copilot Cowork runs within M365’s existing security and governance framework. Identity, permissions, and compliance policies apply by default, and all actions and outputs are auditable. Tasks run in a protected, sandboxed cloud environment so they can continue progressing safely as users switch devices.

The cautious rollout plan signals that Microsoft is treating this as a deliberate enterprise deployment rather than a consumer splash. Copilot Cowork is currently being tested with a limited set of customers in what Microsoft calls a Research Preview. Broader availability will come through the company’s Frontier program in late March 2026.

The timing aligns with other major Microsoft announcements today, including the general availability of Agent 365 and Microsoft 365 Enterprise 7 — products designed to bring security and governance to AI agents operating inside large organizations.

Agent 365, which Microsoft calls “the control plane for agents,” will be generally available on May 1 at $15 per user per month, giving IT and security teams a single place to observe, secure, and govern every AI agent operating across an organization. Microsoft cited an IDC projection that agent use will increase by an order of magnitude in the next few years, with “hundreds of millions — and soon billions — of agents operating across enterprises.”

Together, the announcements paint a picture of Microsoft building out the infrastructure to support autonomous AI agents at enterprise scale while keeping IT administrators firmly in control.

What it means for the AI agent race

Copilot Cowork arrives at an inflection point in the AI industry. Every major platform company is now racing to deliver agents that don’t just converse but execute.

Anthropic has its standalone Claude Cowork. OpenAI recently launched GPT-5.4 with native computer use capabilities and its own Windows apps integrations, and earlier, launched its own Codex AI coding application and hired the creator of the popular open source AI agent tool OpenClaw. Google has of course been steadily expanding Workspace integrations for AI agents.

Microsoft’s advantage is distribution. With hundreds of millions of M365 users across the enterprise, Copilot Cowork has a built-in audience that no standalone AI product can match.

By pairing that reach with what it considers the best available AI technology — even when that technology comes from a competitor to its $13 billion investment partner — Microsoft is betting that the enterprise agent market will be won not by the company with the best model but by the one that integrates most deeply into the workflows people already use.

Whether Copilot Cowork delivers on that promise will depend on execution quality and user trust — the same open questions facing every AI agent product on the market. But with Anthropic’s Claude technology running inside M365’s security perimeter and Nadella personally promoting the launch, Microsoft — like the rest of the tech industry — is clearly banking on the fact that the era of AI as a passive assistant is over. The next chapter is AI that does the work for you, without you.

Enterprise agentic AI requires a process layer most companies haven’t built

Presented by Celonis


85% of enterprises want to become agentic within three years — yet 76% admit their operations can’t support it. According to the Celonis 2026 Process Optimization Report, based on a survey of more than 1,600 global business leaders, organizations are aggressively pursuing AI-driven transformation. Yet most acknowledge that the foundational work — modernizing workflows, reducing process friction, and building operational resilience — remains unfinished. The ambition is clear. The infrastructure to execute on it is not.

To act autonomously and effectively, AI agents need optimized, AI-ready processes and the process data and operational context that only comes from process intelligence. Without that, they’re guessing. And 82% of decision-makers believe AI will fail to deliver return on investment (ROI) if it doesn’t understand how the business runs.

“The scale of the opportunity is truly remarkable: 89% of leaders see AI as their biggest competitive opportunity,” says Patrick Thompson, global SVP of customer transformation. “That’s not a marginal finding. What’s interesting is the shift in the framing. Leaders are confident that AI will transform operations. The question now is how to fuel their ambitions with the right AI enablers.”

Explaining the gap between ambition and reality

Right now, 85% of teams are using gen AI tools for everyday tasks, so the “will this work?” question is largely settled. The real question has shifted to: “Why isn’t it working the way we need it to?” And that’s a much harder problem, because it’s structural. It’s siloed teams. Systems that don’t talk to each other. AI that looks impressive in a demo but falters once it’s dropped into a real enterprise environment. That’s the wall companies are hitting.

So, despite the overwhelming ambition, only 19% of organizations use multi-agent systems today. It all comes down to an operational readiness problem, Thompson says.

“Nine in ten leaders are already using or exploring multi-agent systems, so the will is absolutely there, but ambition without infrastructure doesn’t get you very far,” he explains.

Until now, process has largely been a “good enough” problem, because processes that are messy and disconnected can still produce results, just inefficient and opaque. As long as the business is growing, there hasn’t been a burning urge to fix them. AI changed the calculus. If 82% of leaders believe AI can only deliver ROI with proper business context, then sub-optimal processes aren’t just an operational inconvenience, they’re actively blocking an AI strategy. Suddenly, process optimization isn’t a background IT project, but a prerequisite for competing.

“This is where structural modernization becomes critical,” he says. “Organizations that have invested in modernizing their data, systems, and processes are in a far stronger position to enable AI at scale.”

The other AI stopper: Lack of business context

AI will not be able to provide the strongest ROI possible until it understands the operational context of the business. That includes how KPIs are defined and calculated, any unique internal policies and procedures, how the organization is structured, and where the real decision authority sits.

This knowledge is usually trapped in different departments that have developed their own languages and systems over time. They don’t naturally share a common understanding. Bringing AI into that environment is something like dropping someone into a conversation that’s been going on for years, without any of the backstory.

Process intelligence becomes the connective layer — a shared operational language that grounds AI decisions in how the business actually runs.

Why AI adoption is also a change management problem

The AI adoption challenge is less a technology problem and more of a change-management and operating-model problem than many more leaders want to admit, because technology problems feel easier to solve. The data shows that only 6% of leaders cite resistance to change as a hurdle. The real blockers are siloed teams (54%) and a lack of coordination between departments (44%). And 93% of process and operations leaders explicitly state that process optimization is as much about people and culture as it is about tools and technology.

“When companies come to us looking for a technology fix, part of our job is helping them see that the operating model has to evolve alongside the tooling,” Thompson says. “You can’t bolt AI onto a broken process and expect it to work. True enterprise modernization means redesigning how teams, systems, and decisions connect, and AI only works when that modernization happens first.”

Making process optimization a strategic advantage

How do you make process optimization a strategic advantage, rather than another operational project? Connect it directly to outcomes that executives care about. When processes work, they go beyond IT metrics, directly affecting board-level concerns. A full 63% of leaders use process optimization to proactively manage risks, while 58% see faster decision-making.

Plus, the economic and geopolitical environment right now makes agility a survival skill. Look at the supply chain industry, where 66% already view process optimization as a critical business-wide initiative.

“That’s the mindset shift we’re trying to catalyze across the rest of the organization,” Thompson says. “It’s not maintenance work. It’s what lets you move fast when the world changes, and right now the world is moving constantly.”

Closing the readiness gap in enterprise agentic AI

To succeed, and even triumph, organizations must be ready to close the readiness gap, and they need to be honest about where they’re starting from, Thompson says.

“The biggest risk I see is companies continuing to layer AI on top of fragmented, opaque processes and then wondering why they’re not getting results,” he says. “Moving from static, traditional tools to real process intelligence, where you have live visibility into how your operations actually run, that’s the foundational shift that makes agentic AI viable.”

Without it, agents get deployed in the wrong places, can’t be integrated with existing systems, and organizations end up with expensive pilots that don’t scale. The call to action is clear: stop starting with tools and start with operational visibility.

“The leaders who will win in the agentic era aren’t necessarily the ones with the most sophisticated AI,” he says. “They’re the ones who’ve done the hard work of building a shared, accurate picture of their operations. Process intelligence is the starting point. It’s what enables enterprise modernization in practice, creating the operational clarity AI needs to deliver real ROI. Master your processes, give AI the context it needs, and then you can actually deploy it somewhere it will deliver.”


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Dynamic UI for dynamic AI: Inside the emerging A2UI model

With agentic AI, businesses are conducting business more dynamically. Instead of traditional pre-programmed bots and static rules, agents can now “think” and invent alternate paths when unseen conditions arise. For instance, using a business domain ont…

The limits of bubble thinking: How AI breaks every historical analogy

It’s always the same story: A new technology appears and everyone starts talking about how it’ll change everything. Then capital rushes in, companies form overnight, and valuations climb faster than anyone can justify. Then, many many months later, the…

Karpathy’s March of Nines shows why 90% AI reliability isn’t even close to enough

“When you get a demo and something works 90% of the time, that’s just the first nine.” — Andrej KarpathyThe “March of Nines” frames a common production reality: You can reach the first 90% reliability with a strong demo, and each additional nine often …

LangChain’s CEO argues that better models alone won’t get your AI agent to production

As models get smarter and more capable, the “harnesses” around them must also evolve.

This “harness engineering” is an extension of context engineering, says LangChain co-founder and CEO Harrison Chase in a new VentureBeat Beyond the Pilot podcast episode. Whereas traditional AI harnesses have tended to constrain models from running in loops and calling tools, harnesses specifically built for AI agents allow them to interact more independently and effectively perform long-running tasks.

Chase also weighed in on OpenAI’s acquisition of OpenClaw, arguing that its viral success came down to a willingness to “let it rip” in ways that no major lab would — and questioning whether the acquisition actually gets OpenAI closer to a safe enterprise version of the product.

“The trend in harnesses is to actually give the large language model (LLM) itself more control over context engineering, letting it decide what it sees and what it doesn’t see,” Chase says. “Now, this idea of a long-running, more autonomous assistant is viable.”

Tracking progress and maintaining coherence

While the concept of allowing LLMs to run in a loop and call tools seems relatively simple, it’s difficult to pull off reliably, Chase noted. For a while, models were “below the threshold of usefulness” and simply couldn’t run in a loop, so devs used graphs and wrote chains to get around that. Chase pointed to AutoGPT — once the fastest-growing GitHub project ever — as a cautionary example: same architecture as today’s top agents, but the models weren’t good enough yet to run reliably in a loop, so it faded fast.

But as LLMs keep improving, teams can construct environments where models can run in loops and plan over longer horizons, and they can continually improve these harnesses. Previously, “you couldn’t really make improvements to the harness because you couldn’t actually run the model in a harness,” Chase said.

LangChain’s answer to this is Deep Agents, a customizable general-purpose harness.

Built on LangChain and LangGraph, it has planning capabilities, a virtual filesystem, context and token management, code execution, and skills and memory functions. Further, it can delegate tasks to subagents; these are specialized with different tools and configurations and can work in parallel. Context is also isolated, meaning subagent work doesn’t clutter the main agent’s context, and large subtask context is compressed into a single result for token efficiency.

All of these agents have access to file systems, Chase explained, and can essentially create to-do lists that they can execute on and track over time.

“When it goes on to the next step, and it goes on to step two or step three or step four out of a 200 step process, it has a way to track its progress and keep that coherence,” Chase said. “It comes down to letting the LLM write its thoughts down as it goes along, essentially.”

He emphasized that harnesses should be designed so that models can maintain coherence over longer tasks, and be “amenable” to models deciding when to compact context at points it determines is “advantageous.”

Also, giving agents access to code interpreters and BASH tools increases flexibility. And, providing agents with skills as opposed to just tools loaded up front allows them to load information when they need it. “So rather than hard code everything into one big system prompt,” Chase explained, “you could have a smaller system prompt, ‘This is the core foundation, but if I need to do X, let me read the skill for X. If I need to do Y, let me read the skill for Y.'”

Essentially, context engineering is a “really fancy” way of saying: What is the LLM seeing? Because that’s different from what developers see, he noted. When human devs can analyze agent traces, they can put themselves in the AI’s “mindset” and answer questions like: What is the system prompt? How is it created? Is it static or is it populated? What tools does the agent have? When it makes a tool call, and gets a response back, how is that presented?

“When agents mess up, they mess up because they don’t have the right context; when they succeed, they succeed because they have the right context,” Chase said. “I think of context engineering as bringing the right information in the right format to the LLM at the right time.”

Listen to the podcast to hear more about:

  • How LangChain built its stack: LangGraph as the core pillar, LangChain at the center, Deep Agents on top.

  • Why code sandboxes will be the next big thing.

  • How a different type of UX will evolve as agents run at longer intervals (or continuously).

  • Why traces and observability are core to building an agent that actually works.

You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.

New KV cache compaction technique cuts LLM memory 50x without accuracy loss

Enterprise AI applications that handle large documents or long-horizon tasks face a severe memory bottleneck. As the context grows longer, so does the KV cache, the area where the model’s working memory is stored.

A new technique developed by researchers at MIT addresses this challenge with a fast compression method for the KV cache. The technique, called Attention Matching, manages to compact the context by up to 50x with very little loss in quality.

While it is not the only memory compaction technique available, Attention Matching stands out for its execution speed and impressive information-preserving capabilities.

The memory bottleneck of the KV cache

Large language models generate their responses sequentially, one token at a time. To avoid recalculating the entire conversation history from scratch for every predicted word, the model stores a mathematical representation of every previous token it has processed, also known as the key and value pairs. This critical working memory is known as the KV cache.

The KV cache scales with conversation length because the model is forced to retain these keys and values for all previous tokens in a given interaction. This consumes expensive hardware resources. “In practice, KV cache memory is the biggest bottleneck to serving models at ultra-long context,” Adam Zweiger, co-author of the paper, told VentureBeat. “It caps concurrency, forces smaller batches, and/or requires more aggressive offloading.”

In modern enterprise use cases, such as analyzing massive legal contracts, maintaining multi-session customer dialogues, or running autonomous coding agents, the KV cache can balloon to many gigabytes of memory for a single user request.

To solve this massive bottleneck, the AI industry has tried several strategies, but these methods fall short when deployed in enterprise environments where extreme compression is necessary. A class of technical fixes includes optimizing the KV cache by either evicting tokens the model deems less important or merging similar tokens into a single representation. These techniques work for mild compression but “degrade rapidly at high reduction ratios,” according to the authors.

Real-world applications often rely on simpler techniques, with the most common approach being to simply drop the older context once the memory limit is reached. But this approach causes the model to lose older information as the context grows long. Another alternative is context summarization, where the system pauses, writes a short text summary of the older context, and replaces the original memory with that summary. While this is an industry standard, summarization is highly lossy and heavily damages downstream performance because it might remove pertinent information from the context.

Recent research has proven that it is technically possible to highly compress this memory using a method called Cartridges. However, this approach requires training latent KV cache models through slow, end-to-end mathematical optimization. This gradient-based training can take several hours on expensive GPUs just to compress a single context, making it completely unviable for real-time enterprise applications.

How attention matching compresses without the cost

Attention Matching achieves high-level compaction ratios and quality while being orders of magnitude faster than gradient-based optimization. It bypasses the slow training process through clever mathematical tricks.

The researchers realized that to perfectly mimic how an AI interacts with its memory, they need to preserve two mathematical properties when compressing the original key and value vectors into a smaller footprint. The first is the “attention output,” which is the actual information the AI extracts when it queries its memory. The second is the “attention mass,” which acts as the mathematical weight that a token has relative to everything else in the model’s working memory. If the compressed memory can match these two properties, it will behave exactly like the massive, original memory, even when new, unpredictable user prompts are added later. 

“Attention Matching is, in some ways, the ‘correct’ objective for doing latent context compaction in that it directly targets preserving the behavior of each attention head after compaction,” Zweiger said. While token-dropping and related heuristics can work, explicitly matching attention behavior simply leads to better results.

Before compressing the memory, the system generates a small set of “reference queries” that act as a proxy for the types of internal searches the model is likely to perform when reasoning about the specific context. If the compressed memory can accurately answer these reference queries, it will very likely succeed at answering the user’s actual questions later. The authors suggest various methods for generating these reference queries, including appending a hidden prompt to the document telling the model to repeat the previous context, known as the “repeat-prefill” technique. They also suggest a “self-study” approach where the model is prompted to perform a few quick synthetic tasks on the document, such as aggregating all key facts or structuring dates and numbers into a JSON format.

With these queries in hand, the system picks a set of keys to preserve in the compacted KV cache based on signals like the highest attention value. It then uses the keys and reference queries to calculate the matching values along with a scalar bias term. This bias ensures that pertinent information is preserved, allowing each retained key to represent the mass of many removed keys.

This formulation makes it possible to fit the values with simple algebraic techniques, such as ordinary least squares and nonnegative least squares, entirely avoiding compute-heavy gradient-based optimization. This is what makes Attention Matching super fast in comparison to optimization-heavy compaction methods. The researchers also apply chunked compaction, processing contiguous chunks of the input independently and concatenating them, to further improve performance on long contexts.

Attention matching in action

To understand how this method performs in the real world, the researchers ran a series of stress tests using popular open-source models like Llama 3.1 and Qwen-3 on two distinct types of enterprise datasets. The first was QuALITY, a standard reading comprehension benchmark using 5,000 to 8,000-word documents. The second, representing a true enterprise challenge, was LongHealth, a highly dense, 60,000-token dataset containing the complex medical records of multiple patients.

The key finding was the ability of Attention Matching to compact the model’s KV cache by 50x without reducing the accuracy, while taking only seconds to process the documents. To achieve that same level of quality previously, Cartridges required hours of intensive GPU computation per context.

When dealing with the dense medical records, standard industry workarounds completely collapsed. The researchers noted that when they tried to use standard text summarization on these patient records, the model’s accuracy dropped so low that it matched the “no-context” baseline, meaning the AI performed as if it had not read the document at all. 

Attention Matching drastically outperforms summarization, but enterprise architects will need to dial down the compression ratio for dense tasks compared to simpler reading comprehension tests. As Zweiger explains, “The main practical tradeoff is that if you are trying to preserve nearly everything in-context on highly information-dense tasks, you generally need a milder compaction ratio to retain strong accuracy.”

The researchers also explored what happens in cases where absolute precision isn’t necessary but extreme memory savings are. They ran Attention Matching on top of a standard text summary. This combined approach achieved 200x compression. It successfully matched the accuracy of standard summarization alone, but with a very small memory footprint.

One of the interesting experiments for enterprise workflows was testing online compaction, though they note that this is a proof of concept and has not been tested rigorously in production environments. The researchers tested the model on the advanced AIME math reasoning test. They forced the AI to solve a problem with a strictly capped physical memory limit. Whenever the model’s memory filled up, the system paused, instantly compressed its working memory by 50 percent using Attention Matching, and let it continue thinking. Even after hitting the memory wall and having its KV cache shrunk up to six consecutive times mid-thought, the model successfully solved the math problems. Its performance matched a model that had been given massive, unlimited memory.

There are caveats to consider. At a 50x compression ratio, Attention Matching is the clear winner in balancing speed and quality. However, if an enterprise attempts to push compression to extreme 100x limits on highly complex data, the slower, gradient-based Cartridges method actually outperforms it.

The researchers have released the code for Attention Matching. However, they note that this is not currently a simple plug-and-play software update. “I think latent compaction is best considered a model-layer technique,” Zweiger notes. “While it can be applied on top of any existing model, it requires access to model weights.” This means enterprises relying entirely on closed APIs cannot implement this themselves; they need open-weight models. 

The authors note that integrating this latent-space KV compaction into existing, highly optimized commercial inference engines still requires significant effort. Modern AI infrastructure uses complex tricks like prefix caching and variable-length memory packing to keep servers running efficiently, and seamlessly weaving this new compaction technique into those existing systems will take dedicated engineering work. However, there are immediate enterprise applications. “We believe compaction after ingestion is a promising use case, where large tool call outputs or long documents are compacted right after being processed,” Zweiger said.

Ultimately, the shift toward mechanical, latent-space compaction aligns with the future product roadmaps of major AI players, Zweiger argues. “We are seeing compaction to shift from something enterprises implement themselves into something model providers ship,” Zweiger said. “This is even more true for latent compaction, where access to model weights is needed. For example, OpenAI now exposes a black-box compaction endpoint that returns an opaque object rather than a plain-text summary.”

Google PM open-sources Always On Memory Agent, ditching vector databases for LLM-driven persistent memory

Google senior AI product manager Shubham Saboo has turned one of the thorniest problems in agent design into an open-source engineering exercise: persistent memory.

This week, he published an open-source “Always On Memory Agent” on the official Google Cloud Platform Github page under a permissive MIT License, allowing for commercial usage.

It was built with Google’s Agent Development Kit, or ADK introduced last Spring in 2025, and Gemini 3.1 Flash-Lite, a low-cost model Google introduced on March 3, 2026 as its fastest and most cost-efficient Gemini 3 series model.

The project serves as a practical reference implementation for something many AI teams want but few have productionized cleanly: an agent system that can ingest information continuously, consolidate it in the background, and retrieve it later without relying on a conventional vector database.

For enterprise developers, the release matters less as a product launch than as a signal about where agent infrastructure is headed.

The repo packages a view of long-running autonomy that is increasingly attractive for support systems, research assistants, internal copilots and workflow automation. It also brings governance questions into sharper focus as soon as memory stops being session-bound.

What the repo appears to do — and what it does not clearly claim

The repo also appears to use a multi-agent internal architecture, with specialist components handling ingestion, consolidation and querying.

But the supplied materials do not clearly establish a broader claim that this is a shared memory framework for multiple independent agents.

That distinction matters. ADK as a framework supports multi-agent systems, but this specific repo is best described as an always-on memory agent, or memory layer, built with specialist subagents and persistent storage.

Even at this narrower level, it addresses a core infrastructure problem many teams are actively working through.

The architecture favors simplicity over a traditional retrieval stack

According to the repository, the agent runs continuously, ingests files or API input, stores structured memories in SQLite, and performs scheduled memory consolidation every 30 minutes by default.

A local HTTP API and Streamlit dashboard are included, and the system supports text, image, audio, video and PDF ingestion. The repo frames the design with an intentionally provocative claim: “No vector database. No embeddings. Just an LLM that reads, thinks, and writes structured memory.”

That design choice is likely to draw attention from developers managing cost and operational complexity. Traditional retrieval stacks often require separate embedding pipelines, vector storage, indexing logic and synchronization work.

Saboo’s example instead leans on the model to organize and update memory directly. In practice, that can simplify prototypes and reduce infrastructure sprawl, especially for smaller or medium-memory agents. It also shifts the performance question from vector search overhead to model latency, memory compaction logic and long-run behavioral stability.

Flash-Lite gives the always-on model some economic logic

That is where Gemini 3.1 Flash-Lite enters the story.

Google says the model is built for high-volume developer workloads at scale and priced at $0.25 per 1 million input tokens and $1.50 per 1 million output tokens.

The company also says Flash-Lite is 2.5 times faster than Gemini 2.5 Flash in time to first token and delivers a 45% increase in output speed while maintaining similar or better quality.

On Google’s published benchmarks, the model posts an Elo score of 1432 on Arena.ai, 86.9% on GPQA Diamond and 76.8% on MMMU Pro. Google positions those characteristics as a fit for high-frequency tasks such as translation, moderation, UI generation and simulation.

Those numbers help explain why Flash-Lite is paired with a background-memory agent. A 24/7 service that periodically re-reads, consolidates and serves memory needs predictable latency and low enough inference cost to avoid making “always on” prohibitively expensive.

Google’s ADK documentation reinforces the broader story. The framework is presented as model-agnostic and deployment-agnostic, with support for workflow agents, multi-agent systems, tools, evaluation and deployment targets including Cloud Run and Vertex AI Agent Engine. That combination makes the memory agent feel less like a one-off demo and more like a reference point for a broader agent runtime strategy.

The enterprise debate is about governance, not just capability

Public reaction shows why enterprise adoption of persistent memory will not hinge on speed or token pricing alone.

Several responses on X highlighted exactly the concerns enterprise architects are likely to raise. Franck Abe called Google ADK and 24/7 memory consolidation “brilliant leaps for continuous agent autonomy,” but warned that an agent “dreaming” and cross-pollinating memories in the background without deterministic boundaries becomes “a compliance nightmare.”

ELED made a related point, arguing that the main cost of always-on agents is not tokens but “drift and loops.”

Those critiques go directly to the operational burden of persistent systems: who can write memory, what gets merged, how retention works, when memories are deleted, and how teams audit what the agent learned over time?

Another reaction, from Iffy, challenged the repo’s “no embeddings” framing, arguing that the system still has to chunk, index and retrieve structured memory, and that it may work well for small-context agents but break down once memory stores become much larger.

That criticism is technically important. Removing a vector database does not remove retrieval design; it changes where the complexity lives.

For developers, the tradeoff is less about ideology than fit. A lighter stack may be attractive for low-cost, bounded-memory agents, while larger-scale deployments may still demand stricter retrieval controls, more explicit indexing strategies and stronger lifecycle tooling.

ADK broadens the story beyond a single demo

Other commenters focused on developer workflow. One asked for the ADK repo and documentation and wanted to know whether the runtime is serverless or long-running, and whether tool-calling and evaluation hooks are available out of the box.

Based on the supplied materials, the answer is effectively both: the memory-agent example itself is structured like a long-running service, while ADK more broadly supports multiple deployment patterns and includes tools and evaluation capabilities.

The always-on memory agent is interesting on its own, but the larger message is that Saboo is trying to make agents feel like deployable software systems rather than isolated prompts. In that framing, memory becomes part of the runtime layer, not just an add-on feature.

What Saboo has shown — and what he has not

What Saboo has not shown yet is just as important as what he’s published.

The provided materials do not include a direct Flash-Lite versus Anthropic Claude Haiku benchmark for agent loops in production use.

They also do not lay out enterprise-grade compliance controls specific to this memory agent, such as: deterministic policy boundaries, retention guarantees, segregation rules or formal audit workflows.

And while the repo appears to use multiple specialist agents internally, the materials do not clearly prove a larger claim about persistent memory shared across multiple independent agents.

For now, the repo reads as a compelling engineering template rather than a complete enterprise memory platform.

Why this matters now

Still, the release lands at the right time. Enterprise AI teams are moving beyond single-turn assistants and into systems expected to remember preferences, preserve project context and operate across longer horizons.

Saboo’s open-source memory agent offers a concrete starting point for that next layer of infrastructure, and Flash-Lite gives the economics some credibility.

But the strongest takeaway from the reaction around the launch is that continuous memory will be judged on governance as much as capability.

That is the real enterprise question behind Saboo’s demo: not whether an agent can remember, but whether it can remember in ways that stay bounded, inspectable and safe enough to trust in production.

Google Workspace CLI brings Gmail, Docs, Sheets and more into a common interface for AI agents

What’s old is new: the command line — the original, clunky non-graphical interface for interacting with and controlling PCs, where the user just typed in raw commands in code — has become one of the most important interfaces in agentic AI.

That shift has been driven in part by the rise of coding-native tools such as Claude Code and Kilo CLI, which have helped establish a model where AI agents do not just answer questions in chat windows but execute real tasks through a shared, scriptable interface already familiar to developers — and which can still be found on virtually all PCs.

For developers, the appeal is practical: the CLI is inspectable, composable and easier to control than a patchwork of custom app integrations.

Now, Google Workspace — the umbrella term for Google’s suite of enterprise cloud apps including Drive, Gmail, Calendar, Sheets, Docs, Chat, Admin — is moving into that pattern with a new CLI that lets them access these applications and the data within them directly, without relying on third-party connectors.

The project, googleworkspace/cli, describes itself as “one CLI for all of Google Workspace — built for humans and AI agents,” with structured JSON output and agent-oriented workflows included.

In an X post yesterday, Google Cloud director Addy Osmani introduced the Google Workspace CLI as “built for humans and agents,” adding that it covers “Google Drive, Gmail, Calendar, and every Workspace API.”

While not officially supported by Google, other posts cast the release as a broader turning point for automation and agent access to enterprise productivity software.

Now, instead of having to set up third-party connectors like Zapier to access data and use AI agents to automate work across the Google Workspace suite of apps, enterprise developers (or indie devs and users, for that matter) can easily install the open source (Apache 2.0) Google Workspace CLI from Github and begin setting up automated agentic workflows directly in terminal, asking their AI model to sort email, respond, edit docs and files, and more.

Why the CLI model is gaining traction

For enterprise developers, the importance of the release is not that Google suddenly made Workspace programmable. Workspace APIs have long been available. What changes here is the interface.

Instead of forcing teams to build and maintain separate wrappers around individual APIs, the CLI offers a unified command surface with structured output.

Installation is straightforward — npm install -g @googleworkspace/cli — and the repo says the package includes prebuilt binaries, with releases also available through GitHub.

The repo also says gws reads Google’s Discovery Service at runtime and dynamically builds its command surface, allowing new Workspace API methods to appear without waiting for a manually maintained static tool definition to catch up.

For teams building agents or internal automation, that is a meaningful operational advantage. It reduces glue code, lowers maintenance overhead and makes Workspace easier to treat as a programmable runtime rather than a collection of separate SaaS applications.

What developers and enterprises actually get

The CLI is designed for both direct human use and agent-driven workflows. For developers working in the terminal, the README highlights features such as per-resource help, dry-run previews, schema inspection and auto-pagination.

For agents, the value is clearer still: structured JSON output, reusable commands and built-in skills that let models interact with Workspace data and actions without a custom integration layer.

That creates immediate utility for internal enterprise workflows. Teams can use the tool to list Drive files, create spreadsheets, inspect request and response schemas, send Chat messages and paginate through large result sets from the terminal. The README also says the repo ships more than 100 agent skills, including helpers and curated recipes for Gmail, Drive, Docs, Calendar and Sheets.

That matters because Workspace remains one of the most common systems of record for day-to-day business work. Email, calendars, internal docs, spreadsheets and shared files are often where operational context lives. A CLI that exposes those surfaces through a common, agent-friendly interface makes it easier to build assistants that retrieve information, trigger actions and automate repetitive processes with less bespoke plumbing.

The important caveat: visible, but not officially supported

The social-media response has been enthusiastic, but enterprises should read the repo carefully before treating the project as a formal Google platform commitment.

The README explicitly says: “This is not an officially supported Google product”. It also says the project is under active development and warns users to expect breaking changes as it moves toward v1.0.

That does not diminish the technical relevance of the release. It does, however, shape how enterprise teams should think about adoption. Today, this looks more like a promising developer tool with strong momentum than a production platform that large organizations should standardize on immediately.

This is a cleaner interface, not a governance bypass

The other key point is that the CLI does not bypass the underlying controls that govern Workspace access.

The documentation says users still need a Google Cloud project for OAuth credentials and a Google account with Workspace access. It also outlines multiple authentication patterns for local development, CI and service accounts, along with instructions for enabling APIs and handling setup issues.

For enterprises, that is the right way to interpret the tool. It is not magic access to Gmail, Docs or Sheets. It is a more usable abstraction over the same permissions, scopes and admin controls companies already manage.

Not a rejection of MCP, but a broader agent interface strategy

Some of the early commentary around the tool frames it as a cleaner alternative to Model Context Protocol (MCP)-heavy setups, arguing that CLI-driven execution can avoid wasting context window on large tool definitions. There is some logic to that argument, especially for agent systems that can call shell commands directly and parse JSON responses.

But the repo itself presents a more nuanced picture. It includes a Gemini CLI extension that gives Gemini agents access to gws commands and Workspace agent skills after terminal authentication. It also includes an MCP server mode through gws mcp, exposing Workspace APIs as structured tools for MCP-compatible clients including Claude Desktop, Gemini CLI and VS Code.

The strategic takeaway is not that Google Workspace is choosing CLI instead of MCP. It is that the CLI is emerging as the base interface, with MCP available where it makes sense.

What enterprises should do now

The right near-term move for enterprises is not broad rollout. It is targeted evaluation.

Developer productivity, platform engineering and IT automation teams should test the tool in a sandboxed Workspace environment and identify a narrow set of high-friction use cases where a CLI-first approach could reduce integration work. File discovery, spreadsheet updates, document generation, calendar operations and internal reporting are natural starting points.

Security and identity teams should review authentication patterns early and determine how tightly permissions, scopes and service-account usage can be constrained and monitored. AI platform teams, meanwhile, should compare direct CLI execution against MCP-based approaches in real workflows, focusing on reliability, prompt overhead and operational simplicity.

The broader trend is clear. As agentic software matures, the command line is becoming a common control plane for both developers and AI systems. Google Workspace’s new CLI does not change enterprise automation overnight. But it does make one of the most widely used productivity stacks easier to access through the interface that agent builders increasingly prefer.

EY hit 4x coding productivity by connecting AI agents to engineering standards

Coding agents can generate thousands of lines of code in minutes. The problem: most of it can’t be deployed. It breaks internal standards, fails compliance checks, or creates more cleanup work than it saves.

“You can generate a ton of code, but it doesn’t mean really anything, right? It’s got to be code that is integratable, that is compliant, and you don’t want to create more work on the back end just because you sped up the code generation process on the front end,” said Stephen Newman, EY Global CTO Engineering Leader.

EY’s product development team solved this by connecting coding agents to their engineering standards, code repositories, and compliance frameworks. The result: 4x to 5x productivity gains across teams building EY’s suite of audit, tax, and financial platforms.

But the gains didn’t come from just turning on a tool. Newman’s team spent 18 to 24 months building the cultural foundation and technical integrations that made semi-autonomous coding work at scale.

The first step was cultural. EY started with GitHub Copilot-style tools, letting engineers get comfortable with prompt engineering and assistive AI. Newman said the key learning was making AI adoption organic rather than forced from leadership. “It’s important to bring AI capabilities as a ground-up organic adoption rather than force them onto the users,” he said.

Developers wanted to move beyond code generation to building, deployment, and operationalization. But productivity gains plateaued without deeper integration.

Newman realized agents needed access to EY’s code repos, engineering standards and source catalogs to generate deployable code. Without that “context universe,” as Newman calls it, agents produce generic output that requires extensive rework.

EY evaluated multiple agent platforms: Lovable, Replit and Factory’s IDE-based Droids. Rather than mandate a tool, Newman’s team measured adoption, usage and productivity across all three.

“We didn’t want to be too prescriptive as a leadership team to identify a tool and dumb it down,” Newman said. Developers “really gravitated and navigated” to Factory, which became the signal that it delivered real value.

Factory adoption “took off like wildfire” once elevated from evaluation to pilot. EY had to throttle traffic to Factory and Droids and restrict which repos could connect before getting compliance and security sign-off.

The workload classification framework

The enthusiasm from developers made it clear EY needed discipline around which workloads to delegate to agents. Newman’s team separated tasks into two categories:

High-autonomy tasks agents handle well:

  • Code review

  • Documentation

  • Defect fixing

  • Greenfield features

Complex tasks that still need human oversight:

  • Large-scale refactors

  • Architecture decisions

  • Cross-system integrations

EY also shifted developer roles. Rather than writing all code themselves, engineers became orchestrators directing agents to the correct databases and repos.

With security guardrails in place and integration into code repositories complete, EY measured efficiency gains ranging from 15% to 60% across different personas in the early adoption phase.

“There’s a leap that we’ve made on many of our products where we jumped on what I call horizon model development, where we have semi-autonomous agent execution at scale, a team of orchestrators as opposed to doers and we have the integrations into the context universe,” Newman said.

Newman acknowledged it’s difficult to attribute the 4x to 5x productivity gains solely to coding agents. The improvements came from trial and error combined with cultural and behavioral shifts in developer teams.