Train-to-Test scaling explained: How to optimize your end-to-end AI compute budget for inference

The standard guidelines for building large language models (LLMs) optimize only for training costs and ignore inference costs. This poses a challenge for real-world applications that use inference-time scaling techniques to increase the accuracy of model responses, such as drawing multiple reasoning samples from a model at deployment.

To bridge this gap, researchers at University of Wisconsin-Madison and Stanford University have introduced Train-to-Test (T2) scaling laws, a framework that jointly optimizes a model’s parameter size, its training data volume, and the number of test-time inference samples.

In practice, their approach proves that it is compute-optimal to train substantially smaller models on vastly more data than traditional rules prescribe, and then use the saved computational overhead to generate multiple repeated samples at inference.

For enterprise AI application developers who are training their own models, this research provides a proven blueprint for maximizing return on investment. It shows that AI reasoning does not necessarily require spending huge amounts on frontier models. Instead, smaller models can yield stronger performance on complex tasks while keeping per-query inference costs manageable within real-world deployment budgets.

Conflicting scaling laws

Scaling laws are an important part of developing large language models. Pretraining scaling laws dictate the best way to allocate compute during the model’s creation, while test-time scaling laws guide how to allocate compute during deployment, such as letting the model “think longer” or generating multiple reasoning samples to solve complex problems.

The problem is that these scaling laws have been developed completely independently of one another despite being fundamentally intertwined.

A model’s parameter size and training duration directly dictate both the quality and the per-query cost of its inference samples. Currently, the industry gold standard for pretraining is the Chinchilla rule, which suggests a compute-optimal ratio of roughly 20 training tokens for every model parameter.

However, creators of modern AI model families, such as Llama, Gemma, and Qwen, regularly break this rule by intentionally overtraining their smaller models on massive amounts of data.

As Nicholas Roberts, co-author of the paper, told VentureBeat, the traditional approach falters when building complex agentic workflows: “In my view, the inference stack breaks down when each individual inference call is expensive. This is the case when the models are large and you need to do a lot of repeated sampling.” Instead of relying on massive models, developers can use overtrained compact models to run this repeated sampling at a fraction of the cost.

But because training and test-time scaling laws are examined in isolation, there is no rigorous framework to calculate how much a model should be overtrained based on how many reasoning samples it will need to generate during deployment.

Consequently, there has previously been no formula that jointly optimizes model size, training data volume, and test-time inference budgets.

The reason that this framework is hard to formulate is that pretraining and test-time scaling speak two different mathematical languages. During pretraining, a model’s performance is measured using “loss,” a smooth, continuous metric that tracks prediction errors as the model learns.

At test time, developers use real-world, downstream metrics to evaluate a model’s reasoning capabilities, such as pass@k, which measures the probability that a model will produce at least one correct answer across k independent, repeated attempts.

Train-to-test scaling laws

To solve the disconnect between training and deployment, the researchers introduce Train-to-Test (T2) scaling laws. At a high level, this framework predicts a model’s reasoning performance by treating three variables as a single equation: the model’s size (N), the volume of training tokens it learns from (D), and the number of reasoning samples it generates during inference (k).

T2 combines pretraining and inference budgets into one optimization formula that accounts for both the baseline cost to train the model (6ND) and the compounding cost to query it repeatedly at inference (2Nk). The researchers tried different modeling approaches: whether to model the pre-training loss or test-time performance (pass@k) as functions of N, D, and k.

The first approach takes the familiar mathematical equation used for Chinchilla scaling (which calculates a model’s prediction error, or loss) and directly modifies it by adding a new variable that accounts for the number of repeated test-time samples (k). This allows developers to see how increasing inference compute drives down the model’s overall error rate.

The second approach directly models the downstream pass@k accuracy. It tells developers the probability that their application will solve a problem given a specific compute budget.

But should enterprises use this framework for every application? Roberts clarifies that this approach is highly specialized. “I imagine that you would not see as much of a benefit for knowledge-heavy applications, such as chat models,” he said. Instead, “T2 is tailored to reasoning-heavy applications such as coding, where typically you would use repeated sampling as your test-time scaling method.”

What it means for developers

To validate the T2 scaling laws, the researchers built an extensive testbed of over 100 language models, ranging from 5 million to 901 million parameters. They trained 21 new, heavily overtrained checkpoints from scratch to test if their mathematical forecasts held up in reality. They then benchmarked the models across eight diverse tasks, which included real-world datasets like SciQ and OpenBookQA, alongside synthetic tasks designed to test arithmetic, spatial reasoning, and knowledge recall.

Both of their mathematical models proved that the compute-optimal frontier shifts drastically away from standard Chinchilla scaling. To maximize performance under a fixed budget, the optimal choice is a model that is significantly smaller and trained on vastly more data than the traditional 20-tokens-per-parameter rule dictates.

In their experiments, the highly overtrained small models consistently outperformed the larger, Chinchilla-optimal models across all eight evaluation tasks when test-time sampling costs were accounted for.

For developers looking to deploy these findings, the technical barrier is surprisingly low.

“Nothing fancy is required to perform test-time scaling with our current models,” Roberts said. “At deployment, developers can absolutely integrate infrastructure that makes the sampling process more efficient (e.g. KV caching if you’re using a transformer).”

KV caching helps by storing previously processed context so the model doesn’t have to re-read the initial prompt from scratch for every new reasoning sample.

However, extreme overtraining comes with practical trade-offs. While overtrained models can be notoriously stubborn and harder to fine-tune, Roberts notes that when they applied supervised fine-tuning, “while this effect was present, it was not a strong enough effect to pull the optimal model back to Chinchilla.” The compute-optimal strategy remains definitively skewed toward compact models.

Yet, teams pushing this to the absolute limit must be wary of hitting physical data limits. “Another angle is that if you take our overtraining recommendations to the extreme, you may actually run out of training data,” Roberts said, referring to the looming “data wall” where high-quality internet data is exhausted.

These experiments confirm that if an application relies on generating multiple test-time reasoning samples, aggressively overtraining a compact model is practically and mathematically the most effective way to spend an end-to-end compute budget.

To help developers get started, the research team plans to open-source their checkpoints and code soon, allowing enterprises to plug in their own data and test the scaling behavior immediately. Ultimately, this framework serves as an equalizing force in the AI industry. 

This is especially crucial as the high price of frontier models can become a barrier as you scale agentic applications that rely on reasoning models.

“T2 fundamentally changes who gets to build strong reasoning models,” Roberts concludes. “You might not need massive compute budgets to get state-of-the-art reasoning. Instead, you need good data and smart allocation of your training and inference budget.”

Anthropic just launched Claude Design, an AI tool that turns prompts into prototypes and challenges Figma

Anthropic today launched Claude Design, a new product from its Anthropic Labs division that allows users to create polished visual work — designs, interactive prototypes, slide decks, one-pagers, and marketing collateral — through conversational prompts and fine-grained editing controls. The release, available immediately in research preview to all paid Claude subscribers, is the company’s most aggressive expansion beyond its core language model business and into the application layer that has historically belonged to companies like Figma, Adobe, and Canva.

Claude Design is powered by Claude Opus 4.7, Anthropic’s most capable generally available vision model, which the company also released today. Anthropic says it is rolling access out gradually throughout the day to Claude Pro, Max, Team, and Enterprise subscribers.

The simultaneous launches mark a watershed for Anthropic, whose ambitions now visibly extend from foundation model provider to full-stack product company — one that wants to own the arc from a rough idea to a shipped product. The timing is also significant: Anthropic hit roughly $20 billion in annualized revenue in early March 2026, according to Bloomberg, up from $9 billion at the end of 2025 — and surpassed $30 billion by early April 2026. The company is in early talks with Goldman Sachs, JPMorgan, and Morgan Stanley about a potential IPO that could come as early as October 2026.

How Claude Design turns a text prompt into a working prototype

The product follows a workflow that Anthropic has designed to feel like a natural creative conversation. Users describe what they need, and Claude generates a first version. From there, refinement happens through a combination of channels: chat-based conversation, inline comments on specific elements, direct text editing, and custom adjustment sliders that Claude itself generates to let users tweak spacing, color, and layout in real time.

During onboarding, Claude reads a team’s codebase and design files and builds a design system — colors, typography, and components — that it automatically applies to every subsequent project. Teams can refine the system over time and maintain more than one. The import surface is broad: users can start from a text prompt, upload images and documents in various formats, or point Claude at their codebase. A web capture tool grabs elements directly from a live website so prototypes look like the real product.

What distinguishes Claude Design from the wave of AI design experiments that have proliferated in the past year is the handoff mechanism. When a design is ready to build, Claude packages everything into a handoff bundle that can be passed to Claude Code with a single instruction. That creates a closed loop — exploration to prototype to production code — all within Anthropic’s ecosystem. The export options acknowledge that not everyone’s next step is Claude Code: users can also share designs as an internal URL within their organization, save as a folder, or export to Canva, PDF, PPTX, or standalone HTML files.

Anthropic points to Brilliant, the education technology company known for intricate interactive lessons, as an early proof point. The company’s senior product designer reported that the most complex pages required 20 or more prompts to recreate in competing tools but needed only 2 in Claude Design. The Brilliant team then turned static mockups into interactive prototypes they could share and user-test without code review, and handed everything — including the design intent — to Claude Code for implementation. Datadog’s product team described a similar shift, compressing what had been a week-long cycle of briefs, mockups, and review rounds into a single conversation.

Why Anthropic’s chief product officer just resigned from Figma’s board

The launch arrives against a backdrop that makes Anthropic’s claim of complementarity with existing design tools difficult to take entirely at face value. Mike Krieger, Anthropic’s chief product officer, resigned from the board of Figma on April 14 — the same day The Information reported Anthropic’s next model would include design tools that could compete with Figma’s primary offering.

Figma has collaborated closely with Anthropic to integrate the frontier lab’s AI models into its products. Just two months ago, in February, Figma launched “Code to Canvas,” a feature that converts code generated in AI tools like Claude Code into fully editable designs inside Figma — creating a bridge between AI coding tools and Figma’s design process. The partnership felt like a mutual bet that AI would make design more essential, not less. Claude Design complicates that narrative significantly.

Anthropic’s position, based on VentureBeat’s background conversations with the company, is that Claude Design is built around interoperability and is meant to meet teams where they already work, not replace incumbent tools. The company points to the Canva export, PPTX and PDF support, and plans to make it easier for other tools to connect via MCPs (model context protocols) as evidence of that philosophy. Anthropic is also making it possible for other tools to build integrations with Claude Design, a move clearly designed to preempt accusations of walled-garden ambitions.

But the market read the signals differently. The structural tension is clear: Figma commands an estimated 80 to 90% market share in UI and UX design, according to The Next Web. Both Figma and Adobe assume a trained designer is in the loop. Anthropic’s tool does not. Claude Design is not merely another AI copilot embedded in an existing design application. It is a standalone product that generates complete, interactive prototypes from natural language — accessible to founders, product managers, and marketers who have never opened Figma. The expansion of the design user base to non-designers is the real competitive threat, even if the professional designer’s workflow remains anchored in Figma for now.

Inside Claude Opus 4.7, the model Anthropic deliberately made less dangerous

The model powering Claude Design is itself a significant story. Claude Opus 4.7 is Anthropic’s most capable generally available model, with notable improvements over its predecessor Opus 4.6 in software engineering, instruction following, and vision — but it is intentionally less capable than Anthropic’s most powerful offering, Claude Mythos Preview, the model the company announced earlier this month as too dangerous for broad release due to its cybersecurity capabilities.

That dual-track approach — one model for the public, one model locked behind a vetted-access program — is unprecedented in the AI industry. Anthropic used Claude Mythos Preview to identify thousands of zero-day vulnerabilities in every major operating system and web browser, as reported by multiple outlets. The Project Glasswing initiative that houses Mythos brings together Amazon Web Services, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, Nvidia, and Palo Alto Networks as launch partners.

Opus 4.7 sits a deliberate step below Mythos. Anthropic stated in its release that it “experimented with efforts to differentially reduce” the new model’s cyber capabilities during training and ships it with safeguards that automatically detect and block requests indicating prohibited or high-risk cybersecurity uses. What Anthropic learns from those real-world safeguards will inform the eventual goal of broader release for Mythos-class models. For security professionals with legitimate needs, the company has created a new Cyber Verification Program.

On benchmarks, the model posts strong numbers. Opus 4.7 reached 64.3% on SWE-bench Pro, and on Anthropic’s internal 93-task coding benchmark, it delivered a 13% resolution improvement over Opus 4.6, including solving four tasks that neither Opus 4.6 nor Sonnet 4.6 could crack.

The vision improvements are substantial and directly relevant to Claude Design: Opus 4.7 can accept images up to 2,576 pixels on the long edge — roughly 3.75 megapixels, more than three times the resolution of prior Claude models. Early access partner XBOW, the autonomous penetration testing company, reported that the new model scored 98.5% on their visual-acuity benchmark versus 54.5% for Opus 4.6.

Meanwhile, Bloomberg reported that the White House is preparing to make a version of Mythos available to major federal agencies, with the Office of Management and Budget setting up protections for Cabinet departments — a sign that the government views the model’s capabilities as too important to leave solely in private hands.

What enterprise buyers need to know about data privacy and pricing

For enterprise and regulated-industry buyers, the data handling architecture of Claude Design will be a critical evaluation criterion. Based on VentureBeat’s exclusive background discussions with Anthropic, the system stores the design-system representation it generates — not the source files themselves. When users link a local copy of their code, it is not uploaded to or stored on Anthropic’s servers. The company is also adding the ability to connect directly to GitHub. Anthropic states unequivocally that it does not train on this data. For Enterprise customers, Claude Design is off by default — administrators choose whether to enable it and control who has access.

On pricing, Claude Design is included at no additional cost with Pro, Max, Team, and Enterprise plans, using existing subscription limits with optional extra usage beyond those caps. Opus 4.7 holds the same API pricing as its predecessor: $5 per million input tokens and $25 per million output tokens. The pricing strategy mirrors the approach Anthropic took with Claude Code, which launched as a bundled feature and rapidly grew into a major revenue driver. Anthropic’s reasoning is straightforward: the best way to learn what people will build with a new product category is to put it in their hands, then build monetization around demonstrated value.

Anthropic is also being transparent about the product’s limitations. The design system import works best with a clean codebase; messy source code produces messy output. Collaboration is basic and not yet fully multiplayer. The editing experience has rough edges. There is no general availability date, and Anthropic says that is intentional — it will let the product and user feedback determine when Claude Design is ready for prime time.

Anthropic’s bet that owning the full creative stack is worth the risk

Claude Design is the most visible expression of a trend that has been accelerating for months: the major AI labs are moving up the stack from model providers into full application builders, directly entering categories previously owned by established software companies. Anthropic now offers a coding agent (Claude Code), a knowledge-work assistant (Claude Cowork), desktop computer control, office integrations for Word, Excel, and PowerPoint, a browser agent in Chrome, and now a design tool. Each product reinforces the others. A designer can explore concepts in Claude Design, export a prototype, hand it to Claude Code for implementation, and have Claude Cowork manage the review cycle — all within Anthropic’s platform.

The financial momentum behind this expansion is staggering. Anthropic has received investor offers valuing the company at approximately $800 billion, according to Reuters, more than doubling its $380 billion valuation from a funding round closed just two months ago. But building an application empire while simultaneously navigating an AI safety reputation, an impending IPO, growing public hostility toward the technology, and the diplomatic fallout of competing with your own partners is a balancing act that no technology company has attempted at this scale or speed.

When Figma launched Code to Canvas in February, the implicit promise was that AI coding tools and design tools would grow together, each making the other more valuable. Two months later, Anthropic’s chief product officer has left Figma’s board, and the company has shipped a product that lets anyone who can type a sentence create the kind of interactive prototype that once required years of design training and a Figma license. The partnership may survive. But the power dynamic just changed — and in the AI industry, that tends to be the only kind of change that matters.

Should my enterprise AI agent do that? NanoClaw and Vercel launch easier agentic policy setting and approval dialogs across 15 messaging apps

For the past year, early adopters of autonomous AI agents have been forced to play a murky game of chance: keep the agent in a useless sandbox or give it the keys to the kingdom and hope it doesn’t hallucinate a catastrophic “delete all” command.

To unlock the true utility of an agent—scheduling meetings, triaging emails, or managing cloud infrastructure—users have had to grant these models raw API keys and broad permissions, raising the risk of their systems being disrupted by an accidental agent mistake.

That tradeoff ends today. The creators of the open source sandboxed NanoClaw agent framework — now known under their new private startup named NanoCo — have announced a landmark partnership with Vercel and OneCLI to introduce a standardized, infrastructure-level approval system.

By integrating Vercel’s Chat SDK and OneCLI’s open source credentials vault, NanoClaw 2.0 ensures that no sensitive action occurs without explicit human consent, delivered natively through the messaging apps where users already live.

The specific use cases that stand to benefit most are those involving high-consequence “write” actions. That is, in DevOps, an agent could propose a cloud infrastructure change that only goes live once a senior engineer taps “Approve” in Slack.

For finance teams, an agent could prepare batch payments or invoice triaging, with the final disbursement requiring a human signature via a WhatsApp card.

Technology: security by isolation

The fundamental shift in NanoClaw 2.0 is the move away from “application-level” security to “infrastructure-level” enforcement. In traditional agent frameworks, the model itself is often responsible for asking for permission—a flow that Gavriel Cohen, co-founder of NanoCo, describes as inherently flawed.

“The agent could potentially be malicious or compromised,” Cohen noted in a recent interview. “If the agent is generating the UI for the approval request, it could trick you by swapping the ‘Accept’ and ‘Reject’ buttons.”

NanoClaw solves this by running agents in strictly isolated Docker or Apple Containers. The agent never sees a real API key; instead, it uses “placeholder” keys. When the agent attempts an outbound request, the request is intercepted by the OneCLI Rust Gateway. The gateway checks a set of user-defined policies (e.g., “Read-only access is okay, but sending an email requires approval”).

If the action is sensitive, the gateway pauses the request and triggers a notification to the user. Only after the user approves does the gateway inject the real, encrypted credential and allow the request to reach the service.

Product: bringing the ‘human’ into the loop

While security is the engine, Vercel’s Chat SDK is the dashboard. Integrating with different messaging platforms is notoriously difficult because every app—Slack, Teams, WhatsApp, Telegram—uses different APIs for interactive elements like buttons and cards.

By leveraging Vercel’s unified SDK, NanoClaw can now deploy to 15 different channels from a single TypeScript codebase. When an agent wants to perform a protected action, the user receives a rich interactive card on their phone. “The approval shows up as a rich, native card right inside Slack or WhatsApp or Teams, and the user taps once to approve or deny,” said Cohen. This “seamless UX” is what makes human-in-the-loop oversight practical rather than a productivity bottleneck.

The full list of 15 supported messaging apps/channels contains many favored by enterprise knowledge workers, including:

  • Slack

  • WhatsApp

  • Telegram

  • Microsoft Teams

  • Discord

  • Google Chat

  • iMessage

  • Facebook Messenger

  • Instagram

  • X (Twitter)

  • GitHub

  • Linear

  • Matrix

  • Email

  • Webex

Background on NanoClaw

NanoClaw launched on January 31, 2026, as a minimalist and security-focused response to the “security nightmare” inherent in complex, non-sandboxed agent frameworks.

Created by Cohen, a former Wix.com engineer, and marketed by his brother Lazer, CEO of B2B tech public relations firm Concrete Media, the project was designed to solve the auditability crisis found in competing platforms like OpenClaw, which had grown to nearly 400,000 lines of code.

By contrast, NanoClaw condensed its core logic into roughly 500 lines of TypeScript—a size that, according to VentureBeat, allows the entire system to be audited by a human or a secondary AI in approximately eight minutes.

The platform’s primary technical defense is its use of operating system-level isolation. Every agent is placed inside an isolated Linux container—utilizing Apple Containers for high performance on macOS or Docker for Linux—to ensure that the AI only interacts with directories explicitly mounted by the user.

As detailed in VentureBeat’s reporting on the project’s infrastructure, this approach confines the “blast radius” of potential prompt injections strictly to the container and its specific communication channel.

In March 2026, NanoClaw further matured this security posture through an official partnership with the software container firm Docker to run agents inside “Docker Sandboxes”.

This integration utilizes MicroVM-based isolation to provide an enterprise-ready environment for agents that, by their nature, must mutate their environments by installing packages, modifying files, and launching processes—actions that typically break traditional container immutability assumptions.

Operationally, NanoClaw rejects the traditional “feature-rich” software model in favor of a “Skills over Features” philosophy. Instead of maintaining a bloated main branch with dozens of unused modules, the project encourages users to contribute “Skills”—modular instructions that teach a local AI assistant how to transform and customize the codebase for specific needs, such as adding Telegram or Gmail support.

This methodology, as described on NanoClaw’s website and in VentureBeat interviews, ensures that users only maintain the exact code required for their specific implementation.

Furthermore, the framework natively supports “Agent Swarms” via the Anthropic Agent SDK, allowing specialized agents to collaborate in parallel while maintaining isolated memory contexts for different business functions.

Licensing and open source strategy

NanoClaw remains firmly committed to the open source MIT License, encouraging users to fork the project and customize it for their own needs. This stands in stark contrast to “monolithic” frameworks.

NanoClaw’s codebase is remarkably lean, consisting of only 15 source files and roughly 3,900 lines of code, compared to the hundreds of thousands of lines found in competitors like OpenClaw.

The partnership also highlights the strength of the “Open Source Avengers” coalition.

By combining NanoClaw (agent orchestration), Vercel Chat SDK (UI/UX), and OneCLI (security/secrets), the project demonstrates that modular, open-source tools can outpace proprietary labs in building the application layer for AI.

Community reactions

As shown on the NanoClaw website, the project has amassed more than 27,400 stars on GitHub and maintains an active Discord community.

A core claim on the NanoClaw site is that the codebase is small enough to understand in “8 minutes,” a feature targeted at security-conscious users who want to audit their assistant.

In an interview, Cohen noted that iMessage support via Vercel’s Photon project addresses a common community hurdle: previously, users often had to maintain a separate Mac Mini to connect agents to an iMessage account.

The enterprise perspective: should you adopt?

For enterprises, NanoClaw 2.0 represents a shift from speculative experimentation to safe operationalization.

Historically, IT departments have blocked agent usage due to the “all-or-nothing” nature of credential access. By decoupling the agent from the secret, NanoClaw provides a middle ground that mirrors existing corporate security protocols—specifically the principle of least privilege.

Enterprises should consider this framework if they require high-auditability and have strict compliance needs regarding data exfiltration. According to Cohen, many businesses have not been ready to grant agents access to calendars or emails because of security concerns. This framework addresses that by ensuring the agent structurally cannot act without permission.

Enterprises stand to benefit specifically in use cases involving “high-stakes” actions. As illustrated in the OneCLI dashboard, a user can set a policy where an agent can read emails freely but must trigger a manual approval dialog to “delete” or “send” one.

Because NanoClaw runs as a single Node.js process with isolated containers , it allows enterprise security teams to verify that the gateway is the only path for outbound traffic. This architecture transforms the AI from an unmonitored operator into a supervised junior staffer, providing the productivity of autonomous agents without forgoing executive control.

Ultimately, NanoClaw is a recommendation for organizations that want the productivity of autonomous agents without the “black box” risk of traditional LLM wrappers. It turns the AI from a potentially rogue operator into a highly capable junior staffer who always asks for permission before hitting the “send” or “buy” button.

As AI-native setups become the standard, this partnership establishes the blueprint for how trust will be managed in the age of the autonomous workforce.

Salesforce launches Headless 360 to turn its entire platform into infrastructure for AI agents

Salesforce on Wednesday unveiled the most ambitious architectural transformation in its 27-year history, introducing “Headless 360” — a sweeping initiative that exposes every capability in its platform as an API, MCP tool, or CLI command so AI agents can operate the entire system without ever opening a browser.

The announcement, made at the company’s annual TDX developer conference in San Francisco, ships more than 100 new tools and skills immediately available to developers. It marks a decisive response to the existential question hanging over enterprise software: In a world where AI agents can reason, plan, and execute, does a company still need a CRM with a graphical interface?

Salesforce’s answer: No — and that’s exactly the point.

“We made a decision two and a half years ago: Rebuild Salesforce for agents,” the company said in its announcement. “Instead of burying capabilities behind a UI, expose them so the entire platform will be programmable and accessible from anywhere.”

The timing is anything but coincidental. Salesforce finds itself navigating one of the most turbulent periods in enterprise software history — a sector-wide sell-off that has pushed the iShares Expanded Tech-Software Sector ETF down roughly 28% from its September peak. The fear driving the decline: that AI, particularly large language models from Anthropic, OpenAI, and others, could render traditional SaaS business models obsolete.

Jayesh Govindarjan, EVP of Salesforce and one of the key architects behind the Headless 360 initiative, described the announcement as rooted not in marketing theory but in hard-won lessons from deploying agents with thousands of enterprise customers.

“The problem that emerged is the lifecycle of building an agentic system for every one of our customers on any stack, whether it’s ours or somebody else’s,” Govindarjan told VentureBeat in an exclusive interview. “The challenge that they face is very much the software development challenge. How do I build an agent? That’s only step one.”

More than 100 new tools give coding agents full access to the Salesforce platform for the first time

Salesforce Headless 360 rests on three pillars that collectively represent the company’s attempt to redefine what an enterprise platform looks like in the agentic era.

The first pillar — build any way you want — delivers more than 60 new MCP (Model Context Protocol) tools and 30-plus preconfigured coding skills that give external coding agents like Claude Code, Cursor, Codex, and Windsurf complete, live access to a customer’s entire Salesforce org, including data, workflows, and business logic. Developers no longer need to work inside Salesforce’s own IDE. They can direct AI coding agents from any terminal to build, deploy, and manage Salesforce applications.

Agentforce Vibes 2.0, the company’s own native development environment, now includes what it calls an “open agent harness” supporting both the Anthropic agent SDK and the OpenAI agents SDK. As demonstrated during the keynote, developers can choose between Claude Code and OpenAI agents depending on the task, with the harness dynamically adjusting available capabilities based on the selected agent. The environment also adds multi-model support, including Claude Sonnet and GPT-5, along with full org awareness from the start.

A significant technical addition is native React support on the Salesforce platform. During the keynote demo, presenters built a fully functional partner service application using React — not Salesforce’s own Lightning framework — that connected to org metadata via GraphQL while inheriting all platform security primitives. This opens up dramatically more expressive front-end possibilities for developers who want complete control over the visual layer.

The second pillar — deploy on any surface — centers on the new Agentforce Experience Layer, which separates what an agent does from how it appears, rendering rich interactive components natively across Slack, mobile apps, Microsoft Teams, ChatGPT, Claude, Gemini, and any client supporting MCP apps. During the keynote, presenters defined an experience once and deployed it across six different surfaces without writing surface-specific code. The philosophical shift is significant: rather than pulling customers into a Salesforce UI, enterprises push branded, interactive agent experiences into whatever workspace their customers already inhabit.

The third pillar — build agents you can trust at scale — introduces an entirely new suite of lifecycle management tools spanning testing, evaluation, experimentation, observation, and orchestration. Agent Script, the company’s new domain-specific language for defining agent behavior deterministically, is now generally available and open-sourced. A new Testing Center surfaces logic gaps and policy violations before deployment. Custom Scoring Evals let enterprises define what “good” looks like for their specific use case. And a new A/B Testing API enables running multiple agent versions against real traffic simultaneously.

Why enterprise customers kept breaking their own AI agents — and how Salesforce redesigned its tooling in response

Perhaps the most technically significant — and candid — portion of VentureBeat’s interview with Govindarjan addressed the fundamental engineering tension at the heart of enterprise AI: agents are probabilistic systems, but enterprises demand deterministic outcomes.

Govindarjan explained that early Agentforce customers, after getting agents into production through “sheer hard work,” discovered a painful reality. “They were afraid to make changes to these agents, because the whole system was brittle,” he said. “You make one change and you don’t know whether it’s going to work 100% of the time. All the testing you did needs to be redone.”

This brittleness problem drove the creation of Agent Script, which Govindarjan described as a programming language that “brings together the determinism that’s in programming languages with the inherent flexibility in probabilistic systems that LLMs provide.” The language functions as a single flat file — versionable, auditable — that defines a state machine governing how an agent behaves. Within that machine, enterprises specify which steps must follow explicit business logic and which can reason freely using LLM capabilities.

Salesforce open-sourced Agent Script this week, and Govindarjan noted that Claude Code can already generate it natively because of its clean documentation. The approach stands in sharp contrast to the “vibe coding” movement gaining traction elsewhere in the industry. As the Wall Street Journal recently reported, some companies are now attempting to vibe-code entire CRM replacements — a trend Salesforce’s Headless 360 directly addresses by making its own platform the most agent-friendly substrate available.

Govindarjan described the tooling as a product of Salesforce’s own internal practice. “We needed these tools to make our customers successful. Then our FDEs needed them. We hardened them, and then we gave them to our customers,” he told VentureBeat. In other words, Salesforce productized its own pain.

Inside the two competing AI agent architectures Salesforce says every enterprise will need

Govindarjan drew a revealing distinction between two fundamentally different agentic architectures emerging in the enterprise — one for customer-facing interactions and one he linked to what he called the “Ralph Wiggum loop.”

Customer-facing agents — those deployed to interact with end customers for sales or service — demand tight deterministic control. “Before customers are willing to put these agents in front of their customers, they want to make sure that it follows a certain paradigm — a certain brand set of rules,” Govindarjan told VentureBeat. Agent Script encodes these as a static graph — a defined funnel of steps with LLM reasoning embedded within each step.

The “Ralph Wiggum loop,” by contrast, represents the opposite end of the spectrum: a dynamic graph that unrolls at runtime, where the agent autonomously decides its next step based on what it learned in the previous step, killing dead-end paths and spawning new ones until the task is complete. This architecture, Govindarjan said, manifests primarily in employee-facing scenarios — developers using coding agents, salespeople running deep research loops, marketers generating campaign materials — where an expert human reviews the output before it ships.

“Ralph Wiggum loops are great for employee-facing because employees are, in essence, experts at something,” Govindarjan explained. “Developers are experts at development, salespeople are experts at sales.”

The critical technical insight: both architectures run on the same underlying platform and the same graph engine. “This is a dynamic graph. This is a static graph,” he said. “It’s all a graph underneath.” That unified runtime — spanning the spectrum from tightly controlled customer interactions to free-form autonomous loops — may be Salesforce’s most important technical bet, sparing enterprises from maintaining separate platforms for different agent modalities.

Salesforce hedges its bets on MCP while opening its ecosystem to every major AI model and tool

Salesforce’s embrace of openness at TDX was striking. The platform now integrates with OpenAI, Anthropic, Google Gemini, Meta’s LLaMA, and Mistral AI models. The open agent harness supports third-party agent SDKs. MCP tools work from any coding environment. And the new AgentExchange marketplace unifies 10,000 Salesforce apps, 2,600-plus Slack apps, and 1,000-plus Agentforce agents, tools, and MCP servers from partners including Google, Docusign, and Notion, backed by a new $50 million AgentExchange Builders Initiative.

Yet Govindarjan offered a surprisingly candid assessment of MCP itself — the protocol Anthropic created that has become a de facto standard for agent-tool communication.

“To be very honest, not at all sure” that MCP will remain the standard, he told VentureBeat. “When MCP first came along as a protocol, a lot of us engineers felt that it was a wrapper on top of a really well-written CLI — which now it is. A lot of people are saying that maybe CLI is just as good, if not better.”

His approach: pragmatic flexibility. “We’re not wedded to one or the other. We just use the best, and often we will offer all three. We offer an API, we offer a CLI, we offer an MCP.” This hedging explains the “Headless 360” naming itself — rather than betting on a single protocol, Salesforce exposes every capability across all three access patterns, insulating itself against protocol shifts.

Engine, the B2B travel management company featured prominently in the keynote demos, offered a real-world proof point for the open ecosystem approach. The company built its customer service agent, Ava, in 12 days using Agentforce and now handles 50% of customer cases autonomously. Engine runs five agents across customer-facing and employee-facing functions, with Data 360 at the heart of its infrastructure and Slack as its primary workspace. “CSAT goes up, costs to deliver go down. Customers are happier. We’re getting them answers faster. What’s the trade off? There’s no trade off,” an Engine executive said during the keynote.

Underpinning all of it is a shift in how Salesforce gets paid. The company is moving from per-seat licensing to consumption-based pricing for Agentforce — a transition Govindarjan described as “a business model change and innovation for us.” It’s a tacit acknowledgment that when agents, not humans, are doing the work, charging per user no longer makes sense.

Salesforce isn’t defending the old model — it’s dismantling it and betting the company on what comes next

Govindarjan framed the company’s evolution in architectural terms. Salesforce has organized its platform around four layers: a system of context (Data 360), a system of work (Customer 360 apps), a system of agency (Agentforce), and a system of engagement (Slack and other surfaces). Headless 360 opens every layer via programmable endpoints.

“What you saw today, what we’re doing now, is we’re opening up every single layer, right, with MCP tools, so we can go build the agentic experiences that are needed,” Govindarjan told VentureBeat. “I think you’re seeing a company transforming itself.”

Whether that transformation succeeds will depend on execution across thousands of customer deployments, the staying power of MCP and related protocols, and the fundamental question of whether incumbent enterprise platforms can move fast enough to remain relevant when AI agents can increasingly build new systems from scratch. The software sector’s bear market, the financial pressures bearing down on the entire industry, and the breathtaking pace of LLM improvement all conspire to make this one of the highest-stakes bets in enterprise technology.

But there is an irony embedded in Salesforce’s predicament that Headless 360 makes explicit. The very AI capabilities that threaten to displace traditional software are the same capabilities that Salesforce now harnesses to rebuild itself. Every coding agent that could theoretically replace a CRM is now, through Headless 360, a coding agent that builds on top of one. The company is not arguing that agents won’t change the game. It’s arguing that decades of accumulated enterprise data, workflows, trust layers, and institutional logic give it something no coding agent can generate from a blank prompt.

As Benioff declared on CNBC’s Mad Money in March: “The software industry is still alive, well and growing.” Headless 360 is his company’s most forceful attempt to prove him right — by tearing down the walls of the very platform that made Salesforce famous and inviting every agent in the world to walk through the front door.

Parker Harris, Salesforce’s co-founder, captured the bet most succinctly in a question he posed last month: “Why should you ever log into Salesforce again?”

If Headless 360 works as designed, the answer is: You shouldn’t have to. And that, Salesforce is wagering, is precisely what will keep you paying for it.

Meta researchers introduce ‘hyperagents’ to unlock self-improving AI for non-coding tasks

Creating self-improving AI systems is an important step toward deploying agents in dynamic environments, especially in enterprise production environments, where tasks are not always predictable, nor consistent.

Current self-improving AI systems face severe limitations because they rely on fixed, handcrafted improvement mechanisms that only work under strict conditions such as software engineering.

To overcome this practical challenge, researchers at Meta and several universities introduced “hyperagents,” a self-improving AI system that continuously rewrites and optimizes its problem-solving logic and the underlying code. 

In practice, this allows the AI to self-improve across non-coding domains, such as robotics and document review. The agent independently invents general-purpose capabilities like persistent memory and automated performance tracking.

More broadly, hyperagents don’t just get better at solving tasks, they learn to improve the self-improving cycle to accelerate progress.

This framework can help develop highly adaptable agents that autonomously build structured, reusable decision machinery. This approach compounds capabilities over time with less need for constant, manual prompt engineering and domain-specific human customization.

Current self-improving AI and its architectural bottlenecks

The core goal of self-improving AI systems is to continually enhance their own learning and problem-solving capabilities. However, most existing self-improvement models rely on a fixed “meta agent.” This static, high-level supervisory system is designed to modify a base system.

“The core limitation of handcrafted meta-agents is that they can only improve as fast as humans can design and maintain them,” Jenny Zhang, co-author of the paper, told VentureBeat. “Every time something changes or breaks, a person has to step in and update the rules or logic.”

Instead of an abstract theoretical limit, this creates a practical “maintenance wall.” 

The current paradigm ties system improvement directly to human iteration speed, slowing down progress because it relies heavily on manual engineering effort rather than scaling with agent-collected experience.

To overcome this limitation, the researchers argue that the AI system must be “fully self-referential.” These systems must be able to analyze, evaluate, and rewrite any part of themselves without the constraints of their initial setup. This allows the AI system to break free from structural limits and become self-accelerating.

One example of a self-referential AI system is Sakana AI’s Darwin Gödel Machine (DGM), an AI system that improves itself by rewriting its own code.

In DGM, an agent iteratively generates, evaluates, and modifies its own code, saving successful variants in an archive to act as stepping stones for future improvements. DGM proved open-ended, recursive self-improvement is practically achievable in coding.

However, DGM falls short when applied to real-world applications outside of software engineering because of a critical skill gap. In DGM, the system improves because both evaluation and self-modification are coding tasks. Improving the agent’s coding ability naturally improves its ability to rewrite its own code. But if you deploy DGM for a non-coding enterprise task, this alignment breaks down.

“For tasks like math, poetry, or paper review, improving task performance does not necessarily improve the agent’s ability to modify its own behavior,” Zhang said.

The skills needed to analyze subjective text or business data are entirely different from the skills required to analyze failures and write new Python code to fix them. 

DGM also relies on a fixed, human-engineered mechanism to generate its self-improvement instructions. In practice, if enterprise developers want to use DGM for anything other than coding, they must heavily engineer and manually customize the instruction prompts for every new domain.

The hyperagent framework

To overcome the limitations of previous architectures, the researchers introduce hyperagents. The framework proposes “self-referential agents that can in principle self-improve for any computable task.”

In this framework, an agent is any computable program that can invoke LLMs, external tools, or learned components. Traditionally, these systems are split into two distinct roles: a “task agent” that executes the specific problem at hand, and a “meta agent” that analyzes and modifies the agents. A hyperagent fuses both the task agent and the meta agent into a single, self-referential, and editable program.

Because the entire program can be rewritten, the system can modify the self-improvement mechanism, a process the researchers call metacognitive self-modification.

“Hyperagents are not just learning how to solve the given tasks better, but also learning how to improve,” Zhang said. “Over time, this leads to accumulation. Hyperagents do not need to rediscover how to improve in each new domain. Instead, they retain and build on improvements to the self-improvement process itself, allowing progress to compound across tasks.”

The researchers extended the Darwin Gödel Machine to create DGM-Hyperagents (DGM-H). DGM-H retains the powerful open-ended exploration structure of the original DGM, which prevents the AI from converging too early or getting stuck in dead ends by maintaining a growing archive of successful hyperagents.

The system continuously branches from selected candidates in this archive, allows them to self-modify, evaluates the new variants on given tasks, and adds the successful ones back into the pool as stepping stones for future iterations.

By combining this open-ended evolutionary search with metacognitive self-modification, DGM-H eliminates the fixed, human-engineered instruction step of the original DGM. This enables the agent to self-improve across any computable task.

Hyperagents in action

The researchers used the Polyglot coding benchmark to compare the hyperagent framework against previous coding-only AI. They also evaluated hyperagents across non-coding domains that involve subjective reasoning, external tool use, and complex logic.

These included paper review to simulate a peer reviewer outputting accept or reject decisions, reward model design for training a quadruped robot, and Olympiad-level math grading. Math grading served as a held-out test to see if an AI that learned how to self-improve while reviewing papers and designing robots could transfer those meta-skills to an entirely unseen domain.

The researchers compared hyperagents against several baselines, including domain-specific models like AI-Scientist-v2 for paper reviews and the ProofAutoGrader for math. They also tested against the classic DGM and a manually customized DGM for new domains.

On the coding benchmark, hyperagents matched the performance of DGM despite not being designed specifically for coding. In paper review and robotics, hyperagents outperformed the open-source baselines and human-engineered reward functions. 

When the researchers took a hyperagent optimized for paper review and robotics and deployed it on the unseen math grading task, it achieved an improvement metric of 0.630 in 50 iterations. Baselines relying on classic DGM architectures remained at a flat 0.0. The hyperagent even beat the domain-specific ProofAutoGrader.

The experiments also highlighted interesting autonomous behaviors from hyperagents. In paper evaluation, the agent first used standard prompt-engineering tricks like adopting a rigorous persona. When this proved unreliable, it rewrote its own code to build a multi-stage evaluation pipeline with explicit checklists and rigid decision rules, leading to much higher consistency.

Hyperagents also autonomously developed a memory tool to avoid repeating past mistakes. Furthermore, the system wrote a performance tracker to log and monitor the result of architectural changes across generations. The model even developed a compute-budget aware behavior, where it tracked remaining iterations to adjust its planning. Early generations executed ambitious architectural changes, while later generations focused on conservative, incremental refinements.

For enterprise data teams wondering where to start, Zhang recommends focusing on tasks where success is unambiguous. “Workflows that are clearly specified and easy to evaluate, often referred to as verifiable tasks, are the best starting point,” she said. “This generally opens new opportunities for more exploratory prototyping, more exhaustive data analysis, more exhaustive A/B testing, [and] faster feature engineering.” For harder, unverified tasks, teams can use hyperagents to first develop learned judges that better reflect human preferences, creating a bridge to more complex domains.

The researchers have shared the code for hyperagents, though it has been released under a non-commercial license.

Caveats and future threats

The benefits of hyperagents introduce clear tradeoffs. The researchers highlight several safety considerations regarding systems that can modify themselves in increasingly open-ended ways.

These AI systems pose the risk of evolving far more rapidly than humans can audit or interpret. While researchers contained DGM-H within safety boundaries such as sandboxed environments designed to prevent unintended side effects, these initial safeguards are actually practical deployment blueprints. 

Zhang advises developers to enforce resource limits and restrict access to external systems during the self-modification phase. “The key principle is to separate experimentation from deployment: allow the agent to explore and improve within a controlled sandbox, while ensuring that any changes that affect real systems are carefully validated before being applied,” she said. Only after the newly modified code passes developer-defined correctness checks should it be promoted to a production setting.

Another significant danger is evaluation gaming, where the AI improves its metrics without making actual progress toward the intended real-world goal. Because hyperagents are driven by empirical evaluation signals, they can autonomously discover strategies that exploit blind spots or weaknesses in the evaluation procedure itself to artificially inflate their scores. Preventing this behavior requires developers to implement diverse, robust, and periodically refreshed evaluation protocols alongside continuous human oversight.

Ultimately, these systems will shift the day-to-day responsibilities of human engineers. Just as we do not recompute every operation a calculator performs, future AI orchestration engineers will not write the improvement logic directly, Zhang believes.

Instead, they will design the mechanisms for auditing and stress-testing the system. “As self-improving systems become more capable, the question is no longer just how to improve performance, but what objectives are worth pursuing,” Zhang said. “In that sense, the role evolves from building systems to shaping their direction.

We tested Anthropic’s redesigned Claude Code desktop app and ‘Routines’ — here’s what enterprises should know

The transition from AI as a chatbot to AI as a workforce is no longer a theoretical projection; it has become the primary design philosophy for the modern developer’s toolkit.

On April 14, 2026, Anthropic signaled this shift with a dual release: a complete redesign of the Claude Code desktop app (for Mac and Windows) and the launch of “Routines” in research preview.

These updates suggest that for the modern enterprise, the developer’s role is shifting from a solo practitioner to a high-level orchestrator managing multiple, simultaneous streams of work.

For years, the industry focused on “copilots”—single-threaded assistants that lived within the IDE and responded to the immediate line of code being written. Anthropic’s latest update acknowledges that the shape of “agentic work” has fundamentally changed.

Developers are no longer just typing prompts and waiting for answers; they are initiating refactors in one repository, fixing bugs in another, and writing tests in a third, all while monitoring the progress of these disparate tasks. The redesigned desktop application reflects this change through its central “Mission Control” feature: the new sidebar.

This interface element allows a developer to manage every active and recent session in a single view, filtering by status, project, or environment. It effectively turns the developer’s desktop into a command center where they can steer agents as they drift or review diffs before shipping. This represents a philosophical move away from “conversation” toward “orchestration”.

Routines: your new ‘set and forget’ option for repeating processes and tasks

The introduction of “Routines” represents a significant architectural evolution for Claude Code. Previously, automation was often tied to the user’s local hardware or manually managed infrastructure.

Routines move this execution to Anthropic’s web infrastructure, decoupling progress from the user’s local machine.

This means a critical task—such as a nightly triage of bugs from a Linear backlog—can run at 2:00 AM without the developer’s laptop being open.

These Routines are segmented into three distinct categories designed for enterprise integration:

  • Scheduled Routines: These function like a sophisticated cron job, performing repeatable maintenance like docs-drift scanning or backlog management on a cadence.

  • API Routines: These provide dedicated endpoints and auth tokens, allowing enterprises to trigger Claude via HTTP requests from alerting tools like Datadog or CI/CD pipelines.

  • Webhook Routines: Currently focused on GitHub, these allow Claude to listen for repository events and automatically open sessions to address PR comments or CI failures.

For enterprise teams, these Routines come with structured daily limits: Pro users are capped at 5, Max at 15, and Team/Enterprise tiers at 25 routines per day, though additional usage can be purchased.

Analysis: desktop GUI vs. Terminal

The pivot toward a dedicated Desktop GUI for a tool that originated in the terminal (CLI) invites an analysis of the trade-offs for enterprise users.

The primary benefit of the new desktop app is high-concurrency visibility. In a terminal environment, managing four different AI agents working on four different repositories is a cognitive burden, requiring multiple tabs and constant context switching.

The desktop app’s drag-and-drop layout allows the terminal, preview pane, diff viewer, and chat to be arranged in a grid that matches the user’s specific workflow.

Furthermore, the “Side Chat” feature (accessible via ⌘ + ;) solves a common problem in agentic work: the need to ask a clarifying question without polluting the main task’s history. This ensures that the agent’s primary mission remains focused while the human operator gets the context they need. However, it is also available in the Terminal view via the /btw command.

Despite the GUI’s benefits, the CLI remains the home of many developers. The terminal is lightweight and fits into existing shell-based automation.

Recognizing this, Anthropic has maintained parity: CLI plugins are supposed to work exactly the same in the desktop app as they do in the terminal. Yet in my testing, I was unable to get some of my third-party plugins to show up in the terminal or main view.

For pure speed and users who operate primarily within a single repository, the CLI avoids the resource overhead of a full GUI.

How to use the new Claude Code desktop app view

In practice, accessing the redesigned Claude Code desktop app requires a bit of digital hunting.

It’s not a separate new application — instead, it is but one of three main views in the official Claude desktop app, accessible only by hovering over the “Chat” icon in the top-left corner to reveal the specific coding interfaces.

Once inside, the transition from a standard chat window to the “Claude Code” view is stark. The interface is dominated by a central conversational thread flanked by a session-management sidebar that allows for quick navigation between active and archived projects.

The addition of a new, subtle, hover-over circular indicator at the bottom showing how much context the user has used in their current session and weekly plan limits is nice, but again, a departure from third-party CLI plugins that can show this constantly to the user without having to take the extra step of hovering over.

Similarly, pop up icons for permissions and a small orange asterisk showing the time Claude Code has spent on responding to each prompt (working) and tokens consumed right in the stream is excellent for visibility into costs and activity.

While the visual clarity is high—bolstered by interactive charts and clickable inline links—the discoverability of parallel agent orchestration remains a hurdle.

Despite the promise of “many things in flight,” attempting to run tests across multiple disparate project folders proved difficult, as the current iteration tends to lock the user into a single project focus at a time.

Unlike the Terminal CLI version of Claude Code, which defaults to asking the user to start their session in their user folder on Mac OS, the Claude Code desktop app asks for access to specific subfolder — which can be helpful if you have already started a project, but not necessarily for starting work on a new one or multiple in parallel.

The most effective addition for the “vibe coding” workflow is the integrated preview pane, located in the upper-right corner.

For developers who previously relied on the terminal-only version of Claude Code, this feature eliminates the need to maintain separate browser windows or rely on third-party extensions to view live changes to web applications.

However, the desktop experience is not without friction. The integrated terminal, intended to allow for side-by-side builds and testing, suffered from notable latency, often failing to update in real-time with user input. For users accustomed to the near-instantaneous response of a native terminal, this lag can make the GUI feel like an “overkill” layer that complicates rather than streamlines the dev cycle.

Setting up the new Routines feature also followed a steep learning curve. The interface does not immediately surface how to initiate these background automations; discovery required asking Claude directly and referencing the internal documentation to find the /schedule command.

Once identified, however, the process was remarkably efficient. By using the CLI command and configuring connectors in the browser, a routine can be operational in under two minutes, running autonomously on Anthropic’s web infrastructure without requiring the desktop app to remain active.

The ultimate trade-off for the enterprise user is one of flexibility (standard Terminal/CLI view) versus integrated convenience (new Claude Code desktop app).

The desktop app provides a high-context “Plan” view and a readable narrative of the agent’s logic, which is undeniably helpful for complex, multi-step refactors.

Yet, the platform creates a distinct “walled garden” effect. While the terminal version of Claude Code offers a broader range of movement, the desktop app is strictly optimized for Anthropic’s models.

For the professional coder who frequently switches between Claude and other AI models to work around rate limits or seek different architectural perspectives, this model-lock may be a dealbreaker. For these power users, the traditional terminal interface remains the superior surface for maintaining a diverse and resilient AI stack.

The enterprise verdict

For the enterprise, the Desktop GUI is likely to become the standard for management and review, while the CLI remains the tool for execution.

The desktop app’s inclusion of an in-app file editor and a faster diff viewer—rebuilt for performance on large changesets—makes it a superior environment for the “Review and Ship” phase of development.

It allows a lead developer to review an agent’s work, make spot edits, and approve a PR without ever leaving the application.

Philosophical implications for the future of AI-driven enterprise knowledge work

Anthropic developer Felix Rieseberg noted on X that this version was “redesigned from the ground up for parallel work,” emphasizing that it has become his primary way to interact with the system.

This shift suggests a future where “coding” is less about syntax and more about managing the lifecycle of AI sessions.

The enterprise user now occupies the “orchestrator seat,” managing a fleet of agents that can triage alerts, verify deploys, and resolve feedback automatically.

By providing the infrastructure to run these tasks in the cloud and the interface to monitor them on the desktop, Anthropic is defining a new standard for professional AI-assisted engineering.

AI’s next bottleneck isn’t the models — it’s whether agents can think together

AI agents can connect together, but they cannot think together. That’s a huge difference and a bottleneck for next-gen systems, says Outshift by Cisco’s SVP and GM Vijoy Pandey.

As he describes the current state of AI: Agents can be stitched together in a workflow or plug into a supervisor model — but there’s no semantic alignment, no shared context. They’re essentially working from scratch each go-around. 

This calls for next-level infrastructure, or what Pandey describes as the “internet of cognition.” 

“Agents are not able to think together because connection is not cognition,” he said. “We need to get to a point where you are sharing cognition. That is the greater unlock.”

Creating new protocols to support next-gen agent communication

So what is shared cognition? It’s when AI agents or entities can meaningfully work together to solve for something net new that they weren’t trained for, and do it “100% without human intervention,” Pandey said on the latest episode of Beyond the Pilot.

The Cisco exec analogizes it to human intelligence. Humans evolved over hundreds of thousands of years, first becoming intelligent individually, then communicating on a basic level (with gestures or drawings). That communication improved over time, eventually unlocking a ‘cognitive revolution’ and collective intelligence that allowed for shared intent and the ability to coordinate, negotiate, and ground and discover information. 

“Shared intent, shared context, collective innovation: That’s the exact trajectory that’s playing out in silicon today,” Pandey said. 

His team sees it as a “horizontal distributed assistance problem.” They are pursuing “distributed super intelligence” by codifying intent, context, and collective innovation as a set of rules, APIs, and capabilities within the infrastructure itself. 

Their approach is a set of new protocols: Semantic State Transfer Protocol (SSTP); Latent Space Transfer Protocol (LSTP); and Compressed State Transfer Protocol (CSTP). 

SSTP operates at the language level, analyzing semantic communication so systems can infer the right tool or task. Pandey’s team recently collaborated with MIT on a related piece called the Ripple Effect Protocol.

LSTP can be used to transfer the “entire latent space” of one agent to another, Pandey explained. “Can we just take the KV cache and send it over as an example?” he said. “Because that would be the most efficient way: instead of going through the tax of tokenizing it, going to a natural language, then going back the stack on the other side.” 

CSTP handles compression — grounding only the targeted variants while compressing everything else. Pandey says it’s particularly well-suited for edge deployments where you need to send large amounts of state accurately.

Ultimately, Pandey’s team is building a fabric to scale out intelligence and ensure that cognition states are synchronized across endpoints. Further, they are developing what they call “cognition engines” that provide guardrails and accelerate systems. 

“Protocols, fabric, cognition engines: These are the three layers that we are building out in the pursuit of distributed super intelligence,” Pandey said. 

How Cisco solved a big pain point

Stepping back from these advanced, next-level systems, Cisco has achieved tangible results with existing AI capabilities. Pandey described a specific pain point with the company’s site reliability engineering (SRE) team. 

While they were churning out more and more products and code, the team itself wasn’t growing, and were feeling pressure to improve efficiency. Pandey introduced AI agents that automated more than a dozen end-to-end workflows, including continuous integration/continuous delivery CI/CD pipelines, EC2 instance spin-ups and Kubernetes cluster deployments. 

Now, more than 20 agents — some built in-house, some third-party — have access to 100-plus tools via frameworks like Model Context Protocol (MCP), while also plugging into Cisco’s security platforms. 

The result: A decrease from “hours and hours to seconds” with certain deployments; further, agents have reduced 80% of the issues the SRE team were seeing within Kubernetes workflows.

Still, as Pandey noted, AI is a tool like any other. “It does not mean that I have a new hammer and I’m just gonna go around looking for nails,” he said. “You still have deterministic code. You need to marry these two worlds to get the best outcome for the problem that you’re solving.”

Listen to the podcast to hear more about: 

  • How we are now enabling a new paradigm of non-deterministic computing. 

  • How Cisco bumped error detection capabilities in large networks from 10% to 100%. 

  • How Pandey named his own AI agent Arnold Layne after an early Pink Floyd song.

  • Why the “internet of cognition” must be an open, interoperable effort. 

  • How Cisco’s open source project Agntcy addresses discovery, identity and access management (IAM), observability, and evaluation.

You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.

Traza raises $2.1 million led by Base10 to automate procurement workflows with AI

For decades, procurement has been the back office that enterprise software forgot. Billions of dollars flow through vendor negotiations, purchase orders, and supplier communications every year at the largest manufacturers and construction companies in the country — and the vast majority of that work still runs on email threads, spreadsheets, and phone calls.

Traza, a newly launched startup headquartered in New York, believes the moment has arrived to change that. The company announced today the close of a $2.1 million pre-seed round led by Base10 Partners, with participation from Kfund, a16z scouts, Clara Ventures, Masia Ventures, and a roster of angel investors including Pepe Agell, who scaled Chartboost to 700 million monthly users before its acquisition by Zynga.

The funding is modest by Silicon Valley standards. But Traza’s pitch is anything but incremental: the company deploys AI agents that don’t just recommend procurement actions — they execute them autonomously, handling vendor outreach, request-for-quote generation, order tracking, supplier communications, and invoice processing without continuous human supervision.

“AI is redesigning the procurement category from the ground up,” said Silvestre Jara Montes, Traza’s CEO and co-founder, in an exclusive interview with VentureBeat. “This wave of AI won’t just build procurement software — it will rebuild how procurement works.”

Why procurement contracts silently lose millions after the ink dries

The market Traza is targeting is enormous and, by the company’s framing, spectacularly underserved. The procurement software market alone exceeds $8 billion and grows at roughly 10% annually. But the real cost sits in the labor — the armies of people, agencies, and ad hoc workarounds required to actually run procurement operations at scale. Most enterprises meaningfully engage with only their top 20% of suppliers. The remaining 80% — the vendor outreach, order tracking, invoice reconciliation, and compliance monitoring — goes largely unmanaged.

Research from World Commerce & Contracting and Ironclad finds that organizations lose an average of 11% of total contract value after agreements are signed, a phenomenon described as “post-signature value leakage.” As Tim Cummins, President of WorldCC, put it: “The research shows that the 11% value gap is not caused by poor negotiation, but by how contracts are managed after signature.” For a large enterprise with $500 million in annual contracted spend, that represents $55 million vanishing each year — not from bad deals, but from the operational void between what gets agreed at the negotiating table and what actually gets executed on the ground. Missed savings, unauthorized changes, and poor renewal planning are responsible for the biggest losses.

Jara Montes argues that Traza sits precisely in this gap. “The 11% spans commercial, operational, and compliance leakage. We own the operational layer — and that’s where the most recoverable value sits,” he said. “Supplier tail management that never happens, RFQ processes skipped because someone ran out of bandwidth, invoice discrepancies that slip through unnoticed. That’s where contracts bleed value after signing, and that’s exactly what we automate.” The numbers from Traza’s early deployments, while nascent, are striking: the company claims a 70% reduction in human hours spent on procurement tasks and procurement cycles running three times faster than manual baselines.

How AI agents crossed the line from procurement copilot to autonomous worker

To understand what makes Traza’s approach different, it helps to understand what “AI for procurement” has meant until now. For the past several years, the term largely described dashboards, analytics layers, and recommendation engines that surfaced insights but left every decision and action in a human’s hands. Products from incumbents like SAP Ariba and Coupa — as well as newer entrants like Zip, Fairmarkit, and Tonkean — have layered AI capabilities on top of existing systems of record. But the gap between piloting AI and achieving production-scale impact remains stark, with 49 percent of procurement teams running pilots but only 4 percent reaching meaningful deployment.

Traza’s bet is that 2026 represents an inflection point. AI agents now possess the multi-step reasoning, tool use, and contextual memory required to execute full procurement workflows autonomously — from vendor discovery through invoice processing. The company frames this not as an upgrade to existing procurement software, but as an entirely new product category. “The incumbents built systems of record. They organize procurement data and they’ve never executed procurement work — and their AI additions don’t fundamentally change that,” Jara Montes said. “What they’re shipping is a recommendation layer on the same underlying architecture. A human still has to act on every suggestion. We replace the operational layer entirely.”

Industry data supports the thesis that enterprises are hungry for this shift. According to the 2025 Global CPO Survey from EY, 80 percent of global chief procurement officers plan to deploy generative AI in some capacity over the next three years, and 66 percent consider it a high priority over the next 12 months. A 2025 ABI Research survey found that 76% of supply chain professionals already see autonomous AI agents as ready to handle core tasks like reordering, supplier outreach, and shipment rerouting without human intervention — and early deployments are demonstrably reducing supply chain operational costs by 20 to 35%.

Inside the workflow: what Traza’s AI does and where humans still make the call

In a typical deployment, Traza’s AI agent takes over the operational labor that currently lives in inboxes, spreadsheets, and manual follow-up chains. In a standard RFQ workflow, the agent identifies suitable suppliers, drafts and sends the request for quotes, monitors supplier responses, follows up automatically when responses lag, parses incoming quotes regardless of their format, and builds a structured comparison table ready for a human decision-maker. The key design principle is deliberate: humans remain in the loop at critical junctures.

“At critical steps — approving a purchase order, flagging a compliance issue, committing spend above a threshold — a human is always in the loop,” Jara Montes explained. “That’s not a limitation, it’s the design. It’s how you maintain the auditability enterprises require while moving faster than any manual process could. You earn expanded autonomy over time, as trust is built and results compound.”

When asked about the risk of AI errors — a wrong purchase order or a missed compliance check that could prove costly — Jara Montes was direct: “Anything with meaningful financial or compliance exposure requires human approval before it executes — that’s non-negotiable and baked into the architecture. Below those thresholds, the agent acts autonomously and logs everything.” He added a point that reveals a subtler product insight: “Most procurement operations today are a black box — nobody has a clear picture of what’s happening across the supplier tail. We make it legible.” In other words, the transparency the AI agent provides may itself be a product — giving procurement leaders visibility they have never had into the long tail of supplier relationships that most enterprises simply ignore.

How Traza plugs into legacy enterprise systems without ripping them out

One of the recurring challenges for any enterprise AI startup is the integration question: How do you plug into the deeply entrenched, often decades-old technology stacks that large manufacturers and construction companies rely on? Traza’s answer is to sit on top of existing systems rather than replace them. “We connect via API or direct integration into whatever the customer already runs — ERPs, email, supplier portals. We have reach across more than 200 enterprise tools,” Jara Montes said. “We don’t rip out their system, we sit on top of them.”

The go-to-market motion mirrors this pragmatism. Instead of attempting a big-bang deployment, Traza runs a two-to-three-month proof of value focused on a single, specific workflow. Integrations are built at the key steps that matter for that particular use case, then expanded as the scope of the engagement grows. “We don’t try to connect everything upfront — we compound integrations as we expand scope within each account,” Jara Montes said. “And every integration we build compounds across customers too. Each new deployment makes the next one faster.” Throughout the process, the company works side by side with the customer’s team, managing complexity and helping them transition into a new way of operating. It is a notably high-touch approach for a company selling automation.

The company is already working with large manufacturers and construction companies and says they are paying, though it declines to name them publicly. “We want to earn the right to grow inside each account, not land a pilot that goes nowhere,” Jara Montes said. “That’s how you build something that actually sticks in enterprise.”

Traza bets that vertical depth in physical industry will beat horizontal AI platforms

Traza enters a market that is rapidly heating up. The leading AI procurement solutions include platforms from Coupa, Ivalua, SAP Ariba, Zip, Zycus, and Fairmarkit. Keelvar provides autonomous sourcing bots capable of launching RFQs, collecting bids, and recommending optimal awards, while Tonkean offers a no-code orchestration platform using NLP and generative AI to streamline procurement intake and tail-spend management. Against this crowded field, Jara Montes draws a sharp distinction between horizontal automation tools and Traza’s focus on physical industry.

“We’re built specifically for the physical industry, where supplier relationships, compliance requirements, and workflow complexity are categorically different from software procurement,” he said. “A generic agent doesn’t survive contact with how procurement actually works in manufacturing or construction. Specificity is the moat.” The competitive dynamics with major incumbents are perhaps even more consequential. SAP Ariba, Coupa, and their peers have massive installed bases and deep enterprise relationships. Jara Montes frames their AI initiatives as surface-level additions to legacy architectures — but whether Traza can convert that framing into market share at scale, especially given the gravitational pull of existing vendor relationships, remains the central strategic question.

Beneath Traza’s product pitch sits a deeper strategic thesis about compounding data advantages. The company describes a two-layered learning architecture: at the agent level, Traza gets smarter across every deployment by absorbing supplier behavior patterns, RFQ response dynamics, pricing anomalies, and workflow edge cases. At the data level, each customer’s information stays fully isolated. “What we’re building is deep operational knowledge of how procurement actually runs in the physical industry — not how it’s supposed to run according to an RFP, but how it really runs, with all the exceptions and workarounds,” Jara Montes said. “That’s extraordinarily hard to replicate if you’re starting from scratch, and it gets harder to catch up with the more deployments we have.”

Three Spanish founders, one fellowship, and a plan to rewire industrial procurement

Traza was co-founded by three Spanish entrepreneurs — Silvestre Jara Montes, Santiago Martínez Bragado, and Sergio Ayala Miñano — who came to the United States through the Exponential Fellowship, a program that brings Europe’s top technical talent to the U.S. to build companies at the frontier of AI. Their backgrounds span both sides of the problem Traza is trying to solve. Jara Montes worked at Amazon and CMA CGM — one of the world’s largest shipping groups — at the intersection of operations strategy and supply chain optimization. Martínez Bragado built and deployed agentic AI at Clarity AI before joining Concourse (backed by a16z, Y Combinator, and CRV) as Founding AI Engineer. Ayala Miñano comes from StackAI, one of the fastest-growing enterprise AI platforms in San Francisco, where he was a Founding Engineer.

None of the founders carry the title of Chief Procurement Officer, a gap that the company acknowledges has occasionally surfaced in buyer conversations. Jara Montes’s response is characteristically direct: “Our work is the answer. The results we’re generating move that conversation quickly.” He noted that the company has senior procurement leaders serving as advisors who have run procurement at the scale of its target customers.

Base10 Partners, the lead investor, is a San Francisco-based venture capital firm that invests in companies automating sectors of what it calls “the Real Economy.” Its portfolio includes Notion, Figma, Nubank, Stripe, and Aurora Solar. Rexhi Dollaku, General Partner at Base10, framed the investment in emphatic terms: “Supply chain and procurement is one of the largest, most underautomated markets in the Real Economy. AI agents are finally capable of doing the work, not just assisting with it.” The supporting cast of investors reinforces the immigrant-founder narrative. Clara Ventures — founded by the executives behind Olapic’s $130 million exit — specifically invests in driven foreign founders building in the United States, and Agell adds operational credibility from building Chartboost into a $100 million revenue business in under three years as a Spanish founder in Silicon Valley.

Why $2.1 million may stretch further than it looks for an enterprise AI startup

At $2.1 million, this is a deliberately small round for a company selling to large enterprises with notoriously long procurement cycles. Jara Montes argues it goes further than it appears for structural reasons. “We leverage Europe as a tech talent hub, where we have a deep network of exceptional engineers — people who want to work at the frontier of AI but have far fewer opportunities to do so than their US counterparts,” he said. “We’re not just lean — we’re built to outcompete on capital efficiency while others are burning through runway trying to hire in San Francisco.”

The go-to-market motion is designed for speed to revenue. Proofs of value are scoped, time-bounded, and converted to paying partnerships. The company says it is not running 18-month enterprise sales cycles before seeing a dollar. The milestone for the next raise is explicit: more paying customers, meaningfully stronger annual recurring revenue, and a repeatable sales motion that makes the seed round, as Jara Montes put it, “an obvious conversation.”

Looking ahead, he outlined an ambitious three-year target: 20 to 30 large industrial enterprises in the U.S. and Europe running Traza across their procurement operations, with over a billion dollars in procurement spend flowing through the platform. Whether that vision is achievable depends on several interlocking variables — the pace at which AI agent capabilities continue to improve, the speed of enterprise adoption in a traditionally conservative buyer segment, and Traza’s ability to navigate the competitive gauntlet of incumbents adding AI features and well-funded startups attacking adjacent workflows.

But the underlying math may be on Traza’s side. In procurement, the money that disappears does not look like waste. It vanishes into inefficiency, missed obligations, unmanaged risks, and forgotten commitments — the kind of silent losses that no one tracks because no one has the bandwidth to track them. The traditional mandate of procurement, as currently configured, ends where the value gap begins: at signature. Traza is building an AI workforce that picks up where the humans leave off. For an industry that has spent decades losing $55 million at a time to the back office nobody watches, that might be precisely the point.

Anthropic’s Claude Managed Agents gives enterprises a new one-stop shop but raises vendor ‘lock-in’ risk

Anthropic announced a new platform last week, Claude Managed Agents, aiming to cut out the more complex parts of AI agent deployment for enterprises and competes with existing orchestration frameworks.

Claude Managed Agents is also an architectural shift: enterprises, already burdened with orchestrating an increasing number of agents, can now choose to embed the orchestration logic in the AI model layer.

While this comes with some potential advantages, such as speed (Anthropic proposes its customers can deploy agents in days instead of weeks or months), it also, of course, then also turns more control over the enterprise’s AI agent deployments and operations to the model provider — in this case, Anthropic — potentially resulting in greater “lock in” for the enterprise customer, leaving them more subject to Anthropic’s terms, conditions, and any subsequent platform changes.

But maybe that is worth it for your enterprise, as Anthropic further claims that its platform “handles the complexity” by letting users define agent tasks, tools and guardrails with a built-in orchestration harness, all without the need for sandboxing code execution, checkpointing, credential management, scoped permissions and end-to-end tracing. 

The framework manages state, execution graphs and routing and brings managed agents to a vendor-controlled runtime loop.

Even before the release of Claude Managed Agents, new directional VentureBeat research showed that Anthropic was gaining traction at the orchestration level as enterprises adopted its native tooling. Claude Managed Agents represents a new attempt by the firm to widen its footprint as the orchestration method of choice for organizations.

Anthropic is surging in orchestration interest

Orchestration has emerged as an important segment for enterprises to address as they scale AI systems and deploy agentic workflows. 

VentureBeat directional research of several dozen firms for the first quarter of 2026 found that enterprises mostly chose existing frameworks, such as Microsoft’s Copilot Studio/Azure AI Studio, with 38.6% of respondents in February reporting using Microsoft’s platform. VentureBeat surveyed 56 organizations with more than 100 employees in January and 70 in February.

OpenAI closely followed at 25.7%. Both showed strong growth between the first two months of the year.

Anthropic, driven by increased interest in its offerings, such as Claude Code, over the past year, is putting up a fight. 

Adoption of the Anthropic tool-use and workflows API increased from 0% to 5.7% between January and February. This tracks closely with the growing adoption of Anthropic’s foundation models, showing that enterprises using Claude turn to the company’s native orchestration tooling instead of adding a third-party framework. 

While VentureBeat surveyed before the launch of Claude Managed Agents, we can extrapolate that the new tool will build on that growth, especially if it promises a more straightforward way to deploy agents.

Collapsing the external orchestration layer

Enterprises may find that a streamlined, internal harness for agents compelling, but it does mean giving up certain controls.

Session data is stored in a database managed by Anthropic, increasing the risk that enterprises become locked into a system run by a single company. This may be less desirable for some firms and compete with their desires to move away from the locked-in software-as-a-service (SaaS) applications in the current stacks, which many hope that AI will facilitate.

The specter of vendor lock-in means agent execution becomes more model-driven rather than direct by the organization, happens in an environment enterprises don’t fully control, and behavior becomes harder to guarantee.

It also opens the possibility of giving agents conflicting instructions, especially if the only way for users to exert any control over agents is to prompt them with more context.

Agents could have two control planes: one defined by the enterprises’ orchestration system through instructions and the other as an embedded skill from the Claude runtime.

This could pose an issue for highly sensitive and regulated workflows, such as financial analysis or customer-facing tasks. 

Pricing, control and competitive set

Balancing control with ease is one thing; enterprises also consider the cost structure of Claude Managed Agents.

Claude Managed Agents introduces a hybrid pricing model that blends token-based billing with a usage-based runtime fee.

This makes Managed Agets more dynamic, though less predictable, when determining cost structures. Enterprises will be charged a standard rate of $0.08 per hour when agents are actively running.

For example, at $0.70 per hour, a one-hour session could cost up to $37 to process 10,000 support tickets, depending on how long each agent runs and how many steps it takes to complete a task.

Microsoft, currently the leader according to VentureBeat’s directional survey, offers several orchestration offerings. Copilot Studio uses a capacity-based billing structure, so enterprises pay for blocks of interactions between users and agents rather than the number of steps an agent takes.

Microsoft’s approach tends to be more predictable than Anthropic’s pricing plan: Copilot Studio starts at $200 per month for 25,000 messages.

Compared to similar competitors like OpenAI’s Agents SDK, the picture becomes murky. Agents SDK is technically free to use as an open-source project. However, OpenAI bills for the underlying API usage. Agents built and orchestration with Agents SDK using GPT-5.4, for example, will cost $2.50 per 1 million input tokens and $15 per 1 million output tokens.

The enterprise decision

Claude Managed Agents does give enterprises who find the actual deployment of production agents too complicated a reprieve. It reduces their engineering overhead while adding speed and simplicity in a fast-changing enterprise environment. 

But that comes with a choice: lose control, observability and portability and risk further vendor lock-in.

Anthropic just made a case for why its ecosystem is becoming not just the foundation model of choice for enterprises, but also the orchestration infrastructure. It becomes more imperative for enterprises to balance ease with lesser control. 

Google leaders including Demis Hassabis push back on claim of uneven AI adoption internally

A viral post on X from veteran programmer and former Google engineer Steve Yegge set off a rhetorical firestorm this week, drawing sharp public rebuttals from some of Google’s most prominent AI leaders and reopening a sensitive question for the company: how deeply are its own engineers really using the latest generation of AI coding tools?

The debate began after Yegge summarized what he said was the view of his friend, a current and longtime Google employee (or Googler), who claimed the Gemini AI-firm’s internal AI adoption looks much more ordinary and less cutting-edge than outsiders might expect.

Yegge said Googler friend claimed Google engineering mirrors an “average” industry pattern of a 20%-60%-20% split: a small group of outright AI refusers (20%) a much larger middle still relying mainly on simpler chat and coding-assistant workflows (60%), and another small group of AI-first, cutting-edge engineers using agentic tools extensively and mastering them (20%).

A VentureBeat search of X using its parent company’s AI assistant Grok found that Yegge’s April 13 post spread quickly, topping 4,500 likes, 205 quote posts, 458 replies and 1.9 million views as of April 14.

We’ve reached out to Google for comment on the claims and will update when we receive a response.

A veteran, oustpoken Googler voice

Why did the opinion of Yegge’s unnamed Googler friend land so hard? In part because Yegge is not just another commentator taking shots from the sidelines.

He spent about 13 years at Google after earlier stints at Amazon and GeoWorks, later joined Grab, and then became head of engineering at Sourcegraph in 2022. He has long been known in software circles for widely read essays on programming and engineering culture, and for an earlier internal Google memo that accidentally became public in 2011 and drew broad media attention.

That history helps explain why engineers and executives still take his critiques seriously, even when they reject them.

Yegge has built a reputation over many years as a blunt insider-outsider voice on software culture, someone with enough standing in the industry that his judgments can travel fast, especially when they touch nerves inside big technology companies.

Wikipedia’s summary of his career notes his long Google tenure and the outsized attention his blog posts and prior Google critiques have received.

Unpacking Yegge’s friend’s argument

In this case, Yegge’s argument was not simply that Google uses too little AI. It was that the company’s adoption may be uneven, culturally constrained and less transformed than its branding implies.

His friend supposedly argued that some Googlers could not use Anthropic’s Claude Code because it was framed as “the enemy,” and that Gemini was not yet sufficient for the fullest agentic coding workflows. He contrasted Google with what he described as a smaller set of companies moving much faster.

Pushback from Hassabis and current Googlers

The first major pushback came from Demis Hassabis, the co-founder and CEO of Google DeepMind, who replied directly and forcefully. “Maybe tell your buddy to do some actual work and to stop spreading absolute nonsense. This post is completely false and just pure clickbait,” Hassabis wrote.

Other Google leaders followed with lengthier defenses.

Addy Osmani, a director at Google Cloud AI, wrote that Yegge’s account “doesn’t match the state of agentic coding at our company.” He added, “Over 40K SWEs use agentic coding weekly here.”

Osmani said Googlers have access to internal tools and systems including “custom models, skills, CLIs and MCPs,” and pushed back on the idea that Google employees are sealed off from outside models, writing that “folks can even use @AnthropicAI’s models on Vertex” and concluding that “Google is anything but average.”

Other current Google employees reinforced that message. Jaana Dogan, a software engineer at Google, wrote in a quote tweet: “Everyone I work with uses @antigravity like every second of the day,” later following up with another X post stating: “Unpopular opinion: If you think tokens burned is a productivity metric, no one should take you seriously. Imagine you are a top 0.0001% writer and they are only counting the tokens you produce.”

Paige Bailey, a DevX engineering lead at Google DeepMind, said teams had agents “running 24/7.”

Several other Google and DeepMind figures also challenged Yegge’s characterization, some disputing the factual basis of his claims and others suggesting he lacked visibility into current internal usage.

Yegge’s rebuttal

Yegge, for his part, did not retreat. In a follow-up to Hassabis, he wrote, “I’m not trying to misrepresent anyone,” but argued that by his own standard for advanced AI adoption, Google still does not appear to be doing especially well.

He pointed to token usage and the replacement of older development habits with truly agentic workflows as the more meaningful benchmark, and said he would be willing to retract his criticism if Google could show its engineers were operating at that level.

AI adoption vs. AI transformation

That leaves the core dispute unresolved, but clearer. This is less a fight over whether Google engineers use AI at all than a fight over what should count as meaningful adoption.

Googlers are pointing to scale, weekly usage and the availability of internal and external tools. Yegge is arguing that those measures may capture broad exposure without proving a deeper change, an AI transformation, in how engineering work gets done. The clash reflects a wider industry split between visible usage metrics and more transformative, power-user behavior.

For Google, the subject is especially sensitive. Yegge has criticized the company before, including in a 2018 essay explaining why he left, where he argued Google had become too risk-averse and had lost much of its ability to innovate.

If his latest critique had come from a lesser-known poster, it might have faded. Coming from a former longtime Google engineer with a record of memorable public criticism, it instead drew direct responses from some of the company’s top AI figures — and turned a single post into a broader public argument about whether Google’s AI leadership is as deep internally as it looks from the outside.