Perplexity AI, the fast-growing search startup now valued at $20 billion, unveiled what it calls the first hybrid local-server inference orchestrator at Computex 2026 on Monday night, demonstrating software that autonomously decides — in real time and mid-task — which AI workloads stay on a user’s device and which get routed to frontier models in the cloud.
CEO Aravind Srinivas demonstrated the system onstage alongside Intel CEO Lip-Bu Tan during Intel’s keynote address, using Perplexity’s “Personal Computer” agent to process confidential deal materials. In the demonstration, local models running on Intel Core Ultra Series 3 determined which information should remain on the device and which information could be sent to cloud-based models. Srinivas said the approach balances intelligence, accuracy, privacy, and cost.
The key claim is not that a model can run locally — dozens of tools already do that. It is that Perplexity’s system makes the routing decision itself, task by task, without requiring the user to choose in advance. Sensitive data like financial records or health information stays on the local machine; the heavier reasoning tasks that require frontier-scale models get sent to the cloud. One task, multiple execution locations, automatic orchestration.
“No product has done this before,” a Perplexity spokesperson said in an email to VentureBeat. The product is not yet available to users; according to the company, the hybrid inference feature will launch in the coming weeks.
To understand why the Computex demonstration matters, it helps to trace the product arc Perplexity has been building since early this year.
On February 25, Perplexity launched Computer, a multi-model AI agent that orchestrates 19 different AI models to complete complex, long-running tasks on behalf of users. The system ran entirely in the cloud, breaking goals into subtasks and routing each to whichever model — Claude, Gemini, GPT, Grok, or others — was best suited for the job. Perplexity Computer unified every current AI capability into a single system, functioning as a general-purpose digital worker that operates the same interfaces a user does.
Then, in March, Perplexity introduced Personal Computer at its inaugural Ask 2026 developer conference. That product launched as a new Mac app with support for a hybrid local-cloud AI agent, which Perplexity described as a “personal orchestrator” that hybridizes local and server environments for security and productivity. Personal Computer could access the Mac’s file system and native Mac apps to create and execute entire workflows, with files created in a secure sandbox and all actions auditable and reversible.
What Srinivas demonstrated at Computex extends this architecture in a fundamental way. Previously, even the Personal Computer product divided labor along relatively clear lines: local file access on the device, heavy computation on Perplexity’s servers.
The new hybrid inference orchestrator gives the system itself the ability to reason about where each piece of a task should execute — not just which model to use, but which physical location should process it. The system reportedly asks for user permission before sending sensitive tasks to the cloud, a design choice that addresses one of the central anxieties enterprises have about agentic AI: data governance.
The timing of the demonstration is not coincidental. Computex 2026 has been dominated by a single theme: on-device AI. Just hours before the Intel keynote, Nvidia CEO Jensen Huang unveiled the RTX Spark, a new Arm-based superchip that the company positions as the foundation for a new generation of AI-native Windows PCs.
At full strength, the RTX Spark Superchip offers up to 20 Arm CPU cores, a Blackwell GPU with 6,144 CUDA cores, 128GB of LPDDR5X RAM, and up to 300 GB/s of memory bandwidth — enough power and memory for AI agents and 120-billion-parameter models with context lengths stretching to a million tokens. RTX Spark systems will begin arriving in the fall.
Intel, not to be outdone, used its keynote to showcase Xeon 6+ processors with 288 efficiency cores built on 18A technology for the data center, and positioned its Core Ultra Series 3 as the client silicon that makes hybrid inference possible on the PC.
Perplexity’s hybrid orchestrator sits at the intersection of both strategies. If the system performs as advertised, it creates a direct economic incentive for users — and eventually enterprises — to invest in more powerful local silicon. The more capable the on-device chip, the more inference can run locally, reducing cloud costs and improving latency for sensitive workloads. That dynamic benefits Nvidia, Intel, and every other chipmaker competing for AI PC sockets.
The implications extend well beyond chip economics. “As chips become more powerful, more intelligence moves onto a person’s machine, alongside server inference for the complex tasks that still need frontier models,” a Perplexity spokesperson told VentureBeat. “Sensitive and sovereign work can stay local, which changes the need for massive country-level infrastructure.”
That last claim — about sovereign infrastructure — is the most provocative. Nations from the UAE to France to India have been investing billions in domestic AI compute capacity partly on the assumption that sensitive data must stay within their borders, which means building or buying access to local data centers. If meaningful inference can run on an end user’s device with no data leaving the machine, the calculus changes. It does not eliminate the need for data centers, but it could soften the urgency of the buildout.
Perplexity’s hybrid inference play rests on the same architectural bet the company has been making all year: that the orchestration layer matters more than any individual model. For AI engineers, this signals a fundamental shift — the orchestration layer may matter more than the models themselves.
The key insight is separation of concerns: the orchestration layer handles task decomposition, state management, and tool coordination, while the model layer handles specific computations. This decoupling means teams can swap models as better alternatives emerge without redesigning the entire system.
Perplexity has leaned heavily into this philosophy. The company is doubling down on packaging frontier models in a consumer-friendly user experience, arguing that there is value in orchestrating multiple third-party LLMs to obtain the most cost-effective and accurate answers to queries. Models, in Perplexity’s view, are specializing, not commoditizing.
The hybrid inference extension takes that logic one step further. Perplexity is now orchestrating not just across models but across physical compute locations — choosing which model runs where. A lightweight local model might handle a privacy-sensitive document summarization task while a frontier cloud model tackles the complex reasoning required to analyze that summary against a broader market landscape. The orchestrator manages the handoff.
This is a technically ambitious claim. Making it work reliably in production will require the orchestrator to accurately assess the complexity of each subtask, understand the sensitivity of the data involved, know the capabilities and latency characteristics of whatever local hardware the user has, and manage the state of a task that may be bouncing between environments mid-execution.
It is easy to imagine edge cases where the routing logic fails, sends something sensitive to the cloud, or degrades performance by assigning a task to an underpowered local model. Perplexity says the system will be chip-agnostic, though the initial Computex demo ran on Intel silicon. The company expressed enthusiasm in its communications about the new AI chips announced at Computex this week, suggesting it intends to optimize across vendors.
The hybrid inference announcement arrives at a complicated moment for Perplexity. The company has been on a remarkable growth trajectory: It secured $200 million in new capital at a $20 billion valuation, just two months after raising $100 million at an $18 billion valuation. Since its founding three years ago, the rapidly growing AI company has raised $1.5 billion in total funding, according to PitchBook data.
But the company also faces a mounting stack of legal challenges. Nine organizations have filed active suits against Perplexity for alleged copyright and trademark infringement as of May 31, 2026: CNN, the New York Times, News Corp and Dow Jones, the New York Post, the Chicago Tribune, Encyclopedia Britannica, Merriam-Webster, Reddit, and Japan’s Yomiuri Shimbun. The CNN lawsuit, filed just days ago on May 28, is the most recent, accusing Perplexity of scraping more than 17,000 CNN stories, photos, videos, and other content and using that material to train its products. Perplexity has responded with a consistent message. “You can’t copyright facts,” the company’s chief communications officer Jesse Dwyer said in a statement.
Other publishers have opted for partnership over litigation. Time, Gannett, Le Monde, and Der Spiegel have signed licensing arrangements with Perplexity. The company launched a Publishers Program in mid-2024 in which participating outlets receive a share of revenue generated when their content is cited in Perplexity answers.
According to CNBC, Perplexity’s chief business officer Dmitry Shevelenko confirmed at the time that the flat rate was a double-digit percentage but declined to share specifics. As TechCrunch reported in December 2024, additional publishers including the LA Times, Adweek, The Independent, and Lee Enterprises subsequently joined the program, though not without internal controversy — reporters at some outlets told TechCrunch they were not informed of the deals before they were announced publicly.
The legal risk is not existential, but it is material, and with enterprises increasingly evaluating Perplexity’s tools for sensitive workflows — precisely the use case the hybrid inference system is designed to serve — unresolved intellectual property questions could dampen adoption.
The hybrid inference demo should be read alongside Perplexity’s broader push into enterprise software, a transformation that accelerated dramatically this year. At the Ask 2026 developer conference in March, VentureBeat reported that Perplexity announced Computer for Enterprise, positioning the three-year-old startup as a direct competitor to Microsoft, Salesforce, and the legacy enterprise software stack.
Beyond Computer’s existing 100-plus integrations, enterprise customers gained access to business-grade connectors for Snowflake, Datadog, Salesforce, SharePoint, and HubSpot, with administrators able to install custom connectors via the Model Context Protocol. The package also includes purpose-built workflow templates for legal contract review, finance audit support, sales call preparation, and customer support ticket triage, alongside SOC 2 Type II certification and the option for zero data retention.
Hybrid inference deepens this enterprise pitch considerably. For regulated industries — financial services, healthcare, defense, legal — the ability to keep sensitive data on a local device while still accessing the reasoning power of frontier cloud models is not a nice-to-have. It is a potential compliance requirement.
An investment bank parsing confidential deal documents, for instance, might be unable to send those materials to a third-party cloud under existing data handling agreements. A system that can run the sensitive parsing locally while routing non-sensitive analytical tasks to the cloud offers a middle path. IDC forecasts a tenfold increase in agent usage and a thousandfold growth in inference demands by 2027, and security and governance rank as the top evaluation factor for enterprise agentic platforms, according to a CrewAI survey. Hybrid inference speaks directly to that priority.
Several questions will determine whether Perplexity’s Computex demonstration becomes a landmark product or a compelling prototype.
The actual performance characteristics remain untested outside a controlled stage environment — how the routing logic handles varied hardware configurations, unreliable network connections, and ambiguous data sensitivity classifications is an open question.
The competitive response matters too: Google, Microsoft, Apple, and OpenAI are all building their own local-cloud AI architectures. Apple Intelligence already routes some tasks locally and some to Private Cloud Compute servers, Google’s Gemini Nano runs on-device, and Microsoft’s Copilot+ PCs are designed around local inference capabilities. None of these systems, however, currently offer the kind of dynamic, autonomous task-level routing Perplexity claims.
Even if the technology works as demonstrated, there is the question of whether the business can keep pace with the ambition. At a $20 billion valuation with approximately $200 million in annual recurring revenue, Perplexity trades at roughly 100x revenue, a premium requiring aggressive growth to justify. Management’s $656 million 2026 revenue target implies 230% growth, creating significant execution pressure.
Perplexity has built its business on a bet that the future belongs not to any single model but to the system that orchestrates all of them. At Computex, it extended that bet from the software layer to the physical layer — from which model to which machine. In the AI industry’s relentless race to build bigger data centers and train larger models, Perplexity just argued that the most important computer in the stack might be the one already sitting on your desk.
Microsoft on Monday unveiled the Surface RTX Spark Dev Box, a compact desktop computer designed to let software developers run large AI models on their desks instead of paying for cloud computing — a move that directly challenges the per-token pricing model that has defined the AI industry’s economics since ChatGPT launched three and a half years ago.
The device, announced at Microsoft Build 2026, packs Nvidia’s new Blackwell-architecture RTX Spark processor and 128 gigabytes of unified memory into a small-form-factor chassis, delivering what Nvidia rates at one petaflop of AI compute. In practical terms, that means a developer can load, run and interact with AI models exceeding 120 billion parameters without sending a single API call to the cloud.
“These class of devices, we think, will get to about 100 billion parameter model running,” Pavan Davuluri, Microsoft’s executive vice president of Windows and Devices, said during a press briefing ahead of the event. He emphasized that raw model size is only part of the equation: “The model size is one thing, but for the model to be effective, it kind of needs to be able to have enough context, because a larger model, you feed it larger context.” At 100,000 tokens of context, he noted, the key-value cache alone can consume 40 to 50 gigabytes of memory — which is precisely why Microsoft and Nvidia engineered the device around a 128-gigabyte unified memory pool shared dynamically between the CPU and GPU.
The machine will be available later this year in the United States, sold exclusively through Microsoft.com. The company did not disclose pricing.
The Surface RTX Spark Dev Box arrives at a moment when the economics of AI development have become a boardroom-level concern. Companies large and small are grappling with cloud GPU bills that scale unpredictably: every fine-tuning run, every inference call, every agentic workflow that loops through a frontier model accumulates cost. For a developer iterating rapidly on a prototype — running the same model dozens or hundreds of times a day — those charges compound fast.
Microsoft is framing the Dev Box as a release valve for that pressure. Andrew Hill, corporate vice president of Surface, wrote in the announcement blog post that the device “changes that equation” by letting developers “reserve frontier model calls for truly frontier problems and handle the rest on their own hardware.” The pitch is not that cloud computing is obsolete, but that much of the work currently being sent to remote data centers does not require state-of-the-art models and would be better served by capable local hardware with predictable, fixed costs.
This is a significant strategic shift for Microsoft, a company that derives tens of billions of dollars in annual revenue from Azure cloud services. By selling hardware that explicitly reduces customers’ cloud dependency, Microsoft is acknowledging a tension that has been building across the industry: the marginal cost of AI inference at scale is unsustainable for many teams, and the market is demanding alternatives. The bet appears to be that developers who prototype locally will still deploy to Azure when they need to scale — and that owning both ends of that workflow is more valuable than owning only the cloud.
The technical architecture of the Dev Box reflects a set of deliberate engineering choices aimed at sustained, not peak, performance — a distinction that matters enormously for AI workloads that can run for hours.
At the center is Nvidia’s RTX Spark system-on-chip, which combines an ultra-efficient ARM-based CPU with a Blackwell-generation RTX GPU. In a traditional Windows PC, Davuluri explained during the briefing, this configuration would require four separate components: a CPU, a discrete GPU, dedicated graphics memory and system RAM. The RTX Spark collapses all of that into a single chip paired with a single unified memory pool.
That unification is the critical design decision. Conventional gaming laptops with high-end Nvidia GPUs top out at roughly 24 gigabytes of GPU-accessible memory. The Dev Box’s 128 gigabytes of unified memory — accessible to both the CPU and GPU through what Nvidia calls its Unified Memory Access architecture — is what makes it possible to load models that would otherwise require cloud GPU instances with specialty high-bandwidth memory configurations.
Microsoft did substantial work at the operating system level to exploit this architecture. The company implemented new memory management logic in Windows that raises the ceiling on how much system memory the GPU can address, introduces smarter page-size allocation for shared memory regions and ensures that heavy GPU workloads do not starve the CPU of the resources it needs for multitasking. The Windows scheduler was also optimized for RTX Spark’s heterogeneous core layout, routing demanding workloads to performance cores while keeping efficiency cores available for background tasks.
The thermal design is equally deliberate. The Dev Box operates within an approximately 100-watt sustained thermal envelope — modest by desktop standards, but meaningful for a device intended to run training jobs and inference workloads continuously. The aluminum chassis itself is engineered to function as a passive heatsink, and the method Microsoft used to build it is among the most striking details about the machine.
The top panel is manufactured using metal 3D printing, a process that enables internal geometries too complex for conventional CNC machining or injection molding. The perforations are not simple through-holes; they are angled in multiple directions around the internal fan to optimize airflow from cold-air intake through heat dissipation. During the press briefing, Harry, a Surface industrial designer, explained the rationale: “The complexity is something other manufacturers wouldn’t be able to do, like CNC, or like any molding, because of the complexity of shape.”
When asked whether 3D printing would constrain mass production, the designer acknowledged the challenge but suggested Microsoft had developed a process robust enough to scale. The result is a machine that runs quietly enough for an open office while sustaining the kind of continuous GPU workloads that would throttle most conventional desktops of similar size. For a device that Microsoft expects developers to leave running overnight on fine-tuning jobs, quiet sustained performance is not a luxury — it is a requirement.
Microsoft is shipping the Dev Box with Windows 11 Pro pre-configured at the image level for development work — a detail that sounds minor but reflects a growing recognition that the out-of-box experience for developer hardware has historically been poor.
The machine boots into a dark theme with a simplified taskbar, widgets removed and Do Not Disturb enabled. Developer Mode is turned on. PowerShell 7 is the default shell. WSL 2 — the Windows Subsystem for Linux — comes pre-installed with GPU passthrough and CUDA support already configured. Visual Studio Code, GitHub Copilot, Git, Python and Node.js are all installed and ready.
“We’ve said, ‘Hey, you know what, we got you, you want to go fast,'” a Microsoft engineer who demonstrated the configuration during the briefing told VentureBeat. The philosophy, he explained, is that developers were going to install all of these tools anyway — the friction was in the hours of setup and configuration that stood between unboxing a machine and writing the first line of code.
The Dev Box also ships with integration points across Microsoft’s AI stack: AI Toolkit for VS Code for model conversion and fine-tuning, Windows ML and Windows Copilot Runtime for local inference, and Microsoft Foundry for connecting local prototypes to cloud deployment pipelines. For enterprises, the device integrates with Entra ID and Intune for identity and device management, and includes Secured-core PC architecture, BitLocker encryption and Microsoft Defender.
The most obvious competitive comparison is Apple’s Mac Mini, which has dominated the compact-desktop category and has been widely adopted by developers drawn to Apple Silicon’s unified memory architecture and power efficiency.
Davuluri addressed the comparison directly during the briefing, saying the Dev Box is “in a different class of performance than Mac Minis, intentionally.” He declined to share specific benchmarks, noting that detailed specifications and performance targets would come closer to the fall launch. But the architectural advantage Microsoft is claiming is clear: while the current Mac Mini with M4 Pro tops out at 48 gigabytes of unified memory and the M4 Max configuration reaches 128 gigabytes, the RTX Spark Dev Box pairs its 128 gigabytes with a Blackwell-class GPU that has a fundamentally different CUDA-based compute model — one that the vast majority of the AI/ML ecosystem’s tooling (PyTorch, TensorRT, llama.cpp, Hugging Face frameworks) is already optimized for.
That CUDA ecosystem advantage is difficult to overstate. While Apple’s Metal framework has made progress, the overwhelming majority of AI training and inference frameworks are built and tested first against Nvidia’s CUDA stack. A developer running models on the Dev Box can use the same code, the same libraries and the same workflows they would use on a cloud GPU instance — a level of portability that Apple Silicon cannot currently match.
The Dev Box is one piece of a three-tier hardware strategy Microsoft laid out at Build. The Surface Laptop Ultra, announced days earlier at Computex, brings the same RTX Spark silicon into a 15-inch laptop form factor for developers and creators who need portability. At the other end of the spectrum, the DGX Station for Windows — built on Nvidia’s GB300 Grace Blackwell Ultra Superchip — targets organizations that need to run frontier models up to one trillion parameters on a deskside system. That machine is expected in the fourth quarter of this year.
The three devices map to a tiered computing model that Microsoft is calling “unmetered intelligence”: small on-device language models (the company’s new Aion 1.0 family) handle lightweight tasks at zero marginal cost; RTX Spark-class hardware runs mid-range models locally for the bulk of development work; and cloud resources are reserved for genuinely frontier-scale problems.
The GitHub Copilot CLI is getting a concrete implementation of this model with a new feature called /fleet, which allows a cloud-based primary agent to build a plan, assess the complexity of each task and route appropriate subtasks to a local model running on the developer’s hardware. The cloud agent handles what requires frontier capability; the local model handles what does not. The result, in theory, is lower cost without lower quality.
Whether Microsoft’s bet pays off depends on questions that will take months to answer. How does the Dev Box actually perform under sustained, real-world workloads? What will it cost? How quickly will the open-source model ecosystem continue to produce capable models in the 70-to-120-billion-parameter range that fit within its memory envelope? And perhaps most critically: will enterprise procurement teams, trained to think of AI as a cloud line item, accept a capital expenditure on desk hardware as an alternative?
The strategic logic, however, is difficult to dismiss. For three years, the AI industry has operated on an implicit assumption: serious AI work happens in the cloud, and the economics of that arrangement are simply the cost of doing business. Microsoft, a company with every incentive to reinforce that assumption, is now selling a machine that undermines it. That is not a contradiction — it is a recognition that the market is moving, and that the company that controls the developer’s local environment and the cloud they deploy to has a more durable advantage than one that controls only the cloud.
Every dollar a developer does not spend on cloud inference is a dollar that can fund another experiment, another iteration, another prototype. For years, the AI industry told developers they needed to rent their intelligence by the token. Microsoft is now asking a different question: what if you could just buy it?
For the past two years, the technology industry has raced to make AI agents more capable — teaching them to write code, navigate software interfaces, manage files, and orchestrate multi-step workflows with increasing autonomy. What the industry has not done, at least not with any consistency, is answer the question that keeps chief information security officers awake at night: what happens when an agent goes wrong?
On Tuesday at its annual Build developer conference, Microsoft offered what may become the definitive answer. The company introduced Microsoft Execution Containers, or MXC — a policy-driven execution layer, built into the Windows operating system itself, that lets developers and IT administrators declare exactly what an AI agent can and cannot access, with those boundaries enforced at runtime by the OS kernel.
The announcement, buried within a sweeping set of developer-focused updates, is arguably the most consequential platform move Microsoft made at Build this year, and it has the potential to reshape how every enterprise on Earth thinks about deploying autonomous AI software.
MXC is not a product you buy. It is an SDK and a policy model — a foundational primitive embedded in Windows and the Windows Subsystem for Linux — that provides what Microsoft calls a “composable sandbox spectrum.” That spectrum ranges from lightweight process isolation, already adopted by GitHub Copilot’s command-line interface, all the way up to micro-virtual machines, Linux containers, and full cloud instances running on Windows 365.
The system separates an agent’s execution from the user’s desktop, clipboard, user interface, and input devices. Critically, it binds every agent to a strong identity — either a local ID or a cloud-provisioned identity backed by Microsoft Entra — so that every action the agent takes can be attributed, audited, and governed.
The implications are enormous. Until now, the enterprise deployment of AI agents has been stuck in a paradox: the more autonomous and useful an agent becomes, the more dangerous it is to let it operate on a corporate network without guardrails. MXC is Microsoft’s attempt to break that paradox — not by making agents less capable, but by making the environment they operate in fundamentally more controlled.
To understand why MXC matters, consider what an AI agent actually does when it runs on your computer. Unlike a traditional application, which operates within well-understood boundaries — a word processor reads and writes documents, a browser fetches web pages — an AI agent is, by design, unpredictable. It receives a goal in natural language, reasons about how to achieve it, and then takes actions: opening files, executing code, calling APIs, browsing the web, interacting with other software. Each of those interactions creates what security professionals call “attack surface.”
Microsoft’s own blog post framed the challenge in stark terms. The company wrote that “as agents become more capable and autonomous, they’re delivering material productivity gains. But they’re also introducing new risk, and the issue isn’t just the agent. It’s the entire system the agent operates across.” Every interaction between agents and humans, tools, applications, models, and other agents “exposes new attack surface and introduces different failure modes.” Microsoft characterized this as “a multi-layer systems problem.”
This is not a theoretical concern. In the months leading up to Build, security researchers demonstrated numerous ways that AI agents could be manipulated — through prompt injection, through malicious tool calls, through data exfiltration disguised as normal workflow. For enterprises that handle sensitive data, proprietary models, and regulated information, the absence of a trusted execution environment has been the single biggest barrier to moving agents from demo to deployment.
MXC operates on a deceptively simple principle: declare what the agent can do before it runs, and let the operating system enforce those declarations at runtime. A developer or an IT administrator writes a policy that specifies which files, directories, and network resources an agent is allowed to access. MXC then creates a contained execution environment — a sandbox — that enforces those boundaries regardless of what the agent attempts to do.
What makes MXC unusual, and potentially very powerful, is the breadth of its isolation options. Microsoft designed the system so that a single SDK and policy model can map to the appropriate isolation construct for any given workload. For a lightweight coding assistant that just needs to read the current project directory, fast process isolation may be sufficient. For an autonomous agent that executes arbitrary code downloaded from the internet, a full micro-VM may be required. The system is designed to be “dynamically composable based on intent and risk,” meaning that the level of isolation can be adjusted based on what the agent is actually doing, not just what category it falls into.
Session isolation is a particularly important feature. MXC separates the agent’s execution from the user’s desktop, clipboard, UI, and input devices. This directly mitigates several classes of attacks that security researchers have identified as particularly dangerous for AI agents: UI spoofing, where an agent manipulates what the user sees to trick them into approving a malicious action; input injection, where an agent sends keystrokes or mouse clicks to other applications; and cross-session data leakage, where information from one user’s session bleeds into another.
During a pre-briefing with VentureBeat the night before the announcement, a Microsoft developer offered a vivid demonstration of the technology in action. He had set up the open-source agent framework OpenClaw running inside MXC’s sandbox on his personal development machine. He then instructed the agent to delete all the files on his desktop. The agent attempted to comply — but the sandbox prevented it. “If you look at my desktop here, you see how clean my desktop is,” the developer said during the demo. “That’s a lie.” The files, he explained, were completely safe because “the container won’t allow it.”
The demonstration went further, showcasing the granularity of MXC’s controls. Users can mark specific files as read-only for the agent, restrict access to the browser and screen capture, control whether the agent can see location data, and have all of those permissions managed centrally by an enterprise IT department through Intune policies. The agent operates inside what is effectively a one-way mirror: it can do the work it has been asked to do, but it cannot see or touch anything outside the boundaries that its policy defines.
Pavan Davuluri, Microsoft’s Executive Vice President for Windows and Devices, underscored during the pre-briefing that the primitives MXC introduces — security, containment, isolation, and user control — are essential to making AI agents commercially viable.
He emphasized that these capabilities are “not unique to OpenClaw” and that “this pattern repeats itself over and over” for any agent running on a Windows device. The primitives that exist in the operating system now “for the file around security, containment, isolating them, having users in control,” he said, are what will make agents safe enough for ordinary consumers and corporate deployments alike.
For corporate IT departments, the most significant element of the MXC announcement is not the SDK itself but its integration with Microsoft’s existing enterprise security stack through what the company calls Agent 365. Arriving in preview in July, Agent 365 layers Microsoft’s Entra identity service and Intune device management platform on top of MXC, so that IT administrators can govern agent containment centrally while developers choose the level of isolation their workload demands.
The integration goes further: Microsoft Defender will provide runtime threat protection, Entra will handle identity and access management, Intune will enforce device-level policies, and Microsoft Purview will extend its data governance and compliance capabilities to agent activity. This means that an enterprise could, in theory, allow employees to run AI agents on their corporate machines — even powerful, autonomous agents that execute code and manage files — while maintaining the same kind of centralized visibility and control that IT departments currently have over traditional applications.
Microsoft described the identity layer in its official blog: “Windows assigns agents a local ID or a cloud provisioned identity backed by Entra and attributes all activity from the container to that identity, so you can clearly differentiate human from agent.” For regulated industries — financial services, healthcare, government — the ability to produce an audit trail that distinguishes between human actions and agent actions on the same machine could prove to be a regulatory requirement, not merely a nice-to-have feature. Every agent action attributable to a specific identity, every containment boundary enforceable through the same policy infrastructure that already governs hundreds of millions of Windows devices — this is the architecture that could finally move AI agents from pilot programs to production.
Platform announcements at developer conferences are often aspirational. What distinguishes the MXC launch is the breadth and specificity of the partners already building on it. Microsoft named five: OpenAI, Nvidia, Manus, Nous Research (maker of the Hermes agent), and the OpenClaw open-source project. Each is integrating MXC in a distinct way that illuminates a different use case for the technology.
OpenAI’s involvement is particularly striking. David Wiesen, a member of OpenAI’s technical staff, said that “working with Microsoft on the Microsoft Execution Containers (MXC) allows us to explore new patterns for AI agents to safely and efficiently generate and execute code.” He added that by combining Codex’s capabilities with MXC’s execution environment, the goal is “to help developers move from intent to reliable execution faster, while maintaining the security and control enterprises need.” The reference to Codex — OpenAI’s code-generation agent — suggests that MXC could become the default execution environment for one of the most widely anticipated agent products in the industry.
Nvidia is bringing its OpenShell framework to Windows built on MXC, providing what Microsoft described as “an easy-to-deploy package for autonomous, always-on agents safely.” Manus, the Chinese-born AI agent startup that gained viral attention earlier this year, is also integrating. Tao Zhang, Manus’s Chief Product Officer, said that MXC “gives developers a policy-driven way to define what an agent can access and enforce those boundaries at runtime, so more autonomous agents can operate safely in enterprise environments.” And Dillon Rolnick, the CEO of Nous Research, offered what may be the most concise articulation of why MXC matters: “Continuously-running local agents, like Hermes Agent, require intentional isolation. Developers need control over what an agent can access and trust that those controls will hold.”
One of the more revealing stories behind the MXC announcement involves OpenClaw. During the press pre-briefing, a Microsoft developer described how the partnership came together organically — Peter Steinberger, OpenClaw’s creator, sent him a direct message in January expressing interest in collaborating. What began as a casual conversation evolved into a full-fledged platform partnership, with Microsoft developers contributing to the OpenClaw Windows companion app, built as a native WinUI application rather than a wrapped web app.
The OpenClaw integration serves as what Scott called “the ultimate test app for all the stuff that [the Windows platform team] is making.” If OpenClaw — which by its nature gives agents broad autonomy to execute tasks on a user’s machine — can run securely within MXC’s containment boundaries, then the containment system is robust enough for any agent. Scott explained the philosophy driving the work: “Think of OpenClaw Windows as the ultimate test app… If OpenClaw can succeed on Windows, that means that the Linux support is there, the container support is there, the containment is there.”
The companion app demonstrates the full spectrum of MXC’s enterprise controls — file permissions, network access, screen capture restrictions, location data — all manageable centrally through Intune policies. Microsoft donated the project to OpenClaw and plans to continue contributing to it as open source. As one member of the Windows leadership team put it during the briefing: “All agents, all comers, everyone is welcome on Windows… It’s going to run great on Windows, because the primitives are there. The base of the pyramid is solid.”
MXC arrives at a moment when the technology industry is grappling with a fundamental tension. AI agents represent what may be the most significant new category of software since mobile applications, and every major technology company is racing to build them. But the security and governance infrastructure required to deploy these agents responsibly in enterprise environments barely exists. Microsoft’s approach is distinctive because it locates the trust layer at the operating system level rather than in the agent framework, the model provider, or a third-party security product.
This is a deliberate architectural choice. By building containment into Windows itself, Microsoft ensures that the security guarantees hold regardless of which agent, which model, or which framework a developer chooses.
It also means that the hundreds of millions of Windows devices already managed through Intune and secured through Defender can, in principle, become agent-ready through a software update rather than a rip-and-replace deployment.
Apple’s approach to AI agents leans heavily on its walled-garden ecosystem, offering security through restriction — limiting which agents can run and what they can do. Google’s approach, centered on its cloud infrastructure, offers security through centralization. Microsoft’s approach offers security through declaration and enforcement — allowing any agent to run, but containing its impact through OS-level policy.
For enterprises that operate in heterogeneous environments with diverse toolchains and multiple AI providers, the Microsoft model may prove the most practical. The competitive dynamics are already shifting: with OpenAI’s Codex, Nvidia’s OpenShell, and independent agent frameworks like Manus and Hermes all building on MXC, Microsoft is positioning Windows not just as the platform where agents run, but as the platform where agents can be trusted to run.
MXC is available now in early preview, meaning developers can begin building against the SDK and testing containment policies. The Agent 365 integration with Defender, Entra, Intune, and Purview is scheduled for preview in July — a timeline aggressive enough to suggest that much of the engineering work is already done, but far enough out to allow for refinement based on developer feedback.
The real test, however, will come when enterprises begin deploying agents at scale on production networks. Containment is only as good as the policies that govern it, and writing effective agent policies for complex enterprise environments will be an entirely new discipline — one that IT departments have not yet developed and that no vendor has yet figured out how to teach. The technology is promising, but an empty sandbox is just an empty box. Filling it with the right rules, for the right agents, in the right contexts, will require a level of organizational sophistication that most companies are only beginning to contemplate.
Still, the significance of what Microsoft announced on Tuesday is difficult to overstate. For the first time, a major operating system vendor has proposed a comprehensive, kernel-level answer to the question of how autonomous AI software should be contained, identified, and governed on the devices where most of the world’s work actually gets done. The industry spent two years teaching agents to act. Microsoft is now betting that the bigger business — and the harder engineering problem — is teaching the operating system to watch.
Mistral AI used its inaugural conference on Wednesday to announce a sweeping expansion into industrial manufacturing, a new inference data center south of Paris, and a rebranding of its consumer-facing assistant — moves that collectively signal the three-year-old French startup’s ambition to become the enterprise AI provider of record for companies that refuse to hand their most sensitive data to American hyperscalers.
At the AI NOW Summit, held at a venue in central Paris, co-founder and CEO Arthur Mensch took the stage alongside CTO Timothée Lacroix and Chief Scientist Guillaume Lample to lay out a strategy that stretches from bare-metal GPU clusters to physics simulations for aircraft wings. The company disclosed that it now employs 1,000 people and is targeting €1 billion ($1.17B USD) in revenue for 2026 — a figure that, if achieved, would be an extraordinary growth trajectory for a company that began with 15 employees collaborating with its first customer, BNP Paribas, in 2023.
“We have two convictions at Mistral,” Mensch told the audience. “The first is that in order to deploy AI in the enterprise, you actually need, as an AI provider, to own the full stack.” He described Mistral’s business as fundamentally about “transforming electrons into tokens and intelligence,” arguing that physical infrastructure control matters as much as model quality.
The announcements come at a pivotal moment for Mistral and for the broader European AI ecosystem. The company has raised at least $3.9 billion across nine funding rounds, according to Clay’s funding tracker, including a massive €1.7 billion Series C led by Dutch semiconductor equipment maker ASML in September 2025 at an €11.7 billion valuation, and an $830 million debt financing round in March 2026 from a consortium of seven banks to fund data center construction. Mistral now finds itself in a peculiar competitive position: too large to be dismissed as a research lab, but still dwarfed by the resources of OpenAI, Google DeepMind, and Anthropic.
Its answer, articulated across nearly an hour of presentations Wednesday, is vertical depth — going industry by industry, workflow by workflow, and building the infrastructure to keep everything on premises.
The centerpiece announcement was Mistral for Industrial Engineering, a fully integrated AI stack that combines Mistral’s large language models with physics simulation capabilities acquired through its purchase of Emmi AI, completed earlier in May 2026. The platform targets the aerospace, automotive, and semiconductor industries with tools for accelerating product design, validating simulations, and optimizing production.
The launch came with headline partnerships. Mistral announced it is working with Airbus across its commercial aircraft, helicopter, defense, and space divisions, implementing AI from initial design through to on-board capabilities. For BMW Group, Mistral is serving as a central partner for what the automaker calls its “Large Industry Model” initiative, focused on multimodal reasoning models for crash simulation and other complex engineering tasks. ASML, already Mistral’s largest shareholder, is also an early adopter.
Mensch framed the industrial push as addressing a fundamental gap in how AI is currently deployed. “AI is great today at automating tasks for knowledge workers and for people that are doing software engineering,” he told the summit audience. “But once you move to all the kind of engineers, well, they are underserved.”
The reason, he explained, is structural. Simulating the behavior of a wing or a factory process requires compute-intensive physics solvers that can take hours or weeks per design variant. Traditional simulation creates a bottleneck that makes AI-assisted iteration impractical.
Mistral’s answer is what it calls “physics AI” — data-driven models trained on solver outputs that can predict physical behavior in seconds rather than hours, running on a single GPU. As Mistral’s own blog post on the technology acknowledges, physics AI is “not a replacement for first-principles solvers in every regime” — it is a throughput accelerator for the majority of design-loop iterations, with traditional solvers reserved for verification and edge cases.
“We now have both the language intelligence and the physical intelligence models, and by combining them together we are building delegation loops that allow us to create better tools, that allow us to create better objects that actually have an impact on the physical world,” Mensch said.
The ASML partnership offered a concrete illustration. In a video testimonial shown at the summit, an ASML representative described how the company’s lithography machines run around the clock at customer fabrication plants, and field service engineers need to diagnose issues as rapidly as possible. By combining ASML’s internal engineering expertise with Mistral’s models, “we were able to develop a solution that’s 120 times faster with a similar accuracy as we have today,” the representative said. Another ASML speaker described AI agents acting as “an always-on code reviewer” to catch software defects before they reach customers.
Mistral’s full-stack ambitions extend all the way down to the physical layer. Launched in June 2025, Mistral Compute is a €4 billion ($4.66B USD) investment in data centers in France and Sweden, with a stated roadmap of 200 MW of capacity by 2027 and 1 GW by 2030.
Lacroix described the company’s existing 40 MW facility at Bruyères-le-Châtel, south of Paris, which was built in collaboration with Eclarion and has been training models since early 2026. “It’s been very interesting to see how we can transfer rigor, which is one of our company values, into down to the hardware layer,” he said, describing the process of “fixing compute trays and fixing fibers, allowing us to reach the very best speeds possible on that hardware for training.”
On Wednesday, Mistral announced a new 10 MW facility at Les Ulis in the Essonne department, also south of Paris, dedicated to inference operations and scheduled to open in Q3 2026. Lacroix also referenced a site in Borlänge, Sweden, planned for development through 2027, which will host NVIDIA’s next-generation Vera Rubin GPUs. “One of the benefits for us of owning the hardware layer is also that it lets us be at the very bleeding edge of what infrastructure provides,” he told the audience.
The infrastructure push is funded in part by the $830 million debt financing round announced in March 2026, which Clay’s funding tracker attributes to a consortium of seven banks: Bpifrance, BNP Paribas, Crédit Agricole CIB, HSBC, La Banque Postale, MUFG, and Natixis CIB. And this infrastructure ownership is not merely a hedge against GPU scarcity — it is central to Mistral’s pitch to security-conscious enterprise and government customers. The company’s February 2026 acquisition of serverless platform Koyeb has been integrated into Mistral Studio to support both hosted and on-premises deployments, giving customers a choice between running inference on Mistral’s hardware or their own.
“More and more, the compute world has been getting supply constrained,” Lacroix told the audience. “One of the reasons we’ve been doing all of this and developing all of this data center capacity is to secure compute capacity not only for ourselves but also for our customers.”
In a consumer-facing rebrand with significant enterprise implications, Mistral announced that Le Chat — its conversational AI assistant launched in February 2024 — is being renamed Vibe and reimagined as a unified agent platform for enterprise productivity and software development.
“We are transitioning Le Chat to the Vibe family,” Lacroix told the audience, explaining that the evolution was driven by the growing power of agentic models, particularly the new Mistral Medium 3.5. As the team used Vibe’s coding CLI internally with increasingly complex tasks, “we realized that this really didn’t need to be bound to the CLI, it didn’t need to be limited to code, and we could do a lot more with it,” he said.
Vibe encompasses two primary modes. Vibe for Work is a web and mobile agent that connects to enterprise tools — Google Workspace, Outlook, SharePoint, Slack, GitHub — to perform multi-step tasks such as summarizing emails, analyzing spreadsheets, drafting reports, and scheduling recurring workflows. Vibe for Code is a coding agent available through a web interface, a new VS Code extension, and the existing CLI, capable of building features, fixing bugs, refactoring code, and shipping pull requests. Critically, the same underlying agent powers both modes. “When you access it through our web app or through the CLI, you have access to the same connections, the same tools, the same understanding of who you are, what you do, and what you’re trying to achieve,” Lacroix said.
Pricing starts at free for basic use, $14.99 per month for Pro, $24.99 per user per month for Teams, and custom pricing for Enterprise deployments. Alongside Vibe, Mistral also launched Search Toolkit, an open-source framework for building production search pipelines already in use by shipping giant CMA CGM, which uses it alongside Voxtral to process audio from multiple data sources and return alerts within 15 seconds.
Chief Scientist Guillaume Lample used his portion of the keynote to describe a philosophical shift in Mistral’s model strategy: consolidation of capabilities into fewer, more versatile models rather than maintaining separate specialized products.
Mistral Medium 3.5, the company’s current flagship, absorbs capabilities that previously required distinct models. Pixtral (image processing), Magistrale (reasoning), and DevStral (coding) have all been deprecated as standalone products, with their capabilities folded natively into Medium 3.5. “Now all our models are natively multimodal,” Lample said. “We no longer have Magistrale. This model is deprecated, because all our models will natively be doing reasoning.”
The company is also working on Mistral Large 4, which Lample said would arrive “in a couple of months at most, during the summer,” with expanded capabilities in industrial applications such as fluid dynamics, computational chemistry, computer-aided design, and cybersecurity. On the smaller end of the spectrum, Lample highlighted Mr. Lossier, a 1-billion-parameter OCR model that can process thousands of pages per minute on a single GPU, and the Voxtral speech model family, which has expanded from automatic speech recognition to include text-to-speech with voice cloning. A “duplex” model for real-time conversational speech is planned for release within months.
Lample also made the case for open-weight models becoming more — not less — important in the agentic era. “Today we are building these agentic workflows, these models are running in the background, they are doing a lot of actions, a lot of tool calls, so they are extremely token-hungry, much more than before,” he said. “What we are seeing today is actually a comeback of this small model and the efficient model.” Upcoming models will be trained on more than 200 languages, a multilingual strength now powering a partnership with Amazon to improve non-English interactions on Alexa+.
Mistral’s positioning stands in sharp contrast to the strategies of its most prominent American rivals. While OpenAI and Anthropic have each attracted hundreds of millions of consumer users and derive significant revenue from subscription products, Mistral has leaned almost entirely into enterprise and government deployments. As TechCrunch reported in March when Mistral announced its Forge customization platform at Nvidia GTC, CEO Mensch has described the company as being “on track to surpass $1 billion in annual recurring revenue” — a figure driven largely by corporate clients.
The Forge platform, which lets enterprises train custom models on their own data rather than simply fine-tuning or applying retrieval-augmented generation to existing models, represents the foundation on which the company’s industry-specific solutions are built. As Mistral’s head of product, Elisa Salamanca, told TechCrunch, Forge “lets enterprises and governments customize AI models for their specific needs.” Early partners include Ericsson, the European Space Agency, Italian consulting company Reply, and Singapore’s DSO and HTX, alongside ASML.
Mistral has also built an expanding network of systems integration partnerships to drive enterprise adoption. In February 2026, Accenture and Mistral announced a multi-year strategic collaboration, with Accenture itself becoming a Mistral customer. Mauro Macchi, Accenture’s CEO for Europe, Middle East, and Africa, said at the time that the partnership brings together “sovereign models and the capability to scale technology across industries, geographies and business functions.”
The BNP Paribas relationship offers the most detailed public case study. In a video testimonial at the summit, a BNP Paribas representative described deploying Mistral’s models on-premises to satisfy strict security requirements, developing AI agents for KYC processes that reduced incomplete files from 80% to 10% and compressed processing time from weeks to days. The bank’s LLM platform at its Corporate and Institutional Banking division has now rolled out to 65,000 users. Mensch noted the significance: “We started to collaborate in 2023 where we were 15 people, so that was, I think, really a leap of faith at the time.”
The industrial vertical is also being extended to government clients. Mistral disclosed that it is working with France, Luxembourg, Singapore, Morocco, Greece, and Slovakia to build citizen-facing AI services — from deploying agents that help job-seekers through France Travail to building models that understand Moroccan Darija and Amazigh languages. “We think that AI needs to be specialized and understand structural nuances,” Mensch told the audience. “It needs to speak languages as good as it speaks English.”
For Mistral, Wednesday’s announcements amount to a declaration that the company intends to compete not by matching American AI giants on any single dimension, but by assembling capabilities none of them are willing or able to offer in combination: open-weight models, owned infrastructure, on-premises deployment, physics simulation, and deep vertical customization — all under a single roof.
The strategy demands execution on multiple fronts simultaneously, each requiring enormous capital and specialized talent. The competition is formidable and accelerating. OpenAI has been rapidly expanding its enterprise offerings. Anthropic, backed by billions from Amazon, is building its own corporate AI practice. Google, Microsoft, and Amazon all offer AI platforms deeply integrated with cloud infrastructure that most enterprises already use.
But Mistral is wagering that the world’s most consequential AI deployments — the ones governing how aircraft get designed, how banks process compliance, how governments interact with citizens — will ultimately go to providers that offer sovereignty over data, models, and compute. “AI is too strategic to be left in the hands of a few,” Mensch said, echoing the conviction he described from Mistral’s founding three years ago.
Three years in, the company that started as a Paris research lab with a handful of employees now trains models in its own data centers, simulates physics for the manufacturers that build the world’s planes and cars, and is rewriting its assistant into an agent that can file your pull requests and summarize your inbox in the same conversation. Whether that sprawling ambition coheres into a durable business or stretches Mistral too thin is the €11.7 billion ($13.6B USD) question. The 1,000 people now working there are betting that in enterprise AI, owning the full stack is not a liability — it is the product.
DeepSeek’s announcement over the weekend that it has made its 75% price cut permanent on its flagship V4 Pro model is a disruptive assault on the capital-heavy business models of Silicon Valley’s frontier labs.
The reduction on DeepSeek V4 Pro directly undercuts comparable Western models used as workhorses for enterprise production. It is 7x cheaper on inputs and 17x cheaper on outputs than Anthropic’s Claude Sonnet or OpenAI’s GPT 5.5-Med, while the lightweight DeepSeek V4 Flash undercuts entry-tier alternatives like Claude Haiku by 10x to 25x.
The price cuts are enabled by a series of hardware-software innovations, especially around cache, that make DeepSeek’s models radically more efficient to run. When hosted natively in China, DeepSeek’s cache-read pricing is a whopping 87x cheaper than Western clouds — a deflationary floor so aggressive that handset giant Xiaomi just moved to match the exact pricing tier for its newly deployed MiMo architecture.
DeepSeek V4 Pro’s performance is ranked almost on par with Western frontier models, hitting 80.6% on coding-agent tasks via the SWE-bench Verified leaderboard and an elite reasoning score of 87.5 on the advanced MMLU-Pro technical index. Both V4 Pro and V4 Flash — a hyper-optimized speedy version for developers — are open-weight and issued under a permissive MIT license. This gives enterprises complete flexibility over deployment. This dual-model strategy allows technical teams to route their heaviest, multi-step autonomous agent workloads to the lightning-fast Flash model, while reserving the heavy Pro model for deep reasoning tasks, drastically lowering costs at a time when budget concerns have grown considerably.
This also comes at a time when the closed Western labs, in particular OpenAI and Anthropic, face an intense return-on-investment scrutiny for their multi-billion dollar general-purpose hardware infrastructure investments.
This deflationary collapse will not affect all Silicon Valley labs equally, signaling a permanent bifurcation of the enterprise AI market. While a premium, deterministic tier will endure for mission-critical engineering workflows, the high-volume background agentic layer is being completely commoditized by open weights. Ultimately, it creates a much more dangerous exposure for OpenAI — whose revenue mix relies heavily on general-purpose commodity API streams — than for software-insulated peers like Anthropic.
Uber says it burned through its entire 2026 budget for Claude Code and Cursor in just the first four months of the year; its COO said that the cost related to high token usage by some of its engineers was getting “harder to justify” without better products to show for it. Airbnb’s Brian Chesky said last year that while the company uses OpenAI’s latest models, they don’t rely on them heavily in production — favoring faster, cheaper alternatives like Alibaba’s Qwen. And in the latest episode of VentureBeat’s podcast Beyond the Pilot, Pinterest CTO Matt Madrigal confirmed that the company went all-in on an open-source AI strategy, post-training Alibaba’s open Qwen model on the company’s proprietary “taste graph” to drive Pinterest’s assistant — achieving frontier-like quality at a 90% reduction in costs. DeepSeek’s subsequent price drop makes the possibility of such cost differences even greater.
Widespread enterprise adoption of Chinese models faces massive geopolitical headwinds in the West. For highly regulated U.S. giants in finance, healthcare, and defense, getting comfortable with DeepSeek will take time.
Even though an open-weights architecture under an MIT license allows a company to self-host the model locally and prevent active data exfiltration to foreign servers, corporate compliance boards remain deeply paranoid over software supply chain risks, potential hidden backdoors, and the legal threat of sudden federal sanctions.
Smaller, more nimble software teams, on the other hand, face far less bureaucratic gridlock. Free from multi-month security review cycles, these fast-moving organizations view the immediate 75% infrastructure savings as a massive competitive edge worth deploying right now
Take the token usage metrics on OpenRouter, a leading public proxy for what models are the most popular among developers. OpenRouter allows developers an easy way to compare and deploy models, and while its data is by no means a full proxy for real model popularity — it confirms this structural migration is already taking place within company data pipelines. DeepSeek V4 Flash model has captured the No. 1 position on the OpenRouter leaderboard over the past week, surging 48% in token usage. Its advanced counterpart, V4 Pro, sits at No. 6. DeepSeek’s top three models processed nearly 6 trillion tokens on OpenRouter over the past week, giving it a huge lead over other competitors. For example, OpenAI’s premium model, GPT-5.5, has slipped down to No. 15 at 470B tokens.
It’s not clear exactly how much of the world’s token traffic is on OpenRouter. Conservative estimates put it at about 3%. It does not show the massive amounts of tokens being served by the APIs offered directly to developers by companies like Anthropic, OpenAI and Google. But recent estimates suggest OpenRouter processes between 15 and 40% of each of OpenAI’s and Google’s token usage, and growing, making it a significant indicator of relative trends regardless of the exact percentage it represents.
While skeptics often dismiss aggregator traffic as an indie developer signal rather than a reflection of Fortune 500 IT spend, the corporate pipeline reality is shifting. An infrastructure analysis by a leading venture capital firm, Andreessen Horowitz, revealed that enterprise production environments deploy a median of 14 different models simultaneously to price-route workloads and avoid single-vendor lock-in. This structural architecture shift is why OpenRouter recently secured a massive $113 million Series B funding round backed directly by the big enterprise data and software vendors that serve corporate America — including ServiceNow Ventures, Snowflake Ventures, Databricks Ventures, Nvidia’s NVentures, and Google’s CapitalG. Stripe also cited OpenRouter’s enterprise customers in its decision to partner closely with the company.
That’s why DeepSeek’s surge on this leaderboard is so eye-opening. DeepSeek itself offers an API directly to developers, and so it too delivers more token traffic than what OpenRouter lets on.
The DeepSeek spike on OpenRouter indicates a deeper structural shift in how automated software architectures consume machine intelligence. Technical teams are moving beyond using trivial, single-turn chatbots, and starting to deploy more sophisticated autonomous agents that persist for hours at a time — recursively looping through codebases and data lakes. Their huge number of tool calls, and continuous rereading of long context histories, means AI token consumption expands exponentially.
Running these recursive loops on closed, premium Western APIs quickly creates unsustainable infrastructure costs. While corporate tech teams spent last year experimenting freely with early, single-turn prototypes without worrying about budgets, the onset of token-prolific autonomous agents has triggered an enterprise line-item crisis. VentureBeat’s Q1 2026 research, which surveyed enterprise users at organizations with over 100 employees (n=65, in the U.S. software, finance and healthcare industries), confirms the shift: “Cost per token or licensing model” jumped from 25.4% in January to 36.7% in March, trailing only raw performance as the primary selection criterion for enterprise buyers.
DeepSeek target-optimized its weights for this specific trend of agentic high-token use. It has locked in on a standard input cost of $0.435 per million tokens and a standard output rate of $0.87 per million tokens, alongside a rock-bottom prefix-cached read cost of $0.003625 per million.
It’s this third cost item — for cache — which is arguably the most significant. “If you measure how all of these agents now are using tokens, 80 to 90% of the tokens are cache-read tokens,” said Val Bercovici, Chief AI Officer at WEKA, a company that provides fast storage for much of this cache. “Which means that [that price] is almost by far the most important price, making the others irrelevant — nearly a rounding error. So what DeepSeek did is not just say we’re going to be 5% cheaper, 10% cheaper, 20% cheaper. They’re like 87x cheaper on that cache-read price with DeepSeek V4 Pro. So that’s really set the industry on notice.”
DeepSeek’s core innovations are around hardware-software alignment. This is where we get a little technical.
While Western frontier labs like OpenAI have prioritized performance at all cost, they’ve invested billions into uncompressed “dense” neural architectures. DeepSeek, by contrast, has systematically sought to extract maximum intelligence from lower grade hardware, given that they’ve lacked access to Nvidia’s GPUs. By pioneering deep software optimizations as early as its V2 architectures in 2024, the lab engineered a series of four interconnected hardware-software alignment breakthroughs that decoupled a model’s operational context from expensive computing overhead:
Breakthrough 1: Sequence Dimension Compression via CSA and HCA
The transformer architecture that most LLMs use is bottlenecked by something called the Key-Value (KV) cache. As an agent executes long, multi-step sessions, historical context keys clog the high-bandwidth memory (HBM) on the GPU, causing severe latency spikes and an expensive infrastructure tax.
DeepSeek resolved this structural bottleneck by introducing a hybrid attention mechanism — documented in the DeepSeek V4 Architecture Paper — that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to cut overall KV-cache usage by a massive 90% across its 1-million-token context window.
While traditional models try to keep a unique memory log for every individual word, DeepSeek compresses the rows of its memory cache. CSA acts as a local filter, condensing small windows of text into concise, indexable blocks so the model doesn’t sweat the fine-grained details. HCA acts as an aggressive global index, crushing massive spans of text deep within a session’s history into high-density summaries. By interleaving these layers, DeepSeek shrinks millions of memory rows down to a fraction of their size.
Breakthrough 2: Native memory offloading via Multi-head Latent Attention (MLA)
Using something called Multi-head Latent Attention (MLA), DeepSeek strips the active memory footprint of its context history down to a fraction of standard models. It achieves this by running a physical division of labor between hardware chips. While traditional models force expensive GPUs to hold a session’s entire history, DeepSeek’s architecture keeps only the tiny, highly compressed search index tags (the Keys) on the GPU. Meanwhile, it offloads the heavy data payloads (the Values) entirely into cheaper system memory and local storage tiers. Once the GPU handles the high-speed matching to find relevant data, it calls the values from storage only on an as-needed basis.
DeepSeek’s architecture is so different that the inference engines that load an AI model’s weights into GPU memory, in order to be ready for prompting, are being stretched. The three most popular engines — Nvidia TensorRT-LLM, the UC Berkeley one, SGLang and the really popular vLLM — “are all being stretched to keep up with being able to offer it, which is not normal,” explains Weka’s Bercovici. “Every other open model has had some similarity to other open models. This one from DeepSeek is just built different.”
DeepSeek’s software engineering means its massive 1.6-trillion parameter model requires an astonishingly tiny 5.48 GB of HBM to hold a 1-million-token context loop in production, according to calculations by an analyst using hardware modeling benchmarks. For comparison, smaller models utilizing standard Western architectures choke up to 89 GB of HBM under the exact same context load.
|
Model Framework / Metric Tier |
Active HBM Needed (1M Context) |
Context Length Capacity |
Multi-Step Cached Economics |
|
DeepSeek V4-Pro (1.6T MoE) |
5.48 GB |
1,000,000 tokens |
80% to 90% of workflow tokens |
|
Qwen3-235B-A22B (GQA Standard) |
89.00 GB |
1,000,000 tokens |
Subject to steep hardware tax |
|
GPT-5.5 / Claude 4.7-class (Western Frontier / MoE) |
180+ GB |
1,000,000 tokens |
Prohibitive premium infrastructure tax |
DeepSeek’s extreme compression of the KV cache down to 5.48 GB of HBM is also a calculated geopolitical strategy to bypass U.S. export bans on top-tier Nvidia GPUs. By reducing the need for HBM and Nvidia’s CUDA ecosystem, DeepSeek’s software design allows frontier AI to run efficiently on domestic, lower-cost, and unsanctioned Chinese storage tiers like NAND flash, commodity SSDs, and LPDDR memory (produced by domestic giants like YMTC and CXMT).
Breakthrough 3: Ultra-Low Footprint Inference via FP4 Quantization-Aware Training (QAT)
To keep compute costs low over massive context windows, DeepSeek moved away from the old approach of scanning bulky, uncompressed numbers every time the model searches its memory. Instead, as detailed in the DeepSeek V4 Technical Report, the architecture runs an advanced form of data compression directly on the active pathways it uses to find information during training.
This compression slashes memory demands to deliver a 2x hardware speedup, yet it maintains a near-flawless 99.7% accuracy in how the system targets and indexes specific data blocks. This engineering win allows enterprise workflows to process massive, multi-step agent tasks smoothly while keeping an exceptional 83.5% retrieval accuracy on extreme, million-token “needle-in-a-haystack” benchmarks—eliminating performance lags without draining expensive GPU power.
Breakthrough 4: Ultra-scale training stability via manifold-constrained hyper-connections (mHC)
Training a 1.6-trillion parameter model creates instability risk — causing too many data pathways and processing signals to cascade out of control, crashing the run. DeepSeek resolved this with a framework called Manifold-Constrained Hyper-Connections (mHC), which uses a balancing routine to force the model’s internal data tables to always sum to one — a mathematical safety valve that lets complex data move through deep networks without runaway spikes.
DeepSeek’s significant architectural cache efficiency alters the underlying unit economics for the cloud platforms hosting these models. On developer aggregators like OpenRouter, where third-party providers routinely offer advanced endpoints at a loss, to capture developer mindshare, this hardware-software decoupling alters the balance sheet. DeepSeek’s extremely low cost likely gives DeepSeek a profit, at least when it comes to serving the model in China, Bercovici said.
This transformation in provider-side unit economics is mirrored on the buy-side, which shows a structural change happening across enterprise IT budgets. VentureBeat’s Q1 2026 AI Infrastructure and Compute tracker survey — which tracks enterprise technology buyers at organizations with over 100 employees (n=53 in January, n=39 in February) across software, financial services, healthcare, and manufacturing sectors — revealed that enterprise adoption of custom, self-managed inference stacks utilizing open-source frameworks like Triton, vLLM, Ray, and Kubernetes surged from 11.3% to 17.9%. Because these software layers allow corporate engineering teams to deploy open-weights architectures natively across their own clusters, they act as an operational escape hatch from closed cloud ecosystems.
This software shift is paired with an aggressive hardware migration: enterprise workloads moving to specialized, inference-first AI clouds like CoreWeave, Lambda, and Crusoe grew from 30.2% to 35.9% in the latest survey window. These infrastructure metrics indicate that corporate technology leaders are no longer just prototyping with open alternatives; they are actively laying down the physical plumbing required to host architectures like DeepSeek V4 independently, increasingly pricing away the premium markup of Western API gatekeepers.
This baseline cost reduction could soon fracture the competitive field in Silicon Valley, by rewriting the expectations for labs attempting to yield a return on massive infrastructure investments.
For now, though, the Silicon Valley music is unlikely to stop anytime soon. Anthropic remains on an extraordinary enterprise trajectory, driven by widespread adoption of Claude Code and its codebase-aware terminal execution. For enterprise engineering teams, paying a premium for Anthropic’s deterministic accuracy makes perfect sense for core production software development. Yet even an elite frontier lab scaling at this pace must watch DeepSeek with caution: an open-weights architecture under an MIT license offering near-frontier utility at a 75% cost reduction places downward pricing pressure on the high-volume operational layers of any multi-agent system.
The primary structural margin squeeze may land more squarely on OpenAI, despite its aggressive pivot toward a multi-cloud footprint. To support its staggering consumer and API token volumes, OpenAI fundamentally altered its historic seven-year exclusive alliance with Microsoft, unbundling its distribution so it can serve models across Azure, Oracle, AWS, and Google Cloud. Yet this multi-cloud strategy, while providing raw capacity at scale, leaves the company intensely exposed to infrastructure commodity pressure.
Unlike Anthropic, which has successfully insulated its margins by embedding its models into premium, high-utility software environments like Claude Code, a massive portion of OpenAI’s enterprise revenue relies on high-volume, general-purpose API token streams. To be fair, Western labs have already begun quietly retreating from this territory — aggressively launching deep batch API discounts, prompt caching features, and lightweight entry models to stem the bleed. Yet this tactical retreat only reinforces the structural crisis: Silicon Valley is actively conceding the high-volume commodity layer because they know they cannot defend its margins. When those exact same automated background workflows can be handled natively by highly intelligent open weights like DeepSeek V4, defending a premium price point for raw cloud text completion ceases to be a defensible strategy.
More significantly, unlike OpenAI or Anthropic, DeepSeek has much less interest in urgently building consumer wrappers or locking developers into subscription frameworks. Instead, DeepSeek is positioned for a longer-term ecosystem play. Supported by a massive state-backed funding round led by China’s “Big Fund” — which has pushed the startup’s targeted valuation into the $10 billion to $45 billion range — the lab’s more likely objective is to prove the viability of a self-sufficient, independent Chinese AI hardware stack that could one day be worth up to $10 trillion.
|
Premium deterministic tier (Anthropic / OpenAI / Google) |
High-volume agentic tier (DeepSeek / open ecosystems) |
|
• Core Codebase Refactoring • Strict Corporate Compliance & Guardrails • Mission-Critical Financial/Legal Precision • High CapEx / R&D Premium Margins |
• Recursive Multi-Agent Loops • Prefix-Cached Autonomous Tool Swarms • Massive Real-Time Ingestion Logs • Bare-Metal / Optimized HBM Economics |
The operational division between western labs and models like DeepSeek V4 Pro is already showing up. Financial company Ramp benchmarked automated cybersecurity agent swarms, and showed that while DeepSeek V4 Pro completely flatlines on the most complex security logic, it achieves a flawless 100% detection rate on high-volume baseline tasks like cloud configuration triage — significantly outperforming OpenAI’s GPT-5.5 (44%). For an enterprise CISO, the strategy is clear: You offload the high-volume token burn of routine background noise to cheap open weights, and reserve premium frontier models strictly for the high-level reasoning required to catch the most sophisticated flaws.
For IT operations directors and data pipeline managers, the choice to migrate to an open architecture like DeepSeek V4-Pro is a smart governance decision. The open model gives companies total architecture control, allowing them to host it on-premise or via any specialized cloud layer they choose. Crucially, it provides enterprise infrastructure leads with a strategic operational fallback that closed vendors can’t match: the power to download raw model weights and execute them privately for zero marginal token cost if public cloud pricing or API access conditions change.
The assumption that closed frontier labs hold a permanent monopoly on useful enterprise reasoning has collapsed. While engineering directors will continue to pay a premium to protect specialized, deterministic workflows, the financial foundation of the frontier lab model has fundamentally shifted. By diverting the immense, day-to-day token volume of recursive background agents onto highly optimized, open-source clusters, enterprise teams are starving proprietary clouds of their highest-margin fuel. Silicon Valley’s multi-billion dollar token moat didn’t just narrow — it was completely drained from the bottom up.
Merck is using AI agents to cut drug discovery cycles by a third and ship compliant marketing materials up to 80% faster — but VP of Digital Platforms Sean Finnerty says the only reason it’s working is because they built the infrastructure first.
And the pharmaceutical manufacturer is seeing promising early results: AI is generating marketing drafts that are “99% right” when it comes to compliance, shrinking review cycles from months to days and accelerating delivery by 70% to 80%. In the company’s medical research, meanwhile, one AI-assisted discovery cycle was reduced by 33%.
Still, agentic AI only works if companies first build the underlying “plumbing,” Finnerty said of digital platforms and services at a recent AI Impact Series event.
“If we do one-offs, we’re gonna end up with thousands and thousands of things that are ultimately just gonna be debt that we’ll have to deal with later,” he said. “And that’s gonna be a drag on any further innovation.”
Merck’s plumbing-first strategy comes from lessons learned during the early days of cloud in the 2010s “when nobody knew what the heck was going on,” Finnerty said.
Getting the cloud right meant building from the ground up; at Merck, that infrastructure now supports 2,500 AWS accounts, numerous Microsoft Azure subscriptions, and new Google Cloud Platform (GCP) integrations.
“AI is gonna be the same exact thing,” Finnerty said. “We’re going to have thousands and thousands of agents.” The questions then pile up: How do you register them? How do you secure them? How do you ensure they’re connected to the right tools, and have access to the right data and the right context?
Context delivery is also critical; Merck works with three hyperscalers and has forty-seven edge locations and hundreds of databases. “Many, many petabytes” of structured and unstructured data are stored in Oracle databases, SQL databases, Excel spreadsheets, phone transcripts, and other repositories, Finnerty said.
His team is building scaffolding to deliver meaningful context in various situations, he explained. Data must be organized and ingested into various platforms, because “there’s no one solution to solve every single problem.” Sometimes it’s Databricks, other times it’s Amazon Redshift, “plus four other things.”
The goal is: “Let’s make that easy and frictionless for people to do, and secure it, and make sure it’s well integrated with MCP [model context protocol], and A2A [Agent2Agent], and upstream compute,” Finnerty said. “If you wanna run stuff on GCP or you wanna run stuff on AWS, we’ve got the plumbing in place so you can run your adjacent workloads wherever you want.”
As it builds out its technical plumbing, Merck is experimenting with agents across regulated enterprise operations, scientific discovery workflows, and app modernization.
Notably, AI is accelerating drug discovery. Finnerty explained that scientists look at molecular structures and disease states to determine if a given condition is druggable. But even if a disease state is known, developing a drug to target it can take years.
Now with AI, teams are starting to see “very promising things,” such as cutting one particular research cycle down by one-third. “That’s a year off of the life of the discovery cycle,” Finnerty said. “Which means, theoretically, we can get it to a patient who needs that therapy a year faster.”
Once developed and approved, these products are regulated and marketing materials around them must be clearly and explicitly articulated. “The way you communicate that information per market, per country, per state, per region, is all very carefully governed and regulated,” Finnerty said. It’s also variable: An ad campaign for a vaccine in the state of Georgia looks much different from one launched in Canada.
Historically, humans did the due diligence to make sure the company complied with various laws. Draft materials go through iterations of reviews; when a mistake is discovered, it gets “kicked back to the beginning, and it goes through it again, and then it takes another however many weeks and months,” Finnerty said.
But now, AI can do that “much, much more effectively,” and the process is increasingly evolving from a human-in-the-loop to essentially a “human-as-governor.” With human oversight, AI can deliver a first draft in a day or week that is 99% there, allowing teams to ship materials up to 80% faster.
Meanwhile, when it comes to app modernization, AI can discover architecture, document data interactions, APIs, network paths, and do authentication checks and authorization; it can also write code for Terraform for deployment and refactor JavaScript into Python.
Where the company would have previously spent weeks and months and hundreds of thousands of dollars to update one application, Finnerty said, agents are now handling the work through prompts.
That’s not to say there aren’t significant challenges; Finnerty noted that his team has run into some “wackiness”; for example in automated code and scenario testing. AI has blatantly made up scenarios, whether due to incorrect context, infrastructure, “or if it was just getting creative with, ‘You should be testing these three functions that don’t even exist in the code that you’re trying to test.’”
“That surprised me a little bit because I thought we were further past some of the hallucination challenges in these later models,” he said.
To address this, his team has engineered guardrails to keep hallucinations to a minimum, essentially using AI to supervise AI and applying confidence scores. So if Claude created the first output, they’ll instruct Microsoft Copilot to assess it.
“So if you ask something once, have AI check it, then ask it a third time, the confidence increases every time, and it minimizes some of the garbage that gets created in the early runs,” Finnerty said.
Meanwhile, at Mastercard, Chief Data Officer Andrew Reiskind and his team are focusing agentic experimentation on highly orchestrated transaction and dispute workflows. As he noted, a chargeback or fraud dispute is not a single event.
When a consumer disputes a charge (typically online), that “kicks off an entire other process on the back-end that tends to be very labor-intensive,” Reiskind said.
Mastercard has to collect specifics about the actual dispute; then the merchant has its own investigations (Was the card reported as lost or stolen? Does the consumer dispute charges often?). Further, the network sitting in the middle has its own rules for timing and information submission.
“You have each and every one of these steps, many of which are unstructured, but there are also structured data elements to this,” Reiskind said. Whether a card was lost or stolen tends to be structured, but the consumer complaint is “unstructured data of questionable reliability.”
“So you’re sitting there with a decisioning system that has deterministic decisions, but also probabilistic decisions,” he said.
This problem can be sped up and potentially solved by AI agents, but that can be a complex process: Which tasks are you handing off to agents? When are they kicking things back to human reps? How many agents are you ultimately using? What are the cost implications?
Then there are reputational questions and costs: Have you just called a consumer potentially a liar when they weren’t lying?
“It’s an exact problem where you want to, as a bank, maintain trust with your consumer,” Reiskind said. “But you also wanna make this efficient and take costs out of the system.”
There’s always going to be risk with AI, and enterprises should assess it from the beginning of product design, Reiskind said. There’s also the question of acceptable risk.
As an example: Did you serve a customer a peanut butter jelly sandwich instead of a turkey sandwich (a minor inconvenience)? Or did you serve gluten to someone with celiac disease?
“Is it an acceptable risk if one percent of the time it makes the mistake? If it is, let’s go to the next stage of how you’re mitigating that risk,” Reiskind said.
Leaders must perform cost-benefit analysis, break problems down to their “constituent pieces,” and calculate cost for each one. But these are estimates; it’s near-impossible to forecast real usage, Reiskind said. “It is not a simple process to get to the cost,” he said. “But it is doable.”
Resolve AI, the production-operations startup backed by Greylock and Lightspeed Venture Partners, today announced a sweeping expansion of its platform that introduces always-on background agents, a redesigned investigation architecture, and a shared workspace where engineers and AI agents collaborate in real time on live incidents.
The centerpiece of the release is a new multi-agent investigation system developed by Resolve AI’s in-house research lab. Instead of deploying a single AI agent to diagnose a production failure — analogous to a lone engineer pulling an on-call shift — the platform now dispatches a coordinated team of specialized agents that pursue multiple hypotheses in parallel, independently verify each other’s conclusions, and construct complete causal chains from root cause to symptom. The company says the architecture delivers more than a twofold improvement in root cause accuracy on its internal evaluation benchmarks compared to earlier versions of its platform.
“Think of a single agent being on call, the way a human would be,” Resolve AI CEO and co-founder Spiros Xanthos told VentureBeat in an exclusive interview ahead of the announcement. “We now have a team of agents that all work together, almost like a team of humans debugging an issue, and that has improved quality by 2x.”
The announcement arrives at a moment of acute tension in the software industry. AI-powered code generation has exploded in adoption, enabling engineering teams to ship dramatically more software than they could two years ago. But keeping that software running in production — debugging it when it breaks, monitoring it after deployment, auditing its health — remains overwhelmingly manual. For a company that raised a $125 million Series A at a $1 billion valuation earlier this year, Resolve AI is making a direct bet that the operational side of the software lifecycle is the next major frontier for AI investment.
Any accuracy claim from a startup warrants scrutiny, and Xanthos was candid about both the scale and limitations of the evaluation. The 2x figure comes from internal benchmarks, not a third-party audit, though the evaluation set was built to mirror the complexity that Resolve AI’s enterprise customers encounter daily.
“These are very hard, complex evals that we built over time to represent real-world examples,” Xanthos explained. “This is not customer data, but these evals represent difficult cases similar to what we’ve seen at some of the largest tech companies we work with.” He described the set as comprising hundreds of cases that reflect the kinds of production failures encountered at companies like Coinbase, Salesforce, DoorDash, and Zscaler — all named Resolve AI customers.
The practical impact of that accuracy gain is significant. Resolve AI’s agents now act as first responders for every on-call alert, typically triaging within five minutes before a human engineer even becomes involved. In previous public disclosures, the company has cited DoorDash reducing time to root cause by up to 87 percent. When asked to contextualize that figure, Xanthos described the typical baseline.
“When something goes wrong, it might take five to 10 minutes for a human to even get their laptop and connect,” he said. “The typical MTTR is in the tens of minutes, sometimes hours, depending on severity. So an improvement of 80-plus percent — four to five times faster — is actually huge. It’s something we’ve never achieved before with AI, tools, data, or observability.”
One of the core challenges in applying large language models to high-stakes production environments is their tendency to generate plausible-sounding but incorrect answers — a failure mode that, in the context of a live outage, could send an engineering team chasing the wrong fix while a service stays down.
Xanthos acknowledged this directly. “This is a very common issue with models out of the box,” he said. “They always try to give you an answer, and if they don’t have enough evidence, they’ll give you the best possible answer — which is likely to be wrong.”
Resolve AI’s countermeasure is a system of layered verification among its agents. Each agent investigating a hypothesis must cite every piece of evidence it relies on and present that evidence to another agent for independent review. The investigating agent must construct the full causal chain — from root cause to symptom — and peer agents actively attempt to disprove the theory by identifying gaps in the logic.
“Often, agents actually disprove those theories because they find gaps,” Xanthos said. “There are many layers of defense and agentic checks that allow Resolve to be very accurate and not mislead.”
Equally important, he said, is the system’s willingness to say it does not know. “The bar to actually saying ‘I have the answer’ is very high. In those cases, it will say, ‘This is the evidence I found. Here are three or four paths you can take from here, but I wasn’t able to fully prove that this is the problem.’ A system like this that operates in production cannot be a black box.” In domains where wrong answers carry operational consequences, calibrated uncertainty can be more valuable than confident outputs. For an AI system integrated into an incident-response workflow, confidently pointing engineers in the wrong direction during a customer-facing outage could compound the very harm it was designed to prevent.
Beyond incident response, Resolve AI is introducing a new class of background agents designed to handle the continuous, often invisible operational work that engineering teams are expected to perform but struggle to sustain at scale.
These agents run on schedules or wake automatically in response to events — a new deployment, a fired alert, a merged pull request — and accumulate institutional knowledge from every investigation and human interaction over time. When an engineer opens the Resolve AI interface, agents have already been working: pre-investigating priority issues, monitoring deployments, auditing alert hygiene, flagging configuration drift, and surfacing cost anomalies.
Xanthos drew a distinction between background agents and the incident-response agents that have been Resolve AI’s primary offering. “You can now have these agents run in the background at all times — not only when a human asks an agent to debug a problem or when an alert fires,” he said. “A lot of our customers are now monitoring changes that land in production before they cause an issue. There’s an agent that monitors those all the time.”
He described these background agents as “general-purpose SRE agents that are available to every developer,” capable of handling tasks that range from monitoring infrastructure changes that might increase cloud costs to performing post-incident follow-up work like generating code fixes based on incident learnings. The concept addresses a structural problem in software operations: the daily tasks required to keep production systems healthy — monitoring deployments, investigating alerts, tracking changes across complex environments — are critical but reactive and manual. Engineering organizations know this work needs to happen, but it competes for attention with feature development. Automated agents that perform this work continuously could shift teams from reactive firefighting to proactive operational management.
The third major component of the release is what the company calls a shared investigation surface — a workspace where engineers and AI agents work from the same live evidence during an active incident. Reports update dynamically as investigations evolve. Every finding is inspectable. Engineers can explore side investigations without interrupting the primary workflow. Source queries are pullable and modifiable in place, evidence is embedded directly into the workspace, and remediation actions can be triggered from the same interface without switching tools.
“Think of it as an interface to all the production tools, but also an interface where humans and agents can collaborate with each other — or agents with agents,” Xanthos said. “That’s what gradually leads to more trust and more automation, because you work with the agent, you teach it, you see the results.”
The company is also making its platform available as a REST API and an MCP (Model Context Protocol) server, enabling engineering teams to integrate Resolve AI into broader agentic workflows and infrastructure. According to Xanthos, this is already happening in practice. “A general-purpose agent that a company has built — when it comes to debugging, that agent could invoke Resolve,” he said. “Or somebody works on their coding agent on the laptop, and Resolve shows up there as an MCP. If there is some production-related activity, the coding agent can invoke it.” The interoperability play signals that Resolve AI sees itself not as a closed system but as a specialized node in a broader ecosystem of AI agents that will increasingly hand off tasks to one another — a pattern Xanthos compared to the open architecture of the web rather than the walled-garden model of an app store.
The agentic operations space has become crowded in the past year. Datadog, PagerDuty, and major cloud providers have all announced AI-augmented operations capabilities. When asked what separates Resolve AI from these incumbents, Xanthos pointed to the depth of the company’s technical foundation.
“We’re operating at the frontier here. There’s no blueprint for how you build a system like Resolve,” he said. He noted that he and co-founder Mayank Agarwal co-created OpenTelemetry, the most widely adopted open-source project in observability, which now serves as the de facto standard for collecting metrics, logs, and traces from modern software systems.
Xanthos also highlighted the company’s recent AI Lab, led by a researcher he described as the former post-training lead for Meta’s Llama models. “He managed to combine deep expertise of production observability with AI and models, and I think that’s very unique,” Xanthos said. “I don’t believe any other company, whether it comes from an observability background or it’s a startup, has all of that together.”
The company’s structural defenses, according to Xanthos, include a full environment model that Resolve builds for each customer, a memory system that learns within the customer’s specific production environment, and its multi-agent architecture. The lab is now post-training frontier models on production-specific data — the kind of procedural knowledge that experienced engineers use to debug production issues but that does not appear in standard model training sets. This approach reflects an increasingly common pattern among AI application companies: using frontier foundation models as a base layer but investing heavily in domain-specific fine-tuning, retrieval, and agent architectures to achieve accuracy levels that general-purpose models cannot reach alone.
Resolve AI’s pricing model departs from traditional enterprise software licensing. The company sells credits that are consumed when agents perform work — an outcome-based approach that ties cost directly to value delivered.
“We’re not selling software,” Xanthos said. “The way you buy and use Resolve is by buying credits that are consumed when Resolve performs an action. It’s outcome-based. Only when Resolve troubleshoots an alert — that’s the only time that it consumes credits.”
He addressed the cost question head-on, arguing that Resolve AI is actually cheaper than the alternative of building a similar system from scratch using frontier models and MCP integrations. “If you were to take Opus or GPT-5.4 and try to build a solution like Resolve with MCPs, we measured — you actually end up consuming a lot more in tokens than what you have to pay Resolve, because our system is very optimized in terms of context, in terms of how it reads time-series data.”
As for the always-on background agents, Xanthos said their continuous nature does not inherently add to cost. “The background agent doesn’t mean it does intensive work all the time. It means that it can be there; you can give it any task you want. A lot of these tasks are triggered based on some action — an alert happens, somebody merges a PR, and you want to see if it has an impact on production.” For enterprise customers in regulated industries — the Coinbases and Zscalers of the world — data residency and security are non-negotiable. Resolve AI accommodates this with a flexible deployment model: the data plane sits wherever the customer’s existing tools already live, while the inference layer can run as a standard SaaS deployment or inside a customer-specific VPC. “We designed Resolve to work with the large enterprises where security standards are the highest,” Xanthos said. “There are many measures we take to ensure Resolve is secure, including not retaining data.”
The question of whether engineering teams will trust AI agents to take autonomous action in production — rolling back a deployment, adding capacity, generating a pull request — is one of the defining cultural challenges of this technology wave. Xanthos drew an analogy to autonomous vehicles.
“For us to allow a car to drive on its own on the street, we have to prove that it’s safer than a human. Agents in production is a very similar concept,” he said. He acknowledged that not every customer is comfortable with agents taking automated action, but described a gradient of trust that he expects to evolve rapidly.
“There is a set of actions that are relatively risk-free that most tech companies probably are comfortable having an agent take, and probably there is another set of actions for which the human has to approve,” he said. “But as quality keeps climbing the way we see at Resolve, I would say we’re going to cross the threshold this year where most of the actions will be taken by an agent automatically.”
He described the typical adoption arc: companies begin with agents providing recommendations, then a human decides whether to press the button. Over weeks or months, trust builds incrementally. “I don’t think this is a problem where we just let the agents run wild from the beginning,” Xanthos said. The incremental approach mirrors how enterprise technology adoption has always worked — from cloud migration to container orchestration, organizations move at the speed of trust, not the speed of capability.
Perhaps the most provocative argument in Resolve AI’s thesis is that the explosion of AI-generated code is actually intensifying the production-operations problem. In a recent LinkedIn post, Xanthos framed the dynamic in stark terms, arguing that engineering leaders who celebrate faster code shipping without investing in production operations are effectively having their senior engineers “subsidize velocity” through increased incident-response burden.
In his interview with VentureBeat, he returned to this theme. “Now that coding agents are producing code, we produce a lot more code that we’re less familiar with — humans are less familiar with — so you need the AI to be the defense,” he said.
This framing positions Resolve AI not merely as a productivity tool but as a necessary counterweight to the AI coding revolution. As organizations deploy more code, written by tools that their engineers may not fully understand, running against production systems those engineers did not build, the argument is that the operational complexity — and the consequences of failure — will grow proportionally. On the Stack Overflow Podcast last October, Xanthos put numbers to this claim, estimating that engineers spend upwards of 70 percent of their time maintaining and troubleshooting production systems rather than building new features. “We’re facing a new crisis where we’re building faster than we can operate,” he said in that conversation.
Resolve AI was founded in early 2024 by Xanthos and Agarwal, who first met during their PhD programs at the University of Illinois and have worked together for more than a decade. Xanthos previously co-founded Pattern Insight (acquired by VMware) and Omnition (acquired by Splunk), where the pair helped create OpenTelemetry. The company raised a $35 million seed round from Greylock in 2024, followed by the $125 million Series A led by Lightspeed at a $1 billion valuation earlier this year. Named customers include Coinbase, DoorDash, MSCI, Salesforce, MongoDB, and Zscaler.
Xanthos’s long-term vision is expansive. “Over the long run, once agent ability surpasses that of a human software engineer, the end result is a lot more technology and a lot more software,” he said. “It’s not actually fewer people working on it. It’s technology becoming cheaper, becoming more accessible, producing a lot more technology for the benefit of the world.”
That vision will take years to realize. But the more immediate promise of today’s announcement comes down to something every on-call engineer understands viscerally: the 2 a.m. page, the scramble for a laptop, the frantic search through dashboards and logs for an answer that might take minutes or might take hours. Resolve AI is betting that the next time that alert fires, a team of agents will have already investigated, verified, and documented the root cause before the engineer’s phone even lights up. For a profession that has long measured its nights by mean time to resolution, the question is no longer whether AI can help — it is whether engineers will let it.
Less than a week after completing the largest tech IPO of 2026, Cerebras Systems is making its most aggressive play yet to dominate the fast-growing AI inference market. On Monday, the Sunnyvale-based chipmaker announced that it is now running Kimi K2.6 — a trillion-parameter open-weight model developed by Beijing-based Moonshot AI — for enterprise customers at nearly 1,000 tokens per second, a speed no GPU-based provider has come close to matching.
The result, independently verified by benchmarking firm Artificial Analysis, clocked in at 981 output tokens per second, making Cerebras 6.7 times faster than the next-fastest GPU-based cloud provider and 23 times faster than the median. For a standard agentic coding request involving 10,000 input tokens, Cerebras delivered the full response — including prompt processing, reasoning, and 500 output tokens — in 5.6 seconds, compared to 163.7 seconds on the official Kimi endpoint. That’s a 29-fold improvement in time to final answer.
“We’re really wanting to be very clear and show that we can do the largest models,” James Wang, Cerebras’ director of product marketing, told VentureBeat in an exclusive interview ahead of the announcement. “In this case, Kimi K2.6 — a trillion-parameter MoE model on the wafer-scale architecture — and it runs also at this same incredible speed that we’re famous for.”
The announcement marks a critical inflection point for Cerebras, which has long battled a perception that its unorthodox wafer-scale chips, while blindingly fast, could only handle small and mid-sized models. Kimi K2.6 is the first trillion-parameter open-weight model the company has ever served in production. And with a freshly minted $95 billion market cap and $5.55 billion in IPO proceeds burning a hole in its balance sheet, Cerebras is signaling to Wall Street that it intends to compete not just at the frontier of speed, but at the frontier of model scale.
The choice of Kimi K2.6 reflects both a technical milestone and a commercial calculus. Released on April 20 by Moonshot AI — a Beijing-based company founded in 2023 by Tsinghua University alumni and dubbed one of China’s “AI Tiger” companies — K2.6 is a trillion-parameter Mixture-of-Experts model that has rapidly established itself as the most capable open-weight model available for coding and agentic tasks. The model tops SWE-Bench Pro at 58.6, outperforming Claude Opus 4.6 and matching GPT-5.4, while posting leading scores on agentic benchmarks like Humanity’s Last Exam and DeepSearchQA. Its architecture uses 32 billion activated parameters per token out of a total of 1 trillion, with 384 experts, of which 8 are selected plus 1 shared per forward pass, operating over a 256,000-token context window.
In practical terms, K2.6 is one of the first open-weight models that enterprises can plausibly use as a drop-in replacement for expensive, capacity-constrained closed-source APIs from Anthropic and OpenAI — particularly for the coding and agentic workloads that have become the highest-value application of large language models. The version 2.6 release extends K2.6’s capabilities from front-end design into full-stack workflows, including authentication, database operations, and long-horizon agent execution.
Wang was blunt about what is driving enterprise interest. “They’re very motivated, first of all, to have an alternative to Anthropic,” he told VentureBeat. “Anthropic’s models are fantastic. I use them. I’m sure you probably use them. But they’re quite expensive, and they’re constantly running out of capacity.” He described a personal experience in which an application running on Anthropic’s API failed over a weekend because it ran out of capacity — an anecdote that, he said, resonates deeply with enterprise buyers.
The geopolitical dimension of this arrangement is worth noting, however. Kimi K2.6 is a Chinese-developed model being served by an American chipmaker to American enterprise customers. Moonshot AI operates out of Beijing, and K2.6’s adoption in the West arrives during a period of heightened scrutiny of Chinese AI companies in the U.S. market. Enterprise buyers with strict compliance requirements — particularly those in financial services, healthcare, and defense — will need to evaluate this dimension alongside the model’s technical capabilities.
Understanding why Cerebras can achieve these speeds requires understanding what makes its hardware fundamentally different from anything else on the market. Most AI inference today runs on clusters of Nvidia GPUs — typically organized in racks of 72 GPUs, what Nvidia markets as the NVL72 configuration. In these setups, the model’s parameters are distributed across many discrete chips connected by high-speed networking fabric. Data must constantly shuttle between chips, and the interconnect bandwidth between GPUs becomes a bottleneck, particularly for large models with hundreds of billions or trillions of parameters.
Cerebras takes a radically different approach. Its Wafer-Scale Engine 3 is a single chip the size of an entire silicon wafer — roughly the size of a dinner plate — containing 44 gigabytes of on-chip SRAM. Unlike the high-bandwidth memory used in GPUs, SRAM sits directly on the processor die, offering dramatically lower latency and higher bandwidth for data access. For Kimi K2.6, Cerebras stores the model’s weights in their original 4-bit precision while performing computation at 16-bit floating point. The weights are distributed across multiple wafers in a cluster of approximately 20 CS-3 systems, with activations streamed between them. Critically, all the experts for a given MoE layer are placed on the same wafer, meaning the all-to-all communication required for expert routing happens at SRAM speeds. According to Cerebras’ technical description, the on-wafer network fabric delivers over 200 times the bandwidth of NVLink on NVL72.
Wang explained the architecture using an analogy. “Our single units are much larger and much higher capacity — they’re on the order of 20 racks, as opposed to 72 GPUs,” he said. Each layer in the transformer can, in effect, serve a separate user simultaneously. “They’re just like a queue, like you’re queuing for bagels or something — they’re all occupying a different part of the hardware. But because they move across so fast, the actual experience, tokens per second, single user, on your end is still what you’re used to.” Combined with custom kernels and speculative decoding, this allows Cerebras to serve the trillion-parameter MoE model at close to 1,000 tokens per second — a speed the company calls a world record achievable only with wafer-scale hardware.
Cerebras is not opening K2.6 to the general public. Instead, the company is positioning this as an enterprise-first offering, with Fortune 500 companies in software, financial services, and healthcare currently running cloud trials of their production workloads on the platform. “These are logos that you’ve definitely heard of,” Wang said, though he declined to identify specific customers due to confidentiality agreements.
The enterprise-first approach is deliberate. Cerebras has historically prioritized its largest customers over its consumer-facing API, in part because of hardware capacity constraints. “Everyone is in a capacity crunch. We prioritize our enterprise customers, so we don’t show it in the consumer-facing gateway or the API, where you get very unpredictable traffic, where a single user can, in effect, take over your whole cluster,” Wang explained. Serving K2.6 also limits the company’s ability to simultaneously offer other large models. “We can’t simultaneously, you know, have six other models,” he acknowledged. “It’s just kind of a mutual constraint of reality.”
On pricing, Wang said that while the enterprise deployment does not carry public pricing, the company’s costs are broadly competitive with GPU-based providers. “On all the models we have served with pricing, the pricing is very comparable — maybe in the middle, kind of middle-upper range of GPU pricing,” he said. “It’s not like, because we run fast, it costs many, many fold more.” He drew a line, however, at the lowest end of the market: if you are willing to run K2.6 at 20 tokens per second on bargain GPU infrastructure, Cerebras will not try to compete on price. “We’re an automaker in the pickup truck market. We don’t do that market,” Wang said. For speed-sensitive workloads — particularly agentic coding, where developers wait in real time for the model to generate and iterate on code — the value proposition is straightforward: comparable per-token cost, but an order of magnitude faster delivery.
Cerebras’ announcement arrives at a pivotal moment in the AI chip industry, one in which the inference market is rapidly overtaking training as the most commercially important compute workload. As AI agents proliferate in enterprise software, the speed of inference directly determines how useful those agents are in practice — and the competitive pressures are intensifying accordingly.
The most significant competitive development in recent months was Nvidia’s acquisition of Groq for $20 billion, a deal that gave the GPU giant access to proprietary inference technology built around specialized Language Processing Units. Wang referenced the deal directly. “I think Nvidia is now sensing fast inference is an extremely important market,” he told VentureBeat. “That’s why they’re willing to spend $20 billion on acquiring a company like that.”
But Wang expressed confidence that Cerebras’ architectural advantages are durable. Both Nvidia and Cerebras operate on roughly annual hardware refresh cycles. “We refresh our hardware on a periodic cycle. You will hear some news about that from us soon,” Wang said, hinting at a forthcoming hardware announcement without providing details. On the software side, Wang pointed to the company’s track record of rapidly adapting to the fast-evolving open-weight model ecosystem. “We started with Llama, we supported all the Qwen models, and then when developers told us they wanted GLM, we brought GLM online. And now they’re telling us Kimi is the best — so we’re giving them Kimi,” he said. “At the same time, we’ve also supported the best companies in running their closed models — OpenAI, Cognition, Mistral.”
The mention of OpenAI underscores one of the most unusual business relationships in the AI industry. OpenAI and Cerebras struck a deal in early 2026 reportedly worth more than $20 billion for computing capacity and related services. Wang confirmed that Cerebras serves OpenAI’s “internal coding models forthcoming” but declined to disclose specifics, as neither party has publicly detailed the technical arrangement.
Wang framed the K2.6 deployment as a stepping stone, not a destination. Cerebras started serving inference in late 2024 with relatively small models and has spent over a year scaling from 70 billion parameters to 1 trillion-plus. “We couldn’t have launched that in November 2024,” he said. “But we’re there now.”
The company’s next challenge is to move from serving the best open-weight frontier model to serving the best frontier models, period — including closed-source models from the likes of Anthropic and OpenAI that sit at the absolute top of the intelligence leaderboards. “This is the first open-weight frontier one that we now have clear demonstrated evidence for,” Wang said. “I think over the course of the year, you will see us serving true frontier, frontier at the speed that we’re famous for. And you should hold us up for that.”
When asked whether the current rollout would be overtaken by the pace of hardware improvement at Nvidia and others, Wang was unfazed. “Nvidia has a very clear roadmap. They publish every year at GTC. They’re roughly on a yearly product cycle, and so are we. You will hear some news about that from us soon,” he said, hinting at new hardware without offering details.
He also addressed the question of vendor lock-in — a concern that any CTO evaluating a single-vendor inference provider would raise. “These enterprises rarely commit fully to one vendor,” Wang said. “They have strategies to make sure that some traffic can go to us, some traffic can go to someone else, and there’s load balancing between the two. This is not a new problem. This is just generally how you manage cloud resources.”
The pitch, ultimately, is about more than speeds and feeds. Wang sees the AI industry converging on a world in which autonomous agents — not human developers — are the primary consumers of inference compute, and in which the speed of those agents determines competitive outcomes for the companies that deploy them. “The world economy is kind of getting rebuilt on agents,” Wang said. “Speed will determine who wins or loses.”
It is a bold claim from a company that, until last week, had never traded on a public exchange. But for Cerebras, the logic is straightforward: if the future of enterprise software is built by AI agents that think at the speed of their hardware, then the company that provides the fastest hardware provides the fastest thinking. And in a market where enterprises are spending billions to shave seconds off their AI response times, a company that can serve a trillion-parameter model in the time it takes to pour a cup of coffee might just have the most compelling pitch in Silicon Valley.
Generative AI’s rapid transition from text-based chatbots to high-fidelity media—spanning images, video, spatial 3D, and audio—has exposed a glaring bottleneck in the modern tech stack: infrastructure. Rendering pixels in real-time requires a staggering amount of compute, and developers are increasingly struggling to manage fragmented GPU clusters just to keep their applications online.
Enter fal, a generative media creation platform that has quietly become the connective tissue for 2.5 million developers across the globe, offering literally hundreds of leading AI image, video, and audio creation and editing models — from proprietary ones like OpenAI’s ChatGPT-Images-2.0 and Google’s Nano Banana Pro 2 to open source rivals — all through its unified interface and APIs.
Today, the San Francisco-based startup, recently valued at a massive $4.5 billion following a $300 million Series D round led by Sequoia Capital, announced it has selected Amazon Web Services (AWS) as its preferred cloud provider.
While the financial terms of the deal weren’t made public, the move signals a maturation in the generative media space, shifting the focus from simply building foundational models to effectively scaling them for mass, commercial consumption.
“AWS has been there for distribution and monetization, and for the use of AI in creative pursuits — helping designers, developers, and the creative community think through how they can use AI responsibly, scalably, and at global scale,” said Samira Panah Bakhtiar, General Manager for Media, Entertainment, Games, and Sports at AWS, in an exclusive interview with VentureBeat.
At its core, fal operates as a unified gateway to the rapidly expanding generative AI ecosystem. Rather than forcing developers to provision their own servers, deal with latency issues, or string together disparate open-source model weights, fal provides a single, unified API. Through this API, users gain instant access to over 1,000 production-ready AI models.
Think of it as the Stripe or Plaid of generative media: abstracting away the devastatingly complex back-end plumbing so developers can focus solely on the user experience.
It is a “plug-and-play” solution that has already attracted independent creators and enterprise giants alike, powering generative workflows for enterprises including Canva, Adobe, and Amazon MGM Studios.
“Generative media workloads demand a fundamentally different infrastructure layer, one that can handle massive parallel inference, rapid model iteration, and production-grade reliability at scale,” said Gorkem Yurtseven, CTO and Co-founder of fal, in a statement provided to VentureBeat.
Neither AWS nor fal specified what other cloud or GPU providers the latter was using prior to their deal together. Asked who fal had been using before AWS, Bakhtiar did not name a prior cloud or GPU provider, saying instead that fal is now using AWS services.
In a blog post, fal’s Head of Compute Partnerships Emir Lise described AWS as providing the “global scale and reliability layer” for its existing serverless generative-media infrastructure — framing the partnership around elasticity, reliability and enterprise scale rather than a replacement of a named incumbent.
A public search turned up Tigris as a storage provider for fal — with Tigris saying fal runs a “global fleet of GPUs across many clouds” — and an announcement from fal in Septemeber 2025 that it was available through Google Cloud Marketplace, allowing customers to buy fal through Google Cloud billing and governance, but that listing does not state that Google Cloud powered fal’s GPU infrastructure.
By partnering with AWS, fail aims to merge its highly optimized inference engine with Amazon’s global reach to handle millions of daily API calls with 99.99% guaranteed uptime.
In addition, Bakhtiar said fal users can expect to see “faster inference and performance, greater efficiency, more scalability, and more seamless service continuity — all things you would expect as a result of partnering with the world’s largest, broadly adopted cloud.”
Therefore, the primary benefit for fal users is better performance and reliability without changing how they work: faster inference, more scalability, smoother continuity, and access to production-ready AI models without managing their own infrastructure.
For fal, the partnership makes its platform stronger for creators, studios, and enterprise customers by backing it with AWS’s security, global scale, and cloud infrastructure.
For AWS, it helps push cloud and AI deeper into creative production, not just distribution or monetization. It positions AWS as a key infrastructure partner for studios, media companies, developers, and individual creators building AI-powered content workflows.
The partnership with AWS is designed to address the sheer physics and cost of rendering generative media. By migrating its operations to AWS, fal will be able to leverage Amazon’s broad suite of AI services, including the Bedrock platform, alongside custom-built silicon like Trainium and Graviton processors.
“You don’t have to manage like a GPU fleet to use the AI for creative pursuits,” Bakhtiar explained.
This is a critical pain point for larger-scale media generation demands in 2026. Securing high-performance GPUs for parallel inference is both expensive and technically demanding.
By shifting that burden to AWS, fal ensures that creatives can focus on their workflows, without needing a dedicated DevOps team.
Bakhtiar also noted the powerful “network effect” of building on AWS. Because major studios and creative platforms (like Adobe and Canva) are already deeply entrenched in the AWS ecosystem, integrating fal’s API into their existing pipelines becomes a frictionless endeavor.
For IT leaders and developers, fal’s architecture offers a distinct advantage regarding licensing, security, and deployment.
Historically, utilizing frontier generative models meant either accepting strict vendor lock-in from a single provider or attempting to host open-source models locally.
The latter requires significant overhead and forces enterprises to navigate a minefield of disparate open-source licenses (such as MIT, Apache 2.0, or restrictive non-commercial licenses).
fal bypasses this friction by offering commercial API access to a curated ecosystem of models. Developers simply pay for the inference they consume.
Furthermore, the platform is SOC 2 compliant and explicitly built for “enterprise scale,” meaning it meets the stringent data privacy and security benchmarks required by heavily regulated industries and massive consumer platforms.
For large media conglomerates, this managed service approach allows them to experiment with the latest state-of-the-art tools securely, without the risk of exposing proprietary data or intellectual property.
The true impact of fal’s platform, however, is best observed at the developer level. By democratizing access to high-end infrastructure, fal is enabling a new class of builders—often referred to as “vibe coders”—to create complex, multimodal applications without traditional computer science backgrounds.
As Bakhtiar pointed out, access to these tools fundamentally “levels the playing field”. Whether it is an individual developer or hobbyist vibe coding a side project, or a fully-funded editor or director rendering a blockbuster film, the underlying technology is now identical, infinitely scalable, and ready for production.
“More creatives — whether they’re full-fledged studios, indie brands, or individual content creators — are now going to be able to access these tools, and they’re going to be able to punch way above their weight as a result,” Bakhtiar said, casting the partnership as a way to serve even more users through fal thanks to the reliability of AWS’s servers and custom Trainium, Graviton and Inferentia chips.
The rollout of enhanced AWS capabilities for fal customers will occur in phases throughout 2026.
For a quarter century, the Google search box has been one of the most recognizable interfaces in computing: a thin white rectangle, a blinking cursor, a few typed words, and a list of blue links. On Tuesday, Google will formally retire that paradigm.
At its annual I/O developer conference, Google announced a sweeping redesign of the search box itself — the literal text field where billions of queries begin every day — transforming it from a simple keyword input into a dynamic, AI-driven conversation starter that can accept text, images, PDFs, videos, and even open Chrome tabs as inputs. The company is also merging its AI Overviews and AI Mode features into a single, seamless search flow, eliminating the friction that previously forced users to choose between a traditional results page and an AI-forward experience.
Liz Reid, Google’s vice president and head of Search, called it “the biggest upgrade to our iconic search box since its debut over 25 years ago” during a press briefing on Monday.
The announcement arrived alongside a blizzard of other news — new Gemini models, a personal AI agent called Spark, an intelligent shopping cart, a reimagined developer platform — but the search box redesign may prove to be the most consequential. It is the clearest signal yet that Google views the future of its flagship product not as a place where users type fragmented keywords, but as an interface where they hold open-ended, multimodal conversations with an AI system backed by the entire web.
The changes show a fundamental shift in how Google expects people to interact with the product that generates the vast majority of Alphabet’s revenue.
The box itself now dynamically expands to accommodate longer, more conversational queries. Where the old interface subtly encouraged brevity — a narrow field suited to two- or three-word keyword strings — the new design invites users to fully articulate complex questions in granular detail. It also now supports multimodal inputs directly. Users can upload images, PDFs, files, and videos, or drag in content from Chrome tabs, right from the main search interface. Previously, some of these capabilities existed in AI Mode, but reaching them required extra steps. Now they sit at the primary entry point.
Google is also deploying what it describes as an AI-powered query suggestion system that “goes beyond autocomplete.” Rather than simply predicting the next word a user might type based on popular searches, the system helps users formulate complex, nuanced queries — essentially coaching them toward the kind of detailed questions that AI Mode handles best.
The new search box is starting to roll out immediately in all countries and languages where AI Mode is available.
Perhaps more significant than the box itself is the architectural change happening behind it. Google is unifying AI Overviews — the AI-generated summary panels that appear atop traditional search results — with AI Mode, the more immersive conversational search experience the company launched at I/O one year ago.
Starting Tuesday, this merged experience will be live across mobile and desktop worldwide. A user can type a question, receive an AI Overview alongside traditional results, and then continue directly into a back-and-forth AI Mode conversation to ask follow-up questions — all without navigating to a separate interface.
Reid explained the logic during the press briefing: the new AI search box is “an upgrade of our traditional search box, and so the results take you directly to main search rather than AI mode.” She noted that while some power users actively sought out AI Mode, “for most users, they don’t actually want to have to think about, do they want more of a traditional page or an AI-forward search experience.”
The goal, she said, was to ensure that “for most users, they don’t have to think about where to go, they can just go to the search box they’re familiar with, and it feels like they get the best experience afterwards.”
Google’s decision to redesign the foundational interface of its most important product did not happen in a vacuum. The company shared a set of usage statistics during the briefing that reveal just how rapidly user behavior is already changing.
AI Mode, which launched in the United States at I/O 2025, has surpassed one billion monthly users in its first year. AI Mode queries have been doubling every quarter since launch. AI Overviews, the lighter-weight AI summaries, now reach more than 2.5 billion monthly users. And overall search query volume hit an all-time high last quarter — a data point the company had previously disclosed on its earnings call.
Sundar Pichai, Google’s CEO, framed these figures as evidence that AI features are additive, not cannibalistic, to search usage. “When people use our AI-powered features in search, they use search more,” he said. He added that he loves “how search has become less about individual queries and feels more like an ongoing conversation, giving users deeper insights and connecting you with the vastness of the web.”
Reid reinforced the point: “It’s not just that people are searching more, it’s that they’re searching differently. They’re fully expressing their questions in granular detail, asking those follow-up questions and searching across modalities.”
Under the hood, the new search experience runs on Gemini 3.5 Flash, Google’s newest AI model, which the company also introduced at I/O. Google upgraded AI Mode’s underlying model to 3.5 Flash to deliver what Reid described as “an even more powerful AI search experience.”
Gemini 3.5 Flash is the workhorse of this year’s announcements. Google claims it outperforms its previous frontier model, Gemini 3.1 Pro, on nearly all benchmarks while running four times faster in output tokens per second than comparable frontier models. Pichai described it as being “in a league of its own in the top right quadrant” of the Artificial Analysis index, which plots intelligence against speed — meaning it delivers near-frontier quality at dramatically lower latency.
That speed matters enormously for search. A conversational AI search experience that feels sluggish would be dead on arrival for a product that serves billions of queries daily. By coupling the redesigned interface with a model optimized for both quality and throughput, Google is attempting to make AI-powered search feel as instantaneous as the old keyword experience — while being dramatically more capable.
The redesigned search box is also the gateway to a set of new capabilities that push search far beyond text-based answers. Google announced what it calls “generative UI” — the ability for search to dynamically build custom widgets, interactive visualizations, and even mini applications in real time, tailored to a user’s specific question.
Reid offered a concrete example during the briefing: a user could ask “How do black holes affect space time?” and receive an interactive visual in an AI Overview that brings the concept to life. Follow-up questions would trigger the system to dynamically generate entirely new visuals in real time. This is possible, she explained, because of “a novel real-time code generation system we built in partnership with the Google DeepMind team” that runs on Gemini 3.5 Flash. Generative UI capabilities will roll out to everyone this summer, free of charge.
But Google is going further still. For ongoing tasks — planning a wedding, organizing a move, tracking a fitness routine — users will be able to build what the company describes as customizable, stateful experiences within search, powered by its Antigravity development platform. These require no coding expertise. Users simply describe what they want in natural language, and search builds it. Those experiences will be available in coming months, starting with Google AI Pro and Ultra subscribers in the United States.
The redesign also opens the door to what Google calls “information agents” — AI agents that users can configure directly within search to monitor the web 24/7 for specific conditions and deliver synthesized updates when those conditions are met.
A user could, for example, set up an agent to track market movements in a particular sector with specific parameters. The agent would create a monitoring plan, tap into real-time finance data, and proactively notify the user when conditions are met — complete with links and context for further research. Other use cases include apartment hunting, tracking sneaker drops, or monitoring any topic a user cares about. Information agents will launch first for Google AI Pro and Ultra subscribers this summer.
These agents sit within a much larger strategic pivot that Google articulated throughout the briefing: the company is going all-in on AI systems that don’t just answer questions but proactively take actions on users’ behalf. Beyond search, Google introduced Gemini Spark, a 24/7 personal AI agent that runs on dedicated virtual machines in Google Cloud. It unveiled the Universal Cart, an intelligent cross-merchant shopping cart. It announced the Agent Payments Protocol for agents to make secure purchases. And it expanded its Antigravity developer platform into a full ecosystem for building autonomous AI agents.
The redesign raises profound questions for the sprawling ecosystem — publishers, advertisers, SEO professionals — that has been built around the old model of keyword search and blue links.
If users increasingly express their needs as full, conversational sentences rather than fragmented keywords, the entire discipline of search engine optimization will need to evolve. Keyword-density strategies become less relevant when the AI is parsing natural language intent rather than matching strings. Content that answers deep, nuanced questions in authoritative ways becomes more valuable; content engineered to rank for two-word keyword fragments becomes less so.
For publishers, the stakes are existential. AI Overviews already synthesize information from across the web and present it directly in search results, reducing the need for users to click through to source material. The new seamless AI Mode integration deepens that dynamic: users can now get an AI-generated answer and ask multiple follow-up questions without ever leaving the search page. Google has consistently maintained that its AI features drive more traffic to publishers, but the redesign puts that claim under renewed scrutiny as the search results page becomes more self-contained.
For advertisers — who fund the vast majority of Google’s revenue — the shift from keywords to conversations changes the calculus of ad targeting. Conversational queries contain richer intent signals, which could make ad targeting more precise and valuable. But they also create new ambiguities: when a user is in the middle of a multi-turn conversation with AI Mode, where does an ad naturally fit? Google did not detail changes to its advertising model during the briefing, but the structural shift in the interface will inevitably reshape how ads are surfaced and measured.
There is a reason Google chose to redesign the search box rather than simply adding new features behind it. The search box is not just a product element at this point; it is a cultural artifact — one of the few pieces of digital infrastructure used by essentially the entire internet-connected world. Changing it sends an unmistakable message about where the company believes computing is headed.
For 25 years, the search box trained billions of people to think in keywords — to compress their curiosity into the shortest possible string of words. The new box invites them to do the opposite: to think out loud, to upload what they’re looking at, to ask follow-up questions, to let an AI system handle the compression.
Pichai tied the company’s broader ambitions to a striking statistic: Google’s surfaces now process over 3.2 quadrillion tokens per month, up seven-fold from a year ago. The company expects capital expenditures of approximately $180 to $190 billion in 2026 — roughly six times the $31 billion it spent four years ago — largely to support the infrastructure required for this AI transformation. When asked about the future of traditional search, he was direct. “Search is the most used AI product in the world,” he said.
The blinking cursor in Google’s search box still invites you to type. But after 25 years of teaching the world to speak in keywords, Google is now asking it to speak in sentences — and betting roughly $190 billion that it will.