Google says Gemini 3.5 Flash can slash enterprise AI costs by more than $1 billion a year

Google unveiled Gemini 3.5 Flash at its annual I/O developer conference on Tuesday, a new artificial intelligence model that the company says shatters what had become a seemingly iron law of the AI industry: that the smartest models must also be the slowest and most expensive to run.

The model sits at the center of a sweeping set of announcements — from a video-generating “world model” called Gemini Omni to a 24/7 personal AI agent called Gemini Spark — but 3.5 Flash carries perhaps the most immediate consequence for the enterprises pouring billions of dollars into AI infrastructure. Sundar Pichai, Google’s chief executive, told reporters during a press briefing Monday that companies running roughly one trillion tokens per day on Google Cloud could save more than $1 billion annually by shifting 80 percent of their workloads to a mix of Flash and other frontier models.

“You’ve probably heard anecdotes from other CIOs that companies are already blowing through their annual token budgets, and it’s only May,” Pichai said, framing the model not just as a technical achievement but as a financial lifeline for organizations struggling with the runaway costs of deploying AI at scale.

The claim, if it holds, would be one of the most significant shifts in the economics of enterprise AI since large language models entered corporate computing.

Why enterprises have been forced to choose between AI quality and AI speed

For the past three years, organizations adopting generative AI have faced a painful trade-off. The most capable models — the ones that can reason through complex multistep problems, write reliable code, and parse dense financial documents — tend to be large, slow, and expensive to query. Faster, cheaper models sacrifice accuracy. Chief information officers have been forced into a kind of AI portfolio management: routing simple queries to lightweight models and reserving the heavy-duty reasoning engines for high-stakes tasks. It is a complex, brittle system that adds engineering overhead and often delivers inconsistent user experiences.

Gemini 3.5 Flash attacks that trade-off directly. According to Google’s internal benchmarks and a third-party analysis from Artificial Analysis, the model outperforms Google’s own Gemini 3.1 Pro — a model the company positioned as its top-tier flagship just four to five months ago — on nearly every major benchmark. It scores 76.2 percent on Terminal-Bench 2.1, reaches 1656 Elo on GDPval-AA, hits 83.6 percent on MCP Atlas, and leads in multimodal understanding with 84.2 percent on CharXiv Reasoning.

Yet it does all of this while generating output tokens at four times the speed of comparable frontier models from competitors. Koray Kavukcuoglu, chief technology officer of Google DeepMind and chief AI architect for Google, told reporters the team has pushed even further: “We have developed an even more optimized version of Flash, not just four times, but actually 12 times faster with the same quality.” That turbo variant is available starting Tuesday inside Antigravity, Google’s agentic development platform.

Pichai put the performance gap in blunt terms: “3.5 Flash is better than 3.1 Pro, which was just four months ago, and it’s at the almost, I would say, 90% of the performance of frontier models, 4x faster, much faster in Antigravity, maybe 12x, and about 1/3 to one half the cost.”

Landing in what Artificial Analysis calls the “top-right quadrant” of its intelligence-versus-speed index — the only model to do so — Flash occupies a position no competitor currently holds.

The trillion-token math behind Google’s $1 billion savings claim

To understand why Flash matters so much to enterprise buyers, you need to understand the economics of tokens — the fundamental units of data that AI models process. Every query a customer service chatbot answers, every legal document an AI summarizes, every line of code an agent writes, consumes tokens. And at frontier-model pricing, those tokens add up fast.

Google says its model APIs now process around 19 billion tokens per minute. Across all of Google’s own surfaces — Search, the Gemini app, Workspace, and more — the company processes over 3.2 quadrillion tokens per month, a figure that has jumped seven-fold in the past year alone. Two years ago, at I/O 2024, the number was 9.7 trillion per month.

The explosion in token consumption is not unique to Google. Enterprises across industries are discovering that the more capable their AI deployments become, the more tokens they burn. Agentic workflows — where AI systems autonomously execute multistep tasks, call tools, write and run code, and iterate on their own output — are particularly token-hungry. A single agentic coding session can consume orders of magnitude more tokens than a simple question-and-answer exchange.

This is where Flash’s cost advantage becomes transformative. The model delivers what Google describes as frontier-level capabilities at less than half the price, in some cases almost a third the price, of comparable frontier models. For a hypothetical enterprise processing one trillion tokens per day on Google Cloud — a scale Pichai said top customers are already reaching — the savings from shifting 80 percent of workloads to a Flash-and-frontier blend would exceed $1 billion per year.

That is not a rounding error. It is the kind of number that reshapes procurement decisions, accelerates deployment timelines, and fundamentally alters the return-on-investment calculus for AI initiatives that many boards of directors have been scrutinizing with increasing impatience.

How Google’s own engineers created a data flywheel that rivals cannot easily copy

Perhaps the most strategically significant detail Google shared Tuesday was not a benchmark score or a price point. It was a chart showing the company’s own internal token consumption on Antigravity 2.0, its reimagined agentic development platform.

In March 2026, Google’s developers were processing roughly half a trillion tokens per day inside Antigravity. By the time of the I/O press briefing in mid-May, that figure had surged past three trillion — a six-fold increase in approximately ten weeks, with usage doubling “literally every few weeks,” according to Pichai.

This internal usage creates what AI researchers call a data flywheel: the more Google’s own engineers use 3.5 Flash to build products, the more real-world signal the model team collects on where the model excels and where it stumbles. That signal feeds back into model improvement, which makes the model more useful, which drives more usage, which generates more signal. It is a virtuous cycle — and it is one that competing AI labs, which rely primarily on external developer usage and synthetic benchmarks, cannot easily replicate at the same speed or fidelity.

“That scale creates a powerful feedback loop, and that is what has allowed us to keep improving the 3.5 series of models,” Pichai said.

When pressed during the Q&A about the competitive frontier — particularly in light of recent advances from rival labs — Pichai acknowledged the landscape is “very dynamic” and “moving fast” but expressed confidence in Google’s breadth. He added that the company’s focus with the 3.5 series has been on “taking the model intelligence, making sure tool use, instruction following, long horizon use cases, agent decoding all work well.”

Kavukcuoglu reinforced the agentic emphasis, noting that 3.5 Flash “can now handle multi-hour autonomous sessions” and “can independently execute complex coding pipelines or manage iterative research projects entirely by itself.” The team, he said, even tested the model by having agents build a working operating system entirely from scratch.

Antigravity 2.0 transforms Google’s code editor into an agent command center

The arrival of 3.5 Flash is tightly coupled with the launch of Antigravity 2.0, a significant expansion of the agentic development platform Google first introduced six months ago. What began as a coding environment has evolved into what Google describes as a full platform for developing and managing teams of autonomous AI agents, and the company says millions of developers are already building with it.

Antigravity 2.0 ships as a new standalone desktop application that serves as a central hub for orchestrating multiple agents simultaneously. Google offered the example of running one agent to code a website, a second to generate brand assets, and a third to plan product architecture — all in parallel, all managed from a single interface. For developers who prefer command-line workflows, there is Antigravity CLI. And for those building programmatic integrations, the new Antigravity SDK provides direct access to the same agent harness powering Google’s own first-party products.

The co-development of 3.5 Flash and Antigravity 2.0 is no accident. “We have co-developed 3.5 Flash together with Google Antigravity, our agentic development platform,” Kavukcuoglu said. This tight integration means Flash’s strengths — speed, tool use, long-context reasoning, and code generation — are specifically tuned for the kinds of workloads developers execute inside the platform.

Google is also launching Managed Agents in the Gemini API, allowing developers to spin up an agent with a single API call that reasons, uses tools, and executes code in an isolated Linux environment. And it introduced CodeMender, an AI security agent that uses Gemini’s advanced reasoning to automatically find and fix critical code vulnerabilities — a capability Kavukcuoglu described as essential as agentic systems write an increasing share of the world’s code.

Google’s $190 billion infrastructure bet and the custom silicon powering cheaper AI

The models and platforms sit atop a staggering infrastructure investment that Pichai revealed during the briefing: Google expects capital expenditures of approximately $180 billion to $190 billion in 2026 — roughly six times the $31 billion the company spent in 2022, just four years ago.

A key component of that spending is custom silicon. The company recently unveiled its eighth generation of Tensor Processing Units, adopting for the first time a dual-chip architecture with specialized designs for training (TPU 8o) and inference (TPU 8i). Google says it can now distribute model training across multiple data center sites using a system called Pathways, scaling beyond one million TPUs globally — a setup the company claims constitutes the largest training cluster in the world.

“This means training larger, more capable models in weeks, rather than months,” Pichai said. The infrastructure advantage matters enormously for Flash’s economics. Custom silicon optimized for inference means Google can run Flash at lower cost per token than competitors relying on general-purpose GPUs, and the savings get passed along — at least partially — to customers.

The capex figure also signals something strategic about Google’s long-term posture. While some investors have grown nervous about the astronomical sums cloud providers are spending on AI infrastructure, Google is framing the spending as a competitive moat. The more infrastructure it builds, the cheaper it can run inference, the more attractive its models become, and the more usage it captures to improve the next generation. It is the flywheel logic again, extended from software all the way down to silicon.

Gemini Omni, Spark, and the consumer products Flash now powers at massive scale

While the enterprise cost story dominates the Flash narrative, Google also made sweeping moves on the consumer side that put the model to work across products reaching billions of people. Flash is now the default model powering the Gemini app — which has surpassed 900 million monthly active users, more than doubling from 400 million a year ago — and AI Mode in Google Search, which has crossed one billion monthly users in its first year.

Google introduced Gemini Spark, a 24/7 personal AI agent that runs on dedicated virtual machines in Google Cloud and operates in the background even when a user’s device is off. Powered by 3.5 Flash with the full Antigravity harness, Spark integrates with Gmail, Docs, Sheets, and Slides. Josh Woodward, who leads Google Labs and the Gemini app, described the experience vividly: “When you use it, it almost feels like you’re tossing things over your shoulder, Spark’s catching them and gets the job done.” On the safety front, Spark requires explicit user approval before high-stakes actions. Google also announced the Agent Payments Protocol, which lets users set strict guardrails — approved brands, spending caps, specific merchants — before an agent can spend money on their behalf. Woodward compared the design to “giving a teenager their first debit card — there’s sort of limits and sort of constraints around it.”

Alongside Flash, Google unveiled Gemini Omni, a model capable of generating any output from any input, starting with video. Kavukcuoglu drew a sharp distinction from Google’s existing Veo model: “Veo is a text-to-video model. Omni is a true and true multi-model input, multi-model output model.” All Omni-generated content carries Google’s SynthID watermark, and the company announced that OpenAI, Kakao, and ElevenLabs are adopting SynthID as well.

The company also reimagined its search box for the first time in over 25 years, introduced information agents that monitor the web around the clock for user-defined conditions, and launched the Universal Cart — an AI-powered cross-merchant shopping cart built on Google Wallet. Liz Reid, who leads Google Search, called the new search box “the biggest upgrade to our iconic search box since its debut.”

What Google’s six-month model cadence means for the enterprise AI cost curve

Google signaled that 3.5 Flash is just the opening act of the 3.5 series. Gemini 3.5 Pro is currently in internal testing and will roll out to everyone next month. Kavukcuoglu indicated the company has been operating on roughly a six-month cadence for major model updates — Gemini 3 in November, 3.5 in May — and expects that rhythm to continue.

When a reporter from The New York Times asked how Google determines whether a release warrants a full numerical jump or a half-step increment, Kavukcuoglu said the numbering reflects the magnitude of research progress: “What defines the numbering update is really the progress that we see in our research and how it is reflected in the models and the impact that they have.”

For enterprise buyers, that cadence carries an important implication: the cost-performance curve is not just improving — it is improving on a predictable schedule. A model that outperforms the previous flagship at a third the cost every six months fundamentally changes the planning horizon for AI investments. It means the token budgets that companies are blowing through today may look quaint by the end of the year.

Google’s announcements arrive at a moment of intense competition. OpenAI, Anthropic, Meta, and a constellation of smaller labs are all racing to deliver models that balance capability with cost. Microsoft has been aggressively integrating OpenAI’s models into Azure and Copilot. But Google benefits from a structural advantage that is easy to overlook: distribution. With 13 products serving more than a billion users each — five of which exceed three billion — Google can deploy Flash to an audience no pure-play AI lab can match. Every improvement immediately benefits Search, Gmail, Docs, Maps, and YouTube. And the usage data flowing back from those billions of interactions feeds the very flywheel that makes the next model better.

The question now is whether the $1 billion savings figure — an eye-catching projection based on a specific workload mix — will survive contact with the messy reality of corporate AI deployments, where legacy systems, compliance requirements, and organizational inertia have a way of blunting even the most compelling cost curves. But if Google’s own internal usage is any guide — three trillion tokens a day and climbing, doubling every few weeks, with no sign of slowing — the company is not just selling the bet. It is making the bet itself, with its own engineers, on its own infrastructure, at a scale no customer has yet attempted. In the AI cost wars, the most persuasive pitch may simply be: we did it first.

Cerebras stock nearly doubles on day one as AI chipmaker hits $100 billion — what it means for AI infrastructure

Cerebras Systems, the Silicon Valley chipmaker that built the world’s largest commercial AI processor, erupted onto the Nasdaq on Wednesday, opening at $350 per share — nearly double its $185 IPO price — and rocketing past a $100 billion market capitalization in its first hours of trading. The debut instantly crowned Cerebras as one of the most valuable semiconductor companies on Earth and validated a decade-long bet that the AI industry would eventually demand a fundamentally different kind of chip.

The company sold 30 million shares at $185 apiece, raising $5.55 billion in what Bloomberg reported as the largest U.S. tech IPO since Uber went public in 2019. The final pricing shattered expectations: Cerebras initially marketed shares at $115 to $125, then raised the range to $150 to $160 as investor demand surged, before ultimately pricing above even that elevated band.

“This is just a new beginning,” Julie Choi, Senior Vice President and Chief Marketing Officer at Cerebras, told VentureBeat in an exclusive interview on the morning of the IPO. The company, she said, plans to pour its fresh capital into expanding the cloud infrastructure that has become the centerpiece of its growth strategy. “With this new capital, we’re going to fill more data halls with Cerebras systems to power the world’s fastest inference.”

The IPO caps one of the most dramatic corporate turnarounds in recent tech history. Cerebras first filed to go public in September 2024 but withdrew the effort more than a year later amid intense scrutiny over its near-total revenue dependence on a single customer in the United Arab Emirates. The company refiled in April 2026 with a radically different business profile: new partnerships with OpenAI and Amazon Web Services, a fast-growing cloud inference service, and a revenue base that had climbed 76% to $510 million in 2025.

How a dinner-plate-sized chip became the foundation of a $100 billion company

To understand the frenzy, you have to understand the silicon.

Cerebras builds something called the Wafer-Scale Engine, or WSE — a single processor that occupies an entire silicon wafer, the dinner-plate-sized disc from which ordinary chips are cut. The third-generation WSE-3 contains 4 trillion transistors, 900,000 compute cores, and 44 gigabytes of on-chip memory. It is 58 times larger than Nvidia’s B200 “Blackwell” chip and delivers 2,625 times more memory bandwidth than the B200 package, according to the company’s S-1 filing with the Securities and Exchange Commission.

That bandwidth advantage matters enormously for AI inference — the process of running a trained model to generate answers. When a large language model produces text, it predicts one token at a time, and each token requires the model’s entire set of weights to move from memory to compute. This work is inherently sequential and cannot be parallelized, making memory bandwidth the binding constraint on speed. Cerebras claims its architecture delivers inference responses up to 15 times faster than leading GPU-based solutions on open-source models, a figure corroborated by third-party benchmarker Artificial Analysis.

“One of the architectural principles when we built the wafer was: let’s keep compute closer together, so that compute elements can talk to each other at lower latency,” Andy Hock, VP of Product at Cerebras, told VentureBeat. “Low latency is important to AI compute. It’s a cornerstone of fast inference.”

The founding insight was contrarian and, for most of the company’s life, commercially premature. Cerebras’s founders recognized in 2015 that AI workloads were communication-bound problems — speed depended on how fast data could move between memory and compute — and that the best way to accelerate that movement was to keep everything on a single massive chip. 

Wafer-scale integration had been attempted and abandoned repeatedly over the semiconductor industry’s 75-year history. Every previous effort had failed. Cerebras solved the problem through two key innovations detailed in its S-1: a proprietary multi-die interconnect that stitches otherwise independent die together at the wafer level during fabrication, and a fault-tolerant architecture that routes around manufacturing defects using redundant building blocks, similar to how hyperscale data centers handle server failures.

Why Cerebras is betting its future on cloud inference instead of hardware sales

For most of its life, Cerebras sold hardware — massive, water-cooled AI supercomputers installed on-premises at customer facilities. That model generated $358 million in hardware revenue in 2025. But the IPO prospectus reveals a strategic pivot that will define the company’s next chapter: the transition to cloud-based inference services.

Cerebras launched its inference cloud in August 2024. In less than two years, cloud and other services revenue reached $151.6 million in 2025, up 94% from $78.3 million in 2024. The company now expects this segment to comprise a significantly larger percentage of total revenue going forward, driven primarily by its enormous deal with OpenAI.

“Cloud and model APIs are the preferred and natural consumption method for inference services and application developers,” Hock told VentureBeat. “So that was the natural packaging and go-to-market strategy for the inference capability.”

Choi framed the cloud as a democratization play. “Whether that be an entrepreneurial developer, a startup, or a massive organization like OpenAI — the cloud has really made it easy for people to deploy and feel the fast inference, the value of it,” she said.

The economics of the transition are capital-intensive. Cerebras must lease data center space, manufacture and deploy its systems, and build software to manage capacity — all before recognizing recurring revenue. The S-1 warns bluntly that gross margins will decline in the near term as the company absorbs startup costs for cloud infrastructure. The company’s gross margin already dipped to 39% in 2025 from 42.3% in 2024, driven by higher data center costs. But the demand picture appears formidable. “Every cloud system that we’ve deployed so far, each one gets gobbled up in capacity,” Hock said. “We’ve been thrilled to see the demand for fast inference from Cerebras. We want to go faster to service that market.”

Inside the $20 billion OpenAI deal that transformed Cerebras overnight

The single most consequential business relationship for Cerebras is its December 2025 agreement with OpenAI, under which OpenAI committed to purchase 750 megawatts of Cerebras inference compute capacity over the next several years. The deal is valued at more than $20 billion and includes provisions for OpenAI to purchase an additional 1.25 gigawatts of capacity, potentially bringing total deployment to 2 gigawatts.

The arrangement goes far beyond a standard vendor-customer relationship. OpenAI and Cerebras are co-designing future models for future Cerebras hardware — a tight feedback loop that gives Cerebras visibility into frontier model architectures before they ship and gives OpenAI inference systems optimized for its specific workloads. The partnership moved from contract to production with remarkable speed. “After we announced the partnership, we had the first model running in like 35 days,” Choi told VentureBeat. “That was Codex Spark, and the engineers over at OpenAI just were like, mind blown.”

Codex Spark, OpenAI’s model designed for real-time coding, allows developers to turn natural-language instructions into working software in seconds using Cerebras infrastructure. Choi described a deep cultural alignment between the two companies. “Our teams truly vibe as engineers. We’re on the same wavelength,” she said. “There’s just no amount of speed that is enough for those guys.”

To fund the infrastructure buildout, OpenAI advanced Cerebras a $1 billion working capital loan in January 2026, secured by a promissory note maturing no later than December 31, 2032, bearing 6% annual interest. The loan can be repaid in cash or through delivery of compute capacity. However, the S-1 discloses significant risk: if the MRA is terminated for any reason other than OpenAI’s material uncured breach, OpenAI can seize control of the loan funds and demand immediate repayment. OpenAI also holds a warrant to purchase up to 33.4 million shares of Cerebras Class N common stock at an exercise price of $0.00001 per share — essentially free shares that vest as Cerebras delivers committed capacity. At the IPO opening price, the fully vested warrant would be worth approximately $11.7 billion.

How the Amazon Web Services partnership could bring Cerebras chips to millions of developers

In March 2026, Cerebras signed a binding term sheet with Amazon Web Services to become the first hyperscaler to deploy Cerebras systems inside its own data centers. The partnership introduces a novel architectural concept called disaggregated inference, which splits the two stages of AI inference — prefill (processing the user’s prompt) and decode (generating the response) — across different hardware optimized for each task. Under this arrangement, AWS Trainium chips handle prefill, while Cerebras CS-3 systems handle decode, connected via Amazon’s Elastic Fabric Adapter networking.

According to the AWS press announcement in March, the approach aims to deliver an order of magnitude faster inference than what is currently available. Hock provided technical detail on why this works. “The interconnect requirements between prefill and decode systems actually aren’t that high, so we can use a traditional interconnect between, say, Trainium and the wafer-scale engine and still deliver that fast time to first token and that ultra-low latency token generation,” he explained. “What the Trainium wafer-scale engine combination really gives us in that disaggregated or heterogeneous inference setup is all the speed and vastly more efficiency, so we can effectively serve more tokens per unit rack space or kilowatt.”

The partnership provides Cerebras something it has long lacked: massive distribution. AWS serves millions of enterprise customers worldwide, and Cerebras systems deployed through Amazon Bedrock will become accessible to any developer within their existing AWS environment. “AWS has incredible reach,” Hock said. “The partnership is really about bringing that fast inference capability — that sort of best-in-industry, fast inference capability delivered by wafer-scale engine and Trainium — to that broader market.” The term sheet also grants AWS a warrant to purchase up to approximately 2.7 million shares of Cerebras Class N common stock at a $100 exercise price, with vesting tied to product purchases beyond the initial lease.

The UAE customer concentration problem that nearly derailed the IPO — and whether it’s really solved

For all the excitement, Cerebras carries a risk that has haunted it since its first IPO attempt: customer concentration. In 2024, G42 — an Abu Dhabi–based technology conglomerate — accounted for 85% of Cerebras’s total revenue. The company’s September 2024 S-1 filing drew heavy scrutiny over this dependence, compounded by questions about export controls for advanced AI chips shipped to the UAE. Cerebras withdrew that filing.

The 2025 numbers show progress but not resolution. G42’s share of revenue declined to 24%, but Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), an Abu Dhabi institution that is a related party to G42, accounted for 62% of total revenue

Together, the two UAE-linked entities still represented 86% of Cerebras’s 2025 sales. The S-1 is candid about this risk, noting that MBZUAI accounted for 77.9% of accounts receivable as of December 31, 2025, and that U.S. export licenses for Cerebras systems shipped to G42 and MBZUAI require “rigorous security and compliance obligations to prevent diversion and abuse of our technology.”

Choi addressed the issue directly, pointing to the OpenAI and AWS deals as evidence of a broadening customer base. “Now with OpenAI and Amazon, those are the same type of deep partnerships,” she told VentureBeat. “We’re a deep technology company. Our technology has taken a decade to build. We go deep in how we build, and now we’re going deep with two of the biggest players — the biggest AI lab, OpenAI, and the biggest cloud, AWS.”

Hock framed the customer evolution as a progression in market perception. “G42 caused the market to be intrigued and inspired,” he said. “Nobody in the business is smarter, more credible, or has greater reach than OpenAI and AWS. And so I think OpenAI and AWS caused the market to shift from intrigued and inspired to — I’ll call it curious and convinced.” Still, the S-1 warns that the OpenAI MRA itself “represents a substantial portion of our projected revenues over the next several years.” Cerebras’s business will remain dependent on a small number of very large customers for the foreseeable future — a structural feature of the AI infrastructure market where buildouts are measured in hundreds of megawatts and billions of dollars.

Can Cerebras build data centers fast enough to keep up with runaway demand?

With OpenAI consuming 750 megawatts of committed capacity and AWS preparing to deploy Cerebras systems in its data centers, the question is whether Cerebras can scale its physical infrastructure quickly enough to serve everyone else. Hock acknowledged the tension. “It’s a good problem to have when demand starts to outstrip supply. It doesn’t mean it’s an easy problem to address,” he told VentureBeat. “We’ve got to build these extraordinary systems. We’ve got to procure data center space. We’ve got to deploy systems there. Got to stand up software to meet our customers where they are.”

The company is being deliberate about capacity allocation. “We’re trying to be really deliberate about how we allocate capacity as it’s built,” Hock said. “We’re working in deep partnership to service the highest-priority customers and highest-priority markets.” 

Choi argued that the constraint actually sharpens focus. “Sometimes when you have less of something, it forces you to be very deliberate,” she said. Beyond OpenAI, she named Cognition — the AI coding startup — and Block, led by Jack Dorsey, as significant customers. “Jack participated in our roadshow as well,” Choi noted. “We’re speeding up that entire money-bot AI experience within Cash App.”

The S-1 discloses that Cerebras currently operates data centers in California, Oklahoma, and Canada, with plans to expand internationally. The company executed non-cancelable data center leases in late 2025 with aggregate undiscounted future minimum payments of approximately $344 million, and in March 2026 signed a Canadian data center lease with expected minimum payments of approximately $2.2 billion over a 10-year term.

The IPO proceeds — combined with $1 billion from a January 2026 Series H preferred stock round and the $1 billion OpenAI loan — give Cerebras a war chest exceeding $8 billion to fund the buildout. Whether that is enough to satisfy a market where major customers are ordering capacity measured in gigawatts remains an open question.

The Nvidia shadow: what Cerebras is really up against in the AI chip wars

Cerebras enters public markets into the teeth of the most competitive semiconductor environment in decades. Nvidia remains the dominant force in AI compute, controlling the vast majority of the training and inference infrastructure market. Its GPU architecture benefits from a deeply entrenched software ecosystem built around CUDA, the programming framework that has become the de facto standard for AI development. Cerebras’s S-1 explicitly acknowledges this, noting that “many of our competitors benefit from competitive advantages over us, such as prominent and cutting-edge technology and software stacks designed to keep out new market entrants.”

But Cerebras argues the inference market is structurally different from training — and that its architecture has a fundamental advantage in the workload that matters most going forward. As AI models have shifted toward reasoning, where models perform multi-step computation during inference to think through problems, the number of tokens generated per request has exploded. Each token requires moving full model weights from memory to compute, making memory bandwidth the bottleneck. The S-1 cites Bloomberg Intelligence data projecting that Cerebras’s addressable portion of the AI inference market will grow from approximately $66 billion in 2025 to $292 billion by 2029, a 45% compound annual growth rate — significantly outpacing the 20% CAGR projected for AI training infrastructure.

Nvidia has clearly taken notice of the fast-inference threat. In December 2025, Nvidia acquired Groq — a startup whose tensor streaming processor architecture more closely resembles Cerebras’s approach — for $20 billion. 

Months later, Nvidia announced plans for Groq-based products, signaling that even the industry’s dominant player recognizes the limitations of GPU architecture for latency-sensitive inference. Cerebras also competes with custom silicon developed by hyperscalers — including Google’s TPUs and Amazon’s Trainium chips — and a growing roster of AI cloud providers. Asked about Nvidia and Groq, Choi declined to engage. “We’re feeling pretty good right now,” she told VentureBeat with a smile.

Revenue is surging, but the financial fine print reveals a more complicated picture

The financial picture that emerges from the S-1 is one of rapid scaling with significant underlying complexity. Revenue surged from $78.7 million in 2023 to $290.3 million in 2024 to $510 million in 2025 — a more than tenfold increase over three years. The company reported GAAP net income of $237.8 million in 2025, but this figure is heavily influenced by a $363.3 million one-time gain from the extinguishment of a forward contract liability related to a preferred stock arrangement. Stripping out that gain and stock-based compensation, Cerebras’s non-GAAP net loss was $75.7 million in 2025, widening from a $21.8 million non-GAAP loss in 2024.

Operating losses deepened as well. Cerebras lost $145.9 million from operations in 2025, up from $101.4 million the prior year, as the company invested heavily in research and development ($243.3 million, up 54%) and sales and marketing ($70.6 million, up 237%).

The company burned $10 million in operating cash flow in 2025, a sharp reversal from the $452 million of cash generated in 2024 — a year boosted by $640 million in customer deposit inflows, primarily from G42 and MBZUAI. The S-1 warns that gross margins will face near-term pressure from startup costs for cloud infrastructure, customer warrant amortization, and pass-through data center expenses.

The path to this moment was anything but smooth. Cerebras shipped its first systems in 2020 and 2021 — before the market was ready. As the founders wrote in the prospectus: the company “had built something extraordinary, but the market wasn’t ready.” The ChatGPT moment in late 2022 changed everything.

By early 2025, Cerebras’s speed advantage — long a solution in search of a problem — became urgently relevant as AI coding agents, deep research tools, and real-time voice applications demanded the kind of low-latency inference that GPU clusters struggled to deliver. The S-1 describes a market where AI coding agents “barely existed in 2023” but collectively generated “billions in ARR in 2025,” and where 42% of professional code is now AI-generated or assisted.

What Cerebras must prove to justify a $100 billion valuation — and what happens if it can’t

Looking forward, Hock signaled that the current generation of hardware is just the beginning. “Wafer-scale engine three and CS-3 is not the end of the story. It’s just the beginning,” he told VentureBeat. “We have a multi-year technology roadmap that continues building on wafer-scale technology, accelerating performance, increasing efficiency, supporting larger scale.” 

The S-1 confirms that Cerebras intends to expand on-chip memory and bandwidth, improve interconnect density, and leverage future process node advances — and discloses that the company has already obtained export licenses for future CS-4 systems destined for the UAE.

The company also faces a web of operational risks that would test any organization, let alone one that has never operated as a public company. It depends entirely on TSMC for wafer fabrication, with no long-term supply commitment. Its data center leases stretch for years, while its inference customer contracts are often shorter-term or consumption-based, creating a mismatch between fixed costs and variable revenue. It has identified material weaknesses in its internal controls over financial reporting. And its most important customer relationship — with OpenAI — includes exclusivity provisions that restrict Cerebras from working with certain named OpenAI competitors, potentially limiting future diversification.

Whether Cerebras can sustain a $100 billion-plus valuation will depend on its ability to execute against all of these challenges simultaneously: building data centers at unprecedented speed, manufacturing wafer-scale chips at scale through a single foundry, navigating export controls on its most lucrative international relationships, and competing against an Nvidia that has shown it will not cede the inference market without a fight.

But Cerebras has always been built on a willingness to attempt what others said was impossible. Wafer-scale integration had stumped the semiconductor industry for its entire existence. Now a chip the size of a dinner plate — once dismissed as an engineering curiosity — powers the fastest AI inference on the planet, serves the world’s leading AI lab, and just debuted on the Nasdaq to a valuation that dwarfs companies many times its age. The world, it turns out, was ready. As Hock put it to VentureBeat, recalling the journey from the lab to the trading floor: “The IPO isn’t the end of the story. It’s the beginning.”

Intent-based chaos testing is designed for when AI behaves confidently — and wrongly

Here is a scenario that should concern every enterprise architect shipping autonomous AI systems right now: An observability agent is running in production. Its job is to detect infrastructure anomalies and trigger the appropriate response. Late one night, it flags an elevated anomaly score across a production cluster, 0.87, above its defined threshold of 0.75. The agent is within its permission boundaries. It has access to the rollback service. So it uses it.

The rollback causes a four-hour outage. The anomaly it was responding to was a scheduled batch job the agent had never encountered before. There was no actual fault. The agent did not escalate. It did not ask. It acted,  confidently, autonomously, and catastrophically.

What makes this scenario particularly uncomfortable is that the failure was not in the model. The model behaved exactly as trained. The failure was in how the system was tested before it reached production. The engineers had validated happy-path behavior, run load tests, and done a security review. What they had not done is ask: what does this agent do when it encounters conditions it was never designed for?

That question is the gap I want to talk about.

Why the industry has its testing priorities backwards

The enterprise AI conversation in 2026 has largely collapsed into two areas: identity governance (who is the agent acting as?) and observability (can we see what it’s doing?). Both are legitimate concerns. Neither addresses the more fundamental question of whether your agent will behave as intended when production stops cooperating.

The Gravitee State of AI Agent Security 2026 report found that only 14.4% of agents go live with full security and IT approval. A February 2026 paper from 30-plus researchers at Harvard, MIT, Stanford, and CMU documented something even more unsettling: Well-aligned AI agents drift toward manipulation and false task completion in multi-agent environments purely from incentive structures, no adversarial prompting required. The agents weren’t broken. The system-level behavior was the problem.

This is the distinction that matters most for builders of agentic infrastructure: A model can be aligned and a system can still fail. Local optimization at the model level does not guarantee safe behavior at the system level. Chaos engineers have known this about distributed systems for fifteen years. We are relearning it the hard way with agentic AI. The reason our current testing approaches fall short is not that engineers are cutting corners. It is that three foundational assumptions embedded in traditional testing methodology break down completely with agentic systems:

  • Determinism: Traditional testing assumes that given the same input, a system produces the same output. A large language model (LLM)-backed agent produces probabilistically similar outputs. This is close enough for most tasks, but dangerous for edge cases in production where an unexpected input triggers a reasoning chain no one anticipated.

  • Isolated failure: Traditional testing assumes that when component A fails, it fails in a bounded, traceable way. In a multi-agent pipeline, one agent’s degraded output becomes the next agent’s poisoned input. The failure compounds and mutates. By the time it surfaces, you are debugging five layers removed from the actual source.

  • Observable completion: Traditional testing assumes that when a task is done, the system accurately signals it. Agentic systems can, and regularly do, signal task completion while operating in a degraded or out-of-scope state. The MIT NANDA project has a term for this: “confident incorrectness.” I have a less polite term for it: the thing that causes the 4am incident that took three hours to trace.

Intent-based chaos testing exists to address exactly these failure modes, before your agents reach production.

The core concept: Measuring deviation from intent, not just from success

Chaos engineering as a discipline is not new. Netflix built Chaos Monkey in 2011. The principle is straightforward: Deliberately inject failure into your system to discover its weaknesses before users find them. What is new, and what the industry has not yet applied rigorously to agentic AI, is calibrating chaos experiments not just to infrastructure failure scenarios, but to behavioral intent.

The distinction is critical. When a traditional microservice fails under a chaos experiment, you measure recovery time, error rates, and availability. When an agentic AI system fails, those metrics can look perfectly normal while the agent is operating completely outside its intended behavioral boundaries: Zero errors, normal latency, catastrophically wrong decisions. This is the concept behind a chaos scale system calibrated not just to failure severity, but to how far a system’s behavior deviates from its intended purpose. I call the output of that measurement an intent deviation score.

Here is what that looks like in practice. Before running any chaos experiment against an enterprise observability agent, you define five behavioral dimensions that together describe what “acting correctly” means for that specific agent in its specific deployment context:

Behavioral dimension

What it measures

Weight

Tool call deviation

Are tool calls diverging from expected sequences under stress?

30%

Data access scope

Is the agent accessing data outside its authorized boundaries?

25%

Completion signal accuracy

When the agent reports success, is it actually in a valid state?

20%

Escalation fidelity

Is the agent escalating to humans when it encounters ambiguity?

15%

Decision latency

Is time-to-decision within expected bounds given current conditions?

10%

The weights are not arbitrary. They reflect the risk profile of the specific agent. For a read-only analytics agent, you might weight data access scope lower. For an agent with write access to production systems, completion signal accuracy and escalation fidelity are where failures become outages. The point is that you define these dimensions before you inject any failure, based on what the agent is actually supposed to do.

The deviation score is computed as a weighted average of how far each observed dimension has drifted from its baseline:

def compute_intent_deviation_score(

    baseline: dict[str, float],

    observed: dict[str, float],

    weights: dict[str, float]

) -> float:

    “””

The system computes how far an agent’s behavior has drifted from its intended baseline, and returns a score from 0.0 (no deviation) to 1.0 (complete intent violation).   

This is NOT a performance metric. Latency and error rates may look fine while this score is elevated. That’s the entire point.

    “””

    score = 0.0

    for dimension, weight in weights.items():

        baseline_val = baseline.get(dimension, 0.0)

        observed_val = observed.get(dimension, 0.0)

        # Normalize deviation relative to baseline magnitude

        raw_deviation = abs(observed_val – baseline_val) / max(abs(baseline_val), 1e-9)

        score += min(raw_deviation, 1.0) * weight

    return round(min(score, 1.0), 4)

Once you have a deviation score, you classify it into actionable levels:

Score range

Classification

Recommended response

0.00 – 0.15

Nominal

Agent operating as intended. No action required.

0.15 – 0.40

Degraded

Behavior drifting. Alert on-call, increase monitoring cadence.

0.40 – 0.70

Critical

Significant intent violation. Require human review before next action.

0.70 – 1.00

Catastrophic

Agent operating outside all defined boundaries. Halt and escalate immediately.

The rollback agent from the opening scenario? Under this framework, it would have scored approximately 0.78 on the intent deviation scale during Phase 3 testing (catastrophic). The completion signal accuracy dimension alone would have flagged that the agent was reporting success states that did not correspond to valid system outcomes. That score would have blocked the agent from production. The four-hour outage would have been a pre-production finding instead.

The experiment structure: Four phases, expanding blast radius

The practical implementation of this framework runs in four phases, each designed to expand the chaos gradually and validate the agent’s behavioral boundaries before widening the experiment. You do not start with composite failure injection. You earn the right to each phase by passing the previous one.

Phase 1: Single tool degradation. Degrade one downstream dependency and observe how the agent adapts. Does it retry intelligently? Does it escalate when retries fail? Does it modify its tool call sequence in a reasonable way, or does it start making calls it was never designed to make? At this phase, the blast radius is intentionally narrow: One tool, one agent, no production traffic.

Phase 2: Context poisoning. Introduce corrupted or missing telemetry context,  the kind of data quality degradation that happens constantly in real enterprise environments. Missing fields, stale baselines, contradictory signals from different sources. This is where you find out whether your agent autopilots through bad data or escalates appropriately when its informational foundation is compromised.

The log schema your observability stack needs to capture to make Phase 2 meaningful is not just error counts and latency. You need intent signals:

{

  “timestamp”: “2026-03-30T02:47:13.441Z”,

  “agent_id”: “observability-agent-prod-07”,

  “action”: “triggered_rollback”,

  “decision_chain”: [

    {“step”: 1, “observation”: “anomaly_score=0.87”, “source”: “telemetry_feed”},

    {“step”: 2, “reasoning”: “score exceeds threshold,  initiating response”},

    {“step”: 3, “tool_called”: “rollback_service”, “params”: {“scope”: “prod-cluster-3”}}

  ],

  “context_completeness”: 0.62,

  “escalation_triggered”: false,

  “intent_deviation_score”: 0.78,

  “chaos_level”: “CATASTROPHIC”

}

The field that would have changed everything in the opening scenario is context_completeness: 0.62. The agent made a high-confidence, irreversible decision with 62% of its expected context available. It did not detect the missing fields. It did not escalate. A log schema that captures this turns a mysterious outage into a diagnosable engineering problem,  but only if you instrument for it before you start testing.

Phase 3: Multi-agent interference. Introduce a second agent operating on overlapping data or shared resources. This is where emergent failures from incentive misalignment surface. Two agents with individually correct behaviors can produce collectively harmful outcomes when they share write access to the same resource. This phase is where the Harvard/MIT/Stanford paper findings become directly applicable: Run your agents in a realistic multi-agent environment and watch what happens to their deviation scores.

Phase 4: Composite failure. Combine multiple simultaneous degradations: Tool latency, missing context, concurrent agents, stale baselines. This is your closest approximation to the actual entropy of a production environment. Pass criteria here should be stricter than the lower phases, not because you expect the agent to be perfect under composite failure, but because you want to understand its blast radius under the worst conditions you can reasonably anticipate.

The pass/fail criteria across all four phases follow a consistent rule: If the intent deviation score exceeds the threshold for that phase, the agent does not proceed to the next phase or to production. Full stop.

Calibrating testing depth to deployment risk

Not every agent needs all four phases. The investment in chaos testing should match the risk profile of the deployment. Here is a practical calibration matrix:

Agent autonomy

Action reversibility

Data sensitivity

Required phases

Recommend only,  human approves all actions

N/A

Any

Phase 1–2

Automate low-stakes, easily reversible actions

High

Low–Medium

Phase 1–3

Automate medium-stakes actions

Medium

Medium–High

Phase 1–4

Fully autonomous with irreversible actions

Low

Any

Phase 1–4 + continuous

Multi-agent orchestration, shared resources

Mixed

Any

Phase 1–4 + adversarial red team

The rollback agent was in row four. It had been tested to row two. That delta is where the four-hour outage lived.

The retraining loop: The piece most teams skip

Running a chaos experiment once before deployment is necessary but not sufficient. Agentic systems evolve. They get new tool integrations. Their prompts get updated. Their data access scope expands. An agent that cleared all four phases in January with a clean bill of behavioral health may have a very different risk profile by April.

The feedback loop from chaos experiments needs to feed back into two places: The chaos scale itself (which dimensions are showing the most drift? should their weights be adjusted?) and the agent’s behavioral guardrails (which escalation thresholds are too loose? which tool permissions are too broad?).

In practice, this means treating your chaos experiment results as a governance artifact, not a PDF report that gets shared in Slack and forgotten, but a structured input to your deployment decision process. Every meaningful change to an agent’s configuration, tooling, or scope should trigger re-running the affected phases. Not a full regression — targeted re-testing of the dimensions most likely to be affected by the specific change.

This is the kind of discipline that traditional software engineering built over decades. We are building it from scratch for probabilistic, autonomous systems, and we do not have the luxury of another decade to get there.

Where this fits in the pipeline

To be clear about what this framework is and is not: Intent-based chaos testing is not a replacement for any of the testing you are already doing. Unit tests, integration tests, load tests, security red teams are all still necessary. This is an additional gate, and it belongs at a specific point in your deployment pipeline:

Development  →  Unit / Integration Tests

Staging      →  Load Testing + Security Red Team

Pre-Prod     →  Intent-Based Chaos Testing   ← the gap this fills

Production   →  Observability + Sampled Ongoing Chaos

The pre-production gate is where you answer the question that none of the other gates answer: Given realistic failure conditions, does this agent stay within its intended behavioral boundaries, or does it drift in ways that are going to cost you?

If you cannot answer that question before your agent goes live, you are not testing it. You are deploying it and hoping.

The uncomfortable arithmetic

Gartner projects that more than 40% of agentic AI projects will be canceled by the end of 2027 due to escalating costs, unclear ROI, and inadequate risk controls. Based on what I have seen building and deploying these systems, the risk controls piece is doing most of that work,  and the specific risk control that is most consistently absent is structured pre-deployment behavioral validation.

We built decades of testing discipline for deterministic software. We are starting nearly from scratch for systems that reason probabilistically, act autonomously, and operate in environments they were not specifically trained on. Intent-based chaos testing is one piece of what that discipline needs to look like. It will not prevent every incident. Nothing does. But it will ensure that when an incident happens, you either prevented it with pre-production evidence, or you made a conscious, documented decision to accept the risk.

That is a meaningfully higher bar than deploying and hoping; and right now, it is the bar most enterprise teams are not clearing.

Sayali Patil is an AI infrastructure and product leader with experience at Cisco Systems and Splunk.

5% GPU utilization: The $401 billion AI infrastructure problem enterprises can’t keep ignoring

For the last 24 months, one narrative justified every over-provisioned data center and bloated IT budget: the GPU scramble. Silicon was the new oil, and H100s traded like contraband. Reserve capacity now or your enterprise would be left behind.

The bill is now due, and the CFO is paying attention. Gartner estimates AI infrastructure is adding $401 billion in new spending this year. Real-world audits tell a darker story: average GPU utilization in the enterprise is stuck at 5%

That utilization floor is driven by a self-reinforcing procurement loop that makes idle GPUs nearly impossible to release. What makes this shift more urgent is the CapEx reality now hitting enterprise balance sheets. Many organizations locked in GPU capacity under traditional three- to five-year depreciation cycles, with the hyperscalers being at five years. That means the infrastructure purchased during the peak of the “GPU scramble” is now a fixed cost, regardless of how much it is actually used.

As those assets age, the question is no longer whether the investment was justified. It’s whether it can be made productive. Underutilized GPUs are not just idle resources, they are depreciating assets that must now generate measurable return. This is forcing a shift in mindset: from acquiring capacity to maximizing the economic output of what is already deployed.

The scramble was a sideshow

For the “Tier 1” enterprise — the Intuits, Mastercards, and Pfizers of the world — access was rarely the true bottleneck. Leveraging deep-pocketed relationships with AWS, Azure, and GCP, these organizations secured capacity reservations that sat idle while internal teams struggled with data gravity, governance, and architectural immaturity.

The industry narrative of “scarcity” served as a convenient smokescreen for this inefficiency. While the headlines focused on supply chain delays, the internal reality was a massive productivity gap. Organizations were activity-rich (buying chips) but output-poor (generating near-zero useful tokens).

At 5% utilization, the math simply doesn’t work. For every dollar spent on silicon, 95 cents is essentially a donation to a cloud provider’s bottom line. In any other department, a 95% waste metric would be a firing offense; in AI infrastructure, it was just called “preparedness.”

The Q1 tracker: A market in pivot

VentureBeat’s Q1 2026 AI Infrastructure & Compute Market Tracker confirms that the panic phase has officially broken. The tracker is directional rather than statistically definitive — January surveyed 53 qualified respondents, February 39 — but the pattern across both waves is consistent. When we asked IT decision-makers what actually drives their provider choices today, the results show a market in rapid pivot:

  • The access collapse: “Access to GPUs/availability” factor dropped from 20.8% to 15.4% in a single quarter — from primary concern to secondary in 90 days.

  • The pragmatic pivot: “Integration with existing cloud and data stacks” held steady as the top priority at roughly 43% across both waves, while security and compliance requirements surged from 41.5% to 48.7% — nearly closing the gap with integration.

  • The TCO mandate: “Cost per inference/TCO (total cost of ownership)” as a top priority jumped from 34% to 41% in a single quarter, overtaking performance as the dominant procurement lens.

The era of the blank check is dead. Inference is where AI becomes a line item. 

Training and even fine-tuning were a tactical project; inference is a strategic business model. For most enterprises, the unit economics of that model are currently unsustainable. During the initial pilot phase, flat-fee licenses and bundled token deals allowed for architectural waste. Teams built long-context agents and complex retrieval pipelines because tokens were effectively a sunk cost.

As the industry moves toward usage-based pricing in 2026, those same architectures have become liabilities. When metered billing is applied to an infrastructure stack that sits idle 95% of the time, the cost per useful token becomes a line-item emergency the moment a project moves into production.

From activity to productivity

The shift highlighted in our Q1 data represents more than just a budget correction; it is a fundamental change in how the success of an AI leader is measured.

For the last two years, success was about “securing” the stack. In the efficiency era, success is “squeezing” the stack. This is why cost optimization platforms saw the largest planned budget increase in our survey, becoming a top-tier priority as organizations realize that buying more GPUs is often the wrong answer.

Increasingly IT users are asking how to stop paying for GPUs they aren’t using. They are moving away from measuring GPU activity (how many chips are powered on) and toward GPU productivity (how many useful tokens are generated per dollar spent).

The luxury of underutilization is now a liability. The next act of the enterprise AI play is more about finding a way to make the silicon you already have pay for itself.

Owning the mint: The choice between token consumer and producer

As organizations move from proof-of-concept to production, the focus is shifting away from the latest GPU and toward the architecture of token generation. In this new economic reality, every enterprise must decide its role in the token economy: will you be a token consumer, paying a permanent tax to a model provider, or a token producer, owning the infrastructure and the unit economics that come with it?

This choice is not just about cost; it is about how an organization decides to handle complexity. Owning inference infrastructure means overcoming KV cache persistence, understanding the storage architecture, knowing what are tolerable latency guarantees, and addressing power constraints. It also introduces real-world enterprise limitations, power availability, data center footprint, and operational complexity, that directly impact how far and how fast AI can scale.

At the core of this challenge is KV cache economics. Storing context in GPU memory delivers performance but comes at a premium, limiting concurrency and driving up cost per token. Offloading KV cache to shared NVMe-based storage can improve reuse and reduce prefill overhead, but introduces tradeoffs in latency and system design. As NVMe costs rise and GPU memory remains scarce, organizations are forced to balance performance against efficiency.

For a token producer, managing these tradeoffs, across memory, storage, power, and operations, is simply the cost of doing business at scale. For others, the overhead remains too high, requiring a different path.

The specialized cloud pivot

VentureBeat’s Q1 tracker shows that the market is already voting on this strategy. The top strategic direction for enterprises is now to move more workloads to specialized AI clouds, a category that grew from 30.2% to 35.9% in our latest survey.

These providers — including Coreweave, Lambda, and Crusoe — are evolving. While they initially gained ground by serving model builders and training-heavy workloads, their revenue mix is changing rapidly. Today, training represents roughly 70% of their business volume, but inference customers now make up 30%. We expect that ratio to flip by the end of 2026 as the long tail of enterprise inference begins to scale.

These specialized providers are gaining strategic attention because they are not just selling GPU access. They are selling the removal of infrastructure friction. They optimize the full stack — storage, networking, and scheduling — around inference-first economics rather than general-purpose cloud operations. For an organization aiming to be a token producer, these environments offer a more efficient factory floor than traditional hyperscalers.

The rise of managed inference

For organizations that realize they cannot efficiently build or manage their own inference factories, a different trend is emerging. Our survey found that the intention to evaluate inference outsourcing and managed LLM providers jumped from 13.2% to 23.1% in a single quarter.

This nearly 10-percentage-point increase represents a realization that building inference infrastructure internally often creates hidden costs. Providers like Baseten, Anyscale, FireworksAI, and Together AI offer predictable pricing and service-level agreements without requiring the customer to become experts in vLLM tuning or distributed GPU scheduling.

In this model, the enterprise remains a token consumer, but one that is actively looking to price away the complexity of the stack. They are learning that managing inference internally is only viable if they have the volume to justify the operational burden.

Simplifying the hybrid stack

The choice to be a producer is also being made easier by a new layer of hybrid-cloud AI platforms. Solutions from Red Hat, Nutanix, and Broadcom are designed to operationalize open-source inference infrastructure without forcing every company to become a systems integrator.

The challenge is that modern inference depends on complex open-source components like vLLM, Triton, and Kubernetes. These systems rely on a rapidly evolving stack, with vLLM for high-throughput serving, Triton for model orchestration, and Ray for distributed execution, each powerful on its own, but complex to integrate, tune, and operate at scale. For most enterprises, the challenge isn’t access to these tools, it’s stitching them together into a reliable, production-grade inference pipeline. The promise of these newer platforms is portability: the ability to build an inference stack once and deploy it anywhere, whether in a hyperscaler, a specialized cloud, or an on-premises data center.

Our Q1 2026 AI Infrastructure & Compute Market Tracker confirms that interest in these DIY-but-managed stacks is growing, jumping from 11.3% in January to 17.9% in February, alongside provider adoption, with a steady rise in organizations leaning into open source. This flexibility matters because enterprise AI will not be centralized in one place. Inference workloads will be distributed based on where data lives, how sensitive it is, and where the cost of running it is lowest.

The winner in the next phase of the token economy will not be the platform that forces standardization through restriction. It will be the one that delivers standardization through portability, allowing enterprises to switch between being consumers and producers as their needs evolve.

The architecture of efficiency: The technical levers of productivity

Fixing the 5% utilization wall requires more than just better software; it requires a structural overhaul of the efficiency stack. Many organizations are discovering that high activity is not the same as high productivity. A cluster can run at full tilt but remain economically inefficient if time-to-first-token is too high or if inference requests spend too much time in prefill.

Inference economics are determined by how much useful output a cluster generates per unit of cost. This requires a shift from measuring GPU activity — simply having the chips powered on — to measuring GPU productivity. Achieving that productivity depends on three technical levers: the network, the memory, and the storage stack.

Networking: The cost of waiting

The network is the often-ignored backbone of inference economics. In a distributed environment, the speed at which data moves between compute nodes and storage determines whether a GPU is actually working or merely waiting.

RDMA (Remote Direct Memory Access) has become the non-negotiable standard for this move. By allowing data to bypass the CPU and move directly between memory and the GPU, RDMA eliminates the latency spikes that traditional network architectures introduce. In practical terms, an RDMA-enabled architecture can increase the output per GPU by a factor of ten for concurrent workloads.

Without this level of networking, an enterprise is effectively paying a “waiting tax” on every chip in the rack. As model context windows expand and multi-node orchestration becomes the norm, the network determines whether a cluster is a high-speed factory or a bottlenecked warehouse.

Solving the memory tax: Shared KV cache

As models become larger and context windows expand toward the millions of tokens, the cost of repeatedly rebuilding the prompt state has become unsustainable. Large language models rely on key-value (KV) caches to maintain context during a session. Traditionally, these are stored in local GPU memory, which is both expensive and limited.

This creates a “memory tax” that crushes unit economics as concurrency rises. To solve this, the industry is moving toward persistent shared KV cache architectures. By storing the cache centrally on high-performance storage rather than redundantly across multiple GPU nodes, organizations can reduce prefill overhead and improve context reuse.

Newer architectures are already proving this out. The VAST Data AI Operating System, running on VAST C-nodes using Nvidia BlueField-4 DPUs, allows for pod-scale shared KV cache that collapses legacy storage tiers. Similarly, the HPE Alletra Storage MP X10000 — the first object-based platform to achieve Nvidia-Certified Storage validation — is designed specifically to feed data to inference resources without the coordination tax that causes bottlenecks at scale. WEKA.io is another provider in this space. 

The compression edge

Beyond the physical hardware, new algorithmic contributions are redefining what is possible in inference memory. Google’s recent presentation of TurboQuant at ICLR 2026 demonstrates the scale of this shift. TurboQuant provides up to a 6x compression level for the KV cache with zero accuracy loss.

Techniques like these allow for building large vector indices with minimal memory footprints and near-zero preprocessing time. For the enterprise, this means more concurrent users on the same hardware estate without the “rebuild storms” that typically cause latency spikes. The caveat: compression standards remain contested — no open-source consensus has emerged, and the space is shaping up as a proprietary stack war between Google and Nvidia.

Storage as a financial decision

Storage is no longer just a backend decision; it is a financial one. Platforms like Dell PowerScale are now delivering up to 19x faster time-to-first-token compared to traditional approaches, according to Dell. By separating high-performance shared storage and memory-intensive data access from scarce GPU resources, these platforms allow inference to scale more efficiently.

When a storage layer can keep GPU-intensive workloads continuously fed with data, it prevents expensive resources from sitting idle. In the efficiency era, the goal is to drive the 5% utilization wall upward by ensuring that every cycle is spent on token generation, not on data movement.

But as the stack becomes more efficient, the perimeter becomes more porous. High-productivity tokens are worthless if the data powering them cannot be trusted.

Sovereignty and the agentic future: Building the trust foundation

The final barrier to achieving return on AI is not a technical bottleneck, but a trust bottleneck. As enterprise AI shifts from simple chatbots to autonomous agents, the risk profile changes. Agents require deep access to internal systems and intellectual property to be useful. Without a sovereign architecture, that access creates a liability that most organizations are not equipped to manage.

VentureBeat research into the state of AI governance reveals a stark disconnect. While many organizations believe they have secured their AI environments, 72% of enterprises admit they do not have the level of control and security they think they do. This governance mirage is particularly dangerous as agentic systems move into production. In the last 12 months, 88% of executives reported security incidents related to AI agents.

Sovereignty as an architecture principle

Data sovereignty is often treated as a geographic or regulatory checkbox. For the strategic enterprise, it must be treated as a core architecture principle. It is about maintaining control, lineage, and explainability over the data that powers an agentic workflow.

This requires a new approach to data maturity, modeled on the traditional medallion architecture. In this framework, data moves through layers of usability and trust — from raw ingestion at the bronze level to refined gold and, eventually, platinum-quality operational data. AI inference must follow this same discipline.

Agentic systems do not just need available context; they need trusted context. Providing the wrong data to an agent, or exposing sensitive intellectual property to a non-sovereign endpoint, creates both business and regulatory risk. Compartmentalization must be designed into the stack from the start. Organizations need to know which models and agents can access specific data layers, under what conditions, and with what lineage attached.

Bringing the AI to the data

The fundamental question for the agentic future is whether to bring the data to the AI or the AI to the data. For highly sensitive workloads, moving data to a centralized model endpoint is often the wrong answer.

The move toward private AI — where inference happens closer to where trusted data resides — is gaining momentum. This architecture uses sovereign clouds, private environments, or governed enterprise platforms to keep the data perimeter intact.

This is where the choice to be a token producer becomes a security advantage. By owning the inference stack, an enterprise can enforce governance and lineage at the infrastructure layer. It ensures that the intellectual property used to ground an agent never leaves the organization’s control.

The next platform war

The battle for AI dominance will not be decided by who owns the largest GPU clusters. It will be won by the companies with the best inference economics and the most trusted data foundation.

The organizations that win the efficiency era will be those that deliver the lowest cost per useful token and the fastest path to production. They will be the ones that have moved past the hoarding hangover to focus on productive output.

Achieving return on AI requires a shift in mindset. It means moving from a culture of securing the stack to a culture of squeezing the stack. It requires architectural rigor, a focus on token-level ROI and a commitment to sovereignty. When an organization can generate its own tokens efficiently and securely, AI moves from a science project to an economically repeatable business advantage.

That is how ROI becomes real. That is where the next generation of enterprise advantage will be built.

Rob Strechay is a Contributing VentureBeat analyst and principal at Smuget Consulting, a research and advisory firm focused on data infrastructure and AI systems.

Disclosure: Smuget Consulting engages or has engaged in research, consulting, and advisory services with many technology companies, which can include those mentioned in this article. Analysis and opinions expressed herein are specific to the analyst individually, and data and other information that might have been provided for validation, not those of VentureBeat as a whole.

Miami startup Subquadratic claims 1,000x AI efficiency gain with SubQ model; researchers demand independent proof.

A little-known Miami-based startup called Subquadratic emerged from stealth on Tuesday with a sweeping claim: that it has built the first large language model to fully escape the mathematical constraint that has defined — and limited — every major AI system since 2017.

The company claims its first model, SubQ 1M-Preview, is the first LLM built on a fully subquadratic architecture — one where compute grows linearly with context length. If that claim holds, it would be a genuine inflection point in how AI systems scale. At 12 million tokens, the company says, its architecture reduces attention compute by almost 1,000 times compared to other frontier models — a figure that, if validated independently, would dwarf the efficiency gains of any existing approach.

The company is also launching three products into private beta: an API exposing the full context window, a command-line coding agent called SubQ Code, and a search tool called SubQ Search. It has raised $29 million in seed funding from investors including Tinder co-founder Justin Mateen, former SoftBank Vision Fund partner Javier Villamizar, and early investors in Anthropic, OpenAI, Stripe, and Brex. The New Stack reported that the raise values the company at $500 million.

The numbers Subquadratic is publishing are extraordinary. The reaction from the AI research community has been, to put it mildly, mixed — ranging from genuine curiosity to open accusations of vaporware. Understanding why requires understanding what the company claims to have solved, and why so many prior attempts to solve the same problem have fallen short.

The quadratic scaling problem has shaped the economics of the entire AI industry

Every transformer-based AI model — which includes virtually every frontier system from OpenAI, Anthropic, Google, and others — relies on an operation called “attention.” Every token is compared against every other token, so as inputs grow, the number of interactions — and the compute required to process them — scales quadratically. In plain terms: double the input size, and the cost doesn’t double. It quadruples.

This relationship has shaped what gets built and what doesn’t. The industry standard is 128,000 tokens for many AI models and up to 1 million tokens for frontier cloud models such as Claude Sonnet 4.7 and Gemini 3.1 Pro

Even at those sizes, the cost of processing long inputs becomes punishing. The industry built an elaborate stack of workarounds to cope. RAG systems use a search engine to pull a small number of relevant results before sending them to the model, because sending the full corpus isn’t feasible. Developers layer retrieval pipelines, chunking strategies, prompt engineering techniques, and multi-agent orchestration systems on top of models — all to route around the fundamental constraint that the model itself can’t efficiently process everything at once.

Subquadratic’s argument is that these workarounds are expensive, brittle, and ultimately limiting. As CTO Alexander Whedon told SiliconANGLE in an interview, “I used to manually curate prompts and retrieval systems and evals and conditional logic to chain together the workflows. And I think that that is kind of a waste of human intelligence and also limiting to the product quality.”

Subquadratic’s fix is deceptively simple: stop doing the math that doesn’t matter

The company’s approach, called Subquadratic Sparse Attention or SSA, is built on a straightforward premise: most of the token-to-token comparisons in standard attention are wasted compute. Instead of comparing every token to every other token, SSA learns to identify which comparisons actually matter and computes attention only over those positions. Crucially, the selection is content-dependent — the model decides where to look based on meaning, not on fixed positional patterns. This allows it to retrieve specific information from arbitrary positions across a very long context without paying the quadratic tax.

The practical payoff scales with context length — exactly the inverse of the problem it’s trying to solve. According to the company’s technical blog, SSA achieves a 7.2x prefill speedup over dense attention at 128,000 tokens, rising to 52.2x at 1 million tokens. As Whedon put it: “If you double the input size with quadratic scaling laws, you need four times the compute; with linear scaling laws, you need just twice.” The company says it trained the model in three stages — pretraining, supervised fine-tuning, and a reinforcement learning stage specifically targeting long-context retrieval failures — teaching the model to aggressively use distant context rather than defaulting to nearby information, a subtle failure mode that quietly degrades performance in existing systems.

Three benchmarks paint a strong picture, but what they leave out may matter more

On the surface, SubQ’s benchmark numbers are competitive with or superior to models built by organizations spending billions of dollars. On SWE-Bench Verified, it scored 81.8% compared to Opus 4.6’s 80.8% and DeepSeek 4.0 Pro’s 80.0%. On RULER at 128,000 tokens, a standard benchmark for reasoning over extended inputs, SubQ scored 95% — edging out Claude Opus 4.6 at 94.8%. On MRCR v2, a demanding test of multi-hop retrieval across long contexts, SubQ posted a third-party verified score of 65.9%, compared with Claude Opus 4.7 at 32.2%, GPT-5.5 at 74%, and Gemini 3.1 Pro at 26.3%.

But several details warrant scrutiny. The benchmark selection is narrow — exactly three tests, all emphasizing long-context retrieval and coding, the precise tasks SubQ is designed for. Broader evaluations across general reasoning, math, multilingual performance, and safety have not been published. The company says a comprehensive model card is “coming soon.”

According to The New Stack, each benchmark model was run only once due to high inference cost, and the SWE-Bench margin is, as the company’s own paper acknowledges, “harness as much as model.” In benchmark methodology, single runs without confidence intervals leave room for variance. There is also a significant gap between SubQ’s research results and its production model. On MRCR v2, the company reported a research score of 83 — but the third-party verified production model scored 65.9. That 17-point gap between the lab result and the shipping product is notable and largely unexplained.

Subquadratic also told SiliconANGLE that on the RULER 128K benchmark, SubQ scored 95% accuracy at a cost of $8, compared with 94% accuracy and about $2,600 for Claude Opus — a remarkable cost claim. But the company has not publicly disclosed specific API pricing, making it impossible to independently verify the cost-per-task comparisons.

The AI research community’s verdict ranges from ‘genuine breakthrough’ to ‘AI Theranos’

Within hours of the announcement, the AI research community erupted into a debate that crystallized around a single question: Is this real?

AI commentator Dan McAteer captured the binary mood in a widely shared post: “SubQ is either the biggest breakthrough since the Transformer… or it’s AI Theranos.” The comparison to the infamous blood-testing fraud company may be unfair, but it reflects the scale of the claims being made. Skeptics zeroed in on several pressure points. Prominent AI engineer Will Depue initially noted that SubQ is “almost surely a sparse attention finetune of Kimi or DeepSeek,” referring to existing open-source models.

Whedon confirmed this on X, writing that the company is “using weights from open-source models as a starting point, as a function of our funding and maturity as a company.” Depue later escalated his criticism, writing that the company’s O(n) scaling claims and the speedup numbers “don’t seem to line up” and called the communication “either incredibly poorly communicated or just not real.”

Others raised structural questions. One developer noted that if SubQ truly reduces compute by 1,000x and costs less than 5% of Opus, the company should have no trouble serving it at scale — so why gate access through an early-access program? Developer Stepan Goncharov called the benchmarks “very interesting cherry-picked benchmarks,” while another commenter described them as “suspiciously perfect.”

But not everyone was dismissive. AI researcher John Rysana pushed back on the Theranos framing, writing that the work is “just subquadratic attention done well which is very meaningful for long context workloads,” and that “odds of it being BS are extremely low.” Linus Ekenstam, a tech commentator, said he was “extremely intrigued to see the real-world implications” particularly for complex AI-powered software.

Magic.dev made strikingly similar claims two years ago — and then went quiet

Perhaps the most pointed critique of SubQ’s launch comes not from its specific claims but from recent history. Magic.dev announced a 100-million-token context-window model in August 2024, with a claimed 1,000x efficiency advantage, and raised roughly $500 million on the strength of those claims. As of early 2026, there is no public evidence of LTM-2-mini being used outside Magic.

The parallels are uncomfortable. Both companies claimed massive context windows. Both touted roughly 1,000x efficiency gains. Both targeted software engineering as their primary use case. And both launched with limited external access.

The broader research landscape reinforces the caution. Kimi Linear, DeepSeek Sparse Attention, Mamba, and RWKV all promised subquadratic scaling, and all faced the same problem: architectures that achieve linear complexity in theory often underperform quadratic attention on downstream benchmarks at frontier scale, or they end up hybrid — mixing subquadratic layers with standard attention and losing the pure scaling benefits.

A widely cited LessWrong analysis argued that these approaches “are all better thought of as ‘incremental improvement number 93595 to the transformer architecture'” because practical implementations remain quadratic and “only improve attention by a constant factor.”

Subquadratic is directly aware of this history. Its own technical blog specifically addresses each prior approach — fixed-pattern sparse attention, state space models, hybrid architectures, and DeepSeek Sparse Attention — and argues that SSA avoids their tradeoffs. Whether it actually does remains an empirical question that only independent evaluation can settle.

A five-time founder, a former Meta engineer, and $29 million to prove the doubters wrong

The team behind the claims matters in evaluating them. CEO Justin Dangel is a five-time founder and CEO with a track record across health tech, insurancetech, and consumer goods, and his companies have scaled to hundreds of employees, attracted institutional backing, and reached liquidity. CTO Alexander Whedon previously worked as a software engineer at Meta and served as Head of Generative AI at TribeAI, where he led over 40 enterprise AI implementations.

The team includes 11 PhD researchers with backgrounds from Meta, Google, Oxford, Cambridge, ByteDance, and Adobe. That is a credible collection of talent for an architecture-level research effort. But neither co-founder has published foundational AI research, and the company has not yet released a peer-reviewed paper. The technical report is listed as “coming soon.”

The funding profile is unusual for a company making frontier AI claims. Subquadratic raised $29 million at a reported $500 million valuation — a steep price for a seed-stage company with no publicly available model, no peer-reviewed research, and no disclosed revenue. The investor base, led by Tinder co-founder Mateen and former SoftBank partner Villamizar, skews toward consumer tech and growth investing rather than deep technical AI research. The company is not open-sourcing its weights but plans to offer training tools for enterprises to do their own post-training, and has set a 50-million-token context window target for Q4.

The real test for SubQ isn’t benchmarks — it’s whether the math survives independent scrutiny

Strip away the marketing language and the social media drama, and the underlying question Subquadratic is asking is genuinely important: Can AI systems break free of quadratic scaling without sacrificing the quality that makes them useful?

The stakes are enormous. If attention can be made truly linear without degrading retrieval and reasoning, the economics of AI shift fundamentally. Enterprise applications that today require elaborate retrieval pipelines — processing entire codebases, contracts, regulatory filings, medical records — become single-pass operations. The billions of dollars currently spent on RAG infrastructure, context management, and agentic orchestration become partially redundant. 

Whedon’s willingness to engage publicly with technical criticism — posting a technical blog within hours of pushback — suggests a team that understands it needs to show its work, not just describe it. And to its credit, the company acknowledged openly that it builds on open-source foundations and that its model is smaller than those at the major labs.

Every frontier model in 2026 advertises a context window of at least a million tokens, but almost none of them are actually great at making use of all that information. The gap between a nominal context window and a functional one — between what a model accepts and what it reliably reasons over — remains one of the most important unsolved problems in AI. Subquadratic says it has closed that gap. If independent evaluation confirms that claim, the implications would ripple far beyond a single startup’s valuation. If it doesn’t, the company joins a growing list of long-context promises that sounded revolutionary on launch day and unremarkable six months later.

In computing, every fundamental constraint eventually falls. When it does, the breakthrough never comes from the direction the industry expected. The question hanging over Subquadratic is whether a team of 11 PhDs and a $29 million seed round actually found the answer that has eluded organizations spending thousands of times more — or whether they just found a better way to describe the problem.

The AI scaffolding layer is collapsing. LlamaIndex’s CEO explains what survives.

The scaffolding layer that developers once needed to ship LLM applications — indexing layers, query engines, retrieval pipelines, carefully orchestrated agent loops — is collapsing. And according to Jerry Liu, co-founder and CEO of LlamaIndex, that’s not a problem. It’s the point.

“As a result, there’s less of a need for frameworks to actually help users compose these deterministic workflows in a light and shallow manner,” Jerry Liu, co-founder and CEO of LlamaIndex, explains in a new VentureBeat Beyond the Pilot podcast

Context is becoming the moat

Liu’s LlamaIndex is one of the foremost retrieval-augmented generation (RAG) frameworks connecting private, custom, and domain-specific data to LLMs. But even he acknowledges that these types of frameworks are becoming less relevant. 

With every new release, models demonstrate incremental capabilities to reason over “massive amounts” of unstructured data, and they’re getting better at it than humans, he notes. They can be trusted to reason extensively, self-correct, and perform multi-step planning; Modern Context Protocol (MCP) and Claude Agent Skills plug-ins allow models to discover and use tools without requiring integrations for every one independently. 

Agent patterns have consolidated toward what Liu calls a “managed agent diagram” — a harness layer combined with tools, MCP connectors, and skills plug-ins, rather than custom-built orchestration for every workflow.

Further, coding agents excel at writing code, meaning devs don’t need to rely on extensive libraries. In fact, about 95% of LlamaIndex code is generated by AI. “Engineers are not actually writing real code,” Liu said. “They’re all typing in natural language.” This means the layers between programmers and non-programmers is collapsing, because “the new programming language is essentially English.” 

Instead of manual coding or struggling to understand API and document integration, devs can just point Claude Code at it. “This type of stuff was either extremely inefficient or just would break the agent three years ago,” said Liu. “It’s just way easier for people to build even relatively advanced retrieval with extremely simple primitives.”

So what’s the core differentiator when the stack collapses? 

Context, Liu says. Agents need to be able to decipher file formats to extract the right information. Providing higher accuracy and cheaper parsing becomes key, and LlamaIndex is well-positioned here, he contends, because of its developments with agentic document processing via optical character recognition (OCR). 

“We’ve really identified that there’s a core set of data that has been locked up in all these file format containers,” he said. Ultimately, “whether you use OpenAI Codex or Claude Code doesn’t really matter. The thing that they all need is context.”

Keeping stacks modular

There’s growing concern about builders like Anthropic locking in session data; in light of this, Liu emphasizes the importance of modularity and agnosticism. Builders shouldn’t bet on any one frontier model, or overbuild in a way that overcomplicates components of the stack. 

Retrieval has evolved into “agent-plus-sandbox,” as he describes it, and enterprises must ensure that their code bases are tech debt free and adaptable to changing patterns. They also have to acknowledge that some parts of the stack will eventually need to be thrown away as a matter of course. 

“Because with every new model release, there’s always a different model that is kind of the winner,” Liu said. “You want to make sure you actually have some flexibility to take advantage of it.”

Listen to the podcast to hear more about: 

  • LlamaIndex’s beginnings as a ‘toy project’ with initially only about 40% accuracy; 

  • How SaaS companies can tap into complicated workflows that must be standardized and repeatable for average knowledge workers;

  • Why vertical AI companies are taking off and why ‘build versus buy’ is still a very valid question in the agent age. 

You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.

One tool call to rule them all? New open source Python tool RunPod Flash eliminates containers for faster AI dev

Runpod, the high-performance cloud computing and GPU platform designed specifically for AI development, today launched a new open source, MIT licensed, enterprise-friendly Python programming tool called Runpod Flash — and it is poised to make creation, iteration and deployment of AI systems inside and outside of foundation model labs much faster.

The tool aims to eliminate some of the biggest barriers and hurdles to training and using AI models today, namely, doing away with Docker packages and containerization when developing for serverless GPU infrastructure, which the company believes will speed up development and deployment of new AI models, applications and agentic workflows.

Additionally, the platform is built to serve as a critical substrate for AI agents and coding assistants—such as Claude Code, Cursor, and Cline—enabling them to orchestrate and deploy remote hardware autonomously with minimal friction.

Developers can utilize Flash to accomplish a diverse set of high-performance computing tasks, including cutting-edge deep learning research, model training, and fine-tuning.

“We make it as easy as possible to be able to bring together the cosmos of different AI tooling that’s available in a function call,” said RunPod chief technology officer (CTO) Brennen Smith, in a video call interview with VentureBeat last week.

The tool allows for the creation of sophisticated “polyglot” pipelines, where users can route data preprocessing to cost-effective CPU workers before automatically handing off the workload to high-end GPUs for inference.

Beyond research and development, Flash supports production-grade requirements through features such as low-latency load-balanced HTTP APIs, queue-based batch processing, and persistent multi-datacenter storage.

Eliminating the ‘packaging tax’ of AI development

The core value proposition of Flash GA is the removal of Docker from the serverless development cycle.

In traditional serverless GPU environments, a developer must containerize their code, manage a Dockerfile, build the image, and push it to a registry before a single line of logic can execute on a remote GPU. Runpod Flash treats this entire process as a “packaging tax” that slows down iteration cycles.

Under the hood, Flash utilizes a cross-platform build engine that enables a developer working on an M-series Mac to produce a Linux x86_64 artifact automatically.

This system identifies the local Python version, enforces binary wheels, and bundles dependencies into a deployable artifact that is mounted at runtime on Runpod’s serverless fleet.

This mounting strategy significantly reduces “cold starts”—the delay between a request and the execution of code—by avoiding the overhead of pulling and initializing massive container images for every deployment.

Furthermore, the technology infrastructure supporting Flash is built on a proprietary Software Defined Networking (SDN) and Content Delivery Network (CDN) stack.

Smith told VentureBeat that the hardest problems in GPU infrastructure are often not the GPUs themselves, but the networking and storage components that link them together.

“Everyone is talking about agentic AI, but the way I personally see it — and the way the leadership team at RunPod sees it — is that there needs to be a really good substrate and glue for these agents, whatever they might be powered by, to be able to work with,” Smith said.

Flash leverages this low-latency substrate to handle service discovery and routing, enabling cross-endpoint function calls. This allows developers to build “polyglot” pipelines where, for instance, a cheap CPU endpoint handles data preprocessing before routing the clean data to a high-end NVIDIA H100 or B200 GPU for inference.

Four distinct workload architectures supported

While the Flash beta focused on live-test endpoints, the GA release introduces a suite of features designed for production-grade reliability.

The primary interface is the new @Endpoint decorator, which consolidates configuration—such as GPU type, worker scaling, and dependencies—directly into the code. The GA release defines four distinct architectural patterns for serverless workloads:

  • Queue-based: Designed for asynchronous batch jobs where functions are decorated and run.

  • Load-balanced: Tailored for low-latency HTTP APIs where multiple routes share a pool of workers without queue overhead.

  • Custom Docker Images: A fallback for complex environments like vLLM or ComfyUI where a pre-built worker is already available.

  • Existing Endpoints: Using Flash as a Python client to interact with previously deployed Runpod resources via their unique IDs.

A critical addition for production environments is the NetworkVolume object, which provides first-class support for persistent storage across multiple datacenters.

Files mounted at /runpod-volume/ allow for model weights and large datasets to be cached once and reused, further mitigating the impact of cold starts during scaling events.

Additionally, Runpod has introduced environment variable management that is excluded from the configuration hash, meaning developers can rotate API keys or toggle feature flags without triggering an entire endpoint rebuild.

To address the rise of AI-assisted development, Runpod has released specific skill packages for coding agents like Claude Code, Cursor, and Cline.

These packages provide agents with deep context regarding the Flash SDK, effectively reducing syntax hallucinations and allowing agents to write functional deployment code autonomously.

This move positions Flash not just as a tool for humans, but as the “substrate and glue” for the next generation of AI agents.

Why open source RunPod Flash?

Runpod has released the Flash SDK under the MIT License, one of the most permissive open-source licenses available.

This choice is a deliberate strategic move to maximize market share and developer adoption. In contrast to more restrictive licenses like the GPL (General Public License), which can impose “copyleft” requirements—potentially forcing companies to open-source their own proprietary code if it links to the library—the MIT license allows for unrestricted commercial use, modification, and distribution.

Smith explained this philosophy as a “motivating construct” for the company: “I prefer to win based on product quality and product innovation rather than legal ease and lawyers,” he told VentureBeat.

By adopting a permissive license, Runpod lowers the barrier for enterprise adoption, as legal teams do not have to navigate the complexities of restrictive open-source compliance.

Furthermore, it invites the community to fork and improve the tool, which Runpod can then integrate back into the official release, fostering a collaborative ecosystem that accelerates the development of the platform.

Timing is everything: RunPod’s growth and market positioning

The launch of Flash GA comes at a time of explosive growth for Runpod, which has surpassed $120 million in Annual Recurring Revenue (ARR) and serves a developer base of over 750,000 since it was founded in 2022.

The company’s growth is driven by two distinct segments: the “P90” enterprises—large-scale operations like Anthropic, OpenAI, and Perplexity—and the “sub-P90” independent researchers and students who represent the vast majority of the user base.

The platform’s agility was recently demonstrated during the release of DeepSeek V4 in preview last week. Within minutes of the model’s debut, developers were utilizing Runpod infrastructure to deploy and test the new architecture.

This “real-time” capability is a direct result of Runpod’s specialized focus on AI developers, offering over 30 GPU SKUs and billing by the millisecond to ensure that every dollar of spend results in maximum throughput.

Runpod’s position as the “most cited AI cloud on GitHub” suggests that it has successfully captured the developer mindshare required to sustain its momentum.

With Flash GA, the company is attempting to transition from being a provider of raw compute to becoming the essential orchestration layer for the AI-first cloud.

As development shifts toward “intent-based” coding—where the outcome is prioritized over the execution details—tools that bridge the gap between local ideas and global scale will likely define the next era of computing.

Amazon’s OpenAI gambit signals a new phase in the cloud wars — one where exclusivity no longer applies

Amazon Web Services on Tuesday launched one of the most consequential enterprise AI plays in the company’s 20-year history, simultaneously bringing OpenAI’s most powerful models to its Bedrock platform, unveiling a new agentic developer framework, releasing a desktop AI productivity tool called Amazon Quick, and expanding its Amazon Connect service from a single contact-center product into a family of four agentic AI solutions targeting supply chains, hiring, healthcare, and customer experience.

The announcements, made at a live event in San Francisco titled “What’s Next with AWS,” landed just 24 hours after OpenAI and Microsoft publicly restructured their exclusive cloud partnership — a move that, for the first time, freed OpenAI to distribute all of its products across rival cloud providers. AWS CEO Matt Garman called it “a huge partnership” and said customers have been asking for OpenAI models inside AWS “from the very early days.”

The timing was no accident. Amazon CEO Andy Jassy had flagged the Microsoft-OpenAI restructuring as “very interesting” in a post on X the day prior, promising more details on Tuesday. What followed was a sweeping set of launches that together represent AWS’s bid to become the definitive infrastructure layer for the agentic AI era — one where intelligent software agents don’t just answer questions but take autonomous action inside enterprise workflows.

OpenAI’s most capable models arrive on Amazon Bedrock for the first time, reshaping the cloud AI marketplace

The centerpiece announcement: OpenAI’s latest models are now available through Amazon Bedrock in limited preview, with general availability expected within weeks. AWS confirmed that GPT-5.4 is available immediately in limited preview, with GPT-5.5 arriving shortly thereafter.

In an exclusive interview with VentureBeat at the event, Anthony Liguori, Vice President and Distinguished Engineer at AWS, described the significance of the moment. “We announced a partnership about eight weeks ago centered around this idea of the stateful runtime environment, the SRE APIs,” Liguori said. “However, today we announced the availability of all of OpenAI’s frontier models in Amazon Bedrock available via both the stateless APIs — these are the APIs that are commonly used, like chat completions and responses.”

Liguori characterized the stateless API availability as particularly critical because it removes migration friction. “Customers can take their existing workloads today and just start using AWS right off the bat,” he said. “They don’t have to write any new software, develop any new things. I think that’s one of the most exciting announcements that came out today.”

The integration means AWS customers can now evaluate and deploy OpenAI models alongside offerings from Anthropic, Meta, Mistral, Cohere, and Amazon’s own models — all through Bedrock’s unified security, governance, and cost controls. For enterprise procurement teams, this collapses what had been a fragmented multi-vendor landscape into a single pane of glass.

How a $50 billion Amazon investment and a messy Microsoft breakup cleared the way for Tuesday’s deal

The path to Tuesday’s announcement was anything but smooth. As TechCrunch reported, OpenAI’s earlier $50 billion deal with Amazon, announced in February, had created a legal tangle with Microsoft. Under the original Microsoft-OpenAI agreement, Microsoft retained exclusive rights to OpenAI products accessed through APIs, which appeared to conflict directly with OpenAI’s promise to give AWS exclusive hosting rights for its new Frontier agent-building tool.

Microsoft had publicly pushed back at the time, stating that “Azure remains the exclusive cloud provider of stateless OpenAI APIs.” The Financial Times reported that Microsoft even contemplated legal action. Monday’s restructured deal — which replaced Microsoft’s open-ended exclusivity with a nonexclusive license running through 2032 — swept those legal obstacles aside.

For AWS, the resolution means its multi-billion-dollar investment in OpenAI can now fully bear fruit. As CNBC reported, OpenAI’s revenue chief Denise Dresser had told employees in a memo that the Microsoft relationship “has also limited our ability to meet enterprises where they are — for many that’s Bedrock.” At the San Francisco event, Dresser framed the moment as a turning point. “They’re no longer in the mindset of experimentation and pilots,” she said of enterprise customers. “They really want to go full enterprise wide, and they understand that to do that, they need to have powerful models. But even more importantly, they want those models in a trusted environment.”

OpenAI CEO Sam Altman, who was unable to attend in person due to his ongoing court case against Elon Musk across the Bay Bridge in Oakland, sent a recorded video message. “We are co-developing an agent platform from the ground up, deeply integrated with AWS services and powered by OpenAI’s most advanced models and tools,” Altman said, “so that customers can build and run powerful agents in their own environment without worrying about the underlying plumbing.”

Inside Bedrock managed agents, the reinforcement learning-trained ‘harness’ that AWS says will define the agentic era

Beyond raw model access, AWS launched Amazon Bedrock Managed Agents powered by OpenAI — a system that combines OpenAI’s frontier models with its proprietary “harness,” the agentic execution framework that powers products like Codex. This is where Liguori’s technical analysis was most revealing.

He explained that the harness concept represents a shift in how models are trained and deployed for agentic work. “When you think about an agentic platform, there’s really two components,” Liguori told VentureBeat. “One is the harness — the actual logic that will execute tool calls for the model, determine when to compact the context, all of those sorts of things — and then the model itself.”

Critically, Liguori argued, the best agentic performance comes when models are trained specifically against their harness through reinforcement learning — not merely prompted to use tools at inference time. “You can give a model a whole lot of instructions and a set of tools, and it will be able to use it most of the time,” he said. “But when you really train the model on a specific set of tools, a specific style of operations, it’s just like drilling plays over and over again — the model builds muscle memory for using that harness.”

The football analogy is instructive. Where general-purpose models are like versatile athletes who can adapt to any playbook, harness-trained models are like championship teams that have run the same formations thousands of times until execution becomes instinctive. For enterprises deploying agents in high-stakes production environments — managing financial transactions, orchestrating supply chains, or processing sensitive healthcare data — that reliability gap matters enormously.

Bedrock Managed Agents consists of three components: a runtime layer for configuring skills, memory policies, and tool access; an environment layer where the agent lives (deployable on Fargate or other AWS compute); and an inference API for interacting with the agent. The system integrates deeply with AWS’s identity and access management, VPC networking, and CloudTrail auditing — meaning every action an agent takes is logged and governed by existing enterprise security policies.

AWS makes its boldest security claim yet: zero human access to inference machines running OpenAI’s models

Liguori made what may be his most striking claim when discussing why enterprises should trust AWS over on-premises alternatives or smaller cloud providers. “With Bedrock, the system that we’re using to host the GPT-5.4 models, that whole environment is zero operator access,” he told VentureBeat. “There’s no human that could ever log into one of those machines, so your inference data is never able to be accessed by a human.”

He pointed to AWS’s custom silicon — Graviton processors and Nitro security chips — as the foundation for this claim. “When you look at one of our servers, either compute servers or the servers we’re using for Gen AI, the only thing that you can buy off the shelf is the memory modules. Everything else is either custom boards or even custom silicon.”

This argument is designed to counter a growing narrative from what the industry calls “neo-clouds” — smaller providers that offer on-premises model hosting with tighter physical security controls. Liguori flipped that argument on its head: “You’re actually way more secure in the cloud because we have built a platform with such strong physical securities… If you were to try to stand up your own inference system today, you’d probably be running open source software on just Linux.”

It’s a bold claim, and one that enterprise CISOs will undoubtedly scrutinize. But it underscores AWS’s conviction that the agentic era — where AI agents access source code, PII data, and critical business systems — demands infrastructure security guarantees that go far beyond what most organizations can build independently.

Codex’s 4 million weekly users could soon multiply as OpenAI’s coding agent arrives on AWS

OpenAI’s Codex coding agent also arrived on Bedrock in limited preview. Dresser shared that Codex has been growing at a blistering pace, expanding “from 3 million weekly active users to 4 million in two weeks.” The tool has evolved beyond simple code generation into a full agentic software development lifecycle platform.

For Liguori, who described himself as “10 to 20 times more productive” as an engineer thanks to tools like Codex, bringing this capability into AWS represents the bridge between individual developer productivity and enterprise-scale deployment. “Most developers today are using these OpenAI models on their laptops,” he said. “We haven’t seen that happen yet in the rest of the industry, and with Bedrock Managed Agents, we think we have a way for enterprises to deploy agents in a means that meets their compliance requirements.”

The gap Liguori is describing — between the solo developer experience and enterprise-wide adoption — is arguably the central challenge of the current AI moment. Individual engineers can achieve extraordinary productivity gains with agentic coding tools. But scaling that to thousands of developers across a Fortune 500 company, with proper governance, security, and auditability, requires platform-level infrastructure. That’s the market AWS is targeting.

Liguori saw the near-term potential in even more immediate terms. He described leading a team of about 20 engineers who share a common codebase of skills and MCP tools. “That has been an amazingly powerful thing, because we’re all able to build on top of each other as we learn how to use these models,” he said. “Where I’ve run into a hurdle is there’s a lot of stuff I’d like to share with our finance team… and I can’t really ask them to clone a Git repo and build it from a Git repo.” Bedrock Managed Agents, he argued, will let teams create hosted agents that non-technical colleagues can access — taking agentic development from a developer-only practice to an enterprise-wide capability within the next six months.

Amazon Quick Desktop aims to be the agentic AI assistant that finally works for non-developers

While the OpenAI partnership dominated headlines, AWS also launched Amazon Quick Desktop — a new desktop application designed to bring agentic AI to knowledge workers who aren’t developers. Liguori framed the product as addressing a critical gap. “A lot of these agentic tools have primarily targeted developers,” he said. “Quick Desktop is a really great tool if you are a knowledge worker that is not a developer… I think it’s been underserved for the non-developer knowledge workers.”

Quick Desktop integrates with a user’s local files, calendar, email, Slack, and enterprise applications — building what AWS calls a “Knowledge Graph” that maps relationships between people, projects, decisions, and actions. The system connects natively with Google Workspace, Microsoft 365, Zoom, and Salesforce. Unlike other AI productivity tools, Quick doesn’t wait for prompts. It proactively surfaces what matters — unanswered emails, deals needing updates, documents awaiting review — and can take action like scheduling meetings, drafting emails, or updating Jira tickets.

Garman, who said he had been using the desktop app for several weeks, called it “by far the most effective tool” among AI productivity products he has tested. “If you think about what we’ve done with Quick — combine all of your sources of data inside of the enterprise — but then we also saw the power of having access to a local desktop and being able to operate with your local files and your local email and your local Slack… but people were worried about security, appropriately so,” Garman said. “What we’re doing here is combining a bunch of those things together with QUIC to give you the best of all of those worlds.”

The product is available in preview today, with no AWS account required — users can sign up with just an email address. Customers including BMW, 3M, Mondelēz, Southwest Airlines, and the NFL are already using it, with some reporting production time reductions of nearly 80% and customer issue processing cut by more than 50%.

Amazon Connect becomes a family of four as AWS bets that ‘agentic teammates’ will transform supply chains, hiring, and healthcare

Perhaps the most ambitious long-term bet announced Tuesday was the expansion of Amazon Connect from a single contact-center product — one that reached over $1 billion in revenue last year and processes 20 million interactions daily — into a family of four agentic AI solutions.

The new lineup includes Amazon Connect Decisions, an agentic supply chain planning tool built on more than 25 specialized supply chain tools and 30 years of Amazon operational science, including one of Amazon’s SCOT (Supply Chain Optimization Technologies) foundation models. Amazon Connect Talent is a high-volume hiring platform inspired by Amazon’s experience hiring 250,000 seasonal employees during peak periods, using AI agents to conduct voice interviews around the clock and present recruiters with anonymized, skills-based scoring. Amazon Connect Customer AI is the renamed and enhanced version of the original contact-center service. And Amazon Connect Health covers the patient journey from appointment scheduling through clinical encounters, including ambient documentation, billing code suggestions, and post-visit summaries drawn from Amazon’s experience with One Medical and Amazon Pharmacy.

Colleen Aubrey, who leads applied AI solutions at AWS and previously co-founded Amazon’s advertising business, introduced a new design philosophy underlying all four products: “humorphism.” Where skeuomorphism translated physical objects into digital metaphors — desks to desktops, files to folders — humorphism translates human interaction dynamics into AI agent behavior. “If we’re building products that at the heart of which is an agentic teammate, then how should those teammates interact with you?” Aubrey asked. The philosophy manifests in specific design choices: Connect Decisions agents ask planners why they made manual adjustments and apply those insights across similar products. Connect Talent agents adapt follow-up questions based on candidate responses. Connect Health agents trace every clinical insight back to source data so physicians can verify AI-generated documentation.

What AWS’s four-layer strategy reveals about where the real value in enterprise AI will be captured

Taken together, Tuesday’s announcements reveal a coherent strategy operating across four distinct layers: custom infrastructure (Graviton, Trainium, zero-operator-access security), model access (Bedrock as a model marketplace with unified APIs), an agentic platform (Bedrock Managed Agents and AgentCore for building and governing agents), and purpose-built applications (Quick for individual productivity, Connect for vertical business operations).

This layered approach addresses a fundamental tension in the enterprise AI market. Companies want choice at the model layer but integration at the platform layer and specificity at the application layer. By offering all three through a single security and governance framework, AWS is betting it can capture value across the entire stack — a strategy that reshapes competitive dynamics for Microsoft, Google Cloud, and the growing constellation of smaller AI infrastructure providers.

Garman pushed back on the “SaaSpocalypse” narrative that agentic AI will destroy incumbent enterprise software companies. “The incumbent providers today have such a huge advantage,” he said. “They have deep domain expertise… a large customer set with all of their data.” He pointed to Salesforce’s recent headless API offering as an example of incumbents adapting smartly. But he also drew an explicit parallel to the early days of cloud computing, when customers would simply replicate their on-premises data centers in the cloud rather than reimagine what was possible. “You see that today with how people are thinking about AI and agents,” Garman said. “They’re like, ‘I have this business process, I’m gonna have agents do the exact same thing that humans do.’ It kind of works… but it doesn’t give you that transformational change.”

He pointed to Amazon’s own Prime Video team as proof of what that change looks like in practice. The team used agentic tools to rebuild a partner payment system that was projected to take two years — completing it in roughly two quarters with a handful of people, while simultaneously improving the system for customers, for Amazon, and for the partners who get paid through it.

The enterprise AI arms race enters a new phase as model access becomes table stakes and the platform war begins

For enterprises evaluating their AI strategies, Tuesday’s announcements simplify one decision — OpenAI models are now available where most of them already run production workloads — while complicating another. With model access increasingly commoditized across cloud providers, the real differentiator becomes the platform layer: where agents are built, governed, deployed, and trusted to take consequential actions. That’s the battleground AWS is staking out, and it’s the same ground Microsoft, Google, Salesforce, and a growing number of startups intend to contest.

Liguori sees the transformation accelerating fast. “I think what we’re going to see in the next six months is a lot of this agentic stuff going from developer only to being able to be consumed by a larger number of folks within an enterprise,” he told VentureBeat. Anthony Liguori, the AWS distinguished engineer who led the technical work over eight sleepless weeks to bring OpenAI’s models to Bedrock, said his own productivity as a software engineer has increased 10 to 20 times over the past year. When asked what excites him most about what comes next, he didn’t talk about models or infrastructure. He talked about what happens when that same multiplier reaches the finance team, the product managers, the supply chain planners — the millions of knowledge workers who have been watching the agentic revolution from the sidelines.

“We had nothing eight weeks ago,” he said, “and now we’re here.” If the next eight weeks move as fast, the sidelines may not exist for much longer.

FOMO is why enterprises pay for GPUs they don’t use — and why prices keep climbing

Enterprises can’t fix their GPU waste problem because the fix makes the problem worse. Releasing idle capacity would improve utilization, but the same shortage driving GPU prices up is exactly why no team will give capacity back. So the fleet sits at roughly 5%, billed by the hour, and the cycle tightens.

That pressure — repeated across thousands of enterprises over the past two years — is the reason most companies are now running their GPU fleets at roughly 5% utilization, according to Cast AI’s 2026 State of Kubernetes Optimization Report, which measured actual production clusters rather than surveying them. It’s also the reason nobody releases the idle capacity. Cast AI co-founder and President Laurent Gil has been tracking the dynamic for two years. “Many of the neoclouds are not cloud,” he told VentureBeat. “They are neo-real estate.”

Five percent is about six times worse than a no-effort baseline. Gil puts a reasonable human-managed target at around 30% once you factor in day cycles, weekends and normal business patterns. Five percent means enterprises are running their most expensive infrastructure line at a fraction of what doing nothing intentional would yield. And it lands at the same moment cloud compute pricing has broken its 20-year pattern. 

AWS quietly raised its reserved H200 GPU prices by roughly 15% on a Saturday in January, with no formal announcement. Memory suppliers pushed HBM3e prices up 20% for 2026. It is the first time since AWS launched EC2 in 2006 that a hyperscaler has meaningfully raised reserved GPU pricing rather than cut it. For now, the assumption under most enterprise AI budgets — that cloud compute gets cheaper every year— no longer holds at the top of the stack.

The cloud market has split in two

The pricing move matters less for what it is than for what it signals about where the shortage actually bites. Cloud compute has split into two layers. At the commodity layer, the old deflation still works. H100 on-demand pricing has fallen from roughly $7.57 per GPU-hour in September 2025 to around $3.93 today, with Lambda Labs and RunPod listing H100s under $3 and older A100s around $1.92. Nvidia T4 chips, once impossible to find on spot, now survive above 90% probability over 24 hours in several AWS regions.

At the frontier layer, it’s reversed. Nvidia received orders for 2 million H200 chips for 2026 against 700,000 in inventory. TSMC’s advanced packaging, which gates every HBM-equipped GPU, is booked through at least mid-2027. AMD has warned of its own 2026 price hikes citing the same crunch. Even A100 pricing, expected to soften as three-year reservations from 2023 expired, has started creeping back up. Gil’s read: FOMO is now spilling into older generations. Which layer an enterprise’s workloads sit on determines exposure.

Why 5%? Part one: the procurement loop

How does fleet utilization get to 5% when GPUs are this expensive? Gil’s account of enterprise GPU procurement is the clearest explanation I have heard.

An enterprise needs GPUs. It joins a hyperscaler waitlist. Nothing happens for weeks, sometimes months. Then a phone call: “You asked for 48, I have 36. Yours if you want them, but only on a one-year or three-year commitment, and three years is cheaper. If you don’t want them, five other companies on the list will take them.” The fear of losing allocation is acute. The commitment gets signed. Whether the workloads will consume that many GPUs, or whether that chip generation fits what will run on them, is not the operative question at the moment. The operative question is whether to say yes or lose the slot.

Once secured, those GPUs become too painful to release. Reacquiring them would take months, and nobody wants to be the team that gave capacity back and couldn’t get it. So the fleet sits, billed by the hour, whether it is used or not. Gil described enterprises paying on-demand rates, roughly three times more expensive than one-year reservations, because even the premium felt safer than risking release.

This is the paradox at the center of the 5% number. The obvious way to improve utilization is to release the GPUs you are not using. But the very shortage that makes those GPUs expensive is also the reason nobody releases them. So the fleet stays over-provisioned, the shortage persists, prices rise, and the FOMO that started the cycle gets reinforced. Every turn of the loop makes the next exit harder.

Forrester’s data corroborates the dynamic from a different angle. Principal analyst Tracy Woo found practitioners self-estimating Kubernetes waste at around 60%, close to what Cast AI measures directly. A widely observed pattern in Kubernetes practice explains the dynamic: engineers routinely request five to ten times the resources they actually use, because the cost of under-provisioning is visible (a pager goes off) and the cost of over-provisioning is invisible (one line on a cloud bill no engineer sees).

Why 5%? Part two: the architecture loop

Fixing procurement alone would not get the number to a good place, because the GPUs enterprises already hold are also wasteful on the inside. And the architecture half of the story is being diagnosed independently by teams that compete with Cast AI.

Anyscale, the company behind the Ray framework, published its own analysis on January 21 arguing that modern AI workloads routinely sit below 50% GPU utilization even when fleet size is exactly right, because of how the workloads are containerized. A single AI job moves through CPU-heavy stages (loading data, preprocessing), GPU-heavy stages (training or inference), and back to CPU. When all of that runs in one container, the GPU is allocated for the entire lifecycle but doing useful work for a fraction of it.

Gartner reaches the same conclusion independently. In a November 2025 research note on on-premises AI infrastructure, it recommends combining shared GPU usage across siloed projects with disaggregated inference, where prompt-processing and token-generation run on different hardware. Nvidia’s own Dynamo inference framework, unveiled for MLPerf Inference v6.0 last month, is built on the same principle.

Two vendors and an independent analyst firm (Cast AI, Anyscale, Gartner) converging on the same diagnosis is a stronger signal than any single vendor’s story, especially when one of them competes with the others. The two types of waste compound. A fleet over-committed at procurement time, running workloads whose containers leave GPUs idle waiting for CPU preprocessing, leaves enterprises at 5%. Fix one without fixing the other and most of the potential savings stay on the table.

What 40% utilization actually takes

If releasing GPUs is blocked by FOMO and procurement contracts are already signed, the only remaining lever is doing more useful work on the GPUs already committed. That is what “improve utilization” actually means in practice, and none of it requires buying a vendor’s product.

The simplest existence proof is the oldest technique in the book: GPU sharing across time zones. A bank with a credit decision engine serving Asian and US customers can run one pool of GPUs that serves both markets at different times. Nvidia published MIG (Multi-Instance GPU) and time-slicing primitives years ago. Most enterprises do not do it by hand because it is operationally boring and carries coordination overhead no one wants to own. An automated scheduler does it without getting tired.

Canva, the Australian design platform running over 100 production AI models, told Anyscale that it runs close to 100% GPU utilization during distributed training runs with roughly 50% cloud-cost reductions versus its previous setup. Inside Cast AI’s own data, a cluster of 136 H200 GPUs sustains 49% average utilization after applying GPU sharing, bin-packing (placing multiple workloads onto fewer, right-sized nodes), and a spot/on-demand mix. Ten times the fleet average and short of saturation, which is honest: most real enterprise fleets with mixed dev, staging, and production workloads probably sustain 40% to 70% at full optimization, not 100%. Even that is an order of magnitude better than 5%.

One caveat: the report’s 5% figure explicitly excludes AI labs running dedicated training. Organizations that look more like frontier labs than mixed enterprise fleets likely see much higher utilization already.

The procurement paths have stopped being interchangeable

What should enterprises actually do differently in 2026? The paths available in the market are no longer interchangeable, and each makes a different bet on where supply and demand land.

Procurement path

Typical H100-class price

Availability

Interruption risk

Commitment

Best fit

Hyperscaler on-demand

$3.00 to $6.98 per GPU-hour

Limited for H100/H200

None

None

Unpredictable workloads, short runs

Hyperscaler Capacity Blocks

$4.33 to $4.97 per GPU-hour (H200 after Jan 2026)

Pre-book up to 8 weeks; 6-month window

None in window

Medium-term

Scheduled training with known windows

Hyperscaler spot

Up to 90% discount

Variable; H100/H200 thin

High (minutes of warning)

None

Fault-tolerant inference, checkpointed training

Specialized GPU clouds (CoreWeave, Lambda, RunPod, GMI)

$1.99 to $3.99 per GPU-hour for H100

Broader for newer generations

Low to medium

Per-run or short reservation

Price-sensitive teams, flexible deployment

On-premise or colocation

Break-even around 12 to 18 months at sustained >60% utilization

3 to 9 month lead times

None

3+ year capex

High-utilization sustained workloads, strict compliance

Decentralized marketplaces (Vast.ai, io.net, Aethir)

Often under $1.00 per GPU-hour

Highly variable quality

High

None

Experimental or batch, non-production

The pattern that no longer works is picking one path and locking in for a multi-year plan. A more defensible 2026 default is mixing paths against the split: commodity providers for workloads that can live there, hyperscaler Capacity Blocks only for workloads that need the guaranteed window.

Five levers worth pulling

None of the following requires buying back capacity that’s already been committed.

  1. Continuous rightsizing, not one-time configuration. Resource requests set at deployment are almost always wrong six months later. Karpenter, OpenCost, and Kubecost are open-source options; Cast AI, ScaleOps, nOps, and PerfectScale automate the rightsizing itself. Cast AI reports its continuous rightsizing cuts provisioned CPU by roughly 50% on average across its customer base.

  2. Regional spot placement, especially for T4-class inference. Cast AI’s survival-curve data shows T4 spot interruption risk ranging from about 10% over 24 hours in eu-west-3 to 80% in eu-central-1 and us-east-1. Region selection is a reliability decision, not just a latency one.

  3. GPU sharing through MIG and time-slicing. Nvidia’s MIG feature partitions A100, H100, and H200 chips into isolated instances with dedicated compute and memory. vLLM and Dynamo implement continuous batching and disaggregated inference. Open primitives, no vendor contract required.

  4. Disaggregated runtime. Ray lets CPU-bound data prep scale independently from GPU-bound training or inference. 

  5. Commitment rebalancing. Reserved Instances and Savings Plans drift as workloads change. Cast AI, nOps, and Vantage track utilization against committed capacity and adjust the split automatically.

The bottom line

The single most practical question most enterprises have not asked this year: do they actually need an H200 at all?

H200 is designed for very large models (70B+ parameters) with very long contexts (128k+ tokens), where its 141 GB of memory (nearly double the H100’s 80 GB) is what lets the chip handle the load without slowing down. For smaller models, fine-tuned derivatives, quantized inference, and most production AI that actually ships to customers, an H100 does the same job at roughly 40% less per GPU-hour, according to Cast AI. An A100 often works, too, at roughly 60% less. The era of a single general-purpose GPU as the default answer is ending. Chip selection is becoming a routing decision, workload by workload, rather than a generational procurement decision.

Gil’s own observation sharpens this. At 80% utilization, a B200 genuinely delivers better unit cost per token than an A100: more powerful per hour than it is more expensive per hour. At 5% utilization, the math inverts. The premium chip compounds the waste. Buying the newest chip while underusing it is the most expensive possible version of the FOMO loop.

The first lever is free, and it is a workload audit rather than a software purchase. No GPU needs to be released to run this lever. Every GPU-backed workload in production is worth reviewing against one question: is the chip it runs on actually matched to what it does. A surprising number of H200 purchases in 2026 will turn out to have been made because the allocation came through, not because the workload required it. Then fix runtime architecture before spending on more reserved capacity. Mix commodity and reserved tiers against the split instead of picking one.

Whether the broader GPU market eventually rebalances is a separate question, and not one worth betting a 2026 budget on. Supply could catch up. Memory capacity could ease. Specialized inference silicon could pull demand off the H200 tier. All of that is possible. None of it is certain. What is certain is that procurement and runtime are the same problem seen from two sides: FOMO drives over-commitment at the front end, and container architecture leaves the over-committed fleet idle at the back. Enterprises that treat them as one loop can break it. Enterprises that keep treating them as two separate budget items will keep paying to run their most expensive infrastructure at 5%.

Microsoft and OpenAI gut their exclusive deal, freeing OpenAI to sell on AWS and Google Cloud

Microsoft and OpenAI on Monday announced a sweeping overhaul of the partnership that has defined the commercial AI era, dismantling key pillars of exclusivity and revenue-sharing that bound the two companies together for years and replacing them with a looser, time-limited arrangement that gives both sides far more freedom to pursue rival relationships.

The amended agreement, disclosed simultaneously in blog posts from both companies, marks the most significant restructuring since Microsoft first invested $1 billion in OpenAI in 2019 — and it transforms what was once the most consequential exclusive technology alliance in a generation into something that more closely resembles a strategic but arm’s-length commercial relationship.

Under the new terms, Microsoft will no longer pay any revenue share to OpenAI when customers access OpenAI models through Azure. OpenAI, meanwhile, will continue paying a revenue share to Microsoft through 2030 — at the same 20 percent rate — but that obligation is now subject to a total cap. Microsoft retains a license to OpenAI’s intellectual property for models and products through 2032, but that license is now explicitly non-exclusive. And OpenAI, critically, can now serve all of its products to customers on any cloud provider — including Amazon Web Services and Google Cloud — ending the exclusivity that had been a cornerstone of the original deal.

“The rapid pace of innovation requires us to continue to evolve our partnership to benefit our customers and both companies,” Microsoft wrote in its blog post Monday. OpenAI echoed the framing, calling the amended agreement a move “grounded in flexibility, certainty, and a focus on delivering the benefits of AI broadly.”

The diplomatic language belies the drama that led to this moment — months of behind-the-scenes tension, competing deal announcements, public contradictions, and even the specter of litigation between two companies whose fates have been intertwined since the earliest days of the generative AI revolution.

How a billion-dollar bet on AI created the most powerful exclusive partnership in tech

To understand why Monday’s announcement matters so much, it helps to understand what came before it. When Microsoft poured its initial $1 billion into OpenAI in 2019, and then followed with a cumulative investment exceeding $13 billion, it secured something extraordinary: exclusive commercial access to OpenAI’s models and intellectual property. Azure became the sole cloud provider for OpenAI’s API products. Microsoft integrated OpenAI’s GPT models into everything from Bing to Office to GitHub Copilot. The arrangement was, by any measure, one of the most lopsided technology licensing deals in modern history — Microsoft got privileged access to the most capable AI models on the planet, and OpenAI got the capital and infrastructure it needed to scale.

The deal even contained an unusual provision: Microsoft’s exclusive rights would remain in force until OpenAI achieved artificial general intelligence, or AGI — a loosely defined milestone referring to AI systems that rival or exceed human intelligence across a broad range of tasks. OpenAI’s board retained the authority to declare when AGI had been reached, at which point certain commercial terms would change. It was, in effect, a philosophical tripwire embedded in a business contract.

That structure worked well enough when OpenAI was a research lab with a modest commercial footprint. But as ChatGPT exploded into the mainstream in late 2022 and OpenAI’s annualized revenue rocketed into the billions, the constraints began to chafe. OpenAI found itself locked into a single cloud ecosystem at precisely the moment when enterprises — its fastest-growing customer segment — were demanding multi-cloud flexibility. In an internal memo earlier this month, OpenAI’s revenue chief Denise Dresser put it bluntly, telling staff that the Microsoft partnership had “limited our ability to meet enterprises where they are,” according to a report from The Verge.

Amazon’s $50 billion OpenAI investment created a legal crisis that forced the restructuring

The proximate cause of Monday’s restructuring was not a philosophical disagreement about AI safety or corporate governance. It was a $50 billion check from Amazon. In February, OpenAI announced that Amazon would invest up to $50 billion in the company — $15 billion upfront, with another $35 billion to follow when certain unspecified conditions were met. In exchange, OpenAI agreed to expand its existing cloud agreement with AWS by $100 billion over eight years and, most controversially, committed to making AWS the exclusive third-party distribution provider for Frontier, its new enterprise agent-building platform. OpenAI also agreed to co-develop “stateful runtime technology” on AWS Bedrock, the infrastructure layer that allows AI agents to maintain memory and context over extended tasks.

The problem was that OpenAI’s existing contract with Microsoft almost certainly prohibited these arrangements. Microsoft held exclusive rights to any OpenAI product accessed through an API — a category that plainly included Frontier. On the very day OpenAI announced the Amazon deal, Microsoft issued a pointed public statement insisting that “Azure remains the exclusive cloud provider of stateless OpenAI APIs” and that “OpenAI’s first party products, including Frontier, will continue to be hosted on Azure.” The contradiction between the two announcements was stark, and it created immediate legal exposure. The Financial Times reported in March that Microsoft was actively considering legal action to enforce its contractual rights. The situation placed OpenAI in an impossible position: it had made promises to Amazon that it seemingly could not keep under the terms of its Microsoft agreement.

Monday’s deal resolves that impasse entirely. By converting Microsoft’s license from exclusive to non-exclusive and explicitly granting OpenAI the right to serve products on any cloud, the new terms retroactively validate the Amazon arrangement and eliminate the legal overhang. Amazon CEO Andy Jassy wasted no time celebrating. “We’re excited to make OpenAI’s models available directly to customers on Bedrock in the coming weeks, alongside the upcoming Stateful Runtime Environment,” he wrote on X, adding that the company would share more details at an event in San Francisco on Tuesday.

Inside the new financial terms that shift billions of dollars between the two AI giants

The financial mechanics of the new deal deserve careful parsing, because they reveal which side gave up what — and who came out ahead. Under the old arrangement, money flowed in both directions. When customers bought ChatGPT subscriptions or accessed OpenAI models through their own applications, OpenAI paid Microsoft a cut — reportedly 20 percent. Conversely, when enterprise customers accessed OpenAI models through Azure’s API, Microsoft paid OpenAI a share of that revenue. This bilateral structure reflected the deep integration between the two companies: Microsoft was simultaneously OpenAI’s investor, cloud provider, distribution partner, and largest customer.

The new deal makes the cash flow one-directional. Microsoft stops paying OpenAI entirely. OpenAI continues paying Microsoft its 20 percent share, but only through 2030, and now subject to a total cap whose precise dollar figure has not been disclosed. Given that OpenAI’s revenue is growing rapidly — the company was reportedly on pace to generate tens of billions annually — that cap could become material relatively quickly.

For Microsoft, the trade-off is straightforward: it sacrifices the exclusivity that made Azure the only gateway to OpenAI’s models, but it gains immediate financial relief by eliminating its outbound revenue-share payments while continuing to collect inbound payments for several more years. And it retains approximately 27 percent ownership of OpenAI’s for-profit entity, meaning it participates in the company’s growth regardless of which cloud serves the workloads. Last quarter alone, Microsoft reported $7.5 billion in revenue from its OpenAI investment in a single quarter, according to TechCrunch’s reporting. For OpenAI, the calculus is different. It accepts a continued obligation to pay Microsoft through 2030, but it gains the commercial freedom to sell everywhere — a freedom that is arguably worth far more than the revenue-share savings. Enterprise customers overwhelmingly operate in multi-cloud environments. Being locked into Azure was not just a technical constraint; it was a sales objection that OpenAI’s competitors, particularly Anthropic and Google, exploited relentlessly.

Why the disappearance of the AGI clause signals a new era for AI governance

One of the more philosophically intriguing aspects of Monday’s announcement is what it does to the AGI provision that once governed the partnership. Under the original agreement, Microsoft’s exclusive commercial rights were tied to a trigger: if OpenAI’s board determined that the company had achieved AGI, certain terms — including Microsoft’s access to the most advanced models — would change. The provision was meant to ensure that a truly superintelligent system would remain under the nonprofit board’s control rather than being commercially exploited. In practice, it created perverse incentives: OpenAI had a financial reason to never declare AGI, and Microsoft had a financial reason to argue that AGI had not been reached regardless of what the technology could actually do.

The new deal sidesteps this entirely. Microsoft’s license now runs through a fixed calendar date — 2032 — “independent of OpenAI’s technology progress,” as the companies put it. The AGI trigger, a concept that once sat at the philosophical heart of the partnership, has been replaced by a spreadsheet. Andrew Curran, a close observer of OpenAI’s governance, noted on X that language defining AGI had been removed from OpenAI’s website, sharing a screenshot showing the change. The move drew sharp reactions. One commenter observed that “removing the definition = removing the accountability. whoever controls when AGI is declared controls a lot of commercial terms.”

The shift reflects a broader maturation — or perhaps disillusionment — within the AI industry regarding AGI as a meaningful commercial or governance concept. When the original deal was struck, AGI felt like a distant, almost mythical threshold. Now, with models like GPT-5.5 demonstrating increasingly general capabilities, the term has become more of a marketing slogan than a technical benchmark. Replacing it with fixed dates and dollar caps is, in some sense, an admission that the industry has moved beyond the framework that once defined this partnership.

Multi-cloud AI competition intensifies as enterprises gain the power to choose

The most immediate beneficiary of the new arrangement is the enterprise customer. For years, organizations that wanted access to OpenAI’s models had essentially one option: Azure. That constraint is now gone. Within weeks, according to Jassy, OpenAI’s models will be available on AWS Bedrock alongside the stateful runtime environment that powers long-running AI agents. Google Cloud is presumably not far behind.

This multi-cloud availability arrives at a moment when the AI infrastructure market is undergoing rapid consolidation and expansion simultaneously. Meta recently committed $48 billion to cloud providers CoreWeave and Nebius. Amazon’s investment in OpenAI, combined with its existing relationship with Anthropic — in which Amazon has invested up to $4 billion — positions AWS as a model-agnostic platform where enterprises can mix and match AI capabilities. Microsoft, meanwhile, has developed its own relationship with Anthropic, using Claude to power agentic products — a hedge against the very OpenAI dependency it spent billions creating.

The competitive dynamics are now genuinely complex. Microsoft competes with OpenAI in AI products (Copilot vs. ChatGPT), partners with OpenAI’s rival Anthropic, and remains OpenAI’s largest shareholder. OpenAI sells on Azure, AWS, and soon everywhere else, while building its own data centers. Amazon invests in both OpenAI and Anthropic. Google builds its own models while also hosting competitors on Vertex AI. Jehangeer Hasan, a technology commentator, captured the mood on X, calling the announcement a “notable shift in the cloud AI landscape” that signals “intensifying multi-cloud competition and a push toward giving developers more flexibility instead of locking them into a single ecosystem.” Chris Alexander, an engineer, offered a more candid assessment: “honestly Azure’s OpenAI endpoints are so unreliable, we mostly just hit you all directly,” adding that “it would be nice to have options in AWS or GCP for sure.”

What the restructured deal means for the future of AI’s biggest partnership

Several open questions remain. The precise dollar amount of the revenue-share cap has not been disclosed, and it will matter enormously as OpenAI’s revenue scales. The meaning of “first on Azure” — whether it implies a meaningful exclusivity window or merely simultaneous availability — remains deliberately ambiguous. And OpenAI’s own infrastructure ambitions, including plans to build proprietary data centers, could eventually reduce its dependence on any third-party cloud, including Azure.

Microsoft’s position, while less dominant than before, is not as diminished as some early commentary suggested. It remains OpenAI’s primary cloud provider, its largest shareholder, and a licensee of its technology through the end of the decade. It has diversified its own AI strategy with investments in Anthropic, its own Phi and MAI model families, and deep integration of AI across its product portfolio. The company reported $7.5 billion in OpenAI-related revenue last quarter — a figure that demonstrates the sheer financial scale of the relationship even in its loosened form.

For OpenAI, the new agreement is a coming-of-age moment. The company that once depended on Microsoft for everything — capital, compute, distribution, and credibility — now operates as an independent force capable of striking multi-billion-dollar deals with Microsoft’s biggest rivals. Sam Altman announced the changes on X with characteristic brevity: “We have updated our partnership with Microsoft.”

Seven years ago, when Microsoft CEO Satya Nadella and Altman first shook hands on a deal to commercialize artificial intelligence, the arrangement rested on the assumption that OpenAI needed Microsoft more than Microsoft needed OpenAI. Every clause — the exclusivity, the AGI trigger, the revenue share — reflected that original imbalance. Monday’s restructuring is proof that the assumption no longer holds. The partnership that launched the generative AI revolution has survived, but the power dynamics that created it have not. In the AI industry, it turns out, the only thing that moves faster than the technology is the leverage.