Context decay, orchestration drift, and the rise of silent failures in AI systems

The most expensive AI failure I have seen in enterprise deployments did not produce an error. No alert fired. No dashboard turned red. The system was fully operational, it was just consistently, confidently wrong. That is the reliability gap. And it is the problem most enterprise AI programs are not built to catch.

We have spent the last two years getting very good at evaluating models: benchmarks, accuracy scores, red-team exercises, retrieval quality tests. But in production, the model is rarely where the system breaks. It breaks in the infrastructure layer, the data pipelines feeding it, the orchestration logic wrapping it, the retrieval systems grounding it, the downstream workflows trusting its output. That layer is still being monitored with tools designed for a different kind of software.

The gap no one is measuring

Here’s what makes this problem hard to see: Operationally healthy and behaviorally reliable are not the same thing, and most monitoring stacks cannot tell the difference.

A system can show green across every infrastructure metric, latency within SLA, throughput normal, error rate flat, while simultaneously reasoning over retrieval results that are six months stale, silently falling back to cached context after a tool call degrades, or propagating a misinterpretation through five steps of an agentic workflow. None of that shows up in Prometheus. None of it trips a Datadog alert.

The reason is straightforward: Traditional observability was built to answer the question “is the service up?” Enterprise AI requires answering a harder question: “Is the service behaving correctly?” Those are different instruments.

What teams typically measure

What actually drives AI infrastructure failure

Uptime / latency / error rate

Retrieval freshness and grounding confidence

Token usage

Context integrity across multi-step workflows

Throughput

Semantic drift under real-world load

Model benchmark scores

Behavioral consistency when conditions degrade

Infrastructure error rate

Silent partial failure at the reasoning layer

 Closing this gap requires adding a behavioral telemetry layer alongside the infrastructure one — not replacing what exists, but extending it to capture what the model actually did with the context it received, not just whether the service responded.

Four failure patterns that standard monitoring will not catch

Across enterprise AI deployments in network operations, logistics, and observability platforms, I see four failure patterns repeat with enough consistency to name them.

The first is context degradation. The model reasons over incomplete or stale data in a way that is invisible to the end user. The answer looks polished. The grounding is gone. Detection usually happens weeks later, through downstream consequences rather than system alerts.

The second is orchestration drift. Agentic pipelines rarely fail because one component breaks. They fail because the sequence of interactions between retrieval, inference, tool use, and downstream action starts to diverge under real-world load. A system that looked stable in testing behaves very differently when latency compounds across steps and edge cases stack.

The third is a silent partial failure. One component underperforms without crossing an alert threshold. The system degrades behaviorally before it degrades operationally. These failures accumulate quietly and surface first as user mistrust, not incident tickets. By the time the signal reaches a postmortem, the erosion has been happening for weeks.

The fourth is the automation blast radius. In traditional software, a localized defect stays local. In AI-driven workflows, one misinterpretation early in the chain can propagate across steps, systems, and business decisions. The cost is not just technical. It becomes organizational, and it is very hard to reverse.

Metrics tell you what happened. They rarely tell you what almost happened.

Why classic chaos engineering is not enough and what needs to change

Traditional chaos engineering asks the right kind of question: What happens when things break? Kill a node. Drop a partition. Spike CPU. Observe. Those tests are necessary, and enterprises should run them.

But for AI systems, the most dangerous failures are not caused by hard infrastructure faults. They emerge at the interaction layer between data quality, context assembly, model reasoning, orchestration logic, and downstream action. You can stress the infrastructure all day and never surface the failure mode that costs you the most.

What AI reliability testing needs is an intent-based layer: Define what the system must do under degraded conditions, not just what it should do when everything works. Then test the specific conditions that challenge that intent. What happens if the retrieval layer returns content that is technically valid but six months outdated? What happens if a summarization agent loses 30% of its context window to unexpected token inflation upstream? What happens if a tool call succeeds syntactically but returns semantically incomplete data? What happens if an agent retries through a degraded workflow and compounds its own error with each step?

These scenarios are not edge cases. They are what production looks like. This is the framework I have applied in building reliability systems for enterprise infrastructure: Intent-based chaos level creation for distributed computing environments. The key insight: Intent defines the test, not just the fault.

What the infrastructure layer actually needs

None of this requires reinventing the stack. It requires extending four things.

Add behavioral telemetry alongside infrastructure telemetry. Track whether responses were grounded, whether fallback behavior was triggered, whether confidence dropped below a meaningful threshold, whether the output was appropriate for the downstream context it entered. This is the observability layer that makes everything else interpretable.

Introduce semantic fault injection into pre-production environments. Deliberately simulate stale retrieval, incomplete context assembly, tool-call degradation, and token-boundary pressure. The goal is not theatrical chaos. The goal is finding out how the system behaves when conditions are slightly worse than your staging environment — which is always what production is.

Define safe halt conditions before deployment, not after the first incident. AI systems need the equivalent of circuit breakers at the reasoning layer. If a system cannot maintain grounding, validate context integrity, or complete a workflow with enough confidence to be trusted, it should stop cleanly, label the failure, and hand control to a human or a deterministic fallback. A graceful halt is almost always safer than a fluent error. Too many systems are designed to keep going because confident output creates the illusion of correctness.

Assign shared ownership for end-to-end reliability. The most common organizational failure is a clean separation between model teams, platform teams, data teams, and application teams. When the system is operationally up but behaviorally wrong, no one owns it clearly. Semantic failure needs an owner. Without one, it accumulates.

The maturity curve is shifting

For the last two years, the enterprise AI differentiator has been adoption — who gets to production fastest. That phase is ending. As models commoditize and baseline capability converges, competitive advantage will come from something harder to copy: The ability to operate AI reliably at scale, in real conditions, with real consequences.

Yesterday’s differentiator was model adoption. Today’s is system integration. Tomorrow’s will be reliability under production stress.

The enterprises that get there first will not have the most advanced models. They will have the most disciplined infrastructure around them — infrastructure that was tested against the conditions it would actually face, not the conditions that made the pilot look good.

The model is not the whole risk. The untested system around it is.

Sayali Patil is an AI infrastructure and product leader.

Monitoring LLM behavior: Drift, retries, and refusal patterns

The stochastic challenge

Traditional software is predictable: Input A plus function B always equals output C. This determinism allows engineers to develop robust tests. On the other hand, generative AI is stochastic and unpredictable. The exact same prompt often yields different results on Monday versus Tuesday, breaking the traditional unit testing that engineers know and love.

To ship enterprise-ready AI, engineers cannot rely on mere “vibe checks” that pass today but fail when customers use the product. Product builders need to adopt a new infrastructure layer: The AI Evaluation Stack.

This framework is informed by my extensive experience shipping AI products for Fortune 500 enterprise customers in high-stakes industries, where “hallucination” is not funny — it’s a huge compliance risk.

Defining the AI evaluation paradigm

Traditional software tests are binary assertions (pass/fail). While some AI evals use binary asserts, many evaluate on a gradient. An eval is not a single script; it is a structured pipeline of assertions — ranging from strict code syntax to nuanced semantic checks — that verify the AI system’s intended function.

The taxonomy of evaluation checks

To build a robust, cost-effective pipeline, asserts must be separated into two distinct architectural layers:

Layer 1: Deterministic assertions

A surprisingly large share of production AI failures aren’t semantic “hallucinations” — they are basic syntax and routing failures. Deterministic assertions serve as the pipeline’s first gate, using traditional code and regex to validate structural integrity.

Instead of asking if a response is “helpful,” these assertions ask strict, binary questions:

  • Did the model generate the correct JSON key/value schema?

  • Did it invoke the correct tool call with the required arguments?

  • Did it successfully slot-fill a valid GUID or email address?

// Example: Layer 1 Deterministic Tool Call Assertion

{

  “test_scenario”: “User asks to look up an account”,

  “assertion_type”: “schema_validation”,

  “expected_action”: “Call API: get_customer_record”,

  “actual_ai_output”: “I found the customer.”,

  “eval_result”: “FAIL – AI hallucinated conversational text instead of generating the required API payload.”

}

In the example above, the test failed instantly because the model generated conversational text instead of the required tool call payload.

Architecturally, deterministic assertions must be the first layer of the stack, operating on a computationally inexpensive “fail-fast” principle. If a downstream API requires a specific schema, a malformed JSON string is a fatal error. By failing the evaluation immediately at this layer, engineering teams prevent the pipeline from triggering expensive semantic checks (Layer 2) or wasting valuable human review time (Layer 3).

Layer 2: Model-based assertions

When deterministic assertions pass, the pipeline must evaluate semantic quality. Because natural language is fluid, traditional code cannot easily assert if a response is “helpful” or “empathetic.” This introduces model-based evaluation, commonly referred to as “LLM-as-a-Judge” or “LLM-Judge.”

While using one non-deterministic system to evaluate another seems counterintuitive, it is an exceptionally powerful architectural pattern for use cases requiring nuance. It is virtually impossible to write a reliable regex to verify if a response is “actionable” or “polite.” While human reviewers excel at this nuance, they cannot scale to evaluate tens of thousands of CI/CD test cases. Thus, the LLM-as-a-Judge becomes the scalable proxy for human discernment.

3 critical inputs for model-based assertions

However, model-based assertions only yield reliable data when the LLM-as-a-Judge is provisioned with three critical inputs:

  1. A state-of-the-art reasoning model: The Judge must possess superior reasoning capabilities compared to the production model. If your app runs on a smaller, faster model for latency, the judge must be a frontier reasoning model to approximate human-level discernment.

  2. A strict assessment rubric: Vague evaluation prompts (“Rate how good this answer is”) yield noisy, stochastic evaluations. A robust rubric explicitly defines the gradients of failure and success. (For example, a “Helpfulness” rubric should define Score 1 as an irrelevant refusal, Score 2 as addressing the prompt but lacking actionable steps, and Score 3 as providing actionable next steps strictly within context.)

  3. Ground truth (golden outputs): While the rubric provides the rules, a human-vetted “expected answer” acts as the answer key. When the LLM-Judge can compare the production model’s output against a verified Golden Output, its scoring reliability increases dramatically.

Architecture: The offline vs online pipeline

A robust evaluation architecture requires two complementary pipelines. The online pipeline monitors post-deployment telemetry, while the offline pipeline provides the foundational baseline and deterministic constraints required to evaluate stochastic models safely.

The offline evaluation pipeline

The offline pipeline’s primary objective is regression testing — identifying failures, drift, and latency before production. Deploying an enterprise LLM feature without a gating offline evaluation suite is an architectural anti-pattern; it is the equivalent of merging uncompiled code into a main branch.

Process

1. Curating the golden dataset

The offline lifecycle begins by curating a “golden dataset” — a static, version-controlled repository of 200 to 500 test cases representing the AI’s full operational envelope. Each case pairs an exact input payload with an expected “golden output” (ground truth).

Crucially, this dataset must reflect expected real-world traffic distributions. While most cases cover standard “happy-path” interactions, engineers must systematically incorporate edge cases, jailbreaks, and adversarial inputs. Evaluating “refusal capabilities” under stress remains a strict compliance requirement.

Example test case payload (standard tool use):

  • Input: “Schedule a 30-minute follow-up meeting with the client for next Tuesday at 10 a.m.”

  • Expected output (golden): The system successfully invokes the schedule_meeting tool with the correct JSON payload: {“duration_minutes”: 30, “day”: “Tuesday”, “time”: “10 AM”, “attendee”: “client_email”}.

While manually curating hundreds of edge cases is tedious, the process can be accelerated with synthetic data generation pipelines that use a specialized LLM to produce diverse TSV/CSV test payloads. However, relying entirely on AI-generated test cases introduces the risk of data contamination and bias. A human-in-the-loop (HITL) architecture is mandatory at this stage; domain experts must manually review, edit, and validate the synthetic dataset to ensure it accurately reflects real-world user intent and enterprise policy before it is committed to the repository.

2. Defining the evaluation criteria

Once the dataset is curated, engineers must design the evaluation criteria to compute a composite score for each model output. A robust architecture achieves this by assigning weighted points across a hybrid of Layer 1 (deterministic) and Layer 2 (model-based) asserts.

Consider an AI agent executing a “send email” tool. An evaluation framework might utilize a 10-point scoring system:

  • Layer 1: Deterministic asserts (6 points): Did the agent invoke the correct tool? (2 pts). Did it produce a valid JSON object? (2 pts). Does the JSON strictly adhere to the expected schema? (2 pts).

  • Layer 2: Model-based asserts (4 points): (Note: Semantic rubrics must be highly use-case specific). Does the subject line reflect user intent? (1 pt). Does the email body match expected outputs without hallucination? (1 pt). Were CC/BCC fields leveraged accurately? (1 pt). Was the appropriate priority flag inferred? (1 pt).

To understand why the LLM-Judge awarded these points, the engineer must prompt the judge to supply its reasoning for each score. This is crucial for debugging failures.

The passing threshold and short-circuit logic 

In this example, an 8/10 passing threshold requires 8 points for success. Crucially, the evaluation pipeline must enforce strict short-circuit evaluation (fail-fast logic). If the model fails any deterministic assertion — such as generating a malformed JSON schema — the system must instantly fail the entire test case (0/10). There is zero architectural value in invoking an expensive LLM-Judge to assess the semantic “politeness” of an email if the underlying API call is structurally broken.

3. Executing the pipeline and aggregating signals

Using an evaluation infrastructure of choice, the system executes the offline pipeline — typically integrated as a blocking CI/CD step during a pull request. The infrastructure iterates through the golden dataset, injecting each test payload into the production model, capturing the output, and executing defined assertions against it.

Each output is scored against the passing threshold. Once batch execution is complete, results are aggregated into an overall pass rate. For enterprise-grade applications, the baseline pass rate must typically exceed 95%, scaling to 99%-plus for strict compliance or high-risk domains.

4. Assessment, iteration, and alignment

Based on aggregated failure data, engineering teams conduct a root-cause analysis of failing test cases. This assessment drives iterative updates to core components: refining system prompts, modifying tool descriptions, augmenting knowledge sources, or adjusting hyperparameters (like temperature or top-p). Continuous optimization remains best practice even after achieving a 95% pass rate.

Crucially, any system modification necessitates a full regression test. Because LLMs are inherently non-deterministic, an update intended to fix one specific edge case can easily cause unforeseen degradations in other areas. The entire offline pipeline must be rerun to validate that the update improved quality without introducing regressions.

The online evaluation pipeline

While the offline pipeline acts as a strict pre-deployment gatekeeper, the online pipeline is the post-deployment telemetry system. Its objective is to monitor real-world behavior, capturing emergent edge cases, and quantifying model drift. Architects must instrument applications to capture five distinct categories of telemetry:

1. Explicit user signals

Direct, deterministic feedback indicating model performance:

  • Thumbs up/down: Disproportionate negative feedback is the most immediate leading indicator of system degradation, directing immediate engineering investigation.

  • Verbatim in-app feedback: Systematically parsing written comments identifies novel failure modes to integrate back into the offline “golden dataset.”

2. Implicit behavioral signals

Behavioral telemetry reveals silent failures where users give up without explicit feedback:

  • Regeneration and retry rates: High frequencies of retries indicate the initial output failed to resolve user intent.

  • Apology rate: Programmatically scanning for heuristic triggers (“I’m sorry”) detects degraded capabilities or broken tool routing.

  • Refusal rate: Artificially high refusal rates (“I can’t do that”) indicate over-calibrated safety filters rejecting benign user queries.

3. Production deterministic asserts (synchronous)

Because deterministic code checks execute in milliseconds, teams can seamlessly reuse Layer 1 offline asserts (schema conformity, tool validity) to synchronously evaluate 100% of production traffic. Logging these pass/fail rates instantly detects anomalous spikes in malformed outputs — the earliest warning sign of silent model drift or provider-side API changes.

4. Production LLM-as-a-Judge (asynchronous)

If strict data privacy agreements (DPAs) permit logging user inputs, teams can deploy model-based asserts. Architecturally, production LLM-Judges must never execute synchronously on the critical path, which doubles latency and compute costs. Instead, a background LLM-Judge asynchronously samples a fraction (5%) of daily sessions, grading outputs against the offline rubric to generate a continuous quality dashboard.

Engineering the feedback loop (the “flywheel”)

Evaluation pipelines are not “set-it-and-forget-it” infrastructure. Without continuous updates, static datasets suffer from “rot” (concept drift) as user behavior evolves and customers discover novel use cases.

For example, an HR chatbot might boast a pristine 99% offline pass rate for standard payroll questions. However, if the company suddenly announces a new equity plan, users will immediately begin prompting the AI about vesting schedules — a domain entirely missing from the offline evaluations.

To make the system smarter over time, engineers must architect a closed feedback loop that mines production telemetry for continuous improvement.

The continuous improvement workflow:

  1. Capture: A user triggers an explicit negative signal (a “thumbs down”) or an implicit behavioral flag in production.

  2. Triage: The specific session log is automatically flagged and routed for human review.

  3. Root-cause analysis: A domain expert investigates the failure, identifies the gap, and updates the AI system to successfully handle similar requests.

  4. Dataset augmentation: The novel user input, paired with the newly corrected expected output, is appended to the offline Golden Dataset alongside several synthetic variations.

  5. Regression testing: The model is continuously re-evaluated against this newly discovered edge case in all future runs.

Building an evaluation pipeline without monitoring production logs and updating datasets is fundamentally insufficient. Users are unpredictable. Evaluating on stale data creates a dangerous illusion: High offline pass rates masking a rapidly degrading real-world experience.

Conclusion: The new “definition of done”

In the era of generative AI, a feature or product is no longer “done” simply because the code compiles and the prompt returns a coherent response. It is only done when a rigorous, automated evaluation pipeline is deployed and stable — and when the model consistently passes against both a curated golden dataset and newly discovered production edge cases.

This guide has equipped you with a comprehensive blueprint for building that reality. From architecting offline regression pipelines and online telemetry to the continuous feedback flywheel and navigating enterprise anti-patterns, you now have the structural foundation required to deploy AI systems with greater confidence.

Now, it is your turn. Share this framework with your engineering, product, and legal teams to establish a unified, cross-functional standard for AI quality in your organization. Stop guessing whether your models are degrading in production, and start measuring.

Derah Onuorah is a Microsoft senior product manager.

Google’s Gemini can now run on a single air-gapped server — and vanish when you pull the plug

Cirrascale Cloud Services today announced it has expanded its partnership with Google Cloud to deliver the Gemini model on-premises through Google Distributed Cloud, making it the first neocloud provider to offer Google’s most advanced AI model as a fully private, disconnected appliance. The announcement, timed to coincide with Google Cloud Next 2026 in Las Vegas, addresses a stubborn problem that has plagued regulated industries since the generative AI boom began: how to access frontier-class AI models without surrendering control of your data.

The offering packages Gemini into a Dell-manufactured, Google-certified hardware appliance equipped with eight Nvidia GPUs and wrapped in confidential computing protections. Enterprises and government agencies can deploy the system inside Cirrascale’s data centers or their own facilities, fully disconnected from the internet and from Google’s cloud infrastructure. The product enters preview immediately, with general availability expected in June or July.

In an exclusive interview with VentureBeat ahead of the announcement, Dave Driggers, CEO of Cirrascale Cloud Services, described the deployment as “the next step of the partnership” and “being able to offer their most important model they have, which is Gemini.” He was emphatic about what customers would be getting: “It is full blown Gemini. It’s not pulled,” he told VentureBeat. “Nothing’s missing from it, and it’ll be available in a private scenario, so that we can guarantee them that their data is secure, their inputs are secure, their outputs are secure.”

The move signals a deepening shift in the enterprise AI market, where the most capable models are migrating out of hyperscaler data centers and into customers’ own racks — a reversal of the cloud computing orthodoxy that defined the past decade.

The impossible tradeoff that kept banks and governments on the AI sidelines

For years, organizations in financial services, healthcare, defense and government faced a binary choice: access the most powerful AI models through public cloud APIs, exposing sensitive data to third-party infrastructure, or settle for less capable open-source models they could host themselves. Cirrascale’s new offering attempts to eliminate that tradeoff entirely.

Driggers described how the trust problem escalated in stages. First, companies worried about handing their proprietary data to hyperscalers. Then came a deeper realization. “They started realizing, holy crap, when my users type stuff in, they’re giving private information away — and the output is private too,” Driggers told VentureBeat. “And then the hyperscalers said, ‘Your prompts and the responses? That’s our stuff. We need that in order to answer your question.'” That was the moment, he argued, when the demand for fully private AI became impossible to ignore.

Unlike Google Distributed Cloud, which Google already offers as its own on-premises cloud extension, the Cirrascale deployment places the actual model — weights and all — outside of Google’s infrastructure entirely. “Google doesn’t own this hardware. We own the hardware, or the customer owns the hardware,” Driggers said. “It is completely outside of Google.”

Driggers drew a sharp distinction between this offering and what competitors provide. When asked about Microsoft Azure’s on-premises deployments with OpenAI models and AWS Outposts, he was blunt: “Those are a lot different. This is the actual model being deployed on prem outside of their cloud. It’s not a cut down version. It’s the actual model.” 

Pull the plug and the model vanishes: how confidential computing guards Google’s crown jewel

The technical underpinnings of the deployment reveal how seriously both Google and Cirrascale are treating the security question. The Gemini model resides entirely in volatile memory — not on persistent storage. “As soon as the power is off, the model is gone,” Driggers explained. User sessions operate through caches that clear automatically when a session ends. “A company’s user inputs, once that session’s over, they’re gone. They can be saved, but by default, they’re gone,” he said.

Perhaps the most striking security feature is what happens when someone attempts to tamper with the appliance. Driggers described a mechanism that effectively renders the machine inoperable: “You do anything that is against confidential compute, and it’s gone. Not only does the machine turn off, and therefore the model is gone, it actually puts in a marker that says, ‘You violated the confidential compute.’ That machine has to come back to us, or back to Dell or back to Google.” He characterized the appliance as something that “does time bomb itself if something goes wrong.”

This level of protection reflects Google’s own anxiety about releasing its flagship model’s weights into environments it doesn’t control. The appliance is effectively a vault: the model runs inside it, but nobody — not even the customer — can extract or inspect the weights. The confidential computing envelope ensures that even physical possession of the hardware doesn’t grant access to the model’s intellectual property.

When Google releases a new version of Gemini, the appliance needs to reconnect — but only briefly, and through a private channel. “It does have to get connected back to Google to load the new model. But that can go via a private connection,” Driggers said. For the most security-sensitive customers who can never allow their machine to connect to an outside network, Cirrascale offers a physical swap: “The server will be unplugged, purged, all the data gone, guaranteed it’s gone, a new server will show up with a new version of the model.”

From Wall Street to drug labs, the rush for air-gapped AI is accelerating

Driggers identified three primary drivers of demand: trust, security and guaranteed performance. Financial services institutions top the list. “They’ve got regulatory issues where they can’t have something out of their control. They’ve got to be the one who determines where everything is. It’s got to be air gap,” Driggers said. The minimum deployment footprint — a single eight-GPU server — makes the product accessible in a way that Google’s own private offerings do not. Running Gemini on Google’s TPU-based infrastructure, Driggers noted, requires a much larger commitment. “If you want a private [instance] from Google, they require a much bigger bite, because to build something private for you, Google requires a gigantic footprint. Here we can do it down to a single machine.”

Beyond finance, Driggers pointed to drug discovery, medical data, public-sector research, and any business handling personal information. He also flagged an increasingly critical use case: data sovereignty. “How about your business that’s doing business outside of the United States, and now you’ve got data sovereignty laws in places where GCP is not? We can provide private Gemini in these smaller countries where the data can’t leave.”

The public sector is another major target. Cirrascale launched a dedicated Government Services division in March as part of its earlier partnership with Google Public Sector around the GPAR (Google Public Sector Program for Accelerated Research) initiative. That program provides higher education and research institutions access to AI tools including AlphaFold, AI Co-Scientist, and Gemini Enterprise for Education. Today’s announcement extends that relationship from the research tooling layer to the model itself.

The performance guarantee is the third pillar. Driggers noted that frontier models accessed through public APIs deliver inconsistent response times — a problem for mission-critical business applications. The private deployment eliminates that variability. Cirrascale layers management software on top of the Gemini appliance that allows administrators to prioritize users, allocate tokens by role, adjust context window sizes, and load-balance across multiple appliances and regions. “Your primary data scientists or your programmers may need to have really large context windows and get priority, especially maybe nine to five,” Driggers explained, “but yet, the rest of the time, they want to share the Gemini experience over a wider group of people.” He also noted that agentic AI workloads, which can run around the clock, benefit from the ability to consume unused capacity during off-peak hours — a scheduling flexibility that public cloud deployments don’t easily support.

Seat licenses, token billing and all-you-can-eat pricing: a model built for enterprise flexibility

The pricing model reflects Cirrascale’s broader philosophy of meeting customers where they are. Driggers described several consumption options: seat-based licensing (with both enterprise and standard tiers), per-token billing, and flat “all-you-can-eat” pricing per appliance. The minimum commitment is a single dedicated server — the appliances are not shared between customers in any configuration. “We’ll meet the customer, what they’re used to,” Driggers said. “If they’re currently taking a seat license, we’ll create a seat license for them.”

Customers can also choose to purchase the hardware outright while still consuming Gemini as a managed service, an arrangement Cirrascale has offered since its earliest days in the AI wave. Driggers said OpenAI has been a customer since 2016 or 2017, and in that engagement, OpenAI purchased its own GPUs while Cirrascale “took those GPUs, incorporated them into our servers and storage and networking, and then presented it back as a cloud service to them so they didn’t have to manage anything.”

That flexible ownership model is particularly relevant for universities and government-funded research institutions, where mandates often require a specific mix of capital expenditure, operating expenditure, and personnel investment. “A lot of government funding requires a mixture of CapEx, OPEX and employment development,” Driggers said. “So we allow that as well.”

Inside the neocloud that built the world’s first eight-GPU server — and just landed Google’s biggest AI model

Cirrascale’s announcement arrives during a period of explosive growth for the neocloud sector — the tier of specialized AI cloud providers that sit between the hyperscalers and traditional hosting companies. The neocloud market is projected to be worth $35.22 billion in 2026 and is growing at a compound annual growth rate of 46.37%, according to Mordor Intelligence. Leading neocloud providers include CoreWeave, Crusoe Cloud, Lambda, Nebius and Vultr, and these companies specialize in GPU-as-a-Service for AI and high-performance computing workloads.

But Cirrascale occupies a different niche within this booming category. While companies like CoreWeave have focused primarily on providing raw GPU compute at scale — CoreWeave boasts a $55.6 billion backlog — Cirrascale has positioned itself around private AI, managed services and longer-term engagements rather than on-demand elastic compute. Driggers described the company as “not an on-demand place” but rather a provider focused on “longer-term workloads where we’re really competing against somebody doing it back on prem.”

The company’s history supports that claim. Cirrascale traces its roots to a hardware company that “designed the world’s first eight GPU server in 2012 before anybody thought you’d ever need eight GPUs in a box,” as Driggers put it. It pivoted to pure cloud services roughly eight years ago and has since built a client roster that includes the Allen Institute for AI, which in August 2025 tapped Cirrascale as the managed services provider for a $152 million open AI initiative funded by the National Science Foundation and Nvidia. Earlier this month, Cirrascale announced a three-way alliance with Rafay Systems and Cisco to deliver end-to-end enterprise AI solutions combining Cirrascale’s inference platform, Rafay’s GPU orchestration, and Cisco’s networking and compute hardware.

The private AI era is arriving faster than anyone expected

The Gemini partnership is the highest-profile move yet — and it taps into a broader industry current. The push to move frontier AI out of the public cloud and into private infrastructure is no longer a niche demand. Industry analysts predict that by 2027, 40% of AI model training and inference will occur outside public cloud environments. That projection helps explain why Google is willing to let its crown-jewel model run on hardware it doesn’t own, in data centers it doesn’t operate, managed by a company in San Diego. The alternative — watching regulated enterprises default to open-source models or to Microsoft’s Azure OpenAI Service — is apparently a worse outcome.

The announcement also carries major implications for Google’s competitive positioning. Microsoft has built its enterprise AI strategy around the Azure OpenAI Service and its deep partnership with OpenAI, while AWS has invested in Amazon Bedrock and its own on-premises solutions through Outposts. Google Cloud Platform still trails both rivals in market share, though Q4 cloud revenue rose 48% year-over-year. Enabling Gemini to run on third-party infrastructure via partners like Cirrascale broadens its distribution surface in exactly the segments — government, finance, healthcare — where Microsoft and Amazon have historically held advantages. For Cirrascale, the partnership represents a chance to differentiate sharply in a market where most neoclouds are competing on GPU availability and price.

Driggers expects rapid uptake in the second half of 2026. “It’s going to be crazy towards the end of this year,” he said. “Major banks will finally do stuff like this, because they can secure it. They can do it globally. Big research institutions who have labs all over the world will do these types of things.” He predicted other frontier model providers will follow with similar offerings soon, and he doesn’t see Gemini as the end of the story. “We really think that the enterprise have been waiting for private AI, not just Gemini, but all sorts of private AI,” Driggers said.

That may be the most telling line of all. For three years, the AI revolution has been defined by a simple bargain: send your data to the cloud and get intelligence back. Cirrascale’s bet — and increasingly, Google’s — is that the biggest customers in the world are done accepting those terms. The most powerful AI on the planet is now available on a single locked box that can sit in a bank vault, a university basement, or a government facility in a country where Google has no data center. The cloud, it turns out, is finally ready to come back down to earth.

Building an Interregional Transmission Overlay for a Resilient U.S. Grid
Building an Interregional Transmission Overlay for a Resilient U.S. Grid

Examining how a U.S. Interregional Transmission Overlay could address aging grid infrastructure, surging demand, and renewable integration challenges.What Attendees will LearnWhy the current regional grid structure is approaching its limits — Explore …

Salesforce launches Headless 360 to turn its entire platform into infrastructure for AI agents

Salesforce on Wednesday unveiled the most ambitious architectural transformation in its 27-year history, introducing “Headless 360” — a sweeping initiative that exposes every capability in its platform as an API, MCP tool, or CLI command so AI agents can operate the entire system without ever opening a browser.

The announcement, made at the company’s annual TDX developer conference in San Francisco, ships more than 100 new tools and skills immediately available to developers. It marks a decisive response to the existential question hanging over enterprise software: In a world where AI agents can reason, plan, and execute, does a company still need a CRM with a graphical interface?

Salesforce’s answer: No — and that’s exactly the point.

“We made a decision two and a half years ago: Rebuild Salesforce for agents,” the company said in its announcement. “Instead of burying capabilities behind a UI, expose them so the entire platform will be programmable and accessible from anywhere.”

The timing is anything but coincidental. Salesforce finds itself navigating one of the most turbulent periods in enterprise software history — a sector-wide sell-off that has pushed the iShares Expanded Tech-Software Sector ETF down roughly 28% from its September peak. The fear driving the decline: that AI, particularly large language models from Anthropic, OpenAI, and others, could render traditional SaaS business models obsolete.

Jayesh Govindarjan, EVP of Salesforce and one of the key architects behind the Headless 360 initiative, described the announcement as rooted not in marketing theory but in hard-won lessons from deploying agents with thousands of enterprise customers.

“The problem that emerged is the lifecycle of building an agentic system for every one of our customers on any stack, whether it’s ours or somebody else’s,” Govindarjan told VentureBeat in an exclusive interview. “The challenge that they face is very much the software development challenge. How do I build an agent? That’s only step one.”

More than 100 new tools give coding agents full access to the Salesforce platform for the first time

Salesforce Headless 360 rests on three pillars that collectively represent the company’s attempt to redefine what an enterprise platform looks like in the agentic era.

The first pillar — build any way you want — delivers more than 60 new MCP (Model Context Protocol) tools and 30-plus preconfigured coding skills that give external coding agents like Claude Code, Cursor, Codex, and Windsurf complete, live access to a customer’s entire Salesforce org, including data, workflows, and business logic. Developers no longer need to work inside Salesforce’s own IDE. They can direct AI coding agents from any terminal to build, deploy, and manage Salesforce applications.

Agentforce Vibes 2.0, the company’s own native development environment, now includes what it calls an “open agent harness” supporting both the Anthropic agent SDK and the OpenAI agents SDK. As demonstrated during the keynote, developers can choose between Claude Code and OpenAI agents depending on the task, with the harness dynamically adjusting available capabilities based on the selected agent. The environment also adds multi-model support, including Claude Sonnet and GPT-5, along with full org awareness from the start.

A significant technical addition is native React support on the Salesforce platform. During the keynote demo, presenters built a fully functional partner service application using React — not Salesforce’s own Lightning framework — that connected to org metadata via GraphQL while inheriting all platform security primitives. This opens up dramatically more expressive front-end possibilities for developers who want complete control over the visual layer.

The second pillar — deploy on any surface — centers on the new Agentforce Experience Layer, which separates what an agent does from how it appears, rendering rich interactive components natively across Slack, mobile apps, Microsoft Teams, ChatGPT, Claude, Gemini, and any client supporting MCP apps. During the keynote, presenters defined an experience once and deployed it across six different surfaces without writing surface-specific code. The philosophical shift is significant: rather than pulling customers into a Salesforce UI, enterprises push branded, interactive agent experiences into whatever workspace their customers already inhabit.

The third pillar — build agents you can trust at scale — introduces an entirely new suite of lifecycle management tools spanning testing, evaluation, experimentation, observation, and orchestration. Agent Script, the company’s new domain-specific language for defining agent behavior deterministically, is now generally available and open-sourced. A new Testing Center surfaces logic gaps and policy violations before deployment. Custom Scoring Evals let enterprises define what “good” looks like for their specific use case. And a new A/B Testing API enables running multiple agent versions against real traffic simultaneously.

Why enterprise customers kept breaking their own AI agents — and how Salesforce redesigned its tooling in response

Perhaps the most technically significant — and candid — portion of VentureBeat’s interview with Govindarjan addressed the fundamental engineering tension at the heart of enterprise AI: agents are probabilistic systems, but enterprises demand deterministic outcomes.

Govindarjan explained that early Agentforce customers, after getting agents into production through “sheer hard work,” discovered a painful reality. “They were afraid to make changes to these agents, because the whole system was brittle,” he said. “You make one change and you don’t know whether it’s going to work 100% of the time. All the testing you did needs to be redone.”

This brittleness problem drove the creation of Agent Script, which Govindarjan described as a programming language that “brings together the determinism that’s in programming languages with the inherent flexibility in probabilistic systems that LLMs provide.” The language functions as a single flat file — versionable, auditable — that defines a state machine governing how an agent behaves. Within that machine, enterprises specify which steps must follow explicit business logic and which can reason freely using LLM capabilities.

Salesforce open-sourced Agent Script this week, and Govindarjan noted that Claude Code can already generate it natively because of its clean documentation. The approach stands in sharp contrast to the “vibe coding” movement gaining traction elsewhere in the industry. As the Wall Street Journal recently reported, some companies are now attempting to vibe-code entire CRM replacements — a trend Salesforce’s Headless 360 directly addresses by making its own platform the most agent-friendly substrate available.

Govindarjan described the tooling as a product of Salesforce’s own internal practice. “We needed these tools to make our customers successful. Then our FDEs needed them. We hardened them, and then we gave them to our customers,” he told VentureBeat. In other words, Salesforce productized its own pain.

Inside the two competing AI agent architectures Salesforce says every enterprise will need

Govindarjan drew a revealing distinction between two fundamentally different agentic architectures emerging in the enterprise — one for customer-facing interactions and one he linked to what he called the “Ralph Wiggum loop.”

Customer-facing agents — those deployed to interact with end customers for sales or service — demand tight deterministic control. “Before customers are willing to put these agents in front of their customers, they want to make sure that it follows a certain paradigm — a certain brand set of rules,” Govindarjan told VentureBeat. Agent Script encodes these as a static graph — a defined funnel of steps with LLM reasoning embedded within each step.

The “Ralph Wiggum loop,” by contrast, represents the opposite end of the spectrum: a dynamic graph that unrolls at runtime, where the agent autonomously decides its next step based on what it learned in the previous step, killing dead-end paths and spawning new ones until the task is complete. This architecture, Govindarjan said, manifests primarily in employee-facing scenarios — developers using coding agents, salespeople running deep research loops, marketers generating campaign materials — where an expert human reviews the output before it ships.

“Ralph Wiggum loops are great for employee-facing because employees are, in essence, experts at something,” Govindarjan explained. “Developers are experts at development, salespeople are experts at sales.”

The critical technical insight: both architectures run on the same underlying platform and the same graph engine. “This is a dynamic graph. This is a static graph,” he said. “It’s all a graph underneath.” That unified runtime — spanning the spectrum from tightly controlled customer interactions to free-form autonomous loops — may be Salesforce’s most important technical bet, sparing enterprises from maintaining separate platforms for different agent modalities.

Salesforce hedges its bets on MCP while opening its ecosystem to every major AI model and tool

Salesforce’s embrace of openness at TDX was striking. The platform now integrates with OpenAI, Anthropic, Google Gemini, Meta’s LLaMA, and Mistral AI models. The open agent harness supports third-party agent SDKs. MCP tools work from any coding environment. And the new AgentExchange marketplace unifies 10,000 Salesforce apps, 2,600-plus Slack apps, and 1,000-plus Agentforce agents, tools, and MCP servers from partners including Google, Docusign, and Notion, backed by a new $50 million AgentExchange Builders Initiative.

Yet Govindarjan offered a surprisingly candid assessment of MCP itself — the protocol Anthropic created that has become a de facto standard for agent-tool communication.

“To be very honest, not at all sure” that MCP will remain the standard, he told VentureBeat. “When MCP first came along as a protocol, a lot of us engineers felt that it was a wrapper on top of a really well-written CLI — which now it is. A lot of people are saying that maybe CLI is just as good, if not better.”

His approach: pragmatic flexibility. “We’re not wedded to one or the other. We just use the best, and often we will offer all three. We offer an API, we offer a CLI, we offer an MCP.” This hedging explains the “Headless 360” naming itself — rather than betting on a single protocol, Salesforce exposes every capability across all three access patterns, insulating itself against protocol shifts.

Engine, the B2B travel management company featured prominently in the keynote demos, offered a real-world proof point for the open ecosystem approach. The company built its customer service agent, Ava, in 12 days using Agentforce and now handles 50% of customer cases autonomously. Engine runs five agents across customer-facing and employee-facing functions, with Data 360 at the heart of its infrastructure and Slack as its primary workspace. “CSAT goes up, costs to deliver go down. Customers are happier. We’re getting them answers faster. What’s the trade off? There’s no trade off,” an Engine executive said during the keynote.

Underpinning all of it is a shift in how Salesforce gets paid. The company is moving from per-seat licensing to consumption-based pricing for Agentforce — a transition Govindarjan described as “a business model change and innovation for us.” It’s a tacit acknowledgment that when agents, not humans, are doing the work, charging per user no longer makes sense.

Salesforce isn’t defending the old model — it’s dismantling it and betting the company on what comes next

Govindarjan framed the company’s evolution in architectural terms. Salesforce has organized its platform around four layers: a system of context (Data 360), a system of work (Customer 360 apps), a system of agency (Agentforce), and a system of engagement (Slack and other surfaces). Headless 360 opens every layer via programmable endpoints.

“What you saw today, what we’re doing now, is we’re opening up every single layer, right, with MCP tools, so we can go build the agentic experiences that are needed,” Govindarjan told VentureBeat. “I think you’re seeing a company transforming itself.”

Whether that transformation succeeds will depend on execution across thousands of customer deployments, the staying power of MCP and related protocols, and the fundamental question of whether incumbent enterprise platforms can move fast enough to remain relevant when AI agents can increasingly build new systems from scratch. The software sector’s bear market, the financial pressures bearing down on the entire industry, and the breathtaking pace of LLM improvement all conspire to make this one of the highest-stakes bets in enterprise technology.

But there is an irony embedded in Salesforce’s predicament that Headless 360 makes explicit. The very AI capabilities that threaten to displace traditional software are the same capabilities that Salesforce now harnesses to rebuild itself. Every coding agent that could theoretically replace a CRM is now, through Headless 360, a coding agent that builds on top of one. The company is not arguing that agents won’t change the game. It’s arguing that decades of accumulated enterprise data, workflows, trust layers, and institutional logic give it something no coding agent can generate from a blank prompt.

As Benioff declared on CNBC’s Mad Money in March: “The software industry is still alive, well and growing.” Headless 360 is his company’s most forceful attempt to prove him right — by tearing down the walls of the very platform that made Salesforce famous and inviting every agent in the world to walk through the front door.

Parker Harris, Salesforce’s co-founder, captured the bet most succinctly in a question he posed last month: “Why should you ever log into Salesforce again?”

If Headless 360 works as designed, the answer is: You shouldn’t have to. And that, Salesforce is wagering, is precisely what will keep you paying for it.

Are we getting what we paid for? How to turn AI momentum into measurable value

Enterprise AI is entering a new phase — one where the central question is no longer what can be built, but how to make the most of our AI investment.

At VentureBeat’s latest AI Impact Tour session, Brian Gracely, director of portfolio strategy at Red Hat, described the operational reality inside large organizations: AI sprawl, rising inference costs, and limited visibility into what those investments are actually returning.

It’s the “Day 2” moment — when pilots give way to production, and cost, governance, and sustainability become harder than building the system in the first place.

“We’ve seen customers who say, ‘I have 50,000 licenses of Copilot. I don’t really know what people are getting out of that. But I do know that I’m paying for the most expensive computing in the world, because it’s GPUs,'” Gracely said. “‘How am I going to get that under control?'”

Why enterprise AI costs are now a board-level problem

For much of the past two years, cost was not the primary concern for organizations evaluating generative AI. The experimental phase gave teams cover to spend freely, and the promise of productivity gains justified aggressive investment, but that dynamic is shifting as enterprises enter their second and third budget cycles with AI. The focus has moved from “can we build something?” to “are we getting what we paid for?”

Enterprises that made large, early bets on managed AI services are conducting hard reviews of whether those investments are delivering measurable value. The issue isn’t just that GPU computing is expensive. It is that many organizations lack the instrumentation to connect spending to outcomes, making it nearly impossible to justify renewals or scale responsibly.

The strategic shift from token consumer to token producer

The dominant AI procurement model of the past few years has been straightforward: pay a vendor per token, per seat, or per API call, and let someone else manage the infrastructure. That model made sense as a starting point but is increasingly being questioned by organizations with enough experience to compare alternatives.

Enterprises that have been through one AI cycle are starting to rethink that model.

“Instead of being purely a token consumer, how can I start being a token generator?” Gracely said. “Are there use cases and workloads that make sense for me to own more? It may mean operating GPUs. It may mean renting GPUs. And then asking, ‘Does that workload need the greatest state-of-the-art model? Are there more capable open models or smaller models that fit?'”

The decision is not binary. The right answer depends on the workload, the organization, and the risk tolerance involved, but the math is getting more complicated as the number of capable open models, from DeepSeek to models now available through cloud marketplaces, grows. Now enterprises actually have real alternatives to the handful of providers that dominated the landscape two years ago.

Falling AI costs and rising usage create a paradox for enterprise budgets

Some enterprise leaders argue that locking into infrastructure investments now could mean significantly overpaying in the long run, pointing to the statement from Anthropic CEO Dario Amodei that AI inference costs are declining roughly 60% per year.

The emergence of open-source models such as DeepSeek and others has meaningfully expanded the strategic options available to enterprises that are willing to invest in the underlying infrastructure in the last three years.

But while costs per token are falling, usage is accelerating at a pace that more than offsets efficiency gains. It’s a version of Jevons Paradox, the economic principle that improvements in resource efficiency tend to increase total consumption rather than reduce it, as lower cost enables broader adoption.

For enterprise budget planners, this means declining unit costs do not translate into declining total bills. An organization that triples its AI usage while costs fall by half still ends up spending more than it did before. The consideration becomes which workloads genuinely require the most capable and most expensive models, and which can be handled just fine by smaller, cheaper alternatives.

The business case for investing in AI infrastructure flexibility

The prescription isn’t to slow down AI investment, but to build with flexibility being top of mind. The organizations that will win aren’t necessarily the ones that move fastest or spend the most; they’re the ones building infrastructure and operating models capable of absorbing the next unexpected development.

“The more you can build some abstractions and give yourself some flexibility, the more you can experiment without running up costs, but also without jeopardizing your business. Those are as important as asking whether you’re doing everything best practice right now,” Gracely explained.

But despite how entrenched AI discussions have become in enterprise planning cycles, the practical experience most organizations have is still measured in years, not decades.

“It feels like we’ve been doing this forever. We’ve been doing this for three years,” Gracely added. “It’s early and it’s moving really fast. You don’t know what’s coming next. But the characteristics of what’s coming next — you should have some sense of what that looks like.”

For enterprise leaders still calibrating their AI investment strategies, that may be the most actionable takeaway: the goal is not to optimize for today’s cost structure, but to build the organizational and technical flexibility to adapt when, not if, it changes again.

AI lowered the cost of building software. Enterprise governance hasn’t caught up

Presented by Retool


The logic used to be: buying software is cheaper, faster, and safer for most use cases. Building was reserved for companies with large engineering teams, deep pockets, and problems so specific that no vendor could address them. But now, the cost to code a piece of software has dropped to zero.

Anyone can build their own software now, but enterprise and governance models have yet to catch up. Retool’s 2026 Build vs. Buy Shift Report, based on a survey of 817 builders, traces exactly how this shift is playing out.

The cost curve changed; SaaS pricing didn’t

Two years ago, a custom internal tool might have taken an engineering team weeks or months and cost six figures. Today, an operations lead with the right platform can have a working prototype in a day or two. This structural shift is driven by AI-assisted development and the maturation of enterprise app-building platforms.

Meanwhile, SaaS pricing hasn’t adjusted, still charging per-seat for generic software that requires customization and integration costs on top. When the cost of building drops by an order of magnitude but the cost of buying stays flat, the math changes for every company, not just the ones with large engineering teams.

The data reflects this. Retool’s report found that 35% of teams have already replaced at least one SaaS tool with a custom build, and 78% plan to build more custom tooling in 2026.

Workflow automations and admin tools are among SaaS tools at risk

The shift isn’t happening uniformly. The top SaaS tools respondents have replaced or considered replacing include workflow automations (35%) and internal admin tools (33%), followed by BI tools (29%) and CRMs (25%).

A purchased workflow automation tool has to serve thousands of customers, so it optimizes for the average case — and the average case is nobody’s actual case. Every company’s internal workflows are different. They reflect org structure, compliance requirements, data systems, and business logic unique to that organization.

Internal admin tools carry the same problem: they’re inherently company-specific. These categories were always the most awkward fit for off-the-shelf software, and there’s now an affordable, accessible alternative (MIT’s State of AI in Business reported $2-10 million in savings annually for customer service and document processing tasks).

The replacement pattern tends to be additive rather than wholesale (nobody is just ripping out Salesforce). They’re replacing the specific pieces that never quite fit: an approval flow that required three workarounds, the dashboard that couldn’t connect to their actual data … but those narrow replacements add up. Once a team builds one tool that works better than what they bought, the default question shifts from “What should we buy?” to “Can we build this?”

Builders go around IT, signaling broader procurement challenges

The clearest evidence that procurement processes haven’t kept up with building capability is the scale of shadow IT now occurring inside enterprises. Retool’s report found that 60% of builders have created tools, workflows, or automations outside of IT oversight in the past year — and 25% report doing so frequently.

Even experienced, high-judgment people choose speed over process. Two-thirds of total survey respondents (64%) are senior managers and above. Existing procurement cycles weren’t designed for a world where building software takes days rather than months. When people love to quote the 95% generative AI pilot failure rate they’re not accounting for the robust grassroots adoption happening under executives’ noses.

Shadow IT at this scale is a demand signal. The people closest to the problems are telling organizations that the existing process can’t can’t keep up — 31% of those going around IT do so simply because they can build faster than IT can provision tools. So, suppression isn’t a productive response. The challenge is that the tools being built in the shadows are also the ones most likely to stall before they become useful.

A vibe-coded prototype running on sample data is impressive. A production tool connected to your actual Salesforce instance, with role-based access and a security review, is useful. The report found that 51% of builders have shipped production software currently in use by their teams, and among those, about half report saving six or more hours per week.

When building happens in an ungoverned environment, organizations get neither outcome reliably. Someone connects an AI-powered tool to production data with no audit trail, no access controls, and no owner. Multiply that by dozens of builders across an organization, and you have an expanding security surface that IT doesn’t even know exists.[1]

The teams whose homebuilt solutions reach production tend to have three things the others don’t: connectivity to real data sources, a security and permissions model they trust, and a review process for what gets deployed. Channeling builder energy into governed environments, where speed and security aren’t in conflict, is how organizations avoid shadow IT becoming a liability.

Governance will define the next era of SaaS

The build vs. buy shift is already underway. The more important question now is who controls the environment where that building happens.

Ungoverned building invites security risks and makes the ROI case difficult to close. You can’t measure time saved by tools IT doesn’t know exist, or are only run in one individual’s workflow. You can’t enforce access controls on a prototype that someone connected to production data last Tuesday. And those aren’t hypothetical risks: in Deloitte’s 2026 State of AI in the Enterprise survey of 3,200+ leaders, data privacy and security ranked as the top AI concern at 73%, with governance capabilities close behind at 46%. The 35% of organizations with no AI productivity metrics are missing more than just a dashboard. They’re missing the accountability infrastructure that justifies building over buying in the first place.

The organizations that treat governed environments as a prerequisite for building at scale will be the ones that can actually prove it’s working. The ones that don’t will find out when something breaks.

For a closer look at the data, including how enterprises are approaching AI-assisted building, read the full 2026 Build vs. Buy Shift Report.

[1] The cost of which can be steep: IBM’s 2025 Cost of Data Breach Report found that AI-associated cases cost organizations more than $650,000 per breach.


David Hsu is CEO at Retool.


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

AI-RAN is redefining enterprise edge intelligence and autonomy

Presented by Booz Allen


AI-RAN, or artificial intelligence radio area networks, is a reimagining of what wireless infrastructure can do. Rather than treating the network as a passive conduit for data, AI-RAN turns it into an active computational layer. It’s a sensor, a compute fabric, and a control plane for physical operations, all rolled into one. That shift has huge implications for industries from manufacturing and logistics to healthcare and smart infrastructure.

VentureBeat spoke with two leaders at the center of this transformation: Chris Christou, senior vice president at Booz Allen, and Shervin Gerami, managing director at Cerberus Operations Supply Chain Fund.

“AI-RAN can bring the promise of extending 5G and eventually 6G networks into the enterprise,” Christou said. “Proving that a platform can host inference at the edge to enable new types of AI — in particular, physical AI and autonomy-type use cases for things like smart manufacturing and smart warehousing — can make operations more efficient and effective.”

“AI-RAN lets enterprises move from digitizing processes to autonomously operating them,” Gerami added. “The enterprise investment should not look at AI-RAN as a networking upgrade. It’s an operating system for physical industries.”

The difference between AI for RAN, AI on RAN, and AI and RAN

The difference between AI for RAN, AI on RAN, and AI and RAN is critical. AI on RAN runs enterprise AI workloads on edge compute infrastructure integrated with the RAN, enabling real-time applications like computer vision, robotics, and localized LLM inference.

AI and RAN represents the deeper convergence — where networks are designed to be AI-native, with AI workloads and radio infrastructure architected together as a coordinated, distributed system. At this stage, RAN evolves from a transport layer into a foundational layer of the AI economy.

“This is the transformational part,” Gerami said. “It’s jointly designing applications with networks. Now the application knows the network state, and the network understands the application’s intent. AI for RAN saves money. AI on RAN adds capability. Then AI and RAN together create entirely new business models.”

It’s this layered framework that makes AI-RAN more than an incremental evolution of existing wireless technology, and instead a platform shift that opens the network to the kind of developer ecosystem and application innovation that has historically been the domain of cloud computing.

How ISAC turns the network into a sensor

Integrated sensing and communications (ISAC) is the center of the infrastructure. The network becomes the sensor, a converged infrastructure simultaneously communicating and sensing its environment at the same time it hosts algorithms and applications at the edge. It will enable drone detection, pedestrian safety, and automotive sensing, and eventually even more innovative use cases.

The enterprise value proposition of ISAC and a network as the sensor is clear, Gerami says. Today, organizations rely on multiple discrete systems to achieve situational awareness: cameras, radar, asset trackers, motion sensors and more. Each comes with its own maintenance burden, integration overhead and vendor relationship. ISAC has the potential to handle many of those capabilities natively within the network.

“With ISAC you can do asset tracking at sub-meter precision inside factories and hospitals,” he explained. “You can detect movement patterns, perimeter breaches, and anomalies. Smart buildings can have occupancy-aware HVAC and energy optimization.”

How AI-RAN shaves milliseconds off edge AI and inference

With AI-RAN, edge AI and low-latency inference become supercharged in use cases like real-time robotics management, instant quality inspection, and predictive maintenance. There are the applications where the latency gap between cloud and edge is the difference between a system that works and one that doesn’t.

“Where edge AI kicks in is driving operations in milliseconds, not seconds, which is what cloud does,” Gerami explained.

Split inference can also change the game, Christou says.

“You have a lot of different use cases where the processing is done on the device, making that device more expensive and more power-hungry,” he said. “Now there’s the possibility of offloading that to a local AI-RAN stack, even getting into concepts like split inference, so you do some of the inference on the device, some on the edge AI-RAN stack, and some in the cloud, all appropriate to the use cases and the time scale of the processing required.”

Why the timing of AI-RAN investment is critical now

AI-RAN investment has a narrow and strategically critical window, both Germani and Christou said.

“5G infrastructure is already being deployed, almost getting to a point of completion. 6G standards are not yet locked in,” Gerami explained. “This is an architectural moment for AI-RAN to come in. It allows the ability to not make RAN become a telco-centric design only. It allows the enterprise to become the co-creator of the application, the revenue and value generator of that network infrastructure.”

Historically, enterprise IT has consumed wireless standards rather than shaped them. AI-RAN’s open architecture, built on software-defined, cloud-native, containerized components, changes that standardization dynamic.

“Previously in the wireless industry it was a very long cycle. Now we’re seeing a push to get it implemented, get it out there, get early pilots, and then we’ll see how the technology works,” Christou said. Simultaneously, in parallel, you can start defining the standards. You have real-life implementation experience to help influence how those standards take shape.”

And the entry point is accessible, Gerami added.

“The barrier to entry is very low,” he said. “Right now, it’s all code-based, all software. It’s no different than downloading software. You get yourself an Nvidia box and you can deploy it with a radio.”

Why AI-RAN is the future of innovative AI use cases

“We see AI-RAN as being an open architecture that’s truly driving innovation,” Gerami said. “It’s a flywheel for innovation. We want to create everything to be microservices, open native, cloud native, to allow partners to build vertical AI apps. There’s so much focus right now in the industry around how we can adopt AI effectively, how it will enable autonomy and robotics. This is one of those foundational pieces that can help realize some of those use cases. The future is about owning that physical inference.”

“There’s so much focus right now in the industry around how we can adopt AI effectively — how it will enable autonomy and robotics,” Christou said. “This is one of those foundational pieces that can help realize some of those use cases.”


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Claude, OpenClaw and the new reality: AI agents are here — and so is the chaos

The age of agentic AI is upon us — whether we like it or not. What started with an innocent question-answer banter with ChatGPT back in 2022 has become an existential debate on job security and the rise of the machines.

More recently, fears of reaching artificial general intelligence (AGI) have become more real with the advent of powerful autonomous agents like Claude Cowork and OpenClaw. Having played with these tools for some time, here is a comparison.

First, we have OpenClaw (formerly known as Moltbot and Clawdbot). Surpassing 150,000 GitHub stars in days, OpenClaw is already being deployed on local machines with deep system access. This is like a robot “maid” (Irona for Richie Rich fans, for instance) that you give the keys to your house. It’s supposed to clean it, and you give it the necessary autonomy to take actions and manage your belongings (files and data) as it pleases. The whole purpose is to perform the task at hand — inbox triaging, auto-replies, content curation, travel planning, and more.

Next we have Google’s Antigravity, a coding agent with an IDE that accelerates the path from prompt to production. You can interactively create complete application projects and modify specific details over individual prompts. This is like having a junior developer that can not only code, but build, test, integrate, and fix issues. In the realworld, this is like hiring an electrician: They are really good at a specific job and you only need to give them access to a specific item (your electric junction box). 

Finally, we have the mighty Claude. The release of Anthropic’s Cowork, which featured AI agents for automating legal tasks like contract review and NDA triage, caused a sharp sell-off in legal-tech and software-as-a-service (SaaS) stocks (referred to as the SaaSpocalypse). Claude has anyway been the go-to chatbot; now with Cowork, it has domain knowledge for specific industries like legal and finance. This is like hiring an accountant. They know the domain inside-out and can complete taxes and manage invoices. Users provide specific access to highly-sensitive financial details.

Making these tools work for you

The key to making these tools more impactful is giving them more power, but that increases the risk of misuse. Users must trust providers like Anthorpic and Google to ensure that agent prompts will not cause harm, leak data, or provide unfair (illegal) advantage to certain vendors. OpenClaw is open-source, which complicates things, as there is no central governing authority. 

While these technological advancements are amazing and meant for the greater good, all it takes is one or two adverse events to cause panic. Imagine the agentic electrician frying all your house circuits by connecting the wrong wire. In an agent scenario, this could be injecting incorrect code, breaking down a bigger system or adding hidden flaws that may not be immediately evident. Cowork could miss major saving opportunities when doing a user’s taxes; on the flip side, it could include illegal writeoffs. Claude can do unimaginable damage when it has more control and authority.

But in the middle of this chaos, there is an opportunity to really take advantage. With the right guardrails in place, agents can focus on specific actions and avoid making random, unaccounted-for decisions. Principles of responsible AI — accountability, transparency, reproducibility, security, privacy — are extremely important. Logging agent steps and human confirmation are absolutely critical.

Also, when agents deal with so many diverse systems, it’s important they speak the same language. Ontology becomes very important so that events can be tracked, monitored, and accounted for. A shared domain-specific ontology can define a “code of conduct.” These ethics can help control the chaos. When tied together with a shared trust and distributed identity framework, we can build systems that enable agents to do truly useful work.

When done right, an agentic ecosystem can greatly offload the human “cognitive load” and enable our workforce to perform high-value tasks. Humans will benefit when agents handle the mundane.

Dattaraj Rao is innovation and R&D architect at Persistent Systems.

Nvidia-backed ThinkLabs AI raises $28 million to tackle a growing power grid crunch

ThinkLabs AI, a startup building artificial intelligence models that simulate the behavior of the electric grid, announced today that it has closed a $28 million Series A financing round led by Energy Impact Partners (EIP), one of the largest energy transition investment firms in the world. Nvidia’s venture capital arm NVentures and Edison International, the parent company of Southern California Edison, also participated in the round.

The funding marks a significant escalation in the race to apply AI not just to software and content generation, but to the physical infrastructure that powers modern life. While most AI investment headlines have centered on large language models and generative tools, ThinkLabs is pursuing a different and arguably more consequential application: using physics-informed AI to model the behavior of electrical grids in real time, compressing engineering studies that once took weeks or months into minutes.

“We are dead focused on the grid,” ThinkLabs CEO Josh Wong told VentureBeat in an exclusive interview ahead of the announcement. “We do AI models to model the grid, specifically transmission and distribution power flow related modeling. We can calculate things like interconnection of large loads — like data centers or electric vehicle charging — and understand the impact they have on the grid.”

The round drew participation from a deep bench of returning investors, including GE Vernova, Powerhouse Ventures, Active Impact Investments, Blackhorn Ventures, and Amplify Capital, along with an unnamed large North American investor-owned utility. The company initially set out to raise less than $28 million, according to Wong, but strong demand from strategic partners pushed the round higher.

“This was way oversubscribed,” Wong said. “We attracted the right ecosystem partners and the right capital partners to grow with, and that’s how we ended up at $28 million.”

Why surging electricity demand is breaking the grid’s legacy planning tools

The timing of the raise is no coincidence. U.S. electricity demand is projected to grow 25% by 2030, according to consultancy ICF International, driven largely by AI data centers, electrified transportation, and the broader push toward building and vehicle electrification. That surge is crashing into a grid that was engineered decades ago for a fundamentally different set of demands — and utilities are scrambling to keep up.

The core problem is one of computational capacity. When a utility needs to understand what will happen to its grid if a large data center connects to a particular substation, or if a cluster of EV chargers goes live in a residential neighborhood, engineers must run power flow simulations — complex calculations that model how electricity moves through the network. Those studies have traditionally relied on legacy software tools from companies like Siemens, GE, and Schneider Electric, and they can take weeks or months to complete for a single scenario.

ThinkLabs’ approach replaces that bottleneck with physics-informed AI models that learn from the same engineering simulators but can then run orders of magnitude faster. According to the company, its platform can compress a month-long grid study into under three minutes and run 10 million scenarios in 10 minutes, while maintaining greater than 99.7% accuracy on grid power flow calculations.

Wong draws a sharp distinction between what ThinkLabs does and the generative AI models that dominate public discourse. “We’re not hallucinating the heck out of things,” he said. “We are talking about engineering calculations here. I would really compare this to a computation of fluid dynamics, or like F1 cars, or aerospace, or climate models. We do have a source of truth from existing physics-based engineering models.”

That source of truth is crucial. ThinkLabs trains its AI on the outputs of first-principles physics simulators — the same tools utilities already trust — and then validates its models against those simulators. The result, Wong argues, is an AI system that is not only fast but fully explainable and auditable, a critical requirement in an industry where a miscalculation can cause blackouts or damage physical infrastructure.

How ThinkLabs’ three-phase power flow analysis differs from every other grid AI startup

The competitive landscape for AI in grid management has grown crowded over the past two years, with startups and incumbents alike racing to apply machine learning to utility workflows. But Wong contends that ThinkLabs occupies a fundamentally different position from most of its competitors.

“As far as we know, we’re the only ones actually doing AI-native grid simulation analysis,” he said. “Others might be using AI for forecasting, load disaggregation, or local energy management, but fundamentally, they’re not calculating a power flow.”

What ThinkLabs performs is a full three-phase AC power flow analysis — examining every node and bus on the electric grid to determine real and reactive power levels, line flows, and voltages. This is the same type of analysis that utility engineers perform today using legacy tools, but ThinkLabs can deliver it at a speed and scale that those tools simply cannot match.

The distinction matters because utilities make capital investment decisions — worth billions of dollars — based on exactly these types of studies. If a power flow analysis shows that a proposed data center connection will overload a transmission line, the utility may need to build new infrastructure at enormous cost. But if the analysis can also suggest alternative solutions — battery storage placement, load flexibility scheduling, or topology optimization — the utility can potentially avoid or defer those capital expenditures.

“With many utilities, existing tools will basically show them all the problems, but they can only address solutions by trial and error,” Wong explained. “With AI, we can use reinforcement learning to generate more creative solutions, but also very effectively weigh the pros and cons of each of these solutions.”

Inside ThinkLabs’ strategic relationships with NVIDIA, Edison, and Microsoft

The presence of NVentures in the round — Nvidia’s venture arm does not write many checks — signals a deeper strategic relationship that extends well beyond capital. Wong confirmed that ThinkLabs works extensively within the Nvidia ecosystem on the energy and utility side, leveraging CUDA for GPU-accelerated computation and integrating Nvidia’s Earth-2 climate simulation platform into ThinkLabs’ probabilistic forecasting and risk-adjusted analysis pipelines.

“We are what one utility mentioned as the only high-intensity GPU workload for the OT side — the operational technology side — that’s planning and operations,” Wong said. He added that ThinkLabs is also in discussions with Nvidia’s Omniverse team about additional utility use cases, though those efforts are still early.

Edison International’s participation carries a different kind of strategic weight. In January 2026, ThinkLabs publicly announced results from a collaboration with Southern California Edison (SCE), Edison International’s utility subsidiary, that demonstrated the real-world capabilities of its platform. As the Los Angeles Times reported at the time, the collaboration showed that ThinkLabs’ AI could train in minutes per circuit, process a full year of hourly power-flow data in under three minutes across more than 100 circuits, and produce engineering reports with bridging-solution recommendations in under 90 seconds — work that previously required dedicated engineers an average of 30 to 35 days.

In today’s announcement, Edison International’s Sergej Mahnovski, Managing Director of Strategy, Technology and Innovation, reinforced that urgency: “We must rapidly transition from legacy planning tools and processes to meet the growing demands on the electric grid — new AI-native solutions are needed to transform our capabilities.”

ThinkLabs also works closely with Microsoft, which hosted a webinar in mid-2025 featuring Wong alongside representatives from Southern Company, EPRI, and Microsoft’s own energy team. The SCE collaboration was built on Microsoft Azure AI Foundry, situating ThinkLabs within the cloud infrastructure that many large utilities already use.

The 20-year career path that led from Toronto Hydro to an autonomous grid startup

Wong’s biography reads like a deliberate preparation for this exact moment. He has spent more than 20 years in the utility industry, starting his career at Toronto Hydro before founding Opus One Solutions in 2012 — a smart-grid software company that he grew to over 100 employees serving customers across eight countries before selling it to GE in 2022, as previously reported by BetaKit.

After the acquisition, Wong joined what became GE Vernova and was asked to develop the company’s “grid of the future” roadmap. The thesis he developed there — that the grid is the central bottleneck to economic growth, electrification, and national security, and that autonomous grid orchestration powered by AI is the solution — became the intellectual foundation for ThinkLabs.

“I was pulling together the thesis that we need to electrify, but the grid is really at the center of attention,” Wong said. “The conclusion is we need to drive towards greater autonomy. We talk a lot about autonomous cars, but I would argue that autonomous grids is the much more pressing priority.”

ThinkLabs was incubated inside GE Vernova and spun out as an independent company in April 2024, coinciding with a $5 million seed round co-led by Powerhouse Ventures and Active Impact Investments, as reported by GlobeNewswire at the time. GE Vernova remains a shareholder and strategic partner. Wong is the sole founder.

The team composition reflects the company’s dual identity. “Half of our team are power system PhDs, but the other half are the AI folks — people who have been looking at hyper-scalable AI infrastructure platforms and MLOps for other industries,” Wong said. “We have really been blending the two.”

How ThinkLabs doubled its utility customer base in a single quarter

Utilities are famously among the most conservative technology buyers in the world, with procurement cycles that can stretch years and layers of regulatory oversight that slow adoption. Wong acknowledges this reality but says the landscape is shifting faster than many observers realize.

“I have noticed sales cycles really accelerating,” he said. “It’s still long and depends on which utility and how big the deal is, but we have been witnessing firsthand sales cycles going from the traditional one to two years to a shortest two to three months.”

On the commercial side, Wong declined to share specific revenue figures but offered several data points that suggest meaningful traction. ThinkLabs is working with more than 10 utilities on AI-native grid simulation for planning and operations, he said, and the company doubled its customer accounts in the first quarter of 2026 alone.

“So not one or two, but we’re working with 10-plus utilities,” Wong said. “Things have really picked up pace even before this A round.”

The company primarily targets investor-owned utilities and system operators — the organizations that own and operate the grid — though Wong noted that AI is also beginning to democratize grid simulation capabilities for smaller utilities that previously lacked the engineering resources to run sophisticated analyses.

Wong said the primary use of funds will go toward advancing the product to enterprise grade and expanding the range of use cases the platform supports. The company sees a significant land-and-expand opportunity within individual utility accounts — moving from modeling a small region to training AI models across entire states or multi-state territories within a single customer.

EIP’s involvement as lead investor carries particular significance in this market. The firm is backed by more than half of North America’s investor-owned utilities, giving ThinkLabs a direct line into the executive suites of the customers it is trying to reach. “Utilities are being asked to add capacity on timelines the industry has never seen before, and the stakes extend far beyond the energy sector,” Sameer Reddy, Managing Partner at EIP, said in the press release.

What a 99.7% accuracy rate actually means for critical grid infrastructure

Any conversation about applying AI to critical infrastructure inevitably confronts the question of failure modes. A hallucination in a chatbot is an embarrassment; a miscalculation in a grid power flow analysis could contribute to equipment damage or widespread outages.

Wong addressed this head-on. The 99.7% accuracy figure, he explained, is an average across large-volume planning studies — specifically 8,760-hour analyses (every hour of the year) projected across three to 10 years with multiple sensitivity scenarios. For planning purposes, he argued, this level of accuracy is not only sufficient but may actually exceed what traditional methods deliver in practice.

“If you look at a source of truth, the data quality is actually the biggest limiting factor, not the accuracy of these AI models,” he said. “When we bring in traditional engineering analysis and actually snap it with telemetry — metering data, SCADA data — I would actually argue AI is far more accurate because it is data driven on actual measurements, rather than hypothetical planning analysis based on scenarios.”

For more critical real-time applications, ThinkLabs deploys what Wong called “hybrid models” that blend AI computation with traditional physics-based simulation. In the most stringent use cases, the AI handles roughly 99% of the computational workload before handing off to a physics-based engine for final validation — a technique Wong described as using AI to “warm start” the simulation.

The company also monitors for model drift and maintains strict training boundaries. “We’re not like ChatGPT training the internet here,” Wong said. “We’re training on the possibility of grid conditions. And if we do see a condition where we did not train, or outside of our training boundary, we can always run on-demand training on those certain solution spaces.”

Why ThinkLabs says its value proposition survives even if the data center boom slows down

The bullish case for ThinkLabs — and for grid-focused AI more broadly — rests heavily on the assumption that electricity demand will surge dramatically over the coming decade. But some analysts have begun questioning whether those projections are inflated, particularly if AI investment cycles cool and data center build-outs decelerate.

Wong argued that his company’s value proposition is resilient to that scenario. Even without dramatic load growth, he said, utilities face a fundamental modernization challenge. They have been using tools and processes from the 1990s and 2000s, and the workforce that knows how to operate those tools is retiring at an alarming rate.

“Workforce renewal is a big factor,” he said. “These AI tools not only modernize the tool itself, but also modernize culture and transformation and become major points of retention for the next generation.”

He also pointed to energy affordability as a driver that exists independent of load growth projections. If utilities continue to plan based on worst-case deterministic scenarios — building enough infrastructure to cover every conceivable contingency — consumer rates will become unmanageable. AI-powered probabilistic analysis, Wong argued, allows utilities to make smarter, more cost-effective decisions regardless of whether the most aggressive demand forecasts materialize.

“A large part of this AI is not only enabling workload, but how do we act with intelligence — going from worst-case to time-series analysis, from deterministic to probabilistic and stochastic analysis, and also coming up with solutions,” he said.

Wong frames the broader opportunity with an analogy that captures both the simplicity and the ambition of what ThinkLabs is attempting. For decades, he said, the utility industry’s default response to grid constraints has been the equivalent of building wider highways — more wires, more copper, more steel. ThinkLabs wants to be the navigation system that reroutes traffic instead.

“In the past, when we drive, we always drive with what we are familiar with — just the big roads,” he said. “But with AI, we can optimize the traffic patterns to drive on much more effective routes. In this case, it might be a mix of wires, flexibility, batteries, and operational decisions.”

Whether ThinkLabs can deliver on that vision at the scale the grid demands remains an open question. But Wong, who has spent two decades building and selling grid software companies, is not thinking in terms of incremental improvement. He sees a narrow window — measured in years, not decades — during which the foundational AI infrastructure for the grid will be built, and whoever builds it will shape the energy system for a generation.

“I truly believe the next two years of AI development for the grid will dictate the next decades of what can happen to the grid,” Wong said. “It’s really here now.”

The grid, in other words, is getting a copilot. The question is no longer whether utilities will trust AI with their most critical engineering decisions, but how quickly they can afford not to.