Testing autonomous agents (Or: how I learned to stop worrying and embrace chaos)

Look, we’ve spent the last 18 months building production AI systems, and we’ll tell you what keeps us up at night — and it’s not whether the model can answer questions. That’s table stakes now. What haunts us is the mental image of an agent autonomously approving a six-figure vendor contract at 2 a.m. because someone typo’d a config file.

We’ve moved past the era of “ChatGPT wrappers” (thank God), but the industry still treats autonomous agents like they’re just chatbots with API access. They’re not. When you give an AI system the ability to take actions without human confirmation, you’re crossing a fundamental threshold. You’re not building a helpful assistant anymore — you’re building something closer to an employee. And that changes everything about how we need to engineer these systems.

The autonomy problem nobody talks about

Here’s what’s wild: We’ve gotten really good at making models that *sound* confident. But confidence and reliability aren’t the same thing, and the gap between them is where production systems go to die.

We learned this the hard way during a pilot program where we let an AI agent manage calendar scheduling across executive teams. Seems simple, right? The agent could check availability, send invites, handle conflicts. Except, one Monday morning, it rescheduled a board meeting because it interpreted “let’s push this if we need to” in a Slack message as an actual directive. The model wasn’t wrong in its interpretation — it was plausible. But plausible isn’t good enough when you’re dealing with autonomy.

That incident taught us something crucial: The challenge isn’t building agents that work most of the time. It’s building agents that fail gracefully, know their limitations, and have the circuit breakers to prevent catastrophic mistakes.

What reliability actually means for autonomous systems

Layered reliability architecture

When we talk about reliability in traditional software engineering, we’ve got decades of patterns: Redundancy, retries, idempotency, graceful degradation. But AI agents break a lot of our assumptions.

Traditional software fails in predictable ways. You can write unit tests. You can trace execution paths. With AI agents, you’re dealing with probabilistic systems making judgment calls. A bug isn’t just a logic error—it’s the model hallucinating a plausible-sounding but completely fabricated API endpoint, or misinterpreting context in a way that technically parses but completely misses the human intent.

So what does reliability look like here? In our experience, it’s a layered approach.

Layer 1: Model selection and prompt engineering

This is foundational but insufficient. Yes, use the best model you can afford. Yes, craft your prompts carefully with examples and constraints. But don’t fool yourself into thinking that a great prompt is enough. I’ve seen too many teams ship “GPT-4 with a really good system prompt” and call it enterprise-ready.

Layer 2: Deterministic guardrails

Before the model does anything irreversible, run it through hard checks. Is it trying to access a resource it shouldn’t? Is the action within acceptable parameters? We’re talking old-school validation logic — regex, schema validation, allowlists. It’s not sexy, but it’s effective.

One pattern that’s worked well for us: Maintain a formal action schema. Every action an agent can take has a defined structure, required fields, and validation rules. The agent proposes actions in this schema, and we validate before execution. If validation fails, we don’t just block it — we feed the validation errors back to the agent and let it try again with context about what went wrong.

Layer 3: Confidence and uncertainty quantification

Here’s where it gets interesting. We need agents that know what they don’t know. We’ve been experimenting with agents that can explicitly reason about their confidence before taking actions. Not just a probability score, but actual articulated uncertainty: “I’m interpreting this email as a request to delay the project, but the phrasing is ambiguous and could also mean…”

This doesn’t prevent all mistakes, but it creates natural breakpoints where you can inject human oversight. High-confidence actions go through automatically. Medium-confidence actions get flagged for review. Low-confidence actions get blocked with an explanation.

Layer 4: Observability and auditability

Action Validation Pipeline

If you can’t debug it, you can’t trust it. Every decision the agent makes needs to be loggable, traceable, and explainable. Not just “what action did it take” but “what was it thinking, what data did it consider, what was the reasoning chain?”

We’ve built a custom logging system that captures the full large language model (LLM) interaction — the prompt, the response, the context window, even the model temperature settings. It’s verbose as hell, but when something goes wrong (and it will), you need to be able to reconstruct exactly what happened. Plus, this becomes your dataset for fine-tuning and improvement.

Guardrails: The art of saying no

Let’s talk about guardrails, because this is where engineering discipline really matters. A lot of teams approach guardrails as an afterthought — “we’ll add some safety checks if we need them.” That’s backwards. Guardrails should be your starting point.

We think of guardrails in three categories.

Permission boundaries

What is the agent physically allowed to do? This is your blast radius control. Even if the agent hallucinates the worst possible action, what’s the maximum damage it can cause?

We use a principle called “graduated autonomy.” New agents start with read-only access. As they prove reliable, they graduate to low-risk writes (creating calendar events, sending internal messages). High-risk actions (financial transactions, external communications, data deletion) either require explicit human approval or are simply off-limits.

One technique that’s worked well: Action cost budgets. Each agent has a daily “budget” denominated in some unit of risk or cost. Reading a database record costs 1 unit. Sending an email costs 10. Initiating a vendor payment costs 1,000. The agent can operate autonomously until it exhausts its budget; then, it needs human intervention. This creates a natural throttle on potentially problematic behavior.

Graduated Autonomy and Action Cost Budget

Semantic Houndaries

What should the agent understand as in-scope vs out-of-scope? This is trickier because it’s conceptual, not just technical.

I’ve found that explicit domain definitions help a lot. Our customer service agent has a clear mandate: handle product questions, process returns, escalate complaints. Anything outside that domain — someone asking for investment advice, technical support for third-party products, personal favors — gets a polite deflection and escalation.

The challenge is making these boundaries robust to prompt injection and jailbreaking attempts. Users will try to convince the agent to help with out-of-scope requests. Other parts of the system might inadvertently pass instructions that override the agent’s boundaries. You need multiple layers of defense here.

Operational boundaries

How much can the agent do, and how fast? This is your rate limiting and resource control.

We’ve implemented hard limits on everything: API calls per minute, maximum tokens per interaction, maximum cost per day, maximum number of retries before human escalation. These might seem like artificial constraints, but they’re essential for preventing runaway behavior.

We once saw an agent get stuck in a loop trying to resolve a scheduling conflict. It kept proposing times, getting rejections, and trying again. Without rate limits, it sent 300 calendar invites in an hour. With proper operational boundaries, it would’ve hit a threshold and escalated to a human after attempt number 5.

Agents need their own style of testing

Traditional software testing doesn’t cut it for autonomous agents. You can’t just write test cases that cover all the edge cases, because with LLMs, everything is an edge case.

What’s worked for us:

Simulation environments

Build a sandbox that mirrors production but with fake data and mock services. Let the agent run wild. See what breaks. We do this continuously — every code change goes through 100 simulated scenarios before it touches production.

The key is making scenarios realistic. Don’t just test happy paths. Simulate angry customers, ambiguous requests, contradictory information, system outages. Throw in some adversarial examples. If your agent can’t handle a test environment where things go wrong, it definitely can’t handle production.

Red teaming

Get creative people to try to break your agent. Not just security researchers, but domain experts who understand the business logic. Some of our best improvements came from sales team members who tried to “trick” the agent into doing things it shouldn’t.

Shadow mode

Before you go live, run the agent in shadow mode alongside humans. The agent makes decisions, but humans actually execute the actions. You log both the agent’s choices and the human’s choices, and you analyze the delta.

This is painful and slow, but it’s worth it. You’ll find all kinds of subtle misalignments you’d never catch in testing. Maybe the agent technically gets the right answer, but with phrasing that violates company tone guidelines. Maybe it makes legally correct but ethically questionable decisions. Shadow mode surfaces these issues before they become real problems.

The human-in-the-loop pattern

Three Human-in-the-Loop Patterns

Despite all the automation, humans remain essential. The question is: Where in the loop?

We’re increasingly convinced that “human-in-the-loop” is actually several distinct patterns:

Human-on-the-loop: The agent operates autonomously, but humans monitor dashboards and can intervene. This is your steady-state for well-understood, low-risk operations.

Human-in-the-loop: The agent proposes actions, humans approve them. This is your training wheels mode while the agent proves itself, and your permanent mode for high-risk operations.

Human-with-the-loop: Agent and human collaborate in real-time, each handling the parts they’re better at. The agent does the grunt work, the human does the judgment calls.

The trick is making these transitions smooth. An agent shouldn’t feel like a completely different system when you move from autonomous to supervised mode. Interfaces, logging, and escalation paths should all be consistent.

Failure modes and recovery

Let’s be honest: Your agent will fail. The question is whether it fails gracefully or catastrophically.

We classify failures into three categories:

Recoverable errors: The agent tries to do something, it doesn’t work, the agent realizes it didn’t work and tries something else. This is fine. This is how complex systems operate. As long as the agent isn’t making things worse, let it retry with exponential backoff.

Detectable failures: The agent does something wrong, but monitoring systems catch it before significant damage occurs. This is where your guardrails and observability pay off. The agent gets rolled back, humans investigate, you patch the issue.

Undetectable failures: The agent does something wrong, and nobody notices until much later. These are the scary ones. Maybe it’s been misinterpreting customer requests for weeks. Maybe it’s been making subtly incorrect data entries. These accumulate into systemic issues.

The defense against undetectable failures is regular auditing. We randomly sample agent actions and have humans review them. Not just pass/fail, but detailed analysis. Is the agent showing any drift in behavior? Are there patterns in its mistakes? Is it developing any concerning tendencies?

The cost-performance tradeoff

Here’s something nobody talks about enough: reliability is expensive.

Every guardrail adds latency. Every validation step costs compute. Multiple model calls for confidence checking multiply your API costs. Comprehensive logging generates massive data volumes.

You have to be strategic about where you invest. Not every agent needs the same level of reliability. A marketing copy generator can be looser than a financial transaction processor. A scheduling assistant can retry more liberally than a code deployment system.

We use a risk-based approach. High-risk agents get all the safeguards, multiple validation layers, extensive monitoring. Lower-risk agents get lighter-weight protections. The key is being explicit about these trade-offs and documenting why each agent has the guardrails it does.

Organizational challenges

We’d be remiss if we didn’t mention that the hardest parts aren’t technical — they’re organizational.

Who owns the agent when it makes a mistake? Is it the engineering team that built it? The business unit that deployed it? The person who was supposed to be supervising it?

How do you handle edge cases where the agent’s logic is technically correct but contextually inappropriate? If the agent follows its rules but violates an unwritten norm, who’s at fault?

What’s your incident response process when an agent goes rogue? Traditional runbooks assume human operators making mistakes. How do you adapt these for autonomous systems?

These questions don’t have universal answers, but they need to be addressed before you deploy. Clear ownership, documented escalation paths, and well-defined success metrics are just as important as the technical architecture.

Where we go from here

The industry is still figuring this out. There’s no established playbook for building reliable autonomous agents. We’re all learning in production, and that’s both exciting and terrifying.

What we know for sure: The teams that succeed will be the ones who treat this as an engineering discipline, not just an AI problem. You need traditional software engineering rigor — testing, monitoring, incident response — combined with new techniques specific to probabilistic systems.

You need to be paranoid but not paralyzed. Yes, autonomous agents can fail in spectacular ways. But with proper guardrails, they can also handle enormous workloads with superhuman consistency. The key is respecting the risks while embracing the possibilities.

We’ll leave you with this: Every time we deploy a new autonomous capability, we run a pre-mortem. We imagine it’s six months from now and the agent has caused a significant incident. What happened? What warning signs did we miss? What guardrails failed?

This exercise has saved us more times than we can count. It forces you to think through failure modes before they occur, to build defenses before you need them, to question assumptions before they bite you.

Because in the end, building enterprise-grade autonomous AI agents isn’t about making systems that work perfectly. It’s about making systems that fail safely, recover gracefully, and learn continuously.

And that’s the kind of engineering that actually matters.

Madhvesh Kumar is a principal engineer. Deepika Singh is a senior software engineer.

Views expressed are based on hands-on experience building and deploying autonomous agents, along with the occasional 3 AM incident response that makes you question your career choices.

Why enterprises are replacing generic AI with tools that know their users

The future of AI isn’t just agentic; it’s deep personalization. 

Rather than simple recommender systems that correlate user behavior to identify patterns and apply those to individual workflows, large language models (LLMs) and AI agents can analyze users directly to create deeply personalized experiences. 

It’s this kind of aggressive customization users are increasingly demanding — and the savviest enterprises who provide it (and soon) will win. 

The goal is: “Don’t try to randomize, or guess who I am. I tell you, this is what I care about,” Lijuan Qin, head of product, at Zoom AI, explains in a new Beyond the Pilot podcast.  

How Zoom is incorporating personalization

Zoom is one company that has adapted to this trend: Its generative assistant, AI Companion, goes beyond basic summarization, smart recordings, and after-meeting action items to opinion divergence and user alignment tracking. 

Users can customize meeting summaries based on their specific interests, and create targeted templates for follow-up emails to different personas (whether it be a salesperson or account executive). The AI assistant can then automatically populate these documents post-call. Meanwhile, a custom dictionary in Zoom AI Studio can process unique enterprise terminology and vocabulary for more relevant AI outputs, and a deep research mode can quickly deliver comprehensive analyses based on “internal expertise and external insights.”

Control is key here; the human can be “very specific [and] nail down” agent permissioning, Qin explained. They have “very clear controls” on follow-up actions, such as: Can the agent automatically send emails to specific recipients? Or will it trigger a verification step when it recognizes transcripts contain sensitive information (as dictated by the user)? 

Knowing that AI can go off the rails at times, human users can track agent behavior in Zoom, enable and disable features, and control data access. This can help prevent outputs that are inaccurate or off-target.  

“The most important thing is we do not assume AI is smart enough to get everything right,” Qin emphasized. 

Getting context right

In this new agentic AI age, there is essentially a “land grab for context,” Sam Witteveen, co-founder of Red Dragon AI and Beyond the Pilot host, explains in the podcast. 

“Definitely knowing your users is the big thing, right? Knowing what apps they are living in, what day-to-day tasks are they constantly doing?,” he said. “Companies realize the more they have about you, the better the [AI] memory can get, the better they can customize.”

Claude Cowork is one app that is “really shining” at this, Witteveen says; OpenClaw is another. Models are good enough that they can begin to make decisions for users and respond to directions like: “You know a bunch of things about me. You’ve got all this context. Go and generate the skills that are going to help me do a better job.”

“With something like OpenClaw, you can customize it in any way you want, right? You can chat with it, you can tell it, ‘Hey, at 4 o’clock I want you to do this,’” Witteveen said. 

However, token usage and security must always be taken into account, he advised. OpenClaw has been plagued by security issues since its launch. This has prompted many enterprises to uninstall the autonomous agent or outright ban its use; however, these uninstalls must be done correctly so that IT leaders don’t inadvertently delete their entire enterprise stack. 

Meanwhile, in terms of token budget, personalization can run up costs. “You need to think about the metrics you are tracking,” Witteveen said. “This is very different from product to product, but metrics around these things are gonna be key.”

Watch the podcast to hear more about: 

  • Why the companies that don’t experiment with AI skills right now “may be toast”

  • How Zoom built an AI companion that tracks opinion divergence — not just action items — in your meetings

  • Why the build vs. buy question just got a lot more urgent for enterprise software

  • Why “skills” may matter more than MCP for the future of enterprise AI

You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.

Mistral AI launches Forge to help companies build proprietary AI models, challenging cloud giants

Mistral AI on Monday launched Forge, an enterprise model training platform that allows organizations to build, customize, and continuously improve AI models using their own proprietary data — a move that positions the French AI lab squarely against the hyperscale cloud providers in one of the most consequential and least understood markets in enterprise technology.

The announcement caps a remarkably aggressive week for Mistral, which also released its Mistral Small 4 model, unveiled Leanstral — an open-source code agent for formal verification — and joined the newly formed Nvidia Nemotron Coalition as a co-developer of the coalition’s first open frontier base model. Together, these moves paint the picture of a company that is no longer content to compete on model benchmarks alone and is instead racing to become the infrastructure backbone for organizations that want to own their AI rather than rent it.

Forge goes significantly beyond the fine-tuning APIs that Mistral and its competitors have offered for the past year. The platform supports the full model training lifecycle: pre-training on large internal datasets, post-training through supervised fine-tuning, DPO, and ODPO, and — critically — reinforcement learning pipelines designed to align models with internal policies, evaluation criteria, and operational objectives over time.

“Forge is Mistral’s model training platform,” said Maliena Guy, head of product at Mistral AI, in an exclusive interview with VentureBeat ahead of the launch. “We’ve been building this out behind the scenes with our AI scientists. What Forge actually brings to the table is that it lets enterprises and governments customize AI models for their specific needs.”

Why Mistral says fine-tuning APIs are no longer enough for serious enterprise AI

The distinction Mistral is drawing — between lightweight fine-tuning and full-cycle model training — is central to understanding why Forge exists and whom it serves.

For the past two years, most enterprise AI adoption has followed a familiar pattern: companies select a general-purpose model from OpenAI, Anthropic, Google, or an open-source provider, then apply fine-tuning through a cloud API to adjust the model’s behavior for a narrow set of tasks. This approach works well for proof-of-concept deployments and many production use cases. But Guy argues that it fundamentally plateaus when organizations try to solve their hardest problems.

“We had a fine-tuning API relying on supervised fine-tuning. I think it was kind of what was the standard a couple of months ago,” Guy told VentureBeat. “It gets you to a proof-of-concept state. Whenever you actually want to have the performance that you’re targeting, you need to go beyond. AI scientists today are not using fine-tuning APIs. They’re using much more advanced tools, and that’s what Forge is bringing to the table.”

What Forge packages, in Guy’s telling, is the training methodology that Mistral’s own AI scientists use internally to build the company’s flagship models — including data mixing strategies, data generation pipelines, distributed computing optimizations, and battle-tested training recipes. She drew a sharp line between Forge and the open-source tools and community tutorials that are freely available today.

“There’s no platform out there that provides you real-world training recipes that work,” Guy said. “Other open-source repositories or other tools can give you generic configurations or community tutorials, but they don’t give you the recipe that’s been validated — that we’ve been doing for all of our flagship models today.”

From ancient manuscripts to hedge fund quant languages, early customers reveal what off-the-shelf AI can’t do

The obvious question facing any product like Forge is demand. In a market where GPT-5, Claude, Gemini, and a growing fleet of open-source models can handle an enormous range of tasks, why would an enterprise invest the time, compute, and expertise required to train its own model from scratch?

Guy acknowledged the question head-on but argued that the need emerges quickly once companies move beyond generic use cases. “A lot of the existing models can get you very far,” she said. “But when you’re looking at what’s going to make you competitive compared to your competition — everyone can adopt and use the models that are out there. When you want to go a step beyond that, you actually need to create your own models. You need to leverage your proprietary information.”

The real-world examples she cited illustrate the edges of the current model ecosystem. In one case, Mistral worked with a public institution that had ancient manuscripts with missing text from damaged sections. “The models that were available were not able to do this because they’ve never seen the data,” Guy explained. “Digitization was not very good. There were some unique patterns and characters, and so we actually created a model for them to fill in the spans. This is now used by their researchers, and it’s accelerating their publication and understanding of these documents.”

In another engagement, Mistral partnered with Ericsson to customize its Codestral model for legacy-to-modern code translation. Ericsson, Guy said, has built up half a decade of proprietary knowledge around an internal calling language — a codebase so specialized that no off-the-shelf model has ever encountered it. “The concrete impact is like turning a year-long manual migration process, where each engineer needs six months of onboarding, to something that’s really more scalable and faster,” she said.

Perhaps the most telling example involves hedge funds. Guy described working with financial firms to customize models for proprietary quantitative languages — the kind of deeply guarded intellectual property that these firms keep on-premises and never expose to cloud-hosted AI services. Using Forge’s reinforcement learning capabilities, Mistral helped one hedge fund develop custom benchmarks and then trained the model to outperform on them, producing what Guy called “a unique model that was able to give them the competitive edge that was needed.”

How Forge makes money: license fees, data pipelines, and embedded AI scientists

Forge’s business model reflects the complexity of enterprise model training. According to Guy, it operates across several revenue streams. For customers who run training jobs on their own GPU clusters — a common requirement in highly regulated or IP-sensitive industries — Mistral does not charge for compute. Instead, the company charges a license fee for the Forge platform itself, along with optional fees for data pipeline services and what Mistral calls “forward-deployed scientists” — embedded AI researchers who work alongside the customer’s team.

“No competitor out there today is kind of selling this embedded scientist as part of their training platform offering,” Guy said.

This model has clear echoes of Palantir’s early playbook, where forward-deployed engineers served as the critical bridge between powerful software and the messy reality of enterprise data. It also suggests that Mistral recognizes a fundamental truth about the current state of enterprise AI: the technology alone is not enough. Most organizations lack the internal expertise to design effective training recipes, curate data at scale, or navigate the treacherous optimization landscape of distributed GPU training.

The infrastructure itself is flexible. Training can happen on Mistral’s own clusters, on Mistral Compute (the company’s dedicated infrastructure offering), or entirely on-premises within the customer’s own data centers. “We have all these different cases, and we support everything,” Guy said.

Keeping proprietary data off the cloud is Forge’s sharpest selling point

One of the sharpest points of differentiation Mistral is pressing with Forge is data privacy. When customers train on their own infrastructure, Guy emphasized that Mistral never sees the data at all.

“It’s on their clusters, it’s with their data — we don’t see anything of it, and so it’s completely under their control,” she said. “I think this is something that sets us apart from the competition, where you actually need to upload your data, and you have a black box effect.”

This matters enormously in sectors like defense, intelligence, financial services, and healthcare, where the legal and reputational risks of exposing proprietary data to a third-party cloud service can be deal-breakers. Mistral has already partnered with organizations including ASML, DSO National Laboratories Singapore, the European Space Agency, Home Team Science and Technology Agency Singapore, and Reply — a roster that suggests the company is deliberately targeting the most data-sensitive corners of the enterprise market.

Forge also includes data pipeline capabilities that Mistral has developed through its own model training: data acquisition, curation, and synthetic data generation. “Data is a critical piece of any training job today,” Guy said. “You need to have good data. You need to have a good amount of data to make sure that the model is going to be good performing. We’ve acquired, as a company, really great knowledge building out these data pipelines.”

In the age of AI agents, Mistral argues that custom models still matter more than MCP servers

The timing of Forge’s launch raises an important strategic question. The AI industry in 2026 has been consumed by agents — autonomous AI systems that can use tools, navigate multi-step workflows, and take actions on behalf of users. If the future belongs to agents, why does the underlying model matter? Can’t companies simply plug into the best available frontier model through an MCP server or API and focus their energy on orchestration?

Guy pushed back on this framing with conviction. “The customers that we’ve been working on — some of these specific problems are things that no MCP server would ever solve,” she said. “You actually need that intelligence. You actually need to create that model that will help you solve your most critical business problem.”

She also argued that model customization is essential even in purely agentic architectures. “There are some agentic behaviors that you need to bring to the model,” Guy said. “It can be about reasoning patterns, specific types of documentation, making sure that you have the right reasoning traces. Even in these cases where people are going completely agentic, you still need model customization — like reinforcement learning techniques — to actually get the right level of performance.”

Mistral’s press release makes this connection explicit, arguing that custom models make enterprise agents more reliable by providing deeper understanding of internal environments: more precise tool selection, more dependable multi-step workflows, and decisions that reflect internal policies rather than generic assumptions.

The platform also supports an “agent-first” design philosophy. Forge exposes interfaces that allow autonomous agents — including Mistral’s own Vibe coding agent — to launch training experiments, find optimal hyperparameters, schedule jobs, and generate synthetic data. “We’ve actually been building Forge in an AI-native way,” Guy said. “We’re already testing out how autonomous agents can actually launch training experiments.”

Mistral Small 4, Leanstral, and the Nvidia coalition: the week that redefined the company’s ambitions

To fully appreciate Forge’s significance, it helps to view it alongside the other announcements Mistral made in the same week — a barrage of releases that together represent the most ambitious expansion in the company’s short history.

Just yesterday, Mistral released Leanstral, the first open-source code agent for Lean 4, the proof assistant used in formal mathematics and software verification. Leanstral operates with just 6 billion active parameters and is designed for realistic formal repositories — not isolated math competition problems. On the same day, Mistral launched Mistral Small 4, a mixture-of-experts model with 119 billion total parameters but only 6 billion active per query, running 40 percent faster than its predecessor while handling three times more queries per second. Both models ship under the Apache 2.0 license — the most permissive open-source license in wide use.

And then there is the Nvidia Nemotron Coalition. Announced at Nvidia’s GTC conference, the coalition is a first-of-its-kind collaboration between Nvidia and a group of AI labs — including Mistral, Perplexity, LangChain, Cursor, Black Forest Labs, Reflection AI, Sarvam, and Thinking Machines Lab — to co-develop open frontier models. The coalition’s first project is a base model co-developed specifically by Mistral AI and Nvidia, trained on Nvidia DGX Cloud, which will underpin the upcoming Nvidia Nemotron 4 family of open models.

“Open frontier models are how AI becomes a true platform,” said Arthur Mensch, cofounder and CEO of Mistral AI, in Nvidia’s announcement. “Together with Nvidia, we will take a leading role in training and advancing frontier models at scale.”

This coalition role is strategically significant. It positions Mistral not merely as a consumer of Nvidia’s compute infrastructure but as a co-creator of the foundational models that the broader ecosystem will build upon. For a company that is still a fraction of the size of its American competitors, this is an outsized seat at the table.

Forge takes aim at Amazon, Microsoft, and Google — and says they can’t go deep enough

Forge enters a market that is already crowded — at least on the surface. Amazon Bedrock, Microsoft Azure AI Foundry, and Google Cloud Vertex AI all offer model training and customization capabilities. But Guy argued that these offerings are fundamentally limited in two respects.

First, they are cloud-only. “In one set of cases, it’s very easy to answer — they want to run this on their premises, and so all these tools that are available on the cloud are just not available for them,” Guy said. Second, she argued that the hyperscalers’ training tools largely offer simplified API interfaces that don’t provide the depth of control that serious model training requires.

There is also the dependency question. Guy described digital-native companies that had built products on top of closed-source models, only to have a new model release — more verbose than its predecessor — crash their production pipelines. “When you’re relying on closed-source models, you are also super dependent on the updates of the model that have side effects,” she warned.

This argument resonates with the broader sovereignty narrative that has powered Mistral’s rise in Europe and beyond. The company has positioned itself as the alternative for organizations that want to own their AI stack rather than lease it from American hyperscalers. Forge extends that argument from inference to training: not just running models you own, but building them in the first place.

The open-source foundation matters here, too. Mistral has been releasing models under permissive licenses since its founding, and Guy emphasized that the company is building Forge as an open platform. While it currently works with Mistral’s own models, she confirmed that support for other open-source architectures is planned. “We’re deeply rooted into open source. This has been part of our DNA since the beginning, and we have been building Forge to be an open platform — it’s just a question of a matter of time that we’ll be opening this to other open-source models.”

A co-founder’s departure to xAI underscores why Mistral is turning expertise into a product

The timing of Forge’s launch also arrives against a backdrop of fierce talent competition. As FinTech Weekly reported on March 14, Devendra Singh Chaplot — a co-founder of Mistral AI who headed the company’s multimodal group and contributed to training Mistral 7B, Mixtral 8x7B, and Mistral Large — left to join Elon Musk’s xAI, where he will work on Grok model training. Chaplot had previously also been a founding member of Thinking Machines Lab, the AI startup founded by former OpenAI CTO Mira Murati.

The loss of a co-founder is never insignificant, but Mistral appears to be compensating with institutional capability rather than individual brilliance. Forge is, in essence, a productization of the company’s collective training expertise — the recipes, the pipelines, the distributed computing optimizations — in a form that can scale beyond any single researcher. By packaging this knowledge into a platform and pairing it with forward-deployed scientists, Mistral is attempting to build a durable competitive asset that doesn’t walk out the door when a key hire departs.

Mistral’s big bet: the companies that own their AI models will be the ones that win

Forge is a bet on a specific theory of the enterprise AI future: that the most valuable AI systems will be those trained on proprietary knowledge, governed by internal policies, and operated under the organization’s direct control. This stands in contrast to the prevailing paradigm of the past two years, in which enterprises have largely consumed AI as a cloud service — powerful but generic, convenient but uncontrolled.

The question is whether enough enterprises will be willing to make the investment. Model training is expensive, technically demanding, and requires sustained organizational commitment. Forge lowers the barriers — through its infrastructure automation, its battle-tested recipes, and its embedded scientists — but it does not eliminate them.

What Mistral appears to be banking on is that the organizations with the most to gain from AI — the ones sitting on decades of proprietary knowledge in highly specialized domains — are precisely the ones for whom generic models are least sufficient. These are the companies where the gap between what a general-purpose model can do and what the business actually needs is widest, and where the competitive advantage of closing that gap is greatest.

Forge supports both dense and mixture-of-experts architectures, accommodating different trade-offs between performance, cost, and operational constraints. It handles multimodal inputs. It is designed for continuous adaptation rather than one-time training, with built-in evaluation frameworks that let enterprises test models against internal benchmarks before production deployment.

For the past two years, the enterprise AI playbook has been straightforward: pick a model, call an API, ship a feature. Mistral is now asking a harder question — whether the organizations willing to do the difficult, expensive, unglamorous work of training their own models will end up with something the API-callers never get.

An unfair advantage.

Nvidia’s DGX Station is a desktop supercomputer that runs trillion-parameter AI models without the cloud

Nvidia on Monday unveiled a deskside supercomputer powerful enough to run AI models with up to one trillion parameters — roughly the scale of GPT-4 — without touching the cloud. The machine, called the DGX Station, packs 748 gigabytes of coherent memory and 20 petaflops of compute into a box that sits next to a monitor, and it may be the most significant personal computing product since the original Mac Pro convinced creative professionals to abandon workstations.

The announcement, made at the company’s annual GTC conference in San Jose, lands at a moment when the AI industry is grappling with a fundamental tension: the most powerful models in the world require enormous data center infrastructure, but the developers and enterprises building on those models increasingly want to keep their data, their agents, and their intellectual property local. The DGX Station is Nvidia’s answer — a six-figure machine that collapses the distance between AI’s frontier and a single engineer’s desk.

What 20 petaflops on your desktop actually means

The DGX Station is built around the new GB300 Grace Blackwell Ultra Desktop Superchip, which fuses a 72-core Grace CPU and a Blackwell Ultra GPU through Nvidia’s NVLink-C2C interconnect. That link provides 1.8 terabytes per second of coherent bandwidth between the two processors — seven times the speed of PCIe Gen 6 — which means the CPU and GPU share a single, seamless pool of memory without the bottlenecks that typically cripple desktop AI work.

Twenty petaflops — 20 quadrillion operations per second — would have ranked this machine among the world’s top supercomputers less than a decade ago. The Summit system at Oak Ridge National Laboratory, which held the global No. 1 spot in 2018, delivered roughly ten times that performance but occupied a room the size of two basketball courts. Nvidia is packaging a meaningful fraction of that capability into something that plugs into a wall outlet.

The 748 GB of unified memory is arguably the more important number. Trillion-parameter models are enormous neural networks that must be loaded entirely into memory to run. Without sufficient memory, no amount of processing speed matters — the model simply won’t fit. The DGX Station clears that bar, and it does so with a coherent architecture that eliminates the latency penalties of shuttling data between CPU and GPU memory pools.

Always-on agents need always-on hardware

Nvidia designed the DGX Station explicitly for what it sees as the next phase of AI: autonomous agents that reason, plan, write code, and execute tasks continuously — not just systems that respond to prompts. Every major announcement at GTC 2026 reinforced this “agentic AI” thesis, and the DGX Station is where those agents are meant to be built and run.

The key pairing is NemoClaw, a new open-source stack that Nvidia also announced Monday. NemoClaw bundles Nvidia’s Nemotron open models with OpenShell, a secure runtime that enforces policy-based security, network, and privacy guardrails for autonomous agents. A single command installs the entire stack. Jensen Huang, Nvidia’s founder and CEO, framed the combination in unmistakable terms, calling OpenClaw — the broader agent platform NemoClaw supports — “the operating system for personal AI” and comparing it directly to Mac and Windows.

The argument is straightforward: cloud instances spin up and down on demand, but always-on agents need persistent compute, persistent memory, and persistent state. A machine under your desk, running 24/7 with local data and local models inside a security sandbox, is architecturally better suited to that workload than a rented GPU in someone else’s data center. The DGX Station can operate as a personal supercomputer for a solo developer or as a shared compute node for teams, and it supports air-gapped configurations for classified or regulated environments where data can never leave the building.

From desk prototype to data center production in zero rewrites

One of the cleverest aspects of the DGX Station’s design is what Nvidia calls architectural continuity. Applications built on the machine migrate seamlessly to the company’s GB300 NVL72 data center systems — 72-GPU racks designed for hyperscale AI factories — without rearchitecting a single line of code. Nvidia is selling a vertically integrated pipeline: prototype at your desk, then scale to the cloud when you’re ready.

This matters because the biggest hidden cost in AI development today isn’t compute — it’s the engineering time lost to rewriting code for different hardware configurations. A model fine-tuned on a local GPU cluster often requires substantial rework to deploy on cloud infrastructure with different memory architectures, networking stacks, and software dependencies. The DGX Station eliminates that friction by running the same NVIDIA AI software stack that powers every tier of Nvidia’s infrastructure, from the DGX Spark to the Vera Rubin NVL72.

Nvidia also expanded the DGX Spark, the Station’s smaller sibling, with new clustering support. Up to four Spark units can now operate as a unified system with near-linear performance scaling — a “desktop data center” that fits on a conference table without rack infrastructure or an IT ticket. For teams that need to fine-tune mid-size models or develop smaller-scale agents, clustered Sparks offer a credible departmental AI platform at a fraction of the Station’s cost.

The early buyers reveal where the market is heading

The initial customer roster for DGX Station maps the industries where AI is transitioning fastest from experiment to daily operating tool. Snowflake is using the system to locally test its open-source Arctic training framework. EPRI, the Electric Power Research Institute, is advancing AI-powered weather forecasting to strengthen electrical grid reliability. Medivis is integrating vision language models into surgical workflows. Microsoft Research and Cornell have deployed the systems for hands-on AI training at scale.

Systems are available to order now and will ship in the coming months from ASUS, Dell Technologies, GIGABYTE, MSI, and Supermicro, with HP joining later in the year. Nvidia hasn’t disclosed pricing, but the GB300 components and the company’s historical DGX pricing suggest a six-figure investment — expensive by workstation standards, but remarkably cheap compared to the cloud GPU costs of running trillion-parameter inference at scale.

The list of supported models underscores how open the AI ecosystem has become: developers can run and fine-tune OpenAI’s gpt-oss-120b, Google Gemma 3, Qwen3, Mistral Large 3, DeepSeek V3.2, and Nvidia’s own Nemotron models, among others. The DGX Station is model-agnostic by design — a hardware Switzerland in an industry where model allegiances shift quarterly.

Nvidia’s real strategy: own every layer of the AI stack, from orbit to office

The DGX Station didn’t arrive in a vacuum. It was one piece of a sweeping set of GTC 2026 announcements that collectively map Nvidia’s ambition to supply AI compute at literally every physical scale.

At the top, Nvidia unveiled the Vera Rubin platform — seven new chips in full production — anchored by the Vera Rubin NVL72 rack, which integrates 72 next-generation Rubin GPUs and claims up to 10x higher inference throughput per watt compared to the current Blackwell generation. The Vera CPU, with 88 custom Olympus cores, targets the orchestration layer that agentic workloads increasingly demand. At the far frontier, Nvidia announced the Vera Rubin Space Module for orbital data centers, delivering 25x more AI compute for space-based inference than the H100.

Between orbit and office, Nvidia revealed partnerships spanning Adobe for creative AI, automakers like BYD and Nissan for Level 4 autonomous vehicles, a coalition with Mistral AI and seven other labs to build open frontier models, and Dynamo 1.0, an open-source inference operating system already adopted by AWS, Azure, Google Cloud, and a roster of AI-native companies including Cursor and Perplexity.

The pattern is unmistakable: Nvidia wants to be the computing platform — hardware, software, and models — for every AI workload, everywhere. The DGX Station is the piece that fills the gap between the cloud and the individual.

The cloud isn’t dead, but its monopoly on serious AI work is ending

For the past several years, the default assumption in AI has been that serious work requires cloud GPU instances — renting Nvidia hardware from AWS, Azure, or Google Cloud. That model works, but it carries real costs: data egress fees, latency, security exposure from sending proprietary data to third-party infrastructure, and the fundamental loss of control inherent in renting someone else’s computer.

The DGX Station doesn’t kill the cloud — Nvidia’s data center business dwarfs its desktop revenue and is accelerating. But it creates a credible local alternative for an important and growing category of workloads. Training a frontier model from scratch still demands thousands of GPUs in a warehouse. Fine-tuning a trillion-parameter open model on proprietary data? Running inference for an internal agent that processes sensitive documents? Prototyping before committing to cloud spend? A machine under your desk starts to look like the rational choice.

This is the strategic elegance of the product: it expands Nvidia’s addressable market into personal AI infrastructure while reinforcing the cloud business, because everything built locally is designed to scale up to Nvidia’s data center platforms. It’s not cloud versus desk. It’s cloud and desk, and Nvidia supplies both.

A supercomputer on every desk — and an agent that never sleeps on top of it

The PC revolution’s defining slogan was “a computer on every desk and in every home.” Four decades later, Nvidia is updating the premise with an uncomfortable escalation. The DGX Station puts genuine supercomputing power — the kind that ran national laboratories — beside a keyboard, and NemoClaw puts an autonomous AI agent on top of it that runs around the clock, writing code, calling tools, and completing tasks while its owner sleeps.

Whether that future is exhilarating or unsettling depends on your vantage point. But one thing is no longer debatable: the infrastructure required to build, run, and own frontier AI just moved from the server room to the desk drawer. And the company that sells nearly every serious AI chip on the planet just made sure it sells the desk drawer, too.

Nvidia introduces Vera Rubin, a seven-chip AI platform with OpenAI, Anthropic and Meta on board

Nvidia on Monday took the wraps off Vera Rubin, a sweeping new computing platform built from seven chips now in full production — and backed by an extraordinary lineup of customers that includes Anthropic, OpenAI, Meta and Mistral AI, along with every major cloud provider.

The message to the AI industry, and to investors, was unmistakable: Nvidia is not slowing down. The Vera Rubin platform claims up to 10x more inference throughput per watt and one-tenth the cost per token compared with the Blackwell systems that only recently began shipping. CEO Jensen Huang, speaking at the company’s annual GTC conference, called it “a generational leap” that would kick off “the greatest infrastructure buildout in history.” Amazon Web Services, Google Cloud, Microsoft Azure and Oracle Cloud Infrastructure will all offer the platform, and more than 80 manufacturing partners are building systems around it.

“Vera Rubin is a generational leap — seven breakthrough chips, five racks, one giant supercomputer — built to power every phase of AI,” Huang declared. “The agentic AI inflection point has arrived with Vera Rubin kicking off the greatest infrastructure buildout in history.”

In any other industry, such rhetoric might be dismissed as keynote theater. But Nvidia occupies a singular position in the global economy — a company whose products have become so essential to the AI boom that its market capitalization now rivals the GDP of mid-sized nations. When Huang says the infrastructure buildout is historic, the CEOs of the companies actually writing the checks are standing behind him, nodding.

Dario Amodei, the chief executive of Anthropic, said Nvidia’s platform “gives us the compute, networking and system design to keep delivering while advancing the safety and reliability our customers depend on.” Sam Altman, the chief executive of OpenAI, said that “with Nvidia Vera Rubin, we’ll run more powerful models and agents at massive scale and deliver faster, more reliable systems to hundreds of millions of people.”

Inside the seven-chip architecture designed to power the age of AI agents

The Vera Rubin platform brings together the Nvidia Vera CPU, Rubin GPU, NVLink 6 Switch, ConnectX-9 SuperNIC, BlueField-4 DPU, Spectrum-6 Ethernet switch and the newly integrated Groq 3 LPU — a purpose-built inference accelerator. Nvidia organized these into five interlocking rack-scale systems that function as a unified supercomputer.

The flagship NVL72 rack integrates 72 Rubin GPUs and 36 Vera CPUs connected by NVLink 6. Nvidia says it can train large mixture-of-experts models using one-quarter the GPUs required on Blackwell, a claim that, if validated in production, would fundamentally alter the economics of building frontier AI systems.

The Vera CPU rack packs 256 liquid-cooled processors into a single rack, sustaining more than 22,500 concurrent CPU environments — the sandboxes where AI agents execute code, validate results and iterate. Nvidia describes the Vera CPU as the first processor purpose-built for agentic AI and reinforcement learning, featuring 88 custom-designed Olympus cores and LPDDR5X memory delivering 1.2 terabytes per second of bandwidth at half the power of conventional server CPUs.

The Groq 3 LPX rack, housing 256 inference processors with 128 gigabytes of on-chip SRAM, targets the low-latency demands of trillion-parameter models with million-token contexts. The BlueField-4 STX storage rack provides what Nvidia calls “context memory” — high-speed storage for the massive key-value caches that agentic systems generate as they reason across long, multi-step tasks. And the Spectrum-6 SPX Ethernet rack ties it all together with co-packaged optics delivering 5x greater optical power efficiency than traditional transceivers.

Why Nvidia is betting the future on autonomous AI agents — and rebuilding its stack around them

The strategic logic binding every announcement Monday into a single narrative is Nvidia’s conviction that the AI industry is crossing a threshold. The era of chatbots — AI that responds to a prompt and stops — is giving way to what Huang calls “agentic AI“: systems that reason autonomously for hours or days, write and execute software, call external tools, and continuously improve.

This isn’t just a branding exercise. It represents a genuine architectural shift in how computing infrastructure must be designed. A chatbot query might consume milliseconds of GPU time. An agentic system orchestrating a drug discovery pipeline or debugging a complex codebase might run continuously, consuming CPU cycles to execute code, GPU cycles to reason, and massive storage to maintain context across thousands of intermediate steps. That demands not just faster chips, but a fundamentally different balance of compute, memory, storage and networking.

Nvidia addressed this with the launch of its Agent Toolkit, which includes OpenShell, a new open-source runtime that enforces security and privacy guardrails for autonomous agents. The enterprise adoption list is remarkable: Adobe, Atlassian, Box, Cadence, Cisco, CrowdStrike, Dassault Systèmes, IQVIA, Red Hat, Salesforce, SAP, ServiceNow, Siemens and Synopsys are all integrating the toolkit into their platforms. Nvidia also launched NemoClaw, an open-source stack that lets users install its Nemotron models and OpenShell runtime in a single command to run secure, always-on AI assistants on everything from RTX laptops to DGX Station supercomputers.

The company separately announced Dynamo 1.0, open-source software it describes as the first “operating system” for AI inference at factory scale. Dynamo orchestrates GPU and memory resources across clusters and has already been adopted by AWS, Azure, Google Cloud, Oracle, Cursor, Perplexity, PayPal and Pinterest. Nvidia says it boosted Blackwell inference performance by up to 7x in recent benchmarks.

The Nemotron coalition and Nvidia’s play to shape the open-source AI landscape

If Vera Rubin represents Nvidia’s hardware ambition, the Nemotron Coalition represents its software ambition. Announced Monday, the coalition is a global collaboration of AI labs that will jointly develop open frontier models trained on Nvidia’s DGX Cloud. The inaugural members — Black Forest Labs, Cursor, LangChain, Mistral AI, Perplexity, Reflection AI, Sarvam and Thinking Machines Lab, the startup led by former OpenAI executive Mira Murati — will contribute data, evaluation frameworks and domain expertise.

The first model will be co-developed by Mistral AI and Nvidia and will underpin the upcoming Nemotron 4 family. “Open models are the lifeblood of innovation and the engine of global participation in the AI revolution,” Huang said.

Nvidia also expanded its own open model portfolio significantly. Nemotron 3 Ultra delivers what the company calls frontier-level intelligence with 5x throughput efficiency on Blackwell. Nemotron 3 Omni integrates audio, vision and language understanding. Nemotron 3 VoiceChat supports real-time, simultaneous conversations. And the company previewed GR00T N2, a next-generation robot foundation model that it says helps robots succeed at new tasks in new environments more than twice as often as leading alternatives, currently ranking first on the MolmoSpaces and RoboArena benchmarks.

The open-model push serves a dual purpose. It cultivates the developer ecosystem that drives demand for Nvidia hardware, and it positions Nvidia as a neutral platform provider rather than a competitor to the AI labs building on its chips — a delicate balancing act that grows more complex as Nvidia’s own models grow more capable.

From operating rooms to orbit: how Vera Rubin’s reach extends far beyond the data center

The vertical breadth of Monday’s announcements was almost disorienting. Roche revealed it is deploying more than 3,500 Blackwell GPUs across hybrid cloud and on-premises environments in the U.S. and Europe — the largest announced GPU footprint in the pharmaceutical industry. The company is using the infrastructure for biological foundation models, drug discovery and digital twins of manufacturing facilities, including its new GLP-1 facility in North Carolina. Nearly 90 percent of Genentech’s eligible small-molecule programs now integrate AI, Roche said, with one oncology molecule designed 25 percent faster and a backup candidate delivered in seven months instead of more than two years.

In autonomous vehicles, BYD, Geely, Isuzu and Nissan are building Level 4-ready vehicles on Nvidia’s Drive Hyperion platform. Nvidia and Uber expanded their partnership to launch autonomous vehicles across 28 cities on four continents by 2028, starting with Los Angeles and San Francisco in the first half of 2027. The company introduced Alpamayo 1.5, a reasoning model for autonomous driving already downloaded by more than 100,000 automotive developers, and Nvidia Halos OS, a safety architecture built on ASIL D-certified foundations for production-grade autonomy.

Nvidia also released the first domain-specific physical AI platform for healthcare robotics, anchored by Open-H — the world’s largest healthcare robotics dataset, with over 700 hours of surgical video. CMR Surgical, Johnson & Johnson MedTech and Medtronic are among the adopters.

And then there was space. The Vera Rubin Space Module delivers up to 25x more AI compute for orbital inferencing compared with the H100 GPU. Aetherflux, Axiom Space, Kepler Communications, Planet Labs and Starcloud are building on it. “Space computing, the final frontier, has arrived,” Huang said, deploying the kind of line that, from another executive, might draw eye-rolls — but from the CEO of a company whose chips already power the majority of the world’s AI workloads, lands differently.

The deskside supercomputer and Nvidia’s quiet push into enterprise hardware

Amid the spectacle of trillion-parameter models and orbital data centers, Nvidia made a quieter but potentially consequential move: it launched the DGX Station, a deskside system powered by the GB300 Grace Blackwell Ultra Desktop Superchip that delivers 748 gigabytes of coherent memory and up to 20 petaflops of AI compute performance. The system can run open models of up to one trillion parameters from a desk.

Snowflake, Microsoft Research, Cornell, EPRI and Sungkyunkwan University are among the early users. DGX Station supports air-gapped configurations for regulated industries, and applications built on it move seamlessly to Nvidia’s data center systems without rearchitecting — a design choice that creates a natural on-ramp from local experimentation to large-scale deployment.

Nvidia also updated DGX Spark, its more compact system, with support for clustering up to four units into a “desktop data center” with linear performance scaling. Both systems ship preconfigured with NemoClaw and the Nvidia AI software stack, and support models including Nemotron 3, Google Gemma 3, Qwen3, DeepSeek V3.2, Mistral Large 3 and others.

Adobe and Nvidia separately announced a strategic partnership to develop the next generation of Firefly models using Nvidia’s computing technology and libraries. Adobe will also build a cloud-native 3D digital twin solution for marketing on Nvidia Omniverse and integrate Nemotron capabilities into Adobe Acrobat. The partnership spans creative tools including Photoshop, Premiere Pro, Frame.io and Adobe Experience Platform.

Building the factories that build intelligence: Nvidia’s AI infrastructure blueprint

Perhaps the most telling indicator of where Nvidia sees the industry heading is the Vera Rubin DSX AI Factory reference design — essentially a blueprint for constructing entire buildings optimized to produce AI. The reference design outlines how to integrate compute, networking, storage, power and cooling into a system that maximizes what Nvidia calls “tokens per watt,” along with an Omniverse DSX Blueprint for creating digital twins of these facilities before they are built.

The software stack includes DSX Max-Q for dynamic power provisioning — which Nvidia says enables 30 percent more AI infrastructure within a fixed-power data center — and DSX Flex, which connects AI factories to power-grid services to unlock what the company estimates is 100 gigawatts of stranded grid capacity. Energy leaders Emerald AI, GE Vernova, Hitachi and Siemens Energy are using the architecture. Nscale and Caterpillar are building one of the world’s largest AI factories in West Virginia using the Vera Rubin reference design.

Industry partners Cadence, Dassault Systèmes, Eaton, Jacobs, Schneider Electric, Siemens, PTC, Switch, Trane Technologies and Vertiv are contributing simulation-ready assets and integrating their platforms. CoreWeave is using Nvidia’s DSX Air to run operational rehearsals of AI factories in the cloud before physical delivery.

“In the age of AI, intelligence tokens are the new currency, and AI factories are the infrastructure that generates them,” Huang said. It is the kind of formulation — tokens as currency, factories as mints — that reveals how Nvidia thinks about its place in the emerging economic order.

What Nvidia’s grand vision gets right — and what remains unproven

The scale and coherence of Monday’s announcements are genuinely impressive. No other company in the semiconductor industry — and arguably no other technology company, period — can present an integrated stack spanning custom silicon, systems architecture, networking, storage, inference software, open models, agent frameworks, safety runtimes, simulation platforms, digital twin infrastructure and vertical applications from drug discovery to autonomous driving to orbital computing.

But scale and coherence are not the same as inevitability. The performance claims for Vera Rubin, while dramatic, remain largely unverified by independent benchmarks. The agentic AI thesis that underpins the entire platform — the idea that autonomous, long-running AI agents will become the dominant computing workload — is a bet on a future that has not yet fully materialized. And Nvidia’s expanding role as a provider of models, software, and reference architectures raises questions about how long its hardware customers will remain comfortable depending so heavily on a single supplier for so many layers of their stack.

Competitors are not standing still. AMD continues to close the gap on data center GPU performance. Google’s TPUs power some of the world’s largest AI training runs. Amazon’s Trainium chips are gaining traction inside AWS. And a growing cohort of startups is attacking various pieces of the AI infrastructure puzzle.

Yet none of them showed up at GTC on Monday with endorsements from the CEOs of Anthropic and OpenAI. None of them announced seven new chips in full production simultaneously. And none of them presented a vision this comprehensive for what comes next.

There is a scene that repeats at every GTC: Huang, in his trademark leather jacket, holds up a chip the way a jeweler holds up a diamond, rotating it slowly under the stage lights. It is part showmanship, part sermon. But the congregation keeps growing, the chips keep getting faster, and the checks keep getting larger. Whether Nvidia is building the greatest infrastructure in history or simply the most profitable one may, in the end, be a distinction without a difference.

Rethinking AEO when software agents navigate the web on behalf of users

For more than two decades, digital businesses have relied on a simple assumption: When someone interacts with a website, that activity reflects a human making a conscious choice. Clicks are treated as signals of interest. Time on page is assumed to ind…

NanoClaw and Docker partner to make sandboxes the safest way for enterprises to deploy AI agents

NanoClaw, the open-source AI agent platform created by Gavriel Cohen, is partnering with the containerized development platform Docker to let teams run agents inside Docker Sandboxes, a move aimed at one of the biggest obstacles to enterprise adoption:…

The team behind continuous batching says your idle GPUs should be running inference, not sitting dark

Every GPU cluster has dead time. Training jobs finish, workloads shift and hardware sits dark while power and cooling costs keep running. For neocloud operators, those empty cycles are lost margin.

The obvious workaround is spot GPU markets — renting spare capacity to whoever needs it. But spot instances mean the cloud vendor is still the one doing the renting, and engineers buying that capacity are still paying for raw compute with no inference stack attached.

FriendliAI’s answer is different: run inference directly on the unused hardware, optimize for token throughput, and split the revenue with the operator. FriendliAI was founded by Byung-Gon Chun, the researcher whose paper on continuous batching became foundational to vLLM, the open source inference engine used across most production deployments today.

Chun spent over a decade as a professor at Seoul National University studying efficient execution of machine learning models at scale. That research produced a paper called Orca, which introduced continuous batching. The technique processes inference requests dynamically rather than waiting to fill a fixed batch before executing. It is now industry standard and is the core mechanism inside vLLM.

This week, FriendliAI is launching a new platform called InferenceSense. Just as publishers use Google AdSense to monetize unsold ad inventory, neocloud operators can use InferenceSense to fill unused GPU cycles with paid AI inference workloads and collect a share of the token revenue. The operator’s own jobs always take priority — the moment a scheduler reclaims a GPU, InferenceSense yields.

“What we are providing is that instead of letting GPUs be idle, by running inferences they can monetize those idle GPUs,” Chun told VentureBeat.

How a Seoul National University lab built the engine inside vLLM

Chun founded FriendliAI in 2021, before most of the industry had shifted attention from training to inference. The company’s primary product is a dedicated inference endpoint service for AI startups and enterprises running open-weight models. FriendliAI also appears as a deployment option on Hugging Face alongside Azure, AWS and GCP, and currently supports more than 500,000 open-weight models from the platform.

InferenceSense now extends that inference engine to the capacity problem GPU operators face between workloads.

How it works

InferenceSense runs on top of Kubernetes, which most neocloud operators are already using for resource orchestration. An operator allocates a pool of GPUs to a Kubernetes cluster managed by FriendliAI — declaring which nodes are available and under what conditions they can be reclaimed. Idle detection runs through Kubernetes itself.

“We have our own orchestrator that runs on the GPUs of these neocloud — or just cloud — vendors,” Chun said. “We definitely take advantage of Kubernetes, but the software running on top is a really highly optimized inference stack.”

When GPUs are unused, InferenceSense spins up isolated containers serving paid inference workloads on open-weight models including DeepSeek, Qwen, Kimi, GLM and MiniMax. When the operator’s scheduler needs hardware back, the inference workloads are preempted and GPUs are returned. FriendliAI says the handoff happens within seconds.

Demand is aggregated through FriendliAI’s direct clients and through inference aggregators like OpenRouter. The operator supplies the capacity; FriendliAI handles the demand pipeline, model optimization and serving stack. There are no upfront fees and no minimum commitments. A real-time dashboard shows operators which models are running, tokens being processed and revenue accrued.

Why token throughput beats raw capacity rental

Spot GPU markets from providers like CoreWeave, Lambda Labs and RunPod involve the cloud vendor renting out its own hardware to a third party. InferenceSense runs on hardware the neocloud operator already owns, with the operator defining which nodes participate and setting scheduling agreements with FriendliAI in advance. The distinction matters: spot markets monetize capacity, InferenceSense monetizes tokens.

Token throughput per GPU-hour determines how much InferenceSense can actually earn during unused windows. FriendliAI claims its engine delivers two to three times the throughput of a standard vLLM deployment, though Chun notes the figure varies by workload type.

Most competing inference stacks are built on Python-based open source frameworks. FriendliAI’s engine is written in C++ and uses custom GPU kernels rather than Nvidia’s cuDNN library. The company has built its own model representation layer for partitioning and executing models across hardware, with its own implementations of speculative decoding, quantization and KV-cache management.

Since FriendliAI’s engine processes more tokens per GPU-hour than a standard vLLM stack, operators should generate more revenue per unused cycle than they could by standing up their own inference service. 

What AI engineers evaluating inference costs should watch

For AI engineers evaluating where to run inference workloads, the neocloud versus hyperscaler decision has typically come down to price and availability.

InferenceSense adds a new consideration: if neoclouds can monetize idle capacity through inference, they have more economic incentive to keep token prices competitive.

That is not a reason to change infrastructure decisions today — it is still early. But engineers tracking total inference cost should watch whether neocloud adoption of platforms like InferenceSense puts downward pressure on API pricing for models like DeepSeek and Qwen over the next 12 months.

“When we have more efficient suppliers, the overall cost will go down,” Chun said. “With InferenceSense we can contribute to making those models cheaper.”

Manufact raises $6.3M as MCP becomes the ‘USB-C for AI’ powering ChatGPT and Claude apps

For decades, software companies designed their products for a single type of customer: a human being staring at a screen. Every button, menu, and dashboard existed to translate a person’s intention into a machine’s action. But a small startup based in …

How to make your e-commerce product visible to AI agents? Use this new system trusted by L’Oréal, Unilever, Mars & Beiersdorf

For future-focused e-commerce brands, the primary customer is rapidly changing from a person behind a screen to the AI agents that said human customer deploys on their behalf to research and, if projections are correct, purchase the product on their behalf.

Investment banking and financial services giant Morgan Stanley, for instance, has published research suggesting 10-20% of the entire U.S. commerce spend could be agentic by 2030 — amounting to $190 billion to $385 billion.

In response to this seismic shift, the four-year-old agentic AI e-commerce startup Azoma has unveiled the Agentic Merchant Protocol (AMP).

This new framework is designed to provide high-volume retailers — such as grocery brands, electronics manufacturers, and fashion labels — with a “brand-friendly” anchor in an ecosystem increasingly dominated by autonomous shoppers

The idea is compelling and deceptively simple at its core: instead of the current status quo wherein merchants selling physical products online have to manually enter information about each product like SKUs and materials on different online marketplaces and product listing aggregators (e.g. Walmart, Amazon, Google Shopping, etc.) — brands can now just take all that information, put it into Azoma’s platform, and push it out everywhere it needs to go, including pages optimized for AI agents to search and retrieve the information for users, recommending them the products that fit their specific query.

Using technology to end the ‘black box’ era of early agentic AI e-commerce

Modern AI integration typically relies on siloed systems like OpenAI’s ACP or Google’s UCP. While these protocols manage the technical handshakes required for discovery and payment, they offer minimal oversight regarding brand integrity.

When an AI agent deployed by a customer “reasons” about their human consumer’s product query, it often synthesizes data from unverified corners of the web, such as Reddit or outdated affiliate sites, creating a “black box” effect where the brand’s intended messaging is lost.

AMP functions as a high-level “system of record” that bridges these disparate platforms. It allows companies to centralize their product intelligence—including legal guardrails and brand books—into a single, machine-native format.

“AMP breaks the foundations of traditional ecommerce,” states Max Sinclair, CEO of Azoma in a press release shared with VentureBeat ahead of the official announcement set for March 12 in London.”For decades, marketplaces like Amazon and Walmart acted as gatekeepers by controlling product detail pages, rankings, and distribution. Brands optimized a finite set of endpoints: PDPs, ads, search results. In an agentic world, those fixed pages no longer exist”.

The Azoma platform is specifically engineered for high-volume retailers and manufacturers of physical goods, with a primary concentration on the Consumer Packaged Goods (CPG) and fast-moving consumer goods (FMCG) sectors.

In an interview with VentureBeat, Sinclair explicitly distinguished the protocol’s utility from digital-only assets or services, noting that Azoma does not currently support NFTs, SaaS, or financial sectors like banking and insurance.

Whether facilitating the automated reordering of household staples like dishwasher soap or providing the “reasoning” data for high-consideration purchases like specialized supplements and ski hardware, the protocol serves as the digital connective tissue for brands whose value is rooted in the physical world.

Sovereignty in a multi-agent world

The protocol has already seen rapid adoption by a coalition of consumer goods giants, including L’Oréal, Unilever, Mars, Beiersdorf, and Reckitt. For these organizations, maintaining a consistent identity across various AI surfaces is an urgent priority.

“The fact that businesses like L’Oréal, Unilever, Mars & Beiersdorf have moved so quickly to adopt AMP tells you everything about the urgency they feel,” Sinclair remarked during a recent interview with VentureBeat. “These are companies that have spent decades building brand equity—they’re not about to hand control of how their products are represented to an AI black box”.

The AMP suite provides several critical levers for technical leaders:

  • Canonical Machine-Native Catalogues: Data structures designed specifically for LLM ingestion, enriched with persona-level signaling.

  • Programmatic Open Web Distribution: Ensuring that the data agents find on the open web matches the brand’s official documentation.

  • Agent-Agnostic Infrastructure: A design that prevents vendor lock-in by allowing brands to interface with any AI assistant or marketplace agent.

  • Performance Visibility: Tools to measure how agents “weigh” specific product attributes and verify compliance across the ecosystem.

Intelligence as a competitive moat

Beyond simple data distribution, Azoma provides an end-to-end workflow designed to secure market share in an AI-first economy.

The platform includes a proprietary “RegGuard™ Compliance” engine that automatically audits all generated content against strict brand guidelines and regulatory rules, such as FDA/DSHEA standards.

This automated oversight is paired with advanced citation tracking, allowing brands to see exactly which sources—ranging from Reddit and Quora to Wikipedia and YouTube—AI agents are citing when they make a recommendation.

This granular visibility has already yielded significant performance gains for early partners. The company reports that for the brand Ruroc, site traffic from ChatGPT has increased 14x, positioning them as the #1 recommended ski helmet brand in target geographies.

Similarly, clients have seen their share of mentions within specific retail agents like Amazon Rufus increase by 5x, while optimized content has demonstrated conversion lifts of up to 32% in split-testing.

By addressing technical “GEO blockers”—such as schema errors, crawlability gaps, and JavaScript-only content that traditional scrapers might miss—Azoma enables brands to transition from passive observation to active optimization of the AI conversation.

For rapidly growing firms like Perfect Ted, this visibility contributed to a +532% year-over-year revenue increase.

Fusing marketplace DNA with AI research

Azoma’s leadership team mirrors the intersection of high-scale retail and advanced computation.

Sinclair spent six years at Amazon, where he spearheaded the customer browse experience for the Singapore launch and managed the expansion of Amazon Grocery throughout the European Union.

This tenure at the world’s largest retailer highlighted the limitations of static listings in a dynamic, AI-driven market. “In the traditional e-commerce world… you’d write a product listing, publish it, and that would be that,” Sinclair observed. “In this new world, the product detail pages are generative… our customers lose all of the control”.

The technical backbone of the protocol is led by CTO Timur Luguev, a Fulbright Scholar and ERCIM Fellow with over a decade in multimodal deep learning.

Luguev views AMP as a way to indirectly influence the broader “online footprint” that informs AI reasoning. “We want to feed agents through, basically, indirectly, through open online footprint,” Luguev explained.

“That’s the focus: basically first define this kind of a standard, so centralize this information about the product and the brand in one place, then syndicated across the open surfaces, and then quantify and measure the impact”.

Licensing and market implications

Azoma is positioning its protocol as a neutral alternative to the walled-garden approaches of major tech providers. While search engines prioritize the consumer’s user experience, AMP focuses exclusively on the merchant’s requirement for predictability and accuracy.

Feature

Platform Protocols (ACP/UCP)

Azoma AMP

Primary Focus

Transaction execution

Brand control & multi-agent syndication

Data Reach

Internal ecosystem only

Cross-platform & Open Web

Brand Governance

No / Partial oversight

Full enterprise-defined control

Integration

Developer-centric APIs

Marketing & Commerce team-friendly

This shift effectively replaces traditional Search Engine Optimization (SEO) with Agentic Commerce Optimization (ACO).

Sinclair argues that this transition is driven by a shift in consumer trust. “You’re going to trust ChatGPT acting on over your data [more] than just putting into Google, ‘what mattress should I use’ and just clicking on whoever paid for that top link,” he says.

Pricing structure

Azoma’s commercial strategy is designed to bridge the gap between traditional enterprise software procurement and the performance-driven metrics of the AI era. Currently, the company utilizes a standard enterprise model, engaging with its global partners through annual contracts that typically fall within the six-to-seven-figure range. This structure is intended to align with the existing budgetary frameworks of large-scale organizations, providing the predictability required for multi-national department planning.

However, the company’s long-term vision involves a fundamental pivot toward an outcome-based pricing model. By integrating directly into a brand’s data and revenue flows, Azoma can measure the specific financial impact of every syndicated intervention across the agentic ecosystem.

“Our ambition is the future is kind of… taking a cut when they [agents] deliver value,” explained Sinclair.

This goal would effectively transition the protocol from a SaaS expense into a performance-based asset, mirroring how modern advertising platforms operate by tying costs directly to incremental revenue growth.

Outcome-based agentic e-commerce

Beyond mere data distribution, Azoma is looking toward a model where revenue is tied directly to successful agentic interactions. While current enterprise clients typically engage via traditional six-to-seven-figure annual contracts, the company’s long-term goal is outcome-based pricing.

“Our ambition is the future is kind of… taking a cut when they [agents] deliver value,” Sinclair stated. Luguev noted that by accessing a brand’s data flows, they can provide rigorous ROI forecasting. “We have access to our actions, and then we measure what actions actually made the biggest impact… provide them ability to forecast which campaigns which actions and where to syndicate based on this understanding”.

As the market prepares for the official unveiling of the protocol at the Agentic Commerce Optimization event in London on March 12th, the message to the C-suite is clear: the “fixed” product page is dead. “When L’Oréal, Unilever and Mars move together in the same direction, the rest of the market pays attention,” Sinclair concluded.