For the modern enterprise, the digital workspace risks descending into “coordination theater,” in which teams spend more time discussing work than executing it.
While traditional tools like Slack or Teams excel at rapid communication, they have structurally failed to serve as a reliable foundation for AI agents, such that a Hacker News thread went viral in February 2026 calling upon OpenAI to build its own version of Slack to help empower AI agents, amassing 327 comments.
That’s because agents often lack the real-time context and secure data access required to be truly useful, often resulting in “hallucinations” or repetitive re-explaining of codebase conventions.
PromptQL, a spin-off from the GraphQL unicorn Hasura, is addressing this by pivoting from an AI data tool into a comprehensive, AI-native workspace designed to turn casual, regular team interactions into a persistent, secure memory for agentic workflows — ensuring these conversations are not simply left by the wayside or that users and agents have to try and find them again later, but rather, distilled and stored as actionable, proprietary data in an organized format — an internal wiki — that the company can rely on going forward, forever, approved and edited manually as needed.
Imagine two colleagues messaging about a bug that needs to be fixed — instead of manually assigning it to an engineer or agent, your messaging platform automatically tags it, assigns it and documents it all in the wiki with one click Now do this for every issue or topic of discussion that takes place in your enterprise, and you’ll have an idea of what PromptQL is attempting. The idea is a simple but powerful one: turning the conversation that necessarily precedes work into an actual assignment that is automatically started by your own messaging system.
“We don’t have conversations about work anymore,” CEO Tanmai Gopal said in a recent video call interview with VentureBeat. “You actually have conversations that do the work.”
Originally positioned as an AI data analyst, the company—a spin-off from the GraphQL unicorn Hasura—is pivoting into a full-scale AI-native workspace.
It isn’t just “Slack with a chatbot”; it is a fundamental re-architecting of how teams interact with their data, their tools, and each other.
“PromptQL is this workhorse in the background, this 24/7 intern that’s continuously cranking out the actual work—looking at code, confirming hypotheses, going to multiple places, actually doing the work,” Gopal said.
The technical soul of PromptQL is its Shared Wiki. Traditional LLMs suffer from a “memory” problem; they forget previous interactions or hallucinate based on outdated training data.
PromptQL solves this by capturing “shared context” as teams work. When an engineer fixes a bug or a marketer defines a “recycled lead,” they aren’t just typing into a void. They are teaching a living, internal Wikipedia. This wiki doesn’t require “documentation sprints” or manual YAML file updates; it accumulates context organically.
“Throughout every single conversation, you are teaching PromptQL, and that is going into this wiki that is being developed over time. This is our entire company’s knowledge gradually coming together.”
Interconnectivity: Much like cells in a Petri dish, small “islands” of knowledge—say, a Salesforce integration—eventually bridge to other islands, like product usage data in Snowflake.
Human-in-the-Loop: To prevent the AI from learning “junk” (like a reminder about a doctor’s appointment from 2024), humans must explicitly “Add to Wiki” to canonize a fact.
The Virtual Data Layer: Unlike traditional platforms that require data replication, PromptQL uses a virtual SQL layer. It queries your data in place across databases (Snowflake, Clickhouse, Postgres) and SaaS tools (Stripe, Zendesk, HubSpot), ensuring that nothing is ever extracted or cached,.
PromptQL is designed to be a highly integrable orchestration layer that supports both leading AI model providers and a vast ecosystem of existing enterprise tools.
AI Model Support: The platform allows users to delegate tasks to specific coding agents such as Claude Code and Cursor, or use custom agents built for specific internal needs.
Workflow Compatibility: The system is built to inherit context from existing team tools, enabling AI agents to understand codebase conventions or deployment patterns from your existing infrastructure without manual re-explanation
The PromptQL interface looks familiar—threads, channels, and mentions—but the functionality is transformative. In a demonstration, an engineer identifies a failing checkout in a #eng-bugs channel.
Instead of tagging a human SRE, they delegate to Claude Code via PromptQL.The agent doesn’t just look at the code; it inherits the team’s shared context.
It knows, for instance, that “EU payments switched to Adyen on Jan 15” because that fact was added to the wiki weeks prior.
Within minutes, the AI identifies a currency mismatch, pushes a fix, opens a PR, and updates the wiki for future reference. This “multiplayer” AI approach is what sets the platform apart.
It allows a non-technical manager to ask, “Which accounts have growing Stripe billing but flat Mixpanel usage?” and receive a joined table of data pulled from two disparate sources instantly. The user can then schedule a recurring Slack DM of those results with a single follow-up command.
Also, users don’t even need to think about the integrity or cleanliness of their data — PromptQL handles it for them: “Connect all data in whatever state of shittiness it is, and let shared context build up on the fly as you use it,” Gopal said.
For Fortune 500 companies like McDonald’s and Cisco, “just connect your data” is a terrifying sentence. PromptQL addresses this with fine-grained access control
.The system enforces attribute-based policies at the infrastructure level. If a Regional Ops Manager asks for vendor rates across all regions, the AI will redact columns or rows they aren’t authorized to see, even if the LLM “knows” the answer. Furthermore, any high-stakes action—like updating 38 payment statuses in Netsuite—requires a human “Approve/Deny” sign-off before execution.
In a departure from the “per-seat” SaaS status quo, PromptQL is entirely consumption-based.
Pricing: The company uses “Operational Language Units” (OLUs).
Philosophy: Gopal argues that charging per seat penalizes companies for onboarding their whole team. By charging for the value created (the OLU), PromptQL encourages users to connect “everyone and everything”.
Enterprise Storage: While smaller teams use dedicated accounts, enterprise customers get a dedicated VPC. Any data the AI “saves” (like a custom to-do list) is stored in the customer’s own S3 bucket using the Iceberg format, ensuring total data sovereignty.
“Philosophically, we want you to connect everyone and everything [to PromptQL], so we don’t penalize that,” Gopal said. “We just price based on consumption.”
So, is PromptQL a Teams or Slack killer? According to Gopal, the answer is yes: “That is what has happened for us. We’ve shut down our internal Slack for internal comms entirely,” he said.
The launch comes at a pivot point for the industry. Companies are realizing that “chatting with a PDF” isn’t enough. They need AI that can act, but they can’t afford the security risks of “unsupervised” agents.
By building a workspace that prioritizes shared context and human-in-the-loop verification, PromptQL is offering a middle ground: an AI that learns like a teammate and executes like an intern, all while staying within the guardrails of enterprise security.
For enterprises focused on making AI work at scale, PromptQL addresses the critical “how” of implementation by providing the orchestration and operational layer needed to deploy agentic systems.
By replacing the “coordination theater” of traditional chat tools with a workspace where AI agents have the same permissions and context as human teammates, it enables seamless multi-agent coordination and task-routing. This allows decision-makers to move beyond simple model selection to a reality where agents—such as Claude Code—use shared team context to execute complex workflows, like fixing production bugs or updating CRM records, directly within active threads.
From a data infrastructure perspective, the platform simplifies the management of real-time pipelines and RAG-ready architectures by utilizing a virtual SQL layer that queries data “in place”. This eliminates the need for expensive, time-consuming data preparation and replication sprints across hundreds of thousands of tables in databases like Snowflake or Postgres.
Furthermore, the system’s “Shared Wiki” serves as a superior alternative to standard vector databases or prompt-based memory, capturing tribal knowledge organically and creating a living metadata store that informs every AI interaction with company-specific reasoning.
Finally, PromptQL addresses the security governance required for modern AI stacks by enforcing fine-grained, attribute-based access control and role-based permissions.
Through human-in-the-loop verification, it ensures that high-stakes actions and data mutations are held for explicit approval, protecting against model misuse and unauthorized data leakage.
While it does not assist with physical infrastructure tasks such as GPU cluster optimization or hardware procurement, it provides the necessary software guardrails and auditability to ensure that agentic workflows remain compliant with enterprise standards like SOC 2, HIPAA, and GDPR.
Waymo’s weekly paid robotaxi trips have increased tenfold in less than two years.
The enterprise voice AI market is in the middle of a land grab. ElevenLabs and IBM announced a collaboration just this week to bring premium voice capabilities into IBM’s watsonx Orchestrate platform. Google Cloud has been expanding its Chirp 3 HD voices. OpenAI continues to iterate on its own speech synthesis. And the market underpinning all of this activity is enormous — voice AI crossed $22 billion globally in 2026, with the voice AI agents segment alone projected to reach $47.5 billion by 2034, according to industry estimates.
On Thursday morning, Mistral AI entered that fight with a fundamentally different proposition. The Paris-based AI startup released Voxtral TTS, what it calls the first frontier-quality, open-weight text-to-speech model designed specifically for enterprise use. Where every major competitor in the space operates a proprietary, API-first business — enterprises rent the voice, they don’t own it — Mistral is releasing the full model weights, inviting companies to download Voxtral TTS, run it on their own servers or even on a smartphone, and never send a single audio frame to a third party.
It is a bet that the future of enterprise voice AI will not be shaped by whoever builds the best-sounding model, but by whoever gives companies the most control over it. And it arrives at a moment when Mistral, valued at $13.8 billion after a $2 billion Series C round led by Dutch chipmaker ASML last September, has been aggressively assembling the building blocks of a complete, enterprise-owned AI stack — from its Forge customization platform announced at Nvidia GTC earlier this month, to its AI Studio production infrastructure, to the Voxtral Transcribe speech-to-text model released just weeks ago.
Voxtral TTS is the output layer that completes that picture, giving enterprises a speech-to-speech pipeline they can run end-to-end without relying on any external provider.
“We see audio as a big bet and as a critical and maybe the only future interface with all the AI models,” Pierre Stock, Mistral’s vice president of science and the first employee hired at the company, said in an exclusive interview with VentureBeat. “This is something customers have been asking for.”
The technical specifications of Voxtral TTS read like a deliberate inversion of industry norms. Where most frontier TTS models are large and resource-intensive, Mistral built its model to be roughly three times smaller than what it calls the industry standard for comparable quality.
The architecture comprises three components: a 3.4-billion-parameter transformer decoder backbone, a 390-million-parameter flow-matching acoustic transformer, and a 300-million-parameter neural audio codec that Mistral developed in-house. The system is built on top of Ministral 3B, the same pretrained backbone that powers the company’s Voxtral Transcribe model — a design choice that Stock described as emblematic of Mistral’s culture of efficiency and artifact reuse.
In practice, the model achieves a time-to-first-audio of 90 milliseconds for a typical input and generates speech at approximately six times real-time speed. When quantized for inference, it requires roughly three gigabytes of RAM. Stock confirmed it can run on any laptop or smartphone, and even on older hardware it still operates in real time.
“It’s a 3B model, so it can basically run on any laptop or any smartphone,” Stock told VentureBeat. “If you quantize it to infer, it’s actually three gigabytes of RAM. And you can run it on super old chips — it’s still going to be real time.”
The model supports nine languages — English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic — and can adapt to a custom voice with as little as five seconds of reference audio. Perhaps more remarkably, it demonstrates zero-shot cross-lingual voice adaptation without explicit training for that task.
Stock illustrated this with a personal example: he can feed the model 10 seconds of his own French-accented voice, type a prompt in German, and the model will generate German speech that sounds like him — complete with his natural accent and vocal characteristics. For enterprises operating across borders, this capability unlocks cascaded speech-to-speech translation that preserves speaker identity, a feature that has obvious applications in customer support, sales, and internal communications for multinational organizations.
Mistral is not being coy about which competitor it intends to displace. In human evaluations conducted by the company, Voxtral TTS achieved a 62.8 percent listener preference rate against ElevenLabs Flash v2.5 on flagship voices and a 69.9 percent preference rate in voice customization tasks. Mistral also claims the model performs at parity with ElevenLabs v3 — the company’s premium, higher-latency tier — on emotional expressiveness, while maintaining similar latency to the much faster Flash model.
The evaluation methodology involved a comparative side-by-side test across all nine supported languages. Using two recognizable voices in their native dialects for each language, three annotators performed preference tests on naturalness, accent adherence, and acoustic similarity to the original reference. Mistral says Voxtral TTS widened the quality gap to ElevenLabs v2.5 Flash especially in zero-shot multilingual custom voice settings, highlighting what the company calls the “instant customizability” of the model.
ElevenLabs remains widely regarded as the benchmark for raw voice quality. Its Eleven v3 model has been described by multiple independent reviewers as the gold standard for emotionally nuanced AI speech. But ElevenLabs operates as a closed platform with tiered subscription pricing that scales from around $5 per month at the starter level to over $1,300 per month for business plans. It does not release model weights.
Mistral’s pitch is that enterprises shouldn’t have to choose between quality and control — and that at scale, the economics of an open-weight model are dramatically more favorable.
“What we want to underline is that we’re faster and cheaper as well — and open source,” Stock told VentureBeat. “When something is open source and cheap, people adopt it and people build on it.”
He framed the cost argument in terms that resonate with CTOs managing AI budgets: “AI is a transformative technology, but it has a cost. When you want to scale and have impact on a large business, that cost matters. And what we allow is to scale seamlessly while minimizing the cost and maximizing the accuracy.”
To understand why Mistral is entering text-to-speech now, you have to understand the broader strategic architecture the company has been building for the past year. While OpenAI and Anthropic have captured the imagination of consumers, Mistral has quietly assembled what may be the most comprehensive enterprise AI platform in Europe — and increasingly, globally.
CEO Arthur Mensch has said the company is on track to surpass $1 billion in annual recurring revenue this year, according to TechCrunch’s reporting on the Forge launch. The Financial Times has reported that Mistral’s annualized revenue run rate surged from $20 million to over $400 million within a single year. That growth has been powered by more than 100 major enterprise customers and a consistent thesis: companies should own their AI infrastructure, not rent it.
Voxtral TTS is the latest expression of that thesis, applied to what may be the most sensitive category of enterprise data there is. Voice recordings capture not just words but emotion, identity, and intent. They carry legal, regulatory, and reputational weight that text data often does not. For industries like financial services, healthcare, and government — all key Mistral verticals — sending voice data to a third-party API introduces risks that many compliance teams are unwilling to accept.
Stock made the data sovereignty argument forcefully. “Since the models are open weights, we have no trouble and no problem actually giving the weights to the enterprise and helping them customize the models,” he said. “We don’t see the weights anymore. We don’t see the data. We see nothing. And you are fully controlled.”
That message has particular resonance in Europe, where concern about technological dependence on American cloud providers has intensified throughout 2026. The EU currently sources more than 80 percent of its digital services from foreign providers, most of them American. Mistral has positioned itself as the answer to that anxiety — the only European frontier AI developer with the scale and technical capability to offer a credible alternative.
Voxtral TTS is the final piece in a pipeline Mistral has been methodically assembling. Voxtral Transcribe handles speech-to-text. Mistral’s language models — from Mistral Small to Mistral Large — provide the reasoning layer. Forge allows enterprises to customize any of these models on their own data. AI Studio provides the production infrastructure for observability, governance, and deployment. And Mistral Compute offers the underlying GPU resources.
Together, these pieces form what Stock described as a “full AI stack, fully controllable and customizable” for the enterprise. Voice agents — AI systems that can listen to a customer, understand what they need, reason about the answer, and respond in natural-sounding speech — are the use case that ties all of these layers together.
The applications Mistral envisions span customer support, where voice agents can route and resolve queries with brand-appropriate speech; sales and marketing, where a single voice can work across markets through cross-lingual emulation; real-time translation for cross-border operations; and even interactive storytelling and game design, where emotion-steering can control tone and personality.
Stock was most animated when discussing how Voxtral TTS fits into the broader agentic AI trend that has dominated enterprise technology discussions in 2026. “We are totally building for a world in which audio is a natural interface, in particular for agents to which you can delegate work — extensions of yourself,” he said. He described a scenario in which a user starts planning a vacation on a computer, commutes to work, and then picks up the workflow on a phone simply by asking for an update by voice.
“To make that happen, you need a model you can trust, you need a model that’s super efficient and super cheap to run — otherwise you won’t use it for long — and you need a model that sounds super conversational and that you can interrupt at any time,” Stock said.
That emphasis on interruptibility and real-time responsiveness reflects a broader insight about voice interfaces that distinguishes them from text. A chatbot can take two or three seconds to respond without breaking the user experience. A voice agent cannot. The 90-millisecond time-to-first-audio that Voxtral TTS achieves is not just a benchmark number — it is the threshold between a voice interaction that feels natural and one that feels robotic.
Mistral’s decision to release Voxtral TTS with open weights is consistent with a movement that has been gathering momentum across the AI industry. At Nvidia GTC earlier this month, Nvidia CEO Jensen Huang declared that “proprietary versus open is not a thing — it’s proprietary and open.” Nvidia announced the Nemotron Coalition, a first-of-its-kind collaboration of model builders working to advance open frontier-level foundation models, with Mistral as a founding member. The first project from that coalition will be a base model codeveloped by Mistral AI and Nvidia.
For Mistral, open weights serve a dual commercial purpose. They drive adoption — developers and enterprises can experiment without friction or commitment — while the company monetizes through its platform services, customization offerings, and managed infrastructure. The model is available to test in Mistral Studio and through the company’s API, but the strategic play is to become embedded in enterprise voice pipelines as an owned asset, not a metered service.
This mirrors the playbook that worked for Mistral’s language models. As Mensch told CNBC in February, “AI is making us able to develop software at the speed of light,” predicting that “more than half of what’s currently being bought by IT in terms of SaaS is going to shift to AI.” He described a “replatforming” taking place across enterprise technology, with businesses looking to replace legacy software systems with AI-native alternatives. An open-weight voice model that enterprises can customize and deploy on their own terms fits naturally into that narrative.
When asked what comes after Voxtral TTS, Stock outlined two directions. The first is expanding language and dialect support, with particular attention to cultural nuance. “It’s not the same to speak French in Paris than to speak French in Canada, in Montreal,” he said. “We want to respect both cultures, and we want our models to perform in both contexts with all the cultural specifics.”
The second direction is more ambitious: a fully end-to-end audio model that doesn’t just generate speech from text but understands the complete spectrum of human vocal communication.
“We convey some meaning with the words we speak,” Stock said. “We actually convey way more with the intonation, the rhythm, and how we say it. When people talk about end-to-end audio, that’s what they mean — the model is able to pick up that you’re in a hurry, for instance, and will go for the fastest answer. The model will know that you’re joyful today and crack a joke. It’s super adaptive to you, and that’s where we want to go.”
That vision — an AI that speaks naturally, listens with nuance, responds with emotional intelligence, and runs on a model small enough to fit in your pocket — is the frontier every major AI lab is racing toward. For now, Voxtral TTS gives Mistral a foundation to build on and enterprises a question they haven’t had to answer before: if you could own your voice AI stack outright, at lower cost and with competitive quality, why would you keep renting someone else’s?
Enterprise data teams moving agentic AI into production are hitting a consistent failure point at the data tier. Agents built across a vector store, a relational database, a graph store and a lakehouse require sync pipelines to keep context current. Under production load, that context goes stale.
Oracle, whose database infrastructure runs the transaction systems of 97% of Fortune Global 100 companies by the company’s own count, is now making a direct architectural argument that the database is the right place to fix that problem.
Oracle this week announced a set of agentic AI capabilities for Oracle AI Database, built around a direct architectural counter-argument to that pattern.
The core of the release is the Unified Memory Core, a single ACID (Atomicity, Consistency, Isolation, and Durability)-transactional engine that processes vector, JSON, graph, relational, spatial and columnar data without a sync layer. Alongside that, Oracle announced Vectors on Ice for native vector indexing on Apache Iceberg tables, a standalone Autonomous AI Vector Database service and an Autonomous AI Database MCP Server for direct agent access without custom integration code.
The news isn’t just that Oracle is adding new features, it’s about the world’s largest database vendor realizing that things have changed in the AI world that go beyond what its namesake database was providing.
“As much as I’d love to tell you that everybody stores all their data in an Oracle database today — you and I live in the real world,” Maria Colgan, Vice President, Product Management for Mission-Critical Data and AI Engines, at Oracle told VentureBeat. “We know that that’s not true.”
Oracle’s release spans four interconnected capabilities. Together they form the architectural argument that a converged database engine is a better foundation for production agentic AI than a stack of specialized tools.
Unified Memory Core. Agents reasoning across multiple data formats simultaneously — vector, JSON, graph, relational, spatial — require sync pipelines when those formats live in separate systems. The Unified Memory Core puts all of them in a single ACID-transactional engine. Under the hood it is an API layer over the Oracle database engine, meaning ACID consistency applies across every data type without a separate consistency mechanism.
“By having the memory live in the same place that the data does, we can control what it has access to the same way we would control the data inside the database,” Colgan explained.
Vectors on Ice. For teams running data lakehouse architectures on the open-source Apache Iceberg table format, Oracle now creates a vector index inside the database that references the Iceberg table directly. The index updates automatically as the underlying data changes and works with Iceberg tables that are managed by Databricks and Snowflake. Teams can combine Iceberg vector search with relational, JSON, spatial or graph data stored inside Oracle in a single query.
Autonomous AI Vector Database. A fully managed, free-to-start vector database service built on the Oracle 26ai engine. The service is designed as a developer entry point with a one-click upgrade path to full Autonomous AI Database when workload requirements grow.
Autonomous AI Database MCP Server. Lets external agents and MCP clients connect to Autonomous AI Database without custom integration code. Oracle’s row-level and column-level access controls apply automatically when an agent connects, regardless of what the agent requests.
“Even though you are making the same standard API call you would make with other platforms, the privileges that user has continued to kick in when the LLM is asking those questions,” Colgan said.
Oracle’s Autonomous AI Vector Database enters a market occupied by purpose-built vector services including Pinecone, Qdrant and Weaviate. The distinction Oracle is drawing is about what happens when vector alone is not enough.
“Once you are done with vectors, you do not really have an option,” Steve Zivanic, Global Vice President, Database and Autonomous Services, Product Marketing at Oracle, told VentureBeat. “With this, you can get graph, spatial, time series — whatever you may need. It is not a dead end.”
Holger Mueller, principal analyst at Constellation Research, said that the architectural argument is credible precisely because other vendors cannot make it without moving data first. Other database vendors require transactional data to move to a data lake before agents can reason across it. Oracle’s converged legacy, in his view, gives it a structural advantage that is difficult to replicate without a ground-up rebuild.
Not everyone sees the feature set as differentiated. Steven Dickens, CEO and principal analyst at HyperFRAME Research, told VentureBeat that vector search, RAG integration and Apache Iceberg support are now standard requirements across enterprise databases — Postgres, Snowflake and Databricks all offer comparable capabilities.
“Oracle’s move to label the database itself as an AI Database is primarily a rebranding of its converged database strategy to match the current hype cycle,” Dickens said. In his view the real differentiation Oracle is claiming is not at the feature level but at the architectural level — and the Unified Memory Core is where that argument either holds or falls apart.
The four capabilities Oracle shipped this week are a response to a specific and well-documented production failure mode. Enterprise agent deployments are not breaking down at the model layer. They are breaking down at the data layer, where agents built across fragmented systems hit sync latency, stale context and inconsistent access controls the moment workloads scale.
Matt Kimball, vice president and principal analyst at Moor Insights and Strategy, told VentureBeat the data layer is where production constraints surface first.
“The struggle is running them in production,” Kimball said. “The gap is seen almost immediately at the data layer — access, governance, latency and consistency. These all become constraints.”
Dickens frames the core mismatch as a stateless-versus-stateful problem. Most enterprise agent frameworks store memory as a flat list of past interactions, which means agents are effectively stateless while the databases they query are stateful. The lag between the two is where decisions go wrong.
“Data teams are exhausted by fragmentation fatigue,” Dickens said. “Managing a separate vector store, graph database and relational system just to power one agent is a DevOps nightmare.”
That fragmentation is precisely what Oracle’s Unified Memory Core is designed to eliminate. The control plane question follows directly.
“In a traditional application model, control lives in the app layer,” Kimball said. “With agentic systems, access control breaks down pretty quickly because agents generate actions dynamically and need consistent enforcement of policy. By pushing all that control into the database, it can all be applied in a more uniform way.”
The question of where control lives in an enterprise agentic AI stack is not settled.
Most organizations are still building across fragmented systems, and the architectural decisions being made now — which engine anchors agent memory, where access controls are enforced, how lakehouse data gets pulled into agent context — will be difficult to undo at scale.
The distributed data challenge is still the real test.
“Data is increasingly distributed across SaaS platforms, lakehouses and event-driven systems, each with its own control plane and governance model,” Kimball said. “The opportunity now is extending that model across the broader, more distributed data estates that define most enterprise environments today.”
Engineers building browser agents today face a choice between closed APIs they cannot inspect and open-weight frameworks with no trained model underneath them. Ai2 is now offering a third option.
The Seattle-based nonprofit behind the open-source OLMo language models and Molmo vision-language family today is releasing MolmoWeb, an open-weight visual web agent available in 4 billion and 8 billion parameter sizes.
Until now, no open-weight visual web agent shipped with the training data and pipeline needed to audit or reproduce it. MolmoWeb does.
MolmoWebMix, the accompanying dataset, includes 30,000 human task trajectories across more than 1,100 websites, 590,000 individual subtask demonstrations and 2.2 million screenshot question-answer pairs — which Ai2 describes as the largest publicly released collection of human web-task execution ever assembled.
“Can you go from just passively understanding images, describing them and captioning them, to actually making them take action in some environment?” Tanmay Gupta, senior research scientist at Ai2, told VentureBeat. “That is exactly what MolmoWeb is.”
MolmoWeb operates entirely from browser screenshots. It does not parse HTML or rely on accessibility tree representations of a page. At each step it receives a task instruction, the current screenshot, a text log of previous actions and the current URL and page title. It produces a natural-language thought describing its reasoning, then executes the next browser action — clicking at screen coordinates, typing text, scrolling, navigating to a URL or switching tabs.
The model is browser-agnostic. It requires only a screenshot, which means it runs against local Chrome, Safari or a hosted browser service. The hosted demo uses Browserbase, a cloud browser infrastructure startup.
The model weights are only part of what Ai2 is releasing. MolmoWebMix, the accompanying training dataset, is the core differentiator from every other open-weight agent available today.
“The data basically looks like a sequence of screenshots and actions paired with instructions for what the intent behind that sequence of screenshots was,” Gupta said.
MolmoWebMix combines three components.
Human demonstrations. Human annotators completed browsing tasks using a custom Chrome extension that recorded actions and screenshots across more than 1,100 websites. The result is 30,000 task trajectories spanning more than 590,000 individual subtask demonstrations.
Synthetic trajectories. To scale beyond what human annotation alone can provide, Ai2 generated additional trajectories using text-based accessibility-tree agents — single-agent runs filtered for task success, multi-agent pipelines that decompose tasks into subgoals and deterministic navigation paths across hundreds of websites. Critically, no proprietary vision agents were used. The synthetic data came from text-only systems, not from OpenAI Operator or Anthropic’s computer use API.
GUI perception data. A third component trains the model to read and reason about page content directly from images. It includes more than 2.2 million screenshot question-answer pairs drawn from nearly 400 websites, covering element grounding and screenshot-based reasoning tasks.
“If you are able to perform a task and you’re able to record a trajectory from that, you should be able to train the web agent on that trajectory to do the exact same task,” Gupta said.
In Gupta’s view, there are two categories of technologies in the browser agent market.
The first is API-only systems, capable but closed, with no visibility into training or architecture. OpenAI Operator, Anthropic’s computer use API and Google’s Gemini computer use fall into this group.
The second is open-weight models, a significantly smaller category. Browser-use, the most widely adopted open alternative, is a framework rather than a trained model. It requires developers to supply their own LLM and build the agent layer on top.
MolmoWeb sits in the second category as a fully trained open-weight vision model. Ai2 reports it leads that group across four live-website benchmarks: WebVoyager, Online-Mind2Web, DeepShop and WebTailBench. According to Ai2, it also outperforms older API-based agents built on GPT-4o with accessibility tree plus screenshot input.
Ai2 documents several current limitations in the release. The model makes occasional errors reading text from screenshots, drag-and-drop interactions remain unreliable and performance degrades on ambiguous or heavily constrained instructions. The model was also not trained on tasks requiring logins or financial transactions.
Enterprise teams evaluating browser agents are not just choosing a model. They are deciding whether they can audit what they are running, fine-tune it on internal workflows, and avoid a per-call API dependency.
Getting AI agents to perform reliably in production — not just in demos — is turning out to be harder than enterprises anticipated. Fragmented data, unclear workflows, and runaway escalation rates are slowing deployments across industries.
“The technology itself often works well in demonstrations,” said Sanchit Vir Gogia, chief analyst with Greyhound Research. “The challenge begins when it is asked to operate inside the complexity of a real organization.”
Burley Kawasaki, who oversees agent deployment at Creatio, and team have developed a methodology built around three disciplines: data virtualization to work around data lake delays; agent dashboards and KPIs as a management layer; and tightly bounded use-case loops to drive toward high autonomy.
In simpler use cases, Kawasaki says these practices have enabled agents to handle up to 80-90% of tasks on their own. With further tuning, he estimates they could support autonomous resolution in at least half of use cases, even in more complex deployments.
“People have been experimenting a lot with proof of concepts, they’ve been putting a lot of tests out there,” Kawasaki told VentureBeat. “But now in 2026, we’re starting to focus on mission-critical workflows that drive either operational efficiencies or additional revenue.”
Enterprises are eager to adopt agentic AI in some form or another — often because they’re afraid to be left out, even before they even identify real-world tangible use cases — but run into significant bottlenecks around data architecture, integration, monitoring, security, and workflow design.
The first obstacle almost always has to do with data, Gogia said. Enterprise information rarely exists in a neat or unified form; it is spread across SaaS platforms, apps, internal databases, and other data stores. Some are structured, some are not.
But even when enterprises overcome the data retrieval problem, integration is a big challenge. Agents rely on APIs and automation hooks to interact with applications, but many enterprise systems were designed long before this kind of autonomous interaction was a reality, Gogia pointed out.
This can result in incomplete or inconsistent APIs, and systems can respond unpredictably when accessed programmatically. Organizations also run into snags when they attempt to automate processes that were never formally defined, Gogia said.
“Many business workflows depend on tacit knowledge,” he said. That is, employees know how to resolve exceptions they’ve seen before without explicit instructions — but, those missing rules and instructions become startlingly obvious when workflows are translated into automation logic.
Creatio deploys agents in a “bounded scope with clear guardrails,” followed by an “explicit” tuning and validation phase, Kawasaki explained. Teams review initial outcomes, adjust as needed, then re-test until they’ve reached an acceptable level of accuracy.
That loop typically follows this pattern:
Design-time tuning (before go-live): Performance is improved through prompt engineering, context wrapping, role definitions, workflow design, and grounding in data and documents.
Human-in-the-loop correction (during execution): Devs approve, edit, or resolve exceptions. In instances where humans have to intervene the most (escalation or approval), users establish stronger rules, provide more context, and update workflow steps; or, they’ll narrow tool access.
Ongoing optimization (after go-live): Devs continue to monitor exception rates and outcomes, then tune repeatedly as needed, helping to improve accuracy and autonomy over time.
Kawasaki’s team applies retrieval-augmented generation to ground agents in enterprise knowledge bases, CRM data, and other proprietary sources.
Once agents are deployed in the wild, they are monitored with a dashboard providing performance analytics, conversion insights, and auditability. Essentially, agents are treated like digital workers. They have their own management layer with dashboards and KPIs.
For instance, an onboarding agent will be incorporated as a standard dashboard interface providing agent monitoring and telemetry. This is part of the platform layer — orchestration, governance, security, workflow execution, monitoring, and UI embedding — that sits “above the LLM,” Kawasaki said.
Users see a dashboard of agents in use and each of their processes, workflows, and executed results. They can “drill down” into an individual record (like a referral or renewal) that shows a step-by-step execution log and related communications to support traceability, debugging, and agent tweaking. The most common adjustments involve logic and incentives, business rules, prompt context, and tool access, Kawasaki said.
The biggest issues that come up post-deployment:
Exception handling volume can be high: Early spikes in edge cases often occur until guardrails and workflows are tuned.
Data quality and completeness: Missing or inconsistent fields and documents can cause escalations; teams can identify which data to prioritize for grounding and which checks to automate.
Auditability and trust: Regulated customers, particularly, require clear logs, approvals, role-based access control (RBAC), and audit trails.
“We always explain that you have to allocate time to train agents,” Creatio’s CEO Katherine Kostereva told VentureBeat. “It doesn’t happen immediately when you switch on the agent, it needs time to understand fully, then the number of mistakes will decrease.”
When looking to deploy agents, “Is my data ready?,” is a common early question. Enterprises know data access is important, but can be turned off by a massive data consolidation project.
But virtual connections can allow agents access to underlying systems and get around typical data lake/lakehouse/warehouse delays. Kawasaki’s team built a platform that integrates with data, and is now working on an approach that will pull data into a virtual object, process it, and use it like a standard object for UIs and workflows. This way, they don’t have to “persist or duplicate” large volumes of data in their database.
This technique can be helpful in areas like banking, where transaction volumes are simply too large to copy into CRM, but are “still valuable for AI analysis and triggers,” Kawasaki said.
Once integrations and virtual objects are established, teams can evaluate data completeness, consistency, and availability, and identify low-friction starting points (like document-heavy or unstructured workflows).
Kawasaki emphasized the importance of “really using the data in the underlying systems, which tends to actually be the cleanest or the source of truth anyway.”
The best fit for autonomous (or near-autonomous) agents are high-volume workflows with “clear structure and controllable risk,” Kawasaki said. For instance, document intake and validation in onboarding or loan preparation, or standardized outreach like renewals and referrals.
“Especially when you can link them to very specific processes inside an industry — that’s where you can really measure and deliver hard ROI,” he said.
For instance, financial institutions are often siloed by nature. Commercial lending teams perform in their own environment, wealth management in another. But an autonomous agent can look across departments and separate data stores to identify, for instance, commercial customers who might be good candidates for wealth management or advisory services.
“You think it would be an obvious opportunity, but no one is looking across all the silos,” Kawasaki said. Some banks that have applied agents to this very scenario have seen “benefits of millions of dollars of incremental revenue,” he claimed, without naming specific institutions.
However, in other cases — particularly in regulated industries — longer-context agents are not only preferable, but necessary. For instance, in multi-step tasks like gathering evidence across systems, summarizing, comparing, drafting communications, and producing auditable rationales.
“The agent isn’t giving you a response immediately,” Kawasaki said. “It may take hours, days, to complete full end-to-end tasks.”
This requires orchestrated agentic execution rather than a “single giant prompt,” he said. This approach breaks work down into deterministic steps to be performed by sub-agents. Memory and context management can be maintained across various steps and time intervals. Grounding with RAG can help keep outputs tied to approved sources, and users have the ability to dictate expansion to file shares and other document repositories.
This model typically doesn’t require custom retraining or a new foundation model. Whatever model enterprises use (GPT, Claude, Gemini), performance improves through prompts, role definitions, controlled tools, workflows, and data grounding, Kawasaki said.
The feedback loop puts “extra emphasis” on intermediate checkpoints, he said. Humans review intermediate artifacts (such as summaries, extracted facts, or draft recommendations) and correct errors. Those can then be converted into better rules and retrieval sources, narrower tool scopes, and improved templates.
“What is important for this style of autonomous agent, is you mix the best of both worlds: The dynamic reasoning of AI, with the control and power of true orchestration,” Kawasaki said.
Ultimately, agents require coordinated changes across enterprise architecture, new orchestration frameworks, and explicit access controls, Gogia said. Agents must be assigned identities to restrict their privileges and keep them within bounds. Observability is critical; monitoring tools can record task completion rates, escalation events, system interactions, and error patterns. This kind of evaluation must be a permanent practice, and agents should be tested to see how they react when encountering new scenarios and unusual inputs.
“The moment an AI system can take action, enterprises have to answer several questions that rarely appear during copilot deployments,” Gogia said. Such as: What systems is the agent allowed to access? What types of actions can it perform without approval? Which activities must always require a human decision? How will every action be recorded and reviewed?
“Those [enterprises] that underestimate the challenge often find themselves stuck in demonstrations that look impressive but cannot survive real operational complexity,” Gogia said.
Voice AI is moving faster than the tools we use to measure it. Every major AI lab — OpenAI, Google DeepMind, Anthropic, xAI — is racing to ship voice models capable of natural, real-time conversation.
But the benchmarks used to evaluate those models are largely still running on synthetic speech, English-only prompts, and scripted test sets that bear little resemblance to how people actually talk.
Scale AI, the large data annotation startup whose founder was poached by Meta last year to lead its Superintelligence Lab, is still going strong and tackling the problem head on: today it launches Voice Showdown, what it calls the first global preference-based arena designed to benchmark voice AI through the lens of real human interaction.
This product offers a unique strategic value to users: free access to the world’s leading frontier models. Through Scale’s ChatLab platform, users can interact with high-tier models—which typically require multiple $20-per-month subscriptions—at no cost. In exchange, users participate in occasional blind, head-to-head “battles” to choose which of two anonymized leading voice models offers a better experience, providing data for the industry’s most authentic, human-preference leaderboard of voice AI models.
“Voice AI is really the fastest moving frontier in AI right now,” said Janie Gu, product manager for Showdown at Scale AI. “But the way that we evaluate voice models hasn’t kept up.”
The results, drawn from thousands of spontaneous voice conversations across more than 60 languages, reveal capability gaps that other benchmarks have consistently missed.
Voice Showdown is built on ChatLab, Scale’s model-agnostic chat platform where users can freely interact with whichever frontier AI model they choose — for free — within a single app. The platform has been available to Scale’s global community of over 500,000 annotators, with roughly 300,000 having submitted at least one prompt. Scale is opening the platform to a public waitlist today.
The evaluation mechanism is elegant in its simplicity: while a user is having a natural voice conversation with a model, the system occasionally — on fewer than 5% of all voice prompts — surfaces a blind side-by-side comparison. The same prompt is sent to a second, anonymous model, and the user picks which response they prefer.
This design solves three problems that plague existing voice benchmarks.
First, every prompt comes from real human speech — with accents, background noise, half-finished sentences, and conversational filler — rather than synthesized audio generated from text.
Second, the platform spans more than 60 languages across 6 continents, with over a third of battles occurring in non-English languages including Spanish, Arabic, Japanese, Portuguese, Hindi, and French.
Third, because battles occur within users’ actual daily conversations, 81% of prompts are conversational or open-ended — questions without a single correct answer. That rules out automated scoring and makes human preference the only credible signal.
Voice Showdown currently runs two evaluation modes: Dictate (users speak, models respond with text) and Speech-to-Speech, or S2S (Speech-to-Speech, users speak, models talk back). A third mode — Full Duplex, which captures real-time, interruptible conversation — is in development.
One design detail sets Voice Showdown apart from Chatbot Arena (LM Arena), the text benchmark it most closely resembles. In LM Arena, critics have noted that users sometimes cast throwaway votes with little stake in the outcome. Voice Showdown addresses this directly: after a user votes for the model they preferred, the app switches them to that model for the rest of their conversation. If you voted for GPT-4o Audio over Gemini, you’re now talking to GPT-4o Audio. That alignment of consequence with preference discourages casual or dishonest voting.
The system also controls for confounds that could corrupt comparisons: both model responses begin streaming simultaneously (eliminating speed bias), voice gender is matched across both options (eliminating gender preference bias), and neither model is identified by name during voting.
Voice Showdown launches with 11 frontier models evaluated across 52 model-voice pairs as of March 18, 2026. Not all models support both evaluation modes — the Dictate leaderboard includes 8 models, while S2S includes 6.
Dictate Leaderboard (Speech-In, Text-Out)
In this mode, users provide a spoken prompt and evaluate two side-by-side text responses. Here are the baseline scores:
Gemini 3 Pro (1073)
Gemini 3 Flash (1068)
GPT-4o Audio (1019)
Qwen 3 Omni (1000)
Voxtral Small (925)
Gemma 3n (918)
GPT Realtime (875)
Phi-4 Multimodal (729)
Note: Gemini 3 Pro and Gemini 3 Flash are statistically tied for the top rank.
Speech-to-Speech (S2S) Leaderboard
In this mode, users speak to the model and evaluate two competing audio responses. Also baselines:
Gemini 2.5 Flash Audio (1060)
GPT-4o Audio (1059)
Grok Voice (1024)
Qwen 3 Omni (1000)
GPT Realtime (962)
GPT Realtime 1.5 (920)
Note: Gemini 2.5 Flash Audio and GPT-4o Audio are statistically tied for the top rank in baseline evaluations.
Dictate rankings are led by Google’s Gemini 3 Pro and Gemini 3 Flash, which are statistically tied at #1 with Elo scores around 1,043-1,044 after style controls.
GPT-4o Audio holds a clear third place. Open-weight models including Gemma3n, Voxtral Small, and Phi-4 Multimodal trail significantly.
Speech-to-Speech (S2S) rankings show a tighter race at the top, with Gemini 2.5 Flash Audio and GPT-4o Audio statistically tied at #1 in the baseline rankings.
After adjusting for response length and formatting — factors that can inflate perceived quality — GPT-4o Audio pulls ahead (1,102 Elo vs. 1,075 for Gemini 2.5 Flash Audio).
Grok Voice jumps to a close second at 1,093 under style controls, suggesting its raw #3 ranking undersells its actual performance quality.
Qwen 3 Omni, the open-weight model from Alibaba’s Qwen team, performs better on pure preference than its popularity would suggest — ranking fourth in both modes, ahead of several higher-profile names.
“When people come in, they go for the big names,” Gu noted. “But for preference, lesser-known models like Qwen actually pull ahead.”
Beyond rankings, Voice Showdown’s real value is in the failure diagnostics — and those paint a more complicated picture of voice AI than most leaderboards reveal.
The multilingual gap is worse than you think
Language robustness is the starkest differentiator across models. In Dictate, Gemini 3 models lead across essentially every language tested.
In S2S, the winner depends heavily on which language is being spoken: GPT-4o Audio leads in Arabic and Turkish; Gemini 2.5 Flash Audio is strongest in French; Grok Voice is competitive in Japanese and Portuguese.
But the more alarming finding is how frequently some models simply stop responding in the user’s language at all.
GPT Realtime 1.5 — OpenAI’s newer real-time voice model — responds in English to non-English prompts roughly 20% of the time, even on high-resource, officially supported languages like Hindi, Spanish, and Turkish.
Its predecessor, GPT Realtime, mismatches at about half that rate (~10%). Gemini 2.5 Flash Audio and GPT-4o Audio sit at ~7%.
The phenomenon runs both directions: some models carry non-English context from earlier in a conversation into an English turn, or simply mishear a prompt and generate an unrelated response in the wrong language entirely.
User verbatims from the platform capture the frustration bluntly: “I said I have an interview today with Quest Management and instead of answering, it gave me information about ‘Risk Management.'”
“GPT Realtime 1.5 thought I was speaking incoherently and recommended mental health assistance, while Qwen 3 Omni correctly identified I was speaking a Nigerian local language.”
The reason existing benchmarks miss this: they’re built on synthetic speech optimized for clean acoustic conditions, and they’re rarely multilingual. Real speakers in real environments — with background noise, short utterances, and regional accents — break speech understanding in ways lab conditions don’t anticipate.
Voice Showdown evaluates models not just at the model level but at the individual voice level — and the variance within a single model’s voice catalog is striking.
For one unnamed model in the study, the best-performing voice won 30 percentage points more often than the worst-performing voice from the same underlying model. Both voices share the same reasoning and generation backend. The difference is purely in audio presentation.
The top-performing voices tend to win or lose on audio understanding and content completeness — whether the model heard you correctly and answered fully. But speech quality remains a deciding factor at the voice selection level, particularly when models are otherwise comparable. “Voice directly shapes how users evaluate the interaction,” Gu said.
Most benchmarks test a single turn. Voice Showdown tests how models hold up across extended conversations — and the results aren’t flattering.
On Turn 1, content quality accounts for 23% of model failures. By Turn 11 and beyond, it becomes the primary failure mode at 43%. Most models see their win rates decline as conversations extend, struggling to maintain coherence across multiple exchanges.
GPT Realtime variants are an exception, marginally improving on later turns — consistent with their known strengths on longer contexts, and their documented weakness on the brief, noisy utterances that dominate early interactions.
Prompt length shows a complementary pattern: short prompts (under 10 seconds) are dominated by audio understanding failures (38%), while long prompts (over 40 seconds) shift the primary failure toward content quality (31%). Shorter audio gives models less acoustic context to parse; longer requests are understood but harder to answer well.
After every S2S comparison, users tag why they preferred one response over the other across three axes: audio understanding, content quality, and speech output. The failure signatures differ meaningfully by model.
Qwen 3 Omni’s losses cluster around speech generation — its reasoning is competitive, but users are put off by how it sounds. GPT Realtime 1.5’s losses are dominated by audio understanding failures (51%), consistent with its language-switching behavior on challenging prompts. Grok Voice’s failures are more balanced across all three axes, indicating no single dominant weakness but no particular strength either.
The current leaderboard covers turn-based interaction — you speak, the model responds, repeat. But real voice conversations don’t work that way. People interrupt, change direction mid-sentence, and talk over each other.
Scale says Full Duplex evaluation — designed to capture these real-time dynamics through human preference rather than scripted scenarios or automated metrics — is coming to Showdown next. No existing benchmark captures full-duplex interaction through organic human preference data.
The leaderboard is live at scale.com/showdown. A public waitlist to join ChatLab and vote on comparisons is open today, with users receiving free access to frontier voice models including GPT-4o, Gemini, and Grok in exchange for occasional preference votes.
In 2026, data engineers working with multi-agent systems are hitting a familiar problem: Agents built on different platforms don’t operate from a shared understanding of the business. The result isn’t model failure — it’s hallucination driven by fragmented context.
The problem is that agents built on different platforms, by different teams, do not share a common understanding of how the business actually operates. Each one carries its own interpretation of what a customer, an order or a region means. When those definitions diverge across a workforce of agents, decisions break down.
A set of announcements from Microsoft this week directly targets that problem. The centerpiece is a significant expansion of Fabric IQ, the semantic intelligence layer the company debuted in November 2025. Fabric IQ’s business ontology is now accessible via MCP to any agent from any vendor, not just Microsoft’s. Alongside that, Microsoft is adding enterprise planning to Fabric IQ, unifying historical data, real-time signals and formal organizational goals in one queryable layer. The new Database Hub brings Azure SQL, Cosmos DB, PostgreSQL, MySQL and SQL Server under a single management plane inside Fabric. Fabric data agents reach general availability.
The overall goal is a unified platform where all data and semantics are available and accessible by any agent to get the context that enterprises require.
Amir Netz, CTO of Microsoft Fabric, reached for a film analogy to explain why the shared context layer matters. “It’s a little bit like the girl from 50 First Dates,” Netz told VentureBeat. “Every morning they wake up and they forget everything and you have to explain it again. This is the explanation that you give them every morning.”
Making the ontology MCP-accessible is the step that moves Fabric IQ from a Fabric-specific feature into shared infrastructure for multi-vendor agent deployments. Netz was explicit about the design intent.
“It doesn’t really matter whose agent it is, how it was built, what the role is,” Netz said. “There’s certain common knowledge, certain common context that all the agents will share.”
That shared context is also where Netz draws a clear line between what the ontology does and what RAG does. He did not dismiss retrieval-augmented generation as a technique — he placed it specifically. RAG handles large document bodies such as regulations, company handbooks and technical documentation, where on-demand retrieval is more practical than loading everything into context.
“We don’t expect humans to remember everything by heart,” he said. “When somebody asks a question, you have to know to go and do a little bit of a search, find the right relevant part and bring it back.”
But RAG does not solve for real-time business state, he argued. It does not tell an agent which planes are in the air right now, whether a crew has enough rest hours, or what the current priority is on a given product line.
“The mistake of the past was they thought one technology can just give you everything,” Netz said. “The cognitive model of the agents is similar to humans. You have to have things that are available out of memory, things that are available on demand, things that are constantly observed and detected in real time.”
Industry analysts see the logic behind Microsoft’s direction but have questions about what comes next.
Robert Kramer, analyst at Moor Insights and Strategy, noted that Microsoft’s broad stack gives it a structural advantage in the race to become the default platform for enterprise agent deployments.
“Fabric ties into Power BI, Microsoft 365, Dynamics and Azure services. That gives Microsoft a natural path to connect enterprise data with business users, operational workflows and now AI systems operating across that environment,” he said. The trade-off, Kramer said, is that Microsoft is competing across a wider surface area than Databricks or Snowflake, which built their reputations on depth of the data platform itself.
The more immediate question for data teams, Kramer said, is whether MCP access actually reduces integration work.
“Most enterprises do not operate in a single AI environment. Finance might be using one set of tools, engineering another, supply chain something else,” Kramer told VentureBeat. “If Fabric IQ can act as a common data context layer those agents can access, it starts to reduce some of the fragmentation that typically shows up around enterprise data.”
But, he said, “If it just adds another protocol that still requires a lot of engineering work, adoption will be slower.”
Whether the engineering work is the harder problem is open to debate. Independent analyst Sanjeev Mohan, told VentureBeat, that the bigger challenge is organizational, not technical.
“I don’t think they fully understand the implications yet,” he said of enterprise data teams. “This is a classical capabilities overhang — capabilities are expanding faster than people’s imagination to use them. The harder work will be ensuring that the context layer is reliable and trustworthy.”
Holger Mueller, principal analyst at Constellation Research, sees MCP as the right mechanism but urges caution on execution.
“For enterprise to benefit from AI, they need to get access to their data — that is in many places unorganized, siloed — and they want that in a way that makes it easy for AI in a standard way to get there. That is what MCP does,” Mueller told VentureBeat. “The devil is in the details. How good is the access, how well does it perform and what does it cost. Access and governance still need to be sorted out.”
The Fabric IQ announcements arrive alongside the Database Hub, now in early access, which brings Azure SQL, Azure Cosmos DB, PostgreSQL, MySQL and SQL Server under a single management and observability layer inside Fabric. The intent is to give data operations teams one place to monitor, govern and optimize their database estate without changing how each service is deployed.
Devin Pratt, research director at IDC, said the integrated direction tracks with where the broader market is heading. IDC expects that by 2029, 60% of enterprise data platforms will unify transactional and analytical workloads.
“Microsoft’s angle is to bring more of those pieces together in one coordinated approach, while rivals are moving along similar lines from different starting points,” Pratt told VentureBeat.
For data engineers responsible for making pipelines AI-ready, the practical implication of this week’s announcements is a shift in where the hard work lives.
Connecting data sources to a platform is a solved problem. Defining what that data means in business terms, and making that definition consistently available to every agent that queries it, is not.
That shift has a concrete implication for data professionals. The semantic layer — the ontology that maps business entities, relationships and operational rules — is becoming production infrastructure. It will need to be built, versioned, governed and maintained with the same discipline as a data pipeline. That is a new category of responsibility for data engineering teams, and most organizations have not yet staffed or structured for it.
The broader trend this week’s announcements reflect is that the data platform race in 2026 is no longer primarily about compute or storage. It is about which platform can deliver the most reliable shared context to the widest range of agents.
When an AI agent loses context mid-task because traditional storage can’t keep pace with inference, it is not a model problem — it is a storage problem. At GTC 2026, Nvidia announced BlueField-4 STX, a modular reference architecture that inserts a dedicated context memory layer between GPUs and traditional storage, claiming 5x the token throughput, 4x the energy efficiency and 2x the data ingestion speed of conventional CPU-based storage.
The bottleneck STX targets is key-value cache data. KV cache is the stored record of what a model has already processed — the intermediate calculations an LLM saves so it does not have to recompute attention across the entire context on every inference step. It is what allows an agent to maintain coherent working memory across sessions, tool calls and reasoning steps. As context windows grow and agents take more steps, that cache grows with them. When it has to traverse a traditional storage path to get back to the GPU, inference slows and GPU utilization drops.
STX is not a product Nvidia sells directly. It is a reference architecture the company is distributing to its storage partner ecosystem so vendors can build AI-native infrastructure around it.
The architecture is built around a new storage-optimized BlueField-4 processor that combines Nvidia’s Vera CPU with the ConnectX-9 SuperNIC. It runs on Spectrum-X Ethernet networking and is programmable through Nvidia’s DOCA software platform.
The first rack-scale implementation is the Nvidia CMX context memory storage platform. CMX extends GPU memory with a high-performance context layer designed specifically for storing and retrieving KV cache data generated by large language models during inference. Keeping that cache accessible without forcing a round trip through general-purpose storage is what CMX is designed to do.
“Traditional data centers provide high-capacity, general-purpose storage, but generally lack the responsiveness required for interaction with AI agents that need to work across many steps, tools and different sessions,” Ian Buck, Nvidia’s vice president of hyperscale and high-performance computing said in a briefing with press and analysts.
In response to a question from VentureBeat, Buck confirmed that STX also ships with a software reference platform alongside the hardware architecture. Nvidia is expanding DOCA to include a new component referred to in the briefing as DOCA Memo.
“Our storage providers can leverage the programmability of the BlueField-4 processor to optimize storage for the agentic AI factory,” Buck said. “In addition to having a reference rack architecture, we’re also providing a reference software platform for them to deliver those innovations and optimizations for their customers.”
Storage partners building on STX get both a hardware reference design and a software reference platform — a programmable foundation for context-optimized storage.
Storage providers co-designing STX-based infrastructure include Cloudian, DDN, Dell Technologies, Everpure, Hitachi Vantara, HPE, IBM, MinIO, NetApp, Nutanix, VAST Data and WEKA. Manufacturing partners building STX-based systems include AIC, Supermicro and Quanta Cloud Technology.
On the cloud and AI side, CoreWeave, Crusoe, IREN, Lambda, Mistral AI, Nebius, Oracle Cloud Infrastructure and Vultr have all committed to STX for context memory storage.
That combination of enterprise storage incumbents and AI-native cloud providers is the signal worth watching. Nvidia is not positioning STX as a specialty product for hyperscalers. It is positioning it as the reference standard for anyone building storage infrastructure that has to serve agentic AI workloads — which, within the next two to three years, is likely to include most enterprise AI deployments running multi-step inference at scale.
STX-based platforms will be available from partners in the second half of 2026.
IBM sits on both sides of the STX announcement. It is listed as a storage provider co-designing STX-based infrastructure, and Nvidia separately confirmed that it has selected IBM Storage Scale System 6000 — certified and validated on Nvidia DGX platforms — as the high-performance storage foundation for its own GPU-native analytics infrastructure.
IBM also announced a broader expanded collaboration with Nvidia at GTC, including GPU-accelerated integration between IBM’s watsonx.data Presto SQL engine and Nvidia’s cuDF library. A production proof of concept with Nestlé put numbers on what that acceleration looks like: a data refresh cycle across the company’s Order-to-Cash data mart, covering 186 countries and 44 tables, dropped from 15 minutes to three minutes. IBM reported 83% cost savings and a 30x price-performance improvement.
The Nestlé result is a structured analytics workload. It does not directly demonstrate agentic inference performance. But it makes IBM and Nvidia’s shared argument concrete: the data layer is where enterprise AI performance is currently constrained, and GPU-accelerating it produces material results in production.
STX is a signal that the storage layer is becoming a first-class concern in enterprise AI infrastructure planning, not an afterthought to GPU procurement.
General-purpose NAS and object storage were not designed to serve KV cache data at inference latency requirements. STX-based systems from partners including Dell, HPE, NetApp and VAST Data are what Nvidia is putting forward as the practical alternative, with the DOCA software platform providing the programmability layer to tune storage behavior for specific agentic workloads.
The performance claims — 5x token throughput, 4x energy efficiency, 2x data ingestion — are measured against traditional CPU-based storage architectures. Nvidia has not specified the exact baseline configuration for those comparisons. Before those numbers drive infrastructure decisions, the baseline is worth pinning down.
Platforms are expected from partners in the second half of 2026. Given that most major storage vendors are already co-designing on STX, enterprises evaluating storage refreshes for AI infrastructure in the next 12 months should expect STX-based options to be available from their existing vendor relationships.
What’s the role of vector databases in the agentic AI world? That’s a question that organizations have been coming to terms with in recent months.
The narrative had real momentum. As large language models scaled to million-token context windows, a credible argument circulated among enterprise architects: purpose-built vector search was a stopgap, not infrastructure. Agentic memory would absorb the retrieval problem. Vector databases were a RAG-era artifact.
The production evidence is running the other way.
Qdrant, the Berlin-based open source vector search company, announced a $50 million Series B on Thursday, two years after a $28 million Series A. The timing is not incidental. The company is also shipping version 1.17 of its platform. Together, they reflect a specific argument: The retrieval problem did not shrink when agents arrived. It scaled up and got harder.
“Humans make a few queries every few minutes,” Andre Zayarni, Qdrant’s CEO and co-founder, told VentureBeat. “Agents make hundreds or even thousands of queries per second, just gathering information to be able to make decisions.”
That shift changes the infrastructure requirements in ways that RAG-era deployments were never designed to handle.
Agents operate on information they were never trained on: proprietary enterprise data, current information, millions of documents that change continuously. Context windows manage session state. They don’t provide high-recall search across that data, maintain retrieval quality as it changes, or sustain the query volumes autonomous decision-making generates.
“The majority of AI memory frameworks out there are using some kind of vector storage,” Zayarni said.
The implication is direct: even the tools positioned as memory alternatives rely on retrieval infrastructure underneath.
Three failure modes surface when that retrieval layer isn’t purpose-built for the load. At document scale, a missed result is not a latency problem — it is a quality-of-decision problem that compounds across every retrieval pass in a single agent turn. Under write load, relevance degrades because newly ingested data sits in unoptimized segments before indexing catches up, making searches over the freshest data slower and less accurate precisely when current information matters most. Across distributed infrastructure, a single slow replica pushes latency across every parallel tool call in an agent turn — a delay a human user absorbs as inconvenience but an autonomous agent cannot.
Qdrant’s 1.17 release addresses each directly. A relevance feedback query improves recall by adjusting similarity scoring on the next retrieval pass using lightweight model-generated signals, without retraining the embedding model. A delayed fan-out feature queries a second replica when the first exceeds a configurable latency threshold. A new cluster-wide telemetry API replaces node-by-node troubleshooting with a single view across the entire cluster.
Nearly every major database now supports vectors as a data type — from hyperscalers to traditional relational systems. That shift has changed the competitive question. The data type is now table stakes. What remains specialized is retrieval quality at production scale.
That distinction is why Zayarni no longer wants Qdrant called a vector database.
“We’re building an information retrieval layer for the AI age,” he said. “Databases are for storing user data. If the quality of search results matters, you need a search engine.”
His advice for teams starting out: use whatever vector support is already in your stack. The teams that migrate to purpose-built retrieval do so when scale forces the issue.
“We see companies come to us every day saying they started with Postgres and thought it was good enough — and it’s not.”
Qdrant’s architecture, written in Rust, gives it memory efficiency and low-level performance control that higher-level languages don’t match at the same cost. The open source foundation compounds that advantage — community feedback and developer adoption are what allow a company at Qdrant’s scale to compete with vendors that have far larger engineering resources.
“Without it, we wouldn’t be where we are right now at all,” Zayarni said.
The companies building production AI systems on Qdrant are making the same argument from different directions: agents need a retrieval layer, and conversational or contextual memory is not a substitute for it.
GlassDollar helps enterprises including Siemens and Mahle evaluate startups. Search is the core product: a user describes a need in natural language and gets back a ranked shortlist from a corpus of millions of companies. The architecture runs query expansion on every request – a single prompt fans out into multiple parallel queries, each retrieving candidates from a different angle, before results are combined and re-ranked. That is an agentic retrieval pattern, not a RAG pattern, and it requires purpose-built search infrastructure to sustain it at volume.
The company migrated from Elasticsearch as it scaled toward 10 million indexed documents. After moving to Qdrant it cut infrastructure costs by roughly 40%, dropped a keyword-based compensation layer it had maintained to offset Elasticsearch’s relevance gaps, and saw a 3x increase in user engagement.
“We measure success by recall,” Kamen Kanev, GlassDollar’s head of product, told VentureBeat. “If the best companies aren’t in the results, nothing else matters. The user loses trust.”
Agentic memory and extended context windows aren’t enough to absorb the workload that GlassDollar needs, either.
“That’s an infrastructure problem, not a conversation state management task,” Kanev said. “It’s not something you solve by extending a context window.”
Another Qdrant user is &AI, which is building infrastructure for patent litigation. Its AI agent, Andy, runs semantic search across hundreds of millions of documents spanning decades and multiple jurisdictions. Patent attorneys will not act on AI-generated legal text, which means every result the agent surfaces has to be grounded in a real document.
“Our whole architecture is designed to minimize hallucination risk by making retrieval the core primitive, not generation,” Herbie Turner, &AI’s founder and CTO, told VentureBeat.
For &AI, the agent layer and the retrieval layer are distinct by design.
“Andy, our patent agent, is built on top of Qdrant,” Turner said. “The agent is the interface. The vector database is the ground truth.”
The practical starting point: use whatever vector capability is already in your stack. The evaluation question isn’t whether to add vector search — it’s when your current setup stops being adequate. Three signals mark that point: retrieval quality is directly tied to business outcomes; query patterns involve expansion, multi-stage re-ranking, or parallel tool calls; or data volume crosses into the tens of millions of documents.
At that point the evaluation shifts to operational questions: how much visibility does your current setup give you into what’s happening across a distributed cluster, and how much performance headroom does it have when agent query volumes increase.
“There’s a lot of noise right now about what replaces the retrieval layer,” Kanev said. “But for anyone building a product where retrieval quality is the product, where missing a result has real business consequences, you need dedicated search infrastructure.”