Mistral AI, the Paris-based artificial intelligence company valued at €11.7 billion ($13.8 billion), today released Workflows in public preview — a production-grade orchestration layer designed to move enterprise AI systems out of proofs of concept and into the business processes that generate revenue.
The product, which launches as part of Mistral’s Studio platform, is the company’s clearest articulation yet of a thesis that is quietly reshaping the enterprise AI market: that the bottleneck for organizations adopting AI is no longer the model itself, but the infrastructure required to run it reliably at scale.
“What we’re seeing today is that organizations are struggling to go beyond isolated proofs of concept,” Elisa Salamanca, who leads go-to-market for Mistral’s enterprise products, told VentureBeat in an exclusive interview ahead of the launch. “The gap is operational. Workflows is the infrastructure to run AI systems reliably across business-critical processes.”
The release arrives at a pivotal moment for both Mistral and the broader AI industry. The dedicated agentic AI market has been valued at approximately $10.9 billion in 2026 and is projected to reach $199 billion by 2034. Yet despite that staggering growth trajectory, industry research points to a stark reality: over 40% of agentic AI projects will be aborted by 2027 due to high costs, unclear value, and complexity. Mistral is betting that Workflows can help its enterprise customers avoid becoming one of those statistics.
At its core, Workflows provides a structured system for defining, executing, and monitoring multi-step AI processes — from simple sequential tasks to complex, stateful operations that blend deterministic business rules with the probabilistic outputs of large language models.
Salamanca described Workflows as containing several key components. The first is a development kit that allows engineers to build orchestration logic in just a few lines of Python code. “We have also been able to expose MCP servers,” she explained, referring to the Model Context Protocol standard for connecting AI systems to external tools, “so that they can actually do this with agent authoring.”
The second — and arguably more technically significant — component is an architecture that separates orchestration from execution. “We’re decorrelating the orchestration from the execution,” Salamanca said. “Execution can happen close to the customer’s data — their critical systems — and orchestration can happen on the cloud or wherever they want to run it.” This means the data never has to leave the customer’s perimeter, a design decision with enormous implications for regulated industries where data sovereignty is non-negotiable. “Enterprises do not have to worry about us having access to the data,” she added.
The third pillar is observability. According to Mistral’s blog post announcing the release, every branch, retry, and state change within a workflow is recorded in Studio with native support for OpenTelemetry. Salamanca noted that this is not an afterthought: “You can easily see what decisions have been taken by the workflow, by the agent, and you can deep dive into where problems are happening.”
Workflows is fully customizable across models — engineers can select which model handles which step and can inject arbitrary code, allowing them to blend deterministic pipelines with agentic sections. The system also supports connectors that integrate directly with CRMs, ticketing systems, support platforms, and other enterprise tools, with built-in authentication and secrets management.
Unlike some competitors offering drag-and-drop workflow builders, Mistral has deliberately targeted developers and engineers rather than business users. “There are a couple of solutions out there that have click-and-drag, drag-and-drop solutions for workflows,” Salamanca acknowledged. “This is not the approach that we’ve been taking. We’ve been really focused towards developers and critical systems that will not scale if you’re doing these drag-and-drop workflows.”
The decision is part of a broader philosophy at Mistral: that enterprise AI systems handling mission-critical operations — cargo releases, compliance reviews, financial transactions — require the precision and version control that only code can provide. Business users are not excluded from the picture, but their role is downstream. Once engineers write a workflow in Python, it can be published to Le Chat, Mistral’s chatbot platform, so anyone in the organization can trigger it. Every step remains tracked and auditable in Studio.
Under the hood, Workflows runs on Temporal’s durable execution engine — a platform whose $5 billion valuation reflects how its durable execution capabilities, originally built for cloud workflow orchestration, have become essential infrastructure for AI agents requiring reliable, long-running, stateful processes. Temporal’s customers include OpenAI, Snap, Netflix, and JPMorgan Chase, and its technology powers orchestration at companies like Stripe and Salesforce.
Mistral extended Temporal’s core engine for AI-specific workloads by adding streaming, payload handling, multi-tenancy, and observability that the base engine does not provide out of the box. “Workflows is built on top of Temporal,” Salamanca confirmed. “We added all the AI requirements to make these AI workflows reliable. It provides out of the box durability, retries, state management. Whenever there’s a failure, it starts again wherever it stopped.” Originally spun out of Uber’s Cadence project, Temporal transparently handles retries, state persistence, and timeouts, providing durable execution across failures. In late 2025, Temporal joined the newly formed Agentic AI Foundation as a Gold Member and announced an official OpenAI Agents SDK integration. By building on this infrastructure rather than creating a proprietary alternative, Mistral inherits battle-tested reliability while focusing its own engineering efforts on the AI-specific layer that sits above it.
Mistral is not launching Workflows as a concept — the company says customers are already running the product in production, processing millions of executions daily across three primary use cases.
The first is cargo release automation in the logistics sector. Global shipping still runs on paperwork, and a single cargo release can involve customs declarations, dangerous goods classifications, safety inspections, and regulatory checks spanning multiple jurisdictions. Salamanca described the scope of the problem: “Their global shipping today runs on paperwork. They have to involve customs declaration, Dangerous Goods classification, safety inspections, regulatory checks, and Workflows is now powering that with our models and business rules inside.”
Critically, the system keeps humans in the loop at the right moments. According to Mistral’s blog, the human approval step in a workflow is a single line of code — wait_for_input() — that pauses the workflow indefinitely with no compute consumption, notifies the reviewer, and resumes exactly where it left off once approval is given. “Humans are still in the loop, but they’re in the loop at the right time,” Salamanca said. “They just get the validation — I don’t have to go into multiple tools — and the shipment gets released.”
The second production use case is document compliance checking for financial institutions, specifically Know Your Customer reviews. These reviews are manual, repetitive, and traditionally require hours of analyst time per case. Salamanca said Workflows now processes these reviews in minutes and provides outputs in an auditable manner — a requirement for meeting regulatory obligations.
The third example involves customer support in the banking sector. “You’d have millions of users actually asking to have credit cards blocked, or feedbacks on their account situation, on their credit feedbacks,” Salamanca said. With Workflows, incoming support tickets are analyzed, categorized by intent and urgency, and routed automatically. Each routing decision is visible and traceable in Studio, and when the system gets a categorization wrong, the team can correct it at the workflow level without retraining the model.
Workflows does not exist in isolation. It is the middle layer of a three-part enterprise platform that Mistral has been assembling at a rapid clip throughout 2026.
At the bottom sits Forge, the custom model training platform Mistral launched in March at Nvidia’s GTC conference. Forge allows organizations to build, customize, and continuously improve AI models using their own proprietary data. At the top sits Vibe, Mistral’s coding agent platform that provides the user-facing interaction layer — available on web, mobile, or desktop.
Salamanca connected the three explicitly: “We just released Forge. It enables you to create your own models. But the question is, how do you put these models to do valuable work for your enterprise? That’s where Workflows comes in, because this is the orchestration piece — how you blend in deterministic rules and agentic capabilities. And then if you really want to have your end users interact with these AI patterns, it’s where Vibe comes into play.”
Forge is already seeing strong traction, Salamanca said, across two distinct patterns of enterprise demand. “First, they wanted to really build completely dedicated models to solve unique problems — transformers-based architecture for time series in the financial sector, adding new types of modalities to the LLMs,” she explained. “And the second motion was about customers with really specific tasks they want to solve. Reinforcement learning really caught their attention as to how they can use Forge and Forge RL to actually have models do these tasks very well.”
This layered architecture — model customization, workflow orchestration, and end-user interfaces — positions Mistral as something more ambitious than a model provider. It is building a full-stack enterprise AI platform, a strategy that pits it directly against not just other AI labs like OpenAI and Anthropic, but also against the hyperscale cloud providers. The company’s product portfolio now ranges, as Salamanca put it, “from compute to end-user interfaces,” including data centers in Europe, document processing with its OCR model, and audio capabilities through its Voxtral models.
The Workflows launch comes as Mistral executes one of the most aggressive scaling campaigns in the history of the European technology industry. The French AI startup has increased its revenue twentyfold within a year, with co-founder and CEO Arthur Mensch putting the company’s annualized revenue run rate at over $400 million, compared to just $20 million the previous year. The Paris-based company aims to achieve recurring annual revenue of more than $1 billion by year-end.
The company’s fundraising trajectory has been equally dramatic. Mistral announced a €1.7 billion ($1.9 billion) Series C round at a €11.7 billion ($12.8 billion) valuation in September 2025. Bloomberg reported in September 2025 that the company was finalizing a €2 billion investment valuing it at €12 billion ($14 billion). ASML led the round and contributed €1.3 billion, a landmark investment that aligned chip manufacturing expertise with frontier AI development and underscored European industrial capital’s commitment to building a sovereign AI ecosystem. Mistral then secured $830 million in debt in March 2026 to buy 13,800 Nvidia chips for a new data center near Paris.
The financial picture illustrates why Workflows matters strategically. Mistral’s revenue growth is being driven primarily by enterprise adoption, with approximately 60% of revenue coming from Europe, according to CEO Mensch’s public statements. Those enterprise customers are not buying Mistral’s models for casual chatbot applications — they are deploying them in regulated, mission-critical environments where reliability and data sovereignty are table stakes. Workflows gives those customers the production infrastructure they need to actually deploy AI systems that matter.
In May 2025, Mistral released Mistral Medium 3, which was priced at $0.40 per million input tokens and $2 per million output tokens. The company said clients in financial services, energy, and healthcare had been beta testing it for customer service, workflow automation, and analyzing complex datasets. That model now becomes one of many that can be plugged into Workflows, creating a flywheel where better models drive more workflow adoption, which in turn drives more inference revenue.
Mistral’s entry into workflow orchestration arrives in an increasingly crowded field. AI orchestration platforms are quickly becoming the backbone of enterprise AI systems in 2026, and as businesses deploy multiple AI agents, tools, and LLMs, the need for unified control, oversight, and efficiency has never been greater.
Major cloud providers — Amazon with Bedrock AgentCore, Microsoft with Copilot Studio, Google with Vertex AI’s agent tools, and IBM with WatsonX — all offer some form of workflow or agent orchestration. Open-source frameworks like LangChain, LlamaIndex, and Microsoft AutoGen provide developer-level building blocks. And dedicated orchestration startups are proliferating.
Mistral’s differentiation rests on three pillars. First, vertical integration: because Workflows is native to Studio, the orchestration layer and the components it orchestrates — models, agents, connectors, observability — are built to work together, eliminating the integration tax that enterprises pay when stitching together disparate tools. Second, deployment flexibility: the split control-plane/data-plane architecture means customers in regulated industries can run execution workers in their own environments while still benefiting from managed orchestration. Third, data sovereignty: Mistral’s European roots and infrastructure investments give it a natural advantage with organizations wary of routing sensitive data through U.S.-headquartered cloud providers — a concern that has intensified amid ongoing geopolitical tensions and growing European anxiety about relying on foreign providers for over 80% of digital services and infrastructure.
Still, the challenges are real. OpenAI and Anthropic both have significantly larger model ecosystems and developer communities. The hyperscalers control the cloud infrastructure where most enterprise workloads actually run. And the enterprise sales cycles for production-grade AI deployments remain long and complex, requiring deep technical integration work that even well-funded startups can struggle to staff.
Salamanca outlined three areas of near-term development. First, Mistral plans to release a more managed version of Workflows that abstracts deployment logic for developers who don’t need granular control over worker placement. “Whenever you want to have this flexibility, you can, but if you want to be able to have this on a managed infrastructure, even if it’s running in your own VPC, this is something that we’re adding,” she said.
Second, the company intends to make Workflows accessible to business users, not just engineers. “With Vibe code, you can actually author a workflow. This can be executed at scale, and any end user, in the end, can actually do that with Workflows,” Salamanca explained. The third area is enterprise guardrails and safety controls for agentic applications — ensuring agents use the correct tools, run with appropriate permissions, and that administrators can enforce policies at scale. “Making sure that we have all these enterprise controls to be able to scale the authoring and the building of these workflows is something we’re actively working on,” she said.
The Python SDK for Workflows (v3.0) is now publicly available. Developers can try the product in Studio and access documentation and demo templates immediately. Mistral will be hosting its inaugural AI Now Summit in Paris on May 27–28, where the company is expected to provide additional details on its platform roadmap.
For three years, the AI industry has been captivated by a single question: who can build the most powerful model? Mistral’s Workflows launch suggests the company has moved on to a different question entirely — one that may prove far more consequential for the enterprises writing the checks. It’s not about which model is smartest. It’s about which one can actually show up for work.
AI R&D runs on a cycle of hypothesis, experiment, and analysis — each step demanding substantial manual engineering effort. A new framework from researchers at SII-GAIR aims to close that bottleneck by automating the full optimization loop for training data, model architectures, and learning algorithms.
A new framework called ASI-EVOLVE, developed by researchers at the Generative Artificial Intelligence Research Lab (SII-GAIR), aims to solve this bottleneck. Designed as an agentic system for AI-for-AI research, it uses a continuous “learn-design-experiment-analyze” cycle to automate the optimization of the foundational AI stack.
In experiments, this self-improvement loop autonomously discovered novel designs that significantly outperformed state-of-the-art human baselines. The system generated novel language model architectures, improved pretraining data pipelines to boost benchmark scores by over 18 points, and designed highly efficient reinforcement learning algorithms.
For enterprise teams running repeated optimization cycles on their AI systems, the framework offers a path to reducing manual engineering overhead while matching or exceeding the performance of human-designed baselines.
Engineering teams can only explore a tiny fraction of the vast possible design space for AI models at any given time. Executing experimental workflows requires costly manual effort and frequent human intervention. And the insights gained from these expensive cycles are often siloed as individual intuition or experience, making it difficult to systematically preserve and transfer that knowledge to future projects or across different teams. These constraints fundamentally limit the pace and scale of AI innovation.
AI has made incredible strides in scientific discovery, ranging from specialized tools like AlphaFold solving discrete biological problems to agentic systems answering basic scientific questions. However, current frameworks still struggle with open-ended AI innovation and are mostly limited to narrow optimization within very specific constraints.
Advancing core AI capabilities is far more complex. It requires modifying large interdependent codebases, running compute-heavy experiments that consume tens to hundreds of GPU hours, and analyzing multi-dimensional feedback from training dynamics.
“Existing frameworks have not yet demonstrated that AI can operate effectively in this regime in a unified way, nor that it can generate meaningful advances across the three foundational pillars of AI development rather than within a single narrowly scoped setting,” the researchers write.
To overcome the limitations of manual R&D, ASI-EVOLVE operates on a continuous loop between prior knowledge, hypothesis generation, experimentation, and refinement. The system learns relevant knowledge and historical experience from existing databases, designs a candidate program representing its next hypothesis, runs experiments to obtain evaluation signals, and analyzes outcomes into reusable, human-readable lessons that it feeds back into its knowledge base.
There are two key components that drive ASI-EVOLVE. The “Cognition Base” acts as the system’s foundational domain expertise. To speed up the search process, the system is pre-loaded with human knowledge, task-relevant heuristics, and known pitfalls extracted from existing literature. This steers the exploration toward promising directions right from the first iteration.
The second component is the “Analyzer,” which tackles the complex, multi-dimensional feedback from the experiments. It processes raw training logs, benchmark results, and efficiency traces, distilling them into compact, actionable insights and causal analyses.
Several other complementary modules bring the framework together. A “Researcher” agent reviews prior knowledge from the cognition base and past experimental results to generate new hypotheses, either proposing localized code modifications or writing new programs.
The “Engineer” component runs the actual experiments. Because AI training trials are incredibly costly, the Engineer is equipped with efficiency measures like wall-clock limits and early rejection quick tests to filter out flawed candidate programs before they consume excessive GPU hours.
Finally, the “Database” serves as the system’s persistent memory, storing the code, research motivations, raw results, and the Analyzer’s final reports for every iteration, ensuring that insights compound systematically over time.
By unifying these components, ASI-EVOLVE ensures that an AI agent systematically learns from complex, real-world experimental feedback without requiring constant human intervention.
While previous frameworks are designed to evolve candidate solutions, “ASI-EVOLVE evolves cognition itself,” the researchers write. “Accumulated experience and distilled insights are continuously stored and retrieved to inform future exploration, ensuring that the system grows not only in the quality of its solutions but in its capacity to reason about where to search next.”
In their experiments, the researchers showed that ASI-EVOLVE can successfully improve data curation, model architectures, and learning algorithms to create better AI systems.
For real-world enterprise applications, high-quality data is a persistent bottleneck. When tasked with designing category-specific cleaning strategies for massive pretraining corpora, ASI-EVOLVE inspected data samples and diagnosed quality issues like HTML artifacts and formatting inconsistencies. The system autonomously formulated custom curation rules, discovering that systematic cleaning combined with domain-aware preservation rules is far more effective than aggressive filtering.
In benchmark tests, 3B-parameter models trained on the AI-curated data saw an average score boost of nearly 4 points over models trained on raw data. The gains were highest in knowledge-intensive tasks, with performance increasing by over 18 points on Massive Multitask Language Understanding (MMLU), an LLM benchmark that covers tasks across STEM, humanities, and social sciences.
Beyond data, the system proved highly capable at neural architecture design. Across 1,773 autonomous exploration rounds, it generated 105 novel linear attention architectures that surpassed DeltaNet, a highly efficient human-designed baseline. To achieve these results, ASI-EVOLVE developed multi-scale routing mechanisms that dynamically adjust the model’s computational budget based on the specific content of the input.
Finally, in reinforcement learning algorithm design, ASI-EVOLVE discovered novel optimization mechanisms. It designed algorithms that outperformed the competitive GRPO baseline on complex mathematical reasoning benchmarks such as AMC32 and AIME24. One successful variant invented a “Budget-Constrained Dynamic Radius” that keeps model updates within a defined budget, effectively stabilizing training on noisy data.
Enterprise AI workflows constantly require optimizations to existing systems, from fine-tuning open-source models on proprietary data to making small changes to architectures and algorithms. Usually, the computational resources and engineering hours required to carry out such efforts are immense and beyond the capabilities of most organizations. As a result, many are left to run unoptimized versions of standard AI models.
The research team says the framework is designed so enterprises can integrate proprietary domain knowledge into the cognition repository and allow the autonomous loop to iterate on internal AI systems.
The research team has open-sourced the ASI-EVOLVE code, making the foundational framework available for developers and product builders.
Presented by EdgeverveSupply chains are where legacy integration models reach their limits. As partner networks expand and operational volatility increases, traditional middleware is buckling under costs and complexity. That’s why supply chain has emer…
For the past eighteen months, the corporate world has been obsessed with the “builder” phase of the generative AI revolution. Enterprises have raced to deploy autonomous agents to handle everything from customer support to complex codebase refactoring.
However, as these digital workers proliferate, a new, more structural problem has emerged: fragmentation. Agents built on LangChain cannot easily hand off tasks to those built on CrewAI; a Salesforce-embedded agent has no native way to coordinate with a custom-built Python script running on a private cloud.
Today, a new startup, BAND (also known as Thenvoi AI Ltd.) exited stealth with $17 million in Seed funding to provide the “interaction infrastructure” necessary to turn these isolated tools into a unified, collaborative workforce.
“In order for agents to become real players in the global economy, they need ways to communicate, just like humans do,” said co-founder and CEO Arick Goomanovsky in an interview with VentureBeat, continuing, “the communication solutions we have today for systems don’t work for agents, because agents are non-deterministic creatures. It’s not just about API integrations.”
By introducing a deterministic communication layer that functions as a “Slack for agents,” BAND aims to move the industry from a collection of fragile experiments to a scalable, “agentic economy”.
At the core of BAND’s thesis is that simply creating and plugging AI agents into human communication tools like Slack causes them to lose context or require constant “rehydration” if they fail and re-enter a conversation.
“You can’t take a bunch of agents and put them into Slack and expect it to miraculously work,” Goomanovsky said.
BAND solves this through a two-layer architecture designed to handle the unique telemetry of AI-to-AI interaction, a so called “agentic mesh.”
This is the “interaction layer” where agent discovery and structured delegation occur. It allows agents to find one another across different clouds and frameworks without requiring developers to write brittle “glue code” for every new connection.
Multi-Peer Collaboration: Unlike existing protocols that are primarily peer-to-peer or client-server, BAND supports full-duplex, multi-peer communication. This allows a group of agents—for example, a planning agent, a coding agent, and a QA agent—to work together in a shared “room” with synchronized context.
Deterministic Routing: Notably, BAND does not use Large Language Models (LLMs) to route messages. Using an LLM for routing would introduce the same non-deterministic errors the platform seeks to solve. Instead, the platform uses a patent-pending multi-layer architecture to ensure messages reach their destination reliably.
The WhatsApp Comparison: To handle the anticipated volume of agentic traffic, BAND’s infrastructure is built on the same technical stack utilized by global messaging giants like WhatsApp and Discord. This ensures the platform can scale to billions of messages as digital identities begin to outnumber human ones.
If the nesh is the “pipes,” the Control Plane is the “valve”. This layer provides the runtime governance that enterprises require before they can safely scale autonomous systems.
Authority Boundaries: The platform allows organizations to enforce strict rules on which agents can talk to each other and what topics they can discuss.
Credential Traversal: One of the most significant hurdles in multi-agent systems is identity. BAND manages how human permissions and security tokens traverse from agent to agent. For instance, if a human asks Agent A for information, and Agent A delegates that task to Agent B, BAND ensures Agent B only accesses data the original human is permitted to see.
BAND’s product suite is designed to be “framework-agnostic” and “cloud-agnostic,” positioning itself as an independent middleware that prevents vendor lock-in. In a market where hyperscalers like OpenAI or Anthropic want enterprises to stay within their specific ecosystems, BAND offers the flexibility to use the best model across multiple options for the job, including open source and fine-tuned, custom enterprise options.
“No matter where the agents run or how they were built, we can band them together, allow them to discover each other, delegate tasks, and have full-duplex, bidirectional communication,” Goomanovsky said, noting that despite competing first-part options from model providers like OpenAI’s workspace agents (announced yesterday) and Anthropic’s Claude Managed Agents (announced earlier this month), BAND “play[s] the role of the independent platform that allows an enterprise to avoid vendor lock-in.”
The company is currently seeing the most traction in “tech-forward” sectors, including telecommunications, financial services, and cybersecurity.
Coding Agents: This is currently the most popular use case. Developers often find that Claude is superior at planning, while Codex is better at reviewing code. BAND allows these agents to work simultaneously, delegating tasks to one another in real-time.
Customer Support and Operations: Beyond code, BAND enables “cross-boundary” automation. For example, a new employee could be onboarded by a Workday agent, which then communicates with a ServiceNow agent to open a ticket for equipment, which finally talks to a purchasing agent to finalize the order.
Understanding the sensitivity of enterprise data, BAND offers three primary ways to consume the platform:
SaaS: A straightforward cloud-based platform where agents connect via API.
Private Cloud/On-Premise: The entire platform can be deployed within a customer’s VPC or on-premise environment to ensure data never leaves their control.
The Edge: The infrastructure is lightweight enough to be deployed on “flying objects” like drones (UAVs) or even satellites, facilitating communication between agents in physically isolated environments.
Already, BAND’s early users — and enterprises more broadly — are mixing and matching AI agents powered by models from various providers, so the time to provide an overarching solution seems ripe.
As Goomanovsky put it: “Advanced developers are not using a single coding agent. They realize Claude is very good at planning, Codex is much better at reviewing, and today there is no way to create that bidirectional interaction between coding, review, and planning agents. We enable that.”
BAND operates as a commercial entity, focusing on providing “enterprise-grade” stability and security. While the platform integrates with open-source frameworks like LangChain and CrewAI, its own core routing and control technology is proprietary and patent-pending.
For enterprise IT leaders, the “Control Plane” is less about communication and more about auditability. BAND provides full observability into every agent interaction, creating a transcript and a “paper trail” for autonomous actions.
This is a “complementary” solution to existing guardrail products; while a guardrail might protect a single agent from a prompt injection, BAND protects the entire system from cascading failures caused by one agent misinforming another.
The company has launched with a tiered pricing model designed to capture everyone from individual “agent enthusiasts” to global corporations:
Free ($0/mo): Designed for individuals. It allows for up to 10 remote agents and 50 active chat rooms, though it only retains data for 24 hours.
Pro ($17.99/mo): Aimed at startups and growing R&D teams. This tier increases limits to 40 agents and 250 active chat rooms with email support.
Enterprise (Custom): Offers unlimited agents, custom data retention policies to meet compliance requirements, and full API access to BAND’s “Memory APIs”.
The emergence of BAND coincides with a shift in how analysts view the AI market. Gartner has predicted that by 2029, 90% of enterprises deploying multiple agents will require what they call a “Universal Orchestrator”. Similarly, Forrester has recognized the “Agent Control Plane” as a distinct and emerging market category.
The company was founded by Goomanovsky and Vlad Luzin, who combined their backgrounds in Israeli intelligence, cybersecurity, and multi-agent systems to build BAND.
Goomanovsky views the platform not just as a tool, but as a foundational layer for the next era of the internet.
“Communication is the most fundamental problem in computing,” Goomanovsky noted. “When new beings emerge, the first thing they need is a way to talk to each other… We are the agent internet”.
The $17 million Seed round was led by Sierra Ventures, Hetz Ventures, and Team8. Tim Guleri of Sierra Ventures emphasized that BAND is building the “missing layer” that makes large-scale collaboration practical.
This capital will be used to expand the engineering team and accelerate the development of the “design partner” ecosystem, which already includes leading North American telcos and European digital payment companies.
As agents transition from being digital novelties to becoming the primary drivers of enterprise workflows, the “glue code” that holds them together will become the most critical piece of the stack. BAND’s launch marks the first serious attempt to standardize that glue, turning a chaotic “band” of agents into a synchronized, governed symphony.
OpenAI introduced a new paradigm and product today that is likely to have huge implications for enterprises seeking to adopt and control fleets of AI agent workers.
Called “Workspace Agents,” OpenAI’s new offering essentially allows users on its ChatGPT Business ($20 per user per month) and variably priced Enterprise, Edu and Teachers subscription plans to design or select from pre-existing agent templates that can take on work tasks across third-party apps and data sources including Slack, Google Drive, Microsoft apps, Salesforce, Notion, Atlassian Rovo, and other popular enterprise applications.
Put simply: these agents can be created and accessed from ChatGPT, but users can also add them to third-party apps like Slack, communicate with them across disparate channels, ask them to use information from the channel they’re in and other third-party tools and apps, and the agents will go off and do work like drafting emails to the entire team, selected members, or pull data and make presentations.
Human users can trust that the agent will manage all this complexity and complete the task as requested, even if the user who requested it leaves.
It’s the end of “babysitting” agents and the start of letting them go off and get shit done for your business — according to your defined business processes and permissions, of course.
The product experience appears centered on the Agents tab in the ChatGPT sidebar, where teams can discover and manage shared agents.
This functions as a kind of team directory: a place where agents built by coworkers can be reused across a workspace. The broader idea is that AI becomes less of an individual productivity trick and more of a shared organizational resource.
In this sense, OpenAI is targeting one of office work’s oldest pain points: the handoff between people, systems, and steps in a process.
OpenAI says workspace agents will be free for the next two weeks, until May 6, 2026, after which credit-based pricing will begin. The company also says more capabilities are on the way, including new triggers to start work automatically, better dashboards, more ways for agents to take action across business tools, and support for workspace agents in its AI code generation app, Codex.
For more information on how to get started building and using them, OpenAI recommends heading over to its online academy page on them here and its help desk documentation here.
The most significant shift in this announcement is the move away from purely session-based interaction. Workspace agents are powered by Codex — the cloud-based, partially open-source AI coding harness that OpenAI has been aggressively expanding in 2026 — which gives them access to a workspace for files, code, tools, and memory.
OpenAI says the agents can do far more than answer a prompt. They can write or run code, use connected apps, remember what they have learned, and continue work across multiple steps.
That description lines up closely with the capabilities OpenAI shipped into Codex just six days ago, including background computer use, more than 90 new plugins spanning tools like Atlassian Rovo, CircleCI, GitLab, Microsoft Suite, Neon by Databricks, and Render, plus image generation, persistent memory, and the ability to schedule future work and wake up on its own to continue across days or weeks.
Workspace agents inherit that plumbing. When one pulls a Friday metrics report, it is effectively spinning up a Codex cloud session with the right tools attached, running code to fetch and transform data, rendering charts, writing the narrative, and persisting what it learned for next week.
When that same agent is deployed to a Slack channel, it is a Codex instance listening for mentions and threading its work back in.
This is the technical decision enterprise buyers should focus on. Building an agent on a code-execution substrate rather than a pure LLM-call-and-response loop is what gives workspace agents the ability to do real work — transforming a CSV, reconciling two systems of record, generating a chart that is actually correct — rather than describing what the work would look like.
In earlier AI assistant models, progress paused when the user stopped interacting. Workspace agents change that by running in the cloud and supporting long-running workflows. Teams can also set them to run on a schedule.
That means a recurring reporting agent can pull data on a set cadence, generate charts and summaries, and share the results with a team without anyone manually kicking off the process.
Here at VentureBeat, we analyze story traffic and user return rate on a weekly basis — exactly the kind of recurring, multi-step, multi-source task that could theoretically be automated with a single workspace agent. Any enterprise with a weekly reporting rhythm pulling from dynamic data sources is likely to find a use for these agents.
Agents also retain memory across runs. OpenAI says they can be guided and corrected in conversation, so they improve the more a team uses them.
Over time they start to reflect how a team actually works — its processes, its standards, its preferred ways of handling recurring jobs — which is a meaningfully different proposition from the static instruction-set GPTs that preceded them.
OpenAI’s claim is that agents should gather information and take action where work already happens, rather than forcing teams into a separate interface. That point becomes clearest in the Slack examples. OpenAI’s launch materials show a product-feedback agent operating inside a channel named #user-insights, answering a question about recent mobile-app feedback with a themed summary pulled from multiple sources.
The company’s demo lineup walks through a sample team directory of agents: Spark for lead qualification and follow-up, Slate for software-request review, Tally for metrics reporting, Scout for product feedback routing, Trove for third-party vendor risk, and Angle for marketing and web content.
OpenAI also shared more functional examples its own teams use internally — a Software Reviewer that checks employee requests against approved-tools policy and files IT tickets; an accounting agent that prepares parts of month-end close including journal entries, balance-sheet reconciliations, and variance analysis, with workpapers containing underlying inputs and control totals for review; and a Slack agent used by the product team that answers employee questions, links relevant documentation, and files tickets when it surfaces a new issue.
In a sense, it is a continuation of the philosophy OpenAI espoused for individuals with last week’s Codex desktop release: the agent joins the workflow where work is already happening, draws in context from the surrounding apps, takes action where permitted, and keeps moving.
Workspace agents are not a standalone launch. They sit inside a roughly 12-month arc in which OpenAI has been systematically rebuilding ChatGPT, the API, and the developer platform around agents.
Workspace agents are explicitly positioned by OpenAI as an evolution of its custom GPTs, introduced in late 2023, which gave users a way to create customized versions of ChatGPT for particular roles and use cases.
However, now OpenAI says it is deprecating the custom GPT standard for organizations in a yet-to-be determined future date, and will require Business, Enterprise, Edu and Teachers users to update their GPTs to be new workspace agents.
Individuals who have made custom GPTs can continue using them for the foreseeable future, according to our sources at the company.
In October 2025, OpenAI introduced AgentKit, a developer-focused suite that includes Agent Builder, a Connector Registry, and ChatKit for building, deploying, and optimizing agents.
In February 2026, it introduced Frontier, an enterprise platform focused on helping organizations manage AI coworkers with shared business context, execution environments, evaluation, and permissions.
Workspace agents arrive as the no-code, in-product entry point that sits on top of that stack — even if OpenAI does not explicitly describe the architectural relationship in its materials.
The subtext across all three launches is the same: OpenAI has decided that the future of ChatGPT-for-work is fleets of permissioned agents, not single chat windows — and that GPTs, its first attempt at letting businesses customize ChatGPT, were not enough.
Because workspace agents can act across business systems, OpenAI puts heavy emphasis on governance. Admins can control who is allowed to build, run, and publish agents, and which tools, apps, and actions those agents can reach.
The role-based controls are more granular than the ones most custom-GPT rollouts ever had: admins can toggle, per role, whether members can browse and run agents, whether they can build them, whether they can publish to the workspace directory, and — separately — whether they can publish agents that authenticate using personal credentials.
That last setting is the risky case, and OpenAI explicitly recommends keeping it narrowly scoped.
Authentication itself comes in two flavors, and the choice has real consequences. In end-user account mode, each person who runs the agent authenticates with their own credentials, so the agent only ever sees what that individual is allowed to see.
In agent-owned account mode, the agent uses a single shared connection so users don’t have to authenticate at run time. OpenAI’s documentation strongly recommends service accounts rather than personal accounts for the shared case, and flags the data-exfiltration risk of publishing an agent that authenticates as its creator.
Write actions — sending email, editing a spreadsheet, posting a message, filing a ticket — default to Always ask, requiring human approval before the agent executes.
Builders can relax specific actions to “Never ask” or configure a custom approval policy, but the default posture is human-in-the-loop.
OpenAI also claims built-in safeguards against prompt-injection attacks, where malicious content in a document or web page tries to hijack an agent. The claim is welcome but not yet proven in the wild.
For organizations that want deeper visibility, OpenAI says its Compliance API surfaces every agent’s configuration, updates, and run history.
Admins can suspend agents on the fly, and OpenAI says an admin-console view of every agent built across the organization, with usage patterns and connected data sources, is coming soon.
Two caveats worth flagging for security-sensitive buyers: workspace agents are off by default at launch for ChatGPT Enterprise workspaces pending admin enablement, and they are not available at all to Enterprise customers using Enterprise Key Management (EKM).
OpenAI also ships an analytics dashboard aimed at helping teams understand how their agents are being used. Screenshots in the launch materials show measures like total runs, unique users, and an activity feed of recent runs, including one by a user named Ethan Rowe completing a run in a #b2b-sales channel.
The mockup detail supports OpenAI’s broader point: the company wants organizations to measure not just whether agents exist, but whether they are being used.
The clearest early-adopter signal in the launch itself comes from Rippling. Ankur Bhatt, who leads AI Engineering at the HR platform, says workspace agents shortened the traditional development cycle enough that a sales consultant was able to build a sales agent without an engineering team. “It researches accounts, summarizes Gong calls, and posts deal briefs directly into the team’s Slack room,” Bhatt says. “What used to take reps 5–6 hours a week now runs automatically in the background on every deal.”
OpenAI’s announcement names SoftBank Corp., Better Mortgage, BBVA, and Hibob as additional early testers.
Workspace agents do not land in a vacuum. They land in the middle of a broader OpenAI push — through AgentKit, through Frontier, through the Codex overhaul — to make agents more persistent, more connected, and more useful inside real organizational workflows.
They also land in a deeply crowded field: Microsoft Copilot Studio is wired into the Microsoft 365 base, Google is pushing Agentspace, Salesforce has rebuilt itself as agent infrastructure with Agentforce, and Anthropic recently introduced Claude Managed Agents, all different flavors of similar ideas — agents that cut across your apps and tools, take actions on schedules repeatedly as desired, and retain some degree of memory, context, and permissions and policies.
But this launch matters because it turns OpenAI’s strategy into something concrete for the teams already paying for ChatGPT, and because it quietly retires the product those teams were most recently told to standardize on.
If workspace agents live up to the pitch — shared, reusable, scheduled, permissioned coworkers that follow approved processes and keep work moving when their human is offline — it would mark a meaningful change in what workplace software does. Less passive software waiting for input, more active systems helping teams coordinate, execute, and move faster together.
The era of the digital coworker has begun. And, on OpenAI’s plans at least, the era of the custom GPT is ending.
The era of enterprises stitching together prompt chains and shadow agents is nearing its end as more options for orchestrating complex multi-agent systems emerge. As organizations move AI agents into production, the question remains: "how will we …
Enterprise teams building multi-agent AI systems may be paying a compute premium for gains that don’t hold up under equal-budget conditions. New Stanford University research finds that single-agent systems match or outperform multi-agent architectures on complex reasoning tasks when both are given the same thinking token budget.
However, multi-agent systems come with the added baggage of computational overhead. Because they typically use longer reasoning traces and multiple interactions, it is often unclear whether their reported gains stem from architectural advantages or simply from consuming more resources.
To isolate the true driver of performance, researchers at Stanford University compared single-agent systems against multi-agent architectures on complex multi-hop reasoning tasks under equal “thinking token” budgets.
Their experiments show that in most cases, single-agent systems match or outperform multi-agent systems when compute is equal. Multi-agent systems gain a competitive edge when a single agent’s context becomes too long or corrupted.
In practice, this means that a single-agent model with an adequate thinking budget can deliver more efficient, reliable, and cost-effective multi-hop reasoning. Engineering teams should reserve multi-agent systems for scenarios where single agents hit a performance ceiling.
Multi-agent frameworks, such as planner agents, role-playing systems, or debate swarms, break down a problem by having multiple models operate on partial contexts. These components communicate with each other by passing their answers around.
While multi-agent solutions show strong empirical performance, comparing them to single-agent baselines is often an imprecise measurement. Comparisons are heavily confounded by differences in test-time computation. Multi-agent setups require multiple agent interactions and generate longer reasoning traces, meaning they consume significantly more tokens.
ddConsequently, when a multi-agent system reports higher accuracy, it is difficult to determine if the gains stem from better architecture design or from spending extra compute.
Recent studies show that when the compute budget is fixed, elaborate multi-agent strategies frequently underperform compared to strong single-agent baselines. However, they are mostly very broad comparisons that don’t account for nuances such as different multi-agent architectures or the difference between prompt and reasoning tokens.
“A central point of our paper is that many comparisons between single-agent systems (SAS) and multi-agent systems (MAS) are not apples-to-apples,” paper authors Dat Tran and Douwe Kiela told VentureBeat. “MAS often get more effective test-time computation through extra calls, longer traces, or more coordination steps.”
To create a fair comparison, the Stanford researchers set a strict “thinking token” budget. This metric controls the total number of tokens used exclusively for intermediate reasoning, excluding the initial prompt and the final output.
The study evaluated single- and multi-agent systems on multi-hop reasoning tasks, meaning questions that require connecting multiple pieces of disparate information to reach an answer.
During their experiments, the researchers noticed that single-agent setups sometimes stop their internal reasoning prematurely, leaving available compute budget unspent. To counter this, they introduced a technique called SAS-L (single-agent system with longer thinking).
Rather than jumping to multi-agent orchestration when a model gives up early, the researchers suggest a simple prompt-and-budgeting change.
“The engineering idea is simple,” Tran and Kiela said. “First, restructure the single-agent prompt so the model is explicitly encouraged to spend its available reasoning budget on pre-answer analysis.”
By instructing the model to explicitly identify ambiguities, list candidate interpretations, and test alternatives before committing to a final answer, developers can recover the benefits of collaboration inside a single-agent setup.
The results of their experiments confirm that a single agent is the strongest default architecture for multi-hop reasoning tasks. It produces the highest accuracy answers while consuming fewer reasoning tokens. When paired with specific models like Google’s Gemini 2.5, the longer-thinking variant produces even better aggregate performance.
The researchers rely on a concept called “Data Processing Inequality” to explain why a single agent outperforms a swarm. Multi-agent frameworks introduce inherent communication bottlenecks. Every time information is summarized and handed off between different agents, there is a risk of data loss.
In contrast, a single agent reasoning within one continuous context avoids this fragmentation. It retains access to the richest available representation of the task and is thus more information-efficient under a fixed budget.
The authors also note that enterprises often overlook the secondary costs of multi-agent systems.
“What enterprises often underestimate is that orchestration is not free,” they said. “Every additional agent introduces communication overhead, more intermediate text, more opportunities for lossy summarization, and more places for errors to compound.”
On the other hand, they discovered that multi-agent orchestration is superior when a single agent’s environment gets messy. If an enterprise application must handle highly degraded contexts, such as noisy data, long inputs filled with distractors, or corrupted information, a single agent struggles. In these scenarios, the structured filtering, decomposition, and verification of a multi-agent system can recover relevant information more reliably.
The study also warns about hidden evaluation traps that falsely inflate multi-agent performance. Relying purely on API-reported token counts heavily distorts how much computation an architecture is actually spending. The researchers found these accounting artifacts when testing models like Gemini 2.5, proving this is an active issue for enterprise applications today.
“For API models, the situation is trickier because budget accounting can be opaque,” the authors said. To evaluate architectures reliably, they advise developers to “log everything, measure the visible reasoning traces where available, use provider-reported reasoning-token counts when exposed, and treat those numbers cautiously.”
If a single-agent system matches the performance of multiple agents under equal reasoning budgets, it wins on total cost of ownership by offering fewer model calls, lower latency, and simpler debugging. Tran and Kiela warn that without this baseline, “some enterprises may be paying a large ‘swarm tax’ for architectures whose apparent advantage is really coming from spending more computation rather than reasoning more effectively.”
Another way to look at the decision boundary is not how complex the overall task is, but rather where the exact bottleneck lies.
“If it is mainly reasoning depth, SAS is often enough. If it is context fragmentation or degradation, MAS becomes more defensible,” Tran said.
Engineering teams should stay with a single agent when a task can be handled within one coherent context window. Multi-agent systems become necessary when an application handles highly degraded contexts.
Looking ahead, multi-agent frameworks will not disappear, but their role will evolve as frontier models improve their internal reasoning capabilities.
“The main takeaway from our paper is that multi-agent structure should be treated as a targeted engineering choice for specific bottlenecks, not as a default assumption that more agents automatically means better intelligence,” Tran said.
Every frontier AI lab right now is rationing two things: electricity and compute. Most of them buy their compute for model training from the same supplier, at the steep gross margins that have turned Nvidia into one of the most valuable companies in the world. Google does not.
On Tuesday night, inside a private gathering at F1 Plaza in Las Vegas, Google previewed its eighth-generation Tensor Processing Units. The pitch: two custom silicon designs shipping later this year, each purpose-built for a different half of the modern AI workload. TPU 8t targets training for frontier models, and TPU 8i targets the low-latency, memory-hungry world of agentic inference and real-time sampling.
Amin Vahdat, Google’s SVP and chief technologist for AI and infrastructure (pictured above left), used his time onstage to make a point that matters more to enterprise buyers than any individual spec: Google designs every layer of its AI stack end-to-end, and that vertical integration is starting to show up in cost-per-token economics that Google says its rivals cannot match.
The more interesting story behind v8t and v8i is when the decision to split the roadmap was made. The call came in 2024, according to Vahdat — a year before the industry at large pivoted to reasoning models, agents and reinforcement learning as the dominant frontier workload.
At the time, it was a contrarian read. “We realized two years ago that one chip a year wouldn’t be enough,” Vahdat said during the fireside. “This is our first shot at actually going with two super high-powered specialized chips.”
For enterprise buyers, the implication is concrete. Customers running fine-tuning or large-scale training on Google Cloud and customers serving production agents on Vertex AI have been renting the same accelerators and eating the inefficiency. V8 is the first generation where the silicon itself treats those as different problems with two sets of chips.
On paper, TPU 8t is an aggressive generational step. According to Google, 8t delivers 2.8x the FP4 EFlops per pod (121 vs 42.5) against Ironwood, the seventh-generation TPU that shipped in 2025, doubles bidirectional scale-up bandwidth to 19.2 Tb/s per chip, and quadruples scale-out networking to 400 Gb/s per chip. Pod size grows modestly from 9,216 to 9,600 chips, held together by Google’s 3D Torus topology.
The number that matters most to IT leaders evaluating where to run frontier-scale training: 8t clusters (Superpods) can scale beyond 1 million TPU chips in a single training job via a new interconnect Google is calling Virgo networking.
8t also introduces TPU Direct Storage, which moves data from Google’s managed storage tier directly into HBM without the usual CPU-mediated hops. For long training runs where wall-clock time is the cost driver, collapsing that data path reduces the number of pod-hours needed to finish each epoch.
If 8t is an evolutionary step, TPU 8i is the more architecturally interesting chip. It is also where the story for IT buyers gets most compelling.
The year-over-year spec jumps are, as Vahdat put it, “stunning.” According to Google, 8i delivers 9.8x the FP8 EFlops per pod (11.6 vs 1.2), 6.8x the HBM capacity per pod (331.8 TB vs 49.2), and a pod size that grows 4.5x from 256 to 1,152 chips.
What drove those numbers is a rethink of the network itself. Vahdat explained the insight directly: Google’s default way of connecting chips together supported bandwidth over latency — good for moving large amounts of data through, not built for the minimum time it takes a response to get back. That profile works for training. For agents, it does not. In partnership with Google DeepMind, the TPU team built what Google calls Boardfly topology specifically to reduce the network diameter — shrinking the number of hops between any two chips in a pod. Paired with a Collective Acceleration Engine and what Google describes as very large on-chip SRAM, 8i delivers a claimed 5x improvement in latency for real-time LLM sampling and reinforcement learning.
The subtext across Vahdat’s presentation was a six-layer diagram Google calls its AI stack: energy at the foundation, then data center land and enclosures, AI infrastructure hardware, AI infrastructure software, models (Gemini 3), and services on top. Vahdat noted that designing each layer in isolation forces you to the least common denominator for each layer. Google designs them together.
This is where the competitive story for IT buyers and analysts crystallizes. OpenAI, Anthropic, xAI and Meta all depend heavily on Nvidia silicon to train their frontier models. Every H200 and Blackwell GPU they buy carries Nvidia’s data-center gross margin — the informal “Nvidia tax” that industry analysts have flagged for two years running as a structural cost disadvantage for anyone renting rather than designing. Google pays fab, packaging and engineering costs on its TPUs. It does not pay that margin.
For procurement and infrastructure teams, TPUv8 reframes the 2026–2027 cloud evaluation in concrete ways.
Teams training large proprietary models should look at 8t availability windows, Virgo networking access, and goodput SLAs — not just headline EFlops. Teams serving agents or reasoning workloads should evaluate 8i availability on Vertex AI, independent latency benchmarks as they emerge, and whether HBM-per-pod sizing fits their context windows. Teams consuming Gemini through Gemini Enterprise should inherit the 8i lift and should expect the ceiling on what they can deploy in production to rise meaningfully through 2026.
The caveats are real. General availability is still “later in 2026.” The v8 is a roadmap signal, not a procurement decision today. Google’s benchmarks are self-reported; undoubtedly independent numbers will come from early cloud customers and third-party evaluators over the next two quarters. And portability between JAX/XLA and the CUDA/PyTorch ecosystem remains a friction cost worth thinking about when negotiating any multi-year commitment.
Looking further out, Vahdat made two predictions worth noting. First, general-purpose CPUs will see a resurgence inside AI systems — not as accelerators, but as orchestration compute for agent sandboxes, virtual machines and tool execution. Second, framed explicitly as an industry prediction rather than a Google roadmap preview, specialization also keeps going strong. As general-purpose CPUs gain plateau at a few percent a year, workloads that matter will demand purpose-built silicon. “Two chips might become more,” Vahdat said — without specifying whether the “more” would mean future TPU variants or other classes of specialized accelerators.
The frontier compute race used to be a question of who could buy the most H100s. It is now a question of who controls the stack. The shortlist of companies that genuinely do is, for the moment, two: Google and Nvidia.
When startup fundraising platform VentureCrowd began deploying AI coding agents, they saw the same gains as other enterprises: they cut the front-end development cycle by 90% in some projects.
However, it didn’t come easy or without a lot of trial and error.
VentureCrowd’s first challenge revolved around data and context quality, since Diego Mogollon, chief product officer at VentureCrowd, told VentureBeat that “agents reason against whatever data they can access at runtime” and would then be confidently “wrong” because they’re only basing their knowledge on the context given to them.
Their other roadblock, like many others, was messy data and unclear processes. Similar to context, Mogollon said coding agents would amplify bad data, so the company had to build a well-structured codebase first.
“The challenges are rarely about the coding agents themselves; they are about everything around them,” said Mogollon. “It’s a context problem disguised as an AI problem, and it is the number one failure mode I see across agentic implementations.”
Mogollon said VentureCrowd encountered several roadblocks in overhauling its software development.
VentureCrowd’s experience illustrates a broader issue in AI agent development. The models are not failing the agents; rather, they become overwhelmed by too much context and too many tools at once.
This comes from a phenomenon called Context bloat, when AI systems accumulate more and more data, tools or instructions, the more complex the workflows become.
The problem arises because agents need context to work better, but too much of it creates noise. And the more context an agent has to sift through, the more tokens it uses, the work slows down and the costs increase.
One way to curb context bloat is through context engineering. Context engineering helps agents understand code changes or pull requests and align them with their tasks.
However, context engineering often becomes an external task rather than built into the coding platforms enterprises use to build their agents.
VentureCrowd relied on one solution in particular to help it overcome the issues with context bloat plaguing its enterprise AI agent deployment: Salesforce’s Agentforce Vibes, a coding platform that lives within Salesforce and is available for all plans starting with the free one.
Salesforce recently updated Agentforce Vibes to version 2.0, expanding support for third-party frameworks like ReAct. Most important for companies like VentureCrowd, Agentforce Vibes added Abilities and Skills, which they can use to direct agent behavior.
“For context, our entire platform, frontend and backend, runs on the Salesforce ecosystem. So when Agentforce Vibes launched, it slotted naturally into an environment we already knew well,” Mogollon said.
Salesforce’s approach doesn’t minimize the context agents use; rather, it helps enterprises ensure that context stays within their data models or codebases. Agentforce Vibes adds additional execution through the new Skills and Abilities feature. Abilities define what agents want to accomplish, and Skills are the tools they will use to get there.
Other coding agent platforms manage context differently. For example, Claude Code and OpenAI’s Codex focus on autonomous execution, continuously reading files, running commands and as tasks evolve, expanding context. Claude Code has a context indicator that which compacts context when it becomes too large.
With these different approaches, the consistent pattern is that most systems manage growing contexts for agents, not necessarily to limit them. Context keeps growing, especially as workflows become more complex, making it more difficult for enterprises to control costs, latency and reliability.
Mogollon said his company chose Agentforce Vibes not only because a large portion of their data already lives on Salesforce, making it easier to integrate, but also because it would allow them to control more of the context they feed their agents.
There’s no single way to address context bloat, but the pattern is now clear: more context doesn’t always mean better results.
Along with investing in context engineering, enterprises have to experiment with the context constraint approach they are most comfortable with. For enterprises, that means the challenge isn’t just giving agents more information—it’s deciding what to leave out.
Google on Monday unveiled the most significant upgrade to its autonomous research agent capabilities since the product’s debut, launching two new agents — Deep Research and Deep Research Max — that for the first time allow developers to fuse open web data with proprietary enterprise information through a single API call, produce native charts and infographics inside research reports, and connect to arbitrary third-party data sources through the Model Context Protocol (MCP).
The release, built on Google’s Gemini 3.1 Pro model, marks an inflection point in the rapidly intensifying race to build AI systems that can autonomously conduct the kind of exhaustive, multi-source research that has traditionally consumed hours or days of human analyst time. It also represents Google’s clearest bid yet to position its AI infrastructure as the backbone for enterprise research workflows in finance, life sciences, and market intelligence — industries where the stakes of getting information wrong are extraordinarily high.
“We are launching two powerful updates to Deep Research in the Gemini API, now with better quality, MCP support, and native chart/infographics generation,” Google CEO Sundar Pichai wrote on X. “Use Deep Research when you want speed and efficiency, and use Max when you want the highest quality context gathering & synthesis using extended test-time compute — achieving 93.3% on DeepSearchQA and 54.6% on HLE.”
Both agents are available starting today in public preview via paid tiers of the Gemini API, accessible through the Interactions API that Google first introduced in December 2025.
The launch introduces a tiered architecture that reflects a fundamental tension in AI agent design: the tradeoff between speed and thoroughness.
Deep Research, the standard tier, replaces the preview agent Google released in December and is optimized for low-latency, interactive use cases. It delivers what Google describes as significantly reduced latency and cost at higher quality levels compared to its predecessor. The company positions it as ideal for applications where a developer wants to embed research capabilities directly into a user-facing interface — think a financial dashboard that can answer complex analytical questions in near-real time.
Deep Research Max occupies the opposite end of the spectrum. It leverages extended test-time compute — a technique where the model spends more computational cycles iteratively reasoning, searching, and refining its output before delivering a final report. Google designed it for asynchronous, background workflows: the kind of task where an analyst team kicks off a batch of due diligence reports before leaving the office and expects exhaustive, fully sourced analyses waiting for them the next morning.
The Google DeepMind team framed the distinction on X: “Deep Research: Optimized for speed and efficiency. Perfect for interactive apps needing quicker responses. Deep Research Max: It uses extra time to search and reason. Ideal for exhaustive context gathering and tasks happening in the background.”
“Deep Research was our first hosted agent in the API and has gained a ton of traction over the last 3 months, very excited for folks to test out the new agents and all the improvements, this is just the start of our agents journey,” Logan Kilpatrick, who leads developer relations for Google’s AI efforts, wrote on X.
Perhaps the most consequential feature in today’s release is the addition of Model Context Protocol support, which transforms Deep Research from a sophisticated web research tool into something more closely resembling a universal data analyst.
MCP , an emerging open standard for connecting AI models to external data sources, allows Deep Research to securely query private databases, internal document repositories, and specialized third-party data services — all without requiring sensitive information to leave its source environment. In practical terms, this means a hedge fund could point Deep Research at its internal deal-flow database and a financial data terminal simultaneously, then ask the agent to synthesize insights from both alongside publicly available information from the web.
Google disclosed that it is actively collaborating with FactSet, S&P, and PitchBook on their MCP server designs, a signal that the company is pursuing deep integration with the data providers that Wall Street and the broader financial services industry already rely on daily. The goal, according to the blog post authored by Google DeepMind product managers Lukas Haas and Srinivas Tadepalli, is to “let shared customers integrate financial data offerings into workflows powered by Deep Research, and to enable them to realize a leap in productivity by gathering context using their exhaustive data universes at lightning speed.”
This addresses one of the most persistent pain points in enterprise AI adoption: the gap between what a model can find on the open internet and what an organization actually needs to make decisions. Until now, bridging that gap required significant custom engineering. MCP support, combined with Deep Research’s autonomous browsing and reasoning capabilities, collapses much of that complexity into a configuration step. Developers can now run Deep Research with Google Search, remote MCP servers, URL Context, Code Execution, and File Search simultaneously — or turn off web access entirely to search exclusively over custom data. The system also accepts multimodal inputs including PDFs, CSVs, images, audio, and video as grounding context.
The second headline feature — native chart and infographic generation — may sound incremental, but it addresses a practical limitation that has constrained the usefulness of AI-generated research outputs in professional settings.
Previous versions of Deep Research produced text-only reports. Users who needed visualizations had to export the data and build charts themselves, a friction point that undermined the promise of end-to-end automation. The new agents generate high-quality charts and infographics inline within their reports, rendered in HTML or Google’s Nano Banana format, dynamically visualizing complex datasets as part of the analytical narrative.
“The agent generates HTML charts and infographics inline with the report. Not screenshots. Not suggestions to ‘visualize this data.’ Actual rendered charts inside the markdown output,” noted AI commentator Shruti Mishra on X, capturing the practical significance of the change.
For enterprise users — particularly those in finance and consulting who need to produce stakeholder-ready deliverables — this transforms Deep Research from a tool that accelerates the research phase into one that can potentially produce near-final analytical products. Combined with a new collaborative planning feature that lets users review, guide, and refine the agent’s research plan before execution, and real-time streaming of intermediate reasoning steps, the system gives developers granular control over the investigation’s scope while maintaining the transparency that regulated industries demand.
Today’s release crystallizes a strategic narrative Google has been building for months: Deep Research is not merely a consumer feature but a piece of infrastructure that powers multiple Google products and is now being offered to external developers as a platform.
The blog post explicitly notes that when developers build with the Deep Research agent, they tap into “the same autonomous research infrastructure that powers research capabilities within some of Google’s most popular products like Gemini App, NotebookLM, Google Search and Google Finance.” This suggests that the agent available through the API is not a stripped-down version of what Google uses internally but the same system, offered at platform scale.
The journey to this point has been remarkably rapid. Google first introduced Deep Research as a consumer feature in the Gemini app in December 2024, initially powered by Gemini 1.5 Pro. At the time, the company described it as a personal AI research assistant that could save users hours by synthesizing web information in minutes. By March 2025, Google upgraded Deep Research with Gemini 2.0 Flash Thinking Experimental and made it available for anyone to try. Then came the upgrade to Gemini 2.5 Pro Experimental, where Google reported that raters preferred its reports over competing deep research providers by more than a 2-to-1 margin. The December 2025 release was the pivot to developer access, when Google launched the Interactions API and made Deep Research available programmatically for the first time, powered by Gemini 3 Pro and accompanied by the open-source DeepSearchQA benchmark.
The underlying model driving today’s improvements is Gemini 3.1 Pro, which Google released on February 19, 2026. That model represented a significant leap in core reasoning: on ARC-AGI-2, a benchmark evaluating a model’s ability to solve novel logic patterns, 3.1 Pro scored 77.1% — more than double the performance of Gemini 3 Pro. Deep Research Max inherits that reasoning foundation and layers autonomous research behaviors on top of it, achieving 93.3% on DeepSearchQA (up from 66.1% in December) and 54.6% on Humanity’s Last Exam (up from 46.4%).
Google is not operating in a vacuum. The launch arrives amid intensifying competition in the autonomous research agent space. OpenAI has been developing its own agent capabilities within ChatGPT under the codename Hermes, which includes an agent builder, templates, scheduling, and Slack integration, according to reports circulating on social media. Perplexity has built its business around AI-powered research. And a growing ecosystem of startups is attacking various slices of the automated research workflow.
What distinguishes Google’s approach is the combination of its search infrastructure — which gives Deep Research access to the broadest and most current index of web information available — with the MCP-based connectivity to enterprise data sources. No other company currently offers a research agent that can simultaneously query the open web at Google Search’s scale and navigate proprietary data repositories through a standardized protocol. The pricing structure also signals Google’s intent to drive adoption: according to Sim.ai, which tracks model pricing, the Deep Research agent in the December preview was priced at $2 per million input tokens and $2 per million output tokens with a 1 million token context window — positioning it as cost-competitive for the volume of research output it generates.
Not everyone greeted the announcement with unalloyed enthusiasm, however. Several users on X noted that the new agents are available only through the API, not in the Gemini consumer app. “Not on Gemini app,” observed TestingCatalog News, while another user wrote, “Google keeps punishing Gemini App Pro subscribers for some reason.” Others raised concerns about the presentation of benchmark results, with one user arguing that Google’s charts could be “misleading” in how they represent percentage improvements. These complaints point to a broader tension in Google’s AI strategy: the company is increasingly directing its most advanced capabilities toward developers and enterprise customers who access them through APIs, while consumer-facing products sometimes lag behind.
The practical implications of today’s launch are most immediately felt in industries that depend on exhaustive, multi-source research as a core business function. In financial services, where analysts routinely spend hours assembling due diligence reports from scattered sources — SEC filings, earnings transcripts, market data terminals, internal deal memos — Deep Research Max offers the possibility of automating the initial research phase entirely. The FactSet, S&P, and PitchBook partnerships suggest Google is serious about making this work with the data infrastructure that financial professionals already use.
In life sciences, the blog post notes that Google has collaborated with Axiom Bio, which builds AI systems to predict drug toxicity, and found that Deep Research unlocked new levels of initial research depth across biomedical literature. In market research and consulting, the ability to produce stakeholder-ready reports with embedded visualizations and granular citations could compress project timelines from days to hours.
The key question is whether the quality and reliability of these automated outputs will meet the standards that professionals in these fields demand. Google’s benchmark numbers are impressive, but benchmarks measure performance on standardized tasks — real-world research is messier, more ambiguous, and often requires the kind of judgment that remains difficult to automate. Deep Research and Deep Research Max are available now in public preview via paid tiers of the Gemini API, with availability on Google Cloud for startups and enterprises coming soon.
Eighteen months ago, Deep Research was a feature that helped grad students avoid drowning in browser tabs. Today, Google is betting it can replace the first shift at an investment bank. The distance between those two ambitions — and whether the technology can actually close it — will define whether autonomous research agents become a transformative category of enterprise software or just another AI demo that dazzles on benchmarks and disappoints in the conference room.