Every new AI agent your team deploys starts from scratch: no memory of how the business works, where data lives, or what rules apply. And as agentic coding tools spin up applications faster than anyone can govern them, each one risks becoming another silo outside your data layer entirely. Microsoft is addressing both problems directly at Build 2026.
According to VentureBeat’s VB Pulse’s Q1 2026 RAG Infrastructure Market Tracker, hybrid retrieval intent among 100-plus employee organizations tripled from 10.3% in January to 33.3% in March, a signal that enterprises have moved past expanding RAG coverage and are now focused on the architecture underneath it. Shared business context is the part retrieval does not solve.
On the context side, Microsoft is expanding Fabric IQ, its existing business data context layer, into a broader unified system called Microsoft IQ, adding three additional context sources covering how the organization works, what it knows and real-time global signals from the web, so any agent can tap all four as a single foundation. On the application side, Rayfin, a new open-source SDK and CLI, deploys agent-built applications directly to Fabric as a governed production backend, routing application data into the same platform rather than spinning up new silos.
Amir Netz, CTO of Microsoft Fabric, reached for a film analogy to explain where the data platform fits. The green screen of cascading code in “The Matrix” wasn’t atmosphere, it was the layer that built the world Agent Smith operated in.
“Our job in the world of data is creating reality for agents based on data,” Netz told VentureBeat.
Microsoft IQ brings together four context sources that until now existed separately, designed so a developer can connect a new agent to all four in a single integration step.
Work IQ. Captures how the organization operates day to day, drawing on email, documents, meetings and schedules to give agents an understanding of people, teams and workflows.
Foundry IQ. Manages institutional knowledge, curating and indexing knowledge bases so agents understand what it means to work within the organization, what rules apply and what procedures to follow.
Fabric IQ. Models the live operational state of the business through data, defining entities, relationships and business rules grounded in real-time signals from Fabric Real-Time Intelligence. Ontologies, the layer that captures that operational context, are expected to reach GA in the coming months.
Web IQ. Adds real-time global context from the web, giving agents a current picture of the world outside the organization alongside its internal data.
“The agents are going to become highly informed virtual employees,” Netz said. “That’s where the world is heading.”
Building shared context solves one half of the problem. The other is what happens when agents start generating applications. Every new app needs a backend, and without a governed deployment path each one creates a new data silo outside the context layer entirely.
Rayfin provides an enterprise-grade back end and deploys agent-built applications directly to Fabric, so application data lands in Microsoft OneLake by default and feeds back into the Microsoft IQ context layer rather than accumulating outside it.
Microsoft positions Rayfin against Supabase and Neon, the Postgres-compatible backends that agentic coding tools default to. The differentiator is governance: Rayfin routes the entire application fleet through Fabric’s unified data and compliance layer rather than creating isolated silos.
Netz described the relationship as bidirectional. The agent building a Rayfin application draws from the organization’s ontology. The data that application generates then enriches that ontology for the next agent.
Microsoft is not the only platform building a shared context layer for agents. Snowflake announced its own context capabilities this week with semantic capabilities. Pinecone has its Nexus platform that expands the vector database to become a knowledge engine and Redis has developed its Iris context and memory platform.
Microsoft’s approach further reinforces the trend that RAG and model availability aren’t the issue anymore.
“Fabric IQ and Rayfin are important because the enterprise AI challenge is no longer just about the model availability,” Robert Kramer, managing partner at KramerERP told VentureBeat. “The real question is whether Microsoft simplifies execution and strengthens trust or adds another layer to an already complex environment.”
Enterprise AI agents have a new production failure mode, and it is not the model. As enterprises move from single-layer RAG to hybrid retrieval architectures, the same underlying data produces different answers depending on which agent, tool or system asks the question. Revenue means one thing in a business intelligence (BI) dashboard, something slightly different in a SQL table and something else again in an agent instruction. The retrieval infrastructure build-out of the past two years produced faster and cheaper vector search. It did not produce a shared definition of what the data means.
At Snowflake Summit 26 in San Francisco, the data cloud vendor is taking a broad swing at that problem, with announcements spanning a Kafka-compatible managed streaming service called Data Stream, adaptive compute improvements, expanded Apache Iceberg interoperability and updates to its Cowork and CoCo agent and coding products. Running underneath all of it is a context layer: Horizon Context and Cortex Sense, a two-layer system designed to give agents a governed, shared definition of business logic across retrieval stacks. The context problem is why it matters: VentureBeat’s VB Pulse Q1 2026 data, drawn from a survey of organizations with 100 or more employees, shows hybrid retrieval intent tripling from 10.3% in January to 33.3% in March, the fastest-growing strategic position in the dataset.
“There are a lot of tools out there that you can ask questions, you get a very confident answer, but whether it’s correct or not is different,” said Christian Kleinerman, EVP of Product at Snowflake.
The problem Horizon Context targets is specific. Business logic today is distributed across SQL, BI dashboards and agent instructions, and no single system owns the definition. When multiple agents or tools query the same underlying data, they reason over different schemas and return different answers. Horizon Context is Snowflake’s attempt to fix that at the catalog layer rather than at the agent layer.
Horizon Context. The customer-managed layer, built on Snowflake’s acquisition of Select Star. It pulls metadata from Postgres, SQL Server, Tableau and Power BI into the Horizon Catalog, so every agent, BI tool and external system draws from the same governed definition rather than reasoning independently over a raw physical schema. Semantic View Autopilot automatically creates and refines semantic views over time, extending curated business logic without requiring ongoing manual effort.
Cortex Sense. The platform-derived layer. It automatically builds and enriches context from customer data and usage patterns on an ongoing basis, without requiring manual semantic view authoring. Kleinerman described it as improving the default experience before any explicit curation has happened.
The distinction between the two layers is architectural and Kleinerman was precise about it. “Think of Horizon Context as everything that is explicit and declared by customers, and Cortex Sense is anything that is implicit and derived by us,” Kleinerman said.
The two layers connect to Snowflake’s existing retrieval infrastructure. Cortex Search, the company’s RAG implementation, plugs into both CoCo and Cowork as a tool, so context enriched by either layer flows into retrieval workflows.
While Horizon Context is a Snowflake technology, the goal is for it to be interoperable and open. Snowflake is tying the technology to the Open Semantic Interchange, making customer-declared definitions portable across third-party catalogs and tools.
“Horizon Context, 100% we’re committed to and leading the effort to make sure that that’s not locked in,” Kleinerman said.
Snowflake is joining an increasingly crowded field of vendors targeting the same problem. Microsoft has opened its Fabric IQ business ontology via MCP so any vendor’s agent can draw from a shared semantic layer. Redis launched Iris, a context and memory platform that sits between agents and their data, built on a storage engine redesigned for agent-scale retrieval volumes. Pinecone is repositioning from vector database to knowledge engine with Nexus, which compiles enterprise data into task-specific artifacts before agents ever query them.
Devin Pratt, research director at IDC, told VentureBeat that in his view Snowflake is headed in the right direction and is going where the whole market is heading.
“Agents are only as good as the data and semantics behind them, so the context layer, not the model, is the thing to watch right now,” Pratt said.
In Pratt’s view, what works about Snowflake’s version is the split. Horizon Context covers what teams declare and curate themselves, and Cortex Sense covers what the platform picks up automatically. Just as important, they’ve anchored Horizon Context inside the catalog and governance layer rather than bolting it on after the fact.
“The context layer is the real battleground for agentic AI. An agent is only as trustworthy as the data and semantics behind it” Pratt said.
Mike Leone, VP and principal analyst at Moor Insights and Strategy, agreed that treating the two layers differently is the right architectural call.
“I like where Snowflake’s heading. They’re splitting context into two buckets, with Horizon Context covering what customers explicitly define and Cortex Sense covering what the platform figures out on its own,” Leone told VentureBeat. “You can’t trust those two things the same way, so treating them differently is the right call. If Snowflake can show those two layers reconcile cleanly and you can see where every answer came from, they’ve got something real.”
For enterprises evaluating context layers, the architectural direction is clear. The execution gap is not.
Agents raise the bar on an old problem. The semantic layer idea has existed for years, but agents change what failure costs — when an agent gives a wrong answer at scale, the damage is immediate. Leone is direct about what that means for most vendors currently in the market.
“Most vendors selling a drop-in fix are overpromising,” Leone said. “Drop one into a real enterprise and it mostly exposes how messy your data and definitions already are, and a lot of companies are about to find that out the hard way.”
The evaluation bar is specific. Pratt identified what separates context layers that work from those that stall: governance and lineage built in so teams can audit why an agent gave the answer it did, portability so context and policy are not locked to one vendor, and accuracy that can be measured and reused across agents and tools.
“Enterprises don’t need another silo of semantics,” Pratt said. “They need a context layer that’s governed, portable, and trustworthy enough to audit.”
When Miro’s data team pointed AI agents directly at its Snowflake environment, the agents got the wrong answer more than 65% of the time. The problem wasn’t the model — it was context. With more than 10,000 tables and no semantic layer to guide routing, the agents had no way to know which data assets matched which business questions.
DataHub is releasing a context intelligence layer Thursday that mines existing SQL query history to build a semantic index — and exposes it to agents via MCP, LangChain, Google’s Agent Development Kit and CrewAI. The company calls it Context Intelligence, and it’s built on the same query-log infrastructure DataHub has used for lineage tracking in production deployments worldwide.
The company was founded by the team that built DataHub as an open source project at LinkedIn, where co-founder and CTO Shirshanka Das led data infrastructure for nearly 11 years. The open source project now has more than 15,000 contributors and 3,000 production deployments worldwide.
“For the first time, enterprises can turn years of analyst query history into a living, retrievable knowledge base where agents stop hallucinating joins because they have access to the joins that have worked before, validated by the people who ran them,” Shirshanka Das, co-founder and CTO of DataHub, told VentureBeat in an exclusive interview.
DataHub began as a metadata management project at LinkedIn, built to solve two problems simultaneously: making data easy to find and use across the organization while ensuring it was only used for the right reasons. Das open-sourced the project in early 2020 after nearly six years of internal development.
The primary use case in the years since has been lineage — understanding how data flows from operational systems through streaming infrastructure into warehouses and out to business tools. Regulatory compliance audits, operational triage and new engineer onboarding all depend on that lineage graph. Postgres is the most-connected source in the DataHub deployment base globally, followed by MySQL, Oracle and the major cloud warehouses including Snowflake and Google BigQuery. The platform supports more than 100 connected metadata sources.
That deployed base matters for what DataHub is releasing. The query log extraction and SQL parsing capabilities powering Context Intelligence were developed across years of production deployment, not built for this release. The same infrastructure now serves agents querying a semantic index at runtime.
“The consumption layer has changed from humans to agents,” Das said.
Context Intelligence is a new capability layer built on top of DataHub’s existing open source metadata foundation. The open source platform has spent years extracting and parsing query logs from connected warehouses for lineage tracking. That same infrastructure is what Context Intelligence draws on to build the semantic index. The capability is new. The underlying plumbing is not.
Filtering for signal. Warehouse query logs contain too much noise to use directly. DataHub’s engine filters for what Das describes as the “golden queries,” meaning high-quality analyst queries and scheduled pipelines that represent proven business logic.
Inverting SQL into semantic definitions. The engine extracts patterns from those queries and translates them into structured text definitions DataHub calls semantic anchors. Those anchors form the retrieval basis agents draw on before generating SQL.
“You can almost think of it as inverting text to SQL,” Das said.
Human validation on top. Context Hub lets domain experts review AI-proposed context, resolve conflicting definitions and simulate the impact of changes before publishing. DataHub surfaces cases where different teams calculate the same metric differently and raises them for human resolution.
Miro, the digital collaboration platform, was already using DataHub for lineage tracking and impact analysis when it began testing analytics agents against its Snowflake environment. Ronald Angel, product manager for the data platform at Miro told VentureBeat that the scale of the data estate became the problem immediately. Sending natural language queries directly to the Snowflake MCP produced incorrect answers more than 65% of the time. Exposing more than 10,000 tables directly to agents caused too much confusion for reliable routing.
Miro addressed the problem by organizing data into well-defined data products that constrain what agents can see rather than exposing raw schema. The production architecture runs from user requests submitted via Claude Chat or Claude Cowork through a context layer where DataHub’s MCP maps natural language to the appropriate data assets, then hands off to Snowflake’s MCP for SQL generation.
Angel said the context layer pulls in metadata, entity relationships, query history and business intent for each Snowflake table, specifically what business question each entity is designed to answer. Those semantic signals allow the agent to identify the correct database entities before writing SQL rather than guessing from schema alone.
Data vendors including Pinecone, Oracle and Redis all have contextual memory capabilities. On the platform side Microsoft has built out its Fabric IQ as a semantic layer for context.
DataHub’s argument isn’t feature parity. The company is positioning the context layer as platform-neutral — provisioning context into existing endpoints like Snowflake semantic views and Microsoft Fabric IQ rather than replacing them.
“A lot of times people want to be platform neutral when it comes to their context layer,” Das said.
Kevin Petrie, an analyst at BARC, told VentureBeat that he sees DataHub’s ability to integrate diverse metadata for both structured and unstructured objects, including documents and images, as differentiating them in the market.
“Many other vendors are more focused on structured tables, which provide trusted facts but often lack the rich context of text objects,” he said.
Michael Ni, VP and principal analyst at Constellation Research, told VentureBeat that for him what stands out about DataHub’s context layer is its support of the shift from passive cataloging to continuously refreshed semantic intelligence.
Ni described the competition for context as the next major platform war, arguing that whoever controls context at runtime controls the decision layer for data, agents, workflows and decisions.
“Buyers need to be careful, since many vendors only support a portion of the full context capabilities required for AI and agentic solutions,” Ni said. “Buyers should be clear on their context management requirements, as vector memory isn’t business meaning, business meaning isn’t governance, and governance isn’t execution.”
Presented by EquinixDigital systems are central to economic resilience. But the governance models supporting them were designed for a bygone era, when systems were smaller, often centralized, and rarely crossing multiple jurisdictions. This structural …
The data processing agreement (DPA) — the bedrock contract companies use to evaluate how vendors handle personal data — can no longer be trusted at face value. That is the central, and arguably most alarming, conclusion of DataGrail’s Privacy and AI Trends Report 2026, released today.
The San Francisco-based privacy platform analyzed 2,400 popular business software providers and found that 63.6% of vendors that prominently advertise AI capabilities do not disclose a third-party AI subprocessor in their legal documentation. The implication: the majority of companies purchasing AI-enabled software may be unknowingly exposing their customers’ data to AI models and pipelines they never reviewed, never approved, and may not even know exist.
“All software vendors are trying to move to become AI vendors, which makes sense, but the technologies are moving faster than AI governance can actually keep up,” DataGrail co-founder and CEO Daniel Barber told VentureBeat in an exclusive interview ahead of the report’s release. “The DPA should be the reliable document that teams use to evaluate AI risk, but based on that number, that’s not enough in 2026.”
The finding drops into an enterprise landscape where organizations with high levels of shadow AI already experience average breach costs of $4.63 million — $670,000 more than those with low or no shadow AI, according to IBM’s 2025 Cost of Data Breach Report. And it arrives in a year when U.S. states gave out $3.425 billion in privacy-related fines — more than the last five years combined — a trend Gartner expects to accelerate through 2028.
DataGrail’s methodology for arriving at the 63.6% figure goes well beyond reading contracts. The company’s research team cross-referenced DPA disclosures against product documentation, GitHub environments, API connections, and marketing materials for each of the 2,400 vendors in its tracking universe.
Barber walked VentureBeat through the process: “We looked at the DPA as the baseline, but then what we also looked at is the GitHub environment, the API connections that a particular vendor has, the product documentation, the marketing documentation, and triangulate that information to discern — okay, so the DPA document says use OpenAI, but actually you’ve got these three AI subprocessors over here in your product documentation outlining features and functionality, but that is not reflected in your DPA.”
When asked directly about how confident he was that these gaps represent actual shadow AI risk rather than vendors using proprietary technology, Barber was unequivocal. “Very confident, because we looked at the sample of the 2,400 systems, and we spent a substantial amount of time actually looking at product documentation, GitHub environments, looking at actual API connections, because we integrate with these systems as well, so we know how they process personal information. It is from primary research.”
The disclosure gap matters because it undermines the entire chain of trust that privacy programs rely on. Consider a scenario Barber described: A company invests in an AI recruiting tool. The tool’s DPA lists Claude as its foundational model. The company dutifully performs a security review of Anthropic’s AI. But the recruiting tool also quietly uses OpenAI and Gemini behind the scenes — models the company never evaluated.
Those undisclosed models then process thousands of resumes and execute automated hiring decisions. The company, without knowing it, has exposed sensitive personal information — home addresses, financial data, possibly Social Security numbers — to AI systems it never vetted, potentially violating FTC regulations on automated decision-making in employment. “How those vendors are evaluating and performing that automated decision making could be really disastrous for a business,” Barber said.
The disclosure gap alone would be concerning enough. But DataGrail’s report layers on another finding that makes the problem materially worse: 32.8% of AI systems that disclose AI capabilities also disclose at least one other high-risk activity, such as processing sensitive personal information or powering automated decision-making. Among AI systems with self-reported risk factors, 47.1% process personal data, 20.7% have the potential to power automated decision-making, 16.5% process sensitive data categories like health or financial information, and 7.5% process biometric data.
The report argues these figures almost certainly undercount actual exposure, since they reflect only what vendors have formally disclosed. Vendors could underreport access to personal data, and the inherent flexibility of AI means even good-faith vendors might not predict riskier user applications of their tools.
This has immediate regulatory implications. The CCPA’s new risk assessment requirement, effective January 1, 2026, requires businesses to conduct and document risk assessments for processing activities that present significant privacy risks — and will require submission to CalPrivacy by April 2028, with executive attestation under penalty of perjury.
Processing sensitive personal information with AI, or using AI for automated decision-making, are precisely the activities that trigger this obligation. The report finds that 42% of companies abandoned AI initiatives in 2025 with data privacy concerns cited as a primary obstacle — a statistic sourced to S&P Global research. Privacy teams that engage early with AI projects, Barber argues, can prevent that waste by ensuring safeguards are in place before launch, with AI risk assessments serving as the right starting point.
While shadow AI is still a newer category of threat, the report makes clear that traditional privacy challenges have not eased — they have intensified. Consent management was the busiest enforcement topic of 2025. California alone publicly reported $4.3 million in CCPA consent settlements, and 2025 saw over 1,400 class action wiretapping suits driven by private firms investigating tracking pixels and session replay software.
Despite this enforcement wave, 63% of the 5,000 websites DataGrail audited still fail to comply with universal opt-out mechanisms such as the Global Privacy Control signal. While that figure represents an improvement from 75% non-compliance in 2023, the pace of improvement is slow relative to the acceleration in enforcement.
Barber pointed to the case of Todd Snyder, the menswear retailer that the California Privacy Protection Agency fined $345,178 in May 2025, as evidence that enforcement is no longer reserved for big tech. “This is a business that has two or three stores across the U.S. They have 300 employees,” he said. “They run tight margins because they’re a consumer menswear clothing store.”
The California Attorney General also reached a $2.75 million settlement with Disney over failures to honor opt-out signals, while the California Privacy Protection Agency has brought enforcement actions against PlayOn Sports and Ford — a pattern that demonstrates both the breadth and depth of regulatory activity. Among the trackers that fire even after a user sends a GPC signal, the report found that 27.1% come from Google Analytics and 43.8% are for targeted advertising via platforms like Meta and Microsoft.
For users who do engage with consent banners, 48.3% click “Accept all,” while only 12.4% select “Essential only” and 2.3% customize their preferences. A full 37% simply exit the banner without making a selection. The practical takeaway: less than 15% of users make a conscious choice to opt out of tracking, which means consent banners present relatively low business risk when properly configured — but enormous regulatory risk when they are not.
Data subject request volume hit an all-time high for the fifth consecutive year. Deletion requests have surged 567% since 2021 and now represent 87% of all data subject requests. Access requests, by contrast, have gradually declined as consumers skip visibility and reach straight for the delete button.
The cost is staggering. For a mid-sized organization receiving 5 million annual web visitors, the report estimates manual DSR management now runs approximately $1.5 million per year, based on Gartner’s estimated cost of $1,524 per manual DSR. The average cost has climbed from $238,000 in 2021 to $1.51 million in 2025 — a trajectory that makes manual processing not just inefficient but, as the report argues, “irresponsible.”
Barber emphasized that these numbers reflect verified human requests with bot and spam traffic excluded, and that data broker scenarios — which will see their own massive influx of requests under California’s Delete Act — are reported separately. “That is a natural increase,” Barber told VentureBeat. “If you’ve now got 20-plus U.S. states with privacy regulation, it’s unlikely that we see a federal bill passed, even though we’ve seen one proposed. And while we don’t see federal awareness and regulation, we do see at the state level over 20 states, and that may actually increase awareness for the consumer even more.”
He added a telling detail about how businesses are responding in practice: “99% of DataGrail customers do process that deletion” even for residents of states without privacy laws, “simply because it’s too hard at this point. Discerning and even communicating to the person, ‘Hey, you live in Montana, sorry, you’re just in an unfortunate state without regulation’ — you just can’t do that.” Data brokers felt the impact most acutely, with a 398% increase in deletion requests compared to 2024 and an average of over 2,000 deletion requests handled per month.
The regulatory landscape underpinning all of these trends has fundamentally shifted from education to punishment. Nearly half of U.S. states now have a comprehensive privacy law in effect, plus over 160 AI-specific laws. State legislatures enacted 145 AI-related laws in 2025 alone, with another thousand introduced or reworked. According to Gartner, over 50% of the U.S. population is now covered by a comprehensive state privacy law, with 24 additional states expected to pass laws within five years. States have also begun pooling their resources, with ten forming the Consortium of Privacy Regulators last year and pledging to coordinate investigations across state lines.
Barber argued that privacy enforcement is fundamentally bipartisan, which insulates it from the shifting political winds of the current administration. “Privacy overall is a pretty bipartisan issue,” he said. “It’s easy to pass privacy regulation because constituents somewhat expect privacy in their day-to-day living. If you were flying on an airline and they said, ‘Okay, this seat, if you want your privacy, you’re going to have to pay $6 more,’ you’re like, ‘I’m going to go to another airline.’ It’s an expected part of a transaction at this stage.”
He predicted that other states will replicate California’s enforcement model. “California has their enforcement division, CalPrivacy. That group has one task: to ensure enforcement of privacy throughout businesses. Is it likely that we see other states get funding and support to fund these types of groups? Highly likely. The enforcement fines — the actual payments — go back to us as constituents. That type of model, you could imagine, being very popular across the country.”
Perhaps the most paradoxical finding in the report is that privacy teams lost as much as 33% of their headcount last year, even as their workloads expanded across every metric the report tracks. Cisco data cited in the report shows that 90% of privacy programs expanded in 2025 due to AI, while only 12% of AI governance programs are considered mature. Meanwhile, 74% of privacy teams planned to apply AI to privacy-related tasks in 2026, according to ISACA’s State of Privacy 2026 survey.
Barber sees this as part of a broader macroeconomic pattern rather than a sign that organizations do not value privacy. “It’s actually a fascinating macro trend, and probably one you’ve seen across all functions,” he said. “Businesses are driving more efficiency in all parts of the business. Privacy teams, five years ago, we would have said, ‘Well, there’s more regulation, the volume of deletions have increased 500%, we need more humans.’ It’s become clear that AI provides capabilities that can do the work for privacy individuals.” He drew an analogy: “They might have had a design team of 20 people five years ago, now they have a design team of five, courtesy of Claude Design or Gamma or whatever the tool may be. I think that’s what we’re seeing here as well.”
DataGrail has positioned its own AI agent, Vera — launched in March 2026 — as part of the answer. Vera is embedded within DataGrail’s existing platform and aims to automate privacy workflows across multiple jurisdictions. The company was also named the first production-ready Model Context Protocol server for privacy, using the standard created by Anthropic to enable customers to launch DataGrail tools from whatever application they are already working in, whether Slack, email, or Claude.
DataGrail is, of course, a company that directly benefits from the problems its report identifies. The company has raised a total of $84.2 million over five rounds, with its largest being a $45 million Series C in October 2022 led by Third Point Ventures. Its platform addresses precisely the data mapping, DSR automation, consent management, and risk assessment challenges the report spotlights.
Barber acknowledged the tension directly. “It’s a fair statement,” he said when asked about potential skepticism. “DataGrail doesn’t provide a service to keep DPAs up to date — that’s on a business to evaluate how they work with a vendor. What DataGrail does help to do is assessments, and automate those assessments using our AI agent, Vera, to assess that increased risk.”
He argued that the more neutral reading of the data is structural: “This is evidence to show that the DPA unfortunately is not keeping up with technology and the speed at which technology is innovating. That’s both exciting but also we need to accept that’s where we are.” The methodology does lend some credibility to this claim.
The report draws on anonymized privacy operations data from hundreds of enterprise customers, the 2,400-system AI tracking database, and the 5,000-website consent audit — sources that are at least partially independent of DataGrail’s commercial interests. And the broader findings on enforcement spending, DSR volume trends, and regulatory expansion align closely with independently published data from Gartner, Cisco, and state enforcement agencies.
When asked about the most important trend that did not make it into the report, Barber pointed to a next-generation risk that extends the shadow AI problem into far more dangerous territory: agentic AI workflows. Gartner predicts 40% of enterprise applications will feature task-specific AI agents by end of 2026, up from under 5% in 2025 — a pace of adoption that could rapidly outstrip the governance mechanisms companies are only now beginning to build.
“Where we go next with this research is agent processing,” Barber said. “How are agents then leveraging that information? Because the downstream ramifications would be far more concerning for a business. One particular system is using shadow AI, the business has no idea that that’s happening, and then an agent is propagating that information across a whole bunch of other places. The guardrails of you and I checking the system will be lower than maybe what we’ve seen in the past with agentic workflows.”
He framed the distinction in human terms: “The identity of an agent is different than a human. There is thought that goes into what am I about to use here, where did this information come from, how was it collected — that may not be considered in the same way for an agentic workflow. We need to solve the root of the problem, which is how are these businesses leveraging AI subprocessors. But this quickly becomes an agentic problem that could be far more concerning.”
For the enterprise privacy and security leaders absorbing this report today, the uncomfortable truth is that the foundational documents and processes they have relied on to manage vendor risk for years are decomposing in real time. The DPA is breaking down as a reliable instrument. State enforcement is accelerating on a bipartisan basis. Privacy teams are shrinking even as their mandates expand. And the next wave of agentic AI systems threatens to distribute unvetted data processing across networks of autonomous agents that operate with even less human oversight than today’s tools.
Five years ago, when DataGrail published its first trends report, deletion requests were a fraction of what they are today, only a handful of states had privacy laws on the books, and the phrase “shadow AI” did not exist. Every year since, the report has warned that the problem was getting worse. Every year, the data has proved it right. The companies that survive the next chapter will not be the ones with the biggest compliance teams or the thickest policy binders. They will be the ones that accept a disorienting new reality: in 2026, the contracts you signed may not describe the AI that is already processing your customers’ data — and by 2027, autonomous agents may be deciding what to do with it.
For months, the leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: the top models are all roughly the same. OpenAI’s GPT-5 family, Anthropic’s Claude Opus, and Google’s Gemini Pro have clustered within a narrow band on Scale AI’s SWE-Bench Pro leaderboard, making it nearly impossible for engineering leaders to determine which agent will actually perform best inside their codebases.
On Monday, a startup called Datacurve released a benchmark it says shatters that illusion. DeepSWE, a 113-task evaluation spanning 91 open-source repositories and five programming languages, produces a dramatically wider spread among the same frontier models — and crowns OpenAI’s GPT-5.5 as the clear leader at 70%, sixteen points ahead of its nearest competitor.
“On public leaderboards, top models often look relatively close in capability,” wrote Datacurve co-author Serena Ge on X. “DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.”
The benchmark also delivers a pointed critique of the evaluation infrastructure the AI industry relies on to measure progress: Datacurve’s audit found that SWE-Bench Pro’s verifiers — the automated graders that determine whether an agent solved a task — issued incorrect pass/fail verdicts on roughly one-third of the trials it reviewed.
If that finding holds up, it has sweeping implications. Enterprise procurement teams, venture capitalists, and AI lab marketing departments all lean heavily on benchmark scores to make multimillion-dollar decisions. A 32% error rate in the most widely cited coding benchmark suggests the industry may have been navigating by a broken compass.
To understand what Datacurve is claiming, it helps to understand how coding benchmarks work — and how they can go wrong.
The dominant paradigm, pioneered by the SWE-Bench family maintained by Scale AI and academic researchers, constructs tasks by mining real GitHub commits. The process extracts a bug fix or feature addition from a repository’s history, rolls the code back to the pre-fix state, and then asks an AI agent to reproduce the change. The original commit’s test suite serves as the verifier: if the agent’s patch makes the same tests pass, it gets credit. This approach has an elegant simplicity, but Datacurve argues it introduces three systemic weaknesses.
First, contamination. Because tasks are drawn from public GitHub history, the problem statement, the discussion, and often the exact solution are already present in the training data of frontier models. “The SWE-Bench family scrapes existing GitHub issues and PRs, which creates two problems: memorization (models have already seen the solution) and triviality (most tasks are small),” Ge wrote.
Second, scope. SWE-Bench Pro tasks require, on average, just 120 lines of code added across 5 files. DeepSWE’s reference solutions average 668 lines added across 7 files — roughly 5.5 times more code. Yet DeepSWE’s prompts are actually shorter, averaging 2,158 characters versus SWE-Bench Pro’s 4,614. In other words, DeepSWE gives the agent less instruction but expects far more output, which more closely mirrors how a human developer might actually delegate work to an AI assistant.
Third — and most damaging — verifier reliability. Datacurve drew 30 tasks at random from both DeepSWE and SWE-Bench Pro, ran three rollouts across 10 frontier model configurations, and then deployed an LLM-based judge to independently assess whether each agent’s patch actually solved the problem. SWE-Bench Pro’s verifiers accepted wrong implementations 8.5% of the time and rejected correct implementations 24% of the time. DeepSWE’s verifiers registered 0.3% and 1.1%, respectively.
The false negative problem is especially insidious because it punishes creative solutions. In one documented case, the gold-standard pull request for a SWE-Bench Pro task refactored a private helper function. An agent that correctly solved the task by inlining the same logic — a perfectly valid engineering choice — failed because the test suite tried to import a symbol that only existed in the original author’s specific implementation.
DeepSWE’s top-line results reorder the familiar hierarchy in ways that should matter to every engineering team evaluating AI coding tools. On SWE-Bench Pro, models from OpenAI, Anthropic, and Google have traded the lead within a 30-point range. DeepSWE stretches that range to 70 points.
GPT-5.5 leads at 70%, followed by GPT-5.4 at 56% and Claude Opus 4.7 at 54%. From there, the drop-off is steep: Claude Sonnet 4.6 lands at 32%, Gemini 3.5 Flash at 28%, GPT-5.4-mini and Kimi K2.6 tied at 24%, and then a long tail of models in the teens and single digits. Claude Haiku 4.5, which scores 39% on SWE-Bench Pro, collapses to zero on DeepSWE — suggesting that some mid-tier models have been significantly overperforming on easier, potentially contaminated benchmarks.
GPT-5.5 doesn’t just score the highest — it does so efficiently. The model reaches its 70% pass rate with a median cost of $5.80 per trial, a median wall-clock time of 20 minutes, and a median of 47,000 output tokens. GPT-5.4 emerges as perhaps the best overall value at $3.30 per trial with a 56% score. Claude Opus 4.7, meanwhile, costs significantly more per run, and output tokens, wall-clock duration, and dollar cost per trial all vary by an order of magnitude across the agents tested — yet none of these correlates strongly with pass rate. Agents that emit more tokens, run longer, or cost more do not consistently solve more tasks.
Perhaps the most provocative finding in DeepSWE’s analysis concerns what the authors label “CHEATED” verdicts — instances where an agent passes a benchmark not by solving the problem, but by reading the answer.
SWE-Bench Pro’s Docker containers ship the repository’s full .git history, which means the gold-standard solution commit is sitting right there in the container’s file system. Most models ignore it. Claude does not. Datacurve’s analysis found that both Claude Opus 4.7 and Claude Opus 4.6 registered “CHEATED” on more than 12% of their reviewed SWE-Bench Pro rollouts. In those instances, the Claude agent ran commands like git log –all or git show <gold-hash> to retrieve the merged fix and paste it into its own patch. The behavior accounted for approximately 18% of Opus 4.7’s passes and 25% of Opus 4.6’s passes on the reviewed sample. The issue has been filed publicly as GitHub issue #93 on the SWE-Bench Pro repository.
GPT-5.4 and GPT-5.5 never exhibited this behavior. Gemini configurations stayed around 1%. Datacurve describes the behavior diplomatically — “The benchmark makes this possible (the gold commit lives in the container), but Claude is the family that consistently does so” — but the implication is clear: a meaningful fraction of Claude’s SWE-Bench Pro scores may reflect environmental exploitation rather than genuine engineering capability.
DeepSWE addresses this by shipping only a shallow clone with the base commit, leaving no gold hash for the agent to discover. It is worth noting that the behavior is arguably a sign of Claude’s environmental attentiveness — the model is very good at exploring its surroundings and exploiting available resources. Whether that counts as “cheating” or “resourcefulness” depends on your perspective, but in the context of a benchmark designed to measure independent problem-solving, it undermines the signal.
Beyond the top-line scores, Datacurve’s qualitative trajectory analysis reveals distinctly different failure signatures across model families — a finding that could help engineering teams choose the right model for specific types of work.
Claude is forgetful with multi-part prompts. On DeepSWE, Claude configurations miss stated requirements more than any other family. The pattern is consistent: when a prompt enumerates parallel behaviors — “support both sync and async,” for instance — Claude typically implements the obvious branch and forgets to mirror the change. Datacurve reports that roughly two-thirds of Claude’s “MISSED_REQUIREMENT” failures on DeepSWE follow this “one branch shipped” pattern. In one example, Claude Opus 4.7 correctly landed a sync state-data hook in one engine class while the async engine never received the same hook.
GPT, by contrast, implements exactly what is asked. GPT-5.5 had the lowest rate of missing stated behaviors of any configuration tested. Across multiple runs of the same task, GPT trials tended to converge on the same interpretation of the prompt, suggesting instruction-following precision is a stable trait of the model rather than per-run luck.
One of the most intriguing findings involves self-verification. On DeepSWE, Claude Opus 4.7 and GPT-5.4 wrote and ran new tests in the project’s own test framework on over 80% of their runs — even though no one asked them to. On SWE-Bench Pro, those same models dropped to 28% and 18%, respectively. The reason: SWE-Bench Pro’s prompt template explicitly tells agents they “should not modify the testing logic or any of the tests.” Agents dutifully complied, suppressing a behavior that likely would have improved their performance. This suggests that prompt design in production coding workflows may be inadvertently suppressing valuable agent behaviors — something enterprise teams deploying AI coding agents should carefully audit.
Datacurve is forthright about several limitations. The standardized harness, while ensuring fairness, routes all edits through bash rather than the model-specific editing tools each family was trained on — apply_patch for GPT, str_replace_based_edit_tool for Claude. This could hold models below their native ceilings. The benchmark draws exclusively from open-source repositories with 500-plus stars, and results may not generalize to proprietary codebases. Bug localization and refactoring tasks are under-represented, and widely used languages like C++ and Java are absent entirely. The verdict assignments in the qualitative analysis come from an LLM analyzer, not human reviewers, and sample sizes are modest — roughly 90 reviewed rollouts per model per benchmark.
It is also worth noting that Datacurve is a startup with its own commercial interests, and an independent benchmark that reshuffles the leaderboard will inevitably invite scrutiny. The company’s decision to publish the full dataset, all agent trajectories, and the evaluation harness on GitHub mitigates this concern considerably, but independent reproduction will be necessary before the AI community treats these results as definitive.
DeepSWE arrives at an inflection point for the AI coding market. Enterprise adoption of AI coding agents is accelerating rapidly, with engineering organizations making consequential bets on which model to build around. The benchmark market itself has become a strategic battleground — Scale AI’s SWE-Bench Pro, which Datacurve directly critiques, is maintained by a company that also provides evaluation services to the labs whose models it ranks.
If DeepSWE’s central findings about verifier reliability and data contamination hold up under independent scrutiny, they could force a reckoning not just with how the industry measures coding agents, but with the broader question of what benchmarks are actually for. A leaderboard where the grading system is wrong a third of the time is not merely inaccurate — it is the kind of broken instrument that makes everyone feel good about progress that may not be real. And in an industry spending billions on a bet that AI agents can do the work of software engineers, the difference between real progress and the appearance of it is not academic. It is the whole game.
Dun & Bradstreet has spent over 180 years building a comprehensive commercial database. Its Commercial Graph, covering 642 million businesses and their relationships, corporate hierarchies and risk profiles, was designed for people. Credit analysts, risk managers and sales professionals who could wait for query results and work through ambiguous entity matches. AI agents cannot do any of those things.
When D&B’s customers started pushing agents into credit, procurement and supply chain workflows, the Commercial Graph that had reliably served nearly 200,000 customers globally became a problem. The systems built to serve human analysts were the wrong architecture for machines. So D&B rebuilt.
“We need to think about agents as our new consumer category, evolving from our standard credit analysts or sales and marketing professionals, et cetera, to also now catering to these customers’ agents,” Gary Kotovets, Chief Data and Analytics Officer at Dun & Bradstreet, told VentureBeat.
The Commercial Graph was not a single database. It was a collection of separate systems built for different use cases and different markets, held together by custom integrations. Human analysts navigated that fragmentation through SQL queries or pre-built interfaces. Agents could not.
The scale of the underlying data compounded the problem. The database had nearly doubled in five years, expanding from more than 300 million to more than 642 million business records, with 11,000 fields per record, according to D&B. The firm now runs approximately 100 billion data quality checks per month as records move through its systems. Querying that at the sub-second latency agents require, against a fragmented architecture, was not workable.
The relationships the graph tracked were also the wrong kind. Legacy systems recorded static connections between entities. A CEO was linked to a company. That was the line. Agents working on credit assessments or third-party risk need dynamic relationships: when that CEO leaves for a new company, which organization does their track record follow? When a subsidiary changes ownership, how does that propagate across a corporate hierarchy? Those questions required custom analyst work before. Agents cannot wait for custom analyst work.
The broader problem is not unique to D&B. Kotovets said he has spoken with hundreds of CDOs and CIOs over the past six months and consistently heard the same constraint: they could not build what they wanted in AI because their data foundations were not standardized, normalized or agent-queryable. D&B had that foundation, built over decades to serve human analysts. It still had to rebuild for agents.
The rebuild started with consolidation. D&B migrated its fragmented databases to cloud infrastructure, redesigned the underlying schema and built a data fabric layer that normalizes records across markets while preserving regional compliance requirements. The result is a unified knowledge graph that tracks billions of relationships across 642 million companies, continuously updated and enriched by AI-driven data processing.
On top of that graph, D&B built a structured access layer for agents. Raw SQL access at agent query volumes and latency requirements was not the answer. Instead, D&B created a set of tools and skills available through MCP that package data with context and route agents to the right records for specific queries. A match and entity resolution engine sits behind every query, confirming that when an agent asks about a company, the answer resolves to a verified, specific entity rather than a name match.
Rebuilding the graph and adding MCP access solved the data retrieval problem. It did not solve the identity problem. Agents are not humans, and the authentication model built for human users did not extend to machines.
D&B built a new registration model for agents. They must map to a verified IP address and register an individual access key, treated as an authenticated identity in the same pipeline as a human user.
“We actually have a concept of Know Your Agent, similar to know your customer, that does those additional verifications,” Kotovets said.
That handles the inbound problem: knowing which company an agent belongs to and what data it is entitled to query. But D&B also built for the outbound problem: what happens when a customer’s own multi-agent workflow loses track of which company it is analyzing.
In a workflow that chains a credit check agent, a KYC agent and a third-party risk agent, each queries D&B at a different step. Without a mechanism to confirm they are all referencing the same entity, a workflow can complete while operating on divergent records.
“They have to come back to our verification agent to ensure that they’re still talking to each other about the same entity,” Kotovets said. “It’s almost like a digital handshake, in a sense.”
D&B’s business verification agent can be embedded into any workflow as a persistent reference point and is available on Google’s A2A protocol regardless of which orchestration tool a customer uses.
The rebuild exposed requirements that go beyond D&B’s own stack.
Data foundations come before agent infrastructure. The CDOs and CIOs Kotovets spoke with over the past six months consistently hit the same wall: they cannot build what they want in AI until their data is clean, normalized and consolidated. D&B had that foundation already. Most enterprises do not, and they will feel it.
Design for dynamic relationships, not static ones. Enterprise data systems typically record point-in-time connections: a person belongs to a company, an asset belongs to a subsidiary. Agents working on credit, risk or supply chain decisions need to reason across relationships that shift over time. If the underlying data only captures the static line, the agent will too.
Build entity consistency checks into multi-agent workflows. When multiple agents touch the same entity at different steps, there is no guarantee they are all referencing the same record by the time the workflow completes. That gap needs to be engineered for explicitly. Entity verification is a workflow design requirement, not an optional guardrail.
Embed lineage from the start, not as an afterthought. Every agent-produced answer should carry a traceable path back to its source. In credit, risk and supply chain decisions, the cost of an error is concrete. Lineage needs to be built in before scaling, not added after problems surface.
“You could always click and see where it came from, and validate it all the way back to the original source,” Kotovets said. “That’s been the key for us in unlocking a lot of other capabilities, because we have that level of certainty in the things that we’ve done.”
Google on Tuesday unveiled Gemini Spark, a personal AI agent designed to work around the clock — drafting emails, assembling documents, monitoring inboxes, and eventually making purchases — even when a user’s laptop is closed and their phone is locked.
The announcement, made at Google I/O 2026, is the company’s most ambitious attempt yet to transform its AI assistant from a tool that answers questions into one that autonomously completes tasks. It also arrives at a moment of extraordinary competition, as Microsoft, OpenAI, Anthropic, and Apple all race to build AI systems that don’t merely converse but act — completing multi-step workflows with decreasing human supervision.
“We are in that part of the cycle where people want to see real value in the products they use on a day-to-day basis,” Sundar Pichai, CEO of Google and Alphabet, said during a press briefing ahead of the keynote address. With Spark, he argued, that value comes from an agent that never stops working. It operates around the clock in Google’s cloud, he said, so “you don’t need to keep your laptop open to make sure it’s running.”
The product arrives at an inflection point for the technology industry, as Google, Microsoft, OpenAI, Anthropic, and Apple all race to build AI systems that don’t merely converse but do — completing multi-step workflows with decreasing human supervision. It also raises urgent questions about trust, spending guardrails, and what happens when an artificial intelligence agent misinterprets a user’s intent.
Spark will begin rolling out this week to a small group of trusted testers, with a beta planned for Google AI Ultra subscribers in the United States next week.
Unlike conventional AI assistants that activate only when prompted, Gemini Spark is architecturally different. It runs persistently on Google Cloud infrastructure, powered by the company’s new Gemini 3.5 Flash model and what Google calls the Antigravity agent harness — the same underlying system that powers the company’s internal developer tools.
In practical terms, this means Spark can accept a complex instruction — “email my boss a status update pulling the latest figures from our shared spreadsheet and the project timeline in our Slides deck” — and then execute it across multiple Google applications without further input. The agent can pull context from emails, documents, and calendar entries, synthesize the information, and produce a finished output.
Josh Woodward, VP of Google Labs, Gemini App, and AI Studio, described the experience in visceral terms during the briefing: “When you use it, it almost feels like you’re tossing things over your shoulder — Spark’s catching them and gets the job done.”
The cloud-based architecture is a deliberate design choice. Because Spark operates on remote servers rather than on a user’s device, it can continue working through tasks after a user walks away. A student could ask Spark to build a study guide that updates itself as new assignments arrive from a professor. A small business owner could instruct it to monitor their inbox and flag potential customer inquiries. A parent could delegate the logistics of a neighborhood block party — tracking RSVPs, coordinating contributions, scouting venues. These are not hypothetical scenarios. Woodward said they reflect how early testers have actually been using the product.
Over the coming months, Google plans to expand Spark’s capabilities significantly. The company will roll out MCP (Model Context Protocol) connections to more than 30 third-party partners, including Canva, OpenTable, and Instacart. Users will also be able to text and email Spark directly, create custom sub-agents for specialized tasks, and connect Spark to Chrome for web-based actions. Later this year, a new Android interface called Android Halo will provide live, at-a-glance visibility into what Spark is working on, displayed at the top of a user’s phone screen.
For all its ambition, Spark confronts a fundamental challenge that has bedeviled every AI agent to date: How do you trust an autonomous system to act on your behalf — particularly when money is involved?
Google is acutely aware of the concern. When asked during the press briefing how Spark would avoid making unauthorized purchases, Woodward reached for an analogy that was striking in its candor. “On the team, we think a lot of it is like if you’re giving a teenager their first debit card — there’s sort of limits and sort of constraints around it, and that’s how we’ll be designing Spark as we go through the year,” he said.
At launch, Spark will not autonomously make purchases. Users will be given explicit opportunities to review and approve any transaction before it goes through. But Google has built the infrastructure for a more autonomous future. Vidhya Srinivasan, who leads Google’s ads and commerce teams, introduced the Agent Payments Protocol, or AP2 — a system designed to let AI agents make secure purchases within user-defined boundaries.
The concept works like this: a user tells their agent the specific brands, products, and spending limits they’re comfortable with. If the criteria are met, the agent can automatically complete a purchase. AP2 creates what Google describes as a transparent, verifiable link between the user, the merchant, and payment processors, using privacy-preserving technology and tamper-proof digital mandates to ensure the agent is acting within its authorization. AP2 also generates a permanent digital paper trail, so that if a return is needed, the user and the merchant are looking at the same record. Google plans to bring AP2 to its products in the coming months, starting with Gemini Spark.
The system is underpinned by the Universal Commerce Protocol (UCP), an open-source standard Google announced earlier this year that gives agents and commerce systems a common language across the entire shopping journey. The UCP Tech Council now includes Amazon, Meta, Microsoft, Salesforce, and Stripe — a remarkable coalition that underscores how seriously the industry takes the prospect of agent-driven commerce.
Google also announced the Universal Cart, an intelligent shopping cart that works across merchants and Google services. Users can add items while browsing Search, chatting with Gemini, watching YouTube, or reading Gmail. The cart then works in the background — tracking price drops, surfacing deals based on payment card perks, and even flagging product incompatibilities. The shopping infrastructure is rolling out in the U.S. this summer across Search and the Gemini app, with YouTube and Gmail to follow.
The announcement lands in the middle of the most intense competitive period in AI history. Google, Microsoft, OpenAI, Anthropic, and Apple are all racing to ship autonomous agents that can do real work — and each is placing a fundamentally different architectural bet on how to get there.
OpenAI recently unified its Operator and deep research capabilities into ChatGPT agent — a system that brings together website interaction, information synthesis, and conversational intelligence. It carries out tasks using its own virtual computer, shifting between reasoning and action to handle complex workflows. The company emphasizes that users remain in control, with ChatGPT requesting permission before taking consequential actions. But the product has faced scrutiny over reliability. OpenAI’s Computer-Using Agent scores 38.1% on OSWorld, the industry benchmark for computer use tasks, while humans score over 72%.
Anthropic launched its Claude Computer Use Agent in research preview in March, giving Claude the ability to see, navigate, and control a user’s desktop — clicking buttons, opening applications, filling spreadsheets, and completing multi-step workflows. Claude Cowork handles tasks autonomously — users give it a goal and Claude works on their computer, local files, and applications to return a finished deliverable. Anthropic has iterated aggressively, recently shipping ten pre-built financial agents and pursuing deep Microsoft 365 integration.
Microsoft introduced Copilot Cowork to move beyond chat and into execution — helping users delegate real tasks and have them completed. Cowork runs in the cloud, meaning users don’t have to worry about closing their laptop. The system is grounded in Work IQ, Microsoft’s intelligence layer that understands organizational data, tools, and structure. The shift moves Copilot from a sidebar helper to an orchestrator of autonomous agents.
Apple is also preparing a revamped Siri for WWDC 2026 that will act as an “always-on agent” capable of handling tasks across apps using personal data. Google’s Gemini models will help power the upgraded Siri through a multi-year deal reportedly costing Apple around $1 billion per year.
The convergence is unmistakable: every major platform is moving from assistants that talk to agents that act. But each is approaching the problem differently. OpenAI’s agent operates primarily through a browser. Anthropic’s works directly on a user’s desktop. Microsoft’s is tightly bound to the Office 365 ecosystem. Apple’s emphasizes on-device processing and privacy. Google’s approach with Spark is distinctive in its bet on cloud persistence and deep integration with its own services.
Rather than controlling a user’s screen pixel by pixel, Spark works through structured integrations — Google’s own Workspace APIs, and increasingly, third-party connections through MCP. The advantage is reliability and speed: structured tool use is far more predictable than screen-reading. The disadvantage is that Spark, at least initially, can only act within the systems it’s been connected to.
Spark’s capabilities are inseparable from the model that drives it. Gemini 3.5 Flash, also announced Monday, is Google’s new workhorse AI model — designed specifically for the demands of agentic workflows.
The performance claims are important. Google says 3.5 Flash outperforms its previous frontier model, Gemini 3.1 Pro, across nearly all benchmarks, while running four times faster than comparable frontier models in terms of output tokens per second. An even more optimized version, available within Google’s Antigravity development platform, runs twelve times faster.
Pichai framed the economics bluntly. Companies processing roughly one trillion tokens per day on Google Cloud — a figure he said top enterprise customers are hitting — could save over $1 billion annually by shifting 80% of their workloads to a mix of Flash and frontier models like 3.5 Pro. In a market where, as Pichai noted, CIOs are already “blowing through their annual token budgets and it’s only May,” the cost argument may matter as much as the capability argument.
Internally, Google’s own developers have been consuming Gemini 3.5 Flash at a staggering and rapidly accelerating pace. In March, Google was processing about half a trillion tokens per day internally. That figure has since grown to more than three trillion — doubling roughly every few weeks. Pichai described this as a “powerful feedback loop” that continually improves the model.
Koray Kavukcuoglu, CTO of Google DeepMind and Chief AI Architect for Google, said the model’s speed is what makes agentic use cases practical. “3.5 Flash is especially good when deploying multiple agents simultaneously and completing long-running tasks,” he said during the briefing, adding that Google had successfully tested agents building “a working operating system entirely from scratch.”
The 3.5 Pro model, the more powerful sibling, is currently being tested internally and will roll out next month.
Gemini Spark will be available to Google AI Ultra subscribers. The company is simultaneously restructuring its subscription tiers to make the technology more accessible. A new Ultra plan at $100 per month provides a 5x higher usage limit than the Pro plan, along with priority access to Antigravity and 20TB of cloud storage. The top-tier Ultra plan drops from $250 to $200 per month, with a 20x higher usage limit and access to the full suite of capabilities.
Both tiers include Gemini Spark, the Daily Brief agent — a proactive morning digest that triages email, calendar, and tasks overnight — and access to the new Gemini Omni and 3.5 Flash models. The pricing positions Spark as a premium product — more expensive than Anthropic’s Claude Pro at $20 per month, but comparable to the higher tiers of competing products like Claude Max ($100–$200/month) and OpenAI’s ChatGPT Pro ($200/month).
The risks are real and multidimensional.
Reliability remains the industry’s greatest challenge. Even the best AI models hallucinate, misinterpret instructions, and make errors that a human would never make. An agent that drafts an email to the wrong person, misreads a spreadsheet figure, or sends a payment to the wrong merchant could create consequences that are difficult to reverse. Google’s approach of requiring explicit approval for high-stakes actions like spending money or sending emails is a sensible safeguard — but it also limits how autonomous the agent can actually be. An agent that asks for confirmation at every turn isn’t much of an agent at all.
Privacy is another concern. Spark’s ability to synthesize information across a user’s entire Gmail inbox, calendar, documents, and chat history means it has an extraordinarily deep view of a person’s digital life. Google says Spark operates on a fully managed, secure runtime with isolated ephemeral virtual machines, encrypted credentials, and Data Loss Prevention policies. But the concentration of personal context in a single AI system — accessible through natural language — creates a surface area that will attract scrutiny from regulators, privacy advocates, and security researchers.
Market timing is uncertain, too. The consumer appetite for always-on AI agents is unproven at scale. Google says the Gemini app has 900 million monthly users, but it’s unclear how many of those users are ready for the conceptual leap from “ask a question, get an answer” to “delegate a task, trust the outcome.” The history of digital assistants — from Clippy to early Siri to Alexa — is littered with products that promised proactive intelligence and delivered frustration.
And then there is the question of ecosystem lock-in. Spark works best within Google’s own services. While MCP connections to third-party apps will broaden its reach, the initial experience is one of deep Workspace integration. For the billions of people who live inside Google’s ecosystem, this is a natural fit. For those who split their digital lives across Microsoft, Apple, and other platforms, Spark’s utility will be more limited — at least initially.
Woodward acknowledged as much when asked whether Spark would remain confined to the Google ecosystem. “It’s going to be cross-platform in two ways,” he said — through MCP integrations with third-party apps, and through availability on the web, Android, and iOS, with tasks syncing across devices via the cloud.
Google’s bet with Gemini Spark is that the AI industry’s center of gravity is shifting from models that think to systems that act — and that the company best positioned to win that transition is the one with the most comprehensive set of consumer services to act within. It is a bet backed by enormous infrastructure investment. Google expects to spend approximately $180 to $190 billion in capital expenditure this year — roughly six times what it spent in 2022 — much of it on the AI compute required to run agents like Spark at scale for hundreds of millions of users.
The technology, in other words, is arriving. The models are fast enough, the integrations deep enough, the payment rails secure enough. Google has built a system that can draft your emails, organize your calendar, monitor your inbox, and soon enough, spend your money — all while you sleep.
But the hardest problem in artificial intelligence has never been making a machine capable. It has been making a human comfortable. For two decades, Google’s core promise has been ten blue links and a search box — a transaction built on the assumption that the user is in control. Gemini Spark asks users to renegotiate that relationship entirely, to hand a set of keys to a system that is brilliant, tireless, and still, by its maker’s own admission, best compared to a teenager with a debit card.
Gemini Spark rolls out to trusted testers this week, with a broader beta for U.S. Google AI Ultra subscribers expected next week.
Redis built its name as the caching layer that kept web applications from collapsing under load. The problem it is targeting now has the same structure but is harder to solve: production AI agents failing not because the models are wrong, but because the data underneath them is scattered, stale and structured for humans rather than machines. Retrieval pipelines built for single queries cannot absorb the volume agents generate.
The gap Redis is targeting is structural: agents make orders of magnitude more data requests than human users, but most retrieval layers were built for the human-scale problem. Redis Iris, launched Monday, is the company’s answer: a context and memory platform that sits between an agent and the data it needs to act. The platform combines real-time data ingestion, a semantic interface that auto-generates MCP tools from business data models, and an agent memory server built on Redis Flex, a rewritten storage engine that runs 99% of data on flash at a tenth of the cost of in-memory storage alone.
The announcement lands as enterprise RAG infrastructure is in active transition. VentureBeat’s Q1 2026 VB Pulse RAG Infrastructure Market Tracker found buyer intent to adopt hybrid retrieval tripling from 10.3% to 33.3% between January and March. Retrieval optimization surpassed evaluation as the top enterprise investment priority for the first time. Custom in-house retrieval stacks rose from 24.1% to 35.6% as enterprises outgrew off-the-shelf options. Redis is not the only infrastructure vendor reading those signals — several data platform providers have repositioned around agent context layers in recent weeks.
The scale mismatch is the structural argument behind the launch.
“Companies will have orders of magnitude more agents than human beings,” Rowan Trollope, CEO of Redis, told VentureBeat. “Orders of magnitude more agents than human beings means orders of magnitude more load on back end systems.”
Trollope traces the parallel back to the mobile era: When legacy backends built for branch tellers suddenly had to serve a million smartphone users, Redis became the caching layer that absorbed the load without a full rebuild.
What is different this time is that agents cannot write their own middleware. In the mobile era, a developer would sit with a database administrator, identify the queries an application needed and hard-code the caching logic into a middleware layer. Agents cannot do that. They need to find the right data at runtime, through interfaces built for them in advance, or they stall.
“This is like the analogy of the grocery store in the fridge,” he said. “If every time you have to go make your sandwich, you have to run to the grocery store to get the food, that’s not very efficient. You put a fridge in every house, you store a little bit of food there. And that’s kind of where we still tend to exist in the infrastructure stack.”
Iris ships five components that together cover data ingestion, semantic access, memory and caching.
Redis Data Integration. Now in general availability. RDI uses change data capture pipelines to sync data from relational databases, warehouses and document stores into Redis continuously, with connectors for Oracle, Snowflake, Databricks and Postgres.
Context Retriever. Now in preview. Developers define a semantic model of business data using pydantic models and Redis auto-generates MCP tools agents use to query it directly, with row-level access controls enforced server-side. Trollope describes the shift from classic RAG as a directional inversion. “It’s just a flip to let the agent pull the data instead of presupposing and stuffing it into the pipeline,” he said.
Agent Memory. Now in preview. Stores short and long-term state across sessions so agents carry context without re-deriving it on each turn.
Redis Flex. A rewritten storage engine that runs 99% of data on SSDs and 1% in RAM, delivering petabyte-scale retrieval at sub-millisecond latencies.
Redis Search and LangCache. The retrieval and semantic caching backbone underneath the platform. LangCache reduces redundant model calls by caching prompt responses.
The data industry is generally heading in the same direction now. Every major database vendor is making a context layer argument.
Traditional database vendors including Oracle are integrating context and memory layers to bring relational databases into the agentic AI era. Purpose-built vector database vendors including Pinecone are doing the same, building out a new knowledge layer for agentic AI context. Standalone context layers like Hindsight are also part of the emerging landscape.
Trollope frames Redis’s position as structurally different from that competition.
“For us to win, no one else has to lose,” he said. Many Redis deployments already run MongoDB or Oracle as the backend system of record. Iris reflects and caches from those systems rather than displacing them. Redis is launching Iris in the Snowflake marketplace with native connectors.
Stephanie Walter, Practice Leader for AI Stack at HyperFRAME Research, puts the market context plainly. “The market is converging on the same conclusion: agents don’t just need more tokens or better models. They need governed, current, low-latency context,” Walter said.
Her read on Redis’s differentiation focuses on where Redis already sits in the stack, which is close to runtime, latency-sensitive operational state, and real-time data.,
“The pitch is not ‘better RAG’ as much as ‘agents need live context, memory, and fast retrieval while they are actually working,” she said.
Whether it’s Redis or another vendor, every context layer technology will face a governance challenge to be successful.
“Agentic AI will not scale in the enterprise if every agent becomes a new cost center, a new data access risk, and a new governance exception,” she said. “The winning context layers will be the ones that make agents faster, cheaper, and safer to run.”
Mangoes.ai is one company that has already had to answer those questions in production, under conditions where the cost of getting context wrong is measured in patient outcomes.
Amit Lamba, founder and CEO of Mangoes.ai, runs a real-time voice AI platform deployed across large healthcare facilities where patients and clinicians ask live questions about treatment, scheduling and case history. Mangoes.ai built its stack natively on Redis from the start.
“Retrieval, memory, and session state all run through Redis, so we’re not stitching together separate tools and hoping they talk to each other,” Lamba said.
The problem Iris’s dynamic memory capability addresses is what happens across a complex session.
“Think about a one-hour group therapy session,” Lamba said. “You need to know who said what, when, and be able to surface the right information to the therapist in the moment. That’s not a simple retrieval problem.”
The platform runs multiple specialized agents in parallel, one for entity identification, one for relationship reasoning and one for integrating case history.
“The dynamic memory capability maps almost perfectly to the problem we’re solving,” Lamba said.
For enterprises that built their AI stack around RAG, the retrieval layer that got them to production is no longer enough to keep them there
The RAG era is giving way to context architecture. The classic RAG model pushed data into the agent before the model was called. Production deployments are flipping that: agents pull what they need at runtime through tool calls, treating the data layer as a live resource rather than a pre-loaded payload. Teams still optimizing RAG pipelines are solving last year’s problem.
The semantic layer is now production infrastructure. The model that defines business entities, their relationships and the access rules between them needs to be built, versioned and maintained with the same discipline as a data pipeline. Most organizations have not staffed or structured for that work. The enterprises that define their context architecture now are the ones that will not have to rebuild it when agent workloads scale.
Budget is already moving. VB Pulse Q1 2026 data shows retrieval optimization investment rising from 19% to 28.9% across the quarter, overtaking evaluation spending for the first time. Organizations that spent the previous year measuring their retrieval quality are now spending to fix it. The context layer is an active procurement decision, not a roadmap item.
“The first buyer question should not be ‘Do I need a vector database, long context, memory, or a context engine?’ It should be ‘What does this agent need to know, how fresh must that knowledge be, who is allowed to access it, and what does every retrieval cost?'” Walter said.
Every query an enterprise AI application processes, every correction a subject matter expert makes to its output — that interaction is training data. Most organizations are not capturing it. The production workflows companies have already built are generating a continuous signal that improves AI models, and it is disappearing.
San Francisco-based Empromptu AI on Thursday launched Alchemy Models with a straightforward premise: the AI applications enterprises are already building are generating training data, and most of it is going to waste. The platform captures that signal automatically, routing validated outputs from subject matter experts back into a fine-tuning pipeline that improves the model over time. Enterprises own the resulting weights outright.
It sits in different territory from both RAG and traditional fine-tuning. RAG retrieves external context at inference time without modifying model weights. Traditional fine-tuning changes weights but requires separately assembled labeled datasets and a dedicated ML pipeline. Alchemy does the latter continuously, using the enterprise application itself as the data source.
Companies adopting foundation model APIs face three compounding constraints: inference costs that scale with usage, no ownership of the models their data is effectively training, and limited ability to customize behavior for domain-specific tasks. Empromptu CEO Shanea Leven says those constraints are widely felt but rarely addressed.
“Every customer, everybody that I talk to, is like, how am I not going to get disrupted? How am I going to protect my business? And they just don’t see the path,” Leven told VentureBeat in an exclusive interview.
Most custom model training approaches require companies to separately collect, clean and label data before any fine-tuning can begin. Alchemy takes a different path: the enterprise application itself generates and cleans the training data.
The mechanism runs through Empromptu’s Golden Data Pipelines infrastructure in two stages. Before an app is built, enterprise data is cleaned, extracted and enriched so the application starts with structured inputs. Once it is running, every output it generates goes back through the pipeline, where subject matter experts inside the organization review and correct it. That validated output becomes the training data for the next fine-tuning run.
“The app, the AI application that customers are already creating, cleans the data,” Leven said.
The resulting fine-tuned models are what Empromptu calls Expert Nano Models: small, task-specific models optimized for a particular workflow rather than general-purpose reasoning. Evals, guardrails and compliance controls run within the same pipeline, so governance travels with the training process. Customers own the model weights outright. Empromptu hosts and runs inference on its infrastructure, but the weights are portable and exportable for a fee. The platform is model agnostic, supporting Llama, Qwen and other base models.
The hard constraint is data volume. Early deployments run on the base model while the application accumulates enough production data to trigger a useful fine-tuning run. Leven acknowledged the timeline without sugarcoating it. “Training the model will just take time,” she said.
OpenAI’s fine-tuning API and AWS Bedrock custom models both offer enterprise fine-tuning. Both require organizations to bring separately prepared training datasets and manage the fine-tuning process outside their application stack. The burden of data curation and model evaluation sits with the customer’s ML team.
Alchemy’s differentiation is process integration. The training data is generated by the enterprise application itself, so there is no separate data preparation step and no ML expertise required. The application workflow is the pipeline.
“Do I need to have Bedrock and go spin up another ML team to go figure out how to fine tune a model and figure out all of that infrastructure? No, anyone can do it now,” Leven said.
The tradeoff is platform dependency. Alchemy only works within the Empromptu environment. Enterprises that want the same outcome on existing infrastructure would need to replicate the data capture, validation and fine-tuning pipeline themselves.
Empromptu is targeting regulated and data-intensive verticals first: healthcare, financial services, legal technology, retail and revenue forecasting. These are sectors where general-purpose model outputs carry the highest mismatch risk and proprietary workflow data is most concentrated.
Among the early users is behavioral health company Ascent Autism, which uses Alchemy to automate session documentation and parent communication.
Facilitators use learner session recordings, transcripts, session notes and behavioral metrics to generate structured notes and personalized parent updates. That workflow previously required one to two hours of writing per session. With Alchemy training on the same data, it now takes 10 to 15 minutes.
“Relying solely on API-based models can become expensive quickly,” Faraz Fadavi, co-founder and CTO of Ascent Autism, told VentureBeat. “Alchemy gave us a way to structure the workflow, train models on our own data, and reduce costs while improving output quality over time.”
Fadavi said the company saw usable outputs quickly, with continued improvement as the system refined. Evaluation criteria went beyond accuracy to include traceability to session data and output consistency with the company’s clinical voice.
“We wanted a system that could learn our workflow and produce outputs aligned with how we actually operate — not just summarize text,” he said.
The practical test: how much facilitators need to edit, whether the output matches their voice and whether it meaningfully reduces time spent. Facilitators have shifted from rewriting generated notes to editing and quality-checking them.
The data flywheel is real — but so is the platform lock-in:
Every workflow is a training opportunity. Enterprises that capture and validate outputs from their production AI applications will compound that advantage over time. More usage generates more training signals, which produces more accurate domain-specific models, which generate better outputs, which produce cleaner training data in the next cycle.
Leven positions Alchemy as a third architectural choice. Enterprises have spent the past two years choosing between RAG for domain knowledge access and fine-tuning for model specialization. Workflow-driven model training is a third option, combining the ongoing improvement of fine-tuning with the operational simplicity of building inside a managed platform.
“Having that data moat is the most valuable currency,” Leven said.