AI IQ is here: a new site scores frontier AI models on the human IQ scale. The results are already dividing tech.

For decades, the IQ test has been one of the most familiar — and most contested — yardsticks for human intelligence. Now, a startup project called AI IQ is applying the same metaphor to artificial intelligence, assigning estimated intelligence quotients to more than 50 of the world’s most powerful language models and plotting them on a standard bell curve.

The result is a set of interactive visualizations at aiiq.org that have ricocheted across social media in the past week, drawing praise from enterprise technologists who say the charts make an impossibly complex market legible — and sharp criticism from researchers and commentators who warn the entire framework is misleading.

“This is super useful,” wrote Thibaut Mélen, a technology commentator, on X. “Much easier to understand model progress when it’s mapped like this instead of another giant leaderboard table.”

Brian Vellmure, a business strategist, offered a similar endorsement: “This is helpful. Anecdotally tracks with personal experience.”

But the backlash arrived just as quickly. “It’s nonsense. AI is far too jagged. The map is not the territory,” posted AI Deeply, an artificial intelligence commentary account, crystallizing a worry shared by many researchers: that reducing a language model’s sprawling, uneven capabilities to a single number creates a dangerous illusion of precision.

Twelve benchmarks, four dimensions, and one controversial number: how AI IQ actually works

AI IQ was created by Ryan Shea, an engineer, entrepreneur, and angel investor best known as a co-founder of the blockchain platform Stacks. Shea also co-founded Voterbase and has invested in the early stages of several unicorns, including OpenSea, Lattice, Anchorage, and Mercury. He holds a Bachelor of Science in Mechanical Engineering from Princeton University.

The site’s methodology rests on a deceptively simple formula. AI IQ groups 12 benchmarks into four reasoning dimensions: abstract, mathematical, programmatic, and academic. The composite IQ is a straight average of those four dimension scores: IQ = ¼ (IQ_Abstract + IQ_Math + IQ_Prog + IQ_Acad).

The abstract reasoning dimension draws from ARC-AGI-1 and ARC-AGI-2, the notoriously difficult pattern-recognition benchmarks designed to test general fluid intelligence. Mathematical reasoning includes FrontierMath (Tiers 1–3 and Tier 4), AIME, and ProofBench. Programmatic reasoning uses Terminal-Bench 2.0, SWE-Bench Verified, and SciCode. Academic reasoning pulls from Humanity’s Last Exam, CritPt, and GPQA Diamond.

Each raw benchmark score gets mapped to an implied IQ through what the site describes as “hand-calibrated difficulty curves.” Crucially, the methodology compresses ceilings for benchmarks considered easier or more susceptible to data contamination, preventing them from inflating scores above 100. Harder, less gameable benchmarks retain higher ceilings. The system also handles missing data conservatively: models need scores on at least two of the four dimensions to receive a derived IQ, and when benchmarks are absent, the pipeline deliberately pulls scores down rather than up. The site states that “every derived IQ averages all four dimensions, so missing coverage cannot make a model look better by omission.”

OpenAI leads the bell curve, but the gap between the top AI models has never been smaller

As of mid-May 2026, the AI IQ charts tell a story of rapid convergence at the top of the frontier — and widening diversity in the tiers below.

According to the Frontier IQ Over Time chart, GPT-5.5 from OpenAI currently sits at the peak of the bell curve, with an estimated IQ near 136 — the highest of any model tracked. It is closely followed by GPT-5.4 (approximately 131), Opus 4.7 from Anthropic (approximately 132), and Opus 4.6 (approximately 129). Google’s Gemini 3.1 Pro lands near 131, making the top cluster extraordinarily tight.

That compression is not unique to AI IQ’s framework. Visual Capitalist, drawing from a separate Mensa-based ranking by TrackingAI, recently observed the same dynamic, noting that “the biggest takeaway is how compressed the top of the leaderboard has become.” On that scale, Grok-4.20 Expert Mode and GPT 5.4 Pro tied at 145, with Gemini 3.1 Pro at 141.

Below the frontier cluster, the AI IQ charts show a crowded midfield. Models from Chinese labs — Kimi K2.6, GLM-5, DeepSeek-V3.2, Qwen3.6, MiniMax-M2.7 — bunch between roughly 112 and 118, making the cost-performance tier increasingly competitive for enterprise buyers who don’t need the absolute best model for every task. One X user, ovsky, noted that the data “confirms experience with sonnet 4.6 being an absolute workhorse as opposed to opus 4.5” — pointing to the way the charts can validate practitioner intuitions that headline rankings often miss.

Why emotional intelligence scores are becoming the new battleground in AI model rankings

What distinguishes AI IQ from most other benchmarking efforts is its inclusion of an “EQ” — emotional intelligence — score. The site maps each model’s EQ-Bench 3 Elo score and Arena Elo score to an estimated EQ using calibrated piecewise-linear scales, then takes a 50/50 weighted composite of the two.

The EQ scores produce a meaningfully different ranking than IQ alone. On the IQ vs. EQ scatter plot, Anthropic’s Opus 4.7 leads on EQ with a score near 132, pushing it into the upper-right quadrant — the most desirable position, signaling both high cognitive and high emotional intelligence. OpenAI’s GPT-5.5 and GPT-5.4 cluster in the high-IQ zone but lag slightly on EQ. Google’s Gemini 3.1 Pro sits in a strong middle position on both axes.

One notable methodological choice has drawn attention: EQ-Bench 3 is judged by Claude, an Anthropic model, which the site acknowledges “creates potential scoring bias in favor of Anthropic models.” To correct for this, AI IQ subtracts a 200-point Elo penalty from the EQ-Bench component for all Anthropic models before mapping to implied EQ. The Arena component is unaffected since it uses human judges. That self-correction is unusual in the benchmarking world, and it suggests Shea is aware of the methodological minefield he has entered. Still, the EQ dimension captures something IQ alone cannot: the growing importance of conversational quality, collaboration, and trust in models deployed for user-facing work.

The AI cost-performance chart that enterprise buyers actually need to see

Perhaps the most practically useful chart on the site is not the bell curve but the IQ vs. Effective Cost scatter plot. It maps each model’s estimated IQ against an “effective cost” metric — defined as the token cost for a task using 2 million input tokens and 1 million output tokens, multiplied by a usage efficiency factor.

The chart reveals a familiar pattern in enterprise technology: the best models are not always the best value. GPT-5.5 and Opus 4.7 sit in the upper-left corner — high IQ, high cost, with effective per-task costs north of $30 and $50 respectively. Meanwhile, models like GPT-5.4-mini, DeepSeek-V3.2, and MiniMax-M2.7 occupy a sweet spot in the middle: respectable IQ scores between 112 and 120, at effective costs ranging from roughly $1 to $5 per task. At the cheapest extreme, GPT-oss-20b (an open-source OpenAI model) appears near $0.20 effective cost with an IQ around 107 — potentially the most economical option for bulk classification or extraction workloads.

The site also offers a 3D visualization mapping IQ, EQ, and effective cost simultaneously. A dashed line running through the cube points toward the ideal: higher IQ, higher EQ, and lower cost. Models near the “green end” of that axis are stronger all-around deals; those near the “red end” sacrifice capability, cost efficiency, or both. For CIOs staring at API invoices, the implication is clear: the intelligence gap between a $50 model and a $3 model has narrowed enough that routing — using expensive models for hard problems and cheap ones for everything else — is no longer optional. It is the dominant architecture for serious AI deployments.

Critics say AI’s “jagged” capabilities make a single IQ score dangerously misleading

The loudest objection to AI IQ is philosophical, and it cuts deep. Critics argue that collapsing a model’s uneven capabilities into a single score obscures more than it reveals.

“IQ as a proxy is fading — we’re seeing reasoning density spikes that don’t map to g-factor,” posted Zaya, a technology commentator, on X. “GPT-5.5 already hit saturation on MMLU-Pro, but still fails ClockBench 50% of the time.”

That observation touches on what AI researchers call the “jaggedness” problem: large language models often exhibit wildly uneven capabilities, excelling at graduate-level physics while failing at tasks a child could do. A composite score can paper over those gaps.

Pressureangle, another X user, posted a more granular critique, calling out “complete lack of transparency” and arguing the site never fully discloses how its calibration curves were created or validated. In fairness, AI IQ does list its 12 benchmarks and shows the shape of each calibration curve in its methodology modal. But the raw data and precise mathematical transformations are not published as open datasets — a gap that matters to researchers accustomed to fully reproducible methods.

Others questioned the premise itself. “As useless as human IQ testing,” wrote haashim on X. Shubham Sharma, an AI and technology writer, offered a constructive alternative: “Why not having the Models take an official (MENSA-Grade) test? Wouldn’t this be the most accurate and most ‘human-comparable’ way to benchmark intelligence?” That approach already exists through TrackingAI, which administers the Mensa Norway IQ test to language models. But Mensa-style tests measure only abstract pattern recognition, while AI IQ attempts a broader composite across coding, mathematics, and academic reasoning. As Visual Capitalist noted, “an IQ-style benchmark captures only one slice of capability.” Each approach has tradeoffs — and neither has won the argument yet.

The real race isn’t for the highest score — it’s for the smartest model stack

For all the debate about methodology, the most important signal in AI IQ’s data may not be any single model’s score. It is the shape of the market the charts reveal.

There are now more than 50 frontier-class models available through APIs, from at least 14 major providers spanning the United States, China, and Europe. Each provider publishes its own benchmarks, often cherry-picked to showcase strengths. The result is a Tower of Babel where no two companies measure the same thing in the same way. Academic research has highlighted that “most benchmarks introduce bias by focusing on a particular type of domain,” and the Frontier IQ Over Time chart on AI IQ shows just how fast the targets are moving: in October 2023, GPT-4-turbo sat near an estimated IQ of 75. By early 2026, the top models were brushing 135 — roughly 60 points of improvement in 30 months.

That pace raises a fundamental question about whether any scoring system can keep up. The site compresses ceilings for saturated benchmarks, but as models continue to max out even the hardest tests — ARC-AGI-2, FrontierMath Tier 4, Humanity’s Last Exam — the framework will face the same ceiling effects that have plagued every AI evaluation before it. Connor Forsyth pointed to this dynamic on X: “ARC AGI 3 disagrees,” he wrote, referencing a next-generation benchmark that may already be undermining current scores.

AI IQ is not perfect. Its methodology is partially opaque. Its IQ metaphor can mislead. And its creator acknowledges known biases while likely missing others. But the alternative — wading through dozens of provider-specific benchmark tables, each using different test suites and scoring conventions — is worse. The site offers enterprise buyers something genuinely scarce: a single framework for comparing models across providers, dimensions, and price points, updated regularly, with enough nuance to show that the right answer to “which model is best?” is almost always “it depends on the task.”

As Debdoot Ghosh mused on X after viewing the charts: “Now a human’s role is just to orchestrate?

Maybe. But if the AI IQ data shows anything clearly, it is that orchestration — knowing which model to deploy, when, and at what price — has become its own form of intelligence. And for that, there is no benchmark yet.

Anthropic finally beat OpenAI in business AI adoption — but 3 big threats could erase its lead

For the first time since the AI race began, more American businesses are paying for Anthropic’s Claude than for OpenAI’s ChatGPT.

Adoption of Anthropic rose 3.8% in April to 34.4% of businesses, according to the May 2026 release of the Ramp AI Index. OpenAI’s adoption fell 2.9% to 32.3%. Overall AI adoption among businesses rose 0.2 percentage points to 50.6%.

The crossover — published Tuesday by Ramp, the corporate card and finance automation platform that tracks spending patterns across more than 50,000 U.S. businesses — marks the culmination of a yearlong surge by Anthropic that few in the industry predicted. Anthropic has quadrupled its business adoption over the past year, while OpenAI grew its business adoption by only 0.3%.

But the same report that crowns a new market leader also warns that Anthropic’s position may be more fragile than it appears — threatened by escalating costs, compute constraints, and the very token-based pricing model that has fueled the company’s extraordinary revenue growth.

How Anthropic went from a niche player to the most popular AI model in corporate America

To appreciate the scale of the shift, consider where the two companies stood a year ago. In April 2025, OpenAI commanded roughly 32% of business AI adoption according to Ramp’s underlying data, while Anthropic stood at under 8%. OpenAI had built an early, commanding lead as the consumer default — ChatGPT was where most people first encountered AI, and that momentum carried into corporate purchasing decisions.

Anthropic’s path was different. The company was popular early on with the earliest adopters — engineers, AI evangelists, the technical vanguard inside organizations. As Ramp lead economist Ara Kharazian noted in the March 2026 edition of the index, Anthropic leveraged that early-adopter base to go mainstream. By February, Anthropic was winning about 70% of head-to-head matchups against OpenAI among businesses purchasing AI services for the first time — a complete reversal of the trends observed in 2025.

The trajectory is visible in Ramp’s underlying data. The company’s adoption figures show Anthropic climbing from 0.03% of businesses in June 2023 to 7.94% by April 2025, then rocketing to 34.44% by April 2026.

OpenAI, meanwhile, peaked near 36.5% in mid-2025 and has been slowly declining since. The engine behind much of this growth is a single product: Claude Code, the company’s agentic AI coding tool, which has become the fastest-growing product in Anthropic’s history. A recent analysis estimated that 4% of all GitHub public commits worldwide were being authored by Claude Code — double the percentage from just one month prior.

Business Insider reported in April that the crossover was imminent. A Ramp spokesperson told the outlet that “at the current pace, Anthropic is on track to surpass OpenAI within the next two months,” noting that it already led “among early adopters, including VC-backed companies, and in key sectors like software, finance, and professional services.” That prediction proved accurate almost to the day.

AI adoption reaches a workplace tipping point, but the productivity revolution hasn’t arrived yet

The Ramp data on business spending finds its complement in a separate workforce survey that underscores just how deeply AI has embedded itself into American economic life. For the first time in Gallup’s measurement, half of employed American adults say they use AI in their role at least a few times a year, up from 46% the previous quarter. Frequent use is also increasing, with 13% of employees now saying they use AI daily and 28% reporting they use it a few times a week or more.

But the Gallup data, based on a February 2026 survey of 23,717 U.S. employees, also suggests that the benefits of AI remain concentrated at the level of individual tasks rather than organizational transformation. Only about one in 10 employees in AI-adopting organizations strongly agree that artificial intelligence has transformed how work gets done. That finding is consistent with firm-level studies across the U.S., U.K., Germany, and Australia showing chief executives reporting minimal broad productivity effects from AI over the past three years — a notable gap between the hype cycle and operational reality.

The Ramp methodology captures a different but complementary signal. Where Gallup asks employees whether they use AI, Ramp measures whether their employer is writing checks for it. The index counts corporate card and invoice-based payments, identifying firms as AI adopters if they have a positive transaction amount for an AI product or service in a given month. As Ramp’s methodology page notes, its results likely underestimate actual adoption because many employees use free AI tools or personal accounts for work tasks. Taken together, the two datasets paint a picture of AI that is ubiquitous in the American workplace but has not yet delivered on its promise to fundamentally transform how organizations operate.

Why Anthropic’s biggest threat might be the success of its own best-selling product

Perhaps the most striking aspect of Ramp’s analysis is its refusal to declare a lasting winner. Kharazian identified three specific risks facing Anthropic even as the company takes the lead — and the most serious one stems from a structural tension baked into the company’s business model.

Anthropic makes more money when businesses purchase more tokens, meaning the company is incentivized to drive users toward more expensive models even when cheaper ones are sufficient. This dynamic is already creating budget crises at major enterprises. Uber’s CTO revealed that the company spent its entire 2026 AI budget in just four months, largely on Claude Code and Cursor, with engineers reporting monthly API costs between $500 and $2,000 per person. Adoption jumped from 32% to 84% of Uber engineers in a matter of months, and about 70% of committed code at Uber now comes from AI. The Uber case is a microcosm of a broader tension: Claude Code works — perhaps too well. When a productivity tool becomes so valuable that an organization’s $3.4 billion R&D operation can’t afford to keep the lights on, the resulting cost scrutiny could push enterprises toward cheaper alternatives.

At the same time, quality and reliability have suffered under the weight of demand. In recent weeks, users have experienced frequent outages, rate limits, and increasing dissatisfaction with Claude’s results. Anthropic has responded by resetting usage limits and by striking a compute deal with SpaceX to access more than 300 megawatts of new capacity at the Colossus 1 data center in Memphis. CEO Dario Amodei said the company saw “80x growth per year in revenue and usage” for Q1 2026, when it had only planned for 10x. And Ramp economist Rafael Hajjar found that Anthropic’s latest model update would triple token costs for any prompt that includes an image — a change that seems at odds with the company’s already-acute cost and compute problems.

Open-source models and OpenAI’s Codex could quickly erode Anthropic’s narrow lead

The Ramp report points to competitive dynamics that could reshape the market within months. Some of the fastest-growing vendors on Ramp’s platform in April were AI inference platforms that give companies access to cheap, open-source models — offering enterprises a way to get “good enough” AI at a fraction of the cost, particularly for routine tasks that don’t require frontier model capabilities.

OpenAI’s Codex presents an even more direct threat. By most measures, it is a strong product that does many of the same tasks as Claude Code at a lower price point — and the switching cost between models is minimal. Uber itself is already testing Codex as a hedge, a move that could preview a broader pattern across enterprise tech. OpenAI also retains enormous structural advantages. ChatGPT reached 900 million weekly active users by March 2026, dwarfing Claude’s consumer footprint. Enterprise revenue now makes up more than 40% of OpenAI’s total and is on track to reach parity with consumer revenue by the end of 2026. And OpenAI’s $122 billion funding round, closed in March at an $852 billion valuation, gives it vast resources to compete on pricing, capacity, and product development.

Anthropic is not standing still on distribution. AWS recently launched Claude Platform on AWS, giving enterprises direct access to Anthropic’s native platform through existing AWS credentials, billing, and access controls — a move that lowers procurement friction considerably. Anthropic has also announced compute agreements totaling billions of dollars with Amazon, Google, Microsoft, Nvidia, and others, though much of that capacity won’t come online until late 2026 or 2027. Anthropic is reportedly in talks to raise another $50 billion at a valuation approaching $900 billion.

The unlikely reason businesses are choosing Claude over cheaper alternatives

Beneath the spending data and market share charts lies a more intriguing question: Why are businesses choosing Anthropic over a cheaper, comparably performing alternative?

Kharazian explored this in his March analysis. Claude Code and OpenAI’s Codex are roughly comparable products — on certain benchmarks, Codex is arguably better, and it’s also cheaper. Yet Anthropic can’t meet its own demand. Every plan still has usage limits and rate caps. The company is actively turning away revenue because it doesn’t have the compute to serve it. Despite charging more for roughly equivalent performance, Anthropic’s demand is growing.

Kharazian suggested the answer might be cultural. Earlier this year, Anthropic refused to agree to the Pentagon’s terms of use for Claude, resulting in a blacklisting by the Department of Defense. OpenAI stepped in to offer its services in Anthropic’s place. In the wake of that episode, users rallied around Anthropic, and Claude temporarily surpassed ChatGPT on the App Store. The question, Kharazian wrote, is whether choosing an AI model is becoming less like an enterprise procurement decision and “more like the green bubble/blue bubble distinction in iMessage: a signal of identity as much as a choice of technology.”

That observation may sound absurd for an enterprise software category. But Ramp’s data tells a story that pure economics cannot fully explain. In a market where the products perform similarly, where the cheaper option is arguably better on benchmarks, and where switching costs are negligible, something other than spreadsheet logic is driving the biggest shift in AI market share since the industry began. As Kharazian noted in his report: “We have never seen a software industry as dynamic, where newcomers can disrupt market leaders in a matter of months, and where the pace of development overrides the typical forces of vendor stickiness.”

That dynamism cuts both ways. The same forces that propelled a company from 8% to 34% market share in twelve months could just as easily work in reverse. Anthropic’s two-point lead was earned in the most volatile software market in modern history — and in this market, the distance between the throne and the floor has never been shorter.

Market research is too slow for the AI era, so Brox built 60,000 identical ‘digital twins’ of real people you can survey instantly, repeatedly

In a world where a viral TikTok video can cause a brand to trend globally in mere hours, the traditional market research cycle — often spanning 12 weeks — is becoming a liability.

The lag between a survey question and the answers from a wide (or targeted) pool of respondents has become a primary bottleneck for Fortune 500 decision-makers who are forced to navigate volatile geopolitical and economic shifts with data that is frequently outdated by the time it reaches a slide deck, as industry experts have observed.

Brox, a predictive human intelligence startup, recently announced a strategic funding round following a year where they reported 10X revenue growth. Their proposition is as ambitious as it is technical: the creation of a “parallel universe” populated by 60,000 digital twins of real, living human beings and their entire demographic profiles and consumer preferences, allowing enterprises to run unlimited experiments in hours rather than months.

“These digital twins are one-to-one replicas of actual, real individuals,” said Brox CEO Hamish Brocklebank in a recent video call interview with VentureBeat. “We recruit real people like a normal panel company does, pay them to interview them, and capture all the data around them — fully consent-driven.”

The company, currently a lean 14-person operation, is positioning itself as the antithesis of the “insane” research industry. By replacing statistical models with behavioral replicas, Brox aims to transform how the world’s largest banks and pharmaceutical giants anticipate human reactions to high-stakes global and market-shifting events, or narrow, targeted product releases and personnel news, and everything in between.

The kinds of surveys and specific questions that Brox asks its digital twins are completely open-ended and can be customized to fit any conceivable business customer’s use cases and goals.

According to Brocklebank, examples of survey questions include: “What happens if America invades Iran or Greenland? Will depositors at Bank of America put more money into their account or take more money out? Or, in pharmaceuticals, if RFK Jr. says something next week, will that make people more likely to take vaccines or less likely?”

Not synthetic people — AI copies of real ones

The core differentiator of Brox’s technology lies in the fidelity of its input data.

While many competitors in the “digital audience” space rely on purely synthetic identities — generic personas generated by Large Language Models (LLMs ) — Brocklebank argues that these methods inevitably produce “AI slop”.

Purely synthetic audiences often cluster around a tight distribution of answers, over-indexing for “correct” or “healthy” behaviors (such as eating broccoli) because of inherent biases in the underlying models.

Brox’s “Digital Twins” are instead one-to-one behavioral replicas of real individuals who have been recruited and interviewed with exhaustive depth. The process is intensive:

  • Deep Interviews: The company conducts hours of real and AI-driven interviews with each participant.

  • Psychological Depth: The data collection seeks to understand fundamental “decision drivers,” including upbringing, relationships, and even marital stability.

  • Data Density: For some twins, Brox maintains up to 300 pages of text data, representing what Brocklebank calls “the deepest per person data set that exists”.

To solve the “black box” problem common in AI, Brox utilizes a “reasoning chain” for its predictive outputs. When a digital twin predicts a reaction — such as how a $2 billion net-worth individual might respond to a specific interest rate hike — the model introspects and provides a step-by-step explanation for that decision.

This allows clients to understand not just what will happen, but the underlying psychology of why it is happening.

Scaling the “unscalable” interview

The product offering is currently live in the US, UK, Japan, and Turkey. Brox has successfully digitized specific, high-value cohorts that are traditionally difficult for researchers to access.

This includes a panel of “high-net-worth” individuals (those worth over $5 million) and specialized medical professionals like dermatologists — including a multibillionaire.

However, the largest value for customers is likely in the aggregate mass of all individuals that can be polled en masse and/or segmented across demographics, especially those of medium and lower income levels, whose purchasing power and decision-making is more constrained and whose market-

One of the more unique aspects of the Brox platform is its incentive structure. To ensure twins remain up-to-date, real-world counterparts are re-contacted frequently.

For high-value individuals who are not motivated by small cash payments, Brox has issued Stock Appreciation Rights (SARs), essentially making these participants “investors” in the company’s success to ensure they continue to provide high-fidelity personal updates. The platform’s use cases currently focus on two primary sectors:

  1. Pharmaceuticals: Predicting vaccine hesitancy or how physicians might react to new biologics based on shifting political climates.

  2. Finance: Simulating how depositors at major banks might move funds in response to geopolitical events, such as conflicts in the Middle East.

As for why go to the trouble of interviewing and digitally cloning real people instead of just creating wholly fictitious, synthetic audience characters and personas using LLMs and other AI models, Brocklebank offered his perspective.

“You can create 10,000 truly synthetic digital twins, but the answers will still normalize into a very tight distribution, which is not realistic when you’re actually asking real people,” Brocklebank said.

By maintaining a pre-built audience of 60,000 twins, the company enables clients to bypass the recruitment phase of research. A large US bank or a global pharma giant can now “query” the digital population and receive a validated analysis in a matter of hours.

Pricing and accessibility

Unlike traditional research firms that charge on a per-project or per-respondent basis, Brox operates as a high-end Software-as-a-Service (SaaS) platform with enterprise-level commercial licensing. The company avoids the “seat” or “usage” limits that often hinder rapid experimentation within large organizations.

  • Pricing Tiers: Subscriptions are sold as blanket flat fees, starting at a minimum of $100,000 per year.

  • Top-Tier Contracts: For larger deployments involving multiple teams and global data access, contracts scale up to $1.5 million per year.

  • Usage Rights: Clients are granted unlimited usage during the contract period. This allows them to run thousands of simulations without worrying about incremental costs, encouraging a culture of “testing everything” before deployment.

From a legal and privacy standpoint, the digital twins are built on a “fully consent-driven” framework. While the twins can be traced back to real human data for internal validation, the platform is designed to provide aggregated behavioral insights that protect the anonymity of the participants while maintaining the predictive power of their digital replicas.

Rejecting the rise of Kalshi, Polymarket and ‘prediction markets’

The tech industry has recently seen a surge in valuations and interest in “prediction markets” like PolyMarket and Kalshi, which allow users to bet on the outcomes of various global events.

However, the leadership at Brox maintains a distinct distance from these platforms, citing a “personal disdain” for betting markets from both a moral and intellectual perspective.

Brocklebank argues that while betting markets can predict outcomes (e.g., who wins an election), they offer zero utility for business decision-makers because they fail to provide the “why”.

Knowing there is a 60% chance of a certain candidate winning does not help a company adjust its consumer strategy; knowing why a specific cohort of depositors is feeling anxious does.

Investors including Scribble Ventures, Wonder Ventures, and Vela Partners have backed this “human-first” approach to AI, betting that the moat created by deep human data will prove more resilient than the commoditized models of synthetic data providers.

As Brox prepares for launches in the Middle East and APAC, the company is moving toward its ultimate goal: simulating the entire world as a “parallel universe” for risk-free decision-making.

OpenAI turns its sold-out GPT-5.5 party into a monthlong Codex giveaway for 8,000 developers

OpenAI on Monday began emailing more than 8,000 developers who applied for its invite-only GPT-5.5 party with a surprise consolation prize: a tenfold increase in Codex rate limits on their personal ChatGPT accounts, effective immediately and lasting through June 5.

“We had over 8,000 people express interest in just 24 hours, and while we wish our office was big enough to welcome everyone, we weren’t able to make space for every person who applied,” the company wrote in the email, which VentureBeat obtained. “As a small token of appreciation, we’ve 10x’ed your Codex rate limits until June 5th on your personal ChatGPT account.”

The gift is not limited to the lucky few who scored invitations to the party itself. Everyone who raised their hand — whether they were accepted, waitlisted, or turned away — received the rate limit boost, according to the email and confirmed by multiple recipients on social media.

CEO Sam Altman telegraphed the move on X shortly before inboxes started lighting up. “We are gonna do something nice for everyone who applied for the GPT-5.5 party and that we didn’t have space for,” he wrote. “Hope you enjoy!” The post amassed more than 521,000 views within hours.

What a month of supercharged Codex access actually means for developers

The practical implications are huge. Codex, OpenAI’s AI-powered coding agent, operates under daily usage caps that vary by subscription tier. A tenfold increase to those caps gives developers dramatically more room to prototype, debug, and ship code using GPT-5.5 — which OpenAI says matches GPT-5.4’s per-token latency while performing at a higher level of intelligence and using significantly fewer tokens to complete tasks.

The 31-day window is generous enough to reshape habits. By flooding thousands of developers with expanded access during a critical adoption period, OpenAI is effectively subsidizing the kind of deep, sustained usage that turns a curious trial into a daily dependency. It is a bet that once developers experience Codex at full throttle, they won’t want to go back — and that when the limits reset on June 5, a meaningful number will upgrade their subscriptions to preserve the workflow they’ve built.

The developer community responded with a mix of glee and regret. “I’m literally not taking my Codex hat off for the month,” one developer declared on X. Others kicked themselves for not signing up. “That’s the last time I don’t sign up just because I’m not in SF,” one wrote.

Several users raised a question OpenAI has yet to answer publicly: does the boost stack with the existing Pro $200 tier’s 20x multiplier? One user reported that OpenAI support said no — users get whichever limit is higher, not a combined total. “The key question isn’t whether the 10x boost is only for party applicants,” they wrote. “It’s whether it stacks with Pro.”

OpenAI did not immediately respond to a request for comment on whether the boost stacks with Pro-tier limits.

Inside the low-key meetup that an AI planned for itself

The rate limit gift is a sidecar to the main event: “GPT-5.5 on 5/5,” an invite-only gathering running tonight from 5:55 p.m. to 8:55 p.m. PDT at an undisclosed San Francisco venue. OpenAI billed the evening as “a low-key meetup with Sam and the team behind GPT-5.5,” promising food, drinks, community, giveaways, and swag — not a product announcement. Even the address remained secret until invitations were confirmed — a touch of exclusivity that generated its own buzz.

In a detail that doubles as a product demo, Altman revealed that GPT-5.5 itself planned the party. The model proposed the May 5 date, suggested that human developers give the toasts rather than the AI, and recommended setting up a suggestion box for the next-generation model. Altman described this as “weird emergent behavior.” Registrations closed shortly after opening due to overwhelming demand, with Codex handling the selection process.

Altman also extended an unlikely invitation. He publicly asked Elon Musk to attend, saying, “He can come if he wants… the world needs more love.” The gesture arrives amid Musk’s ongoing lawsuit against OpenAI seeking up to $150 billion in damages — a fact that makes the invitation read less like diplomacy and more like performance art.

Anthropic’s competing reception turns a scheduling overlap into a Silicon Valley spectacle

Here is where the story gets interesting. VentureBeat has confirmed that Anthropic is hosting its very own invite-only event in San Francisco on Tuesday evening — a “Media VIP Welcome Reception” at nearly identical times to OpenAI’s party. The reception serves as a warm-up for Anthropic’s Code with Claude developer conference, the company’s second annual gathering focused on its API, CLI tools, and Model Context Protocol (MCP). The conference proper takes place tomorrow.

The scheduling overlap is difficult to dismiss as coincidence. Both companies are hosting developer-focused events on the same evening, in the same city, targeting many of the same people. Whether this was deliberate counter-programming or genuine coincidence, the optics neatly capture where things stand in the industry’s most consequential rivalry.

Anthropic’s conference will feature its executive and product teams discussing Claude Code, agent implementation strategies, and the product roadmap — all squarely aimed at the same developer audience that just received a month of free Codex upgrades from OpenAI.

How Anthropic overtook OpenAI in revenue — and what it means for the coding wars

The dueling cocktail hours are a social manifestation of a far more consequential battle playing out in revenue, developer adoption, and investor confidence — one that has tilted sharply in Anthropic’s favor.

According to Counterpoint Research data, Anthropic surpassed OpenAI for the first time in global LLM revenue market share in Q1 2026, capturing 31.4% compared to OpenAI’s 29%. But the headline near-tie obscures a dramatic structural divergence. Counterpoint estimates Anthropic achieved that share with roughly 134 million monthly active users, compared to approximately 900 million for OpenAI — yielding average monthly revenue per active user of $16.20 for Anthropic versus $2.20 for OpenAI. OpenAI commands massive scale; Anthropic extracts roughly seven times more revenue per user. That gap is the central tension in this rivalry.

The enterprise shift has been building for over a year. Menlo Ventures — whose portfolio includes Anthropic — estimates the company now captures 40% of enterprise LLM spend, up from 24% the prior year and 12% in 2023, while OpenAI’s share fell to 27% from 50% over the same period. Anthropic has maintained an almost unparalleled 18 months atop the LLM leaderboards for coding, starting with Claude Sonnet 3.5 in June 2024. That dominance in code — AI’s first true killer app — has become the on-ramp to broader enterprise adoption and the engine behind Anthropic’s revenue acceleration.

The top-line numbers tell the rest of the story. Anthropic said earlier this month that its annualized revenue has topped $30 billion, up from $9 billion at the end of 2025, with more than 1,000 business customers now spending over $1 million annually — a figure the company says has more than doubled since February.

Sources familiar with Anthropic’s financials told TechCrunch the run rate is currently closer to $40 billion, driven largely by demand for Claude Code and Cowork. OpenAI, meanwhile, topped $25 billion in annualized revenue as of February, according to Reuters — but the Wall Street Journal reported that the company has recently missed its own projections for user growth and revenue, with CFO Sarah Friar warning colleagues that if growth doesn’t accelerate, the company could face difficulty funding future compute agreements.

The momentum has carried into fundraising at a pace that could redraw the industry’s power map. Anthropic raised $30 billion at a valuation of $380 billion in February. Bloomberg reported last week that the company has begun weighing a fresh funding round that would value it at more than $900 billion, potentially leapfrogging OpenAI as the world’s most valuable AI startup. OpenAI was valued at $852 billion in late March after closing a record-breaking $122 billion funding round. If Anthropic proceeds at the terms described, the company would not only more than double its valuation but would also surpass OpenAI — a reversal that seemed unthinkable six months ago.

Two parties, two visions, and one city at the center of the AI industry’s defining rivalry

For the 8,000-plus developers who applied for the GPT-5.5 party, the immediate value is straightforward: a full month of dramatically expanded Codex usage, free of charge, during a period when both companies are shipping at a breakneck pace. For the industry, the signal is harder to miss. The two most valuable private companies in the world are competing for developer loyalty with a combination of free perks, invite-only parties, celebrity CEO engagement, and multi-billion-dollar enterprise ventures — all within the same 24-hour window, in the same seven-square-mile city.

The broader stakes extend well beyond cocktail napkins and rate limits. Both companies are barreling toward potential IPOs. Both are courting the same Wall Street backers for enterprise joint ventures. Both are racing to define how the next generation of software gets built — and by whom. The developers caught between them are, for the moment, the beneficiaries of a spending war that shows no sign of cooling.

Tonight in San Francisco, the Anthropic reception starts at 5pm. The OpenAI party starts at 5:55pm. VentureBeat will be at both. And somewhere between the two venues, 8,000 developers who couldn’t get into either room will be burning through their new rate limits — building the future with whichever model they opened first.


Michael Nunez is an editor at VentureBeat covering artificial intelligence. He is attending both the Anthropic Code with Claude Media VIP Welcome Reception and the OpenAI GPT-5.5 launch party tonight in San Francisco.

This story is developing and will be updated.

The RAG era is ending for agentic AI — a new compilation-stage knowledge layer is what comes next

The vector database category is undergoing a shift in response to the needs of agentic AI. 

The retrieval-augmented generation (RAG)-to-vector database pipeline doesn’t cut it anymore; agentic AI requires a different approach that incorporates context. VentureBeat’s Q1 2026 Pulse survey underscores this trend: Every standalone vector database is losing adoption share, while hybrid retrieval intent has tripled to 33.3%, the fastest-growing strategic position in the dataset.

Vector database pioneer Pinecone recognizes this and is pivoting to meet the specific needs of agentic AI.

The company today announced Nexus, which it positions as a knowledge engine rather than an improvement on retrieval. Nexus introduces a context compiler that converts raw enterprise data into persistent, task-specific knowledge artifacts before agents query them, and a composable retriever that serves those artifacts with field-level citations and deterministic conflict resolution.

Alongside Nexus, Pinecone is releasing KnowQL, a declarative query language that gives agents a vocabulary to specify output shape, confidence requirements, and latency budgets. In Pinecone’s own internal benchmark, one financial analysis task that previously consumed 2.8 million tokens was completed by Nexus with just 4,000. This represents a 98% reduction, although the company has not yet validated it in customer production deployments. Nexus is in early access starting today.

“RAG was built for human users,” Pinecone CEO Ash Ashutosh told VentureBeat. “Nexus was built for agentic users, because their language is very different. The responses they expect are very different. The task that an agent is assigned to do is very different from what a chatbot is supposed to do.”

Why RAG was never built for what agents actually do

RAG encompasses one query, one response, and a person in the loop to interpret the result. But agents work differently. They are assigned tasks, not questions — and completing these requires assembling context from multiple sources, resolving conflicts, tracking what has already been retrieved, and deciding what to query next.

The distinction matters. A RAG pipeline retrieves documents and hands them to a model at inference time. Each agent session starts cold, with no compiled understanding of the enterprise data estate — which tables relate to which, which sources are authoritative for which questions, and which formats an agent downstream will actually be able to consume. Every session re-discovers that from scratch.

“At the heart of all this stuff was a very simple problem,” Ashutosh said. “You’re asking agents — machines — to work on systems and data that was designed for humans.”

Pinecone estimates that 85% of agent compute effort goes to the re-discovery cycle rather than task completion. The downstream effects compound: unpredictable latency, runaway token costs, and non-deterministic results. Run the same task twice against the same data, and an agent may return different answers with no record of which sources drove either result. For enterprises where auditability is a compliance requirement, that is a structural disqualifier, not a tuning problem.

What Nexus is and how it works

Nexus moves reasoning work from inference time to compilation time. In a conventional RAG pipeline, the reasoning required to interpret, contextualize, and structure knowledge happens at the moment an agent queries — every session, every time, burning tokens on work that could have been done in advance. But Nexus reasons just once during a compilation stage that runs before any agent query, then stores the result as a reusable knowledge artifact. The agent receives structured, task-ready context rather than raw documents to interpret on the fly.

The architecture Pinecone is shipping has three distinct components, each addressing a different layer of the agent retrieval problem.

  1. Context compiler. Nexus takes raw source data and a task specification and builds specialized knowledge artifacts — structured, task-optimized representations that agents consume directly without interpretation overhead. The same underlying data estate produces different artifacts for different agents: a sales agent gets deal context synthesized from CRM and call records, a finance agent gets revenue context linking contracts to billing schedules. Artifacts are persistent and reused across agent sessions, not regenerated at inference time.

  2. Composable retriever. Compiled artifacts are served at query time with typed fields, per-field citations with confidence levels, and deterministic conflict resolution. Output is shaped to match the agent’s specified format rather than returned as raw text for the agent to re-parse.

  3. KnowQL. Pinecone describes this as the first declarative query language designed for agents rather than humans. Six primitives — intent, filter, provenance, output shape, confidence, and budget — allow agents to specify structured responses and source grounding and latency envelopes in a single interface. Ashutosh compared the structural gap that KnowQL fills to what SQL did for relational databases: Before a standard interface existed, every application built its own data access layer from scratch.

The relationship between Nexus and Pinecone’s underlying vector database is additive. The context compiler produces knowledge artifacts that are indexed and stored in the vector database; the compilation layer shapes and serves knowledge; the vector layer handles storage, retrieval speed, and scale.

 “The vectors are still stored and managed by the Pinecone vector database,” Ashutosh said.

What analysts make of the architectural claim

Moving reasoning upstream from inference to a compilation stage is not a novel concept — ontologies, data catalogs, and semantic layers have pursued versions of it for years. What has changed is the ability to do this at scale without dedicated engineering teams for every domain. That is the specific argument Nexus is making, and it is where analysts see the genuine advance.

Stephanie Walter, practice leader for AI stack at HyperFRAME Research, told VentureBeat that Nexus is directionally important because it shifts knowledge work from runtime chaos to pre-compiled structure. She stressed, however, that it is an evolution of RAG architecture, not a complete reinvention. 

“The real innovation isn’t the idea itself, but the productization of knowledge compilation as a first-class infrastructure layer,” Walter said. “If Pinecone can operationalize that reliably, it becomes meaningful infrastructure, not just another RAG tuning trick.”

The technical mechanism behind that claim is what Gartner distinguished VP analyst Arun Chandrasekaran called the meaningful architectural distinction.

“Unlike traditional RAG, which relies on pure semantic search at runtime, architectural compilation embeds structural logic into the metadata layer, which can boost time to response and provide better reasoning,” Chandrasekaran told VentureBeat. “This is an important leap from simple retrieval to enhanced reasoning, allowing agents to navigate enterprise schemas and acquire better memory for contextualization.”

The competitive landscape

Multiple vendors acknowledge that a vector database and traditional RAG are not enough for agentic AI.

Microsoft has extended its FabricIQ technology to provide semantic context for agentic AI. Google recently announced its Agentic Data Cloud as an approach to help solve the same issues. There are also standalone contextual memory technologies, like hindsight, that provide yet another option for users.

But analysts are less focused on the feature comparison than on what buyers should actually be evaluating.

“The agentic AI stack is fragmenting into dozens of features, but enterprise buyers shouldn’t chase features,” Walter said. “They should chase control: cost control, governance control, and security control.”

Most enterprise failures in agentic AI, she argued, will not be technical. They will be operational — tied to cost overruns, governance gaps, and security discipline.

The capability bar goes beyond retrieval speed.

“The true differentiator is deterministic grounding,” Chandrasekaran said, pointing to techniques like knowledge graphs that ensure agents understand structural relationships within enterprise data rather than returning surface-level matches. Interoperability is a related consideration: Standards like model context protocol (MCP) matter for connecting agents to legacy data sources without creating new dependencies.

What this means for enterprises

RAG and standalone vector databases were built for a different era. Agentic workloads are exposing the limits of both.

The retrieval cost problem is architectural

Teams running complex agentic workloads on conventional RAG pipelines are burning tokens at inference time on work that could be done in advance — interpreting, contextualizing, and structuring knowledge, every session, from scratch. That is a design problem. Tuning the retrieval layer will not fix it. The question for data engineering teams is whether their current stack is structurally capable of pre-compiling knowledge for specific agent tasks, or whether it was built for a human user who never needed that capability.

Governance is what separates a pilot from a production deployment

The capabilities that determine whether agentic AI gets approved for enterprise use are not performance metrics.

“The real enterprise value proposition isn’t just faster retrieval, but governed knowledge pipelines,” Walter said. “Those are the capabilities that turn agentic AI from an experiment into something finance and risk teams will actually approve.” 

The budget has shifted

VentureBeat’s Q1 Pulse data shows that retrieval optimization investment rose to 28.9% in March, overtaking evaluation spending for the first time in the quarter. Enterprises have finished measuring their retrieval problems. They are now spending to fix them. 

“The future of agentic AI won’t be decided by who has the longest context window,” Walter said. “It will be decided by who can operationalize trusted knowledge at scale without blowing up cost or governance.”

The retrieval rebuild: Why hybrid retrieval intent tripled as enterprise RAG programs hit the scale wall

Something shifted in enterprise RAG in Q1 2026. VB Pulse data spanning January through March tells a consistent story: the market stopped adding retrieval layers and started fixing the ones it already has. Call it the retrieval rebuild.

The survey covered three consecutive monthly waves from organizations with 100 or more employees, with between 45 and 58 qualified respondents per month across platform adoption, buyer intent, architecture outlook and evaluation criteria. The data should be treated as directional.

Enterprise intent to adopt hybrid retrieval tripled from 10.3% to 33.3% in a single quarter — even as 22% of qualified enterprise respondents reported having no production RAG systems at all. For data engineers and enterprise architects building agentic AI infrastructure, the data reveals a market in active transition: the RAG architecture most enterprises built to scale is not the one they expect to run by year-end. 

Hybrid retrieval has become the consensus enterprise strategy. Unlike single-method RAG pipelines that rely on vector similarity alone, hybrid retrieval combines dense embeddings with sparse keyword search and reranking layers, trading simplicity for the retrieval accuracy and access control that production agentic workloads require.

The standalone vector database category is under pressure. Weaviate, Milvus, Pinecone and Qdrant each lost adoption share across the quarter in the VB Pulse data. Custom stacks and provider-native retrieval are absorbing their displaced share.

A growing minority of enterprises are stepping back from RAG altogether — a signal that the market’s maturity narrative has meaningful exceptions.

Organizations that went wide on RAG in 2025 are hitting the same failure point: the architecture built for document retrieval does not hold at agentic scale.

Enterprises that scaled RAG fast are now paying to rebuild it

The two largest intent movements in Q1 are directly connected — enterprises confronting retrieval quality problems at scale, and hybrid retrieval emerging as the consensus answer.

Investment priorities shifted in parallel. Evaluation and relevance testing led budget intent in January at 32.8% and fell to 15.6% by March. Retrieval optimization moved in the opposite direction, from 19.0% to 28.9% — overtaking evaluation as the top growth investment area for the first time. 

Steven Dickens, vice president and practice lead at HyperFRAME Research, described the operational burden enterprise data teams are facing in a VentureBeat interview in March on Oracle’s agentic AI data stack. “Data teams are exhausted by fragmentation fatigue,” Dickens said. “Managing a separate vector store, graph database and relational system just to power one agent is a DevOps nightmare.”

That fatigue shows directly in the platform data. The custom stack rise to 35.6% is not a rejection of managed retrieval — many organizations run both. It is a consolidation response from engineering teams that have hit the limits of assembling too many components.

Not every enterprise has made it that far. The VB Pulse data includes a signal that complicates the market’s overall growth narrative: 22.2% of qualified respondents reported no production RAG by March, up from 8.6% in January.  The report attributes this cohort to organizations that have “not yet committed to any retrieval infrastructure, or have paused programs” — concentrated in Healthcare, Education and Government, the same sectors showing the highest rates of flat budgets.

Standalone vector databases are losing the adoption argument but winning the reliability one

Recent reporting by VentureBeat illustrates why the dedicated retrieval layer still matters in production. 

Two enterprises building on Qdrant show why purpose-built vector infrastructure still wins in production.

 &AI builds patent litigation infrastructure and runs semantic search across hundreds of millions of documents. Grounding every result in a real source document is not optional — patent attorneys will not act on AI-generated text. That requirement makes the architectural choice clear.

“The agent is the interface,” Herbie Turner, &AI’s founder and CTO, told VentureBeat in March. “The vector database is the ground truth.”

GlassDollar, a startup that helps Siemens and Mahle evaluate startups, runs an agentic retrieval pattern across a corpus approaching 10 million indexed documents. A single user prompt fans out into multiple parallel queries, each retrieving candidates from a different angle before results are combined and re-ranked. That query volume and precision requirement is what drove the choice of purpose-built vector infrastructure.

“We measure success by recall,” Kamen Kanev, GlassDollar’s head of product, told VentureBeat in March. “If the best companies aren’t in the results, nothing else matters. The user loses trust.”

The VB Pulse data shows that framing — retrieval as ground truth rather than feature — is gaining traction across the broader enterprise market, even as standalone vector database adoption declines. 

Why enterprises say they need a dedicated vector layer shifted significantly across Q1. In January the top reasons were access control complexity (20.7%) and retrieval precision (19.0%). By March, operational reliability at scale had surged to 31.1% — more than doubling and overtaking everything else. Enterprises are no longer keeping vector infrastructure primarily for precision. They are keeping it because it is the part of the stack they can rely on when query volumes scale.

How enterprises are redefining what good retrieval means

How enterprises judge their retrieval systems shifted notably across Q1 — and the direction of that shift points to a market getting more sophisticated about what good retrieval actually means.

In January, response correctness dominated evaluation criteria at 67.2% — far above anything else. By March, response correctness (53.3%), retrieval accuracy (53.3%) and answer relevance (53.3%) had converged exactly. Getting the right answer is no longer enough if it came from the wrong document or missed the context of the question.

Answer relevance was the only criterion that rose across the quarter, gaining five percentage points. It is also the hardest to measure — whether the retrieved context is actually the right context for that specific question requires purpose-built evaluation infrastructure, not just pass-or-fail correctness checks. Its rise signals that a meaningful share of enterprise buyers have moved past basic RAG testing entirely. 

The market’s verdict: RAG isn’t dead. The original architecture is

The “RAG is dead” narrative had real momentum heading into 2026. It rested on two claims. The first: that long-context windows — models capable of processing hundreds of thousands of tokens in a single prompt — would make dedicated retrieval unnecessary. The second: that agentic memory systems, which store what an agent learns across sessions rather than retrieving it fresh each time, would absorb the knowledge access problem entirely.

The VB Pulse data is the enterprise market’s answer to the first claim. The long-context-as-dominant-architecture position collapsed from 15.5% in January to 3.5% in February before partially recovering to 6.7% in March. January’s sample was heavily weighted toward Technology and Software respondents — the segment most exposed to long-context model announcements in late 2025. As the sample diversified, the position evaporated.

On the memory question, Jonathan Frankle, chief AI scientist at Databricks, framed the architecture clearly in a March interview with VentureBeat: a vector database with millions of entries sits at the base of the agentic memory stack, too large to fit in context. The LLM context window sits at the top. Between them, new caching and compression layers are emerging — but none of them replace the retrieval layer at the base. New agentic memory systems like Hindsight, developed by Vectorize, and observational memory approaches like those in the Mastra framework address session continuity and agent context over time — a different problem than high-recall search across millions of changing enterprise documents.

The most consequential signal: the share of respondents not expecting large-scale RAG deployments by year-end grew from 3.4% to 15.6% — nearly 5x. That is not a verdict against retrieval. It is a verdict against the retrieval architecture most enterprises built first.

The retrieval rebuild is not optional

The retrieval rebuild is the cost of scaling RAG without first deciding what architecture could actually support it.

If your organization is among the 43.1% that entered Q1 planning to expand RAG into more workflows, the VB Pulse data suggests that plan has already changed for many of your peers — and may need to change for you. Hybrid retrieval is the consensus destination. Custom stack growth to 35.6% reflects teams building retrieval infrastructure around requirements that off-the-shelf products do not fully address.

RAG is not dead. The architecture most enterprises used to implement it is. The data suggests the rebuild is not a future decision. For 33% of enterprises, the rebuild is already the stated priority.

Definity embeds agents inside Spark pipelines to catch failures before they reach agentic AI systems

For most data engineering teams, managing pipeline reliability often means waiting for an alert, manually tracing failures across distributed jobs and clusters, and fixing problems after they’ve already hit the business. Agentic AI needs the data to be there, clean and on time. A pipeline that fails silently or delivers stale data doesn’t just break a dashboard — it breaks the AI system depending on it.

That gap is what Definity, a Chicago-based data pipeline operations startup, is building into: embedding agents directly inside the Spark or DBT driver to act during a pipeline run, not after it. One enterprise customer identified 33% of its optimization opportunities in the first week of deployment and cut troubleshooting and optimization effort by 70%, according to Definity. The company also claims customers are resolving complex Spark issues up to 10x faster.

“You need three big things for agentic data operations: full stack context that is real time and production aware. Control of the pipeline. And the ability to validate in a feedback loop. Without that, you can be outside looking in and read only,” Roy Daniel, CEO and co-founder of Definity told VentureBeat in an exclusive interview.

The company on Wednesday announced that it has raised $12 million in Series A financing led by GreatPoint Ventures, with participation from Dynatrace and existing investors StageOne Ventures and Hyde Park Venture Partners.

Why existing pipeline monitoring breaks down at scale

Existing tools approach the problem from outside the execution layer — Datadog, which acquired data quality monitor Metaplane last year, Databricks system tables, and platforms like Unravel Data and Acceldata all read metrics after a job completes. Dynatrace has monitoring capabilities; it also participated in Definity’s Series A.

The Definity approach is differentiated from other options in the way the solution is architected. According to Daniel, that means by the time a platform monitoring tool surfaces a problem, the pipeline has already run — and the failure, the wasted compute or the bad data is already downstream.

“It’s always after the fact,” Daniel said. “By the time you know something happened, it already happened.”

How Definity’s in-execution agents work

The core architectural difference is where the agent sits — inside the pipeline rather than watching from outside it.

Inline instrumentation. The Definity system installs a JVM agent directly inside the pipeline execution layer via a single line of code, running below the platform layer and pulling execution data directly from Spark.

Execution context during the run. The agent captures query execution behavior, memory pressure, data skew, shuffle patterns and infrastructure utilization as the pipeline runs. It also infers lineage between pipelines and tables dynamically — no predefined data catalog is required.

Intervention, not just observation. The agent can modify resource allocation mid-run, stop a job before bad data propagates or preempt a pipeline based on upstream data conditions. Daniel described one production deployment where the agent detected that an upstream job had been preempted and the input table it was supposed to write was stale — and stopped the downstream pipeline before it started, before bad data reached any dependent system.

What is and isn’t real time. Detection and prevention are real time. Root cause analysis and optimization recommendations run on demand when an engineer queries the assistant, with full execution context already assembled.

Overhead and data residency. The agent adds approximately one second of compute on an hour-long run. Only metadata transmits externally; full on-premises deployment is available for environments where no metadata can leave the perimeter.

What in-execution intelligence looks like in a production environment

One early user of the Definity platform is Nexxen, an ad tech platform running large-scale Spark pipelines  for mission-critical advertising workloads, running on-premises.

Dennis Meyer, Director of Data Engineering at Nexxen, told VentureBeat that the core problem he was facing was not pipeline failures but the accumulating cost of inefficiency in an environment with no elastic cloud capacity to absorb waste.

“The main challenge wasn’t about pipelines breaking, but about managing an increasingly complex and large-scale environment,” Meyer said. “Because we operate on-prem, we don’t have the flexibility of instant elasticity, so inefficiencies have a direct cost impact.”

Existing monitoring tools gave Nexxen partial visibility but not enough to act on systematically. “We had existing monitoring tools in place, but needed full-stack visibility to understand workload behavior holistically and to systematically prioritize optimizations,” Meyer said.

Nexxen deployed Definity with no pipeline code changes. According to Meyer, the team identified 33% of its optimization opportunities within the first week, and engineering effort on troubleshooting and optimization dropped by 70%. The platform freed infrastructure capacity, allowing the team to support workload growth without additional hardware investment.

“The key shift was moving from reactive troubleshooting to proactive, continuous optimization,” Meyer said. “At scale, the biggest gap often isn’t tooling — it’s actionable visibility.”

What this means for enterprise data teams

For data engineering teams running production Spark environments, the shift from reactive monitoring to in-execution intelligence has architectural and organizational implications worth thinking through.

Pipeline ops is becoming an AI infrastructure problem. Data pipelines that previously supported analytics now carry AI workloads with direct business dependencies. Failures that were once an inconvenience are now blocking production AI delivery.

Troubleshooting time is a recoverable cost. According to Meyer, Nexxen cut engineering effort on troubleshooting and optimization by 70% after deploying Definity. For teams running lean, that time going back to the roadmap is the most direct near-term case for evaluating this category.

RAG precision tuning can quietly cut retrieval accuracy by 40%, putting agentic pipelines at risk

Enterprise teams that fine-tune their RAG embedding models for better precision may be unintentionally degrading the retrieval quality those pipelines depend on, according to new research from Redis.

The paper, “Training for Compositional Sensitivity Reduces Dense Retrieval Generalization,” tested what happens when teams train embedding models for compositional sensitivity. That is the ability to catch sentences that look nearly identical but mean something different — “the dog bit the man” versus “the man bit the dog,” or a negation flip that reverses a statement’s meaning entirely. That training consistently broke dense retrieval generalization, how well a model retrieves correctly across broad topics and domains it wasn’t specifically trained on. Performance dropped by 8 to 9 percent on smaller models and by 40 percent on a current mid-size embedding model teams are actively using in production.

The findings have direct implications for enterprise teams building agentic AI pipelines, where retrieval quality determines what context flows into an agent’s reasoning chain. A retrieval error in a single-stage pipeline returns a wrong answer. The same error in an agentic pipeline can trigger a cascade of wrong actions downstream.

Srijith Rajamohan, AI Research Leader at Redis and one of the paper’s authors, said the finding challenges a widespread assumption about how embedding-based retrieval actually works. 

“There’s this general notion that when you use semantic search or similar semantic similarity, we get correct intent. That’s not necessarily true,” Rajamohan told VentureBeat. “A close or high semantic similarity does not actually mean an exact intent.”

The geometry behind the retrieval tradeoff

Embedding models work by compressing an entire sentence into a single point in a high-dimensional space, then finding the closest points to a query at retrieval time. That works well for broad topical matching — documents about similar subjects end up near each other. The problem is that two sentences with nearly identical words but opposite meanings also end up near each other, because the model is working from word content rather than structure.

That is what the research quantified. When teams fine-tune an embedding model to push structurally different sentences apart — teaching it that a negation flip which reverses a statement’s meaning is not the same as the original — the model uses representational space it was previously using for broad topical recall. The two objectives compete for the same vector.

The research also found the regression is not uniform across failure types. Negation and spatial flip errors improved measurably with structured training. Binding errors — where a model confuses which modifier applies to which word, such as which party a contract obligation falls on — barely moved. For enterprise teams, that means the precision problem is harder to fix in exactly the cases where getting it wrong has the most consequences.

The reason most teams don’t catch it is that fine-tuning metrics measure the task being trained for, not what happens to general retrieval across unrelated topics. A model can show strong improvement on near-miss rejection during training while quietly regressing on the broader retrieval job it was hired to do. The regression only surfaces in production.

Rajamohan said the instinct most teams reach for — moving to a larger embedding model — does not address the underlying architecture.

“You can’t scale your way out of this,” he said. “It’s not a problem you can solve with more dimensions and more parameters.”

Why the standard alternatives all fall short

The natural instinct when retrieval precision fails is to layer on additional approaches. The research tested several of them and found each fails in a different way.

Hybrid search. Combining embedding-based retrieval with keyword search is already standard practice for closing precision gaps. But Rajamohan said keyword search cannot catch the failure mode this research identifies, because the problem is not missing words — it is misread structure.

“If you have a sentence like ‘Rome is closer than Paris’ and another that says ‘Paris is closer than Rome,’ and you do an embedding retrieval followed by a text search, you’re not going to be able to tell the difference,” he said. “The same words exist in both sentences.”

MaxSim reranking. Some teams add a second scoring layer that compares individual query words against individual document words rather than relying on the single compressed vector. This approach, known as MaxSim or late interaction and used in systems like ColBERT, did improve relevance benchmark scores in the research. But it completely failed to reject structural near-misses, assigning them near-identity similarity scores. 

The problem is that relevance and identity are different objectives. MaxSim is optimized for the former and blind to the latter. A team that adds MaxSim and sees benchmark improvement may be solving a different problem than the one they have.

Cross-encoders. These work by feeding the query and candidate document into the model simultaneously, letting it compare every word against every word before making a decision. That full comparison is what makes them accurate — and what makes them too expensive to run at production scale. Rajamohan said his team investigated them. They work in the lab and break under real query volumes.

Contextual memory. Also sometimes referred to as agentic memory, these systems are increasingly cited as the path beyond RAG, but Rajamohan said moving to that type of  architecture does not eliminate the structural retrieval problem. Those systems still depend on retrieval at query time, which means the same failure modes apply. The main difference is looser latency requirements, not a precision fix.

The two-stage fix the research validated

The common thread across every failed approach is the same: a single scoring mechanism trying to handle both recall and precision at once. The research validated a different architecture: stop trying to do both jobs with one vector, and assign each job to a dedicated stage.

Stage one: recall. The first stage works exactly as standard dense retrieval does today — the embedding model compresses documents into vectors and retrieves the closest matches to a query. Nothing changes here. The goal is to cast a wide net and bring back a set of strong candidates quickly. Speed and breadth are what matter at this stage, not perfect precision.

Stage two: precision. The second stage is where the fix lives. Rather than scoring candidates with a single similarity number, a small learned Transformer model examines the query and each candidate at the token level — comparing individual words against individual words to detect structural mismatches like negation flips or role reversals. This is the verification step the single-vector approach cannot perform.

The results. Under end-to-end training, the Transformer verifier outperformed every other approach the research tested on structural near-miss rejection. It was the only approach that reliably caught the failure modes the single-vector system missed.

The tradeoff. Adding a verification stage costs latency. The latency cost depends on how much verification a team runs. For precision-sensitive workloads like legal or accounting applications, full verification at every query is warranted. For general-purpose search, lighter verification may be sufficient. 

The research grew out of a real production problem. Enterprise customers running semantic caching systems were getting fast but semantically incorrect responses back — the retrieval system was treating similar-sounding queries as identical even when their meaning differed. The two-stage architecture is Redis’s proposed fix, with incorporation into its LangCache product on the roadmap but not yet available to customers.

What this means for enterprise teams

The research does not require enterprise teams to rebuild their retrieval pipelines from scratch. But it does ask them to pressure-test assumptions most teams have never examined — about what their embedding models are actually doing, which metrics are worth trusting and where the real precision gaps live in production.

Recognize the tradeoff before tuning around it. Rajamohan said the first practical step is understanding the regression exists. He evaluates any LLM-based retrieval system on three criteria: correctness, completeness and usefulness. Correctness failures cascade directly into the other two, which means a retrieval system that scores well on relevance benchmarks but fails on structural near-misses is producing a false sense of production readiness.

RAG is not obsolete — but know what it can’t do. Rajamohan pushed back firmly on claims that RAG has been superseded. “That’s a massive oversimplification,” he said. “RAG is a very simple pipeline that can be productionized by almost anyone with very little lift.” The research does not argue against RAG as an architecture. It argues against assuming a single-stage RAG pipeline with a fine-tuned embedding model is production-ready for precision-sensitive workloads.

The fix is real but not free. For teams that do need higher precision, Rajamohan said the two-stage architecture is not a prohibitive implementation lift, but adding a verification stage costs latency. “It’s a mitigation problem,” he said. “Not something we can actually solve.”

OpenAI launches Privacy Filter, an open source, on-device data sanitization model that removes personal information from enterprise datasets

In a significant shift toward local-first privacy infrastructure, OpenAI has released Privacy Filter, a specialized open-source model designed to detect and redact personally identifiable information (PII) before it ever reaches a cloud-based server.

Launched today on AI code sharing community Hugging Face under a permissive Apache 2.0 license, the tool addresses a growing industry bottleneck: the risk of sensitive data “leaking” into training sets or being exposed during high-throughput inference.

By providing a 1.5-billion-parameter model that can run on a standard laptop or directly in a web browser, the company is effectively handing developers a “privacy-by-design” toolkit that functions as a sophisticated, context-aware digital shredder.

Though OpenAI was founded with a focus on open source models such as this, the company shifted during the ChatGPT era to providing more proprietary (“closed source”) models available only through its website, apps, and API — only to return to open source in a big way last year with the launch of the gpt-oss family of language models.

In that light, and combined with OpenAI’s recent open sourcing of agentic orchestration tools and frameworks, it’s safe to say that the generative AI giant is clearly still heavily invested in fostering this less immediately lucrative part of the AI ecosystem.

Technology: a gpt-oss variant with bidirectional token classifier that reads from both directions

Architecturally, Privacy Filter is a derivative of OpenAI’s gpt-oss family, a series of open-weight reasoning models released earlier this year.

However, while standard large language models (LLMs) are typically autoregressive—predicting the next token in a sequence—Privacy Filter is a bidirectional token classifier.

This distinction is critical for accuracy. By looking at a sentence from both directions simultaneously, the model gains a deeper understanding of context that a forward-only model might miss.

For instance, it can better distinguish whether “Alice” refers to a private individual or a public literary character based on the words that follow the name, not just those that precede it.

The model utilizes a Sparse Mixture-of-Experts (MoE) framework. Although it contains 1.5 billion total parameters, only 50 million parameters are active during any single forward pass.

This sparse activation allows for high throughput without the massive computational overhead typically associated with LLMs. Furthermore, it features a massive 128,000-token context window, enabling it to process entire legal documents or long email threads in a single pass without the need for fragmenting text—a process that often causes traditional PII filters to lose track of entities across page breaks.

To ensure the redacted output remains coherent, OpenAI implemented a constrained Viterbi decoder. Rather than making an independent decision for every single word, the decoder evaluates the entire sequence to enforce logical transitions.

It uses a “BIOES” (Begin, Inside, Outside, End, Single) labeling scheme, which ensures that if the model identifies “John” as the start of a name, it is statistically inclined to label “Smith” as the continuation or end of that same name, rather than a separate entity.

On-device data sanitization

Privacy Filter is designed for high-throughput workflows where data residency is a non-negotiable requirement. It currently supports the detection of eight primary PII categories:

  • Private Names: Individual persons.

  • Contact Info: Physical addresses, email addresses, and phone numbers.

  • Digital Identifiers: URLs, account numbers, and dates.

  • Secrets: A specialized category for credentials, API keys, and passwords.

In practice, this allows enterprises to deploy the model on-premises or within their own private clouds. By masking data locally before sending it to a more powerful reasoning model (like GPT-5 or gpt-oss-120b), companies can maintain compliance with strict GDPR or HIPAA standards while still leveraging the latest AI capabilities.

Initial benchmarks are promising: the model reportedly hits a 96% F1 score on the PII-Masking-300k benchmark out of the box.

For developers, the model is available via Hugging Face, with native support for transformers.js, allowing it to run entirely within a user’s browser using WebGPU.

Fully open source, commercially viable Apache 2.0 license

Perhaps the most significant aspect of the announcement for the developer community is the Apache 2.0 license. Unlike “available-weight” licenses that often restrict commercial use or require “copyleft” sharing of derivative works, Apache 2.0 is one of the most permissive licenses in the software world.For startups and dev-tool makers, this means:

  1. Commercial Freedom: Companies can integrate Privacy Filter into their proprietary products and sell them without paying royalties to OpenAI.

  2. Customization: Teams can fine-tune the model on their specific datasets (such as medical jargon or proprietary log formats) to improve accuracy for niche industries.

  3. No Viral Obligations: Unlike the GPL license, builders do not have to open-source their entire codebase if they use Privacy Filter as a component.

By choosing this licensing path, OpenAI is positioning Privacy Filter as a standard utility for the AI era—essentially the “SSL for text”.

Community reactions

The tech community reacted quickly to the release, with many noting the impressive technical constraints OpenAI managed to hit.

Elie Bakouch (@eliebakouch), a research engineer at agentic model training platform startup Prime Intellect, praised the efficiency of Privacy Filter’s architecture on X:

“Very nice release by @OpenAI! A 50M active, 1.5B total gpt-oss arch MoE, to filter private information from trillion scale data cheaply. keeping 128k context with such a small model is quite impressive too”.

The sentiment reflects a broader industry trend toward “small but mighty” models. While the world has focused on massive, 100-trillion parameter giants, the practical reality of enterprise AI often requires small, fast models that can perform one task—like privacy filtering—exceptionally well and at a low cost.

However, OpenAI included a “High-Risk Deployment Caution” in its documentation. The company warned that the tool should be viewed as a “redaction aid” rather than a “safety guarantee,” noting that over-reliance on a single model could lead to “missed spans” in highly sensitive medical or legal workflows.

OpenAI’s Privacy Filter is clearly an effort by the company to make the AI pipeline fundamentally safer.

By combining the efficiency of a Mixture-of-Experts architecture with the openness of an Apache 2.0 license, OpenAI is providing a way for many enterprises to more easily, cheaply and safely redact PII data.

The modern data stack was built for humans asking questions. Google just rebuilt its for agents taking action.

Enterprise data stacks were built for humans running scheduled queries. As AI agents increasingly act autonomously on behalf of businesses around the clock, that architecture is breaking down — and vendors are racing to rebuild it. Google’s answer, announced at Cloud Next on Wednesday, is the Agentic Data Cloud.

The architecture has three pillars:

  • Knowledge Catalog. Automates semantic metadata curation, inferring business logic from query logs without manual data steward intervention

  • Cross-cloud lakehouse. Lets BigQuery query Iceberg tables on AWS S3 via private network with no egress fees

  • Data Agent Kit. Drops MCP tools into VS Code, Claude Code and Gemini CLI so data engineers describe outcomes rather than write pipelines

“The data architecture has to change now,” Andi Gutmans, VP and GM of Data Cloud at Google Cloud, told VentureBeat. “We’re moving from human scale to agent scale.”

From system of intelligence to system of action

The core premise behind Agentic Data Cloud is that enterprises are moving from human‑scale to agent‑scale operations.

Historically, data platforms have been optimized for reporting, dashboarding, and some forecasting — what Google characterizes as “reactive intelligence.” In that model, humans interpret data and decide what to do.

Now, with AI agents increasingly expected to take actions directly on behalf of the business, Gutmans argued that data platforms must evolve into systems of action.

“We need to make sure that all of enterprise data can be activated with AI, that includes both structured and unstructured data,” Gutmans said. “We need to make sure that there’s the right level of trust, which also means it’s not just about getting access to the data, but really understanding the data.”

The Knowledge Catalog is Google’s answer to that problem. It is an evolution of Dataplex, Google’s existing data governance product, with a materially different architecture underneath. Where traditional data catalogs required data stewards to manually label tables, define business terms and build glossaries, the Knowledge Catalog automates that process using agents.

The practical implication for data engineering teams is that the Knowledge Catalog scales to the full data estate, not just the curated subset that a small team of data stewards can maintain by hand. The catalog covers BigQuery, Spanner, AlloyDB and Cloud SQL natively, and federates with third-party catalogs including Collibra, Atlan and Datahub. Zero-copy federation extends semantic context from SaaS applications including SAP, Salesforce Data360, ServiceNow and Workday without requiring data movement.

Google’s lakehouse goes cross cloud

Google has had a data lakehouse called BigLake since 2022. Initially it was limited to just Google data, but in recent years has had some limited federation capabilities enabling enterprises to query data found in other locations.

Gutmans explained that the previous federation worked through query APIs, which limited the features and optimizations BigQuery could bring to bear on external data. The new approach is storage-based sharing via the open Apache Iceberg format. That means whether the data is in Amazon S3 or in Google Cloud , he argued it doesn’t make a difference.

“This truly means we can bring all the goodness and all the AI capabilities to those third-party data sets,” he said.

The practical result is that BigQuery can query Iceberg tables sitting on Amazon S3 via Google’s Cross-Cloud Interconnect, a dedicated private networking layer, with no egress fees and price-performance Google says is comparable to native AWS warehouses. All BigQuery AI functions run against that cross-cloud data without modification. Bidirectional federation in preview extends to Databricks Unity Catalog on S3, Snowflake Polaris and the AWS Glue Data Catalog using the open Iceberg REST Catalog standard.

From writing pipelines to describing outcomes

The Knowledge Catalog and cross-cloud lakehouse solve the data access and context problems. The third pillar addresses what happens when a data engineer actually sits down to build something with all of it.

The Data Agent Kit ships as a portable set of skills, MCP tools and IDE extensions that drop into VS Code, Claude Code, Gemini CLI and Codex. It does not introduce a new interface.

The architectural shift it enables is a move from what Gutmans called a “prescriptive copilot experience” to intent-driven engineering. Rather than writing a Spark pipeline to move data from source A to destination B, a data engineer describes the outcome — a cleaned dataset ready for model training, a transformation that enforces a governance rule — and the agent selects whether to use BigQuery, the Lightning Engine for Apache Spark or Spanner to execute it, then generates production-ready code.

“Customers are kind of sick of building their own pipelines,” Gutmans said. “They’re truly more in the review kind of mode, than they are in the writing the code mode.”

Where Google and its rivals diverge

The premise that agents require semantic context, not just data access, is shared across the market. 

Databricks has Unity Catalog, which provides governance and a semantic layer across its lakehouse. Snowflake has Cortex, its AI and semantic layer offering. Microsoft Fabric includes a semantic model layer built for business intelligence and, increasingly, agent grounding.

The dispute is not over whether semantics matter — everyone agrees they do. The dispute is over who builds and maintains them.

“Our goal is just to get all the semantics you can get,” he explained, noting that Google will federate with third-party semantic models rather than require customers to start over.

Google is also positioning openness as a differentiator, with bidirectional federation into Databricks Unity Catalog and Snowflake Polaris via the open Iceberg REST Catalog standard.

What this means for enterprises

Google’s argument — and one echoed across the data infrastructure market — is that enterprises are behind on three fronts:

Semantic context is becoming infrastructure. If your data catalog is still manually curated, it will not scale to agent workloads — and Gutmans argues that gap will only widen as agent query volumes increase.

Cross-cloud egress costs are a hidden tax on agentic AI. Storage-based federation via open Iceberg standards is emerging as the architectural answer across Google, Databricks and Snowflake. Enterprises locked into proprietary federation approaches should be stress-testing those costs at agent-scale query volumes.

Gutmans argues the pipeline-writing era is ending. Data engineers who move toward outcome-based orchestration now will have a significant head start.