TTT-Discover optimizes GPU kernels 2x faster than human experts — by training during inference

Researchers from Stanford, Nvidia, and Together AI have developed a new technique that can discover new solutions to very complex problems. For example, they managed to optimize a critical GPU kernel to run 2x faster than the previous state-of-the-art written by human experts.

Their technique, called “Test-Time Training to Discover” (TTT-Discover), challenges the current paradigm of letting models “think longer” for reasoning problems. TTT-Discover allows the model to continue training during the inference process and update its weights for the problem at hand.

The limits of ‘frozen’ reasoning

Current enterprise AI strategies often rely on “frozen” models. Whether you use a closed or open reasoning model, the model’s parameters are static. When you prompt these models, they search for answers within the fixed manifold of their training data. This works well for problems that resemble what the model has seen before.

However, true discovery problems, like inventing a novel algorithm or proving a new mathematical theorem, are, by definition, out-of-distribution. If the solution requires a leap of logic that doesn’t exist in the training set, a frozen model will likely fail, no matter how much compute you throw at it during inference.

In comments to VentureBeat, Mert Yuksekgonul, a co-author of the paper and doctorate student at Stanford, illustrated this distinction using a famous mathematical breakthrough:

“I believe that thinking models wouldn’t be able to prove, for example, P != NP, without test-time training, just like Andrew Wiles wouldn’t be able to prove Fermat’s Last Theorem without the 7 years he spent pursuing this single problem in isolation and continuously learning from his own failures.”

TTT-Discover treats the test problem not as a query to be answered, but as an environment to be mastered. As the model attempts to solve the problem, it generates different types of data: failures, partial successes, and errors. Instead of discarding this data, TTT-Discover uses it to update the model’s weights in real-time, effectively allowing the model to laser focus on that specific challenge as opposed to developing a very general problem-solving framework.

A different approach to reinforcement learning

TTT-Discover provides a fundamental shift on how reasoning models are trained. In standard reinforcement learning (RL) training, the goal is a generalist policy that performs well on average across many tasks. In TTT-Discover, the goal is to find the best solution to a very specific problem, and the policy is “a means towards this end,” according to the authors. Once the model discovers the artifact (i.e., the optimized code, the proof, or the molecule) the neural network that produced it can be discarded. 

To achieve this, the researchers engineered two specific components that differentiate TTT-Discover from standard reinforcement learning:

  1. Entropic objective: Standard RL optimizes for the average expected reward. If a model tries a risky path and fails, standard RL punishes it. TTT-Discover flips this. It uses an “entropic objective” that exponentially weighs high-reward outcomes. This forces the model to ignore “safe,” average answers and aggressively hunt for “eureka” outliers, solutions that have a low probability of being found but offer a massive reward.

  2. PUCT search: The system introduces PUCT, a tree-search algorithm inspired by AlphaZero. It explores different solution paths, building a dataset of attempts. The model then trains on this dataset in real-time, learning to recognize which partial steps lead to high-reward outcomes.

Crucially, this method works best on problems with a continuous reward signal. The system needs a way to measure incremental progress such as “runtime in microseconds” or “error rate” rather than a binary “pass/fail” signal. This allows the model to follow the gradual improvement toward the optimal solution.

The economics of ‘heavy inference’

For enterprises accustomed to paying fractions of a cent per API call, the cost profile of TTT-Discover requires a mindset shift. In their experiments, the researchers reported that a single discovery run involves approximately 50 training steps and thousands of rollouts, costing roughly $500 per problem.

TTT-Discover could be for “static, high-value assets” as opposed to trivial and recurring problems that can be solved with existing models and approaches.

Consider a cloud-native enterprise running a data pipeline that processes petabytes of information nightly. If that pipeline relies on a specific SQL query or GPU kernel, optimizing that code by just 1% could save hundreds of thousands of dollars in annual compute costs. In this context, spending $500 to find a kernel that is 50% faster is a trivial expense with an immediate ROI.

“This makes the most sense for low-frequency, high-impact decisions where a single improvement is worth far more than the compute cost,” Yuksekgonul said. “Supply chain routing, drug design, and material discovery qualify. In these settings, spending hundreds of dollars on a single discovery step can easily pay for itself.”

Implementation considerations

One of the most significant findings for enterprise adoption is that TTT-Discover does not require a proprietary frontier model. The researchers achieved state-of-the-art results using gpt-oss-120b, OpenAI’s open-weights model. The researchers have released the code for TTT-Discover to enable researchers and developers to use it for their own models.

Because the technique works with open models, companies can run this “discovery loop” entirely within their own secure VPCs or on-premise H100 clusters without sending their proprietary data to third-party servers.

“If a company already runs reinforcement learning, there is no additional infrastructure required,” Yuksekgonul said. “TTT-Discover uses the same training stack (GPUs, rollout workers, optimizers, checkpointing).” 

If they don’t already run RL, they would need to build that infrastructure. But enterprises can also use existing solutions to reduce the complexity of the process. The researchers orchestrated these training runs using the Tinker API by Thinking Machines, an API that manages the complexity of distributed training and inference.

“Tooling such as Tinker (and open variants, e.g., OpenTinker) lowers the setup cost, and both labor and compute costs are likely to drop over time,” he said.

Real-world use cases

The researchers deployed TTT-Discover across four distinct technical domains: systems engineering, algorithm design, biology, and mathematics. In almost every instance, the method set a new state-of-the-art.

In one experiment, the model optimized GPU kernels for matrix multiplication (including the “TriMul” kernel used in AlphaFold), achieving execution speeds up to 2x faster than prior state-of-the-art and outperforming the best human-written kernels on the leaderboard.

In competitive programming scenarios (AtCoder), it solved complex heuristic problems (e.g., optimizing geometric constraints for fishing nets) better than top human experts and prior AI baselines.

For the enterprise, the transition from these academic benchmarks to business value hinges on one specific constraint: the existence of a verifiable, scalar signal. Unlike a chatbot that generates text, TTT-Discover needs a hard metric (e.g., runtime, error rate, or profit margin) to optimize against.

Yuksekgonul said that this requirement draws a clear line between where this technology should and shouldn’t be used. “At the moment, the key requirement is a reliable scalar signal of progress — cost, error, molecular properties — that the system can optimize against,” he said.

This directs enterprise adoption toward “hard” engineering and operations challenges such as logistics, supply chain, and resource management, where problems like fleet routing or crew scheduling often rely on static heuristics. TTT-Discover can treat these as optimization environments, spending hours to find a route structure that shaves 5% off daily fuel costs.

The requirement for clear verifiers rules out qualitative tasks like “write a better marketing strategy,” where verification is subjective and prone to noise.

“Hard to verify problems are still an open question,” Yuksekgonul said.

With current technology, the best path forward is to try to design verifiers, but “making those verifiers robust and hard to game is challenging, and we don’t have a good solution yet,” he added.

From inference to invention

The broader implication is that enterprise AI stacks may need to evolve to support this kind of per-problem learning.

“Systems built around a frozen model will need to support per-problem (or per-domain) adaptation, and enterprises will need better problem specifications and internal feedback signals to make test-time learning effective,” Yuksekgonul said. “If training runs inside a private VPC, the training loop can also be integrated with more of the company’s internal environment, not just a central lab pipeline.”

For the enterprise, the value lies in identifying “million-dollar problems,” optimization challenges where a verifiable metric exists, but human progress has stalled. These are the candidates for TTT-Discover. By accepting higher latency and cost for specific queries, enterprises can turn their inference compute into an automated R&D lab, discovering solutions that were previously out of reach for both humans and frozen AI models.

Elon Musk is getting serious about orbital data centers

We’re starting to see the idea of Musk-owned orbital AI data clusters cohere into an actual plan.

Anthropic’s Claude Opus 4.6 brings 1M token context and ‘agent teams’ to take on OpenAI’s Codex

Anthropic on Thursday released Claude Opus 4.6, a major upgrade to its flagship artificial intelligence model that the company says plans more carefully, sustains longer autonomous workflows, and outperforms competitors including OpenAI’s GPT-5.2 on key enterprise benchmarks — a release that arrives at a tumultuous moment for the AI industry and global software markets.

The launch comes just three days after OpenAI released its own Codex desktop application in a direct challenge to Anthropic’s Claude Code momentum, and amid a $285 billion rout in software and services stocks that investors attribute partly to fears that Anthropic’s AI tools could disrupt established enterprise software businesses.

For the first time, Anthropic’s Opus-class models will feature a 1 million token context window, allowing the AI to process and reason across vastly more information than previous versions. The company also introduced “agent teams” in Claude Code — a research preview feature that enables multiple AI agents to work simultaneously on different aspects of a coding project, coordinating autonomously.

“We’re focused on building the most capable, reliable, and safe AI systems,” an Anthropic spokesperson told VentureBeat about the announcements. “Opus 4.6 is even better at planning, helping solve the most complex coding tasks. And the new agent teams feature means users can split work across multiple agents — one on the frontend, one on the API, one on the migration — each owning its piece and coordinating directly with the others.”

Why OpenAI and Anthropic are locked in an all-out war for enterprise developers

The release intensifies an already fierce competition between Anthropic and OpenAI, the two most valuable privately held AI companies in the world. OpenAI on Monday released a new desktop application for its Codex artificial intelligence coding system, a tool the company says transforms software development from a collaborative exercise with a single AI assistant into something more akin to managing a team of autonomous workers.

AI coding assistants have exploded in popularity over the last year, and OpenAI said more than 1 million developers have used Codex in the past month. The new Codex app is part of OpenAI’s ongoing effort to lure users and market share away from rivals like Anthropic and Cursor.

The timing of Anthropic’s release — just 72 hours after OpenAI’s Codex launch — underscores the breakneck pace of competition in AI development tools. OpenAI faces intensifying competition from Anthropic, which posted the largest share increase of any frontier lab since May 2025, according to a recent Andreessen Horowitz survey. Forty-four percent of enterprises now use Anthropic in production, driven by rapid capability gains in software development since late 2024. The desktop launch is a strategic counter to Claude Code’s momentum.

According to Anthropic’s announcement, Opus 4.6 achieves the highest score on Terminal-Bench 2.0, an agentic coding evaluation, and leads all other frontier models on Humanity’s Last Exam, a complex multi-discipline reasoning test. On GDPval-AA — a benchmark measuring performance on economically valuable knowledge work tasks in finance, legal and other domains — Opus 4.6 outperforms OpenAI’s GPT-5.2 by approximately 144 ELO points, which translates to obtaining a higher score approximately 70% of the time.

Inside Claude Code’s $1 billion revenue milestone and growing enterprise footprint

The stakes are substantial. Asked about Claude Code’s financial performance, the Anthropic spokesperson noted that in November, the company announced that Claude Code reached $1 billion in run rate revenue only six months after becoming generally available in May 2025.

The spokesperson highlighted major enterprise deployments: “Claude Code is used by Uber across teams like software engineering, data science, finance, and trust and safety; wall-to-wall deployment across Salesforce’s global engineering org; tens of thousands of devs at Accenture; and companies across industries like Spotify, Rakuten, Snowflake, Novo Nordisk, and Ramp.”

That enterprise traction has translated into skyrocketing valuations. Earlier this month, Anthropic signed a term sheet for a $10 billion funding round at a $350 billion valuation. Bloomberg reported that Anthropic is simultaneously working on a tender offer that would allow employees to sell shares at that valuation, offering liquidity to staffers who have watched the company’s worth multiply since its 2021 founding.

How Opus 4.6 solves the ‘context rot’ problem that has plagued AI models

One of Opus 4.6’s most significant technical improvements addresses what the AI industry calls “context rot“—the degradation of model performance as conversations grow longer. Anthropic says Opus 4.6 scores 76% on MRCR v2, a needle-in-a-haystack benchmark testing a model’s ability to retrieve information hidden in vast amounts of text, compared to just 18.5% for Sonnet 4.5.

“This is a qualitative shift in how much context a model can actually use while maintaining peak performance,” the company said in its announcement.

The model also supports outputs of up to 128,000 tokens — enough to complete substantial coding tasks or documents without breaking them into multiple requests.

For developers, Anthropic is introducing several new API features alongside the model: adaptive thinking, which allows Claude to decide when deeper reasoning would be helpful rather than requiring a binary on-off choice; four effort levels (low, medium, high, max) to control intelligence, speed and cost tradeoffs; and context compaction, a beta feature that automatically summarizes older context to enable longer-running tasks.

Anthropic’s delicate balancing act: Building powerful AI agents without losing control

Anthropic, which has built its brand around AI safety research, emphasized that Opus 4.6 maintains alignment with its predecessors despite its enhanced capabilities. On the company’s automated behavior audit measuring misaligned behaviors such as deception, sycophancy, and cooperation with misuse, Opus 4.6 “showed a low rate” of problematic responses while also achieving “the lowest rate of over-refusals — where the model fails to answer benign queries — of any recent Claude model.”

When asked how Anthropic thinks about safety guardrails as Claude becomes more agentic, particularly with multiple agents coordinating autonomously, the spokesperson pointed to the company’s published framework: “Agents have tremendous potential for positive impacts in work but it’s important that agents continue to be safe, reliable, and trustworthy. We outlined our framework for developing safe and trustworthy agents last year which shares core principles developers should consider when building agents.”

The company said it has developed six new cybersecurity probes to detect potentially harmful uses of the model’s enhanced capabilities, and is using Opus 4.6 to help find and patch vulnerabilities in open-source software as part of defensive cybersecurity efforts.

Sam Altman vs. Dario Amodei: The Super Bowl ad battle that exposed AI’s deepest divisions

The rivalry between Anthropic and OpenAI has spilled into consumer marketing in dramatic fashion. Both companies will feature prominently during Sunday’s Super Bowl. Anthropic is airing commercials that mock OpenAI’s decision to begin testing advertisements in ChatGPT, with the tagline: “Ads are coming to AI. But not to Claude.”

OpenAI CEO Sam Altman responded by calling the ads “funny” but “clearly dishonest,” posting on X that his company would “obviously never run ads in the way Anthropic depicts them” and that “Anthropic wants to control what people do with AI” while serving “an expensive product to rich people.”

The exchange highlights a fundamental strategic divergence: OpenAI has moved to monetize its massive free user base through advertising, while Anthropic has focused almost exclusively on enterprise sales and premium subscriptions.

The $285 billion stock selloff that revealed Wall Street’s AI anxiety

The launch occurs against a backdrop of historic market volatility in software stocks. A new AI automation tool from Anthropic PBC sparked a $285 billion rout in stocks across the software, financial services and asset management sectors on Tuesday as investors raced to dump shares with even the slightest exposure. A Goldman Sachs basket of US software stocks sank 6%, its biggest one-day decline since April’s tariff-fueled selloff.

The selloff was triggered by a new legal tool from Anthropic, which showed the AI industry’s growing push into industries that can unlock lucrative enterprise revenue needed to fund massive investments in the technology. One trigger for Tuesday’s selloff was Anthropic’s launch of plug-ins for its Claude Cowork agent on Friday, enabling automated tasks across legal, sales, marketing and data analysis.

Thomson Reuters plunged 15.83% Tuesday, its biggest single-day drop on record; and Legalzoom.com sank 19.68%. European legal software providers including RELX, owner of LexisNexis, and Wolters Kluwer experienced their worst single-day performances in decades.

Not everyone agrees the selloff is warranted. Nvidia CEO Jensen Huang said on Tuesday that fears AI would replace software and related tools were “illogical” and “time will prove itself.” Mark Murphy, head of U.S. enterprise software research at JPMorgan, said in a Reuters report it “feels like an illogical leap” to say a new plug-in from an LLM would “replace every layer of mission-critical enterprise software.”

What Claude’s new PowerPoint integration means for Microsoft’s AI strategy

Among the more notable product announcements: Anthropic is releasing Claude in PowerPoint in research preview, allowing users to create presentations using the same AI capabilities that power Claude’s document and spreadsheet work. The integration puts Claude directly inside a core Microsoft product — an unusual arrangement given Microsoft’s 27% stake in OpenAI.

The Anthropic spokesperson framed the move pragmatically in an interview with VentureBeat: “Microsoft has an official add-in marketplace for Office products with multiple add-ins available to help people with slide creation and iteration. Any developer can build a plugin for Excel or PowerPoint. We’re participating in that ecosystem to bring Claude into PowerPoint. This is about participating in the ecosystem and giving users the ability to work with the tools that they want, in the programs they want.”

The data behind enterprise AI adoption: Who’s winning and who’s losing ground

Data from a16z’s recent enterprise AI survey suggests both Anthropic and OpenAI face an increasingly competitive landscape. While OpenAI remains the most widely used AI provider in the enterprise, with approximately 77% of surveyed companies using it in production in January 2026, Anthropic’s adoption is rising rapidly — from near-zero in March 2024 to approximately 40% using it in production by January 2026.

The survey data also shows that 75% of Anthropic’s enterprise customers are using it in production, with 89% either testing or in production — figures that slightly exceed OpenAI’s 46% in production and 73% testing or in production rates among its customer base.

Enterprise spending on AI continues to accelerate. Average enterprise LLM spend reached $7 million in 2025, up 180% from $2.5 million in 2024, with projections suggesting $11.6 million in 2026 — a 65% increase year-over-year.

Pricing, availability, and what developers need to know about Claude Opus 4.6

Opus 4.6 is available immediately on claude.ai, the Claude API, and major cloud platforms. Developers can access it via claude-opus-4-6 through the API. Pricing remains unchanged at $5 per million input tokens and $25 per million output tokens, with premium pricing of $10/$37.50 for prompts exceeding 200,000 tokens using the 1 million token context window.

For users who find Opus 4.6 “overthinking” simpler tasks — a characteristic Anthropic acknowledges can add cost and latency — the company recommends adjusting the effort parameter from its default high setting to medium.

The recommendation captures something essential about where the AI industry now stands. These models have grown so capable that their creators must now teach customers how to make them think less. Whether that represents a breakthrough or a warning sign depends entirely on which side of the disruption you’re standing on — and whether you remembered to sell your software stocks before Tuesday.

The ‘brownie recipe problem’: why LLMs must have fine-grained context to deliver real-time results

Today’s LLMs excel at reasoning, but can still struggle with context. This is particularly true in real-time ordering systems like Instacart

Instacart CTO Anirban Kundu calls it the “brownie recipe problem.”

It’s not as simple as telling an LLM ‘I want to make brownies.’ To be truly assistive when planning the meal, the model must go beyond that simple directive to understand what’s available in the user’s market based on their preferences — say, organic eggs versus regular eggs — and factor that into what’s deliverable in their geography so food doesn’t spoil. This among other critical factors. 

For Instacart, the challenge is juggling latency with the right mix of context to provide experiences in, ideally, less than one second’s time. 

“If reasoning itself takes 15 seconds, and if every interaction is that slow, you’re gonna lose the user,” Kundu said at a recent VB event. 

Mixing reasoning, real-world state, personalization

In grocery delivery, there’s a “world of reasoning” and a “world of state” (what’s available in the real world), Bose noted, both of which must be understood by an LLM along with user preference. But it’s not as simple as loading the entirety of a user’s purchase history and known interests into a reasoning model. 

“Your LLM is gonna blow up into a size that will be unmanageable,” said Kundu. 

To get around this, Instacart splits processing into chunks. First, data is fed into a large foundational model that can understand intent and categorize products. That processed data is then routed to small language models (SLMs) designed for catalog context (the types of food or other items that work together) and semantic understanding. 

In the case of catalog context, the SLM must be able to process multiple levels of details around the order itself as well as the different products. For instance, what products go together and what are their relevant replacements if the first choice isn’t in stock? These substitutions are “very, very important” for a company like Instacart, which Kundu said has “over double digit cases” where a product isn’t available in a local market. 

In terms of semantic understanding, say a shopper is looking to buy healthy snacks for children. The model needs to understand what a healthy snack is and what foods are appropriate for, and appeal to, an 8 year old, then identify relevant products. And, when those particular products aren’t available in a given market, the model has to also find related subsets of products. 

Then there’s the logistical element. For example, a product like ice cream melts quickly, and frozen vegetables also don’t fare well when left out in warmer temperatures. The model must have this context and calculate an acceptable deliverability time. 

“So you have this intent understanding, you have this categorization, then you have this other portion about logistically, how do you do it?”, Kundu noted.

Avoiding ‘monolithic’ agent systems

Like many other companies, Instacart is experimenting with AI agents, finding that a mix of agents works better than a “single monolith” that does multiple different tasks. The Unix philosophy of a modular operating system with smaller, focused tools helps address different payment systems, for instance, that have varying failure modes, Kundu explained. 

“Having to build all of that within a single environment was very unwieldy,” he said. Further, agents on the back end talk to many third-party platforms, including point-of-sale (POS) and catalog systems. Naturally, not all of them behave the same way; some are more reliable than others, and they have different update intervals and feeds. 

“So being able to handle all of those things, we’ve gone down this route of microagents rather than agents that are dominantly large in nature,” said Kundu. 

To manage agents, Instacart has integrated with OpenAI’s model context protocol (MCP), which standardizes and simplifies the process of connecting AI models to different tools and data sources.

The company also uses Google’s Universal Commerce Protocol (UCP) open standard, which allows AI agents to directly interact with merchant systems. 

However, Kundu’s team still deals with challenges. As he noted, it’s not about whether integration is possible, but how reliably those integrations behave and how well they’re understood by users. Discovery can be difficult, not just in identifying available services, but understanding which ones are appropriate for which task.

Instacart has had to implement MCP and UCP in “very different” cases, and the biggest problems they’ve run into are failure modes and latency, Kundu noted. “The response times and understandings of both of those services are very, very different I would say we spend probably two thirds of the time fixing those error cases.” 

Vercel rebuilt v0 to tackle the 90% problem: Connecting AI-generated code to existing production infrastructure, not prototypes

Before Claude Code wrote its first line of code, Vercel was already in the vibe coding space with its v0 service.

The basic idea behind the original v0, which launched in 2024, was essentially to be version 0. That is, the earliest version of an application, helping developers solve the blank canvas problem.  Developers could prompt their way to a user interface (UI) scaffolding that looked good, but the code was disposable. Getting those prototypes into production required rewrites.

More than 4 million people have used v0 to build millions of prototypes, but the platform was missing elements required to get into production. The challenge is a familiar one with vibe coding tools, as there is a gap in what tools provide and what enterprise builders require. Claude Code, for instance, generates backend logic and scripts effectively, but does not deploy production UIs within existing company design systems while enforcing security policies

This creates what Vercel CPO Tom Occhino calls “the world’s largest shadow IT problem.” AI-enabled software creation is already happening inside every enterprise. Credentials are copied into prompts. Company data flows to unmanaged tools. Apps deploy outside approved infrastructure. There’s no audit trail.

Vercel rebuilt v0 to address this production deployment gap. The new version, generally available today, imports existing GitHub repositories and automatically pulls environment variables and configurations. It generates code in a sandbox-based runtime that maps directly to real Vercel deployments and enforces security controls and proper git workflows while allowing non-engineers to ship production code.

“What’s really nice about v0 is that you still have the code visible and reviewable and governed,” Occhino told VentureBeat in an exclusive interview. “Teams end up collaborating on the product, not on PRDs and stuff.”

This shift matters because most enterprise software work happens on existing applications, not new prototypes. Teams need tools that integrate with their current codebases and infrastructure.

How v0’s sandbox runtime connects AI-generated code to existing repositories

The original v0 generated UI scaffolding from prompts and let users iterate through conversations. But the code lived in v0’s isolated environment, which meant moving it to production required copying files, rewriting imports and manually wiring everything together.

The rebuilt v0 fundamentally changes this by directly importing existing GitHub repositories. A sandbox-based runtime automatically pulls environment variables, deployments and configurations from Vercel, so every prompt generates production-ready code that already understands the company’s infrastructure. The code lives in the repository, not a separate prototyping tool.

Previously, v0 was a separate prototyping environment. Now, it’s connected to the actual codebase with full VS Code built into the interface, which means developers can edit code directly without switching tools.

A new git panel handles proper workflows. Anyone on a team can create branches from within v0, open pull requests against main and deploy on merge. Pull requests are first-class citizens and previews map directly to real Vercel deployments, not isolated demos.

This matters because product managers and marketers can now ship production code through proper git workflows without needing local development environments or handing code snippets to engineers for integration. The new version also adds direct integrations with Snowflake and AWS databases, so teams can wire apps to production data sources with proper access controls built in, rather than requiring manual work.

Vercel’s React and Next.js experience explains v0’s deployment infrastructure

Prior to joining Vercel in 2023, Occhino spent a dozen years as an engineer at Meta (formerly Facebook) and helped lead that company’s development of the widely-used React JavaScript framework.

Vercel’s claim to fame is that its company founder, Guillermo Rauch, is the creator of Next.js, a full-stack framework built on top of React. In the vibe coding era, Next.js has become an increasingly popular framework. The company recently published a list of React best practices specifically designed to help AI agents and LLMs work.

The Vercel platform encapsulates best practices and learnings from Next.js and React. That decade of building frameworks and infrastructure together means v0 outputs production-ready code that deploys on the same infrastructure Vercel uses for millions of deployments annually. The platform includes agentic workflow support, MCP integration, web application firewall, SSO and deployment protections. Teams can open any project in a cloud dev environment and push changes in a single click to a Vercel preview or production deployment.

With no shortage of competitive offerings in the vibe coding space, including Replit, Lovable and Cursor among others, it’s the core foundational infrastructure that Occhino sees as standing out.

“The biggest differentiator for us is the Vercel infrastructure,” Occhino said. “It’s been building managed infrastructure, framework-defined infrastructure, now self-driving infrastructure for the past 10 years.”

Why vibe coding security requires infrastructure control, not just policy

The shadow IT problem isn’t that employees are using AI tools. It’s that most vibe coding tools operate entirely outside enterprise infrastructure. Credentials are copied into prompts because there’s no secure way to connect generated code to enterprise databases. Apps deploy to public URLs because the tools don’t integrate with company deployment pipelines. Data leaks happen because visibility controls don’t exist.

The technical challenge is that securing AI-generated code requires controlling where it runs and what it can access. Policy documents don’t help if the tooling itself can’t enforce those policies.

This is where infrastructure matters. When vibe coding tools operate on separate platforms, enterprises face a choice: Block the tools entirely or accept the security risks. When the vibe coding tool runs on the same infrastructure as production deployments, security controls can be enforced automatically.

v0 runs on Vercel’s infrastructure, which means enterprises can set deployment protections, visibility controls and access policies that apply to AI-generated code the same way they apply to hand-written code. Direct integrations with Snowflake and AWS databases let teams connect to production data with proper access controls rather than copying credentials into prompts.

“IT teams are comfortable with what their teams are building because they have control over who has access,” Occhino said. “They have control over what those applications have access to from Snowflake or data systems.”

Generative UI vs. generative software

In addition to the new version of v0, Vercel has recently introduced a generative UI technology called json-render.

v0 is what Vercel calls generative software. This differs from the company’s json-render framework for a true generative UI. Vercel software engineer Chris Tate explained that v0 builds full-stack apps and agents, not just UIs or frontends. In contrast, json-render is a framework that enables AI to generate UI components directly at runtime by outputting JSON instead of code. 

“The AI doesn’t write software,” Tate told VentureBeat. “It plugs directly into the rendering layer to create spontaneous, personalized interfaces on demand.”

The distinction matters for enterprise use cases. Teams use v0 when they need to build complete applications, custom components or production software.

They use JSON-render for dynamic, personalized UI elements within applications, dashboards that adapt to individual users, contextual widgets and interfaces that respond to changing data without code changes.

Both leverage the AI SDK infrastructure that Vercel has built for streaming and structured outputs.

Three lessons enterprises learned from vibe coding adoption

As enterprises adopted vibe coding tools over the past two years, several patterns emerged about AI-generated code in production environments.

Lesson 1: Prototyping without production deployment creates false progress. Enterprises saw teams generate impressive demos in v0’s early versions, then hit a wall moving those demos to production. The problem wasn’t the quality of generated code. It was that prototypes lived in isolated environments disconnected from production infrastructure.

“While demos are easy to generate, I think most of the iteration that’s happening on these code bases is happening on real production apps,” Occhino said. “90% of what we need to do is make changes to an existing code base.”

Lesson 2: The software development lifecycle has already changed, whether enterprises planned for it or not. Domain experts are building software directly instead of writing product requirement documents (PRDs) for engineers to interpret. Product managers and marketers ship features without waiting for engineering sprints.

This shift means enterprises need tools that maintain code visibility and governance while enabling non-engineers to ship. The alternative is creating bottlenecks by forcing all AI-generated code through traditional development workflows.

Lesson 3: Blocking vibe coding tools doesn’t stop vibe coding. It just pushes the activity outside IT’s visibility. Enterprises that try to restrict AI-powered development find employees using tools anyway, creating the shadow IT problem at scale.

The practical implication is that enterprises should focus less on whether to allow vibe coding and more on ensuring it happens within infrastructure that can enforce existing security and deployment policies. 

Shared memory is the missing layer in AI orchestration

The key to successful AI agents within an enterprise? Shared memory and context. 

This, according to Asana CPO Arnab Bose, provides detailed history and direct access from the get-go — with guardrail checkpoints and human oversight, of course. 

This way, “when you assign a task, you’re not having to go ahead and re-provide all of the context about how your business works,” Bose said at a recent VB event in San Francisco. 

AI as an active teammate, rather than a passive add-on

Asana launched Asana AI Teammates last year with the philosophy that, just like humans, AI agents should be plugged directly into a team or project to create a collaborative system. To further this mission, the project management company has fully integrated with Anthropic’s Claude.  

Users can choose from 12 pre-built agents — for common use cases like IT ticket deflection — or build their own, then assign them to project teams and immediately provide a historical record of what tasks have already been completed and what is still yet to be resolved. Agents also have access to third-party resources like Microsoft 365 or Google Drive. 

“When that agent gets created, it’s not acting on behalf of someone, it manifests itself as a teammate and it gets all of the same sharing permissions, it inherits that,” Bose explained. Everything anyone does — humans and AI included — is documented to allow for “ease of explainability” and a “very transparent and trustworthy system.”

But just like human workers, AI agents are kept in check: Critically, workflows incorporate checkpoints, where humans can give feedback and ask the agent to tweak certain elements of a project or adjust research plans. This is documented in what Bose called a “very human-readable way.” 

Also importantly, the UI provides instructions and knowledge about agent behavior, and approved admins can pause, edit and redirect models in the API when they take actions based on conflicting directions or start acting “in a weird way.”

“The person with edit rights can delete those things that are conflicting and make it go back to its correct behavior,” said Bose. “We’re leaning into that common human-understandable interaction pattern.”

Overcoming challenges of authorization, integration 

But because AI agents are so new, there are still many challenges around security, accessibility and compatibility. 

Asana users, for instance, must go through an OAuth flow and grant Claude access to Asana via their MCP and other public APIs. But getting all knowledge workers to know that that integration exists — and more importantly, which OAuth grants are OK and which are to be avoided — can be a tall order.

Some of the challenges around direct OAuth grants between applications could be centralized by identity providers, Bose noted, or a centralized listing of approved enterprise AI agents with their skill sets, “almost like an active directory or universal directory of agents.”

Right now, though, beyond what Asana is doing, there’s no standard protocol around shared knowledge and memory, said Bose. His team has been getting “a lot of interesting inbound asks” from partners who want their agents to operate on the Asana work graph and benefit from shared work.

“But because the protocol or standard doesn’t exist, today it has to be a very custom bespoke conversation,” said Bose. 

Ultimately, there are three questions the CPO called “extremely interesting” in AI orchestration right now: 

  • How do you build, manage and secure an authoritative list of known approved AI agents? 

  • How can you enable app-to-app integrations as an IT team without potentially configuring dangerous or harmful agents?

  • Today’s agent-to-agent interactions are very single player. Clouds can independently be connected to Asana or Figma or Slack. How can we finally get to a unified, multi-player outcome?

The increased adoption of modern context protocol (MCP) — the open standard introduced by Anthropic that connects AI agents to external systems in a single action, rather than custom integrations for every single pairing — is promising, he noted, and its widespread adoption could open up new and exciting use cases.

However, “I think there probably isn’t a silver bullet standard out there right now,” said Bose. 

Enterprises are measuring the wrong part of RAG

Enterprises have moved quickly to adopt RAG to ground LLMs in proprietary data. In practice, however, many organizations are discovering that retrieval is no longer a feature bolted onto model inference — it has become a foundational system dependency….

Most RAG systems don’t understand sophisticated documents — they shred them

By now, many enterprises have deployed some form of RAG. The promise is seductive: index your PDFs, connect an LLM and instantly democratize your corporate knowledge.

But for industries dependent on heavy engineering, the reality has been underwhelming. Engineers ask specific questions about infrastructure, and the bot hallucinates.

The failure isn’t in the LLM. The failure is in the preprocessing.

Standard RAG pipelines treat documents as flat strings of text. They use “fixed-size chunking” (cutting a document every 500 characters). This works for prose, but it destroys the logic of technical manuals. It slices tables in half, severs captions from images, and ignores the visual hierarchy of the page.

Improving RAG reliability isn’t about buying a bigger model; it’s about fixing the “dark data” problem through semantic chunking and multimodal textualization.

Here is the architectural framework for building a RAG system that can actually read a manual.

The fallacy of fixed-size chunking

In a standard Python RAG tutorial, you split text by character count. In an enterprise PDF, this is disastrous.

If a safety specification table spans 1,000 tokens, and your chunk size is 500, you have just split the “voltage limit” header from the “240V” value. The vector database stores them separately. When a user asks, “What is the voltage limit?”, the retrieval system finds the header but not the value. The LLM, forced to answer, often guesses.

The solution: Semantic chunking

The first step to fixing production RAG is abandoning arbitrary character counts in favor of document intelligence.

Using layout-aware parsing tools (such as Azure Document Intelligence), we can segment data based on document structure such as chapters, sections and paragraphs, rather than token count.

  • Logical cohesion: A section describing a specific machine part is kept as a single vector, even if it varies in length.

  • Table preservation: The parser identifies a table boundary and forces the entire grid into a single chunk, preserving the row-column relationships that are vital for accurate retrieval.

In our internal qualitative benchmarks, moving from fixed to semantic chunking significantly improved the retrieval accuracy of tabular data, effectively stopping the fragmentation of technical specs.

Unlocking visual dark data

The second failure mode of enterprise RAG is blindness. A massive amount of corporate IP exists not in text, but in flowcharts, schematics and system architecture diagrams. Standard embedding models (like text-embedding-3-small) cannot “see” these images. They are skipped during indexing.

If your answer lies in a flowchart, your RAG system will say, “I don’t know.”

The solution: Multimodal textualization

To make diagrams searchable, we implemented a multimodal preprocessing step using vision-capable models (specifically GPT-4o) before the data ever hits the vector store.

  1. OCR extraction: High-precision optical character recognition pulls text labels from within the image.

  2. Generative captioning: The vision model analyzes the image and generates a detailed natural language description (“A flowchart showing that process A leads to process B if the temperature exceeds 50 degrees”).

  3. Hybrid embedding: This generated description is embedded and stored as metadata linked to the original image.

Now, when a user searches for “temperature process flow,” the vector search matches the description, even though the original source was a PNG file.

The trust layer: Evidence-based UI

For enterprise adoption, accuracy is only half the battle. The other half is verifiability.

In a standard RAG interface, the chatbot gives a text answer and cites a filename. This forces the user to download the PDF and hunt for the page to verify the claim. For high-stakes queries (“Is this chemical flammable?”), users simply won’t trust the bot.

The architecture should implement visual citation. Because we preserved the link between the text chunk and its parent image during the preprocessing phase, the UI can display the exact chart or table used to generate the answer alongside the text response.

This “show your work” mechanism allows humans to verify the AI’s reasoning instantly, bridging the trust gap that kills so many internal AI projects.

Future-proofing: Native multimodal embeddings

While the “textualization” method (converting images to text descriptions) is the practical solution for today, the architecture is rapidly evolving.

We are already seeing the emergence of native multimodal embeddings (such as Cohere’s Embed 4). These models can map text and images into the same vector space without the intermediate step of captioning. While we currently use a multi-stage pipeline for maximum control, the future of data infrastructure will likely involve “end-to-end” vectorization where the layout of a page is embedded directly.

Furthermore, as long context LLMs become cost-effective, the need for chunking may diminish. We may soon pass entire manuals into the context window. However, until latency and cost for million-token calls drop significantly, semantic preprocessing remains the most economically viable strategy for real-time systems.

Conclusion

The difference between a RAG demo and a production system is how it handles the messy reality of enterprise data.

Stop treating your documents as simple strings of text. If you want your AI to understand your business, you must respect the structure of your documents. By implementing semantic chunking and unlocking the visual data within your charts, you transform your RAG system from a “keyword searcher” into a true “knowledge assistant.”

Dippu Kumar Singh is an AI architect and data engineer.

This tree search framework hits 98.7% on documents where vector search fails

A new open-source framework called PageIndex solves one of the old problems of retrieval-augmented generation (RAG): handling very long documents.

The classic RAG workflow (chunk documents, calculate embeddings, store them in a vector database, and retrieve the top matches based on semantic similarity) works well for basic tasks such as Q&A over small documents.

PageIndex abandons the standard “chunk-and-embed” method entirely and treats document retrieval not as a search problem, but as a navigation problem.

But as enterprises try to move RAG into high-stakes workflows — auditing financial statements, analyzing legal contracts, navigating pharmaceutical protocols — they’re hitting an accuracy barrier that chunk optimization can’t solve.

AlphaGo for documents

PageIndex addresses these limitations by borrowing a concept from game-playing AI rather than search engines: tree search.

When humans need to find specific information in a dense textbook or a long annual report, they do not scan every paragraph linearly. They consult the table of contents to identify the relevant chapter, then the section, and finally the specific page. PageIndex forces the LLM to replicate this human behavior.

Instead of pre-calculating vectors, the framework builds a “Global Index” of the document’s structure, creating a tree where nodes represent chapters, sections, and subsections. When a query arrives, the LLM performs a tree search, explicitly classifying each node as relevant or irrelevant based on the full context of the user’s request.

“In computer science terms, a table of contents is a tree-structured representation of a document, and navigating it corresponds to tree search,” Zhang said. “PageIndex applies the same core idea — tree search — to document retrieval, and can be thought of as an AlphaGo-style system for retrieval rather than for games.”

This shifts the architectural paradigm from passive retrieval, where the system simply fetches matching text, to active navigation, where an agentic model decides where to look.

The limits of semantic similarity

There is a fundamental flaw in how traditional RAG handles complex data. Vector retrieval assumes that the text most semantically similar to a user’s query is also the most relevant. In professional domains, this assumption frequently breaks down.

Mingtian Zhang, co-founder of PageIndex, points to financial reporting as a prime example of this failure mode. If a financial analyst asks an AI about “EBITDA” (earnings before interest, taxes, depreciation, and amortization), a standard vector database will retrieve every chunk where that acronym or a similar term appears.

“Multiple sections may mention EBITDA with similar wording, yet only one section defines the precise calculation, adjustments, or reporting scope relevant to the question,” Zhang told VentureBeat. “A similarity based retriever struggles to distinguish these cases because the semantic signals are nearly indistinguishable.”

This is the “intent vs. content” gap. The user does not want to find the word “EBITDA”; they want to understand the “logic” behind it for that specific quarter.

Furthermore, traditional embeddings strip the query of its context. Because embedding models have strict input-length limits, the retrieval system usually only sees the specific question being asked, ignoring the previous turns of the conversation. This detaches the retrieval step from the user’s reasoning process. The system matches documents against a short, decontextualized query rather than the full history of the problem the user is trying to solve.

Solving the multi-hop reasoning problem

The real-world impact of this structural approach is most visible in “multi-hop” queries that require the AI to follow a trail of breadcrumbs across different parts of a document.

In a recent benchmark test known as FinanceBench, a system built on PageIndex called “Mafin 2.5” achieved a state-of-the-art accuracy score of 98.7%. The performance gap between this approach and vector-based systems becomes clear when analyzing how they handle internal references.

Zhang offers the example of a query regarding the total value of deferred assets in a Federal Reserve annual report. The main section of the report describes the “change” in value but does not list the total. However, the text contains a footnote: “See Appendix G of this report … for more detailed information.”

A vector-based system typically fails here. The text in Appendix G looks nothing like the user’s query about deferred assets; it is likely just a table of numbers. Because there is no semantic match, the vector database ignores it.

The reasoning-based retriever, however, reads the cue in the main text, follows the structural link to Appendix G, locates the correct table, and returns the accurate figure.

The latency trade-off and infrastructure shift

For enterprise architects, the immediate concern with an LLM-driven search process is latency. Vector lookups occur in milliseconds; having an LLM “read” a table of contents implies a significantly slower user experience.

However, Zhang explains that the perceived latency for the end-user may be negligible due to how the retrieval is integrated into the generation process. In a classic RAG setup, retrieval is a blocking step: the system must search the database before it can begin generating an answer. With PageIndex, retrieval happens inline, during the model’s reasoning process.

“The system can start streaming immediately, and retrieve as it generates,” Zhang said. “That means PageIndex does not add an extra ‘retrieval gate’ before the first token, and Time to First Token (TTFT) is comparable to a normal LLM call.”

This architectural shift also simplifies the data infrastructure. By removing reliance on embeddings, enterprises no longer need to maintain a dedicated vector database. The tree-structured index is lightweight enough to sit in a traditional relational database like PostgreSQL.

This addresses a growing pain point in LLM systems with retrieval components: the complexity of keeping vector stores in sync with living documents. PageIndex separates structure indexing from text extraction. If a contract is amended or a policy updated, the system can handle small edits by re-indexing only the affected subtree rather than reprocessing the entire document corpus.

A decision matrix for the enterprise

While the accuracy gains are compelling, tree-search retrieval is not a universal replacement for vector search. The technology is best viewed as a specialized tool for “deep work” rather than a catch-all for every retrieval task.

For short documents, such as emails or chat logs, the entire context often fits within a modern LLM’s context window, making any retrieval system unnecessary. Conversely, for tasks purely based on semantic discovery, such as recommending similar products or finding content with a similar “vibe,” vector embeddings remain the superior choice because the goal is proximity, not reasoning.

PageIndex fits squarely in the middle: long, highly structured documents where the cost of error is high. This includes technical manuals, FDA filings, and merger agreements. In these scenarios, the requirement is auditability. An enterprise system needs to be able to explain not just the answer, but the path it took to find it (e.g., confirming that it checked Section 4.1, followed the reference to Appendix B, and synthesized the data found there).

The future of agentic retrieval

The rise of frameworks like PageIndex signals a broader trend in the AI stack: the move toward “Agentic RAG.” As models become more capable of planning and reasoning, the responsibility for finding data is moving from the database layer to the model layer.

We are already seeing this in the coding space, where agents like Claude Code and Cursor are moving away from simple vector lookups in favor of active codebase exploration. Zhang believes generic document retrieval will follow the same trajectory.

“Vector databases still have suitable use cases,” Zhang said. “But their historical role as the default database for LLMs and AI will become less clear over time.”

AI agents can talk to each other — they just can’t think together yet

AI agents can talk to each other now — they just can’t understand what the other one is trying to do. That’s the problem Cisco’s Outshift is trying to solve with a new architectural approach it calls the Internet of Cognition.

The gap is practical: protocols like MCP and A2A let agents exchange messages and identify tools, but they don’t share intent or context. Without that, multi-agent systems burn cycles on coordination and can’t compound what they learn.

“The bottom line is, we can send messages, but agents do not understand each other, so there is no grounding, negotiation or coordination or common intent,” Vijoy Pandey, general manager and senior vice president of Outshift, told VentureBeat. 

The practical impact:

Consider a patient scheduling a specialist appointment. With MCP alone, a symptom assessment agent passes a diagnosis code to a scheduling agent, which finds available appointments. An insurance agent verifies coverage. A pharmacy agent checks drug availability.

Each agent completes its task, but none of them reasons together about the patient’s needs. The pharmacy agent might recommend a drug that conflicts with the patient’s history — information the symptom agent has but didn’t pass along because “potential drug interactions” wasn’t in its scope. The scheduling agent books the nearest available appointment without knowing the insurance agent found better coverage at a different facility.

They’re connected, but they’re not aligned on the goal: Find the right care for this patient’s specific situation.

Current protocols handle the mechanics of agent communication — MCP, A2A, and Outshift’s AGNTCY, which it donated to the Linux Foundation, let agents discover tools and exchange messages. But these operate at what Pandey calls the “connectivity and identification layer.” They handle syntax, not semantics.

The missing piece is shared context and intent. An agent completing a task knows what it’s doing and why, but that reasoning isn’t transmitted when it hands off to another agent. Each agent interprets goals independently, which means coordination requires constant clarification and learned insights stay siloed.

For agents to move from communication to collaboration, they need to share three things, according to Outshift: pattern recognition across datasets, causal relationships between actions, and explicit goal states.

“Without shared intent and shared context, AI agents remain semantically isolated. They are capable individually, but goals get interpreted differently; coordination burns cycles, and nothing compounds. One agent learns something valuable, but the rest of the multi-agent-human organization still starts from scratch,” Outshift said in a paper.

Outshift said the industry needs “open, interoperable, enterprise-grade agentic systems that semantically collaborate” and proposes a new architecture it calls the “Internet of Cognition,” where multi-agent environments work within a shared system.

The proposed architecture introduces three layers:

Cognition State Protocols: A semantic layer that sits above message-passing protocols. Agents share not just data but intent — what they’re trying to accomplish and why. This lets agents align on goals before acting, rather than clarifying after the fact.

Cognition Fabric: Infrastructure for building and maintaining shared context. Think of it as distributed working memory: context graphs that persist across agent interactions, with policy controls for what gets shared and who can access it. System designers can define what “common understanding” looks like for their use case.

Cognition Engines: Two types of capability. Accelerators let agents pool insights and compound learning — one agent’s discovery becomes available to others solving related problems. Guardrails enforce compliance boundaries so shared reasoning doesn’t violate regulatory or policy constraints.

Outshift positioned the framework as a call to action rather than a finished product. The company is working on implementation but emphasized that semantic agent collaboration will require industry-wide coordination — much like early internet protocols needed buy-in to become standards. Outshift is in the process of writing the code, publishing the specs and releasing research around the Internet of Cognition. It hopes to have a demo of the protocols soon.

Noah Goodman, co-founder of frontier AI company Humans& and a professor of computer science at Stanford, said during VentureBeat’s AI Impact event held in San Francisco that innovation happens when “other humans figure out which humans to pay attention to.” The same dynamic applies to agent systems: as individual agents learn, the value multiplies when other agents can identify and leverage that knowledge.

The practical question for teams deploying multi-agent systems now: Are your agents just connected, or are they actually working toward the same goal?