The average Fortune 1000 company has more than 30,000 employees and engineering, sales and marketing teams with hundreds of members. Equally large teams exist in government, science and defense organizations. And yet, research shows that the ideal size…
When an AI agent visits a website, it’s essentially a tourist who doesn’t speak the local language. Whether built on LangChain, Claude Code, or the increasingly popular OpenClaw framework, the agent is reduced to guessing which buttons to press: scraping raw HTML, firing off screenshots to multimodal models, and burning through thousands of tokens just to figure out where a search bar is.
That era may be ending. Earlier this week, the Google Chrome team launched WebMCP — Web Model Context Protocol — as an early preview in Chrome 146 Canary. WebMCP, which was developed jointly by engineers at Google and Microsoft and incubated through the W3C’s Web Machine Learning community group, is a proposed web standard that lets any website expose structured, callable tools directly to AI agents through a new browser API: navigator.modelContext.
The implications for enterprise IT are significant. Instead of building and maintaining separate back-end MCP servers in Python or Node.js to connect their web applications to AI platforms, development teams can now wrap their existing client-side JavaScript logic into agent-readable tools — without re-architecting a single page.
The cost and reliability issues with current approaches to web-agent (browser agents) interaction are well understood by anyone who has deployed them at scale. The two dominant methods — visual screen-scraping and DOM parsing — both suffer from fundamental inefficiencies that directly affect enterprise budgets.
With screenshot-based approaches, agents pass images into multimodal models (like Claude and Gemini) and hope the model can identify not only what is on the screen, but where buttons, form fields, and interactive elements are located. Each image consumes thousands of tokens and can have a long latency. With DOM-based approaches, agents ingest raw HTML and JavaScript — a foreign language full of various tags, CSS rules, and structural markup that is irrelevant to the task at hand but still consumes context window space and inference cost.
In both cases, the agent is translating between what the website was designed for (human eyes) and what the model needs (structured data about available actions). A single product search that a human completes in seconds can require dozens of sequential agent interactions — clicking filters, scrolling pages, parsing results — each one an inference call that adds latency and cost.
WebMCP proposes two complementary APIs that serve as a bridge between websites and AI agents.
The Declarative API handles standard actions that can be defined directly in existing HTML forms. For organizations with well-structured forms already in production, this pathway requires minimal additional work; by adding tool names and descriptions to existing form markup, developers can make those forms callable by agents. If your HTML forms are already clean and well-structured, you are probably already 80% of the way there.
The Imperative API handles more complex, dynamic interactions that require JavaScript execution. This is where developers define richer tool schemas — conceptually similar to the tool definitions sent to the OpenAI or Anthropic API endpoints, but running entirely client-side in the browser. Through the registerTool(), a website can expose functions like searchProducts(query, filters) or orderPrints(copies, page_size) with full parameter schemas and natural language descriptions.
The key insight is that a single tool call through WebMCP can replace what might have been dozens of browser-use interactions. An e-commerce site that registers a searchProducts tool lets the agent make one structured function call and receive structured JSON results, rather than having the agent click through filter dropdowns, scroll through paginated results, and screenshot each page.
For IT decision makers evaluating agentic AI deployments, WebMCP addresses three persistent pain points simultaneously.
Cost reduction is the most immediately quantifiable benefit. By replacing sequences of screenshot captures, multimodal inference calls, and iterative DOM parsing with single structured tool calls, organizations can expect significant reductions in token consumption.
Reliability improves because agents are no longer guessing about page structure. When a website explicitly publishes a tool contract — “here are the functions I support, here are their parameters, here is what they return” — the agent operates with certainty rather than inference. Failed interactions due to UI changes, dynamic content loading, or ambiguous element identification are largely eliminated for any interaction covered by a registered tool.
Development velocity accelerates because web teams can leverage their existing front-end JavaScript rather than standing up separate backend infrastructure. The specification emphasizes that any task a user can accomplish through a page’s UI can be made into a tool by reusing much of the page’s existing JavaScript code. Teams do not need to learn new server frameworks or maintain separate API surfaces for agent consumers.
A critical architectural decision separates WebMCP from the fully autonomous agent paradigm that has dominated recent headlines. The standard is explicitly designed around cooperative, human-in-the-loop workflows — not unsupervised automation.
According to Khushal Sagar, a staff software engineer for Chrome, the WebMCP specification identifies three pillars that underpin this philosophy.
Context: All the data agents need to understand what the user is doing, including content that is often not currently visible on screen.
Capabilities: Actions the agent can take on the user’s behalf, from answering questions to filling out forms.
Coordination: Controlling the handoff between user and agent when the agent encounters situations it cannot resolve autonomously.
The specification’s authors at Google and Microsoft illustrate this with a shopping scenario: a user named Maya asks her AI assistant to help find an eco-friendly dress for a wedding. The agent suggests vendors, opens a browser to a dress site, and discovers the page exposes WebMCP tools like getDresses() and showDresses(). When Maya’s criteria go beyond the site’s basic filters, the agent calls those tools to fetch product data, uses its own reasoning to filter for “cocktail-attire appropriate,” and then calls showDresses()to update the page with only the relevant results. It’s a fluid loop of human taste and agent capability, exactly the kind of collaborative browsing that WebMCP is designed to enable.
This is not a headless browsing standard. The specification explicitly states that headless and fully autonomous scenarios are non-goals. For those use cases, the authors point to existing protocols like Google’s Agent-to-Agent (A2A) protocol. WebMCP is about the browser — where the user is present, watching, and collaborating.
WebMCP is not a replacement for Anthropic’s Model Context Protocol, despite sharing a conceptual lineage and a portion of its name. It does not follow the JSON-RPC specification that MCP uses for client-server communication. Where MCP operates as a back-end protocol connecting AI platforms to service providers through hosted servers, WebMCP operates entirely client-side within the browser.
The relationship is complementary. A travel company might maintain a back-end MCP server for direct API integrations with AI platforms like ChatGPT or Claude, while simultaneously implementing WebMCP tools on its consumer-facing website so that browser-based agents can interact with its booking flow in the context of a user’s active session. The two standards serve different interaction patterns without conflict.
The distinction matters for enterprise architects. Back-end MCP integrations are appropriate for service-to-service automation where no browser UI is needed. WebMCP is appropriate when the user is present and the interaction benefits from shared visual context — which describes the majority of consumer-facing web interactions that enterprises care about.
WebMCP is currently available in Chrome 146 Canary behind the “WebMCP for testing” flag at chrome://flags. Developers can join the Chrome Early Preview Program for access to documentation and demos. Other browsers have not yet announced implementation timelines, though Microsoft’s active co-authorship of the specification suggests Edge support is likely.
Industry observers expect formal browser announcements by mid-to-late 2026, with Google Cloud Next and Google I/O as probable venues for broader rollout announcements. The specification is transitioning from community incubation within the W3C to a formal draft — a process that historically takes months but signals serious institutional commitment.
The comparison that Sagar has drawn is instructive: WebMCP aims to become the USB-C of AI agent interactions with the web. A single, standardized interface that any agent can plug into, replacing the current tangle of bespoke scraping strategies and fragile automation scripts.
Whether that vision is realized depends on adoption — by both browser vendors and web developers. But with Google and Microsoft jointly shipping code, the W3C providing institutional scaffolding, and Chrome 146 already running the implementation behind a flag, WebMCP has cleared the most difficult hurdle any web standard faces: getting from proposal to working software.
Lowering the cost of inference is typically a combination of hardware and software. A new analysis released Thursday by Nvidia details how four leading inference providers are reporting 4x to 10x reductions in cost per token.
The dramatic cost reductions were achieved using Nvidia’s Blackwell platform with open-source models. Production deployment data from Baseten, DeepInfra, Fireworks AI and Together AI shows significant cost improvements across healthcare, gaming, agentic chat, and customer service as enterprises scale AI from pilot projects to millions of users.
The 4x to 10x cost reductions reported by inference providers required combining Blackwell hardware with two other elements: optimized software stacks and switching from proprietary to open-source models that now match frontier-level intelligence. Hardware improvements alone delivered 2x gains in some deployments, according to the analysis. Reaching larger cost reductions required adopting low-precision formats like NVFP4 and moving away from closed source APIs that charge premium rates.
The economics prove counterintuitive. Reducing inference costs requires investing in higher-performance infrastructure because throughput improvements translate directly into lower per-token costs.
“Performance is what drives down the cost of inference,” Dion Harris, senior director of HPC and AI hyperscaler solutions at Nvidia, told VentureBeat in an exclusive interview. “What we’re seeing in inference is that throughput literally translates into real dollar value and driving down the cost.”
Nvidia detailed four customer deployments in a blog post showing how the combination of Blackwell infrastructure, optimized software stacks and open-source models delivers cost reductions across different industry workloads. The case studies span high-volume applications where inference economics directly determines business viability.
Sully.ai cut healthcare AI inference costs by 90% (a 10x reduction) while improving response times 65% by switching from proprietary models to open-source models running on Baseten’s Blackwell-powered platform, according to Nvidia. The company returned over 30 million minutes to physicians by automating medical coding and note-taking tasks that previously required manual data entry.
Nvidia also reported that Latitude reduced gaming inference costs 4x for its AI Dungeon platform by running large mixture-of-experts (MoE) models on DeepInfra’s Blackwell deployment. Cost per million tokens dropped from 20 cents on Nvidia’s previous Hopper platform to 10 cents on Blackwell, then to 5 cents after adopting Blackwell’s native NVFP4 low-precision format. Hardware alone delivered 2x improvement, but reaching 4x required the precision format change.
Sentient Foundation achieved 25% to 50% better cost efficiency for its agentic chat platform using Fireworks AI’s Blackwell-optimized inference stack, according to Nvidia. The platform orchestrates complex multi-agent workflows and processed 5.6 million queries in a single week during its viral launch while maintaining low latency.
Nvidia said Decagon saw 6x cost reduction per query for AI-powered voice customer support by running its multimodel stack on Together AI’s Blackwell infrastructure. Response times stayed under 400 milliseconds, even when processing thousands of tokens per query, critical for voice interactions where delays cause users to hang up or lose trust.
The range from 4x to 10x cost reductions across deployments reflects different combinations of technical optimizations rather than just hardware differences. Three factors emerge as primary drivers: precision format adoption, model architecture choices, and software stack integration.
Precision formats show the clearest impact. Latitude’s case demonstrates this directly. Moving from Hopper to Blackwell delivered 2x cost reduction through hardware improvements. Adopting NVFP4, Blackwell’s native low-precision format, doubled that improvement to 4x total. NVFP4 reduces the number of bits required to represent model weights and activations, allowing more computation per GPU cycle while maintaining accuracy. The format works particularly well for MoE models where only a subset of the model activates for each inference request.
Model architecture matters. MoE models, which activate different specialized sub-models based on input, benefit from Blackwell’s NVLink fabric that enables rapid communication between experts. “Having those experts communicate across that NVLink fabric allows you to reason very quickly,” Harris said. Dense models that activate all parameters for every inference don’t leverage this architecture as effectively.
Software stack integration creates additional performance deltas. Harris said that Nvidia’s co-design approach — where Blackwell hardware, NVL72 scale-up architecture, and software like Dynamo and TensorRT-LLM are optimized together — also makes a difference. Baseten’s deployment for Sully.ai used this integrated stack, combining NVFP4, TensorRT-LLM and Dynamo to achieve the 10x cost reduction. Providers running alternative frameworks like vLLM may see lower gains.
Workload characteristics matter. Reasoning models show particular advantages on Blackwell because they generate significantly more tokens to reach better answers. The platform’s ability to process these extended token sequences efficiently through disaggregated serving, where context prefill and token generation are handled separately, makes reasoning workloads cost-effective.
Teams evaluating potential cost reductions should examine their workload profiles against these factors. High token generation workloads using mixture-of-experts models with the integrated Blackwell software stack will approach the 10x range. Lower token volumes using dense models on alternative frameworks will land closer to 4x.
While these case studies focus on Nvidia Blackwell deployments, enterprises have multiple paths to reducing inference costs. AMD’s MI300 series, Google TPUs, and specialized inference accelerators from Groq and Cerebras offer alternative architectures. Cloud providers also continue optimizing their inference services. The question isn’t whether Blackwell is the only option but whether the specific combination of hardware, software and models fits particular workload requirements.
Enterprises considering Blackwell-based inference should start by calculating whether their workloads justify infrastructure changes.
“Enterprises need to work back from their workloads and use case and cost constraints,” Shruti Koparkar, AI product marketing at Nvidia, told VentureBeat.
The deployments achieving 6x to 10x improvements all involved high-volume, latency-sensitive applications processing millions of requests monthly. Teams running lower volumes or applications with latency budgets exceeding one second should explore software optimization or model switching before considering infrastructure upgrades.
Testing matters more than provider specifications. Koparkar emphasizes that providers publish throughput and latency metrics, but these represent ideal conditions.
“If it’s a highly latency-sensitive workload, they might want to test a couple of providers and see who meets the minimum they need while keeping the cost down,” she said. Teams should run actual production workloads across multiple Blackwell providers to measure real performance under their specific usage patterns and traffic spikes rather than relying on published benchmarks.
The staged approach Latitude used provides a model for evaluation. The company first moved to Blackwell hardware and measured 2x improvement, then adopted NVFP4 format to reach 4x total reduction. Teams currently on Hopper or other infrastructure can test whether precision format changes and software optimization on existing hardware capture meaningful savings before committing to full infrastructure migrations. Running open source models on current infrastructure might deliver half the potential cost reduction without new hardware investments.
Provider selection requires understanding software stack differences. While multiple providers offer Blackwell infrastructure, their software implementations vary. Some run Nvidia’s integrated stack using Dynamo and TensorRT-LLM, while others use frameworks like vLLM. Harris acknowledges performance deltas exist between these configurations. Teams should evaluate what each provider actually runs and how it matches their workload requirements rather than assuming all Blackwell deployments perform identically.
The economic equation extends beyond cost per token. Specialized inference providers like Baseten, DeepInfra, Fireworks and Together offer optimized deployments but require managing additional vendor relationships. Managed services from AWS, Azure or Google Cloud may have higher per-token costs but lower operational complexity. Teams should calculate total cost including operational overhead, not just inference pricing, to determine which approach delivers better economics for their specific situation.
A team of researchers led by Nvidia has released DreamDojo, a new AI system designed to teach robots how to interact with the physical world by watching tens of thousands of hours of human video — a development that could significantly reduce the time and cost required to train the next generation of humanoid machines.
The research, published this month and involving collaborators from UC Berkeley, Stanford, the University of Texas at Austin, and several other institutions, introduces what the team calls “the first robot world model of its kind that demonstrates strong generalization to diverse objects and environments after post-training.”
At the core of DreamDojo is what the researchers describe as “a large-scale video dataset” comprising “44k hours of diverse human egocentric videos, the largest dataset to date for world model pretraining.” The dataset, called DreamDojo-HV, is a dramatic leap in scale — “15x longer duration, 96x more skills, and 2,000x more scenes than the previously largest dataset for world model training,” according to the project documentation.
The system operates in two distinct phases. First, DreamDojo “acquires comprehensive physical knowledge from large-scale human datasets by pre-training with latent actions.” Then it undergoes “post-training on the target embodiment with continuous robot actions” — essentially learning general physics from watching humans, then fine-tuning that knowledge for specific robot hardware.
For enterprises considering humanoid robots, this approach addresses a stubborn bottleneck. Teaching a robot to manipulate objects in unstructured environments traditionally requires massive amounts of robot-specific demonstration data — expensive and time-consuming to collect. DreamDojo sidesteps this problem by leveraging existing human video, allowing robots to learn from observation before ever touching a physical object.
One of the technical breakthroughs is speed. Through a distillation process, the researchers achieved “real-time interactions at 10 FPS for over 1 minute” — a capability that enables practical applications like live teleoperation and on-the-fly planning. The team demonstrated the system working across multiple robot platforms, including the GR-1, G1, AgiBot, and YAM humanoid robots, showing what they call “realistic action-conditioned rollouts” across “a wide range of environments and object interactions.”
The release comes at a pivotal moment for Nvidia’s robotics ambitions — and for the broader AI industry. At the World Economic Forum in Davos last month, CEO Jensen Huang declared that AI robotics represents a “once-in-a-generation” opportunity, particularly for regions with strong manufacturing bases. According to Digitimes, Huang has also stated that the next decade will be “a critical period of accelerated development for robotics technology.”
The financial stakes are enormous. Huang told CNBC’s “Halftime Report” on February 6 that the tech industry’s capital expenditures — potentially reaching $660 billion this year from major hyperscalers — are “justified, appropriate and sustainable.” He characterized the current moment as “the largest infrastructure buildout in human history,” with companies like Meta, Amazon, Google, and Microsoft dramatically increasing their AI spending.
That infrastructure push is already reshaping the robotics landscape. Robotics startups raised a record $26.5 billion in 2025, according to data from Dealroom. European industrial giants including Siemens, Mercedes-Benz, and Volvo have announced robotics partnerships in the past year, while Tesla CEO Elon Musk has claimed that 80 percent of his company’s future value will come from its Optimus humanoid robots.
For technical decision-makers evaluating humanoid robots, DreamDojo’s most immediate value may lie in its simulation capabilities. The researchers highlight downstream applications including “reliable policy evaluation without real-world deployment and model-based planning for test-time improvement” — capabilities that could let companies simulate robot behavior extensively before committing to costly physical trials.
This matters because the gap between laboratory demonstrations and factory floors remains significant. A robot that performs flawlessly in controlled conditions often struggles with the unpredictable variations of real-world environments — different lighting, unfamiliar objects, unexpected obstacles. By training on 44,000 hours of diverse human video spanning thousands of scenes and nearly 100 distinct skills, DreamDojo aims to build the kind of general physical intuition that makes robots adaptable rather than brittle.
The research team, led by Linxi “Jim” Fan, Joel Jang, and Yuke Zhu, with Shenyuan Gao and William Liang as co-first authors, has indicated that code will be released publicly, though a timeline was not specified.
Whether DreamDojo translates into commercial robotics products remains to be seen. But the research signals where Nvidia’s ambitions are heading as the company increasingly positions itself beyond its gaming roots. As Kyle Barr observed at Gizmodo earlier this month, Nvidia now views “anything related to gaming and the ‘personal computer'” as “outliers on Nvidia’s quarterly spreadsheets.”
The shift reflects a calculated bet: that the future of computing is physical, not just digital. Nvidia has already invested $10 billion in Anthropic and signaled plans to invest heavily in OpenAI’s next funding round. DreamDojo suggests the company sees humanoid robots as the next frontier where its AI expertise and chip dominance can converge.
For now, the 44,000 hours of human video at the heart of DreamDojo represent something more fundamental than a technical benchmark. They represent a theory — that robots can learn to navigate our world by watching us live in it. The machines, it turns out, have been taking notes.
Researchers from Stanford, Nvidia, and Together AI have developed a new technique that can discover new solutions to very complex problems. For example, they managed to optimize a critical GPU kernel to run 2x faster than the previous state-of-the-art written by human experts.
Their technique, called “Test-Time Training to Discover” (TTT-Discover), challenges the current paradigm of letting models “think longer” for reasoning problems. TTT-Discover allows the model to continue training during the inference process and update its weights for the problem at hand.
Current enterprise AI strategies often rely on “frozen” models. Whether you use a closed or open reasoning model, the model’s parameters are static. When you prompt these models, they search for answers within the fixed manifold of their training data. This works well for problems that resemble what the model has seen before.
However, true discovery problems, like inventing a novel algorithm or proving a new mathematical theorem, are, by definition, out-of-distribution. If the solution requires a leap of logic that doesn’t exist in the training set, a frozen model will likely fail, no matter how much compute you throw at it during inference.
In comments to VentureBeat, Mert Yuksekgonul, a co-author of the paper and doctorate student at Stanford, illustrated this distinction using a famous mathematical breakthrough:
“I believe that thinking models wouldn’t be able to prove, for example, P != NP, without test-time training, just like Andrew Wiles wouldn’t be able to prove Fermat’s Last Theorem without the 7 years he spent pursuing this single problem in isolation and continuously learning from his own failures.”
TTT-Discover treats the test problem not as a query to be answered, but as an environment to be mastered. As the model attempts to solve the problem, it generates different types of data: failures, partial successes, and errors. Instead of discarding this data, TTT-Discover uses it to update the model’s weights in real-time, effectively allowing the model to laser focus on that specific challenge as opposed to developing a very general problem-solving framework.
TTT-Discover provides a fundamental shift on how reasoning models are trained. In standard reinforcement learning (RL) training, the goal is a generalist policy that performs well on average across many tasks. In TTT-Discover, the goal is to find the best solution to a very specific problem, and the policy is “a means towards this end,” according to the authors. Once the model discovers the artifact (i.e., the optimized code, the proof, or the molecule) the neural network that produced it can be discarded.
To achieve this, the researchers engineered two specific components that differentiate TTT-Discover from standard reinforcement learning:
Entropic objective: Standard RL optimizes for the average expected reward. If a model tries a risky path and fails, standard RL punishes it. TTT-Discover flips this. It uses an “entropic objective” that exponentially weighs high-reward outcomes. This forces the model to ignore “safe,” average answers and aggressively hunt for “eureka” outliers, solutions that have a low probability of being found but offer a massive reward.
PUCT search: The system introduces PUCT, a tree-search algorithm inspired by AlphaZero. It explores different solution paths, building a dataset of attempts. The model then trains on this dataset in real-time, learning to recognize which partial steps lead to high-reward outcomes.
Crucially, this method works best on problems with a continuous reward signal. The system needs a way to measure incremental progress such as “runtime in microseconds” or “error rate” rather than a binary “pass/fail” signal. This allows the model to follow the gradual improvement toward the optimal solution.
For enterprises accustomed to paying fractions of a cent per API call, the cost profile of TTT-Discover requires a mindset shift. In their experiments, the researchers reported that a single discovery run involves approximately 50 training steps and thousands of rollouts, costing roughly $500 per problem.
TTT-Discover could be for “static, high-value assets” as opposed to trivial and recurring problems that can be solved with existing models and approaches.
Consider a cloud-native enterprise running a data pipeline that processes petabytes of information nightly. If that pipeline relies on a specific SQL query or GPU kernel, optimizing that code by just 1% could save hundreds of thousands of dollars in annual compute costs. In this context, spending $500 to find a kernel that is 50% faster is a trivial expense with an immediate ROI.
“This makes the most sense for low-frequency, high-impact decisions where a single improvement is worth far more than the compute cost,” Yuksekgonul said. “Supply chain routing, drug design, and material discovery qualify. In these settings, spending hundreds of dollars on a single discovery step can easily pay for itself.”
One of the most significant findings for enterprise adoption is that TTT-Discover does not require a proprietary frontier model. The researchers achieved state-of-the-art results using gpt-oss-120b, OpenAI’s open-weights model. The researchers have released the code for TTT-Discover to enable researchers and developers to use it for their own models.
Because the technique works with open models, companies can run this “discovery loop” entirely within their own secure VPCs or on-premise H100 clusters without sending their proprietary data to third-party servers.
“If a company already runs reinforcement learning, there is no additional infrastructure required,” Yuksekgonul said. “TTT-Discover uses the same training stack (GPUs, rollout workers, optimizers, checkpointing).”
If they don’t already run RL, they would need to build that infrastructure. But enterprises can also use existing solutions to reduce the complexity of the process. The researchers orchestrated these training runs using the Tinker API by Thinking Machines, an API that manages the complexity of distributed training and inference.
“Tooling such as Tinker (and open variants, e.g., OpenTinker) lowers the setup cost, and both labor and compute costs are likely to drop over time,” he said.
The researchers deployed TTT-Discover across four distinct technical domains: systems engineering, algorithm design, biology, and mathematics. In almost every instance, the method set a new state-of-the-art.
In one experiment, the model optimized GPU kernels for matrix multiplication (including the “TriMul” kernel used in AlphaFold), achieving execution speeds up to 2x faster than prior state-of-the-art and outperforming the best human-written kernels on the leaderboard.
In competitive programming scenarios (AtCoder), it solved complex heuristic problems (e.g., optimizing geometric constraints for fishing nets) better than top human experts and prior AI baselines.
For the enterprise, the transition from these academic benchmarks to business value hinges on one specific constraint: the existence of a verifiable, scalar signal. Unlike a chatbot that generates text, TTT-Discover needs a hard metric (e.g., runtime, error rate, or profit margin) to optimize against.
Yuksekgonul said that this requirement draws a clear line between where this technology should and shouldn’t be used. “At the moment, the key requirement is a reliable scalar signal of progress — cost, error, molecular properties — that the system can optimize against,” he said.
This directs enterprise adoption toward “hard” engineering and operations challenges such as logistics, supply chain, and resource management, where problems like fleet routing or crew scheduling often rely on static heuristics. TTT-Discover can treat these as optimization environments, spending hours to find a route structure that shaves 5% off daily fuel costs.
The requirement for clear verifiers rules out qualitative tasks like “write a better marketing strategy,” where verification is subjective and prone to noise.
“Hard to verify problems are still an open question,” Yuksekgonul said.
With current technology, the best path forward is to try to design verifiers, but “making those verifiers robust and hard to game is challenging, and we don’t have a good solution yet,” he added.
The broader implication is that enterprise AI stacks may need to evolve to support this kind of per-problem learning.
“Systems built around a frozen model will need to support per-problem (or per-domain) adaptation, and enterprises will need better problem specifications and internal feedback signals to make test-time learning effective,” Yuksekgonul said. “If training runs inside a private VPC, the training loop can also be integrated with more of the company’s internal environment, not just a central lab pipeline.”
For the enterprise, the value lies in identifying “million-dollar problems,” optimization challenges where a verifiable metric exists, but human progress has stalled. These are the candidates for TTT-Discover. By accepting higher latency and cost for specific queries, enterprises can turn their inference compute into an automated R&D lab, discovering solutions that were previously out of reach for both humans and frozen AI models.
We’re starting to see the idea of Musk-owned orbital AI data clusters cohere into an actual plan.
Anthropic on Thursday released Claude Opus 4.6, a major upgrade to its flagship artificial intelligence model that the company says plans more carefully, sustains longer autonomous workflows, and outperforms competitors including OpenAI’s GPT-5.2 on key enterprise benchmarks — a release that arrives at a tumultuous moment for the AI industry and global software markets.
The launch comes just three days after OpenAI released its own Codex desktop application in a direct challenge to Anthropic’s Claude Code momentum, and amid a $285 billion rout in software and services stocks that investors attribute partly to fears that Anthropic’s AI tools could disrupt established enterprise software businesses.
For the first time, Anthropic’s Opus-class models will feature a 1 million token context window, allowing the AI to process and reason across vastly more information than previous versions. The company also introduced “agent teams” in Claude Code — a research preview feature that enables multiple AI agents to work simultaneously on different aspects of a coding project, coordinating autonomously.
“We’re focused on building the most capable, reliable, and safe AI systems,” an Anthropic spokesperson told VentureBeat about the announcements. “Opus 4.6 is even better at planning, helping solve the most complex coding tasks. And the new agent teams feature means users can split work across multiple agents — one on the frontend, one on the API, one on the migration — each owning its piece and coordinating directly with the others.”
The release intensifies an already fierce competition between Anthropic and OpenAI, the two most valuable privately held AI companies in the world. OpenAI on Monday released a new desktop application for its Codex artificial intelligence coding system, a tool the company says transforms software development from a collaborative exercise with a single AI assistant into something more akin to managing a team of autonomous workers.
AI coding assistants have exploded in popularity over the last year, and OpenAI said more than 1 million developers have used Codex in the past month. The new Codex app is part of OpenAI’s ongoing effort to lure users and market share away from rivals like Anthropic and Cursor.
The timing of Anthropic’s release — just 72 hours after OpenAI’s Codex launch — underscores the breakneck pace of competition in AI development tools. OpenAI faces intensifying competition from Anthropic, which posted the largest share increase of any frontier lab since May 2025, according to a recent Andreessen Horowitz survey. Forty-four percent of enterprises now use Anthropic in production, driven by rapid capability gains in software development since late 2024. The desktop launch is a strategic counter to Claude Code’s momentum.
According to Anthropic’s announcement, Opus 4.6 achieves the highest score on Terminal-Bench 2.0, an agentic coding evaluation, and leads all other frontier models on Humanity’s Last Exam, a complex multi-discipline reasoning test. On GDPval-AA — a benchmark measuring performance on economically valuable knowledge work tasks in finance, legal and other domains — Opus 4.6 outperforms OpenAI’s GPT-5.2 by approximately 144 ELO points, which translates to obtaining a higher score approximately 70% of the time.
The stakes are substantial. Asked about Claude Code’s financial performance, the Anthropic spokesperson noted that in November, the company announced that Claude Code reached $1 billion in run rate revenue only six months after becoming generally available in May 2025.
The spokesperson highlighted major enterprise deployments: “Claude Code is used by Uber across teams like software engineering, data science, finance, and trust and safety; wall-to-wall deployment across Salesforce’s global engineering org; tens of thousands of devs at Accenture; and companies across industries like Spotify, Rakuten, Snowflake, Novo Nordisk, and Ramp.”
That enterprise traction has translated into skyrocketing valuations. Earlier this month, Anthropic signed a term sheet for a $10 billion funding round at a $350 billion valuation. Bloomberg reported that Anthropic is simultaneously working on a tender offer that would allow employees to sell shares at that valuation, offering liquidity to staffers who have watched the company’s worth multiply since its 2021 founding.
One of Opus 4.6’s most significant technical improvements addresses what the AI industry calls “context rot“—the degradation of model performance as conversations grow longer. Anthropic says Opus 4.6 scores 76% on MRCR v2, a needle-in-a-haystack benchmark testing a model’s ability to retrieve information hidden in vast amounts of text, compared to just 18.5% for Sonnet 4.5.
“This is a qualitative shift in how much context a model can actually use while maintaining peak performance,” the company said in its announcement.
The model also supports outputs of up to 128,000 tokens — enough to complete substantial coding tasks or documents without breaking them into multiple requests.
For developers, Anthropic is introducing several new API features alongside the model: adaptive thinking, which allows Claude to decide when deeper reasoning would be helpful rather than requiring a binary on-off choice; four effort levels (low, medium, high, max) to control intelligence, speed and cost tradeoffs; and context compaction, a beta feature that automatically summarizes older context to enable longer-running tasks.
Anthropic, which has built its brand around AI safety research, emphasized that Opus 4.6 maintains alignment with its predecessors despite its enhanced capabilities. On the company’s automated behavior audit measuring misaligned behaviors such as deception, sycophancy, and cooperation with misuse, Opus 4.6 “showed a low rate” of problematic responses while also achieving “the lowest rate of over-refusals — where the model fails to answer benign queries — of any recent Claude model.”
When asked how Anthropic thinks about safety guardrails as Claude becomes more agentic, particularly with multiple agents coordinating autonomously, the spokesperson pointed to the company’s published framework: “Agents have tremendous potential for positive impacts in work but it’s important that agents continue to be safe, reliable, and trustworthy. We outlined our framework for developing safe and trustworthy agents last year which shares core principles developers should consider when building agents.”
The company said it has developed six new cybersecurity probes to detect potentially harmful uses of the model’s enhanced capabilities, and is using Opus 4.6 to help find and patch vulnerabilities in open-source software as part of defensive cybersecurity efforts.
The rivalry between Anthropic and OpenAI has spilled into consumer marketing in dramatic fashion. Both companies will feature prominently during Sunday’s Super Bowl. Anthropic is airing commercials that mock OpenAI’s decision to begin testing advertisements in ChatGPT, with the tagline: “Ads are coming to AI. But not to Claude.”
OpenAI CEO Sam Altman responded by calling the ads “funny” but “clearly dishonest,” posting on X that his company would “obviously never run ads in the way Anthropic depicts them” and that “Anthropic wants to control what people do with AI” while serving “an expensive product to rich people.”
The exchange highlights a fundamental strategic divergence: OpenAI has moved to monetize its massive free user base through advertising, while Anthropic has focused almost exclusively on enterprise sales and premium subscriptions.
The launch occurs against a backdrop of historic market volatility in software stocks. A new AI automation tool from Anthropic PBC sparked a $285 billion rout in stocks across the software, financial services and asset management sectors on Tuesday as investors raced to dump shares with even the slightest exposure. A Goldman Sachs basket of US software stocks sank 6%, its biggest one-day decline since April’s tariff-fueled selloff.
The selloff was triggered by a new legal tool from Anthropic, which showed the AI industry’s growing push into industries that can unlock lucrative enterprise revenue needed to fund massive investments in the technology. One trigger for Tuesday’s selloff was Anthropic’s launch of plug-ins for its Claude Cowork agent on Friday, enabling automated tasks across legal, sales, marketing and data analysis.
Thomson Reuters plunged 15.83% Tuesday, its biggest single-day drop on record; and Legalzoom.com sank 19.68%. European legal software providers including RELX, owner of LexisNexis, and Wolters Kluwer experienced their worst single-day performances in decades.
Not everyone agrees the selloff is warranted. Nvidia CEO Jensen Huang said on Tuesday that fears AI would replace software and related tools were “illogical” and “time will prove itself.” Mark Murphy, head of U.S. enterprise software research at JPMorgan, said in a Reuters report it “feels like an illogical leap” to say a new plug-in from an LLM would “replace every layer of mission-critical enterprise software.”
Among the more notable product announcements: Anthropic is releasing Claude in PowerPoint in research preview, allowing users to create presentations using the same AI capabilities that power Claude’s document and spreadsheet work. The integration puts Claude directly inside a core Microsoft product — an unusual arrangement given Microsoft’s 27% stake in OpenAI.
The Anthropic spokesperson framed the move pragmatically in an interview with VentureBeat: “Microsoft has an official add-in marketplace for Office products with multiple add-ins available to help people with slide creation and iteration. Any developer can build a plugin for Excel or PowerPoint. We’re participating in that ecosystem to bring Claude into PowerPoint. This is about participating in the ecosystem and giving users the ability to work with the tools that they want, in the programs they want.”
Data from a16z’s recent enterprise AI survey suggests both Anthropic and OpenAI face an increasingly competitive landscape. While OpenAI remains the most widely used AI provider in the enterprise, with approximately 77% of surveyed companies using it in production in January 2026, Anthropic’s adoption is rising rapidly — from near-zero in March 2024 to approximately 40% using it in production by January 2026.
The survey data also shows that 75% of Anthropic’s enterprise customers are using it in production, with 89% either testing or in production — figures that slightly exceed OpenAI’s 46% in production and 73% testing or in production rates among its customer base.
Enterprise spending on AI continues to accelerate. Average enterprise LLM spend reached $7 million in 2025, up 180% from $2.5 million in 2024, with projections suggesting $11.6 million in 2026 — a 65% increase year-over-year.
Opus 4.6 is available immediately on claude.ai, the Claude API, and major cloud platforms. Developers can access it via claude-opus-4-6 through the API. Pricing remains unchanged at $5 per million input tokens and $25 per million output tokens, with premium pricing of $10/$37.50 for prompts exceeding 200,000 tokens using the 1 million token context window.
For users who find Opus 4.6 “overthinking” simpler tasks — a characteristic Anthropic acknowledges can add cost and latency — the company recommends adjusting the effort parameter from its default high setting to medium.
The recommendation captures something essential about where the AI industry now stands. These models have grown so capable that their creators must now teach customers how to make them think less. Whether that represents a breakthrough or a warning sign depends entirely on which side of the disruption you’re standing on — and whether you remembered to sell your software stocks before Tuesday.
Today’s LLMs excel at reasoning, but can still struggle with context. This is particularly true in real-time ordering systems like Instacart.
Instacart CTO Anirban Kundu calls it the “brownie recipe problem.”
It’s not as simple as telling an LLM ‘I want to make brownies.’ To be truly assistive when planning the meal, the model must go beyond that simple directive to understand what’s available in the user’s market based on their preferences — say, organic eggs versus regular eggs — and factor that into what’s deliverable in their geography so food doesn’t spoil. This among other critical factors.
For Instacart, the challenge is juggling latency with the right mix of context to provide experiences in, ideally, less than one second’s time.
“If reasoning itself takes 15 seconds, and if every interaction is that slow, you’re gonna lose the user,” Kundu said at a recent VB event.
In grocery delivery, there’s a “world of reasoning” and a “world of state” (what’s available in the real world), Bose noted, both of which must be understood by an LLM along with user preference. But it’s not as simple as loading the entirety of a user’s purchase history and known interests into a reasoning model.
“Your LLM is gonna blow up into a size that will be unmanageable,” said Kundu.
To get around this, Instacart splits processing into chunks. First, data is fed into a large foundational model that can understand intent and categorize products. That processed data is then routed to small language models (SLMs) designed for catalog context (the types of food or other items that work together) and semantic understanding.
In the case of catalog context, the SLM must be able to process multiple levels of details around the order itself as well as the different products. For instance, what products go together and what are their relevant replacements if the first choice isn’t in stock? These substitutions are “very, very important” for a company like Instacart, which Kundu said has “over double digit cases” where a product isn’t available in a local market.
In terms of semantic understanding, say a shopper is looking to buy healthy snacks for children. The model needs to understand what a healthy snack is and what foods are appropriate for, and appeal to, an 8 year old, then identify relevant products. And, when those particular products aren’t available in a given market, the model has to also find related subsets of products.
Then there’s the logistical element. For example, a product like ice cream melts quickly, and frozen vegetables also don’t fare well when left out in warmer temperatures. The model must have this context and calculate an acceptable deliverability time.
“So you have this intent understanding, you have this categorization, then you have this other portion about logistically, how do you do it?”, Kundu noted.
Like many other companies, Instacart is experimenting with AI agents, finding that a mix of agents works better than a “single monolith” that does multiple different tasks. The Unix philosophy of a modular operating system with smaller, focused tools helps address different payment systems, for instance, that have varying failure modes, Kundu explained.
“Having to build all of that within a single environment was very unwieldy,” he said. Further, agents on the back end talk to many third-party platforms, including point-of-sale (POS) and catalog systems. Naturally, not all of them behave the same way; some are more reliable than others, and they have different update intervals and feeds.
“So being able to handle all of those things, we’ve gone down this route of microagents rather than agents that are dominantly large in nature,” said Kundu.
To manage agents, Instacart has integrated with OpenAI’s model context protocol (MCP), which standardizes and simplifies the process of connecting AI models to different tools and data sources.
The company also uses Google’s Universal Commerce Protocol (UCP) open standard, which allows AI agents to directly interact with merchant systems.
However, Kundu’s team still deals with challenges. As he noted, it’s not about whether integration is possible, but how reliably those integrations behave and how well they’re understood by users. Discovery can be difficult, not just in identifying available services, but understanding which ones are appropriate for which task.
Instacart has had to implement MCP and UCP in “very different” cases, and the biggest problems they’ve run into are failure modes and latency, Kundu noted. “The response times and understandings of both of those services are very, very different I would say we spend probably two thirds of the time fixing those error cases.”
Before Claude Code wrote its first line of code, Vercel was already in the vibe coding space with its v0 service.
The basic idea behind the original v0, which launched in 2024, was essentially to be version 0. That is, the earliest version of an application, helping developers solve the blank canvas problem. Developers could prompt their way to a user interface (UI) scaffolding that looked good, but the code was disposable. Getting those prototypes into production required rewrites.
More than 4 million people have used v0 to build millions of prototypes, but the platform was missing elements required to get into production. The challenge is a familiar one with vibe coding tools, as there is a gap in what tools provide and what enterprise builders require. Claude Code, for instance, generates backend logic and scripts effectively, but does not deploy production UIs within existing company design systems while enforcing security policies
This creates what Vercel CPO Tom Occhino calls “the world’s largest shadow IT problem.” AI-enabled software creation is already happening inside every enterprise. Credentials are copied into prompts. Company data flows to unmanaged tools. Apps deploy outside approved infrastructure. There’s no audit trail.
Vercel rebuilt v0 to address this production deployment gap. The new version, generally available today, imports existing GitHub repositories and automatically pulls environment variables and configurations. It generates code in a sandbox-based runtime that maps directly to real Vercel deployments and enforces security controls and proper git workflows while allowing non-engineers to ship production code.
“What’s really nice about v0 is that you still have the code visible and reviewable and governed,” Occhino told VentureBeat in an exclusive interview. “Teams end up collaborating on the product, not on PRDs and stuff.”
This shift matters because most enterprise software work happens on existing applications, not new prototypes. Teams need tools that integrate with their current codebases and infrastructure.
The original v0 generated UI scaffolding from prompts and let users iterate through conversations. But the code lived in v0’s isolated environment, which meant moving it to production required copying files, rewriting imports and manually wiring everything together.
The rebuilt v0 fundamentally changes this by directly importing existing GitHub repositories. A sandbox-based runtime automatically pulls environment variables, deployments and configurations from Vercel, so every prompt generates production-ready code that already understands the company’s infrastructure. The code lives in the repository, not a separate prototyping tool.
Previously, v0 was a separate prototyping environment. Now, it’s connected to the actual codebase with full VS Code built into the interface, which means developers can edit code directly without switching tools.
A new git panel handles proper workflows. Anyone on a team can create branches from within v0, open pull requests against main and deploy on merge. Pull requests are first-class citizens and previews map directly to real Vercel deployments, not isolated demos.
This matters because product managers and marketers can now ship production code through proper git workflows without needing local development environments or handing code snippets to engineers for integration. The new version also adds direct integrations with Snowflake and AWS databases, so teams can wire apps to production data sources with proper access controls built in, rather than requiring manual work.
Prior to joining Vercel in 2023, Occhino spent a dozen years as an engineer at Meta (formerly Facebook) and helped lead that company’s development of the widely-used React JavaScript framework.
Vercel’s claim to fame is that its company founder, Guillermo Rauch, is the creator of Next.js, a full-stack framework built on top of React. In the vibe coding era, Next.js has become an increasingly popular framework. The company recently published a list of React best practices specifically designed to help AI agents and LLMs work.
The Vercel platform encapsulates best practices and learnings from Next.js and React. That decade of building frameworks and infrastructure together means v0 outputs production-ready code that deploys on the same infrastructure Vercel uses for millions of deployments annually. The platform includes agentic workflow support, MCP integration, web application firewall, SSO and deployment protections. Teams can open any project in a cloud dev environment and push changes in a single click to a Vercel preview or production deployment.
With no shortage of competitive offerings in the vibe coding space, including Replit, Lovable and Cursor among others, it’s the core foundational infrastructure that Occhino sees as standing out.
“The biggest differentiator for us is the Vercel infrastructure,” Occhino said. “It’s been building managed infrastructure, framework-defined infrastructure, now self-driving infrastructure for the past 10 years.”
The shadow IT problem isn’t that employees are using AI tools. It’s that most vibe coding tools operate entirely outside enterprise infrastructure. Credentials are copied into prompts because there’s no secure way to connect generated code to enterprise databases. Apps deploy to public URLs because the tools don’t integrate with company deployment pipelines. Data leaks happen because visibility controls don’t exist.
The technical challenge is that securing AI-generated code requires controlling where it runs and what it can access. Policy documents don’t help if the tooling itself can’t enforce those policies.
This is where infrastructure matters. When vibe coding tools operate on separate platforms, enterprises face a choice: Block the tools entirely or accept the security risks. When the vibe coding tool runs on the same infrastructure as production deployments, security controls can be enforced automatically.
v0 runs on Vercel’s infrastructure, which means enterprises can set deployment protections, visibility controls and access policies that apply to AI-generated code the same way they apply to hand-written code. Direct integrations with Snowflake and AWS databases let teams connect to production data with proper access controls rather than copying credentials into prompts.
“IT teams are comfortable with what their teams are building because they have control over who has access,” Occhino said. “They have control over what those applications have access to from Snowflake or data systems.”
In addition to the new version of v0, Vercel has recently introduced a generative UI technology called json-render.
v0 is what Vercel calls generative software. This differs from the company’s json-render framework for a true generative UI. Vercel software engineer Chris Tate explained that v0 builds full-stack apps and agents, not just UIs or frontends. In contrast, json-render is a framework that enables AI to generate UI components directly at runtime by outputting JSON instead of code.
“The AI doesn’t write software,” Tate told VentureBeat. “It plugs directly into the rendering layer to create spontaneous, personalized interfaces on demand.”
The distinction matters for enterprise use cases. Teams use v0 when they need to build complete applications, custom components or production software.
They use JSON-render for dynamic, personalized UI elements within applications, dashboards that adapt to individual users, contextual widgets and interfaces that respond to changing data without code changes.
Both leverage the AI SDK infrastructure that Vercel has built for streaming and structured outputs.
As enterprises adopted vibe coding tools over the past two years, several patterns emerged about AI-generated code in production environments.
Lesson 1: Prototyping without production deployment creates false progress. Enterprises saw teams generate impressive demos in v0’s early versions, then hit a wall moving those demos to production. The problem wasn’t the quality of generated code. It was that prototypes lived in isolated environments disconnected from production infrastructure.
“While demos are easy to generate, I think most of the iteration that’s happening on these code bases is happening on real production apps,” Occhino said. “90% of what we need to do is make changes to an existing code base.”
Lesson 2: The software development lifecycle has already changed, whether enterprises planned for it or not. Domain experts are building software directly instead of writing product requirement documents (PRDs) for engineers to interpret. Product managers and marketers ship features without waiting for engineering sprints.
This shift means enterprises need tools that maintain code visibility and governance while enabling non-engineers to ship. The alternative is creating bottlenecks by forcing all AI-generated code through traditional development workflows.
Lesson 3: Blocking vibe coding tools doesn’t stop vibe coding. It just pushes the activity outside IT’s visibility. Enterprises that try to restrict AI-powered development find employees using tools anyway, creating the shadow IT problem at scale.
The practical implication is that enterprises should focus less on whether to allow vibe coding and more on ensuring it happens within infrastructure that can enforce existing security and deployment policies.
The key to successful AI agents within an enterprise? Shared memory and context.
This, according to Asana CPO Arnab Bose, provides detailed history and direct access from the get-go — with guardrail checkpoints and human oversight, of course.
This way, “when you assign a task, you’re not having to go ahead and re-provide all of the context about how your business works,” Bose said at a recent VB event in San Francisco.
Asana launched Asana AI Teammates last year with the philosophy that, just like humans, AI agents should be plugged directly into a team or project to create a collaborative system. To further this mission, the project management company has fully integrated with Anthropic’s Claude.
Users can choose from 12 pre-built agents — for common use cases like IT ticket deflection — or build their own, then assign them to project teams and immediately provide a historical record of what tasks have already been completed and what is still yet to be resolved. Agents also have access to third-party resources like Microsoft 365 or Google Drive.
“When that agent gets created, it’s not acting on behalf of someone, it manifests itself as a teammate and it gets all of the same sharing permissions, it inherits that,” Bose explained. Everything anyone does — humans and AI included — is documented to allow for “ease of explainability” and a “very transparent and trustworthy system.”
But just like human workers, AI agents are kept in check: Critically, workflows incorporate checkpoints, where humans can give feedback and ask the agent to tweak certain elements of a project or adjust research plans. This is documented in what Bose called a “very human-readable way.”
Also importantly, the UI provides instructions and knowledge about agent behavior, and approved admins can pause, edit and redirect models in the API when they take actions based on conflicting directions or start acting “in a weird way.”
“The person with edit rights can delete those things that are conflicting and make it go back to its correct behavior,” said Bose. “We’re leaning into that common human-understandable interaction pattern.”
But because AI agents are so new, there are still many challenges around security, accessibility and compatibility.
Asana users, for instance, must go through an OAuth flow and grant Claude access to Asana via their MCP and other public APIs. But getting all knowledge workers to know that that integration exists — and more importantly, which OAuth grants are OK and which are to be avoided — can be a tall order.
Some of the challenges around direct OAuth grants between applications could be centralized by identity providers, Bose noted, or a centralized listing of approved enterprise AI agents with their skill sets, “almost like an active directory or universal directory of agents.”
Right now, though, beyond what Asana is doing, there’s no standard protocol around shared knowledge and memory, said Bose. His team has been getting “a lot of interesting inbound asks” from partners who want their agents to operate on the Asana work graph and benefit from shared work.
“But because the protocol or standard doesn’t exist, today it has to be a very custom bespoke conversation,” said Bose.
Ultimately, there are three questions the CPO called “extremely interesting” in AI orchestration right now:
How do you build, manage and secure an authoritative list of known approved AI agents?
How can you enable app-to-app integrations as an IT team without potentially configuring dangerous or harmful agents?
Today’s agent-to-agent interactions are very single player. Clouds can independently be connected to Asana or Figma or Slack. How can we finally get to a unified, multi-player outcome?
The increased adoption of modern context protocol (MCP) — the open standard introduced by Anthropic that connects AI agents to external systems in a single action, rather than custom integrations for every single pairing — is promising, he noted, and its widespread adoption could open up new and exciting use cases.
However, “I think there probably isn’t a silver bullet standard out there right now,” said Bose.