Alison.ai is pushing creative validation upstream, using computer vision and predictive scoring to help brands assess video ads earlier and improve performance before campaigns launch.
The post How Alison.ai is bringing objectivity to video ads before …
Despite growing chatter about a future when much human work is automated by AI, one of the ironies of this current tech boom is how stubbornly reliant on human beings it remains, specifically the process of training AI models using reinforcement learning from human feedback (RLHF).
At its simplest, RLHF is a tutoring system: after an AI is trained on curated data, it still makes mistakes or sounds robotic. Human contractors are then hired en masse by AI labs to rate and rank a new model’s outputs while it trains, and the model learns from their ratings, adjusting its behavior to offer higher-rated outputs. This process is all the more important as AI expands to produce multimedia outputs like video, audio, and imagery which may have more nuanced and subjective measures of quality.
Historically, this tutoring process has been a massive logistical headache and PR nightmare for AI companies, relying on fragmented networks of foreign contractors and static labeling pools in specific, low-income geographic hubs, cast by the media as low wage — even exploitative. It’s also inefficient: requiring AI labs wait weeks or months for a single batch of feedback, delaying model progress.
Now a new startup has emerged to make the process far more efficient: Rapidata‘s platform effectively “gamifies” RLHF by pushing said review tasks around the globe to nearly 20 million users of popular apps, including Duolingo or Candy Crush, in the form of short, opt-in review tasks they can choose to complete in place of watching mobile ads, with data sent back to a commissioning AI lab instantly.
As shared with VentureBeat in a press release, this platform allows AI labs to “iterate on models in near-real-time,” significantly shortening development timelines compared to traditional methods.
CEO and founder Jason Corkill stated in the same release that Rapidata makes “human judgment available at a global scale and near real time, unlocking a future where AI teams can run constant feedback loops and build systems that evolve every day instead of every release cycle.””
Rapidata treats RLHF as high-speed infrastructure rather than a manual labor problem. Today, the company exclusively announced to us at VentureBeat its emergence with an $8.5 million seed round co-led by Canaan Partners and IA Ventures, with participation from Acequia Capital and BlueYard, to scale its unique approach to on-demand human data.
The genesis of Rapidata was born not in a boardroom, but at a table over a few beers. When Corkill was a student at ETH Zurich, working in robotics and computer vision, when he hit the wall that every AI engineer eventually faces: the data annotation bottleneck.
“Specifically, I’ve been working in robotics, AI and computer vision for quite a few years now, studied at ETH here in Zurich, and just always was frustrated with data annotation,” Corkill recalled in a recent interview. “Always when you needed humans or human data annotation, that’s kind of when your project was stopped in its tracks, because up until then, you could move it forward by just pushing longer nights. But when you needed the large scale human annotation, you had to go to someone and then wait for a few weeks”.
Frustrated by this delay, Corkill and his co-founders realized that the existing labor model for AI was fundamentally broken for a world moving at the speed of modern compute. While compute scales exponentially, the traditional human workforce—bound by manual onboarding, regional hiring, and slow payment cycles—does not. Rapidata was born from the idea that human judgment could be delivered as a globally distributed, near-instantaneous service.
The core innovation of Rapidata lies in its distribution method. Rather than hiring full-time annotators in specific regions, Rapidata leverages the existing attention economy of the mobile app world. By partnering with third-party apps like Candy Crush or Duolingo, Rapidata offers users a choice: watch a traditional ad or spend a few seconds providing feedback for an AI model.
“The users are asked, ‘Hey, would you rather instead of watching ads and having, you know, companies buy your eyeballs like that, would you rather like annotate some data, give feedback?'” Corkill explained. According to Corkill, between 50% and 60% of users opt for the feedback task over a traditional video advertisement.
This “crowd intelligence” approach allows AI teams to tap into a diverse, global demographic at an unprecedented scale.
The global network: Rapidata currently reaches between 15 and 20 million people.
Massive parallelism: The platform can process 1.5 million human annotations in a single hour.
Speed: Feedback cycles that previously took weeks or months are reduced to hours or even minutes.
Quality control: The platform builds trust and expertise profiles for respondents over time, ensuring that complex questions are matched with the most relevant human judges.
Anonymity: While users are tracked via anonymized IDs to ensure consistency and reliability, Rapidata does not collect personal identities, maintaining privacy while optimizing for data quality.
The most significant technological leap Rapidata is enabling is what Corkill describes as “online RLHF”. Traditionally, AI is trained in disconnected batches: you train the model, stop, send data to humans, wait weeks for labels, and then resume. This creates a “circle” of information that often lacks fresh human input.
Rapidata is moving this judgment directly into the training loop. Because their network is so fast, they can integrate via API directly with the GPUs running the model.
“We’ve always had this idea of reinforcement learning for human feedback… so far, you always had to do it like in batches,” Corkill said. “Now, if you go all the way down, we have a few clients now where, because we’re so fast, we can be directly, basically in the process, like in in the processor on the GPU right, and the GPU calculate some output, and it can immediately request from us in a distributed fashion. ‘Oh, I need, I need, I need a human to look at this.’ I get the answer and then apply that loss, which has not been possible so far”.
Currently, the platform supports roughly 5,500 humans per minute providing live feedback to models running on thousands of GPUs. This prevents “reward model hacking,” where two AI models trick each other in a feedback loop, by grounding the training in actual human nuance.
As AI moves beyond simple object recognition into generative media, the requirements for data labeling have evolved from objective tagging to subjective “taste-based” curation. It is no longer just about “is this a cat?” but rather “is this voice synthesis convincing?” or “which of these two summaries feels more professional?”.
Lily Clifford, CEO of the voice AI startup Rime, notes that Rapidata has been transformative for testing models in real-world contexts. “Previously, gathering meaningful feedback meant cobbling together vendors and surveys, segment by segment, or country by country, which didn’t scale,” Clifford said. Using Rapidata, Rime can reach the right audiences—whether in Sweden, Serbia, or the United States—and see how models perform in real customer workflows in days, not months.
“Most models are factually correct, but I’m sure you’re you have received emails that feel, you know, not authentic, right?” Corkill noted. “You can smell an AI email, you can smell an AI image or a video, it’s immediately clear to you… these models still don’t feel human, and you need human feedback to do that”.
From an operational standpoint, Rapidata positions itself as an infrastructure layer that eliminates the need for companies to manage their own custom annotation operations. By providing a scalable network, the company is lowering the barrier to entry for AI teams that previously struggled with the cost and complexity of traditional feedback loops.
Jared Newman of Canaan Partners, who led the investment, suggests that this infrastructure is essential for the next generation of AI. “Every serious AI deployment depends on human judgment somewhere in the lifecycle,” Newman said. “As models move from expertise-based tasks to taste-based curation, the demand for scalable human feedback will grow dramatically”.
While the current focus is on the model labs of the Bay Area, Corkill sees a future where the AI models themselves become the primary customers of human judgment. He calls this “human use”.
In this vision, a car designer AI wouldn’t just generate a generic vehicle; it could programmatically call Rapidata to ask 25,000 people in the French market what they think of a specific aesthetic, iterate on that feedback, and refine its design within hours.
“Society is in constant flux,” Corkill noted, addressing the trend of using AI to simulate human behavior. “If they simulate a society now, the simulation will be stable for and maybe mirror ours for a few months, but then it completely changes, because society has changed and has developed completely differently”.
By creating a distributed, programmatic way to access human brain capacity worldwide, Rapidata is positioning itself as the vital interconnect between silicon and society. With $8.5 million in new funding, the company plans to move aggressively to ensure that as AI scales, the human element is no longer a bottleneck, but a real-time feature.
Traditional ETL tools like dbt or Fivetran prepare data for reporting: structured analytics and dashboards with stable schemas. AI applications need something different: preparing messy, evolving operational data for model inference in real-time.
Empromptu calls this distinction “inference integrity” versus “reporting integrity.” Instead of treating data preparation as a separate discipline, golden pipelines integrate normalization directly into the AI application workflow, collapsing what typically requires 14 days of manual engineering into under an hour, the company says. Empromptu’s “golden pipeline” approach is a way to accelerate data preparation and make sure that data is accurate.
The company works primarily with mid-market and enterprise customers in regulated industries where data accuracy and compliance are non-negotiable. Fintech is Empromptu’s fastest-growing vertical, with additional customers in healthcare and legal tech. The platform is HIPAA compliant and SOC 2 certified.
“Enterprise AI doesn’t break at the model layer, it breaks when messy data meets real users,” Shanea Leven, CEO and co-founder of Empromptu told VentureBeat in an exclusive interview. “Golden pipelines bring data ingestion, preparation and governance directly into the AI application workflow so teams can build systems that actually work in production.”
Golden pipelines operate as an automated layer that sits between raw operational data and AI application features.
The system handles five core functions. First, it ingests data from any source including files, databases, APIs and unstructured documents. It then processes that data through automated inspection and cleaning, structuring with schema definitions, and labeling and enrichment to fill gaps and classify records. Built-in governance and compliance checks include audit trails, access controls and privacy enforcement.
The technical approach combines deterministic preprocessing with AI-assisted normalization. Instead of hard-coding every transformation, the system identifies inconsistencies, infers missing structure and generates classifications based on model context. Every transformation is logged and tied directly to downstream AI evaluation.
The evaluation loop is central to how golden pipelines function. If data normalization reduces downstream accuracy, the system catches it through continuous evaluation against production behavior. That feedback coupling between data preparation and model performance distinguishes golden pipelines from traditional ETL tools, according to Leven.
Golden pipelines are embedded directly into the Empromptu Builder and run automatically as part of creating an AI application. From the user’s perspective, teams are building AI features. Under the hood, golden pipelines ensure the data feeding those features is clean, structured, governed and ready for production use.
Leven positions golden pipelines as solving a fundamentally different problem than traditional ETL tools like dbt, Fivetran or Databricks.
“Dbt and Fivetran are optimized for reporting integrity. Golden pipelines are optimized for inference integrity,” Leven said. “Traditional ETL tools are designed to move and transform structured data based on predefined rules. They assume schema stability, known transformations and relatively static logic.”
“We’re not replacing dbt or Fivetran, enterprises will continue to use those for warehouse integrity and structured reporting,” Leven said. “Golden pipelines sit closer to the AI application layer. They solve the last-mile problem: how do you take real-world, imperfect operational data and make it usable for AI features without months of manual wrangling?”
The trust argument for AI-driven normalization rests on auditability and continuous evaluation.
“It is not unsupervised magic. It is reviewable, auditable and continuously evaluated against production behavior,” Leven said. “If normalization reduces downstream accuracy, the evaluation loop catches it. That feedback coupling between data preparation and model performance is something traditional ETL pipelines do not provide.”
The golden pipeline approach is already having an impact in the real world.
Event management platform VOW handles high-profile events for organizations like GLAAD as well as multiple sports organizations. When GLAAD plans an event, data populates across sponsor invites, ticket purchases, tables, seats and more. The process happens quickly and data consistency is non-negotiable.
“Our data is more complex than the average platform,” Jennifer Brisman, CEO of VOW, told VentureBeat. “When GLAAD plans an event that data gets populated across sponsor invites, ticket purchases, tables and seats, and more. And it all has to happen very quickly.”
VOW was writing regex scripts manually. When the company decided to build an AI-generated floor plan feature that updated data in near real-time and populated information across the platform, ensuring data accuracy became critical. Golden Pipelines automated the process of extracting data from floor plans that often arrived messy, inconsistent and unstructured, then formatting and sending it without extensive manual effort across the engineering team.
VOW initially used Empromptu for AI-generated floor plan analysis that neither Google’s AI team nor Amazon’s AI team could solve. The company is now rewriting its entire platform on Empromptu’s system.
Golden pipelines target a specific deployment pattern: organizations building integrated AI applications where data preparation is currently a manual bottleneck between prototype and production.
The approach makes less sense for teams that already have mature data engineering organizations with established ETL processes optimized for their specific domains, or for organizations building standalone AI models rather than integrated applications.
The decision point is whether data preparation is blocking AI velocity in the organization. If data scientists are preparing datasets for experimentation that engineering teams then rebuild from scratch for production, integrated data prep addresses that gap.
If the bottleneck is elsewhere in the AI development lifecycle, it won’t. The trade-off is platform integration vs tool flexibility. Teams using golden pipelines commit to an integrated approach where data preparation, AI application development and governance happen in a single platform. Organizations that prefer assembling best-of-breed tools for each function will find that approach limiting. The benefit is eliminating handoffs between data prep and application development. The cost is reduced optionality in how those functions are implemented.
Building retrieval-augmented generation (RAG) systems for AI agents often involves using multiple layers and technologies for structured data, vectors and graph information. In recent months it has also become increasingly clear that agentic AI systems need memory, sometimes referred to as contextual memory, to operate effectively.
The complexity and synchronization of having different data layers to enable context can lead to performance and accuracy issues. It’s a challenge that SurrealDB is looking to solve.
SurrealDB on Tuesday launched version 3.0 of its namesake database alongside a $23 million Series A extension, bringing total funding to $44 million. The company had taken a different architectural approach than relational databases like PostgreSQL, native vector databases like Pinecone or a graph database like Neo4j. The OpenAI engineering team recently detailed how it scaled Postgres to 800 million users using read replicas — an approach that works for read-heavy workloads. SurrealDB takes a different approach: Store agent memory, business logic, and multi-modal data directly inside the database. Instead of synchronizing across multiple systems, vector search, graph traversal, and relational queries all run transactionally in a single Rust-native engine that maintains consistency.
“People are running DuckDB, Postgres, Snowflake, Neo4j, Quadrant or Pinecone all together, and then they’re wondering why they can’t get good accuracy in their agents,” CEO and co-founder Tobie Morgan Hitchcock told VentureBeat. “It’s because they’re having to send five different queries to five different databases which only have the knowledge or the context that they deal with.”
The architecture has resonated with developers, with 2.3 million downloads and 31,000 GitHub stars to date for the database. Existing deployments span edge devices in cars and defense systems, product recommendation engines for major New York retailers, and Android ad serving technologies, according to Hitchcock.
SurrealDB stores agent memory as graph relationships and semantic metadata directly in the database, not in application code or external caching layers.
The Surrealism plugin system in SurrealDB 3.0 lets developers define how agents build and query this memory; the logic runs inside the database with transactional guarantees rather than in middleware.
Here’s what that means in practice: When an agent interacts with data, it creates context graphs that link entities, decisions and domain knowledge as database records. These relationships are queryable through the same SurrealQL interface used for vector search and structured data. An agent asking about a customer issue can traverse graph connections to related past incidents, pull vector embeddings of similar cases, and join with structured customer data — all in one transactional query.
“People don’t want to store just the latest data anymore,” Hitchcock said. “They want to store all that data. They want to analyze and have the AI understand and run through all the data of an organization over the last year or two, because that informs their model, their AI agent about context, about history, and that can therefore deliver better results.”
Traditional RAG systems query databases based on data types. Developers write separate queries for vector similarity search, graph traversal, and relational joins, then merge results in application code. This creates synchronization delays as queries round-trip between systems.
In contrast, Hitchcock explained that SurrealDB stores data as binary-encoded documents with graph relationships embedded directly alongside them. A single query through SurrealQL can traverse graph relationships, perform vector similarity searches, and join structured records without leaving the database.
That architecture also affects how consistency works at scale: Every node maintains transactional consistency, even at 50+ node scale, Hitchcock said. When an agent writes new context to node A, a query on node B immediately sees that update. No caching, no read replicas.
“A lot of our use cases, a lot of our deployments are where data is constantly updated and the relationships, the context, the semantic understanding, or the graph connections between that data needs to be constantly refreshed,” he said. “So no caching. There’s no read replicas. In SurrealDB, every single thing is transactional.”
“It’s important to say SurrealDB is not the best database for every task. I’d love to say we are, but it’s not. And you can’t be,” Hitchcock said. “If you only need analysis over petabytes of data and you’re never really updating that data, then you’re going to be best going with object storage or a columnar database. If you’re just dealing with vector search, then you can go with a vector database like Quadrant or Pinecone, and that’s going to suffice.”
The inflection point comes when you need multiple data types together. The practical benefit shows up in development timelines. What used to take months to build with multi-database orchestration can now launch in days, Hitchcock said.
OpenAI on Thursday launched GPT-5.3-Codex-Spark, a stripped-down coding model engineered for near-instantaneous response times, marking the company’s first significant inference partnership outside its traditional Nvidia-dominated infrastructure. The model runs on hardware from Cerebras Systems, a Sunnyvale-based chipmaker whose wafer-scale processors specialize in low-latency AI workloads.
The partnership arrives at a pivotal moment for OpenAI. The company finds itself navigating a frayed relationship with longtime chip supplier Nvidia, mounting criticism over its decision to introduce advertisements into ChatGPT, a newly announced Pentagon contract, and internal organizational upheaval that has seen a safety-focused team disbanded and at least one researcher resign in protest.
“GPUs remain foundational across our training and inference pipelines and deliver the most cost effective tokens for broad usage,” an OpenAI spokesperson told VentureBeat. “Cerebras complements that foundation by excelling at workflows that demand extremely low latency, tightening the end-to-end loop so use cases such as real-time coding in Codex feel more responsive as you iterate.”
The careful framing — emphasizing that GPUs “remain foundational” while positioning Cerebras as a “complement” — underscores the delicate balance OpenAI must strike as it diversifies its chip suppliers without alienating Nvidia, the dominant force in AI accelerators.
Codex-Spark represents OpenAI’s first model purpose-built for real-time coding collaboration. The company claims the model delivers generation speeds 15 times faster than its predecessor, though it declined to provide specific latency metrics such as time-to-first-token or tokens-per-second figures.
“We aren’t able to share specific latency numbers, however Codex-Spark is optimized to feel near-instant—delivering 15x faster generation speeds while remaining highly capable for real-world coding tasks,” the OpenAI spokesperson said.
The speed gains come with acknowledged capability tradeoffs. On SWE-Bench Pro and Terminal-Bench 2.0 — two industry benchmarks that evaluate AI systems’ ability to perform complex software engineering tasks autonomously — Codex-Spark underperforms the full GPT-5.3-Codex model. OpenAI positions this as an acceptable exchange: developers get responses fast enough to maintain creative flow, even if the underlying model cannot tackle the most sophisticated multi-step programming challenges.
The model launches with a 128,000-token context window and supports text only — no image or multimodal inputs. OpenAI has made it available as a research preview to ChatGPT Pro subscribers through the Codex app, command-line interface, and Visual Studio Code extension. A small group of enterprise partners will receive API access to evaluate integration possibilities.
“We are making Codex-Spark available in the API for a small set of design partners to understand how developers want to integrate Codex-Spark into their products,” the spokesperson explained. “We’ll expand access over the coming weeks as we continue tuning our integration under real workloads.”
The technical architecture behind Codex-Spark tells a story about inference economics that increasingly matters as AI companies scale consumer-facing products. Cerebras’s Wafer Scale Engine 3 — a single chip roughly the size of a dinner plate containing 4 trillion transistors — eliminates much of the communication overhead that occurs when AI workloads spread across clusters of smaller processors.
For training massive models, that distributed approach remains necessary and Nvidia’s GPUs excel at it. But for inference — the process of generating responses to user queries — Cerebras argues its architecture can deliver results with dramatically lower latency. Sean Lie, Cerebras’s CTO and co-founder, framed the partnership as an opportunity to reshape how developers interact with AI systems.
“What excites us most about GPT-5.3-Codex-Spark is partnering with OpenAI and the developer community to discover what fast inference makes possible — new interaction patterns, new use cases, and a fundamentally different model experience,” Lie said in a statement. “This preview is just the beginning.”
OpenAI’s infrastructure team did not limit its optimization work to the Cerebras hardware. The company announced latency improvements across its entire inference stack that benefit all Codex models regardless of underlying hardware, including persistent WebSocket connections and optimizations within the Responses API. The results: 80 percent reduction in overhead per client-server round trip, 30 percent reduction in per-token overhead, and 50 percent reduction in time-to-first-token.
The Cerebras partnership takes on additional significance given the increasingly complicated relationship between OpenAI and Nvidia. Last fall, when OpenAI announced its Stargate infrastructure initiative, Nvidia publicly committed to investing $100 billion to support OpenAI as it built out AI infrastructure. The announcement appeared to cement a strategic alliance between the world’s most valuable AI company and its dominant chip supplier.
Five months later, that megadeal has effectively stalled, according to multiple reports. Nvidia CEO Jensen Huang has publicly denied tensions, telling reporters in late January that there is “no drama” and that Nvidia remains committed to participating in OpenAI’s current funding round. But the relationship has cooled considerably, with friction stemming from multiple sources.
OpenAI has aggressively pursued partnerships with alternative chip suppliers, including the Cerebras deal and separate agreements with AMD and Broadcom. From Nvidia’s perspective, OpenAI may be using its influence to commoditize the very hardware that made its AI breakthroughs possible. From OpenAI’s perspective, reducing dependence on a single supplier represents prudent business strategy.
“We will continue working with the ecosystem on evaluating the most price-performant chips across all use cases on an ongoing basis,” OpenAI’s spokesperson told VentureBeat. “GPUs remain our priority for cost-sensitive and throughput-first use cases across research and inference.” The statement reads as a careful effort to avoid antagonizing Nvidia while preserving flexibility — and reflects a broader reality that training frontier AI models still requires exactly the kind of massive parallel processing that Nvidia GPUs provide.
The Codex-Spark launch comes as OpenAI navigates a series of internal challenges that have intensified scrutiny of the company’s direction and values. Earlier this week, reports emerged that OpenAI disbanded its mission alignment team, a group established in September 2024 to promote the company’s stated goal of ensuring artificial general intelligence benefits humanity. The team’s seven members have been reassigned to other roles, with leader Joshua Achiam given a new title as OpenAI’s “chief futurist.”
OpenAI previously disbanded another safety-focused group, the superalignment team, in 2024. That team had concentrated on long-term existential risks from AI. The pattern of dissolving safety-oriented teams has drawn criticism from researchers who argue that OpenAI’s commercial pressures are overwhelming its original non-profit mission.
The company also faces fallout from its decision to introduce advertisements into ChatGPT. Researcher Zoë Hitzig resigned this week over what she described as the “slippery slope” of ad-supported AI, warning in a New York Times essay that ChatGPT’s archive of intimate user conversations creates unprecedented opportunities for manipulation. Anthropic seized on the controversy with a Super Bowl advertising campaign featuring the tagline: “Ads are coming to AI. But not to Claude.”
Separately, the company agreed to provide ChatGPT to the Pentagon through Genai.mil, a new Department of Defense program that requires OpenAI to permit “all lawful uses” without company-imposed restrictions — terms that Anthropic reportedly rejected. And reports emerged that Ryan Beiermeister, OpenAI’s vice president of product policy who had expressed concerns about a planned explicit content feature, was terminated in January following a discrimination allegation she denies.
Despite the surrounding turbulence, OpenAI’s technical roadmap for Codex suggests ambitious plans. The company envisions a coding assistant that seamlessly blends rapid-fire interactive editing with longer-running autonomous tasks — an AI that handles quick fixes while simultaneously orchestrating multiple agents working on more complex problems in the background.
“Over time, the modes will blend — Codex can keep you in a tight interactive loop while delegating longer-running work to sub-agents in the background, or fanning out tasks to many models in parallel when you want breadth and speed, so you don’t have to choose a single mode up front,” the OpenAI spokesperson told VentureBeat.
This vision would require not just faster inference but sophisticated task decomposition and coordination across models of varying sizes and capabilities. Codex-Spark establishes the low-latency foundation for the interactive portion of that experience; future releases will need to deliver the autonomous reasoning and multi-agent coordination that would make the full vision possible.
For now, Codex-Spark operates under separate rate limits from other OpenAI models, reflecting constrained Cerebras infrastructure capacity during the research preview. “Because it runs on specialized low-latency hardware, usage is governed by a separate rate limit that may adjust based on demand during the research preview,” the spokesperson noted. The limits are designed to be “generous,” with OpenAI monitoring usage patterns as it determines how to scale.
The Codex-Spark announcement arrives amid intense competition for AI-powered developer tools. Anthropic’s Claude Cowork product triggered a selloff in traditional software stocks last week as investors considered whether AI assistants might displace conventional enterprise applications. Microsoft, Google, and Amazon continue investing heavily in AI coding capabilities integrated with their respective cloud platforms.
OpenAI’s Codex app has demonstrated rapid adoption since launching ten days ago, with more than one million downloads and weekly active users growing 60 percent week-over-week. More than 325,000 developers now actively use Codex across free and paid tiers. But the fundamental question facing OpenAI — and the broader AI industry — is whether speed improvements like those promised by Codex-Spark translate into meaningful productivity gains or merely create more pleasant experiences without changing outcomes.
Early evidence from AI coding tools suggests that faster responses encourage more iterative experimentation. Whether that experimentation produces better software remains contested among researchers and practitioners alike. What seems clear is that OpenAI views inference latency as a competitive frontier worth substantial investment, even as that investment takes it beyond its traditional Nvidia partnership into untested territory with alternative chip suppliers.
The Cerebras deal is a calculated bet that specialized hardware can unlock use cases that general-purpose GPUs cannot cost-effectively serve. For a company simultaneously battling competitors, managing strained supplier relationships, and weathering internal dissent over its commercial direction, it is also a reminder that in the AI race, standing still is not an option. OpenAI built its reputation by moving fast and breaking conventions. Now it must prove it can move even faster — without breaking itself.
RAG isn’t always fast enough or intelligent enough for modern agentic AI workflows. As teams move from short-lived chatbots to long-running, tool-heavy agents embedded in production systems, those limitations are becoming harder to work around.
In response, teams are experimenting with alternative memory architectures — sometimes called contextual memory or agentic memory — that prioritize persistence and stability over dynamic retrieval.
One of the more recent implementations of this approach is “observational memory,” an open-source technology developed by Mastra, which was founded by the engineers who previously built and sold the Gatsby framework to Netlify.
Unlike RAG systems that retrieve context dynamically, observational memory uses two background agents (Observer and Reflector) to compress conversation history into a dated observation log. The compressed observations stay in context, eliminating retrieval entirely. For text content, the system achieves 3-6x compression. For tool-heavy agent workloads generating large outputs, compression ratios hit 5-40x.
The tradeoff is that observational memory prioritizes what the agent has already seen and decided over searching a broader external corpus, making it less suitable for open-ended knowledge discovery or compliance-heavy recall use cases.
The system scored 94.87% on LongMemEval using GPT-5-mini, while maintaining a completely stable, cacheable context window. On the standard GPT-4o model, observational memory scored 84.23% compared to Mastra’s own RAG implementation at 80.05%.
“It has this great characteristic of being both simpler and it is more powerful, like it scores better on the benchmarks,” Sam Bhagwat, co-founder and CEO of Mastra, told VentureBeat.
The architecture is simpler than traditional memory systems but delivers better results.
Observational memory divides the context window into two blocks. The first contains observations — compressed, dated notes extracted from previous conversations. The second holds raw message history from the current session.
Two background agents manage the compression process. When unobserved messages hit 30,000 tokens (configurable), the Observer agent compresses them into new observations and appends them to the first block. The original messages get dropped. When observations reach 40,000 tokens (also configurable), the Reflector agent restructures and condenses the observation log, combining related items and removing superseded information.
“The way that you’re sort of compressing these messages over time is you’re actually just sort of getting messages, and then you have an agent sort of say, ‘OK, so what are the key things to remember from this set of messages?'” Bhagwat said. “You kind of compress it, and then you get in another 30,000 tokens, and you compress that.”
The format is text-based, not structured objects. No vector databases or graph databases required.
The economics of observational memory come from prompt caching. Anthropic, OpenAI, and other providers reduce token costs by 4-10x for cached prompts versus those that are uncached. Most memory systems can’t take advantage of this because they change the prompt every turn by injecting dynamically retrieved context, which invalidates the cache. For production teams, that instability translates directly into unpredictable cost curves and harder-to-budget agent workloads.
Observational memory keeps the context stable. The observation block is append-only until reflection runs, which means the system prompt and existing observations form a consistent prefix that can be cached across many turns. Messages keep getting appended to the raw history block until the 30,000 token threshold hits. Every turn before that is a full cache hit.
When observation runs, messages are replaced with new observations appended to the existing observation block. The observation prefix stays consistent, so the system still gets a partial cache hit. Only during reflection (which runs infrequently) is the entire cache invalidated.
The average context window size for Mastra’s LongMemEval benchmark run was around 30,000 tokens, far smaller than the full conversation history would require.
Most coding agents use compaction to manage long context. Compaction lets the context window fill all the way up, then compresses the entire history into a summary when it’s about to overflow. The agent continues, the window fills again, and the process repeats.
Compaction produces documentation-style summaries. It captures the gist of what happened but loses specific events, decisions and details. The compression happens in large batches, which makes each pass computationally expensive. That works for human readability, but it often strips out the specific decisions and tool interactions agents need to act consistently over time.
The Observer, on the other hand, runs more frequently, processing smaller chunks. Instead of summarizing the conversation, it produces an event-based decision log — a structured list of dated, prioritized observations about what specifically happened. Each observation cycle handles less context and compresses it more efficiently.
The log never gets summarized into a blob. Even during reflection, the Reflector reorganizes and condenses the observations to find connections and drop redundant data. But the event-based structure persists. The result reads like a log of decisions and actions, not documentation.
Mastra’s customers span several categories. Some build in-app chatbots for CMS platforms like Sanity or Contentful. Others create AI SRE systems that help engineering teams triage alerts. Document processing agents handle paperwork for traditional businesses moving toward automation.
What these use cases share is the need for long-running conversations that maintain context across weeks or months. An agent embedded in a content management system needs to remember that three weeks ago the user asked for a specific report format. An SRE agent needs to track which alerts were investigated and what decisions were made.
“One of the big goals for 2025 and 2026 has been building an agent inside their web app,” Bhagwat said about B2B SaaS companies. “That agent needs to be able to remember that, like, three weeks ago, you asked me about this thing, or you said you wanted a report on this kind of content type, or views segmented by this metric.”
In those scenarios, memory stops being an optimization and becomes a product requirement — users notice immediately when agents forget prior decisions or preferences.
Observational memory keeps months of conversation history present and accessible. The agent can respond while remembering the full context, without requiring the user to re-explain preferences or previous decisions.
The system shipped as part of Mastra 1.0 and is available now. The team released plug-ins this week for LangChain, Vercel’s AI SDK, and other frameworks, enabling developers to use observational memory outside the Mastra ecosystem.
Observational memory offers a different architectural approach than the vector database and RAG pipelines that dominate current implementations. The simpler architecture (text-based, no specialized databases) makes it easier to debug and maintain. The stable context window enables aggressive caching that cuts costs. The benchmark performance suggests that the approach can work at scale.
For enterprise teams evaluating memory approaches, the key questions are:
How much context do your agents need to maintain across sessions?
What’s your tolerance for lossy compression versus full-corpus search?
Do you need the dynamic retrieval that RAG provides, or would stable context work better?
Are your agents tool-heavy, generating large amounts of output that needs compression?
The answers determine whether observational memory fits your use case. Bhagwat positions memory as one of the top primitives needed for high-performing agents, alongside tool use, workflow orchestration, observability, and guardrails. For enterprise agents embedded in products, forgetting context between sessions is unacceptable. Users expect agents to remember their preferences, previous decisions and ongoing work.
“The hardest thing for teams building agents is the production, which can take time,” Bhagwat said. “Memory is a really important bit in that, because it’s just jarring if you use any sort of agentic tool and you sort of told it something and then it just kind of forgot it.”
As agents move from experiments to embedded systems of record, how teams design memory may matter as much as which model they choose.
A team of researchers led by Nvidia has released DreamDojo, a new AI system designed to teach robots how to interact with the physical world by watching tens of thousands of hours of human video — a development that could significantly reduce the time and cost required to train the next generation of humanoid machines.
The research, published this month and involving collaborators from UC Berkeley, Stanford, the University of Texas at Austin, and several other institutions, introduces what the team calls “the first robot world model of its kind that demonstrates strong generalization to diverse objects and environments after post-training.”
At the core of DreamDojo is what the researchers describe as “a large-scale video dataset” comprising “44k hours of diverse human egocentric videos, the largest dataset to date for world model pretraining.” The dataset, called DreamDojo-HV, is a dramatic leap in scale — “15x longer duration, 96x more skills, and 2,000x more scenes than the previously largest dataset for world model training,” according to the project documentation.
The system operates in two distinct phases. First, DreamDojo “acquires comprehensive physical knowledge from large-scale human datasets by pre-training with latent actions.” Then it undergoes “post-training on the target embodiment with continuous robot actions” — essentially learning general physics from watching humans, then fine-tuning that knowledge for specific robot hardware.
For enterprises considering humanoid robots, this approach addresses a stubborn bottleneck. Teaching a robot to manipulate objects in unstructured environments traditionally requires massive amounts of robot-specific demonstration data — expensive and time-consuming to collect. DreamDojo sidesteps this problem by leveraging existing human video, allowing robots to learn from observation before ever touching a physical object.
One of the technical breakthroughs is speed. Through a distillation process, the researchers achieved “real-time interactions at 10 FPS for over 1 minute” — a capability that enables practical applications like live teleoperation and on-the-fly planning. The team demonstrated the system working across multiple robot platforms, including the GR-1, G1, AgiBot, and YAM humanoid robots, showing what they call “realistic action-conditioned rollouts” across “a wide range of environments and object interactions.”
The release comes at a pivotal moment for Nvidia’s robotics ambitions — and for the broader AI industry. At the World Economic Forum in Davos last month, CEO Jensen Huang declared that AI robotics represents a “once-in-a-generation” opportunity, particularly for regions with strong manufacturing bases. According to Digitimes, Huang has also stated that the next decade will be “a critical period of accelerated development for robotics technology.”
The financial stakes are enormous. Huang told CNBC’s “Halftime Report” on February 6 that the tech industry’s capital expenditures — potentially reaching $660 billion this year from major hyperscalers — are “justified, appropriate and sustainable.” He characterized the current moment as “the largest infrastructure buildout in human history,” with companies like Meta, Amazon, Google, and Microsoft dramatically increasing their AI spending.
That infrastructure push is already reshaping the robotics landscape. Robotics startups raised a record $26.5 billion in 2025, according to data from Dealroom. European industrial giants including Siemens, Mercedes-Benz, and Volvo have announced robotics partnerships in the past year, while Tesla CEO Elon Musk has claimed that 80 percent of his company’s future value will come from its Optimus humanoid robots.
For technical decision-makers evaluating humanoid robots, DreamDojo’s most immediate value may lie in its simulation capabilities. The researchers highlight downstream applications including “reliable policy evaluation without real-world deployment and model-based planning for test-time improvement” — capabilities that could let companies simulate robot behavior extensively before committing to costly physical trials.
This matters because the gap between laboratory demonstrations and factory floors remains significant. A robot that performs flawlessly in controlled conditions often struggles with the unpredictable variations of real-world environments — different lighting, unfamiliar objects, unexpected obstacles. By training on 44,000 hours of diverse human video spanning thousands of scenes and nearly 100 distinct skills, DreamDojo aims to build the kind of general physical intuition that makes robots adaptable rather than brittle.
The research team, led by Linxi “Jim” Fan, Joel Jang, and Yuke Zhu, with Shenyuan Gao and William Liang as co-first authors, has indicated that code will be released publicly, though a timeline was not specified.
Whether DreamDojo translates into commercial robotics products remains to be seen. But the research signals where Nvidia’s ambitions are heading as the company increasingly positions itself beyond its gaming roots. As Kyle Barr observed at Gizmodo earlier this month, Nvidia now views “anything related to gaming and the ‘personal computer'” as “outliers on Nvidia’s quarterly spreadsheets.”
The shift reflects a calculated bet: that the future of computing is physical, not just digital. Nvidia has already invested $10 billion in Anthropic and signaled plans to invest heavily in OpenAI’s next funding round. DreamDojo suggests the company sees humanoid robots as the next frontier where its AI expertise and chip dominance can converge.
For now, the 44,000 hours of human video at the heart of DreamDojo represent something more fundamental than a technical benchmark. They represent a theory — that robots can learn to navigate our world by watching us live in it. The machines, it turns out, have been taking notes.
Presented by F5
As enterprises pour billions into GPU infrastructure for AI workloads, many are discovering that their expensive compute resources sit idle far more than expected. The culprit isn’t the hardware. It’s the often-invisible data delivery layer between storage and compute that’s starving GPUs of the information they need.
“While people are focusing their attention, justifiably so, on GPUs, because they’re very significant investments, those are rarely the limiting factor,” says Mark Menger, solutions architect at F5. “They’re capable of more work. They’re waiting on data.”
AI performance increasingly depends on an independent, programmable control point between AI frameworks and object storage — one that most enterprises haven’t deliberately architected. As AI workloads scale, bottlenecks and instability happens when AI frameworks are tightly coupled to specific storage endpoints during scaling events, failures, and cloud transitions.
“Traditional storage access patterns were not designed for highly parallel, bursty, multi-consumer AI workloads,” says Maggie Stringfellow, VP, product management – BIG-IP. “Efficient AI data movement requires a distinct data delivery layer designed to abstract, optimize, and secure data flows independently of storage systems, because GPU economics make inefficiency immediately visible and expensive.”
These bidirectional patterns include massive ingestion from continuous data capture, simulation output, and model checkpoints. Combined with read-intensive training and inference workloads, they stress the tightly coupled infrastructure upon which the storage systems are reliant.
While storage vendors have done significant work in scaling the data throughput into and out of their systems, that focus on throughput alone creates knock-on effects across the switching, traffic management, and security layers coupled to storage.
The stress on S3-compatible systems from AI workloads is multidimensional and differs significantly from traditional application patterns. It’s less about raw throughput and more about concurrency, metadata pressure, and fan-out considerations. Training and fine-tuning create particularly challenging patterns, like massive parallel reads of small to mid-size objects. These workloads also involve repeated passes through training data across epochs and periodic checkpoint write bursts.
RAG workloads introduce their own complexity through request amplification. A single request can fan out into dozens or hundreds of additional data chunks, cascading into further detail, related chunks, and more complex documents. The stress concentration is less about capacity, storage system speed, and more about request management and traffic shaping.
When AI frameworks connect directly to storage endpoints without an intermediate delivery layer, operational fragility compounds quickly during scaling events, failures, and cloud transitions, which can have major consequences.
“Any instability in the storage service now has an uncontained blast radius,” Menger says. “Anything here becomes a system failure, not a storage failure. Or frankly, aberrant behavior in one application can have knock-on effects to all consumers of that storage service.”
Menger describes a pattern he’s seen with three different customers, where tight coupling cascaded into complete system failures.
“We see large training or fine-tuning workloads overwhelm the storage infrastructure, and the storage infrastructure goes down,” he explains. “At that scale, the recovery is never measured in seconds. Minutes if you’re lucky. Usually hours. The GPUs are now not being fed. They’re starved for data. These high value resources, for that entire time the system is down, are negative ROI.”
The financial impact of introducing an independent data delivery layer extends beyond preventing catastrophic failures.
Decoupling allows data access to be optimized independently of storage hardware, improving GPU utilization by reducing idle time and contention while improving cost predictability and system performance as scale increases, Stringfellow says.
“It enables intelligent caching, traffic shaping, and protocol optimization closer to compute, which lowers cloud egress and storage amplification costs,” she explains. “Operationally, this isolation protects storage systems from unbounded AI access patterns, resulting in more predictable cost behavior and stable performance under growth and variability.”
F5’s answer is to position its Application Delivery and Security Platform, powered by BIG-IP, as a “storage front door” that provides health-aware routing, hotspot avoidance, policy enforcement, and security controls without requiring application rewrites.
“Introducing a delivery tier in between compute and storage helps define boundaries of accountability,” Menger says. “Compute is about execution. Storage is about durability. Delivery is about reliability.”
The programmable control point, which uses event-based, conditional logic rather than generative AI, enables intelligent traffic management that goes beyond simple load balancing. Routing decisions are based on real backend health, using intelligent health awareness to detect early signs of trouble. This includes monitoring leading indicators of trouble. And when problems emerge, the system can isolate misbehaving components without taking down the entire service.
“An independent, programmable data delivery layer becomes necessary because it allows policy, optimization, security, and traffic control to be applied uniformly across both ingestion and consumption paths without modifying storage systems or AI frameworks,” Stringfellow says. “By decoupling data access from storage implementation, organizations can safely absorb bursty writes, optimize reads, and protect backend systems from unbounded AI access patterns.”
AI isn’t just pushing storage teams on throughput, it’s forcing them to treat data movement as both a performance and security problem, Stringfellow says. Security can no longer be assumed simply because data sits deep in the data center. AI introduces automated, high-volume access patterns that must be authenticated, encrypted, and governed at speed. That’s where F5 BIG-IP comes into play.
“F5 BIG-IP sits directly in the AI data path to deliver high-throughput access to object storage while enforcing policy, inspecting traffic, and making payload-informed traffic management decisions,” Stringfellow says. “Feeding GPUs quickly is necessary, but not sufficient; storage teams now need confidence that AI data flows are optimized, controlled, and secure.”
Looking ahead, the requirements for data delivery will only intensify, Stringfellow says.
“AI data delivery will shift from bulk optimization toward real-time, policy-driven data orchestration across distributed systems,” she says. “Agentic and RAG-based architectures will require fine-grained runtime control over latency, access scope, and delegated trust boundaries. Enterprises should start treating data delivery as programmable infrastructure, not a byproduct of storage or networking. The organizations that do this early will scale faster and with less risk.”
Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
OpenAI on Wednesday released GPT-5.3-Codex, which the company calls its most capable coding agent to date, in an announcement timed to land at the exact same moment Anthropic unveiled its own flagship model upgrade, Claude Opus 4.6. The synchronized launches mark the opening salvo in what industry observers are calling the AI coding wars — a high-stakes battle to capture the enterprise software development market.
The dueling announcements came amid an already heated week between the two AI giants, who are also set to air competing Super Bowl advertisements on Sunday, and whose executives have been trading barbs publicly over business models, access, and corporate ethics.
“I love building with this model; it feels like more of a step forward than the benchmarks suggest,” OpenAI CEO Sam Altman wrote on X minutes after the launch. He later added: “It was amazing to watch how much faster we were able to ship 5.3-Codex by using 5.3-Codex, and for sure this is a sign of things to come.”
That claim — that the model helped build itself — is a significant milestone in AI development. According to OpenAI’s announcement, the Codex team used early versions of GPT-5.3-Codex to debug its own training runs, manage deployment infrastructure, and diagnose test results and evaluations. The company describes it as “our first model that was instrumental in creating itself.”
The new model posts substantial gains across multiple industry benchmarks. GPT-5.3-Codex achieves 57% on SWE-Bench Pro, a rigorous evaluation of real-world software engineering that spans four programming languages and tests contamination-resistant, industrially relevant challenges. It scores 77.3% on Terminal-Bench 2.0, which measures the terminal skills essential for coding agents, and 64% on OSWorld, an agentic computer-use benchmark where models must complete productivity tasks in visual desktop environments.
The Terminal-Bench 2.0 result is particularly striking. According to performance data released Wednesday, GPT-5.3-Codex scored 77.3% compared to GPT-5.2-Codex’s 64.0% and the base GPT-5.2 model’s 62.2% — a 13-percentage-point leap in a single generation. One user on X noted that the score “absolutely demolished” Anthropic’s Opus 4.6, which reportedly achieved 65.4% on the same benchmark.
OpenAI also claims the model accomplishes these results with dramatically improved efficiency: less than half the tokens of its predecessor for equivalent tasks, plus more than 25% faster inference per token.
“Notably, GPT-5.3-Codex does so with fewer tokens than any prior model, letting users simply build more,” the company stated in its announcement.
Perhaps more significant than the benchmark improvements is OpenAI’s positioning of GPT-5.3-Codex as a model that transcends pure coding. The company explicitly states that “Codex goes from an agent that can write and review code to an agent that can do nearly anything developers and professionals can do on a computer.”
This expanded capability set includes debugging, deploying, monitoring, writing product requirement documents, editing copy, conducting user research, building slide decks, and analyzing data in spreadsheet applications. The model shows strong performance on GDPVal, an OpenAI evaluation released in 2025 that measures performance on well-specified knowledge-work tasks across 44 occupations.
The expansion signals OpenAI’s ambition to capture not just the developer tools market but the broader enterprise productivity software space — a market that includes established players like Microsoft, Salesforce, and ServiceNow, all of whom are racing to embed AI agents into their platforms.
The pivot toward general-purpose computing brings new security considerations. In a notable disclosure, OpenAI revealed that GPT-5.3-Codex is the first model it classifies as “High capability” for cybersecurity-related tasks under its Preparedness Framework, and the first directly trained to identify software vulnerabilities.
“While we don’t have definitive evidence it can automate cyber attacks end-to-end, we’re taking a precautionary approach and deploying our most comprehensive cybersecurity safety stack to date,” the company stated. Mitigations include dual-use safety training, automated monitoring, trusted access for advanced capabilities, and enforcement pipelines incorporating threat intelligence.
Altman highlighted this development on X: “This is our first model that hits ‘high’ for cybersecurity on our preparedness framework. We are piloting a Trusted Access framework, and committing $10 million in API credits to accelerate cyber defense.”
The company is also expanding the private beta of Aardvark, its security research agent, and partnering with open-source maintainers to provide free codebase scanning for widely used projects. OpenAI cited Next.js as an example where a security researcher used Codex to discover vulnerabilities disclosed last week.
The cybersecurity announcement, however, has been overshadowed by the increasingly personal nature of the OpenAI-Anthropic rivalry. The timing of Wednesday’s release cannot be understood without the context of OpenAI’s intensifying competition with Anthropic, the AI safety-focused startup founded in 2021 by former OpenAI researchers, including Dario and Daniela Amodei.
Both companies scheduled major product announcements for 10 a.m. Pacific Time today. Anthropic unveiled Claude Opus 4.6, which it describes as its “smartest model” that “plans more carefully, sustains agentic tasks for longer, operates reliably in massive codebases, and catches its own mistakes.”
The head-to-head timing follows a week of escalating tensions. Anthropic announced it will air Super Bowl advertisements mocking OpenAI’s recent decision to begin testing ads within ChatGPT for free users.
Altman responded with unusual directness, calling the advertisements “funny” but “clearly dishonest” in an extensive X post.
“We would obviously never run ads in the way Anthropic depicts them. We are not stupid and we know our users would reject that,” Altman wrote. “I guess it’s on brand for Anthropic doublespeak to use a deceptive ad to critique theoretical deceptive ads that aren’t real, but a Super Bowl ad is not where I would expect it.”
He went further, characterizing Anthropic as an “authoritarian company” that “wants to control what people do with AI.”
“Anthropic serves an expensive product to rich people,” Altman wrote. “More Texans use ChatGPT for free than total people use Claude in the US, so we have a differently-shaped problem than they do.”
The public sparring masks a deadly serious business competition. The rivalry plays out against a backdrop of explosive enterprise AI adoption, where both companies are fighting for position in a rapidly expanding market.
According to survey data from Andreessen Horowitz released this week, enterprise spending on large language models has dramatically outpaced even bullish projections. Average enterprise LLM spending reached $7 million in 2025, 180% higher than 2024’s actual spending of $2.5 million — and 56% above what enterprises had projected for 2025 just a year earlier. Spending is projected to reach $11.6 million per enterprise in 2026, a further 65% increase.
The a16z data reveals shifting market dynamics that help explain the intensity of the competition. OpenAI maintains the largest average share of enterprise AI wallet, but that share is shrinking — from 62% in 2024 to a projected 53% in 2026. Anthropic’s share, meanwhile, has grown from 14% to a projected 18% over the same period, with Google showing similar gains.
Enterprise adoption patterns tell a more nuanced story. While OpenAI leads in overall usage, only 46% of surveyed OpenAI customers are using its most capable models in production, compared to 75% for Anthropic and 76% for Google. When including testing environments, 89% of Anthropic customers are testing or using the company’s most capable models — the highest rate among major providers.
For software development specifically — one of the primary use cases for both companies’ coding agents — the a16z survey shows OpenAI with approximately 35% market share, with Anthropic claiming a substantial and growing portion of the remainder.
These market dynamics explain why both companies are positioning themselves as platforms rather than mere model providers. OpenAI on Wednesday also launched Frontier, a new platform designed to serve as a comprehensive hub for businesses adopting a range of AI tools — including those developed by third parties — that can operate together seamlessly.
“We can be the partner of choice for AI transformation for enterprise. The sky is the limit in terms of revenue we can generate from a platform like that,” Fidji Simo, OpenAI’s CEO of applications, told reporters this week.
This follows Monday’s launch of the Codex desktop application for macOS, which OpenAI says has already surpassed 500,000 downloads. The app enables users to manage multiple AI coding agents simultaneously — a capability that becomes increasingly important as enterprises deploy agents for complex, long-running tasks.
The platform ambitions require extraordinary capital. The dueling launches underscore the staggering financial requirements of frontier AI development, with both companies burning through billions while racing to establish market dominance.
Anthropic is currently in discussions for a funding round that could bring in more than $20 billion at a valuation of at least $350 billion, according to Bloomberg, and is simultaneously planning an employee tender offer at that valuation.
OpenAI, meanwhile, has disclosed that it owes more than $1 trillion in financial obligations to backers — including Oracle, Microsoft, and Nvidia — that are essentially fronting compute costs in expectation of future returns.
GPT-5.3-Codex was “co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems,” according to OpenAI’s announcement—a reference to Nvidia’s latest Blackwell-generation AI supercomputing architecture.
The financial pressure adds urgency to both companies’ enterprise strategies. Unlike established tech giants with diversified revenue streams, both Anthropic and OpenAI must prove they can generate sufficient revenue from AI products to justify their extraordinary valuations and infrastructure costs.
Looking ahead, OpenAI says GPT-5.3-Codex is available immediately for paid ChatGPT users across all Codex surfaces: the desktop app, command-line interface, IDE extensions, and web interface. API access is expected to follow.
The model includes a new interactivity feature: users can choose between “pragmatic” or “friendly” personalities — a customization Altman suggests users feel strongly about. More substantively, the model provides frequent progress updates during tasks, allowing users to interact in real time, ask questions, discuss approaches, and steer toward solutions without losing context.
“Instead of waiting for a final output, you can interact in real time,” OpenAI stated. “GPT-5.3-Codex talks through what it’s doing, responds to feedback, and keeps you in the loop from start to finish.”
The company promises more capabilities in the coming weeks, with Altman declaring: “I believe Codex is going to win.”
He concluded his response to Anthropic with a philosophical statement that frames the competition in stark terms: “This time belongs to the builders, not the people who want to control them.”
Whether that message resonates with enterprise customers — who according to a16z data cite trust, security, and compliance as their top concerns — remains to be seen. What’s clear is that the AI coding wars have begun in earnest, and neither company intends to cede ground.
Anthropic on Thursday released Claude Opus 4.6, a major upgrade to its flagship artificial intelligence model that the company says plans more carefully, sustains longer autonomous workflows, and outperforms competitors including OpenAI’s GPT-5.2 on key enterprise benchmarks — a release that arrives at a tumultuous moment for the AI industry and global software markets.
The launch comes just three days after OpenAI released its own Codex desktop application in a direct challenge to Anthropic’s Claude Code momentum, and amid a $285 billion rout in software and services stocks that investors attribute partly to fears that Anthropic’s AI tools could disrupt established enterprise software businesses.
For the first time, Anthropic’s Opus-class models will feature a 1 million token context window, allowing the AI to process and reason across vastly more information than previous versions. The company also introduced “agent teams” in Claude Code — a research preview feature that enables multiple AI agents to work simultaneously on different aspects of a coding project, coordinating autonomously.
“We’re focused on building the most capable, reliable, and safe AI systems,” an Anthropic spokesperson told VentureBeat about the announcements. “Opus 4.6 is even better at planning, helping solve the most complex coding tasks. And the new agent teams feature means users can split work across multiple agents — one on the frontend, one on the API, one on the migration — each owning its piece and coordinating directly with the others.”
The release intensifies an already fierce competition between Anthropic and OpenAI, the two most valuable privately held AI companies in the world. OpenAI on Monday released a new desktop application for its Codex artificial intelligence coding system, a tool the company says transforms software development from a collaborative exercise with a single AI assistant into something more akin to managing a team of autonomous workers.
AI coding assistants have exploded in popularity over the last year, and OpenAI said more than 1 million developers have used Codex in the past month. The new Codex app is part of OpenAI’s ongoing effort to lure users and market share away from rivals like Anthropic and Cursor.
The timing of Anthropic’s release — just 72 hours after OpenAI’s Codex launch — underscores the breakneck pace of competition in AI development tools. OpenAI faces intensifying competition from Anthropic, which posted the largest share increase of any frontier lab since May 2025, according to a recent Andreessen Horowitz survey. Forty-four percent of enterprises now use Anthropic in production, driven by rapid capability gains in software development since late 2024. The desktop launch is a strategic counter to Claude Code’s momentum.
According to Anthropic’s announcement, Opus 4.6 achieves the highest score on Terminal-Bench 2.0, an agentic coding evaluation, and leads all other frontier models on Humanity’s Last Exam, a complex multi-discipline reasoning test. On GDPval-AA — a benchmark measuring performance on economically valuable knowledge work tasks in finance, legal and other domains — Opus 4.6 outperforms OpenAI’s GPT-5.2 by approximately 144 ELO points, which translates to obtaining a higher score approximately 70% of the time.
The stakes are substantial. Asked about Claude Code’s financial performance, the Anthropic spokesperson noted that in November, the company announced that Claude Code reached $1 billion in run rate revenue only six months after becoming generally available in May 2025.
The spokesperson highlighted major enterprise deployments: “Claude Code is used by Uber across teams like software engineering, data science, finance, and trust and safety; wall-to-wall deployment across Salesforce’s global engineering org; tens of thousands of devs at Accenture; and companies across industries like Spotify, Rakuten, Snowflake, Novo Nordisk, and Ramp.”
That enterprise traction has translated into skyrocketing valuations. Earlier this month, Anthropic signed a term sheet for a $10 billion funding round at a $350 billion valuation. Bloomberg reported that Anthropic is simultaneously working on a tender offer that would allow employees to sell shares at that valuation, offering liquidity to staffers who have watched the company’s worth multiply since its 2021 founding.
One of Opus 4.6’s most significant technical improvements addresses what the AI industry calls “context rot“—the degradation of model performance as conversations grow longer. Anthropic says Opus 4.6 scores 76% on MRCR v2, a needle-in-a-haystack benchmark testing a model’s ability to retrieve information hidden in vast amounts of text, compared to just 18.5% for Sonnet 4.5.
“This is a qualitative shift in how much context a model can actually use while maintaining peak performance,” the company said in its announcement.
The model also supports outputs of up to 128,000 tokens — enough to complete substantial coding tasks or documents without breaking them into multiple requests.
For developers, Anthropic is introducing several new API features alongside the model: adaptive thinking, which allows Claude to decide when deeper reasoning would be helpful rather than requiring a binary on-off choice; four effort levels (low, medium, high, max) to control intelligence, speed and cost tradeoffs; and context compaction, a beta feature that automatically summarizes older context to enable longer-running tasks.
Anthropic, which has built its brand around AI safety research, emphasized that Opus 4.6 maintains alignment with its predecessors despite its enhanced capabilities. On the company’s automated behavior audit measuring misaligned behaviors such as deception, sycophancy, and cooperation with misuse, Opus 4.6 “showed a low rate” of problematic responses while also achieving “the lowest rate of over-refusals — where the model fails to answer benign queries — of any recent Claude model.”
When asked how Anthropic thinks about safety guardrails as Claude becomes more agentic, particularly with multiple agents coordinating autonomously, the spokesperson pointed to the company’s published framework: “Agents have tremendous potential for positive impacts in work but it’s important that agents continue to be safe, reliable, and trustworthy. We outlined our framework for developing safe and trustworthy agents last year which shares core principles developers should consider when building agents.”
The company said it has developed six new cybersecurity probes to detect potentially harmful uses of the model’s enhanced capabilities, and is using Opus 4.6 to help find and patch vulnerabilities in open-source software as part of defensive cybersecurity efforts.
The rivalry between Anthropic and OpenAI has spilled into consumer marketing in dramatic fashion. Both companies will feature prominently during Sunday’s Super Bowl. Anthropic is airing commercials that mock OpenAI’s decision to begin testing advertisements in ChatGPT, with the tagline: “Ads are coming to AI. But not to Claude.”
OpenAI CEO Sam Altman responded by calling the ads “funny” but “clearly dishonest,” posting on X that his company would “obviously never run ads in the way Anthropic depicts them” and that “Anthropic wants to control what people do with AI” while serving “an expensive product to rich people.”
The exchange highlights a fundamental strategic divergence: OpenAI has moved to monetize its massive free user base through advertising, while Anthropic has focused almost exclusively on enterprise sales and premium subscriptions.
The launch occurs against a backdrop of historic market volatility in software stocks. A new AI automation tool from Anthropic PBC sparked a $285 billion rout in stocks across the software, financial services and asset management sectors on Tuesday as investors raced to dump shares with even the slightest exposure. A Goldman Sachs basket of US software stocks sank 6%, its biggest one-day decline since April’s tariff-fueled selloff.
The selloff was triggered by a new legal tool from Anthropic, which showed the AI industry’s growing push into industries that can unlock lucrative enterprise revenue needed to fund massive investments in the technology. One trigger for Tuesday’s selloff was Anthropic’s launch of plug-ins for its Claude Cowork agent on Friday, enabling automated tasks across legal, sales, marketing and data analysis.
Thomson Reuters plunged 15.83% Tuesday, its biggest single-day drop on record; and Legalzoom.com sank 19.68%. European legal software providers including RELX, owner of LexisNexis, and Wolters Kluwer experienced their worst single-day performances in decades.
Not everyone agrees the selloff is warranted. Nvidia CEO Jensen Huang said on Tuesday that fears AI would replace software and related tools were “illogical” and “time will prove itself.” Mark Murphy, head of U.S. enterprise software research at JPMorgan, said in a Reuters report it “feels like an illogical leap” to say a new plug-in from an LLM would “replace every layer of mission-critical enterprise software.”
Among the more notable product announcements: Anthropic is releasing Claude in PowerPoint in research preview, allowing users to create presentations using the same AI capabilities that power Claude’s document and spreadsheet work. The integration puts Claude directly inside a core Microsoft product — an unusual arrangement given Microsoft’s 27% stake in OpenAI.
The Anthropic spokesperson framed the move pragmatically in an interview with VentureBeat: “Microsoft has an official add-in marketplace for Office products with multiple add-ins available to help people with slide creation and iteration. Any developer can build a plugin for Excel or PowerPoint. We’re participating in that ecosystem to bring Claude into PowerPoint. This is about participating in the ecosystem and giving users the ability to work with the tools that they want, in the programs they want.”
Data from a16z’s recent enterprise AI survey suggests both Anthropic and OpenAI face an increasingly competitive landscape. While OpenAI remains the most widely used AI provider in the enterprise, with approximately 77% of surveyed companies using it in production in January 2026, Anthropic’s adoption is rising rapidly — from near-zero in March 2024 to approximately 40% using it in production by January 2026.
The survey data also shows that 75% of Anthropic’s enterprise customers are using it in production, with 89% either testing or in production — figures that slightly exceed OpenAI’s 46% in production and 73% testing or in production rates among its customer base.
Enterprise spending on AI continues to accelerate. Average enterprise LLM spend reached $7 million in 2025, up 180% from $2.5 million in 2024, with projections suggesting $11.6 million in 2026 — a 65% increase year-over-year.
Opus 4.6 is available immediately on claude.ai, the Claude API, and major cloud platforms. Developers can access it via claude-opus-4-6 through the API. Pricing remains unchanged at $5 per million input tokens and $25 per million output tokens, with premium pricing of $10/$37.50 for prompts exceeding 200,000 tokens using the 1 million token context window.
For users who find Opus 4.6 “overthinking” simpler tasks — a characteristic Anthropic acknowledges can add cost and latency — the company recommends adjusting the effort parameter from its default high setting to medium.
The recommendation captures something essential about where the AI industry now stands. These models have grown so capable that their creators must now teach customers how to make them think less. Whether that represents a breakthrough or a warning sign depends entirely on which side of the disruption you’re standing on — and whether you remembered to sell your software stocks before Tuesday.