Anthropic’s Claude Opus 4.6 brings 1M token context and ‘agent teams’ to take on OpenAI’s Codex

Anthropic on Thursday released Claude Opus 4.6, a major upgrade to its flagship artificial intelligence model that the company says plans more carefully, sustains longer autonomous workflows, and outperforms competitors including OpenAI’s GPT-5.2 on key enterprise benchmarks — a release that arrives at a tumultuous moment for the AI industry and global software markets.

The launch comes just three days after OpenAI released its own Codex desktop application in a direct challenge to Anthropic’s Claude Code momentum, and amid a $285 billion rout in software and services stocks that investors attribute partly to fears that Anthropic’s AI tools could disrupt established enterprise software businesses.

For the first time, Anthropic’s Opus-class models will feature a 1 million token context window, allowing the AI to process and reason across vastly more information than previous versions. The company also introduced “agent teams” in Claude Code — a research preview feature that enables multiple AI agents to work simultaneously on different aspects of a coding project, coordinating autonomously.

“We’re focused on building the most capable, reliable, and safe AI systems,” an Anthropic spokesperson told VentureBeat about the announcements. “Opus 4.6 is even better at planning, helping solve the most complex coding tasks. And the new agent teams feature means users can split work across multiple agents — one on the frontend, one on the API, one on the migration — each owning its piece and coordinating directly with the others.”

Why OpenAI and Anthropic are locked in an all-out war for enterprise developers

The release intensifies an already fierce competition between Anthropic and OpenAI, the two most valuable privately held AI companies in the world. OpenAI on Monday released a new desktop application for its Codex artificial intelligence coding system, a tool the company says transforms software development from a collaborative exercise with a single AI assistant into something more akin to managing a team of autonomous workers.

AI coding assistants have exploded in popularity over the last year, and OpenAI said more than 1 million developers have used Codex in the past month. The new Codex app is part of OpenAI’s ongoing effort to lure users and market share away from rivals like Anthropic and Cursor.

The timing of Anthropic’s release — just 72 hours after OpenAI’s Codex launch — underscores the breakneck pace of competition in AI development tools. OpenAI faces intensifying competition from Anthropic, which posted the largest share increase of any frontier lab since May 2025, according to a recent Andreessen Horowitz survey. Forty-four percent of enterprises now use Anthropic in production, driven by rapid capability gains in software development since late 2024. The desktop launch is a strategic counter to Claude Code’s momentum.

According to Anthropic’s announcement, Opus 4.6 achieves the highest score on Terminal-Bench 2.0, an agentic coding evaluation, and leads all other frontier models on Humanity’s Last Exam, a complex multi-discipline reasoning test. On GDPval-AA — a benchmark measuring performance on economically valuable knowledge work tasks in finance, legal and other domains — Opus 4.6 outperforms OpenAI’s GPT-5.2 by approximately 144 ELO points, which translates to obtaining a higher score approximately 70% of the time.

Inside Claude Code’s $1 billion revenue milestone and growing enterprise footprint

The stakes are substantial. Asked about Claude Code’s financial performance, the Anthropic spokesperson noted that in November, the company announced that Claude Code reached $1 billion in run rate revenue only six months after becoming generally available in May 2025.

The spokesperson highlighted major enterprise deployments: “Claude Code is used by Uber across teams like software engineering, data science, finance, and trust and safety; wall-to-wall deployment across Salesforce’s global engineering org; tens of thousands of devs at Accenture; and companies across industries like Spotify, Rakuten, Snowflake, Novo Nordisk, and Ramp.”

That enterprise traction has translated into skyrocketing valuations. Earlier this month, Anthropic signed a term sheet for a $10 billion funding round at a $350 billion valuation. Bloomberg reported that Anthropic is simultaneously working on a tender offer that would allow employees to sell shares at that valuation, offering liquidity to staffers who have watched the company’s worth multiply since its 2021 founding.

How Opus 4.6 solves the ‘context rot’ problem that has plagued AI models

One of Opus 4.6’s most significant technical improvements addresses what the AI industry calls “context rot“—the degradation of model performance as conversations grow longer. Anthropic says Opus 4.6 scores 76% on MRCR v2, a needle-in-a-haystack benchmark testing a model’s ability to retrieve information hidden in vast amounts of text, compared to just 18.5% for Sonnet 4.5.

“This is a qualitative shift in how much context a model can actually use while maintaining peak performance,” the company said in its announcement.

The model also supports outputs of up to 128,000 tokens — enough to complete substantial coding tasks or documents without breaking them into multiple requests.

For developers, Anthropic is introducing several new API features alongside the model: adaptive thinking, which allows Claude to decide when deeper reasoning would be helpful rather than requiring a binary on-off choice; four effort levels (low, medium, high, max) to control intelligence, speed and cost tradeoffs; and context compaction, a beta feature that automatically summarizes older context to enable longer-running tasks.

Anthropic’s delicate balancing act: Building powerful AI agents without losing control

Anthropic, which has built its brand around AI safety research, emphasized that Opus 4.6 maintains alignment with its predecessors despite its enhanced capabilities. On the company’s automated behavior audit measuring misaligned behaviors such as deception, sycophancy, and cooperation with misuse, Opus 4.6 “showed a low rate” of problematic responses while also achieving “the lowest rate of over-refusals — where the model fails to answer benign queries — of any recent Claude model.”

When asked how Anthropic thinks about safety guardrails as Claude becomes more agentic, particularly with multiple agents coordinating autonomously, the spokesperson pointed to the company’s published framework: “Agents have tremendous potential for positive impacts in work but it’s important that agents continue to be safe, reliable, and trustworthy. We outlined our framework for developing safe and trustworthy agents last year which shares core principles developers should consider when building agents.”

The company said it has developed six new cybersecurity probes to detect potentially harmful uses of the model’s enhanced capabilities, and is using Opus 4.6 to help find and patch vulnerabilities in open-source software as part of defensive cybersecurity efforts.

Sam Altman vs. Dario Amodei: The Super Bowl ad battle that exposed AI’s deepest divisions

The rivalry between Anthropic and OpenAI has spilled into consumer marketing in dramatic fashion. Both companies will feature prominently during Sunday’s Super Bowl. Anthropic is airing commercials that mock OpenAI’s decision to begin testing advertisements in ChatGPT, with the tagline: “Ads are coming to AI. But not to Claude.”

OpenAI CEO Sam Altman responded by calling the ads “funny” but “clearly dishonest,” posting on X that his company would “obviously never run ads in the way Anthropic depicts them” and that “Anthropic wants to control what people do with AI” while serving “an expensive product to rich people.”

The exchange highlights a fundamental strategic divergence: OpenAI has moved to monetize its massive free user base through advertising, while Anthropic has focused almost exclusively on enterprise sales and premium subscriptions.

The $285 billion stock selloff that revealed Wall Street’s AI anxiety

The launch occurs against a backdrop of historic market volatility in software stocks. A new AI automation tool from Anthropic PBC sparked a $285 billion rout in stocks across the software, financial services and asset management sectors on Tuesday as investors raced to dump shares with even the slightest exposure. A Goldman Sachs basket of US software stocks sank 6%, its biggest one-day decline since April’s tariff-fueled selloff.

The selloff was triggered by a new legal tool from Anthropic, which showed the AI industry’s growing push into industries that can unlock lucrative enterprise revenue needed to fund massive investments in the technology. One trigger for Tuesday’s selloff was Anthropic’s launch of plug-ins for its Claude Cowork agent on Friday, enabling automated tasks across legal, sales, marketing and data analysis.

Thomson Reuters plunged 15.83% Tuesday, its biggest single-day drop on record; and Legalzoom.com sank 19.68%. European legal software providers including RELX, owner of LexisNexis, and Wolters Kluwer experienced their worst single-day performances in decades.

Not everyone agrees the selloff is warranted. Nvidia CEO Jensen Huang said on Tuesday that fears AI would replace software and related tools were “illogical” and “time will prove itself.” Mark Murphy, head of U.S. enterprise software research at JPMorgan, said in a Reuters report it “feels like an illogical leap” to say a new plug-in from an LLM would “replace every layer of mission-critical enterprise software.”

What Claude’s new PowerPoint integration means for Microsoft’s AI strategy

Among the more notable product announcements: Anthropic is releasing Claude in PowerPoint in research preview, allowing users to create presentations using the same AI capabilities that power Claude’s document and spreadsheet work. The integration puts Claude directly inside a core Microsoft product — an unusual arrangement given Microsoft’s 27% stake in OpenAI.

The Anthropic spokesperson framed the move pragmatically in an interview with VentureBeat: “Microsoft has an official add-in marketplace for Office products with multiple add-ins available to help people with slide creation and iteration. Any developer can build a plugin for Excel or PowerPoint. We’re participating in that ecosystem to bring Claude into PowerPoint. This is about participating in the ecosystem and giving users the ability to work with the tools that they want, in the programs they want.”

The data behind enterprise AI adoption: Who’s winning and who’s losing ground

Data from a16z’s recent enterprise AI survey suggests both Anthropic and OpenAI face an increasingly competitive landscape. While OpenAI remains the most widely used AI provider in the enterprise, with approximately 77% of surveyed companies using it in production in January 2026, Anthropic’s adoption is rising rapidly — from near-zero in March 2024 to approximately 40% using it in production by January 2026.

The survey data also shows that 75% of Anthropic’s enterprise customers are using it in production, with 89% either testing or in production — figures that slightly exceed OpenAI’s 46% in production and 73% testing or in production rates among its customer base.

Enterprise spending on AI continues to accelerate. Average enterprise LLM spend reached $7 million in 2025, up 180% from $2.5 million in 2024, with projections suggesting $11.6 million in 2026 — a 65% increase year-over-year.

Pricing, availability, and what developers need to know about Claude Opus 4.6

Opus 4.6 is available immediately on claude.ai, the Claude API, and major cloud platforms. Developers can access it via claude-opus-4-6 through the API. Pricing remains unchanged at $5 per million input tokens and $25 per million output tokens, with premium pricing of $10/$37.50 for prompts exceeding 200,000 tokens using the 1 million token context window.

For users who find Opus 4.6 “overthinking” simpler tasks — a characteristic Anthropic acknowledges can add cost and latency — the company recommends adjusting the effort parameter from its default high setting to medium.

The recommendation captures something essential about where the AI industry now stands. These models have grown so capable that their creators must now teach customers how to make them think less. Whether that represents a breakthrough or a warning sign depends entirely on which side of the disruption you’re standing on — and whether you remembered to sell your software stocks before Tuesday.

AI for transformation: How SAP’s Joule for Consultants reimagines project delivery

Presented by SAP


SAP’s AI solution, Joule, has already transformed how business users work — turning siloed data and tasks into intelligent, connected workflows. But consultants on SAP projects face a different set of needs: navigating complex implementations, evolving best practices, and the pressure to deliver faster. They need timely, expert-level guidance. They need an AI partner that can deliver accurate SAP knowledge in an instant. Enter SAP Joule for Consultants, purpose-built to help system integrators and consulting teams drive smarter, faster outcomes for their clients.

“The consultant’s focus is not using our business applications — it’s to help implement those systems, perform upgrades, and at the largest scale, help our customers transform their on-premises SAP ERP to SAP Business Suite in the cloud,” says Victor Alvarez, head of product marketing, Joule, at SAP. “Acting as a knowledge-grounded AI teammate, SAP Joule for Consultants delivers trusted answers at critical moments, guides design decisions, and keeps projects aligned with SAP’s latest best practices.”

It can both compress timelines and improve project execution at every stage of delivery, resulting in a higher quality of service, he adds.

“Our customers want to complete their SAP projects as quickly as possible to realize the value of those initiatives sooner,” Alvarez says. “SAP Joule for Consultants puts the most accurate and up-to-date information at the fingertips of the consultants and system integrators performing these projects. It helps consultants decide and act with agility, reduce overall costs, and shorten project timelines without sacrificing quality.”

Building a better AI teammate

AI is no longer an option in consulting. It’s reshaping how projects are designed, delivered, and scaled. For SAP consulting practices, this shift is especially pronounced as SAP cloud transformation and implementation projects demand deep knowledge and expertise. AI can help surface knowledge in seconds, making consultants more productive and improving project outcomes. As a result, practice leaders are not considering whether to use AI, but which AI solution to put at the core of their delivery model.

What sets SAP Joule for Consultants apart is its underlying knowledge base. Any AI solution can generate an answer. What matters most is if the answer is accurate and can be trusted to guide project execution. SAP Joule for Consultants is grounded in SAP’s most authoritative, comprehensive, and up-to-date knowledge base. The knowledge base is expertly curated by SAP and includes exclusive content not available to third-party solutions or AI-enabled search engines.

“We recognized an opportunity to help our partners transform their consulting practices with the most reliable AI assistance available,” Alvarez says. “We’ve leveraged our expertise and vast set of non-public customer enablement content to ground SAP Joule for Consultants with the most complete, accurate, and trustworthy knowledge.”

For SAP consultants at every level

Consultants at all experience levels have traditionally spent enormous time searching for information before they can take the next step — whether it’s making an architectural decision that shapes an entire implementation or solving a specific, tactical issue.

All that research time is slashed to an extraordinary degree with SAP Joule for Consultants. By delivering expert-level answers drawn from a continually updated knowledge base, the solution helps consultants keep work moving and make smarter decisions that minimize rework.

“If you think about cloud transformation, there’s a huge return on investment,” Alvarez explains. “For SAP customers, this ROI goes beyond traditional cloud benefits to also unlock the AI innovations available in SAP Business Suite. SAP Joule for Consultants is part of our commitment to help our customers reduce time to value and achieve AI impact at scale.”

Enabling SAP customers too

The solution isn’t just for consultants; it’s also for the IT departments within SAP customers. Organizations don’t have the luxury of engaging a consultant for every SAP project — sometimes the work needs to, or can be, done in-house. So, in the same way that SAP Joule for Consultants enables consultants with expert-level knowledge, internal IT teams can benefit from on-demand answers and guidance to accelerate internally staffed projects.

“They can use SAP Joule for Consultants to access the knowledge needed to take on more complex projects too,” Alvarez says. “That opens up more opportunity to engage system integrators on the more strategic and impactful projects.”

IT teams also benefit from elevated collaboration with their consulting partners. SAP Joule for Consultants helps bridge knowledge gaps that traditionally limit how much internal staff can contribute to and influence project outcomes.

“SAP Joule for Consultants makes customers more informed, so that they can collaborate with their consultant partners in a more impactful way,” he explains. “It illuminates the nuances of a project, helping clients ask the right questions and participate more actively in critical project decisions to ensure the best possible project outcomes.”

Powerful new features coming soon

SAP Joule for Consultants has been generally available for almost one year, and the demand from both large firms like KPMG and smaller organizations has been tremendous, Alvarez says — a strong validation of its value proposition. The next major milestone is already on the horizon: partner knowledge integration.

“Our system integrator partners have institutional knowledge that embodies their industry experience and the best practices that distinguish their services and create predictable project outcomes,” he explains. “Soon, they’ll be able to add that institutional knowledge to their instance of SAP Joule for Consultants — creating a single, unified solution that puts both SAP and firm-specific knowledge at their consultants’ fingertips.”


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

OpenAI launches centralized agent platform as enterprises push for multi-vendor flexibility

OpenAI launched Frontier, a platform for building and governing enterprise AI agents, as companies increasingly question whether to commit to single-vendor systems or maintain multi-model flexibility.

The platform offers integrated tools for agent execution, evaluation, and governance in one place. But Frontier also reflects OpenAI’s push into enterprise AI at a moment when organizations are actively moving toward multi-vendor architectures — creating tension between OpenAI’s centralized approach and what enterprises say they want.

Tatyana Mamut, CEO of the agent observability company Wayfound, told VentureBeat that enterprises don’t want to be locked into a single vendor or platform because AI strategies are ever-evolving. 

“They’re not ready to fully commit. Everybody I talk to knows that eventually they’ll move to a one-size-fits-all solution, but right now, things are moving too fast for us to commit,” Mamut said. “This is the reason why most AI contracts are not traditional SaaS contracts; nobody is signing multi-year contracts anymore because if something great comes out next month, I need to be able to pivot, and I can’t be locked in.”

How Frontier compares to AWS Bedrock

OpenAI is not the first to offer an end-to-end platform for building, prototyping, testing, deploying, and monitoring agents. AWS launched Bedrock AgentCore with the idea that there will be enterprise customers who don’t want to assemble an extensive collection of tools and platforms for their agentic AI projects. 

However, AWS offers a significant advantage: access to multiple LLMs for building agents. Enterprises can choose a hybrid system in which an agent selects the best LLM for each task. OpenAI has not made it clear if it will open Frontier to models and tools from other vendors.

OpenAI did not say whether Frontier users can bring any third-party tools they already use to the platform, and it didn’t comment on why it chose to release Frontier now when enterprises are considering more hybrid systems.

But the company is working with companies including Clay, Abridge, Harvey, Decagon, Ambience, and Sierra to design solutions within Frontier. 

What is Frontier

Frontier is a single platform that offers access to different enterprise-grade tools from OpenAI. The company told VentureBeat that Frontier will not replace offerings such as the Agents SDK, AgentKit, or its suite of APIs. 

OpenAI said Frontier helps bring context, agent execution, and evaluation into a single platform rather than multiple systems and tools.

“Frontier gives agents the same skills people need to succeed at work: shared context, onboarding, hands-on learning with feedback, and clear permissions and boundaries. That’s how teams move beyond isolated use cases to AI co-workers that work across the business,” OpenAI said in a blog post.

Users can connect their data sources, CRM tools, and other internal applications directly to Frontier, effectively creating a semantic layer that normalizes permissions and retrieval logic for agents built on the platform to pull information from. Frontier has an agent executive environment, which can run on local environments, cloud infrastructures, or “OpenAI-hosted runtimes without forcing teams to reinvent how work gets done.”

Built-in evaluation structures, security, and governance dashboards allow teams to monitor agent behavior and performance. These give organizations visibility into their agents’ success rates, accuracy, and latency. OpenAI said Frontier incorporates its enterprise-grade data security layer, including the option for companies to choose where to store their data at rest.

Frontier launched with a small group of initial customers, including HP, Intuit, Oracle, State Farm, Thermo Fisher, and Uber.

Security and governance concerns

Frontier is available only to a select group of customers with wider availability coming soon. Enterprise providers are already weighing what the platform needs to address.

Ellen Boehm, senior vice president for IoT and AI Identity Innovation at Keyfactor, told VentureBeat that companies will still need to focus their agents on security and identity. 

“Agent platforms like OpenAI’s Frontier model are critical for democratizing AI adoption beyond the enterprise,” she said. “This levels the playing field — startups get enterprise-grade capabilities without enterprise-scale infrastructure, which means more innovation and healthier competition across the market. But accessible doesn’t mean you skip the fundamentals.” 

Salesforce AI executive vice president and GM Madhav Thattai, who is overseeing an agent builder and library platform at his company, noted that no matter the platform, enterprises need to focus agents on value.

“What we’re finding is that to build an agent that actually does something at scale that creates real ROI is pretty challenging,” Thattai said. “The true business value for enterprises doesn’t reside in the AI model alone — it’s in the ‘last mile.'”

“That is the software layer that translates raw technology into trusted, autonomous execution. To traverse this last mile, agents must be able to reason through complexity and operate on trusted business data, which is exactly where we are focusing.” 

The ‘brownie recipe problem’: why LLMs must have fine-grained context to deliver real-time results

Today’s LLMs excel at reasoning, but can still struggle with context. This is particularly true in real-time ordering systems like Instacart

Instacart CTO Anirban Kundu calls it the “brownie recipe problem.”

It’s not as simple as telling an LLM ‘I want to make brownies.’ To be truly assistive when planning the meal, the model must go beyond that simple directive to understand what’s available in the user’s market based on their preferences — say, organic eggs versus regular eggs — and factor that into what’s deliverable in their geography so food doesn’t spoil. This among other critical factors. 

For Instacart, the challenge is juggling latency with the right mix of context to provide experiences in, ideally, less than one second’s time. 

“If reasoning itself takes 15 seconds, and if every interaction is that slow, you’re gonna lose the user,” Kundu said at a recent VB event. 

Mixing reasoning, real-world state, personalization

In grocery delivery, there’s a “world of reasoning” and a “world of state” (what’s available in the real world), Bose noted, both of which must be understood by an LLM along with user preference. But it’s not as simple as loading the entirety of a user’s purchase history and known interests into a reasoning model. 

“Your LLM is gonna blow up into a size that will be unmanageable,” said Kundu. 

To get around this, Instacart splits processing into chunks. First, data is fed into a large foundational model that can understand intent and categorize products. That processed data is then routed to small language models (SLMs) designed for catalog context (the types of food or other items that work together) and semantic understanding. 

In the case of catalog context, the SLM must be able to process multiple levels of details around the order itself as well as the different products. For instance, what products go together and what are their relevant replacements if the first choice isn’t in stock? These substitutions are “very, very important” for a company like Instacart, which Kundu said has “over double digit cases” where a product isn’t available in a local market. 

In terms of semantic understanding, say a shopper is looking to buy healthy snacks for children. The model needs to understand what a healthy snack is and what foods are appropriate for, and appeal to, an 8 year old, then identify relevant products. And, when those particular products aren’t available in a given market, the model has to also find related subsets of products. 

Then there’s the logistical element. For example, a product like ice cream melts quickly, and frozen vegetables also don’t fare well when left out in warmer temperatures. The model must have this context and calculate an acceptable deliverability time. 

“So you have this intent understanding, you have this categorization, then you have this other portion about logistically, how do you do it?”, Kundu noted.

Avoiding ‘monolithic’ agent systems

Like many other companies, Instacart is experimenting with AI agents, finding that a mix of agents works better than a “single monolith” that does multiple different tasks. The Unix philosophy of a modular operating system with smaller, focused tools helps address different payment systems, for instance, that have varying failure modes, Kundu explained. 

“Having to build all of that within a single environment was very unwieldy,” he said. Further, agents on the back end talk to many third-party platforms, including point-of-sale (POS) and catalog systems. Naturally, not all of them behave the same way; some are more reliable than others, and they have different update intervals and feeds. 

“So being able to handle all of those things, we’ve gone down this route of microagents rather than agents that are dominantly large in nature,” said Kundu. 

To manage agents, Instacart has integrated with OpenAI’s model context protocol (MCP), which standardizes and simplifies the process of connecting AI models to different tools and data sources.

The company also uses Google’s Universal Commerce Protocol (UCP) open standard, which allows AI agents to directly interact with merchant systems. 

However, Kundu’s team still deals with challenges. As he noted, it’s not about whether integration is possible, but how reliably those integrations behave and how well they’re understood by users. Discovery can be difficult, not just in identifying available services, but understanding which ones are appropriate for which task.

Instacart has had to implement MCP and UCP in “very different” cases, and the biggest problems they’ve run into are failure modes and latency, Kundu noted. “The response times and understandings of both of those services are very, very different I would say we spend probably two thirds of the time fixing those error cases.” 

Mistral drops Voxtral Transcribe 2, an open-source speech model that runs on-device for pennies

Mistral AI, the Paris-based startup positioning itself as Europe’s answer to OpenAI, released a pair of speech-to-text models on Wednesday that the company says can transcribe audio faster, more accurately, and far more cheaply than anything else on the market — all while running entirely on a smartphone or laptop.

The announcement marks the latest salvo in an increasingly competitive battle over voice AI, a technology that enterprise customers see as essential for everything from automated customer service to real-time translation. But unlike offerings from American tech giants, Mistral’s new Voxtral Transcribe 2 models are designed to process sensitive audio without ever transmitting it to remote servers — a feature that could prove decisive for companies in regulated industries like healthcare, finance, and defense.

“You’d like your voice and the transcription of your voice to stay close to where you are, meaning you want it to happen on device—on a laptop, a phone, or a smartwatch,” Pierre Stock, Mistral’s vice president of science operations, said in an interview with VentureBeat. “We make that possible because the model is only 4 billion parameters. It’s small enough to fit almost anywhere.”

Mistral splits its new AI transcription technology into batch processing and real-time applications

Mistral released two distinct models under the Voxtral Transcribe 2 banner, each engineered for different use cases.

  • Voxtral Mini Transcribe V2 handles batch transcription, processing pre-recorded audio files in bulk. The company says it achieves the lowest word error rate of any transcription service and is available via API at $0.003 per minute, roughly one-fifth the price of major competitors. The model supports 13 languages, including English, Mandarin Chinese, Japanese, Arabic, Hindi, and several European languages.

  • Voxtral Realtime, as its name suggests, processes live audio with a latency that can be configured down to 200 milliseconds — the blink of an eye. Mistral claims this is a breakthrough for applications where even a two-second delay proves unacceptable: live subtitling, voice agents, and real-time customer service augmentation.

The Realtime model ships under an Apache 2.0 open-source license, meaning developers can download the model weights from Hugging Face, modify them, and deploy them without paying Mistral a licensing fee. For companies that prefer not to run their own infrastructure, API access costs $0.006 per minute.

Stock said Mistral is betting on the open-source community to expand the model’s reach. “The open-source community is very imaginative when it comes to applications,” he said. “We’re excited to see what they’re going to do.”

Why on-device AI processing matters for enterprises handling sensitive data

The decision to engineer models small enough to run locally reflects a calculation about where the enterprise market is heading. As companies integrate AI into ever more sensitive workflows — transcribing medical consultations, financial advisory calls, legal depositions — the question of where that data travels has become a dealbreaker.

Stock painted a vivid picture of the problem during his interview. Current note-taking applications with audio capabilities, he explained, often pick up ambient noise in problematic ways: “It might pick up the lyrics of the music in the background. It might pick up another conversation. It might hallucinate from a background noise.”

Mistral invested heavily in training data curation and model architecture to address these issues. “All of that, we spend a lot of time ironing out the data and the way we train the model to robustify it,” Stock said.

The company also added enterprise-specific features that its American competitors have been slower to implement. Context biasing allows customers to upload a list of specialized terminology — medical jargon, proprietary product names, industry acronyms — and the model will automatically favor those terms when transcribing ambiguous audio. Unlike fine-tuning, which requires retraining the model, context biasing works through a simple API parameter.

“You only need a text list,” Stock explained. “And then the model will automatically bias the transcription toward these acronyms or these weird words. And it’s zero shots, no need for retraining, no need for weird stuff.”

From factory floors to call centers, Mistral targets high-noise industrial environments

Stock described two scenarios that capture how Mistral envisions the technology being deployed.

The first involves industrial auditing. Imagine technicians walking through a manufacturing facility, inspecting heavy machinery while shouting observations over the din of factory noise. “In the end, imagine like a perfect timestamped notes identifying who said what — so diarization — while being super robust,” Stock said. The challenge is handling what he called “weird technical language that no one is able to spell except these people.”

The second scenario targets customer service operations. When a caller contacts a support center, Voxtral Realtime can transcribe the conversation in real time, feeding text to backend systems that pull up relevant customer records before the caller finishes explaining the problem.

“The status will appear for the operator on the screen before the customer stops the sentence and stops complaining,” Stock explained. “Which means you can just interact and say, ‘Okay, I can see the status. Let me correct the address and send back the shipment.'”

He estimated this could reduce typical customer service interactions from multiple back-and-forth exchanges to just two interactions: the customer explains the problem, and the agent resolves it immediately.

Real-time translation across languages could arrive by the end of 2026

For all the focus on transcription, Stock made clear that Mistral views these models as foundational technology for a more ambitious goal: real-time speech-to-speech translation that feels natural.

“Maybe the end goal application and what the model is laying the groundwork for is live translation,” he said. “I speak French, you speak English. It’s key to have minimal latency, because otherwise you don’t build empathy. Your face is not out of sync with what you said one second ago.”

That goal puts Mistral in direct competition with Apple and Google, both of which have been racing to solve the same problem. Google’s latest translation model operates at a two-second delay — ten times slower than what Mistral claims for Voxtral Realtime.

Mistral positions itself as the privacy-first alternative for enterprise customers

Mistral occupies an unusual position in the AI landscape. Founded in 2023 by alumni of Meta and Google DeepMind, the company has raised over $2 billion and now carries a valuation of approximately $13.6 billion. Yet it operates with a fraction of the compute resources available to American hyperscalers — and has built its strategy around efficiency rather than brute force.

“The models we release are enterprise grade, industry leading, efficient — in particular, in terms of cost — can be embedded into the edge, unlocks privacy, unlocks control, transparency,” Stock said.

That approach has resonated particularly with European customers wary of dependence on American technology. In January, France’s Ministry of the Armed Forces signed a framework agreement giving the country’s military access to Mistral’s AI models—a deal that explicitly requires deployment on French-controlled infrastructure.

“I think a big barrier to adoption of voice AI is that, hey, if you’re in a sensitive industry like finance or in manufacturing or healthcare or insurance, you can’t have information you’re talking about just go to the cloud,” Howard Cohen, who participated in the interview alongside Stock, noted. “It needs to be either on device or needs to be on your premise.”

Mistral faces stiff competition from OpenAI, Google, and a rising China

The transcription market has grown fiercely competitive. OpenAI’s Whisper model has become something of an industry standard, available both through API and as downloadable open-source weights. Google, Amazon, and Microsoft all offer enterprise-grade speech services. Specialized players like Assembly AI and Deepgram have built substantial businesses serving developers who need reliable, scalable transcription.

Mistral claims its new models outperform all of them on accuracy benchmarks while undercutting them on price. “We are better than them on the benchmarks,” Stock said. Independent verification of those claims will take time, but the company points to performance on FLEURS, a widely used multilingual speech benchmark, where Voxtral models achieve word error rates competitive with or superior to alternatives from OpenAI and Google.

Perhaps more significantly, Mistral’s CEO Arthur Mensch has warned that American AI companies face pressure from an unexpected direction. Speaking at the World Economic Forum in Davos last month, Mensch dismissed the notion that Chinese AI lags behind the West as “a fairy tale.”

“The capabilities of China’s open-source technology is probably stressing the CEOs in the US,” he said.

The French startup bets that trust will determine the winner in enterprise voice AI

Stock predicted that 2026 would be “the year of note-taking” — the moment when AI transcription becomes reliable enough that users trust it completely.

“You need to trust the model, and the model basically cannot make any mistake, otherwise you would just lose trust in the product and stop using it,” he said. “The threshold is super, super hard.”

Whether Mistral has crossed that threshold remains to be seen. Enterprise customers will be the ultimate judges, and they tend to move slowly, testing claims against reality before committing budgets and workflows to new technology. The audio playground in Mistral Studio, where developers can test Voxtral Transcribe 2 with their own files, went live today.

But Stock’s broader argument deserves attention. In a market where American giants compete by throwing billions of dollars at ever-larger models, Mistral is making a different wager: that in the age of AI, smaller and local might beat bigger and distant. For the executives who spend their days worrying about data sovereignty, regulatory compliance, and vendor lock-in, that pitch may prove more compelling than any benchmark.

The race to dominate enterprise voice AI is no longer just about who builds the most powerful model. It’s about who builds the model you’re willing to let listen.

Kilo CLI 1.0 brings open source vibe coding to your terminal with support for 500+ models

Remote-first AI coding startup Kilo doesn’t think software developers should have to pledge their undying allegiance to any one development environment — and certainly not any one model or harness.

This week, the startup — backed by GitLab co-founder Sid Sijbrandijunveiled Kilo CLI 1.0, a complete rebuild of its command-line tool that offers support for more than 500 different underlying AI models from proprietary leaders and open source rivals like Alibaba’s Qwen.

It comes just weeks after Kilo launched a Slackbot allowing developers to ship code directly from Salesforce’s popular messaging service (Slack, which VentureBeat also uses) powered by the Chinese AI startup MiniMax.

The release marks a strategic pivot away from the IDE-centric “sidebar” model popularized by industry giants like Cursor and GitHub Copilot, or dedicated apps like the new OpenAI Codex, and even terminal-based rivals like Codex CLI and Claude Code, aiming instead to embed AI capabilities into every fragment of the professional software workflow.

By launching a model-agnostic CLI on the heels of its Slack bot, Kilo is making a calculated bet: the future of AI development isn’t about a single interface, but about tools that travel with the engineer between IDEs, terminals, remote servers, and team chat threads.

In a recent interview with VentureBeat, Kilo CEO and co-founder Scott Breitenother explained the necessity of this fluidity: “This experience just feels a little bit too fragmented right now… as an engineer, sometimes I’m going to use the CLI, sometimes I’m going to be in VS Code, and sometimes I’m going to be kicking off an agent from Slack, and folks shouldn’t have to be jumping around.”

He noted that Kilo CLI 1.0 is specifically “built for this world… for the developer who moves between their local IDE, a remote server via SSH, and a terminal session at 2 a.m. to fix a production bug.”

Technology: Rebuilding for ‘Kilo Speed’

Kilo CLI 1.0 is a fundamental architectural shift. While 2025 was the year senior engineers began to take AI vibe coding seriously, Kilo believes 2026 will be defined by the adoption of agents that can manage end-to-end tasks independently.

The new CLI is built on an MIT-licensed, open-source foundation, specifically designed to function in terminal sessions where developers often find themselves during critical production incidents or deep infrastructure work.

For Breitenother, building in the open is non-negotiable: “When you build in the open, you build better products. You get this great flywheel of contributors… your community is not just passive users. They’re actually part of your team that’s helping you develop your product… Honestly, some people might say open source is a weakness, but I think it’s our superpower.”

The core of this “agentic” experience is Kilo’s ability to move beyond simple autocompletion. The CLI supports multiple operational modes:

  • Code Mode: For high-speed generation and multi-file refactors.

  • Architect Mode: For high-level planning and technical strategy.

  • Debug Mode: For systematic problem diagnosis and resolution.

Solving multi-session memory

To solve the persistent issue of “AI amnesia”—where an agent loses context between sessions—Kilo utilizes a “Memory Bank” feature.

This system maintains state by storing context in structured Markdown files within the repository, ensuring that an agent operating in the CLI has the same understanding of the codebase as the one working in a VS Code sidebar or a Slack thread.

The synergy between the new CLI and “Kilo for Slack” is central to the company’s “Agentic Anywhere” strategy. Launched in January, the Slack integration allows teams to fix bugs and push pull requests directly from a conversation.

Unlike competing integrations from Cursor or Claude Code —which Kilo claims are limited by single-repo configurations or a lack of persistent thread state — Kilo’s bot can ingest context from across multiple repositories simultaneously.

“Engineering teams don’t make decisions in IDE sidebars. They make them in Slack,” Breitenother emphasized.

Extensibility and the ‘superpower’ of open source

A critical component of Kilo’s technical depth is its support for the Model Context Protocol (MCP). This open standard allows Kilo to communicate with external servers, extending its capabilities beyond local file manipulation.

Through MCP, Kilo agents can integrate with custom tools and resources, such as internal documentation servers or third-party monitoring tools, effectively turning the agent into a specialized member of the engineering team.

This extensibility is part of Kilo’s broader commitment to model agnosticism. While MiniMax is the default for Slack, the CLI and extension support a massive array of over 500 models, including Anthropic, OpenAI, and Google Gemini.

Pricing: The economy of ‘AI output per dollar’

Kilo is also attempting to disrupt the economics of AI development with “Kilo Pass,” a subscription service designed for transparency.

The company charges exact provider API rates with zero commission—$1 of Kilo credits is equivalent to $1 of provider costs.

Breitenother is critical of the “black box” subscription models used by others in the space: “We’re selling infrastructure here… you hit some sort of arbitrary, unclear line, and then you start to get throttled. That’s not how the world’s going to work.”

The Kilo Pass tiers offer “momentum rewards,” providing bonus credits for active subscribers:

  • Starter ($19/mo): Up to $26.60 in credits.

  • Pro ($49/mo): Up to $68.60 in credits.

  • Expert ($199/mo): Up to $278.60 in credits.

To incentivize early adoption, Kilo is currently offering a “Double Welcome Bonus” until February 6th, giving users 50% free credits for their first two months.

For power users like Sylvain, this flexibility is a major draw: “Kilo Pass is exactly what I’ve been waiting for. I can use my credits when I need them and save them when I don’t—it finally fits how I actually use AI.”

Community, security, and competition

The arrival of Kilo CLI 1.0 places it in direct conversation with terminal-native heavyweights: Anthropic’s Claude Code and Block’s Goose.

Outside of the terminal, in the more full featured IDE space, OpenAI recently launched a new Codex desktop app for macOS.

Claude Code offers a highly polished experience, but it comes with vendor lock-in and high costs—up to $200 per month for tiers that still include token-based usage caps and rate limits. Independent analysis suggests these limits are often exhausted within minutes of intensive work on large codebases.

OpenAI’s new Codex app similarly favors a platform-locked approach, functioning as a “command center for agents” that allows developers to supervise AI systems running independently for up to 30 minutes.

While Codex introduces powerful features like “Skills” to connect to tools like Figma and Linear, it is fundamentally designed to defend OpenAI’s ecosystem in a highly contested market.

Conversely, Kilo CLI 1.0 utilizes the MIT-licensed OpenCode foundation to deliver a production-ready Terminal User Interface (TUI) that allows engineers to swap between 500+ models.

This portability allows teams to select the best cost-to-performance ratio—perhaps using a lightweight model for documentation but swapping to a frontier model for complex debugging.

Regarding security, Kilo ensures that models are hosted on U.S.-compliant infrastructure like AWS Bedrock, allowing proprietary code to remain within trusted perimeters while leveraging the most efficient intelligence available.

Goose provides an open-source alternative that runs entirely on a user’s local machine for free, but seems more localized and experimental.

Kilo positions itself as the middle path: a production-hardened tool that maintains open-source transparency while providing the infrastructure to scale across an enterprise.

This contrasts with the broader market’s dual-use concerns; while OpenAI builds sandboxes to secure autonomous agents, Kilo’s open-core nature allows for a “superpower” level of community auditing and contribution.

The future: A ‘mech suit’ for the mind

With $8 million in seed funding and a “Right of First Refusal” agreement with GitLab lasting until August 2026, Kilo is positioning itself as the backbone of the next-generation developer stack.

Breitenother views these tools as “exoskeletons” or “mech suits” for the mind, rather than replacements for human engineers.

“We’ve actually moved our engineers to be product owners,” Breitenother reveals. “The time they freed up from writing code, they’re actually doing much more thinking. They’re setting the strategy for the product.”

By unbundling the engineering stack—separating the agentic interface from the model and the model from the IDE—Kilo provides a roadmap for a future where developers think architecturally while machines build the structure.

“It’s the closest thing to magic that I think we can encounter in our life,” Breitenother concludes. For those seeking “Kilo Speed,” the IDE sidebar is just the beginning.

The hidden tax of “Franken-stacks” that sabotages AI strategies

Presented by Certinia


The initial euphoria around Generative and Agentic AI has shifted to a pragmatic, often frustrated, reality. CIOs and technical leaders are asking why their pilot programs, even those designed to automate the simplest of workflows, aren’t delivering the magic promised in demos.

When AI fails to answer a basic question or complete an action correctly, the instinct is to blame the model. We assume the LLM isn’t “smart” enough. But that blame is misplaced. AI doesn’t struggle because it lacks intelligence. It struggles because it lacks context.

In the modern enterprise, context is trapped in a maze of disconnected point solutions, brittle APIs, and latency-ridden integrations — a “Franken-stack” of disparate technologies. And for services-centric organizations in particular, where the real truth of the business lives in the handoffs between sales, delivery, success, and finance, this fragmentation is existential. If your architecture walls off these functions, your AI roadmap is destined for failure.

Context can’t travel through an API

For the last decade, the standard IT strategy was “best-of-breed.” You bought the best CRM for sales, a separate tool for managing projects, a standalone CSP for success, and an ERP for finance; stitched them together with APIs and middleware (if you were lucky), and declared victory.

For human workers, this was annoying but manageable. A human knows that the project status in the project management tool might be 72 hours behind the invoice data in the ERP. Humans possess the intuition to bridge the gap between systems.

But AI doesn’t have intuition. It has queries. When you ask an AI agent to “staff this new project we won for margin and utilization impact,” it executes a query based on the data it can access now. If your architecture relies on integrations to move data, the AI is working with a delay. It sees the signed contract, but not the resource shortage. It sees the revenue target, but not the churn risk.

The result is not only a wrong answer, but a confident, plausible-sounding wrong answer based on partial truths. Acting on that creates costly operational pitfalls that go far beyond failed AI pilots alone.

Why agentic AI requires a platform-native architecture

This is why the conversation is shifting from “which model should we use?” to “where does our data live?

To support a hybrid workforce where human experts work alongside duly capable AI agents, the underlying data can’t be stitched together; it must be native to the core business platform. A platform-native approach, specifically one built on a common data model (e.g. Salesforce), eliminates the translation layer and provides the single source of truth that good, reliable AI requires.

In a native environment, data lives in a single object model. A scope change in delivery is a revenue change in finance. There is no sync, no latency, and no loss of state.

This is the only way to achieve real certainty with AI. If you want an agent to autonomously staff a project or forecast revenue, it’s going to require a 360-degree view of the truth, not a series of snapshots taped together by middleware.

The security tax of the side door: APIs as attack surface

Once you solve for intelligence, you must solve for sovereignty. The argument for a unified platform is usually framed around efficiency, but an increasingly pressing argument is security.

In a best-of-breed Franken-stack, every API connection you build is effectively a new door you have to lock. When you rely on third-party point solutions for critical functions like customer success or resource management, you’re constantly piping sensitive customer data out of your core system of record and into satellite apps. This movement is the risk.

We’ve seen this play out in recent high-profile supply chain breaches. Hackers didn’t need to storm the castle gates of the core platform. They simply walked in through the side door by exploiting the persistent authentication tokens of connected third-party apps.

A platform-native strategy solves this through security by inheritance. When your data stays resident on a single platform, it inherits the massive security investment and trust boundary of that platform. You aren’t moving data across the wire to a different vendor’s cloud just to analyze it. The gold never leaves the vault.

Fix the architecture, then curate the context

The pressure to deploy AI is immense, but layering intelligent agents on top of unintelligent architecture is a waste of time and resources.

Leaders often hesitate because they fear their data isn’t “clean enough.” They believe they have to scrub every record from the last ten years before they can deploy a single agent. On a fragmented stack, this fear is valid.

A platform-native architecture changes the math. Because the data, metadata, and agents live in the same house, you don’t need to boil the ocean. Simply ring-fence specific, trusted fields — like active customer contracts or current resource schedules — and tell the agent, ‘Work here. Ignore the rest.’ By eliminating the need for complex API translations and third-party middleware, a unified platform allows you to ground agents in your most reliable, connected data today, bypassing the mess without waiting for a ‘perfect’ state that may never arrive.

We often fear that AI will hallucinate because it’s too creative. The real danger is that it will fail because it’s blind. And you cannot automate a complex business with fragmented visibility. Deny your new agentic workforce access to the full context of your operations on a unified platform, and you’re building a foundation that is sure to fail.


Raju Malhotra is Chief Product & Technology Officer at Certinia.


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Apple integrates Anthropic’s Claude and OpenAI’s Codex into Xcode 26.3 in push for ‘agentic coding’

Apple on Tuesday announced a major update to its flagship developer tool that gives artificial intelligence agents unprecedented control over the app-building process, a move that signals the iPhone maker’s aggressive push into an emerging and controversial practice known as “agentic coding.”

Xcode 26.3, available immediately as a release candidate, integrates Anthropic’s Claude Agent and OpenAI’s Codex directly into Apple’s development environment, allowing the AI systems to autonomously write code, build projects, run tests, and visually verify their own work — all with minimal human oversight.

The update is Apple’s most significant embrace of AI-assisted software development since introducing intelligence features in Xcode 26 last year, and arrives as “vibe coding” — the practice of delegating software creation to large language models — has become one of the most debated topics in technology.

“Integrating intelligence into the Xcode developer workflow is powerful, but the model itself still has a somewhat limited aperture,” Tim Sneath, an Apple executive, said during a press conference Tuesday morning. “It answers questions based on what the developer provides, but it doesn’t have access to the full context of the project, and it’s not able today to take action on its own. And so that changes today.”

How Apple’s new AI coding features let developers build apps faster than ever

The key innovation in Xcode 26.3 is the depth of integration between AI agents and Apple’s development tools. Unlike previous iterations that offered code suggestions and autocomplete features, the new system grants AI agents access to nearly every aspect of the development process.

During a live demonstration, Jerome Bouvard, an Apple engineer, showed how the Claude agent could receive a simple prompt — “add a new feature to show the weather at a landmark” — and then independently analyze the project’s file structure, consult Apple’s documentation, write the necessary code, build the project, and take screenshots of the running application to verify its work matched the requested design.

“The agent is able to use the tools like build or, you know, grabbing a preview of the screenshots to verify its work, visually analyze the image and confirm that everything has been built accordingly,” Bouvard explained. “Before that, when you’re interacting with a model, the model will provide you an answer and it will just stop there.”

The system creates automatic checkpoints as developers interact with the AI, allowing them to roll back changes if results prove unsatisfactory — a safeguard that acknowledges the unpredictable nature of AI-generated code.

Apple worked directly with Anthropic and OpenAI to optimize the experience, Sneath said, with particular attention paid to reducing token usage — the computational units that determine costs when using cloud-based AI models — and improving the efficiency of tool calling.

“Developers can download new agents with a single click, and they update automatically,” Sneath noted.

Why Apple’s adoption of the Model Context Protocol could reshape the AI development landscape

Underlying the integration is the Model Context Protocol, or MCP, an open standard that Anthropic developed for connecting AI agents with external tools. Apple’s adoption of MCP means that any compatible agent — not just Claude or Codex — can now interact with Xcode’s capabilities.

“This also works for agents that are running outside of Xcode,” Sneath explained. “Any agent that is compatible with MCP can now work with Xcode to do all the same things—Project Discovery and change management, building and testing apps, working with previews and code snippets, and accessing the latest documentation.”

The decision to embrace an open protocol, rather than building a proprietary system, represents a notable departure for Apple, which has historically favored closed ecosystems. It also positions Xcode as a potential hub for a growing universe of AI development tools.

Xcode’s troubled history with AI tools — and why Apple says this time is different

The announcement comes against a backdrop of mixed experiences with AI-assisted coding in Apple’s tools. During the press conference, one developer described previous attempts to use AI agents with Xcode as “horrible,” citing constant crashes and an inability to complete basic tasks.

Sneath acknowledged the concerns while arguing that the new integration addresses fundamental limitations of earlier approaches.

“The big shift here is that Claude and Codex have so much more visibility into the breadth of the project,” he said. “If they hallucinate and write code that doesn’t work, they can now build. They can see the compile errors, and they can iterate in real time to fix those issues, and we’ll do so in this case before you even, you know, presented it as a finished work.”

The power of IDE integration, Sneath argued, extends beyond error correction. Agents can now automatically add entitlements to projects when needed to access protected APIs — a task that would be “otherwise very difficult to do” for an AI operating outside the development environment and “dealing with binary file that it may not have the file format for.”

From Andrej Karpathy’s tweet to LinkedIn certifications: The unstoppable rise of vibe coding

Apple’s announcement arrives at a crucial moment in the evolution of AI-assisted development. The term “vibe coding,” coined by AI researcher Andrej Karpathy in early 2025, has transformed from a curiosity into a genuine cultural phenomenon that is reshaping how software gets built.

LinkedIn announced last week that it will begin offering official certifications in AI coding skills, drawing on usage data from platforms like Lovable and Replit. Job postings requiring AI proficiency doubled in the past year, according to edX research, with Indeed’s Hiring Lab reporting that 4.2% of U.S. job listings now mention AI-related keywords.

The enthusiasm is driven by genuine productivity gains. Casey Newton, the technology journalist, recently described building a complete personal website using Claude Code in about an hour — a task that previously required expensive Squarespace subscriptions and years of frustrated attempts with various website builders.

More dramatically, Jaana Dogan, a principal engineer at Google, posted that she gave Claude Code “a description of the problem” and “it generated what we built last year in an hour.” Her post, which accumulated more than 8 million views, began with the disclaimer: “I’m not joking and this isn’t funny.”

Security experts warn that AI-generated code could lead to ‘catastrophic explosions’

But the rapid adoption of agentic coding has also sparked significant concerns among security researchers and software engineers.

David Mytton, founder and CEO of developer security provider Arcjet, warned last month that the proliferation of vibe-coded applications “into production will lead to catastrophic problems for organizations that don’t properly review AI-developed software.”

“In 2026, I expect more and more vibe-coded applications hitting production in a big way,” Mytton wrote. “That’s going to be great for velocity… but you’ve still got to pay attention. There’s going to be some big explosions coming!”

Simon Willison, co-creator of the Django web framework, drew an even starker comparison. “I think we’re due a Challenger disaster with respect to coding agent security,” he said, referring to the 1986 space shuttle explosion that killed all seven crew members. “So many people, myself included, are running these coding agents practically as root. We’re letting them do all of this stuff.”

A pre-print paper from researchers this week warned that vibe coding could pose existential risks to the open-source software ecosystem. The study found that AI-assisted development pulls user interaction away from community projects, reduces visits to documentation websites and forums, and makes launching new open-source initiatives significantly harder.

Stack Overflow usage has plummeted as developers increasingly turn to AI chatbots for answers—a shift that could ultimately starve the very knowledge bases that trained the AI models in the first place.

Previous research painted an even more troubling picture: a 2024 report found that vibe coding using tools like GitHub Copilot “offered no real benefits unless adding 41% more bugs is a measure of success.”

The hidden mental health cost of letting AI write your code

Even enthusiastic adopters have begun acknowledging the darker aspects of AI-assisted development.

Peter Steinberger, creator of the viral AI agent originally known as Clawdbot (now OpenClaw), recently revealed that he had to step back from vibe coding after it consumed his life.

“I was out with my friends and instead of joining the conversation in the restaurant, I was just like, vibe coding on my phone,” Steinberger said in a recent podcast interview. “I decided, OK, I have to stop this more for my mental health than for anything else.”

Steinberger warned that the constant building of increasingly powerful AI tools creates the “illusion of making you more productive” without necessarily advancing real goals. “If you don’t have a vision of what you’re going to build, it’s still going to be slop,” he added.

Google CEO Sundar Pichai has expressed similar reservations, saying he won’t vibe code on “large codebases where you really have to get it right.”

“The security has to be there,” Pichai said in a November podcast interview.

Boris Cherny, the Anthropic engineer who created Claude Code, acknowledged that vibe coding works best for “prototypes or throwaway code, not software that sits at the core of a business.”

“You want maintainable code sometimes. You want to be very thoughtful about every line sometimes,” Cherny said.

Apple is gambling that deep IDE integration can make AI coding safe for production

Apple appears to be betting that the benefits of deep IDE integration can mitigate many of these concerns. By giving AI agents access to build systems, test suites, and visual verification tools, the company is essentially arguing that Xcode can serve as a quality control mechanism for AI-generated code.

Susan Prescott, Apple’s vice president of Worldwide Developer Relations, framed the update as part of Apple’s broader mission.

“At Apple, our goal is to make tools that put industry-leading technologies directly in developers’ hands so they can build the very best apps,” Prescott said in a statement. “Agentic coding supercharges productivity and creativity, streamlining the development workflow so developers can focus on innovation.”

But the question remains whether the safeguards will prove sufficient as AI agents grow more autonomous. Asked about debugging capabilities, Bouvard noted that while Xcode has “a very powerful debugger built in,” there is “no direct MCP tool for debugging.”

Developers can run the debugger and manually relay information to the agent, but the AI cannot yet independently investigate runtime issues — a limitation that could prove significant as the complexity of AI-generated code increases.

The update also does not currently support running multiple agents simultaneously on the same project, though Sneath noted that developers can open projects in multiple Xcode windows using Git worktrees as a workaround.

The future of software development hangs in the balance — and Apple just raised the stakes

Xcode 26.3 is available immediately as a release candidate for members of the Apple Developer Program, with a general release expected soon on the App Store. The release candidate designation — Apple’s final beta before production — means developers who download today will automatically receive the finished version when it ships.

The integration supports both API keys and direct account credentials from OpenAI and Anthropic, offering developers flexibility in managing their AI subscriptions. But those conveniences belie the magnitude of what Apple is attempting: nothing less than a fundamental reimagining of how software comes into existence.

For the world’s most valuable company, the calculus is straightforward. Apple’s ability to attract and retain developers has always underpinned its platform dominance. If agentic coding delivers on its promise of radical productivity gains, early and deep integration could cement Apple’s position for another generation. If it doesn’t — if the security disasters and “catastrophic explosions” that critics predict come to pass — Cupertino could find itself at the epicenter of a very different kind of transformation.

The technology industry has spent decades building systems to catch human errors before they reach users. Now it must answer a more unsettling question: What happens when the errors aren’t human at all?

As Sneath conceded during Tuesday’s press conference, with what may prove to be unintentional understatement: “Large language models, as agents sometimes do, sometimes hallucinate.”

Millions of lines of code are about to find out how often.

Shared memory is the missing layer in AI orchestration

The key to successful AI agents within an enterprise? Shared memory and context. 

This, according to Asana CPO Arnab Bose, provides detailed history and direct access from the get-go — with guardrail checkpoints and human oversight, of course. 

This way, “when you assign a task, you’re not having to go ahead and re-provide all of the context about how your business works,” Bose said at a recent VB event in San Francisco. 

AI as an active teammate, rather than a passive add-on

Asana launched Asana AI Teammates last year with the philosophy that, just like humans, AI agents should be plugged directly into a team or project to create a collaborative system. To further this mission, the project management company has fully integrated with Anthropic’s Claude.  

Users can choose from 12 pre-built agents — for common use cases like IT ticket deflection — or build their own, then assign them to project teams and immediately provide a historical record of what tasks have already been completed and what is still yet to be resolved. Agents also have access to third-party resources like Microsoft 365 or Google Drive. 

“When that agent gets created, it’s not acting on behalf of someone, it manifests itself as a teammate and it gets all of the same sharing permissions, it inherits that,” Bose explained. Everything anyone does — humans and AI included — is documented to allow for “ease of explainability” and a “very transparent and trustworthy system.”

But just like human workers, AI agents are kept in check: Critically, workflows incorporate checkpoints, where humans can give feedback and ask the agent to tweak certain elements of a project or adjust research plans. This is documented in what Bose called a “very human-readable way.” 

Also importantly, the UI provides instructions and knowledge about agent behavior, and approved admins can pause, edit and redirect models in the API when they take actions based on conflicting directions or start acting “in a weird way.”

“The person with edit rights can delete those things that are conflicting and make it go back to its correct behavior,” said Bose. “We’re leaning into that common human-understandable interaction pattern.”

Overcoming challenges of authorization, integration 

But because AI agents are so new, there are still many challenges around security, accessibility and compatibility. 

Asana users, for instance, must go through an OAuth flow and grant Claude access to Asana via their MCP and other public APIs. But getting all knowledge workers to know that that integration exists — and more importantly, which OAuth grants are OK and which are to be avoided — can be a tall order.

Some of the challenges around direct OAuth grants between applications could be centralized by identity providers, Bose noted, or a centralized listing of approved enterprise AI agents with their skill sets, “almost like an active directory or universal directory of agents.”

Right now, though, beyond what Asana is doing, there’s no standard protocol around shared knowledge and memory, said Bose. His team has been getting “a lot of interesting inbound asks” from partners who want their agents to operate on the Asana work graph and benefit from shared work.

“But because the protocol or standard doesn’t exist, today it has to be a very custom bespoke conversation,” said Bose. 

Ultimately, there are three questions the CPO called “extremely interesting” in AI orchestration right now: 

  • How do you build, manage and secure an authoritative list of known approved AI agents? 

  • How can you enable app-to-app integrations as an IT team without potentially configuring dangerous or harmful agents?

  • Today’s agent-to-agent interactions are very single player. Clouds can independently be connected to Asana or Figma or Slack. How can we finally get to a unified, multi-player outcome?

The increased adoption of modern context protocol (MCP) — the open standard introduced by Anthropic that connects AI agents to external systems in a single action, rather than custom integrations for every single pairing — is promising, he noted, and its widespread adoption could open up new and exciting use cases.

However, “I think there probably isn’t a silver bullet standard out there right now,” said Bose. 

OpenAI launches a Codex desktop app for macOS to run multiple AI coding agents in parallel

OpenAI on Monday released a new desktop application for its Codex artificial intelligence coding system, a tool the company says transforms software development from a collaborative exercise with a single AI assistant into something more akin to managing a team of autonomous workers.

The Codex app for macOS functions as what OpenAI executives describe as a “command center for agents,” allowing developers to delegate multiple coding tasks simultaneously, automate repetitive work, and supervise AI systems that can run for up to 30 minutes independently before returning completed code.

“This is the most loved internal product we’ve ever had,” Sam Altman, OpenAI’s chief executive, told VentureBeat in a press briefing ahead of Monday’s launch. “It’s been totally an amazing thing for us to be using recently at OpenAI.”

The release arrives at a pivotal moment for the enterprise AI market. According to a survey of 100 Global 2000 companies published last week by venture capital firm Andreessen Horowitz, 78% of enterprise CIOs now use OpenAI models in production, though competitors Anthropic and Google are gaining ground rapidly. Anthropic posted the largest share increase of any frontier lab since May 2025, growing 25% in enterprise penetration, with 44% of enterprises now using Anthropic in production.

The timing of OpenAI’s Codex app launch — with its focus on professional software engineering workflows — appears designed to defend the company’s position in what has become the most contested segment of the AI market: coding tools.

Why developers are abandoning their IDEs for AI agent management

The Codex app introduces a fundamentally different approach to AI-assisted coding. While previous tools like GitHub Copilot focused on autocompleting lines of code in real-time, the new application enables developers to “effortlessly manage multiple agents at once, run work in parallel, and collaborate with agents over long-running tasks.”

Alexander Embiricos, the product lead for Codex, explained the evolution during the press briefing by tracing the product’s lineage back to 2021, when OpenAI first introduced a model called Codex that powered GitHub Copilot.

“Back then, people were using AI to write small chunks of code in their IDEs,” Embiricos said. “GPT-5 in August last year was a big jump, and then 5.2 in December was another massive jump, where people started doing longer and longer tasks, asking models to do work end to end. So what we saw is that developers, instead of working closely with the model, pair coding, they started delegating entire features.”

The shift has been so profound that Altman said he recently completed a substantial coding project without ever opening a traditional integrated development environment.

“I was astonished by this…I did this fairly big project in a few days earlier this week and over the weekend. I did not open an IDE during the process. Not a single time,” Altman said. “I did look at some code, but I was not doing it the old-fashioned way, and I did not think that was going to be happening by now.”

How skills and automations extend AI coding beyond simple code generation

The Codex app introduces several new capabilities designed to extend AI coding beyond writing lines of code. Chief among these are “Skills,” which bundle instructions, resources, and scripts so that Codex can “reliably connect to tools, run workflows, and complete tasks according to your team’s preferences.”

The app includes a dedicated interface for creating and managing skills, and users can explicitly invoke specific skills or allow the system to automatically select them based on the task at hand. OpenAI has published a library of skills for common workflows, including tools to fetch design context from Figma, manage projects in Linear, deploy web applications to cloud hosts like Cloudflare and Vercel, generate images using GPT Image, and create professional documents in PDF, spreadsheet, and Word formats.

To demonstrate the system’s capabilities, OpenAI asked Codex to build a racing game from a single prompt. Using an image generation skill and a web game development skill, Codex built the game by working independently using more than 7 million tokens with just one initial user prompt, taking on “the roles of designer, game developer, and QA tester to validate its work by actually playing the game.”

The company has also introduced “Automations,” which allow developers to schedule Codex to work in the background on an automatic schedule. “When an Automation finishes, the results land in a review queue so you can jump back in and continue working if needed.”

Thibault Sottiaux, who leads the Codex team at OpenAI, described how the company uses these automations internally: “We’ve been using Automations to handle the repetitive but important tasks, like daily issue triage, finding and summarizing CI failures, generating daily release briefs, checking for bugs, and more.”

The app also includes built-in support for “worktrees,” allowing multiple agents to work on the same repository without conflicts. “Each agent works on an isolated copy of your code, allowing you to explore different paths without needing to track how they impact your codebase.”

OpenAI battles Anthropic and Google for control of enterprise AI spending

The launch comes as enterprise spending on AI coding tools accelerates dramatically. According to the Andreessen Horowitz survey, average enterprise AI spend on large language models has risen from approximately $4.5 million to $7 million over the last two years, with enterprises expecting growth of another 65% this year to approximately $11.6 million.

Leadership in the enterprise AI market varies significantly by use case. OpenAI dominates “early, horizontal use cases like general purpose chatbots, enterprise knowledge management and customer support,” while Anthropic leads in “software development and data analysis, where CIOs consistently cite rapid capability gains since the second half of 2024.”

When asked during the press briefing how Codex differentiates from Anthropic’s Claude Code, which has been described as having its “ChatGPT moment,” Sottiaux emphasized OpenAI’s focus on model capability for long-running tasks.

“One of the things that our models are extremely good at—they really sit at the frontier of intelligence and doing reliable work for long periods of time,” Sottiaux said. “This is also what we’re optimizing this new surface to be very good at, so that you can start many parallel agents and coordinate them over long periods of time and not get lost.”

Altman added that while many tools can handle “vibe coding front ends,” OpenAI’s 5.2 model remains “the strongest model by far” for sophisticated work on complex systems.

“Taking that level of model capability and putting it in an interface where you can do what Thibault was saying, we think is going to matter quite a bit,” Altman said. “That’s probably the, at least listening to users and sort of looking at the chatter on social that’s that’s the single biggest differentiator.”

The surprising satisfies on AI progress: how fast humans can type

The philosophical underpinning of the Codex app reflects a view that OpenAI executives have been articulating for months: that human limitations — not AI capabilities — now constitute the primary constraint on productivity.

In a December appearance on Lenny’s Podcast, Embiricos described human typing speed as “the current underappreciated limiting factor” to achieving artificial general intelligence. The logic: if AI can perform complex coding tasks but humans can’t write prompts or review outputs fast enough, progress stalls.

The Codex app attempts to address this by enabling what the team calls an “abundance mindset” — running multiple tasks in parallel rather than perfecting single requests. During the briefing, Embiricos described how power users at OpenAI work with the tool.

“Last night, I was working on the app, and I was making a few changes, and all of these changes are able to run in parallel together. And I was just sort of going between them, managing them,” Embiricos said. “Behind the scenes, all these tasks are running on something called gate work trees, which means that the agents are running independently, and you don’t have to manage them.”

In the Sequoia Capital podcast “Training Data,” Embiricos elaborated on this mindset shift: “The mindset that works really well for Codex is, like, kind of like this abundance mindset and, like, hey, let’s try anything. Let’s try anything even multiple times and see what works.” He noted that when users run 20 or more tasks in a day or an hour, “they’ve probably understood basically how to use the tool.”

Building trust through sandboxes: how OpenAI secures autonomous coding agents

OpenAI has built security measures into the Codex architecture from the ground up. The app uses “native, open-source and configurable system-level sandboxing,” and by default, “Codex agents are limited to editing files in the folder or branch where they’re working and using cached web search, then asking for permission to run commands that require elevated permissions like network access.”

Embiricos elaborated on the security approach during the briefing, noting that OpenAI has open-sourced its sandbox technology.

“Codex has this sandbox that we’re actually incredibly proud of, and it’s open source, so you can go check it out,” Embiricos said. The sandbox “basically ensures that when the agent is working on your computer, it can only make writes in a specific folder that you want it to make rights into, and it doesn’t access network without information.”

The system also includes a granular permission model that allows users to configure persistent approvals for specific actions, avoiding the need to repeatedly authorize routine operations. “If the agent wants to do something and you find yourself annoyed that you’re constantly having to approve it, instead of just saying, ‘All right, you can do everything,’ you can just say, ‘Hey, remember this one thing — I’m actually okay with you doing this going forward,'” Embiricos explained.

Altman emphasized that the permission architecture signals a broader philosophy about AI safety in agentic systems.

“I think this is going to be really important. I mean, it’s been so clear to us using this, how much you want it to have control of your computer, and how much you need it,” Altman said. “And the way the team built Codex such that you can sensibly limit what’s happening and also pick the level of control you’re comfortable with is important.”

He also acknowledged the dual-use nature of the technology. “We do expect to get to our internal cybersecurity high moment of our models very soon. We’ve been preparing for this. We’ve talked about our mitigation plan,” Altman said. “A real thing for the world to contend with is going to be defending against a lot of capable cybersecurity threats using these models very quickly.”

The same capabilities that make Codex valuable for fixing bugs and refactoring code could, in the wrong hands, be used to discover vulnerabilities or write malicious software—a tension that will only intensify as AI coding agents become more capable.

From Android apps to research breakthroughs: how Codex transformed OpenAI’s own operations

Perhaps the most compelling evidence for Codex’s capabilities comes from OpenAI’s own use of the tool. Sottiaux described how the system has accelerated internal development.

“A Sora Android app is an example of that where four engineers shipped in only 18 days internally, and then within the month we give access to the world,” Sottiaux said. “I had never noticed such speed at this scale before.”

Beyond product development, Sottiaux described how Codex has become integral to OpenAI’s research operations.

“Codex is really involved in all parts of the research — making new data sets, investigating its own screening runs,” he said. “When I sit in meetings with researchers, they all send Codex off to do an investigation while we’re having a chat, and then it will come back with useful information, and we’re able to debug much faster.”

The tool has also begun contributing to its own development. “Codex also is starting to build itself,” Sottiaux noted. “There’s no screen within the Codex engineering team that doesn’t have Codex running on multiple, six, eight, ten, tasks at a time.”

When asked whether this constitutes evidence of “recursive self-improvement” — a concept that has long concerned AI safety researchers — Sottiaux was measured in his response.

“There is a human in the loop at all times,” he said. “I wouldn’t necessarily call it recursive self-improvement, a glimpse into the future there.”

Altman offered a more expansive view of the research implications.

“There’s two parts of what people talk about when they talk about automating research to a degree where you can imagine that happening,” Altman said. “One is, can you write software, extremely complex infrastructure, software to run training jobs across hundreds of thousands of GPUs and babysit them. And the second is, can you come up with the new scientific ideas that make algorithms more efficient.”

He noted that OpenAI is “seeing early but promising signs on both of those.”

The end of technical debt? AI agents take on the work engineers hate most

One of the more unexpected applications of Codex has been addressing technical debt — the accumulated maintenance burden that plagues most software projects.

Altman described how AI coding agents excel at the unglamorous work that human engineers typically avoid.

“The kind of work that human engineers hate to do — go refactor this, clean up this code base, rewrite this, write this test — this is where the model doesn’t care. The model will do anything, whether it’s fun or not,” Altman said.

He reported that some infrastructure teams at OpenAI that “had sort of like, given up hope that you were ever really going to long term win the war against tech debt, are now like, we’re going to win this, because the model is going to constantly be working behind us, making sure we have great test coverage, making sure that we refactor when we’re supposed to.”

The observation speaks to a broader theme that emerged repeatedly during the briefing: AI coding agents don’t experience the motivational fluctuations that affect human programmers. As Altman noted, a team member recently observed that “the hardest mental adjustment to make about working with these sort of like aI coding teammates, unlike a human, is the models just don’t run out of dopamine. They keep trying. They don’t run out of motivation. They don’t get, you know, they don’t lose energy when something’s not working. They just keep going and, you know, they figure out how to get it done.”

What the Codex app costs and who can use it starting today

The Codex app launches today on macOS and is available to anyone with a ChatGPT Plus, Pro, Business, Enterprise, or Edu subscription. Usage is included in ChatGPT subscriptions, with the option to purchase additional credits if needed.

In a promotional push, OpenAI is temporarily making Codex available to ChatGPT Free and Go users “to help more people try agentic workflows.” The company is also doubling rate limits for existing Codex users across all paid plans during this promotional period.

The pricing strategy reflects OpenAI’s determination to establish Codex as the default tool for AI-assisted development before competitors can gain further traction. More than a million developers have used Codex in the past month, and usage has nearly doubled since the launch of GPT-5.2-Codex in mid-December, building on more than 20x usage growth since August 2025.

Customers using Codex include large enterprises like Cisco, Ramp, Virgin Atlantic, Vanta, Duolingo, and Gap, as well as startups like Harvey, Sierra, and Wonderful. Individual developers have also embraced the tool: Peter Steinberger, creator of OpenClaw, built the project entirely with Codex and reports that since fully switching to the tool, his productivity has roughly doubled across more than 82,000 GitHub contributions.

OpenAI’s ambitious roadmap: Windows support, cloud triggers, and continuous background agents

OpenAI outlined an aggressive development roadmap for Codex. The company plans to make the app available on Windows, continue pushing “the frontier of model capabilities,” and roll out faster inference.

Within the app, OpenAI will “keep refining multi-agent workflows based on real-world feedback” and is “building out Automations with support for cloud-based triggers, so Codex can run continuously in the background—not just when your computer is open.”

The company also announced a new “plan mode” feature that allows Codex to read through complex changes in read-only mode, then discuss with the user before executing. “This means that it lets you build a lot of confidence before, again, sending it to do a lot of work by itself, independently, in parallel to you,” Embiricos explained.

Additionally, OpenAI is introducing customizable personalities for Codex. “The default personality for Codex has been quite terse. A lot of people love it, but some people want something more engaging,” Embiricos said. Users can access the new personalities using the /personality command.

Altman also hinted at future integration with ChatGPT’s broader ecosystem.

“There will be all kinds of cool things we can do over time to connect people’s ChatGPT accounts and leverage sort of all the history they’ve built up there,” Altman said.

Microsoft still dominates enterprise AI, but the window for disruption is open

The Codex launch occurs as most enterprises have moved beyond single-vendor strategies. According to the Andreessen Horowitz survey, “81% now use three or more model families in testing or production, up from 68% less than a year ago.”

Despite the proliferation of AI coding tools, Microsoft continues to dominate enterprise adoption through its existing relationships. “Microsoft 365 Copilot leads enterprise chat though ChatGPT has closed the gap meaningfully,” and “Github Copilot is still the coding leader for enterprises.” The survey found that “65% of enterprises noted they preferred to go with incumbent solutions when available,” citing trust, integration, and procurement simplicity.

However, the survey also suggests significant opportunity for challengers: “Enterprises consistently say they value faster innovation, deeper AI focus, and greater flexibility paired with cutting edge capabilities that AI native startups bring.”

OpenAI appears to be positioning Codex as a bridge between these worlds. “Codex is built on a simple premise: everything is controlled by code,” the company stated. “The better an agent is at reasoning about and producing code, the more capable it becomes across all forms of technical and knowledge work.”

The company’s ambition extends beyond coding. “We’ve focused on making Codex the best coding agent, which has also laid the foundation for it to become a strong agent for a broad range of knowledge work tasks that extend beyond writing code.”

When asked whether AI coding tools could eventually move beyond early adopters to become mainstream, Altman suggested the transition may be closer than many expect.

“Can it go from vibe coding to serious software engineering? That’s what this is about,” Altman said. “I think we are over the bar on that. I think this will be the way that most serious coders do their job — and very rapidly from now.”

He then pivoted to an even bolder prediction: that code itself could become the universal interface for all computer-based work.

“Code is a universal language to get computers to do what you want. And it’s gotten so good that I think, very quickly, we can go not just from vibe coding silly apps but to doing all the non-coding knowledge work,” Altman said.

At the close of the briefing, Altman urged journalists to try the product themselves: “Please try the app. There’s no way to get this across just by talking about it. It’s a crazy amount of power.”

For developers who have spent careers learning to write code, the message was clear: the future belongs to those who learn to manage the machines that write it for them.