Remote-first AI coding startup Kilo doesn’t think software developers should have to pledge their undying allegiance to any one development environment — and certainly not any one model or harness.
This week, the startup — backed by GitLab co-founder Sid Sijbrandij — unveiled Kilo CLI 1.0, a complete rebuild of its command-line tool that offers support for more than 500 different underlying AI models from proprietary leaders and open source rivals like Alibaba’s Qwen.
It comes just weeks after Kilo launched a Slackbot allowing developers to ship code directly from Salesforce’s popular messaging service (Slack, which VentureBeat also uses) powered by the Chinese AI startup MiniMax.
The release marks a strategic pivot away from the IDE-centric “sidebar” model popularized by industry giants like Cursor and GitHub Copilot, or dedicated apps like the new OpenAI Codex, and even terminal-based rivals like Codex CLI and Claude Code, aiming instead to embed AI capabilities into every fragment of the professional software workflow.
By launching a model-agnostic CLI on the heels of its Slack bot, Kilo is making a calculated bet: the future of AI development isn’t about a single interface, but about tools that travel with the engineer between IDEs, terminals, remote servers, and team chat threads.
In a recent interview with VentureBeat, Kilo CEO and co-founder Scott Breitenother explained the necessity of this fluidity: “This experience just feels a little bit too fragmented right now… as an engineer, sometimes I’m going to use the CLI, sometimes I’m going to be in VS Code, and sometimes I’m going to be kicking off an agent from Slack, and folks shouldn’t have to be jumping around.”
He noted that Kilo CLI 1.0 is specifically “built for this world… for the developer who moves between their local IDE, a remote server via SSH, and a terminal session at 2 a.m. to fix a production bug.”
Kilo CLI 1.0 is a fundamental architectural shift. While 2025 was the year senior engineers began to take AI vibe coding seriously, Kilo believes 2026 will be defined by the adoption of agents that can manage end-to-end tasks independently.
The new CLI is built on an MIT-licensed, open-source foundation, specifically designed to function in terminal sessions where developers often find themselves during critical production incidents or deep infrastructure work.
For Breitenother, building in the open is non-negotiable: “When you build in the open, you build better products. You get this great flywheel of contributors… your community is not just passive users. They’re actually part of your team that’s helping you develop your product… Honestly, some people might say open source is a weakness, but I think it’s our superpower.”
The core of this “agentic” experience is Kilo’s ability to move beyond simple autocompletion. The CLI supports multiple operational modes:
Code Mode: For high-speed generation and multi-file refactors.
Architect Mode: For high-level planning and technical strategy.
Debug Mode: For systematic problem diagnosis and resolution.
To solve the persistent issue of “AI amnesia”—where an agent loses context between sessions—Kilo utilizes a “Memory Bank” feature.
This system maintains state by storing context in structured Markdown files within the repository, ensuring that an agent operating in the CLI has the same understanding of the codebase as the one working in a VS Code sidebar or a Slack thread.
The synergy between the new CLI and “Kilo for Slack” is central to the company’s “Agentic Anywhere” strategy. Launched in January, the Slack integration allows teams to fix bugs and push pull requests directly from a conversation.
Unlike competing integrations from Cursor or Claude Code —which Kilo claims are limited by single-repo configurations or a lack of persistent thread state — Kilo’s bot can ingest context from across multiple repositories simultaneously.
“Engineering teams don’t make decisions in IDE sidebars. They make them in Slack,” Breitenother emphasized.
A critical component of Kilo’s technical depth is its support for the Model Context Protocol (MCP). This open standard allows Kilo to communicate with external servers, extending its capabilities beyond local file manipulation.
Through MCP, Kilo agents can integrate with custom tools and resources, such as internal documentation servers or third-party monitoring tools, effectively turning the agent into a specialized member of the engineering team.
This extensibility is part of Kilo’s broader commitment to model agnosticism. While MiniMax is the default for Slack, the CLI and extension support a massive array of over 500 models, including Anthropic, OpenAI, and Google Gemini.
Kilo is also attempting to disrupt the economics of AI development with “Kilo Pass,” a subscription service designed for transparency.
The company charges exact provider API rates with zero commission—$1 of Kilo credits is equivalent to $1 of provider costs.
Breitenother is critical of the “black box” subscription models used by others in the space: “We’re selling infrastructure here… you hit some sort of arbitrary, unclear line, and then you start to get throttled. That’s not how the world’s going to work.”
The Kilo Pass tiers offer “momentum rewards,” providing bonus credits for active subscribers:
Starter ($19/mo): Up to $26.60 in credits.
Pro ($49/mo): Up to $68.60 in credits.
Expert ($199/mo): Up to $278.60 in credits.
To incentivize early adoption, Kilo is currently offering a “Double Welcome Bonus” until February 6th, giving users 50% free credits for their first two months.
For power users like Sylvain, this flexibility is a major draw: “Kilo Pass is exactly what I’ve been waiting for. I can use my credits when I need them and save them when I don’t—it finally fits how I actually use AI.”
The arrival of Kilo CLI 1.0 places it in direct conversation with terminal-native heavyweights: Anthropic’s Claude Code and Block’s Goose.
Outside of the terminal, in the more full featured IDE space, OpenAI recently launched a new Codex desktop app for macOS.
Claude Code offers a highly polished experience, but it comes with vendor lock-in and high costs—up to $200 per month for tiers that still include token-based usage caps and rate limits. Independent analysis suggests these limits are often exhausted within minutes of intensive work on large codebases.
OpenAI’s new Codex app similarly favors a platform-locked approach, functioning as a “command center for agents” that allows developers to supervise AI systems running independently for up to 30 minutes.
While Codex introduces powerful features like “Skills” to connect to tools like Figma and Linear, it is fundamentally designed to defend OpenAI’s ecosystem in a highly contested market.
Conversely, Kilo CLI 1.0 utilizes the MIT-licensed OpenCode foundation to deliver a production-ready Terminal User Interface (TUI) that allows engineers to swap between 500+ models.
This portability allows teams to select the best cost-to-performance ratio—perhaps using a lightweight model for documentation but swapping to a frontier model for complex debugging.
Regarding security, Kilo ensures that models are hosted on U.S.-compliant infrastructure like AWS Bedrock, allowing proprietary code to remain within trusted perimeters while leveraging the most efficient intelligence available.
Goose provides an open-source alternative that runs entirely on a user’s local machine for free, but seems more localized and experimental.
Kilo positions itself as the middle path: a production-hardened tool that maintains open-source transparency while providing the infrastructure to scale across an enterprise.
This contrasts with the broader market’s dual-use concerns; while OpenAI builds sandboxes to secure autonomous agents, Kilo’s open-core nature allows for a “superpower” level of community auditing and contribution.
With $8 million in seed funding and a “Right of First Refusal” agreement with GitLab lasting until August 2026, Kilo is positioning itself as the backbone of the next-generation developer stack.
Breitenother views these tools as “exoskeletons” or “mech suits” for the mind, rather than replacements for human engineers.
“We’ve actually moved our engineers to be product owners,” Breitenother reveals. “The time they freed up from writing code, they’re actually doing much more thinking. They’re setting the strategy for the product.”
By unbundling the engineering stack—separating the agentic interface from the model and the model from the IDE—Kilo provides a roadmap for a future where developers think architecturally while machines build the structure.
“It’s the closest thing to magic that I think we can encounter in our life,” Breitenother concludes. For those seeking “Kilo Speed,” the IDE sidebar is just the beginning.
Presented by Certinia
The initial euphoria around Generative and Agentic AI has shifted to a pragmatic, often frustrated, reality. CIOs and technical leaders are asking why their pilot programs, even those designed to automate the simplest of workflows, aren’t delivering the magic promised in demos.
When AI fails to answer a basic question or complete an action correctly, the instinct is to blame the model. We assume the LLM isn’t “smart” enough. But that blame is misplaced. AI doesn’t struggle because it lacks intelligence. It struggles because it lacks context.
In the modern enterprise, context is trapped in a maze of disconnected point solutions, brittle APIs, and latency-ridden integrations — a “Franken-stack” of disparate technologies. And for services-centric organizations in particular, where the real truth of the business lives in the handoffs between sales, delivery, success, and finance, this fragmentation is existential. If your architecture walls off these functions, your AI roadmap is destined for failure.
For the last decade, the standard IT strategy was “best-of-breed.” You bought the best CRM for sales, a separate tool for managing projects, a standalone CSP for success, and an ERP for finance; stitched them together with APIs and middleware (if you were lucky), and declared victory.
For human workers, this was annoying but manageable. A human knows that the project status in the project management tool might be 72 hours behind the invoice data in the ERP. Humans possess the intuition to bridge the gap between systems.
But AI doesn’t have intuition. It has queries. When you ask an AI agent to “staff this new project we won for margin and utilization impact,” it executes a query based on the data it can access now. If your architecture relies on integrations to move data, the AI is working with a delay. It sees the signed contract, but not the resource shortage. It sees the revenue target, but not the churn risk.
The result is not only a wrong answer, but a confident, plausible-sounding wrong answer based on partial truths. Acting on that creates costly operational pitfalls that go far beyond failed AI pilots alone.
This is why the conversation is shifting from “which model should we use?” to “where does our data live?“
To support a hybrid workforce where human experts work alongside duly capable AI agents, the underlying data can’t be stitched together; it must be native to the core business platform. A platform-native approach, specifically one built on a common data model (e.g. Salesforce), eliminates the translation layer and provides the single source of truth that good, reliable AI requires.
In a native environment, data lives in a single object model. A scope change in delivery is a revenue change in finance. There is no sync, no latency, and no loss of state.
This is the only way to achieve real certainty with AI. If you want an agent to autonomously staff a project or forecast revenue, it’s going to require a 360-degree view of the truth, not a series of snapshots taped together by middleware.
Once you solve for intelligence, you must solve for sovereignty. The argument for a unified platform is usually framed around efficiency, but an increasingly pressing argument is security.
In a best-of-breed Franken-stack, every API connection you build is effectively a new door you have to lock. When you rely on third-party point solutions for critical functions like customer success or resource management, you’re constantly piping sensitive customer data out of your core system of record and into satellite apps. This movement is the risk.
We’ve seen this play out in recent high-profile supply chain breaches. Hackers didn’t need to storm the castle gates of the core platform. They simply walked in through the side door by exploiting the persistent authentication tokens of connected third-party apps.
A platform-native strategy solves this through security by inheritance. When your data stays resident on a single platform, it inherits the massive security investment and trust boundary of that platform. You aren’t moving data across the wire to a different vendor’s cloud just to analyze it. The gold never leaves the vault.
The pressure to deploy AI is immense, but layering intelligent agents on top of unintelligent architecture is a waste of time and resources.
Leaders often hesitate because they fear their data isn’t “clean enough.” They believe they have to scrub every record from the last ten years before they can deploy a single agent. On a fragmented stack, this fear is valid.
A platform-native architecture changes the math. Because the data, metadata, and agents live in the same house, you don’t need to boil the ocean. Simply ring-fence specific, trusted fields — like active customer contracts or current resource schedules — and tell the agent, ‘Work here. Ignore the rest.’ By eliminating the need for complex API translations and third-party middleware, a unified platform allows you to ground agents in your most reliable, connected data today, bypassing the mess without waiting for a ‘perfect’ state that may never arrive.
We often fear that AI will hallucinate because it’s too creative. The real danger is that it will fail because it’s blind. And you cannot automate a complex business with fragmented visibility. Deny your new agentic workforce access to the full context of your operations on a unified platform, and you’re building a foundation that is sure to fail.
Raju Malhotra is Chief Product & Technology Officer at Certinia.
Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Apple on Tuesday announced a major update to its flagship developer tool that gives artificial intelligence agents unprecedented control over the app-building process, a move that signals the iPhone maker’s aggressive push into an emerging and controversial practice known as “agentic coding.”
Xcode 26.3, available immediately as a release candidate, integrates Anthropic’s Claude Agent and OpenAI’s Codex directly into Apple’s development environment, allowing the AI systems to autonomously write code, build projects, run tests, and visually verify their own work — all with minimal human oversight.
The update is Apple’s most significant embrace of AI-assisted software development since introducing intelligence features in Xcode 26 last year, and arrives as “vibe coding” — the practice of delegating software creation to large language models — has become one of the most debated topics in technology.
“Integrating intelligence into the Xcode developer workflow is powerful, but the model itself still has a somewhat limited aperture,” Tim Sneath, an Apple executive, said during a press conference Tuesday morning. “It answers questions based on what the developer provides, but it doesn’t have access to the full context of the project, and it’s not able today to take action on its own. And so that changes today.”
The key innovation in Xcode 26.3 is the depth of integration between AI agents and Apple’s development tools. Unlike previous iterations that offered code suggestions and autocomplete features, the new system grants AI agents access to nearly every aspect of the development process.
During a live demonstration, Jerome Bouvard, an Apple engineer, showed how the Claude agent could receive a simple prompt — “add a new feature to show the weather at a landmark” — and then independently analyze the project’s file structure, consult Apple’s documentation, write the necessary code, build the project, and take screenshots of the running application to verify its work matched the requested design.
“The agent is able to use the tools like build or, you know, grabbing a preview of the screenshots to verify its work, visually analyze the image and confirm that everything has been built accordingly,” Bouvard explained. “Before that, when you’re interacting with a model, the model will provide you an answer and it will just stop there.”
The system creates automatic checkpoints as developers interact with the AI, allowing them to roll back changes if results prove unsatisfactory — a safeguard that acknowledges the unpredictable nature of AI-generated code.
Apple worked directly with Anthropic and OpenAI to optimize the experience, Sneath said, with particular attention paid to reducing token usage — the computational units that determine costs when using cloud-based AI models — and improving the efficiency of tool calling.
“Developers can download new agents with a single click, and they update automatically,” Sneath noted.
Underlying the integration is the Model Context Protocol, or MCP, an open standard that Anthropic developed for connecting AI agents with external tools. Apple’s adoption of MCP means that any compatible agent — not just Claude or Codex — can now interact with Xcode’s capabilities.
“This also works for agents that are running outside of Xcode,” Sneath explained. “Any agent that is compatible with MCP can now work with Xcode to do all the same things—Project Discovery and change management, building and testing apps, working with previews and code snippets, and accessing the latest documentation.”
The decision to embrace an open protocol, rather than building a proprietary system, represents a notable departure for Apple, which has historically favored closed ecosystems. It also positions Xcode as a potential hub for a growing universe of AI development tools.
The announcement comes against a backdrop of mixed experiences with AI-assisted coding in Apple’s tools. During the press conference, one developer described previous attempts to use AI agents with Xcode as “horrible,” citing constant crashes and an inability to complete basic tasks.
Sneath acknowledged the concerns while arguing that the new integration addresses fundamental limitations of earlier approaches.
“The big shift here is that Claude and Codex have so much more visibility into the breadth of the project,” he said. “If they hallucinate and write code that doesn’t work, they can now build. They can see the compile errors, and they can iterate in real time to fix those issues, and we’ll do so in this case before you even, you know, presented it as a finished work.”
The power of IDE integration, Sneath argued, extends beyond error correction. Agents can now automatically add entitlements to projects when needed to access protected APIs — a task that would be “otherwise very difficult to do” for an AI operating outside the development environment and “dealing with binary file that it may not have the file format for.”
Apple’s announcement arrives at a crucial moment in the evolution of AI-assisted development. The term “vibe coding,” coined by AI researcher Andrej Karpathy in early 2025, has transformed from a curiosity into a genuine cultural phenomenon that is reshaping how software gets built.
LinkedIn announced last week that it will begin offering official certifications in AI coding skills, drawing on usage data from platforms like Lovable and Replit. Job postings requiring AI proficiency doubled in the past year, according to edX research, with Indeed’s Hiring Lab reporting that 4.2% of U.S. job listings now mention AI-related keywords.
The enthusiasm is driven by genuine productivity gains. Casey Newton, the technology journalist, recently described building a complete personal website using Claude Code in about an hour — a task that previously required expensive Squarespace subscriptions and years of frustrated attempts with various website builders.
More dramatically, Jaana Dogan, a principal engineer at Google, posted that she gave Claude Code “a description of the problem” and “it generated what we built last year in an hour.” Her post, which accumulated more than 8 million views, began with the disclaimer: “I’m not joking and this isn’t funny.”
But the rapid adoption of agentic coding has also sparked significant concerns among security researchers and software engineers.
David Mytton, founder and CEO of developer security provider Arcjet, warned last month that the proliferation of vibe-coded applications “into production will lead to catastrophic problems for organizations that don’t properly review AI-developed software.”
“In 2026, I expect more and more vibe-coded applications hitting production in a big way,” Mytton wrote. “That’s going to be great for velocity… but you’ve still got to pay attention. There’s going to be some big explosions coming!”
Simon Willison, co-creator of the Django web framework, drew an even starker comparison. “I think we’re due a Challenger disaster with respect to coding agent security,” he said, referring to the 1986 space shuttle explosion that killed all seven crew members. “So many people, myself included, are running these coding agents practically as root. We’re letting them do all of this stuff.”
A pre-print paper from researchers this week warned that vibe coding could pose existential risks to the open-source software ecosystem. The study found that AI-assisted development pulls user interaction away from community projects, reduces visits to documentation websites and forums, and makes launching new open-source initiatives significantly harder.
Stack Overflow usage has plummeted as developers increasingly turn to AI chatbots for answers—a shift that could ultimately starve the very knowledge bases that trained the AI models in the first place.
Previous research painted an even more troubling picture: a 2024 report found that vibe coding using tools like GitHub Copilot “offered no real benefits unless adding 41% more bugs is a measure of success.”
Even enthusiastic adopters have begun acknowledging the darker aspects of AI-assisted development.
Peter Steinberger, creator of the viral AI agent originally known as Clawdbot (now OpenClaw), recently revealed that he had to step back from vibe coding after it consumed his life.
“I was out with my friends and instead of joining the conversation in the restaurant, I was just like, vibe coding on my phone,” Steinberger said in a recent podcast interview. “I decided, OK, I have to stop this more for my mental health than for anything else.”
Steinberger warned that the constant building of increasingly powerful AI tools creates the “illusion of making you more productive” without necessarily advancing real goals. “If you don’t have a vision of what you’re going to build, it’s still going to be slop,” he added.
Google CEO Sundar Pichai has expressed similar reservations, saying he won’t vibe code on “large codebases where you really have to get it right.”
“The security has to be there,” Pichai said in a November podcast interview.
Boris Cherny, the Anthropic engineer who created Claude Code, acknowledged that vibe coding works best for “prototypes or throwaway code, not software that sits at the core of a business.”
“You want maintainable code sometimes. You want to be very thoughtful about every line sometimes,” Cherny said.
Apple appears to be betting that the benefits of deep IDE integration can mitigate many of these concerns. By giving AI agents access to build systems, test suites, and visual verification tools, the company is essentially arguing that Xcode can serve as a quality control mechanism for AI-generated code.
Susan Prescott, Apple’s vice president of Worldwide Developer Relations, framed the update as part of Apple’s broader mission.
“At Apple, our goal is to make tools that put industry-leading technologies directly in developers’ hands so they can build the very best apps,” Prescott said in a statement. “Agentic coding supercharges productivity and creativity, streamlining the development workflow so developers can focus on innovation.”
But the question remains whether the safeguards will prove sufficient as AI agents grow more autonomous. Asked about debugging capabilities, Bouvard noted that while Xcode has “a very powerful debugger built in,” there is “no direct MCP tool for debugging.”
Developers can run the debugger and manually relay information to the agent, but the AI cannot yet independently investigate runtime issues — a limitation that could prove significant as the complexity of AI-generated code increases.
The update also does not currently support running multiple agents simultaneously on the same project, though Sneath noted that developers can open projects in multiple Xcode windows using Git worktrees as a workaround.
Xcode 26.3 is available immediately as a release candidate for members of the Apple Developer Program, with a general release expected soon on the App Store. The release candidate designation — Apple’s final beta before production — means developers who download today will automatically receive the finished version when it ships.
The integration supports both API keys and direct account credentials from OpenAI and Anthropic, offering developers flexibility in managing their AI subscriptions. But those conveniences belie the magnitude of what Apple is attempting: nothing less than a fundamental reimagining of how software comes into existence.
For the world’s most valuable company, the calculus is straightforward. Apple’s ability to attract and retain developers has always underpinned its platform dominance. If agentic coding delivers on its promise of radical productivity gains, early and deep integration could cement Apple’s position for another generation. If it doesn’t — if the security disasters and “catastrophic explosions” that critics predict come to pass — Cupertino could find itself at the epicenter of a very different kind of transformation.
The technology industry has spent decades building systems to catch human errors before they reach users. Now it must answer a more unsettling question: What happens when the errors aren’t human at all?
As Sneath conceded during Tuesday’s press conference, with what may prove to be unintentional understatement: “Large language models, as agents sometimes do, sometimes hallucinate.”
Millions of lines of code are about to find out how often.
The key to successful AI agents within an enterprise? Shared memory and context.
This, according to Asana CPO Arnab Bose, provides detailed history and direct access from the get-go — with guardrail checkpoints and human oversight, of course.
This way, “when you assign a task, you’re not having to go ahead and re-provide all of the context about how your business works,” Bose said at a recent VB event in San Francisco.
Asana launched Asana AI Teammates last year with the philosophy that, just like humans, AI agents should be plugged directly into a team or project to create a collaborative system. To further this mission, the project management company has fully integrated with Anthropic’s Claude.
Users can choose from 12 pre-built agents — for common use cases like IT ticket deflection — or build their own, then assign them to project teams and immediately provide a historical record of what tasks have already been completed and what is still yet to be resolved. Agents also have access to third-party resources like Microsoft 365 or Google Drive.
“When that agent gets created, it’s not acting on behalf of someone, it manifests itself as a teammate and it gets all of the same sharing permissions, it inherits that,” Bose explained. Everything anyone does — humans and AI included — is documented to allow for “ease of explainability” and a “very transparent and trustworthy system.”
But just like human workers, AI agents are kept in check: Critically, workflows incorporate checkpoints, where humans can give feedback and ask the agent to tweak certain elements of a project or adjust research plans. This is documented in what Bose called a “very human-readable way.”
Also importantly, the UI provides instructions and knowledge about agent behavior, and approved admins can pause, edit and redirect models in the API when they take actions based on conflicting directions or start acting “in a weird way.”
“The person with edit rights can delete those things that are conflicting and make it go back to its correct behavior,” said Bose. “We’re leaning into that common human-understandable interaction pattern.”
But because AI agents are so new, there are still many challenges around security, accessibility and compatibility.
Asana users, for instance, must go through an OAuth flow and grant Claude access to Asana via their MCP and other public APIs. But getting all knowledge workers to know that that integration exists — and more importantly, which OAuth grants are OK and which are to be avoided — can be a tall order.
Some of the challenges around direct OAuth grants between applications could be centralized by identity providers, Bose noted, or a centralized listing of approved enterprise AI agents with their skill sets, “almost like an active directory or universal directory of agents.”
Right now, though, beyond what Asana is doing, there’s no standard protocol around shared knowledge and memory, said Bose. His team has been getting “a lot of interesting inbound asks” from partners who want their agents to operate on the Asana work graph and benefit from shared work.
“But because the protocol or standard doesn’t exist, today it has to be a very custom bespoke conversation,” said Bose.
Ultimately, there are three questions the CPO called “extremely interesting” in AI orchestration right now:
How do you build, manage and secure an authoritative list of known approved AI agents?
How can you enable app-to-app integrations as an IT team without potentially configuring dangerous or harmful agents?
Today’s agent-to-agent interactions are very single player. Clouds can independently be connected to Asana or Figma or Slack. How can we finally get to a unified, multi-player outcome?
The increased adoption of modern context protocol (MCP) — the open standard introduced by Anthropic that connects AI agents to external systems in a single action, rather than custom integrations for every single pairing — is promising, he noted, and its widespread adoption could open up new and exciting use cases.
However, “I think there probably isn’t a silver bullet standard out there right now,” said Bose.
OpenAI on Monday released a new desktop application for its Codex artificial intelligence coding system, a tool the company says transforms software development from a collaborative exercise with a single AI assistant into something more akin to managing a team of autonomous workers.
The Codex app for macOS functions as what OpenAI executives describe as a “command center for agents,” allowing developers to delegate multiple coding tasks simultaneously, automate repetitive work, and supervise AI systems that can run for up to 30 minutes independently before returning completed code.
“This is the most loved internal product we’ve ever had,” Sam Altman, OpenAI’s chief executive, told VentureBeat in a press briefing ahead of Monday’s launch. “It’s been totally an amazing thing for us to be using recently at OpenAI.”
The release arrives at a pivotal moment for the enterprise AI market. According to a survey of 100 Global 2000 companies published last week by venture capital firm Andreessen Horowitz, 78% of enterprise CIOs now use OpenAI models in production, though competitors Anthropic and Google are gaining ground rapidly. Anthropic posted the largest share increase of any frontier lab since May 2025, growing 25% in enterprise penetration, with 44% of enterprises now using Anthropic in production.
The timing of OpenAI’s Codex app launch — with its focus on professional software engineering workflows — appears designed to defend the company’s position in what has become the most contested segment of the AI market: coding tools.
The Codex app introduces a fundamentally different approach to AI-assisted coding. While previous tools like GitHub Copilot focused on autocompleting lines of code in real-time, the new application enables developers to “effortlessly manage multiple agents at once, run work in parallel, and collaborate with agents over long-running tasks.”
Alexander Embiricos, the product lead for Codex, explained the evolution during the press briefing by tracing the product’s lineage back to 2021, when OpenAI first introduced a model called Codex that powered GitHub Copilot.
“Back then, people were using AI to write small chunks of code in their IDEs,” Embiricos said. “GPT-5 in August last year was a big jump, and then 5.2 in December was another massive jump, where people started doing longer and longer tasks, asking models to do work end to end. So what we saw is that developers, instead of working closely with the model, pair coding, they started delegating entire features.”
The shift has been so profound that Altman said he recently completed a substantial coding project without ever opening a traditional integrated development environment.
“I was astonished by this…I did this fairly big project in a few days earlier this week and over the weekend. I did not open an IDE during the process. Not a single time,” Altman said. “I did look at some code, but I was not doing it the old-fashioned way, and I did not think that was going to be happening by now.”
The Codex app introduces several new capabilities designed to extend AI coding beyond writing lines of code. Chief among these are “Skills,” which bundle instructions, resources, and scripts so that Codex can “reliably connect to tools, run workflows, and complete tasks according to your team’s preferences.”
The app includes a dedicated interface for creating and managing skills, and users can explicitly invoke specific skills or allow the system to automatically select them based on the task at hand. OpenAI has published a library of skills for common workflows, including tools to fetch design context from Figma, manage projects in Linear, deploy web applications to cloud hosts like Cloudflare and Vercel, generate images using GPT Image, and create professional documents in PDF, spreadsheet, and Word formats.
To demonstrate the system’s capabilities, OpenAI asked Codex to build a racing game from a single prompt. Using an image generation skill and a web game development skill, Codex built the game by working independently using more than 7 million tokens with just one initial user prompt, taking on “the roles of designer, game developer, and QA tester to validate its work by actually playing the game.”
The company has also introduced “Automations,” which allow developers to schedule Codex to work in the background on an automatic schedule. “When an Automation finishes, the results land in a review queue so you can jump back in and continue working if needed.”
Thibault Sottiaux, who leads the Codex team at OpenAI, described how the company uses these automations internally: “We’ve been using Automations to handle the repetitive but important tasks, like daily issue triage, finding and summarizing CI failures, generating daily release briefs, checking for bugs, and more.”
The app also includes built-in support for “worktrees,” allowing multiple agents to work on the same repository without conflicts. “Each agent works on an isolated copy of your code, allowing you to explore different paths without needing to track how they impact your codebase.”
The launch comes as enterprise spending on AI coding tools accelerates dramatically. According to the Andreessen Horowitz survey, average enterprise AI spend on large language models has risen from approximately $4.5 million to $7 million over the last two years, with enterprises expecting growth of another 65% this year to approximately $11.6 million.
Leadership in the enterprise AI market varies significantly by use case. OpenAI dominates “early, horizontal use cases like general purpose chatbots, enterprise knowledge management and customer support,” while Anthropic leads in “software development and data analysis, where CIOs consistently cite rapid capability gains since the second half of 2024.”
When asked during the press briefing how Codex differentiates from Anthropic’s Claude Code, which has been described as having its “ChatGPT moment,” Sottiaux emphasized OpenAI’s focus on model capability for long-running tasks.
“One of the things that our models are extremely good at—they really sit at the frontier of intelligence and doing reliable work for long periods of time,” Sottiaux said. “This is also what we’re optimizing this new surface to be very good at, so that you can start many parallel agents and coordinate them over long periods of time and not get lost.”
Altman added that while many tools can handle “vibe coding front ends,” OpenAI’s 5.2 model remains “the strongest model by far” for sophisticated work on complex systems.
“Taking that level of model capability and putting it in an interface where you can do what Thibault was saying, we think is going to matter quite a bit,” Altman said. “That’s probably the, at least listening to users and sort of looking at the chatter on social that’s that’s the single biggest differentiator.”
The philosophical underpinning of the Codex app reflects a view that OpenAI executives have been articulating for months: that human limitations — not AI capabilities — now constitute the primary constraint on productivity.
In a December appearance on Lenny’s Podcast, Embiricos described human typing speed as “the current underappreciated limiting factor” to achieving artificial general intelligence. The logic: if AI can perform complex coding tasks but humans can’t write prompts or review outputs fast enough, progress stalls.
The Codex app attempts to address this by enabling what the team calls an “abundance mindset” — running multiple tasks in parallel rather than perfecting single requests. During the briefing, Embiricos described how power users at OpenAI work with the tool.
“Last night, I was working on the app, and I was making a few changes, and all of these changes are able to run in parallel together. And I was just sort of going between them, managing them,” Embiricos said. “Behind the scenes, all these tasks are running on something called gate work trees, which means that the agents are running independently, and you don’t have to manage them.”
In the Sequoia Capital podcast “Training Data,” Embiricos elaborated on this mindset shift: “The mindset that works really well for Codex is, like, kind of like this abundance mindset and, like, hey, let’s try anything. Let’s try anything even multiple times and see what works.” He noted that when users run 20 or more tasks in a day or an hour, “they’ve probably understood basically how to use the tool.”
OpenAI has built security measures into the Codex architecture from the ground up. The app uses “native, open-source and configurable system-level sandboxing,” and by default, “Codex agents are limited to editing files in the folder or branch where they’re working and using cached web search, then asking for permission to run commands that require elevated permissions like network access.”
Embiricos elaborated on the security approach during the briefing, noting that OpenAI has open-sourced its sandbox technology.
“Codex has this sandbox that we’re actually incredibly proud of, and it’s open source, so you can go check it out,” Embiricos said. The sandbox “basically ensures that when the agent is working on your computer, it can only make writes in a specific folder that you want it to make rights into, and it doesn’t access network without information.”
The system also includes a granular permission model that allows users to configure persistent approvals for specific actions, avoiding the need to repeatedly authorize routine operations. “If the agent wants to do something and you find yourself annoyed that you’re constantly having to approve it, instead of just saying, ‘All right, you can do everything,’ you can just say, ‘Hey, remember this one thing — I’m actually okay with you doing this going forward,'” Embiricos explained.
Altman emphasized that the permission architecture signals a broader philosophy about AI safety in agentic systems.
“I think this is going to be really important. I mean, it’s been so clear to us using this, how much you want it to have control of your computer, and how much you need it,” Altman said. “And the way the team built Codex such that you can sensibly limit what’s happening and also pick the level of control you’re comfortable with is important.”
He also acknowledged the dual-use nature of the technology. “We do expect to get to our internal cybersecurity high moment of our models very soon. We’ve been preparing for this. We’ve talked about our mitigation plan,” Altman said. “A real thing for the world to contend with is going to be defending against a lot of capable cybersecurity threats using these models very quickly.”
The same capabilities that make Codex valuable for fixing bugs and refactoring code could, in the wrong hands, be used to discover vulnerabilities or write malicious software—a tension that will only intensify as AI coding agents become more capable.
Perhaps the most compelling evidence for Codex’s capabilities comes from OpenAI’s own use of the tool. Sottiaux described how the system has accelerated internal development.
“A Sora Android app is an example of that where four engineers shipped in only 18 days internally, and then within the month we give access to the world,” Sottiaux said. “I had never noticed such speed at this scale before.”
Beyond product development, Sottiaux described how Codex has become integral to OpenAI’s research operations.
“Codex is really involved in all parts of the research — making new data sets, investigating its own screening runs,” he said. “When I sit in meetings with researchers, they all send Codex off to do an investigation while we’re having a chat, and then it will come back with useful information, and we’re able to debug much faster.”
The tool has also begun contributing to its own development. “Codex also is starting to build itself,” Sottiaux noted. “There’s no screen within the Codex engineering team that doesn’t have Codex running on multiple, six, eight, ten, tasks at a time.”
When asked whether this constitutes evidence of “recursive self-improvement” — a concept that has long concerned AI safety researchers — Sottiaux was measured in his response.
“There is a human in the loop at all times,” he said. “I wouldn’t necessarily call it recursive self-improvement, a glimpse into the future there.”
Altman offered a more expansive view of the research implications.
“There’s two parts of what people talk about when they talk about automating research to a degree where you can imagine that happening,” Altman said. “One is, can you write software, extremely complex infrastructure, software to run training jobs across hundreds of thousands of GPUs and babysit them. And the second is, can you come up with the new scientific ideas that make algorithms more efficient.”
He noted that OpenAI is “seeing early but promising signs on both of those.”
One of the more unexpected applications of Codex has been addressing technical debt — the accumulated maintenance burden that plagues most software projects.
Altman described how AI coding agents excel at the unglamorous work that human engineers typically avoid.
“The kind of work that human engineers hate to do — go refactor this, clean up this code base, rewrite this, write this test — this is where the model doesn’t care. The model will do anything, whether it’s fun or not,” Altman said.
He reported that some infrastructure teams at OpenAI that “had sort of like, given up hope that you were ever really going to long term win the war against tech debt, are now like, we’re going to win this, because the model is going to constantly be working behind us, making sure we have great test coverage, making sure that we refactor when we’re supposed to.”
The observation speaks to a broader theme that emerged repeatedly during the briefing: AI coding agents don’t experience the motivational fluctuations that affect human programmers. As Altman noted, a team member recently observed that “the hardest mental adjustment to make about working with these sort of like aI coding teammates, unlike a human, is the models just don’t run out of dopamine. They keep trying. They don’t run out of motivation. They don’t get, you know, they don’t lose energy when something’s not working. They just keep going and, you know, they figure out how to get it done.”
The Codex app launches today on macOS and is available to anyone with a ChatGPT Plus, Pro, Business, Enterprise, or Edu subscription. Usage is included in ChatGPT subscriptions, with the option to purchase additional credits if needed.
In a promotional push, OpenAI is temporarily making Codex available to ChatGPT Free and Go users “to help more people try agentic workflows.” The company is also doubling rate limits for existing Codex users across all paid plans during this promotional period.
The pricing strategy reflects OpenAI’s determination to establish Codex as the default tool for AI-assisted development before competitors can gain further traction. More than a million developers have used Codex in the past month, and usage has nearly doubled since the launch of GPT-5.2-Codex in mid-December, building on more than 20x usage growth since August 2025.
Customers using Codex include large enterprises like Cisco, Ramp, Virgin Atlantic, Vanta, Duolingo, and Gap, as well as startups like Harvey, Sierra, and Wonderful. Individual developers have also embraced the tool: Peter Steinberger, creator of OpenClaw, built the project entirely with Codex and reports that since fully switching to the tool, his productivity has roughly doubled across more than 82,000 GitHub contributions.
OpenAI outlined an aggressive development roadmap for Codex. The company plans to make the app available on Windows, continue pushing “the frontier of model capabilities,” and roll out faster inference.
Within the app, OpenAI will “keep refining multi-agent workflows based on real-world feedback” and is “building out Automations with support for cloud-based triggers, so Codex can run continuously in the background—not just when your computer is open.”
The company also announced a new “plan mode” feature that allows Codex to read through complex changes in read-only mode, then discuss with the user before executing. “This means that it lets you build a lot of confidence before, again, sending it to do a lot of work by itself, independently, in parallel to you,” Embiricos explained.
Additionally, OpenAI is introducing customizable personalities for Codex. “The default personality for Codex has been quite terse. A lot of people love it, but some people want something more engaging,” Embiricos said. Users can access the new personalities using the /personality command.
Altman also hinted at future integration with ChatGPT’s broader ecosystem.
“There will be all kinds of cool things we can do over time to connect people’s ChatGPT accounts and leverage sort of all the history they’ve built up there,” Altman said.
The Codex launch occurs as most enterprises have moved beyond single-vendor strategies. According to the Andreessen Horowitz survey, “81% now use three or more model families in testing or production, up from 68% less than a year ago.”
Despite the proliferation of AI coding tools, Microsoft continues to dominate enterprise adoption through its existing relationships. “Microsoft 365 Copilot leads enterprise chat though ChatGPT has closed the gap meaningfully,” and “Github Copilot is still the coding leader for enterprises.” The survey found that “65% of enterprises noted they preferred to go with incumbent solutions when available,” citing trust, integration, and procurement simplicity.
However, the survey also suggests significant opportunity for challengers: “Enterprises consistently say they value faster innovation, deeper AI focus, and greater flexibility paired with cutting edge capabilities that AI native startups bring.”
OpenAI appears to be positioning Codex as a bridge between these worlds. “Codex is built on a simple premise: everything is controlled by code,” the company stated. “The better an agent is at reasoning about and producing code, the more capable it becomes across all forms of technical and knowledge work.”
The company’s ambition extends beyond coding. “We’ve focused on making Codex the best coding agent, which has also laid the foundation for it to become a strong agent for a broad range of knowledge work tasks that extend beyond writing code.”
When asked whether AI coding tools could eventually move beyond early adopters to become mainstream, Altman suggested the transition may be closer than many expect.
“Can it go from vibe coding to serious software engineering? That’s what this is about,” Altman said. “I think we are over the bar on that. I think this will be the way that most serious coders do their job — and very rapidly from now.”
He then pivoted to an even bolder prediction: that code itself could become the universal interface for all computer-based work.
“Code is a universal language to get computers to do what you want. And it’s gotten so good that I think, very quickly, we can go not just from vibe coding silly apps but to doing all the non-coding knowledge work,” Altman said.
At the close of the briefing, Altman urged journalists to try the product themselves: “Please try the app. There’s no way to get this across just by talking about it. It’s a crazy amount of power.”
For developers who have spent careers learning to write code, the message was clear: the future belongs to those who learn to manage the machines that write it for them.
Enterprises have moved quickly to adopt RAG to ground LLMs in proprietary data. In practice, however, many organizations are discovering that retrieval is no longer a feature bolted onto model inference — it has become a foundational system dependency….
By now, many enterprises have deployed some form of RAG. The promise is seductive: index your PDFs, connect an LLM and instantly democratize your corporate knowledge.
But for industries dependent on heavy engineering, the reality has been underwhelming. Engineers ask specific questions about infrastructure, and the bot hallucinates.
The failure isn’t in the LLM. The failure is in the preprocessing.
Standard RAG pipelines treat documents as flat strings of text. They use “fixed-size chunking” (cutting a document every 500 characters). This works for prose, but it destroys the logic of technical manuals. It slices tables in half, severs captions from images, and ignores the visual hierarchy of the page.
Improving RAG reliability isn’t about buying a bigger model; it’s about fixing the “dark data” problem through semantic chunking and multimodal textualization.
Here is the architectural framework for building a RAG system that can actually read a manual.
In a standard Python RAG tutorial, you split text by character count. In an enterprise PDF, this is disastrous.
If a safety specification table spans 1,000 tokens, and your chunk size is 500, you have just split the “voltage limit” header from the “240V” value. The vector database stores them separately. When a user asks, “What is the voltage limit?”, the retrieval system finds the header but not the value. The LLM, forced to answer, often guesses.
The first step to fixing production RAG is abandoning arbitrary character counts in favor of document intelligence.
Using layout-aware parsing tools (such as Azure Document Intelligence), we can segment data based on document structure such as chapters, sections and paragraphs, rather than token count.
Logical cohesion: A section describing a specific machine part is kept as a single vector, even if it varies in length.
Table preservation: The parser identifies a table boundary and forces the entire grid into a single chunk, preserving the row-column relationships that are vital for accurate retrieval.
In our internal qualitative benchmarks, moving from fixed to semantic chunking significantly improved the retrieval accuracy of tabular data, effectively stopping the fragmentation of technical specs.
The second failure mode of enterprise RAG is blindness. A massive amount of corporate IP exists not in text, but in flowcharts, schematics and system architecture diagrams. Standard embedding models (like text-embedding-3-small) cannot “see” these images. They are skipped during indexing.
If your answer lies in a flowchart, your RAG system will say, “I don’t know.”
To make diagrams searchable, we implemented a multimodal preprocessing step using vision-capable models (specifically GPT-4o) before the data ever hits the vector store.
OCR extraction: High-precision optical character recognition pulls text labels from within the image.
Generative captioning: The vision model analyzes the image and generates a detailed natural language description (“A flowchart showing that process A leads to process B if the temperature exceeds 50 degrees”).
Hybrid embedding: This generated description is embedded and stored as metadata linked to the original image.
Now, when a user searches for “temperature process flow,” the vector search matches the description, even though the original source was a PNG file.
For enterprise adoption, accuracy is only half the battle. The other half is verifiability.
In a standard RAG interface, the chatbot gives a text answer and cites a filename. This forces the user to download the PDF and hunt for the page to verify the claim. For high-stakes queries (“Is this chemical flammable?”), users simply won’t trust the bot.
The architecture should implement visual citation. Because we preserved the link between the text chunk and its parent image during the preprocessing phase, the UI can display the exact chart or table used to generate the answer alongside the text response.
This “show your work” mechanism allows humans to verify the AI’s reasoning instantly, bridging the trust gap that kills so many internal AI projects.
While the “textualization” method (converting images to text descriptions) is the practical solution for today, the architecture is rapidly evolving.
We are already seeing the emergence of native multimodal embeddings (such as Cohere’s Embed 4). These models can map text and images into the same vector space without the intermediate step of captioning. While we currently use a multi-stage pipeline for maximum control, the future of data infrastructure will likely involve “end-to-end” vectorization where the layout of a page is embedded directly.
Furthermore, as long context LLMs become cost-effective, the need for chunking may diminish. We may soon pass entire manuals into the context window. However, until latency and cost for million-token calls drop significantly, semantic preprocessing remains the most economically viable strategy for real-time systems.
The difference between a RAG demo and a production system is how it handles the messy reality of enterprise data.
Stop treating your documents as simple strings of text. If you want your AI to understand your business, you must respect the structure of your documents. By implementing semantic chunking and unlocking the visual data within your charts, you transform your RAG system from a “keyword searcher” into a true “knowledge assistant.”
Dippu Kumar Singh is an AI architect and data engineer.
A new study by Google suggests that advanced reasoning models achieve high performance by simulating multi-agent-like debates involving diverse perspectives, personality traits, and domain expertise.
Their experiments demonstrate that this internal debate, which they dub “society of thought,” significantly improves model performance in complex reasoning and planning tasks. The researchers found that leading reasoning models such as DeepSeek-R1 and QwQ-32B, which are trained via reinforcement learning (RL), inherently develop this ability to engage in society of thought conversations without explicit instruction.
These findings offer a roadmap for how developers can build more robust LLM applications and how enterprises can train superior models using their own internal data.
The core premise of society of thought is that reasoning models learn to emulate social, multi-agent dialogues to refine their logic. This hypothesis draws on cognitive science, specifically the idea that human reason evolved primarily as a social process to solve problems through argumentation and engagement with differing viewpoints.
The researchers write that “cognitive diversity, stemming from variation in expertise and personality traits, enhances problem solving, particularly when accompanied by authentic dissent.” Consequently, they suggest that integrating diverse perspectives allows LLMs to develop robust reasoning strategies. By simulating conversations between different internal personas, models can perform essential checks (such as verification and backtracking) that help avoid common pitfalls like unwanted biases and sycophancy.
In models like DeepSeek-R1, this “society” manifests directly within the chain of thought. The researchers note that you do not need separate models or prompts to force this interaction; the debate emerges autonomously within the reasoning process of a single model instance.
The study provides tangible examples of how this internal friction leads to better outcomes. In one experiment involving a complex organic chemistry synthesis problem, DeepSeek-R1 simulated a debate among multiple distinct internal perspectives, including a “Planner” and a “Critical Verifier.”
The Planner initially proposed a standard reaction pathway. However, the Critical Verifier (characterized as having high conscientiousness and low agreeableness) interrupted to challenge the assumption and provided a counter argument with new facts. Through this adversarial check, the model discovered the error, reconciled the conflicting views, and corrected the synthesis path.
A similar dynamic appeared in creative tasks. When asked to rewrite the sentence, “I flung my hatred into the burning fire,” the model simulated a negotiation between a “Creative Ideator” and a “Semantic Fidelity Checker.” After the ideator suggested a version using the word “deep-seated,” the checker retorted, “But that adds ‘deep-seated,’ which wasn’t in the original. We should avoid adding new ideas.” The model eventually settled on a compromise that maintained the original meaning while improving the style.
Perhaps the most striking evolution occurred in “Countdown Game,” a math puzzle where the model must use specific numbers to reach a target value. Early in training, the model tried to solve the problem using a monologue approach. As it learned via RL, it spontaneously split into two distinct personas: a “Methodical Problem-Solver” performing calculations and an “Exploratory Thinker” monitoring progress, who would interrupt failed paths with remarks like “Again no luck … Maybe we can try using negative numbers,” prompting the Methodical Solver to switch strategies.
These findings challenge the assumption that longer chains of thought automatically result in higher accuracy. Instead, diverse behaviors such as looking at responses through different lenses, verifying earlier assumptions, backtracking, and exploring alternatives, drive the improvements in reasoning. The researchers reinforced this by artificially steering a model’s activation space to trigger conversational surprise; this intervention activated a wider range of personality- and expertise-related features, doubling accuracy on complex tasks.
The implication is that social reasoning emerges autonomously through RL as a function of the model’s drive to produce correct answers, rather than through explicit human supervision. In fact, training models on monologues underperformed raw RL that naturally developed multi-agent conversations. Conversely, performing supervised fine-tuning (SFT) on multi-party conversations, and debate significantly outperformed SFT on standard chains of thought.
For developers and enterprise decision-makers, these insights offer practical guidelines for building more powerful AI applications.
Developers can enhance reasoning in general-purpose models by explicitly prompting them to adopt a society of thought structure. However, it is not enough to simply ask the model to chat with itself.
“It’s not enough to ‘have a debate’ but to have different views and dispositions that make debate inevitable and allow that debate to explore and discriminate between alternatives,” James Evans, co-author of the paper, told VentureBeat.
Instead of generic roles, developers should design prompts that assign opposing dispositions (e.g., a risk-averse compliance officer versus a growth-focused product manager) to force the model to discriminate between alternatives. Even simple cues that steer the model to express “surprise” can trigger these superior reasoning paths.
As developers scale test-time compute to allow models to “think” longer, they should structure this time as a social process. Applications should facilitate a “societal” process where the model uses pronouns like “we,” asks itself questions, and explicitly debates alternatives before converging on an answer.
This approach can also expand to multi-agent systems, where distinct personalities assigned to different agents engage in critical debate to reach better decisions.
Perhaps the most significant implication lies in how companies train or fine-tune their own models. Traditionally, data teams scrub their datasets to create “Golden Answers” that provide perfect, linear paths to a solution. The study suggests this might be a mistake.
Models fine-tuned on conversational data (e.g., transcripts of multi-agent debate and resolution) improve reasoning significantly faster than those trained on clean monologues. There is even value in debates that don’t lead to the correct answer.
“We trained on conversational scaffolding that led to the wrong answer, then reinforced the model and found that it performed just as well as reinforcing on the right answer, suggesting that the conversational habits of exploring solutions was the most important for new problems,” Evans said.
This implies enterprises should stop discarding “messy” engineering logs or Slack threads where problems were solved iteratively. The “messiness” is where the model learns the habit of exploration.
For high-stakes enterprise use cases, simply getting an answer isn’t enough. Evans argues that users need to see the internal dissent to trust the output, suggesting a shift in user interface design.
“We need a new interface that systematically exposes internal debates to us so that we ‘participate’ in calibrating the right answer,” Evans said. “We do better with debate; AIs do better with debate; and we do better when exposed to AI’s debate.”
These findings provide a new argument in the “build vs. buy” debate regarding open-weight models versus proprietary APIs. Many proprietary reasoning models hide their chain-of-thought, treating the internal debate as a trade secret or a safety liability.
But Evans argues that “no one has really provided a justification for exposing this society of thought before,” but that the value of auditing these internal conflicts is becoming undeniable. Until proprietary providers offer full transparency, enterprises in high-compliance sectors may find that open-weight models offer a distinct advantage: the ability to see the dissent, not just the decision.
“I believe that large, proprietary models will begin serving (and licensing) the information once they realize that there is value in it,” Evans said.
The research suggests that the job of an AI architect is shifting from pure model training to something closer to organizational psychology.
“I believe that this opens up a whole new frontier of small group and organizational design within and between models that is likely to enable new classes of performance,” Evans said. “My team is working on this, and I hope that others are too.”
Presented by SAP
The consumer packaged goods industry is experiencing a fundamental shift that’s forcing even the most established brands to rethink how they operate. It’s what some folks call the CPG squeeze, or a convergence of margin compression, trade policy headwinds, and the sobering reality that pricing-led growth is no longer a viable strategy. For companies that have relied on price increases to drive revenue, it’s a structural change that demands new approaches to operations, strategy, and competitive positioning.
CPG companies now need to achieve annual productivity gains of 5% or more just to stay competitive. Traditional cost-cutting measures like travel freezes, hiring pauses, and other age-old efficiency drives from simpler times might yield a couple of percentage points at best. The solution lies in a more sophisticated approach: identifying which processes can be digitally enabled before making organizational changes, confronting questions about process efficiency, manual workflows, and opportunities for automation.
But piecemeal solutions that address isolated problems can’t deliver the systemic efficiency gains that CPG companies now require. This is driving increased interest in integrated technology platforms that can support decision-making and execution across all functional areas simultaneously.
Modern CPG operations run on data, but of course not all data strategies are created equal. Companies are facing a dual-barreled challenge: they need deep insights into their internal operations, while simultaneously understanding external market dynamics and consumer behavior. Historically, this has meant extracting operational data, which means losing critical business context in the process, and then needing to invest big on reconstituting that context so it can be analyzed alongside consumer and retail data.
The disconnect creates real problems. When data loses its business context during extraction, companies spend significant time and money trying to rebuild an understanding of what the numbers actually mean. Meanwhile, market conditions change, promotional windows close, and opportunities disappear. In an industry where timing often determines success or failure, this lag in analytical capability becomes the competitive disadvantage.
To address this challenge, advanced data platforms like SAP’s Business Data Cloud are able to import external data with internal SAP operational data that has full business context. CPG brands can combine point-of-sale data from retailers, insights on consumer behavior, and internal transactional information without the traditional extract-and-reconstruct workflow — fundamentally changing the speed at which companies can move from analysis to decision to action.
The impact is particularly significant for promotional planning and revenue management. Instead of spending weeks preparing data for analysis, companies can run scenarios, model outcomes, and adjust strategies in near real-time, which is huge in an industry where promotional windows are measured in days or weeks.
High-stakes promotional moments like the Super Bowl expose how fragile CPG operations have become. Demand spikes are intense, localized, and short-lived, leaving little margin for delayed insights or disconnected execution. In this environment, promotional success depends less on creative merchandising and more on how quickly companies can sense demand, model outcomes, and align pricing, inventory, and execution while the window is still open.
The decision-making behind these promotions involves complex analysis of multiple variables: which products to feature, optimal discount levels, store-specific positioning, and even regional variations in consumer preferences. What resonates with shoppers in one geography may fall flat in another, so effective promotional strategy requires granular analysis down to individual store locations.
Tools like SAP’s Revenue Growth Management solution enable this level of sophistication, helping brands calculate and model promotional lifts and translate those insights into execution-ready decisions. The analysis accounts for regional taste preferences, local competitive dynamics, and historical performance data to optimize every promotional decision.
But promotional planning is only valuable if it can be executed effectively. This is where many CPG companies encounter friction between strategy and operations. Data analysis might pinpoint the perfect promotional mix, but without ensuring product availability, maintaining shelf presence, and executing physical merchandising, the analysis is pretty much academic. That’s why integration between promotional planning systems, supply chain and financial planning systems and ERP platforms are critical.
For high-velocity promotional periods, companies must forecast demand accurately, position inventory strategically, and execute distribution flawlessly. This is particularly complex for categories like snacks and beverages, where direct store delivery models are common. Managing shelf presence is critical, because an empty shelf means consumers will switch to competitive products or abandon the purchase entirely. And it requires real-time visibility into multiple layers of the supply chain across a variety of data sources, and the operational capabilities to act upon quickly.
Modern warehouse management systems, including SAP Extended Warehouse Management, provide the granular visibility needed to track inventory across these multiple states. When combined with DSD-specific applications, such as SAP’s last mile distribution solution, that optimize driver routes, delivery schedules, and in-store execution, CPG companies can maintain the shelf presence that drives promotional success. Sales execution tools, such as SAP’s retail execution offering in SAP Sales Cloud, allow field teams to audit stores and report on actual conditions. This helps gives headquarters clear, accurate visibility into what’s happening at the point of purchase.
Artificial intelligence is moving beyond experimental use cases to practical applications across CPG operations. In warehouse environments, AI-enhanced systems can optimize task management, improve forecasting accuracy, and streamline returns processing. For supply chain planning, AI assists in generating demand scenarios that account for multiple variables affecting product movement.
SAP’s integration of Joule into Integrated Business Planning software demonstrates how conversational AI can transform planning workflows. Instead of navigating complex interfaces to access supply chain data, planners can ask natural language questions and receive immediate, AI-driven responses based on real-time information. This reduces the friction in accessing insights and accelerates decision-making during critical planning cycles.
Advanced warehouse operations are benefiting from AI agents that can enhance inventory risk analysis, optimize task management, and improve forecast accuracy. These aren’t just faster versions of existing processes. Instead, they represent qualitatively different capabilities that can identify patterns and risks that human analysts might miss amid the volume and complexity of modern supply chain operations.
Revenue management, or determining optimal pricing and promotional strategies, is particularly well-suited to AI assistance, because analyzing how different price points, promotional tactics, and positioning strategies interact across thousands of stores and products is complex beyond human analytical capacity. Machine learning can identify patterns and optimize decisions at a scale and speed that manual analysis cannot match. AI capabilities being built into revenue growth management platforms promise to make promotional planning both more sophisticated and more efficient.
Perhaps most significantly for CPG companies facing the productivity imperative, intelligent inventory management systems are using machine learning to predict delivery dates and provide real-time analytics for distribution decisions. Sales order fulfillment monitoring can predict fulfillment risks before they materialize, enabling proactive intervention. These AI capabilities address issues like product availability and reliable delivery during critical promotional windows, which are some of the highest-stakes challenges in CPG operations.
But the most impactful AI applications in CPG won’t necessarily be the most visible. Instead of flashy consumer-facing features, the real value comes from embedding intelligence into core operational processes. Incremental improvements across dozens of workflows compound into substantial competitive advantages over time.
The CPG squeeze isn’t a temporary condition that companies can wait out. The structural factors driving margin compression and limiting pricing power reflect fundamental market changes. Trade policies will continue evolving. Consumer behavior will keep shifting. The companies that emerge stronger won’t be those with the best products alone, they’ll be those that built the most efficient, responsive operations.
Jon Dano is Industry Advisor for Consumer Products, at SAP.
Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Two days after releasing what analysts call the most powerful open-source AI model ever created, researchers from China’s Moonshot AI logged onto Reddit to face a restless audience. The Beijing-based startup had reason to show up. Kimi K2.5 had just landed headlines about closing the gap with American AI giants and testing the limits of US. chip export controls. But the developers waiting on r/LocalLLaMA, a forum where engineers trade advice on running powerful language models on everything from a single consumer GPU to a small rack of prosumer hardware, had a different concern.
They wanted to know when they could actually use it.
The three-hour Ask Me Anything session became an unexpectedly candid window into frontier AI development in 2026 — not the polished version that appears in corporate blogs, but the messy reality of debugging failures, managing personality drift, and confronting a fundamental tension that defines open-source AI today.
Moonshot had published the model’s weights for anyone to download and customize. The file runs roughly 595 gigabytes. For most of the developers in the thread, that openness remained theoretical.
Three Moonshot team members participated under the usernames ComfortableAsk4494, zxytim, and ppwwyyxx. Over approximately 187 comments, they fielded questions about architecture, training methodology, and the philosophical puzzle of what gives an AI model its “soul.” They also offered a picture of where the next round of progress will come from — and it wasn’t simply “more parameters.”
The very first wave of questions treated Kimi K2.5 less like a breakthrough and more like a logistics headache.
One user asked bluntly why Moonshot wasn’t creating smaller models alongside the flagship. “Small sizes like 8B, 32B, 70B are great spots for the intelligence density,” they wrote. Another said huge models had become difficult to celebrate because many developers simply couldn’t run them. A third pointed to American competitors as size targets, requesting coder-focused variants that could fit on modest GPUs.
Moonshot’s team didn’t announce a smaller model on the spot. But it acknowledged the demand in terms that suggested the complaint was familiar. “Requests well received!” one co-host wrote. Another noted that Moonshot’s model collection already includes some smaller mixture-of-experts models on Hugging Face, while cautioning that small and large models often require different engineering investments.
The most revealing answer came when a user asked whether Moonshot might build something around 100 billion parameters optimized for local use. The Kimi team responded by floating a different compromise: a 200 billion or 300 billion parameter model that could stay above what it called a “usability threshold” across many tasks.
That reply captured the bind open-weight labs face. A 200-to-300 billion parameter model would broaden access compared to a trillion-parameter system, but it still assumes multi-GPU setups or aggressive quantization. The developers in the thread weren’t asking for “somewhat smaller.” They were asking for models sized for the hardware they actually own — and for a roadmap that treats local deployment as a first-class constraint rather than a hobbyist afterthought.
As the thread moved past hardware complaints, it turned to what many researchers now consider the central question in large language models: have scaling laws begun to plateau?
One participant asked directly whether scaling had “hit a wall.” A Kimi representative replied with a diagnosis that has become increasingly common across the industry. “The amount of high-quality data does not grow as fast as the available compute,” they wrote, “so scaling under the conventional ‘next token prediction with Internet data’ will bring less improvement.”
Then the team offered its preferred escape route. It pointed to Agent Swarm, Kimi K2.5’s ability to coordinate up to 100 sub-agents working in parallel, as a form of “test-time scaling” that could open a new path to capability gains. In the team’s framing, scaling doesn’t have to mean only larger pretraining runs. It can also mean increasing the amount of structured work done at inference time, then folding those insights back into training through reinforcement learning.
“There might be new paradigms of scaling that can possibly happen,” one co-host wrote. “Looking forward, it’s likely to have a model that learns with less or even zero human priors.”
The claim implies that the unit of progress may be shifting from parameter count and pretraining loss curves toward systems that can plan, delegate, and verify — using tools and sub-agents as building blocks rather than relying on a single massive forward pass.
On paper, Agent Swarm sounds like a familiar idea in a new wrapper: many AI agents collaborating on a task. The AMA surfaced the more important details — where the memory goes, how coordination happens, and why orchestration doesn’t collapse into noise.
A developer raised a classic multi-agent concern. At a scale of 100 sub-agents, an orchestrator agent often becomes a bottleneck, both in latency and in what the community calls “context rot” — the degradation in performance that occurs as a conversation history fills with internal chatter and tool traces until the model loses the thread.
A Kimi co-host answered with a design choice that matters for anyone building agent systems in enterprise settings. The sub-agents run with their own working memory and send back results to the orchestrator, rather than streaming everything into a shared context. “This allows us to scale the total context length in a new dimension!” they wrote.
Another developer pressed on performance claims. Moonshot has publicly described Agent Swarm as capable of achieving about 4.5 times speedup on suitable workflows, but skeptics asked whether that figure simply reflects how parallelizable a given task is. The team agreed: it depends. In some cases, the system decides that a task doesn’t require parallel agents and avoids spending the extra compute. It also described sub-agent token budgets as something the orchestrator must manage, assigning each sub-agent a task of appropriate size.
Read as engineering rather than marketing, Moonshot was describing a familiar enterprise pattern: keep the control plane clean, bound the outputs from worker processes, and avoid flooding a coordinator with logs it can’t digest.
The most consequential shift hinted at in the AMA wasn’t a new benchmark score. It was a statement about priorities.
One question asked whether Moonshot was moving compute from “System 1” pretraining to “System 2” reinforcement learning — shorthand for shifting from broad pattern learning toward training that explicitly rewards reasoning and correct behavior over multi-step tasks. A Kimi representative replied that RL compute will keep increasing, and suggested that new RL objective functions are likely, “especially in the agent space.”
That line reads like a roadmap. As models become more tool-using and task-decomposing, labs will spend more of their budget training models to behave well as agents — not merely to predict tokens.
For enterprises, this matters because RL-driven improvements often arrive with tradeoffs. A model can become more decisive, more tool-happy, or more aligned to reward signals that don’t map neatly onto a company’s expectations. The AMA didn’t claim Moonshot had solved those tensions. It did suggest the team sees reinforcement learning as the lever that will matter more in the next cycle than simply buying more GPUs.
When asked about the compute gap between Moonshot and American labs with vastly larger GPU fleets, the team was candid. “The gap is not closing I would say,” one co-host wrote. “But how much compute does one need to achieve AGI? We will see.”
Another offered a more philosophical framing: “There are too many factors affecting available compute. But no matter what, innovation loves constraints.”
Open-weight releases now come with a standing suspicion: did the model learn too much from competitors? That suspicion can harden quickly into accusations of distillation, where one AI learns by training on another AI’s outputs.
A user raised one of the most uncomfortable claims circulating in open-model circles — that K2.5 sometimes identifies itself as “Claude,” Anthropic’s flagship model. The implication was heavy borrowing.
Moonshot didn’t dismiss the behavior. Instead it described the conditions under which it happens. With the right system prompt, the team said, the model has a high probability of answering “Kimi,” particularly in thinking mode. But with an empty system prompt, the model drifts into what the team called an “undefined area,” which reflects pretraining data distributions rather than deliberate training choices.
Then it offered a specific explanation tied to a training decision. Moonshot said it had upsampled newer internet coding data during pretraining, and that this data appears more associated with the token “Claude” — likely because developers discussing AI coding assistants frequently reference Anthropic’s model.
The team pushed back on the distillation accusation with benchmark results. “In fact, K2.5 seems to outperform Claude on many benchmarks,” one co-host wrote. “HLE, BrowseComp, MMMU Pro, MathVision, just to name a few.”
For enterprise adopters, the important point isn’t the internet drama. It’s that identity drift is a real failure mode — and one that organizations can often mitigate by controlling system prompts rather than leaving the model’s self-description to chance. The AMA treated prompt governance not as a user-experience flourish, but as operational hygiene.
A recurring theme in the thread was that K2.5’s writing style feels more generic than earlier Kimi models. Users described it as more like a standard “helpful assistant” — a tone many developers now see as the default personality of heavily post-trained models. One user said they loved the personality of Kimi K2 and asked what happened.
A Kimi co-host acknowledged that each new release brings some personality change and described personality as subjective and hard to evaluate. “This is a quite difficult problem,” they wrote. The team said it wants to improve the issue and make personality more customizable per user.
In a separate exchange about whether strengthening coding capability compromises creative writing and emotional intelligence, a Kimi representative argued there’s no inherent conflict if the model is large enough. But maintaining “writing taste” across versions is difficult, they said, because the reward model is constantly evolving. The team relies on internal benchmarks — a kind of meta-evaluation — to track creative writing progress and adjust reward models accordingly.
Another response went further, using language that would sound unusual in a corporate AI specification but familiar to people who use these tools daily. The team talked about the “soul” of a reward model and suggested the possibility of storing a user “state” reflecting taste and using it to condition the model’s outputs.
That exchange points to a product frontier that enterprises often underestimate. Style drift isn’t just aesthetics. It can change how a model explains decisions, how it hedges, how it handles ambiguity, and how it interacts with customers and employees. The AMA made clear that labs increasingly treat “taste” as both an alignment variable and a differentiator — but it remains hard to measure and even harder to hold constant across training runs.
The most revealing cultural insight came in response to a question about surprises during training and reinforcement learning. A co-host answered with a single word, bolded for emphasis: debugging.
“Whether it’s pre-training or post-training, one thing constantly manifests itself as the utmost priority: debugging,” they wrote.
The comment illuminated a theme running through the entire session. When asked about their “scaling ladder” methodology for evaluating new ideas at different model sizes, zxytim offered an anecdote about failure. The team had once hurried to incorporate Kimi Linear, an experimental linear-attention architecture, into the previous model generation. It failed the scaling ladder at a certain scale. They stepped back and went through what the co-host called “a tough debugging process,” and after months finally made it work.
“Statistically, most ideas that work at small scale won’t pass the scaling ladder,” they continued. “Those that do are usually simple, effective, and mathematically grounded. Research is mostly about managing failure, not celebrating success.”
For technical leaders evaluating AI vendors, the admission is instructive. Frontier capability doesn’t emerge from elegant breakthroughs alone. It emerges from relentless fault isolation — and from organizational cultures willing to spend months on problems that might not work.
The AMA also acted as a subtle teaser for Kimi’s next generation.
Developers asked whether Kimi K3 would adopt Moonshot’s linear attention research, which aims to handle long context more efficiently than traditional attention mechanisms. Team members suggested that linear approaches are a serious option. “It’s likely that Kimi Linear will be part of K3,” one wrote. “We will also include other optimizations.”
In another exchange, a co-host predicted K3 “will be much, if not 10x, better than K2.5.”
The team also highlighted continual learning as a direction it is actively exploring, suggesting a future where agents can work effectively over longer time horizons — a critical enterprise need if agents are to handle ongoing projects rather than single-turn tasks. “We believe that continual learning will improve agency and allow the agents to work effectively for much longer durations,” one co-host wrote.
On Agent Swarm specifically, the team said it plans to make the orchestration scaffold available to developers once the system becomes more stable. “Hopefully very soon,” they added.
The session didn’t resolve every question. Some of the most technical prompts — about multimodal training recipes, defenses against reward hacking, and data governance — were deferred to a forthcoming technical report. That’s not unusual. Many labs now treat the most operationally decisive details as sensitive.
But the thread still revealed where the real contests in AI have moved. The gap that matters most isn’t between China and the United States, or between open and closed. It’s the gap between what models promise and what systems can actually deliver.
Orchestration is becoming the product. Moonshot isn’t only shipping a model. It’s shipping a worldview that says the next gains come from agents that can split work, use tools, and return structured results fast. Open weights are colliding with hardware reality, as developers demand openness that runs locally rather than openness that requires a data center. And the battleground is shifting from raw intelligence to reliability — from beating a benchmark by two points to debugging tool-calling discipline, managing memory in multi-agent workflows, and preserving the hard-to-quantify “taste” that determines whether users trust the output.
Moonshot showed up on Reddit in the wake of a high-profile release and a growing geopolitical narrative. The developers waiting there cared about a more practical question: When does “open” actually mean “usable”?
In that sense, the AMA didn’t just market Kimi K2.5. It offered a snapshot of an industry in transition — from larger models to more structured computation, from closed APIs to open weights that still demand serious engineering to deploy, and from celebrating success to managing failure.
“Research is mostly about managing failure,” one of the Moonshot engineers had written. By the end of the thread, it was clear that deployment is, too.