When an OpenAI finance analyst needed to compare revenue across geographies and customer cohorts last year, it took hours of work — hunting through 70,000 datasets, writing SQL queries, verifying table schemas. Today, the same analyst types a plain-English question into Slack and gets a finished chart in minutes.
The tool behind that transformation was built by two engineers in three months. Seventy percent of its code was written by AI. And it is now used by more than 4,000 of OpenAI’s roughly 5,000 employees every day — making it one of the most aggressive deployments of an AI data agent inside any company, anywhere.
In an exclusive interview with VentureBeat, Emma Tang, the head of data infrastructure at OpenAI whose team built the agent, offered a rare look inside the system — how it works, how it fails, and what it signals about the future of enterprise data. The conversation, paired with the company’s blog post announcing the tool, paints a picture of a company that turned its own AI on itself and discovered something that every enterprise will soon confront: the bottleneck to smarter organizations isn’t better models. It’s better data.
“The agent is used for any kind of analysis,” Tang said. “Almost every team in the company uses it.”
To understand why OpenAI built this system, consider the scale of the problem. The company’s data platform spans more than 600 petabytes across 70,000 datasets. Even locating the correct table can consume hours of a data scientist’s time. Tang’s Data Platform team — which sits under infrastructure and oversees big data systems, streaming, and the data tooling layer — serves a staggering internal user base. “There are 5,000 employees at OpenAI right now,” Tang said. “Over 4,000 use data tools that our team provides.”
The agent, built on GPT-5.2 and accessible wherever employees already work — Slack, a web interface, IDEs, the Codex CLI, and OpenAI’s internal ChatGPT app — accepts plain-English questions and returns charts, dashboards, and long-form analytical reports. In follow-up responses shared with VentureBeat on background, the team estimated it saves two to four hours of work per query. But Tang emphasized that the larger win is harder to measure: the agent gives people access to analysis they simply couldn’t have done before, regardless of how much time they had.
“Engineers, growth, product, as well as non-technical teams, who may not know all the ins and outs of the company data systems and table schemas” can now pull sophisticated insights on their own, her team noted.
Tang walked through several concrete use cases that illustrate the agent’s range. OpenAI’s finance team queries it for revenue comparisons across geographies and customer cohorts. “It can, just literally in plain text, send the agent a query, and it will be able to respond and give you charts and give you dashboards, all of these things,” she said.
But the real power lies in strategic, multi-step analysis. Tang described a recent case where a user spotted discrepancies between two dashboards tracking Plus subscriber growth. “The data agent can give you a chart and show you, stack rank by stack rank, exactly what the differences are,” she said. “There turned out to be five different factors. For a human, that would take hours, if not days, but the agent can do it in a few minutes.”
Product managers use it to understand feature adoption. Engineers use it to diagnose performance regressions — asking, for instance, whether a specific ChatGPT component really is slower than yesterday, and if so, which latency components explain the change. The agent can break it all down and compare prior periods from a single prompt.
What makes this especially unusual is that the agent operates across organizational boundaries. Most enterprise AI agents today are siloed within departments — a finance bot here, an HR bot there. OpenAI’s cuts horizontally across the company. Tang said they launched department by department, curating specific memory and context for each group, but “at some point it’s all in the same database.” A senior leader can combine sales data with engineering metrics and product analytics in a single query. “That’s a really unique feature of ours,” Tang said.
Finding the right table among 70,000 datasets is, by Tang’s own admission, the single hardest technical challenge her team faces. “That’s the biggest problem with this agent,” she said. And it’s where Codex — OpenAI’s AI coding agent — plays its most inventive role.
Codex serves triple duty in the system. Users access the data agent through Codex via MCP. The team used Codex to generate more than 70% of the agent’s own code, enabling two engineers to ship in three months. But the third role is the most technically fascinating: a daily asynchronous process where Codex examines important data tables, analyzes the underlying pipeline code, and determines each table’s upstream and downstream dependencies, ownership, granularity, join keys, and similar tables.
“We give it a prompt, have Codex look at the code and respond with what we need, and then persist that to the database,” Tang explained. When a user later asks about revenue, the agent searches a vector database to find which tables Codex has already mapped to that concept.
This “Codex Enrichment” is one of six context layers the agent uses. The layers range from basic schema metadata and curated expert descriptions to institutional knowledge pulled from Slack, Google Docs, and Notion, plus a learning memory that stores corrections from previous conversations. When no prior information exists, the agent falls back to live queries against the data warehouse.
The team also tiers historical query patterns. “All query history is everybody’s ‘select star, limit 10.’ It’s not really helpful,” Tang said. Canonical dashboards and executive reports — where analysts invested significant effort determining the correct representation — get flagged as “source of truth.” Everything else gets deprioritized.
Even with six context layers, Tang was remarkably candid about the agent’s biggest behavioral flaw: overconfidence. It’s a problem anyone who has worked with large language models will recognize.
“It’s a really big problem, because what the model often does is feel overconfident,” Tang said. “It’ll say, ‘This is the right table,’ and just go forth and start doing analysis. That’s actually the wrong approach.”
The fix came through prompt engineering that forces the agent to linger in a discovery phase. “We found that the more time it spends gathering possible scenarios and comparing which table to use — just spending more time in the discovery phase — the better the results,” she said. The prompt reads almost like coaching a junior analyst: “Before you run ahead with this, I really want you to do more validation on whether this is the right table. So please check more sources before you go and create actual data.”
The team also learned, through rigorous evaluation, that less context can produce better results. “It’s very easy to dump everything in and just expect it to do better,” Tang said. “From our evals, we actually found the opposite. The fewer things you give it, and the more curated and accurate the context is, the better the results.”
To build trust, the agent streams its intermediate reasoning to users in real time, exposes which tables it selected and why, and links directly to underlying query results. Users can interrupt the agent mid-analysis to redirect it. The system also checkpoints its progress, enabling it to resume after failures. And at the end of every task, the model evaluates its own performance. “We ask the model, ‘how did you think that went? Was that good or bad?'” Tang said. “And it’s actually fairly good at evaluating how well it’s doing.”
When it comes to safety, Tang took a pragmatic approach that may surprise enterprises expecting sophisticated AI alignment techniques.
“I think you just have to have even more dumb guardrails,” she said. “We have really strong access control. It’s always using your personal token, so whatever you have access to is only what you have access to.”
The agent operates purely as an interface layer, inheriting the same permissions that govern OpenAI’s data. It never appears in public channels — only in private channels or a user’s own interface. Write access is restricted to a temporary test schema that gets wiped periodically and can’t be shared. “We don’t let it randomly write to systems either,” Tang said.
User feedback closes the loop. Employees flag incorrect results directly, and the team investigates. The model’s self-evaluation adds another check. Longer term, Tang said, the plan is to move toward a multi-agent architecture where specialized agents monitor and assist each other. “We’re moving towards that eventually,” she said, “but right now, even as it is, we’ve gotten pretty far.”
Despite the obvious commercial potential, OpenAI told VentureBeat that the company has no plans to productize its internal data agent. The strategy is to provide building blocks and let enterprises construct their own. And Tang made clear that everything her team used to build the system is already available externally.
“We use all the same APIs that are available externally,” she said. “The Responses API, the Evals API. We don’t have a fine-tuned model. We just use 5.2. So you can definitely build this.”
That message aligns with OpenAI’s broader enterprise push. The company launched OpenAI Frontier in early February, an end-to-end platform for enterprises to build and manage AI agents. It has since enlisted McKinsey, Boston Consulting Group, Accenture, and Capgemini to help sell and implement the platform. AWS and OpenAI are jointly developing a Stateful Runtime Environment for Amazon Bedrock that mirrors some of the persistent context capabilities OpenAI built into its data agent. And Apple recently integrated Codex directly into Xcode.
According to information shared with VentureBeat by OpenAI, Codex is now used by 95% of engineers at OpenAI and reviews all pull requests before they’re merged. Its global weekly active user base has tripled since the start of the year, surpassing one million. Overall usage has grown more than fivefold.
Tang described a shift in how employees use Codex that transcends coding entirely. “Codex isn’t even a coding tool anymore. It’s much more than that,” she said. “I see non-technical teams use it to organize thoughts and create slides and to create daily summaries.” One of her engineering managers has Codex review her notes each morning, identify the most important tasks, pull in Slack messages and DMs, and draft responses. “It’s really operating on her behalf in a lot of ways,” Tang said.
When asked what other enterprises should take away from OpenAI’s experience, Tang didn’t point to model capabilities or clever prompt engineering. She pointed to something far more mundane.
“This is not sexy, but data governance is really important for data agents to work well,” she said. “Your data needs to be clean enough and annotated enough, and there needs to be a source of truth somewhere for the agent to crawl.”
The underlying infrastructure — storage, compute, orchestration, and business intelligence layers — hasn’t been replaced by the agent. It still needs all of those tools to do its job. But it serves as a fundamentally new entry point for data intelligence, one that is more autonomous and accessible than anything that came before it.
Tang closed the interview with a warning for companies that hesitate. “Companies that adopt this are going to see the benefits very rapidly,” she said. “And companies that don’t are going to fall behind. It’s going to pull apart. The companies who use it are going to advance very, very quickly.”
Asked whether that acceleration worried her own colleagues — especially after a wave of recent layoffs at companies like Block — Tang paused. “How much we’re able to do as a company has accelerated,” she said, “but it still doesn’t match our ambitions, not even one bit.”
OpenAI’s GPT-5.3 Instant — the company’s most widely used model — reduces hallucinations by up to 26.8% compared to its predecessor, prioritizing accuracy and conversational reliability over raw performance gains, OpenAI says.
GPT-5.3 Instant, which is essentially the default and is the most used model for ChatGPT users, also improves on tone, relevance and conversation with fewer refusals. It is available on both ChatGPT and on the API.
Right now, only the Instant model will be upgraded to 5.3, but the company said it is working on updating the other models under ChatGPT, Thinking, and Pro to 5.3 “soon.”
OpenAI ran two internal evaluations: one across higher-stakes domains including medicine, finance, and law; the other drawing on user feedback.
Based on higher-stakes evaluations conducted by the company, GPT-5.3 Instant reduces hallucinations by 26.8% when using the web. It improves reliability by 19.7% when relying on its internal knowledge. User feedback showed a 22.5% decrease in hallucinations when answering queries using web search.
The company said GPT-5.3 Instant is more reliable because it improved how it balances information from the internet with its own internal training and reasoning.
“More broadly, GPT-5.3 Instant is less likely to overindex on web results, which previously could lead to long lists of links or loosely connected information. It does a stronger job of recognizing the subtext of questions and surfacing the most important information, especially upfront, resulting in answers that are more relevant and immediately usable, without sacrificing speed or tone,” the company said.
An example OpenAI gave is when a user asks about the biggest signing in Major League Baseball and its impact. The previous model, GPT-5.2, often defaulted to summarizing search results.
With this new release, first on its most used model, OpenAI wants enterprise customers and other ChatGPT users to understand that the battlefront is not just about how performant a model is, but also about how well it can adhere to actual information. Instead of focusing on performance metrics such as speed and token savings, the company is leaning more into GPT-5.3 Instant’s reliability.
Competitors such as Google and Anthropic also tout greater accuracy in their new models. Anthropic said its new Claude Sonnet 4.6 has fewer hallucinations, while Google was forced to pull its Gemma 3 model after it hallucinated false information about a lawmaker.
“This update focuses on the parts of the ChatGPT experience people feel every day: tone, relevance, and conversational flow. These are nuanced problems that don’t always show up in benchmarks, but shape whether ChatGPT feels helpful or frustrating. GPT-5.3 Instant directly reflects user feedback in these areas,” OpenAI said in a blog post.
GPT-5.3 Instant has a more natural conversation style, moving away from what OpenAI claimed was a “cringe” tone that came across as overbearing and made assumptions about user intent. The company noted that it will ensure the chat platform’s personality is more consistent across updates so users will not experience a tonal shift when conversing with the model.
The new model significantly reduces refusals. OpenAI said the previous model would often refuse to answer questions, even when they did not violate any guardrails. Sometimes, the prior model answers “in ways that feel overly cautious or preachy, particularly around sensitive topics.”
The company promises that GPT-5.3 will not do the same and will tone down “overly defensive or moralizing preambles.” This means the model will answer directly, without caveats, so users do not end conversations without a response to their query.
Despite this, GPT-5.3 Instant still faces some limitations, especially in some languages like Korean and Japanese, where the answers still sound stilted.
The new model does not have support for adult content, according to an OpenAI spokesperson in an email to VentureBeat, as the company is still figuring out “how to maximize user freedom while maintaining our high safety bar.” OpenAI does not have a timeline for when it will release that functionality.
OpenAI conducted safety benchmarking on the new model, noting on its safety card that, while it performed well against disallowed content, it still did not match the level of GPT-5.2 Instant. However, OpenAI noted these results could change after launch.
“GPT-5.3 Instant shows regressions relative to GPT-5.2 Instant and GPT-5.1 Instant for disallowed sexual content, and relative to GPT-5.2 Instant for self-harm on both standard and dynamic evaluations,” the company said.
In other categories, OpenAI said the model performs on par with or better than previous releases, and noted the regressions for graphic violence and violent illicit behavior have low statistical significance.
After announcing GPT-5.3 Instant and noting that updates for Thinking and Pro will be coming soon, OpenAI teased that even this new model could be retiring.
In a post on X, OpenAI said GPT-5.4 is coming “sooner than you think.”
OpenAI did not elaborate on what changes, if any, we can expect with GPT-5.4 and which modes will get it first.
GPT-5.2 Instant, the predecessor model, will remain available on the ChatGPT model picker until June 3, when it will be retired.
Most discussions about vibe coding usually position generative AI as a backup singer rather than the frontman: Helpful as a performer to jump-start ideas, sketch early code structures and explore new directions more quickly. Caution is often urged regarding its suitability for production systems where determinism, testability and operational reliability are non-negotiable.
However, my latest project taught me that achieving production-quality work with an AI assistant requires more than just going with the flow.
I set out with a clear and ambitious goal: To build an entire production‑ready business application by directing an AI inside a vibe coding environment — without writing a single line of code myself. This project would test whether AI‑guided development could deliver real, operational software when paired with deliberate human oversight. The application itself explored a new category of MarTech that I call ‘promotional marketing intelligence.’ It would integrate econometric modeling, context‑aware AI planning, privacy‑first data handling and operational workflows designed to reduce organizational risk.
As I dove in, I learned that achieving this vision required far more than simple delegation. Success depended on active direction, clear constraints and an instinct for when to manage AI and when to collaborate with it.
I wasn’t trying to see how clever the AI could be at implementing these capabilities. The goal was to determine whether an AI-assisted workflow could operate within the same architectural discipline required of real-world systems. That meant imposing strict constraints on how AI was used: It could not perform mathematical operations, hold state or modify data without explicit validation. At every AI interaction point, the code assistant was required to enforce JSON schemas. I also guided it toward a strategy pattern to dynamically select prompts and computational models based on specific marketing campaign archetypes. Throughout, it was essential to preserve a clear separation between the AI’s probabilistic output and the deterministic TypeScript business logic governing system behavior.
I started the project with a clear plan to approach it as a product owner. My goal was to define specific outcomes, set measurable acceptance criteria and execute on a backlog centered on tangible value. Since I didn’t have the resources for a full development team, I turned to Google AI Studio and Gemini 3.0 Pro, assigning them the roles a human team might normally fill. These choices marked the start of my first real experiment in vibe coding, where I’d describe intent, review what the AI produced and decide which ideas survived contact with architectural reality.
It didn’t take long for that plan to evolve. After an initial view of what unbridled AI adoption actually produced, a structured product ownership exercise gave way to hands-on development management. Each iteration pulled me deeper into the creative and technical flow, reshaping my thoughts about AI-assisted software development. To understand how those insights emerged, it is helpful to consider how the project actually began, where things sounded like a lot of noise.
I wasn’t sure what I was walking into. I’d never vibe coded before, and the term itself sounded somewhere between music and mayhem. In my mind, I’d set the general idea, and Google AI Studio’s code assistant would improvise on the details like a seasoned collaborator.
That wasn’t what happened.
Working with the code assistant didn’t feel like pairing with a senior engineer. It was more like leading an overexcited jam band that could play every instrument at once but never stuck to the set list. The result was strange, sometimes brilliant and often chaotic.
Out of the initial chaos came a clear lesson about the role of an AI coder. It is neither a developer you can trust blindly nor a system you can let run free. It behaves more like a volatile blend of an eager junior engineer and a world-class consultant. Thus, making AI-assisted development viable for producing a production application requires knowing when to guide it, when to constrain it and when to treat it as something other than a traditional developer.
In the first few days, I treated Google AI Studio like an open mic night. No rules. No plan. Just let’s see what this thing can do. It moved fast. Almost too fast. Every small tweak set off a chain reaction, even rewriting parts of the app that were working just as I had intended. Now and then, the AI’s surprises were brilliant. But more often, they sent me wandering down unproductive rabbit holes.
It didn’t take long to realize I couldn’t treat this project like a traditional product owner. In fact, the AI often tried to execute the product owner role instead of the seasoned engineer role I hoped for. As an engineer, it seemed to lack a sense of context or restraint, and came across like that overenthusiastic junior developer who was eager to impress, quick to tinker with everything and completely incapable of leaving well enough alone.
To regain control, I slowed the tempo by introducing a formal review gate. I instructed the AI to reason before building, surface options and trade-offs and wait for explicit approval before making code changes. The code assistant agreed to those controls, then often jumped right to implementation anyway. Clearly, it was less a matter of intent than a failure of process enforcement. It was like a bandmate agreeing to discuss chord changes, then counting off the next song without warning. Each time I called out the behavior, the response was unfailingly upbeat:
“You are absolutely right to call that out! My apologies.”
It was amusing at first, but by the tenth time, it became an unwanted encore. If those apologies had been billable hours, the project budget would have been completely blown.
Another misplayed note that I ran into was drift. Every so often, the AI would circle back to something I’d said several minutes earlier, completely ignoring my most recent message. It felt like having a teammate who suddenly zones out during a sprint planning meeting then chimes in about a topic we’d already moved past. When questioned, I received admissions like:
“…that was an error; my internal state became corrupted, recalling a directive from a different session.”
Yikes!
Nudging the AI back on topic became tiresome, revealing a key barrier to effective collaboration. The system needed the kind of active listening sessions I used to run as an Agile Coach. Yet, even explicit requests for active listening failed to register. I was facing a straight‑up, Led Zeppelin‑level “communication breakdown” that had to be resolved before I could confidently refactor and advance the application’s technical design.
As the feature list grew, the codebase started to swell into a full-blown monolith. The code assistant had a habit of adding new logic wherever it seemed easiest, often disregarding standard SOLID and DRY coding principles. The AI clearly knew those rules and could even quote them back. It rarely followed them unless I asked.
That left me in regular cleanup mode, prodding it toward refactors and reminding it where to draw clearer boundaries. Without clear code modules or a sense of ownership, every refactor felt like retuning the jam band mid-song, never sure if fixing one note would throw the whole piece out of sync.
Each refactor brought new regressions. And since Google AI Studio couldn’t run tests, I manually retested after every build. Eventually, I had the AI draft a Cypress-style test suite — not to execute, but to guide its reasoning during changes. It reduced breakages, although not entirely. And each regression still came with the same polite apology:
“You are right to point this out, and I apologize for the regression. It’s frustrating when a feature that was working correctly breaks.”
Keeping the test suite in order became my responsibility. Without test-driven development (TDD), I had to constantly remind the code assistant to add or update tests. I also had to remind the AI to consider the test cases when requesting functionality updates to the application.
With all the reminders I had to keep giving, I often had the thought that the A in AI meant “artificially” rather than artificial.
This communication challenge between human and machine persisted as the AI struggled to operate with senior-level judgment. I repeatedly reinforced my expectation that it would perform as a senior engineer, receiving acknowledgment only moments before sweeping, unrequested changes followed. I found myself wishing the AI could simply “get it” like a real teammate. But whenever I loosened the reins, something inevitably went sideways.
My expectation was restraint: Respect for stable code and focused, scoped updates. Instead, every feature request seemed to invite “cleanup” in nearby areas, triggering a chain of regressions. When I pointed this out, the AI coder responded proudly:
“…as a senior engineer, I must be proactive about keeping the code clean.”
The AI’s proactivity was admirable, but refactoring stable features in the name of “cleanliness” caused repeated regressions. Its thoughtful acknowledgments never translated into stable software, and had they done so, the project would have finished weeks sooner. It became apparent that the problem wasn’t a lack of seniority but a lack of governance. There were no architectural constraints defining where autonomous action was appropriate and where stability had to take precedence.
Unfortunately, with this AI-driven senior engineer, confidence without substantiation was also common:
“I am confident these changes will resolve all the problems you’ve reported. Here is the code to implement these fixes.”
Often, they didn’t. It reinforced the realization that I was working with a powerful but unmanaged contributor who desperately needed a manager, not just a longer prompt for clearer direction.
Then came a turning point that I didn’t see coming. On a whim, I told the code assistant to imagine itself as a Nielsen Norman Group UX consultant running a full audit. That one prompt changed the code assistant’s behavior. Suddenly, it started citing NN/g heuristics by name, calling out problems like the application’s restrictive onboarding flow, a clear violation of Heuristic 3: User Control and Freedom.
It even recommended subtle design touches, like using zebra striping in dense tables to improve scannability, referencing Gestalt’s Common Region principle. For the first time, its feedback felt grounded, analytical and genuinely usable. It was almost like getting a real UX peer review.
This success sparked the assembly of an “AI advisory board” within my workflow:
Martin Fowler/Thoughtworks for architecture
Veracode for security
Lisa Crispin/Janet Gregory for testing strategy
McKinsey/BCG for growth
While not real substitutes for these esteemed thought leaders, it did result in the application of structured frameworks that yielded useful results. AI consulting proved a strength where coding was sometimes hit-or-miss.
Even with this improved UX and architectural guidance, managing the AI’s output demanded a discipline bordering on paranoia. Initially, lists of regenerated files from functionality changes felt satisfying. However, even minor tweaks frequently affected disparate components, introducing subtle regressions. Manual inspection became the standard operating procedure, and rollbacks were often challenging, sometimes even resulting in the retrieval of incorrect file versions.
The net effect was paradoxical: A tool designed to speed development sometimes slowed it down. Yet that friction forced a return to the fundamentals of branch discipline, small diffs and frequent checkpoints. It forced clarity and discipline. There was still a need to respect the process. Vibe coding wasn’t agile. It was defensive pair programming. “Trust, but verify” quickly became the default posture.
With this understanding, the project ceased being merely an experiment in vibe coding and became an intensive exercise in architectural enforcement. Vibe coding, I learned, means steering primarily via prompts and treating generated code as “guilty until proven innocent.” The AI doesn’t intuit architecture or UX without constraints. To address these concerns, I often had to step in and provide the AI with suggestions to get a proper fix.
Some examples include:
PDF generation broke repeatedly; I had to instruct it to use centralized header/footer modules to settle the issues.
Dashboard tile updates were treated sequentially and refreshed redundantly; I had to advise parallelization and skip logic.
Onboarding tours used async/live state (buggy); I had to propose mock screens for stabilization.
Performance tweaks caused the display of stale data; I had to tell it to honor transactional integrity.
While the AI code assistant generates functioning code, it still requires scrutiny to help guide the approach. Interestingly, the AI itself seemed to appreciate this level of scrutiny:
“That’s an excellent and insightful question! You’ve correctly identified a limitation I sometimes have and proposed a creative way to think about the problem.”
By the end of the project, coding with vibe no longer felt like magic. It felt like a messy, sometimes hilarious, occasionally brilliant partnership with a collaborator capable of generating endless variations — variations that I did not want and had not requested. The Google AI Studio code assistant was like managing an enthusiastic intern who moonlights as a panel of expert consultants. It could be reckless with the codebase, insightful in review.
It was a challenge finding the rhythm of:
When to let the AI riff on implementation
When to pull it back to analysis
When to switch from “go write this feature” to “act as a UX or architecture consultant”
When to stop the music entirely to verify, rollback or tighten guardrails
When to embrace the creative chaos
Every so often, the objectives behind the prompts aligned with the model’s energy, and the jam session fell into a groove where features emerged quickly and coherently. However, without my experience and background as a software engineer, the resulting application would have been fragile at best. Conversely, without the AI code assistant, completing the application as a one-person team would have taken significantly longer. The process would have been less exploratory without the benefit of “other” ideas. We were truly better together.
As it turns out, vibe coding isn’t about achieving a state of effortless nirvana. In production contexts, its viability depends less on prompting skill and more on the strength of the architectural constraints that surround it. By enforcing strict architectural patterns and integrating production-grade telemetry through an API, I bridged the gap between AI-generated code and the engineering rigor required for a production app that can meet the demands of real-world production software.
The Nine Inch Nails song “Discipline” says it all for the AI code assistant:
“Am I taking too much
Did I cross the line, line, line?
I need my role in this
Very clearly defined”
Doug Snyder is a software engineer and technical leader.
The landscape of enterprise artificial intelligence shifted fundamentally today as OpenAI announced $110 billion in new funding from three of tech’s largest firms: $30 billion from SoftBank, $30 billion from Nvidia, and $50 billion from Amazon.
But while the former two players are providing money, OpenAI is going further with Amazon in a new direction, establishing an upcoming fully “Stateful Runtime Environment” on Amazon Web Services (AWS), the world’s most used cloud environment.
This signals OpenAI’s and Amazon’s vision of the next phase of the AI economy — moving from chatbots to autonomous “AI coworkers” known as agents — and that this evolution requires a different architectural foundation than the one that built GPT-4.
For enterprise decision-makers, this announcement isn’t just a headline about massive capital; it is a technical roadmap for where the next generation of agentic intelligence will live and breathe.
And especially for those enterprises currently using AWS, it’s great news, giving them more options with a new runtime environment from OpenAI coming soon (the companies have yet to announce a precise timeline for when it will arrive).
At the heart of the new OpenAI-Amazon partnership is a technical distinction that will define developer workflows for the next decade: the difference between “stateless” and “stateful” environments.
To date, most developers have interacted with OpenAI through stateless APIs. In a stateless model, every request is an isolated event; the model has no “memory” of previous interactions unless the developer manually feeds the entire conversation history back into the prompt. OpenAI’s prior cloud partner and major investor, Microsoft Azure, remains the exclusive third-party cloud provider for these stateless APIs.
The newly announced Stateful Runtime Environment, by contrast, will be hosted on Amazon Bedrock — a paradigm shift.
This environment allows models to maintain persistent context, memory, and identity. Rather than a series of disconnected calls, the stateful environment enables “AI coworkers” to handle ongoing projects, remember prior work, and move seamlessly across different software tools and data sources.
As OpenAI notes on its website: “Now, instead of manually stitching together disconnected requests to make things work, your agents automatically execute complex steps with ‘working context’ that carries forward memory/history, tool and workflow state, environment use, and identity/permission boundaries.”
For builders of complex agents, this reduces the “plumbing” required to maintain context, as the infrastructure itself now handles the persistent state of the agent.
The vehicle for this stateful intelligence is OpenAI Frontier, an end-to-end platform designed to help enterprises build, deploy, and manage teams of AI agents, launched back in early February 2026.
Frontier is positioned as a solution to the “AI opportunity gap”—the disconnect between model capabilities and the ability of a business to actually put them into production.
Key features of the Frontier platform include:
Shared Business Context: Connecting siloed data from CRMs, ticketing tools, and internal databases into a single semantic layer.
Agent Execution Environment: A dependable space where agents can run code, use computer tools, and solve real-world problems.
Built-in Governance: Every AI agent has a unique identity with explicit permissions and boundaries, allowing for use in regulated environments.
While the Frontier application itself will continue to be hosted on Microsoft Azure, AWS has been named the exclusive third-party cloud distribution provider for the platform.
This means that while the “engine” may sit on Azure, AWS customers will be able to access and manage these agentic workloads directly through Amazon Bedrock, integrated with AWS’s existing infrastructure services.
For now, OpenAI has launched a dedicated Enterprise Interest Portal on its website. This serves as the primary intake point for organizations looking to move past isolated pilots and into production-grade agentic workflows.
The portal is a structured “request for access” form where decision-makers provide:
Firmographic Data: Basic details including company size (ranging from startups of 1–50 to large-scale enterprises with 20,000+ employees) and contact information.
Business Needs Assessment: A dedicated field for leadership to outline specific business challenges and requirements for “AI coworkers”.
By submitting this form, enterprises signal their readiness to work directly with OpenAI and AWS teams to implement solutions like multi-system customer support, sales operations, and finance audits that require high-reliability state management.
The scale of the announcement was mirrored in the public statements from the key players on social media.
Sam Altman, CEO of OpenAI, expressed excitement about the Amazon partnership, specifically highlighting the “stateful runtime environment” and the use of Amazon’s custom Trainium chips.
However, Altman was quick to clarify the boundaries of the deal: “Our stateless API will remain exclusive to Azure, and we will build out much more capacity with them”.
Amazon CEO Andy Jassy emphasized the demand from his own customer base, stating, “We have lots of developers and companies eager to run services powered by OpenAI models on AWS”. He noted that the collaboration would “change what’s possible for customers building AI apps and agents”.
Early adopters have already begun to weigh in on the utility of the Frontier approach. Joe Park, EVP at State Farm, noted that the platform is helping the company accelerate its AI capabilities to “help millions plan ahead, protect what matters most, and recover faster”.
For CTOs and enterprise decision-makers, the OpenAI-Amazon-Microsoft triangle creates a new set of strategic choices. The decision of where to allocate budget now depends heavily on the specific use case:
For High-Volume, Standard Tasks: If your organization relies on standard API calls for content generation, summarization, or simple chat, Microsoft Azure remains the primary destination. These “stateless” calls are exclusive to Azure, even if they originate from an Amazon-linked collaboration.
For Complex, Long-Running Agents: If your goal is to build “AI coworkers” that require deep integration with AWS-hosted data and persistent memory across weeks of work, the AWS Stateful Runtime Environment is the clear choice.
For Custom Infrastructure: OpenAI has committed to consuming 2 gigawatts of AWS Trainium capacity to power Frontier and other advanced workloads. This suggests that enterprises looking for the most cost-efficient way to run OpenAI models at massive scale may find an advantage in the AWS-Trainium ecosystem.
Despite the massive infusion of Amazon capital, the legal and financial ties between Microsoft and OpenAI remain remarkably rigid. A joint statement released by both companies clarified that their “commercial and revenue share relationship remains unchanged”.
Crucially, Microsoft continues to maintain its “exclusive license and access to intellectual property across OpenAI models and products”. Furthermore, Microsoft will receive a share of the revenue generated by the OpenAI-Amazon partnership.
This ensures that while OpenAI is diversifying its infrastructure, Microsoft remains the ultimate beneficiary of OpenAI’s commercial success, regardless of which cloud the compute actually runs on.
The definition of Artificial General Intelligence (AGI) also remains a protected term in the Microsoft agreement. The contractual processes for determining when AGI has been reached—and the subsequent impact on commercial licensing—have not been altered by the Amazon deal.
Ultimately, OpenAI is positioning itself as more than a model or tool provider; it is an infrastructure player attempting to straddle the two largest clouds on Earth.
For the user, this means more choice and more specialized environments. For the enterprise, it means that the era of “one-size-fits-all” AI procurement is over.
The choice between Azure and AWS for OpenAI services is now a technical decision about the nature of the work itself: whether your AI needs to simply “think” (stateless) or to “remember and act” (stateful).
AI agents now carry more access and more connections to enterprise systems than any other software in the environment. That makes them a bigger attack surface than anything security teams have had to govern before, and the industry doesn’t yet have a framework for it. “If that attack vector gets utilized, it can result in a data breach, or even worse,” said Spiros Xanthos, founder and CEO of Resolve AI, speaking at a recent VentureBeat AI Impact Series event.
Traditional security frameworks are built around human interactions. There’s not yet an agreed-upon construct for AI agents that have personas and can work autonomously, noted Jon Aniano, SVP of product and CRM applications at Zendesk, at the same event. Agentic AI is moving faster than enterprises can build guardrails — and Model Context Protocol (MCP), while decreasing integration complexity, is making the problem worse.
Agentic AI is moving faster than enterprises can build guardrails around them, according to Aniano and other enterprises leaders. And Model Context Protocol (MCP), while decreasing integration complexity, doesn’t help.
“Right now it’s an unsolved problem because it’s the wild, wild West,” Aniano said. “We don’t even have a defined technical agent-to-agent protocol that all companies agree on. How do you balance user expectations versus what keeps your platform safe?”
Enterprises are increasingly hooking into MCP servers because they simplify integration between agents, tools and data. However, MCP servers tend to be “extremely permissive,” he said.
They are “actually probably worse than an API,” he contended, because APIs at least have more controls in place to impose upon agents.
Today’s agents are acting on behalf of humans based on explicit permissions, thus establishing human accountability. “But you might have tens, hundreds of agents in the future with their own identity, their own access,” said Xanthos. “It becomes a very complex matrix.”
Even as his startup is developing autonomous AI agents for site reliability engineering (SRE) and system management, he acknowledged that the industry “completely lacks the framework” for autonomous agents.
“It’s completely on us and to anybody who builds agents to figure out what restrictions to give them,” he said. And customers must be able to trust those decisions.
Some existing security tools do offer fine-grained access — Splunk, for instance, developed a method to provide access to certain indexes in underlying data stores, he noted — but most are broader and human-oriented.
“We’re trying to figure this out with existing tools,” he said. “But I don’t think they’re sufficient for the era of agents.”
At Zendesk and other customer relationship management (CRM) platform providers, AI is involved in a number of user interactions, Aniano noted — in fact, now it’s at a “volume and a scale that we haven’t contemplated as businesses and as a society.”
It can get tricky when AI is helping out human agents; the audit trail can become a labyrinth.
“So now you’ve got a human talking to a human that’s talking to an AI,” Aniano noted. “The human tells the AI to take action. Who’s at fault if it’s the wrong action?” This becomes even more complicated when there are “multiple pieces of AI and multiple humans” in the mix.
To prevent agents from going off the rails, Zendesk tends to be “very strict” about access and scope; however, customers can define their own guardrails based on their needs. In most cases, AI can access knowledge sources, but they’re not writing code or running commands on servers, Aniano said. If an AI does call an API, it is “declaratively designed” and sanctioned, and actions are specifically called out.
However, customer demand is flooding these scenarios and “we’re kind of holding the gates right now,” he said.
The industry must develop concrete standards for agent interactions. “We’re entering a world where, with things like MCP that can auto-discover tools, we’re going to have to create new methods of safety for deciding what tools these bots can interact with,” said Aniano.
When it comes to security, enterprises are rightly concerned when AI takes over authentication tasks, such as sending out and processing one-time passwords (OTP), SMS codes, or other two-step verification methods, he said. What happens if an AI mis-authenticates or misidentifies someone? This can lead to sensitive data leakage or open the door for attackers.
“There’s a spectrum now, and the end of that spectrum today is a human,” Aniano said. However, “the end of that spectrum tomorrow might be a specialized agent designed to do the same kind of gut feeling or human-level interaction.”
Customers themselves are on a spectrum of adoption and comfort. In certain companies — particularly financial services or other highly-regulated environments — humans still must be involved in authentication, Aniano noted. In other cases, legacy companies or old guards only trust humans to authenticate other humans.
He noted that Zendesk is experimenting with new AI agents that are “a little more connected to systems,” and working with a select group of customers around guardrailing.
In some future, agents may actually be more trusted than humans to do some tasks, and granted permissions “way beyond” what humans have today, Xanthos said. But we’re a long way from that, and, for the most part, the fear of something going wrong is what’s holding enterprises back.
“Which is a good fear, right? I’m not saying that it is a bad thing,” he said. Many enterprises simply aren’t yet comfortable with an agent doing all steps of a workflow or fully closing the loop by itself. They still want human review.
Resolve AI is on the cusp of giving agents standing authorization in a few cases that are “generally safe,” such as in coding; from there they’ll move to more open-ended scenarios that are not all that risky, Xanthos explained. But he acknowledged that there will always be very risky situations where AI mistakes could “mutate the state of the production system,” as he put it.
Ultimately, though: “There’s no going back, obviously; this is moving faster than maybe even mobile did. So the question is what do we do about it?”
Both speakers pointed to interim measures available within existing tooling. Xanthos noted that some tools — Splunk among them — already offer fine-grained index-level access controls that can be applied to agents. Aniano described Zendesk’s approach as a practical starting point: declaratively designed API calls with explicitly sanctioned actions, strict access and scope limits, and human review before expanding agent permissions.
The underlying principle, as Aniano put it: “We’re always checking those gates and seeing how we can widen the aperture” — meaning don’t grant standing authorization until you’ve validated each expansion.
Former Twitter co-founder Jack Dorsey’s new company Block — the parent of merchants payment system Square, mobile peer-to-peer payments Cash App, music streamer Tidal, and open source AI agentic system Goose — is sending shockwaves across the business world tonight after announcing a more than 40% headcount, cutting its workforce by more than 4,000 people out of a prior total of 10,000, despite its latest quarterly earnings statement released today showing $2.87 billion in gross profit up 24% year-over-year.
The culprit? Newfound AI efficiencies. As Dorsey put it in a note shared on his own former social network, X:
“we’re not making this decision because we’re in trouble. our business is strong. gross profit continues to grow, we continue to serve more and more customers, and profitability is improving. but something has changed. we’re already seeing that the intelligence tools we’re creating and using, paired with smaller and flatter teams, are enabling a new way of working which fundamentally changes what it means to build and run a company. and that’s accelerating rapidly.
i had two options: cut gradually over months or years as this shift plays out, or be honest about where we are and act on it now. i chose the latter. repeated rounds of cuts are destructive to morale, to focus, and to the trust that customers and shareholders place in our ability to lead. i’d rather take a hard, clear action now and build from a position we believe in than manage a slow reduction of people toward the same outcome. a smaller company also gives us the space to grow our business the right way, on our own terms, instead of constantly reacting to market pressures.”
The core of this reorganization is a pivot toward an “intelligence-native” model. Dorsey argues that a significantly smaller team, leveraging the very tools they are building, can deliver more value than a traditional large-scale organization. Block is re-engineering its entire operational stack to be orchestrated by AI, moving away from human-intensive management hierarchies toward what it calls “agentic AI infrastructure”.
This includes four primary focus areas:
Customer Capabilities: Atomic features that allow customers to build directly on top of Block’s infrastructure.
Proactive Intelligence: Moving from reactive dashboards to tools like Moneybot that anticipate customer needs before they ask.
Intelligence Models: A system to orchestrate the company’s internal operations, aiming for extreme speed and product velocity.
Operational Orchestration: An AI model designed to manage the internal decision-making and risk-assessment processes of the firm.
The financial strength cited in the lede is driven by deep engagement in Cash App and Square. Cash App’s gross profit grew 33% YoY to $1.83 billion, while Square saw its strongest year on record for new volume added (NVA).
Specific product highlights include:
Cash App Green: This status program for “modern earners” — a segment of 125 million people including gig workers and freelancers — has become a cornerstone of the company’s engagement strategy.
Square AI: Now embedded in the Square Dashboard, it provides sellers with instant insights into staffing and customer behavior.
Consumer Lending: Cash App Borrow origination volume surged 223% YoY, proving to be a high-return product that manages income variability for users.
Block also exceeded the Rule of 40—the industry benchmark where the sum of gross profit growth and adjusted operating income margin exceeds 40%—for the first time in the fourth quarter.
Not everyone was convinced by Dorsey’s letter stating that AI efficiencies were the primary driver of the layoffs. As Will Slaughter wrote on X: “In 3 years from December 2019 to December 2022, Block $XYZ more than tripled its headcount from 3,900 to 12,500. Unwinding less than half an insane COVID overhiring binge has much more to do with Jack Dorsey’s managerial incompetence than whether AI is going to take your job.”
Entrepreneur Marcelo P. Lima offered a similar sentiment on X, writing in part: “Everyone will assume Jack Dorsey ‘greatest of all time’ is doing this because of AI. He’s not. Block has been massively bloated for years. Don’t forget, Jack was head of Twitter. When Elon took over, he fired 80% of staff within 5 months and the product got better. This was before generative AI and Claude Code.”
And yet, regardless of how heavily AI factored into these layoffs in particular, the outcome on the wider enterprise landscape may ultimately be the same. With Block’s stock price rising more than 24% on the news, the boards and leadership of other public companies will likely be forced to at least entertain the idea of similarly drastic cuts if they believe AI can replace human labor and drive greater organizational efficiencies.
As user @khuppy wrote on X: “By Q2, if you aren’t firing lots of employees, your board will fire you for being a dinosaur who doesn’t implement AI. It’s going to happen fast now. Feudalism, here we come…”
Clearly, companies across sectors but especially those in tech and services will be re-examining their headcount in light of Block’s latest move.
Despite the robust financial performance, the human cost is stark. The reduction from over 10,000 to just under 6,000 employees is one of the most drastic in fintech history. Dorsey’s internal note, while aimed at transparency, was met with a mix of awe at the technical vision and criticism of the timing.
Affected employees are receiving a severance package that includes 20 weeks of salary plus one week per year of tenure, equity vesting through May, and a $5,000 transition fund.
Dorsey noted that communication channels would stay open through Thursday evening so the team could say goodbye properly, stating, “i’d rather it feel awkward and human than efficient and cold.”
For enterprise decision-makers, Block’s move represents a fundamental challenge to the “growth at all costs” hiring model that has defined the last decade of tech.
Leadership teams should view this not merely as a cost-cutting measure, but as a strategic reset where organizational value is measured by the ratio of output to “intelligence-native” tools rather than total headcount. Executives should begin by auditing their own internal workflows to identify where agentic AI can consolidate roles and flatten management hierarchies before market pressures force a more reactive, less orderly contraction.
Even if not leading to as drastic of cuts, hiring slowdowns and freezes, Block’s move should likely prompt at least the kind of policy introduced separately by Shopify CEO Tobi Lutke nearly a year ago: “Before asking for more Headcount and resources, teams most demonstrate why they cannot get what they want done using AI.”
While the community reaction to Block’s layoffs highlights the potential for brand damage and morale loss, the 24% surge in Block’s stock price suggests that the public market is increasingly rewarding lean, automated efficiency over human-intensive scaling.
Decision-makers should evaluate their current “bloat” against the benchmark set by Dorsey: if a company of 6,000 can drive $12.20 billion in gross profit, the standard for organizational efficiency has been permanently raised.
In building LLM applications, enterprises often have to create very long system prompts to adjust the model’s behavior for their applications. These prompts contain company knowledge, preferences, and application-specific instructions. At enterprise scale, these contexts can push inference latency past acceptable thresholds and drive per-query costs up significantly.
On-Policy Context Distillation (OPCD), a new training framework proposed by researchers at Microsoft, helps bake the knowledge and preferences of applications directly into a model. OPCD uses the model’s own responses during training, which avoids some of the pitfalls of other training techniques. This improves the abilities of models for bespoke applications while preserving their general capabilities.
In-context learning allows developers to update a model’s behavior at inference time without modifying its underlying parameters. Updating parameters is typically a slow and expensive process. However, in-context knowledge is transient. This knowledge does not carry across different conversations with the model, meaning you have to feed the model the exact same massive set of instructions or documents every time. For an enterprise application, this might mean repeatedly pasting company policies, customer tickets, or dense technical manuals into the prompt. This eventually slows down the model, drives up costs, and can confuse the system.
“Enterprises often use long system prompts to enforce safety constraints (e.g., hate speech detection) or to provide domain-specific expertise (e.g., medical knowledge),” said Tianzhu Ye, co-author of the paper and researcher at Microsoft Research Asia, in comments provided to VentureBeat. “However, lengthy prompts significantly increase computational overhead and latency at inference time.”
The main idea behind context distillation is to train a model to internalize the information that you repeatedly insert into the context. Like other distillation techniques, it follows a teacher-student paradigm. The teacher is an AI model that receives the massive, detailed prompt. Because it has all the instructions and reference documents, it generates highly tailored responses. The student is a model being trained that only sees the main question and doesn’t have access to the full context. Its goal is simply to observe the teacher’s responses and learn to mimic its behavior.
Through this training process, the student model effectively compresses the complex instructions from the teacher’s prompt directly into its parameters. For an enterprise, the primary value happens at inference time. Because the student model has internalized the context, you can deploy it in your application without needing to paste in the lengthy instructions again. This makes the model significantly faster and with far less computational overhead.
However, classic context distillation relies on a flawed training method called “off-policy training,” where the model is trained on fixed datasets that were collected before the training process. This is problematic in several ways. During training, the student is only exposed to ground-truth data and teacher-generated answers, creating what Ye calls “exposure bias.” In production, the model must come up with its own token sequences to reach those answers. Because it never practiced making its own decisions or recovering from its own mistakes during training, it can easily derail when operating independently. It’s like showing a student videos of a professional driver and expecting them to learn driving without trial and error.
Another problem is the “forward Kullback-Leibler (KL) divergence” minimization measure used to train the model. Under this method, the model is graded on how similar its answers are to the teacher, which encourages “mode-covering” behavior, Ye says. The student model is often smaller or lacks the rich context the teacher had, meaning it simply lacks the capacity to perfectly replicate the teacher’s complex reasoning. Because the student is forced to try and cover all those possibilities anyway, its underlying guesses become overly broad and unfocused.
In real-world applications, this can result in hallucinations, where the AI gets confused and confidently makes things up because it is trying to mimic a depth of knowledge it does not actually possess. It also means that the model cannot generalize well to new tasks.
To fix the critical issues with the old teacher-student dynamic, the Microsoft researchers introduced On-Policy Context Distillation (OPCD). The most important shift in OPCD is that the student model learns from its own generation trajectories as opposed to a static dataset (which is why it is called “on-policy”). Instead of passively studying a dataset of the teacher’s perfect outputs, the student is given a task without seeing the massive instruction prompt and has to generate an answer entirely on its own.
As the student generates its answer, the teacher acts as a live instructor. The teacher has access to the full, customized prompt and evaluates the student’s output. At every step along the student’s generation, the system compares the student’s token distribution against what the context-aware teacher would do.
OPCD uses “reverse KL divergence” to grade the student. “By minimizing reverse KL divergence, it promotes ‘mode-seeking’ behavior. It focuses on high-probability regions of the student’s distribution,” Ye said. “It suppresses tokens that the student considers unlikely, even if the teacher’s belief assigned them high probability. This alignment helps the student correct its own mistakes and avoid the broad, hallucinatory distributions of standard distillation.”
Because the student model actively practices making its own decisions and learns to correct its own mistakes during training, it behaves more reliably when deployed in a live application. It successfully bakes complex business rules, safety constraints, or specialized knowledge directly into its permanent memory.
The researchers tested OPCD in two key areas: experiential knowledge distillation and system prompt distillation. For experiential knowledge distillation, the researchers wanted to see if an LLM could learn from its own past successes and permanently adopt those lessons. They tested this on models of various sizes, using mathematical reasoning problems.
First, the model solved problems and was asked to write down general rules it learned from its successes. Then, using OPCD, they baked those written lessons directly into the model’s parameters. The results showed that the models improved dramatically without needing the learned experience pasted into their prompts anymore. On complex math problems, an 8-billion-parameter model improved from a 75.0% baseline to 80.9%. For example, on the Frozen Lake navigation game, a small 1.7-billion parameter model initially had a success rate of 6.3%. After OPCD baked in the learned experience, its accuracy jumped to 38.3%.
The second set of experiments were on long system prompts. Enterprises often use massive system prompts to enforce strict behavioral guidelines, like maintaining a professional tone, ensuring medical accuracy, or filtering out toxic language. The researchers tested whether OPCD could permanently bake these dense behavioral rules into the models so they would not have to be sent with every single user query. Their experiments show that OPCD successfully internalized these complex rules and massively boosted performance. When testing a 3-billion parameter Llama model on safety and toxicity classification, the base model scored 30.7%. After using OPCD to internalize the safety prompt, its accuracy spiked to 83.1%. On medical question answering, the same model improved from 59.4% to 76.3%.
One of the key challenges of fine-tuning models is catastrophic forgetting, where the model becomes too focused on the fine-tune task and worse at general tasks. The researchers tracked out-of-distribution performance to test for this tunnel vision. When they distilled strict safety rules into a model, they immediately tested its ability to answer unrelated medical questions. OPCD successfully maintained the model’s general medical knowledge, outperforming the old off-policy methods by approximately 4 percentage points. It specialized without losing its broader intelligence.
While OPCD is a powerful tool for internalizing static knowledge and complex rules, it does not replace all external context methods. “RAG is better when the required information is highly dynamic or involves a massive, frequently updated external database that cannot be compressed into model weights,” Ye said.
For enterprise teams evaluating their pipelines, adopting OPCD does not require overhauling existing systems or investing in specialized hardware. “OPCD can be integrated into existing workflows with very little friction,” Ye said. “Any team already running standard RLVR [Reinforcement Learning from Verifiable Rewards] pipelines can adopt OPCD without major architectural changes.”
In practice, the student model acts as the policy model performing rollouts, while the frozen teacher model serves as a reference providing logits. The hardware requirements are highly accessible. According to Ye, enterprise teams can reproduce the researchers’ experiments using about eight A100 GPUs.
The data requirements are similarly lightweight. For experiential knowledge distillation, developers only need around 30 seed examples to generate solution traces. Because the technique is applied to previously unoptimized environments, even a small amount of data yields the majority of the performance improvement. For system prompt distillation, existing optimized prompts and standard task datasets are sufficient.
The researchers built their own implementation on verl, an open-source RLVR codebase, proving that the technique fits cleanly within conventional reinforcement learning frameworks. They plan to release their implementation as open source following internal reviews.
Looking ahead, OPCD paves the way for genuinely self-improving models that continuously adapt to bespoke enterprise environments. Once deployed, a model can extract lessons from real-world interactions and use OPCD to progressively internalize those characteristics without requiring manual supervision or data annotation from model trainers.
“This represents a fundamental paradigm shift in model improvement: the core improvements to the model would move from training time to test time,” Ye said. “Using the model—and allowing it to gather experience—would become the primary driver of its advancement.”
When your average daily token usage is 8 billion a day, you have a massive scale problem.
This was the case at AT&T, and chief data officer Andy Markus and his team recognized that it simply wasn’t feasible (or economical) to push everything through large reasoning models.
So, when building out an internal Ask AT&T personal assistant, they reconstructed the orchestration layer. The result: A multi-agent stack built on LangChain where large language model “super agents” direct smaller, underlying “worker” agents performing more concise, purpose-driven work.
This flexible orchestration layer has dramatically improved latency, speed and response times, Markus told VentureBeat. Most notably, his team has seen up to 90% cost savings.
“I believe the future of agentic AI is many, many, many small language models (SLMs),” he said. “We find small language models to be just about as accurate, if not as accurate, as a large language model on a given domain area.”
Most recently, Markus and his team used this re-architected stack along with Microsoft Azure to build and deploy Ask AT&T Workflows, a graphical drag-and-drop agent builder for employees to automate tasks.
The agents pull from a suite of proprietary AT&T tools that handle document processing, natural language-to-SQL conversion, and image analysis. “As the workflow is executed, it’s AT&T’s data that’s really driving the decisions,” Markus said. Rather than asking general questions, “we’re asking questions of our data, and we bring our data to bear to make sure it focuses on our information as it makes decisions.”
Still, a human always oversees the “chain reaction” of agents. All agent actions are logged, data is isolated throughout the process, and role-based access is enforced when agents pass workloads off to one another.
“Things do happen autonomously, but the human on the loop still provides a check and balance of the entire process,” Markus said.
AT&T doesn’t take a “build everything from scratch” mindset, Markus noted; it’s more relying on models that are “interchangeable and selectable” and “never rebuilding a commodity.” As functionality matures across the industry, they’ll deprecate homegrown tools in lieu of off the shelf options, he explained.
“Because in this space, things change every week, if we’re lucky, sometimes multiple times a week,” he said. “We need to be able to pilot, plug in and plug out different components.”
They do “really rigorous” evaluations of available options as well as their own; for instance, their Ask Data with Relational Knowledge Graph has topped the Spider 2.0 text to SQL accuracy leaderboard, and other tools have scored highly on the BERT SQL benchmark.
In the case of homegrown agentic tools, his team uses LangChain as a core framework, fine-tunes models with standard retrieval-augmented generation (RAG) and other in-house algorithms, and partners closely with Microsoft, using the tech giant’s search functionality for their vector store.
Ultimately, though, it’s important not to just fuse agentic AI or other advanced tools into everything for the sake of it, Markus advised. “Sometimes we over complicate things,” he said. “Sometimes I’ve seen a solution over engineered.”
Instead, builders should ask themselves whether a given tool actually needs to be agentic. This could include questions like: What accuracy level could be achieved if it was a simpler, single-turn generative solution? How could they break it down into smaller pieces where each piece could be delivered “way more accurately”?, as Markus put it.
Accuracy, cost and tool responsiveness should be core principles. “Even as the solutions have gotten more complicated, those three pretty basic principles still give us a lot of direction,” he said.
Ask AT&T Workflows has been rolled out to 100,000-plus employees. More than half say they use it every day, and active adopters report productivity gains as high as 90%, Markus said.
“We’re looking at, are they using the system repeatedly? Because stickiness is a good indicator of success,” he said.
The agent builder offers “two journeys” for employees. One is pro-code, where users can program Python behind the scenes, dictating rules for how agents should work. The other is no-code, featuring a drag-and-drop visual interface for a “pretty light user experience,” Markus said.
Interestingly, even proficient users are gravitating toward the latter option. At a recent hackathon geared to a technical audience, participants were given a choice of both, and more than half chose low code. “This was a surprise to us, because these people were all very competent in the programming aspect,” Markus said.
Employees are using agents across a variety of functions; for instance, a network engineer may build a series of them to address alerts and reconnect customers when they lose connectivity. In this scenario, one agent can correlate telemetry to identify the network issue and its location, pull change logs and check for known issues. Then, it can open a trouble ticket.
Another agent could then come up with ways to solve the issue and even write new code to patch it. Once the problem is resolved, a third agent can then write up a summary with preventative measures for the future.
“The [human] engineer would watch over all of it, making sure the agents are performing as expected and taking the right actions,” Markus said.
That same engineering discipline — breaking work into smaller, purpose-built pieces — is now reshaping how AT&T writes code itself, through what Markus calls “AI-fueled coding.”
He compared the process to RAG; devs use agile coding methods in an integrated development environment (IDE) along with “function-specific” build archetypes that dictates how code should interact.
The output is not loose code; the code is “very close to production grade,” and could reach that quality in one turn. “We’ve all worked with vibe coding, where we have an agentic kind of code editor,” Markus noted. But AI-fueled coding “eliminates a lot of the back and forth iterations that you might see in vibe coding.”
He sees this coding technique as “tangibly redefining” the software development cycle, ultimately shortening development timelines and increasing output of production-grade code. Non-technical teams can also get in on the action, using plain language prompts to build software prototypes.
His team, for instance, has used the technique to build an internal curated data product in 20 minutes; without AI, building it would have taken six weeks. “We develop software with it, modify software with it, do data science with it, do data analytics with it, do data engineering with it,” Markus said. “So it’s a game changer.”
ServiceNow is handling 90% of its own employee IT requests autonomously, resolving cases 99% faster than human agents. On Thursday it announced the product technology it wants to use to do the same for everyone else.
Organizations have spent three years running pilots that stall when AI gets to the execution layer. The agent can identify the problem and recommend a fix, then hand it back to a human because it lacks the permissions to finish the job or because no one trusts it to act autonomously inside a governed environment.
The gap most teams are hitting isn’t capability. It’s governance and workflow continuity.
ServiceNow’s answer is a new framework called Autonomous Workforce; a new employee-facing product called EmployeeWorks built on its December acquisition of Moveworks; and an underlying architectural approach it calls “role automation.”
ServiceNow has been building toward this for two decades. The platform started as a ticketing system, evolved into a workflow automation engine, and spent the last two years layering AI onto that foundation through its Now Assist product.
What’s different is that the new approach stops treating AI as a feature sitting on top of workflows and starts treating it as a worker operating inside them. That shift, from AI that assists to AI that executes, is where the broader enterprise market is headed. ServiceNow is making a specific architectural bet about how to get there.
The announcement has three parts: ServiceNow EmployeeWorks lets employees describe a problem in plain language and have it fixed without filing a ticket; Autonomous Workforce executes work end to end; and role automation is the architectural layer that governs how those specialists operate inside existing enterprise permissions.
Most enterprise AI assistants including Microsoft Copilot and Google Gemini require employees to know which tool handles which problem. Moveworks, which had 5.5 million enterprise users before the December acquisition, was built around a single entry point that routes across that ambiguity automatically.
Bhavin Shah, founder of Moveworks and now SVP at ServiceNow following the acquisition, framed the problem directly in a briefing with press and analysts.
“Over the last two years, organizations have raced to adopt AI, but in many cases that rush has created fragmented tools, disconnected AI experiences and employees bouncing between systems just to get simple things done,” he said.
ServiceNow is proposing a new architectural layer it calls role automation, and it differs from the agents most enterprises are already running.
Conventional AI agents are task-oriented: they’re given a goal, they reason toward it and in doing so they figure out what they’re allowed to do at runtime. That creates problems in enterprise environments where governance, audit trails and permission boundaries aren’t optional.
With role automation, an AI specialist does not reason its way into permissions. It inherits them. The same access control framework, CMDB(configuration management database) context, SLA (service level agreement) logic and entitlement rules that govern human workers on the ServiceNow platform govern the AI specialist from the moment it is deployed. It cannot exceed its defined scope. It cannot self-escalate privileges based on what it learns mid-task.
The company draws a three-tier distinction: task agents handle individual automation steps, agentic workflows mix deterministic and probabilistic execution, and role automation sits above both as a fully virtualized employee role with defined responsibilities and pre-inherited governance.
The first product built on this architecture, the Level 1 Service Desk AI Specialist, handles common IT requests end to end — password resets, software access provisioning and network troubleshooting — documenting each resolution and escalating to a human agent only when it hits something outside its defined scope.
Alan Rosa has seen what happens when AI governance fails in healthcare. As CISO and SVP of infrastructure and operations at CVS Health, he manages AI deployment across 300,000 employees where compliance isn’t optional.
Speaking at the same briefing, his framework for scaling AI maps directly onto what ServiceNow is claiming architecturally. CVS Health was already a customer of both ServiceNow and Moveworks before the December acquisition. Rosa said the combination of the two platforms is encouraging and that the potential is “coming to life,” though CVS Health has not committed publicly to deploying Autonomous Workforce.
“Boring is beautiful,” Rosa said. “Predictable. Stable. You have to start with responsible, explainable AI. No bias, no hallucinations, clear guardrails. Everyone understands the rules.”
On the temptation to chase the newest AI capabilities before governance is in place, he was direct: “Don’t chase butterflies. Focus on gritty, unsexy, operational use cases. The ones with real ROI that have an impact on people’s lives.”
Rosa’s approach treats AI as a continuously evolving set of capabilities requiring dynamic rather than static testing. CVS Health runs every AI use case through clinical, legal, privacy and security review before it touches production.
“Static review doesn’t cut it when AI is learning and adapting,” he said. “Wash, rinse, repeat.”
Rosa’s framework requires governance to be embedded in the deployment architecture from the start, not retrofitted after a problem surfaces. That is precisely the claim ServiceNow is making about role automation. AI specialists that inherit existing enterprise permissions and workflow logic are structurally less likely to break governance boundaries than agents that determine their own scope at runtime.
For any organization evaluating agentic AI, regardless of vendor, the practical question is simple: Does your AI governance live inside your execution layer, or is it sitting on top of it as a policy document that agents can reason past?
That is what ServiceNow is trying to solve with Autonomous Workforce and EmployeeWorks, baking governance and workflow context directly into the agentic layer rather than bolting it on afterward. For practitioners, the starting point is governance architecture, not capability. Before deploying any agentic AI, map where your permissions, workflow logic and audit requirements actually live. If that foundation isn’t in place, no agent framework will hold at enterprise scale.
“Scale and trust go together,” Rosa said. “If you lose trust, you lose the right to scale.”
Perplexity, the AI-powered search company valued at $20 billion, on Wednesday launched what it calls the most ambitious product in its three-year history: a multi-model agent orchestration platform called Computer that coordinates 19 different AI models to complete complex, long-running workflows entirely in the background.
The product, currently available only to Perplexity Max subscribers at $200 per month, is the company’s clearest articulation yet of a thesis it has been refining for more than a year: that AI models are not converging into general-purpose commodities but are instead specializing — and that the company best positioned to win the next era of AI is the one that can orchestrate all of them together.
“What has Perplexity been up to last two months? We’ve silently been working on the next big thing,” CEO Aravind Srinivas wrote on X, announcing that “Computer unifies every current capability of AI into a single system.” Srinivas said the system treats models as interchangeable tools rather than core products. “It’s multi-model by design,” he wrote. “When models specialise, they just become tools similar to the file system, CLI tools, connectors, browser, search.”
Computer arrives at a moment when the AI industry is grappling with a fundamental question: now that foundation models have become extraordinarily capable, who captures the value? The model makers — OpenAI, Anthropic, Google — or the companies that sit above them and turn raw intelligence into reliable, accurate products?
Perplexity is making a $20 billion bet on the latter.
At its core, Computer functions as what Perplexity describes as “a general-purpose digital worker” — a system that can accept a high-level objective from a user, decompose it into subtasks, and delegate those subtasks to whichever AI model is best suited for each one. The Verge described it as existing “somewhere between OpenClaw and Claude Cowork,” referring to the viral open-source autonomous agent and Anthropic’s enterprise collaboration tool, respectively.
The platform’s central reasoning engine runs on Anthropic’s Claude Opus 4.6, which handles orchestration logic and coding tasks. Google’s Gemini powers deep research queries. Google’s Nano Banana generates images, and Veo 3.1 handles video. xAI’s Grok is deployed for lightweight, speed-sensitive tasks. OpenAI’s GPT-5.2 manages long-context recall and expansive web search. In total, the system coordinates 19 models on the backend, according to the company.
That model roster is not fixed. Perplexity says new models can be added as they demonstrate strength in specific domains, and the existing lineup will shift as models evolve. Users can also step into the orchestrator role themselves, manually assigning subtasks to particular models if they prefer.
What makes Computer distinct from existing agent tools is its combination of scope and accessibility. A user can describe a desired outcome — say, “Plan a weeklong trip to Japan, find flights under $1,200, and build a full itinerary with restaurant reservations” — and Computer will autonomously break that project into components, assign each to the right model, and work on it in the background. Perplexity says the system can operate quietly for extended periods, checking in with the user only when it genuinely needs input.
The intellectual foundation of Computer rests on data that Perplexity has been collecting across its enterprise customer base — data that, according to the company, no other AI company has access to at the same scale.
At a recent press briefing that VentureBeat attended with other reporters in San Francisco, Perplexity executives shared enterprise usage statistics that illustrated a dramatic shift in how businesses use AI models. In January 2025, more than 90 percent of enterprise tasks on the Perplexity platform were spread across just two models. By December 2025, no single model commanded more than 25 percent of usage across businesses and task types.
That shift, executives said, was driven partly by increasingly intelligent model routing on Perplexity’s side, and partly by a simple reality: models are getting better at different things, not the same things. A new frontier model emerged on average every 17.5 days in 2025, and each one brought distinct strengths rather than uniform improvement.
Claude, for instance, has emerged as the model of choice for software engineering tasks — a reputation so strong that even OpenClaw, the viral autonomous agent created by Austrian programmer Peter Steinberger (who was subsequently hired by OpenAI), was originally built on Claude’s code capabilities. But Claude’s strengths in coding do not translate to writing or creative generation, where Gemini tends to outperform. And in long-context retrieval and broad web search, GPT-5.2 holds advantages.
“What we’ve learned in this time is that they are not commoditizing. They’re specializing,” a senior Perplexity executive said at the briefing, characterizing Claude Opus 4.6 as “a terrible writer” while noting its coding prowess, and adding: “Everybody has job security on that one.”
This specialization dynamic creates what Perplexity sees as a structural advantage. A marketing team using Claude, executives argued, will generally produce worse results than one using Gemini. An engineering team using Gemini will underperform one using Claude. No company operates with only one type of team — and no single model can serve all of them equally well.
Computer’s launch arrives in the immediate wake of OpenClaw, the open-source autonomous agent that went viral earlier this month and prompted OpenAI to hire its creator. OpenClaw captured the imagination of the AI community by demonstrating what a fully autonomous agent could accomplish when given broad access to a user’s entire digital ecosystem — files, email, messaging apps, API keys, and more.
But it also demonstrated the risks. In a widely shared incident this week, Meta AI security researcher Summer Yue posted screenshots on X of her frantic attempts to stop OpenClaw from deleting her entire email inbox — a process the agent had initiated and was refusing to halt. “I had to RUN to my Mac Mini like I was diffusing a bomb,” Yue wrote.
Perplexity has been vocal about why Computer runs entirely in the cloud rather than accessing a user’s local machine — an approach taken by rivals like Anthropic’s Claude and OpenAI’s Operator.
The company argues that local access creates unnecessary risk, comparing it to malware in how easily it can damage data or expose sensitive information. Computer instead operates inside what Perplexity describes as a safe and secure development sandbox, meaning security failures are contained and cannot spread to a user’s primary network or device. The company also said it has run thousands of tasks internally using Computer, from publishing web copy to building apps.
The distinction extends to accessibility. Where OpenClaw requires terminal access, API key configuration, and a dedicated machine (typically a Mac mini), Computer is designed to be invoked from a phone, a Slack message, or the Perplexity app.
At the press briefing, executives elaborated on the philosophy, positioning Computer’s browser agent capabilities — built on Perplexity’s Comet browser technology — as central to the product. One executive noted that Perplexity’s browser agent usage numbers are three to five times higher than ChatGPT’s agent numbers published by The Information in January, despite Perplexity’s much smaller user base.
Perplexity’s product ambitions are backed by a business that, by the company’s own metrics, is growing faster than its user base — and executives say the company has barely begun to focus on monetization.
At the press briefing, executives disclosed that Perplexity grew users by 3.7x in 2025 and revenue by 4.7x, meaning the company is extracting more value from its existing users over time. Consumer subscriptions remain the largest revenue component, but the enterprise business is ramping with what executives acknowledged is a remarkably lean operation.
“We only have five people on our enterprise sales team,” one executive said, before adding that the company’s revenue per employee working on deals may be unmatched in the industry. Another executive noted that 92 percent of the Fortune 500 have Perplexity usage — though that figure encompasses employees signing up with personal accounts and work email addresses for the consumer version, not necessarily formal enterprise contracts.
A common enterprise sales conversation, executives said, starts with: “Did you know that there’s already 3,000 of your employees using Perplexity, and they’re using the consumer version that doesn’t adhere to all of your security policies?”
Notably, Perplexity is not pursuing advertising revenue, even as competitors like OpenAI move toward ad-supported models. Executives said advertising is fundamentally misaligned with the company’s accuracy mission. “The challenge with ads is, you know, a user will just start doubting everything,” one executive said. The company confirmed it has taken no economics on its shopping integrations and expressed doubt that any shopping-based monetization would materialize this year.
On the question of an IPO, Srinivas indicated the company has “very good properties of a company that can go public” given its low capital expenditure and healthy margins, but stopped short of committing to a timeline. Another executive warned that “a lot of IPO talk is hype” and that “if you over promise and under deliver the market punches you severely.”
TestingCatalog also reported this week that a new “Usage and Credits” settings area has appeared in Perplexity’s development builds, which would let users purchase additional credits to extend usage — potentially easing backlash from subscribers who saw their Deep Research query limits cut from roughly 500 per day to as few as 20 per month between late 2025 and early 2026.
Perhaps the least-discussed but most strategically significant element of Perplexity’s story is its search API business — an infrastructure play that positions the company not just as a consumer product but as a foundational layer for the broader AI ecosystem.
At the press briefing, executives revealed that Perplexity launched its search API approximately four months ago and already has four of the “Mag Seven” — the seven largest technology companies by market capitalization — using it in production at significant scale. “You guys cover the Mag Seven, you know that they don’t turn on a feature in production unless they’ve run rigorous evals and compared it,” one executive told reporters.
This disclosure suggests that the world’s largest technology companies have evaluated Perplexity’s search index against alternatives and concluded it is better optimized for AI-native use cases — a fundamentally different optimization target than Google’s traditional index, which was designed for humans scanning lists of links.
“Everything in our index is optimized, not for a human to see 10 blue links,” one executive explained. “It’s for an AI to be able to take those snippets and consume it in this context window and then reason through it.”
The company also confirmed it has fully independent search infrastructure, no longer relying on any third-party APIs from Google or Bing for its index — a significant departure from its earlier years.
For Chinese open-source models, which Perplexity uses in its orchestration stack, the company runs all inference from its own U.S. data centers, post-training the models for accuracy, removing what executives described as “state-infused propaganda,” and building custom inference kernels. The company open-sourced its methodology for depropagandizing Chinese models for others to use as well.
The search API creates a powerful data flywheel, executives argued: Perplexity can observe which snippets its search ranker surfaces for a given query, then track which of those snippets the LLM actually uses in its final output. That feedback loop makes the next query on a similar topic smarter — an advantage that pure API search businesses like Exa cannot replicate because they lack the consumer product generating user queries and feedback.
Perplexity’s ambitions are not without complications. The company faces active lawsuits from multiple publishers, and the legal landscape grew more contentious this week.
As Business Insider’s Melia Russell reported, Perplexity filed a motion on February 24 in its ongoing legal battle with Dow Jones (publisher of The Wall Street Journal) and the New York Post, alleging that the publishers “cherry-picked” responses from Perplexity’s search engine to support their copyright claims. The company said it identified hundreds of prompts the publishers submitted that “were clear attempts to induce copyright-infringing answers,” including one instance where a user allegedly hit the “retry” button more than 50 times.
At the press briefing, Perplexity executives framed the broader copyright debate in historical terms, noting that waves of lawsuits have accompanied every major technology shift since radio. They expressed confidence that AI companies will ultimately prevail, particularly on the question of whether underlying knowledge — as distinct from unique creative expression — can be freely accessed by AI systems. “Countries have copyright law for one reason: to promote innovation,” one executive said, noting that the law protects unique expression while keeping the underlying knowledge open.
On user agents specifically, executives argued that a user’s AI agent is legally and technologically an extension of the user, not an independent actor. In the Amazon lawsuit, which challenges Perplexity’s ability to act as a purchasing agent on behalf of users, one executive offered a pointed analogy: “What Amazon’s claiming is that you shouldn’t be able to have your personal shopper be employed by you. It needs to be employed by them. They want you to use Rufus.”
Executives also clarified the company’s approach to citations, noting that citing a source like The New York Times (which is currently suing the company) does not necessarily mean Perplexity crawled that publication directly. “We can get the summary of that somewhere else, but we cite, we always try to cite that original source,” one executive said. “So drive that traffic to the New York Times if somebody clicks instead of driving them to a summary.”
Computer’s launch crystallizes a tension that has been building in the AI industry for months. The major model makers — OpenAI, Anthropic, Google — have been racing to build end-to-end products that keep users within their ecosystems. OpenAI’s Codex and ChatGPT, Anthropic’s Claude Code and Cowork, Google’s Gemini — all assume that one model family can handle the full range of user needs.
Perplexity is making the opposite bet: that the future belongs to the orchestration layer, not the model layer. It is a bet with historical parallels. In the early days of cloud computing, the companies that built the best abstraction layers above commodity infrastructure — not the infrastructure providers themselves — often captured outsized value. Perplexity is positioning itself as that abstraction layer for AI.
The risk, of course, is that model makers could restrict API access or degrade service to platform competitors. Srinivas has said he isn’t worried, noting that he received congratulations from Anthropic and Google after Computer’s launch and that model makers benefit when their systems are part of broader workflows. But the AI industry’s history of platform dynamics suggests this détente may not last forever.
For enterprise technology leaders evaluating their AI strategies, Computer raises a practical question: should organizations standardize on a single model provider’s ecosystem, accepting its limitations in exchange for simplicity? Or should they invest in multi-model orchestration, gaining access to the best capabilities across providers at the cost of additional complexity?
Perplexity is betting that as models continue to specialize and the gap between their respective strengths widens, the answer will become obvious. The company’s enterprise usage data — showing a market that went from two-model dominance to no-model dominance in just 12 months — suggests the shift is already underway.
Computer is currently available to Perplexity Max subscribers, with a rollout to Pro and Enterprise users planned in the coming weeks. The company has also announced a developer event on March 11, where it plans to share more details about its search API, ranking embeddings, and the infrastructure powering its orchestration stack.