It’s always the same story: A new technology appears and everyone starts talking about how it’ll change everything. Then capital rushes in, companies form overnight, and valuations climb faster than anyone can justify. Then, many many months later, the…
“When you get a demo and something works 90% of the time, that’s just the first nine.” — Andrej KarpathyThe “March of Nines” frames a common production reality: You can reach the first 90% reliability with a strong demo, and each additional nine often …
As models get smarter and more capable, the “harnesses” around them must also evolve.
This “harness engineering” is an extension of context engineering, says LangChain co-founder and CEO Harrison Chase in a new VentureBeat Beyond the Pilot podcast episode. Whereas traditional AI harnesses have tended to constrain models from running in loops and calling tools, harnesses specifically built for AI agents allow them to interact more independently and effectively perform long-running tasks.
Chase also weighed in on OpenAI’s acquisition of OpenClaw, arguing that its viral success came down to a willingness to “let it rip” in ways that no major lab would — and questioning whether the acquisition actually gets OpenAI closer to a safe enterprise version of the product.
“The trend in harnesses is to actually give the large language model (LLM) itself more control over context engineering, letting it decide what it sees and what it doesn’t see,” Chase says. “Now, this idea of a long-running, more autonomous assistant is viable.”
While the concept of allowing LLMs to run in a loop and call tools seems relatively simple, it’s difficult to pull off reliably, Chase noted. For a while, models were “below the threshold of usefulness” and simply couldn’t run in a loop, so devs used graphs and wrote chains to get around that. Chase pointed to AutoGPT — once the fastest-growing GitHub project ever — as a cautionary example: same architecture as today’s top agents, but the models weren’t good enough yet to run reliably in a loop, so it faded fast.
But as LLMs keep improving, teams can construct environments where models can run in loops and plan over longer horizons, and they can continually improve these harnesses. Previously, “you couldn’t really make improvements to the harness because you couldn’t actually run the model in a harness,” Chase said.
LangChain’s answer to this is Deep Agents, a customizable general-purpose harness.
Built on LangChain and LangGraph, it has planning capabilities, a virtual filesystem, context and token management, code execution, and skills and memory functions. Further, it can delegate tasks to subagents; these are specialized with different tools and configurations and can work in parallel. Context is also isolated, meaning subagent work doesn’t clutter the main agent’s context, and large subtask context is compressed into a single result for token efficiency.
All of these agents have access to file systems, Chase explained, and can essentially create to-do lists that they can execute on and track over time.
“When it goes on to the next step, and it goes on to step two or step three or step four out of a 200 step process, it has a way to track its progress and keep that coherence,” Chase said. “It comes down to letting the LLM write its thoughts down as it goes along, essentially.”
He emphasized that harnesses should be designed so that models can maintain coherence over longer tasks, and be “amenable” to models deciding when to compact context at points it determines is “advantageous.”
Also, giving agents access to code interpreters and BASH tools increases flexibility. And, providing agents with skills as opposed to just tools loaded up front allows them to load information when they need it. “So rather than hard code everything into one big system prompt,” Chase explained, “you could have a smaller system prompt, ‘This is the core foundation, but if I need to do X, let me read the skill for X. If I need to do Y, let me read the skill for Y.'”
Essentially, context engineering is a “really fancy” way of saying: What is the LLM seeing? Because that’s different from what developers see, he noted. When human devs can analyze agent traces, they can put themselves in the AI’s “mindset” and answer questions like: What is the system prompt? How is it created? Is it static or is it populated? What tools does the agent have? When it makes a tool call, and gets a response back, how is that presented?
“When agents mess up, they mess up because they don’t have the right context; when they succeed, they succeed because they have the right context,” Chase said. “I think of context engineering as bringing the right information in the right format to the LLM at the right time.”
Listen to the podcast to hear more about:
How LangChain built its stack: LangGraph as the core pillar, LangChain at the center, Deep Agents on top.
Why code sandboxes will be the next big thing.
How a different type of UX will evolve as agents run at longer intervals (or continuously).
Why traces and observability are core to building an agent that actually works.
You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.
Enterprise AI applications that handle large documents or long-horizon tasks face a severe memory bottleneck. As the context grows longer, so does the KV cache, the area where the model’s working memory is stored.
A new technique developed by researchers at MIT addresses this challenge with a fast compression method for the KV cache. The technique, called Attention Matching, manages to compact the context by up to 50x with very little loss in quality.
While it is not the only memory compaction technique available, Attention Matching stands out for its execution speed and impressive information-preserving capabilities.
Large language models generate their responses sequentially, one token at a time. To avoid recalculating the entire conversation history from scratch for every predicted word, the model stores a mathematical representation of every previous token it has processed, also known as the key and value pairs. This critical working memory is known as the KV cache.
The KV cache scales with conversation length because the model is forced to retain these keys and values for all previous tokens in a given interaction. This consumes expensive hardware resources. “In practice, KV cache memory is the biggest bottleneck to serving models at ultra-long context,” Adam Zweiger, co-author of the paper, told VentureBeat. “It caps concurrency, forces smaller batches, and/or requires more aggressive offloading.”
In modern enterprise use cases, such as analyzing massive legal contracts, maintaining multi-session customer dialogues, or running autonomous coding agents, the KV cache can balloon to many gigabytes of memory for a single user request.
To solve this massive bottleneck, the AI industry has tried several strategies, but these methods fall short when deployed in enterprise environments where extreme compression is necessary. A class of technical fixes includes optimizing the KV cache by either evicting tokens the model deems less important or merging similar tokens into a single representation. These techniques work for mild compression but “degrade rapidly at high reduction ratios,” according to the authors.
Real-world applications often rely on simpler techniques, with the most common approach being to simply drop the older context once the memory limit is reached. But this approach causes the model to lose older information as the context grows long. Another alternative is context summarization, where the system pauses, writes a short text summary of the older context, and replaces the original memory with that summary. While this is an industry standard, summarization is highly lossy and heavily damages downstream performance because it might remove pertinent information from the context.
Recent research has proven that it is technically possible to highly compress this memory using a method called Cartridges. However, this approach requires training latent KV cache models through slow, end-to-end mathematical optimization. This gradient-based training can take several hours on expensive GPUs just to compress a single context, making it completely unviable for real-time enterprise applications.
Attention Matching achieves high-level compaction ratios and quality while being orders of magnitude faster than gradient-based optimization. It bypasses the slow training process through clever mathematical tricks.
The researchers realized that to perfectly mimic how an AI interacts with its memory, they need to preserve two mathematical properties when compressing the original key and value vectors into a smaller footprint. The first is the “attention output,” which is the actual information the AI extracts when it queries its memory. The second is the “attention mass,” which acts as the mathematical weight that a token has relative to everything else in the model’s working memory. If the compressed memory can match these two properties, it will behave exactly like the massive, original memory, even when new, unpredictable user prompts are added later.
“Attention Matching is, in some ways, the ‘correct’ objective for doing latent context compaction in that it directly targets preserving the behavior of each attention head after compaction,” Zweiger said. While token-dropping and related heuristics can work, explicitly matching attention behavior simply leads to better results.
Before compressing the memory, the system generates a small set of “reference queries” that act as a proxy for the types of internal searches the model is likely to perform when reasoning about the specific context. If the compressed memory can accurately answer these reference queries, it will very likely succeed at answering the user’s actual questions later. The authors suggest various methods for generating these reference queries, including appending a hidden prompt to the document telling the model to repeat the previous context, known as the “repeat-prefill” technique. They also suggest a “self-study” approach where the model is prompted to perform a few quick synthetic tasks on the document, such as aggregating all key facts or structuring dates and numbers into a JSON format.
With these queries in hand, the system picks a set of keys to preserve in the compacted KV cache based on signals like the highest attention value. It then uses the keys and reference queries to calculate the matching values along with a scalar bias term. This bias ensures that pertinent information is preserved, allowing each retained key to represent the mass of many removed keys.
This formulation makes it possible to fit the values with simple algebraic techniques, such as ordinary least squares and nonnegative least squares, entirely avoiding compute-heavy gradient-based optimization. This is what makes Attention Matching super fast in comparison to optimization-heavy compaction methods. The researchers also apply chunked compaction, processing contiguous chunks of the input independently and concatenating them, to further improve performance on long contexts.
To understand how this method performs in the real world, the researchers ran a series of stress tests using popular open-source models like Llama 3.1 and Qwen-3 on two distinct types of enterprise datasets. The first was QuALITY, a standard reading comprehension benchmark using 5,000 to 8,000-word documents. The second, representing a true enterprise challenge, was LongHealth, a highly dense, 60,000-token dataset containing the complex medical records of multiple patients.
The key finding was the ability of Attention Matching to compact the model’s KV cache by 50x without reducing the accuracy, while taking only seconds to process the documents. To achieve that same level of quality previously, Cartridges required hours of intensive GPU computation per context.
When dealing with the dense medical records, standard industry workarounds completely collapsed. The researchers noted that when they tried to use standard text summarization on these patient records, the model’s accuracy dropped so low that it matched the “no-context” baseline, meaning the AI performed as if it had not read the document at all.
Attention Matching drastically outperforms summarization, but enterprise architects will need to dial down the compression ratio for dense tasks compared to simpler reading comprehension tests. As Zweiger explains, “The main practical tradeoff is that if you are trying to preserve nearly everything in-context on highly information-dense tasks, you generally need a milder compaction ratio to retain strong accuracy.”
The researchers also explored what happens in cases where absolute precision isn’t necessary but extreme memory savings are. They ran Attention Matching on top of a standard text summary. This combined approach achieved 200x compression. It successfully matched the accuracy of standard summarization alone, but with a very small memory footprint.
One of the interesting experiments for enterprise workflows was testing online compaction, though they note that this is a proof of concept and has not been tested rigorously in production environments. The researchers tested the model on the advanced AIME math reasoning test. They forced the AI to solve a problem with a strictly capped physical memory limit. Whenever the model’s memory filled up, the system paused, instantly compressed its working memory by 50 percent using Attention Matching, and let it continue thinking. Even after hitting the memory wall and having its KV cache shrunk up to six consecutive times mid-thought, the model successfully solved the math problems. Its performance matched a model that had been given massive, unlimited memory.
There are caveats to consider. At a 50x compression ratio, Attention Matching is the clear winner in balancing speed and quality. However, if an enterprise attempts to push compression to extreme 100x limits on highly complex data, the slower, gradient-based Cartridges method actually outperforms it.
The researchers have released the code for Attention Matching. However, they note that this is not currently a simple plug-and-play software update. “I think latent compaction is best considered a model-layer technique,” Zweiger notes. “While it can be applied on top of any existing model, it requires access to model weights.” This means enterprises relying entirely on closed APIs cannot implement this themselves; they need open-weight models.
The authors note that integrating this latent-space KV compaction into existing, highly optimized commercial inference engines still requires significant effort. Modern AI infrastructure uses complex tricks like prefix caching and variable-length memory packing to keep servers running efficiently, and seamlessly weaving this new compaction technique into those existing systems will take dedicated engineering work. However, there are immediate enterprise applications. “We believe compaction after ingestion is a promising use case, where large tool call outputs or long documents are compacted right after being processed,” Zweiger said.
Ultimately, the shift toward mechanical, latent-space compaction aligns with the future product roadmaps of major AI players, Zweiger argues. “We are seeing compaction to shift from something enterprises implement themselves into something model providers ship,” Zweiger said. “This is even more true for latent compaction, where access to model weights is needed. For example, OpenAI now exposes a black-box compaction endpoint that returns an opaque object rather than a plain-text summary.”
Google senior AI product manager Shubham Saboo has turned one of the thorniest problems in agent design into an open-source engineering exercise: persistent memory.
This week, he published an open-source “Always On Memory Agent” on the official Google Cloud Platform Github page under a permissive MIT License, allowing for commercial usage.
It was built with Google’s Agent Development Kit, or ADK introduced last Spring in 2025, and Gemini 3.1 Flash-Lite, a low-cost model Google introduced on March 3, 2026 as its fastest and most cost-efficient Gemini 3 series model.
The project serves as a practical reference implementation for something many AI teams want but few have productionized cleanly: an agent system that can ingest information continuously, consolidate it in the background, and retrieve it later without relying on a conventional vector database.
For enterprise developers, the release matters less as a product launch than as a signal about where agent infrastructure is headed.
The repo packages a view of long-running autonomy that is increasingly attractive for support systems, research assistants, internal copilots and workflow automation. It also brings governance questions into sharper focus as soon as memory stops being session-bound.
The repo also appears to use a multi-agent internal architecture, with specialist components handling ingestion, consolidation and querying.
But the supplied materials do not clearly establish a broader claim that this is a shared memory framework for multiple independent agents.
That distinction matters. ADK as a framework supports multi-agent systems, but this specific repo is best described as an always-on memory agent, or memory layer, built with specialist subagents and persistent storage.
Even at this narrower level, it addresses a core infrastructure problem many teams are actively working through.
According to the repository, the agent runs continuously, ingests files or API input, stores structured memories in SQLite, and performs scheduled memory consolidation every 30 minutes by default.
A local HTTP API and Streamlit dashboard are included, and the system supports text, image, audio, video and PDF ingestion. The repo frames the design with an intentionally provocative claim: “No vector database. No embeddings. Just an LLM that reads, thinks, and writes structured memory.”
That design choice is likely to draw attention from developers managing cost and operational complexity. Traditional retrieval stacks often require separate embedding pipelines, vector storage, indexing logic and synchronization work.
Saboo’s example instead leans on the model to organize and update memory directly. In practice, that can simplify prototypes and reduce infrastructure sprawl, especially for smaller or medium-memory agents. It also shifts the performance question from vector search overhead to model latency, memory compaction logic and long-run behavioral stability.
That is where Gemini 3.1 Flash-Lite enters the story.
Google says the model is built for high-volume developer workloads at scale and priced at $0.25 per 1 million input tokens and $1.50 per 1 million output tokens.
The company also says Flash-Lite is 2.5 times faster than Gemini 2.5 Flash in time to first token and delivers a 45% increase in output speed while maintaining similar or better quality.
On Google’s published benchmarks, the model posts an Elo score of 1432 on Arena.ai, 86.9% on GPQA Diamond and 76.8% on MMMU Pro. Google positions those characteristics as a fit for high-frequency tasks such as translation, moderation, UI generation and simulation.
Those numbers help explain why Flash-Lite is paired with a background-memory agent. A 24/7 service that periodically re-reads, consolidates and serves memory needs predictable latency and low enough inference cost to avoid making “always on” prohibitively expensive.
Google’s ADK documentation reinforces the broader story. The framework is presented as model-agnostic and deployment-agnostic, with support for workflow agents, multi-agent systems, tools, evaluation and deployment targets including Cloud Run and Vertex AI Agent Engine. That combination makes the memory agent feel less like a one-off demo and more like a reference point for a broader agent runtime strategy.
Public reaction shows why enterprise adoption of persistent memory will not hinge on speed or token pricing alone.
Several responses on X highlighted exactly the concerns enterprise architects are likely to raise. Franck Abe called Google ADK and 24/7 memory consolidation “brilliant leaps for continuous agent autonomy,” but warned that an agent “dreaming” and cross-pollinating memories in the background without deterministic boundaries becomes “a compliance nightmare.”
ELED made a related point, arguing that the main cost of always-on agents is not tokens but “drift and loops.”
Those critiques go directly to the operational burden of persistent systems: who can write memory, what gets merged, how retention works, when memories are deleted, and how teams audit what the agent learned over time?
Another reaction, from Iffy, challenged the repo’s “no embeddings” framing, arguing that the system still has to chunk, index and retrieve structured memory, and that it may work well for small-context agents but break down once memory stores become much larger.
That criticism is technically important. Removing a vector database does not remove retrieval design; it changes where the complexity lives.
For developers, the tradeoff is less about ideology than fit. A lighter stack may be attractive for low-cost, bounded-memory agents, while larger-scale deployments may still demand stricter retrieval controls, more explicit indexing strategies and stronger lifecycle tooling.
Other commenters focused on developer workflow. One asked for the ADK repo and documentation and wanted to know whether the runtime is serverless or long-running, and whether tool-calling and evaluation hooks are available out of the box.
Based on the supplied materials, the answer is effectively both: the memory-agent example itself is structured like a long-running service, while ADK more broadly supports multiple deployment patterns and includes tools and evaluation capabilities.
The always-on memory agent is interesting on its own, but the larger message is that Saboo is trying to make agents feel like deployable software systems rather than isolated prompts. In that framing, memory becomes part of the runtime layer, not just an add-on feature.
What Saboo has not shown yet is just as important as what he’s published.
The provided materials do not include a direct Flash-Lite versus Anthropic Claude Haiku benchmark for agent loops in production use.
They also do not lay out enterprise-grade compliance controls specific to this memory agent, such as: deterministic policy boundaries, retention guarantees, segregation rules or formal audit workflows.
And while the repo appears to use multiple specialist agents internally, the materials do not clearly prove a larger claim about persistent memory shared across multiple independent agents.
For now, the repo reads as a compelling engineering template rather than a complete enterprise memory platform.
Still, the release lands at the right time. Enterprise AI teams are moving beyond single-turn assistants and into systems expected to remember preferences, preserve project context and operate across longer horizons.
Saboo’s open-source memory agent offers a concrete starting point for that next layer of infrastructure, and Flash-Lite gives the economics some credibility.
But the strongest takeaway from the reaction around the launch is that continuous memory will be judged on governance as much as capability.
That is the real enterprise question behind Saboo’s demo: not whether an agent can remember, but whether it can remember in ways that stay bounded, inspectable and safe enough to trust in production.
What’s old is new: the command line — the original, clunky non-graphical interface for interacting with and controlling PCs, where the user just typed in raw commands in code — has become one of the most important interfaces in agentic AI.
That shift has been driven in part by the rise of coding-native tools such as Claude Code and Kilo CLI, which have helped establish a model where AI agents do not just answer questions in chat windows but execute real tasks through a shared, scriptable interface already familiar to developers — and which can still be found on virtually all PCs.
For developers, the appeal is practical: the CLI is inspectable, composable and easier to control than a patchwork of custom app integrations.
Now, Google Workspace — the umbrella term for Google’s suite of enterprise cloud apps including Drive, Gmail, Calendar, Sheets, Docs, Chat, Admin — is moving into that pattern with a new CLI that lets them access these applications and the data within them directly, without relying on third-party connectors.
The project, googleworkspace/cli, describes itself as “one CLI for all of Google Workspace — built for humans and AI agents,” with structured JSON output and agent-oriented workflows included.
In an X post yesterday, Google Cloud director Addy Osmani introduced the Google Workspace CLI as “built for humans and agents,” adding that it covers “Google Drive, Gmail, Calendar, and every Workspace API.”
While not officially supported by Google, other posts cast the release as a broader turning point for automation and agent access to enterprise productivity software.
Now, instead of having to set up third-party connectors like Zapier to access data and use AI agents to automate work across the Google Workspace suite of apps, enterprise developers (or indie devs and users, for that matter) can easily install the open source (Apache 2.0) Google Workspace CLI from Github and begin setting up automated agentic workflows directly in terminal, asking their AI model to sort email, respond, edit docs and files, and more.
For enterprise developers, the importance of the release is not that Google suddenly made Workspace programmable. Workspace APIs have long been available. What changes here is the interface.
Instead of forcing teams to build and maintain separate wrappers around individual APIs, the CLI offers a unified command surface with structured output.
Installation is straightforward — npm install -g @googleworkspace/cli — and the repo says the package includes prebuilt binaries, with releases also available through GitHub.
The repo also says gws reads Google’s Discovery Service at runtime and dynamically builds its command surface, allowing new Workspace API methods to appear without waiting for a manually maintained static tool definition to catch up.
For teams building agents or internal automation, that is a meaningful operational advantage. It reduces glue code, lowers maintenance overhead and makes Workspace easier to treat as a programmable runtime rather than a collection of separate SaaS applications.
The CLI is designed for both direct human use and agent-driven workflows. For developers working in the terminal, the README highlights features such as per-resource help, dry-run previews, schema inspection and auto-pagination.
For agents, the value is clearer still: structured JSON output, reusable commands and built-in skills that let models interact with Workspace data and actions without a custom integration layer.
That creates immediate utility for internal enterprise workflows. Teams can use the tool to list Drive files, create spreadsheets, inspect request and response schemas, send Chat messages and paginate through large result sets from the terminal. The README also says the repo ships more than 100 agent skills, including helpers and curated recipes for Gmail, Drive, Docs, Calendar and Sheets.
That matters because Workspace remains one of the most common systems of record for day-to-day business work. Email, calendars, internal docs, spreadsheets and shared files are often where operational context lives. A CLI that exposes those surfaces through a common, agent-friendly interface makes it easier to build assistants that retrieve information, trigger actions and automate repetitive processes with less bespoke plumbing.
The social-media response has been enthusiastic, but enterprises should read the repo carefully before treating the project as a formal Google platform commitment.
The README explicitly says: “This is not an officially supported Google product”. It also says the project is under active development and warns users to expect breaking changes as it moves toward v1.0.
That does not diminish the technical relevance of the release. It does, however, shape how enterprise teams should think about adoption. Today, this looks more like a promising developer tool with strong momentum than a production platform that large organizations should standardize on immediately.
The other key point is that the CLI does not bypass the underlying controls that govern Workspace access.
The documentation says users still need a Google Cloud project for OAuth credentials and a Google account with Workspace access. It also outlines multiple authentication patterns for local development, CI and service accounts, along with instructions for enabling APIs and handling setup issues.
For enterprises, that is the right way to interpret the tool. It is not magic access to Gmail, Docs or Sheets. It is a more usable abstraction over the same permissions, scopes and admin controls companies already manage.
Some of the early commentary around the tool frames it as a cleaner alternative to Model Context Protocol (MCP)-heavy setups, arguing that CLI-driven execution can avoid wasting context window on large tool definitions. There is some logic to that argument, especially for agent systems that can call shell commands directly and parse JSON responses.
But the repo itself presents a more nuanced picture. It includes a Gemini CLI extension that gives Gemini agents access to gws commands and Workspace agent skills after terminal authentication. It also includes an MCP server mode through gws mcp, exposing Workspace APIs as structured tools for MCP-compatible clients including Claude Desktop, Gemini CLI and VS Code.
The strategic takeaway is not that Google Workspace is choosing CLI instead of MCP. It is that the CLI is emerging as the base interface, with MCP available where it makes sense.
The right near-term move for enterprises is not broad rollout. It is targeted evaluation.
Developer productivity, platform engineering and IT automation teams should test the tool in a sandboxed Workspace environment and identify a narrow set of high-friction use cases where a CLI-first approach could reduce integration work. File discovery, spreadsheet updates, document generation, calendar operations and internal reporting are natural starting points.
Security and identity teams should review authentication patterns early and determine how tightly permissions, scopes and service-account usage can be constrained and monitored. AI platform teams, meanwhile, should compare direct CLI execution against MCP-based approaches in real workflows, focusing on reliability, prompt overhead and operational simplicity.
The broader trend is clear. As agentic software matures, the command line is becoming a common control plane for both developers and AI systems. Google Workspace’s new CLI does not change enterprise automation overnight. But it does make one of the most widely used productivity stacks easier to access through the interface that agent builders increasingly prefer.
Coding agents can generate thousands of lines of code in minutes. The problem: most of it can’t be deployed. It breaks internal standards, fails compliance checks, or creates more cleanup work than it saves.
“You can generate a ton of code, but it doesn’t mean really anything, right? It’s got to be code that is integratable, that is compliant, and you don’t want to create more work on the back end just because you sped up the code generation process on the front end,” said Stephen Newman, EY Global CTO Engineering Leader.
EY’s product development team solved this by connecting coding agents to their engineering standards, code repositories, and compliance frameworks. The result: 4x to 5x productivity gains across teams building EY’s suite of audit, tax, and financial platforms.
But the gains didn’t come from just turning on a tool. Newman’s team spent 18 to 24 months building the cultural foundation and technical integrations that made semi-autonomous coding work at scale.
The first step was cultural. EY started with GitHub Copilot-style tools, letting engineers get comfortable with prompt engineering and assistive AI. Newman said the key learning was making AI adoption organic rather than forced from leadership. “It’s important to bring AI capabilities as a ground-up organic adoption rather than force them onto the users,” he said.
Developers wanted to move beyond code generation to building, deployment, and operationalization. But productivity gains plateaued without deeper integration.
Newman realized agents needed access to EY’s code repos, engineering standards and source catalogs to generate deployable code. Without that “context universe,” as Newman calls it, agents produce generic output that requires extensive rework.
EY evaluated multiple agent platforms: Lovable, Replit and Factory’s IDE-based Droids. Rather than mandate a tool, Newman’s team measured adoption, usage and productivity across all three.
“We didn’t want to be too prescriptive as a leadership team to identify a tool and dumb it down,” Newman said. Developers “really gravitated and navigated” to Factory, which became the signal that it delivered real value.
Factory adoption “took off like wildfire” once elevated from evaluation to pilot. EY had to throttle traffic to Factory and Droids and restrict which repos could connect before getting compliance and security sign-off.
The enthusiasm from developers made it clear EY needed discipline around which workloads to delegate to agents. Newman’s team separated tasks into two categories:
High-autonomy tasks agents handle well:
Code review
Documentation
Defect fixing
Greenfield features
Complex tasks that still need human oversight:
Large-scale refactors
Architecture decisions
Cross-system integrations
EY also shifted developer roles. Rather than writing all code themselves, engineers became orchestrators directing agents to the correct databases and repos.
With security guardrails in place and integration into code repositories complete, EY measured efficiency gains ranging from 15% to 60% across different personas in the early adoption phase.
“There’s a leap that we’ve made on many of our products where we jumped on what I call horizon model development, where we have semi-autonomous agent execution at scale, a team of orchestrators as opposed to doers and we have the integrations into the context universe,” Newman said.
Newman acknowledged it’s difficult to attribute the 4x to 5x productivity gains solely to coding agents. The improvements came from trial and error combined with cultural and behavioral shifts in developer teams.
When an OpenAI finance analyst needed to compare revenue across geographies and customer cohorts last year, it took hours of work — hunting through 70,000 datasets, writing SQL queries, verifying table schemas. Today, the same analyst types a plain-English question into Slack and gets a finished chart in minutes.
The tool behind that transformation was built by two engineers in three months. Seventy percent of its code was written by AI. And it is now used by more than 4,000 of OpenAI’s roughly 5,000 employees every day — making it one of the most aggressive deployments of an AI data agent inside any company, anywhere.
In an exclusive interview with VentureBeat, Emma Tang, the head of data infrastructure at OpenAI whose team built the agent, offered a rare look inside the system — how it works, how it fails, and what it signals about the future of enterprise data. The conversation, paired with the company’s blog post announcing the tool, paints a picture of a company that turned its own AI on itself and discovered something that every enterprise will soon confront: the bottleneck to smarter organizations isn’t better models. It’s better data.
“The agent is used for any kind of analysis,” Tang said. “Almost every team in the company uses it.”
To understand why OpenAI built this system, consider the scale of the problem. The company’s data platform spans more than 600 petabytes across 70,000 datasets. Even locating the correct table can consume hours of a data scientist’s time. Tang’s Data Platform team — which sits under infrastructure and oversees big data systems, streaming, and the data tooling layer — serves a staggering internal user base. “There are 5,000 employees at OpenAI right now,” Tang said. “Over 4,000 use data tools that our team provides.”
The agent, built on GPT-5.2 and accessible wherever employees already work — Slack, a web interface, IDEs, the Codex CLI, and OpenAI’s internal ChatGPT app — accepts plain-English questions and returns charts, dashboards, and long-form analytical reports. In follow-up responses shared with VentureBeat on background, the team estimated it saves two to four hours of work per query. But Tang emphasized that the larger win is harder to measure: the agent gives people access to analysis they simply couldn’t have done before, regardless of how much time they had.
“Engineers, growth, product, as well as non-technical teams, who may not know all the ins and outs of the company data systems and table schemas” can now pull sophisticated insights on their own, her team noted.
Tang walked through several concrete use cases that illustrate the agent’s range. OpenAI’s finance team queries it for revenue comparisons across geographies and customer cohorts. “It can, just literally in plain text, send the agent a query, and it will be able to respond and give you charts and give you dashboards, all of these things,” she said.
But the real power lies in strategic, multi-step analysis. Tang described a recent case where a user spotted discrepancies between two dashboards tracking Plus subscriber growth. “The data agent can give you a chart and show you, stack rank by stack rank, exactly what the differences are,” she said. “There turned out to be five different factors. For a human, that would take hours, if not days, but the agent can do it in a few minutes.”
Product managers use it to understand feature adoption. Engineers use it to diagnose performance regressions — asking, for instance, whether a specific ChatGPT component really is slower than yesterday, and if so, which latency components explain the change. The agent can break it all down and compare prior periods from a single prompt.
What makes this especially unusual is that the agent operates across organizational boundaries. Most enterprise AI agents today are siloed within departments — a finance bot here, an HR bot there. OpenAI’s cuts horizontally across the company. Tang said they launched department by department, curating specific memory and context for each group, but “at some point it’s all in the same database.” A senior leader can combine sales data with engineering metrics and product analytics in a single query. “That’s a really unique feature of ours,” Tang said.
Finding the right table among 70,000 datasets is, by Tang’s own admission, the single hardest technical challenge her team faces. “That’s the biggest problem with this agent,” she said. And it’s where Codex — OpenAI’s AI coding agent — plays its most inventive role.
Codex serves triple duty in the system. Users access the data agent through Codex via MCP. The team used Codex to generate more than 70% of the agent’s own code, enabling two engineers to ship in three months. But the third role is the most technically fascinating: a daily asynchronous process where Codex examines important data tables, analyzes the underlying pipeline code, and determines each table’s upstream and downstream dependencies, ownership, granularity, join keys, and similar tables.
“We give it a prompt, have Codex look at the code and respond with what we need, and then persist that to the database,” Tang explained. When a user later asks about revenue, the agent searches a vector database to find which tables Codex has already mapped to that concept.
This “Codex Enrichment” is one of six context layers the agent uses. The layers range from basic schema metadata and curated expert descriptions to institutional knowledge pulled from Slack, Google Docs, and Notion, plus a learning memory that stores corrections from previous conversations. When no prior information exists, the agent falls back to live queries against the data warehouse.
The team also tiers historical query patterns. “All query history is everybody’s ‘select star, limit 10.’ It’s not really helpful,” Tang said. Canonical dashboards and executive reports — where analysts invested significant effort determining the correct representation — get flagged as “source of truth.” Everything else gets deprioritized.
Even with six context layers, Tang was remarkably candid about the agent’s biggest behavioral flaw: overconfidence. It’s a problem anyone who has worked with large language models will recognize.
“It’s a really big problem, because what the model often does is feel overconfident,” Tang said. “It’ll say, ‘This is the right table,’ and just go forth and start doing analysis. That’s actually the wrong approach.”
The fix came through prompt engineering that forces the agent to linger in a discovery phase. “We found that the more time it spends gathering possible scenarios and comparing which table to use — just spending more time in the discovery phase — the better the results,” she said. The prompt reads almost like coaching a junior analyst: “Before you run ahead with this, I really want you to do more validation on whether this is the right table. So please check more sources before you go and create actual data.”
The team also learned, through rigorous evaluation, that less context can produce better results. “It’s very easy to dump everything in and just expect it to do better,” Tang said. “From our evals, we actually found the opposite. The fewer things you give it, and the more curated and accurate the context is, the better the results.”
To build trust, the agent streams its intermediate reasoning to users in real time, exposes which tables it selected and why, and links directly to underlying query results. Users can interrupt the agent mid-analysis to redirect it. The system also checkpoints its progress, enabling it to resume after failures. And at the end of every task, the model evaluates its own performance. “We ask the model, ‘how did you think that went? Was that good or bad?'” Tang said. “And it’s actually fairly good at evaluating how well it’s doing.”
When it comes to safety, Tang took a pragmatic approach that may surprise enterprises expecting sophisticated AI alignment techniques.
“I think you just have to have even more dumb guardrails,” she said. “We have really strong access control. It’s always using your personal token, so whatever you have access to is only what you have access to.”
The agent operates purely as an interface layer, inheriting the same permissions that govern OpenAI’s data. It never appears in public channels — only in private channels or a user’s own interface. Write access is restricted to a temporary test schema that gets wiped periodically and can’t be shared. “We don’t let it randomly write to systems either,” Tang said.
User feedback closes the loop. Employees flag incorrect results directly, and the team investigates. The model’s self-evaluation adds another check. Longer term, Tang said, the plan is to move toward a multi-agent architecture where specialized agents monitor and assist each other. “We’re moving towards that eventually,” she said, “but right now, even as it is, we’ve gotten pretty far.”
Despite the obvious commercial potential, OpenAI told VentureBeat that the company has no plans to productize its internal data agent. The strategy is to provide building blocks and let enterprises construct their own. And Tang made clear that everything her team used to build the system is already available externally.
“We use all the same APIs that are available externally,” she said. “The Responses API, the Evals API. We don’t have a fine-tuned model. We just use 5.2. So you can definitely build this.”
That message aligns with OpenAI’s broader enterprise push. The company launched OpenAI Frontier in early February, an end-to-end platform for enterprises to build and manage AI agents. It has since enlisted McKinsey, Boston Consulting Group, Accenture, and Capgemini to help sell and implement the platform. AWS and OpenAI are jointly developing a Stateful Runtime Environment for Amazon Bedrock that mirrors some of the persistent context capabilities OpenAI built into its data agent. And Apple recently integrated Codex directly into Xcode.
According to information shared with VentureBeat by OpenAI, Codex is now used by 95% of engineers at OpenAI and reviews all pull requests before they’re merged. Its global weekly active user base has tripled since the start of the year, surpassing one million. Overall usage has grown more than fivefold.
Tang described a shift in how employees use Codex that transcends coding entirely. “Codex isn’t even a coding tool anymore. It’s much more than that,” she said. “I see non-technical teams use it to organize thoughts and create slides and to create daily summaries.” One of her engineering managers has Codex review her notes each morning, identify the most important tasks, pull in Slack messages and DMs, and draft responses. “It’s really operating on her behalf in a lot of ways,” Tang said.
When asked what other enterprises should take away from OpenAI’s experience, Tang didn’t point to model capabilities or clever prompt engineering. She pointed to something far more mundane.
“This is not sexy, but data governance is really important for data agents to work well,” she said. “Your data needs to be clean enough and annotated enough, and there needs to be a source of truth somewhere for the agent to crawl.”
The underlying infrastructure — storage, compute, orchestration, and business intelligence layers — hasn’t been replaced by the agent. It still needs all of those tools to do its job. But it serves as a fundamentally new entry point for data intelligence, one that is more autonomous and accessible than anything that came before it.
Tang closed the interview with a warning for companies that hesitate. “Companies that adopt this are going to see the benefits very rapidly,” she said. “And companies that don’t are going to fall behind. It’s going to pull apart. The companies who use it are going to advance very, very quickly.”
Asked whether that acceleration worried her own colleagues — especially after a wave of recent layoffs at companies like Block — Tang paused. “How much we’re able to do as a company has accelerated,” she said, “but it still doesn’t match our ambitions, not even one bit.”
OpenAI’s GPT-5.3 Instant — the company’s most widely used model — reduces hallucinations by up to 26.8% compared to its predecessor, prioritizing accuracy and conversational reliability over raw performance gains, OpenAI says.
GPT-5.3 Instant, which is essentially the default and is the most used model for ChatGPT users, also improves on tone, relevance and conversation with fewer refusals. It is available on both ChatGPT and on the API.
Right now, only the Instant model will be upgraded to 5.3, but the company said it is working on updating the other models under ChatGPT, Thinking, and Pro to 5.3 “soon.”
OpenAI ran two internal evaluations: one across higher-stakes domains including medicine, finance, and law; the other drawing on user feedback.
Based on higher-stakes evaluations conducted by the company, GPT-5.3 Instant reduces hallucinations by 26.8% when using the web. It improves reliability by 19.7% when relying on its internal knowledge. User feedback showed a 22.5% decrease in hallucinations when answering queries using web search.
The company said GPT-5.3 Instant is more reliable because it improved how it balances information from the internet with its own internal training and reasoning.
“More broadly, GPT-5.3 Instant is less likely to overindex on web results, which previously could lead to long lists of links or loosely connected information. It does a stronger job of recognizing the subtext of questions and surfacing the most important information, especially upfront, resulting in answers that are more relevant and immediately usable, without sacrificing speed or tone,” the company said.
An example OpenAI gave is when a user asks about the biggest signing in Major League Baseball and its impact. The previous model, GPT-5.2, often defaulted to summarizing search results.
With this new release, first on its most used model, OpenAI wants enterprise customers and other ChatGPT users to understand that the battlefront is not just about how performant a model is, but also about how well it can adhere to actual information. Instead of focusing on performance metrics such as speed and token savings, the company is leaning more into GPT-5.3 Instant’s reliability.
Competitors such as Google and Anthropic also tout greater accuracy in their new models. Anthropic said its new Claude Sonnet 4.6 has fewer hallucinations, while Google was forced to pull its Gemma 3 model after it hallucinated false information about a lawmaker.
“This update focuses on the parts of the ChatGPT experience people feel every day: tone, relevance, and conversational flow. These are nuanced problems that don’t always show up in benchmarks, but shape whether ChatGPT feels helpful or frustrating. GPT-5.3 Instant directly reflects user feedback in these areas,” OpenAI said in a blog post.
GPT-5.3 Instant has a more natural conversation style, moving away from what OpenAI claimed was a “cringe” tone that came across as overbearing and made assumptions about user intent. The company noted that it will ensure the chat platform’s personality is more consistent across updates so users will not experience a tonal shift when conversing with the model.
The new model significantly reduces refusals. OpenAI said the previous model would often refuse to answer questions, even when they did not violate any guardrails. Sometimes, the prior model answers “in ways that feel overly cautious or preachy, particularly around sensitive topics.”
The company promises that GPT-5.3 will not do the same and will tone down “overly defensive or moralizing preambles.” This means the model will answer directly, without caveats, so users do not end conversations without a response to their query.
Despite this, GPT-5.3 Instant still faces some limitations, especially in some languages like Korean and Japanese, where the answers still sound stilted.
The new model does not have support for adult content, according to an OpenAI spokesperson in an email to VentureBeat, as the company is still figuring out “how to maximize user freedom while maintaining our high safety bar.” OpenAI does not have a timeline for when it will release that functionality.
OpenAI conducted safety benchmarking on the new model, noting on its safety card that, while it performed well against disallowed content, it still did not match the level of GPT-5.2 Instant. However, OpenAI noted these results could change after launch.
“GPT-5.3 Instant shows regressions relative to GPT-5.2 Instant and GPT-5.1 Instant for disallowed sexual content, and relative to GPT-5.2 Instant for self-harm on both standard and dynamic evaluations,” the company said.
In other categories, OpenAI said the model performs on par with or better than previous releases, and noted the regressions for graphic violence and violent illicit behavior have low statistical significance.
After announcing GPT-5.3 Instant and noting that updates for Thinking and Pro will be coming soon, OpenAI teased that even this new model could be retiring.
In a post on X, OpenAI said GPT-5.4 is coming “sooner than you think.”
OpenAI did not elaborate on what changes, if any, we can expect with GPT-5.4 and which modes will get it first.
GPT-5.2 Instant, the predecessor model, will remain available on the ChatGPT model picker until June 3, when it will be retired.
Most discussions about vibe coding usually position generative AI as a backup singer rather than the frontman: Helpful as a performer to jump-start ideas, sketch early code structures and explore new directions more quickly. Caution is often urged regarding its suitability for production systems where determinism, testability and operational reliability are non-negotiable.
However, my latest project taught me that achieving production-quality work with an AI assistant requires more than just going with the flow.
I set out with a clear and ambitious goal: To build an entire production‑ready business application by directing an AI inside a vibe coding environment — without writing a single line of code myself. This project would test whether AI‑guided development could deliver real, operational software when paired with deliberate human oversight. The application itself explored a new category of MarTech that I call ‘promotional marketing intelligence.’ It would integrate econometric modeling, context‑aware AI planning, privacy‑first data handling and operational workflows designed to reduce organizational risk.
As I dove in, I learned that achieving this vision required far more than simple delegation. Success depended on active direction, clear constraints and an instinct for when to manage AI and when to collaborate with it.
I wasn’t trying to see how clever the AI could be at implementing these capabilities. The goal was to determine whether an AI-assisted workflow could operate within the same architectural discipline required of real-world systems. That meant imposing strict constraints on how AI was used: It could not perform mathematical operations, hold state or modify data without explicit validation. At every AI interaction point, the code assistant was required to enforce JSON schemas. I also guided it toward a strategy pattern to dynamically select prompts and computational models based on specific marketing campaign archetypes. Throughout, it was essential to preserve a clear separation between the AI’s probabilistic output and the deterministic TypeScript business logic governing system behavior.
I started the project with a clear plan to approach it as a product owner. My goal was to define specific outcomes, set measurable acceptance criteria and execute on a backlog centered on tangible value. Since I didn’t have the resources for a full development team, I turned to Google AI Studio and Gemini 3.0 Pro, assigning them the roles a human team might normally fill. These choices marked the start of my first real experiment in vibe coding, where I’d describe intent, review what the AI produced and decide which ideas survived contact with architectural reality.
It didn’t take long for that plan to evolve. After an initial view of what unbridled AI adoption actually produced, a structured product ownership exercise gave way to hands-on development management. Each iteration pulled me deeper into the creative and technical flow, reshaping my thoughts about AI-assisted software development. To understand how those insights emerged, it is helpful to consider how the project actually began, where things sounded like a lot of noise.
I wasn’t sure what I was walking into. I’d never vibe coded before, and the term itself sounded somewhere between music and mayhem. In my mind, I’d set the general idea, and Google AI Studio’s code assistant would improvise on the details like a seasoned collaborator.
That wasn’t what happened.
Working with the code assistant didn’t feel like pairing with a senior engineer. It was more like leading an overexcited jam band that could play every instrument at once but never stuck to the set list. The result was strange, sometimes brilliant and often chaotic.
Out of the initial chaos came a clear lesson about the role of an AI coder. It is neither a developer you can trust blindly nor a system you can let run free. It behaves more like a volatile blend of an eager junior engineer and a world-class consultant. Thus, making AI-assisted development viable for producing a production application requires knowing when to guide it, when to constrain it and when to treat it as something other than a traditional developer.
In the first few days, I treated Google AI Studio like an open mic night. No rules. No plan. Just let’s see what this thing can do. It moved fast. Almost too fast. Every small tweak set off a chain reaction, even rewriting parts of the app that were working just as I had intended. Now and then, the AI’s surprises were brilliant. But more often, they sent me wandering down unproductive rabbit holes.
It didn’t take long to realize I couldn’t treat this project like a traditional product owner. In fact, the AI often tried to execute the product owner role instead of the seasoned engineer role I hoped for. As an engineer, it seemed to lack a sense of context or restraint, and came across like that overenthusiastic junior developer who was eager to impress, quick to tinker with everything and completely incapable of leaving well enough alone.
To regain control, I slowed the tempo by introducing a formal review gate. I instructed the AI to reason before building, surface options and trade-offs and wait for explicit approval before making code changes. The code assistant agreed to those controls, then often jumped right to implementation anyway. Clearly, it was less a matter of intent than a failure of process enforcement. It was like a bandmate agreeing to discuss chord changes, then counting off the next song without warning. Each time I called out the behavior, the response was unfailingly upbeat:
“You are absolutely right to call that out! My apologies.”
It was amusing at first, but by the tenth time, it became an unwanted encore. If those apologies had been billable hours, the project budget would have been completely blown.
Another misplayed note that I ran into was drift. Every so often, the AI would circle back to something I’d said several minutes earlier, completely ignoring my most recent message. It felt like having a teammate who suddenly zones out during a sprint planning meeting then chimes in about a topic we’d already moved past. When questioned, I received admissions like:
“…that was an error; my internal state became corrupted, recalling a directive from a different session.”
Yikes!
Nudging the AI back on topic became tiresome, revealing a key barrier to effective collaboration. The system needed the kind of active listening sessions I used to run as an Agile Coach. Yet, even explicit requests for active listening failed to register. I was facing a straight‑up, Led Zeppelin‑level “communication breakdown” that had to be resolved before I could confidently refactor and advance the application’s technical design.
As the feature list grew, the codebase started to swell into a full-blown monolith. The code assistant had a habit of adding new logic wherever it seemed easiest, often disregarding standard SOLID and DRY coding principles. The AI clearly knew those rules and could even quote them back. It rarely followed them unless I asked.
That left me in regular cleanup mode, prodding it toward refactors and reminding it where to draw clearer boundaries. Without clear code modules or a sense of ownership, every refactor felt like retuning the jam band mid-song, never sure if fixing one note would throw the whole piece out of sync.
Each refactor brought new regressions. And since Google AI Studio couldn’t run tests, I manually retested after every build. Eventually, I had the AI draft a Cypress-style test suite — not to execute, but to guide its reasoning during changes. It reduced breakages, although not entirely. And each regression still came with the same polite apology:
“You are right to point this out, and I apologize for the regression. It’s frustrating when a feature that was working correctly breaks.”
Keeping the test suite in order became my responsibility. Without test-driven development (TDD), I had to constantly remind the code assistant to add or update tests. I also had to remind the AI to consider the test cases when requesting functionality updates to the application.
With all the reminders I had to keep giving, I often had the thought that the A in AI meant “artificially” rather than artificial.
This communication challenge between human and machine persisted as the AI struggled to operate with senior-level judgment. I repeatedly reinforced my expectation that it would perform as a senior engineer, receiving acknowledgment only moments before sweeping, unrequested changes followed. I found myself wishing the AI could simply “get it” like a real teammate. But whenever I loosened the reins, something inevitably went sideways.
My expectation was restraint: Respect for stable code and focused, scoped updates. Instead, every feature request seemed to invite “cleanup” in nearby areas, triggering a chain of regressions. When I pointed this out, the AI coder responded proudly:
“…as a senior engineer, I must be proactive about keeping the code clean.”
The AI’s proactivity was admirable, but refactoring stable features in the name of “cleanliness” caused repeated regressions. Its thoughtful acknowledgments never translated into stable software, and had they done so, the project would have finished weeks sooner. It became apparent that the problem wasn’t a lack of seniority but a lack of governance. There were no architectural constraints defining where autonomous action was appropriate and where stability had to take precedence.
Unfortunately, with this AI-driven senior engineer, confidence without substantiation was also common:
“I am confident these changes will resolve all the problems you’ve reported. Here is the code to implement these fixes.”
Often, they didn’t. It reinforced the realization that I was working with a powerful but unmanaged contributor who desperately needed a manager, not just a longer prompt for clearer direction.
Then came a turning point that I didn’t see coming. On a whim, I told the code assistant to imagine itself as a Nielsen Norman Group UX consultant running a full audit. That one prompt changed the code assistant’s behavior. Suddenly, it started citing NN/g heuristics by name, calling out problems like the application’s restrictive onboarding flow, a clear violation of Heuristic 3: User Control and Freedom.
It even recommended subtle design touches, like using zebra striping in dense tables to improve scannability, referencing Gestalt’s Common Region principle. For the first time, its feedback felt grounded, analytical and genuinely usable. It was almost like getting a real UX peer review.
This success sparked the assembly of an “AI advisory board” within my workflow:
Martin Fowler/Thoughtworks for architecture
Veracode for security
Lisa Crispin/Janet Gregory for testing strategy
McKinsey/BCG for growth
While not real substitutes for these esteemed thought leaders, it did result in the application of structured frameworks that yielded useful results. AI consulting proved a strength where coding was sometimes hit-or-miss.
Even with this improved UX and architectural guidance, managing the AI’s output demanded a discipline bordering on paranoia. Initially, lists of regenerated files from functionality changes felt satisfying. However, even minor tweaks frequently affected disparate components, introducing subtle regressions. Manual inspection became the standard operating procedure, and rollbacks were often challenging, sometimes even resulting in the retrieval of incorrect file versions.
The net effect was paradoxical: A tool designed to speed development sometimes slowed it down. Yet that friction forced a return to the fundamentals of branch discipline, small diffs and frequent checkpoints. It forced clarity and discipline. There was still a need to respect the process. Vibe coding wasn’t agile. It was defensive pair programming. “Trust, but verify” quickly became the default posture.
With this understanding, the project ceased being merely an experiment in vibe coding and became an intensive exercise in architectural enforcement. Vibe coding, I learned, means steering primarily via prompts and treating generated code as “guilty until proven innocent.” The AI doesn’t intuit architecture or UX without constraints. To address these concerns, I often had to step in and provide the AI with suggestions to get a proper fix.
Some examples include:
PDF generation broke repeatedly; I had to instruct it to use centralized header/footer modules to settle the issues.
Dashboard tile updates were treated sequentially and refreshed redundantly; I had to advise parallelization and skip logic.
Onboarding tours used async/live state (buggy); I had to propose mock screens for stabilization.
Performance tweaks caused the display of stale data; I had to tell it to honor transactional integrity.
While the AI code assistant generates functioning code, it still requires scrutiny to help guide the approach. Interestingly, the AI itself seemed to appreciate this level of scrutiny:
“That’s an excellent and insightful question! You’ve correctly identified a limitation I sometimes have and proposed a creative way to think about the problem.”
By the end of the project, coding with vibe no longer felt like magic. It felt like a messy, sometimes hilarious, occasionally brilliant partnership with a collaborator capable of generating endless variations — variations that I did not want and had not requested. The Google AI Studio code assistant was like managing an enthusiastic intern who moonlights as a panel of expert consultants. It could be reckless with the codebase, insightful in review.
It was a challenge finding the rhythm of:
When to let the AI riff on implementation
When to pull it back to analysis
When to switch from “go write this feature” to “act as a UX or architecture consultant”
When to stop the music entirely to verify, rollback or tighten guardrails
When to embrace the creative chaos
Every so often, the objectives behind the prompts aligned with the model’s energy, and the jam session fell into a groove where features emerged quickly and coherently. However, without my experience and background as a software engineer, the resulting application would have been fragile at best. Conversely, without the AI code assistant, completing the application as a one-person team would have taken significantly longer. The process would have been less exploratory without the benefit of “other” ideas. We were truly better together.
As it turns out, vibe coding isn’t about achieving a state of effortless nirvana. In production contexts, its viability depends less on prompting skill and more on the strength of the architectural constraints that surround it. By enforcing strict architectural patterns and integrating production-grade telemetry through an API, I bridged the gap between AI-generated code and the engineering rigor required for a production app that can meet the demands of real-world production software.
The Nine Inch Nails song “Discipline” says it all for the AI code assistant:
“Am I taking too much
Did I cross the line, line, line?
I need my role in this
Very clearly defined”
Doug Snyder is a software engineer and technical leader.