admin.codes » Category » DataDecisionMakers

Your developers are already running AI locally: Why on-device inference is the CISO’s new blind spot

For the last 18 months, the CISO playbook for generative AI has been relatively simple: Control the browser.

Security teams tightened cloud access security broker (CASB) policies, blocked or monitored traffic to well-known AI endpoints, and routed usage through sanctioned gateways. The operating model was clear: If sensitive data leaves the network for an external API call, we can observe it, log it, and stop it. But that model is starting to break.

A quiet hardware shift is pushing large language model (LLM) usage off the network and onto the endpoint. Call it Shadow AI 2.0, or the “bring your own model” (BYOM) era: Employees running capable models locally on laptops, offline, with no API calls and no obvious network signature. The governance conversation is still framed as “data exfiltration to the cloud,” but the more immediate enterprise risk is increasingly “unvetted inference inside the device.”

When inference happens locally, traditional data loss prevention (DLP) doesn’t see the interaction. And when security can’t see it, it can’t manage it.

Why local inference is suddenly practical

Two years ago, running a useful LLM on a work laptop was a niche stunt. Today, it’s routine for technical teams.

Three things converged:

Consumer-grade accelerators got serious: A MacBook Pro with 64GB unified memory can often run quantized 70B-class models at usable speeds (with practical limits on context length). What once required multi-GPU servers is now feasible on a high-end laptop for many real workflows.
Quantization went mainstream: It’s now easy to compress models into smaller, faster formats that fit within laptop memory often with acceptable quality tradeoffs for many tasks.
Distribution is frictionless: Open-weight models are a single command away, and the tooling ecosystem makes “download → run → chat” trivial.

The result: An engineer can pull down a multi‑GB model artifact, turn off Wi‑Fi, and run sensitive workflows locally, source code review, document summarization, drafting customer communications, even exploratory analysis over regulated datasets. No outbound packets, no proxy logs, no cloud audit trail.

From a network-security perspective, that activity can look indistinguishable from “nothing happened”.

The risk isn’t only data leaving the company anymore

If the data isn’t leaving the laptop, why should a CISO care?

Because the dominant risks shift from exfiltration to integrity, provenance, and compliance. In practice, local inference creates three classes of blind spots that most enterprises have not operationalized.

1. Code and decision contamination (integrity risk)

Local models are often adopted because they’re fast, private, and “no approval required.” The downside is that they’re frequently unvetted for the enterprise environment.

A common scenario: A senior developer downloads a community-tuned coding model because it benchmarks well. They paste in internal auth logic, payment flows, or infrastructure scripts to “clean it up.” The model returns output that looks competent, compiles, and passes unit tests, but subtly degrades security posture (weak input validation, unsafe defaults, brittle concurrency changes, dependency choices that aren’t allowed internally). The engineer commits the change.

If that interaction happened offline, you may have no record that AI influenced the code path at all. And when you later do incident response, you’ll be investigating the symptom (a vulnerability) without visibility into a key cause (uncontrolled model usage).

2. Licensing and IP exposure (compliance risk)

Many high-performing models ship with licenses that include restrictions on commercial use, attribution requirements, field-of-use limits, or obligations that can be incompatible with proprietary product development. When employees run models locally, that usage can bypass the organization’s normal procurement and legal review process.

If a team uses a non-commercial model to generate production code, documentation, or product behavior, the company can inherit risk that shows up later during M&A diligence, customer security reviews, or litigation. The hard part is not just the license terms, it’s the lack of inventory and traceability. Without a governed model hub or usage record, you may not be able to prove what was used where.

3. Model supply chain exposure (provenance risk)

Local inference also changes the software supply chain problem. Endpoints begin accumulating large model artifacts and the toolchains around them: ownloaders, converters, runtimes, plugins, UI shells, and Python packages.

There is a critical technical nuance here: The file format matters. While newer formats like Safetensors are designed to prevent arbitrary code execution, older Pickle-based PyTorch files can execute malicious payloads simply when loaded. If your developers are grabbing unvetted checkpoints from Hugging Face or other repositories, they aren’t just downloading data — they could be downloading an exploit.

Security teams have spent decades learning to treat unknown executables as hostile. BYOM requires extending that mindset to model artifacts and the surrounding runtime stack. The biggest organizational gap today is that most companies have no equivalent of a software bill of materials for models: Provenance, hashes, allowed sources, scanning, and lifecycle management.

Mitigating BYOM: treat model weights like software artifacts

You can’t solve local inference by blocking URLs. You need endpoint-aware controls and a developer experience that makes the safe path the easy path.

Here are three practical ways:

1. Move governance down to the endpoint

Network DLP and CASB still matter for cloud usage, but they’re not sufficient for BYOM. Start treating local model usage as an endpoint governance problem by looking for specific signals:

Inventory and detection: Scan for high-fidelity indicators like .gguf files larger than 2GB, processes like llama.cpp or Ollama, and local listeners on common default port 11434.
Process and runtime awareness: Monitor for repeated high GPU/NPU (neural processing unit) utilization from unapproved runtimes or unknown local inference servers.
Device policy: Use mobile device management (MDM) and endpoint detection and response (EDR) policies to control installation of unapproved runtimes and enforce baseline hardening on engineering devices. The point isn’t to punish experimentation. It’s to regain visibility.

2. Provide a paved road: An internal, curated model hub

Shadow AI is often an outcome of friction. Approved tools are too restrictive, too generic, or too slow to approve. A better approach is to offer a curated internal catalog that includes:

Approved models for common tasks (coding, summarization, classification)
Verified licenses and usage guidance
Pinned versions with hashes (prioritizing safer formats like Safetensors)
Clear documentation for safe local usage, including where sensitive data is and isn’t allowed. If you want developers to stop scavenging, give them something better.

3. Update policy language: “Cloud services” isn’t enough anymore

Most acceptable use policies talk about SaaS and cloud tools. BYOM requires policy that explicitly covers:

Downloading and running model artifacts on corporate endpoints
Acceptable sources
License compliance requirements
Rules for using models with sensitive data
Retention and logging expectations for local inference tools This doesn’t need to be heavy-handed. It needs to be unambiguous.

The perimeter is shifting back to the device

For a decade we moved security controls “up” into the cloud. Local inference is pulling a meaningful slice of AI activity back “down” to the endpoint.

5 signals shadow AI has moved to endpoints:

Large model artifacts: Unexplained storage consumption by .gguf or .pt files.
Local inference servers: Processes listening on ports like 11434 (Ollama).
GPU utilization patterns: Spikes in GPU usage while offline or disconnected from VPN.
Lack of model inventory: Inability to map code outputs to specific model versions.
License ambiguity: Presence of “non-commercial” model weights in production builds.

Shadow AI 2.0 isn’t a hypothetical future, it’s a predictable consequence of fast hardware, easy distribution, and developer demand. CISOs who focus only on network controls will miss what’s happening on the silicon sitting right on employees’ desks.

The next phase of AI governance is less about blocking websites and more about controlling artifacts, provenance, and policy at the endpoint, without killing productivity.

Jayachander Reddy Kandakatla is a senior MLOps engineer.

DataDecisionMakers, Security

Claude, OpenClaw and the new reality: AI agents are here — and so is the chaos

The age of agentic AI is upon us — whether we like it or not. What started with an innocent question-answer banter with ChatGPT back in 2022 has become an existential debate on job security and the rise of the machines.

More recently, fears of reaching artificial general intelligence (AGI) have become more real with the advent of powerful autonomous agents like Claude Cowork and OpenClaw. Having played with these tools for some time, here is a comparison.

First, we have OpenClaw (formerly known as Moltbot and Clawdbot). Surpassing 150,000 GitHub stars in days, OpenClaw is already being deployed on local machines with deep system access. This is like a robot “maid” (Irona for Richie Rich fans, for instance) that you give the keys to your house. It’s supposed to clean it, and you give it the necessary autonomy to take actions and manage your belongings (files and data) as it pleases. The whole purpose is to perform the task at hand — inbox triaging, auto-replies, content curation, travel planning, and more.

Next we have Google’s Antigravity, a coding agent with an IDE that accelerates the path from prompt to production. You can interactively create complete application projects and modify specific details over individual prompts. This is like having a junior developer that can not only code, but build, test, integrate, and fix issues. In the realworld, this is like hiring an electrician: They are really good at a specific job and you only need to give them access to a specific item (your electric junction box).

Finally, we have the mighty Claude. The release of Anthropic’s Cowork, which featured AI agents for automating legal tasks like contract review and NDA triage, caused a sharp sell-off in legal-tech and software-as-a-service (SaaS) stocks (referred to as the SaaSpocalypse). Claude has anyway been the go-to chatbot; now with Cowork, it has domain knowledge for specific industries like legal and finance. This is like hiring an accountant. They know the domain inside-out and can complete taxes and manage invoices. Users provide specific access to highly-sensitive financial details.

Making these tools work for you

The key to making these tools more impactful is giving them more power, but that increases the risk of misuse. Users must trust providers like Anthorpic and Google to ensure that agent prompts will not cause harm, leak data, or provide unfair (illegal) advantage to certain vendors. OpenClaw is open-source, which complicates things, as there is no central governing authority.

While these technological advancements are amazing and meant for the greater good, all it takes is one or two adverse events to cause panic. Imagine the agentic electrician frying all your house circuits by connecting the wrong wire. In an agent scenario, this could be injecting incorrect code, breaking down a bigger system or adding hidden flaws that may not be immediately evident. Cowork could miss major saving opportunities when doing a user’s taxes; on the flip side, it could include illegal writeoffs. Claude can do unimaginable damage when it has more control and authority.

But in the middle of this chaos, there is an opportunity to really take advantage. With the right guardrails in place, agents can focus on specific actions and avoid making random, unaccounted-for decisions. Principles of responsible AI — accountability, transparency, reproducibility, security, privacy — are extremely important. Logging agent steps and human confirmation are absolutely critical.

Also, when agents deal with so many diverse systems, it’s important they speak the same language. Ontology becomes very important so that events can be tracked, monitored, and accounted for. A shared domain-specific ontology can define a “code of conduct.” These ethics can help control the chaos. When tied together with a shared trust and distributed identity framework, we can build systems that enable agents to do truly useful work.

When done right, an agentic ecosystem can greatly offload the human “cognitive load” and enable our workforce to perform high-value tasks. Humans will benefit when agents handle the mundane.

Dattaraj Rao is innovation and R&D architect at Persistent Systems.

DataDecisionMakers, Infrastructure

OCSF explained: The shared data language security teams have been missing

The security industry has spent the last year talking about models, copilots, and agents, but a quieter shift is happening one layer below all of that: Vendors are lining up around a shared way to describe security data. The Open Cybersecurity Schema Framework (OCSF), is emerging as one of the strongest candidates for that job.

It gives vendors, enterprises, and practitioners a common way to represent security events, findings, objects, and context. That means less time rewriting field names and custom parsers and more time correlating detections, running analytics, and building workflows that can work across products. In a market where every security team is stitching together endpoint, identity, cloud, SaaS, and AI telemetry, a common infrastructure long felt like a pipe dream, and OCSF now puts it within reach.

OCSF in plain language

OCSF is an open-source framework for cybersecurity schemas. It’s vendor neutral by design and deliberately agnostic to storage format, data collection, and ETL choices. In practical terms, it gives application teams and data engineers a shared structure for events so analysts can work with a more consistent language for threat detection and investigation.

That sounds dry until you look at the daily work inside a security operations center (SOC). Security teams have to spend a lot of effort normalizing data from different tools so that they can correlate events. For example, detecting an employee logging in from San Francisco at 10 a.m. on their laptop, then accessing a cloud resource from New York at 10:02 a.m. could reveal a leaked credential.

Setting up a system that can correlate those events, however, is no easy task: Different tools describe the same idea with different fields, nesting structures, and assumptions. OCSF was built to lower this tax. It helps vendors map their own schemas into a common model and helps customers move data through lakes, pipelines, security incident and event management (SIEM) tools without requiring time consuming translation at every hop.

The last two years have been unusually fast

Most of OCSF’s visible acceleration has happened in the last two years. The project was announced in August 2022 by Amazon AWS and Splunk, building on worked contributed by Symantec, Broadcom, and other well known infrastructure giants Cloudflare, CrowdStrike, IBM, Okta, Palo Alto Networks, Rapid7, Salesforce, Securonix, Sumo Logic, Tanium, Trend Micro, and Zscaler.

The OCSF community has kept up a steady cadence of releases over the last two years

The community has grown quickly. AWS said in August 2024 that OCSF had expanded from a 17-company initiative into a community with more than 200 participating organizations and 800 contributors, which expanded to 900 wen OCSF joined the Linux Foundation in November 2024.

OCSF is showing up across the industry

In the observability and security space, OCSF is everywhere. AWS Security Lake converts natively supported AWS logs and events into OCSF and stores them in Parquet. AWS AppFabric can output OCSF — normalized audit data. AWS Security Hub findings use OCSF, and AWS publishes an extension for cloud-specific resource details.

Splunk can translate incoming data into OCSF with edge processor and ingest processor. Cribl supports seamless converting streaming data into OCSF and compatible formats.

Palo Alto Networks can forward Strata sogging Service data into Amazon Security Lake in OCSF. CrowdStrike positions itself on both sides of the OCSF pipe, with Falcon data translated into OCSF for Security Lake and Falcon Next-Gen SIEM positioned to ingest and parse OCSF-formatted data. OCSF is one of those rare standards that has crossed the chasm from an abstract standard into standard operational plumbing across the industry.

AI is giving the OCSF story fresh urgency

When enterprises deploy AI infrastructure, large language models (LLMs) sit at the core, surrounded by complex distributed systems such as model gateways, agent runtimes, vector stores, tool calls, retrieval systems, and policy engines. These components generate new forms of telemetry, much of which spans product boundaries. Security teams across the SOC are increasingly focused on capturing and analyzing this data. The central question often becomes what an agentic AI system actually did, rather than only the text it produced, and whether its actions led to any security breaches.

That puts more pressure on the underlying data model. An AI assistant that calls the wrong tool, retrieves the wrong data, or chains together a risky sequence of actions creates a security event that needs to be understood across systems. A shared security schema becomes more valuable in that world, especially when AI is also being used on the analytics side to correlate more data, faster.

For OCSF, 2025 was all about AI

Imagine a company uses an AI assistant to help employees look up internal documents and trigger tools like ticketing systems or code repositories. One day, the assistant starts pulling the wrong files, calling tools it should not use, and exposing sensitive information in its responses.

Updates in OCSF versions 1.5.0, 1.6.0, and 1.7.0 help security teams piece together what happened by flagging unusual behavior, showing who had access to the connected systems, and tracing the assistant’s tool calls step by step. Instead of only seeing the final answer the AI gave, the team can investigate the full chain of actions that led to the problem.

What’s on the horizon

Imagine a company uses an AI customer support bot, and one day the bot begins giving long, detailed answers that include internal troubleshooting guidance meant only for staff. With the kinds of changes being developed for OCSF 1.8.0, the security team could see which model handled the exchange, which provider supplied it, what role each message played, and how the token counts changed across the conversation.

A sudden spike in prompt or completion tokens could signal that the bot was fed an unusually large hidden prompt, pulled in too much background data from a vector database, or generated an overly long response that increased the chance of sensitive information leaking. That gives investigators a practical clue about where the interaction went off course, instead of leaving them with only the final answer.

Why this matters to the broader market

The bigger story is that OCSF has moved quickly from being a community effort to becoming a real standard that security products use every day. Over the past two years, it has gained stronger governance, frequent releases, and practical support across data lakes, ingest pipelines, SIEM workflows, and partner ecosystems.

In a world where AI expands the security landscape through scams, abuse, and new attack paths, security teams rely on OCSF to connect data from many systems without losing context along the way to keep your data safe.

Nikhil Mungel has been building distributed systems and AI teams at SaaS companies for more than 15 years.

DataDecisionMakers, Security

When product managers ship code: AI just broke the software org chart

Last week, one of our product managers (PMs) built and shipped a feature. Not spec’d it. Not filed a ticket for it. Built it, tested it, and shipped it to production. In a day.

A few days earlier, our designer noticed that the visual appearance of our IDE plugins had drifted from the design system. In the old world, that meant screenshots, a JIRA ticket, a conversation to explain the intent, and a sprint slot. Instead, he opened an agent, adjusted the layout himself, experimented, iterated, and tuned in real time, then pushed the fix. The person with the strongest design intuition fixed the design directly. No translation layer required.

None of this is new in theory. Vibe coding opened the gates of software creation to millions. That was aspiration. When I shared the data on how our engineers doubled throughput, shifted from coding to validation, brought design upfront for rapid experimentation, it was still an engineering story. What changed is that the theory became practice. Here’s how it actually played out.

The bottleneck moved

When we went AI-first in 2025, implementation cost collapsed. Agents took over scaffolding, tests, and the repetitive glue code that used to eat half the sprint. Cycle times dropped from weeks to days, from days to hours. Engineers started thinking less in files and functions and more in architecture, constraints, and execution plans.

But once engineering capacity stopped being the bottleneck, we noticed something: Decision velocity was. All the coordination mechanisms we’d built to protect engineering time (specs, tickets, handoffs, backlog grooming) were now the slowest part of the system. We were optimizing for a constraint that no longer existed.

What happens when building is cheaper than coordination

We started asking a different question: What would it look like if the people closest to the intent could ship the software directly?

PMs already think in specifications. Designers already define structure, layout, and behavior. They don’t think in syntax. They think in outcomes. When the cost of turning intent into working software dropped far enough, these roles didn’t need to “learn to code.” The cost of implementation simply fell to their level.

I asked one of our PMs, Dmitry, to describe what changed from his perspective. He told me: “While agents are generating tasks in Zenflow, there’s a few minutes of idle time. Just dead air. I wanted to build a small game, something to interact with while you wait.”

If you’ve ever run a product team, you know this kind of idea. It doesn’t move a KPI. It’s impossible to justify in a prioritization meeting. It gets deferred forever. But it adds personality. It makes the product feel like someone cared about the small details. These are exactly the things that get optimized out of every backlog grooming session, and exactly the things users remember.

He built it in a day.

In the past, that idea would have died in a prioritization spreadsheet. Not because it was bad, but because the cost of implementation made it irrational to pursue. When that cost drops to near zero, the calculus changes completely.

Shipping became cheaper than explaining

As more people started building directly, entire layers of process quietly vanished. Fewer tickets. Fewer handoffs. Fewer “can you explain what you mean by…” conversations. Fewer lost-in-translation moments.

For a meaningful class of tasks, it became faster to just build the thing than to describe what you wanted and wait for someone else to build it. Think about that for a second. Every modern software organization is structured around the assumption that implementation is the expensive part. When that assumption breaks, the org has to change with it.

Our designer fixing the plugin UI is a perfect example. The old workflow (screenshot the problem, file a ticket, explain the gap between intent and implementation, wait for a sprint slot, review the result, request adjustments) existed entirely to protect engineering bandwidth. When the person with the design intuition can act on it directly, that whole stack disappears. Not because we eliminated process for its own sake, but because the process was solving a problem that no longer existed.

The compounding effect

Here’s what surprised me most: It compounds.

When PMs build their own ideas, their specifications get sharper, because they now understand what the agent needs to execute well. Sharper specs produce better agent output. Better output means fewer iteration cycles. We’re seeing velocity compound week over week, not just because the models improved, but because the people using them got closer to the work.

Dmitry put it well: The feedback loop between intent and outcome went from weeks to minutes. When you can see the result of your specification immediately, you learn what precision the system needs, and you start providing it instinctively.

There’s a second-order effect that’s harder to measure but impossible to miss: Ownership. People stop waiting. They stop filing tickets for things they could just fix. “Builder” stopped being a job title. It became the default behavior.

What this means for the industry

A lot of the “everyone can code” narrative last year was theoretical, or focused on solo founders and tiny teams. What we experienced is different. We have ~50 engineers working in a complex brownfield codebase: Multiple surfaces and programming languages, enterprise integrations, the full weight of a real production system.

I don’t think we’re unique. I think we’re early. And with each new generation of models, the gap between who can build and who can’t is closing faster than most organizations realize. Every software company is about to discover that their PMs and designers are sitting on unrealized building capacity, blocked not by skill, but by the cost of implementation. As that cost continues to fall, the organizational implications are profound.

We started with an intent to accelerate software engineering. What we’re becoming is something different: A company where everyone ships.

Andrew Filev is founder and CEO of Zencoder.

DataDecisionMakers, Orchestration, technology

When AI turns software development inside-out: 170% throughput at 80% headcount

Many people have tried AI tools and walked away unimpressed. I get it — many demos promise magic, but in practice, the results can feel underwhelming.

That’s why I want to write this not as a futurist prediction, but from lived experience. Over the past six months, I turned my engineering organization AI-first. I’ve shared before about the system behind that transformation — how we built the workflows, the metrics, and the guardrails. Today, I want to zoom out from the mechanics and talk about what I’ve learned from that experience — about where our profession is heading when software development itself turns inside out.

Before I do, a couple of numbers to illustrate the scale of change. Subjectively, it feels that we are moving twice as fast. Objectively, here’s how the throughput evolved. Our total engineering team headcount floated from 36 at the beginning of the year to 30. So you get ~170% throughput on ~80% headcount, which matches the subjective ~2x.

Zooming in, I picked a couple of our senior engineers who started the year in a more traditional software engineering process and ended it in the AI-first way. [The dips correspond to vacations and off-sites]:

Note that our PRs are tied to JIRA tickets, and the average scope of those tickets didn’t change much through the year, so it’s as good a proxy as the data can give us.

Qualitatively, looking at the business value, I actually see even higher uplift. One reason is that, as we started last year, our quality assurance (QA) team couldn’t keep up with our engineers’ velocity. As the company leader, I wasn’t happy with the quality of some of our early releases. As we progressed through the year, and tooled our AI workflows to include writing unit and end-to-end tests, our coverage improved, the number of bugs dropped, users became fans, and the business value of engineering work multiplied.

From big design to rapid experimentation

Before AI, we spent weeks perfecting user flows before writing code. It made sense when change was expensive. Agile helped, but even then, testing multiple product ideas was too costly.

Once we went AI-first, that trade-off disappeared. The cost of experimentation collapsed. An idea could go from whiteboard to a working prototype in a day: From idea to AI-generated product requirements document (PRD), to AI-generated tech spec, to AI-assisted implementation.

It manifested itself in some amazing transformations. Our website—central to our acquisition and inbound demand—is now a product-scale system with hundreds of custom components, all designed, developed, and maintained directly in code by our creative director.

Now, instead of validating with slides or static prototypes, we validate with working products. We test ideas live, learn faster, and release major updates every other month, a pace I couldn’t imagine three years ago.

For example, Zen CLI was first written in Kotlin, but then we changed our mind and moved it to TypeScript with no release velocity lost.

Instead of mocking the features, our UX designers and project managers vibe code them. And when the release-time crunch hit everyone, they jumped into action and fixed dozens of small details with production-ready PRs to help us ship a great product. This included an overnight UI layout change.

From coding to validation

The next shift came where I least expected it: Validation.

In a traditional org, most people write code and a smaller group tests it. But when AI generates much of the implementation, the leverage point moves. The real value lies in defining what “good” looks like — in making correctness explicit.

We support 70-plus programming languages and countless integrations. Our QA engineers have evolved into system architects. They build AI agents that generate and maintain acceptance tests directly from requirements. And those agents are embedded into the codified AI workflows that allow us to achieve predictable engineering outcomes by using a system.

This is what “shift left” really means. Validation isn’t a stand-alone function, it’s an integral part of the production process. If the agent can’t validate it’s work, it can’t be trusted to generate production code. For QA professionals, this is a moment of reinvention, where, with the right upskilling, their work becomes a critical enabler and accelerator of the AI adoption.

Product managers, tech leads, and data engineers now share this responsibility as well, because defining correctness has become a cross-functional skill, not a role confined to QA.

From diamond to double funnel

For decades, software development followed a “diamond” shape: A small product team handed off to a large engineering team, then narrowed again through QA.

Today, that geometry is flipping. Humans engage more deeply at the beginning — defining intent, exploring options — and again at the end, validating outcomes. The middle, where AI executes, is faster and narrower.

It’s not just a new workflow; it’s a structural inversion.

The model looks less like an assembly line and more like a control tower. Humans set direction and constraints, AI handles execution at speed, and people step back in to validate outcomes before decisions land in production.

Engineering at a higher level of abstraction

Every major leap in software raised our level of abstraction — from punch cards to high-level programming languages, from hardware to cloud. AI is the next step. Our engineers now work at a meta-layer: Orchestrating AI workflows, tuning agentic instructions and skills, and defining guardrails. The machines build; the humans decide what and why.

Teams now routinely decide when AI output is safe to merge without review, how tightly to bound agent autonomy in production systems, and what signals actually indicate correctness at scale, decisions that simply didn’t exist before.

And that’s the paradox of AI-first engineering — it feels less like coding, and more like thinking. Welcome to the new era of human intelligence, powered by AI.

Andrew Filev is founder and CEO of Zencoder

DataDecisionMakers, Orchestration, technology

You thought the generalist was dead — in the ‘vibe work’ era, they’re more important than ever

Not long ago, the idea of being a “generalist” in the workplace had a mixed reputation. The stereotype was the “jack of all trades” who could dabble in many disciplines but was a “master of none.” And for years, that was more or less true.

Most people simply didn’t have access to the expertise required to do highly cross-functional work. If you needed a new graphic, you waited for a designer. If you needed to change a contract, you waited for legal. In smaller organizations and startups, this waiting game was typically replaced with inaction or improvization — often with questionable results.

AI is changing this faster than any technology shift I’ve seen. It’s allowing people to succeed at tasks beyond their normal area of expertise.

Anthropic found that AI is “enabling engineers to become more full-stack in their work,” meaning they’re able to make competent decisions across a much wider range of interconnected technologies. A direct consequence of this is tasks that would have been left aside due to lack of time or expertise are now being accomplished (27% of AI-assisted work per Anthropic’s study).

This shift is closely mirroring the effects of past revolutionary technologies. The invention of the automobile or the computer did not bring us a wealth of leisure time — it mainly led us to start doing work that could not be done before.

With AI as a guide, anyone can now expand their skillsets and augment their expertise to accomplish more. This fundamentally changes what people can do, who can do it, how teams operate, and what leaders should expect.

Well, not so fast.

The AI advances have been incredible, and if 2025 may not have fully delivered its promise of bringing AI agents to the workforce, there’s no reason to doubt it’s well on its way. But for now, it’s not perfect. If to err is human, to trust AI not to err is foolish.

One of the biggest challenges of working with AI is identifying hallucinations. The term was coined, I assume, not as a cute way to refer to factual errors, but as quite an apt way of describing the conviction that AI exhibits in its erroneous answers. We humans have a clear bias toward confident people, which probably explains the number of smart people getting burned after taking ChatGPT at face value.

And if experts can get fooled by an overconfident AI, how can generalists hope to harness the power of AI without making the same mistake?

Citizen guardrails give way to vibe freedom

It’s tempting to compare today’s AI vibe coding wave to the rise of low- and no-code tools. No-code tools gave users freedom to build custom software tailored to their needs. However, the comparison doesn’t quite hold. The so-called “citizen developers” could only operate inside the boundaries the tool allowed. These tight constraints were limiting, but they had the benefit of saving the users from themselves — preventing anything catastrophic.

AI removes those boundaries almost entirely, and with great freedom comes responsibilities that most people aren’t quite prepared for.

The first stage of ‘vibe freedom’ is one of unbridled optimism encouraged by a sycophantic AI. “You’re absolutely correct!” The dreaded report that would have taken all night looks better than anything you could have done yourself and only took a few minutes.

The next stage comes almost by surprise — there’s something that’s not quite right. You start doubting the accuracy of the work — you review and then wonder if it wouldn’t have been quicker to just do it yourself in the first place.

Then comes bargaining and acceptance. You argue with the AI, you’re led down confusing paths, but slowly you start developing an understanding — a mental model of the AI mind. You learn to recognize the confidently incorrect, you learn to push back and cross-check, you learn to trust and verify.

The generalist becomes the trust layer

This is a skill that can be learned, and it can only be learned on the job, through regular practice. This doesn’t require deep specialization, but it does require awareness. Curiosity becomes essential. So does the willingness to learn quickly, think critically, spot inconsistencies, and to rely on judgment rather than treating AI as infallible.

That’s the new job of the generalist: Not to be an expert in everything, but to understand the AI mind enough to catch when something is off, and to defer to a true specialist when the stakes are high.

The generalist becomes the human trust layer sitting between the AI’s output and the organization’s standards. They decide what passes and what gets a second opinion.

That said, this only works if the generalist clears a minimum bar of fluency. There’s a big difference between “broadly informed” and “confidently unaware.” AI makes that gap easier to miss.

Impact on teams and hiring

Clearly, specialists will not be replaced by AI anytime soon. Their work remains critical. It will evolve to become more strategic.

What AI changes is everything around the edges. Roles that felt important but were hard to fill, tasks that sat in limbo because no expert was available, backlogs created by waiting for highly skilled people to review simple work. Now, a generalist can get much farther on their own, and specialists can focus on the hardest problems.

We’re already starting to see an impact in the hiring landscape. Companies are looking to bring on individuals who are comfortable navigating AI. People who embrace it and use it to take on projects outside of their comfort zone.

Performance expectations will shift too. Many leaders are already looking less at productivity alone, and more at how effectively someone uses AI. We see token usage not as a measure of cost, but as an indicator of AI adoption, and perhaps optimistically, as a proxy for productivity.

Making vibe work viable

Use AI to enhance work, not to wing it: You will get burned letting AI loose. It requires guidance and oversight.
Learn when to trust and when to verify: Build an understanding of the AI mind so you can exercise good judgement on the work produced. When in doubt or when the stakes are high, defer to specialists.
Set clear organizational standards: AI thrives on context and humans, too. Invest in documentation of processes, procedures, and best practices.
Keep humans in the loop: AI shouldn’t remove oversight. It should make oversight easier.

Without these factors, AI work stays in the “vibe” stage. With them, it becomes something the business can actually rely on.

Return of the generalist

The emerging, AI-empowered generalist is defined by curiosity, adaptability, and the ability to evaluate the work AI produces. They can span multiple functions, not because they’re experts in each one, but because AI gives them access to specialist-level expertise. Most importantly, this new generation of generalists knows when and how to apply their human judgment and critical thinking. That’s the real determining factor for turning vibes into something reliable, sustainable, and viable in the long run.

Cedric Savarese is founder and CEO of FormAssembly.

DataDecisionMakers, technology

Testing autonomous agents (Or: how I learned to stop worrying and embrace chaos)

Look, we’ve spent the last 18 months building production AI systems, and we’ll tell you what keeps us up at night — and it’s not whether the model can answer questions. That’s table stakes now. What haunts us is the mental image of an agent autonomously approving a six-figure vendor contract at 2 a.m. because someone typo’d a config file.

We’ve moved past the era of “ChatGPT wrappers” (thank God), but the industry still treats autonomous agents like they’re just chatbots with API access. They’re not. When you give an AI system the ability to take actions without human confirmation, you’re crossing a fundamental threshold. You’re not building a helpful assistant anymore — you’re building something closer to an employee. And that changes everything about how we need to engineer these systems.

The autonomy problem nobody talks about

Here’s what’s wild: We’ve gotten really good at making models that *sound* confident. But confidence and reliability aren’t the same thing, and the gap between them is where production systems go to die.

We learned this the hard way during a pilot program where we let an AI agent manage calendar scheduling across executive teams. Seems simple, right? The agent could check availability, send invites, handle conflicts. Except, one Monday morning, it rescheduled a board meeting because it interpreted “let’s push this if we need to” in a Slack message as an actual directive. The model wasn’t wrong in its interpretation — it was plausible. But plausible isn’t good enough when you’re dealing with autonomy.

That incident taught us something crucial: The challenge isn’t building agents that work most of the time. It’s building agents that fail gracefully, know their limitations, and have the circuit breakers to prevent catastrophic mistakes.

What reliability actually means for autonomous systems

Layered reliability architecture

When we talk about reliability in traditional software engineering, we’ve got decades of patterns: Redundancy, retries, idempotency, graceful degradation. But AI agents break a lot of our assumptions.

Traditional software fails in predictable ways. You can write unit tests. You can trace execution paths. With AI agents, you’re dealing with probabilistic systems making judgment calls. A bug isn’t just a logic error—it’s the model hallucinating a plausible-sounding but completely fabricated API endpoint, or misinterpreting context in a way that technically parses but completely misses the human intent.

So what does reliability look like here? In our experience, it’s a layered approach.

Layer 1: Model selection and prompt engineering

This is foundational but insufficient. Yes, use the best model you can afford. Yes, craft your prompts carefully with examples and constraints. But don’t fool yourself into thinking that a great prompt is enough. I’ve seen too many teams ship “GPT-4 with a really good system prompt” and call it enterprise-ready.

Layer 2: Deterministic guardrails

Before the model does anything irreversible, run it through hard checks. Is it trying to access a resource it shouldn’t? Is the action within acceptable parameters? We’re talking old-school validation logic — regex, schema validation, allowlists. It’s not sexy, but it’s effective.

One pattern that’s worked well for us: Maintain a formal action schema. Every action an agent can take has a defined structure, required fields, and validation rules. The agent proposes actions in this schema, and we validate before execution. If validation fails, we don’t just block it — we feed the validation errors back to the agent and let it try again with context about what went wrong.

Layer 3: Confidence and uncertainty quantification

Here’s where it gets interesting. We need agents that know what they don’t know. We’ve been experimenting with agents that can explicitly reason about their confidence before taking actions. Not just a probability score, but actual articulated uncertainty: “I’m interpreting this email as a request to delay the project, but the phrasing is ambiguous and could also mean…”

This doesn’t prevent all mistakes, but it creates natural breakpoints where you can inject human oversight. High-confidence actions go through automatically. Medium-confidence actions get flagged for review. Low-confidence actions get blocked with an explanation.

Layer 4: Observability and auditability

Action Validation Pipeline

If you can’t debug it, you can’t trust it. Every decision the agent makes needs to be loggable, traceable, and explainable. Not just “what action did it take” but “what was it thinking, what data did it consider, what was the reasoning chain?”

We’ve built a custom logging system that captures the full large language model (LLM) interaction — the prompt, the response, the context window, even the model temperature settings. It’s verbose as hell, but when something goes wrong (and it will), you need to be able to reconstruct exactly what happened. Plus, this becomes your dataset for fine-tuning and improvement.

Guardrails: The art of saying no

Let’s talk about guardrails, because this is where engineering discipline really matters. A lot of teams approach guardrails as an afterthought — “we’ll add some safety checks if we need them.” That’s backwards. Guardrails should be your starting point.

We think of guardrails in three categories.

Permission boundaries

What is the agent physically allowed to do? This is your blast radius control. Even if the agent hallucinates the worst possible action, what’s the maximum damage it can cause?

We use a principle called “graduated autonomy.” New agents start with read-only access. As they prove reliable, they graduate to low-risk writes (creating calendar events, sending internal messages). High-risk actions (financial transactions, external communications, data deletion) either require explicit human approval or are simply off-limits.

One technique that’s worked well: Action cost budgets. Each agent has a daily “budget” denominated in some unit of risk or cost. Reading a database record costs 1 unit. Sending an email costs 10. Initiating a vendor payment costs 1,000. The agent can operate autonomously until it exhausts its budget; then, it needs human intervention. This creates a natural throttle on potentially problematic behavior.

Graduated Autonomy and Action Cost Budget

Semantic Houndaries

What should the agent understand as in-scope vs out-of-scope? This is trickier because it’s conceptual, not just technical.

I’ve found that explicit domain definitions help a lot. Our customer service agent has a clear mandate: handle product questions, process returns, escalate complaints. Anything outside that domain — someone asking for investment advice, technical support for third-party products, personal favors — gets a polite deflection and escalation.

The challenge is making these boundaries robust to prompt injection and jailbreaking attempts. Users will try to convince the agent to help with out-of-scope requests. Other parts of the system might inadvertently pass instructions that override the agent’s boundaries. You need multiple layers of defense here.

Operational boundaries

How much can the agent do, and how fast? This is your rate limiting and resource control.

We’ve implemented hard limits on everything: API calls per minute, maximum tokens per interaction, maximum cost per day, maximum number of retries before human escalation. These might seem like artificial constraints, but they’re essential for preventing runaway behavior.

We once saw an agent get stuck in a loop trying to resolve a scheduling conflict. It kept proposing times, getting rejections, and trying again. Without rate limits, it sent 300 calendar invites in an hour. With proper operational boundaries, it would’ve hit a threshold and escalated to a human after attempt number 5.

Agents need their own style of testing

Traditional software testing doesn’t cut it for autonomous agents. You can’t just write test cases that cover all the edge cases, because with LLMs, everything is an edge case.

What’s worked for us:

Simulation environments

Build a sandbox that mirrors production but with fake data and mock services. Let the agent run wild. See what breaks. We do this continuously — every code change goes through 100 simulated scenarios before it touches production.

The key is making scenarios realistic. Don’t just test happy paths. Simulate angry customers, ambiguous requests, contradictory information, system outages. Throw in some adversarial examples. If your agent can’t handle a test environment where things go wrong, it definitely can’t handle production.

Red teaming

Get creative people to try to break your agent. Not just security researchers, but domain experts who understand the business logic. Some of our best improvements came from sales team members who tried to “trick” the agent into doing things it shouldn’t.

Shadow mode

Before you go live, run the agent in shadow mode alongside humans. The agent makes decisions, but humans actually execute the actions. You log both the agent’s choices and the human’s choices, and you analyze the delta.

This is painful and slow, but it’s worth it. You’ll find all kinds of subtle misalignments you’d never catch in testing. Maybe the agent technically gets the right answer, but with phrasing that violates company tone guidelines. Maybe it makes legally correct but ethically questionable decisions. Shadow mode surfaces these issues before they become real problems.

The human-in-the-loop pattern

Three Human-in-the-Loop Patterns

Despite all the automation, humans remain essential. The question is: Where in the loop?

We’re increasingly convinced that “human-in-the-loop” is actually several distinct patterns:

Human-on-the-loop: The agent operates autonomously, but humans monitor dashboards and can intervene. This is your steady-state for well-understood, low-risk operations.

Human-in-the-loop: The agent proposes actions, humans approve them. This is your training wheels mode while the agent proves itself, and your permanent mode for high-risk operations.

Human-with-the-loop: Agent and human collaborate in real-time, each handling the parts they’re better at. The agent does the grunt work, the human does the judgment calls.

The trick is making these transitions smooth. An agent shouldn’t feel like a completely different system when you move from autonomous to supervised mode. Interfaces, logging, and escalation paths should all be consistent.

Failure modes and recovery

Let’s be honest: Your agent will fail. The question is whether it fails gracefully or catastrophically.

We classify failures into three categories:

Recoverable errors: The agent tries to do something, it doesn’t work, the agent realizes it didn’t work and tries something else. This is fine. This is how complex systems operate. As long as the agent isn’t making things worse, let it retry with exponential backoff.

Detectable failures: The agent does something wrong, but monitoring systems catch it before significant damage occurs. This is where your guardrails and observability pay off. The agent gets rolled back, humans investigate, you patch the issue.

Undetectable failures: The agent does something wrong, and nobody notices until much later. These are the scary ones. Maybe it’s been misinterpreting customer requests for weeks. Maybe it’s been making subtly incorrect data entries. These accumulate into systemic issues.

The defense against undetectable failures is regular auditing. We randomly sample agent actions and have humans review them. Not just pass/fail, but detailed analysis. Is the agent showing any drift in behavior? Are there patterns in its mistakes? Is it developing any concerning tendencies?

The cost-performance tradeoff

Here’s something nobody talks about enough: reliability is expensive.

Every guardrail adds latency. Every validation step costs compute. Multiple model calls for confidence checking multiply your API costs. Comprehensive logging generates massive data volumes.

You have to be strategic about where you invest. Not every agent needs the same level of reliability. A marketing copy generator can be looser than a financial transaction processor. A scheduling assistant can retry more liberally than a code deployment system.

We use a risk-based approach. High-risk agents get all the safeguards, multiple validation layers, extensive monitoring. Lower-risk agents get lighter-weight protections. The key is being explicit about these trade-offs and documenting why each agent has the guardrails it does.

Organizational challenges

We’d be remiss if we didn’t mention that the hardest parts aren’t technical — they’re organizational.

Who owns the agent when it makes a mistake? Is it the engineering team that built it? The business unit that deployed it? The person who was supposed to be supervising it?

How do you handle edge cases where the agent’s logic is technically correct but contextually inappropriate? If the agent follows its rules but violates an unwritten norm, who’s at fault?

What’s your incident response process when an agent goes rogue? Traditional runbooks assume human operators making mistakes. How do you adapt these for autonomous systems?

These questions don’t have universal answers, but they need to be addressed before you deploy. Clear ownership, documented escalation paths, and well-defined success metrics are just as important as the technical architecture.

Where we go from here

The industry is still figuring this out. There’s no established playbook for building reliable autonomous agents. We’re all learning in production, and that’s both exciting and terrifying.

What we know for sure: The teams that succeed will be the ones who treat this as an engineering discipline, not just an AI problem. You need traditional software engineering rigor — testing, monitoring, incident response — combined with new techniques specific to probabilistic systems.

You need to be paranoid but not paralyzed. Yes, autonomous agents can fail in spectacular ways. But with proper guardrails, they can also handle enormous workloads with superhuman consistency. The key is respecting the risks while embracing the possibilities.

We’ll leave you with this: Every time we deploy a new autonomous capability, we run a pre-mortem. We imagine it’s six months from now and the agent has caused a significant incident. What happened? What warning signs did we miss? What guardrails failed?

This exercise has saved us more times than we can count. It forces you to think through failure modes before they occur, to build defenses before you need them, to question assumptions before they bite you.

Because in the end, building enterprise-grade autonomous AI agents isn’t about making systems that work perfectly. It’s about making systems that fail safely, recover gracefully, and learn continuously.

And that’s the kind of engineering that actually matters.

Madhvesh Kumar is a principal engineer. Deepika Singh is a senior software engineer.

Views expressed are based on hands-on experience building and deploying autonomous agents, along with the occasional 3 AM incident response that makes you question your career choices.

DataDecisionMakers, Infrastructure, Orchestration

Rethinking AEO when software agents navigate the web on behalf of users

For more than two decades, digital businesses have relied on a simple assumption: When someone interacts with a website, that activity reflects a human making a conscious choice. Clicks are treated as signals of interest. Time on page is assumed to ind…

DataDecisionMakers, Infrastructure, Orchestration, technology

Fixing AI failure: Three changes enterprises should make now

Recent reports about AI project failure rates have raised uncomfortable questions for organizations investing heavily in AI. Much of the discussion has focused on technical factors like model accuracy and data quality, but after watching dozens of AI initiatives launch, I’ve noticed that the biggest opportunities for improvement are often cultural, not technical.

Internal projects that struggle tend to share common issues. For example, engineering teams build models that product managers don’t know how to use. Data scientists build prototypes that operations teams struggle to maintain. And AI applications sit unused because the people they were built for weren’t involved in deciding what “useful” really meant.

In contrast, organizations that achieve meaningful value with AI have figured out how to create the right kind of collaboration across departments, and established shared accountability for outcomes. The technology matters, but the organizational readiness matters just as much.

Here are three practices I’ve observed that address the cultural and organizational barriers that can impede AI success.

Expand AI literacy beyond engineering

When only engineers understand how an AI system works and what it’s capable of, collaboration breaks down. Product managers can’t evaluate trade-offs they don’t understand. Designers can’t create interfaces for capabilities they can’t articulate. Analysts can’t validate outputs they can’t interpret.

The solution isn’t making everyone a data scientist. It’s helping each role understand how AI applies to their specific work. Product managers need to grasp what kinds of generated content, predictions or recommendations are realistic given available data. Designers need to understand what the AI can actually do so they can design features users will find useful. Analysts need to know which AI outputs require human validation versus which can be trusted.

When teams share this working vocabulary, AI stops being something that happens in the engineering department and becomes a tool the entire organization can use effectively.

Establish clear rules for AI autonomy

The second challenge involves knowing where AI can act on its own versus where human approval is required. Many organizations default to extremes, either bottlenecking every AI decision through human review, or letting AI systems operate without guardrails.

What’s needed is a clear framework that defines where and how AI can act autonomously. This means establishing rules upfront: Can AI approve routine configuration changes? Can it recommend schema updates but not implement them? Can it deploy code to staging environments but not production?

These rules should include three elements: auditability (can you trace how the AI reached its decision?), reproducibility (can you recreate the decision path?), and observability (can teams monitor AI behavior as it happens?). Without this framework, you either slow down to the point where AI provides no advantage, or you create systems making decisions nobody can explain or control.

Create cross-functional playbooks

The third step is codifying how different teams actually work with AI systems. When every department develops its own approach, you get inconsistent results and redundant effort.

Cross-functional playbooks work best when teams develop them together rather than having them imposed from above. These playbooks answer concrete questions like: How do we test AI recommendations before putting them into production? What’s our fallback procedure when an automated deployment fails – does it hand off to human operators or try a different approach first? Who needs to be involved when we override an AI decision? How do we incorporate feedback to improve the system?

The goal isn’t to add bureaucracy. It’s ensuring everyone understands how AI fits into their existing work, and what to do when results don’t match expectations.

Moving forward

Technical excellence in AI remains important, but enterprises that over-index on model performance while ignoring organizational factors are setting themselves up for avoidable challenges. The successful AI deployments I’ve seen treat cultural transformation and workflows just as seriously as technical implementation.

The question isn’t whether your AI technology is sophisticated enough. It’s whether your organization is ready to work with it.

Adi Polak is director for advocacy and developer experience engineering at Confluent.

DataDecisionMakers, technology

The limits of bubble thinking: How AI breaks every historical analogy

It’s always the same story: A new technology appears and everyone starts talking about how it’ll change everything. Then capital rushes in, companies form overnight, and valuations climb faster than anyone can justify. Then, many many months later, the…

DataDecisionMakers, Infrastructure, Orchestration, technology