In the race to bring artificial intelligence into the enterprise, a small but well-funded startup is making a bold claim: The problem holding back AI adoption in complex industries has never been the models themselves.
Contextual AI, a two-and-a-half-year-old company backed by investors including Bezos Expeditions and Bain Capital Ventures, on Monday unveiled Agent Composer, a platform designed to help engineers in aerospace, semiconductor manufacturing, and other technically demanding fields build AI agents that can automate the kind of knowledge-intensive work that has long resisted automation.
The announcement arrives at a pivotal moment for enterprise AI. Four years after ChatGPT ignited a frenzy of corporate AI initiatives, many organizations remain stuck in pilot programs, struggling to move experimental projects into full-scale production. Chief financial officers and business unit leaders are growing impatient with internal efforts that have consumed millions of dollars but delivered limited returns.
Douwe Kiela, Contextual AI’s chief executive, believes the industry has been focused on the wrong bottleneck. “The model is almost commoditized at this point,” Kiela said in an interview with VentureBeat. “The bottleneck is context — can the AI actually access your proprietary docs, specs, and institutional knowledge? That’s the problem we solve.”
To understand what Contextual AI is attempting, it helps to understand a concept that has become central to modern AI development: retrieval-augmented generation, or RAG.
When large language models like those from OpenAI, Google, or Anthropic generate responses, they draw on knowledge embedded during training. But that knowledge has a cutoff date, and it cannot include the proprietary documents, engineering specifications, and institutional knowledge that make up the lifeblood of most enterprises.
RAG systems attempt to solve this by retrieving relevant documents from a company’s own databases and feeding them to the model alongside the user’s question. The model can then ground its response in actual company data rather than relying solely on its training.
Kiela helped pioneer this approach during his time as a research scientist at Facebook AI Research and later as head of research at Hugging Face, the influential open-source AI company. He holds a Ph.D. from Cambridge and serves as an adjunct professor in symbolic systems at Stanford University.
But early RAG systems, Kiela acknowledges, were crude.
“Early RAG was pretty crude — grab an off-the-shelf retriever, connect it to a generator, hope for the best,” he said. “Errors compounded through the pipeline. Hallucinations were common because the generator wasn’t trained to stay grounded.”
When Kiela founded Contextual AI in June 2023, he set out to solve these problems systematically. The company developed what it calls a “unified context layer” — a set of tools that sit between a company’s data and its AI models, ensuring that the right information reaches the model in the right format at the right time.
The approach has earned recognition. According to a Google Cloud case study, Contextual AI achieved the highest performance on Google’s FACTS benchmark for grounded, hallucination-resistant results. The company fine-tuned Meta’s open-source Llama models on Google Cloud’s Vertex AI platform, focusing specifically on reducing the tendency of AI systems to invent information.
Agent Composer extends Contextual AI’s existing platform with orchestration capabilities — the ability to coordinate multiple AI tools across multiple steps to complete complex workflows.
The platform offers three ways to create AI agents. Users can start with pre-built agents designed for common technical workflows like root cause analysis or compliance checking. They can describe a workflow in natural language and let the system automatically generate a working agent architecture. Or they can build from scratch using a visual drag-and-drop interface that requires no coding.
What distinguishes Agent Composer from competing approaches, the company says, is its hybrid architecture. Teams can combine strict, deterministic rules for high-stakes steps — compliance checks, data validation, approval gates — with dynamic reasoning for exploratory analysis.
“For highly critical workflows, users can choose completely deterministic steps to control agent behavior and avoid uncertainty,” Kiela said.
The platform also includes what the company calls “one-click agent optimization,” which takes user feedback and automatically adjusts agent performance. Every step of an agent’s reasoning process can be audited, and responses come with sentence-level citations showing exactly where information originated in source documents.
Contextual AI says early customers have reported significant efficiency gains, though the company acknowledges these figures come from customer self-reporting rather than independent verification.
“These come directly from customer evals, which are approximations of real-world workflows,” Kiela said. “The numbers are self-reported by our customers as they describe the before-and-after scenario of adopting Contextual AI.”
The claimed results are nonetheless striking. An advanced manufacturer reduced root-cause analysis from eight hours to 20 minutes by automating sensor data parsing and log correlation. A specialty chemicals company reduced product research from hours to minutes using agents that search patents and regulatory databases. A test equipment maker now generates test code in minutes instead of days.
Keith Schaub, vice president of technology and strategy at Advantest, a semiconductor test equipment company, offered an endorsement. “Contextual AI has been an important part of our AI transformation efforts,” Schaub said. “The technology has been rolled out to multiple teams across Advantest and select end customers, saving meaningful time across tasks ranging from test code generation to customer engineering workflows.”
The company’s other customers include Qualcomm, the semiconductor giant; ShipBob, a tech-enabled logistics provider that claims to have achieved 60 times faster issue resolution; and Nvidia, the chip maker whose graphics processors power most AI systems.
Perhaps the biggest challenge Contextual AI faces is not competing products but the instinct among engineering organizations to build their own solutions.
“The biggest objection is ‘we’ll build it ourselves,'” Kiela acknowledged. “Some teams try. It sounds exciting to do, but is exceptionally hard to do this well at scale. Many of our customers started with DIY, and found themselves still debugging retrieval pipelines instead of solving actual problems 12-18 months later.”
The alternative — off-the-shelf point solutions — presents its own problems, the company argues. Such tools deploy quickly but often prove inflexible and difficult to customize for specific use cases.
Agent Composer attempts to occupy a middle ground, offering a platform approach that combines pre-built components with extensive customization options. The system supports models from OpenAI, Anthropic, and Google, as well as Contextual AI’s own Grounded Language Model, which was specifically trained to stay faithful to retrieved content.
Pricing starts at $50 per month for self-serve usage, with custom enterprise pricing for larger deployments.
“The justification to CFOs is really about increasing productivity and getting them to production faster with their AI initiatives,” Kiela said. “Every technical team is struggling to hire top engineering talent, so making their existing teams more productive is a huge priority in these industries.”
Looking ahead, Kiela outlined three priorities for the coming year: workflow automation with actual write actions across enterprise systems rather than just reading and analyzing; better coordination among multiple specialized agents working together; and faster specialization through automatic learning from production feedback.
“The compound effect matters here,” he said. “Every document you ingest, every feedback loop you close, those improvements stack up. Companies building this infrastructure now are going to be hard to catch.”
The enterprise AI market remains fiercely competitive, with offerings from major cloud providers, established software vendors, and scores of startups all chasing the same customers. Whether Contextual AI’s bet on context over models will pay off depends on whether enterprises come to share Kiela’s view that the foundation model wars matter less than the infrastructure that surrounds them.
But there is a certain irony in the company’s positioning. For years, the AI industry has fixated on building ever-larger, ever-more-powerful models — pouring billions into the race for artificial general intelligence. Contextual AI is making a quieter argument: that for most real-world work, the magic isn’t in the model. It’s in knowing where to look.
Chinese company Moonshot AI upgraded its open-sourced Kimi K2 model, transforming it into a coding and vision model with an architecture that supports an agent swarm orchestration.
The new model, Moonshot Kimi K2.5, is a good option for enterprises that want agents that can automatically pass off actions instead of having a framework be a central decision-maker.
The company characterized Kimi K2.5 as an “all-in-one model” that supports both visual and text inputs, letting users leverage the model for more visual coding projects.
Moonshot did not publicly disclose K2.5’s parameter count, but the Kimi K2 model that it’s based on, had 1 trillion total parameters and 32 billion activated parameters thanks to its mixture-of-experts architecture.
This is the latest open-source model to offer an alternative to the more closed options from Google, OpenAI, and Anthropic, and it outperforms them on key metrics including agentic workflows, coding, and vision.
On the Humanity’s Last Exam (HLE) benchmark, Kimi K2.5 scored 50.2% (with tools), surpassing OpenAI’s GPT-5.2 (xhigh) and Claude Opus 4.5. It also achieved 76.8% on SWE-bench Verified, cementing its status as a top-tier coding model, though GPT-5.2 and Opus 4.5 overtake it here at 80 and 80.9, respectively.
Moonshot said in a press release that it’s seen a 170% increase in users between September and November for Kimi K2 and Kimi K2 Thinking, which was released in early November.
Moonshot aims to leverage self-directed agents and the agent swarm paradigm built into Kimi K2.5. Agent swarm has been touted as the next frontier in enterprise AI development and agent-based systems. It has attracted significant attention in the past few months.
For enterprises, this means that if they build agent ecosystems with Kimi K2.5, they can expect to scale more efficiently. But instead of scaling “up” or growing model sizes to create larger agents, it’s betting on making more agents that can essentially orchestrate themselves.
Kimi K2.5 “creates and coordinates a swarm of specialized agents working in parallel.” The company compared it to a beehive where each agent performs a task while contributing to a common goal. The model learns to self-direct up to 100 sub-agents and can execute parallel workflows of up to 1,500 tool calls.
“Benchmarks only tell half the story. Moonshot AI believes AGI should ultimately be evaluated by its ability to complete real-world tasks efficiently under real-world time constraints. The real metric they care about is: how much of your day did AI actually give back to you? Running in parallel substantially reduces the time needed for a complex task — tasks that required days of work now can be accomplished in minutes,” the company said.
Enterprises considering their orchestration strategies have begun looking at agentic platforms where agents communicate and pass off tasks, rather than following a rigid orchestration framework that dictates when an action is completed.
While Kimi K2.5 may offer a compelling option for organizations that want to use this form of orchestration, some may feel more comfortable avoiding agent-based orchestration baked into the model and instead using a different platform to differentiate the model training from the agentic task.
This is because enterprises often want more flexibility in which models make up their agents, so they can build an ecosystem of agents that tap LLMs that work best for specific actions.
Some agent platforms, such as Salesforce, AWS Bedrock, and IBM, offer separate observability, management, and monitoring tools that help users orchestrate AI agents built with different models and enable them to work together.
The model lets users code visual layouts, including user interfaces and interactions. It reasons over images and videos to understand tasks encoded in visual inputs. For example, K2.5 can reconstruct a website’s code simply by analyzing a video recording of the site in action, translating visual cues into interactive layouts and animations.
“Interfaces, layouts, and interactions that are difficult to describe precisely in language can be communicated through screenshots or screen recordings, which the model can interpret and turn into fully functional websites. This enables a new class of vibe coding experiences,” Moonshot said.
This capability is integrated into Kimi Code, a new terminal-based tool that works with IDEs like VSCode and Cursor.
It supports “autonomous visual debugging,” where the model visually inspects its own output — such as a rendered web page — references documentation, and iterates on the code to fix layout shifts or aesthetic errors without human intervention.
Unlike other multimodal models that can create and understand images, Kimi K2.5 can build frontend interactions for websites with visuals, not just the code behind them.
Moonshot AI has aggressively priced the K2.5 API to compete with major U.S. labs, offering significant reductions compared to its previous K2 Turbo model.
Input: 60 cents per million tokens (a 47.8% decrease).
Cached Input: 10 cents per million tokens (a 33.3% decrease).
Output: $3 per million tokens (a 62.5% decrease).
The low cost of cached inputs ($0.10/M tokens) is particularly relevant for the “Agent Swarm” features, which often require maintaining large context windows across multiple sub-agents and extensive tool usage.
While Kimi K2.5 is open-sourced, it is released under a Modified MIT License that includes a specific clause targeting “hyperscale” commercial users.
The license grants standard permissions to use, copy, modify, and sell the software.
However, it stipulates that if the software or any derivative work is used for a commercial product or service that has more than 100 million monthly active users (MAU) or more than $20 million USD in monthly revenue, the entity must prominently display “Kimi K2.5” on the user interface.
This clause ensures that while the model remains free and open for the vast majority of the developer community and startups, major tech giants cannot white-label Moonshot’s technology without providing visible attribution.
It’s not full “open source” but it is better than Meta’s similar Llama Licensing terms for its “open source” family of models, which required those companies with 700 million or more monthly users to obtain a special enterprise license from the company.
For the practitioners defining the modern AI stack — from LLM decision-makers optimizing deployment cycles to AI orchestration leaders setting up agents and AI-powered automated business processes — Kimi K2.5 represents a fundamental shift in leverage.
By embedding swarm orchestration directly into the model, Moonshot AI effectively hands these resource-constrained builders a synthetic workforce, allowing a single engineer to direct a hundred autonomous sub-agents as easily as a single prompt.
This “scale-out” architecture directly addresses data decision-makers’ dilemma of balancing complex pipelines with limited headcount, while the slashed pricing structure transforms high-context data processing from a budget-breaking luxury into a routine commodity.
Ultimately, K2.5 suggests a future where the primary constraint on an engineering team is no longer the number of hands on keyboards, but the ability of its leaders to choreograph a swarm.
One of the biggest constraints currently facing AI builders who want to deploy agents in service of their individual or enterprise goals is the “working memory” required to manage complex, multi-stage engineering projects.
Typically, when a AI agent operates purely on a stream of text or voice-based conversation, it lacks the structural permanence to handle dependencies. It knows what to do, but it often forgets why it is doing it, or in what order.
With the release of Tasks for Claude Code (introduced in v2.1.16) last week, Anthropic has introduced a solution that is less about “AI magic” and more about sound software engineering principles.
By moving from ephemeral “To-dos” to persistent “Tasks,” the company is fundamentally re-architecting how the model interacts with time, complexity, and system resources.
This update transforms the tool from a reactive coding assistant into a state-aware project manager, creating the infrastructure necessary to execute the sophisticated workflows outlined in Anthropic’s just-released Best Practices guide, while recent changelog updates (v2.1.19) signal a focus on the stability required for enterprise adoption.
To understand the significance of this release for engineering teams, we must look at the mechanical differences between the old “To-do” system and the new “Task” primitive.
Previously, Claude Code utilized a “To-do” list—a lightweight, chat-resident checklist.
As Anthropic engineer Thariq Shihipar wrote in an article on X: “Todos (orange) = ‘help Claude remember what to do’.” These were effective for single-session scripts but fragile for actual engineering. If the session ended, the terminal crashed, or the context window drifted, the plan evaporated.
Tasks (Green) introduce a new layer of abstraction designed for “coordinating work across sessions, subagents, and context windows.” This is achieved through three key architectural decisions:
Dependency Graphs vs. Linear Lists: Unlike a flat Todo list, Tasks support directed acyclic graphs (DAGs). A task can explicitly “block” another. As seen in community demonstrations, the system can determine that Task 3 (Run Tests) cannot start until Task 1 (Build API) and Task 2 (Configure Auth) are complete. This enforcement prevents the “hallucinated completion” errors common in LLM workflows, where a model attempts to test code it hasn’t written yet.
Filesystem Persistence & Durability: Anthropic chose a “UNIX-philosophy” approach to state management. Rather than locking project state inside a proprietary cloud database, Claude Code writes tasks directly to the user’s local filesystem (~/.claude/tasks). This creates durable state. A developer can shut down their terminal, switch machines, or recover from a system crash, and the agent reloads the exact state of the project. For enterprise teams, this persistence is critical—it means the “plan” is now an artifact that can be audited, backed up, or version-controlled, independent of the active session.
Orchestration via Environment Variables: The most potent technical unlock is the ability to share state across sessions. By setting the CLAUDE_CODE_TASK_LIST_ID environment variable, developers can point multiple instances of Claude at the same task list. This allows updates to be “broadcast” to all active sessions, enabling a level of coordination that was previously impossible without external orchestration tools.
The release of Tasks makes the “Parallel Sessions” described in Anthropic’s Best Practices guide practical. The documentation suggests a Writer/Reviewer pattern that leverages this shared state:
Session A (Writer) picks up Task #1 (“Implement Rate Limiter”).
Session A marks it complete.
Session B (Reviewer), observing the shared state update, sees Task #2 (“Review Rate Limiter”) is now unblocked.
Session B begins the review in a clean context, unbiased by the generation process.
This aligns with the guide’s advice to “fan out” work across files, using scripts to loop through tasks and call Claude in parallel. Crucially, patch v2.1.17 fixed “out-of-memory crashes when resuming sessions with heavy subagent usage,” indicating that Anthropic is actively optimizing the runtime for these high-load, multi-agent scenarios.
For decision-makers evaluating Claude Code for production pipelines, the recent changelogs (v2.1.16–v2.1.19) reveal a focus on reliability and integration.
The Best Practices guide explicitly endorses running Claude in Headless Mode (claude -p). This allows engineering teams to integrate the agent into CI/CD pipelines, pre-commit hooks, or data processing scripts.
For example, a nightly cron job could instantiate a Claude session to “Analyze the day’s log files for anomalies,” using a Task list to track progress through different log shards.
The move to autonomous agents introduces new failure modes, which recent patches have addressed:
Dangling Processes: v2.1.19 fixed an issue where Claude Code processes would hang when the terminal closed; the system now catches EIO errors and ensures a clean exit (using SIGKILL as a fallback).
Hardware Compatibility: Fixes for crashes on processors without AVX support ensure broader deployment compatibility.
Git Worktrees: Fixes for resume functionality when working across different directories or git worktrees ensure that the “state” follows the code, not just the shell session.
Recognizing that enterprise workflows cannot turn on a dime, Anthropic introduced the CLAUDE_CODE_ENABLE_TASKS environment variable (v2.1.19). Setting this to false allows teams to opt-out of the new system temporarily, preserving existing workflows while they migrate to the Task-based architecture.
For the individual developer, the Task system solves the “context economy” problem. Anthropic’s documentation warns that “Claude’s context window… is the most important resource to manage,” and that performance degrades as it fills.
Before Tasks, clearing the context was dangerous—you wiped the agent’s memory of the overall plan. Now, because the plan is stored on disk, users can follow the best practice of “aggressive context management.” Developers can run /clear or /compact to free up tokens for the model’s reasoning, without losing the project roadmap.
The changelog also highlights quality-of-life improvements for power users building complex scripts:
Shorthand Arguments: Users can now access custom command arguments via $0, $1, etc., making it easier to script reusable “Skills” (e.g., a /refactor command that takes a filename as an argument).
Keybindings: Fully customizable keyboard shortcuts (/keybindings) allow for faster interaction loops.
With the introduction of Tasks, Anthropic is signaling that the future of coding agents is a project management.
By giving Claude Code a persistent memory, a way to understand dependency, and the stability fixes required for long-running processes, they have moved the tool from a “copilot” that sits next to you to a “subagent” that can be trusted to run in the background — especially when powered by Anthropic’s most performant model, Claude Opus 4.5.
It is a technical evolution that acknowledges a simple truth: in the enterprise, the code is cheap; it is the context, the plan, and the reliability that are precious.
When Anthropic announced Monday that it was embedding nine workplace applications directly inside Claude, transforming its AI chatbot into what I earlier described as a “workplace command center,” Asana was among the headliners.
But while the broader launch signals a new era of AI-native productivity tools, Asana’s participation reflects a deeper strategic bet — one that positions the project management company not as an AI competitor, but as the essential context layer that makes any AI model more useful.
In an exclusive interview with VentureBeat, Arnab Bose, Asana’s Chief Product Officer, explained the thinking behind the partnership and why the company chose to embrace external AI providers rather than build proprietary models.
“The AI landscape is advancing at a breakneck pace,” Bose said. “We believe our customers are best served when they have access to the latest, most powerful reasoning capabilities from best-in-class providers like Anthropic, rather than being locked into a single, proprietary model that may fall behind quickly.”
The integration arrives at a pivotal moment for Asana: the company is navigating a leadership transition after co-founder Dustin Moskovitz’s retirement, competing against rivals racing to embed AI into productivity software, and betting that its proprietary “Work Graph” — the company’s mapping of how tasks, people, and goals connect inside organizations — can differentiate it in an increasingly crowded market.
The strategic logic Bose outlined goes beyond simply offering Claude users another tool to connect. At its core, Asana is making a bet about where value will accrue in the AI era — and the company believes context will matter more than raw model capability.
“An LLM in isolation is context-starved,” Bose told VentureBeat. “It knows how to write, but it doesn’t know your business—your goals, your knowledge, your specific approvals, or your historical relationships. Asana provides the scaffolding—the Work Graph data model—that grounds those external models in the reality of how your company actually operates.”
It’s a framing that positions Asana as essential infrastructure rather than a replaceable application. If Bose is right, then even as AI models from Anthropic, OpenAI, and Google grow more powerful, they will remain fundamentally limited without deep integration into how organizations actually function.
“Most errors happen because models are context-starved,” Bose said. “Asana solves this with context that is unique to each business.”
The argument has implications beyond Asana. It suggests a future where AI capability becomes increasingly commoditized, while the companies that control rich organizational data — project histories, approval workflows, team relationships — become the essential partners that make AI useful in enterprise settings.
In practice, the Claude integration allows users to create and manage Asana projects entirely through natural conversation. When a user connects their Asana account via OAuth authentication, Claude gains the ability to read project data, create new tasks, and build entire project structures based on natural language instructions.
A marketing team discussing a product launch in Claude can simply say: “Create a Q2 product launch project with phases for creative development, partner outreach, press kit, and launch day.” Claude then generates the project structure, complete with sections and tasks, which the user can review before pushing it live to Asana.
“When you use Claude to explore a new initiative, like brainstorming a campaign structure, outlining a project plan, or mapping out a cross-functional launch, you can turn that thinking into real, structured work in Asana without breaking your flow,” the company said in its press release announcing the integration.
The synchronization runs in real time. Changes made through Claude appear immediately in Asana, and status updates from Asana can be pulled into Claude conversations for on-the-fly reporting. Users can ask questions like “What’s behind schedule in our marketing campaigns right now?” and receive answers grounded in their actual project data.
One of the key design decisions in the integration is a strict requirement for human oversight. Bose emphasized that Claude cannot act autonomously within Asana — every consequential action requires explicit user approval.
“Our architecture follows a strict human-in-the-loop philosophy where AI actions—from drafting project plans to summarizing risks—has a human in the loop to course correct, check quality, and ultimately give final sign-off when working with AI,” Bose told VentureBeat. “Users review and approve before tasks are created and projects are built.”
When asked whether Claude could potentially access projects or tasks that a user wouldn’t normally have permission to see, Bose was direct: “No. Users need to authenticate via OAuth with their Asana credentials to use this integration, and Claude respects their permissions and access.”
The approach is an increasingly common pattern in enterprise AI — giving artificial intelligence significant capabilities while maintaining human control over final decisions. It addresses one of the core anxieties around AI in workplace settings: the fear that automated systems will make mistakes that propagate through organizations before anyone notices.
When asked about audit capabilities for enterprise administrators, Bose said admins can monitor usage information about Claude in Asana’s Admin App Management portal, with deeper audit log visibility potentially coming based on customer feedback.
Notably, Asana is not betting exclusively on Claude. Bose emphasized the company’s commitment to working with multiple AI providers, positioning Asana as a neutral platform that works with whichever AI systems its customers prefer.
“Our philosophy is to meet users where they want to work,” Bose said. “We are building the work platform for today and the future which means being the best front-end for any vendor’s agents.”
He confirmed that Asana offers “foundational connectors” with both ChatGPT and Google Gemini and is working to deepen those integrations. The company is also committed to emerging industry standards for AI agent interoperability, including the Agent-to-Agent protocol and MCP.
“We want to be the best front-end for agents from any vendor,” Bose said, describing a vision where Asana becomes the coordination layer through which various AI systems — whether from Anthropic, OpenAI, Google, or others — can operate within enterprise workflows.
This multi-provider approach differs from companies that have tied themselves exclusively to a single AI partner. It reflects both a pragmatic recognition that the AI landscape remains volatile and a strategic bet that Asana’s value lies in its data and workflow capabilities rather than any particular AI model.
The Claude integration arrives as Asana navigates significant organizational change. Dustin Moskovitz, the company’s co-founder and longtime CEO, retired earlier this year after announcing his departure during Asana’s fourth-quarter earnings report in March. Moskovitz’s departure triggered immediate market reaction, with Asana’s stock dropping more than 25 percent in after-hours trading following the announcement.
The company subsequently hired Dan Rogers — formerly CEO of software startup LaunchDarkly and previously president of Rubrik and marketing chief at ServiceNow — to take over as chief executive. Rogers started in July, with Moskovitz transitioning to the role of board chairman.
In a recent appearance on the Stratechery podcast, Moskovitz reflected candidly on his tenure. “I don’t like to manage teams, and it wasn’t my intention when we started Asana,” he said. “I’d intended to be more of a independent or head of engineering or something again. Then one thing led to another and I was CEO for 13 years and I just found it quite exhausting.”
Moskovitz — who co-founded Facebook alongside Mark Zuckerberg before leaving to start Asana in 2008 — retains approximately 39 percent of outstanding Asana shares. He said he plans to focus more on his philanthropic endeavors, including Good Ventures and Open Philanthropy, which lists “potential risks from advanced AI” among its focus areas.
When asked about the long-term trajectory of AI in Asana, Bose outlined a vision that balances automation with human judgment — what he described as a “self-driving” organization where humans nonetheless remain at the wheel.
“Our vision is for customers to work however suits them best, alongside AI agents that actually have the context to be helpful and productive,” he said. “But the goal is not for agents to make important decisions on their own. That is where humans provide value: having the judgment, relationships, and nuance to make complex decisions.”
He described a future in which AI handles “orchestration” — spotting patterns, flagging risks, managing follow-ups — while humans retain authority over strategy and trade-offs. As an example, Bose pointed to Asana’s AI Teammates feature, which the company introduced in beta last year.
“Asana AI Teammates — built on the Work Graph, so they understand who is doing what, by when, and why — can flag that three teams are behind on dependencies for a launch and draft a mitigation plan,” Bose said. “But a human reviews it, adjusts based on business priorities, and makes the call on what happens next.”
The question is whether that boundary will hold as AI capabilities advance. Anthropic and OpenAI are both racing to build more capable “agentic” systems that can execute multi-step tasks with less human oversight. If those systems become reliable enough, the human-in-the-loop requirement may shift from necessity to preference — a transition Asana appears to be preparing for, even as it emphasizes human control today.
The Asana integration in Claude is available immediately to all Asana customers who have a paid Claude subscription. Users can connect Asana through Claude’s app directory or request that their administrator enable the integration for their workspace.
The interactive app feature is available on Claude’s web and desktop applications for Pro, Max, Team, and Enterprise subscribers. Once connected, users can mention Asana in any Claude conversation to start creating projects, assigning tasks, or pulling status updates from their existing work.
The industry consensus is that 2026 will be the year of “agentic AI.” We are rapidly moving past chatbots that simply summarize text. We are entering the era of autonomous agents that execute tasks. We expect them to book flights, diagnose system outages, manage cloud infrastructure and personalize media streams in real-time.
As a technology executive overseeing platforms that serve 30 million concurrent users during massive global events like the Olympics and the Super Bowl, I have seen the unsexy reality behind the hype: Agents are incredibly fragile.
Executives and VCs obsess over model benchmarks. They debate Llama 3 versus GPT-4. They focus on maximizing context window sizes. Yet they are ignoring the actual failure point. The primary reason autonomous agents fail in production is often due to data hygiene issues.
In the previous era of “human-in-the-loop” analytics, data quality was a manageable nuisance. If an ETL pipeline experiences an issue, a dashboard may display an incorrect revenue number. A human analyst would spot the anomaly, flag it and fix it. The blast radius was contained.
In the new world of autonomous agents, that safety net is gone.
If a data pipeline drifts today, an agent doesn’t just report the wrong number. It takes the wrong action. It provisions the wrong server type. It recommends a horror movie to a user watching cartoons. It hallucinates a customer service answer based on corrupted vector embeddings.
To run AI at the scale of the NFL or the Olympics, I realized that standard data cleaning is insufficient. We cannot just “monitor” data. We must legislate it.
A solution to this specific problem could be in the form of a ‘data quality – creed’ framework. It functions as a ‘data constitution.’ It enforces thousands of automated rules before a single byte of data is allowed to touch an AI model. While I applied this specifically to the streaming architecture at NBCUniversal, the methodology is universal for any enterprise looking to operationalize AI agents.
Here is why “defensive data engineering” and the Creed philosophy are the only ways to survive the Agentic era.
The core problem with AI Agents is that they trust the context you give them implicitly. If you are using RAG, your vector database is the agent’s long-term memory.
Standard data quality issues are catastrophic for vector databases. In traditional SQL databases, a null value is just a null value. In a vector database, a null value or a schema mismatch can warp the semantic meaning of the entire embedding.
Consider a scenario where metadata drifts. Suppose your pipeline ingests video metadata, but a race condition causes the “genre” tag to slip. Your metadata might tag a video as “live sports,” but the embedding was generated from a “news clip.” When an agent queries the database for “touchdown highlights,” it retrieves the news clip because the vector similarity search is operating on a corrupted signal. The agent then serves that clip to millions of users.
At scale, you cannot rely on downstream monitoring to catch this. By the time an anomaly alarm goes off, the agent has already made thousands of bad decisions. Quality controls must shift to the absolute “left” of the pipeline.
The Creed framework is expected to act as a gatekeeper. It is a multi-tenant quality architecture that sits between ingestion sources and AI models.
For technology leaders looking to build their own “constitution,” here are the three non-negotiable principles I recommend.
1. The “quarantine” pattern is mandatory: In many modern data organizations, engineers favor the “ELT” approach. They dump raw data into a lake and clean it up later. For AI Agents, this is unacceptable. You cannot let an agent drink from a polluted lake.
The Creed methodology enforces a strict “dead letter queue.” If a data packet violates a contract, it is immediately quarantined. It never reaches the vector database. It is far better for an agent to say “I don’t know” due to missing data than to confidently lie due to bad data. This “circuit breaker” pattern is essential for preventing high-profile hallucinations.
2. Schema is law: For years, the industry moved toward “schemaless” flexibility to move fast. We must reverse that trend for core AI pipelines. We must enforce strict typing and referential integrity.
In my experience, a robust system requires scale. The implementation I oversee currently enforces more than 1,000 active rules running across real-time streams. These aren’t just checking for nulls. They check for business logic consistency.
Example: Does the “user_segment” in the event stream match the active taxonomy in the feature store? If not, block it.
Example: Is the timestamp within the acceptable latency window for real-time inference? If not, drop it.
3. Vector consistency checks This is the new frontier for SREs. We must implement automated checks to ensure that the text chunks stored in a vector database actually match the embedding vectors associated with them. “Silent” failures in an embedding model API often leave you with vectors that point to nothing. This causes agents to retrieve pure noise.
Implementing a framework like Creed is not just a technical challenge. It is a cultural one.
Engineers generally hate guardrails. They view strict schemas and data contracts as bureaucratic hurdles that slow down deployment velocity. When introducing a data constitution, leaders often face pushback. Teams feel they are returning to the “waterfall” era of rigid database administration.
To succeed, you must flip the incentive structure. We demonstrated that Creed was actually an accelerator. By guaranteeing the purity of the input data, we eliminated the weeks data scientists used to spend debugging model hallucinations. We turned data governance from a compliance task into a “quality of service” guarantee.
If you are building an AI strategy for 2026, stop buying more GPUs. Stop worrying about which foundation model is slightly higher on the leaderboard this week.
Start auditing your data contracts.
An AI Agent is only as autonomous as its data is reliable. Without a strict, automated data constitution like the Creed framework, your agents will eventually go rogue. In an SRE’s world, a rogue agent is far worse than a broken dashboard. It is a silent killer of trust, revenue, and customer experience.
Manoj Yerrasani is a senior technology executive.
The modern customer has just one need that matters: Getting the thing they want when they want it. The old standard RAG model embed+retrieve+LLM misunderstands intent, overloads context and misses freshness, repeatedly sending customers down the wrong paths.
Instead, intent-first architecture uses a lightweight language model to parse the query for intent and context, before delivering to the most relevant content sources (documents, APIs, people).
Enterprise AI is a speeding train headed for a cliff. Organizations are deploying LLM-powered search applications at a record pace, while a fundamental architectural issue is setting most up for failure.
A recent Coveo study revealed that 72% of enterprise search queries fail to deliver meaningful results on the first attempt, while Gartner also predicts that the majority of conversational AI deployments have been falling short of enterprise expectations.
The problem isn’t the underlying models. It’s the architecture around them.
After designing and running live AI-driven customer interaction platforms at scale, serving millions of customer and citizen users at some of the world’s largest telecommunications and healthcare organizations, I’ve come to see a pattern. It’s the difference between successful AI-powered interaction deployments and multi-million-dollar failures.
It’s a cloud-native architecture pattern that I call Intent-First. And it’s reshaping the way enterprises build AI-powered experiences.
Gartner projects the global conversational AI market will balloon to $36 billion by 2032. Enterprises are scrambling to get a slice. The demos are irresistible. Plug your LLM into your knowledge base, and suddenly it can answer customer questions in natural language.Magic.
Then production happens.
A major telecommunications provider I work with rolled out a RAG system with the expectation of driving down the support call rate. Instead, the rate increased. Callers tried AI-powered search, were provided incorrect answers with a high degree of confidence and called customer support angrier than before.
This pattern is repeated over and over. In healthcare, customer-facing AI assistants are providing patients with formulary information that’s outdated by weeks or months. Financial services chatbots are spitting out answers from both retail and institutional product content. Retailers are seeing discontinued products surface in product searches.
The issue isn’t a failure of AI technology. It’s a failure of architecture
The standard RAG pattern — embedding the query, retrieving semantically similar content, passing to an LLM —works beautifully in demos and proof of concepts. But it falls apart in production use cases for three systematic reasons:
Intent is not context. But standard RAG architectures don’t account for this.
Say a customer types “I want to cancel” What does that mean? Cancel a service? Cancel an order? Cancel an appointment? During our telecommunications deployment, we found that 65% of queries for “cancel” were actually about orders or appointments, not service cancellation. The RAG system had no way of understanding this intent, so it consistently returned service cancellation documents.
Intent matters. In healthcare, if a patient is typing “I need to cancel” because they’re trying to cancel an appointment, a prescription refill or a procedure, routing them to medication content from scheduling is not only frustrating — it’s also dangerous.
Enterprise knowledge and experience is vast, spanning dozens of sources such as product catalogs, billing, support articles, policies, promotions and account data. Standard RAG models treat all of it the same, searching all for every query.
When a customer asks “How do I activate my new phone,” they don’t care about billing FAQs, store locations or network status updates. But a standard RAG model retrieves semantically similar content from every source, returning search results that are a half-steps off the mark.
Vector space is timeblind. Semantically, last quarter’s promotion is identical to this quarter’s. But presenting customers with outdated offers shatters trust. We linked a significant percentage of customer complaints to search results that surfaced expired products, offers, or features.
The Intent-First architecture pattern is the mirror image of the standard RAG deployment. In the RAG model, you retrieve, then route. In the Intent-First model, you classify before you route or retrieve.
Intent-First architectures use a lightweight language model to parse a query for intent and context, before dispatching to the most relevant content sources (documents, APIs, agents).
The Intent-First pattern is designed for cloud-native deployment, leveraging microservices, containerization and elastic scaling to handle enterprise traffic patterns.
The classifier determines user intent before any retrieval occurs:
ALGORITHM: Intent Classification
INPUT: user_query (string)
OUTPUT: intent_result (object)
1. PREPROCESS query (normalize, expand contractions)
2. CLASSIFY using transformer model:
– primary_intent ← model.predict(query)
– confidence ← model.confidence_score()
3. IF confidence < 0.70 THEN
– RETURN {
requires_clarification: true,
suggested_question: generate_clarifying_question(query)
}
4. EXTRACT sub_intent based on primary_intent:
– IF primary = “ACCOUNT” → check for ORDER_STATUS, PROFILE, etc.
– IF primary = “SUPPORT” → check for DEVICE_ISSUE, NETWORK, etc.
– IF primary = “BILLING” → check for PAYMENT, DISPUTE, etc.
5. DETERMINE target_sources based on intent mapping:
– ORDER_STATUS → [orders_db, order_faq]
– DEVICE_ISSUE → [troubleshooting_kb, device_guides]
– MEDICATION → [formulary, clinical_docs] (healthcare)
6. RETURN {
primary_intent,
sub_intent,
confidence,
target_sources,
requires_personalization: true/false
}
Once intent is classified, retrieval becomes targeted:
ALGORITHM: Context-Aware Retrieval
INPUT: query, intent_result, user_context
OUTPUT: ranked_documents
1. GET source_config for intent_result.sub_intent:
– primary_sources ← sources to search
– excluded_sources ← sources to skip
– freshness_days ← max content age
2. IF intent requires personalization AND user is authenticated:
– FETCH account_context from Account Service
– IF intent = ORDER_STATUS:
– FETCH recent_orders (last 60 days)
– ADD to results
3. BUILD search filters:
– content_types ← primary_sources only
– max_age ← freshness_days
– user_context ← account_context (if available)
4. FOR EACH source IN primary_sources:
– documents ← vector_search(query, source, filters)
– ADD documents to results
5. SCORE each document:
– relevance_score ← vector_similarity × 0.40
– recency_score ← freshness_weight × 0.20
– personalization_score ← user_match × 0.25
– intent_match_score ← type_match × 0.15
– total_score ← SUM of above
6. RANK by total_score descending
7. RETURN top 10 documents
In healthcare deployments, the Intent-First pattern includes additional safeguards:
Healthcare intent categories:
Clinical: Medication questions, symptoms, care instructions
Coverage: Benefits, prior authorization, formulary
Scheduling: Appointments, provider availability
Billing: Claims, payments, statements
Account: Profile, dependents, ID cards
Critical safeguard: Clinical queries always include disclaimers and never replace professional medical advice. The system routes complex clinical questions to human support.
The edge cases are where systems fail. The Intent-First pattern includes specific handlers:
Frustration detection keywords:
Anger: “terrible,” “worst,” “hate,” “ridiculous”
Time: “hours,” “days,” “still waiting”
Failure: “useless,” “no help,” “doesn’t work”
Escalation: “speak to human,” “real person,” “manager”
When frustration is detected, skip search entirely and route to human support.
The Intent-First pattern applies wherever enterprises deploy conversational AI over heterogeneous content:
|
Industry |
Intent categories |
Key benefit |
|
Telecommunications |
Sales, Support, Billing, Account, Retention |
Prevents “cancel” misclassification |
|
Healthcare |
Clinical, Coverage, Scheduling, Billing |
Separates clinical from administrative |
|
Financial services |
Retail, Institutional, Lending, Insurance |
Prevents context mixing |
|
Retail |
Product, Orders, Returns, Loyalty |
Ensures promotional freshness |
After implementing Intent-First architecture across telecommunications and healthcare platforms:
|
Metric |
Impact |
|
Query success rate |
Nearly doubled |
|
Support escalations |
Reduced by more than half |
|
Time to resolution |
Reduced approximately 70% |
|
User satisfaction |
Improved roughly 50% |
|
Return user rate |
More than doubled |
The return user rate proved most significant. When search works, users come back. When it fails, they abandon the channel entirely, increasing costs across all other support channels.
The conversational AI market will continue to experience hyper growth.
But enterprises that build and deploy typical RAG architectures will continue to fail … repeatedly.
AI will confidently give wrong answers, users will abandon digital channels out of frustration and support costs will go up instead of down.
Intent-First is a fundamental shift in how enterprises need to architect and build AI-powered customer conversations. It’s not about better models or more data. It’s about understanding what a user wants before you try to help them.
The sooner an organization realizes this as an architectural imperative, the sooner they will be able to capture the efficiency gains this technology is supposed to enable. Those that don’t will be debugging why their AI investments haven’t been producing expected business outcomes for many years to come.
The demo is easy. Production is hard. But the pattern for production success is clear: Intent First.
Sreenivasa Reddy Hulebeedu Reddy is a lead software engineer and enterprise architect
Claude Cowork is now available to more Claude users, alongside new updates aimed at team workflows.Anthropic made Claude Cowork accessible to users on Team and Enterprise plans, and it brings the platform closer to being a collaborative AI infrastructu…
Presented by Insight Enterprises
Organizations today are trapped in proof-of-concept purgatory because yesterday’s models don’t work for today’s AI challenges.
Everyone’s racing to prove what AI could do. But the real winners are those who have realized that AI deployment is not a technology project — it is a core operational capability.
Success depends on execution, not just far-reaching visions of optimization.
At Insight, we’ve seen this cycle before. For more than 35 years, from our roots as a Value-Added Reseller (VAR) to our evolution as the leading Solutions Integrator, we’ve helped clients cut through the hype and make emerging technology actually work.
AI is following the same pattern. But this time, the stakes are higher, and the timelines are tighter. The organizations making real progress aren’t chasing pilots. They’re building the muscle to deploy, turning experiments and early momentum into measurable outcomes for the business.
MIT research estimates that 95% of enterprise AI initiatives fail to deliver measurable business value. This isn’t a failure of ambition. It’s a failure of deployment.
Too often, leaders are stuck in the “what”, obsessing over which model to use or how fast they can automate a single task. They get locked into long, costly discovery phases with traditional consultants that are all about theory and very little action.
We know this because we’ve lived it. When Insight first began experimenting with generative AI, our early pilots suffered from the same issues we see in the market: they looked great on slides but failed to scale.
We also hit cultural resistance and skills gaps. To overcome this, we had to stop treating AI as a “tool” and start treating it as a “capability.”
We started asking questions like, “Where will AI truly change how our people work and how our business performs — and how do we get there now?” OR “Given the AI tech advances, what is the art of the possible? How can we re-imagine our business processes and the work our people do to drive 10x improvement?
Now, 93% of our 14,000+ teammates are using generative AI tools in their daily work, saving more than 8,500 hours every week through automation and productivity gains.
If there’s one thing we’ve learned from decades of transformation, it’s that success isn’t born from strategy decks or proofs of concept.
It’s earned in the details.
As we brought together our AI experts from across our business, we saw that the most successful client engagements shared three common traits, but not the kind that fit neatly into a diagram. They’re about how work gets done:
Fees tied to outcomes. The old model of billing for time and material is broken. Commercial models need to put skin in the game. We win when you see measurable business value, not when we complete project.
Use tech to accelerate past theory. Instead of manual, multi-month discovery phases, look for partners who can accelerate your journey. We do this by providing our clients with an inventory of high-value use cases on day zero, so our consulting engagement starts with a roadmap to action, not just a listening tour.
Look at internal transformation. You cannot successfully deploy for your customers what you haven’t mastered internally. At Insight, we built our suite of AI offerings by first transforming our own business. Our internal story isn’t just a data point. It’s our proof of concept for cultural and operational change. It’s how we break the old perceptions and prove we understand the human side of deployment. In our 2024 survey of IT leaders, 44% identified skills gaps as a top barrier to transformation, and 74% said they have focused time and budget on building custom AI tools. Yet most still lack the deployment discipline to embed them.
That’s the real craft of deployment. It’s not theory, and it’s not hype. It is execution at scale.
And over the past few years, we’ve built on those lessons to give organizations a clear roadmap from ideation to ROI. Real success comes from connecting expertise, tools, and a robust delivery engine to get beyond vision and experimentation.
I love this concept from Boston Consulting Group (BCG) called the 10-20-70 rule.
10% of success comes from algorithms, 20% from data and technology, and 70% from people, process, and culture.
Most companies invest nearly all their energy in the first 30%. But the real advantage (yes, the durable kind) lives in the 70%. That’s where execution happens.
At Insight, we’ve built our entire business around that principle. From cloud to AI, our mission hasn’t changed. We turn technology into a capability that clients can scale and continuously improve.
The “AI theory” era is ending. This next chapter belongs to the doers. To organizations ready to apply intelligence the same way they operationalized cloud or digital transformation.
It requires a delicate balance of innovation and governance, and certainly bold ideas with disciplined execution.
In fact, that philosophy is exactly what inspired Prism, our way of helping organizations bring clarity to complexity. Clients can get a full inventory of AI use cases for their entire business on day zero, skipping the months-long discovery phase of traditional consulting and prioritizing opportunities for immediate impact.
We know that transformation doesn’t begin with algorithms. It begins with mastery, and it’s the kind we’ve earned through decades of deploying and scaling what’s next.
How are you moving from hype to how?
Joyce Mullen is President & CEO at Insight Enterprises.
Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Despite lots of hype, “voice AI” has so far largely been a euphemism for a request-response loop. You speak, a cloud server transcribes your words, a language model thinks, and a robotic voice reads the text back. Functional, but not really conversational.
That all changed in the past week with a rapid succession of powerful, fast, and more capable voice AI model releases from Nvidia, Inworld, FlashLabs, and Alibaba’s Qwen team, combined with a massive talent acquisition and tech licensing deal by Google DeepMind and Hume AI.
Now, the industry has effectively solved the four “impossible” problems of voice computing: latency, fluidity, efficiency, and emotion.
For enterprise builders, the implications are immediate. We have moved from the era of “chatbots that speak” to the era of “empathetic interfaces.”
Here is how the landscape has shifted, the specific licensing models for each new tool, and what it means for the next generation of applications.
The “magic number” in human conversation is roughly 200 milliseconds. That is the typical gap between one person finishing a sentence and another beginning theirs. Anything longer than 500ms feels like a satellite delay; anything over a second breaks the illusion of intelligence entirely.
Until now, chaining together ASR (speech recognition), LLMs (intelligence), and TTS (text-to-speech) resulted in latencies of 2–5 seconds.
Inworld AI’s release of TTS 1.5 directly attacks this bottleneck. By achieving a P90 latency of under 120ms, Inworld has effectively pushed the technology faster than human perception.
For developers building customer service agents or interactive training avatars, this means the “thinking pause” is dead.
Crucially, Inworld claims this model achieves “viseme-level synchronization,” meaning the lip movements of a digital avatar will match the audio frame-by-frame—a requirement for high-fidelity gaming and VR training.
It’s vailable via commercial API (pricing tiers based on usage) with a free tier for testing.
Simultaneously, FlashLabs released Chroma 1.0, an end-to-end model that integrates the listening and speaking phases. By processing audio tokens directly via an interleaved text-audio token schedule (1:2 ratio), the model bypasses the need to convert speech to text and back again.
This “streaming architecture” allows the model to generate acoustic codes while it is still generating text, effectively “thinking out loud” in data form before the audio is even synthesized. This one is open source on Hugging Face under the enterprise-friendly, commercially viable Apache 2.0 license.
Together, they signal that speed is no longer a differentiator; it is a commodity. If your voice application has a 3-second delay, it is now obsolete. The standard for 2026 is immediate, interruptible response.
Speed is useless if the AI is rude. Traditional voice bots are “half-duplex”—like a walkie-talkie, they cannot listen while they are speaking. If you try to interrupt a banking bot to correct a mistake, it keeps talking over you.
Nvidia’s PersonaPlex, released last week, introduces a 7-billion parameter “full-duplex” model.
Built on the Moshi architecture (originally from Kyutai), it uses a dual-stream design: one stream for listening (via the Mimi neural audio codec) and one for speaking (via the Helium language model). This allows the model to update its internal state while the user is speaking, enabling it to handle interruptions gracefully.
Crucially, it understands “backchanneling”—the non-verbal “uh-huhs,” “rights,” and “okays” that humans use to signal active listening without taking the floor. This is a subtle but profound shift for UI design.
An AI that can be interrupted allows for efficiency. A customer can cut off a long legal disclaimer by saying, “I got it, move on,” and the AI will instantly pivot. This mimics the dynamics of a high-competence human operator.
The model weights are released under the Nvidia Open Model License (permissive for commercial use but with attribution/distribution terms), while the code is MIT Licensed.
While Inworld and Nvidia focused on speed and behavior, open source AI powerhouse Qwen (parent company Alibaba Cloud) quietly solved the bandwidth problem.
Earlier today, the team released Qwen3-TTS, featuring a breakthrough 12Hz tokenizer. In plain English, this means the model can represent high-fidelity speech using an incredibly small amount of data—just 12 tokens per second.
For comparison, previous state-of-the-art models required significantly higher token rates to maintain audio quality. Qwen’s benchmarks show it outperforming competitors like FireredTTS 2 on key reconstruction metrics (MCD, CER, WER) while using fewer tokens.
Why does this matter for the enterprise? Cost and scale.
A model that requires less data to generate speech is cheaper to run and faster to stream, especially on edge devices or in low-bandwidth environments (like a field technician using a voice assistant on a 4G connection). It turns high-quality voice AI from a server-hogging luxury into a lightweight utility.
It’s available on Hugging Face now under a permissive Apache 2.0 license, perfect for research and commercial application.
Perhaps the most significant news of the week—and the most complex—is Google DeepMind’s move to license Hume AI’s technology and hire its CEO, Alan Cowen, along with key research staff.
While Google integrates this tech into Gemini to power the next generation of consumer assistants, Hume AI itself is pivoting to become the infrastructure backbone for the enterprise.
Under new CEO Andrew Ettinger, Hume is doubling down on the thesis that “emotion” is not a UI feature, but a data problem.
In an exclusive interview with VentureBeat regarding the transition, Ettinger explained that as voice becomes the primary interface, the current stack is insufficient because it treats all inputs as flat text.
“I saw firsthand how the frontier labs are using data to drive model accuracy,” Ettinger says. “Voice is very clearly emerging as the de facto interface for AI. If you see that happening, you would also conclude that emotional intelligence around that voice is going to be critical—dialects, understanding, reasoning, modulation.”
The challenge for enterprise builders has been that LLMs are sociopaths by design—they predict the next word, not the emotional state of the user. A healthcare bot that sounds cheerful when a patient reports chronic pain is a liability. A financial bot that sounds bored when a client reports fraud is a churn risk.
Ettinger emphasizes that this isn’t just about making bots sound nice; it’s about competitive advantage.
When asked about the increasingly competitive landscape and the role of open source versus proprietary models, Ettinger remained pragmatic.
He noted that while open-source models like PersonaPlex are raising the baseline for interaction, the proprietary advantage lies in the data—specifically, the high-quality, emotionally annotated speech data that Hume has spent years collecting.
“The team at Hume ran headfirst into a problem shared by nearly every team building voice models today: the lack of high-quality, emotionally annotated speech data for post-training,” he wrote on LinkedIn. “Solving this required rethinking how audio data is sourced, labeled, and evaluated… This is our advantage. Emotion isn’t a feature; it’s a foundation.”
Hume’s models and data infrastructure are available via proprietary enterprise licensing.
With these pieces in place, the “Voice Stack” for 2026 looks radically different.
The Brain: An LLM (like Gemini or GPT-4o) provides the reasoning.
The Body: Efficient, open-weight models like PersonaPlex (Nvidia), Chroma (FlashLabs), or Qwen3-TTS handle the turn-taking, synthesis, and compression, allowing developers to host their own highly responsive agents.
The Soul: Platforms like Hume provide the annotated data and emotional weighting to ensure the AI “reads the room,” preventing the reputational damage of a tone-deaf bot.
Ettinger claims the market demand for this specific “emotional layer” is exploding beyond just tech assistants.
“We are seeing that very deeply with the frontier labs, but also in healthcare, education, finance, and manufacturing,” Ettinger told me. “As people try to get applications into the hands of thousands of workers across the globe who have complex SKUs… we’re seeing dozens and dozens of use cases by the day.”
This aligns with his comments on LinkedIn, where he revealed that Hume signed “multiple 8-figure contracts in January alone,” validating the thesis that enterprises are willing to pay a premium for AI that doesn’t just understand what a customer said, but how they felt.
For years, enterprise voice AI was graded on a curve. If it understood the user’s intent 80% of the time, it was a success.
The technologies released this week have removed the technical excuses for bad experiences. Latency is solved. Interruption is solved. Bandwidth is solved. Emotional nuance is solvable.
“Just like GPUs became foundational for training models,” Ettinger wrote on his LinkedIn, “emotional intelligence will be the foundational layer for AI systems that actually serve human well-being.”
For the CIO or CTO, the message is clear: The friction has been removed from the interface. The only remaining friction is in how quickly organizations can adopt the new stack.
A new technique developed by researchers at Shanghai Jiao Tong University and other institutions enables large language model agents to learn new skills without the need for expensive fine-tuning.
The researchers propose MemRL, a framework that gives agents the ability to develop episodic memory, the capacity to retrieve past experiences to create solutions for unseen tasks. MemRL allows agents to use environmental feedback to refine their problem-solving strategies continuously.
MemRL is part of a broader push in the research community to develop continual learning capabilities for AI applications. In experiments on key industry benchmarks, the framework outperformed other baselines such as RAG and other memory organization techniques, particularly in complex environments that require exploration and experiments. This suggests MemRL could become a critical component for building AI applications that must operate in dynamic real-world settings where requirements and tasks constantly shift.
One of the central challenges in deploying agentic applications is adapting the underlying model to new knowledge and tasks after the initial training phase. Current approaches generally fall into two categories: parametric approaches, such as fine-tuning, and non-parametric approaches, such as RAG. But both come with significant trade-offs.
Fine-tuning, while effective for baking in new information, is computationally expensive and slow. More critically, it often leads to catastrophic forgetting, a phenomenon where newly acquired knowledge overwrites previously learned data, degrading the model’s general performance.
Conversely, non-parametric methods like RAG are fundamentally passive; they retrieve information based solely on semantic similarity, such as vector embeddings, without evaluating the actual utility of the information to the input query. This approach assumes that “similar implies useful,” which is often flawed in complex reasoning tasks.
The researchers argue that human intelligence solves this problem by maintaining “the delicate balance between the stability of cognitive reasoning and the plasticity of episodic memory.” In the human brain, stable reasoning (associated with the cortex) is decoupled from dynamic episodic memory. This allows humans to adapt to new tasks without “rewiring neural circuitry” (the rough equivalent of model fine-tuning).
Inspired by humans’ use of episodic memory and cognitive reasoning, MemRL is designed to enable an agent to continuously improve its performance after deployment without compromising the stability of its backbone LLM. Instead of changing the model’s parameters, the framework shifts the adaptation mechanism to an external, self-evolving memory structure.
In this architecture, the LLM’s parameters remain completely frozen. The model acts effectively as the “cortex,” responsible for general reasoning, logic, and code generation, but it is not responsible for storing specific successes or failures encountered after deployment. This structure ensures stable cognitive reasoning and prevents catastrophic forgetting.
To handle adaptation, MemRL maintains a dynamic episodic memory component. Instead of storing plain text documents and static embedding values, as is common in RAG, MemRL organizes memory into “intent-experience-utility” triplets. These contain the user’s query (the intent), the specific solution trajectory or action taken (the experience), and a score, known as the Q-value, that represents how successful this specific experience was in the past (the utility).
Crucially for enterprise architects, this new data structure doesn’t require ripping out existing infrastructure. “MemRL is designed to be a ‘drop-in’ replacement for the retrieval layer in existing technology stacks and is compatible with various vector databases,” Muning Wen, a co-author of the paper and PhD candidate at Shanghai Jiao Tong University, told VentureBeat. “The existence and updating of ‘Q-Value’ is solely for better evaluation and management of dynamic data… and is independent of the storage format.”
This utility score is the key differentiator from classic RAG systems. At inference time, MemRL agents employ a “two-phase retrieval” mechanism. First, the system identifies memories that are semantically close to the query to ensure relevance. It then re-ranks these candidates based on their Q-value, effectively prioritizing proven strategies.
The framework incorporates reinforcement learning directly into the memory retrieval process. When an agent attempts a solution and receives environmental feedback (i.e., success or failure) it updates the Q-value of the retrieved memory. This creates a closed feedback loop: over time, the agent learns to ignore distractor memories and prioritize high-value strategies without ever needing to retrain the underlying LLM.
While adding a reinforcement learning step might sound like it adds significant latency, Wen noted that the computational overhead is minimal. “Our Q-value calculation is performed entirely on the CPU,” he said.
MemRL also possesses runtime continual learning capabilities. When the agent encounters a new scenario, the system uses the frozen LLM to summarize the new trajectory and adds it to the memory bank as a new triplet. This allows the agent to expand its knowledge base dynamically as it interacts with the world.
It is worth noting that the automation of the value assignment comes with a risk: If the system mistakenly validates a bad interaction, the agent could learn the wrong lesson. Wen acknowledges this “poisoned memory” risk but notes that unlike black-box neural networks, MemRL remains transparent and auditable. “If a bad interaction is mistakenly classified as a positive example… it may spread more widely,” Wen said. “However … we can easily fix it by removing the contaminated data from the memory bank or resetting their Q-values.”
The researchers evaluated MemRL against several baselines on four diverse industry benchmarks: BigCodeBench (code generation), ALFWorld (embodied navigation), Lifelong Agent Bench (OS and database interaction), and Humanity’s Last Exam (complex multidisciplinary reasoning).
The results showed that MemRL consistently outperformed baselines in both runtime learning (improving during the session) and transfer learning (generalizing to unseen tasks).
The advantages of this value-aware retrieval mechanism were most pronounced in exploration-heavy environments like ALFWorld. In this benchmark, which requires agents to navigate and interact with a simulated household environment, MemRL achieved a relative improvement of approximately 56% over MemP, another agentic memory framework. The researchers found that the reinforcement learning component effectively encouraged the agent to explore and discover solutions for complex tasks that similarity-based retrieval methods often failed to solve.
When the memory bank was frozen and tested on held-out sets to measure generalization, MemRL achieved the highest accuracy across benchmarks. For example, on the Lifelong Agent Bench, it improved significantly upon the standard RAG baseline on OS tasks. This indicates that the system does not merely memorize training data but effectively filters out low-value memories to retain high-utility experiences that generalize to new situations.
MemRL fits within a growing body of research focused on Memory-Based Markov Decision Processes (M-MDP), a formulation that frames memory retrieval as an active decision-making step rather than a passive search function. By treating retrieval as an action that can be optimized via reinforcement learning, frameworks like MemRL and similar approaches such as Memento are paving the way for more autonomous systems.
For enterprise AI, this shift is significant. It suggests a future where agents can be deployed with a general-purpose LLM and then rapidly adapt to specific company workflows, proprietary databases, and unique problem sets through interaction alone. The key shift we’re seeing is frameworks that are treating applications as dynamic environments that they can learn from.
These emerging capabilities will allow organizations to maintain consistent, high-performance agents that evolve alongside their business needs, solving the problem of stale models without incurring the prohibitive costs of constant retraining.
It marks a transition in how we value data. “In a future where static data is about to be exhausted, the interaction experience generated by each intelligent agent during its lifespan will become the new fuel,” Wen said.