Enterprises have moved quickly to adopt RAG to ground LLMs in proprietary data. In practice, however, many organizations are discovering that retrieval is no longer a feature bolted onto model inference — it has become a foundational system dependency….
By now, many enterprises have deployed some form of RAG. The promise is seductive: index your PDFs, connect an LLM and instantly democratize your corporate knowledge.
But for industries dependent on heavy engineering, the reality has been underwhelming. Engineers ask specific questions about infrastructure, and the bot hallucinates.
The failure isn’t in the LLM. The failure is in the preprocessing.
Standard RAG pipelines treat documents as flat strings of text. They use “fixed-size chunking” (cutting a document every 500 characters). This works for prose, but it destroys the logic of technical manuals. It slices tables in half, severs captions from images, and ignores the visual hierarchy of the page.
Improving RAG reliability isn’t about buying a bigger model; it’s about fixing the “dark data” problem through semantic chunking and multimodal textualization.
Here is the architectural framework for building a RAG system that can actually read a manual.
In a standard Python RAG tutorial, you split text by character count. In an enterprise PDF, this is disastrous.
If a safety specification table spans 1,000 tokens, and your chunk size is 500, you have just split the “voltage limit” header from the “240V” value. The vector database stores them separately. When a user asks, “What is the voltage limit?”, the retrieval system finds the header but not the value. The LLM, forced to answer, often guesses.
The first step to fixing production RAG is abandoning arbitrary character counts in favor of document intelligence.
Using layout-aware parsing tools (such as Azure Document Intelligence), we can segment data based on document structure such as chapters, sections and paragraphs, rather than token count.
Logical cohesion: A section describing a specific machine part is kept as a single vector, even if it varies in length.
Table preservation: The parser identifies a table boundary and forces the entire grid into a single chunk, preserving the row-column relationships that are vital for accurate retrieval.
In our internal qualitative benchmarks, moving from fixed to semantic chunking significantly improved the retrieval accuracy of tabular data, effectively stopping the fragmentation of technical specs.
The second failure mode of enterprise RAG is blindness. A massive amount of corporate IP exists not in text, but in flowcharts, schematics and system architecture diagrams. Standard embedding models (like text-embedding-3-small) cannot “see” these images. They are skipped during indexing.
If your answer lies in a flowchart, your RAG system will say, “I don’t know.”
To make diagrams searchable, we implemented a multimodal preprocessing step using vision-capable models (specifically GPT-4o) before the data ever hits the vector store.
OCR extraction: High-precision optical character recognition pulls text labels from within the image.
Generative captioning: The vision model analyzes the image and generates a detailed natural language description (“A flowchart showing that process A leads to process B if the temperature exceeds 50 degrees”).
Hybrid embedding: This generated description is embedded and stored as metadata linked to the original image.
Now, when a user searches for “temperature process flow,” the vector search matches the description, even though the original source was a PNG file.
For enterprise adoption, accuracy is only half the battle. The other half is verifiability.
In a standard RAG interface, the chatbot gives a text answer and cites a filename. This forces the user to download the PDF and hunt for the page to verify the claim. For high-stakes queries (“Is this chemical flammable?”), users simply won’t trust the bot.
The architecture should implement visual citation. Because we preserved the link between the text chunk and its parent image during the preprocessing phase, the UI can display the exact chart or table used to generate the answer alongside the text response.
This “show your work” mechanism allows humans to verify the AI’s reasoning instantly, bridging the trust gap that kills so many internal AI projects.
While the “textualization” method (converting images to text descriptions) is the practical solution for today, the architecture is rapidly evolving.
We are already seeing the emergence of native multimodal embeddings (such as Cohere’s Embed 4). These models can map text and images into the same vector space without the intermediate step of captioning. While we currently use a multi-stage pipeline for maximum control, the future of data infrastructure will likely involve “end-to-end” vectorization where the layout of a page is embedded directly.
Furthermore, as long context LLMs become cost-effective, the need for chunking may diminish. We may soon pass entire manuals into the context window. However, until latency and cost for million-token calls drop significantly, semantic preprocessing remains the most economically viable strategy for real-time systems.
The difference between a RAG demo and a production system is how it handles the messy reality of enterprise data.
Stop treating your documents as simple strings of text. If you want your AI to understand your business, you must respect the structure of your documents. By implementing semantic chunking and unlocking the visual data within your charts, you transform your RAG system from a “keyword searcher” into a true “knowledge assistant.”
Dippu Kumar Singh is an AI architect and data engineer.
A new study by Google suggests that advanced reasoning models achieve high performance by simulating multi-agent-like debates involving diverse perspectives, personality traits, and domain expertise.
Their experiments demonstrate that this internal debate, which they dub “society of thought,” significantly improves model performance in complex reasoning and planning tasks. The researchers found that leading reasoning models such as DeepSeek-R1 and QwQ-32B, which are trained via reinforcement learning (RL), inherently develop this ability to engage in society of thought conversations without explicit instruction.
These findings offer a roadmap for how developers can build more robust LLM applications and how enterprises can train superior models using their own internal data.
The core premise of society of thought is that reasoning models learn to emulate social, multi-agent dialogues to refine their logic. This hypothesis draws on cognitive science, specifically the idea that human reason evolved primarily as a social process to solve problems through argumentation and engagement with differing viewpoints.
The researchers write that “cognitive diversity, stemming from variation in expertise and personality traits, enhances problem solving, particularly when accompanied by authentic dissent.” Consequently, they suggest that integrating diverse perspectives allows LLMs to develop robust reasoning strategies. By simulating conversations between different internal personas, models can perform essential checks (such as verification and backtracking) that help avoid common pitfalls like unwanted biases and sycophancy.
In models like DeepSeek-R1, this “society” manifests directly within the chain of thought. The researchers note that you do not need separate models or prompts to force this interaction; the debate emerges autonomously within the reasoning process of a single model instance.
The study provides tangible examples of how this internal friction leads to better outcomes. In one experiment involving a complex organic chemistry synthesis problem, DeepSeek-R1 simulated a debate among multiple distinct internal perspectives, including a “Planner” and a “Critical Verifier.”
The Planner initially proposed a standard reaction pathway. However, the Critical Verifier (characterized as having high conscientiousness and low agreeableness) interrupted to challenge the assumption and provided a counter argument with new facts. Through this adversarial check, the model discovered the error, reconciled the conflicting views, and corrected the synthesis path.
A similar dynamic appeared in creative tasks. When asked to rewrite the sentence, “I flung my hatred into the burning fire,” the model simulated a negotiation between a “Creative Ideator” and a “Semantic Fidelity Checker.” After the ideator suggested a version using the word “deep-seated,” the checker retorted, “But that adds ‘deep-seated,’ which wasn’t in the original. We should avoid adding new ideas.” The model eventually settled on a compromise that maintained the original meaning while improving the style.
Perhaps the most striking evolution occurred in “Countdown Game,” a math puzzle where the model must use specific numbers to reach a target value. Early in training, the model tried to solve the problem using a monologue approach. As it learned via RL, it spontaneously split into two distinct personas: a “Methodical Problem-Solver” performing calculations and an “Exploratory Thinker” monitoring progress, who would interrupt failed paths with remarks like “Again no luck … Maybe we can try using negative numbers,” prompting the Methodical Solver to switch strategies.
These findings challenge the assumption that longer chains of thought automatically result in higher accuracy. Instead, diverse behaviors such as looking at responses through different lenses, verifying earlier assumptions, backtracking, and exploring alternatives, drive the improvements in reasoning. The researchers reinforced this by artificially steering a model’s activation space to trigger conversational surprise; this intervention activated a wider range of personality- and expertise-related features, doubling accuracy on complex tasks.
The implication is that social reasoning emerges autonomously through RL as a function of the model’s drive to produce correct answers, rather than through explicit human supervision. In fact, training models on monologues underperformed raw RL that naturally developed multi-agent conversations. Conversely, performing supervised fine-tuning (SFT) on multi-party conversations, and debate significantly outperformed SFT on standard chains of thought.
For developers and enterprise decision-makers, these insights offer practical guidelines for building more powerful AI applications.
Developers can enhance reasoning in general-purpose models by explicitly prompting them to adopt a society of thought structure. However, it is not enough to simply ask the model to chat with itself.
“It’s not enough to ‘have a debate’ but to have different views and dispositions that make debate inevitable and allow that debate to explore and discriminate between alternatives,” James Evans, co-author of the paper, told VentureBeat.
Instead of generic roles, developers should design prompts that assign opposing dispositions (e.g., a risk-averse compliance officer versus a growth-focused product manager) to force the model to discriminate between alternatives. Even simple cues that steer the model to express “surprise” can trigger these superior reasoning paths.
As developers scale test-time compute to allow models to “think” longer, they should structure this time as a social process. Applications should facilitate a “societal” process where the model uses pronouns like “we,” asks itself questions, and explicitly debates alternatives before converging on an answer.
This approach can also expand to multi-agent systems, where distinct personalities assigned to different agents engage in critical debate to reach better decisions.
Perhaps the most significant implication lies in how companies train or fine-tune their own models. Traditionally, data teams scrub their datasets to create “Golden Answers” that provide perfect, linear paths to a solution. The study suggests this might be a mistake.
Models fine-tuned on conversational data (e.g., transcripts of multi-agent debate and resolution) improve reasoning significantly faster than those trained on clean monologues. There is even value in debates that don’t lead to the correct answer.
“We trained on conversational scaffolding that led to the wrong answer, then reinforced the model and found that it performed just as well as reinforcing on the right answer, suggesting that the conversational habits of exploring solutions was the most important for new problems,” Evans said.
This implies enterprises should stop discarding “messy” engineering logs or Slack threads where problems were solved iteratively. The “messiness” is where the model learns the habit of exploration.
For high-stakes enterprise use cases, simply getting an answer isn’t enough. Evans argues that users need to see the internal dissent to trust the output, suggesting a shift in user interface design.
“We need a new interface that systematically exposes internal debates to us so that we ‘participate’ in calibrating the right answer,” Evans said. “We do better with debate; AIs do better with debate; and we do better when exposed to AI’s debate.”
These findings provide a new argument in the “build vs. buy” debate regarding open-weight models versus proprietary APIs. Many proprietary reasoning models hide their chain-of-thought, treating the internal debate as a trade secret or a safety liability.
But Evans argues that “no one has really provided a justification for exposing this society of thought before,” but that the value of auditing these internal conflicts is becoming undeniable. Until proprietary providers offer full transparency, enterprises in high-compliance sectors may find that open-weight models offer a distinct advantage: the ability to see the dissent, not just the decision.
“I believe that large, proprietary models will begin serving (and licensing) the information once they realize that there is value in it,” Evans said.
The research suggests that the job of an AI architect is shifting from pure model training to something closer to organizational psychology.
“I believe that this opens up a whole new frontier of small group and organizational design within and between models that is likely to enable new classes of performance,” Evans said. “My team is working on this, and I hope that others are too.”
Presented by SAP
The consumer packaged goods industry is experiencing a fundamental shift that’s forcing even the most established brands to rethink how they operate. It’s what some folks call the CPG squeeze, or a convergence of margin compression, trade policy headwinds, and the sobering reality that pricing-led growth is no longer a viable strategy. For companies that have relied on price increases to drive revenue, it’s a structural change that demands new approaches to operations, strategy, and competitive positioning.
CPG companies now need to achieve annual productivity gains of 5% or more just to stay competitive. Traditional cost-cutting measures like travel freezes, hiring pauses, and other age-old efficiency drives from simpler times might yield a couple of percentage points at best. The solution lies in a more sophisticated approach: identifying which processes can be digitally enabled before making organizational changes, confronting questions about process efficiency, manual workflows, and opportunities for automation.
But piecemeal solutions that address isolated problems can’t deliver the systemic efficiency gains that CPG companies now require. This is driving increased interest in integrated technology platforms that can support decision-making and execution across all functional areas simultaneously.
Modern CPG operations run on data, but of course not all data strategies are created equal. Companies are facing a dual-barreled challenge: they need deep insights into their internal operations, while simultaneously understanding external market dynamics and consumer behavior. Historically, this has meant extracting operational data, which means losing critical business context in the process, and then needing to invest big on reconstituting that context so it can be analyzed alongside consumer and retail data.
The disconnect creates real problems. When data loses its business context during extraction, companies spend significant time and money trying to rebuild an understanding of what the numbers actually mean. Meanwhile, market conditions change, promotional windows close, and opportunities disappear. In an industry where timing often determines success or failure, this lag in analytical capability becomes the competitive disadvantage.
To address this challenge, advanced data platforms like SAP’s Business Data Cloud are able to import external data with internal SAP operational data that has full business context. CPG brands can combine point-of-sale data from retailers, insights on consumer behavior, and internal transactional information without the traditional extract-and-reconstruct workflow — fundamentally changing the speed at which companies can move from analysis to decision to action.
The impact is particularly significant for promotional planning and revenue management. Instead of spending weeks preparing data for analysis, companies can run scenarios, model outcomes, and adjust strategies in near real-time, which is huge in an industry where promotional windows are measured in days or weeks.
High-stakes promotional moments like the Super Bowl expose how fragile CPG operations have become. Demand spikes are intense, localized, and short-lived, leaving little margin for delayed insights or disconnected execution. In this environment, promotional success depends less on creative merchandising and more on how quickly companies can sense demand, model outcomes, and align pricing, inventory, and execution while the window is still open.
The decision-making behind these promotions involves complex analysis of multiple variables: which products to feature, optimal discount levels, store-specific positioning, and even regional variations in consumer preferences. What resonates with shoppers in one geography may fall flat in another, so effective promotional strategy requires granular analysis down to individual store locations.
Tools like SAP’s Revenue Growth Management solution enable this level of sophistication, helping brands calculate and model promotional lifts and translate those insights into execution-ready decisions. The analysis accounts for regional taste preferences, local competitive dynamics, and historical performance data to optimize every promotional decision.
But promotional planning is only valuable if it can be executed effectively. This is where many CPG companies encounter friction between strategy and operations. Data analysis might pinpoint the perfect promotional mix, but without ensuring product availability, maintaining shelf presence, and executing physical merchandising, the analysis is pretty much academic. That’s why integration between promotional planning systems, supply chain and financial planning systems and ERP platforms are critical.
For high-velocity promotional periods, companies must forecast demand accurately, position inventory strategically, and execute distribution flawlessly. This is particularly complex for categories like snacks and beverages, where direct store delivery models are common. Managing shelf presence is critical, because an empty shelf means consumers will switch to competitive products or abandon the purchase entirely. And it requires real-time visibility into multiple layers of the supply chain across a variety of data sources, and the operational capabilities to act upon quickly.
Modern warehouse management systems, including SAP Extended Warehouse Management, provide the granular visibility needed to track inventory across these multiple states. When combined with DSD-specific applications, such as SAP’s last mile distribution solution, that optimize driver routes, delivery schedules, and in-store execution, CPG companies can maintain the shelf presence that drives promotional success. Sales execution tools, such as SAP’s retail execution offering in SAP Sales Cloud, allow field teams to audit stores and report on actual conditions. This helps gives headquarters clear, accurate visibility into what’s happening at the point of purchase.
Artificial intelligence is moving beyond experimental use cases to practical applications across CPG operations. In warehouse environments, AI-enhanced systems can optimize task management, improve forecasting accuracy, and streamline returns processing. For supply chain planning, AI assists in generating demand scenarios that account for multiple variables affecting product movement.
SAP’s integration of Joule into Integrated Business Planning software demonstrates how conversational AI can transform planning workflows. Instead of navigating complex interfaces to access supply chain data, planners can ask natural language questions and receive immediate, AI-driven responses based on real-time information. This reduces the friction in accessing insights and accelerates decision-making during critical planning cycles.
Advanced warehouse operations are benefiting from AI agents that can enhance inventory risk analysis, optimize task management, and improve forecast accuracy. These aren’t just faster versions of existing processes. Instead, they represent qualitatively different capabilities that can identify patterns and risks that human analysts might miss amid the volume and complexity of modern supply chain operations.
Revenue management, or determining optimal pricing and promotional strategies, is particularly well-suited to AI assistance, because analyzing how different price points, promotional tactics, and positioning strategies interact across thousands of stores and products is complex beyond human analytical capacity. Machine learning can identify patterns and optimize decisions at a scale and speed that manual analysis cannot match. AI capabilities being built into revenue growth management platforms promise to make promotional planning both more sophisticated and more efficient.
Perhaps most significantly for CPG companies facing the productivity imperative, intelligent inventory management systems are using machine learning to predict delivery dates and provide real-time analytics for distribution decisions. Sales order fulfillment monitoring can predict fulfillment risks before they materialize, enabling proactive intervention. These AI capabilities address issues like product availability and reliable delivery during critical promotional windows, which are some of the highest-stakes challenges in CPG operations.
But the most impactful AI applications in CPG won’t necessarily be the most visible. Instead of flashy consumer-facing features, the real value comes from embedding intelligence into core operational processes. Incremental improvements across dozens of workflows compound into substantial competitive advantages over time.
The CPG squeeze isn’t a temporary condition that companies can wait out. The structural factors driving margin compression and limiting pricing power reflect fundamental market changes. Trade policies will continue evolving. Consumer behavior will keep shifting. The companies that emerge stronger won’t be those with the best products alone, they’ll be those that built the most efficient, responsive operations.
Jon Dano is Industry Advisor for Consumer Products, at SAP.
Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Two days after releasing what analysts call the most powerful open-source AI model ever created, researchers from China’s Moonshot AI logged onto Reddit to face a restless audience. The Beijing-based startup had reason to show up. Kimi K2.5 had just landed headlines about closing the gap with American AI giants and testing the limits of US. chip export controls. But the developers waiting on r/LocalLLaMA, a forum where engineers trade advice on running powerful language models on everything from a single consumer GPU to a small rack of prosumer hardware, had a different concern.
They wanted to know when they could actually use it.
The three-hour Ask Me Anything session became an unexpectedly candid window into frontier AI development in 2026 — not the polished version that appears in corporate blogs, but the messy reality of debugging failures, managing personality drift, and confronting a fundamental tension that defines open-source AI today.
Moonshot had published the model’s weights for anyone to download and customize. The file runs roughly 595 gigabytes. For most of the developers in the thread, that openness remained theoretical.
Three Moonshot team members participated under the usernames ComfortableAsk4494, zxytim, and ppwwyyxx. Over approximately 187 comments, they fielded questions about architecture, training methodology, and the philosophical puzzle of what gives an AI model its “soul.” They also offered a picture of where the next round of progress will come from — and it wasn’t simply “more parameters.”
The very first wave of questions treated Kimi K2.5 less like a breakthrough and more like a logistics headache.
One user asked bluntly why Moonshot wasn’t creating smaller models alongside the flagship. “Small sizes like 8B, 32B, 70B are great spots for the intelligence density,” they wrote. Another said huge models had become difficult to celebrate because many developers simply couldn’t run them. A third pointed to American competitors as size targets, requesting coder-focused variants that could fit on modest GPUs.
Moonshot’s team didn’t announce a smaller model on the spot. But it acknowledged the demand in terms that suggested the complaint was familiar. “Requests well received!” one co-host wrote. Another noted that Moonshot’s model collection already includes some smaller mixture-of-experts models on Hugging Face, while cautioning that small and large models often require different engineering investments.
The most revealing answer came when a user asked whether Moonshot might build something around 100 billion parameters optimized for local use. The Kimi team responded by floating a different compromise: a 200 billion or 300 billion parameter model that could stay above what it called a “usability threshold” across many tasks.
That reply captured the bind open-weight labs face. A 200-to-300 billion parameter model would broaden access compared to a trillion-parameter system, but it still assumes multi-GPU setups or aggressive quantization. The developers in the thread weren’t asking for “somewhat smaller.” They were asking for models sized for the hardware they actually own — and for a roadmap that treats local deployment as a first-class constraint rather than a hobbyist afterthought.
As the thread moved past hardware complaints, it turned to what many researchers now consider the central question in large language models: have scaling laws begun to plateau?
One participant asked directly whether scaling had “hit a wall.” A Kimi representative replied with a diagnosis that has become increasingly common across the industry. “The amount of high-quality data does not grow as fast as the available compute,” they wrote, “so scaling under the conventional ‘next token prediction with Internet data’ will bring less improvement.”
Then the team offered its preferred escape route. It pointed to Agent Swarm, Kimi K2.5’s ability to coordinate up to 100 sub-agents working in parallel, as a form of “test-time scaling” that could open a new path to capability gains. In the team’s framing, scaling doesn’t have to mean only larger pretraining runs. It can also mean increasing the amount of structured work done at inference time, then folding those insights back into training through reinforcement learning.
“There might be new paradigms of scaling that can possibly happen,” one co-host wrote. “Looking forward, it’s likely to have a model that learns with less or even zero human priors.”
The claim implies that the unit of progress may be shifting from parameter count and pretraining loss curves toward systems that can plan, delegate, and verify — using tools and sub-agents as building blocks rather than relying on a single massive forward pass.
On paper, Agent Swarm sounds like a familiar idea in a new wrapper: many AI agents collaborating on a task. The AMA surfaced the more important details — where the memory goes, how coordination happens, and why orchestration doesn’t collapse into noise.
A developer raised a classic multi-agent concern. At a scale of 100 sub-agents, an orchestrator agent often becomes a bottleneck, both in latency and in what the community calls “context rot” — the degradation in performance that occurs as a conversation history fills with internal chatter and tool traces until the model loses the thread.
A Kimi co-host answered with a design choice that matters for anyone building agent systems in enterprise settings. The sub-agents run with their own working memory and send back results to the orchestrator, rather than streaming everything into a shared context. “This allows us to scale the total context length in a new dimension!” they wrote.
Another developer pressed on performance claims. Moonshot has publicly described Agent Swarm as capable of achieving about 4.5 times speedup on suitable workflows, but skeptics asked whether that figure simply reflects how parallelizable a given task is. The team agreed: it depends. In some cases, the system decides that a task doesn’t require parallel agents and avoids spending the extra compute. It also described sub-agent token budgets as something the orchestrator must manage, assigning each sub-agent a task of appropriate size.
Read as engineering rather than marketing, Moonshot was describing a familiar enterprise pattern: keep the control plane clean, bound the outputs from worker processes, and avoid flooding a coordinator with logs it can’t digest.
The most consequential shift hinted at in the AMA wasn’t a new benchmark score. It was a statement about priorities.
One question asked whether Moonshot was moving compute from “System 1” pretraining to “System 2” reinforcement learning — shorthand for shifting from broad pattern learning toward training that explicitly rewards reasoning and correct behavior over multi-step tasks. A Kimi representative replied that RL compute will keep increasing, and suggested that new RL objective functions are likely, “especially in the agent space.”
That line reads like a roadmap. As models become more tool-using and task-decomposing, labs will spend more of their budget training models to behave well as agents — not merely to predict tokens.
For enterprises, this matters because RL-driven improvements often arrive with tradeoffs. A model can become more decisive, more tool-happy, or more aligned to reward signals that don’t map neatly onto a company’s expectations. The AMA didn’t claim Moonshot had solved those tensions. It did suggest the team sees reinforcement learning as the lever that will matter more in the next cycle than simply buying more GPUs.
When asked about the compute gap between Moonshot and American labs with vastly larger GPU fleets, the team was candid. “The gap is not closing I would say,” one co-host wrote. “But how much compute does one need to achieve AGI? We will see.”
Another offered a more philosophical framing: “There are too many factors affecting available compute. But no matter what, innovation loves constraints.”
Open-weight releases now come with a standing suspicion: did the model learn too much from competitors? That suspicion can harden quickly into accusations of distillation, where one AI learns by training on another AI’s outputs.
A user raised one of the most uncomfortable claims circulating in open-model circles — that K2.5 sometimes identifies itself as “Claude,” Anthropic’s flagship model. The implication was heavy borrowing.
Moonshot didn’t dismiss the behavior. Instead it described the conditions under which it happens. With the right system prompt, the team said, the model has a high probability of answering “Kimi,” particularly in thinking mode. But with an empty system prompt, the model drifts into what the team called an “undefined area,” which reflects pretraining data distributions rather than deliberate training choices.
Then it offered a specific explanation tied to a training decision. Moonshot said it had upsampled newer internet coding data during pretraining, and that this data appears more associated with the token “Claude” — likely because developers discussing AI coding assistants frequently reference Anthropic’s model.
The team pushed back on the distillation accusation with benchmark results. “In fact, K2.5 seems to outperform Claude on many benchmarks,” one co-host wrote. “HLE, BrowseComp, MMMU Pro, MathVision, just to name a few.”
For enterprise adopters, the important point isn’t the internet drama. It’s that identity drift is a real failure mode — and one that organizations can often mitigate by controlling system prompts rather than leaving the model’s self-description to chance. The AMA treated prompt governance not as a user-experience flourish, but as operational hygiene.
A recurring theme in the thread was that K2.5’s writing style feels more generic than earlier Kimi models. Users described it as more like a standard “helpful assistant” — a tone many developers now see as the default personality of heavily post-trained models. One user said they loved the personality of Kimi K2 and asked what happened.
A Kimi co-host acknowledged that each new release brings some personality change and described personality as subjective and hard to evaluate. “This is a quite difficult problem,” they wrote. The team said it wants to improve the issue and make personality more customizable per user.
In a separate exchange about whether strengthening coding capability compromises creative writing and emotional intelligence, a Kimi representative argued there’s no inherent conflict if the model is large enough. But maintaining “writing taste” across versions is difficult, they said, because the reward model is constantly evolving. The team relies on internal benchmarks — a kind of meta-evaluation — to track creative writing progress and adjust reward models accordingly.
Another response went further, using language that would sound unusual in a corporate AI specification but familiar to people who use these tools daily. The team talked about the “soul” of a reward model and suggested the possibility of storing a user “state” reflecting taste and using it to condition the model’s outputs.
That exchange points to a product frontier that enterprises often underestimate. Style drift isn’t just aesthetics. It can change how a model explains decisions, how it hedges, how it handles ambiguity, and how it interacts with customers and employees. The AMA made clear that labs increasingly treat “taste” as both an alignment variable and a differentiator — but it remains hard to measure and even harder to hold constant across training runs.
The most revealing cultural insight came in response to a question about surprises during training and reinforcement learning. A co-host answered with a single word, bolded for emphasis: debugging.
“Whether it’s pre-training or post-training, one thing constantly manifests itself as the utmost priority: debugging,” they wrote.
The comment illuminated a theme running through the entire session. When asked about their “scaling ladder” methodology for evaluating new ideas at different model sizes, zxytim offered an anecdote about failure. The team had once hurried to incorporate Kimi Linear, an experimental linear-attention architecture, into the previous model generation. It failed the scaling ladder at a certain scale. They stepped back and went through what the co-host called “a tough debugging process,” and after months finally made it work.
“Statistically, most ideas that work at small scale won’t pass the scaling ladder,” they continued. “Those that do are usually simple, effective, and mathematically grounded. Research is mostly about managing failure, not celebrating success.”
For technical leaders evaluating AI vendors, the admission is instructive. Frontier capability doesn’t emerge from elegant breakthroughs alone. It emerges from relentless fault isolation — and from organizational cultures willing to spend months on problems that might not work.
The AMA also acted as a subtle teaser for Kimi’s next generation.
Developers asked whether Kimi K3 would adopt Moonshot’s linear attention research, which aims to handle long context more efficiently than traditional attention mechanisms. Team members suggested that linear approaches are a serious option. “It’s likely that Kimi Linear will be part of K3,” one wrote. “We will also include other optimizations.”
In another exchange, a co-host predicted K3 “will be much, if not 10x, better than K2.5.”
The team also highlighted continual learning as a direction it is actively exploring, suggesting a future where agents can work effectively over longer time horizons — a critical enterprise need if agents are to handle ongoing projects rather than single-turn tasks. “We believe that continual learning will improve agency and allow the agents to work effectively for much longer durations,” one co-host wrote.
On Agent Swarm specifically, the team said it plans to make the orchestration scaffold available to developers once the system becomes more stable. “Hopefully very soon,” they added.
The session didn’t resolve every question. Some of the most technical prompts — about multimodal training recipes, defenses against reward hacking, and data governance — were deferred to a forthcoming technical report. That’s not unusual. Many labs now treat the most operationally decisive details as sensitive.
But the thread still revealed where the real contests in AI have moved. The gap that matters most isn’t between China and the United States, or between open and closed. It’s the gap between what models promise and what systems can actually deliver.
Orchestration is becoming the product. Moonshot isn’t only shipping a model. It’s shipping a worldview that says the next gains come from agents that can split work, use tools, and return structured results fast. Open weights are colliding with hardware reality, as developers demand openness that runs locally rather than openness that requires a data center. And the battleground is shifting from raw intelligence to reliability — from beating a benchmark by two points to debugging tool-calling discipline, managing memory in multi-agent workflows, and preserving the hard-to-quantify “taste” that determines whether users trust the output.
Moonshot showed up on Reddit in the wake of a high-profile release and a growing geopolitical narrative. The developers waiting there cared about a more practical question: When does “open” actually mean “usable”?
In that sense, the AMA didn’t just market Kimi K2.5. It offered a snapshot of an industry in transition — from larger models to more structured computation, from closed APIs to open weights that still demand serious engineering to deploy, and from celebrating success to managing failure.
“Research is mostly about managing failure,” one of the Moonshot engineers had written. By the end of the thread, it was clear that deployment is, too.
In the race to bring artificial intelligence into the enterprise, a small but well-funded startup is making a bold claim: The problem holding back AI adoption in complex industries has never been the models themselves.
Contextual AI, a two-and-a-half-year-old company backed by investors including Bezos Expeditions and Bain Capital Ventures, on Monday unveiled Agent Composer, a platform designed to help engineers in aerospace, semiconductor manufacturing, and other technically demanding fields build AI agents that can automate the kind of knowledge-intensive work that has long resisted automation.
The announcement arrives at a pivotal moment for enterprise AI. Four years after ChatGPT ignited a frenzy of corporate AI initiatives, many organizations remain stuck in pilot programs, struggling to move experimental projects into full-scale production. Chief financial officers and business unit leaders are growing impatient with internal efforts that have consumed millions of dollars but delivered limited returns.
Douwe Kiela, Contextual AI’s chief executive, believes the industry has been focused on the wrong bottleneck. “The model is almost commoditized at this point,” Kiela said in an interview with VentureBeat. “The bottleneck is context — can the AI actually access your proprietary docs, specs, and institutional knowledge? That’s the problem we solve.”
To understand what Contextual AI is attempting, it helps to understand a concept that has become central to modern AI development: retrieval-augmented generation, or RAG.
When large language models like those from OpenAI, Google, or Anthropic generate responses, they draw on knowledge embedded during training. But that knowledge has a cutoff date, and it cannot include the proprietary documents, engineering specifications, and institutional knowledge that make up the lifeblood of most enterprises.
RAG systems attempt to solve this by retrieving relevant documents from a company’s own databases and feeding them to the model alongside the user’s question. The model can then ground its response in actual company data rather than relying solely on its training.
Kiela helped pioneer this approach during his time as a research scientist at Facebook AI Research and later as head of research at Hugging Face, the influential open-source AI company. He holds a Ph.D. from Cambridge and serves as an adjunct professor in symbolic systems at Stanford University.
But early RAG systems, Kiela acknowledges, were crude.
“Early RAG was pretty crude — grab an off-the-shelf retriever, connect it to a generator, hope for the best,” he said. “Errors compounded through the pipeline. Hallucinations were common because the generator wasn’t trained to stay grounded.”
When Kiela founded Contextual AI in June 2023, he set out to solve these problems systematically. The company developed what it calls a “unified context layer” — a set of tools that sit between a company’s data and its AI models, ensuring that the right information reaches the model in the right format at the right time.
The approach has earned recognition. According to a Google Cloud case study, Contextual AI achieved the highest performance on Google’s FACTS benchmark for grounded, hallucination-resistant results. The company fine-tuned Meta’s open-source Llama models on Google Cloud’s Vertex AI platform, focusing specifically on reducing the tendency of AI systems to invent information.
Agent Composer extends Contextual AI’s existing platform with orchestration capabilities — the ability to coordinate multiple AI tools across multiple steps to complete complex workflows.
The platform offers three ways to create AI agents. Users can start with pre-built agents designed for common technical workflows like root cause analysis or compliance checking. They can describe a workflow in natural language and let the system automatically generate a working agent architecture. Or they can build from scratch using a visual drag-and-drop interface that requires no coding.
What distinguishes Agent Composer from competing approaches, the company says, is its hybrid architecture. Teams can combine strict, deterministic rules for high-stakes steps — compliance checks, data validation, approval gates — with dynamic reasoning for exploratory analysis.
“For highly critical workflows, users can choose completely deterministic steps to control agent behavior and avoid uncertainty,” Kiela said.
The platform also includes what the company calls “one-click agent optimization,” which takes user feedback and automatically adjusts agent performance. Every step of an agent’s reasoning process can be audited, and responses come with sentence-level citations showing exactly where information originated in source documents.
Contextual AI says early customers have reported significant efficiency gains, though the company acknowledges these figures come from customer self-reporting rather than independent verification.
“These come directly from customer evals, which are approximations of real-world workflows,” Kiela said. “The numbers are self-reported by our customers as they describe the before-and-after scenario of adopting Contextual AI.”
The claimed results are nonetheless striking. An advanced manufacturer reduced root-cause analysis from eight hours to 20 minutes by automating sensor data parsing and log correlation. A specialty chemicals company reduced product research from hours to minutes using agents that search patents and regulatory databases. A test equipment maker now generates test code in minutes instead of days.
Keith Schaub, vice president of technology and strategy at Advantest, a semiconductor test equipment company, offered an endorsement. “Contextual AI has been an important part of our AI transformation efforts,” Schaub said. “The technology has been rolled out to multiple teams across Advantest and select end customers, saving meaningful time across tasks ranging from test code generation to customer engineering workflows.”
The company’s other customers include Qualcomm, the semiconductor giant; ShipBob, a tech-enabled logistics provider that claims to have achieved 60 times faster issue resolution; and Nvidia, the chip maker whose graphics processors power most AI systems.
Perhaps the biggest challenge Contextual AI faces is not competing products but the instinct among engineering organizations to build their own solutions.
“The biggest objection is ‘we’ll build it ourselves,'” Kiela acknowledged. “Some teams try. It sounds exciting to do, but is exceptionally hard to do this well at scale. Many of our customers started with DIY, and found themselves still debugging retrieval pipelines instead of solving actual problems 12-18 months later.”
The alternative — off-the-shelf point solutions — presents its own problems, the company argues. Such tools deploy quickly but often prove inflexible and difficult to customize for specific use cases.
Agent Composer attempts to occupy a middle ground, offering a platform approach that combines pre-built components with extensive customization options. The system supports models from OpenAI, Anthropic, and Google, as well as Contextual AI’s own Grounded Language Model, which was specifically trained to stay faithful to retrieved content.
Pricing starts at $50 per month for self-serve usage, with custom enterprise pricing for larger deployments.
“The justification to CFOs is really about increasing productivity and getting them to production faster with their AI initiatives,” Kiela said. “Every technical team is struggling to hire top engineering talent, so making their existing teams more productive is a huge priority in these industries.”
Looking ahead, Kiela outlined three priorities for the coming year: workflow automation with actual write actions across enterprise systems rather than just reading and analyzing; better coordination among multiple specialized agents working together; and faster specialization through automatic learning from production feedback.
“The compound effect matters here,” he said. “Every document you ingest, every feedback loop you close, those improvements stack up. Companies building this infrastructure now are going to be hard to catch.”
The enterprise AI market remains fiercely competitive, with offerings from major cloud providers, established software vendors, and scores of startups all chasing the same customers. Whether Contextual AI’s bet on context over models will pay off depends on whether enterprises come to share Kiela’s view that the foundation model wars matter less than the infrastructure that surrounds them.
But there is a certain irony in the company’s positioning. For years, the AI industry has fixated on building ever-larger, ever-more-powerful models — pouring billions into the race for artificial general intelligence. Contextual AI is making a quieter argument: that for most real-world work, the magic isn’t in the model. It’s in knowing where to look.
Chinese company Moonshot AI upgraded its open-sourced Kimi K2 model, transforming it into a coding and vision model with an architecture that supports an agent swarm orchestration.
The new model, Moonshot Kimi K2.5, is a good option for enterprises that want agents that can automatically pass off actions instead of having a framework be a central decision-maker.
The company characterized Kimi K2.5 as an “all-in-one model” that supports both visual and text inputs, letting users leverage the model for more visual coding projects.
Moonshot did not publicly disclose K2.5’s parameter count, but the Kimi K2 model that it’s based on, had 1 trillion total parameters and 32 billion activated parameters thanks to its mixture-of-experts architecture.
This is the latest open-source model to offer an alternative to the more closed options from Google, OpenAI, and Anthropic, and it outperforms them on key metrics including agentic workflows, coding, and vision.
On the Humanity’s Last Exam (HLE) benchmark, Kimi K2.5 scored 50.2% (with tools), surpassing OpenAI’s GPT-5.2 (xhigh) and Claude Opus 4.5. It also achieved 76.8% on SWE-bench Verified, cementing its status as a top-tier coding model, though GPT-5.2 and Opus 4.5 overtake it here at 80 and 80.9, respectively.
Moonshot said in a press release that it’s seen a 170% increase in users between September and November for Kimi K2 and Kimi K2 Thinking, which was released in early November.
Moonshot aims to leverage self-directed agents and the agent swarm paradigm built into Kimi K2.5. Agent swarm has been touted as the next frontier in enterprise AI development and agent-based systems. It has attracted significant attention in the past few months.
For enterprises, this means that if they build agent ecosystems with Kimi K2.5, they can expect to scale more efficiently. But instead of scaling “up” or growing model sizes to create larger agents, it’s betting on making more agents that can essentially orchestrate themselves.
Kimi K2.5 “creates and coordinates a swarm of specialized agents working in parallel.” The company compared it to a beehive where each agent performs a task while contributing to a common goal. The model learns to self-direct up to 100 sub-agents and can execute parallel workflows of up to 1,500 tool calls.
“Benchmarks only tell half the story. Moonshot AI believes AGI should ultimately be evaluated by its ability to complete real-world tasks efficiently under real-world time constraints. The real metric they care about is: how much of your day did AI actually give back to you? Running in parallel substantially reduces the time needed for a complex task — tasks that required days of work now can be accomplished in minutes,” the company said.
Enterprises considering their orchestration strategies have begun looking at agentic platforms where agents communicate and pass off tasks, rather than following a rigid orchestration framework that dictates when an action is completed.
While Kimi K2.5 may offer a compelling option for organizations that want to use this form of orchestration, some may feel more comfortable avoiding agent-based orchestration baked into the model and instead using a different platform to differentiate the model training from the agentic task.
This is because enterprises often want more flexibility in which models make up their agents, so they can build an ecosystem of agents that tap LLMs that work best for specific actions.
Some agent platforms, such as Salesforce, AWS Bedrock, and IBM, offer separate observability, management, and monitoring tools that help users orchestrate AI agents built with different models and enable them to work together.
The model lets users code visual layouts, including user interfaces and interactions. It reasons over images and videos to understand tasks encoded in visual inputs. For example, K2.5 can reconstruct a website’s code simply by analyzing a video recording of the site in action, translating visual cues into interactive layouts and animations.
“Interfaces, layouts, and interactions that are difficult to describe precisely in language can be communicated through screenshots or screen recordings, which the model can interpret and turn into fully functional websites. This enables a new class of vibe coding experiences,” Moonshot said.
This capability is integrated into Kimi Code, a new terminal-based tool that works with IDEs like VSCode and Cursor.
It supports “autonomous visual debugging,” where the model visually inspects its own output — such as a rendered web page — references documentation, and iterates on the code to fix layout shifts or aesthetic errors without human intervention.
Unlike other multimodal models that can create and understand images, Kimi K2.5 can build frontend interactions for websites with visuals, not just the code behind them.
Moonshot AI has aggressively priced the K2.5 API to compete with major U.S. labs, offering significant reductions compared to its previous K2 Turbo model.
Input: 60 cents per million tokens (a 47.8% decrease).
Cached Input: 10 cents per million tokens (a 33.3% decrease).
Output: $3 per million tokens (a 62.5% decrease).
The low cost of cached inputs ($0.10/M tokens) is particularly relevant for the “Agent Swarm” features, which often require maintaining large context windows across multiple sub-agents and extensive tool usage.
While Kimi K2.5 is open-sourced, it is released under a Modified MIT License that includes a specific clause targeting “hyperscale” commercial users.
The license grants standard permissions to use, copy, modify, and sell the software.
However, it stipulates that if the software or any derivative work is used for a commercial product or service that has more than 100 million monthly active users (MAU) or more than $20 million USD in monthly revenue, the entity must prominently display “Kimi K2.5” on the user interface.
This clause ensures that while the model remains free and open for the vast majority of the developer community and startups, major tech giants cannot white-label Moonshot’s technology without providing visible attribution.
It’s not full “open source” but it is better than Meta’s similar Llama Licensing terms for its “open source” family of models, which required those companies with 700 million or more monthly users to obtain a special enterprise license from the company.
For the practitioners defining the modern AI stack — from LLM decision-makers optimizing deployment cycles to AI orchestration leaders setting up agents and AI-powered automated business processes — Kimi K2.5 represents a fundamental shift in leverage.
By embedding swarm orchestration directly into the model, Moonshot AI effectively hands these resource-constrained builders a synthetic workforce, allowing a single engineer to direct a hundred autonomous sub-agents as easily as a single prompt.
This “scale-out” architecture directly addresses data decision-makers’ dilemma of balancing complex pipelines with limited headcount, while the slashed pricing structure transforms high-context data processing from a budget-breaking luxury into a routine commodity.
Ultimately, K2.5 suggests a future where the primary constraint on an engineering team is no longer the number of hands on keyboards, but the ability of its leaders to choreograph a swarm.
One of the biggest constraints currently facing AI builders who want to deploy agents in service of their individual or enterprise goals is the “working memory” required to manage complex, multi-stage engineering projects.
Typically, when a AI agent operates purely on a stream of text or voice-based conversation, it lacks the structural permanence to handle dependencies. It knows what to do, but it often forgets why it is doing it, or in what order.
With the release of Tasks for Claude Code (introduced in v2.1.16) last week, Anthropic has introduced a solution that is less about “AI magic” and more about sound software engineering principles.
By moving from ephemeral “To-dos” to persistent “Tasks,” the company is fundamentally re-architecting how the model interacts with time, complexity, and system resources.
This update transforms the tool from a reactive coding assistant into a state-aware project manager, creating the infrastructure necessary to execute the sophisticated workflows outlined in Anthropic’s just-released Best Practices guide, while recent changelog updates (v2.1.19) signal a focus on the stability required for enterprise adoption.
To understand the significance of this release for engineering teams, we must look at the mechanical differences between the old “To-do” system and the new “Task” primitive.
Previously, Claude Code utilized a “To-do” list—a lightweight, chat-resident checklist.
As Anthropic engineer Thariq Shihipar wrote in an article on X: “Todos (orange) = ‘help Claude remember what to do’.” These were effective for single-session scripts but fragile for actual engineering. If the session ended, the terminal crashed, or the context window drifted, the plan evaporated.
Tasks (Green) introduce a new layer of abstraction designed for “coordinating work across sessions, subagents, and context windows.” This is achieved through three key architectural decisions:
Dependency Graphs vs. Linear Lists: Unlike a flat Todo list, Tasks support directed acyclic graphs (DAGs). A task can explicitly “block” another. As seen in community demonstrations, the system can determine that Task 3 (Run Tests) cannot start until Task 1 (Build API) and Task 2 (Configure Auth) are complete. This enforcement prevents the “hallucinated completion” errors common in LLM workflows, where a model attempts to test code it hasn’t written yet.
Filesystem Persistence & Durability: Anthropic chose a “UNIX-philosophy” approach to state management. Rather than locking project state inside a proprietary cloud database, Claude Code writes tasks directly to the user’s local filesystem (~/.claude/tasks). This creates durable state. A developer can shut down their terminal, switch machines, or recover from a system crash, and the agent reloads the exact state of the project. For enterprise teams, this persistence is critical—it means the “plan” is now an artifact that can be audited, backed up, or version-controlled, independent of the active session.
Orchestration via Environment Variables: The most potent technical unlock is the ability to share state across sessions. By setting the CLAUDE_CODE_TASK_LIST_ID environment variable, developers can point multiple instances of Claude at the same task list. This allows updates to be “broadcast” to all active sessions, enabling a level of coordination that was previously impossible without external orchestration tools.
The release of Tasks makes the “Parallel Sessions” described in Anthropic’s Best Practices guide practical. The documentation suggests a Writer/Reviewer pattern that leverages this shared state:
Session A (Writer) picks up Task #1 (“Implement Rate Limiter”).
Session A marks it complete.
Session B (Reviewer), observing the shared state update, sees Task #2 (“Review Rate Limiter”) is now unblocked.
Session B begins the review in a clean context, unbiased by the generation process.
This aligns with the guide’s advice to “fan out” work across files, using scripts to loop through tasks and call Claude in parallel. Crucially, patch v2.1.17 fixed “out-of-memory crashes when resuming sessions with heavy subagent usage,” indicating that Anthropic is actively optimizing the runtime for these high-load, multi-agent scenarios.
For decision-makers evaluating Claude Code for production pipelines, the recent changelogs (v2.1.16–v2.1.19) reveal a focus on reliability and integration.
The Best Practices guide explicitly endorses running Claude in Headless Mode (claude -p). This allows engineering teams to integrate the agent into CI/CD pipelines, pre-commit hooks, or data processing scripts.
For example, a nightly cron job could instantiate a Claude session to “Analyze the day’s log files for anomalies,” using a Task list to track progress through different log shards.
The move to autonomous agents introduces new failure modes, which recent patches have addressed:
Dangling Processes: v2.1.19 fixed an issue where Claude Code processes would hang when the terminal closed; the system now catches EIO errors and ensures a clean exit (using SIGKILL as a fallback).
Hardware Compatibility: Fixes for crashes on processors without AVX support ensure broader deployment compatibility.
Git Worktrees: Fixes for resume functionality when working across different directories or git worktrees ensure that the “state” follows the code, not just the shell session.
Recognizing that enterprise workflows cannot turn on a dime, Anthropic introduced the CLAUDE_CODE_ENABLE_TASKS environment variable (v2.1.19). Setting this to false allows teams to opt-out of the new system temporarily, preserving existing workflows while they migrate to the Task-based architecture.
For the individual developer, the Task system solves the “context economy” problem. Anthropic’s documentation warns that “Claude’s context window… is the most important resource to manage,” and that performance degrades as it fills.
Before Tasks, clearing the context was dangerous—you wiped the agent’s memory of the overall plan. Now, because the plan is stored on disk, users can follow the best practice of “aggressive context management.” Developers can run /clear or /compact to free up tokens for the model’s reasoning, without losing the project roadmap.
The changelog also highlights quality-of-life improvements for power users building complex scripts:
Shorthand Arguments: Users can now access custom command arguments via $0, $1, etc., making it easier to script reusable “Skills” (e.g., a /refactor command that takes a filename as an argument).
Keybindings: Fully customizable keyboard shortcuts (/keybindings) allow for faster interaction loops.
With the introduction of Tasks, Anthropic is signaling that the future of coding agents is a project management.
By giving Claude Code a persistent memory, a way to understand dependency, and the stability fixes required for long-running processes, they have moved the tool from a “copilot” that sits next to you to a “subagent” that can be trusted to run in the background — especially when powered by Anthropic’s most performant model, Claude Opus 4.5.
It is a technical evolution that acknowledges a simple truth: in the enterprise, the code is cheap; it is the context, the plan, and the reliability that are precious.
When Anthropic announced Monday that it was embedding nine workplace applications directly inside Claude, transforming its AI chatbot into what I earlier described as a “workplace command center,” Asana was among the headliners.
But while the broader launch signals a new era of AI-native productivity tools, Asana’s participation reflects a deeper strategic bet — one that positions the project management company not as an AI competitor, but as the essential context layer that makes any AI model more useful.
In an exclusive interview with VentureBeat, Arnab Bose, Asana’s Chief Product Officer, explained the thinking behind the partnership and why the company chose to embrace external AI providers rather than build proprietary models.
“The AI landscape is advancing at a breakneck pace,” Bose said. “We believe our customers are best served when they have access to the latest, most powerful reasoning capabilities from best-in-class providers like Anthropic, rather than being locked into a single, proprietary model that may fall behind quickly.”
The integration arrives at a pivotal moment for Asana: the company is navigating a leadership transition after co-founder Dustin Moskovitz’s retirement, competing against rivals racing to embed AI into productivity software, and betting that its proprietary “Work Graph” — the company’s mapping of how tasks, people, and goals connect inside organizations — can differentiate it in an increasingly crowded market.
The strategic logic Bose outlined goes beyond simply offering Claude users another tool to connect. At its core, Asana is making a bet about where value will accrue in the AI era — and the company believes context will matter more than raw model capability.
“An LLM in isolation is context-starved,” Bose told VentureBeat. “It knows how to write, but it doesn’t know your business—your goals, your knowledge, your specific approvals, or your historical relationships. Asana provides the scaffolding—the Work Graph data model—that grounds those external models in the reality of how your company actually operates.”
It’s a framing that positions Asana as essential infrastructure rather than a replaceable application. If Bose is right, then even as AI models from Anthropic, OpenAI, and Google grow more powerful, they will remain fundamentally limited without deep integration into how organizations actually function.
“Most errors happen because models are context-starved,” Bose said. “Asana solves this with context that is unique to each business.”
The argument has implications beyond Asana. It suggests a future where AI capability becomes increasingly commoditized, while the companies that control rich organizational data — project histories, approval workflows, team relationships — become the essential partners that make AI useful in enterprise settings.
In practice, the Claude integration allows users to create and manage Asana projects entirely through natural conversation. When a user connects their Asana account via OAuth authentication, Claude gains the ability to read project data, create new tasks, and build entire project structures based on natural language instructions.
A marketing team discussing a product launch in Claude can simply say: “Create a Q2 product launch project with phases for creative development, partner outreach, press kit, and launch day.” Claude then generates the project structure, complete with sections and tasks, which the user can review before pushing it live to Asana.
“When you use Claude to explore a new initiative, like brainstorming a campaign structure, outlining a project plan, or mapping out a cross-functional launch, you can turn that thinking into real, structured work in Asana without breaking your flow,” the company said in its press release announcing the integration.
The synchronization runs in real time. Changes made through Claude appear immediately in Asana, and status updates from Asana can be pulled into Claude conversations for on-the-fly reporting. Users can ask questions like “What’s behind schedule in our marketing campaigns right now?” and receive answers grounded in their actual project data.
One of the key design decisions in the integration is a strict requirement for human oversight. Bose emphasized that Claude cannot act autonomously within Asana — every consequential action requires explicit user approval.
“Our architecture follows a strict human-in-the-loop philosophy where AI actions—from drafting project plans to summarizing risks—has a human in the loop to course correct, check quality, and ultimately give final sign-off when working with AI,” Bose told VentureBeat. “Users review and approve before tasks are created and projects are built.”
When asked whether Claude could potentially access projects or tasks that a user wouldn’t normally have permission to see, Bose was direct: “No. Users need to authenticate via OAuth with their Asana credentials to use this integration, and Claude respects their permissions and access.”
The approach is an increasingly common pattern in enterprise AI — giving artificial intelligence significant capabilities while maintaining human control over final decisions. It addresses one of the core anxieties around AI in workplace settings: the fear that automated systems will make mistakes that propagate through organizations before anyone notices.
When asked about audit capabilities for enterprise administrators, Bose said admins can monitor usage information about Claude in Asana’s Admin App Management portal, with deeper audit log visibility potentially coming based on customer feedback.
Notably, Asana is not betting exclusively on Claude. Bose emphasized the company’s commitment to working with multiple AI providers, positioning Asana as a neutral platform that works with whichever AI systems its customers prefer.
“Our philosophy is to meet users where they want to work,” Bose said. “We are building the work platform for today and the future which means being the best front-end for any vendor’s agents.”
He confirmed that Asana offers “foundational connectors” with both ChatGPT and Google Gemini and is working to deepen those integrations. The company is also committed to emerging industry standards for AI agent interoperability, including the Agent-to-Agent protocol and MCP.
“We want to be the best front-end for agents from any vendor,” Bose said, describing a vision where Asana becomes the coordination layer through which various AI systems — whether from Anthropic, OpenAI, Google, or others — can operate within enterprise workflows.
This multi-provider approach differs from companies that have tied themselves exclusively to a single AI partner. It reflects both a pragmatic recognition that the AI landscape remains volatile and a strategic bet that Asana’s value lies in its data and workflow capabilities rather than any particular AI model.
The Claude integration arrives as Asana navigates significant organizational change. Dustin Moskovitz, the company’s co-founder and longtime CEO, retired earlier this year after announcing his departure during Asana’s fourth-quarter earnings report in March. Moskovitz’s departure triggered immediate market reaction, with Asana’s stock dropping more than 25 percent in after-hours trading following the announcement.
The company subsequently hired Dan Rogers — formerly CEO of software startup LaunchDarkly and previously president of Rubrik and marketing chief at ServiceNow — to take over as chief executive. Rogers started in July, with Moskovitz transitioning to the role of board chairman.
In a recent appearance on the Stratechery podcast, Moskovitz reflected candidly on his tenure. “I don’t like to manage teams, and it wasn’t my intention when we started Asana,” he said. “I’d intended to be more of a independent or head of engineering or something again. Then one thing led to another and I was CEO for 13 years and I just found it quite exhausting.”
Moskovitz — who co-founded Facebook alongside Mark Zuckerberg before leaving to start Asana in 2008 — retains approximately 39 percent of outstanding Asana shares. He said he plans to focus more on his philanthropic endeavors, including Good Ventures and Open Philanthropy, which lists “potential risks from advanced AI” among its focus areas.
When asked about the long-term trajectory of AI in Asana, Bose outlined a vision that balances automation with human judgment — what he described as a “self-driving” organization where humans nonetheless remain at the wheel.
“Our vision is for customers to work however suits them best, alongside AI agents that actually have the context to be helpful and productive,” he said. “But the goal is not for agents to make important decisions on their own. That is where humans provide value: having the judgment, relationships, and nuance to make complex decisions.”
He described a future in which AI handles “orchestration” — spotting patterns, flagging risks, managing follow-ups — while humans retain authority over strategy and trade-offs. As an example, Bose pointed to Asana’s AI Teammates feature, which the company introduced in beta last year.
“Asana AI Teammates — built on the Work Graph, so they understand who is doing what, by when, and why — can flag that three teams are behind on dependencies for a launch and draft a mitigation plan,” Bose said. “But a human reviews it, adjusts based on business priorities, and makes the call on what happens next.”
The question is whether that boundary will hold as AI capabilities advance. Anthropic and OpenAI are both racing to build more capable “agentic” systems that can execute multi-step tasks with less human oversight. If those systems become reliable enough, the human-in-the-loop requirement may shift from necessity to preference — a transition Asana appears to be preparing for, even as it emphasizes human control today.
The Asana integration in Claude is available immediately to all Asana customers who have a paid Claude subscription. Users can connect Asana through Claude’s app directory or request that their administrator enable the integration for their workspace.
The interactive app feature is available on Claude’s web and desktop applications for Pro, Max, Team, and Enterprise subscribers. Once connected, users can mention Asana in any Claude conversation to start creating projects, assigning tasks, or pulling status updates from their existing work.
The industry consensus is that 2026 will be the year of “agentic AI.” We are rapidly moving past chatbots that simply summarize text. We are entering the era of autonomous agents that execute tasks. We expect them to book flights, diagnose system outages, manage cloud infrastructure and personalize media streams in real-time.
As a technology executive overseeing platforms that serve 30 million concurrent users during massive global events like the Olympics and the Super Bowl, I have seen the unsexy reality behind the hype: Agents are incredibly fragile.
Executives and VCs obsess over model benchmarks. They debate Llama 3 versus GPT-4. They focus on maximizing context window sizes. Yet they are ignoring the actual failure point. The primary reason autonomous agents fail in production is often due to data hygiene issues.
In the previous era of “human-in-the-loop” analytics, data quality was a manageable nuisance. If an ETL pipeline experiences an issue, a dashboard may display an incorrect revenue number. A human analyst would spot the anomaly, flag it and fix it. The blast radius was contained.
In the new world of autonomous agents, that safety net is gone.
If a data pipeline drifts today, an agent doesn’t just report the wrong number. It takes the wrong action. It provisions the wrong server type. It recommends a horror movie to a user watching cartoons. It hallucinates a customer service answer based on corrupted vector embeddings.
To run AI at the scale of the NFL or the Olympics, I realized that standard data cleaning is insufficient. We cannot just “monitor” data. We must legislate it.
A solution to this specific problem could be in the form of a ‘data quality – creed’ framework. It functions as a ‘data constitution.’ It enforces thousands of automated rules before a single byte of data is allowed to touch an AI model. While I applied this specifically to the streaming architecture at NBCUniversal, the methodology is universal for any enterprise looking to operationalize AI agents.
Here is why “defensive data engineering” and the Creed philosophy are the only ways to survive the Agentic era.
The core problem with AI Agents is that they trust the context you give them implicitly. If you are using RAG, your vector database is the agent’s long-term memory.
Standard data quality issues are catastrophic for vector databases. In traditional SQL databases, a null value is just a null value. In a vector database, a null value or a schema mismatch can warp the semantic meaning of the entire embedding.
Consider a scenario where metadata drifts. Suppose your pipeline ingests video metadata, but a race condition causes the “genre” tag to slip. Your metadata might tag a video as “live sports,” but the embedding was generated from a “news clip.” When an agent queries the database for “touchdown highlights,” it retrieves the news clip because the vector similarity search is operating on a corrupted signal. The agent then serves that clip to millions of users.
At scale, you cannot rely on downstream monitoring to catch this. By the time an anomaly alarm goes off, the agent has already made thousands of bad decisions. Quality controls must shift to the absolute “left” of the pipeline.
The Creed framework is expected to act as a gatekeeper. It is a multi-tenant quality architecture that sits between ingestion sources and AI models.
For technology leaders looking to build their own “constitution,” here are the three non-negotiable principles I recommend.
1. The “quarantine” pattern is mandatory: In many modern data organizations, engineers favor the “ELT” approach. They dump raw data into a lake and clean it up later. For AI Agents, this is unacceptable. You cannot let an agent drink from a polluted lake.
The Creed methodology enforces a strict “dead letter queue.” If a data packet violates a contract, it is immediately quarantined. It never reaches the vector database. It is far better for an agent to say “I don’t know” due to missing data than to confidently lie due to bad data. This “circuit breaker” pattern is essential for preventing high-profile hallucinations.
2. Schema is law: For years, the industry moved toward “schemaless” flexibility to move fast. We must reverse that trend for core AI pipelines. We must enforce strict typing and referential integrity.
In my experience, a robust system requires scale. The implementation I oversee currently enforces more than 1,000 active rules running across real-time streams. These aren’t just checking for nulls. They check for business logic consistency.
Example: Does the “user_segment” in the event stream match the active taxonomy in the feature store? If not, block it.
Example: Is the timestamp within the acceptable latency window for real-time inference? If not, drop it.
3. Vector consistency checks This is the new frontier for SREs. We must implement automated checks to ensure that the text chunks stored in a vector database actually match the embedding vectors associated with them. “Silent” failures in an embedding model API often leave you with vectors that point to nothing. This causes agents to retrieve pure noise.
Implementing a framework like Creed is not just a technical challenge. It is a cultural one.
Engineers generally hate guardrails. They view strict schemas and data contracts as bureaucratic hurdles that slow down deployment velocity. When introducing a data constitution, leaders often face pushback. Teams feel they are returning to the “waterfall” era of rigid database administration.
To succeed, you must flip the incentive structure. We demonstrated that Creed was actually an accelerator. By guaranteeing the purity of the input data, we eliminated the weeks data scientists used to spend debugging model hallucinations. We turned data governance from a compliance task into a “quality of service” guarantee.
If you are building an AI strategy for 2026, stop buying more GPUs. Stop worrying about which foundation model is slightly higher on the leaderboard this week.
Start auditing your data contracts.
An AI Agent is only as autonomous as its data is reliable. Without a strict, automated data constitution like the Creed framework, your agents will eventually go rogue. In an SRE’s world, a rogue agent is far worse than a broken dashboard. It is a silent killer of trust, revenue, and customer experience.
Manoj Yerrasani is a senior technology executive.