admin.codes » Category » Orchestration

Claude Code just got updated with one of the most-requested user features

Anthropic’s open source standard, the Model Context Protocol (MCP), released in late 2024, allows users to connect AI models and the agents atop them to external tools in a structured, reliable format. It is the engine behind Anthropic’s hit AI agentic programming harness, Claude Code, allowing it to access numerous functions like web browsing and file creation immediately when asked.

But there was one problem: Claude Code typically had to “read” the instruction manual for every single tool available, regardless of whether it was needed for the immediate task, using up the available context that could otherwise be filled with more information from the user’s prompts or the agent’s responses.

At least until last night. The Claude Code team released an update that fundamentally alters this equation. Dubbed MCP Tool Search, the feature introduces “lazy loading” for AI tools, allowing agents to dynamically fetch tool definitions only when necessary.

It is a shift that moves AI agents from a brute-force architecture to something resembling modern software engineering—and according to early data, it effectively solves the “bloat” problem that was threatening to stifle the ecosystem.

The ‘Startup Tax’ on Agents

To understand the significance of Tool Search, one must understand the friction of the previous system. The Model Context Protocol (MCP), released in 2024 by Anthropic as an open source standard was designed to be a universal standard for connecting AI models to data sources and tools—everything from GitHub repositories to local file systems.

However, as the ecosystem grew, so did the “startup tax.”

Thariq Shihipar, a member of the technical staff at Anthropic, highlighted the scale of the problem in the announcement.

“We’ve found that MCP servers may have up to 50+ tools,” Shihipar wrote. “Users were documenting setups with 7+ servers consuming 67k+ tokens.”

In practical terms, this meant a developer using a robust set of tools might sacrifice 33% or more of their available context window limit of 200,000 tokens before they even typed a single character of a prompt, as AI newsletter author Aakash Gupta pointed out in a post on X.

The model was effectively “reading” hundreds of pages of technical documentation for tools it might never use during that session.

Community analysis provided even starker examples.

Gupta further noted that a single Docker MCP server could consume 125,000 tokens just to define its 135 tools.

“The old constraint forced a brutal tradeoff,” he wrote. “Either limit your MCP servers to 2-3 core tools, or accept that half your context budget disappears before you start working.”

How Tool Search Works

The solution Anthropic rolled out — which Shihipar called “one of our most-requested features on GitHub” — is elegant in its restraint. Instead of preloading every definition, Claude Code now monitors context usage.

According to the release notes, the system automatically detects when tool descriptions would consume more than 10% of the available context.

When that threshold is crossed, the system switches strategies. Instead of dumping raw documentation into the prompt, it loads a lightweight search index.

When the user asks for a specific action—say, “deploy this container”—Claude Code doesn’t scan a massive, pre-loaded list of 200 commands. Instead, it queries the index, finds the relevant tool definition, and pulls only that specific tool into the context.

“Tool Search flips the architecture,” Gupta analyzed. “The token savings are dramatic: from ~134k to ~5k in Anthropic’s internal testing. That’s an 85% reduction while maintaining full tool access.”

For developers maintaining MCP servers, this shifts the optimization strategy.

Shihipar noted that the `server instructions` field in the MCP definition—previously a “nice to have”—is now critical. It acts as the metadata that helps Claude “know when to search for your tools, similar to skills.”

‘Lazy Loading’ and Accuracy Gains

While the token savings are the headline metric—saving money and memory is always popular—the secondary effect of this update might be more important: focus.

LLMs are notoriously sensitive to “distraction.” When a model’s context window is stuffed with thousands of lines of irrelevant tool definitions, its ability to reason decreases. It creates a “needle in a haystack” problem where the model struggles to differentiate between similar commands, such as `notification-send-user` versus `notification-send-channel`.

Boris Cherny, Head of Claude Code, emphasized this in his reaction to the launch on X: “Every Claude Code user just got way more context, better instruction following, and the ability to plug in even more tools.”

The data backs this up. Internal benchmarks shared by the community indicate that enabling Tool Search improved the accuracy of the Opus 4 model on MCP evaluations from 49% to 74%.

For the newer Opus 4.5, accuracy jumped from 79.5% to 88.1%.

By removing the noise of hundreds of unused tools, the model can dedicate its “attention” mechanisms to the user’s actual query and the relevant active tools.

Maturing the Stack

This update signals a maturation in how we treat AI infrastructure. In the early days of any software paradigm, brute force is common. But as systems scale, efficiency becomes the primary engineering challenge.

Aakash Gupta drew a parallel to the evolution of Integrated Development Environments (IDEs) like VSCode or JetBrains. “The bottleneck wasn’t ‘too many tools.’

It was loading tool definitions like 2020-era static imports instead of 2024-era lazy loading,” he wrote. “VSCode doesn’t load every extension at startup. JetBrains doesn’t inject every plugin’s docs into memory.”

By adopting “lazy loading”—a standard best practice in web and software development—Anthropic is acknowledging that AI agents are no longer just novelties; they are complex software platforms that require architectural discipline.

Implications for the Ecosystem

For the end user, this update is seamless: Claude Code simply feels “smarter” and retains more memory of the conversation. But for the developer ecosystem, it opens the floodgates.

Previously, there was a “soft cap” on how capable an agent could be. Developers had to curate their toolsets carefully to avoid lobotomizing the model with excessive context. With Tool Search, that ceiling is effectively removed. An agent can theoretically have access to thousands of tools—database connectors, cloud deployment scripts, API wrappers, local file manipulators—without paying a penalty until those tools are actually touched.

It turns the “context economy” from a scarcity model into an access model. As Gupta summarized, “They’re not just optimizing context usage. They’re changing what ‘tool-rich agents’ can mean.”

The update is rolling out immediately for Claude Code users. For developers building MCP clients, Anthropic recommends implementing the `ToolSearchTool` to support this dynamic loading, ensuring that as the agentic future arrives, it doesn’t run out of memory before it even says hello.

Orchestration

Why MongoDB thinks better retrieval — not bigger models — is the key to trustworthy enterprise AI

Agentic systems and enterprise search depend on strong data retrieval that works efficiently and accurately. Database provider MongoDB thinks its newest embeddings models help solve falling retrieval quality as more AI systems go into production.

As agentic and RAG systems move into production, retrieval quality is emerging as a quiet failure point — one that can undermine accuracy, cost, and user trust even when models themselves perform well.

The company launched four new versions of its embeddings and reranking models. Voyage 4 will be available in four modes: voyage-4 embedding, voyage-4-large, voyage-4-lite, and voyage-4-nano.

MongoDB said the voyage-4 embedding serves as its general-purpose model; MongoDB considers Voyage-4-large its flagship model. Voyage-4-lite focuses on tasks requiring little latency and lower costs, and voyage-4-nano is intended for more local development and testing environments or for on-device data retrieval.

Voyage-4-nano is also MongoDB’s first open-weight model. All models are available via an API and on MongoDB’s Atlas platform.

The company said the models outperform similar models from Google and Cohere on the RTEB benchmark. Hugging Face’s RTEB benchmark puts Voyage 4 as the top embedding model.

“Embedding models are one of those invisible choices that can really make or break AI experiences,” Frank Liu, product manager at MongoDB, said in a briefing. “You get them wrong, your search results will feel pretty random and shallow, but if you get them right, your application suddenly feels like it understands your users and your data.”

He added that the goal of the Voyage 4 models is to improve the retrieval of real-world data, which often collapses once agentic and RAG pipelines go into production.

MongoDB also released a new multimodal embedding model, voyage-multimodal-3.5, that can handle documents that include text, images, and video. This model vectorizes the data and extracts semantic meaning from the tables, graphics, figures, and slides typically found in enterprise documents.

Enterprise’s embeddings problems

For enterprises, an agentic system is only as good as its ability to reliably retrieve the right information at the right time. This requirement becomes harder as workloads scale and context windows fragment.

Several model providers target that layer of agentic AI. Google’s Gemini Embedding model topped the embedding leaderboards, and Cohere launched its Embed 4 multimodal model, which processes documents more than 200 pages long. Mistral said its coding-embedding model, Codestral Embedding, outperforms Cohere, Google, and even MongoDB’s Voyage Code 3. MongoDB argues that benchmark performance alone doesn’t address the operational complexity enterprises face in production.

MongoDB said many clients have found that their data stacks cannot handle context-aware, retrieval-intensive workloads in production. The company said it’s seeing more fragmentation with enterprises having to stitch together different solutions to connect databases with a retrieval or reranking model. To help customers who don’t want fragmented solutions, the company is offering its models through a single data platform, Atlas.

MongoDB’s bet is that retrieval can’t be treated as a loose collection of best-of-breed components anymore. For enterprise agents to work reliably at scale, embeddings, reranking, and the data layer need to operate as a tightly integrated system rather than a stitched-together stack.

Orchestration

AI agents can talk — orchestration is what makes them work together

Rather than asking how AI agents can work for them, a key question in enterprise is now: Are agents playing well together?

This makes orchestration across multi-agent systems and platforms a critical concern — and a key differentiator.

“Agent-to-agent communications is emerging as a really big deal,” G2’s chief innovation officer Tim Sanders told VentureBeat. “Because if you don’t orchestrate it, you get misunderstandings, like people speaking foreign languages to each other. Those misunderstandings reduce the quality of actions and raise the specter of hallucinations, which could be security incidents or data leakage.”

Allowing agents to talk and coordinate

Orchestration to this point has largely been around data, but that’s quickly turning to action. “Conductor-like solutions” are increasingly bringing together agents, robotic process automation (RPA), and data repositories. Sanders likened the progression to that of answer engine optimization, which initially began with monitoring and now creates bespoke content and code.

“Orchestration platforms coordinate a variety of different agentic solutions to increase the consistency of outcomes,” he said.

Early providers include Salesforce MuleSoft, UiPath Maestro, and IBM Watsonx Orchestrate. These “phase one” software-based observability dashboards help IT leaders see all agentic actions across an enterprise.

The critical element of risk management

But coordination can only add so much value; these platforms will morph into technical risk management tools that provide greater quality control. This could include, for instance, agent assessments, policy recommendation and proactive scoring (such as, how reliable agents are when they call on enterprise tools, or how often they hallucinate and when).

Enterprise leaders have become wary of relying on vendors to minimize risks and errors; many IT decision-makers, in fact, do not trust a vendor’s statements about the reliability of their agents, he said.

Third-party tools are beginning to bridge the gap and automate tedious guardrail processes and escalation tickets. Teams are already experiencing “ticket exhaustion” in semi-automated systems, where agents hit guardrails and require human permission to proceed.

As an example: The loan process at a bank requires 17 steps for approval, and an agent keeps interrupting human workflows with approval requests when it runs into established guardrails.

Third-party orchestration platforms can manage these tickets and nay, yay, or even challenge the need for approval altogether. They can eventually eliminate the need for persistent human-in-the-loop oversight so organizations can experience “true velocity gains” measured not in percentages but in multiples (that is, 3X versus 30%).

“Where it goes from there is remote management of the entire agentic process for organizations,” Sanders said.

‘Human-on-the-loop’ versus ‘human-in-the-loop’

In another critical evolution in the agentic era, human evaluators will become designers, moving from human-in-the-loop to human-on-the-loop, according to Sanders. That is: They will begin designing agents to automate workflows.

Agent builder platforms continue to innovate their no-code solutions, Sanders said, meaning nearly anyone can now stand up an agent using natural language. “This will democratize agentic AI, and the super skill will be the ability to express a goal, provide context and envision pitfalls, very similar to a good people manager today.”

What enterprise leaders should be doing now

Agent-first automation stacks “dramatically outperform” hybrid automation stacks in almost every attribute, he noted: satisfaction, quality of actions, security, cost savings.

Organizations should begin “expeditious programs” to infuse agents across workflows, especially with highly repetitive work that poses bottlenecks. Likely at first, there will be a strong human-in-the-loop element to ensure quality and promote change management.

“Serving as an evaluator will strengthen the understanding of how these systems work,” Sanders said, “and eventually enable all of us to operate upstream in agentic workflows instead of downstream.”

IT leaders should take inventory today of all the different elements of their automation stack. Whether these elements are rules-based automation, RPA, or agentic automation, they must learn everything going on in the organization to optimally use emerging orchestration platforms.

“If they don’t, there could actually be dis-synergies across organizations where old school technology and cutting edge technology clash at the point of delivery, oftentimes customer-facing,” Sanders said. “You can’t orchestrate what you can’t see clearly.”

Orchestration

This new, dead simple prompt technique boosts accuracy on LLMs by up to 76% on non-reasoning tasks

In the chaotic world of Large Language Model (LLM) optimization, engineers have spent the last few years developing increasingly esoteric rituals to get better answers.

We’ve seen “Chain of Thought” (asking the model to think step-by-step and often, show those “reasoning traces” to the user), “Emotional Blackmail” (telling the model its career depends on the answer, or that it is being accused of sexual misconduct), and complex multi-shot prompting frameworks.

But a new paper released by Google Research suggests that we may have been overthinking it. The researchers found that simply repeating the input query—literally copying and pasting the prompt so it appears twice—consistently improves performance across major models including Gemini, GPT-4o, Claude, and DeepSeek.

The paper, titled “Prompt Repetition Improves Non-Reasoning LLMs,” released last month just before the holidays, presents a finding that is almost suspiciously simple: for tasks that don’t require complex reasoning steps, stating the prompt twice yields significantly better results than stating it once.

Even better, because of how transformer architecture works, this “one weird trick” comes with virtually zero penalty in terms of generation speed.

The Causal Blind Spot

To understand why repeating a question makes a supercomputer smarter, you have to look at the architectural limitations of the standard Transformer model.

Most modern LLMs are trained as “causal” language models. This means they process text strictly from left to right. When the model is processing the 5th token in your sentence, it can “attend” (pay attention) to tokens 1 through 4, but it has zero knowledge of token 6, because it hasn’t happened yet.

This creates a fundamental constraint in how models understand user queries. As the authors note, the order of information matters immensely.

A query formatted as <CONTEXT> <QUESTION> often yields different results than <QUESTION> <CONTEXT> because, in the latter case, the model reads the question before it knows the context it’s supposed to apply it to.

Prompt repetition hacks this limitation by transforming an input of <QUERY> into <QUERY><QUERY>.

By the time the model begins processing the second iteration of the query, it has already “read” the first iteration. This allows the tokens in the second copy to attend to every single token in the first copy.

Effectively, the second repetition enjoys a form of bidirectional attention—it can “look back” at the entire query to resolve ambiguities or retrieve specific details that might have been missed in a single pass.

The Benchmarks: 47 Wins, 0 Losses

The researchers, Yaniv Leviathan, Matan Kalman, and Yossi Matias, tested this hypothesis across a suite of seven popular benchmarks, including ARC, OpenBookOA, GSM8K, and MMLU-Pro. They evaluated seven different models, ranging from lightweight models like Gemini 2.0 Flash Lite and GPT-4o-mini to heavyweights like Claude 3.7 Sonnet and DeepSeek V3.The results were statistically stark. When asking models not to use explicit reasoning (i.e., just giving a direct answer), prompt repetition won 47 out of 70 head-to-head tests against the baseline, with zero losses.The gains were particularly dramatic in tasks requiring precise retrieval from a prompt. The team designed a custom “NameIndex” benchmark, where the model is given a list of 50 names and asked to identify the 25th one.

Baseline Performance: Gemini 2.0 Flash-Lite scored a dismal 21.33% accuracy.
With Repetition: Accuracy skyrocketed to 97.33%.

This massive jump illustrates the “causal blind spot” perfectly. In a single pass, the model might lose track of the count by the time it reaches the 25th name. In the repeated pass, the model effectively has the entire list in its “working memory” before it attempts to solve the retrieval task.

The “Free Lunch” of Latency

Usually, adding text to a prompt increases costs and latency. If you double the input, surely you double the wait time?Surprisingly, no. The paper demonstrates that prompt repetition is essentially “free” regarding user-perceived latency.LLM processing is divided into two stages:

Prefill: The model processes the input prompt. This is highly parallelizable; the GPU can crunch the entire prompt matrix simultaneously.
Generation (Decoding): The model generates the answer one token at a time. This is serial and slow.

Prompt repetition only increases the work in the prefill stage. Because modern hardware handles prefill so efficiently, the user barely notices the difference. The researchers found that repeating the prompt did not increase the length of the generated answer, nor did it increase the “time to first token” latency for most models.The only exceptions were Anthropic’s models (Claude Haiku and Sonnet) on extremely long requests, where the prefill stage eventually hit a bottleneck. But for the vast majority of use cases, the technique improves accuracy without slowing down the chat experience.

Reasoning vs. Repetition

There is a caveat: this technique is primarily for “non-reasoning” tasks—scenarios where you want a direct answer rather than a step-by-step derivation.

When the researchers tested prompt repetition combined with “Chain of Thought” (asking the model to “think step by step”), the gains largely vanished, showing neutral to slightly positive results (5 wins, 1 loss, 22 ties).

The authors posit that reasoning models naturally perform a version of repetition themselves. When a model “thinks,” it often restates the premise of the question in its generated output before solving it. Therefore, explicitly repeating the prompt in the input becomes redundant.

However, for applications where you need a fast, direct answer without the verbosity (and cost) of a long reasoning trace, prompt repetition offers a powerful alternative.

Strategic Implementation for the Enterprise

For enterprise leadership, this research represents that rarest of things in AI development: a “free” optimization. But capitalization requires nuance; this isn’t a setting to toggle blindly across an entire organization, but rather a tactical adjustment that ripples across engineering, orchestration, and security.

For technical leads balancing the eternal triangle of speed, quality, and cost, prompt repetition offers a way to punch above your weight class. The data shows that smaller, faster models—like Gemini 2.0 Flash Lite—can achieve near-perfect retrieval accuracy (jumping from 21.33% to 97.33%) simply by processing the input twice.

This changes the calculus for model selection: before upgrading to a larger, more expensive model to solve an accuracy bottleneck, engineers should first test whether simple repetition allows their current “Lite” models to close the gap. It is a potential strategy for retaining the speed and cost benefits of lightweight infrastructure without sacrificing performance on extraction and retrieval tasks.

This logic naturally shifts the burden to the orchestration layer. For those managing the middleware and API gateways that glue AI applications together, prompt repetition should likely become a standard, invisible component of the pipeline logic rather than a user behavior.

However, because the technique is neutral for reasoning-heavy tasks but highly effective for direct answers, it requires conditional application. A smart orchestration harness would automatically identify requests routed to non-reasoning endpoints—such as entity extraction, classification, or simple Q&A—and double the prompt before passing it to the model. This optimizes performance at the infrastructure level, delivering better results without requiring action from end-users or increasing the generation budget.

Finally, this heightened attentiveness introduces a new variable for security teams.

If repeating a prompt clarifies a user’s intent to the model, it stands to reason that malicious intents might be clarified as well. Security directors will need to update their red-teaming protocols to test “repeated injection” attacks—verifying whether repeating a jailbreak command (e.g., “Ignore previous instructions”) makes the model “attend” to the breach more effectively. Conversely, this mechanism offers a new defensive tool: repeating System Prompts.

Stating safety guardrails twice at the start of the context window could force the model to attend to safety constraints more rigorously, acting as a low-cost reinforcement for robust security operations.

Why This Matters

This research highlights a crucial insight for developers building on top of LLMs: our current models are still deeply constrained by their unidirectional nature. While we wait for new architectures that might solve causal blindness, crude but effective workarounds like prompt repetition offer immediate value.The authors suggest this could become a default behavior for future systems.

We might soon see inference engines that silently double our prompts in the background before sending them to the model, or “Reasoning” models trained to internalize this repetition strategy to be more efficient.For now, if you are struggling to get a model to follow complex instructions or retrieve specific details from a long document, the solution might not be a better prompt. You might just need to say it again.

Orchestration

Why Egnyte keeps hiring junior engineers despite the rise of AI coding tools

Egnyte, the $1.5 billion cloud content governance company, has embedded AI coding tools across its global team of more than 350 developers — but not to reduce headcount. Instead, the company continues to hire junior engineers, using AI to accelerate onboarding, deepen codebase understanding, and shorten the path from junior to senior contributor.

The approach challenges a dominant 2025 narrative that automation will replace developers, showing instead how enterprises are using AI to scale engineering capacity while keeping humans firmly in the loop.

“To have engineers disappear or us not hiring junior engineers doesn’t look like the likely outcome,” Amrit Jassal, Egnyte CTO and co-founder, told VentureBeat. “You’ve got to have people, you’re training and doing all types of succession planning. The junior engineer of today is the senior engineer of tomorrow.”

How Egnyte coders are using AI — without ceding control

Egnyte — which has more than 22,000 users including NASDAQ, Red Bull, and BuzzFeed — has rolled out Claude Code, Cursor, Augment, and Gemini CLI coding tools across its developer base to support its core business strategies and expand its newer AI offerings like customer-facing copilots and customizable AI agents.

Devs use these tools across a variety of tasks, the simplest of which include data retrieval, code comprehension, smart search, and code lookup. Egnyte’s code base has lots of Java code, which uses numerous libraries, each with different versions, Jassal explained. AI tools are great for peer-to-peer programming, helping new users get a lay of the land, or existing users probe into different code repositories.

“We have a pretty big code base, right?” Jassal said. “Let’s say you’re looking at an iOS application, but you’re not well versed; you will fire up Google CLI or an Augment, and ask it to discover the code base.”

Some Egnyte devs are moving into automatic pull request summaries, which provide simple overviews of code changes that essentially explain the “what,” “how,” and “why” of proposed modifications.

“But obviously, any change that’s made, we don’t want to hear that AI made the change; it has to be that developer made the change,” Jassal pointed out. “I would not trust AI to commit to the production code base.”

Commits still pass through human review and security validation, and anything red-flagged is escalated to senior engineers. Devs are warned of the dangers of settling into autopilot mode or blindly trusting code. A model may not have been exposed to, or given enough samples of, certain coding components and infrastructure in its training.

Another growing, and closely monitored, use case for AI is unit testing, where code components are run in isolation to ensure they work as intended. “At the end of the day, it is a productivity improvement tool,” he said. “It is really a continuation, it’s like any other tool, it’s not some magic.”

Beyond core engineering, AI is helping other teams collaborate with programmers. Product management, for instance, is using tools like Vercel to bring “demo-worthy” prototypes, rather than just ideas, to devs, who can then move ahead with mock-ups. Or, if UX teams are looking to change certain elements on a dashboard, AI can quickly spin up a handful of options, like different widgets or buttons.

“Then you come to engineering with that, and the engineer immediately knows what you really intend to do with it,” Jassal said.

Setting expectations, meeting devs where they are

However, day-to-day activities for all Egnyte engineers, including junior developers, extend beyond just coding.

Junior developers are given hands-on tasks across the full development lifecycle to accelerate their growth and experience, Jassal said. For instance, they assist with requirement analysis in early software engineering phases, as well as deployment, productization and post-deployment maintenance.

In turn, these activities require “Egnyte-specific tacit knowledge and experience” offered by senior engineers. One clear example of work that sits firmly with senior engineers is authoring architecture notes, as these cut across the platform and require a more holistic, system-level view, Jassal said.

“Many of the traditional roadblocks are navigated faster these days with AI; for example, understanding the codebase, dissecting requirements, auto-testing,” he said. “This faster track allows our talented junior hires to progress more quickly and provide higher value to the company sooner.”

The company expects a much faster learning curve from junior to mid-level engineers, Jassal said. “It’s always the case that people coming straight into the workforce are much more excited about trying new things,” Jassal said. But that has to be colored with reality to temper expectations, he added.

On the other hand, some senior engineers may need to be ramped up in their adoption because they’re hesitant or had ho-hum or bad experiences with earlier generation tools. This requires incremental introduction.

“The senior people, having been burnt multiple times, bring that perspective,” he said. “So both [types of engineers] play an important role.”

Hiring will continue for scale and fresh perspective

“In general, I would say it has been really hyped by folks who want to sell you tokens,” Jassal said referring to people who talk about human coders becoming obsolete.

“Vibe coding” could be construed in a similar vein: Like others in software development, he prefers the term “AI assisted coding,” wherein programmers have a self-driven loop, generating code, analyzing exceptions, then correcting and scaling.

At least in Egnyte’s case, hiring will continue, even if at a slower clip as people become more productive thanks to AI, Jassal said.

“We are not just hiring for scale, but to develop the next generation of senior developers and inject fresh perspectives into our development practices,” he said.

The takeaway for technical decision-makers is not that AI will eliminate engineering jobs — but that it will reshape how talent is developed.

At Egnyte, AI-assisted coding is compressing learning curves and raising expectations, not removing humans from the process. Enterprises that treat AI as a replacement risk hollowing out their future senior talent pipeline; those that treat it as infrastructure can move faster without losing the judgment, creativity, and accountability that only engineers provide.

Orchestration

Why your LLM bill is exploding — and how semantic caching can cut it by 73%

Our LLM API bill was growing 30% month-over-month. Traffic was increasing, but not that fast. When I analyzed our query logs, I found the real problem: Users ask the same questions in different ways.

“What’s your return policy?,” “How do I return something?”, and “Can I get a refund?” were all hitting our LLM separately, generating nearly identical responses, each incurring full API costs.

Exact-match caching, the obvious first solution, captured only 18% of these redundant calls. The same semantic question, phrased differently, bypassed the cache entirely.

So, I implemented semantic caching based on what queries mean, not how they’re worded. After implementing it, our cache hit rate increased to 67%, reducing LLM API costs by 73%. But getting there requires solving problems that naive implementations miss.

Why exact-match caching falls short

Traditional caching uses query text as the cache key. This works when queries are identical:

# Exact-match caching

cache_key = hash(query_text)

if cache_key in cache:

return cache[cache_key]

But users don’t phrase questions identically. My analysis of 100,000 production queries found:

Only 18% were exact duplicates of previous queries
47% were semantically similar to previous queries (same intent, different wording)
35% were genuinely novel queries

That 47% represented massive cost savings we were missing. Each semantically-similar query triggered a full LLM call, generating a response nearly identical to one we’d already computed.

Semantic caching architecture

Semantic caching replaces text-based keys with embedding-based similarity lookup:

class SemanticCache:

def __init__(self, embedding_model, similarity_threshold=0.92):

self.embedding_model = embedding_model

self.threshold = similarity_threshold

self.vector_store = VectorStore() # FAISS, Pinecone, etc.

self.response_store = ResponseStore() # Redis, DynamoDB, etc.

def get(self, query: str) -> Optional[str]:

“””Return cached response if semantically similar query exists.”””

query_embedding = self.embedding_model.encode(query)

# Find most similar cached query

matches = self.vector_store.search(query_embedding, top_k=1)

if matches and matches[0].similarity >= self.threshold:

cache_id = matches[0].id

return self.response_store.get(cache_id)

return None

def set(self, query: str, response: str):

“””Cache query-response pair.”””

query_embedding = self.embedding_model.encode(query)

cache_id = generate_id()

self.vector_store.add(cache_id, query_embedding)

self.response_store.set(cache_id, {

‘query’: query,

‘response’: response,

‘timestamp’: datetime.utcnow()

})

The key insight: Instead of hashing query text, I embed queries into vector space and find cached queries within a similarity threshold.

The threshold problem

The similarity threshold is the critical parameter. Set it too high, and you miss valid cache hits. Set it too low, and you return wrong responses.

Our initial threshold of 0.85 seemed reasonable; 85% similar should be “the same question,” right?

Wrong. At 0.85, we got cache hits like:

Query: “How do I cancel my subscription?”
Cached: “How do I cancel my order?”
Similarity: 0.87

These are different questions with different answers. Returning the cached response would be incorrect.

I discovered that optimal thresholds vary by query type:

Query type	Optimal threshold	Rationale
FAQ-style questions	0.94	High precision needed; wrong answers damage trust
Product searches	0.88	More tolerance for near-matches
Support queries	0.92	Balance between coverage and accuracy
Transactional queries	0.97	Very low tolerance for errors

I implemented query-type-specific thresholds:

class AdaptiveSemanticCache:

def __init__(self):

self.thresholds = {

‘faq’: 0.94,

‘search’: 0.88,

‘support’: 0.92,

‘transactional’: 0.97,

‘default’: 0.92

}

self.query_classifier = QueryClassifier()

def get_threshold(self, query: str) -> float:

query_type = self.query_classifier.classify(query)

return self.thresholds.get(query_type, self.thresholds[‘default’])

def get(self, query: str) -> Optional[str]:

threshold = self.get_threshold(query)

query_embedding = self.embedding_model.encode(query)

matches = self.vector_store.search(query_embedding, top_k=1)

if matches and matches[0].similarity >= threshold:

return self.response_store.get(matches[0].id)

return None

Threshold tuning methodology

I couldn’t tune thresholds blindly. I needed ground truth on which query pairs were actually “the same.”

Our methodology:

Step 1: Sample query pairs. I sampled 5,000 query pairs at various similarity levels (0.80-0.99).

Step 2: Human labeling. Annotators labeled each pair as “same intent” or “different intent.” I used three annotators per pair and took a majority vote.

Step 3: Compute precision/recall curves. For each threshold, we computed:

Precision: Of cache hits, what fraction had the same intent?
Recall: Of same-intent pairs, what fraction did we cache-hit?

def compute_precision_recall(pairs, labels, threshold):

“””Compute precision and recall at given similarity threshold.”””

predictions = [1 if pair.similarity >= threshold else 0 for pair in pairs]

true_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 1)

false_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 0)

false_negatives = sum(1 for p, l in zip(predictions, labels) if p == 0 and l == 1)

precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0

recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0

return precision, recall

Step 4: Select threshold based on cost of errors. For FAQ queries where wrong answers damage trust, I optimized for precision (0.94 threshold gave 98% precision). For search queries where missing a cache hit just costs money, I optimized for recall (0.88 threshold).

Latency overhead

Semantic caching adds latency: You must embed the query and search the vector store before knowing whether to call the LLM.

Our measurements:

Operation	Latency (p50)	Latency (p99)
Query embedding	12ms	28ms
Vector search	8ms	19ms
Total cache lookup	20ms	47ms
LLM API call	850ms	2400ms

The 20ms overhead is negligible compared to the 850ms LLM call we avoid on cache hits. Even at p99, the 47ms overhead is acceptable.

However, cache misses now take 20ms longer than before (embedding + search + LLM call). At our 67% hit rate, the math works out favorably:

Before: 100% of queries × 850ms = 850ms average
After: (33% × 870ms) + (67% × 20ms) = 287ms + 13ms = 300ms average

Net latency improvement of 65% alongside the cost reduction.

Cache invalidation

Cached responses go stale. Product information changes, policies update and yesterday’s correct answer becomes today’s wrong answer.

I implemented three invalidation strategies:

Time-based TTL

Simple expiration based on content type:

TTL_BY_CONTENT_TYPE = {

‘pricing’: timedelta(hours=4), # Changes frequently

‘policy’: timedelta(days=7), # Changes rarely

‘product_info’: timedelta(days=1), # Daily refresh

‘general_faq’: timedelta(days=14), # Very stable

}

Event-based invalidation

When underlying data changes, invalidate related cache entries:

class CacheInvalidator:

def on_content_update(self, content_id: str, content_type: str):

“””Invalidate cache entries related to updated content.”””

# Find cached queries that referenced this content

affected_queries = self.find_queries_referencing(content_id)

for query_id in affected_queries:

self.cache.invalidate(query_id)

self.log_invalidation(content_id, len(affected_queries))

Staleness detection

For responses that might become stale without explicit events, I implemented periodic freshness checks:

def check_freshness(self, cached_response: dict) -> bool:

“””Verify cached response is still valid.”””

# Re-run the query against current data

fresh_response = self.generate_response(cached_response[‘query’])

# Compare semantic similarity of responses

cached_embedding = self.embed(cached_response[‘response’])

fresh_embedding = self.embed(fresh_response)

similarity = cosine_similarity(cached_embedding, fresh_embedding)

# If responses diverged significantly, invalidate

if similarity < 0.90:

self.cache.invalidate(cached_response[‘id’])

return False

return True

We run freshness checks on a sample of cached entries daily, catching staleness that TTL and event-based invalidation miss.

Production results

After three months in production:

Metric	Before	After	Change
Cache hit rate	18%	67%	+272%
LLM API costs	$47K/month	$12.7K/month	-73%
Average latency	850ms	300ms	-65%
False-positive rate	N/A	0.8%	—
Customer complaints (wrong answers)	Baseline	+0.3%	Minimal increase

The 0.8% false-positive rate (queries where we returned a cached response that was semantically incorrect) was within acceptable bounds. These cases occurred primarily at the boundaries of our threshold, where similarity was just above the cutoff but intent differed slightly.

Pitfalls to avoid

Don’t use a single global threshold. Different query types have different tolerance for errors. Tune thresholds per category.

Don’t skip the embedding step on cache hits. You might be tempted to skip embedding overhead when returning cached responses, but you need the embedding for cache key generation. The overhead is unavoidable.

Don’t forget invalidation. Semantic caching without invalidation strategy leads to stale responses that erode user trust. Build invalidation from day one.

Don’t cache everything. Some queries shouldn’t be cached: Personalized responses, time-sensitive information, transactional confirmations. Build exclusion rules.

def should_cache(self, query: str, response: str) -> bool:

“””Determine if response should be cached.””

# Don’t cache personalized responses

if self.contains_personal_info(response):

return False

# Don’t cache time-sensitive information

if self.is_time_sensitive(query):

return False

# Don’t cache transactional confirmations

if self.is_transactional(query):

return False

return True

Key takeaways

Semantic caching is a practical pattern for LLM cost control that captures redundancy exact-match caching misses. The key challenges are threshold tuning (use query-type-specific thresholds based on precision/recall analysis) and cache invalidation (combine TTL, event-based and staleness detection).

At 73% cost reduction, this was our highest-ROI optimization for production LLM systems. The implementation complexity is moderate, but the threshold tuning requires careful attention to avoid quality degradation.

Sreenivasa Reddy Hulebeedu Reddy is a lead software engineer.

DataDecisionMakers, Infrastructure, Orchestration

Orchestral replaces LangChain’s complexity with reproducible, provider-agnostic LLM orchestration

A new framework from researchers Alexander and Jacob Roman rejects the complexity of current AI tools, offering a synchronous, type-safe alternative designed for reproducibility and cost-conscious science.

In the rush to build autonomous AI agents, developers have largely been forced into a binary choice: surrender control to massive, complex ecosystems like LangChain, or lock themselves into single-vendor SDKs from providers like Anthropic or OpenAI. For software engineers, this is an annoyance. For scientists trying to use AI for reproducible research, it is a dealbreaker.

Enter Orchestral AI, a new Python framework released on Github this week that attempts to chart a third path.

Developed by theoretical physicist Alexander Roman and software engineer Jacob Roman, Orchestral positions itself as the “scientific computing” answer to agent orchestration—prioritizing deterministic execution and debugging clarity over the “magic” of async-heavy alternatives.

The ‘anti-framework’ architecture

The core philosophy behind Orchestral is an intentional rejection of the complexity that plagues the current market. While frameworks like AutoGPT and LangChain rely heavily on asynchronous event loops—which can make error tracing a nightmare—Orchestral utilizes a strictly synchronous execution model.

“Reproducibility demands understanding exactly what code executes and when,” the founders argue in their technical paper. By forcing operations to happen in a predictable, linear order, the framework ensures that an agent’s behavior is deterministic—a critical requirement for scientific experiments where a “hallucinated” variable or a race condition could invalidate a study.

Despite this focus on simplicity, the framework is provider-agnostic. It ships with a unified interface that works across OpenAI, Anthropic, Google Gemini, Mistral, and local models via Ollama. This allows researchers to write an agent once and swap the underlying “brain” with a single line of code—crucial for comparing model performance or managing grant money by switching to cheaper models for draft runs.

LLM-UX: designing for the model, not the end user

Orchestral introduces a concept the founders call “LLM-UX”—user experience designed from the perspective of the model itself.

The framework simplifies tool creation by automatically generating JSON schemas from standard Python type hints. Instead of writing verbose descriptions in a separate format, developers can simply annotate their Python functions. Orchestral handles the translation, ensuring that the data types passed between the LLM and the code remain safe and consistent.

This philosophy extends to the built-in tooling. The framework includes a persistent terminal tool that maintains its state (like working directories and environment variables) between calls. This mimics how human researchers interact with command lines, reducing the cognitive load on the model and preventing the common failure mode where an agent “forgets” it changed directories three steps ago.

Built for the lab (and the budget)

Orchestral’s origins in high-energy physics and exoplanet research are evident in its feature set. The framework includes native support for LaTeX export, allowing researchers to drop formatted logs of agent reasoning directly into academic papers.

It also tackles the practical reality of running LLMs: cost. The framework includes an automated cost-tracking module that aggregates token usage across different providers, allowing labs to monitor burn rates in real-time.

Perhaps most importantly for safety-conscious fields, Orchestral implements “read-before-edit” guardrails. If an agent attempts to overwrite a file it hasn’t read in the current session, the system blocks the action and prompts the model to read the file first. This prevents the “blind overwrite” errors that terrify anyone using autonomous coding agents.

The licensing caveat

While Orchestral is easy to install via pip install orchestral-ai, potential users should look closely at the license. Unlike the MIT or Apache licenses common in the Python ecosystem, Orchestral is released under a Proprietary license.

The documentation explicitly states that “unauthorized copying, distribution, modification, or use… is strictly prohibited without prior written permission”. This “source-available” model allows researchers to view and use the code, but restricts them from forking it or building commercial competitors without an agreement. This suggests a business model focused on enterprise licensing or dual-licensing strategies down the road.

Furthermore, early adopters will need to be on the bleeding edge of Python environments: the framework requires Python 3.13 or higher, explicitly dropping support for the widely used Python 3.12 due to compatibility issues.

Why it matters

“Civilization advances by extending the number of important operations which we can perform without thinking about them,” the founders write, quoting mathematician Alfred North Whitehead.

Orchestral attempts to operationalize this for the AI era. By abstracting away the “plumbing” of API connections and schema validation, it aims to let scientists focus on the logic of their agents rather than the quirks of the infrastructure. Whether the academic and developer communities will embrace a proprietary tool in an ecosystem dominated by open source remains to be seen, but for those drowning in async tracebacks and broken tool calls, Orchestral offers a tempting promise of sanity.

Orchestration

How KPMG is redefining the future of SAP consulting on a global scale

Presented by SAP

SAP consulting projects today involve a vast amount of documentation, multiple stakeholders, and compressed timelines, which often require manual knowledge retrieval from online SAP documentation. At the same time, cloud ERP programs now demand faster design cycles, continuous enhancements rather than big-bang rollouts, and near-real-time decision-making. Joule for Consultants, SAP’s conversational AI solution, was designed to help meet these expectations and support consultants throughout their daily tasks, from reconciling best practices and validating design considerations, to navigating SAP’s expanding AI, data, and application landscape.

The result: consultants work more productively than ever before, with superior results, and deliver faster, high-quality SAP cloud transformations.

That promise attracted early attention from KPMG firms, which became some of the largest SAP enablers participating in the Joule early access program, and one of SAP’s largest customers overall. The organization has onboarded 29 KPMG member firms around the world to this point, and now thousands of KPMG consultants are using Joule for Consultants in their daily work.

“For us it wasn’t about experimenting,” says Valentino Koester, global head of the SAP360 and SAP AI program at KPMG International. “It was more about positioning our people and member firm clients at the forefront of AI-enabled consulting.”

Knowledge on a global scale

“Competitive pressure is intense in the SAP implementation market,” Koester says. “The core asset you have as a consultancy is knowledge and experience, bundled in all kinds of ways. AI and Joule for Consultants allow us to scale that knowledge instantly across our global network, making sure customers can access it from all over the world, no matter who they talk to in the organization.”

Whether it’s a junior consultant or a senior manager, Joule ensures SAP best practices, industry benchmarks, and the innovation an organization has invested in for years is not siloed somewhere in a team, or somewhere in a country where it can’t be shared at speed and with accuracy.

“This makes our teams more agile in responding to emerging client needs or regulatory shifts, when new market entrants emerge or technology changes,” Koester says. “The agility our consultants gain allows us to advise customers not just reactively, but proactively. In many cases, it becomes a form of early forecasting.”

For example, consultants can spot potential supply chain risks early through AI-enabled process mining they’ve implemented for a customer, or apply analytical capabilities with SAP Analytics Cloud before those risks actually materialize.

Overcoming challenges, alleviating pain points

KPMG customer transformations typically follow KPMG’s Transformation methodology, which is closely aligned with SAP’s RISE methodology. RISE moves through six SAP Activate phases: discover, prepare, explore, realize, deploy, and run, and each has recurring challenges that can slow momentum if not addressed early, Koester explains.

In the discover phase, teams invest heavily in business-case modeling, benchmarking, and stakeholder alignment — activities that are time-intensive and difficult to complete without full visibility. The prepare phase introduces extensive project mobilization, reporting demands, early risk identification, and governance setup, any of which can stall progress before execution begins.

During the explore and realize phases, long design workshops and piles of documentation can bog down decision-making. Defects and bottlenecks must be identified as early as possible, or they risk cascading downstream as rework. In the deploy and run phases, organizations must develop and deliver training content while overcoming change resistance, which requires sustained communication to maintain adoption. Once live, continuous KPI monitoring and process optimization helps prevent issues from settling into the operating model and eroding value over time. AI can help consultants perform these tasks, as well as any necessary review and corrections, with a higher degree of accuracy, and more quickly than ever before.

“By adopting tools like Joule for Consultants, we want to enhance the work our professionals do, making them more effective and more productive, so they have more time to focus on what matters most,” Koester said. “That’s the client relationship, strategic decision-making, and delivering measurable business outcomes. “

How AI changes the transformation approach

Joule for Consultants isn’t only reducing repetitive work; it’s also reshaping how KPMG approaches SAP-enabled transformations. The AI tool has surfaced insights that traditionally required deep expertise and immediate recall, enhancing consultants’ ability to respond to market dynamics and competitive pressure.

For example, in early design workshops with customers, unless there were highly experienced consultants on site who, in real time, could answer every question about business processes, as well as explain the design considerations and all the technical logic in the new system, consultants frequently had to defer answers or validate them later, especially when novel questions or edge cases came up, slowing the momentum of the transformation.

“With Joule for Consultants, we were hoping to validate the guidance our consultants can give on the spot, in real time, to maintain implementation momentum and strengthen client confidence in their SAP system and KPMG advisors,” Koester said. “Our people can instantly surface SAP best practices, guidelines, and risk scenarios during workshops now, allowing our teams to move forward, helping reduce delay within the engagement.”

They’re seeing the same success during internal learning and enablement or sales-related activities. Joule has supercharged KPMG’s internal SAP University, a learning program for new employees. Recently, junior consultants have been able to prepare and present complex and technical RFP responses with confidence, despite limited experience, as Joule guided them through each step in a structured, high-quality way.

Creating a successful roll-out

To ensure a smooth internal roll out process to analysts, KPMG positioned Joule not as the introduction of a new tool, but as the adoption of a new way of working.

“We emphasized the organizational impact it would have, with Joule enabling smarter, more effective ways of working,” Koester said. “Consultants quickly saw benefits, such as less time spent searching for technical information and more time to spend advising their customers and doing the functional work we’re very strong at.”

Responsible AI remained a core pillar of the rollout. In addition to aligning the initiative with KPMG’s Trusted AI Framework, every participating KPMG firm conducted a risk assessment to mitigate potential risks for clients and ensure compliant, secure use of Joule for Consultants. Early adoption therefore began with awareness-building conversations across the network, helping teams understand what these “new ways of working” look like and where AI can genuinely support delivery. KPMG also appointed a dedicated enablement team to ensure Joule for Consultants is integrated into the broader organizational picture.

“We’re working week after week on increasing the active adoption by consultants, while also using the chance to get their feedback,” he explains. “Our grassroots innovation feedback tools don’t just ask what they like and dislike, but what successful use cases they’re uncovering, and how much time they think they’re saving.”

The goal, he adds, isn’t to automate consulting, but to enable consultants to do their jobs better. And because Joule for Consultants is continuing to evolve, KPMG expects its role to expand significantly over time. SAP continues to enhance feature sets and response quality with each release, while it works on developing AI agents that collaborate with consultants and automate some portions of selected workflows intelligently. KPMG is collaborating closely with SAP to responsibly integrate these capabilities into more phases of transformation projects as they mature, so that Joule becomes a part of its overall approach to consulting.

“If early signals hold, Joule for Consultants isn’t just a helpful tool — it’s on track to become a standard in how SAP projects are delivered,” Koester says.

Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Orchestration

Claude Code 2.1.0 arrives with smoother workflows and smarter agents

Anthropic has released Claude Code v2.1.0, a notable update to its “vibe coding” development environment for autonomously building software, spinning up AI agents, and completing a wide range of computer tasks, according to Head of Claude Code Boris Cherny in a post on X last night.

The release introduces improvements across agent lifecycle control, skill development, session portability, and multilingual output — all bundled in a dense package of 1,096 commits.

It comes amid a growing wave of praise for Claude Code from software developers and startup founders on X, as they increasingly use the system — powered by Anthropic’s Claude model family, including the flagship Opus 4.5 — to push beyond simple completions and into long-running, modular workflows.

Enterprise Relevance: Agent Lifecycle and Orchestration Improvements

Claude Code was originally released as a “command line” tool back in February 2025, almost a year ago, alongside Anthropic’s then cutting-edge Claude Sonnet 3.7 large language model (LLM). It has been updated various times since then, as Anthropic has also advanced its underlying LLMs.

The new version, Claude Code 2.1.0 introduces infrastructure-level features aimed at developers deploying structured workflows and reusable skills. These changes reduce the manual scaffolding required to manage agents across sessions, tools, and environments — letting teams spend less time on configuration and more time on building.

Key additions include:

Hooks for agents, skills, and slash commands, enabling scoped PreToolUse, PostToolUse, and Stop logic. This gives developers fine-grained control over state management, tool constraints, and audit logging — reducing unexpected behavior and making agent actions easier to debug and reproduce.
Hot reload for skills, so new or updated skills in ~/.claude/skills or .claude/skills become available immediately without restarting sessions. Developers can iterate on skill logic in real time, eliminating the stop-start friction that slows down experimentation.
Forked sub-agent context via context: fork in skill frontmatter, allowing skills and slash commands to run in isolated contexts. This prevents unintended side effects and makes it safer to test new logic without polluting the main agent’s state.
Wildcard tool permissions (e.g., Bash(npm *), Bash(*-h*)) for easier rule configuration and access management. Teams can define broader permission patterns with fewer rules, reducing configuration overhead and the risk of mismatched permissions blocking legitimate workflows.
Language-specific output via a language setting, enabling workflows that require output in Japanese, Spanish, or other languages. Global teams and multilingual projects no longer need post-processing workarounds to localize Claude’s responses.
Session teleportation via /teleport and /remote-env slash commands, which allow claude.ai subscribers to resume and configure remote sessions at claude.ai/code. Developers can seamlessly move work between local terminals and the web interface — ideal for switching devices or sharing sessions with collaborators.
Improved terminal UX, including Shift+Enter working out of the box in iTerm2, Kitty, Ghostty, and WezTerm without modifying terminal configs. This removes a common setup frustration and lets developers start working immediately in their preferred terminal.
Unified Ctrl+B behavior for backgrounding both agents and shell commands simultaneously. Developers can push long-running tasks to the background with a single keystroke, freeing up the terminal for other work without losing progress.
New Vim motions including ; and , to repeat f/F/t/T motions, yank operator (y, yy, Y), paste (p/P), text objects, indent/dedent (>>, <<), and line joining (J). Power users who rely on Vim-style editing can now work faster without switching mental models or reaching for the mouse.
MCP list_changed notifications, allowing MCP servers to dynamically update their available tools, prompts, and resources without requiring reconnection. This keeps workflows running smoothly when tool configurations change, avoiding interruptions and manual restarts.
Agents continue after permission denial, allowing subagents to try alternative approaches rather than stopping entirely. This makes autonomous workflows more resilient, reducing the need for human intervention when an agent hits a permissions wall.

Developer Experience Improvements

Beyond the headline features, this release includes numerous quality-of-life improvements designed to reduce daily friction and help developers stay in flow.

/plan command shortcut to enable plan mode directly from the prompt — fewer keystrokes to switch modes means less context-switching and faster iteration on complex tasks.
Slash command autocomplete now works when / appears anywhere in input, not just at the beginning. Developers can compose commands more naturally without backtracking to the start of a line.
Real-time thinking block display in Ctrl+O transcript mode, giving developers visibility into Claude’s reasoning as it happens. This makes it easier to catch misunderstandings early and steer the agent before it goes down the wrong path.
respectGitignore support in settings.json for per-project control over @-mention file picker behavior. Teams can keep sensitive or irrelevant files out of suggestions, reducing noise and preventing accidental exposure of ignored content.
IS_DEMO environment variable to hide email and organization from the UI, useful for streaming or recording sessions. Developers can share their work publicly without leaking personal or company information.
Skills progress indicators showing tool uses as they happen during execution. Developers get real-time feedback on what Claude is doing, reducing uncertainty during long-running operations and making it easier to spot issues mid-flight.
Skills visible in slash command menu by default from /skills/ directories (opt-out with user-invocable: false in frontmatter. Custom skills are now more discoverable, helping teams adopt shared workflows without hunting through documentation.
Improved permission prompt UX with Tab hint moved to footer, cleaner Yes/No input labels with contextual placeholders. Clearer prompts mean fewer mistakes and faster decisions when approving tool access.
Multiple startup performance optimizations and improved terminal rendering performance, especially for text with emoji, ANSI codes, and Unicode characters. Faster startup and smoother rendering reduce waiting time and visual distractions, keeping developers focused on the task at hand.

The release also addresses numerous bug fixes, including a security fix where sensitive data (OAuth tokens, API keys, passwords) could be exposed in debug logs, fixes for session persistence after transient server errors, and resolution of API context overflow when background tasks produce large output. Together, these fixes improve reliability and reduce the risk of data leaks or lost work.

Why This Matters: Claude Code Hits a Turning Point with Power Users

Claude Code 2.1.0 arrives in the midst of a significant shift in developer behavior. Originally built as an internal tool at Anthropic, Claude Code is now gaining real traction among external power users — especially those building autonomous workflows, experimenting with agent tooling, and integrating Claude into terminal-based pipelines.

According to X discussions in late December 2025 and early January 2026, enthusiasm surged as developers began describing Claude Code as a game-changer for “vibe coding,” agent composition, and productivity at scale.

@JsonBasedman captured the prevailing sentiment: “I don’t even see the timeline anymore, it’s just ‘Holy shit Claude code is so good’…”

“Claude Code addiction is real,” opined Matt Shumer, co-founder and CEO of Hyperwrite/Otherside AI, in another X post.

Non-developers have embraced the accessibility. @LegallyInnovate, a lawyer, noted: “Trying Claude code for the first time today. I’m a lawyer not a developer. It’s AMAZING. I am blown away and probably not even scratching the surface. “

Some users are shifting away from popular alternatives — @troychaplin switched from Cursor, calling Claude Code “so much better!” for standalone use.

Claude Code has even fueled discussion that Anthropic has actually achieved artificial generalized intelligence, AGI, the so-called “holy grail” of artificial systems development — something that outperforms humans at most “economically valuable work,” according to the definition offered by Anthropic rival OpenAI.

@deepfates argued that Claude Code may not be AGI, but that “if Claude Code is good enough to to do that, combine ideas on the computer, then I think it is ‘artificial general intellect’ at least. And that is good enough to create a new frontier…”

A clear pattern emerges: users who engage with Claude Code as an orchestration layer — configuring tools, defining reusable components, and layering logic — report transformative results. Those treating it as a standard AI assistant often find its limitations more apparent.

Claude Code 2.1.0 doesn’t try to paper over those divisions — it builds for the advanced tier. Features like agent lifecycle hooks, hot-reloading of skills, wildcard permissioning, and session teleportation reinforce Claude Code’s identity as a tool for builders who treat agents not as chatbots, but as programmable infrastructure.

In total, these updates don’t reinvent Claude Code, but they do lower friction for repeat users and unlock more sophisticated workflows. For teams orchestrating multi-step agent logic, Claude Code 2.1.0 makes Claude feel less like a model — and more like a framework.

Pricing and Availability

Claude Code is available to Claude Pro ($20/month), Claude Max ($100/month), Claude Team (Premium Seat, $150 per month) with and Claude Enterprise (variable pricing) subscribers.

The /teleport and /remote-env commands require access to Claude Code’s web interface at claude.ai/code. Full installation instructions and documentation are available at code.claude.com/docs/en/setup.

What’s Next?

With reusable skills, lifecycle hooks, and improved agent control, Claude Code continues evolving from a chat-based coding assistant into a structured environment for programmable, persistent agents.

As enterprise teams and solo builders increasingly test Claude in real workflows — from internal copilots to complex bash-driven orchestration — version 2.1.0 makes it easier to treat agents as first-class components of a production stack.

Anthropic appears to be signaling that it views Claude Code not as an experiment, but as infrastructure. And with this release, it’s building like it means it.

Orchestration

Nvidia’s Cosmos Reason 2 aims to bring reasoning VLMs into the physical world

Nvidia CEO Jensen Huang said last year that we are now entering the age of physical AI. While the company continues to offer LLMs for software use cases, Nvidia is increasingly positioning itself as a provider of AI models for fully AI-powered systems — including agentic AI in the physical world.

At CES 2026, Nvidia announced a slate of new models designed to push AI agents beyond chat interfaces and into physical environments.

Nvidia launched Cosmos Reason 2, the latest version of its vision-language model designed for embodied reasoning. Cosmos Reason 1, released last year, introduced a two-dimensional ontology for embodied reasoning and currently leads Hugging Face’s physical reasoning for video leaderboard.

Cosmos Reason 2 builds on the same ontology while giving enterprises more flexibility to customize applications and enabling physical agents to plan their next actions, similar to how software-based agents reason through digital workflows.

Nvidia also released a new version of Cosmos Transfer, a model that lets developers generate training simulations for robots.

Other vision-language models, such as Google’s PaliGemma and Pixtral Large from Mistral, can process visual inputs, but not all commercially available VLMs support reasoning.

“Robotics is at an inflection point. We are moving from specialist robots limited to single tasks to generalist specialist systems,” said Kari Briski, Nvidia vice president for generative AI software, in a briefing with reporters. She was referring to robots that combine broad foundational knowledge with deep task-specific skills. “These new robots combine broad fundamental knowledge with deep proficiency and complex tasks.”

She added that Cosmos Reason 2 “enhances the reasoning capabilities that robots need to navigate the unpredictable physical world.”

Moving to physical agents

Briski noted that Nvidia’s roadmap follows “the same pattern of assets across all of our open models.”

“In building specialized AI agents, a digital workforce, or the physical embodiment of AI in robots and autonomous vehicles, more than just the model is needed,” Briski said. “First, the AI needs the compute resources to train, simulate the world around it. Data is the fuel for AI to learn and improve and we contribute to the world’s largest collection of open and diverse datasets, going beyond just opening the weights of the models. The open libraries and training scripts give developers the tools to purpose-build AI for their applications, and we publish blueprints and examples to help deploy AI as systems of models.”

The company now has open models specifically for physical AI in Cosmos, robotics, with the open-reasoning vision-language-action (VLA) model Gr00t and its Nemotron models for agentic AI.

Nvidia is making the case that open models across different branches of AI form a shared enterprise ecosystem that feeds data, training, and reasoning to agents in both the digital and physical worlds.

Additions to the Nemotron family

Briski said Nvidia plans to continue expanding its open models, including its Nemotron family, beyond reasoning to include a new RAG and embeddings model to make information more readily available to agents. The company released Nemotron 3, the latest version of its agentic reasoning models, in December.

Nvidia announced three new additions to the Nemotron family: Nemotron Speech, Nemotron RAG and Nemotron Safety.

In a blog post, Nvidia said Nemotron Speech delivers “real-time low-latency speech recognition for live captions and speech AI applications” and is 10 times faster than other speech models.

Nemotron RAG is technically comprised of two models: an embedding model and a rerank model, both of which can understand images to provide more multimodal insights that data agents will tap.

“Nemotron RAG is on top of what we call the MMTab, or the Massive Multilingual Text Embedding Benchmark, with strong multilingual performance while using less computing power memory, so they are a good fit for systems that must handle a lot of requests very quickly and with low delay,” Briski said.

Nemotron Safety detects sensitive data so AI agents do not accidentally unleash personally identifiable data.

Orchestration

The ‘Startup Tax’ on Agents

How Tool Search Works

‘Lazy Loading’ and Accuracy Gains

Maturing the Stack

Implications for the Ecosystem

Enterprise’s embeddings problems

Allowing agents to talk and coordinate

The critical element of risk management

‘Human-on-the-loop’ versus ‘human-in-the-loop’

What enterprise leaders should be doing now

The Causal Blind Spot

The Benchmarks: 47 Wins, 0 Losses

The “Free Lunch” of Latency

Reasoning vs. Repetition

Strategic Implementation for the Enterprise

Why This Matters

How Egnyte coders are using AI — without ceding control

Setting expectations, meeting devs where they are

Hiring will continue for scale and fresh perspective

Why exact-match caching falls short

Semantic caching architecture

The threshold problem

Threshold tuning methodology

Latency overhead

Cache invalidation

Time-based TTL

Event-based invalidation

Staleness detection

Production results

Pitfalls to avoid

Key takeaways

The ‘anti-framework’ architecture

LLM-UX: designing for the model, not the end user

Built for the lab (and the budget)

The licensing caveat

Why it matters

Knowledge on a global scale

Overcoming challenges, alleviating pain points

How AI changes the transformation approach

Creating a successful roll-out

Enterprise Relevance: Agent Lifecycle and Orchestration Improvements

Developer Experience Improvements

Why This Matters: Claude Code Hits a Turning Point with Power Users

Pricing and Availability

What’s Next?

Moving to physical agents

Additions to the Nemotron family