ServiceNow announced a multi-year partnership with OpenAI to bring GPT-5.2 into its AI Control Tower and Xanadu platform, reinforcing ServiceNow’s strategy to focus on enterprise workflows, guardrails, and orchestration rather than building frontier mo…
Recursive language models (RLMs) are an inference technique developed by researchers at MIT CSAIL that treat long prompts as an external environment to the model. Instead of forcing the entire prompt into the model’s context window, the framework allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the text.
Rather than expanding context windows or summarizing old information, the MIT team reframes long-context reasoning as a systems problem. By letting models treat prompts as something they can inspect with code, recursive language models allow LLMs to reason over millions of tokens without retraining. This offers enterprises a practical path to long-horizon tasks like codebase analysis, legal review, and multi-step reasoning that routinely break today’s models.
Because the framework is designed as a wrapper around existing models, it can serve as a drop-in replacement for applications that make direct calls to LLMs.
While frontier models are becoming increasingly sophisticated at reasoning, their ability to process massive amounts of information is not scaling at the same rate. This bottleneck is driven by two distinct limitations: the hard physical constraint on how much text a model can process at once (context length) and “context rot.”
The challenge, the researchers argue, is whether it’s possible to scale the effective context size of general-purpose LLMs by orders of magnitude without retraining them. This capability is becoming increasingly important for enterprise applications, where LLMs are adopted for long-horizon tasks requiring the processing of millions of tokens — a challenge Zhang argues can’t be solved by simply expanding context windows.
“There is an entropy argument that implies you need exponentially more data samples as you increase the effective context window size,” Alex Zhang, a co-author of the paper, told VentureBeat.
Current approaches to extending context often rely on compaction, where the model summarizes older parts of the conversation to free up space. However, this method fails for tasks requiring random access to specific details located in earlier parts of the prompt.
The concept behind RLMs is drawn from “out-of-core” algorithms used in classical computing. These algorithms are designed to process datasets too large to fit into a computer’s main memory by keeping the data on a hard drive and fetching only the necessary chunks as needed.
RLMs apply this logic to generative AI. Instead of feeding a long prompt directly into the neural network, the framework loads the text as a string variable inside a Python coding environment. The LLM is given general context about the data (such as the total character count) but does not “see” the text initially.
Once the prompt is stored as a variable, the LLM acts as a programmer. It writes Python code to interact with the external variable, using standard commands to peek into the data. For example, the model might use regular expressions to search for specific keywords like “Chapter 1” or “financial results.”
When the code execution finds a relevant snippet, the RLM pulls only that specific chunk into its active context window for analysis.
For example, if the prompt is a massive book, the LLM might write a loop that identifies chapter boundaries and then triggers a sub-call to summarize each chapter individually.
The architecture typically involves two agents. A “root language model,” often a capability-heavy model like GPT-5, acts as the orchestrator. It plans the approach, writes the code, and manages the data flow within the REPL environment. A “recursive language model,” often a faster and cheaper model, acts as the worker. The root LM calls this worker to process the specific text snippets isolated by the code.
Because the prompt resides in the environment’s memory rather than the model’s context window, the system can handle inputs far larger than the model’s training limit. Importantly, to the end-user, the RLM behaves exactly like a standard model: It accepts a string and returns an answer. This allows enterprise teams to swap standard API calls for RLMs.
For developers looking to experiment, the RLM code is currently available on GitHub.
“A key argument for RLMs is that most complex tasks can be decomposed into smaller, ‘local’ sub-tasks,” Zhang said. “However, how to perform this context/problem decomposition is non-trivial, and the model must be capable of performing this.”
To validate the framework, the researchers tested RLMs against base models and other agentic approaches like CodeAct and summary agents across a variety of long-context tasks, including retrieval and multi-hop question answering.
The results demonstrated strong performance gains at the 10 million+ token scale. On BrowseComp-Plus, a benchmark involving inputs of 6 to 11 million tokens, standard base models failed completely, scoring 0%. In contrast, the RLM powered by GPT-5 achieved a score of 91.33%, significantly outperforming the Summary Agent (70.47%) and CodeAct (51%).
The framework also excelled at tasks with high computational complexity. On OOLONG-Pairs, an information-dense reasoning benchmark where the difficulty scales quadratically with input length, base GPT-5 models failed catastrophically with a score of just 0.04%. The RLM achieved an F1 score (a balanced measure of precision and recall) of 58%, demonstrating emergent capabilities to handle dense tasks that paralyze standard models. Similarly, on code understanding tasks (CodeQA benchmark), the RLM more than doubled the performance of the base GPT-5 model, jumping from 24% to 62%.
Regarding the context rot problem, the data showed that while the base GPT-5 performance degrades rapidly as task complexity increases, RLM performance holds steady, consistently outperforming the base model on contexts longer than 16,000 tokens.
Despite the increased complexity of the workflow, RLMs often maintained comparable or lower average costs than the baselines. On the BrowseComp-Plus benchmark, the RLM was up to three times cheaper than the summarization baseline.
However, the researchers noted that while median costs are low, RLM trajectories are “long-tailed.” Outlier runs can become expensive if the model gets stuck in loops or performs redundant verifications. While GPT-5 was conservative in its sub-calls, the open-source Qwen3-Coder model sometimes attempted thousands of sub-calls for simple tasks.
“Today, you likely will have to implement your own guardrails and logic to control RLM behavior,” Zhang said. However, he hypothesizes that future models could be trained to manage their own compute budgets more effectively. Companies like Prime Intellect are planning to integrate RLM into the training process of models, possibly addressing the edge cases where the model’s inference budget spikes.
For enterprise architects deciding where to place their bets, the RLM framework offers a new tool for handling information-dense problems.
“I think RLMs are still extremely useful for chatbots (think long chat histories), but ultimately they argue for an alternative way of using LMs,” Zhang said. “I think RLMs work in tandem with standard retrieval methods like RAG; they do not serve as a replacement, and can be used in different settings or together.”
Every year, NeurIPS produces hundreds of impressive papers, and a handful that subtly reset how practitioners think about scaling, evaluation and system design. In 2025, the most consequential works weren’t about a single breakthrough model. Instead, they challenged fundamental assumptions that academicians and corporations have quietly relied on: Bigger models mean better reasoning, RL creates new capabilities, attention is “solved” and generative models inevitably memorize.
This year’s top papers collectively point to a deeper shift: AI progress is now constrained less by raw model capacity and more by architecture, training dynamics and evaluation strategy.
Below is a technical deep dive into five of the most influential NeurIPS 2025 papers — and what they mean for anyone building real-world AI systems.
Paper: Artificial Hivemind: The Open-Ended Homogeneity of Language Models
For years, LLM evaluation has focused on correctness. But in open-ended or ambiguous tasks like brainstorming, ideation or creative synthesis, there often is no single correct answer. The risk instead is homogeneity: Models producing the same “safe,” high-probability responses.
This paper introduces Infinity-Chat, a benchmark designed explicitly to measure diversity and pluralism in open-ended generation. Rather than scoring answers as right or wrong, it measures:
Intra-model collapse: How often the same model repeats itself
Inter-model homogeneity: How similar different models’ outputs are
The result is uncomfortable but important: Across architectures and providers, models increasingly converge on similar outputs — even when multiple valid answers exist.
For corporations, this reframes “alignment” as a trade-off. Preference tuning and safety constraints can quietly reduce diversity, leading to assistants that feel too safe, predictable or biased toward dominant viewpoints.
Takeaway: If your product relies on creative or exploratory outputs, diversity metrics need to be first-class citizens.
Paper: Gated Attention for Large Language Models
Transformer attention has been treated as settled engineering. This paper proves it isn’t.
The authors introduce a small architectural change: Apply a query-dependent sigmoid gate after scaled dot-product attention, per attention head. That’s it. No exotic kernels, no massive overhead.
Across dozens of large-scale training runs — including dense and mixture-of-experts (MoE) models trained on trillions of tokens — this gated variant:
Improved stability
Reduced “attention sinks”
Enhanced long-context performance
Consistently outperformed vanilla attention
The gate introduces:
Non-linearity in attention outputs
Implicit sparsity, suppressing pathological activations
This challenges the assumption that attention failures are purely data or optimization problems.
Takeaway: Some of the biggest LLM reliability issues may be architectural — not algorithmic — and solvable with surprisingly small changes.
Paper: 1,000-Layer Networks for Self-Supervised Reinforcement Learning
Conventional wisdom says RL doesn’t scale well without dense rewards or demonstrations. This paper reveals that that assumption is incomplete.
By scaling network depth aggressively from typical 2 to 5 layers to nearly 1,000 layers, the authors demonstrate dramatic gains in self-supervised, goal-conditioned RL, with performance improvements ranging from 2X to 50X.
The key isn’t brute force. It’s pairing depth with contrastive objectives, stable optimization regimes and goal-conditioned representations
For agentic systems and autonomous workflows, this suggests that representation depth — not just data or reward shaping — may be a critical lever for generalization and exploration.
Takeaway: RL’s scaling limits may be architectural, not fundamental.
Paper: Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training
Diffusion models are massively overparameterized, yet they often generalize remarkably well. This paper explains why.
The authors identify two distinct training timescales:
One where generative quality rapidly improves
Another — much slower — where memorization emerges
Crucially, the memorization timescale grows linearly with dataset size, creating a widening window where models improve without overfitting.
This reframes early stopping and dataset scaling strategies. Memorization isn’t inevitable — it’s predictable and delayed.
Takeaway: For diffusion training, dataset size doesn’t just improve quality — it actively delays overfitting.
Paper: Does Reinforcement Learning Really Incentivize Reasoning in LLMs?
Perhaps the most strategically important result of NeurIPS 2025 is also the most sobering.
This paper rigorously tests whether reinforcement learning with verifiable rewards (RLVR) actually creates new reasoning abilities in LLMs — or simply reshapes existing ones.
Their conclusion: RLVR primarily improves sampling efficiency, not reasoning capacity. At large sample sizes, the base model often already contains the correct reasoning trajectories.
RL is better understood as:
A distribution-shaping mechanism
Not a generator of fundamentally new capabilities
Takeaway: To truly expand reasoning capacity, RL likely needs to be paired with mechanisms like teacher distillation or architectural changes — not used in isolation.
Taken together, these papers point to a common theme:
The bottleneck in modern AI is no longer raw model size — it’s system design.
Diversity collapse requires new evaluation metrics
Attention failures require architectural fixes
RL scaling depends on depth and representation
Memorization depends on training dynamics, not parameter count
Reasoning gains depend on how distributions are shaped, not just optimized
For builders, the message is clear: Competitive advantage is shifting from “who has the biggest model” to “who understands the system.”
Maitreyi Chatterjee is a software engineer.
Devansh Agarwal currently works as an ML engineer at FAANG.
Anthropic’s open source standard, the Model Context Protocol (MCP), released in late 2024, allows users to connect AI models and the agents atop them to external tools in a structured, reliable format. It is the engine behind Anthropic’s hit AI agentic programming harness, Claude Code, allowing it to access numerous functions like web browsing and file creation immediately when asked.
But there was one problem: Claude Code typically had to “read” the instruction manual for every single tool available, regardless of whether it was needed for the immediate task, using up the available context that could otherwise be filled with more information from the user’s prompts or the agent’s responses.
At least until last night. The Claude Code team released an update that fundamentally alters this equation. Dubbed MCP Tool Search, the feature introduces “lazy loading” for AI tools, allowing agents to dynamically fetch tool definitions only when necessary.
It is a shift that moves AI agents from a brute-force architecture to something resembling modern software engineering—and according to early data, it effectively solves the “bloat” problem that was threatening to stifle the ecosystem.
To understand the significance of Tool Search, one must understand the friction of the previous system. The Model Context Protocol (MCP), released in 2024 by Anthropic as an open source standard was designed to be a universal standard for connecting AI models to data sources and tools—everything from GitHub repositories to local file systems.
However, as the ecosystem grew, so did the “startup tax.”
Thariq Shihipar, a member of the technical staff at Anthropic, highlighted the scale of the problem in the announcement.
“We’ve found that MCP servers may have up to 50+ tools,” Shihipar wrote. “Users were documenting setups with 7+ servers consuming 67k+ tokens.”
In practical terms, this meant a developer using a robust set of tools might sacrifice 33% or more of their available context window limit of 200,000 tokens before they even typed a single character of a prompt, as AI newsletter author Aakash Gupta pointed out in a post on X.
The model was effectively “reading” hundreds of pages of technical documentation for tools it might never use during that session.
Community analysis provided even starker examples.
Gupta further noted that a single Docker MCP server could consume 125,000 tokens just to define its 135 tools.
“The old constraint forced a brutal tradeoff,” he wrote. “Either limit your MCP servers to 2-3 core tools, or accept that half your context budget disappears before you start working.”
The solution Anthropic rolled out — which Shihipar called “one of our most-requested features on GitHub” — is elegant in its restraint. Instead of preloading every definition, Claude Code now monitors context usage.
According to the release notes, the system automatically detects when tool descriptions would consume more than 10% of the available context.
When that threshold is crossed, the system switches strategies. Instead of dumping raw documentation into the prompt, it loads a lightweight search index.
When the user asks for a specific action—say, “deploy this container”—Claude Code doesn’t scan a massive, pre-loaded list of 200 commands. Instead, it queries the index, finds the relevant tool definition, and pulls only that specific tool into the context.
“Tool Search flips the architecture,” Gupta analyzed. “The token savings are dramatic: from ~134k to ~5k in Anthropic’s internal testing. That’s an 85% reduction while maintaining full tool access.”
For developers maintaining MCP servers, this shifts the optimization strategy.
Shihipar noted that the `server instructions` field in the MCP definition—previously a “nice to have”—is now critical. It acts as the metadata that helps Claude “know when to search for your tools, similar to skills.”
While the token savings are the headline metric—saving money and memory is always popular—the secondary effect of this update might be more important: focus.
LLMs are notoriously sensitive to “distraction.” When a model’s context window is stuffed with thousands of lines of irrelevant tool definitions, its ability to reason decreases. It creates a “needle in a haystack” problem where the model struggles to differentiate between similar commands, such as `notification-send-user` versus `notification-send-channel`.
Boris Cherny, Head of Claude Code, emphasized this in his reaction to the launch on X: “Every Claude Code user just got way more context, better instruction following, and the ability to plug in even more tools.”
The data backs this up. Internal benchmarks shared by the community indicate that enabling Tool Search improved the accuracy of the Opus 4 model on MCP evaluations from 49% to 74%.
For the newer Opus 4.5, accuracy jumped from 79.5% to 88.1%.
By removing the noise of hundreds of unused tools, the model can dedicate its “attention” mechanisms to the user’s actual query and the relevant active tools.
This update signals a maturation in how we treat AI infrastructure. In the early days of any software paradigm, brute force is common. But as systems scale, efficiency becomes the primary engineering challenge.
Aakash Gupta drew a parallel to the evolution of Integrated Development Environments (IDEs) like VSCode or JetBrains. “The bottleneck wasn’t ‘too many tools.’
It was loading tool definitions like 2020-era static imports instead of 2024-era lazy loading,” he wrote. “VSCode doesn’t load every extension at startup. JetBrains doesn’t inject every plugin’s docs into memory.”
By adopting “lazy loading”—a standard best practice in web and software development—Anthropic is acknowledging that AI agents are no longer just novelties; they are complex software platforms that require architectural discipline.
For the end user, this update is seamless: Claude Code simply feels “smarter” and retains more memory of the conversation. But for the developer ecosystem, it opens the floodgates.
Previously, there was a “soft cap” on how capable an agent could be. Developers had to curate their toolsets carefully to avoid lobotomizing the model with excessive context. With Tool Search, that ceiling is effectively removed. An agent can theoretically have access to thousands of tools—database connectors, cloud deployment scripts, API wrappers, local file manipulators—without paying a penalty until those tools are actually touched.
It turns the “context economy” from a scarcity model into an access model. As Gupta summarized, “They’re not just optimizing context usage. They’re changing what ‘tool-rich agents’ can mean.”
The update is rolling out immediately for Claude Code users. For developers building MCP clients, Anthropic recommends implementing the `ToolSearchTool` to support this dynamic loading, ensuring that as the agentic future arrives, it doesn’t run out of memory before it even says hello.
Agentic systems and enterprise search depend on strong data retrieval that works efficiently and accurately. Database provider MongoDB thinks its newest embeddings models help solve falling retrieval quality as more AI systems go into production.
As agentic and RAG systems move into production, retrieval quality is emerging as a quiet failure point — one that can undermine accuracy, cost, and user trust even when models themselves perform well.
The company launched four new versions of its embeddings and reranking models. Voyage 4 will be available in four modes: voyage-4 embedding, voyage-4-large, voyage-4-lite, and voyage-4-nano.
MongoDB said the voyage-4 embedding serves as its general-purpose model; MongoDB considers Voyage-4-large its flagship model. Voyage-4-lite focuses on tasks requiring little latency and lower costs, and voyage-4-nano is intended for more local development and testing environments or for on-device data retrieval.
Voyage-4-nano is also MongoDB’s first open-weight model. All models are available via an API and on MongoDB’s Atlas platform.
The company said the models outperform similar models from Google and Cohere on the RTEB benchmark. Hugging Face’s RTEB benchmark puts Voyage 4 as the top embedding model.
“Embedding models are one of those invisible choices that can really make or break AI experiences,” Frank Liu, product manager at MongoDB, said in a briefing. “You get them wrong, your search results will feel pretty random and shallow, but if you get them right, your application suddenly feels like it understands your users and your data.”
He added that the goal of the Voyage 4 models is to improve the retrieval of real-world data, which often collapses once agentic and RAG pipelines go into production.
MongoDB also released a new multimodal embedding model, voyage-multimodal-3.5, that can handle documents that include text, images, and video. This model vectorizes the data and extracts semantic meaning from the tables, graphics, figures, and slides typically found in enterprise documents.
For enterprises, an agentic system is only as good as its ability to reliably retrieve the right information at the right time. This requirement becomes harder as workloads scale and context windows fragment.
Several model providers target that layer of agentic AI. Google’s Gemini Embedding model topped the embedding leaderboards, and Cohere launched its Embed 4 multimodal model, which processes documents more than 200 pages long. Mistral said its coding-embedding model, Codestral Embedding, outperforms Cohere, Google, and even MongoDB’s Voyage Code 3. MongoDB argues that benchmark performance alone doesn’t address the operational complexity enterprises face in production.
MongoDB said many clients have found that their data stacks cannot handle context-aware, retrieval-intensive workloads in production. The company said it’s seeing more fragmentation with enterprises having to stitch together different solutions to connect databases with a retrieval or reranking model. To help customers who don’t want fragmented solutions, the company is offering its models through a single data platform, Atlas.
MongoDB’s bet is that retrieval can’t be treated as a loose collection of best-of-breed components anymore. For enterprise agents to work reliably at scale, embeddings, reranking, and the data layer need to operate as a tightly integrated system rather than a stitched-together stack.
Rather than asking how AI agents can work for them, a key question in enterprise is now: Are agents playing well together?
This makes orchestration across multi-agent systems and platforms a critical concern — and a key differentiator.
“Agent-to-agent communications is emerging as a really big deal,” G2’s chief innovation officer Tim Sanders told VentureBeat. “Because if you don’t orchestrate it, you get misunderstandings, like people speaking foreign languages to each other. Those misunderstandings reduce the quality of actions and raise the specter of hallucinations, which could be security incidents or data leakage.”
Orchestration to this point has largely been around data, but that’s quickly turning to action. “Conductor-like solutions” are increasingly bringing together agents, robotic process automation (RPA), and data repositories. Sanders likened the progression to that of answer engine optimization, which initially began with monitoring and now creates bespoke content and code.
“Orchestration platforms coordinate a variety of different agentic solutions to increase the consistency of outcomes,” he said.
Early providers include Salesforce MuleSoft, UiPath Maestro, and IBM Watsonx Orchestrate. These “phase one” software-based observability dashboards help IT leaders see all agentic actions across an enterprise.
But coordination can only add so much value; these platforms will morph into technical risk management tools that provide greater quality control. This could include, for instance, agent assessments, policy recommendation and proactive scoring (such as, how reliable agents are when they call on enterprise tools, or how often they hallucinate and when).
Enterprise leaders have become wary of relying on vendors to minimize risks and errors; many IT decision-makers, in fact, do not trust a vendor’s statements about the reliability of their agents, he said.
Third-party tools are beginning to bridge the gap and automate tedious guardrail processes and escalation tickets. Teams are already experiencing “ticket exhaustion” in semi-automated systems, where agents hit guardrails and require human permission to proceed.
As an example: The loan process at a bank requires 17 steps for approval, and an agent keeps interrupting human workflows with approval requests when it runs into established guardrails.
Third-party orchestration platforms can manage these tickets and nay, yay, or even challenge the need for approval altogether. They can eventually eliminate the need for persistent human-in-the-loop oversight so organizations can experience “true velocity gains” measured not in percentages but in multiples (that is, 3X versus 30%).
“Where it goes from there is remote management of the entire agentic process for organizations,” Sanders said.
In another critical evolution in the agentic era, human evaluators will become designers, moving from human-in-the-loop to human-on-the-loop, according to Sanders. That is: They will begin designing agents to automate workflows.
Agent builder platforms continue to innovate their no-code solutions, Sanders said, meaning nearly anyone can now stand up an agent using natural language. “This will democratize agentic AI, and the super skill will be the ability to express a goal, provide context and envision pitfalls, very similar to a good people manager today.”
Agent-first automation stacks “dramatically outperform” hybrid automation stacks in almost every attribute, he noted: satisfaction, quality of actions, security, cost savings.
Organizations should begin “expeditious programs” to infuse agents across workflows, especially with highly repetitive work that poses bottlenecks. Likely at first, there will be a strong human-in-the-loop element to ensure quality and promote change management.
“Serving as an evaluator will strengthen the understanding of how these systems work,” Sanders said, “and eventually enable all of us to operate upstream in agentic workflows instead of downstream.”
IT leaders should take inventory today of all the different elements of their automation stack. Whether these elements are rules-based automation, RPA, or agentic automation, they must learn everything going on in the organization to optimally use emerging orchestration platforms.
“If they don’t, there could actually be dis-synergies across organizations where old school technology and cutting edge technology clash at the point of delivery, oftentimes customer-facing,” Sanders said. “You can’t orchestrate what you can’t see clearly.”
In the chaotic world of Large Language Model (LLM) optimization, engineers have spent the last few years developing increasingly esoteric rituals to get better answers.
We’ve seen “Chain of Thought” (asking the model to think step-by-step and often, show those “reasoning traces” to the user), “Emotional Blackmail” (telling the model its career depends on the answer, or that it is being accused of sexual misconduct), and complex multi-shot prompting frameworks.
But a new paper released by Google Research suggests that we may have been overthinking it. The researchers found that simply repeating the input query—literally copying and pasting the prompt so it appears twice—consistently improves performance across major models including Gemini, GPT-4o, Claude, and DeepSeek.
The paper, titled “Prompt Repetition Improves Non-Reasoning LLMs,” released last month just before the holidays, presents a finding that is almost suspiciously simple: for tasks that don’t require complex reasoning steps, stating the prompt twice yields significantly better results than stating it once.
Even better, because of how transformer architecture works, this “one weird trick” comes with virtually zero penalty in terms of generation speed.
To understand why repeating a question makes a supercomputer smarter, you have to look at the architectural limitations of the standard Transformer model.
Most modern LLMs are trained as “causal” language models. This means they process text strictly from left to right. When the model is processing the 5th token in your sentence, it can “attend” (pay attention) to tokens 1 through 4, but it has zero knowledge of token 6, because it hasn’t happened yet.
This creates a fundamental constraint in how models understand user queries. As the authors note, the order of information matters immensely.
A query formatted as <CONTEXT> <QUESTION> often yields different results than <QUESTION> <CONTEXT> because, in the latter case, the model reads the question before it knows the context it’s supposed to apply it to.
Prompt repetition hacks this limitation by transforming an input of <QUERY> into <QUERY><QUERY>.
By the time the model begins processing the second iteration of the query, it has already “read” the first iteration. This allows the tokens in the second copy to attend to every single token in the first copy.
Effectively, the second repetition enjoys a form of bidirectional attention—it can “look back” at the entire query to resolve ambiguities or retrieve specific details that might have been missed in a single pass.
The researchers, Yaniv Leviathan, Matan Kalman, and Yossi Matias, tested this hypothesis across a suite of seven popular benchmarks, including ARC, OpenBookOA, GSM8K, and MMLU-Pro. They evaluated seven different models, ranging from lightweight models like Gemini 2.0 Flash Lite and GPT-4o-mini to heavyweights like Claude 3.7 Sonnet and DeepSeek V3.The results were statistically stark. When asking models not to use explicit reasoning (i.e., just giving a direct answer), prompt repetition won 47 out of 70 head-to-head tests against the baseline, with zero losses.The gains were particularly dramatic in tasks requiring precise retrieval from a prompt. The team designed a custom “NameIndex” benchmark, where the model is given a list of 50 names and asked to identify the 25th one.
Baseline Performance: Gemini 2.0 Flash-Lite scored a dismal 21.33% accuracy.
With Repetition: Accuracy skyrocketed to 97.33%.
This massive jump illustrates the “causal blind spot” perfectly. In a single pass, the model might lose track of the count by the time it reaches the 25th name. In the repeated pass, the model effectively has the entire list in its “working memory” before it attempts to solve the retrieval task.
Usually, adding text to a prompt increases costs and latency. If you double the input, surely you double the wait time?Surprisingly, no. The paper demonstrates that prompt repetition is essentially “free” regarding user-perceived latency.LLM processing is divided into two stages:
Prefill: The model processes the input prompt. This is highly parallelizable; the GPU can crunch the entire prompt matrix simultaneously.
Generation (Decoding): The model generates the answer one token at a time. This is serial and slow.
Prompt repetition only increases the work in the prefill stage. Because modern hardware handles prefill so efficiently, the user barely notices the difference. The researchers found that repeating the prompt did not increase the length of the generated answer, nor did it increase the “time to first token” latency for most models.The only exceptions were Anthropic’s models (Claude Haiku and Sonnet) on extremely long requests, where the prefill stage eventually hit a bottleneck. But for the vast majority of use cases, the technique improves accuracy without slowing down the chat experience.
There is a caveat: this technique is primarily for “non-reasoning” tasks—scenarios where you want a direct answer rather than a step-by-step derivation.
When the researchers tested prompt repetition combined with “Chain of Thought” (asking the model to “think step by step”), the gains largely vanished, showing neutral to slightly positive results (5 wins, 1 loss, 22 ties).
The authors posit that reasoning models naturally perform a version of repetition themselves. When a model “thinks,” it often restates the premise of the question in its generated output before solving it. Therefore, explicitly repeating the prompt in the input becomes redundant.
However, for applications where you need a fast, direct answer without the verbosity (and cost) of a long reasoning trace, prompt repetition offers a powerful alternative.
For enterprise leadership, this research represents that rarest of things in AI development: a “free” optimization. But capitalization requires nuance; this isn’t a setting to toggle blindly across an entire organization, but rather a tactical adjustment that ripples across engineering, orchestration, and security.
For technical leads balancing the eternal triangle of speed, quality, and cost, prompt repetition offers a way to punch above your weight class. The data shows that smaller, faster models—like Gemini 2.0 Flash Lite—can achieve near-perfect retrieval accuracy (jumping from 21.33% to 97.33%) simply by processing the input twice.
This changes the calculus for model selection: before upgrading to a larger, more expensive model to solve an accuracy bottleneck, engineers should first test whether simple repetition allows their current “Lite” models to close the gap. It is a potential strategy for retaining the speed and cost benefits of lightweight infrastructure without sacrificing performance on extraction and retrieval tasks.
This logic naturally shifts the burden to the orchestration layer. For those managing the middleware and API gateways that glue AI applications together, prompt repetition should likely become a standard, invisible component of the pipeline logic rather than a user behavior.
However, because the technique is neutral for reasoning-heavy tasks but highly effective for direct answers, it requires conditional application. A smart orchestration harness would automatically identify requests routed to non-reasoning endpoints—such as entity extraction, classification, or simple Q&A—and double the prompt before passing it to the model. This optimizes performance at the infrastructure level, delivering better results without requiring action from end-users or increasing the generation budget.
Finally, this heightened attentiveness introduces a new variable for security teams.
If repeating a prompt clarifies a user’s intent to the model, it stands to reason that malicious intents might be clarified as well. Security directors will need to update their red-teaming protocols to test “repeated injection” attacks—verifying whether repeating a jailbreak command (e.g., “Ignore previous instructions”) makes the model “attend” to the breach more effectively. Conversely, this mechanism offers a new defensive tool: repeating System Prompts.
Stating safety guardrails twice at the start of the context window could force the model to attend to safety constraints more rigorously, acting as a low-cost reinforcement for robust security operations.
This research highlights a crucial insight for developers building on top of LLMs: our current models are still deeply constrained by their unidirectional nature. While we wait for new architectures that might solve causal blindness, crude but effective workarounds like prompt repetition offer immediate value.The authors suggest this could become a default behavior for future systems.
We might soon see inference engines that silently double our prompts in the background before sending them to the model, or “Reasoning” models trained to internalize this repetition strategy to be more efficient.For now, if you are struggling to get a model to follow complex instructions or retrieve specific details from a long document, the solution might not be a better prompt. You might just need to say it again.
Egnyte, the $1.5 billion cloud content governance company, has embedded AI coding tools across its global team of more than 350 developers — but not to reduce headcount. Instead, the company continues to hire junior engineers, using AI to accelerate onboarding, deepen codebase understanding, and shorten the path from junior to senior contributor.
The approach challenges a dominant 2025 narrative that automation will replace developers, showing instead how enterprises are using AI to scale engineering capacity while keeping humans firmly in the loop.
“To have engineers disappear or us not hiring junior engineers doesn’t look like the likely outcome,” Amrit Jassal, Egnyte CTO and co-founder, told VentureBeat. “You’ve got to have people, you’re training and doing all types of succession planning. The junior engineer of today is the senior engineer of tomorrow.”
Egnyte — which has more than 22,000 users including NASDAQ, Red Bull, and BuzzFeed — has rolled out Claude Code, Cursor, Augment, and Gemini CLI coding tools across its developer base to support its core business strategies and expand its newer AI offerings like customer-facing copilots and customizable AI agents.
Devs use these tools across a variety of tasks, the simplest of which include data retrieval, code comprehension, smart search, and code lookup. Egnyte’s code base has lots of Java code, which uses numerous libraries, each with different versions, Jassal explained. AI tools are great for peer-to-peer programming, helping new users get a lay of the land, or existing users probe into different code repositories.
“We have a pretty big code base, right?” Jassal said. “Let’s say you’re looking at an iOS application, but you’re not well versed; you will fire up Google CLI or an Augment, and ask it to discover the code base.”
Some Egnyte devs are moving into automatic pull request summaries, which provide simple overviews of code changes that essentially explain the “what,” “how,” and “why” of proposed modifications.
“But obviously, any change that’s made, we don’t want to hear that AI made the change; it has to be that developer made the change,” Jassal pointed out. “I would not trust AI to commit to the production code base.”
Commits still pass through human review and security validation, and anything red-flagged is escalated to senior engineers. Devs are warned of the dangers of settling into autopilot mode or blindly trusting code. A model may not have been exposed to, or given enough samples of, certain coding components and infrastructure in its training.
Another growing, and closely monitored, use case for AI is unit testing, where code components are run in isolation to ensure they work as intended. “At the end of the day, it is a productivity improvement tool,” he said. “It is really a continuation, it’s like any other tool, it’s not some magic.”
Beyond core engineering, AI is helping other teams collaborate with programmers. Product management, for instance, is using tools like Vercel to bring “demo-worthy” prototypes, rather than just ideas, to devs, who can then move ahead with mock-ups. Or, if UX teams are looking to change certain elements on a dashboard, AI can quickly spin up a handful of options, like different widgets or buttons.
“Then you come to engineering with that, and the engineer immediately knows what you really intend to do with it,” Jassal said.
However, day-to-day activities for all Egnyte engineers, including junior developers, extend beyond just coding.
Junior developers are given hands-on tasks across the full development lifecycle to accelerate their growth and experience, Jassal said. For instance, they assist with requirement analysis in early software engineering phases, as well as deployment, productization and post-deployment maintenance.
In turn, these activities require “Egnyte-specific tacit knowledge and experience” offered by senior engineers. One clear example of work that sits firmly with senior engineers is authoring architecture notes, as these cut across the platform and require a more holistic, system-level view, Jassal said.
“Many of the traditional roadblocks are navigated faster these days with AI; for example, understanding the codebase, dissecting requirements, auto-testing,” he said. “This faster track allows our talented junior hires to progress more quickly and provide higher value to the company sooner.”
The company expects a much faster learning curve from junior to mid-level engineers, Jassal said. “It’s always the case that people coming straight into the workforce are much more excited about trying new things,” Jassal said. But that has to be colored with reality to temper expectations, he added.
On the other hand, some senior engineers may need to be ramped up in their adoption because they’re hesitant or had ho-hum or bad experiences with earlier generation tools. This requires incremental introduction.
“The senior people, having been burnt multiple times, bring that perspective,” he said. “So both [types of engineers] play an important role.”
“In general, I would say it has been really hyped by folks who want to sell you tokens,” Jassal said referring to people who talk about human coders becoming obsolete.
“Vibe coding” could be construed in a similar vein: Like others in software development, he prefers the term “AI assisted coding,” wherein programmers have a self-driven loop, generating code, analyzing exceptions, then correcting and scaling.
At least in Egnyte’s case, hiring will continue, even if at a slower clip as people become more productive thanks to AI, Jassal said.
“We are not just hiring for scale, but to develop the next generation of senior developers and inject fresh perspectives into our development practices,” he said.
The takeaway for technical decision-makers is not that AI will eliminate engineering jobs — but that it will reshape how talent is developed.
At Egnyte, AI-assisted coding is compressing learning curves and raising expectations, not removing humans from the process. Enterprises that treat AI as a replacement risk hollowing out their future senior talent pipeline; those that treat it as infrastructure can move faster without losing the judgment, creativity, and accountability that only engineers provide.
Our LLM API bill was growing 30% month-over-month. Traffic was increasing, but not that fast. When I analyzed our query logs, I found the real problem: Users ask the same questions in different ways.
“What’s your return policy?,” “How do I return something?”, and “Can I get a refund?” were all hitting our LLM separately, generating nearly identical responses, each incurring full API costs.
Exact-match caching, the obvious first solution, captured only 18% of these redundant calls. The same semantic question, phrased differently, bypassed the cache entirely.
So, I implemented semantic caching based on what queries mean, not how they’re worded. After implementing it, our cache hit rate increased to 67%, reducing LLM API costs by 73%. But getting there requires solving problems that naive implementations miss.
Traditional caching uses query text as the cache key. This works when queries are identical:
# Exact-match caching
cache_key = hash(query_text)
if cache_key in cache:
return cache[cache_key]
But users don’t phrase questions identically. My analysis of 100,000 production queries found:
Only 18% were exact duplicates of previous queries
47% were semantically similar to previous queries (same intent, different wording)
35% were genuinely novel queries
That 47% represented massive cost savings we were missing. Each semantically-similar query triggered a full LLM call, generating a response nearly identical to one we’d already computed.
Semantic caching replaces text-based keys with embedding-based similarity lookup:
class SemanticCache:
def __init__(self, embedding_model, similarity_threshold=0.92):
self.embedding_model = embedding_model
self.threshold = similarity_threshold
self.vector_store = VectorStore() # FAISS, Pinecone, etc.
self.response_store = ResponseStore() # Redis, DynamoDB, etc.
def get(self, query: str) -> Optional[str]:
“””Return cached response if semantically similar query exists.”””
query_embedding = self.embedding_model.encode(query)
# Find most similar cached query
matches = self.vector_store.search(query_embedding, top_k=1)
if matches and matches[0].similarity >= self.threshold:
cache_id = matches[0].id
return self.response_store.get(cache_id)
return None
def set(self, query: str, response: str):
“””Cache query-response pair.”””
query_embedding = self.embedding_model.encode(query)
cache_id = generate_id()
self.vector_store.add(cache_id, query_embedding)
self.response_store.set(cache_id, {
‘query’: query,
‘response’: response,
‘timestamp’: datetime.utcnow()
})
The key insight: Instead of hashing query text, I embed queries into vector space and find cached queries within a similarity threshold.
The similarity threshold is the critical parameter. Set it too high, and you miss valid cache hits. Set it too low, and you return wrong responses.
Our initial threshold of 0.85 seemed reasonable; 85% similar should be “the same question,” right?
Wrong. At 0.85, we got cache hits like:
Query: “How do I cancel my subscription?”
Cached: “How do I cancel my order?”
Similarity: 0.87
These are different questions with different answers. Returning the cached response would be incorrect.
I discovered that optimal thresholds vary by query type:
|
Query type |
Optimal threshold |
Rationale |
|
FAQ-style questions |
0.94 |
High precision needed; wrong answers damage trust |
|
Product searches |
0.88 |
More tolerance for near-matches |
|
Support queries |
0.92 |
Balance between coverage and accuracy |
|
Transactional queries |
0.97 |
Very low tolerance for errors |
I implemented query-type-specific thresholds:
class AdaptiveSemanticCache:
def __init__(self):
self.thresholds = {
‘faq’: 0.94,
‘search’: 0.88,
‘support’: 0.92,
‘transactional’: 0.97,
‘default’: 0.92
}
self.query_classifier = QueryClassifier()
def get_threshold(self, query: str) -> float:
query_type = self.query_classifier.classify(query)
return self.thresholds.get(query_type, self.thresholds[‘default’])
def get(self, query: str) -> Optional[str]:
threshold = self.get_threshold(query)
query_embedding = self.embedding_model.encode(query)
matches = self.vector_store.search(query_embedding, top_k=1)
if matches and matches[0].similarity >= threshold:
return self.response_store.get(matches[0].id)
return None
I couldn’t tune thresholds blindly. I needed ground truth on which query pairs were actually “the same.”
Our methodology:
Step 1: Sample query pairs. I sampled 5,000 query pairs at various similarity levels (0.80-0.99).
Step 2: Human labeling. Annotators labeled each pair as “same intent” or “different intent.” I used three annotators per pair and took a majority vote.
Step 3: Compute precision/recall curves. For each threshold, we computed:
Precision: Of cache hits, what fraction had the same intent?
Recall: Of same-intent pairs, what fraction did we cache-hit?
def compute_precision_recall(pairs, labels, threshold):
“””Compute precision and recall at given similarity threshold.”””
predictions = [1 if pair.similarity >= threshold else 0 for pair in pairs]
true_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 1)
false_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 0)
false_negatives = sum(1 for p, l in zip(predictions, labels) if p == 0 and l == 1)
precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
return precision, recall
Step 4: Select threshold based on cost of errors. For FAQ queries where wrong answers damage trust, I optimized for precision (0.94 threshold gave 98% precision). For search queries where missing a cache hit just costs money, I optimized for recall (0.88 threshold).
Semantic caching adds latency: You must embed the query and search the vector store before knowing whether to call the LLM.
Our measurements:
|
Operation |
Latency (p50) |
Latency (p99) |
|
Query embedding |
12ms |
28ms |
|
Vector search |
8ms |
19ms |
|
Total cache lookup |
20ms |
47ms |
|
LLM API call |
850ms |
2400ms |
The 20ms overhead is negligible compared to the 850ms LLM call we avoid on cache hits. Even at p99, the 47ms overhead is acceptable.
However, cache misses now take 20ms longer than before (embedding + search + LLM call). At our 67% hit rate, the math works out favorably:
Before: 100% of queries × 850ms = 850ms average
After: (33% × 870ms) + (67% × 20ms) = 287ms + 13ms = 300ms average
Net latency improvement of 65% alongside the cost reduction.
Cached responses go stale. Product information changes, policies update and yesterday’s correct answer becomes today’s wrong answer.
I implemented three invalidation strategies:
Simple expiration based on content type:
TTL_BY_CONTENT_TYPE = {
‘pricing’: timedelta(hours=4), # Changes frequently
‘policy’: timedelta(days=7), # Changes rarely
‘product_info’: timedelta(days=1), # Daily refresh
‘general_faq’: timedelta(days=14), # Very stable
}
When underlying data changes, invalidate related cache entries:
class CacheInvalidator:
def on_content_update(self, content_id: str, content_type: str):
“””Invalidate cache entries related to updated content.”””
# Find cached queries that referenced this content
affected_queries = self.find_queries_referencing(content_id)
for query_id in affected_queries:
self.cache.invalidate(query_id)
self.log_invalidation(content_id, len(affected_queries))
For responses that might become stale without explicit events, I implemented periodic freshness checks:
def check_freshness(self, cached_response: dict) -> bool:
“””Verify cached response is still valid.”””
# Re-run the query against current data
fresh_response = self.generate_response(cached_response[‘query’])
# Compare semantic similarity of responses
cached_embedding = self.embed(cached_response[‘response’])
fresh_embedding = self.embed(fresh_response)
similarity = cosine_similarity(cached_embedding, fresh_embedding)
# If responses diverged significantly, invalidate
if similarity < 0.90:
self.cache.invalidate(cached_response[‘id’])
return False
return True
We run freshness checks on a sample of cached entries daily, catching staleness that TTL and event-based invalidation miss.
After three months in production:
|
Metric |
Before |
After |
Change |
|
Cache hit rate |
18% |
67% |
+272% |
|
LLM API costs |
$47K/month |
$12.7K/month |
-73% |
|
Average latency |
850ms |
300ms |
-65% |
|
False-positive rate |
N/A |
0.8% |
— |
|
Customer complaints (wrong answers) |
Baseline |
+0.3% |
Minimal increase |
The 0.8% false-positive rate (queries where we returned a cached response that was semantically incorrect) was within acceptable bounds. These cases occurred primarily at the boundaries of our threshold, where similarity was just above the cutoff but intent differed slightly.
Don’t use a single global threshold. Different query types have different tolerance for errors. Tune thresholds per category.
Don’t skip the embedding step on cache hits. You might be tempted to skip embedding overhead when returning cached responses, but you need the embedding for cache key generation. The overhead is unavoidable.
Don’t forget invalidation. Semantic caching without invalidation strategy leads to stale responses that erode user trust. Build invalidation from day one.
Don’t cache everything. Some queries shouldn’t be cached: Personalized responses, time-sensitive information, transactional confirmations. Build exclusion rules.
def should_cache(self, query: str, response: str) -> bool:
“””Determine if response should be cached.””
# Don’t cache personalized responses
if self.contains_personal_info(response):
return False
# Don’t cache time-sensitive information
if self.is_time_sensitive(query):
return False
# Don’t cache transactional confirmations
if self.is_transactional(query):
return False
return True
Semantic caching is a practical pattern for LLM cost control that captures redundancy exact-match caching misses. The key challenges are threshold tuning (use query-type-specific thresholds based on precision/recall analysis) and cache invalidation (combine TTL, event-based and staleness detection).
At 73% cost reduction, this was our highest-ROI optimization for production LLM systems. The implementation complexity is moderate, but the threshold tuning requires careful attention to avoid quality degradation.
Sreenivasa Reddy Hulebeedu Reddy is a lead software engineer.
A new framework from researchers Alexander and Jacob Roman rejects the complexity of current AI tools, offering a synchronous, type-safe alternative designed for reproducibility and cost-conscious science.
In the rush to build autonomous AI agents, developers have largely been forced into a binary choice: surrender control to massive, complex ecosystems like LangChain, or lock themselves into single-vendor SDKs from providers like Anthropic or OpenAI. For software engineers, this is an annoyance. For scientists trying to use AI for reproducible research, it is a dealbreaker.
Enter Orchestral AI, a new Python framework released on Github this week that attempts to chart a third path.
Developed by theoretical physicist Alexander Roman and software engineer Jacob Roman, Orchestral positions itself as the “scientific computing” answer to agent orchestration—prioritizing deterministic execution and debugging clarity over the “magic” of async-heavy alternatives.
The core philosophy behind Orchestral is an intentional rejection of the complexity that plagues the current market. While frameworks like AutoGPT and LangChain rely heavily on asynchronous event loops—which can make error tracing a nightmare—Orchestral utilizes a strictly synchronous execution model.
“Reproducibility demands understanding exactly what code executes and when,” the founders argue in their technical paper. By forcing operations to happen in a predictable, linear order, the framework ensures that an agent’s behavior is deterministic—a critical requirement for scientific experiments where a “hallucinated” variable or a race condition could invalidate a study.
Despite this focus on simplicity, the framework is provider-agnostic. It ships with a unified interface that works across OpenAI, Anthropic, Google Gemini, Mistral, and local models via Ollama. This allows researchers to write an agent once and swap the underlying “brain” with a single line of code—crucial for comparing model performance or managing grant money by switching to cheaper models for draft runs.
Orchestral introduces a concept the founders call “LLM-UX”—user experience designed from the perspective of the model itself.
The framework simplifies tool creation by automatically generating JSON schemas from standard Python type hints. Instead of writing verbose descriptions in a separate format, developers can simply annotate their Python functions. Orchestral handles the translation, ensuring that the data types passed between the LLM and the code remain safe and consistent.
This philosophy extends to the built-in tooling. The framework includes a persistent terminal tool that maintains its state (like working directories and environment variables) between calls. This mimics how human researchers interact with command lines, reducing the cognitive load on the model and preventing the common failure mode where an agent “forgets” it changed directories three steps ago.
Orchestral’s origins in high-energy physics and exoplanet research are evident in its feature set. The framework includes native support for LaTeX export, allowing researchers to drop formatted logs of agent reasoning directly into academic papers.
It also tackles the practical reality of running LLMs: cost. The framework includes an automated cost-tracking module that aggregates token usage across different providers, allowing labs to monitor burn rates in real-time.
Perhaps most importantly for safety-conscious fields, Orchestral implements “read-before-edit” guardrails. If an agent attempts to overwrite a file it hasn’t read in the current session, the system blocks the action and prompts the model to read the file first. This prevents the “blind overwrite” errors that terrify anyone using autonomous coding agents.
While Orchestral is easy to install via pip install orchestral-ai, potential users should look closely at the license. Unlike the MIT or Apache licenses common in the Python ecosystem, Orchestral is released under a Proprietary license.
The documentation explicitly states that “unauthorized copying, distribution, modification, or use… is strictly prohibited without prior written permission”. This “source-available” model allows researchers to view and use the code, but restricts them from forking it or building commercial competitors without an agreement. This suggests a business model focused on enterprise licensing or dual-licensing strategies down the road.
Furthermore, early adopters will need to be on the bleeding edge of Python environments: the framework requires Python 3.13 or higher, explicitly dropping support for the widely used Python 3.12 due to compatibility issues.
“Civilization advances by extending the number of important operations which we can perform without thinking about them,” the founders write, quoting mathematician Alfred North Whitehead.
Orchestral attempts to operationalize this for the AI era. By abstracting away the “plumbing” of API connections and schema validation, it aims to let scientists focus on the logic of their agents rather than the quirks of the infrastructure. Whether the academic and developer communities will embrace a proprietary tool in an ecosystem dominated by open source remains to be seen, but for those drowning in async tracebacks and broken tool calls, Orchestral offers a tempting promise of sanity.