By now, many enterprises have deployed some form of RAG. The promise is seductive: index your PDFs, connect an LLM and instantly democratize your corporate knowledge.
But for industries dependent on heavy engineering, the reality has been underwhelming. Engineers ask specific questions about infrastructure, and the bot hallucinates.
The failure isn’t in the LLM. The failure is in the preprocessing.
Standard RAG pipelines treat documents as flat strings of text. They use “fixed-size chunking” (cutting a document every 500 characters). This works for prose, but it destroys the logic of technical manuals. It slices tables in half, severs captions from images, and ignores the visual hierarchy of the page.
Improving RAG reliability isn’t about buying a bigger model; it’s about fixing the “dark data” problem through semantic chunking and multimodal textualization.
Here is the architectural framework for building a RAG system that can actually read a manual.
In a standard Python RAG tutorial, you split text by character count. In an enterprise PDF, this is disastrous.
If a safety specification table spans 1,000 tokens, and your chunk size is 500, you have just split the “voltage limit” header from the “240V” value. The vector database stores them separately. When a user asks, “What is the voltage limit?”, the retrieval system finds the header but not the value. The LLM, forced to answer, often guesses.
The first step to fixing production RAG is abandoning arbitrary character counts in favor of document intelligence.
Using layout-aware parsing tools (such as Azure Document Intelligence), we can segment data based on document structure such as chapters, sections and paragraphs, rather than token count.
Logical cohesion: A section describing a specific machine part is kept as a single vector, even if it varies in length.
Table preservation: The parser identifies a table boundary and forces the entire grid into a single chunk, preserving the row-column relationships that are vital for accurate retrieval.
In our internal qualitative benchmarks, moving from fixed to semantic chunking significantly improved the retrieval accuracy of tabular data, effectively stopping the fragmentation of technical specs.
The second failure mode of enterprise RAG is blindness. A massive amount of corporate IP exists not in text, but in flowcharts, schematics and system architecture diagrams. Standard embedding models (like text-embedding-3-small) cannot “see” these images. They are skipped during indexing.
If your answer lies in a flowchart, your RAG system will say, “I don’t know.”
To make diagrams searchable, we implemented a multimodal preprocessing step using vision-capable models (specifically GPT-4o) before the data ever hits the vector store.
OCR extraction: High-precision optical character recognition pulls text labels from within the image.
Generative captioning: The vision model analyzes the image and generates a detailed natural language description (“A flowchart showing that process A leads to process B if the temperature exceeds 50 degrees”).
Hybrid embedding: This generated description is embedded and stored as metadata linked to the original image.
Now, when a user searches for “temperature process flow,” the vector search matches the description, even though the original source was a PNG file.
For enterprise adoption, accuracy is only half the battle. The other half is verifiability.
In a standard RAG interface, the chatbot gives a text answer and cites a filename. This forces the user to download the PDF and hunt for the page to verify the claim. For high-stakes queries (“Is this chemical flammable?”), users simply won’t trust the bot.
The architecture should implement visual citation. Because we preserved the link between the text chunk and its parent image during the preprocessing phase, the UI can display the exact chart or table used to generate the answer alongside the text response.
This “show your work” mechanism allows humans to verify the AI’s reasoning instantly, bridging the trust gap that kills so many internal AI projects.
While the “textualization” method (converting images to text descriptions) is the practical solution for today, the architecture is rapidly evolving.
We are already seeing the emergence of native multimodal embeddings (such as Cohere’s Embed 4). These models can map text and images into the same vector space without the intermediate step of captioning. While we currently use a multi-stage pipeline for maximum control, the future of data infrastructure will likely involve “end-to-end” vectorization where the layout of a page is embedded directly.
Furthermore, as long context LLMs become cost-effective, the need for chunking may diminish. We may soon pass entire manuals into the context window. However, until latency and cost for million-token calls drop significantly, semantic preprocessing remains the most economically viable strategy for real-time systems.
The difference between a RAG demo and a production system is how it handles the messy reality of enterprise data.
Stop treating your documents as simple strings of text. If you want your AI to understand your business, you must respect the structure of your documents. By implementing semantic chunking and unlocking the visual data within your charts, you transform your RAG system from a “keyword searcher” into a true “knowledge assistant.”
Dippu Kumar Singh is an AI architect and data engineer.
The industry consensus is that 2026 will be the year of “agentic AI.” We are rapidly moving past chatbots that simply summarize text. We are entering the era of autonomous agents that execute tasks. We expect them to book flights, diagnose system outages, manage cloud infrastructure and personalize media streams in real-time.
As a technology executive overseeing platforms that serve 30 million concurrent users during massive global events like the Olympics and the Super Bowl, I have seen the unsexy reality behind the hype: Agents are incredibly fragile.
Executives and VCs obsess over model benchmarks. They debate Llama 3 versus GPT-4. They focus on maximizing context window sizes. Yet they are ignoring the actual failure point. The primary reason autonomous agents fail in production is often due to data hygiene issues.
In the previous era of “human-in-the-loop” analytics, data quality was a manageable nuisance. If an ETL pipeline experiences an issue, a dashboard may display an incorrect revenue number. A human analyst would spot the anomaly, flag it and fix it. The blast radius was contained.
In the new world of autonomous agents, that safety net is gone.
If a data pipeline drifts today, an agent doesn’t just report the wrong number. It takes the wrong action. It provisions the wrong server type. It recommends a horror movie to a user watching cartoons. It hallucinates a customer service answer based on corrupted vector embeddings.
To run AI at the scale of the NFL or the Olympics, I realized that standard data cleaning is insufficient. We cannot just “monitor” data. We must legislate it.
A solution to this specific problem could be in the form of a ‘data quality – creed’ framework. It functions as a ‘data constitution.’ It enforces thousands of automated rules before a single byte of data is allowed to touch an AI model. While I applied this specifically to the streaming architecture at NBCUniversal, the methodology is universal for any enterprise looking to operationalize AI agents.
Here is why “defensive data engineering” and the Creed philosophy are the only ways to survive the Agentic era.
The core problem with AI Agents is that they trust the context you give them implicitly. If you are using RAG, your vector database is the agent’s long-term memory.
Standard data quality issues are catastrophic for vector databases. In traditional SQL databases, a null value is just a null value. In a vector database, a null value or a schema mismatch can warp the semantic meaning of the entire embedding.
Consider a scenario where metadata drifts. Suppose your pipeline ingests video metadata, but a race condition causes the “genre” tag to slip. Your metadata might tag a video as “live sports,” but the embedding was generated from a “news clip.” When an agent queries the database for “touchdown highlights,” it retrieves the news clip because the vector similarity search is operating on a corrupted signal. The agent then serves that clip to millions of users.
At scale, you cannot rely on downstream monitoring to catch this. By the time an anomaly alarm goes off, the agent has already made thousands of bad decisions. Quality controls must shift to the absolute “left” of the pipeline.
The Creed framework is expected to act as a gatekeeper. It is a multi-tenant quality architecture that sits between ingestion sources and AI models.
For technology leaders looking to build their own “constitution,” here are the three non-negotiable principles I recommend.
1. The “quarantine” pattern is mandatory: In many modern data organizations, engineers favor the “ELT” approach. They dump raw data into a lake and clean it up later. For AI Agents, this is unacceptable. You cannot let an agent drink from a polluted lake.
The Creed methodology enforces a strict “dead letter queue.” If a data packet violates a contract, it is immediately quarantined. It never reaches the vector database. It is far better for an agent to say “I don’t know” due to missing data than to confidently lie due to bad data. This “circuit breaker” pattern is essential for preventing high-profile hallucinations.
2. Schema is law: For years, the industry moved toward “schemaless” flexibility to move fast. We must reverse that trend for core AI pipelines. We must enforce strict typing and referential integrity.
In my experience, a robust system requires scale. The implementation I oversee currently enforces more than 1,000 active rules running across real-time streams. These aren’t just checking for nulls. They check for business logic consistency.
Example: Does the “user_segment” in the event stream match the active taxonomy in the feature store? If not, block it.
Example: Is the timestamp within the acceptable latency window for real-time inference? If not, drop it.
3. Vector consistency checks This is the new frontier for SREs. We must implement automated checks to ensure that the text chunks stored in a vector database actually match the embedding vectors associated with them. “Silent” failures in an embedding model API often leave you with vectors that point to nothing. This causes agents to retrieve pure noise.
Implementing a framework like Creed is not just a technical challenge. It is a cultural one.
Engineers generally hate guardrails. They view strict schemas and data contracts as bureaucratic hurdles that slow down deployment velocity. When introducing a data constitution, leaders often face pushback. Teams feel they are returning to the “waterfall” era of rigid database administration.
To succeed, you must flip the incentive structure. We demonstrated that Creed was actually an accelerator. By guaranteeing the purity of the input data, we eliminated the weeks data scientists used to spend debugging model hallucinations. We turned data governance from a compliance task into a “quality of service” guarantee.
If you are building an AI strategy for 2026, stop buying more GPUs. Stop worrying about which foundation model is slightly higher on the leaderboard this week.
Start auditing your data contracts.
An AI Agent is only as autonomous as its data is reliable. Without a strict, automated data constitution like the Creed framework, your agents will eventually go rogue. In an SRE’s world, a rogue agent is far worse than a broken dashboard. It is a silent killer of trust, revenue, and customer experience.
Manoj Yerrasani is a senior technology executive.
The modern customer has just one need that matters: Getting the thing they want when they want it. The old standard RAG model embed+retrieve+LLM misunderstands intent, overloads context and misses freshness, repeatedly sending customers down the wrong paths.
Instead, intent-first architecture uses a lightweight language model to parse the query for intent and context, before delivering to the most relevant content sources (documents, APIs, people).
Enterprise AI is a speeding train headed for a cliff. Organizations are deploying LLM-powered search applications at a record pace, while a fundamental architectural issue is setting most up for failure.
A recent Coveo study revealed that 72% of enterprise search queries fail to deliver meaningful results on the first attempt, while Gartner also predicts that the majority of conversational AI deployments have been falling short of enterprise expectations.
The problem isn’t the underlying models. It’s the architecture around them.
After designing and running live AI-driven customer interaction platforms at scale, serving millions of customer and citizen users at some of the world’s largest telecommunications and healthcare organizations, I’ve come to see a pattern. It’s the difference between successful AI-powered interaction deployments and multi-million-dollar failures.
It’s a cloud-native architecture pattern that I call Intent-First. And it’s reshaping the way enterprises build AI-powered experiences.
Gartner projects the global conversational AI market will balloon to $36 billion by 2032. Enterprises are scrambling to get a slice. The demos are irresistible. Plug your LLM into your knowledge base, and suddenly it can answer customer questions in natural language.Magic.
Then production happens.
A major telecommunications provider I work with rolled out a RAG system with the expectation of driving down the support call rate. Instead, the rate increased. Callers tried AI-powered search, were provided incorrect answers with a high degree of confidence and called customer support angrier than before.
This pattern is repeated over and over. In healthcare, customer-facing AI assistants are providing patients with formulary information that’s outdated by weeks or months. Financial services chatbots are spitting out answers from both retail and institutional product content. Retailers are seeing discontinued products surface in product searches.
The issue isn’t a failure of AI technology. It’s a failure of architecture
The standard RAG pattern — embedding the query, retrieving semantically similar content, passing to an LLM —works beautifully in demos and proof of concepts. But it falls apart in production use cases for three systematic reasons:
Intent is not context. But standard RAG architectures don’t account for this.
Say a customer types “I want to cancel” What does that mean? Cancel a service? Cancel an order? Cancel an appointment? During our telecommunications deployment, we found that 65% of queries for “cancel” were actually about orders or appointments, not service cancellation. The RAG system had no way of understanding this intent, so it consistently returned service cancellation documents.
Intent matters. In healthcare, if a patient is typing “I need to cancel” because they’re trying to cancel an appointment, a prescription refill or a procedure, routing them to medication content from scheduling is not only frustrating — it’s also dangerous.
Enterprise knowledge and experience is vast, spanning dozens of sources such as product catalogs, billing, support articles, policies, promotions and account data. Standard RAG models treat all of it the same, searching all for every query.
When a customer asks “How do I activate my new phone,” they don’t care about billing FAQs, store locations or network status updates. But a standard RAG model retrieves semantically similar content from every source, returning search results that are a half-steps off the mark.
Vector space is timeblind. Semantically, last quarter’s promotion is identical to this quarter’s. But presenting customers with outdated offers shatters trust. We linked a significant percentage of customer complaints to search results that surfaced expired products, offers, or features.
The Intent-First architecture pattern is the mirror image of the standard RAG deployment. In the RAG model, you retrieve, then route. In the Intent-First model, you classify before you route or retrieve.
Intent-First architectures use a lightweight language model to parse a query for intent and context, before dispatching to the most relevant content sources (documents, APIs, agents).
The Intent-First pattern is designed for cloud-native deployment, leveraging microservices, containerization and elastic scaling to handle enterprise traffic patterns.
The classifier determines user intent before any retrieval occurs:
ALGORITHM: Intent Classification
INPUT: user_query (string)
OUTPUT: intent_result (object)
1. PREPROCESS query (normalize, expand contractions)
2. CLASSIFY using transformer model:
– primary_intent ← model.predict(query)
– confidence ← model.confidence_score()
3. IF confidence < 0.70 THEN
– RETURN {
requires_clarification: true,
suggested_question: generate_clarifying_question(query)
}
4. EXTRACT sub_intent based on primary_intent:
– IF primary = “ACCOUNT” → check for ORDER_STATUS, PROFILE, etc.
– IF primary = “SUPPORT” → check for DEVICE_ISSUE, NETWORK, etc.
– IF primary = “BILLING” → check for PAYMENT, DISPUTE, etc.
5. DETERMINE target_sources based on intent mapping:
– ORDER_STATUS → [orders_db, order_faq]
– DEVICE_ISSUE → [troubleshooting_kb, device_guides]
– MEDICATION → [formulary, clinical_docs] (healthcare)
6. RETURN {
primary_intent,
sub_intent,
confidence,
target_sources,
requires_personalization: true/false
}
Once intent is classified, retrieval becomes targeted:
ALGORITHM: Context-Aware Retrieval
INPUT: query, intent_result, user_context
OUTPUT: ranked_documents
1. GET source_config for intent_result.sub_intent:
– primary_sources ← sources to search
– excluded_sources ← sources to skip
– freshness_days ← max content age
2. IF intent requires personalization AND user is authenticated:
– FETCH account_context from Account Service
– IF intent = ORDER_STATUS:
– FETCH recent_orders (last 60 days)
– ADD to results
3. BUILD search filters:
– content_types ← primary_sources only
– max_age ← freshness_days
– user_context ← account_context (if available)
4. FOR EACH source IN primary_sources:
– documents ← vector_search(query, source, filters)
– ADD documents to results
5. SCORE each document:
– relevance_score ← vector_similarity × 0.40
– recency_score ← freshness_weight × 0.20
– personalization_score ← user_match × 0.25
– intent_match_score ← type_match × 0.15
– total_score ← SUM of above
6. RANK by total_score descending
7. RETURN top 10 documents
In healthcare deployments, the Intent-First pattern includes additional safeguards:
Healthcare intent categories:
Clinical: Medication questions, symptoms, care instructions
Coverage: Benefits, prior authorization, formulary
Scheduling: Appointments, provider availability
Billing: Claims, payments, statements
Account: Profile, dependents, ID cards
Critical safeguard: Clinical queries always include disclaimers and never replace professional medical advice. The system routes complex clinical questions to human support.
The edge cases are where systems fail. The Intent-First pattern includes specific handlers:
Frustration detection keywords:
Anger: “terrible,” “worst,” “hate,” “ridiculous”
Time: “hours,” “days,” “still waiting”
Failure: “useless,” “no help,” “doesn’t work”
Escalation: “speak to human,” “real person,” “manager”
When frustration is detected, skip search entirely and route to human support.
The Intent-First pattern applies wherever enterprises deploy conversational AI over heterogeneous content:
|
Industry |
Intent categories |
Key benefit |
|
Telecommunications |
Sales, Support, Billing, Account, Retention |
Prevents “cancel” misclassification |
|
Healthcare |
Clinical, Coverage, Scheduling, Billing |
Separates clinical from administrative |
|
Financial services |
Retail, Institutional, Lending, Insurance |
Prevents context mixing |
|
Retail |
Product, Orders, Returns, Loyalty |
Ensures promotional freshness |
After implementing Intent-First architecture across telecommunications and healthcare platforms:
|
Metric |
Impact |
|
Query success rate |
Nearly doubled |
|
Support escalations |
Reduced by more than half |
|
Time to resolution |
Reduced approximately 70% |
|
User satisfaction |
Improved roughly 50% |
|
Return user rate |
More than doubled |
The return user rate proved most significant. When search works, users come back. When it fails, they abandon the channel entirely, increasing costs across all other support channels.
The conversational AI market will continue to experience hyper growth.
But enterprises that build and deploy typical RAG architectures will continue to fail … repeatedly.
AI will confidently give wrong answers, users will abandon digital channels out of frustration and support costs will go up instead of down.
Intent-First is a fundamental shift in how enterprises need to architect and build AI-powered customer conversations. It’s not about better models or more data. It’s about understanding what a user wants before you try to help them.
The sooner an organization realizes this as an architectural imperative, the sooner they will be able to capture the efficiency gains this technology is supposed to enable. Those that don’t will be debugging why their AI investments haven’t been producing expected business outcomes for many years to come.
The demo is easy. Production is hard. But the pattern for production success is clear: Intent First.
Sreenivasa Reddy Hulebeedu Reddy is a lead software engineer and enterprise architect
It’s the question on everyone’s minds and lips: Are we in an AI bubble?
It’s the wrong question. The real question is: Which AI bubble are we in, and when will each one burst?
The debate over whether AI represents a transformative technology or an economic time bomb has reached a fever pitch. Even tech leaders like Meta CEO Mark Zuckerberg have acknowledged evidence of an unstable financial bubble forming around AI. OpenAI CEO Sam Altman and Microsoft co-founder Bill Gates see clear bubble dynamics: overexcited investors, frothy valuations and plenty of doomed projects — but they still believe AI will ultimately transform the economy.
But treating “AI” as a single monolithic entity destined for a uniform collapse is fundamentally misguided. The AI ecosystem is actually three distinct layers, each with different economics, defensibility and risk profiles. Understanding these layers is critical, because they won’t all pop at once.
The most vulnerable segment isn’t building AI — it’s repackaging it.
These are the companies that take OpenAI’s API, add a slick interface and some prompt engineering, then charge $49/month for what amounts to a glorified ChatGPT wrapper. Some have achieved rapid initial success, like Jasper.ai, which reached approximately $42 million in annual recurring revenue (ARR) in its first year by wrapping GPT models in a user-friendly interface for marketers.
But the cracks are already showing. These businesses face threats from every direction:
Feature absorption: Microsoft can bundle your $50/month AI writing tool into Office 365 tomorrow. Google can make your AI email assistant a free Gmail feature. Salesforce can build your AI sales tool natively into their CRM. When large platforms decide your product is a feature, not a product, your business model evaporates overnight.
The commoditization trap: Wrapper companies are essentially just passing inputs and outputs, if OpenAI improves prompting, these tools lose value overnight. As foundation models become more similar in capability and pricing continues to fall, margins compress to nothing.
Zero switching costs: Most wrapper companies don’t own proprietary data, embedded workflows or deep integrations. A customer can switch to a competitor, or directly to ChatGPT, in minutes. There’s no moat, no lock-in, no defensibility.
The white-label AI market exemplifies this fragility. Companies using white-label platforms face vendor lock-in risks from proprietary systems and API limitations that can hinder integration. These businesses are building on rented land, and the landlord can change the terms, or bulldoze the property, at any moment.
The exception that proves the rule: Cursor stands as a rare wrapper-layer company that has built genuine defensibility. By deeply integrating into developer workflows, creating proprietary features beyond simple API calls and establishing strong network effects through user habits and custom configurations, Cursor has demonstrated how a wrapper can evolve into something more substantial. But companies like Cursor are outliers, not the norm — most wrapper companies lack this level of workflow integration and user lock-in.
Timeline: Expect significant failures in this segment by late 2025 through 2026, as large platforms absorb functionality and users realize they’re paying premium prices for commoditized capabilities.
The companies building LLMs — OpenAI, Anthropic, Mistral — occupy a more defensible but still precarious position.
Economic researcher Richard Bernstein points to OpenAI as an example of the bubble dynamic, noting that the company has made around $1 trillion in AI deals, including a $500 billion data center buildout project, despite being set to generate only $13 billion in revenue. The divergence between investment and plausible earnings “certainly looks bubbly,” Bernstein notes.
Yet, these companies possess genuine technological moats: Model training expertise, compute access and performance advantages. The question is whether these advantages are sustainable or whether models will commoditize to the point where they’re indistinguishable — turning foundation model providers into low-margin infrastructure utilities.
Engineering will separate winners from losers: As foundation models converge in baseline capabilities, the competitive edge will increasingly come from inference optimization and systems engineering. Companies that can scale the memory wall through innovations like extended KV cache architectures, achieve superior token throughput and deliver faster time-to-first-token will command premium pricing and market share. The winners won’t just be those with the largest training runs, but those who can make AI inference economically viable at scale. Technical breakthroughs in memory management, caching strategies and infrastructure efficiency will determine which frontier labs survive consolidation.
Another concern is the circular nature of investments. For instance, Nvidia is pumping $100 billion into OpenAI to bankroll data centers, and OpenAI is then filling those facilities with Nvidia’s chips. Nvidia is essentially subsidizing one of its biggest customers, potentially artificially inflating actual AI demand.
Still, these companies have massive capital backing, genuine technical capabilities and strategic partnerships with major cloud providers and enterprises. Some will consolidate, some will be acquired, but the category will survive.
Timeline: Consolidation in 2026 to 2028, with 2 to 3 dominant players emerging while smaller model providers are acquired or shuttered.
Here’s the contrarian take: The infrastructure layer — including Nvidia, data centers, cloud providers, memory systems and AI-optimized storage — is the least bubbly part of the AI boom.
Yes, the latest estimates suggest global AI capital expenditures and venture capital investments already exceed $600 billion in 2025, with Gartner estimating that all AI-related spending worldwide might top $1.5 trillion. That sounds like bubble territory.
But infrastructure has a critical characteristic: It retains value regardless of which specific applications succeed. The fiber optic cables laid during the dot-com bubble weren’t wasted — they enabled YouTube, Netflix and cloud computing. Twenty-five years ago, the original dot-com bubble burst after debt financing built out fiber-optic cables for a future that had not yet arrived, but that future eventually did arrive, and the infrastructure was there waiting.
Despite stock pressure, Nvidia’s Q3 fiscal year 2025 revenue hit about $57 billion, up 22% quarter-over-quarter and 62% year-over-year, with the data center division alone generating roughly $51.2 billion. These aren’t vanity metrics; they represent real demand from companies making genuine infrastructure investments.
The chips, data centers, memory systems and storage infrastructure being built today will power whatever AI applications ultimately succeed, whether that’s today’s chatbots, tomorrow’s autonomous agents or applications we haven’t even imagined yet. Unlike commoditized storage alone, modern AI infrastructure encompasses the entire memory hierarchy — from GPU HBM to DRAM to high-performance storage systems that serve as token warehouses for inference workloads. This integrated approach to memory and storage represents a fundamental architectural innovation, not a commodity play.
Timeline: Short-term overbuilding and lazy engineering are possible (2026), but long-term value retention is expected as AI workloads expand over the next decade.
The current AI boom won’t end with one dramatic crash. Instead, we’ll see a cascade of failures beginning with the most vulnerable companies, and the warning signs are already here.
Phase 1: Wrapper and white-label companies face margin compression and feature absorption. Hundreds of AI startups with thin differentiation will shut down or sell for pennies on the dollar. More than 1,300 AI startups now have valuations of over $100 million, with 498 AI “unicorns” valued at $1 billion or more, many of which won’t justify those valuations.
Phase 2: Foundation model consolidation as performance converges and only the best-capitalized players survive. Expect 3 to 5 major acquisitions as tech giants absorb promising model companies.
Phase 3: Infrastructure spending normalizes but remains elevated. Some data centers will sit partially empty for a few years (like fiber optic cables in 2002), but they’ll eventually fill as AI workloads genuinely expand.
The most significant risk isn’t being a wrapper — it’s staying one. If you own the experience the user operates in, you own the user.
If you’re building in the application layer, you need to move upstack immediately:
From wrapper → application layer: Stop just generating outputs. Own the workflow before and after the AI interaction.
From application → vertical SaaS: Build execution layers that force users to stay inside your product. Create proprietary data, deep integrations and workflow ownership that makes switching painful.
The distribution moat: Your real advantage isn’t the LLM, it’s how you get users, keep them and expand what they do inside your platform. Winning AI businesses aren’t just software companies — they’re distribution companies.
It’s time to stop asking whether we’re in “the” AI bubble. We’re in multiple bubbles with different characteristics and timelines.
The wrapper companies will pop first, probably within 18 months. Foundation models will consolidate over the next 2 to 4 years. I predict that current infrastructure investments will ultimately prove justified over the long term, although not without some short-term overbuilding pains.
This isn’t a reason for pessimism, it’s a roadmap. Understanding which layer you’re operating in and which bubble you might be caught in is the difference between becoming the next casualty and building something that survives the shakeout.
The AI revolution is real. But not every company riding the wave will make it to shore.
Val Bercovici is CAIO at WEKA.
Every year, NeurIPS produces hundreds of impressive papers, and a handful that subtly reset how practitioners think about scaling, evaluation and system design. In 2025, the most consequential works weren’t about a single breakthrough model. Instead, they challenged fundamental assumptions that academicians and corporations have quietly relied on: Bigger models mean better reasoning, RL creates new capabilities, attention is “solved” and generative models inevitably memorize.
This year’s top papers collectively point to a deeper shift: AI progress is now constrained less by raw model capacity and more by architecture, training dynamics and evaluation strategy.
Below is a technical deep dive into five of the most influential NeurIPS 2025 papers — and what they mean for anyone building real-world AI systems.
Paper: Artificial Hivemind: The Open-Ended Homogeneity of Language Models
For years, LLM evaluation has focused on correctness. But in open-ended or ambiguous tasks like brainstorming, ideation or creative synthesis, there often is no single correct answer. The risk instead is homogeneity: Models producing the same “safe,” high-probability responses.
This paper introduces Infinity-Chat, a benchmark designed explicitly to measure diversity and pluralism in open-ended generation. Rather than scoring answers as right or wrong, it measures:
Intra-model collapse: How often the same model repeats itself
Inter-model homogeneity: How similar different models’ outputs are
The result is uncomfortable but important: Across architectures and providers, models increasingly converge on similar outputs — even when multiple valid answers exist.
For corporations, this reframes “alignment” as a trade-off. Preference tuning and safety constraints can quietly reduce diversity, leading to assistants that feel too safe, predictable or biased toward dominant viewpoints.
Takeaway: If your product relies on creative or exploratory outputs, diversity metrics need to be first-class citizens.
Paper: Gated Attention for Large Language Models
Transformer attention has been treated as settled engineering. This paper proves it isn’t.
The authors introduce a small architectural change: Apply a query-dependent sigmoid gate after scaled dot-product attention, per attention head. That’s it. No exotic kernels, no massive overhead.
Across dozens of large-scale training runs — including dense and mixture-of-experts (MoE) models trained on trillions of tokens — this gated variant:
Improved stability
Reduced “attention sinks”
Enhanced long-context performance
Consistently outperformed vanilla attention
The gate introduces:
Non-linearity in attention outputs
Implicit sparsity, suppressing pathological activations
This challenges the assumption that attention failures are purely data or optimization problems.
Takeaway: Some of the biggest LLM reliability issues may be architectural — not algorithmic — and solvable with surprisingly small changes.
Paper: 1,000-Layer Networks for Self-Supervised Reinforcement Learning
Conventional wisdom says RL doesn’t scale well without dense rewards or demonstrations. This paper reveals that that assumption is incomplete.
By scaling network depth aggressively from typical 2 to 5 layers to nearly 1,000 layers, the authors demonstrate dramatic gains in self-supervised, goal-conditioned RL, with performance improvements ranging from 2X to 50X.
The key isn’t brute force. It’s pairing depth with contrastive objectives, stable optimization regimes and goal-conditioned representations
For agentic systems and autonomous workflows, this suggests that representation depth — not just data or reward shaping — may be a critical lever for generalization and exploration.
Takeaway: RL’s scaling limits may be architectural, not fundamental.
Paper: Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training
Diffusion models are massively overparameterized, yet they often generalize remarkably well. This paper explains why.
The authors identify two distinct training timescales:
One where generative quality rapidly improves
Another — much slower — where memorization emerges
Crucially, the memorization timescale grows linearly with dataset size, creating a widening window where models improve without overfitting.
This reframes early stopping and dataset scaling strategies. Memorization isn’t inevitable — it’s predictable and delayed.
Takeaway: For diffusion training, dataset size doesn’t just improve quality — it actively delays overfitting.
Paper: Does Reinforcement Learning Really Incentivize Reasoning in LLMs?
Perhaps the most strategically important result of NeurIPS 2025 is also the most sobering.
This paper rigorously tests whether reinforcement learning with verifiable rewards (RLVR) actually creates new reasoning abilities in LLMs — or simply reshapes existing ones.
Their conclusion: RLVR primarily improves sampling efficiency, not reasoning capacity. At large sample sizes, the base model often already contains the correct reasoning trajectories.
RL is better understood as:
A distribution-shaping mechanism
Not a generator of fundamentally new capabilities
Takeaway: To truly expand reasoning capacity, RL likely needs to be paired with mechanisms like teacher distillation or architectural changes — not used in isolation.
Taken together, these papers point to a common theme:
The bottleneck in modern AI is no longer raw model size — it’s system design.
Diversity collapse requires new evaluation metrics
Attention failures require architectural fixes
RL scaling depends on depth and representation
Memorization depends on training dynamics, not parameter count
Reasoning gains depend on how distributions are shaped, not just optimized
For builders, the message is clear: Competitive advantage is shifting from “who has the biggest model” to “who understands the system.”
Maitreyi Chatterjee is a software engineer.
Devansh Agarwal currently works as an ML engineer at FAANG.
Our LLM API bill was growing 30% month-over-month. Traffic was increasing, but not that fast. When I analyzed our query logs, I found the real problem: Users ask the same questions in different ways.
“What’s your return policy?,” “How do I return something?”, and “Can I get a refund?” were all hitting our LLM separately, generating nearly identical responses, each incurring full API costs.
Exact-match caching, the obvious first solution, captured only 18% of these redundant calls. The same semantic question, phrased differently, bypassed the cache entirely.
So, I implemented semantic caching based on what queries mean, not how they’re worded. After implementing it, our cache hit rate increased to 67%, reducing LLM API costs by 73%. But getting there requires solving problems that naive implementations miss.
Traditional caching uses query text as the cache key. This works when queries are identical:
# Exact-match caching
cache_key = hash(query_text)
if cache_key in cache:
return cache[cache_key]
But users don’t phrase questions identically. My analysis of 100,000 production queries found:
Only 18% were exact duplicates of previous queries
47% were semantically similar to previous queries (same intent, different wording)
35% were genuinely novel queries
That 47% represented massive cost savings we were missing. Each semantically-similar query triggered a full LLM call, generating a response nearly identical to one we’d already computed.
Semantic caching replaces text-based keys with embedding-based similarity lookup:
class SemanticCache:
def __init__(self, embedding_model, similarity_threshold=0.92):
self.embedding_model = embedding_model
self.threshold = similarity_threshold
self.vector_store = VectorStore() # FAISS, Pinecone, etc.
self.response_store = ResponseStore() # Redis, DynamoDB, etc.
def get(self, query: str) -> Optional[str]:
“””Return cached response if semantically similar query exists.”””
query_embedding = self.embedding_model.encode(query)
# Find most similar cached query
matches = self.vector_store.search(query_embedding, top_k=1)
if matches and matches[0].similarity >= self.threshold:
cache_id = matches[0].id
return self.response_store.get(cache_id)
return None
def set(self, query: str, response: str):
“””Cache query-response pair.”””
query_embedding = self.embedding_model.encode(query)
cache_id = generate_id()
self.vector_store.add(cache_id, query_embedding)
self.response_store.set(cache_id, {
‘query’: query,
‘response’: response,
‘timestamp’: datetime.utcnow()
})
The key insight: Instead of hashing query text, I embed queries into vector space and find cached queries within a similarity threshold.
The similarity threshold is the critical parameter. Set it too high, and you miss valid cache hits. Set it too low, and you return wrong responses.
Our initial threshold of 0.85 seemed reasonable; 85% similar should be “the same question,” right?
Wrong. At 0.85, we got cache hits like:
Query: “How do I cancel my subscription?”
Cached: “How do I cancel my order?”
Similarity: 0.87
These are different questions with different answers. Returning the cached response would be incorrect.
I discovered that optimal thresholds vary by query type:
|
Query type |
Optimal threshold |
Rationale |
|
FAQ-style questions |
0.94 |
High precision needed; wrong answers damage trust |
|
Product searches |
0.88 |
More tolerance for near-matches |
|
Support queries |
0.92 |
Balance between coverage and accuracy |
|
Transactional queries |
0.97 |
Very low tolerance for errors |
I implemented query-type-specific thresholds:
class AdaptiveSemanticCache:
def __init__(self):
self.thresholds = {
‘faq’: 0.94,
‘search’: 0.88,
‘support’: 0.92,
‘transactional’: 0.97,
‘default’: 0.92
}
self.query_classifier = QueryClassifier()
def get_threshold(self, query: str) -> float:
query_type = self.query_classifier.classify(query)
return self.thresholds.get(query_type, self.thresholds[‘default’])
def get(self, query: str) -> Optional[str]:
threshold = self.get_threshold(query)
query_embedding = self.embedding_model.encode(query)
matches = self.vector_store.search(query_embedding, top_k=1)
if matches and matches[0].similarity >= threshold:
return self.response_store.get(matches[0].id)
return None
I couldn’t tune thresholds blindly. I needed ground truth on which query pairs were actually “the same.”
Our methodology:
Step 1: Sample query pairs. I sampled 5,000 query pairs at various similarity levels (0.80-0.99).
Step 2: Human labeling. Annotators labeled each pair as “same intent” or “different intent.” I used three annotators per pair and took a majority vote.
Step 3: Compute precision/recall curves. For each threshold, we computed:
Precision: Of cache hits, what fraction had the same intent?
Recall: Of same-intent pairs, what fraction did we cache-hit?
def compute_precision_recall(pairs, labels, threshold):
“””Compute precision and recall at given similarity threshold.”””
predictions = [1 if pair.similarity >= threshold else 0 for pair in pairs]
true_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 1)
false_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 0)
false_negatives = sum(1 for p, l in zip(predictions, labels) if p == 0 and l == 1)
precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
return precision, recall
Step 4: Select threshold based on cost of errors. For FAQ queries where wrong answers damage trust, I optimized for precision (0.94 threshold gave 98% precision). For search queries where missing a cache hit just costs money, I optimized for recall (0.88 threshold).
Semantic caching adds latency: You must embed the query and search the vector store before knowing whether to call the LLM.
Our measurements:
|
Operation |
Latency (p50) |
Latency (p99) |
|
Query embedding |
12ms |
28ms |
|
Vector search |
8ms |
19ms |
|
Total cache lookup |
20ms |
47ms |
|
LLM API call |
850ms |
2400ms |
The 20ms overhead is negligible compared to the 850ms LLM call we avoid on cache hits. Even at p99, the 47ms overhead is acceptable.
However, cache misses now take 20ms longer than before (embedding + search + LLM call). At our 67% hit rate, the math works out favorably:
Before: 100% of queries × 850ms = 850ms average
After: (33% × 870ms) + (67% × 20ms) = 287ms + 13ms = 300ms average
Net latency improvement of 65% alongside the cost reduction.
Cached responses go stale. Product information changes, policies update and yesterday’s correct answer becomes today’s wrong answer.
I implemented three invalidation strategies:
Simple expiration based on content type:
TTL_BY_CONTENT_TYPE = {
‘pricing’: timedelta(hours=4), # Changes frequently
‘policy’: timedelta(days=7), # Changes rarely
‘product_info’: timedelta(days=1), # Daily refresh
‘general_faq’: timedelta(days=14), # Very stable
}
When underlying data changes, invalidate related cache entries:
class CacheInvalidator:
def on_content_update(self, content_id: str, content_type: str):
“””Invalidate cache entries related to updated content.”””
# Find cached queries that referenced this content
affected_queries = self.find_queries_referencing(content_id)
for query_id in affected_queries:
self.cache.invalidate(query_id)
self.log_invalidation(content_id, len(affected_queries))
For responses that might become stale without explicit events, I implemented periodic freshness checks:
def check_freshness(self, cached_response: dict) -> bool:
“””Verify cached response is still valid.”””
# Re-run the query against current data
fresh_response = self.generate_response(cached_response[‘query’])
# Compare semantic similarity of responses
cached_embedding = self.embed(cached_response[‘response’])
fresh_embedding = self.embed(fresh_response)
similarity = cosine_similarity(cached_embedding, fresh_embedding)
# If responses diverged significantly, invalidate
if similarity < 0.90:
self.cache.invalidate(cached_response[‘id’])
return False
return True
We run freshness checks on a sample of cached entries daily, catching staleness that TTL and event-based invalidation miss.
After three months in production:
|
Metric |
Before |
After |
Change |
|
Cache hit rate |
18% |
67% |
+272% |
|
LLM API costs |
$47K/month |
$12.7K/month |
-73% |
|
Average latency |
850ms |
300ms |
-65% |
|
False-positive rate |
N/A |
0.8% |
— |
|
Customer complaints (wrong answers) |
Baseline |
+0.3% |
Minimal increase |
The 0.8% false-positive rate (queries where we returned a cached response that was semantically incorrect) was within acceptable bounds. These cases occurred primarily at the boundaries of our threshold, where similarity was just above the cutoff but intent differed slightly.
Don’t use a single global threshold. Different query types have different tolerance for errors. Tune thresholds per category.
Don’t skip the embedding step on cache hits. You might be tempted to skip embedding overhead when returning cached responses, but you need the embedding for cache key generation. The overhead is unavoidable.
Don’t forget invalidation. Semantic caching without invalidation strategy leads to stale responses that erode user trust. Build invalidation from day one.
Don’t cache everything. Some queries shouldn’t be cached: Personalized responses, time-sensitive information, transactional confirmations. Build exclusion rules.
def should_cache(self, query: str, response: str) -> bool:
“””Determine if response should be cached.””
# Don’t cache personalized responses
if self.contains_personal_info(response):
return False
# Don’t cache time-sensitive information
if self.is_time_sensitive(query):
return False
# Don’t cache transactional confirmations
if self.is_transactional(query):
return False
return True
Semantic caching is a practical pattern for LLM cost control that captures redundancy exact-match caching misses. The key challenges are threshold tuning (use query-type-specific thresholds based on precision/recall analysis) and cache invalidation (combine TTL, event-based and staleness detection).
At 73% cost reduction, this was our highest-ROI optimization for production LLM systems. The implementation complexity is moderate, but the threshold tuning requires careful attention to avoid quality degradation.
Sreenivasa Reddy Hulebeedu Reddy is a lead software engineer.
AI is evolving faster than our vocabulary for describing it. We may need a few new words. We have “cognition” for how a single mind thinks, but we don’t have a word for what happens when human and machine intelligence work together to perceive, decide, create and act. Let’s call that process intelition.
Intelition isn’t a feature; it’s the organizing principle for the next wave of software where humans and AI operate inside the same shared model of the enterprise. Today’s systems treat AI models as things you invoke from the outside. You act as a “user,” prompting for responses or wiring a “human in the loop” step into agentic workflows. But that’s evolving into continuous co-production: People and agents are shaping decisions, logic and actions together, in real time.
Read on for a breakdown of the three forces driving this new paradigm.
In a recent shareholder letter, Palantir CEO Alex Karp wrote that “all the value in the market is going to go to chips and what we call ontology,” and argued that this shift is “only the beginning of something much larger and more significant.” By ontology, Karp means a shared model of objects (customers, policies, assets, events) and their relationships. This also includes what Palantir calls an ontology’s “kinetic layer” that defines the actions and security permissions connecting objects.
In the SaaS era, every enterprise application creates its own object and process models. Combined with a host of legacy systems and often chaotic models, enterprises face the challenge of stitching all this together. It’s a big and difficult job, with redundancies, incomplete structures and missing data. The reality: No matter how many data warehouse or data lake projects commissioned, few enterprises come close to creating a consolidated enterprise ontology.
A unified ontology is essential for today’s agentic AI tools. As organizations link and federate ontologies, a new software paradigm emerges: Agentic AI can reason and act across suppliers, regulators, customers and operations, not just within a single app.
As Karp describes it, the aim is “to tether the power of artificial intelligence to objects and relationships in the real world.”
Today’s models can hold extensive context, but holding information isn’t the same as learning from it. Continual learning requires the accumulation of understanding, rather than resets with each retraining.
To his aim, Google recently announced “Nested Learning” as a potential solution, grounded direclty into existing LLM architecture and training data. The authors don’t claim to have solved the challenges of building world models. But, Nested Learning could supply the raw ingredients for them: Durable memory with continual learning layered into the system. The endpoint would make retraining obsolete.
In June 2022, Meta’s chief AI scientist Yann LeCun created a blueprint for “autonomous machine intelligence” that featured a hierarchical approach to using joint embeddings to make predictions using world models. He called the technique H-JEPA, and later put bluntly: “LLMs are good at manipulating language, but not at thinking.”
Over the past three years, LeCun and his colleagues at Meta have moved H-JEPA theory into practice with open source models V-JEPA and I-JEPA, which learn image and video representations of the world.
The third force in this agentic, ontology-driven world is the personal interface. This puts people at the center rather than as “users” on the periphery. This is not another app; it is the primary way a person participates in the next era of work and life. Rather than treating AI as something we visit through a chat window or API cal, the personal intelition interface will be always-on, aware of our context, preferences and goals and capable of acting on our behalf across the entire federated economy.
Let’s analyze how this is already coming together.
In May, Jony Ive sold his AI device company io to OpenAI to accelerate a new AI device category. He noted at the time: “If you make something new, if you innovate, there will be consequences unforeseen, and some will be wonderful, and some will be harmful. While some of the less positive consequences were unintentional, I still feel responsibility. And the manifestation of that is a determination to try and be useful.” That is, getting the personal intelligence device right means more than an attractive venture opportunity.
Apple is looking beyond LLMs for on-device solutions that require less processing power and result in less latency when creating AI apps to understand “user intent.” Last year, they created UI-JEPA, an innovation that moves to “on-device analysis” of what the user wants. This strikes directly at the business model of today’s digital economy, where centralized profiling of “users” transforms intent and behavior data into vast revenue streams.
Tim Berners-Lee, the inventor of the World Wide Web, recently noted: “The user has been reduced to a consumable product for the advertiser … there’s still time to build machines that work for humans, and not the other way around.” Moving user intent to the device will drive interest in a secure personal data management standard, Solid, that Berners-Lee and his colleagues have been developing since 2022. The standard is ideally suited to pair with new personal AI devices. For instance, Inrupt, Inc., a company founded by Berners-Lee, recently combined Solid with Anthropic’s MCP standard for Agentic Wallets. Personal control is more than a feature of this paradigm; it is the architectural safeguard as systems gain the ability to learn and act continuously.
Ultimately, these three forces are moving and converging faster than most realize. Enterprise ontologies provide the nouns and verbs, world-model research supplies durable memory and learning and the personal interface becomes the permissioned point of control. The next software era isn’t coming. It’s already here.
Brian Mulconrey is SVP at Sureify Labs.
For decades, we have adapted to software. We learned shell commands, memorized HTTP method names and wired together SDKs. Each interface assumed we would speak its language. In the 1980s, we typed ‘grep’, ‘ssh’ and ‘ls’ into a shell; by the mid-2000s, we were invoking REST endpoints like GET /users; by the 2010s, we imported SDKs (client.orders.list()) so we didn’t have to think about HTTP. But underlying each of those steps was the same premise: Expose capabilities in a structured form so others can invoke them.
But now we are entering the next interface paradigm. Modern LLMs are challenging the notion that a user must choose a function or remember a method signature. Instead of “Which API do I call?” the question becomes: “What outcome am I trying to achieve?” In other words, the interface is shifting from code → to language. In this shift, Model Context Protocol (MCP) emerges as the abstraction that allows models to interpret human intent, discover capabilities and execute workflows, effectively exposing software functions not as programmers know them, but as natural-language requests.
MCP is not a hype-term; multiple independent studies identify the architectural shift required for “LLM-consumable” tool invocation. One blog by Akamai engineers describes the transition from traditional APIs to “language-driven integrations” for LLMs. Another academic paper on “AI agentic workflows and enterprise APIs” talks about how enterprise API architecture must evolve to support goal-oriented agents rather than human-driven calls. In short: We are no longer merely designing APIs for code; we are designing capabilities for intent.
Why does this matter for enterprises? Because enterprises are drowning in internal systems, integration sprawl and user training costs. Workers struggle not because they don’t have tools, but because they have too many tools, each with its own interface. When natural language becomes the primary interface, the barrier of “which function do I call?” disappears. One recent business blog observed that natural‐language interfaces (NLIs) are enabling self-serve data access for marketers who previously had to wait for analysts to write SQL. When the user just states intent (like “fetch last quarter revenue for region X and flag anomalies”), the system underneath can translate that into calls, orchestration, context memory and deliver results.
To understand how this evolution works, consider the interface ladder:
|
Era |
Interface |
Who it was built for |
|
CLI |
Shell commands |
Expert users typing text |
|
API |
Web or RPC endpoints |
Developers integrating systems |
|
SDK |
Library functions |
Programmers using abstractions |
|
Natural language (MCP) |
Intent-based requests |
Human + AI agents stating what they want |
Through each step, humans had to “learn the machine’s language.” With MCP, the machine absorbs the human’s language and works out the rest. That’s not just UX improvement, it’s an architectural shift.
Under MCP, functions of code are still there: data access, business logic and orchestration. But they’re discovered rather than invoked manually. For example, rather than calling “billingApi.fetchInvoices(customerId=…),” you say “Show all invoices for Acme Corp since January and highlight any late payments.” The model resolves the entities, calls the right systems, filters and returns structured insight. The developer’s work shifts from wiring endpoints to defining capability surfaces and guardrails.
This shift transforms developer experience and enterprise integration. Teams often struggle to onboard new tools because they require mapping schemas, writing glue code and training users. With a natural-language front, onboarding involves defining business entity names, declaring capabilities and exposing them via the protocol. The human (or AI agent) no longer needs to know parameter names or call order. Studies show that using LLMs as interfaces to APIs can reduce the time and resources required to develop chatbots or tool-invoked workflows.
The change also brings productivity benefits. Enterprises that adopt LLM-driven interfaces can turn data access latency (hours/days) into conversation latency (seconds). For instance, if an analyst previously had to export CSVs, run transforms and deploy slides, a language interface allows “Summarize the top five risk factors for churn over the last quarter” and generate narrative + visuals in one go. The human then reviews, adjusts and acts — shifting from data plumber to decision maker. That matters: According to a survey by McKinsey & Company, 63% of organizations using gen AI are already creating text outputs, and more than one-third are generating images or code. (While many are still in the early days of capturing enterprise-wide ROI, the signal is clear: Language as interface unlocks new value.
In architectural terms, this means software design must evolve. MCP demands systems that publish capability metadata, support semantic routing, maintain context memory and enforce guardrails. An API design no longer needs to ask “What function will the user call?”, but rather “What intent might the user express?” A recently published framework for improving enterprise APIs for LLMs shows how APIs can be enriched with natural-language-friendly metadata so that agents can select tools dynamically. The implication: Software becomes modular around intent surfaces rather than function surfaces.
Language-first systems also bring risks and requirements. Natural language is ambiguous by nature, so enterprises must implement authentication, logging, provenance and access control, just as they did for APIs. Without these guardrails, an agent might call the wrong system, expose data or misinterpret intent. One post on “prompt collapse” calls out the danger: As natural-language UI becomes dominant, software may turn into “a capability accessed through conversation” and the company into “an API with a natural-language frontend”. That transformation is powerful, but only safe if systems are designed for introspection, audit and governance.
The shift also has cultural and organizational ramifications. For decades, enterprises hired integration engineers to design APIs and middleware. With MCP-driven models, companies will increasingly hire ontology engineers, capability architects and agent enablement specialists. These roles focus on defining the semantics of business operations, mapping business entities to system capabilities and curating context memory. Because the interface is now human-centric, skills such as domain knowledge, prompt framing, oversight and evaluation become central.
What should enterprise leaders do today? First, think of natural language as the interface layer, not as a fancy add-on. Map your business workflows that can safely be invoked via language. Then catalogue the underlying capabilities you already have: data services, analytics and APIs. Then ask: “Are these discoverable? Can they be called via intent?” Finally, pilot an MCP-style layer: Build a small domain (customer support triage) where a user or agent can express outcomes in language, and let systems do the orchestration. Then iterate and scale.
Natural language is not just the new front-end. It is becoming the default interface layer for software, replacing CLI, then APIs, then SDKs. MCP is the abstraction that makes this possible. Benefits include faster integration, modular systems, higher productivity and new roles. For those organizations still tethered to calling endpoints manually, the shift will feel like learning a new platform all over again. The question is no longer “which function do I call?” but “what do I want to do?”
Dhyey Mavani is accelerating gen AI and computational mathematics.