Why your LLM bill is exploding — and how semantic caching can cut it by 73%

Our LLM API bill was growing 30% month-over-month. Traffic was increasing, but not that fast. When I analyzed our query logs, I found the real problem: Users ask the same questions in different ways.

“What’s your return policy?,” “How do I return something?”, and “Can I get a refund?” were all hitting our LLM separately, generating nearly identical responses, each incurring full API costs.

Exact-match caching, the obvious first solution, captured only 18% of these redundant calls. The same semantic question, phrased differently, bypassed the cache entirely.

So, I implemented semantic caching based on what queries mean, not how they’re worded. After implementing it, our cache hit rate increased to 67%, reducing LLM API costs by 73%. But getting there requires solving problems that naive implementations miss.

Why exact-match caching falls short

Traditional caching uses query text as the cache key. This works when queries are identical:

# Exact-match caching

cache_key = hash(query_text)

if cache_key in cache:

    return cache[cache_key]

But users don’t phrase questions identically. My analysis of 100,000 production queries found:

  • Only 18% were exact duplicates of previous queries

  • 47% were semantically similar to previous queries (same intent, different wording)

  • 35% were genuinely novel queries

That 47% represented massive cost savings we were missing. Each semantically-similar query triggered a full LLM call, generating a response nearly identical to one we’d already computed.

Semantic caching architecture

Semantic caching replaces text-based keys with embedding-based similarity lookup:

class SemanticCache:

    def __init__(self, embedding_model, similarity_threshold=0.92):

        self.embedding_model = embedding_model

        self.threshold = similarity_threshold

        self.vector_store = VectorStore()  # FAISS, Pinecone, etc.

        self.response_store = ResponseStore()  # Redis, DynamoDB, etc.

    def get(self, query: str) -> Optional[str]:

        “””Return cached response if semantically similar query exists.”””

        query_embedding = self.embedding_model.encode(query)

        # Find most similar cached query

        matches = self.vector_store.search(query_embedding, top_k=1)

        if matches and matches[0].similarity >= self.threshold:

            cache_id = matches[0].id

            return self.response_store.get(cache_id)

        return None

    def set(self, query: str, response: str):

        “””Cache query-response pair.”””

        query_embedding = self.embedding_model.encode(query)

        cache_id = generate_id()

        self.vector_store.add(cache_id, query_embedding)

        self.response_store.set(cache_id, {

            ‘query’: query,

            ‘response’: response,

            ‘timestamp’: datetime.utcnow()

        })

The key insight: Instead of hashing query text, I embed queries into vector space and find cached queries within a similarity threshold.

The threshold problem

The similarity threshold is the critical parameter. Set it too high, and you miss valid cache hits. Set it too low, and you return wrong responses.

Our initial threshold of 0.85 seemed reasonable; 85% similar should be “the same question,” right?

Wrong. At 0.85, we got cache hits like:

  • Query: “How do I cancel my subscription?”

  • Cached: “How do I cancel my order?”

  • Similarity: 0.87

These are different questions with different answers. Returning the cached response would be incorrect.

I discovered that optimal thresholds vary by query type:

Query type

Optimal threshold

Rationale

FAQ-style questions

0.94

High precision needed; wrong answers damage trust

Product searches

0.88

More tolerance for near-matches

Support queries

0.92

Balance between coverage and accuracy

Transactional queries

0.97

Very low tolerance for errors

I implemented query-type-specific thresholds:

class AdaptiveSemanticCache:

    def __init__(self):

        self.thresholds = {

            ‘faq’: 0.94,

            ‘search’: 0.88,

            ‘support’: 0.92,

            ‘transactional’: 0.97,

            ‘default’: 0.92

        }

        self.query_classifier = QueryClassifier()

    def get_threshold(self, query: str) -> float:

        query_type = self.query_classifier.classify(query)

        return self.thresholds.get(query_type, self.thresholds[‘default’])

    def get(self, query: str) -> Optional[str]:

        threshold = self.get_threshold(query)

        query_embedding = self.embedding_model.encode(query)

        matches = self.vector_store.search(query_embedding, top_k=1)

        if matches and matches[0].similarity >= threshold:

            return self.response_store.get(matches[0].id)

        return None

Threshold tuning methodology

I couldn’t tune thresholds blindly. I needed ground truth on which query pairs were actually “the same.”

Our methodology:

Step 1: Sample query pairs. I sampled 5,000 query pairs at various similarity levels (0.80-0.99).

Step 2: Human labeling. Annotators labeled each pair as “same intent” or “different intent.” I used three annotators per pair and took a majority vote.

Step 3: Compute precision/recall curves. For each threshold, we computed:

  • Precision: Of cache hits, what fraction had the same intent?

  • Recall: Of same-intent pairs, what fraction did we cache-hit?

def compute_precision_recall(pairs, labels, threshold):

    “””Compute precision and recall at given similarity threshold.”””

    predictions = [1 if pair.similarity >= threshold else 0 for pair in pairs]

    true_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 1)

    false_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 0)

    false_negatives = sum(1 for p, l in zip(predictions, labels) if p == 0 and l == 1)

    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0

    recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0

    return precision, recall

Step 4: Select threshold based on cost of errors. For FAQ queries where wrong answers damage trust, I optimized for precision (0.94 threshold gave 98% precision). For search queries where missing a cache hit just costs money, I optimized for recall (0.88 threshold).

Latency overhead

Semantic caching adds latency: You must embed the query and search the vector store before knowing whether to call the LLM.

Our measurements:

Operation

Latency (p50)

Latency (p99)

Query embedding

12ms

28ms

Vector search

8ms

19ms

Total cache lookup

20ms

47ms

LLM API call

850ms

2400ms

The 20ms overhead is negligible compared to the 850ms LLM call we avoid on cache hits. Even at p99, the 47ms overhead is acceptable.

However, cache misses now take 20ms longer than before (embedding + search + LLM call). At our 67% hit rate, the math works out favorably:

  • Before: 100% of queries × 850ms = 850ms average

  • After: (33% × 870ms) + (67% × 20ms) = 287ms + 13ms = 300ms average

Net latency improvement of 65% alongside the cost reduction.

Cache invalidation

Cached responses go stale. Product information changes, policies update and yesterday’s correct answer becomes today’s wrong answer.

I implemented three invalidation strategies:

  1. Time-based TTL

Simple expiration based on content type:

TTL_BY_CONTENT_TYPE = {

    ‘pricing’: timedelta(hours=4),      # Changes frequently

    ‘policy’: timedelta(days=7),         # Changes rarely

    ‘product_info’: timedelta(days=1),   # Daily refresh

    ‘general_faq’: timedelta(days=14),   # Very stable

}

  1. Event-based invalidation

When underlying data changes, invalidate related cache entries:

class CacheInvalidator:

    def on_content_update(self, content_id: str, content_type: str):

        “””Invalidate cache entries related to updated content.”””

        # Find cached queries that referenced this content

        affected_queries = self.find_queries_referencing(content_id)

        for query_id in affected_queries:

            self.cache.invalidate(query_id)

        self.log_invalidation(content_id, len(affected_queries))

  1. Staleness detection

For responses that might become stale without explicit events, I implemented  periodic freshness checks:

def check_freshness(self, cached_response: dict) -> bool:

    “””Verify cached response is still valid.”””

    # Re-run the query against current data

    fresh_response = self.generate_response(cached_response[‘query’])

    # Compare semantic similarity of responses

    cached_embedding = self.embed(cached_response[‘response’])

    fresh_embedding = self.embed(fresh_response)

    similarity = cosine_similarity(cached_embedding, fresh_embedding)

    # If responses diverged significantly, invalidate

    if similarity < 0.90:

        self.cache.invalidate(cached_response[‘id’])

        return False

    return True

We run freshness checks on a sample of cached entries daily, catching staleness that TTL and event-based invalidation miss.

Production results

After three months in production:

Metric

Before

After

Change

Cache hit rate

18%

67%

+272%

LLM API costs

$47K/month

$12.7K/month

-73%

Average latency

850ms

300ms

-65%

False-positive rate

N/A

0.8%

Customer complaints (wrong answers)

Baseline

+0.3%

Minimal increase

The 0.8% false-positive rate (queries where we returned a cached response that was semantically incorrect) was within acceptable bounds. These cases occurred primarily at the boundaries of our threshold, where similarity was just above the cutoff but intent differed slightly.

Pitfalls to avoid

Don’t use a single global threshold. Different query types have different tolerance for errors. Tune thresholds per category.

Don’t skip the embedding step on cache hits. You might be tempted to skip embedding overhead when returning cached responses, but you need the embedding for cache key generation. The overhead is unavoidable.

Don’t forget invalidation. Semantic caching without invalidation strategy leads to stale responses that erode user trust. Build invalidation from day one.

Don’t cache everything. Some queries shouldn’t be cached: Personalized responses, time-sensitive information, transactional confirmations. Build exclusion rules.

def should_cache(self, query: str, response: str) -> bool:

    “””Determine if response should be cached.””

    # Don’t cache personalized responses

    if self.contains_personal_info(response):

        return False

    # Don’t cache time-sensitive information

    if self.is_time_sensitive(query):

        return False

    # Don’t cache transactional confirmations

    if self.is_transactional(query):

        return False

    return True

Key takeaways

Semantic caching is a practical pattern for LLM cost control that captures redundancy exact-match caching misses. The key challenges are threshold tuning (use query-type-specific thresholds based on precision/recall analysis) and cache invalidation (combine TTL, event-based and staleness detection).

At 73% cost reduction, this was our highest-ROI optimization for production LLM systems. The implementation complexity is moderate, but the threshold tuning requires careful attention to avoid quality degradation.

Sreenivasa Reddy Hulebeedu Reddy is a lead software engineer.

‘Intelition’ changes everything: AI is no longer a tool you invoke

AI is evolving faster than our vocabulary for describing it. We may need a few new words. We have “cognition” for how a single mind thinks, but we don’t have a word for what happens when human and machine intelligence work together to perceive, decide, create and act. Let’s call that process intelition

Intelition isn’t a feature; it’s the organizing principle for the next wave of software where humans and AI operate inside the same shared model of the enterprise. Today’s systems treat AI models as things you invoke from the outside. You act as a “user,” prompting for responses or wiring a “human in the loop” step into agentic workflows. But that’s evolving into continuous co-production: People and agents are shaping decisions, logic and actions together, in real time.

Read on for a breakdown of the three forces driving this new paradigm.

A unified ontology is just the beginning

In a recent shareholder letter, Palantir CEO Alex Karp wrote that “all the value in the market is going to go to chips and what we call ontology,” and argued that this shift is “only the beginning of something much larger and more significant.” By ontology, Karp means a shared model of objects (customers, policies, assets, events) and their relationships. This also includes what Palantir calls an ontology’s “kinetic layer” that defines the actions and security permissions connecting objects.

In the SaaS era, every enterprise application creates its own object and process models. Combined with a host of legacy systems and often chaotic models, enterprises face the challenge of stitching all this together. It’s a big and difficult job, with redundancies, incomplete structures and missing data. The reality: No matter how many data warehouse or data lake projects commissioned, few enterprises come close to creating a consolidated enterprise ontology. 

A unified ontology is essential for today’s agentic AI tools. As organizations link and federate ontologies, a new software paradigm emerges: Agentic AI can reason and act across suppliers, regulators, customers and operations, not just within a single app.  

As Karp describes it, the aim is “to tether the power of artificial intelligence to objects and relationships in the real world.”

World models and continuous learning

Today’s models can hold extensive context, but holding information isn’t the same as learning from it. Continual learning requires the accumulation of understanding, rather than resets with each retraining.

To his aim, Google recently announced “Nested Learning” as a potential solution, grounded direclty into existing LLM architecture and training data. The authors don’t claim to have solved the challenges of building world models. But, Nested Learning could supply the raw ingredients for them: Durable memory with continual learning layered into the system. The endpoint would make retraining obsolete. 

In June 2022, Meta’s chief AI scientist Yann LeCun created a blueprint for “autonomous machine intelligence” that featured a hierarchical approach to using joint embeddings to make predictions using world models. He called the technique H-JEPA, and later put bluntly: “LLMs are good at manipulating language, but not at thinking.”

Over the past three years, LeCun and his colleagues at Meta have moved H-JEPA theory into practice with open source models V-JEPA and I-JEPA, which learn image and video representations of the world.

The personal intelition interface 

The third force in this agentic, ontology-driven world is the personal interface. This puts people at the center rather than as “users” on the periphery. This is not another app; it is the primary way a person participates in the next era of work and life. Rather than treating AI as something we visit through a chat window or API cal, the personal intelition interface will be always-on, aware of our context, preferences and goals and capable of acting on our behalf across the entire federated economy.

Let’s analyze how this is already coming together.

In May, Jony Ive sold his AI device company io to OpenAI to accelerate a new AI device category. He noted at the time: “If you make something new, if you innovate, there will be consequences unforeseen, and some will be wonderful, and some will be harmful. While some of the less positive consequences were unintentional, I still feel responsibility. And the manifestation of that is a determination to try and be useful.” That is, getting the personal intelligence device right means more than an attractive venture opportunity. 

Apple is looking beyond LLMs for on-device solutions that require less processing power and result in less latency when creating AI apps to understand “user intent.” Last year, they created UI-JEPA, an innovation that moves to “on-device analysis” of what the user wants. This strikes directly at the business model of today’s digital economy, where centralized profiling of “users” transforms intent and behavior data into vast revenue streams.

Tim Berners-Lee, the inventor of the World Wide Web, recently noted: “The user has been reduced to a consumable product for the advertiser … there’s still time to build machines that work for humans, and not the other way around.” Moving user intent to the device will drive interest in a secure personal data management standard, Solid, that Berners-Lee and his colleagues have been developing since 2022. The standard is ideally suited to pair with new personal AI devices. For instance, Inrupt, Inc., a company founded by Berners-Lee, recently combined Solid with Anthropic’s MCP standard for Agentic Wallets. Personal control is more than a feature of this paradigm; it is the architectural safeguard as systems gain the ability to learn and act continuously.

Ultimately, these three forces are moving and converging faster than most realize. Enterprise ontologies provide the nouns and verbs, world-model research supplies durable memory and learning and the personal interface becomes the permissioned point of control. The next software era isn’t coming. It’s already here.

Brian Mulconrey is SVP at Sureify Labs.

Why “which API do I call?” is the wrong question in the LLM era

For decades, we have adapted to software. We learned shell commands, memorized HTTP method names and wired together SDKs. Each interface assumed we would speak its language. In the 1980s, we typed ‘grep’, ‘ssh’ and ‘ls’ into a shell; by the mid-2000s, we were invoking REST endpoints like GET /users; by the 2010s, we imported SDKs (client.orders.list()) so we didn’t have to think about HTTP. But underlying each of those steps was the same premise: Expose capabilities in a structured form so others can invoke them.

But now we are entering the next interface paradigm. Modern LLMs are challenging the notion that a user must choose a function or remember a method signature. Instead of “Which API do I call?” the question becomes: “What outcome am I trying to achieve?” In other words, the interface is shifting from code → to language. In this shift, Model Context Protocol (MCP) emerges as the abstraction that allows models to interpret human intent, discover capabilities and execute workflows, effectively exposing software functions not as programmers know them, but as natural-language requests.

MCP is not a hype-term; multiple independent studies identify the architectural shift required for “LLM-consumable” tool invocation. One blog by Akamai engineers describes the transition from traditional APIs to “language-driven integrations” for LLMs. Another academic paper on “AI agentic workflows and enterprise APIs” talks about how enterprise API architecture must evolve to support goal-oriented agents rather than human-driven calls. In short: We are no longer merely designing APIs for code; we are designing capabilities for intent.

Why does this matter for enterprises? Because enterprises are drowning in internal systems, integration sprawl and user training costs. Workers struggle not because they don’t have tools, but because they have too many tools, each with its own interface. When natural language becomes the primary interface, the barrier of “which function do I call?” disappears. One recent business blog observed that natural‐language interfaces (NLIs) are enabling self-serve data access for marketers who previously had to wait for analysts to write SQL. When the user just states intent (like “fetch last quarter revenue for region X and flag anomalies”), the system underneath can translate that into calls, orchestration, context memory and deliver results. 

Natural language becomes not a convenience, but the interface

To understand how this evolution works, consider the interface ladder:

Era

Interface

Who it was built for

CLI

Shell commands

Expert users typing text

API

Web or RPC endpoints

Developers integrating systems

SDK

Library functions

Programmers using abstractions

Natural language (MCP)

Intent-based requests

Human + AI agents stating what they want

Through each step, humans had to “learn the machine’s language.” With MCP, the machine absorbs the human’s language and works out the rest. That’s not just UX improvement, it’s an architectural shift.

Under MCP, functions of code are still there: data access, business logic and orchestration. But they’re discovered rather than invoked manually. For example, rather than calling “billingApi.fetchInvoices(customerId=…),” you say “Show all invoices for Acme Corp since January and highlight any late payments.” The model resolves the entities, calls the right systems, filters and returns structured insight. The developer’s work shifts from wiring endpoints to defining capability surfaces and guardrails.

This shift transforms developer experience and enterprise integration. Teams often struggle to onboard new tools because they require mapping schemas, writing glue code and training users. With a natural-language front, onboarding involves defining business entity names, declaring capabilities and exposing them via the protocol. The human (or AI agent) no longer needs to know parameter names or call order. Studies show that using LLMs as interfaces to APIs can reduce the time and resources required to develop chatbots or tool-invoked workflows.

The change also brings productivity benefits. Enterprises that adopt LLM-driven interfaces can turn data access latency (hours/days) into conversation latency (seconds). For instance, if an analyst previously had to export CSVs, run transforms and deploy slides, a language interface allows “Summarize the top five risk factors for churn over the last quarter” and generate narrative + visuals in one go. The human then reviews, adjusts and acts — shifting from data plumber to decision maker. That matters: According to a survey by McKinsey & Company, 63% of organizations using gen AI are already creating text outputs, and more than one-third are generating images or code. (While many are still in the early days of capturing enterprise-wide ROI, the signal is clear: Language as interface unlocks new value.

In architectural terms, this means software design must evolve. MCP demands systems that publish capability metadata, support semantic routing, maintain context memory and enforce guardrails. An API design no longer needs to ask “What function will the user call?”, but rather “What intent might the user express?” A recently published framework for improving enterprise APIs for LLMs shows how APIs can be enriched with natural-language-friendly metadata so that agents can select tools dynamically. The implication: Software becomes modular around intent surfaces rather than function surfaces.

Language-first systems also bring risks and requirements. Natural language is ambiguous by nature, so enterprises must implement authentication, logging, provenance and access control, just as they did for APIs. Without these guardrails, an agent might call the wrong system, expose data or misinterpret intent. One post on “prompt collapse” calls out the danger: As natural-language UI becomes dominant, software may turn into “a capability accessed through conversation” and the company into “an API with a natural-language frontend”. That transformation is powerful, but only safe if systems are designed for introspection, audit and governance.

The shift also has cultural and organizational ramifications. For decades, enterprises hired integration engineers to design APIs and middleware. With MCP-driven models, companies will increasingly hire ontology engineers, capability architects and agent enablement specialists. These roles focus on defining the semantics of business operations, mapping business entities to system capabilities and curating context memory. Because the interface is now human-centric, skills such as domain knowledge, prompt framing, oversight and evaluation become central.

What should enterprise leaders do today? First, think of natural language as the interface layer, not as a fancy add-on. Map your business workflows that can safely be invoked via language. Then catalogue the underlying capabilities you already have: data services, analytics and APIs. Then ask: “Are these discoverable? Can they be called via intent?” Finally, pilot an MCP-style layer: Build a small domain (customer support triage) where a user or agent can express outcomes in language, and let systems do the orchestration. Then iterate and scale.

Natural language is not just the new front-end. It is becoming the default interface layer for software, replacing CLI, then APIs, then SDKs. MCP is the abstraction that makes this possible. Benefits include faster integration, modular systems, higher productivity and new roles. For those organizations still tethered to calling endpoints manually, the shift will feel like learning a new platform all over again. The question is no longer “which function do I call?” but “what do I want to do?”

Dhyey Mavani is accelerating gen AI and computational mathematics.