Our LLM API bill was growing 30% month-over-month. Traffic was increasing, but not that fast. When I analyzed our query logs, I found the real problem: Users ask the same questions in different ways.
"What's your return policy?," "How do I return something?", and "Can I get a refund?" were all hitting our LLM separately, generating nearly identical responses, each incurring full API costs.
Exact-match caching, the obvious first solution, captured only 18% of these redundant calls. The same semantic question, phrased differently, bypassed the cache entirely.
So, I implemented semantic caching based on what queries mean, not how they're worded. After implementing it, our cache hit rate increased to 67%, reducing LLM API costs by 73%. But getting there requires solving problems that naive implementations miss.
Why exact-match caching falls short
Traditional caching uses query text as the cache key. This works when queries are identical:
# Exact-match caching
cache_key = hash(query_text)
if cache_key in cache:
return cache[cache_key]
But users don't phrase questions identically. My analysis of 100,000 production queries found:
Only 18% were exact duplicates of previous queries
47% were semantically similar to previous queries (same intent, different wording)
35% were genuinely novel queries
That 47% represented massive cost savings we were missing. Each semantically-similar query triggered a full LLM call, generating a response nearly identical to one we'd already computed.
Semantic caching architecture
Semantic caching replaces text-based keys with embedding-based similarity lookup:
class SemanticCache:
def __init__(self, embedding_model, similarity_threshold=0.92):
self.embedding_model = embedding_model
self.threshold = similarity_threshold
self.vector_store = VectorStore() # FAISS, Pinecone, etc.
self.response_store = ResponseStore() # Redis, DynamoDB, etc.
def get(self, query: str) -> Optional[str]:
"""Return cached response if semantically similar query exists."""
&n...


2 days ago
51





English (US)