Prompt Caching Explained (and When It Saves Money)

Prompt caching is one of the highest-leverage cost optimisations for LLM apps with large, repeated context — but it isn't free, and it doesn't always help. Here's how it works and when to use it.

What it is

When you send a prompt, the provider can store a prefix of it so that later requests starting with the same tokens reuse that work, billing a discounted cached-input rate (often 10–25% of normal input price) instead of full price.

The catch: write cost

Populating the cache (a cache miss) can cost slightly more than a normal request on some providers. So caching only pays off when the prefix is reused enough to outweigh those write penalties — there's a break-even hit rate.

When it wins

Two conditions must both hold: the cacheable prefix is large, and it's frequently reused. Agents, coding assistants and RAG apps that resend the same system prompt, tool schemas or context on every call are perfect candidates.

Maximising hit rate

Put stable content first; keep it byte-for-byte identical (no changing timestamps).
Place dynamic content (the user query) after the cached prefix.
Be aware of cache TTL — rarely-used prefixes may expire before reuse.

Estimate your savings

Use the prompt caching savings calculator to compare cost with and without caching and find your break-even hit rate.