1. Use the smallest model that passes your eval
This is the single biggest lever. Small models are often 20–100× cheaper and handle classification, extraction, routing and simple chat perfectly well. Reserve frontier models for genuine reasoning.
2. Route by difficulty
Send easy requests to a cheap model and escalate only hard ones to a premium model. A lightweight classifier decides — and costs almost nothing.
3. Cap output length
Output tokens cost 2–5× input. Set max_tokens and prompt for concise answers. Going from 800 to 400 output tokens can halve the more expensive half of your bill.
4. Cache repeated context
If you resend a big system prompt, examples or retrieved context, prompt caching bills that prefix at a fraction of the price. Ideal for agents, coding tools and RAG.
5. Batch non-urgent work
Evals, enrichment and overnight jobs don't need instant responses — the Batch API takes ~50% off.
6. Trim the system prompt
Every token in your system prompt is paid on every call. Cut waffle, move rarely-needed instructions to a tool, and compress examples.
7. Retrieve fewer RAG chunks
Each retrieved chunk is input tokens. Add a reranker so you can retrieve 3 great chunks instead of 8 mediocre ones.
8. Summarise long histories
In long chats, summarise old turns instead of resending them verbatim every message.
9. Reduce retries
Validate tool inputs, use structured outputs, and add guardrails so the model gets it right first time.
10. Stream and stop early
Stream responses and stop generation once you have what you need, especially for structured extraction.
11. Deduplicate identical requests
Cache results for identical inputs (e.g. the same document summarised twice) at the application layer.
12. Monitor and alert
Track tokens per feature and per user. A single long-context feature or a runaway agent loop can dominate the bill — you can't cut what you don't measure.