AI Cost Glossary
Plain-English definitions of the terms used across our calculators.
Token
The unit models read and bill by — roughly ¾ of an English word (~4 characters). Prices are quoted per million tokens.
Input vs output tokens
Input (prompt) tokens are what you send; output (completion) tokens are what the model generates. Output is usually priced 2–5× higher.
Context window
The maximum number of tokens (input + output) a model can handle in one request. Larger windows enable longer documents but can raise cost.
Prompt caching
Reusing a stored prompt prefix so repeated context is billed at a discounted cached rate. See prompt caching explained.
Batch API
An asynchronous mode that trades latency for roughly a 50% discount — ideal for non-urgent bulk jobs.
Embedding
A numeric vector representing text, used for semantic search and retrieval. Embedding models are far cheaper than generation models.
RAG (retrieval-augmented generation)
Searching a knowledge base for relevant chunks and passing them to the model as context before it answers. See the RAG cost calculator.
Reranker
A model that re-orders retrieved chunks by relevance, letting you retrieve fewer chunks while keeping quality.
Agent
An LLM system that completes multi-step tasks using planning, tools and retries — typically many model calls per task.
Failure-adjusted cost
Nominal cost divided by success rate, reflecting that you pay for failed attempts too.
Self-hosting
Running an open-weight model on your own GPUs instead of paying a per-token API. See API vs self-hosted.