AI Cost Glossary

Plain-English definitions of the terms used across our calculators.

Token

The unit models read and bill by — roughly ¾ of an English word (~4 characters). Prices are quoted per million tokens.

Input (prompt) tokens are what you send; output (completion) tokens are what the model generates. Output is usually priced 2–5× higher.

The maximum number of tokens (input + output) a model can handle in one request. Larger windows enable longer documents but can raise cost.

Reusing a stored prompt prefix so repeated context is billed at a discounted cached rate. See prompt caching explained.

An asynchronous mode that trades latency for roughly a 50% discount — ideal for non-urgent bulk jobs.

A numeric vector representing text, used for semantic search and retrieval. Embedding models are far cheaper than generation models.

Searching a knowledge base for relevant chunks and passing them to the model as context before it answers. See the RAG cost calculator.

A model that re-orders retrieved chunks by relevance, letting you retrieve fewer chunks while keeping quality.

An LLM system that completes multi-step tasks using planning, tools and retries — typically many model calls per task.

Nominal cost divided by success rate, reflecting that you pay for failed attempts too.

Running an open-weight model on your own GPUs instead of paying a per-token API. See API vs self-hosted.