The fundamental difference
APIs are pure pay-per-token: you pay only for what you use. Self-hosting is a fixed cost: you rent GPUs by the hour whether they're busy or idle. That single difference drives the entire decision.
Break-even volume
Below a certain monthly token volume, the API always wins because you're not paying for idle hardware. Above it, owned GPUs can be cheaper per token. The crossover is often hundreds of millions to billions of tokens per month — model yours in the self-hosted vs API calculator.
Utilisation is decisive
A GPU busy 20% of the time costs the same as one busy 90% of the time but serves far fewer tokens, so its cost-per-token is much higher. Without high, steady utilisation, self-hosting rarely pays.
The hidden cost: people
Self-hosting means running model servers, autoscaling, monitoring, patching and on-call. For small teams this engineering time often exceeds the GPU bill. APIs bundle all of it into the per-token price.
When self-hosting wins
Very high steady volume, strict data-residency/compliance requirements, or customisation needs (fine-tuned models, special decoding). A hybrid — API for bursty traffic, self-host the steady base load — is often optimal.