Model Pricing
Auto AI Router supports per-model cost calculation for spend logging. Prices are loaded from a JSON file or remote URL at startup and merged with any prices stored in the LiteLLM database.
Configuration
| Value | Description |
|---|---|
file://price.json |
Relative path to a local file |
file:///data/prices.json |
Absolute path |
https://prices.example.com/default.json |
Remote HTTPS URL |
https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json |
LiteLLM's upstream prices |
The file must be valid JSON and must not exceed 100 MB.
Price File Format
The file is a JSON object where each key is a model name and each value is a price descriptor:
{
"gpt-4o-mini": {
"input_cost_per_token": 1.5e-07,
"output_cost_per_token": 6e-07
},
"gemini-2.5-flash": {
"input_cost_per_token": 3e-07,
"output_cost_per_token": 2.5e-06,
"input_cost_per_audio_token": 1e-06,
"output_cost_per_reasoning_token": 2.5e-06
},
"claude-opus-4-1": {
"input_cost_per_token": 1.5e-05,
"output_cost_per_token": 7.5e-05,
"cache_read_input_token_cost": 1.5e-06,
"cache_creation_input_token_cost": 1.875e-05
},
"imagen-4.0-fast-generate-001": {
"output_cost_per_image": 0.02
}
}
Why prices are per 1 token
All per-token prices are expressed as cost per one token (not per 1 000 or per 1 million). This matches the format used by LiteLLM's model_prices_and_context_window.json, making it straightforward to use the upstream file directly or maintain a custom override file in the same format.
For reference:
$1.50 / 1M tokens→1.5e-06(0.0000015)$0.15 / 1M tokens→1.5e-07(0.00000015)
Available fields
| Field | Description |
|---|---|
input_cost_per_token |
Regular input tokens |
output_cost_per_token |
Regular output tokens |
input_cost_per_token_above_200k_tokens |
Input rate for tokens beyond the 200k threshold |
output_cost_per_token_above_200k_tokens |
Output rate for tokens beyond the 200k threshold |
input_cost_per_audio_token |
Audio input tokens (falls back to input_cost_per_token if absent) |
output_cost_per_audio_token |
Audio output tokens (falls back to output_cost_per_token if absent) |
input_cost_per_image_token |
Image input tokens |
output_cost_per_image_token |
Image output tokens |
output_cost_per_reasoning_token |
Reasoning/thinking tokens (falls back to output_cost_per_token) |
input_cost_per_cached_token |
Cached prompt read cost (alias: cache_read_input_token_cost) |
cache_read_input_token_cost |
LiteLLM-compatible alias for input_cost_per_cached_token |
cache_creation_input_token_cost |
Anthropic cache write cost (falls back to input_cost_per_token) |
output_cost_per_cached_token |
Cached output tokens (falls back to output_cost_per_token) |
output_cost_per_prediction_token |
Accepted predicted-output tokens (falls back to output_cost_per_token) |
output_cost_per_image |
Cost per generated image (takes priority over output_cost_per_image_token) |
Cost Calculation
All providers return specialised token counts as subsets of the totals:
prompt_tokens(Vertex AI, OpenAI) already includesaudio_input_tokens,cached_input_tokenscompletion_tokens(all providers) already includesreasoning_tokens,audio_output_tokens, prediction tokens- Anthropic is the exception:
cached_input_tokensandcache_creation_tokensare reported separately and are not included inprompt_tokens
To avoid billing the same tokens at two different rates, the calculator first computes regular (base-rate) token counts by subtracting all specialised sub-types, then adds each sub-type back at its own rate:
regular_input = prompt_tokens - audio_input_tokens - cached_input_tokens - cache_creation_tokens
regular_output = completion_tokens - audio_output_tokens - reasoning_tokens
- accepted_prediction_tokens - rejected_prediction_tokens
total = regular_input × input_cost_per_token
+ regular_output × output_cost_per_token
+ audio_input_tokens × input_cost_per_audio_token
+ audio_output_tokens × output_cost_per_audio_token
+ cached_input_tokens × cache_read_input_token_cost
+ cache_creation_tokens × cache_creation_input_token_cost
+ cached_output_tokens × output_cost_per_cached_token
+ reasoning_tokens × output_cost_per_reasoning_token
+ accepted_prediction_tokens × output_cost_per_prediction_token
+ rejected_prediction_tokens × output_cost_per_token
+ image_count × output_cost_per_image
This means every token is billed exactly once regardless of how the provider reported it.
Regular input tokens
Vertex AI and OpenAI include audio and cached tokens inside prompt_tokens. Anthropic reports cached tokens separately. The formula above handles both:
- Vertex/OpenAI:
100 prompt − 5 audio − 20 cached = 75 regular, then +5 audio +20 cached at their rates - Anthropic:
100 prompt − 0 − 20 cached = 80 regular(cached was separate, so subtracted here keeps the math consistent)
Regular output tokens
All providers include reasoning inside completion_tokens:
- OpenAI
o-series:completion_tokens_details.reasoning_tokensis a subset ofcompletion_tokens - Vertex Gemini 2.5+: thinking tokens are included in
candidatesTokenCount - Anthropic with extended thinking: thinking tokens are included in
output_tokens
The subtraction ensures reasoning is billed at output_cost_per_reasoning_token (not double-charged at the base output rate as well).
Tiered pricing (200k threshold)
Some models charge a higher rate once the context exceeds 200 000 tokens. When input_cost_per_token_above_200k_tokens is set:
below = min(prompt_tokens, 200_000)
above = prompt_tokens - 200_000 # only when prompt_tokens > 200_000
# regular tokens are split proportionally between below/above
regular_above = regular_input × above / prompt_tokens
regular_below = regular_input - regular_above
input_cost = regular_below × input_cost_per_token
+ regular_above × input_cost_per_token_above_200k_tokens
The same logic applies to output tokens using output_cost_per_token_above_200k_tokens.
Specialised token types
| Type | Formula |
|---|---|
| Audio input | audio_input_tokens × input_cost_per_audio_token (falls back to regular input rate) |
| Audio output | audio_output_tokens × output_cost_per_audio_token (falls back to regular output rate) |
| Cached read | cached_input_tokens × cache_read_input_token_cost (falls back to input_cost_per_cached_token, then regular input rate) |
| Cache creation | cache_creation_tokens × cache_creation_input_token_cost (falls back to regular input rate) |
| Reasoning | reasoning_tokens × output_cost_per_reasoning_token (falls back to regular output rate) |
| Accepted prediction | accepted_prediction_tokens × output_cost_per_prediction_token (falls back to regular output rate) |
| Rejected prediction | rejected_prediction_tokens × output_cost_per_token (always at regular output rate) |
| Images | image_count × output_cost_per_image OR image_count × output_cost_per_image_token |
How Prices Are Loaded
Loading is handled by internal/models/price_loader.go:
- The value of
model_prices_linkis inspected to determine the source: - Paths starting with
file://or containing no://are read from disk. - Paths starting with
http://orhttps://are fetched via HTTP with a 100 MB limit. - The JSON is parsed into a
map[string]*ModelPrice. - Every key is normalised: the provider prefix is stripped and the name is lowercased.
"openai/gpt-4-turbo"→"gpt-4-turbo""vertex_ai/gemini-2.5-pro"→"gemini-2.5-pro"- If two keys normalise to the same string, the last one wins and a warning is logged.
- The resulting map is stored in a
ModelPriceRegistry(thread-safe,sync.RWMutex).
DB price merging
When the LiteLLM database is enabled, prices defined in LiteLLM_ModelTable are merged on top of the file-based registry via MergeDB. Database prices take precedence for any model that appears in both sources. The file-based prices remain intact for all other models.
Lookup
When a request completes, the router calls GetPrice(modelName) which normalises the name and returns the *ModelPrice. If no entry is found, cost calculation is skipped and null is stored in the spend log.