Skip to content

Model Pricing

Auto AI Router supports per-model cost calculation for spend logging. Prices are loaded from a JSON file or remote URL at startup and merged with any prices stored in the LiteLLM database.

Configuration

server:
  model_prices_link: "file://price.json"
Value Description
file://price.json Relative path to a local file
file:///data/prices.json Absolute path
https://prices.example.com/default.json Remote HTTPS URL
https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json LiteLLM's upstream prices

The file must be valid JSON and must not exceed 100 MB.

Price File Format

The file is a JSON object where each key is a model name and each value is a price descriptor:

{
  "gpt-4o-mini": {
    "input_cost_per_token": 1.5e-07,
    "output_cost_per_token": 6e-07
  },
  "gemini-2.5-flash": {
    "input_cost_per_token": 3e-07,
    "output_cost_per_token": 2.5e-06,
    "input_cost_per_audio_token": 1e-06,
    "output_cost_per_reasoning_token": 2.5e-06
  },
  "claude-opus-4-1": {
    "input_cost_per_token": 1.5e-05,
    "output_cost_per_token": 7.5e-05,
    "cache_read_input_token_cost": 1.5e-06,
    "cache_creation_input_token_cost": 1.875e-05
  },
  "imagen-4.0-fast-generate-001": {
    "output_cost_per_image": 0.02
  }
}

Why prices are per 1 token

All per-token prices are expressed as cost per one token (not per 1 000 or per 1 million). This matches the format used by LiteLLM's model_prices_and_context_window.json, making it straightforward to use the upstream file directly or maintain a custom override file in the same format.

For reference:

  • $1.50 / 1M tokens1.5e-06 (0.0000015)
  • $0.15 / 1M tokens1.5e-07 (0.00000015)

Available fields

Field Description
input_cost_per_token Regular input tokens
output_cost_per_token Regular output tokens
input_cost_per_token_above_200k_tokens Input rate for tokens beyond the 200k threshold
output_cost_per_token_above_200k_tokens Output rate for tokens beyond the 200k threshold
input_cost_per_audio_token Audio input tokens (falls back to input_cost_per_token if absent)
output_cost_per_audio_token Audio output tokens (falls back to output_cost_per_token if absent)
input_cost_per_image_token Image input tokens
output_cost_per_image_token Image output tokens
output_cost_per_reasoning_token Reasoning/thinking tokens (falls back to output_cost_per_token)
input_cost_per_cached_token Cached prompt read cost (alias: cache_read_input_token_cost)
cache_read_input_token_cost LiteLLM-compatible alias for input_cost_per_cached_token
cache_creation_input_token_cost Anthropic cache write cost (falls back to input_cost_per_token)
output_cost_per_cached_token Cached output tokens (falls back to output_cost_per_token)
output_cost_per_prediction_token Accepted predicted-output tokens (falls back to output_cost_per_token)
output_cost_per_image Cost per generated image (takes priority over output_cost_per_image_token)

Cost Calculation

All providers return specialised token counts as subsets of the totals:

  • prompt_tokens (Vertex AI, OpenAI) already includes audio_input_tokens, cached_input_tokens
  • completion_tokens (all providers) already includes reasoning_tokens, audio_output_tokens, prediction tokens
  • Anthropic is the exception: cached_input_tokens and cache_creation_tokens are reported separately and are not included in prompt_tokens

To avoid billing the same tokens at two different rates, the calculator first computes regular (base-rate) token counts by subtracting all specialised sub-types, then adds each sub-type back at its own rate:

regular_input  = prompt_tokens - audio_input_tokens - cached_input_tokens - cache_creation_tokens
regular_output = completion_tokens - audio_output_tokens - reasoning_tokens
                                   - accepted_prediction_tokens - rejected_prediction_tokens

total = regular_input  × input_cost_per_token
      + regular_output × output_cost_per_token
      + audio_input_tokens  × input_cost_per_audio_token
      + audio_output_tokens × output_cost_per_audio_token
      + cached_input_tokens    × cache_read_input_token_cost
      + cache_creation_tokens  × cache_creation_input_token_cost
      + cached_output_tokens   × output_cost_per_cached_token
      + reasoning_tokens            × output_cost_per_reasoning_token
      + accepted_prediction_tokens  × output_cost_per_prediction_token
      + rejected_prediction_tokens  × output_cost_per_token
      + image_count × output_cost_per_image

This means every token is billed exactly once regardless of how the provider reported it.

Regular input tokens

Vertex AI and OpenAI include audio and cached tokens inside prompt_tokens. Anthropic reports cached tokens separately. The formula above handles both:

  • Vertex/OpenAI: 100 prompt − 5 audio − 20 cached = 75 regular, then +5 audio +20 cached at their rates
  • Anthropic: 100 prompt − 0 − 20 cached = 80 regular (cached was separate, so subtracted here keeps the math consistent)

Regular output tokens

All providers include reasoning inside completion_tokens:

  • OpenAI o-series: completion_tokens_details.reasoning_tokens is a subset of completion_tokens
  • Vertex Gemini 2.5+: thinking tokens are included in candidatesTokenCount
  • Anthropic with extended thinking: thinking tokens are included in output_tokens

The subtraction ensures reasoning is billed at output_cost_per_reasoning_token (not double-charged at the base output rate as well).

Tiered pricing (200k threshold)

Some models charge a higher rate once the context exceeds 200 000 tokens. When input_cost_per_token_above_200k_tokens is set:

below = min(prompt_tokens, 200_000)
above = prompt_tokens - 200_000          # only when prompt_tokens > 200_000

# regular tokens are split proportionally between below/above
regular_above = regular_input × above / prompt_tokens
regular_below = regular_input - regular_above

input_cost = regular_below × input_cost_per_token
           + regular_above × input_cost_per_token_above_200k_tokens

The same logic applies to output tokens using output_cost_per_token_above_200k_tokens.

Specialised token types

Type Formula
Audio input audio_input_tokens × input_cost_per_audio_token (falls back to regular input rate)
Audio output audio_output_tokens × output_cost_per_audio_token (falls back to regular output rate)
Cached read cached_input_tokens × cache_read_input_token_cost (falls back to input_cost_per_cached_token, then regular input rate)
Cache creation cache_creation_tokens × cache_creation_input_token_cost (falls back to regular input rate)
Reasoning reasoning_tokens × output_cost_per_reasoning_token (falls back to regular output rate)
Accepted prediction accepted_prediction_tokens × output_cost_per_prediction_token (falls back to regular output rate)
Rejected prediction rejected_prediction_tokens × output_cost_per_token (always at regular output rate)
Images image_count × output_cost_per_image OR image_count × output_cost_per_image_token

How Prices Are Loaded

Loading is handled by internal/models/price_loader.go:

  1. The value of model_prices_link is inspected to determine the source:
  2. Paths starting with file:// or containing no :// are read from disk.
  3. Paths starting with http:// or https:// are fetched via HTTP with a 100 MB limit.
  4. The JSON is parsed into a map[string]*ModelPrice.
  5. Every key is normalised: the provider prefix is stripped and the name is lowercased.
  6. "openai/gpt-4-turbo""gpt-4-turbo"
  7. "vertex_ai/gemini-2.5-pro""gemini-2.5-pro"
  8. If two keys normalise to the same string, the last one wins and a warning is logged.
  9. The resulting map is stored in a ModelPriceRegistry (thread-safe, sync.RWMutex).

DB price merging

When the LiteLLM database is enabled, prices defined in LiteLLM_ModelTable are merged on top of the file-based registry via MergeDB. Database prices take precedence for any model that appears in both sources. The file-based prices remain intact for all other models.

Lookup

When a request completes, the router calls GetPrice(modelName) which normalises the name and returns the *ModelPrice. If no entry is found, cost calculation is skipped and null is stored in the spend log.