Responses API

Auto AI Router implements the OpenAI Responses API and routes requests natively to Anthropic, Vertex AI, and AWS Bedrock — without converting through Chat Completions format as an intermediary.

Endpoints

Method	Path	Description
`POST`	`/v1/responses`	Create a response (HTTP, optionally streaming)
`GET`	`/v1/responses`	Create a response via WebSocket
`GET`	`/v1/responses/{id}`	Retrieve a stored response by ID
`POST`	`/v1/responses/compact`	Compact a conversation into a summary item

Request Parameters

All standard Responses API parameters are supported. The table below lists the full set recognized by the router:

Parameter	Type	Description
`model`	string	Model ID (required)
`input`	string \| array	Conversation input: plain string or input items
`instructions`	string \| null	System-level instructions prepended to the request
`max_output_tokens`	integer	Maximum tokens in the response
`max_tool_calls`	integer	Maximum number of tool calls per response
`temperature`	float	Sampling temperature
`top_p`	float	Top-p (nucleus) sampling
`presence_penalty`	float	Presence penalty
`frequency_penalty`	float	Frequency penalty
`top_logprobs`	integer	Number of log probabilities to return
`stop`	string \| array	Stop sequences
`stream`	boolean	Enable SSE streaming
`background`	boolean	Run as a background job
`tools`	array	Tools available to the model
`tool_choice`	string \| object	Tool selection mode
`reasoning`	object	Reasoning/thinking configuration
`text`	object	Text output configuration (e.g. `response_format`)
`store`	boolean	Persist the response (enables `GET /v1/responses/{id}`)
`previous_response_id`	string	Continue a multi-turn conversation
`metadata`	object	Key-value metadata attached to the response
`include`	array	Extra fields to include in the response
`truncation`	string	Truncation mode (`"auto"` \| `"disabled"`)
`user`	string	User identifier
`parallel_tool_calls`	boolean	Allow parallel tool calls
`service_tier`	string	Service tier hint
`prompt_cache_key`	string	Cache key for prompt caching
`prompt_cache_retention`	string	Cache retention duration
`conversation`	interface	Conversation context (passthrough)

Provider coverage

Not all providers support every parameter. See the Provider Support table below.

Content Types

The input array accepts items of different types. Supported ContentPart types within messages:

`type`	Fields	Description
`input_text`	`text`	Plain text
`input_image`	`image_url` (string or `{url, detail}`), `file_id`, `detail`	Image from URL or file ID
`input_audio`	`data` (base64), `format`	Audio clip
`input_file`	`file_id`, `filename`, `file_url`	File reference

Input items can also be function call / function call output items for multi-turn tool use:

{"type": "function_call", "call_id": "call_abc", "name": "get_weather", "arguments": "{\"city\":\"Paris\"}"}
{"type": "function_call_output", "call_id": "call_abc", "output": "{\"temp\":22}"}

Multi-Turn Conversations

Storing Responses

Set "store": true to persist a response. A stored response can be retrieved later:

curl http://localhost:8080/v1/responses/resp_01abc... \
  -H "Authorization: Bearer sk-your-key"

Continuing a Conversation

Pass previous_response_id to continue from a prior response. The router reconstructs the previous output as input context before sending to the provider:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="sk-your-key")

first = client.responses.create(
    model="claude-sonnet-4-20250514",
    input="What is the capital of France?",
    store=True,
)

second = client.responses.create(
    model="claude-sonnet-4-20250514",
    input="And what language do they speak there?",
    previous_response_id=first.id,
    store=True,
)

Streaming

Add "stream": true to receive Server-Sent Events. The event sequence follows the Responses API specification:

response.created
response.in_progress
  response.output_item.added
  response.content_part.added
  response.output_text.delta  (repeated)
  response.output_text.done
  response.content_part.done
  response.output_item.done
response.completed
[DONE]

stream = client.responses.create(
    model="gemini-2.5-flash",
    input="Tell me about Paris",
    stream=True,
)

for event in stream:
    if event.type == "response.output_text.delta":
        print(event.delta, end="", flush=True)

WebSocket Protocol

The router accepts WebSocket connections on GET /v1/responses (with Upgrade: websocket header). This allows multiple request-response turns on a single persistent connection.

Connection

const ws = new WebSocket("ws://localhost:8080/v1/responses", {
  headers: { "Authorization": "Bearer sk-your-key" }
});

Sending a Request

Send a JSON message with "type": "response.create" and any standard Responses API fields:

{
  "type": "response.create",
  "model": "claude-sonnet-4-20250514",
  "input": "Hello! What is 2+2?",
  "stream": true
}

The type field is stripped before forwarding to the provider.

Receiving Events

The server sends each SSE event as a plain JSON text message (no data: prefix, no [DONE]). Turn completion is signaled by a terminal event (response.completed, response.failed, response.incomplete, error).

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  if (data.type === "response.output_text.delta") {
    process.stdout.write(data.delta);
  } else if (data.type === "response.completed") {
    console.log("\nDone");
  } else if (data.type === "error") {
    console.error(data.error.message);
  }
};

Error Events

HTTP errors are converted to structured WebSocket error events:

{
  "type": "error",
  "sequence_number": 0,
  "error": {
    "code": "api_error",
    "message": "Rate limit exceeded",
    "type": "server_error",
    "param": null
  }
}

Connection-Local Cache

When store: false is explicitly set, completed responses are cached in connection-local memory for the duration of the WebSocket connection. This allows previous_response_id continuations within the same session without a persistent store. The cache is cleared on reconnect.

When store is absent or true, the persistent response store handles continuations across reconnects.

Multi-Turn Example

// First turn
ws.send(JSON.stringify({
  type: "response.create",
  model: "claude-sonnet-4-20250514",
  input: "What is the capital of France?",
  store: false,
}));

// Wait for response.completed, capture response ID, then:
ws.send(JSON.stringify({
  type: "response.create",
  model: "claude-sonnet-4-20250514",
  input: "What language do they speak there?",
  previous_response_id: "<id from first turn>",
  store: false,
}));

Compact API

POST /v1/responses/compact summarizes a conversation into a single compaction item. This is useful for reducing context size while preserving essential information.

Request

curl -X POST http://localhost:8080/v1/responses/compact \
  -H "Authorization: Bearer sk-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-20250514",
    "input": [
      {"role": "user", "content": "What is photosynthesis?"},
      {"role": "assistant", "content": "Photosynthesis is the process by which plants..."}
    ]
  }'

Requirements:

model is required
Request body limit: 10 MB

Response

{
  "id": "resp_01abc...",
  "object": "response.compaction",
  "created_at": 1234567890,
  "output": [
    {
      "type": "compaction",
      "id": "compact_01xyz...",
      "encrypted_content": "<summary of the conversation>"
    }
  ],
  "usage": {
    "input_tokens": 120,
    "output_tokens": 45,
    "total_tokens": 165
  }
}

The encrypted_content field contains the model's summary. Use this item in input for subsequent requests to continue the conversation from the compacted context.

Native vs Passthrough Mode

The router uses two modes for Responses API requests:

Mode	Description
Native	Responses API request → provider-specific format directly. Preserves all provider features
Passthrough	Responses API request → Chat Completions → provider, then Chat Completions → Responses API

Native mode is used automatically for Anthropic, Vertex AI, and AWS Bedrock. Passthrough is used for OpenAI and other providers that already speak Responses API natively.

The mode can be overridden via model configuration:

models:
  - name: "my-model"
    passthrough_responses: true  # force passthrough

Provider Support

Feature	Anthropic	Vertex AI	Bedrock	OpenAI
Non-streaming	✅	✅	✅	✅
Streaming (SSE)	✅	✅	✅	✅
WebSocket	✅	✅	✅	✅
`store` / response store	✅	✅	✅	✅
`previous_response_id`	✅	✅	✅	✅
`tools` (function)	✅	✅	✅	✅
`reasoning`	✅	✅	✅	✅
`presence_penalty`	❌	✅	❌	✅
`frequency_penalty`	❌	✅	❌	✅
`top_logprobs`	❌	✅	❌	✅
`compact` endpoint	✅	✅	✅	✅

Retry and Fallback

When a provider credential returns a rate-limit error (429), the router automatically tries the next available credential of the same type. The original HTTP error code is preserved in the final response — the client receives 429 (not 502) when all credentials of the appropriate type are exhausted.

When no credentials are available at all, the router returns 503 Service Unavailable.