API REFERENCE

Завершения чата

Создавайте ответы в чате для более чем 100 моделей с помощью одного API. Модуль, совместимый с завершениями чата OpenAI, сообщениями Anthropic Messages и ответами Anthropic Responses.

Airforce понимает оба wire-формата — OpenAI Chat Completions и Anthropic Messages — поверх одного набора моделей. Выберите тот SDK, который вы уже используете, и просто измените base URL — не-Claude модели прозрачно проксируются за обоими интерфейсами.

На этой странице рассматриваются аутентификация, формы request и response для обоих интерфейсов, streaming, tool calling, vision, reasoning и prompt caching. Впервые здесь? Начните с базового примера ниже, добейтесь работы одного вызова, а затем добавляйте streaming, tools или caching.

Аутентификация

Для каждого запроса требуется Bearer-токен (ваш API-ключ Airforce). Заголовок Anthropic x-api-key заголовок также принимается /v1/messages для совместимости с SDK.

Authorization: Bearer sk-air-YOUR_API_KEY
# alt for /v1/messages:
x-api-key: sk-air-YOUR_API_KEY

POST /v1/chat/completions

OpenAI-совместимые завершения чата. Работает с официальным openai SDK путем переопределения base_url к https://api.airforce/v1.

POSThttps://api.airforce/v1/chat/completions

Тело запроса

Parameter	Type	Required	Description
model	string	Required	Идентификатор модели. Используйте GET /v1/models, чтобы обнаружить доступные идентификаторы.
messages	array	Required	История разговора. Каждая запись имеет { role: "system" \| "user" \| "assistant" \| "tool", content }. Контент — это строка или массив блоков контента (vision, см. ниже).
max_tokens	integer	Optional	Максимальное количество токенов для генерации. Ограничено max_output_tokens модели.
temperature	float	Optional	Температура сэмплирования, 0–2. Чем ниже, тем детерминированнее. Значение по умолчанию зависит от вышестоящего провайдера.
top_p	float	Optional	Nucleus sampling (выборка по ядру). Используйте либо temperature, либо top_p, но не оба одновременно.
stream	boolean	Optional	Если true, ответ представляет собой поток Server-Sent Events. См. «Потоковая передача» ниже.
models	array	Optional	Fallback models (max 3), e.g. ["deepseek-v3.2", "gpt-4o-mini"]. If every channel of the primary model fails, each candidate is tried in order. You are billed for — and response.model reports — the model that actually answered. Unknown or plan-gated candidates are skipped. With the OpenAI SDK pass it via extra_body.
transforms	array	Optional	Prompt transforms. Supported: ["middle-out"] — when the conversation overflows the model's context window, whole messages are dropped from the middle (system prompts, the first message and the most recent turns are kept), so long roleplay or agent histories keep working instead of erroring. Opt-in; off by default.
stream_options	object	Optional	{ include_usage: boolean }. Usage всегда включается в последний чанк стрима; это поле принимается для совместимости с OpenAI, но отключить его нельзя.
stop	string \| array	Optional	До 4 стоп-последовательностей. Генерация прекращается, как только одна из них встречается.
tools	array	Optional	Определения функций, которые может вызывать модель. См. «Вызов инструмента» ниже.
tool_choice	string \| object	Optional	"auto" (по умолчанию), "none" или { type: "function", function: { name } }, чтобы принудительно вызвать конкретную функцию.
response_format	object	Optional	{ type: "json_object" } заставляет модель выдавать действительный JSON. Игнорируется для моделей, которые его не поддерживают.
reasoning_effort	string	Optional	Reasoning depth: "low" \| "medium" \| "high" \| "xhigh" \| "max". Any model with supports_reasoning: true (Claude, OpenAI o/GPT-5, Gemini, Qwen, DeepSeek, …). See "Reasoning & thinking".
thinking	string \| object	Optional	Cross-model thinking switch. "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. See "Reasoning & thinking".
thinking_budget	integer	Optional	Ограничение токена для трассировки рассуждений модели (если поставщик предоставляет его).
ignore_defaults	boolean	Optional	Пропустите сохраненные пользователем параметры по умолчанию для каждой модели (настроенные на панели управления) для этого запроса.
skill	string	Optional	ID of a single marketplace skill to apply to this request. The skill transforms your messages/parameters before the upstream call and overrides any installed-skill defaults. Consumed by Airforce, never forwarded upstream. See the Skills catalog at /docs/api/extend.
skills	array	Optional	Array of marketplace skill IDs applied in order, for stacking multiple skills on one request.

Базовый пример

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

Форма ответа

Parameter	Type	Required	Description
id	string	Optional	Стабильный идентификатор завершения, например "chatcmpl-abc123".
object	string	Optional	"chat.completion" для непотокового, "chat.completion.chunk" для потокового.
created	integer	Optional	Временная метка Unix (в секундах).
model	string	Optional	Эхо запрошенного идентификатора модели.
choices	array	Optional	Массив кандидатов завершения: [{ index, message: { role, content, tool_calls? }, finish_reason }].
choices[].finish_reason	string	Optional	"stop" \| "length" \| "tool_calls" \| "content_filter".
usage	object	Optional	{ prompt_tokens, completion_tokens, total_tokens, completion_tokens_details?, prompt_tokens_details?, cache_creation_input_tokens?, cache_creation? }. completion_tokens_details.reasoning_tokens устанавливается, когда модель сгенерировала след рассуждений. Поля кеша появляются, когда upstream вернул информацию о prompt-кешировании: prompt_tokens_details.cached_tokens сообщает о чтениях кеша (стандарт OpenAI), cache_creation_input_tokens агрегирует записи, а cache_creation.ephemeral_5m_input_tokens / ephemeral_1h_input_tokens дают разбивку по TTL.

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1710000000,
  "model": "gpt-5.1-chat",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "The capital of France is Paris."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 8,
    "total_tokens": 28
  }
}

Рассуждение и мышление

Reasoning/thinking is a cross-model feature for every model ID with supports_reasoning: true — Claude, OpenAI o-series/GPT-5, Gemini, Qwen, DeepSeek, and others. You send the same canonical parameters; Airforce maps them to each provider's native shape. This is not a DeepSeek-only API.

Truth source: check supports_reasoning: true на модели в GET /v1/models (or GET /api/models/{id}/allowed-params). Prefer that flag over guessing from the model name.

Модели с аргументированной поддержкой

…· live

Канонические параметры

Parameter	Type	Required	Description
reasoning_effort	string	Optional	"low" \| "medium" \| "high" \| "xhigh" \| "max". Accepted on every model with supports_reasoning: true. Some upstreams only honour a subset (e.g. high/max); others clamp unsupported levels to the nearest served value.
thinking	string \| object	Optional	Three accepted shapes (we normalise): "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. Mapped onto Claude extended thinking, OpenAI effort profiles, Gemini thinking_config, Qwen enable_thinking, DeepSeek hybrid, etc.
thinking_budget	integer	Optional	Maximum tokens the model may spend reasoning before emitting visible output. Mirrors budget_tokens when the upstream exposes a budget; takes precedence over reasoning_effort when both are sent and a budget is available.

What differs by family (mapping only)

Parameters are the same everywhere. Only how we map them (and how hard "off" is) differs:

Claude — Thinking on/off + budget; often also reasoning_effort via the gateway.
OpenAI (o1/o3, GPT-5) — Mainly reasoning_effort. A full "thinking off" is often not available — you control how strongly the model reasons, not always whether it reasons at all.
Gemini — thinking_config / budget mapped internally.
Qwen / Xiaomi / Alibaba — thinking + enable_thinking-style controls.
DeepSeek (generic) — Hybrid on/off is especially clear: thinking: { type: enabled|disabled } plus optional reasoning_effort.
Resellers / other — Often generic passthrough of the same canonical fields.

Controlling where the trace appears

An optional reasoning object on the request decides what happens to the thinking trace. It is consumed by Airforce and never forwarded upstream.

Parameter	Type	Required	Description
reasoning.format	string	Optional	"separate" (default) puts the trace in message.reasoning (and delta.reasoning while streaming). "inline" keeps the legacy inline <think>…</think> form inside content.
reasoning.exclude	boolean	Optional	When true, the reasoning trace is dropped entirely from the response. Reasoning tokens are still counted and billed if the model produced them.

"reasoning": { "format": "separate", "exclude": false }

Усилия по рассуждению (в стиле OpenAI)

Primary control for o-series and GPT-5: how much the model may reason. Same canonical field as on every other supports_reasoning model — OpenAI is included, but behaviour is not 1:1 with DeepSeek's hard on/off.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "o3-mini",
    "messages": [{"role": "user", "content": "Prove the Pythagorean theorem."}],
    "reasoning_effort": "high"
  }'

Расширенное мышление (в стиле Anthropic)

Budget-based thinking for Claude (and gateways that accept the Anthropic shape). You can still send reasoning_effort; we map when the channel supports it.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "messages": [{"role": "user", "content": "Plan a 7-day Italy trip."}],
    "thinking": {"type": "enabled", "budget_tokens": 4000}
  }'

Hybrid thinking (e.g. DeepSeek V3.2/V4)

Example of a hybrid model family with a clear Thinking / Non-Thinking switch — not a separate protocol. deepseek-v3.2, deepseek-v4-flash and deepseek-v4-pro accept the same canonical fields as every other supports_reasoning model. Toggle thinking and optionally set effort in one request:

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-pro",
    "messages": [{"role": "user", "content": "Solve this step by step: integrate x^2 * e^x."}],
    "thinking": {"type": "enabled"},
    "reasoning_effort": "high"
  }'

Turn thinking off (faster, cheaper when you only need the final answer) — this hard off is clearer on hybrid models than on many OpenAI o-series profiles:

"thinking": {"type": "disabled"}
// or simply: "thinking": "off"

Native docs for this family often list effort levels such as "high" and "max". We accept the full low…max scale and map unsupported levels to the nearest value that reaches the model. Prefer the hybrid IDs above over retired deepseek-chat / deepseek-reasoner names when you need an explicit on/off switch.

Сам след рассуждения проявляется в choices[0].message.reasoning (форма OpenAI) или как thinking блокирует в content (формат Anthropic). Токены рассуждения выставляются по счетам и сообщаются в usage.completion_tokens_details.reasoning_tokens.

Разбивка completion_tokens_details.reasoning_tokens присутствует только тогда, когда её сообщает вышестоящий провайдер. В потоковом ответе трассировка приходит в delta.reasoning_content для каждого chunk.

Видение и ввод изображений

Модели с supports_vision: true принимать изображения, встроенные в блоки контента. Подойдет либо общедоступный URL-адрес, либо URL-адрес данных base64; Ограничения по размеру зависят от восходящей модели.

Модели с поддержкой зрения

…· live

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
      ]
    }]
  }'

Вызов инструмента

Модели с supports_tools: true может вызывать функции, которые вы определяете. Модель возвращает tool_calls множество; вы запускаете вызов, а затем отправляете результат обратно в tool сообщение.

Модели с поддержкой вызова инструментов

…· live

Запрос

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City name"}
          },
          "required": ["location"]
        }
      }
    }],
    "tool_choice": "auto"
  }'

Ответ вызовом инструмента

{
  "id": "chatcmpl-abc123",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\":\"Paris\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

Отслеживание результата инструмента

{
  "model": "gpt-5.1-chat",
  "messages": [
    {"role": "user", "content": "What is the weather in Paris?"},
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {"name": "get_weather", "arguments": "{\"location\":\"Paris\"}"}
      }]
    },
    {"role": "tool", "tool_call_id": "call_1", "content": "{\"temp_c\": 14, \"sky\": \"cloudy\"}"}
  ]
}

Assistant prefill

End your messages array with an assistant message that already contains some text, and the model continues from it instead of starting a fresh turn. This is a reliable way to force a response to begin a specific way — a leading "{" for JSON, a chosen language, or a fixed prefix. The same trick works on /v1/messages. Providers that reject native prefill are handled automatically: the gateway retries once with a compatible rewrite, so you do not have to special-case them.

{
  "model": "claude-sonnet-4.6",
  "messages": [
    {"role": "user", "content": "List three primary colors as a JSON array."},
    {"role": "assistant", "content": "["}
  ]
}

Structured outputs

Set response_format to make the model return JSON. Two modes are supported:

{ "type": "json_object" } — the response is a single valid JSON value.
{ "type": "json_schema", "json_schema": { "name", "schema", "strict" } } — the model is steered to produce JSON that matches your JSON Schema.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Extract the city and country: I live in Paris, France."}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "location",
        "schema": {
          "type": "object",
          "properties": { "city": {"type": "string"}, "country": {"type": "string"} },
          "required": ["city", "country"]
        }
      }
    }
  }'

Reliability: even when a model wraps its answer in prose or a markdown code fence, Airforce extracts the JSON payload so you always receive parseable content. If no valid JSON can be recovered, the original text is returned unchanged — so the guarantee never makes a response worse. This applies to non-streamed responses; streamed responses are passed through unchanged.

Потоковая передача

Набор stream: true для получения частичных завершений как событий, отправленных сервером. Каждое событие представляет собой один фрагмент JSON той же формы, что и непотоковое ответ, за исключением message заменяется на delta. Поток заканчивается data: [DONE].

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Write a haiku about Berlin."}],
    "stream": true,
    "stream_options": {"include_usage": true}
  }'

Формат передачи (wire format)

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"Cold "},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"stone "},"finish_reason":null}]}

…

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":12,"completion_tokens":17,"total_tokens":29}}

data: [DONE]

Reliability & smart routing

Every model ID resolves to a pool of upstream providers behind the scenes. If the first one errors or times out, the request is automatically retried against the next provider for the same model, in order, before any failure is returned — you do not configure or trigger this. The model field in the response always reports the variant that actually answered. This is independent of the optional models / fallbacks array, which adds your own cross-model candidates on top: first the primary model exhausts its own provider chain, then each fallback model exhausts its chain.

POST /v1/messages

API сообщений, совместимый с Anthropic. Работает с официальным @anthropic-ai/sdk установив baseURL к https://api.airforce. Пересылаем в OpenAI/Google/и т.д. прозрачно для моделей, отличных от Claude.

POSThttps://api.airforce/v1/messages

Тело запроса

Parameter	Type	Required	Description
model	string	Required	Идентификатор модели (в формате Anthropic или маршрутизируемый псевдоним).
messages	array	Required	Каждая запись: { role: "user" \| "assistant", content: string \| array }.
max_tokens	integer	Required	Требуется Anthropic. Ограничение токена для ответа.
system	string \| array	Optional	Системный промпт. Передайте массив блоков { type: "text", text, cache_control? }, чтобы пометить сегменты кэшируемого префикса. См. «Кэширование промптов».
temperature	float	Optional	0–1.
top_p	float	Optional	Nucleus sampling (выборка по ядру).
top_k	integer	Optional	Ограничьте пул выборки токенами из топ-K.
stop_sequences	array	Optional	До 4 стоп-последовательностей.
stream	boolean	Optional	Если установлено значение true, генерирует поток событий SSE в стиле Anthropic (см. «Потоковая передача»).
fallbacks	array	Optional	Fallback models (max 3) in Anthropic form: [{"model": "gpt-4o-mini"}]. If every channel of the primary model fails, each candidate is tried in order; you are billed for — and the response model field reports — the model that actually answered. A plain models string array is accepted too.
tools	array	Optional	Определения инструментов Anthropic: { name, description, input_schema }. Ответ может содержать блоки контента tool_use.
tool_choice	object	Optional	{ type: "auto" \| "any" \| "tool", name? }.
thinking	object	Optional	Расширенное мышление Anthropic: { type: "enabled", budget_tokens: N }.

Пример

curl https://api.airforce/v1/messages \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "max_tokens": 256,
    "system": "You are a helpful assistant.",
    "messages": [
      {"role": "user", "content": "Hello, Claude!"}
    ]
  }'

Форма ответа

Parameter	Type	Required	Description
id	string	Optional	Идентификатор сообщения, например "msg_01ABCxyz".
type	string	Optional	Всегда "message".
role	string	Optional	Всегда "assistant".
content	array	Optional	Массив блоков контента: { type: "text" \| "tool_use" \| "thinking", … }.
model	string	Optional	Эхо запрошенной модели.
stop_reason	string	Optional	"end_turn" \| "max_tokens" \| "stop_sequence" \| "tool_use".
usage	object	Optional	{ input_tokens, output_tokens, cache_read_input_tokens?, cache_creation_input_tokens?, cache_creation? }. Поля кеша появляются при использовании prompt-кеширования. cache_creation.ephemeral_5m_input_tokens и ephemeral_1h_input_tokens дают разбивку записей по TTL.

Потоковые события

Anthropic SSE использует именованные события вместо одноразовых фрагментов JSON. Каждое событие имеет как event: имя и data: Полезная нагрузка JSON.

event: message_start
data: {"type":"message_start","message":{"id":"msg_01","role":"assistant","content":[],"model":"claude-sonnet-4.6","stop_reason":null,"usage":{"input_tokens":12,"output_tokens":1}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":17}}

event: message_stop
data: {"type":"message_stop"}

POST /v1/messages/count_tokens

Anthropic-compatible token counting. Send the same system / messages / tools you would pass to /v1/messages and get an input-token estimate back without running the model — nothing is billed.

POSThttps://api.airforce/v1/messages/count_tokens

curl https://api.airforce/v1/messages/count_tokens \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "system": "You are a helpful assistant.",
    "messages": [{"role": "user", "content": "Hello, Claude!"}]
  }'

# → {"input_tokens": 34}

The count is a fast character-based estimate (about 4 characters per token) over system, messages and tools — close enough for context-budget checks, not an exact tokenizer run.

Оперативное кэширование

На /v1/messages с моделями Claude пометьте префикс как кэшированный, передав system как массив блоков, в которых кэшированный сегмент содержит cache_control: { type: "ephemeral" }. Последующие запросы, начинающиеся с того же префикса, требуют более низкой скорости чтения кэша. Модели с supports_caching: true в /v1/models поддержите это.

Write vs read pricing

Cache writes are typically charged slightly above normal input (about 1.25× on Claude-family models). Cache reads are much cheaper (about 0.1× input). A large write with almost no later read is the expensive case — not a “cache discount”. Only reusing the same prefix turns the write into savings.

Tools like Claude Code often attach a large project context with cache markers on the first turns. Expect cache-write spend while the repo/system prefix is loaded; later turns only get cheap if that prefix is stable and reused. Subagents and multi-step agents can multiply large contexts across several requests.

Модели с быстрым кэшированием

…· live

{
  "model": "claude-sonnet-4.6",
  "max_tokens": 1024,
  "system": [
    {"type": "text", "text": "You are a senior staff engineer at Airforce."},
    {
      "type": "text",
      "text": "<repository-snapshot>...</repository-snapshot>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "Where is rate limiting enforced?"}
  ]
}

Как счётчики кеша отображаются в ответе

Счётчики токенов кеша передаются в нативной форме каждого формата, поэтому SDK (openai, @anthropic-ai/sdk, @google/genai) читают их без специального кода. Поля опускаются при нулевом значении, сохраняя ответы без кеша компактными.

/v1/chat/completions (форма OpenAI)

"usage": {
  "prompt_tokens": 2104,
  "completion_tokens": 147,
  "total_tokens": 2251,
  "prompt_tokens_details": { "cached_tokens": 1980 },
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1/messages (форма Anthropic)

"usage": {
  "input_tokens": 2104,
  "output_tokens": 147,
  "cache_read_input_tokens": 1980,
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1beta/.../generateContent (форма Gemini)

"usageMetadata": {
  "promptTokenCount": 2104,
  "candidatesTokenCount": 147,
  "totalTokenCount": 2251,
  "cachedContentTokenCount": 1980
}

Где применяется кэширование

Явные маркеры cache_control учитываются на /v1/messages и /v1/chat/completions для моделей Claude — ставьте их на блоки контента system или message. Многие другие провайдеры (семейство OpenAI, DeepSeek, Gemini) кэшируют автоматически: вы не отправляете маркеры и просто видите cached_tokens в ответе, как только переиспользуется достаточно длинный префикс.

Длительность кэша: 5 минут или 1 час

Кэшированный префикс по умолчанию живёт 5 минут, и таймер обновляется при каждом попадании. Для более долгоживущего префикса добавьте ttl: "1h" в маркер. Ответ сообщает каждый TTL отдельно в cache_creation.

"cache_control": { "type": "ephemeral", "ttl": "1h" }

Пример: сначала запись, потом чтение

Отправьте один и тот же запрос дважды (пример кэширования выше). Первый вызов, увидевший префикс, оплачивает однократную запись в кэш; идентичные вызовы в пределах TTL оплачивают гораздо более дешёвое чтение из кэша.

Первый вызов — запись в кэш (фрагмент usage):

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 1980,
  "cache_read_input_tokens": 0
}

Второй идентичный вызов в пределах TTL — чтение из кэша:

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 1980
}

Ограничения и стоимость

Claude требует минимальный кэшируемый префикс (около 1024 токенов; для некоторых моделей больше). Более короткие префиксы просто не кэшируются.
До 4 точек кэширования на запрос, и кэшированный префикс должен быть байт-в-байт идентичным между вызовами — даже изменение одного символа промахивается мимо кэша.
Записи в кэш стоят дороже обычного ввода (5m ≈ 1,25×, 1h ≈ 2×); чтения стоят гораздо меньше (≈ 0,1×). Цены кэша по каждой модели — на странице тарифов.

POST /v1/responses

OpenAI Responses-API для stateful диалогов. Та же Bearer/x-api-key аутентификация. Счётчики кеша появляются как input_tokens_details.cached_tokens (чтение) плюс плоский cache_creation_input_tokens + cache_creation.ephemeral_* (записи) для паритета с /v1/chat/completions.

POSThttps://api.airforce/v1/responses

POST /v1beta/models/{model}:generateContent

Google Gemini-compatible endpoint. Works with the official @google/genai SDK and the Gemini CLI by pointing the base URL at https://api.airforce/v1beta. Any routed model works — requests are translated to and from the native Gemini shape, and the model is taken from the URL path (not the body).

POSThttps://api.airforce/v1beta/models/{model}:generateContent

Authentication

Pass your Airforce API key any of the three ways Google clients use:

# 1) query parameter (Google default)
?key=sk-air-YOUR_API_KEY

# 2) header
x-goog-api-key: sk-air-YOUR_API_KEY

# 3) bearer token
Authorization: Bearer sk-air-YOUR_API_KEY

Request body

Parameter	Type	Required	Description
contents	array	Required	Conversation turns. Each: { role: "user" \| "model", parts: [...] }. A part is { text }, { functionCall: { name, args } }, or { functionResponse: { name, response } }. "model" is Gemini's term for the assistant role.
systemInstruction	object	Optional	System prompt: { parts: [{ text }] }.
generationConfig	object	Optional	{ temperature, maxOutputTokens, topP, stopSequences } — mapped to the canonical sampling parameters.
tools	array	Optional	Tool definitions: [{ functionDeclarations: [{ name, description, parameters }] }]. functionDeclarations are flattened across entries.
toolConfig	object	Optional	Tool-choice control: { functionCallingConfig: { mode: "AUTO" \| "ANY" \| "NONE" } }. ANY forces a call, NONE disables tools.

Example

curl "https://api.airforce/v1beta/models/gemini-3.1-pro:generateContent" \
  -H "x-goog-api-key: sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {"role": "user", "parts": [{"text": "What is the capital of France?"}]}
    ],
    "systemInstruction": {"parts": [{"text": "You are a helpful assistant."}]},
    "generationConfig": {"temperature": 0.7, "maxOutputTokens": 256}
  }'

Response shape

Parameter	Type	Required	Description
candidates	array	Optional	Generated turns: [{ content: { role: "model", parts }, finishReason, index }]. Only the first candidate is populated.
candidates[].finishReason	string	Optional	"STOP" \| "MAX_TOKENS" \| "SAFETY" \| "OTHER".
usageMetadata	object	Optional	{ promptTokenCount, candidatesTokenCount, totalTokenCount, cachedContentTokenCount? }. cachedContentTokenCount appears when the upstream reported a cache read.
modelVersion	string	Optional	Echo of the requested model.

{
  "candidates": [{
    "content": {
      "role": "model",
      "parts": [{"text": "The capital of France is Paris."}]
    },
    "finishReason": "STOP",
    "index": 0
  }],
  "usageMetadata": {
    "promptTokenCount": 16,
    "candidatesTokenCount": 8,
    "totalTokenCount": 24
  },
  "modelVersion": "gemini-3.1-pro"
}

POST /v1beta/models/{model}:streamGenerateContent

Streaming uses the :streamGenerateContent action and returns Server-Sent Events. Each data: line is a full Gemini-shaped chunk (not a delta object); the final chunk carries usageMetadata.

data: {"candidates":[{"content":{"role":"model","parts":[{"text":"The capital"}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[{"text":" is Paris."}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[]},"finishReason":"STOP","index":0}],"usageMetadata":{"promptTokenCount":16,"candidatesTokenCount":8,"totalTokenCount":24}}

List models

The catalog is also exposed in Gemini Model-resource shape so Google clients can enumerate models.

curl https://api.airforce/v1beta/models

Notes: the base URL is https://api.airforce/v1beta (or /v1), not Google's host. The model name comes from the URL path, not the request body. Only the first candidate is returned, and a subset of Gemini fields is translated — safetySettings and cachedContent are currently ignored. Billing, rate limits and smart routing apply exactly as on /v1/chat/completions.

Ошибки

Airforce возвращает стандартные коды состояния HTTP и единый конверт ошибок для обеих конечных точек.

Parameter	Type	Required	Description
400	invalid_request_error	Optional	Неверный формат JSON, отсутствует обязательное поле, неизвестная модель.
401	invalid_request_error / auth_required	Optional	Ключ API отсутствует или недействителен.
402	insufficient_quota	Optional	Модель требует активной подписки или положительного баланса Pay-as-you-Go.
403	model_access_denied / insufficient_scope	Optional	Плановые разрешения или разрешения для каждого ключа отклоняют этот запрос.
404	model_not_found	Optional	Запрошенная модель не существует или у вас нет к ней доступа.
429	rate_limit_error	Optional	Превышена частота запросов или дневной лимит токенов.
503	api_error / moderation_unavailable	Optional	Не удалось выполнить все восходящие ключи для запрошенного поставщика.

{
  "error": {
    "message": "The requested model does not exist or you do not have access to it.",
    "type": "model_not_found",
    "param": null,
    "code": "404"
  }
}

Описательный слаг находится в type. code — это HTTP-статус в виде строки (например, "404"), а param равен null, кроме ошибок валидации диапазона параметров, где он называет проблемный параметр.

Откройте для себя модели

Полный список идентификаторов моделей и флагов их возможностей (видение, инструменты, рассуждения, кэширование, длина контекста и т. д.) см. на странице /docs/api/models.

curl https://api.airforce/v1/models \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY"