API REFERENCE

Uzupełnienia czatu

Generuj odpowiedzi na czacie dla ponad 100 modeli za pomocą jednego API. Drop-in kompatybilny z OpenAI Chat Completions, Anthropic Messages i Anthropic Responses.

Airforce obsługuje formaty komunikacji zarówno OpenAI Chat Completions, jak i Anthropic Messages w obrębie tego samego zestawu modeli. Wybierz SDK, którego już używasz, i zmień jedynie base URL — modele inne niż Claude są przekazywane transparentnie pod każdą z tych powierzchni.

Ta strona obejmuje uwierzytelnianie, kształty żądań i odpowiedzi dla obu powierzchni, streaming, tool calling, vision, reasoning oraz prompt caching. Dopiero zaczynasz? Zacznij od podstawowego przykładu poniżej, doprowadź do działania jedno wywołanie, a następnie dołóż streaming, narzędzia lub caching, gdy już zadziała.

Uwierzytelnianie

Każde żądanie wymaga tokena Bearer (Twój klucz API Airforce). Nagłówek Anthropic x-api-key nagłówek jest również akceptowany /v1/messages dla kompatybilności SDK.

Authorization: Bearer sk-air-YOUR_API_KEY
# alt for /v1/messages:
x-api-key: sk-air-YOUR_API_KEY

POST /v1/chat/completions

Chat Completions zgodne z OpenAI. Współpracuje z oficjalnym openai SDK przez zastąpienie base_url na https://api.airforce/v1.

POSThttps://api.airforce/v1/chat/completions

Treść żądania

Parameter	Type	Required	Description
model	string	Required	Identyfikator modelu. Użyj GET /v1/models, aby odkryć dostępne identyfikatory.
messages	array	Required	Historia rozmów. Każdy wpis ma { role: "system" \| "user" \| "assistant" \| "tool", content }. Content to ciąg znaków lub tablica bloków treści (vision, patrz poniżej).
max_tokens	integer	Optional	Maksymalna liczba tokenów do wygenerowania. Ograniczone do max_output_tokens modelu.
temperature	float	Optional	Temperatura pobierania próbek, 0–2. Niższy jest bardziej deterministyczny. Wartość domyślna zależy od dostawcy nadrzędnego.
top_p	float	Optional	Próbkowanie jądra. Użyj temperatury lub top_p, a nie obu.
stream	boolean	Optional	Jeśli ma wartość true, odpowiedzią jest strumień zdarzeń wysłanych przez serwer. Zobacz „Streaming” poniżej.
models	array	Optional	Fallback models (max 3), e.g. ["deepseek-v3.2", "gpt-4o-mini"]. If every channel of the primary model fails, each candidate is tried in order. You are billed for — and response.model reports — the model that actually answered. Unknown or plan-gated candidates are skipped. With the OpenAI SDK pass it via extra_body.
transforms	array	Optional	Prompt transforms. Supported: ["middle-out"] — when the conversation overflows the model's context window, whole messages are dropped from the middle (system prompts, the first message and the most recent turns are kept), so long roleplay or agent histories keep working instead of erroring. Opt-in; off by default.
stream_options	object	Optional	{ include_usage: boolean }. Zużycie jest zawsze dołączane do ostatniego fragmentu strumienia; to pole jest akceptowane dla zgodności z OpenAI, ale nie można go wyłączyć.
stop	string \| array	Optional	Do 4 sekwencji zatrzymania. Generowanie zatrzymuje się, gdy tylko zostanie wyprodukowane.
tools	array	Optional	Definicje funkcji, które model może wywołać. Patrz „Wywoływanie narzędzi” poniżej.
tool_choice	string \| object	Optional	"auto" (domyślnie), "none" lub { type: "function", function: { name } }, aby wymusić określone wywołanie.
response_format	object	Optional	{ type: "json_object" } wymusza na modelu emitowanie prawidłowego formatu JSON. Ignorowany w przypadku modeli, które go nie obsługują.
reasoning_effort	string	Optional	Reasoning depth: "low" \| "medium" \| "high" \| "xhigh" \| "max". Any model with supports_reasoning: true (Claude, OpenAI o/GPT-5, Gemini, Qwen, DeepSeek, …). See "Reasoning & thinking".
thinking	string \| object	Optional	Cross-model thinking switch. "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. See "Reasoning & thinking".
thinking_budget	integer	Optional	Limit tokenów dla śladu rozumowania modelu (gdy dostawca go udostępnia).
ignore_defaults	boolean	Optional	Pomiń zapisane przez użytkownika domyślne parametry poszczególnych modeli (skonfigurowane w panelu kontrolnym) dla tego żądania.
skill	string	Optional	ID of a single marketplace skill to apply to this request. The skill transforms your messages/parameters before the upstream call and overrides any installed-skill defaults. Consumed by Airforce, never forwarded upstream. See the Skills catalog at /docs/api/extend.
skills	array	Optional	Array of marketplace skill IDs applied in order, for stacking multiple skills on one request.

Podstawowy przykład

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

Kształt odpowiedzi

Parameter	Type	Required	Description
id	string	Optional	Stabilny identyfikator ukończenia, np. „chatcmpl-abc123”.
object	string	Optional	„chat.completion” w przypadku transmisji niestrumieniowej, „chat.completion.chunk” w przypadku transmisji strumieniowej.
created	integer	Optional	Znacznik czasu Uniksa (sekundy).
model	string	Optional	Echo żądanego identyfikatora modelu.
choices	array	Optional	Tablica kandydatów uzupełnienia: [{ index, message: { role, content, tool_calls? }, finish_reason }].
choices[].finish_reason	string	Optional	"stop" \| "length" \| "tool_calls" \| "content_filter".
usage	object	Optional	{ prompt_tokens, completion_tokens, total_tokens, completion_tokens_details?, prompt_tokens_details?, cache_creation_input_tokens?, cache_creation? }. completion_tokens_details.reasoning_tokens jest ustawiane gdy model wygenerował ślad rozumowania. Pola cache pojawiają się gdy upstream zwrócił informacje o prompt-caching: prompt_tokens_details.cached_tokens raportuje odczyty cache (standard OpenAI), cache_creation_input_tokens agreguje zapisy, a cache_creation.ephemeral_5m_input_tokens / ephemeral_1h_input_tokens dają podział na TTL.

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1710000000,
  "model": "gpt-5.1-chat",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "The capital of France is Paris."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 8,
    "total_tokens": 28
  }
}

Rozumowanie i myślenie

Reasoning/thinking is a cross-model feature for every model ID with supports_reasoning: true — Claude, OpenAI o-series/GPT-5, Gemini, Qwen, DeepSeek, and others. You send the same canonical parameters; Airforce maps them to each provider's native shape. This is not a DeepSeek-only API.

Truth source: check supports_reasoning: true na modelu w GET /v1/models (or GET /api/models/{id}/allowed-params). Prefer that flag over guessing from the model name.

Modele ze wsparciem rozumowania

…· live

Parametry kanoniczne

Parameter	Type	Required	Description
reasoning_effort	string	Optional	"low" \| "medium" \| "high" \| "xhigh" \| "max". Accepted on every model with supports_reasoning: true. Some upstreams only honour a subset (e.g. high/max); others clamp unsupported levels to the nearest served value.
thinking	string \| object	Optional	Three accepted shapes (we normalise): "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. Mapped onto Claude extended thinking, OpenAI effort profiles, Gemini thinking_config, Qwen enable_thinking, DeepSeek hybrid, etc.
thinking_budget	integer	Optional	Maximum tokens the model may spend reasoning before emitting visible output. Mirrors budget_tokens when the upstream exposes a budget; takes precedence over reasoning_effort when both are sent and a budget is available.

What differs by family (mapping only)

Parameters are the same everywhere. Only how we map them (and how hard "off" is) differs:

Claude — Thinking on/off + budget; often also reasoning_effort via the gateway.
OpenAI (o1/o3, GPT-5) — Mainly reasoning_effort. A full "thinking off" is often not available — you control how strongly the model reasons, not always whether it reasons at all.
Gemini — thinking_config / budget mapped internally.
Qwen / Xiaomi / Alibaba — thinking + enable_thinking-style controls.
DeepSeek (generic) — Hybrid on/off is especially clear: thinking: { type: enabled|disabled } plus optional reasoning_effort.
Resellers / other — Often generic passthrough of the same canonical fields.

Controlling where the trace appears

An optional reasoning object on the request decides what happens to the thinking trace. It is consumed by Airforce and never forwarded upstream.

Parameter	Type	Required	Description
reasoning.format	string	Optional	"separate" (default) puts the trace in message.reasoning (and delta.reasoning while streaming). "inline" keeps the legacy inline <think>…</think> form inside content.
reasoning.exclude	boolean	Optional	When true, the reasoning trace is dropped entirely from the response. Reasoning tokens are still counted and billed if the model produced them.

"reasoning": { "format": "separate", "exclude": false }

Wysiłek rozumowania (w stylu OpenAI)

Primary control for o-series and GPT-5: how much the model may reason. Same canonical field as on every other supports_reasoning model — OpenAI is included, but behaviour is not 1:1 with DeepSeek's hard on/off.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "o3-mini",
    "messages": [{"role": "user", "content": "Prove the Pythagorean theorem."}],
    "reasoning_effort": "high"
  }'

Rozszerzone myślenie (w stylu Anthropic)

Budget-based thinking for Claude (and gateways that accept the Anthropic shape). You can still send reasoning_effort; we map when the channel supports it.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "messages": [{"role": "user", "content": "Plan a 7-day Italy trip."}],
    "thinking": {"type": "enabled", "budget_tokens": 4000}
  }'

Hybrid thinking (e.g. DeepSeek V3.2/V4)

Example of a hybrid model family with a clear Thinking / Non-Thinking switch — not a separate protocol. deepseek-v3.2, deepseek-v4-flash and deepseek-v4-pro accept the same canonical fields as every other supports_reasoning model. Toggle thinking and optionally set effort in one request:

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-pro",
    "messages": [{"role": "user", "content": "Solve this step by step: integrate x^2 * e^x."}],
    "thinking": {"type": "enabled"},
    "reasoning_effort": "high"
  }'

Turn thinking off (faster, cheaper when you only need the final answer) — this hard off is clearer on hybrid models than on many OpenAI o-series profiles:

"thinking": {"type": "disabled"}
// or simply: "thinking": "off"

Native docs for this family often list effort levels such as "high" and "max". We accept the full low…max scale and map unsupported levels to the nearest value that reaches the model. Prefer the hybrid IDs above over retired deepseek-chat / deepseek-reasoner names when you need an explicit on/off switch.

Sam ślad rozumowania pojawia się w choices[0].message.reasoning (kształt OpenAI) lub jako thinking blokuje w content (Kształt Anthropic). Tokeny rozumowania są rozliczane i raportowane w usage.completion_tokens_details.reasoning_tokens.

Rozbicie completion_tokens_details.reasoning_tokens jest obecne tylko wtedy, gdy dostawca źródłowy je raportuje. W odpowiedzi strumieniowanej ślad przychodzi w delta.reasoning_content dla każdego chunku.

Wprowadzanie wizji i obrazu

Modele z supports_vision: true akceptować obrazy osadzone jako bloki treści. Działa publiczny adres URL lub adres URL danych base64; limity rozmiaru zależą od modelu wyższego szczebla.

Modele ze wsparciem wzroku

…· live

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
      ]
    }]
  }'

Wywoływanie narzędzi

Modele z supports_tools: true może wywoływać zdefiniowane przez Ciebie funkcje. Model zwraca tablicę tool_calls ; uruchamiasz wywołanie, a następnie odsyłasz wynik w wiadomości tool wiadomość.

Modele z obsługą wywoływania narzędzi

…· live

Żądanie

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City name"}
          },
          "required": ["location"]
        }
      }
    }],
    "tool_choice": "auto"
  }'

Odpowiedź z wywołaniem narzędzia

{
  "id": "chatcmpl-abc123",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\":\"Paris\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

Kontynuacja z wynikami narzędzia

{
  "model": "gpt-5.1-chat",
  "messages": [
    {"role": "user", "content": "What is the weather in Paris?"},
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {"name": "get_weather", "arguments": "{\"location\":\"Paris\"}"}
      }]
    },
    {"role": "tool", "tool_call_id": "call_1", "content": "{\"temp_c\": 14, \"sky\": \"cloudy\"}"}
  ]
}

Assistant prefill

End your messages array with an assistant message that already contains some text, and the model continues from it instead of starting a fresh turn. This is a reliable way to force a response to begin a specific way — a leading "{" for JSON, a chosen language, or a fixed prefix. The same trick works on /v1/messages. Providers that reject native prefill are handled automatically: the gateway retries once with a compatible rewrite, so you do not have to special-case them.

{
  "model": "claude-sonnet-4.6",
  "messages": [
    {"role": "user", "content": "List three primary colors as a JSON array."},
    {"role": "assistant", "content": "["}
  ]
}

Structured outputs

Set response_format to make the model return JSON. Two modes are supported:

{ "type": "json_object" } — the response is a single valid JSON value.
{ "type": "json_schema", "json_schema": { "name", "schema", "strict" } } — the model is steered to produce JSON that matches your JSON Schema.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Extract the city and country: I live in Paris, France."}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "location",
        "schema": {
          "type": "object",
          "properties": { "city": {"type": "string"}, "country": {"type": "string"} },
          "required": ["city", "country"]
        }
      }
    }
  }'

Reliability: even when a model wraps its answer in prose or a markdown code fence, Airforce extracts the JSON payload so you always receive parseable content. If no valid JSON can be recovered, the original text is returned unchanged — so the guarantee never makes a response worse. This applies to non-streamed responses; streamed responses are passed through unchanged.

Transmisja strumieniowa

Ustaw stream: true aby otrzymać częściowe uzupełnienia jako zdarzenia wysłane przez serwer. Każde zdarzenie to jeden fragment JSON o tym samym kształcie, co odpowiedź niestrumieniowa, z wyjątkiem message zostaje zastąpiony przez delta. Strumień kończy się data: [DONE].

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Write a haiku about Berlin."}],
    "stream": true,
    "stream_options": {"include_usage": true}
  }'

Format transmisji

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"Cold "},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"stone "},"finish_reason":null}]}

…

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":12,"completion_tokens":17,"total_tokens":29}}

data: [DONE]

Reliability & smart routing

Every model ID resolves to a pool of upstream providers behind the scenes. If the first one errors or times out, the request is automatically retried against the next provider for the same model, in order, before any failure is returned — you do not configure or trigger this. The model field in the response always reports the variant that actually answered. This is independent of the optional models / fallbacks array, which adds your own cross-model candidates on top: first the primary model exhausts its own provider chain, then each fallback model exhausts its chain.

POST /v1/messages

Interfejs API Messages zgodny z Anthropic. Współpracuje z oficjalnym @anthropic-ai/sdk poprzez ustawienie baseURL na https://api.airforce. Przekazuje do OpenAI/Google/etc. przejrzyste dla modeli innych niż Claude.

POSThttps://api.airforce/v1/messages

Treść żądania

Parameter	Type	Required	Description
model	string	Required	Identyfikator modelu (alias w formacie Anthropic lub routowany).
messages	array	Required	Każdy wpis: { role: "user" \| "assistant", content: string \| array }.
max_tokens	integer	Required	Wymagane przez firmę Anthropic. Limit tokenu dla odpowiedzi.
system	string \| array	Optional	Prompt systemowy. Przekaż tablicę bloków { type: "text", text, cache_control? }, aby oznaczyć buforowane segmenty prefiksu. Zobacz „Buforowanie promptów”.
temperature	float	Optional	0–1.
top_p	float	Optional	Próbkowanie jądra.
top_k	integer	Optional	Ogranicz pulę próbkowania do tokenów z najwyższej półki.
stop_sequences	array	Optional	Do 4 sekwencji zatrzymania.
stream	boolean	Optional	Jeśli ma wartość true, emituje strumień zdarzeń SSE w stylu Anthropic (patrz „Streaming”).
fallbacks	array	Optional	Fallback models (max 3) in Anthropic form: [{"model": "gpt-4o-mini"}]. If every channel of the primary model fails, each candidate is tried in order; you are billed for — and the response model field reports — the model that actually answered. A plain models string array is accepted too.
tools	array	Optional	Definicje narzędzi Anthropic: { name, description, input_schema }. Odpowiedź może zawierać bloki treści tool_use.
tool_choice	object	Optional	{ type: "auto" \| "any" \| "tool", name? }.
thinking	object	Optional	Rozszerzone myślenie Anthropic: { type: "enabled", budget_tokens: N }.

Przykład

curl https://api.airforce/v1/messages \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "max_tokens": 256,
    "system": "You are a helpful assistant.",
    "messages": [
      {"role": "user", "content": "Hello, Claude!"}
    ]
  }'

Kształt odpowiedzi

Parameter	Type	Required	Description
id	string	Optional	Identyfikator wiadomości, np. „msg_01ABCxyz”.
type	string	Optional	Zawsze "message".
role	string	Optional	Zawsze "assistant".
content	array	Optional	Tablica bloków treści: { type: "text" \| "tool_use" \| "thinking", … }.
model	string	Optional	Echo żądanego modelu.
stop_reason	string	Optional	"end_turn" \| "max_tokens" \| "stop_sequence" \| "tool_use".
usage	object	Optional	{ input_tokens, output_tokens, cache_read_input_tokens?, cache_creation_input_tokens?, cache_creation? }. Pola cache pojawiają się gdy użyto prompt cachingu. cache_creation.ephemeral_5m_input_tokens i ephemeral_1h_input_tokens dają podział zapisów na TTL.

Wydarzenia strumieniowe

Anthropic SSE używa nazwanych zdarzeń zamiast jednorazowych fragmentów JSON. Każde wydarzenie ma zarówno event: nazwę oraz data: Ładunek JSON.

event: message_start
data: {"type":"message_start","message":{"id":"msg_01","role":"assistant","content":[],"model":"claude-sonnet-4.6","stop_reason":null,"usage":{"input_tokens":12,"output_tokens":1}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":17}}

event: message_stop
data: {"type":"message_stop"}

POST /v1/messages/count_tokens

Anthropic-compatible token counting. Send the same system / messages / tools you would pass to /v1/messages and get an input-token estimate back without running the model — nothing is billed.

POSThttps://api.airforce/v1/messages/count_tokens

curl https://api.airforce/v1/messages/count_tokens \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "system": "You are a helpful assistant.",
    "messages": [{"role": "user", "content": "Hello, Claude!"}]
  }'

# → {"input_tokens": 34}

The count is a fast character-based estimate (about 4 characters per token) over system, messages and tools — close enough for context-budget checks, not an exact tokenizer run.

Buforowanie promptów

W /v1/messages w przypadku modeli Claude zaznacz przedrostek jako zapisany w pamięci podręcznej poprzez przekazanie system jako tablica bloków, w których znajduje się buforowany segment cache_control: { type: "ephemeral" }. Kolejne żądania rozpoczynające się od tego samego prefiksu naliczają niższą stawkę za odczyt pamięci podręcznej. Modele z supports_caching: true w /v1/models to obsługują.

Write vs read pricing

Cache writes are typically charged slightly above normal input (about 1.25× on Claude-family models). Cache reads are much cheaper (about 0.1× input). A large write with almost no later read is the expensive case — not a “cache discount”. Only reusing the same prefix turns the write into savings.

Tools like Claude Code often attach a large project context with cache markers on the first turns. Expect cache-write spend while the repo/system prefix is loaded; later turns only get cheap if that prefix is stable and reused. Subagents and multi-step agents can multiply large contexts across several requests.

Modele z buforowaniem promptów

…· live

{
  "model": "claude-sonnet-4.6",
  "max_tokens": 1024,
  "system": [
    {"type": "text", "text": "You are a senior staff engineer at Airforce."},
    {
      "type": "text",
      "text": "<repository-snapshot>...</repository-snapshot>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "Where is rate limiting enforced?"}
  ]
}

Jak liczniki cache są raportowane w odpowiedzi

Liczniki tokenów cache są przekazywane w natywnej formie każdego formatu, więc SDK (openai, @anthropic-ai/sdk, @google/genai) odczytują je bez niestandardowego kodu. Pola są pomijane gdy wartość wynosi zero, utrzymując odpowiedzi niecachowane szczupłymi.

/v1/chat/completions (kształt OpenAI)

"usage": {
  "prompt_tokens": 2104,
  "completion_tokens": 147,
  "total_tokens": 2251,
  "prompt_tokens_details": { "cached_tokens": 1980 },
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1/messages (kształt Anthropic)

"usage": {
  "input_tokens": 2104,
  "output_tokens": 147,
  "cache_read_input_tokens": 1980,
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1beta/.../generateContent (kształt Gemini)

"usageMetadata": {
  "promptTokenCount": 2104,
  "candidatesTokenCount": 147,
  "totalTokenCount": 2251,
  "cachedContentTokenCount": 1980
}

Gdzie działa cache

Jawne markery cache_control są honorowane na /v1/messages i /v1/chat/completions dla modeli Claude — umieść je na blokach treści system lub message. Wielu innych dostawców (rodzina OpenAI, DeepSeek, Gemini) cache’uje automatycznie: nie wysyłasz markerów i po prostu widzisz cached_tokens w odpowiedzi, gdy wystarczająco długi prefiks zostanie ponownie użyty.

Czas życia cache: 5 minut lub 1 godzina

Buforowany prefiks żyje domyślnie 5 minut, a licznik odświeża się przy każdym trafieniu. Aby prefiks żył dłużej, dodaj ttl: "1h" do markera. Odpowiedź raportuje każdy TTL osobno w cache_creation.

"cache_control": { "type": "ephemeral", "ttl": "1h" }

Przykład: najpierw zapis, potem odczyt

Wyślij dokładnie to samo żądanie dwa razy (przykład cache powyżej). Pierwsze wywołanie widzące prefiks płaci jednorazowy zapis do cache; identyczne wywołania w obrębie TTL płacą znacznie tańszy odczyt z cache.

Pierwsze wywołanie — zapis do cache (fragment usage):

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 1980,
  "cache_read_input_tokens": 0
}

Drugie identyczne wywołanie w obrębie TTL — odczyt z cache:

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 1980
}

Limity i koszt

Claude wymaga minimalnego buforowalnego prefiksu (około 1024 tokenów; więcej w niektórych modelach). Krótsze prefiksy po prostu nie są buforowane.
Do 4 punktów cache na żądanie, a buforowany prefiks musi być identyczny co do bajta między wywołaniami — nawet zmiana jednego znaku chybia cache.
Zapisy do cache kosztują więcej niż zwykłe wejście (5m ≈ 1,25×, 1h ≈ 2×); odczyty kosztują znacznie mniej (≈ 0,1×). Ceny cache dla każdego modelu znajdziesz na stronie cennika.

POST /v1/responses

Powierzchnia OpenAI Responses-API dla konwersacji stanowych. Ta sama autoryzacja Bearer/x-api-key. Liczniki cache pojawiają się jako input_tokens_details.cached_tokens (odczyt) plus płaskie cache_creation_input_tokens + cache_creation.ephemeral_* (zapisy) dla parzystości z /v1/chat/completions.

POSThttps://api.airforce/v1/responses

POST /v1beta/models/{model}:generateContent

Google Gemini-compatible endpoint. Works with the official @google/genai SDK and the Gemini CLI by pointing the base URL at https://api.airforce/v1beta. Any routed model works — requests are translated to and from the native Gemini shape, and the model is taken from the URL path (not the body).

POSThttps://api.airforce/v1beta/models/{model}:generateContent

Authentication

Pass your Airforce API key any of the three ways Google clients use:

# 1) query parameter (Google default)
?key=sk-air-YOUR_API_KEY

# 2) header
x-goog-api-key: sk-air-YOUR_API_KEY

# 3) bearer token
Authorization: Bearer sk-air-YOUR_API_KEY

Request body

Parameter	Type	Required	Description
contents	array	Required	Conversation turns. Each: { role: "user" \| "model", parts: [...] }. A part is { text }, { functionCall: { name, args } }, or { functionResponse: { name, response } }. "model" is Gemini's term for the assistant role.
systemInstruction	object	Optional	System prompt: { parts: [{ text }] }.
generationConfig	object	Optional	{ temperature, maxOutputTokens, topP, stopSequences } — mapped to the canonical sampling parameters.
tools	array	Optional	Tool definitions: [{ functionDeclarations: [{ name, description, parameters }] }]. functionDeclarations are flattened across entries.
toolConfig	object	Optional	Tool-choice control: { functionCallingConfig: { mode: "AUTO" \| "ANY" \| "NONE" } }. ANY forces a call, NONE disables tools.

Example

curl "https://api.airforce/v1beta/models/gemini-3.1-pro:generateContent" \
  -H "x-goog-api-key: sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {"role": "user", "parts": [{"text": "What is the capital of France?"}]}
    ],
    "systemInstruction": {"parts": [{"text": "You are a helpful assistant."}]},
    "generationConfig": {"temperature": 0.7, "maxOutputTokens": 256}
  }'

Response shape

Parameter	Type	Required	Description
candidates	array	Optional	Generated turns: [{ content: { role: "model", parts }, finishReason, index }]. Only the first candidate is populated.
candidates[].finishReason	string	Optional	"STOP" \| "MAX_TOKENS" \| "SAFETY" \| "OTHER".
usageMetadata	object	Optional	{ promptTokenCount, candidatesTokenCount, totalTokenCount, cachedContentTokenCount? }. cachedContentTokenCount appears when the upstream reported a cache read.
modelVersion	string	Optional	Echo of the requested model.

{
  "candidates": [{
    "content": {
      "role": "model",
      "parts": [{"text": "The capital of France is Paris."}]
    },
    "finishReason": "STOP",
    "index": 0
  }],
  "usageMetadata": {
    "promptTokenCount": 16,
    "candidatesTokenCount": 8,
    "totalTokenCount": 24
  },
  "modelVersion": "gemini-3.1-pro"
}

POST /v1beta/models/{model}:streamGenerateContent

Streaming uses the :streamGenerateContent action and returns Server-Sent Events. Each data: line is a full Gemini-shaped chunk (not a delta object); the final chunk carries usageMetadata.

data: {"candidates":[{"content":{"role":"model","parts":[{"text":"The capital"}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[{"text":" is Paris."}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[]},"finishReason":"STOP","index":0}],"usageMetadata":{"promptTokenCount":16,"candidatesTokenCount":8,"totalTokenCount":24}}

List models

The catalog is also exposed in Gemini Model-resource shape so Google clients can enumerate models.

curl https://api.airforce/v1beta/models

Notes: the base URL is https://api.airforce/v1beta (or /v1), not Google's host. The model name comes from the URL path, not the request body. Only the first candidate is returned, and a subset of Gemini fields is translated — safetySettings and cachedContent are currently ignored. Billing, rate limits and smart routing apply exactly as on /v1/chat/completions.

Błędy

Airforce zwraca standardowe kody stanu HTTP i jednolitą kopertę błędów dla obu punktów końcowych.

Parameter	Type	Required	Description
400	invalid_request_error	Optional	Źle sformułowany JSON, brak wymaganego pola, nieznany model.
401	invalid_request_error / auth_required	Optional	Brakujący lub nieprawidłowy klucz API.
402	insufficient_quota	Optional	Model wymaga aktywnej subskrypcji lub dodatniego salda Pay-as-you-Go.
403	model_access_denied / insufficient_scope	Optional	Uprawnienia planu lub klucza odrzucają to żądanie.
404	model_not_found	Optional	Żądany model nie istnieje lub nie masz do niego dostępu.
429	rate_limit_error	Optional	Przekroczono częstotliwość żądań lub dzienny limit tokenów.
503	api_error / moderation_unavailable	Optional	Wszystkie klucze nadrzędne dla żądanego dostawcy nie powiodły się.

{
  "error": {
    "message": "The requested model does not exist or you do not have access to it.",
    "type": "model_not_found",
    "param": null,
    "code": "404"
  }
}

Opisowy slug znajduje się w type. code to status HTTP w postaci ciągu znaków (np. "404"), a param jest null z wyjątkiem błędów walidacji zakresu parametrów, gdzie wskazuje nieprawidłowy parametr.

Odkryj modele

Zobacz pełną listę identyfikatorów modeli i ich flag możliwości (wizja, narzędzia, rozumowanie, buforowanie, długość kontekstu,…) na stronie /docs/api/models.

curl https://api.airforce/v1/models \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY"