API REFERENCE

Chat-Completions

Generiere Chat-Antworten über 100+ Modelle aus einer API. Direkt kompatibel mit OpenAI Chat Completions, Anthropic Messages und Anthropic Responses.

Airforce spricht sowohl das OpenAI-Chat-Completions- als auch das Anthropic-Messages-Wire-Format über dieselben Modelle. Wähle einfach das SDK, das du bereits nutzt, und ändere nur die base URL — Nicht-Claude-Modelle werden hinter beiden Oberflächen transparent weitergeleitet.

Diese Seite behandelt Authentifizierung, die Request- und Response-Formate beider Oberflächen, streaming, tool calling, vision, reasoning und prompt caching. Neu hier? Beginne mit dem einfachen Beispiel unten, bring einen Call zum Laufen und ergänze danach streaming, tools oder caching.

Authentifizierung

Jede Anfrage benötigt einen Bearer-Token (deinen Airforce API-Key). Der Anthropic x-api-key Header wird auch an /v1/messages akzeptiert für SDK-Kompatibilität.

Authorization: Bearer sk-air-YOUR_API_KEY
# alt for /v1/messages:
x-api-key: sk-air-YOUR_API_KEY

POST /v1/chat/completions

OpenAI-kompatible Chat Completions. Funktioniert mit dem offiziellen openai SDK indem du base_url auf https://api.airforce/v1.

POSThttps://api.airforce/v1/chat/completions

Request-Body

Parameter	Type	Required	Description
model	string	Required	Model-ID. Nutze GET /v1/models um verfügbare IDs zu finden.
messages	array	Required	Konversationsverlauf. Jeder Eintrag hat { role: "system" \| "user" \| "assistant" \| "tool", content }. Content ist ein String oder ein Array von Content-Blöcken (Vision, siehe unten).
max_tokens	integer	Optional	Maximale Anzahl zu generierender Tokens. Begrenzt durch max_output_tokens des Modells.
temperature	float	Optional	Sampling-Temperatur, 0–2. Niedriger ist deterministischer. Default hängt vom Upstream-Provider ab.
top_p	float	Optional	Nucleus-Sampling. Nutze entweder temperature oder top_p, nicht beides.
stream	boolean	Optional	Wenn true, ist die Response ein Stream von Server-Sent Events. Siehe "Streaming" unten.
models	array	Optional	Fallback models (max 3), e.g. ["deepseek-v3.2", "gpt-4o-mini"]. If every channel of the primary model fails, each candidate is tried in order. You are billed for — and response.model reports — the model that actually answered. Unknown or plan-gated candidates are skipped. With the OpenAI SDK pass it via extra_body.
transforms	array	Optional	Prompt transforms. Supported: ["middle-out"] — when the conversation overflows the model's context window, whole messages are dropped from the middle (system prompts, the first message and the most recent turns are kept), so long roleplay or agent histories keep working instead of erroring. Opt-in; off by default.
stream_options	object	Optional	{ include_usage: boolean }. Usage wird immer im letzten Streaming-Chunk mitgesendet; dieses Feld wird zwecks OpenAI-Kompatibilität akzeptiert, lässt sich aber nicht abschalten.
stop	string \| array	Optional	Bis zu 4 Stopp-Sequenzen. Generierung stoppt sobald eine produziert wird.
tools	array	Optional	Funktionsdefinitionen die das Modell aufrufen darf. Siehe "Tool-Calling" unten.
tool_choice	string \| object	Optional	"auto" (default), "none", oder { type: "function", function: { name } } um einen spezifischen Call zu erzwingen.
response_format	object	Optional	{ type: "json_object" } zwingt das Modell, valides JSON auszugeben. Wird ignoriert für Modelle die das nicht unterstützen.
reasoning_effort	string	Optional	Reasoning depth: "low" \| "medium" \| "high" \| "xhigh" \| "max". Any model with supports_reasoning: true (Claude, OpenAI o/GPT-5, Gemini, Qwen, DeepSeek, …). See "Reasoning & thinking".
thinking	string \| object	Optional	Cross-model thinking switch. "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. See "Reasoning & thinking".
thinking_budget	integer	Optional	Token-Limit für den Reasoning-Trace des Modells (sofern der Provider eines exponiert).
ignore_defaults	boolean	Optional	Überspringt die im Dashboard gespeicherten Per-Modell-Default-Parameter für diese Anfrage.
skill	string	Optional	ID of a single marketplace skill to apply to this request. The skill transforms your messages/parameters before the upstream call and overrides any installed-skill defaults. Consumed by Airforce, never forwarded upstream. See the Skills catalog at /docs/api/extend.
skills	array	Optional	Array of marketplace skill IDs applied in order, for stacking multiple skills on one request.

Einfaches Beispiel

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

Response-Format

Parameter	Type	Required	Description
id	string	Optional	Stabile Completion-ID, z.B. "chatcmpl-abc123".
object	string	Optional	"chat.completion" für Non-Streamed, "chat.completion.chunk" für Streamed.
created	integer	Optional	Unix-Timestamp (Sekunden).
model	string	Optional	Echo der angefragten Model-ID.
choices	array	Optional	Array von Completion-Kandidaten: [{ index, message: { role, content, tool_calls? }, finish_reason }].
choices[].finish_reason	string	Optional	"stop" \| "length" \| "tool_calls" \| "content_filter".
usage	object	Optional	{ prompt_tokens, completion_tokens, total_tokens, completion_tokens_details?, prompt_tokens_details?, cache_creation_input_tokens?, cache_creation? }. completion_tokens_details.reasoning_tokens wird gesetzt wenn das Modell eine Reasoning-Spur erzeugt hat. Cache-Felder erscheinen wenn der Upstream Prompt-Caching-Infos zurückgab: prompt_tokens_details.cached_tokens berichtet Cache-Reads (OpenAI-Standard), cache_creation_input_tokens aggregiert Writes, und cache_creation.ephemeral_5m_input_tokens / ephemeral_1h_input_tokens geben die TTL-Aufteilung.

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1710000000,
  "model": "gpt-5.1-chat",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "The capital of France is Paris."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 8,
    "total_tokens": 28
  }
}

Reasoning & Thinking

Reasoning/Thinking ist ein modellübergreifendes Feature für jede Model-ID mit supports_reasoning: true — Claude, OpenAI o-Serie/GPT-5, Gemini, Qwen, DeepSeek und andere. Du sendest dieselben kanonischen Parameter; Airforce mappt sie auf die native Form jedes Providers. Das ist keine DeepSeek-only-API.

Wahrheitsquelle: prüfe supports_reasoning: true auf einem Modell in GET /v1/models (oder GET /api/models/{id}/allowed-params). Bevorzuge dieses Flag statt aus dem Modellnamen zu raten.

Modelle mit Reasoning-Support

…· live

Kanonische Parameter

Parameter	Type	Required	Description
reasoning_effort	string	Optional	"low" \| "medium" \| "high" \| "xhigh" \| "max". Accepted on every model with supports_reasoning: true. Some upstreams only honour a subset (e.g. high/max); others clamp unsupported levels to the nearest served value.
thinking	string \| object	Optional	Three accepted shapes (we normalise): "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. Mapped onto Claude extended thinking, OpenAI effort profiles, Gemini thinking_config, Qwen enable_thinking, DeepSeek hybrid, etc.
thinking_budget	integer	Optional	Maximum tokens the model may spend reasoning before emitting visible output. Mirrors budget_tokens when the upstream exposes a budget; takes precedence over reasoning_effort when both are sent and a budget is available.

Was sich pro Familie unterscheidet (nur Mapping)

Die Parameter sind überall gleich. Nur wie wir sie mappen (und wie hart „off“ ist) unterscheidet sich:

Claude — Thinking an/aus + Budget; oft auch reasoning_effort über das Gateway.
OpenAI (o1/o3, GPT-5) — Vor allem reasoning_effort. Echtes „thinking off“ oft nicht verfügbar — du steuerst eher wie stark gedacht wird, nicht immer ob überhaupt.
Gemini — thinking_config / Budget intern gemappt.
Qwen / Xiaomi / Alibaba — thinking + enable_thinking-ähnliche Controls.
DeepSeek (generic) — Hybrid an/aus besonders klar: thinking: { type: enabled|disabled } plus optional reasoning_effort.
Reseller / andere — Oft generischer Passthrough derselben kanonischen Felder.

Controlling where the trace appears

An optional reasoning object on the request decides what happens to the thinking trace. It is consumed by Airforce and never forwarded upstream.

Parameter	Type	Required	Description
reasoning.format	string	Optional	"separate" (default) puts the trace in message.reasoning (and delta.reasoning while streaming). "inline" keeps the legacy inline <think>…</think> form inside content.
reasoning.exclude	boolean	Optional	When true, the reasoning trace is dropped entirely from the response. Reasoning tokens are still counted and billed if the model produced them.

"reasoning": { "format": "separate", "exclude": false }

Reasoning Effort (OpenAI-Stil)

Primäre Steuerung für o-Serie und GPT-5: wie stark das Modell denken darf. Dasselbe kanonische Feld wie bei jedem supports_reasoning-Modell — OpenAI ist mit drin, aber nicht 1:1 wie DeepSeeks hartes an/aus.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "o3-mini",
    "messages": [{"role": "user", "content": "Prove the Pythagorean theorem."}],
    "reasoning_effort": "high"
  }'

Extended Thinking (Anthropic-Stil)

Budget-basiertes Thinking für Claude (und Gateways mit Anthropic-Shape). reasoning_effort kannst du trotzdem senden; wir mappen, wenn der Channel es unterstützt.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "messages": [{"role": "user", "content": "Plan a 7-day Italy trip."}],
    "thinking": {"type": "enabled", "budget_tokens": 4000}
  }'

Hybrid Thinking (z. B. DeepSeek V3.2/V4)

Beispiel einer Hybrid-Familie mit klarem Thinking-/Non-Thinking-Schalter — kein separates Protokoll. deepseek-v3.2, deepseek-v4-flash und deepseek-v4-pro akzeptieren dieselben kanonischen Felder wie jedes andere supports_reasoning-Modell. Thinking toggeln und optional Effort in einem Request:

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-pro",
    "messages": [{"role": "user", "content": "Solve this step by step: integrate x^2 * e^x."}],
    "thinking": {"type": "enabled"},
    "reasoning_effort": "high"
  }'

Thinking aus (schneller, günstiger wenn nur die finale Antwort zählt) — dieses harte Off ist bei Hybrid-Modellen klarer als bei vielen OpenAI-o-Profilen:

"thinking": {"type": "disabled"}
// or simply: "thinking": "off"

Native Docs dieser Familie listen oft Effort-Level wie "high" und "max". Wir akzeptieren die volle low…max-Skala und mappen nicht unterstützte Level auf den nächsten Wert, der das Modell erreicht. Bevorzuge die Hybrid-IDs oben gegenüber retired deepseek-chat / deepseek-reasoner, wenn du ein explizites an/aus brauchst.

Der Reasoning-Trace selbst erscheint in choices[0].message.reasoning (OpenAI-Format) oder als thinking Blöcke in content (Anthropic-Format). Reasoning-Tokens werden abgerechnet und in usage.completion_tokens_details.reasoning_tokens.

Diese completion_tokens_details.reasoning_tokens-Aufschlüsselung ist nur vorhanden, wenn der Upstream-Provider sie meldet. Bei einer gestreamten Antwort trifft der Trace pro Chunk über delta.reasoning_content ein.

Vision & Bild-Input

Modelle mit supports_vision: true akzeptieren Bilder als eingebettete Content-Blöcke. Entweder eine öffentliche URL oder eine base64-Data-URL — Größenlimits hängen vom Upstream-Modell ab.

Modelle mit Vision-Support

…· live

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
      ]
    }]
  }'

Tool-Calling

Modelle mit supports_tools: true können Funktionen aufrufen die du definierst. Das Modell gibt ein tool_calls Array zurück; du führst den Call aus und schickst das Ergebnis dann in einer tool Message zurück.

Modelle mit Tool-Calling-Support

…· live

Anfrage

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City name"}
          },
          "required": ["location"]
        }
      }
    }],
    "tool_choice": "auto"
  }'

Response mit Tool-Call

{
  "id": "chatcmpl-abc123",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\":\"Paris\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

Folge-Anfrage mit Tool-Ergebnis

{
  "model": "gpt-5.1-chat",
  "messages": [
    {"role": "user", "content": "What is the weather in Paris?"},
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {"name": "get_weather", "arguments": "{\"location\":\"Paris\"}"}
      }]
    },
    {"role": "tool", "tool_call_id": "call_1", "content": "{\"temp_c\": 14, \"sky\": \"cloudy\"}"}
  ]
}

Assistant prefill

End your messages array with an assistant message that already contains some text, and the model continues from it instead of starting a fresh turn. This is a reliable way to force a response to begin a specific way — a leading "{" for JSON, a chosen language, or a fixed prefix. The same trick works on /v1/messages. Providers that reject native prefill are handled automatically: the gateway retries once with a compatible rewrite, so you do not have to special-case them.

{
  "model": "claude-sonnet-4.6",
  "messages": [
    {"role": "user", "content": "List three primary colors as a JSON array."},
    {"role": "assistant", "content": "["}
  ]
}

Structured outputs

Set response_format to make the model return JSON. Two modes are supported:

{ "type": "json_object" } — the response is a single valid JSON value.
{ "type": "json_schema", "json_schema": { "name", "schema", "strict" } } — the model is steered to produce JSON that matches your JSON Schema.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Extract the city and country: I live in Paris, France."}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "location",
        "schema": {
          "type": "object",
          "properties": { "city": {"type": "string"}, "country": {"type": "string"} },
          "required": ["city", "country"]
        }
      }
    }
  }'

Reliability: even when a model wraps its answer in prose or a markdown code fence, Airforce extracts the JSON payload so you always receive parseable content. If no valid JSON can be recovered, the original text is returned unchanged — so the guarantee never makes a response worse. This applies to non-streamed responses; streamed responses are passed through unchanged.

Streaming

Setze stream: true um partielle Completions als Server-Sent Events zu erhalten. Jedes Event ist ein JSON-Chunk mit der gleichen Form wie die Non-Streamed-Response, außer dass message ersetzt wird durch delta. Der Stream endet mit data: [DONE].

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Write a haiku about Berlin."}],
    "stream": true,
    "stream_options": {"include_usage": true}
  }'

Wire-Format

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"Cold "},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"stone "},"finish_reason":null}]}

…

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":12,"completion_tokens":17,"total_tokens":29}}

data: [DONE]

Reliability & smart routing

Every model ID resolves to a pool of upstream providers behind the scenes. If the first one errors or times out, the request is automatically retried against the next provider for the same model, in order, before any failure is returned — you do not configure or trigger this. The model field in the response always reports the variant that actually answered. This is independent of the optional models / fallbacks array, which adds your own cross-model candidates on top: first the primary model exhausts its own provider chain, then each fallback model exhausts its chain.

POST /v1/messages

Anthropic-kompatible Messages-API. Funktioniert mit dem offiziellen @anthropic-ai/sdk indem du baseURL auf https://api.airforce. setzt. Leitet für Non-Claude-Modelle transparent an OpenAI/Google/etc. weiter.

POSThttps://api.airforce/v1/messages

Request-Body

Parameter	Type	Required	Description
model	string	Required	Model-ID (Anthropic-Format oder geroutetes Alias).
messages	array	Required	Jeder Eintrag: { role: "user" \| "assistant", content: string \| array }.
max_tokens	integer	Required	Anthropic-Pflichtfeld. Token-Limit für die Response.
system	string \| array	Optional	System-Prompt. Übergib ein Array aus { type: "text", text, cache_control? } Blöcken um Cache-Prefix-Segmente zu markieren. Siehe "Prompt-Caching".
temperature	float	Optional	0–1.
top_p	float	Optional	Nucleus-Sampling.
top_k	integer	Optional	Begrenzt den Sampling-Pool auf die Top-K Tokens.
stop_sequences	array	Optional	Bis zu 4 Stopp-Sequenzen.
stream	boolean	Optional	Wenn true, sendet einen Anthropic-Stil SSE-Event-Stream (siehe "Streaming").
fallbacks	array	Optional	Fallback models (max 3) in Anthropic form: [{"model": "gpt-4o-mini"}]. If every channel of the primary model fails, each candidate is tried in order; you are billed for — and the response model field reports — the model that actually answered. A plain models string array is accepted too.
tools	array	Optional	Anthropic-Tool-Definitionen: { name, description, input_schema }. Die Response kann tool_use Content-Blöcke enthalten.
tool_choice	object	Optional	{ type: "auto" \| "any" \| "tool", name? }.
thinking	object	Optional	Anthropic Extended Thinking: { type: "enabled", budget_tokens: N }.

Beispiel

curl https://api.airforce/v1/messages \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "max_tokens": 256,
    "system": "You are a helpful assistant.",
    "messages": [
      {"role": "user", "content": "Hello, Claude!"}
    ]
  }'

Response-Format

Parameter	Type	Required	Description
id	string	Optional	Message-ID, z.B. "msg_01ABCxyz".
type	string	Optional	Immer "message".
role	string	Optional	Immer "assistant".
content	array	Optional	Array von Content-Blöcken: { type: "text" \| "tool_use" \| "thinking", … }.
model	string	Optional	Echo des angefragten Modells.
stop_reason	string	Optional	"end_turn" \| "max_tokens" \| "stop_sequence" \| "tool_use".
usage	object	Optional	{ input_tokens, output_tokens, cache_read_input_tokens?, cache_creation_input_tokens?, cache_creation? }. Cache-Felder erscheinen wenn Prompt-Caching genutzt wurde. cache_creation.ephemeral_5m_input_tokens und ephemeral_1h_input_tokens geben die Write-Aufteilung pro TTL.

Streaming-Events

Anthropic SSE nutzt benannte Events statt einzelner JSON-Chunks. Jedes Event hat sowohl einen event: Namen als auch ein data: JSON-Payload.

event: message_start
data: {"type":"message_start","message":{"id":"msg_01","role":"assistant","content":[],"model":"claude-sonnet-4.6","stop_reason":null,"usage":{"input_tokens":12,"output_tokens":1}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":17}}

event: message_stop
data: {"type":"message_stop"}

POST /v1/messages/count_tokens

Anthropic-compatible token counting. Send the same system / messages / tools you would pass to /v1/messages and get an input-token estimate back without running the model — nothing is billed.

POSThttps://api.airforce/v1/messages/count_tokens

curl https://api.airforce/v1/messages/count_tokens \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "system": "You are a helpful assistant.",
    "messages": [{"role": "user", "content": "Hello, Claude!"}]
  }'

# → {"input_tokens": 34}

The count is a fast character-based estimate (about 4 characters per token) over system, messages and tools — close enough for context-budget checks, not an exact tokenizer run.

Prompt-Caching

Auf /v1/messages mit Claude-Modellen markierst du ein Prefix als gecacht, indem du system als Array von Blöcken übergibst, wo das gecachte Segment cache_control: { type: "ephemeral" }. trägt. Folgeanfragen mit demselben Prefix zahlen den günstigeren Cache-Read-Tarif. Modelle mit supports_caching: true in /v1/models unterstützen das.

Write- vs. Read-Preise

Cache-Writes werden typischerweise etwas über normalem Input berechnet (ca. 1,25× bei Claude-Modellen). Cache-Reads sind deutlich günstiger (ca. 0,1× Input). Ein großer Write fast ohne späteren Read ist der teure Fall — kein „Cache-Rabatt“. Nur die Wiederverwendung desselben Prefixes macht den Write zur Ersparnis.

Tools wie Claude Code hängen oft einen großen Projekt-Kontext mit Cache-Markern an die ersten Turns. Rechne mit Cache-Write-Kosten, während Repo/System-Prefix geladen wird; spätere Turns werden nur billig, wenn dieser Prefix stabil bleibt und wiederverwendet wird. Subagents und Multi-Step-Agents können große Kontexte über mehrere Requests multiplizieren.

Modelle mit Prompt-Caching

…· live

{
  "model": "claude-sonnet-4.6",
  "max_tokens": 1024,
  "system": [
    {"type": "text", "text": "You are a senior staff engineer at Airforce."},
    {
      "type": "text",
      "text": "<repository-snapshot>...</repository-snapshot>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "Where is rate limiting enforced?"}
  ]
}

Wie Cache-Counts in der Response erscheinen

Cache-Token-Counts werden im jeweils nativen Format zurückgegeben, sodass SDKs (openai, @anthropic-ai/sdk, @google/genai) sie ohne Custom-Code lesen. Felder mit Wert 0 werden weggelassen, damit Non-Cache-Responses schlank bleiben.

/v1/chat/completions (OpenAI-Format)

"usage": {
  "prompt_tokens": 2104,
  "completion_tokens": 147,
  "total_tokens": 2251,
  "prompt_tokens_details": { "cached_tokens": 1980 },
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1/messages (Anthropic-Format)

"usage": {
  "input_tokens": 2104,
  "output_tokens": 147,
  "cache_read_input_tokens": 1980,
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1beta/.../generateContent (Gemini-Format)

"usageMetadata": {
  "promptTokenCount": 2104,
  "candidatesTokenCount": 147,
  "totalTokenCount": 2251,
  "cachedContentTokenCount": 1980
}

Wo Caching greift

Explizite cache_control-Marker werden auf /v1/messages und /v1/chat/completions für Claude-Modelle honoriert — setze sie auf system- oder message-content-Blöcke. Viele andere Anbieter (OpenAI-Familie, DeepSeek, Gemini) cachen automatisch: du sendest keine Marker und siehst einfach cached_tokens in der Antwort, sobald ein ausreichend langer Prefix wiederverwendet wird.

Cache-Dauer: 5 Minuten oder 1 Stunde

Ein gecachter Prefix lebt standardmäßig 5 Minuten, und der Timer wird bei jedem Treffer erneuert. Für einen länger lebenden Prefix füge ttl: "1h" zum Marker hinzu. Die Antwort meldet jede TTL separat unter cache_creation.

"cache_control": { "type": "ephemeral", "ttl": "1h" }

Beispiel: erst Write, dann Read

Sende denselben Request zweimal (das Caching-Beispiel oben). Der erste Aufruf, der den Prefix sieht, zahlt einen einmaligen Cache-Write; identische Aufrufe innerhalb der TTL zahlen den viel günstigeren Cache-Read.

Erster Aufruf — Cache-Write (usage-Auszug):

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 1980,
  "cache_read_input_tokens": 0
}

Zweiter identischer Aufruf innerhalb der TTL — Cache-Read:

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 1980
}

Limits & Kosten

Claude erfordert einen Mindest-Prefix (ca. 1024 Tokens; bei manchen Modellen mehr). Kürzere Prefixe werden schlicht nicht gecacht.
Bis zu 4 Cache-Breakpoints pro Request, und der gecachte Prefix muss über Aufrufe hinweg byte-identisch sein — schon eine Änderung um ein Zeichen verfehlt den Cache.
Cache-Writes kosten mehr als normaler Input (5m ≈ 1,25×, 1h ≈ 2×); Cache-Reads kosten viel weniger (≈ 0,1×). Die Cache-Preise je Modell stehen auf der Preisseite.

POST /v1/responses

OpenAI Responses-API für stateful Konversationen. Gleiche Bearer/x-api-key Auth. Cache-Counts erscheinen als input_tokens_details.cached_tokens (Read) plus cache_creation_input_tokens + cache_creation.ephemeral_* (Writes) — analog zu /v1/chat/completions.

POSThttps://api.airforce/v1/responses

POST /v1beta/models/{model}:generateContent

Google Gemini-compatible endpoint. Works with the official @google/genai SDK and the Gemini CLI by pointing the base URL at https://api.airforce/v1beta. Any routed model works — requests are translated to and from the native Gemini shape, and the model is taken from the URL path (not the body).

POSThttps://api.airforce/v1beta/models/{model}:generateContent

Authentication

Pass your Airforce API key any of the three ways Google clients use:

# 1) query parameter (Google default)
?key=sk-air-YOUR_API_KEY

# 2) header
x-goog-api-key: sk-air-YOUR_API_KEY

# 3) bearer token
Authorization: Bearer sk-air-YOUR_API_KEY

Request body

Parameter	Type	Required	Description
contents	array	Required	Conversation turns. Each: { role: "user" \| "model", parts: [...] }. A part is { text }, { functionCall: { name, args } }, or { functionResponse: { name, response } }. "model" is Gemini's term for the assistant role.
systemInstruction	object	Optional	System prompt: { parts: [{ text }] }.
generationConfig	object	Optional	{ temperature, maxOutputTokens, topP, stopSequences } — mapped to the canonical sampling parameters.
tools	array	Optional	Tool definitions: [{ functionDeclarations: [{ name, description, parameters }] }]. functionDeclarations are flattened across entries.
toolConfig	object	Optional	Tool-choice control: { functionCallingConfig: { mode: "AUTO" \| "ANY" \| "NONE" } }. ANY forces a call, NONE disables tools.

Example

curl "https://api.airforce/v1beta/models/gemini-3.1-pro:generateContent" \
  -H "x-goog-api-key: sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {"role": "user", "parts": [{"text": "What is the capital of France?"}]}
    ],
    "systemInstruction": {"parts": [{"text": "You are a helpful assistant."}]},
    "generationConfig": {"temperature": 0.7, "maxOutputTokens": 256}
  }'

Response shape

Parameter	Type	Required	Description
candidates	array	Optional	Generated turns: [{ content: { role: "model", parts }, finishReason, index }]. Only the first candidate is populated.
candidates[].finishReason	string	Optional	"STOP" \| "MAX_TOKENS" \| "SAFETY" \| "OTHER".
usageMetadata	object	Optional	{ promptTokenCount, candidatesTokenCount, totalTokenCount, cachedContentTokenCount? }. cachedContentTokenCount appears when the upstream reported a cache read.
modelVersion	string	Optional	Echo of the requested model.

{
  "candidates": [{
    "content": {
      "role": "model",
      "parts": [{"text": "The capital of France is Paris."}]
    },
    "finishReason": "STOP",
    "index": 0
  }],
  "usageMetadata": {
    "promptTokenCount": 16,
    "candidatesTokenCount": 8,
    "totalTokenCount": 24
  },
  "modelVersion": "gemini-3.1-pro"
}

POST /v1beta/models/{model}:streamGenerateContent

Streaming uses the :streamGenerateContent action and returns Server-Sent Events. Each data: line is a full Gemini-shaped chunk (not a delta object); the final chunk carries usageMetadata.

data: {"candidates":[{"content":{"role":"model","parts":[{"text":"The capital"}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[{"text":" is Paris."}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[]},"finishReason":"STOP","index":0}],"usageMetadata":{"promptTokenCount":16,"candidatesTokenCount":8,"totalTokenCount":24}}

List models

The catalog is also exposed in Gemini Model-resource shape so Google clients can enumerate models.

curl https://api.airforce/v1beta/models

Notes: the base URL is https://api.airforce/v1beta (or /v1), not Google's host. The model name comes from the URL path, not the request body. Only the first candidate is returned, and a subset of Gemini fields is translated — safetySettings and cachedContent are currently ignored. Billing, rate limits and smart routing apply exactly as on /v1/chat/completions.

Fehler

Airforce gibt Standard-HTTP-Statuscodes und einen einheitlichen Fehler-Envelope für beide Endpoints zurück.

Parameter	Type	Required	Description
400	invalid_request_error	Optional	Fehlerhaftes JSON, fehlendes Pflichtfeld, unbekanntes Modell.
401	invalid_request_error / auth_required	Optional	API-Key fehlt oder ist ungültig.
402	insufficient_quota	Optional	Das Modell erfordert ein aktives Abo oder ein positives Pay-as-you-Go-Guthaben.
403	model_access_denied / insufficient_scope	Optional	Plan- oder Per-Key-Permissions verweigern diese Anfrage.
404	model_not_found	Optional	Das angeforderte Modell existiert nicht oder du hast keinen Zugriff darauf.
429	rate_limit_error	Optional	Request-Rate oder tägliches Token-Limit überschritten.
503	api_error / moderation_unavailable	Optional	Alle Upstream-Keys für den angefragten Provider sind ausgefallen.

{
  "error": {
    "message": "The requested model does not exist or you do not have access to it.",
    "type": "model_not_found",
    "param": null,
    "code": "404"
  }
}

Der beschreibende Slug steht in type. code ist der HTTP-Status als String (z. B. "404"), und param ist null, außer bei Validierungsfehlern zum Parameterbereich, wo es den betreffenden Parameter benennt.

Modelle entdecken

Siehe die vollständige Liste aller Model-IDs und ihrer Capability-Flags (Vision, Tools, Reasoning, Caching, Context-Length, …) auf /docs/api/models.

curl https://api.airforce/v1/models \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY"