API REFERENCE

Chat-voltooiingen

Genereer chatreacties voor meer dan 100 modellen vanuit één API. Drop-in compatibel met OpenAI Chat-voltooiingen, Anthropic Messages en Anthropic Responses.

Airforce spreekt zowel het OpenAI Chat Completions- als het Anthropic Messages-wireformaat over dezelfde set modellen. Kies de SDK die je al gebruikt en verander alleen de base URL — niet-Claude-modellen worden transparant doorgestuurd achter beide surfaces.

Deze pagina behandelt authenticatie, de request- en response-shapes voor beide surfaces, streaming, tool calling, vision, reasoning en prompt caching. Nieuw hier? Begin met het basisvoorbeeld hieronder, krijg één call werkend en voeg dan streaming, tools of caching toe zodra dat lukt.

Authenticatie

Voor elk verzoek is een Bearer-token nodig (uw Airforce API-sleutel). De Anthropic x-api-key header wordt ook geaccepteerd op /v1/messages voor SDK-compatibiliteit.

Authorization: Bearer sk-air-YOUR_API_KEY
# alt for /v1/messages:
x-api-key: sk-air-YOUR_API_KEY

POST /v1/chat/completions

OpenAI-compatibele chatvoltooiingen. Werkt samen met de ambtenaar openai SDK door te overschrijven base_url naar https://api.airforce/v1.

POSThttps://api.airforce/v1/chat/completions

Lichaam aanvragen

Parameter	Type	Required	Description
model	string	Required	Model-ID. Gebruik GET /v1/models om beschikbare ID's te ontdekken.
messages	array	Required	Gespreksgeschiedenis. Elk item heeft { rol: "systeem" \| "gebruiker" \| "assistent" \| "gereedschap", inhoud }. Inhoud is een string of een array van inhoudsblokken (visie, zie hieronder).
max_tokens	integer	Optional	Maximaal aantal tokens dat moet worden gegenereerd. Gemaximeerd op de max_output_tokens van het model.
temperature	float	Optional	Bemonsteringstemperatuur, 0–2. Lager is deterministischer. De standaardinstelling is afhankelijk van de upstreamprovider.
top_p	float	Optional	Kernbemonstering. Gebruik temperatuur of top_p, niet beide.
stream	boolean	Optional	Wanneer dit waar is, is het antwoord een stroom van door de server verzonden gebeurtenissen. Zie "Streamen" hieronder.
models	array	Optional	Fallback models (max 3), e.g. ["deepseek-v3.2", "gpt-4o-mini"]. If every channel of the primary model fails, each candidate is tried in order. You are billed for — and response.model reports — the model that actually answered. Unknown or plan-gated candidates are skipped. With the OpenAI SDK pass it via extra_body.
transforms	array	Optional	Prompt transforms. Supported: ["middle-out"] — when the conversation overflows the model's context window, whole messages are dropped from the middle (system prompts, the first message and the most recent turns are kept), so long roleplay or agent histories keep working instead of erroring. Opt-in; off by default.
stream_options	object	Optional	{ include_usage: boolean }. Gebruik wordt altijd opgenomen in het laatste streamingfragment; dit veld wordt geaccepteerd voor OpenAI-compatibiliteit maar kan het niet uitschakelen.
stop	string \| array	Optional	Maximaal 4 stopsequenties. De generatie stopt zodra er één is geproduceerd.
tools	array	Optional	Functiedefinities die het model kan aanroepen. Zie "Gereedschapsoproep" hieronder.
tool_choice	string \| object	Optional	"auto" (standaard), "none", of { type: "function", function: { name } } om een specifieke oproep te forceren.
response_format	object	Optional	{ type: "json_object" } dwingt het model om geldige JSON uit te zenden. Genegeerd voor modellen die dit niet ondersteunen.
reasoning_effort	string	Optional	Reasoning depth: "low" \| "medium" \| "high" \| "xhigh" \| "max". Any model with supports_reasoning: true (Claude, OpenAI o/GPT-5, Gemini, Qwen, DeepSeek, …). See "Reasoning & thinking".
thinking	string \| object	Optional	Cross-model thinking switch. "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. See "Reasoning & thinking".
thinking_budget	integer	Optional	Tokenlimiet voor de redeneringstracering van het model (wanneer de provider er een openbaart).
ignore_defaults	boolean	Optional	Sla de opgeslagen standaardparameters per model van de gebruiker (geconfigureerd in het dashboard) voor dit verzoek over.
skill	string	Optional	ID of a single marketplace skill to apply to this request. The skill transforms your messages/parameters before the upstream call and overrides any installed-skill defaults. Consumed by Airforce, never forwarded upstream. See the Skills catalog at /docs/api/extend.
skills	array	Optional	Array of marketplace skill IDs applied in order, for stacking multiple skills on one request.

Eenvoudig voorbeeld

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

Reactie vorm

Parameter	Type	Required	Description
id	string	Optional	Stabiele voltooiings-ID, b.v. "chatcmpl-abc123".
object	string	Optional	"chat.completion" voor niet-gestreamd, "chat.completion.chunk" voor gestreamd.
created	integer	Optional	Unix-tijdstempel (seconden).
model	string	Optional	Echo van de gevraagde model-ID.
choices	array	Optional	Reeks voltooiingskandidaten: [{ index, bericht: { rol, inhoud, tool_calls? }, finish_reason }].
choices[].finish_reason	string	Optional	"stoppen" \| "lengte" \| "tool_calls" \| "inhoud_filter".
usage	object	Optional	{ prompt_tokens, completion_tokens, total_tokens, completion_tokens_details?, prompt_tokens_details?, cache_creation_input_tokens?, cache_creation? }. completion_tokens_details.reasoning_tokens wordt ingesteld wanneer het model een reasoning-spoor produceerde. Cache-velden verschijnen wanneer de upstream prompt-caching info teruggaf: prompt_tokens_details.cached_tokens rapporteert cache-reads (OpenAI-standaard), cache_creation_input_tokens aggregeert writes, en cache_creation.ephemeral_5m_input_tokens / ephemeral_1h_input_tokens geven de TTL-splitsing.

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1710000000,
  "model": "gpt-5.1-chat",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "The capital of France is Paris."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 8,
    "total_tokens": 28
  }
}

Redeneren & denken

Reasoning/thinking is a cross-model feature for every model ID with supports_reasoning: true — Claude, OpenAI o-series/GPT-5, Gemini, Qwen, DeepSeek, and others. You send the same canonical parameters; Airforce maps them to each provider's native shape. This is not a DeepSeek-only API.

Truth source: check supports_reasoning: true op een model in GET /v1/models (or GET /api/models/{id}/allowed-params). Prefer that flag over guessing from the model name.

Modellen met redeneerondersteuning

…· live

Canonieke parameters

Parameter	Type	Required	Description
reasoning_effort	string	Optional	"low" \| "medium" \| "high" \| "xhigh" \| "max". Accepted on every model with supports_reasoning: true. Some upstreams only honour a subset (e.g. high/max); others clamp unsupported levels to the nearest served value.
thinking	string \| object	Optional	Three accepted shapes (we normalise): "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. Mapped onto Claude extended thinking, OpenAI effort profiles, Gemini thinking_config, Qwen enable_thinking, DeepSeek hybrid, etc.
thinking_budget	integer	Optional	Maximum tokens the model may spend reasoning before emitting visible output. Mirrors budget_tokens when the upstream exposes a budget; takes precedence over reasoning_effort when both are sent and a budget is available.

What differs by family (mapping only)

Parameters are the same everywhere. Only how we map them (and how hard "off" is) differs:

Claude — Thinking on/off + budget; often also reasoning_effort via the gateway.
OpenAI (o1/o3, GPT-5) — Mainly reasoning_effort. A full "thinking off" is often not available — you control how strongly the model reasons, not always whether it reasons at all.
Gemini — thinking_config / budget mapped internally.
Qwen / Xiaomi / Alibaba — thinking + enable_thinking-style controls.
DeepSeek (generic) — Hybrid on/off is especially clear: thinking: { type: enabled|disabled } plus optional reasoning_effort.
Resellers / other — Often generic passthrough of the same canonical fields.

Controlling where the trace appears

An optional reasoning object on the request decides what happens to the thinking trace. It is consumed by Airforce and never forwarded upstream.

Parameter	Type	Required	Description
reasoning.format	string	Optional	"separate" (default) puts the trace in message.reasoning (and delta.reasoning while streaming). "inline" keeps the legacy inline <think>…</think> form inside content.
reasoning.exclude	boolean	Optional	When true, the reasoning trace is dropped entirely from the response. Reasoning tokens are still counted and billed if the model produced them.

"reasoning": { "format": "separate", "exclude": false }

Redeneringsinspanning (OpenAI-stijl)

Primary control for o-series and GPT-5: how much the model may reason. Same canonical field as on every other supports_reasoning model — OpenAI is included, but behaviour is not 1:1 with DeepSeek's hard on/off.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "o3-mini",
    "messages": [{"role": "user", "content": "Prove the Pythagorean theorem."}],
    "reasoning_effort": "high"
  }'

Uitgebreid denken (Anthropic-stijl)

Budget-based thinking for Claude (and gateways that accept the Anthropic shape). You can still send reasoning_effort; we map when the channel supports it.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "messages": [{"role": "user", "content": "Plan a 7-day Italy trip."}],
    "thinking": {"type": "enabled", "budget_tokens": 4000}
  }'

Hybrid thinking (e.g. DeepSeek V3.2/V4)

Example of a hybrid model family with a clear Thinking / Non-Thinking switch — not a separate protocol. deepseek-v3.2, deepseek-v4-flash and deepseek-v4-pro accept the same canonical fields as every other supports_reasoning model. Toggle thinking and optionally set effort in one request:

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-pro",
    "messages": [{"role": "user", "content": "Solve this step by step: integrate x^2 * e^x."}],
    "thinking": {"type": "enabled"},
    "reasoning_effort": "high"
  }'

Turn thinking off (faster, cheaper when you only need the final answer) — this hard off is clearer on hybrid models than on many OpenAI o-series profiles:

"thinking": {"type": "disabled"}
// or simply: "thinking": "off"

Native docs for this family often list effort levels such as "high" and "max". We accept the full low…max scale and map unsupported levels to the nearest value that reaches the model. Prefer the hybrid IDs above over retired deepseek-chat / deepseek-reasoner names when you need an explicit on/off switch.

Het redeneringsspoor zelf verschijnt in choices[0].message.reasoning (OpenAI-vorm) of zoals thinking blokkeert binnen content (Anthropic-vorm). Redeneringstokens worden gefactureerd en gerapporteerd usage.completion_tokens_details.reasoning_tokens.

Die completion_tokens_details.reasoning_tokens uitsplitsing is alleen aanwezig wanneer de upstream-provider deze rapporteert. Bij een gestreamde response arriveert de trace op delta.reasoning_content per chunk.

Visie & beeldinvoer

Modellen met supports_vision: true accepteer afbeeldingen die zijn ingesloten als inhoudsblokken. Een openbare URL of een base64-gegevens-URL werkt; groottelimieten zijn afhankelijk van het stroomopwaartse model.

Modellen met zichtondersteuning

…· live

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
      ]
    }]
  }'

Gereedschap bellen

Modellen met supports_tools: true kan functies aanroepen die u definieert. Het model retourneert a tool_calls reeks; u voert de oproep uit en stuurt het resultaat vervolgens terug in a tool bericht.

Modellen met ondersteuning voor het oproepen van gereedschap

…· live

Verzoek

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City name"}
          },
          "required": ["location"]
        }
      }
    }],
    "tool_choice": "auto"
  }'

Reactie met gereedschapsoproep

{
  "id": "chatcmpl-abc123",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\":\"Paris\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

Opvolging met gereedschapsresultaat

{
  "model": "gpt-5.1-chat",
  "messages": [
    {"role": "user", "content": "What is the weather in Paris?"},
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {"name": "get_weather", "arguments": "{\"location\":\"Paris\"}"}
      }]
    },
    {"role": "tool", "tool_call_id": "call_1", "content": "{\"temp_c\": 14, \"sky\": \"cloudy\"}"}
  ]
}

Assistant prefill

End your messages array with an assistant message that already contains some text, and the model continues from it instead of starting a fresh turn. This is a reliable way to force a response to begin a specific way — a leading "{" for JSON, a chosen language, or a fixed prefix. The same trick works on /v1/messages. Providers that reject native prefill are handled automatically: the gateway retries once with a compatible rewrite, so you do not have to special-case them.

{
  "model": "claude-sonnet-4.6",
  "messages": [
    {"role": "user", "content": "List three primary colors as a JSON array."},
    {"role": "assistant", "content": "["}
  ]
}

Structured outputs

Set response_format to make the model return JSON. Two modes are supported:

{ "type": "json_object" } — the response is a single valid JSON value.
{ "type": "json_schema", "json_schema": { "name", "schema", "strict" } } — the model is steered to produce JSON that matches your JSON Schema.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Extract the city and country: I live in Paris, France."}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "location",
        "schema": {
          "type": "object",
          "properties": { "city": {"type": "string"}, "country": {"type": "string"} },
          "required": ["city", "country"]
        }
      }
    }
  }'

Reliability: even when a model wraps its answer in prose or a markdown code fence, Airforce extracts the JSON payload so you always receive parseable content. If no valid JSON can be recovered, the original text is returned unchanged — so the guarantee never makes a response worse. This applies to non-streamed responses; streamed responses are passed through unchanged.

Streamen

Stel stream: true om gedeeltelijke voltooiingen te ontvangen als door de server verzonden gebeurtenissen. Elke gebeurtenis is één JSON-stuk met dezelfde vorm als het niet-gestreamde antwoord, behalve message wordt vervangen door delta. De stroom eindigt met data: [DONE].

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Write a haiku about Berlin."}],
    "stream": true,
    "stream_options": {"include_usage": true}
  }'

Draad formaat

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"Cold "},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"stone "},"finish_reason":null}]}

…

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":12,"completion_tokens":17,"total_tokens":29}}

data: [DONE]

Reliability & smart routing

Every model ID resolves to a pool of upstream providers behind the scenes. If the first one errors or times out, the request is automatically retried against the next provider for the same model, in order, before any failure is returned — you do not configure or trigger this. The model field in the response always reports the variant that actually answered. This is independent of the optional models / fallbacks array, which adds your own cross-model candidates on top: first the primary model exhausts its own provider chain, then each fallback model exhausts its chain.

POST /v1/messages

Anthropic-compatibele berichten-API. Werkt samen met de ambtenaar @anthropic-ai/sdk door in te stellen baseURL naar https://api.airforce. Doorsturen naar OpenAI/Google/etc. transparant voor niet-Claude-modellen.

POSThttps://api.airforce/v1/messages

Lichaam aanvragen

Parameter	Type	Required	Description
model	string	Required	Model-ID (Anthropic-formaat of gerouteerde alias).
messages	array	Required	Elk item: { rol: "gebruiker" \| "assistent", inhoud: string \| matrix }.
max_tokens	integer	Required	Vereist door Anthropic. Tokenlimiet voor het antwoord.
system	string \| array	Optional	Systeemprompt. Geef een array door van { type: "text", text, cache_control? } blokken om in de cache opgeslagen prefixsegmenten te markeren. Zie "Promptcaching".
temperature	float	Optional	0–1.
top_p	float	Optional	Kernbemonstering.
top_k	integer	Optional	Beperk de bemonsteringspool tot top-K-tokens.
stop_sequences	array	Optional	Maximaal 4 stopsequenties.
stream	boolean	Optional	Indien waar, wordt er een SSE-gebeurtenisstroom in Anthropic-stijl uitgezonden (zie "Streaming").
fallbacks	array	Optional	Fallback models (max 3) in Anthropic form: [{"model": "gpt-4o-mini"}]. If every channel of the primary model fails, each candidate is tried in order; you are billed for — and the response model field reports — the model that actually answered. A plain models string array is accepted too.
tools	array	Optional	Anthropic-gereedschapsdefinities: { naam, beschrijving, input_schema }. Het antwoord kan tool_use-inhoudsblokken bevatten.
tool_choice	object	Optional	{type: "automatisch" \| "elke" \| "gereedschap", naam? }.
thinking	object	Optional	Anthropic uitgebreid denken: { type: "enabled", budget_tokens: N }.

Voorbeeld

curl https://api.airforce/v1/messages \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "max_tokens": 256,
    "system": "You are a helpful assistant.",
    "messages": [
      {"role": "user", "content": "Hello, Claude!"}
    ]
  }'

Reactie vorm

Parameter	Type	Required	Description
id	string	Optional	Bericht-ID, b.v. "msg_01ABCxyz".
type	string	Optional	Altijd "bericht".
role	string	Optional	Altijd "assistent".
content	array	Optional	Reeks inhoudsblokken: { type: "text" \| "tool_use" \| "denken", … }.
model	string	Optional	Echo van het gevraagde model.
stop_reason	string	Optional	"eind_bocht" \| "max_tokens" \| "stop_reeks" \| "tool_gebruik".
usage	object	Optional	{ input_tokens, output_tokens, cache_read_input_tokens?, cache_creation_input_tokens?, cache_creation? }. Cache-velden verschijnen wanneer prompt caching is gebruikt. cache_creation.ephemeral_5m_input_tokens en ephemeral_1h_input_tokens geven de write-splitsing per TTL.

Streaming-evenementen

Anthropic SSE gebruikt benoemde gebeurtenissen in plaats van eenmalige JSON-brokken. Elk evenement heeft zowel een event: naam en een data: JSON-payload.

event: message_start
data: {"type":"message_start","message":{"id":"msg_01","role":"assistant","content":[],"model":"claude-sonnet-4.6","stop_reason":null,"usage":{"input_tokens":12,"output_tokens":1}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":17}}

event: message_stop
data: {"type":"message_stop"}

POST /v1/messages/count_tokens

Anthropic-compatible token counting. Send the same system / messages / tools you would pass to /v1/messages and get an input-token estimate back without running the model — nothing is billed.

POSThttps://api.airforce/v1/messages/count_tokens

curl https://api.airforce/v1/messages/count_tokens \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "system": "You are a helpful assistant.",
    "messages": [{"role": "user", "content": "Hello, Claude!"}]
  }'

# → {"input_tokens": 34}

The count is a fast character-based estimate (about 4 characters per token) over system, messages and tools — close enough for context-budget checks, not an exact tokenizer run.

Snel cachen

Op /v1/messages bij Claude-modellen markeert u een voorvoegsel als in de cache opgeslagen door het door te geven system als een reeks blokken waar het in de cache opgeslagen segment zich in bevindt cache_control: { type: "ephemeral" }. Bij daaropvolgende verzoeken die met hetzelfde voorvoegsel beginnen, wordt het goedkopere cache-leestarief in rekening gebracht. Modellen met supports_caching: true in /v1/models ondersteunen dit.

Write vs read pricing

Cache writes are typically charged slightly above normal input (about 1.25× on Claude-family models). Cache reads are much cheaper (about 0.1× input). A large write with almost no later read is the expensive case — not a “cache discount”. Only reusing the same prefix turns the write into savings.

Tools like Claude Code often attach a large project context with cache markers on the first turns. Expect cache-write spend while the repo/system prefix is loaded; later turns only get cheap if that prefix is stable and reused. Subagents and multi-step agents can multiply large contexts across several requests.

Modellen met snelle caching

…· live

{
  "model": "claude-sonnet-4.6",
  "max_tokens": 1024,
  "system": [
    {"type": "text", "text": "You are a senior staff engineer at Airforce."},
    {
      "type": "text",
      "text": "<repository-snapshot>...</repository-snapshot>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "Where is rate limiting enforced?"}
  ]
}

Hoe cache-tellingen worden gerapporteerd in de respons

Cache-tokentellingen worden doorgegeven in de native vorm van elk formaat, zodat SDK's (openai, @anthropic-ai/sdk, @google/genai) ze zonder aangepaste code lezen. Velden worden weggelaten wanneer de waarde nul is, waardoor niet-gecachte respons compact blijft.

/v1/chat/completions (OpenAI-vorm)

"usage": {
  "prompt_tokens": 2104,
  "completion_tokens": 147,
  "total_tokens": 2251,
  "prompt_tokens_details": { "cached_tokens": 1980 },
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1/messages (Anthropic-vorm)

"usage": {
  "input_tokens": 2104,
  "output_tokens": 147,
  "cache_read_input_tokens": 1980,
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1beta/.../generateContent (Gemini-vorm)

"usageMetadata": {
  "promptTokenCount": 2104,
  "candidatesTokenCount": 147,
  "totalTokenCount": 2251,
  "cachedContentTokenCount": 1980
}

Waar caching geldt

Expliciete cache_control-markers worden gehonoreerd op /v1/messages en /v1/chat/completions voor Claude-modellen — zet ze op system- of message-contentblokken. Veel andere providers (OpenAI-familie, DeepSeek, Gemini) cachen automatisch: je stuurt geen markers en ziet gewoon cached_tokens in het antwoord zodra een lang genoeg prefix wordt hergebruikt.

Cacheduur: 5 minuten of 1 uur

Een gecachet prefix leeft standaard 5 minuten en de timer vernieuwt bij elke hit. Voeg voor een langer levend prefix ttl: "1h" toe aan de marker. Het antwoord rapporteert elke TTL apart onder cache_creation.

"cache_control": { "type": "ephemeral", "ttl": "1h" }

Voorbeeld: eerst write, dan read

Stuur exact hetzelfde verzoek twee keer (het cachingvoorbeeld hierboven). De eerste call die het prefix ziet betaalt een eenmalige cache-write; identieke calls binnen de TTL betalen de veel goedkopere cache-read.

Eerste call — cache-write (usage-fragment):

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 1980,
  "cache_read_input_tokens": 0
}

Tweede identieke call binnen de TTL — cache-read:

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 1980
}

Limieten & kosten

Claude vereist een minimaal cachebaar prefix (ongeveer 1024 tokens; meer voor sommige modellen). Kortere prefixen worden simpelweg niet gecachet.
Tot 4 cache-breakpoints per verzoek, en het gecachete prefix moet byte-identiek zijn tussen calls — zelfs een wijziging van één teken mist de cache.
Cache-writes kosten meer dan normale input (5m ≈ 1,25×, 1h ≈ 2×); reads kosten veel minder (≈ 0,1×). Zie de cacheprijzen per model op de prijspagina.

POST /v1/responses

OpenAI Responses-API oppervlak voor stateful conversaties. Dezelfde Bearer/x-api-key auth. Cache-tellingen verschijnen als input_tokens_details.cached_tokens (read) plus de platte cache_creation_input_tokens + cache_creation.ephemeral_* (writes) voor pariteit met /v1/chat/completions.

POSThttps://api.airforce/v1/responses

POST /v1beta/models/{model}:generateContent

Google Gemini-compatible endpoint. Works with the official @google/genai SDK and the Gemini CLI by pointing the base URL at https://api.airforce/v1beta. Any routed model works — requests are translated to and from the native Gemini shape, and the model is taken from the URL path (not the body).

POSThttps://api.airforce/v1beta/models/{model}:generateContent

Authentication

Pass your Airforce API key any of the three ways Google clients use:

# 1) query parameter (Google default)
?key=sk-air-YOUR_API_KEY

# 2) header
x-goog-api-key: sk-air-YOUR_API_KEY

# 3) bearer token
Authorization: Bearer sk-air-YOUR_API_KEY

Request body

Parameter	Type	Required	Description
contents	array	Required	Conversation turns. Each: { role: "user" \| "model", parts: [...] }. A part is { text }, { functionCall: { name, args } }, or { functionResponse: { name, response } }. "model" is Gemini's term for the assistant role.
systemInstruction	object	Optional	System prompt: { parts: [{ text }] }.
generationConfig	object	Optional	{ temperature, maxOutputTokens, topP, stopSequences } — mapped to the canonical sampling parameters.
tools	array	Optional	Tool definitions: [{ functionDeclarations: [{ name, description, parameters }] }]. functionDeclarations are flattened across entries.
toolConfig	object	Optional	Tool-choice control: { functionCallingConfig: { mode: "AUTO" \| "ANY" \| "NONE" } }. ANY forces a call, NONE disables tools.

Example

curl "https://api.airforce/v1beta/models/gemini-3.1-pro:generateContent" \
  -H "x-goog-api-key: sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {"role": "user", "parts": [{"text": "What is the capital of France?"}]}
    ],
    "systemInstruction": {"parts": [{"text": "You are a helpful assistant."}]},
    "generationConfig": {"temperature": 0.7, "maxOutputTokens": 256}
  }'

Response shape

Parameter	Type	Required	Description
candidates	array	Optional	Generated turns: [{ content: { role: "model", parts }, finishReason, index }]. Only the first candidate is populated.
candidates[].finishReason	string	Optional	"STOP" \| "MAX_TOKENS" \| "SAFETY" \| "OTHER".
usageMetadata	object	Optional	{ promptTokenCount, candidatesTokenCount, totalTokenCount, cachedContentTokenCount? }. cachedContentTokenCount appears when the upstream reported a cache read.
modelVersion	string	Optional	Echo of the requested model.

{
  "candidates": [{
    "content": {
      "role": "model",
      "parts": [{"text": "The capital of France is Paris."}]
    },
    "finishReason": "STOP",
    "index": 0
  }],
  "usageMetadata": {
    "promptTokenCount": 16,
    "candidatesTokenCount": 8,
    "totalTokenCount": 24
  },
  "modelVersion": "gemini-3.1-pro"
}

POST /v1beta/models/{model}:streamGenerateContent

Streaming uses the :streamGenerateContent action and returns Server-Sent Events. Each data: line is a full Gemini-shaped chunk (not a delta object); the final chunk carries usageMetadata.

data: {"candidates":[{"content":{"role":"model","parts":[{"text":"The capital"}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[{"text":" is Paris."}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[]},"finishReason":"STOP","index":0}],"usageMetadata":{"promptTokenCount":16,"candidatesTokenCount":8,"totalTokenCount":24}}

List models

The catalog is also exposed in Gemini Model-resource shape so Google clients can enumerate models.

curl https://api.airforce/v1beta/models

Notes: the base URL is https://api.airforce/v1beta (or /v1), not Google's host. The model name comes from the URL path, not the request body. Only the first candidate is returned, and a subset of Gemini fields is translated — safetySettings and cachedContent are currently ignored. Billing, rate limits and smart routing apply exactly as on /v1/chat/completions.

Fouten

Airforce retourneert standaard HTTP-statuscodes en een uniforme foutenvelop voor beide eindpunten.

Parameter	Type	Required	Description
400	invalid_request_error	Optional	Verkeerd opgemaakte JSON, ontbrekend verplicht veld, onbekend model.
401	invalid_request_error / auth_required	Optional	Ontbrekende of ongeldige API-sleutel.
402	insufficient_quota	Optional	Het model vereist een actief abonnement of een positief Pay-as-you-Go-saldo.
403	model_access_denied / insufficient_scope	Optional	Plan- of per-sleutelmachtigingen weigeren dit verzoek.
404	model_not_found	Optional	Het gevraagde model bestaat niet of je hebt er geen toegang toe.
429	rate_limit_error	Optional	Verzoektarief of dagelijkse tokenlimiet overschreden.
503	api_error / moderation_unavailable	Optional	Alle upstream-sleutels voor de aangevraagde provider zijn mislukt.

{
  "error": {
    "message": "The requested model does not exist or you do not have access to it.",
    "type": "model_not_found",
    "param": null,
    "code": "404"
  }
}

De beschrijvende slug staat in type. code is de HTTP-status als string (bijv. "404"), en param is null behalve bij validatiefouten over parameterbereiken, waar het de betreffende parameter benoemt.

Ontdek modellen

Bekijk de volledige lijst met model-ID's en hun capaciteitsvlaggen (visie, tools, redenering, caching, contextlengte, ...) op /docs/api/models.

curl https://api.airforce/v1/models \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY"