API REFERENCE

聊天完成狀況

透過一個 API 產生跨 100 多個模型的聊天回應。與 OpenAI 聊天完成、Anthropic Messages 和 Anthropic Responses 相容。

Airforce 在同一組模型上同時支援 OpenAI Chat Completions 與 Anthropic Messages 兩種通訊格式。挑選你已在使用的 SDK，只需更改 base URL — 非 Claude 模型會在任一介面後透明轉發。

本頁涵蓋驗證、兩種介面的 request 與 response 結構、streaming、tool calling、vision、reasoning 以及 prompt caching。第一次使用？先從下方的基本範例開始，讓單次呼叫運作起來，成功後再逐步加上 streaming、tools 或 caching。

驗證

每個請求都需要一個 Bearer 令牌（您的 Airforce API 金鑰）。Anthropic x-api-key 標頭也被接受 /v1/messages 用於 SDK 相容性。

Authorization: Bearer sk-air-YOUR_API_KEY
# alt for /v1/messages:
x-api-key: sk-air-YOUR_API_KEY

POST /v1/chat/completions

OpenAI 相容的聊天完成。與官方合作 openai SDK透過覆蓋 base_url 到 https://api.airforce/v1.

POSThttps://api.airforce/v1/chat/completions

請求正文

Parameter	Type	Required	Description
model	string	Required	型號 ID。使用 GET /v1/models 發現可用的 ID。
messages	array	Required	對話歷史記錄。每個條目都有 { role: "system" \| “使用者” \| “助理”\| “工具”，內容}。內容是一個字串或內容塊數組（願景，見下文）。
max_tokens	integer	Optional	產生的最大令牌數。上限為模型的 max_output_tokens。
temperature	float	Optional	採樣溫度，0–2。越低則更具確定性。預設值取決於上游提供者。
top_p	float	Optional	細胞核取樣。使用溫度或top_p，而不是兩者都使用。
stream	boolean	Optional	當為 true 時，回應是伺服器發送的事件流。請參閱下面的“串流”。
models	array	Optional	Fallback models (max 3), e.g. ["deepseek-v3.2", "gpt-4o-mini"]. If every channel of the primary model fails, each candidate is tried in order. You are billed for — and response.model reports — the model that actually answered. Unknown or plan-gated candidates are skipped. With the OpenAI SDK pass it via extra_body.
transforms	array	Optional	Prompt transforms. Supported: ["middle-out"] — when the conversation overflows the model's context window, whole messages are dropped from the middle (system prompts, the first message and the most recent turns are kept), so long roleplay or agent histories keep working instead of erroring. Opt-in; off by default.
stream_options	object	Optional	{ include_usage: boolean }。用量一律包含在最後一個串流分塊中；此欄位為相容 OpenAI 而被接受，但無法將其關閉。
stop	string \| array	Optional	最多 4 個停止序列。一旦生產出來，生產就會停止。
tools	array	Optional	模型可能呼叫的函數定義。請參閱下面的「工具呼叫」。
tool_choice	string \| object	Optional	「auto」（預設）、「none」或 { type: "function", function: { name } } 強制執行特定呼叫。
response_format	object	Optional	{ type: "json_object" } 強制模型發出有效的 JSON。不支援的型號將被忽略。
reasoning_effort	string	Optional	Reasoning depth: "low" \| "medium" \| "high" \| "xhigh" \| "max". Any model with supports_reasoning: true (Claude, OpenAI o/GPT-5, Gemini, Qwen, DeepSeek, …). See "Reasoning & thinking".
thinking	string \| object	Optional	Cross-model thinking switch. "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. See "Reasoning & thinking".
thinking_budget	integer	Optional	模型推理追蹤的令牌上限（當提供者公開時）。
ignore_defaults	boolean	Optional	跳過使用者為此請求保存的每個模型的預設參數（在儀表板中配置）。
skill	string	Optional	ID of a single marketplace skill to apply to this request. The skill transforms your messages/parameters before the upstream call and overrides any installed-skill defaults. Consumed by Airforce, never forwarded upstream. See the Skills catalog at /docs/api/extend.
skills	array	Optional	Array of marketplace skill IDs applied in order, for stacking multiple skills on one request.

基本範例

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

回應形狀

Parameter	Type	Required	Description
id	string	Optional	穩定的完成 ID，例如“chatcmpl-abc123”。
object	string	Optional	「chat.completion」用於非串流傳輸，「chat.completion.chunk」用於串流傳輸。
created	integer	Optional	Unix 時間戳（秒）。
model	string	Optional	回顯所請求的型號 ID。
choices	array	Optional	完成候選數組：[{索引，訊息：{角色，內容，工具呼叫？ }，完成原因}]。
choices[].finish_reason	string	Optional	“停止”\| “長度”\| “工具呼叫”\| “內容過濾器”。
usage	object	Optional	{ prompt_tokens, completion_tokens, total_tokens, completion_tokens_details?, prompt_tokens_details?, cache_creation_input_tokens?, cache_creation? }。當模型生成推理痕跡時設定 completion_tokens_details.reasoning_tokens。當上游回傳提示快取資訊時會出現快取欄位：prompt_tokens_details.cached_tokens 回報快取讀取(OpenAI 標準)，cache_creation_input_tokens 彙總寫入，cache_creation.ephemeral_5m_input_tokens / ephemeral_1h_input_tokens 提供 TTL 拆分。

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1710000000,
  "model": "gpt-5.1-chat",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "The capital of France is Paris."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 8,
    "total_tokens": 28
  }
}

推理與思考

Reasoning/thinking is a cross-model feature for every model ID with supports_reasoning: true — Claude, OpenAI o-series/GPT-5, Gemini, Qwen, DeepSeek, and others. You send the same canonical parameters; Airforce maps them to each provider's native shape. This is not a DeepSeek-only API.

Truth source: check supports_reasoning: true 在模型上 GET /v1/models (or GET /api/models/{id}/allowed-params). Prefer that flag over guessing from the model name.

具有推理支持的模型

…· live

規範參數

Parameter	Type	Required	Description
reasoning_effort	string	Optional	"low" \| "medium" \| "high" \| "xhigh" \| "max". Accepted on every model with supports_reasoning: true. Some upstreams only honour a subset (e.g. high/max); others clamp unsupported levels to the nearest served value.
thinking	string \| object	Optional	Three accepted shapes (we normalise): "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. Mapped onto Claude extended thinking, OpenAI effort profiles, Gemini thinking_config, Qwen enable_thinking, DeepSeek hybrid, etc.
thinking_budget	integer	Optional	Maximum tokens the model may spend reasoning before emitting visible output. Mirrors budget_tokens when the upstream exposes a budget; takes precedence over reasoning_effort when both are sent and a budget is available.

What differs by family (mapping only)

Parameters are the same everywhere. Only how we map them (and how hard "off" is) differs:

Claude — Thinking on/off + budget; often also reasoning_effort via the gateway.
OpenAI (o1/o3, GPT-5) — Mainly reasoning_effort. A full "thinking off" is often not available — you control how strongly the model reasons, not always whether it reasons at all.
Gemini — thinking_config / budget mapped internally.
Qwen / Xiaomi / Alibaba — thinking + enable_thinking-style controls.
DeepSeek (generic) — Hybrid on/off is especially clear: thinking: { type: enabled|disabled } plus optional reasoning_effort.
Resellers / other — Often generic passthrough of the same canonical fields.

Controlling where the trace appears

An optional reasoning object on the request decides what happens to the thinking trace. It is consumed by Airforce and never forwarded upstream.

Parameter	Type	Required	Description
reasoning.format	string	Optional	"separate" (default) puts the trace in message.reasoning (and delta.reasoning while streaming). "inline" keeps the legacy inline <think>…</think> form inside content.
reasoning.exclude	boolean	Optional	When true, the reasoning trace is dropped entirely from the response. Reasoning tokens are still counted and billed if the model produced them.

"reasoning": { "format": "separate", "exclude": false }

推理工作（OpenAI 風格）

Primary control for o-series and GPT-5: how much the model may reason. Same canonical field as on every other supports_reasoning model — OpenAI is included, but behaviour is not 1:1 with DeepSeek's hard on/off.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "o3-mini",
    "messages": [{"role": "user", "content": "Prove the Pythagorean theorem."}],
    "reasoning_effort": "high"
  }'

擴展思維（Anthropic 風格）

Budget-based thinking for Claude (and gateways that accept the Anthropic shape). You can still send reasoning_effort; we map when the channel supports it.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "messages": [{"role": "user", "content": "Plan a 7-day Italy trip."}],
    "thinking": {"type": "enabled", "budget_tokens": 4000}
  }'

Hybrid thinking (e.g. DeepSeek V3.2/V4)

Example of a hybrid model family with a clear Thinking / Non-Thinking switch — not a separate protocol. deepseek-v3.2, deepseek-v4-flash and deepseek-v4-pro accept the same canonical fields as every other supports_reasoning model. Toggle thinking and optionally set effort in one request:

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-pro",
    "messages": [{"role": "user", "content": "Solve this step by step: integrate x^2 * e^x."}],
    "thinking": {"type": "enabled"},
    "reasoning_effort": "high"
  }'

Turn thinking off (faster, cheaper when you only need the final answer) — this hard off is clearer on hybrid models than on many OpenAI o-series profiles:

"thinking": {"type": "disabled"}
// or simply: "thinking": "off"

Native docs for this family often list effort levels such as "high" and "max". We accept the full low…max scale and map unsupported levels to the nearest value that reaches the model. Prefer the hybrid IDs above over retired deepseek-chat / deepseek-reasoner names when you need an explicit on/off switch.

推理痕跡本身出現在 choices[0].message.reasoning （OpenAI 形狀）或作為 thinking 阻塞在 content （Anthropic 格式）。推理令牌的計費和報告在 usage.completion_tokens_details.reasoning_tokens.

該 completion_tokens_details.reasoning_tokens 細目只有在上游供應商回報時才會出現。在 stream 回應中，該追蹤資訊會逐 chunk 透過 delta.reasoning_content 傳來。

視覺與影像輸入

型號有 supports_vision: true 接受作為內容區塊嵌入的圖像。公用 URL 或 Base64 資料 URL 皆可；大小限制取決於上游模型。

具有視覺支援的型號

…· live

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
      ]
    }]
  }'

工具調用

型號有 supports_tools: true 可以呼叫您定義的函數。該模型返回一個 tool_calls 大批;您運行該調用，然後將結果傳回 tool 訊息.

支援工具調用的型號

…· live

要求

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City name"}
          },
          "required": ["location"]
        }
      }
    }],
    "tool_choice": "auto"
  }'

透過工具呼叫進行回應

{
  "id": "chatcmpl-abc123",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\":\"Paris\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

跟進工具結果

{
  "model": "gpt-5.1-chat",
  "messages": [
    {"role": "user", "content": "What is the weather in Paris?"},
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {"name": "get_weather", "arguments": "{\"location\":\"Paris\"}"}
      }]
    },
    {"role": "tool", "tool_call_id": "call_1", "content": "{\"temp_c\": 14, \"sky\": \"cloudy\"}"}
  ]
}

Assistant prefill

End your messages array with an assistant message that already contains some text, and the model continues from it instead of starting a fresh turn. This is a reliable way to force a response to begin a specific way — a leading "{" for JSON, a chosen language, or a fixed prefix. The same trick works on /v1/messages. Providers that reject native prefill are handled automatically: the gateway retries once with a compatible rewrite, so you do not have to special-case them.

{
  "model": "claude-sonnet-4.6",
  "messages": [
    {"role": "user", "content": "List three primary colors as a JSON array."},
    {"role": "assistant", "content": "["}
  ]
}

Structured outputs

Set response_format to make the model return JSON. Two modes are supported:

{ "type": "json_object" } — the response is a single valid JSON value.
{ "type": "json_schema", "json_schema": { "name", "schema", "strict" } } — the model is steered to produce JSON that matches your JSON Schema.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Extract the city and country: I live in Paris, France."}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "location",
        "schema": {
          "type": "object",
          "properties": { "city": {"type": "string"}, "country": {"type": "string"} },
          "required": ["city", "country"]
        }
      }
    }
  }'

Reliability: even when a model wraps its answer in prose or a markdown code fence, Airforce extracts the JSON payload so you always receive parseable content. If no valid JSON can be recovered, the original text is returned unchanged — so the guarantee never makes a response worse. This applies to non-streamed responses; streamed responses are passed through unchanged.

串流媒體

放 stream: true 接收部分完成作為伺服器發送的事件。每個事件都是一個 JSON 區塊，其形狀與非串流響應相同，除了 message 被替換為 delta. 串流結束於 data: [DONE].

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Write a haiku about Berlin."}],
    "stream": true,
    "stream_options": {"include_usage": true}
  }'

接線格式

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"Cold "},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"stone "},"finish_reason":null}]}

…

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":12,"completion_tokens":17,"total_tokens":29}}

data: [DONE]

Reliability & smart routing

Every model ID resolves to a pool of upstream providers behind the scenes. If the first one errors or times out, the request is automatically retried against the next provider for the same model, in order, before any failure is returned — you do not configure or trigger this. The model field in the response always reports the variant that actually answered. This is independent of the optional models / fallbacks array, which adds your own cross-model candidates on top: first the primary model exhausts its own provider chain, then each fallback model exhausts its chain.

POST /v1/messages

與 Anthropic 相容的訊息 API。與官方合作 @anthropic-ai/sdk 透過設定 baseURL 到 https://api.airforce. 對於非 Claude 模型，會透明地轉發至 OpenAI/Google 等。

POSThttps://api.airforce/v1/messages

請求正文

Parameter	Type	Required	Description
model	string	Required	模型 ID（Anthropic 格式或路由別名）。
messages	array	Required	每個條目：{ 角色：“用戶”\| “助理”，內容：字串 \|大批 }。
max_tokens	integer	Required	Anthropic 需要。響應的令牌上限。
system	string \| array	Optional	系統提示。傳遞一個 { type: "text", text, cache_control? 的陣列} 區塊來標記快取的前綴段。請參閱“提示快取”。
temperature	float	Optional	0–1。
top_p	float	Optional	細胞核取樣。
top_k	integer	Optional	將採樣池限制為前 K 個代幣。
stop_sequences	array	Optional	最多 4 個停止序列。
stream	boolean	Optional	如果為 true，則發出 Anthropic 風格的 SSE 事件流（請參閱「流」）。
fallbacks	array	Optional	Fallback models (max 3) in Anthropic form: [{"model": "gpt-4o-mini"}]. If every channel of the primary model fails, each candidate is tried in order; you are billed for — and the response model field reports — the model that actually answered. A plain models string array is accepted too.
tools	array	Optional	Anthropic 工具定義：{ 名稱、描述、input_schema }。回應可能包含 tool_use 內容區塊。
tool_choice	object	Optional	{ 類型：“自動”\| “任何”\| “工具”，名字？ }。
thinking	object	Optional	Anthropic 擴展思維：{ type: "enabled",budget_tokens: N }。

例子

curl https://api.airforce/v1/messages \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "max_tokens": 256,
    "system": "You are a helpful assistant.",
    "messages": [
      {"role": "user", "content": "Hello, Claude!"}
    ]
  }'

回應形狀

Parameter	Type	Required	Description
id	string	Optional	訊息 ID，例如“msg_01ABCxyz”。
type	string	Optional	總是“消息”。
role	string	Optional	永遠是「助手」。
content	array	Optional	內容塊陣列：{ type: "text" \| “工具使用” \| “思考”，…}。
model	string	Optional	所請求型號的迴聲。
stop_reason	string	Optional	“結束轉彎”\| “最大代幣”\| “停止序列”\| “工具使用”。
usage	object	Optional	{ input_tokens, output_tokens, cache_read_input_tokens?, cache_creation_input_tokens?, cache_creation? }。當使用了提示快取時會出現快取欄位。cache_creation.ephemeral_5m_input_tokens 和 ephemeral_1h_input_tokens 提供按 TTL 的寫入拆分。

串流媒體活動

Anthropic SSE 使用命名事件而不是一次性 JSON 區塊。每個事件都有一個 event: 姓名和一個 data: JSON 有效負載。

event: message_start
data: {"type":"message_start","message":{"id":"msg_01","role":"assistant","content":[],"model":"claude-sonnet-4.6","stop_reason":null,"usage":{"input_tokens":12,"output_tokens":1}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":17}}

event: message_stop
data: {"type":"message_stop"}

POST /v1/messages/count_tokens

Anthropic-compatible token counting. Send the same system / messages / tools you would pass to /v1/messages and get an input-token estimate back without running the model — nothing is billed.

POSThttps://api.airforce/v1/messages/count_tokens

curl https://api.airforce/v1/messages/count_tokens \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "system": "You are a helpful assistant.",
    "messages": [{"role": "user", "content": "Hello, Claude!"}]
  }'

# → {"input_tokens": 34}

The count is a fast character-based estimate (about 4 characters per token) over system, messages and tools — close enough for context-budget checks, not an exact tokenizer run.

提示快取

在 /v1/messages 對於 Claude 模型，透過傳遞將前綴標記為緩存 system 作為緩存段攜帶的塊數組 cache_control: { type: "ephemeral" }. 以相同前綴開頭的後續請求將收取更便宜的快取讀取速率。型號有 supports_caching: true 在 /v1/models 支持這一點。

Write vs read pricing

Cache writes are typically charged slightly above normal input (about 1.25× on Claude-family models). Cache reads are much cheaper (about 0.1× input). A large write with almost no later read is the expensive case — not a “cache discount”. Only reusing the same prefix turns the write into savings.

Tools like Claude Code often attach a large project context with cache markers on the first turns. Expect cache-write spend while the repo/system prefix is loaded; later turns only get cheap if that prefix is stable and reused. Subagents and multi-step agents can multiply large contexts across several requests.

具有提示快取的模型

…· live

{
  "model": "claude-sonnet-4.6",
  "max_tokens": 1024,
  "system": [
    {"type": "text", "text": "You are a senior staff engineer at Airforce."},
    {
      "type": "text",
      "text": "<repository-snapshot>...</repository-snapshot>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "Where is rate limiting enforced?"}
  ]
}

快取計數在回應中的報告方式

快取令牌計數以每種格式的原生形狀傳遞，因此 SDK (openai、@anthropic-ai/sdk、@google/genai) 無需自訂程式碼即可讀取。當值為零時省略欄位，使未快取的回應保持精簡。

/v1/chat/completions (OpenAI 格式)

"usage": {
  "prompt_tokens": 2104,
  "completion_tokens": 147,
  "total_tokens": 2251,
  "prompt_tokens_details": { "cached_tokens": 1980 },
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1/messages (Anthropic 格式)

"usage": {
  "input_tokens": 2104,
  "output_tokens": 147,
  "cache_read_input_tokens": 1980,
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1beta/.../generateContent (Gemini 格式)

"usageMetadata": {
  "promptTokenCount": 2104,
  "candidatesTokenCount": 147,
  "totalTokenCount": 2251,
  "cachedContentTokenCount": 1980
}

快取在哪些情況下生效

對於 Claude 模型，明確的 cache_control 標記在 /v1/messages 與 /v1/chat/completions 上皆生效——把它們放在 system 或 message 的內容區塊上。許多其他供應商（OpenAI 系、DeepSeek、Gemini）會自動快取：你無需傳送任何標記，只要重用夠長的前綴，回應中就會出現 cached_tokens。

快取時長：5 分鐘或 1 小時

快取的前綴預設存活 5 分鐘，每次命中都會刷新計時。若需更長存活的前綴，請在標記中加入 ttl: "1h" 。回應會在 cache_creation 下分別回報每種 TTL。

"cache_control": { "type": "ephemeral", "ttl": "1h" }

範例：先寫入，再讀取

把完全相同的請求傳送兩次（上面的快取範例）。第一次看到該前綴的呼叫支付一次性的快取寫入；TTL 內相同的呼叫支付便宜許多的快取讀取。

第一次呼叫——快取寫入（usage 摘錄）：

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 1980,
  "cache_read_input_tokens": 0
}

TTL 內第二次相同呼叫——快取讀取：

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 1980
}

限制與費用

Claude 要求最小可快取前綴（約 1024 個 token；某些模型更大）。更短的前綴根本不會被快取。
每個請求最多 4 個快取斷點，且快取的前綴在多次呼叫間必須逐位元組相同——哪怕改動一個字元也會錯過快取。
快取寫入比一般輸入更貴（5m ≈ 1.25×，1h ≈ 2×）；讀取便宜許多（≈ 0.1×）。各模型的快取價格見定價頁面。

POST /v1/responses

用於有狀態對話的 OpenAI Responses-API 表面。相同的 Bearer/x-api-key 認證。快取計數顯示為 input_tokens_details.cached_tokens(讀取)加上平面的 cache_creation_input_tokens + cache_creation.ephemeral_*(寫入)，與 /v1/chat/completions 對等。

POSThttps://api.airforce/v1/responses

POST /v1beta/models/{model}:generateContent

Google Gemini-compatible endpoint. Works with the official @google/genai SDK and the Gemini CLI by pointing the base URL at https://api.airforce/v1beta. Any routed model works — requests are translated to and from the native Gemini shape, and the model is taken from the URL path (not the body).

POSThttps://api.airforce/v1beta/models/{model}:generateContent

Authentication

Pass your Airforce API key any of the three ways Google clients use:

# 1) query parameter (Google default)
?key=sk-air-YOUR_API_KEY

# 2) header
x-goog-api-key: sk-air-YOUR_API_KEY

# 3) bearer token
Authorization: Bearer sk-air-YOUR_API_KEY

Request body

Parameter	Type	Required	Description
contents	array	Required	Conversation turns. Each: { role: "user" \| "model", parts: [...] }. A part is { text }, { functionCall: { name, args } }, or { functionResponse: { name, response } }. "model" is Gemini's term for the assistant role.
systemInstruction	object	Optional	System prompt: { parts: [{ text }] }.
generationConfig	object	Optional	{ temperature, maxOutputTokens, topP, stopSequences } — mapped to the canonical sampling parameters.
tools	array	Optional	Tool definitions: [{ functionDeclarations: [{ name, description, parameters }] }]. functionDeclarations are flattened across entries.
toolConfig	object	Optional	Tool-choice control: { functionCallingConfig: { mode: "AUTO" \| "ANY" \| "NONE" } }. ANY forces a call, NONE disables tools.

Example

curl "https://api.airforce/v1beta/models/gemini-3.1-pro:generateContent" \
  -H "x-goog-api-key: sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {"role": "user", "parts": [{"text": "What is the capital of France?"}]}
    ],
    "systemInstruction": {"parts": [{"text": "You are a helpful assistant."}]},
    "generationConfig": {"temperature": 0.7, "maxOutputTokens": 256}
  }'

Response shape

Parameter	Type	Required	Description
candidates	array	Optional	Generated turns: [{ content: { role: "model", parts }, finishReason, index }]. Only the first candidate is populated.
candidates[].finishReason	string	Optional	"STOP" \| "MAX_TOKENS" \| "SAFETY" \| "OTHER".
usageMetadata	object	Optional	{ promptTokenCount, candidatesTokenCount, totalTokenCount, cachedContentTokenCount? }. cachedContentTokenCount appears when the upstream reported a cache read.
modelVersion	string	Optional	Echo of the requested model.

{
  "candidates": [{
    "content": {
      "role": "model",
      "parts": [{"text": "The capital of France is Paris."}]
    },
    "finishReason": "STOP",
    "index": 0
  }],
  "usageMetadata": {
    "promptTokenCount": 16,
    "candidatesTokenCount": 8,
    "totalTokenCount": 24
  },
  "modelVersion": "gemini-3.1-pro"
}

POST /v1beta/models/{model}:streamGenerateContent

Streaming uses the :streamGenerateContent action and returns Server-Sent Events. Each data: line is a full Gemini-shaped chunk (not a delta object); the final chunk carries usageMetadata.

data: {"candidates":[{"content":{"role":"model","parts":[{"text":"The capital"}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[{"text":" is Paris."}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[]},"finishReason":"STOP","index":0}],"usageMetadata":{"promptTokenCount":16,"candidatesTokenCount":8,"totalTokenCount":24}}

List models

The catalog is also exposed in Gemini Model-resource shape so Google clients can enumerate models.

curl https://api.airforce/v1beta/models

Notes: the base URL is https://api.airforce/v1beta (or /v1), not Google's host. The model name comes from the URL path, not the request body. Only the first candidate is returned, and a subset of Gemini fields is translated — safetySettings and cachedContent are currently ignored. Billing, rate limits and smart routing apply exactly as on /v1/chat/completions.

錯誤

Airforce 為兩個端點傳回標準 HTTP 狀態代碼和統一的錯誤信封。

Parameter	Type	Required	Description
400	invalid_request_error	Optional	JSON 格式錯誤、缺少必填欄位、未知型號。
401	invalid_request_error / auth_required	Optional	API 金鑰缺失或無效。
402	insufficient_quota	Optional	此模型需要有效的訂閱或正的 Pay-as-you-Go 餘額。
403	model_access_denied / insufficient_scope	Optional	計劃或每鍵權限拒絕此請求。
404	model_not_found	Optional	請求的模型不存在或你無權存取。
429	rate_limit_error	Optional	超出請求率或每日代幣上限。
503	api_error / moderation_unavailable	Optional	所請求的提供程序的所有上游金鑰均失敗。

{
  "error": {
    "message": "The requested model does not exist or you do not have access to it.",
    "type": "model_not_found",
    "param": null,
    "code": "404"
  }
}

描述性的 slug 位於 type。code 是以字串表示的 HTTP 狀態（例如 "404"），而 param 除了參數範圍驗證錯誤外皆為 null，在該情況下它會指出有問題的參數。

探索型號

請參閱模型 ID 及其功能標誌（視覺、工具、推理、快取、上下文長度等）的完整清單： /docs/api/models.

curl https://api.airforce/v1/models \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY"