API REFERENCE

聊天完成情况

通过一个 API 在 100 多个模型上生成聊天响应。可直接兼容 OpenAI Chat Completions、Anthropic Messages 和 Anthropic Responses。

Airforce 在同一组模型上同时支持 OpenAI Chat Completions 和 Anthropic Messages 两种 wire 格式。选用你已经在用的 SDK，只需更换 base URL —— 非 Claude 模型会在两种接口下被透明转发。

本页介绍认证、两种接口的请求与响应结构、streaming、tool calling、vision、reasoning 以及 prompt caching。初次使用？先从下面的基础示例开始，让一次调用跑通，再在此基础上叠加 streaming、tools 或 caching。

验证

每个请求都需要一个 Bearer 令牌（您的 Airforce API 密钥）。Anthropic 的 x-api-key 标头也被接受 /v1/messages 用于 SDK 兼容性。

Authorization: Bearer sk-air-YOUR_API_KEY
# alt for /v1/messages:
x-api-key: sk-air-YOUR_API_KEY

POST /v1/chat/completions

OpenAI 兼容的聊天完成。与官方合作 openai SDK通过覆盖 base_url 到 https://api.airforce/v1.

POSThttps://api.airforce/v1/chat/completions

请求正文

Parameter	Type	Required	Description
model	string	Required	型号 ID。使用 GET /v1/models 发现可用的 ID。
messages	array	Required	对话历史记录。每个条目都有 { role: "system" \| "user" \| "assistant" \| "tool", content }。content 是一个字符串，或一个内容块数组（用于视觉/图像输入，见下文）。
max_tokens	integer	Optional	生成的最大令牌数。上限为模型的 max_output_tokens。
temperature	float	Optional	采样温度，0–2。越低则更具确定性。默认值取决于上游提供商。
top_p	float	Optional	核采样（Nucleus sampling）。使用温度或 top_p 之一，不要同时使用。
stream	boolean	Optional	当为 true 时，响应是服务器发送的事件流。请参阅下面的“流式传输”。
models	array	Optional	Fallback models (max 3), e.g. ["deepseek-v3.2", "gpt-4o-mini"]. If every channel of the primary model fails, each candidate is tried in order. You are billed for — and response.model reports — the model that actually answered. Unknown or plan-gated candidates are skipped. With the OpenAI SDK pass it via extra_body.
transforms	array	Optional	Prompt transforms. Supported: ["middle-out"] — when the conversation overflows the model's context window, whole messages are dropped from the middle (system prompts, the first message and the most recent turns are kept), so long roleplay or agent histories keep working instead of erroring. Opt-in; off by default.
stream_options	object	Optional	{ include_usage: boolean }。用量始终包含在最后一个流式分块中；此字段为兼容 OpenAI 而被接受，但无法将其关闭。
stop	string \| array	Optional	最多 4 个停止序列。一旦生成出其中之一，生成就会停止。
tools	array	Optional	模型可能调用的函数定义。请参阅下面的“工具调用”。
tool_choice	string \| object	Optional	"auto"（默认）、"none"，或 { type: "function", function: { name } } 以强制执行特定调用。
response_format	object	Optional	{ type: "json_object" } 强制模型发出有效的 JSON。不支持的型号将被忽略。
reasoning_effort	string	Optional	Reasoning depth: "low" \| "medium" \| "high" \| "xhigh" \| "max". Any model with supports_reasoning: true (Claude, OpenAI o/GPT-5, Gemini, Qwen, DeepSeek, …). See "Reasoning & thinking".
thinking	string \| object	Optional	Cross-model thinking switch. "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. See "Reasoning & thinking".
thinking_budget	integer	Optional	模型推理跟踪的令牌上限（当提供者公开时）。
ignore_defaults	boolean	Optional	跳过用户为此请求保存的每个模型的默认参数（在仪表板中配置）。
skill	string	Optional	ID of a single marketplace skill to apply to this request. The skill transforms your messages/parameters before the upstream call and overrides any installed-skill defaults. Consumed by Airforce, never forwarded upstream. See the Skills catalog at /docs/api/extend.
skills	array	Optional	Array of marketplace skill IDs applied in order, for stacking multiple skills on one request.

基本示例

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

响应形状

Parameter	Type	Required	Description
id	string	Optional	稳定的完成 ID，例如“chatcmpl-abc123”。
object	string	Optional	“chat.completion”用于非流式传输，“chat.completion.chunk”用于流式传输。
created	integer	Optional	Unix 时间戳（秒）。
model	string	Optional	回显所请求的型号 ID。
choices	array	Optional	补全候选项数组：[{ index, message: { role, content, tool_calls? }, finish_reason }]。
choices[].finish_reason	string	Optional	"stop" \| "length" \| "tool_calls" \| "content_filter"。
usage	object	Optional	{ prompt_tokens, completion_tokens, total_tokens, completion_tokens_details?, prompt_tokens_details?, cache_creation_input_tokens?, cache_creation? }。当模型生成推理痕迹时设置 completion_tokens_details.reasoning_tokens。当上游返回提示缓存信息时会出现缓存字段：prompt_tokens_details.cached_tokens 报告缓存读取(OpenAI 标准)，cache_creation_input_tokens 聚合写入，cache_creation.ephemeral_5m_input_tokens / ephemeral_1h_input_tokens 提供 TTL 拆分。

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1710000000,
  "model": "gpt-5.1-chat",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "The capital of France is Paris."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 8,
    "total_tokens": 28
  }
}

推理与思考

Reasoning/thinking is a cross-model feature for every model ID with supports_reasoning: true — Claude, OpenAI o-series/GPT-5, Gemini, Qwen, DeepSeek, and others. You send the same canonical parameters; Airforce maps them to each provider's native shape. This is not a DeepSeek-only API.

Truth source: check supports_reasoning: true 在模型上 GET /v1/models (or GET /api/models/{id}/allowed-params). Prefer that flag over guessing from the model name.

具有推理支持的模型

…· live

规范参数

Parameter	Type	Required	Description
reasoning_effort	string	Optional	"low" \| "medium" \| "high" \| "xhigh" \| "max". Accepted on every model with supports_reasoning: true. Some upstreams only honour a subset (e.g. high/max); others clamp unsupported levels to the nearest served value.
thinking	string \| object	Optional	Three accepted shapes (we normalise): "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. Mapped onto Claude extended thinking, OpenAI effort profiles, Gemini thinking_config, Qwen enable_thinking, DeepSeek hybrid, etc.
thinking_budget	integer	Optional	Maximum tokens the model may spend reasoning before emitting visible output. Mirrors budget_tokens when the upstream exposes a budget; takes precedence over reasoning_effort when both are sent and a budget is available.

What differs by family (mapping only)

Parameters are the same everywhere. Only how we map them (and how hard "off" is) differs:

Claude — Thinking on/off + budget; often also reasoning_effort via the gateway.
OpenAI (o1/o3, GPT-5) — Mainly reasoning_effort. A full "thinking off" is often not available — you control how strongly the model reasons, not always whether it reasons at all.
Gemini — thinking_config / budget mapped internally.
Qwen / Xiaomi / Alibaba — thinking + enable_thinking-style controls.
DeepSeek (generic) — Hybrid on/off is especially clear: thinking: { type: enabled|disabled } plus optional reasoning_effort.
Resellers / other — Often generic passthrough of the same canonical fields.

Controlling where the trace appears

An optional reasoning object on the request decides what happens to the thinking trace. It is consumed by Airforce and never forwarded upstream.

Parameter	Type	Required	Description
reasoning.format	string	Optional	"separate" (default) puts the trace in message.reasoning (and delta.reasoning while streaming). "inline" keeps the legacy inline <think>…</think> form inside content.
reasoning.exclude	boolean	Optional	When true, the reasoning trace is dropped entirely from the response. Reasoning tokens are still counted and billed if the model produced them.

"reasoning": { "format": "separate", "exclude": false }

推理工作（OpenAI 风格）

Primary control for o-series and GPT-5: how much the model may reason. Same canonical field as on every other supports_reasoning model — OpenAI is included, but behaviour is not 1:1 with DeepSeek's hard on/off.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "o3-mini",
    "messages": [{"role": "user", "content": "Prove the Pythagorean theorem."}],
    "reasoning_effort": "high"
  }'

扩展思维（Anthropic 风格）

Budget-based thinking for Claude (and gateways that accept the Anthropic shape). You can still send reasoning_effort; we map when the channel supports it.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "messages": [{"role": "user", "content": "Plan a 7-day Italy trip."}],
    "thinking": {"type": "enabled", "budget_tokens": 4000}
  }'

Hybrid thinking (e.g. DeepSeek V3.2/V4)

Example of a hybrid model family with a clear Thinking / Non-Thinking switch — not a separate protocol. deepseek-v3.2, deepseek-v4-flash and deepseek-v4-pro accept the same canonical fields as every other supports_reasoning model. Toggle thinking and optionally set effort in one request:

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-pro",
    "messages": [{"role": "user", "content": "Solve this step by step: integrate x^2 * e^x."}],
    "thinking": {"type": "enabled"},
    "reasoning_effort": "high"
  }'

Turn thinking off (faster, cheaper when you only need the final answer) — this hard off is clearer on hybrid models than on many OpenAI o-series profiles:

"thinking": {"type": "disabled"}
// or simply: "thinking": "off"

Native docs for this family often list effort levels such as "high" and "max". We accept the full low…max scale and map unsupported levels to the nearest value that reaches the model. Prefer the hybrid IDs above over retired deepseek-chat / deepseek-reasoner names when you need an explicit on/off switch.

推理痕迹本身出现在 choices[0].message.reasoning （OpenAI 形状）或作为 thinking 阻塞在 content （Anthropic 形式）。推理令牌会被计费并报告在 usage.completion_tokens_details.reasoning_tokens.

仅当上游提供方上报时，才会出现 completion_tokens_details.reasoning_tokens 这一明细。在流式响应中，该追踪信息会按每个 chunk 通过 delta.reasoning_content 到达。

视觉与图像输入

型号有 supports_vision: true 接受作为内容块嵌入的图像。公共 URL 或 Base64 数据 URL 均可；大小限制取决于上游模型。

具有视觉支持的型号

…· live

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
      ]
    }]
  }'

工具调用

型号有 supports_tools: true 可以调用您定义的函数。该模型返回一个 tool_calls 数组；您运行该调用，然后将结果发送回一个 tool 信息。

支持工具调用的型号

…· live

要求

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City name"}
          },
          "required": ["location"]
        }
      }
    }],
    "tool_choice": "auto"
  }'

通过工具调用进行响应

{
  "id": "chatcmpl-abc123",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\":\"Paris\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

跟进工具结果

{
  "model": "gpt-5.1-chat",
  "messages": [
    {"role": "user", "content": "What is the weather in Paris?"},
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {"name": "get_weather", "arguments": "{\"location\":\"Paris\"}"}
      }]
    },
    {"role": "tool", "tool_call_id": "call_1", "content": "{\"temp_c\": 14, \"sky\": \"cloudy\"}"}
  ]
}

Assistant prefill

End your messages array with an assistant message that already contains some text, and the model continues from it instead of starting a fresh turn. This is a reliable way to force a response to begin a specific way — a leading "{" for JSON, a chosen language, or a fixed prefix. The same trick works on /v1/messages. Providers that reject native prefill are handled automatically: the gateway retries once with a compatible rewrite, so you do not have to special-case them.

{
  "model": "claude-sonnet-4.6",
  "messages": [
    {"role": "user", "content": "List three primary colors as a JSON array."},
    {"role": "assistant", "content": "["}
  ]
}

Structured outputs

Set response_format to make the model return JSON. Two modes are supported:

{ "type": "json_object" } — the response is a single valid JSON value.
{ "type": "json_schema", "json_schema": { "name", "schema", "strict" } } — the model is steered to produce JSON that matches your JSON Schema.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Extract the city and country: I live in Paris, France."}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "location",
        "schema": {
          "type": "object",
          "properties": { "city": {"type": "string"}, "country": {"type": "string"} },
          "required": ["city", "country"]
        }
      }
    }
  }'

Reliability: even when a model wraps its answer in prose or a markdown code fence, Airforce extracts the JSON payload so you always receive parseable content. If no valid JSON can be recovered, the original text is returned unchanged — so the guarantee never makes a response worse. This applies to non-streamed responses; streamed responses are passed through unchanged.

流媒体

放 stream: true 接收部分完成作为服务器发送的事件。每个事件都是一个 JSON 块，其形状与非流式响应相同，除了 message 被替换为 delta. 流结束于 data: [DONE].

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Write a haiku about Berlin."}],
    "stream": true,
    "stream_options": {"include_usage": true}
  }'

接线格式

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"Cold "},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"stone "},"finish_reason":null}]}

…

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":12,"completion_tokens":17,"total_tokens":29}}

data: [DONE]

Reliability & smart routing

Every model ID resolves to a pool of upstream providers behind the scenes. If the first one errors or times out, the request is automatically retried against the next provider for the same model, in order, before any failure is returned — you do not configure or trigger this. The model field in the response always reports the variant that actually answered. This is independent of the optional models / fallbacks array, which adds your own cross-model candidates on top: first the primary model exhausts its own provider chain, then each fallback model exhausts its chain.

POST /v1/messages

兼容 Anthropic 的 Messages API。可与官方 @anthropic-ai/sdk 通过设置 baseURL 到 https://api.airforce. 对于非 Claude 模型，会透明地转发到 OpenAI/Google 等。

POSThttps://api.airforce/v1/messages

请求正文

Parameter	Type	Required	Description
model	string	Required	模型 ID（Anthropic 格式或路由别名）。
messages	array	Required	每个条目：{ role: "user" \| "assistant", content: string \| array }。
max_tokens	integer	Required	Anthropic 要求提供。响应的令牌上限。
system	string \| array	Optional	系统提示。传入一个由 { type: "text", text, cache_control? } 块组成的数组，以标记需缓存的前缀段。请参阅“提示缓存”。
temperature	float	Optional	0–1。
top_p	float	Optional	核采样（Nucleus sampling）。
top_k	integer	Optional	将采样池限制为前 K 个令牌。
stop_sequences	array	Optional	最多 4 个停止序列。
stream	boolean	Optional	如果为 true，则发出 Anthropic 风格的 SSE 事件流（请参阅“流”）。
fallbacks	array	Optional	Fallback models (max 3) in Anthropic form: [{"model": "gpt-4o-mini"}]. If every channel of the primary model fails, each candidate is tried in order; you are billed for — and the response model field reports — the model that actually answered. A plain models string array is accepted too.
tools	array	Optional	Anthropic 工具定义：{ name, description, input_schema }。响应可能包含 tool_use 内容块。
tool_choice	object	Optional	{ type: "auto" \| "any" \| "tool", name? }。
thinking	object	Optional	Anthropic 扩展思维：{ type: "enabled", budget_tokens: N }。

例子

curl https://api.airforce/v1/messages \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "max_tokens": 256,
    "system": "You are a helpful assistant.",
    "messages": [
      {"role": "user", "content": "Hello, Claude!"}
    ]
  }'

响应形状

Parameter	Type	Required	Description
id	string	Optional	消息 ID，例如“msg_01ABCxyz”。
type	string	Optional	总是“消息”。
role	string	Optional	永远是“助手”。
content	array	Optional	内容块数组：{ type: "text" \| "tool_use" \| "thinking", … }。
model	string	Optional	回显所请求的模型。
stop_reason	string	Optional	"end_turn" \| "max_tokens" \| "stop_sequence" \| "tool_use"。
usage	object	Optional	{ input_tokens, output_tokens, cache_read_input_tokens?, cache_creation_input_tokens?, cache_creation? }。当使用了提示缓存时会出现缓存字段。cache_creation.ephemeral_5m_input_tokens 和 ephemeral_1h_input_tokens 提供按 TTL 的写入拆分。

流式事件

Anthropic SSE 使用命名事件而不是一次性 JSON 块。每个事件都有一个 event: 名称和一个 data: JSON 有效负载。

event: message_start
data: {"type":"message_start","message":{"id":"msg_01","role":"assistant","content":[],"model":"claude-sonnet-4.6","stop_reason":null,"usage":{"input_tokens":12,"output_tokens":1}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":17}}

event: message_stop
data: {"type":"message_stop"}

POST /v1/messages/count_tokens

Anthropic-compatible token counting. Send the same system / messages / tools you would pass to /v1/messages and get an input-token estimate back without running the model — nothing is billed.

POSThttps://api.airforce/v1/messages/count_tokens

curl https://api.airforce/v1/messages/count_tokens \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "system": "You are a helpful assistant.",
    "messages": [{"role": "user", "content": "Hello, Claude!"}]
  }'

# → {"input_tokens": 34}

The count is a fast character-based estimate (about 4 characters per token) over system, messages and tools — close enough for context-budget checks, not an exact tokenizer run.

提示缓存

在 /v1/messages 对于 Claude 模型，通过传递将前缀标记为缓存 system 作为缓存段携带的块数组 cache_control: { type: "ephemeral" }. 以相同前缀开头的后续请求将收取更便宜的缓存读取速率。型号有 supports_caching: true 在 /v1/models 支持这一点。

Write vs read pricing

Cache writes are typically charged slightly above normal input (about 1.25× on Claude-family models). Cache reads are much cheaper (about 0.1× input). A large write with almost no later read is the expensive case — not a “cache discount”. Only reusing the same prefix turns the write into savings.

Tools like Claude Code often attach a large project context with cache markers on the first turns. Expect cache-write spend while the repo/system prefix is loaded; later turns only get cheap if that prefix is stable and reused. Subagents and multi-step agents can multiply large contexts across several requests.

具有提示缓存的模型

…· live

{
  "model": "claude-sonnet-4.6",
  "max_tokens": 1024,
  "system": [
    {"type": "text", "text": "You are a senior staff engineer at Airforce."},
    {
      "type": "text",
      "text": "<repository-snapshot>...</repository-snapshot>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "Where is rate limiting enforced?"}
  ]
}

缓存计数在响应中的报告方式

缓存令牌计数以每种格式的原生形状传递，因此 SDK (openai、@anthropic-ai/sdk、@google/genai) 无需自定义代码即可读取。当值为零时省略字段，使未缓存的响应保持精简。

/v1/chat/completions (OpenAI 格式)

"usage": {
  "prompt_tokens": 2104,
  "completion_tokens": 147,
  "total_tokens": 2251,
  "prompt_tokens_details": { "cached_tokens": 1980 },
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1/messages (Anthropic 格式)

"usage": {
  "input_tokens": 2104,
  "output_tokens": 147,
  "cache_read_input_tokens": 1980,
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1beta/.../generateContent (Gemini 格式)

"usageMetadata": {
  "promptTokenCount": 2104,
  "candidatesTokenCount": 147,
  "totalTokenCount": 2251,
  "cachedContentTokenCount": 1980
}

缓存在哪些情况下生效

对于 Claude 模型，显式的 cache_control 标记在 /v1/messages 和 /v1/chat/completions 上均生效——把它们放在 system 或 message 的内容块上。许多其他提供商（OpenAI 系、DeepSeek、Gemini）会自动缓存：你无需发送任何标记，只要重用足够长的前缀，响应中就会出现 cached_tokens。

缓存时长：5 分钟或 1 小时

缓存的前缀默认存活 5 分钟，每次命中都会刷新计时。若需更长存活的前缀，请在标记中加入 ttl: "1h" 。响应会在 cache_creation 下分别报告每种 TTL。

"cache_control": { "type": "ephemeral", "ttl": "1h" }

示例：先写入，再读取

把完全相同的请求发送两次（上面的缓存示例）。第一次看到该前缀的调用支付一次性的缓存写入；TTL 内相同的调用支付便宜得多的缓存读取。

第一次调用——缓存写入（usage 摘录）：

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 1980,
  "cache_read_input_tokens": 0
}

TTL 内第二次相同调用——缓存读取：

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 1980
}

限制与费用

Claude 要求最小可缓存前缀（约 1024 个 token；某些模型更大）。更短的前缀根本不会被缓存。
每个请求最多 4 个缓存断点，且缓存的前缀在多次调用间必须逐字节相同——哪怕改动一个字符也会错过缓存。
缓存写入比普通输入更贵（5m ≈ 1.25×，1h ≈ 2×）；读取便宜得多（≈ 0.1×）。各模型的缓存价格见定价页面。

POST /v1/responses

用于有状态对话的 OpenAI Responses-API 表面。相同的 Bearer/x-api-key 认证。缓存计数显示为 input_tokens_details.cached_tokens(读取)加上平面的 cache_creation_input_tokens + cache_creation.ephemeral_*(写入)，与 /v1/chat/completions 对等。

POSThttps://api.airforce/v1/responses

POST /v1beta/models/{model}:generateContent

Google Gemini-compatible endpoint. Works with the official @google/genai SDK and the Gemini CLI by pointing the base URL at https://api.airforce/v1beta. Any routed model works — requests are translated to and from the native Gemini shape, and the model is taken from the URL path (not the body).

POSThttps://api.airforce/v1beta/models/{model}:generateContent

Authentication

Pass your Airforce API key any of the three ways Google clients use:

# 1) query parameter (Google default)
?key=sk-air-YOUR_API_KEY

# 2) header
x-goog-api-key: sk-air-YOUR_API_KEY

# 3) bearer token
Authorization: Bearer sk-air-YOUR_API_KEY

Request body

Parameter	Type	Required	Description
contents	array	Required	Conversation turns. Each: { role: "user" \| "model", parts: [...] }. A part is { text }, { functionCall: { name, args } }, or { functionResponse: { name, response } }. "model" is Gemini's term for the assistant role.
systemInstruction	object	Optional	System prompt: { parts: [{ text }] }.
generationConfig	object	Optional	{ temperature, maxOutputTokens, topP, stopSequences } — mapped to the canonical sampling parameters.
tools	array	Optional	Tool definitions: [{ functionDeclarations: [{ name, description, parameters }] }]. functionDeclarations are flattened across entries.
toolConfig	object	Optional	Tool-choice control: { functionCallingConfig: { mode: "AUTO" \| "ANY" \| "NONE" } }. ANY forces a call, NONE disables tools.

Example

curl "https://api.airforce/v1beta/models/gemini-3.1-pro:generateContent" \
  -H "x-goog-api-key: sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {"role": "user", "parts": [{"text": "What is the capital of France?"}]}
    ],
    "systemInstruction": {"parts": [{"text": "You are a helpful assistant."}]},
    "generationConfig": {"temperature": 0.7, "maxOutputTokens": 256}
  }'

Response shape

Parameter	Type	Required	Description
candidates	array	Optional	Generated turns: [{ content: { role: "model", parts }, finishReason, index }]. Only the first candidate is populated.
candidates[].finishReason	string	Optional	"STOP" \| "MAX_TOKENS" \| "SAFETY" \| "OTHER".
usageMetadata	object	Optional	{ promptTokenCount, candidatesTokenCount, totalTokenCount, cachedContentTokenCount? }. cachedContentTokenCount appears when the upstream reported a cache read.
modelVersion	string	Optional	Echo of the requested model.

{
  "candidates": [{
    "content": {
      "role": "model",
      "parts": [{"text": "The capital of France is Paris."}]
    },
    "finishReason": "STOP",
    "index": 0
  }],
  "usageMetadata": {
    "promptTokenCount": 16,
    "candidatesTokenCount": 8,
    "totalTokenCount": 24
  },
  "modelVersion": "gemini-3.1-pro"
}

POST /v1beta/models/{model}:streamGenerateContent

Streaming uses the :streamGenerateContent action and returns Server-Sent Events. Each data: line is a full Gemini-shaped chunk (not a delta object); the final chunk carries usageMetadata.

data: {"candidates":[{"content":{"role":"model","parts":[{"text":"The capital"}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[{"text":" is Paris."}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[]},"finishReason":"STOP","index":0}],"usageMetadata":{"promptTokenCount":16,"candidatesTokenCount":8,"totalTokenCount":24}}

List models

The catalog is also exposed in Gemini Model-resource shape so Google clients can enumerate models.

curl https://api.airforce/v1beta/models

Notes: the base URL is https://api.airforce/v1beta (or /v1), not Google's host. The model name comes from the URL path, not the request body. Only the first candidate is returned, and a subset of Gemini fields is translated — safetySettings and cachedContent are currently ignored. Billing, rate limits and smart routing apply exactly as on /v1/chat/completions.

错误

Airforce 为两个端点返回标准 HTTP 状态代码和统一的错误信封。

Parameter	Type	Required	Description
400	invalid_request_error	Optional	JSON 格式错误、缺少必填字段、未知型号。
401	invalid_request_error / auth_required	Optional	API 密钥缺失或无效。
402	insufficient_quota	Optional	该模型需要有效的订阅或正的 Pay-as-you-Go 余额。
403	model_access_denied / insufficient_scope	Optional	计划或每键权限拒绝此请求。
404	model_not_found	Optional	请求的模型不存在或你无权访问。
429	rate_limit_error	Optional	超出请求速率或每日令牌上限。
503	api_error / moderation_unavailable	Optional	所请求的提供程序的所有上游密钥均失败。

{
  "error": {
    "message": "The requested model does not exist or you do not have access to it.",
    "type": "model_not_found",
    "param": null,
    "code": "404"
  }
}

描述性标识位于 type 中。code 是以字符串表示的 HTTP 状态（例如 "404"），而 param 通常为 null，仅在参数范围校验错误时例外，此时它会指明出错的参数。

探索型号

请参阅模型 ID 及其功能标志（视觉、工具、推理、缓存、上下文长度等）的完整列表： /docs/api/models.

curl https://api.airforce/v1/models \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY"