API REFERENCE

Hoàn thành trò chuyện

Tạo phản hồi trò chuyện trên hơn 100 mô hình từ một API. Tính năng đăng nhập tương thích với các lần hoàn thành trò chuyện OpenAI, Anthropic Messages và Anthropic Responses.

Airforce hỗ trợ cả hai định dạng giao thức OpenAI Chat Completions và Anthropic Messages trên cùng một tập hợp model. Hãy chọn SDK mà bạn đang dùng và chỉ cần đổi base URL — các model không phải Claude được chuyển tiếp một cách trong suốt sau cả hai bề mặt.

Trang này đề cập đến xác thực, cấu trúc request và response cho cả hai bề mặt, streaming, tool calling, vision, reasoning và prompt caching. Mới bắt đầu? Hãy bắt đầu với ví dụ cơ bản bên dưới, làm cho một lệnh gọi hoạt động, rồi thêm dần streaming, tools hoặc caching khi đã chạy được.

Xác thực

Mọi yêu cầu đều cần có mã thông báo Bearer (khóa API Airforce của bạn). Anthropic x-api-key tiêu đề cũng được chấp nhận trên /v1/messages để tương thích với SDK.

Authorization: Bearer sk-air-YOUR_API_KEY
# alt for /v1/messages:
x-api-key: sk-air-YOUR_API_KEY

POST /v1/chat/completions

Hoàn thành trò chuyện tương thích với OpenAI. Làm việc với chính thức openai SDK bằng cách ghi đè base_url ĐẾN https://api.airforce/v1.

POSThttps://api.airforce/v1/chat/completions

Nội dung yêu cầu

Parameter	Type	Required	Description
model	string	Required	ID mẫu. Sử dụng GET /v1/models để khám phá các ID có sẵn.
messages	array	Required	Lịch sử hội thoại. Mỗi mục có { role: "system" \| "user" \| "assistant" \| "tool", content }. Content là một chuỗi hoặc một mảng các khối nội dung (vision, xem bên dưới).
max_tokens	integer	Optional	Số lượng mã thông báo tối đa để tạo. Giới hạn ở max_output_tokens của mô hình.
temperature	float	Optional	Nhiệt độ lấy mẫu, 0–2. Thấp hơn là xác định hơn. Mặc định phụ thuộc vào nhà cung cấp ngược dòng.
top_p	float	Optional	Lấy mẫu hạt nhân. Sử dụng nhiệt độ hoặc top_p, không phải cả hai.
stream	boolean	Optional	Khi đúng, phản hồi là một luồng Sự kiện do máy chủ gửi. Xem "Truyền phát" bên dưới.
models	array	Optional	Fallback models (max 3), e.g. ["deepseek-v3.2", "gpt-4o-mini"]. If every channel of the primary model fails, each candidate is tried in order. You are billed for — and response.model reports — the model that actually answered. Unknown or plan-gated candidates are skipped. With the OpenAI SDK pass it via extra_body.
transforms	array	Optional	Prompt transforms. Supported: ["middle-out"] — when the conversation overflows the model's context window, whole messages are dropped from the middle (system prompts, the first message and the most recent turns are kept), so long roleplay or agent histories keep working instead of erroring. Opt-in; off by default.
stream_options	object	Optional	{ include_usage: boolean }. Usage luôn được bao gồm trong chunk streaming cuối cùng; trường này được chấp nhận để tương thích với OpenAI nhưng không thể tắt nó.
stop	string \| array	Optional	Lên đến 4 chuỗi dừng. Thế hệ dừng lại ngay khi một thế hệ được sản xuất.
tools	array	Optional	Định nghĩa hàm mà mô hình có thể gọi. Xem "Gọi công cụ" bên dưới.
tool_choice	string \| object	Optional	"auto" (mặc định), "none" hoặc { type: "function", function: { name } } để thực hiện một cuộc gọi cụ thể.
response_format	object	Optional	{ type: "json_object" } buộc mô hình phát ra JSON hợp lệ. Bỏ qua các mô hình không hỗ trợ nó.
reasoning_effort	string	Optional	Reasoning depth: "low" \| "medium" \| "high" \| "xhigh" \| "max". Any model with supports_reasoning: true (Claude, OpenAI o/GPT-5, Gemini, Qwen, DeepSeek, …). See "Reasoning & thinking".
thinking	string \| object	Optional	Cross-model thinking switch. "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. See "Reasoning & thinking".
thinking_budget	integer	Optional	Giới hạn mã thông báo cho dấu vết lý luận của mô hình (khi nhà cung cấp hiển thị một dấu vết).
ignore_defaults	boolean	Optional	Bỏ qua các tham số mặc định đã lưu cho mỗi mô hình của người dùng (được định cấu hình trong trang tổng quan) cho yêu cầu này.
skill	string	Optional	ID of a single marketplace skill to apply to this request. The skill transforms your messages/parameters before the upstream call and overrides any installed-skill defaults. Consumed by Airforce, never forwarded upstream. See the Skills catalog at /docs/api/extend.
skills	array	Optional	Array of marketplace skill IDs applied in order, for stacking multiple skills on one request.

Ví dụ cơ bản

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

Hình dạng phản hồi

Parameter	Type	Required	Description
id	string	Optional	ID hoàn thành ổn định, ví dụ: "chatcmpl-abc123".
object	string	Optional	"chat.completion" dành cho không phát trực tuyến, "chat.completion.chunk" dành cho phát trực tuyến.
created	integer	Optional	Dấu thời gian Unix (giây).
model	string	Optional	Tiếng vang của ID mẫu được yêu cầu.
choices	array	Optional	Mảng các ứng viên hoàn thành: [{ index, message: { role, content, tool_calls? }, finish_reason }].
choices[].finish_reason	string	Optional	"dừng lại" \| "chiều dài" \| "tool_calls" \| "content_filter".
usage	object	Optional	{ prompt_tokens, completion_tokens, total_tokens, completion_tokens_details?, prompt_tokens_details?, cache_creation_input_tokens?, cache_creation? }. completion_tokens_details.reasoning_tokens được đặt khi mô hình tạo ra dấu vết suy luận. Các trường cache xuất hiện khi upstream trả về thông tin prompt-caching: prompt_tokens_details.cached_tokens báo cáo lượt đọc cache (chuẩn OpenAI), cache_creation_input_tokens tổng hợp lượt ghi, và cache_creation.ephemeral_5m_input_tokens / ephemeral_1h_input_tokens cho biết phân chia theo TTL.

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1710000000,
  "model": "gpt-5.1-chat",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "The capital of France is Paris."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 8,
    "total_tokens": 28
  }
}

Lý luận và suy nghĩ

Reasoning/thinking is a cross-model feature for every model ID with supports_reasoning: true — Claude, OpenAI o-series/GPT-5, Gemini, Qwen, DeepSeek, and others. You send the same canonical parameters; Airforce maps them to each provider's native shape. This is not a DeepSeek-only API.

Truth source: check supports_reasoning: true trên một mô hình trong GET /v1/models (or GET /api/models/{id}/allowed-params). Prefer that flag over guessing from the model name.

Các mô hình có hỗ trợ lý luận

…· live

Thông số chuẩn

Parameter	Type	Required	Description
reasoning_effort	string	Optional	"low" \| "medium" \| "high" \| "xhigh" \| "max". Accepted on every model with supports_reasoning: true. Some upstreams only honour a subset (e.g. high/max); others clamp unsupported levels to the nearest served value.
thinking	string \| object	Optional	Three accepted shapes (we normalise): "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. Mapped onto Claude extended thinking, OpenAI effort profiles, Gemini thinking_config, Qwen enable_thinking, DeepSeek hybrid, etc.
thinking_budget	integer	Optional	Maximum tokens the model may spend reasoning before emitting visible output. Mirrors budget_tokens when the upstream exposes a budget; takes precedence over reasoning_effort when both are sent and a budget is available.

What differs by family (mapping only)

Parameters are the same everywhere. Only how we map them (and how hard "off" is) differs:

Claude — Thinking on/off + budget; often also reasoning_effort via the gateway.
OpenAI (o1/o3, GPT-5) — Mainly reasoning_effort. A full "thinking off" is often not available — you control how strongly the model reasons, not always whether it reasons at all.
Gemini — thinking_config / budget mapped internally.
Qwen / Xiaomi / Alibaba — thinking + enable_thinking-style controls.
DeepSeek (generic) — Hybrid on/off is especially clear: thinking: { type: enabled|disabled } plus optional reasoning_effort.
Resellers / other — Often generic passthrough of the same canonical fields.

Controlling where the trace appears

An optional reasoning object on the request decides what happens to the thinking trace. It is consumed by Airforce and never forwarded upstream.

Parameter	Type	Required	Description
reasoning.format	string	Optional	"separate" (default) puts the trace in message.reasoning (and delta.reasoning while streaming). "inline" keeps the legacy inline <think>…</think> form inside content.
reasoning.exclude	boolean	Optional	When true, the reasoning trace is dropped entirely from the response. Reasoning tokens are still counted and billed if the model produced them.

"reasoning": { "format": "separate", "exclude": false }

Nỗ lực suy luận (kiểu OpenAI)

Primary control for o-series and GPT-5: how much the model may reason. Same canonical field as on every other supports_reasoning model — OpenAI is included, but behaviour is not 1:1 with DeepSeek's hard on/off.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "o3-mini",
    "messages": [{"role": "user", "content": "Prove the Pythagorean theorem."}],
    "reasoning_effort": "high"
  }'

Tư duy mở rộng (kiểu Anthropic)

Budget-based thinking for Claude (and gateways that accept the Anthropic shape). You can still send reasoning_effort; we map when the channel supports it.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "messages": [{"role": "user", "content": "Plan a 7-day Italy trip."}],
    "thinking": {"type": "enabled", "budget_tokens": 4000}
  }'

Hybrid thinking (e.g. DeepSeek V3.2/V4)

Example of a hybrid model family with a clear Thinking / Non-Thinking switch — not a separate protocol. deepseek-v3.2, deepseek-v4-flash and deepseek-v4-pro accept the same canonical fields as every other supports_reasoning model. Toggle thinking and optionally set effort in one request:

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-pro",
    "messages": [{"role": "user", "content": "Solve this step by step: integrate x^2 * e^x."}],
    "thinking": {"type": "enabled"},
    "reasoning_effort": "high"
  }'

Turn thinking off (faster, cheaper when you only need the final answer) — this hard off is clearer on hybrid models than on many OpenAI o-series profiles:

"thinking": {"type": "disabled"}
// or simply: "thinking": "off"

Native docs for this family often list effort levels such as "high" and "max". We accept the full low…max scale and map unsupported levels to the nearest value that reaches the model. Prefer the hybrid IDs above over retired deepseek-chat / deepseek-reasoner names when you need an explicit on/off switch.

Bản thân dấu vết lý luận xuất hiện trong choices[0].message.reasoning (hình dạng OpenAI) hoặc dưới dạng thinking khối trong content (dạng Anthropic). Mã thông báo lý luận được lập hoá đơn và báo cáo trong usage.completion_tokens_details.reasoning_tokens.

Phần phân tách completion_tokens_details.reasoning_tokens đó chỉ xuất hiện khi nhà cung cấp thượng nguồn báo cáo nó. Trên một phản hồi stream, dấu vết đến qua delta.reasoning_content theo từng chunk.

Đầu vào tầm nhìn và hình ảnh

Mô hình với supports_vision: true chấp nhận hình ảnh được nhúng dưới dạng khối nội dung. URL công khai hoặc URL dữ liệu base64 đều hoạt động; giới hạn kích thước phụ thuộc vào mô hình ngược dòng.

Các mô hình có hỗ trợ thị giác

…· live

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
      ]
    }]
  }'

Công cụ gọi

Mô hình với supports_tools: true có thể gọi các hàm bạn xác định. Mô hình trả về một tool_calls mảng; bạn thực hiện cuộc gọi, sau đó gửi lại kết quả trong tool tin nhắn.

Các mô hình có hỗ trợ gọi công cụ

…· live

Lời yêu cầu

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City name"}
          },
          "required": ["location"]
        }
      }
    }],
    "tool_choice": "auto"
  }'

Phản hồi bằng lệnh gọi công cụ

{
  "id": "chatcmpl-abc123",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\":\"Paris\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

Theo dõi kết quả của công cụ

{
  "model": "gpt-5.1-chat",
  "messages": [
    {"role": "user", "content": "What is the weather in Paris?"},
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {"name": "get_weather", "arguments": "{\"location\":\"Paris\"}"}
      }]
    },
    {"role": "tool", "tool_call_id": "call_1", "content": "{\"temp_c\": 14, \"sky\": \"cloudy\"}"}
  ]
}

Assistant prefill

End your messages array with an assistant message that already contains some text, and the model continues from it instead of starting a fresh turn. This is a reliable way to force a response to begin a specific way — a leading "{" for JSON, a chosen language, or a fixed prefix. The same trick works on /v1/messages. Providers that reject native prefill are handled automatically: the gateway retries once with a compatible rewrite, so you do not have to special-case them.

{
  "model": "claude-sonnet-4.6",
  "messages": [
    {"role": "user", "content": "List three primary colors as a JSON array."},
    {"role": "assistant", "content": "["}
  ]
}

Structured outputs

Set response_format to make the model return JSON. Two modes are supported:

{ "type": "json_object" } — the response is a single valid JSON value.
{ "type": "json_schema", "json_schema": { "name", "schema", "strict" } } — the model is steered to produce JSON that matches your JSON Schema.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Extract the city and country: I live in Paris, France."}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "location",
        "schema": {
          "type": "object",
          "properties": { "city": {"type": "string"}, "country": {"type": "string"} },
          "required": ["city", "country"]
        }
      }
    }
  }'

Reliability: even when a model wraps its answer in prose or a markdown code fence, Airforce extracts the JSON payload so you always receive parseable content. If no valid JSON can be recovered, the original text is returned unchanged — so the guarantee never makes a response worse. This applies to non-streamed responses; streamed responses are passed through unchanged.

Truyền phát

Bộ stream: true để nhận được các lần hoàn thành một phần dưới dạng Sự kiện do máy chủ gửi. Mỗi sự kiện là một đoạn JSON có hình dạng giống như phản hồi không được phát trực tuyến, ngoại trừ message được thay thế bởi delta. Luồng kết thúc bằng data: [DONE].

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Write a haiku about Berlin."}],
    "stream": true,
    "stream_options": {"include_usage": true}
  }'

Định dạng dây

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"Cold "},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"stone "},"finish_reason":null}]}

…

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":12,"completion_tokens":17,"total_tokens":29}}

data: [DONE]

Reliability & smart routing

Every model ID resolves to a pool of upstream providers behind the scenes. If the first one errors or times out, the request is automatically retried against the next provider for the same model, in order, before any failure is returned — you do not configure or trigger this. The model field in the response always reports the variant that actually answered. This is independent of the optional models / fallbacks array, which adds your own cross-model candidates on top: first the primary model exhausts its own provider chain, then each fallback model exhausts its chain.

POST /v1/messages

API tin nhắn tương thích với Anthropic. Làm việc với chính thức @anthropic-ai/sdk bằng cách thiết lập baseURL ĐẾN https://api.airforce. Chuyển tiếp tới OpenAI/Google/v.v. minh bạch cho các mô hình không phải Claude.

POSThttps://api.airforce/v1/messages

Nội dung yêu cầu

Parameter	Type	Required	Description
model	string	Required	ID mẫu (định dạng Anthropic hoặc bí danh được định tuyến).
messages	array	Required	Mỗi mục: { role: "user" \| "assistant", content: string \| array }.
max_tokens	integer	Required	Được yêu cầu bởi Anthropic. Giới hạn mã thông báo cho phản hồi.
system	string \| array	Optional	Prompt hệ thống. Truyền một mảng các khối { type: "text", text, cache_control? } để đánh dấu các phân đoạn tiền tố được lưu cache. Xem "Prompt caching".
temperature	float	Optional	0–1.
top_p	float	Optional	Lấy mẫu hạt nhân.
top_k	integer	Optional	Giới hạn nhóm lấy mẫu ở các mã thông báo top-K.
stop_sequences	array	Optional	Lên đến 4 chuỗi dừng.
stream	boolean	Optional	Khi đúng, sẽ phát ra luồng sự kiện SSE kiểu Anthropic (xem "Truyền phát").
fallbacks	array	Optional	Fallback models (max 3) in Anthropic form: [{"model": "gpt-4o-mini"}]. If every channel of the primary model fails, each candidate is tried in order; you are billed for — and the response model field reports — the model that actually answered. A plain models string array is accepted too.
tools	array	Optional	Định nghĩa công cụ Anthropic: { name, description, input_schema }. Phản hồi có thể chứa các khối nội dung tool_use.
tool_choice	object	Optional	{ type: "auto" \| "any" \| "tool", name? }.
thinking	object	Optional	Tư duy mở rộng của Anthropic: { type: "enabled", budget_tokens: N }.

Ví dụ

curl https://api.airforce/v1/messages \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "max_tokens": 256,
    "system": "You are a helpful assistant.",
    "messages": [
      {"role": "user", "content": "Hello, Claude!"}
    ]
  }'

Hình dạng phản hồi

Parameter	Type	Required	Description
id	string	Optional	ID tin nhắn, ví dụ: "tin nhắn_01ABCxyz".
type	string	Optional	Luôn luôn "tin nhắn".
role	string	Optional	Luôn luôn là "trợ lý".
content	array	Optional	Mảng các khối nội dung: { type: "text" \| "tool_use" \| "thinking", … }.
model	string	Optional	Tiếng vang của mô hình được yêu cầu.
stop_reason	string	Optional	"end_turn" \| "max_tokens" \| "stop_sequence" \| "công cụ_sử dụng".
usage	object	Optional	{ input_tokens, output_tokens, cache_read_input_tokens?, cache_creation_input_tokens?, cache_creation? }. Các trường cache xuất hiện khi prompt caching được sử dụng. cache_creation.ephemeral_5m_input_tokens và ephemeral_1h_input_tokens cho biết phân chia ghi theo TTL.

Truyền phát sự kiện

SSE của Anthropic sử dụng các sự kiện được đặt tên thay vì các đoạn JSON một lần. Mỗi sự kiện đều có một event: tên và một data: Tải trọng JSON.

event: message_start
data: {"type":"message_start","message":{"id":"msg_01","role":"assistant","content":[],"model":"claude-sonnet-4.6","stop_reason":null,"usage":{"input_tokens":12,"output_tokens":1}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":17}}

event: message_stop
data: {"type":"message_stop"}

POST /v1/messages/count_tokens

Anthropic-compatible token counting. Send the same system / messages / tools you would pass to /v1/messages and get an input-token estimate back without running the model — nothing is billed.

POSThttps://api.airforce/v1/messages/count_tokens

curl https://api.airforce/v1/messages/count_tokens \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "system": "You are a helpful assistant.",
    "messages": [{"role": "user", "content": "Hello, Claude!"}]
  }'

# → {"input_tokens": 34}

The count is a fast character-based estimate (about 4 characters per token) over system, messages and tools — close enough for context-budget checks, not an exact tokenizer run.

Bộ nhớ đệm nhắc nhở

TRÊN /v1/messages với các mô hình Claude, đánh dấu tiền tố là được lưu trong bộ nhớ đệm bằng cách chuyển system dưới dạng một mảng các khối trong đó phân đoạn được lưu trong bộ nhớ đệm mang cache_control: { type: "ephemeral" }. Các yêu cầu tiếp theo bắt đầu bằng cùng một tiền tố sẽ tính phí tốc độ đọc bộ đệm rẻ hơn. Mô hình với supports_caching: true TRONG /v1/models ủng hộ điều này.

Write vs read pricing

Cache writes are typically charged slightly above normal input (about 1.25× on Claude-family models). Cache reads are much cheaper (about 0.1× input). A large write with almost no later read is the expensive case — not a “cache discount”. Only reusing the same prefix turns the write into savings.

Tools like Claude Code often attach a large project context with cache markers on the first turns. Expect cache-write spend while the repo/system prefix is loaded; later turns only get cheap if that prefix is stable and reused. Subagents and multi-step agents can multiply large contexts across several requests.

Các mô hình có bộ nhớ đệm nhanh chóng

…· live

{
  "model": "claude-sonnet-4.6",
  "max_tokens": 1024,
  "system": [
    {"type": "text", "text": "You are a senior staff engineer at Airforce."},
    {
      "type": "text",
      "text": "<repository-snapshot>...</repository-snapshot>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "Where is rate limiting enforced?"}
  ]
}

Cách số liệu cache được báo cáo trong phản hồi

Số liệu token cache được chuyển qua trong hình dạng gốc của từng định dạng, vì vậy SDK (openai, @anthropic-ai/sdk, @google/genai) đọc chúng mà không cần mã tùy chỉnh. Các trường được bỏ qua khi giá trị bằng không, giữ cho các phản hồi không cache gọn nhẹ.

/v1/chat/completions (dạng OpenAI)

"usage": {
  "prompt_tokens": 2104,
  "completion_tokens": 147,
  "total_tokens": 2251,
  "prompt_tokens_details": { "cached_tokens": 1980 },
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1/messages (dạng Anthropic)

"usage": {
  "input_tokens": 2104,
  "output_tokens": 147,
  "cache_read_input_tokens": 1980,
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1beta/.../generateContent (dạng Gemini)

"usageMetadata": {
  "promptTokenCount": 2104,
  "candidatesTokenCount": 147,
  "totalTokenCount": 2251,
  "cachedContentTokenCount": 1980
}

Caching áp dụng ở đâu

Các marker cache_control rõ ràng được tôn trọng trên /v1/messages và /v1/chat/completions cho các mô hình Claude — đặt chúng trên khối nội dung system hoặc message. Nhiều nhà cung cấp khác (họ OpenAI, DeepSeek, Gemini) cache tự động: bạn không gửi marker nào và chỉ cần thấy cached_tokens trong phản hồi khi một prefix đủ dài được tái sử dụng.

Thời lượng cache: 5 phút hoặc 1 giờ

Một prefix đã cache mặc định tồn tại 5 phút và bộ đếm làm mới sau mỗi lần trúng. Để prefix tồn tại lâu hơn, thêm ttl: "1h" vào marker. Phản hồi báo cáo từng TTL riêng dưới cache_creation.

"cache_control": { "type": "ephemeral", "ttl": "1h" }

Ví dụ: ghi trước, đọc sau

Gửi đúng cùng một yêu cầu hai lần (ví dụ caching ở trên). Lần gọi đầu tiên thấy prefix trả một lần ghi cache; các lần gọi giống hệt trong TTL trả lần đọc cache rẻ hơn nhiều.

Lần gọi đầu — ghi cache (trích usage):

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 1980,
  "cache_read_input_tokens": 0
}

Lần gọi giống hệt thứ hai trong TTL — đọc cache:

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 1980
}

Giới hạn & chi phí

Claude yêu cầu prefix tối thiểu có thể cache (khoảng 1024 token; lớn hơn với một số mô hình). Các prefix ngắn hơn đơn giản là không được cache.
Tối đa 4 điểm cache mỗi yêu cầu, và prefix đã cache phải giống nhau từng byte giữa các lần gọi — chỉ một ký tự khác cũng trượt cache.
Ghi cache đắt hơn input thường (5m ≈ 1,25×, 1h ≈ 2×); đọc rẻ hơn nhiều (≈ 0,1×). Xem giá cache của từng mô hình trên trang giá.

POST /v1/responses

Giao diện OpenAI Responses-API cho các cuộc trò chuyện có trạng thái. Cùng xác thực Bearer/x-api-key. Số liệu cache xuất hiện dưới dạng input_tokens_details.cached_tokens (đọc) cộng với cache_creation_input_tokens phẳng + cache_creation.ephemeral_* (ghi) để tương đương với /v1/chat/completions.

POSThttps://api.airforce/v1/responses

POST /v1beta/models/{model}:generateContent

Google Gemini-compatible endpoint. Works with the official @google/genai SDK and the Gemini CLI by pointing the base URL at https://api.airforce/v1beta. Any routed model works — requests are translated to and from the native Gemini shape, and the model is taken from the URL path (not the body).

POSThttps://api.airforce/v1beta/models/{model}:generateContent

Authentication

Pass your Airforce API key any of the three ways Google clients use:

# 1) query parameter (Google default)
?key=sk-air-YOUR_API_KEY

# 2) header
x-goog-api-key: sk-air-YOUR_API_KEY

# 3) bearer token
Authorization: Bearer sk-air-YOUR_API_KEY

Request body

Parameter	Type	Required	Description
contents	array	Required	Conversation turns. Each: { role: "user" \| "model", parts: [...] }. A part is { text }, { functionCall: { name, args } }, or { functionResponse: { name, response } }. "model" is Gemini's term for the assistant role.
systemInstruction	object	Optional	System prompt: { parts: [{ text }] }.
generationConfig	object	Optional	{ temperature, maxOutputTokens, topP, stopSequences } — mapped to the canonical sampling parameters.
tools	array	Optional	Tool definitions: [{ functionDeclarations: [{ name, description, parameters }] }]. functionDeclarations are flattened across entries.
toolConfig	object	Optional	Tool-choice control: { functionCallingConfig: { mode: "AUTO" \| "ANY" \| "NONE" } }. ANY forces a call, NONE disables tools.

Example

curl "https://api.airforce/v1beta/models/gemini-3.1-pro:generateContent" \
  -H "x-goog-api-key: sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {"role": "user", "parts": [{"text": "What is the capital of France?"}]}
    ],
    "systemInstruction": {"parts": [{"text": "You are a helpful assistant."}]},
    "generationConfig": {"temperature": 0.7, "maxOutputTokens": 256}
  }'

Response shape

Parameter	Type	Required	Description
candidates	array	Optional	Generated turns: [{ content: { role: "model", parts }, finishReason, index }]. Only the first candidate is populated.
candidates[].finishReason	string	Optional	"STOP" \| "MAX_TOKENS" \| "SAFETY" \| "OTHER".
usageMetadata	object	Optional	{ promptTokenCount, candidatesTokenCount, totalTokenCount, cachedContentTokenCount? }. cachedContentTokenCount appears when the upstream reported a cache read.
modelVersion	string	Optional	Echo of the requested model.

{
  "candidates": [{
    "content": {
      "role": "model",
      "parts": [{"text": "The capital of France is Paris."}]
    },
    "finishReason": "STOP",
    "index": 0
  }],
  "usageMetadata": {
    "promptTokenCount": 16,
    "candidatesTokenCount": 8,
    "totalTokenCount": 24
  },
  "modelVersion": "gemini-3.1-pro"
}

POST /v1beta/models/{model}:streamGenerateContent

Streaming uses the :streamGenerateContent action and returns Server-Sent Events. Each data: line is a full Gemini-shaped chunk (not a delta object); the final chunk carries usageMetadata.

data: {"candidates":[{"content":{"role":"model","parts":[{"text":"The capital"}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[{"text":" is Paris."}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[]},"finishReason":"STOP","index":0}],"usageMetadata":{"promptTokenCount":16,"candidatesTokenCount":8,"totalTokenCount":24}}

List models

The catalog is also exposed in Gemini Model-resource shape so Google clients can enumerate models.

curl https://api.airforce/v1beta/models

Notes: the base URL is https://api.airforce/v1beta (or /v1), not Google's host. The model name comes from the URL path, not the request body. Only the first candidate is returned, and a subset of Gemini fields is translated — safetySettings and cachedContent are currently ignored. Billing, rate limits and smart routing apply exactly as on /v1/chat/completions.

Lỗi

Airforce trả về mã trạng thái HTTP tiêu chuẩn và đường bao lỗi thống nhất cho cả hai điểm cuối.

Parameter	Type	Required	Description
400	invalid_request_error	Optional	JSON không đúng định dạng, thiếu trường bắt buộc, mô hình không xác định.
401	invalid_request_error / auth_required	Optional	Khóa API bị thiếu hoặc không hợp lệ.
402	insufficient_quota	Optional	Mô hình yêu cầu gói đăng ký đang hoạt động hoặc số dư Pay-as-you-Go dương.
403	model_access_denied / insufficient_scope	Optional	Quyền gói hoặc quyền theo từng khóa từ chối yêu cầu này.
404	model_not_found	Optional	Mô hình được yêu cầu không tồn tại hoặc bạn không có quyền truy cập.
429	rate_limit_error	Optional	Đã vượt quá tỷ lệ yêu cầu hoặc giới hạn mã thông báo hàng ngày.
503	api_error / moderation_unavailable	Optional	Tất cả các khóa ngược dòng cho nhà cung cấp được yêu cầu đều không thành công.

{
  "error": {
    "message": "The requested model does not exist or you do not have access to it.",
    "type": "model_not_found",
    "param": null,
    "code": "404"
  }
}

Slug mô tả nằm ở type. code là mã trạng thái HTTP dưới dạng chuỗi (ví dụ "404"), và param là null trừ trường hợp lỗi xác thực phạm vi tham số, khi đó nó nêu tên tham số gây lỗi.

Khám phá mô hình

Xem danh sách đầy đủ các ID mô hình và cờ khả năng của chúng (tầm nhìn, công cụ, lý luận, bộ nhớ đệm, độ dài ngữ cảnh, ...) tại /docs/api/models.

curl https://api.airforce/v1/models \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY"