API REFERENCE

채팅 완료

하나의 API로 100개 이상의 모델에 걸쳐 채팅 응답을 생성합니다. OpenAI 채팅 완료, Anthropic 메시지 및 Anthropic 응답과 즉시 호환됩니다.

Airforce는 동일한 모델 집합에서 OpenAI Chat Completions와 Anthropic Messages wire format을 모두 지원합니다. 이미 사용 중인 SDK를 그대로 선택하고 base URL만 변경하세요 — Claude가 아닌 모델은 어느 surface 뒤에서든 투명하게 전달됩니다.

이 페이지는 인증, 두 surface의 request 및 response 형태, streaming, tool calling, vision, reasoning, prompt caching을 다룹니다. 처음이신가요? 아래 기본 예제로 시작해 호출 하나를 먼저 작동시킨 다음, 작동하면 streaming, tool, caching을 차례로 얹으세요.

인증

모든 요청에는 Bearer 토큰(Airforce API 키)이 필요합니다. Anthropic x-api-key 헤더도 허용됩니다. /v1/messages SDK 호환성을 위해.

Authorization: Bearer sk-air-YOUR_API_KEY
# alt for /v1/messages:
x-api-key: sk-air-YOUR_API_KEY

POST /v1/chat/completions

OpenAI 호환 채팅 완성입니다. 공식 openai 재정의를 통한 SDK base_url 에게 https://api.airforce/v1.

POSThttps://api.airforce/v1/chat/completions

요청 본문

Parameter	Type	Required	Description
model	string	Required	모델 ID. 사용 가능한 ID를 검색하려면 GET /v1/models를 사용하세요.
messages	array	Required	대화 기록. 각 항목은 { role: "system" \| "user" \| "assistant" \| "tool", content } 형태입니다. content는 문자열 또는 콘텐츠 블록 배열입니다(비전, 아래 참조).
max_tokens	integer	Optional	생성할 최대 토큰 수입니다. 모델의 max_output_tokens로 제한됩니다.
temperature	float	Optional	샘플링 온도, 0–2. 낮을수록 더 결정적입니다. 기본값은 업스트림 공급자에 따라 다릅니다.
top_p	float	Optional	핵 샘플링. 온도나 top_p 중 하나만 사용하고 둘 다 사용하지는 마세요.
stream	boolean	Optional	true인 경우 응답은 서버에서 보낸 이벤트의 스트림입니다. 아래의 "스트리밍"을 참조하세요.
models	array	Optional	Fallback models (max 3), e.g. ["deepseek-v3.2", "gpt-4o-mini"]. If every channel of the primary model fails, each candidate is tried in order. You are billed for — and response.model reports — the model that actually answered. Unknown or plan-gated candidates are skipped. With the OpenAI SDK pass it via extra_body.
transforms	array	Optional	Prompt transforms. Supported: ["middle-out"] — when the conversation overflows the model's context window, whole messages are dropped from the middle (system prompts, the first message and the most recent turns are kept), so long roleplay or agent histories keep working instead of erroring. Opt-in; off by default.
stream_options	object	Optional	{ include_usage: boolean }. usage는 항상 마지막 스트리밍 청크에 포함됩니다. 이 필드는 OpenAI 호환성을 위해 허용되지만 끌 수는 없습니다.
stop	string \| array	Optional	정지 시퀀스는 최대 4개입니다. 하나가 생산되자마자 생성이 중단됩니다.
tools	array	Optional	모델이 호출할 수 있는 함수 정의입니다. 아래의 "도구 호출"을 참조하십시오.
tool_choice	string \| object	Optional	"auto"(기본값), "none" 또는 { type: "function", function: { name } } 특정 호출을 강제 실행합니다.
response_format	object	Optional	{ type: "json_object" } 모델이 유효한 JSON을 내보내도록 강제합니다. 지원하지 않는 모델의 경우 무시됩니다.
reasoning_effort	string	Optional	Reasoning depth: "low" \| "medium" \| "high" \| "xhigh" \| "max". Any model with supports_reasoning: true (Claude, OpenAI o/GPT-5, Gemini, Qwen, DeepSeek, …). See "Reasoning & thinking".
thinking	string \| object	Optional	Cross-model thinking switch. "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. See "Reasoning & thinking".
thinking_budget	integer	Optional	모델의 추론 추적에 대한 토큰 한도(제공자가 공개하는 경우).
ignore_defaults	boolean	Optional	이 요청에 대해 사용자가 저장한 모델별 기본 매개변수(대시보드에 구성됨)를 건너뜁니다.
skill	string	Optional	ID of a single marketplace skill to apply to this request. The skill transforms your messages/parameters before the upstream call and overrides any installed-skill defaults. Consumed by Airforce, never forwarded upstream. See the Skills catalog at /docs/api/extend.
skills	array	Optional	Array of marketplace skill IDs applied in order, for stacking multiple skills on one request.

기본 예

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

응답 형태

Parameter	Type	Required	Description
id	string	Optional	안정적인 완료 ID(예: "chatcmpl-abc123").
object	string	Optional	스트리밍되지 않는 경우 "chat.completion", 스트리밍되는 경우 "chat.completion.chunk"입니다.
created	integer	Optional	Unix 타임스탬프(초)입니다.
model	string	Optional	요청된 모델 ID의 에코입니다.
choices	array	Optional	완료 후보 배열: [{ index, message: { role, content, tool_calls? }, finish_reason }].
choices[].finish_reason	string	Optional	"stop" \| "length" \| "tool_calls" \| "content_filter".
usage	object	Optional	{ prompt_tokens, completion_tokens, total_tokens, completion_tokens_details?, prompt_tokens_details?, cache_creation_input_tokens?, cache_creation? }. completion_tokens_details.reasoning_tokens는 모델이 추론 트레이스를 생성했을 때 설정됩니다. 캐시 필드는 upstream이 프롬프트 캐싱 정보를 반환했을 때 나타납니다: prompt_tokens_details.cached_tokens는 캐시 읽기(OpenAI 표준)를 보고하고, cache_creation_input_tokens는 쓰기를 집계하며, cache_creation.ephemeral_5m_input_tokens / ephemeral_1h_input_tokens는 TTL 분할을 제공합니다.

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1710000000,
  "model": "gpt-5.1-chat",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "The capital of France is Paris."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 8,
    "total_tokens": 28
  }
}

추론 및 사고

Reasoning/thinking is a cross-model feature for every model ID with supports_reasoning: true — Claude, OpenAI o-series/GPT-5, Gemini, Qwen, DeepSeek, and others. You send the same canonical parameters; Airforce maps them to each provider's native shape. This is not a DeepSeek-only API.

Truth source: check supports_reasoning: true 모델에 GET /v1/models (or GET /api/models/{id}/allowed-params). Prefer that flag over guessing from the model name.

추론을 지원하는 모델

…· live

정식 매개변수

Parameter	Type	Required	Description
reasoning_effort	string	Optional	"low" \| "medium" \| "high" \| "xhigh" \| "max". Accepted on every model with supports_reasoning: true. Some upstreams only honour a subset (e.g. high/max); others clamp unsupported levels to the nearest served value.
thinking	string \| object	Optional	Three accepted shapes (we normalise): "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. Mapped onto Claude extended thinking, OpenAI effort profiles, Gemini thinking_config, Qwen enable_thinking, DeepSeek hybrid, etc.
thinking_budget	integer	Optional	Maximum tokens the model may spend reasoning before emitting visible output. Mirrors budget_tokens when the upstream exposes a budget; takes precedence over reasoning_effort when both are sent and a budget is available.

What differs by family (mapping only)

Parameters are the same everywhere. Only how we map them (and how hard "off" is) differs:

Claude — Thinking on/off + budget; often also reasoning_effort via the gateway.
OpenAI (o1/o3, GPT-5) — Mainly reasoning_effort. A full "thinking off" is often not available — you control how strongly the model reasons, not always whether it reasons at all.
Gemini — thinking_config / budget mapped internally.
Qwen / Xiaomi / Alibaba — thinking + enable_thinking-style controls.
DeepSeek (generic) — Hybrid on/off is especially clear: thinking: { type: enabled|disabled } plus optional reasoning_effort.
Resellers / other — Often generic passthrough of the same canonical fields.

Controlling where the trace appears

An optional reasoning object on the request decides what happens to the thinking trace. It is consumed by Airforce and never forwarded upstream.

Parameter	Type	Required	Description
reasoning.format	string	Optional	"separate" (default) puts the trace in message.reasoning (and delta.reasoning while streaming). "inline" keeps the legacy inline <think>…</think> form inside content.
reasoning.exclude	boolean	Optional	When true, the reasoning trace is dropped entirely from the response. Reasoning tokens are still counted and billed if the model produced them.

"reasoning": { "format": "separate", "exclude": false }

추론 노력(OpenAI 스타일)

Primary control for o-series and GPT-5: how much the model may reason. Same canonical field as on every other supports_reasoning model — OpenAI is included, but behaviour is not 1:1 with DeepSeek's hard on/off.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "o3-mini",
    "messages": [{"role": "user", "content": "Prove the Pythagorean theorem."}],
    "reasoning_effort": "high"
  }'

확장된 사고(Anthropic 스타일)

Budget-based thinking for Claude (and gateways that accept the Anthropic shape). You can still send reasoning_effort; we map when the channel supports it.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "messages": [{"role": "user", "content": "Plan a 7-day Italy trip."}],
    "thinking": {"type": "enabled", "budget_tokens": 4000}
  }'

Hybrid thinking (e.g. DeepSeek V3.2/V4)

Example of a hybrid model family with a clear Thinking / Non-Thinking switch — not a separate protocol. deepseek-v3.2, deepseek-v4-flash and deepseek-v4-pro accept the same canonical fields as every other supports_reasoning model. Toggle thinking and optionally set effort in one request:

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-pro",
    "messages": [{"role": "user", "content": "Solve this step by step: integrate x^2 * e^x."}],
    "thinking": {"type": "enabled"},
    "reasoning_effort": "high"
  }'

Turn thinking off (faster, cheaper when you only need the final answer) — this hard off is clearer on hybrid models than on many OpenAI o-series profiles:

"thinking": {"type": "disabled"}
// or simply: "thinking": "off"

Native docs for this family often list effort levels such as "high" and "max". We accept the full low…max scale and map unsupported levels to the nearest value that reaches the model. Prefer the hybrid IDs above over retired deepseek-chat / deepseek-reasoner names when you need an explicit on/off switch.

추론 추적 자체는 choices[0].message.reasoning (OpenAI 모양) 또는 thinking 블록 content (Anthropic 형식). 추론 토큰은 다음과 같이 청구되고 보고됩니다. usage.completion_tokens_details.reasoning_tokens.

이 completion_tokens_details.reasoning_tokens 세부 내역은 업스트림 제공자가 보고할 때만 포함됩니다. stream 응답에서는 추적 정보가 chunk마다 delta.reasoning_content로 도착합니다.

비전 및 이미지 입력

다음을 갖춘 모델 supports_vision: true 콘텐츠 블록으로 삽입된 이미지를 허용합니다. 공개 URL 또는 base64 데이터 URL이 작동합니다. 크기 제한은 업스트림 모델에 따라 다릅니다.

비전 지원 모델

…· live

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
      ]
    }]
  }'

도구 호출

다음을 갖춘 모델 supports_tools: true 정의한 함수를 호출할 수 있습니다. 모델은 다음을 반환합니다. tool_calls 정렬; 호출을 실행한 다음 결과를 다시 보냅니다. tool 메시지.

도구 호출 지원 모델

…· live

요구

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City name"}
          },
          "required": ["location"]
        }
      }
    }],
    "tool_choice": "auto"
  }'

도구 호출로 응답

{
  "id": "chatcmpl-abc123",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\":\"Paris\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

도구 결과에 대한 후속 조치

{
  "model": "gpt-5.1-chat",
  "messages": [
    {"role": "user", "content": "What is the weather in Paris?"},
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {"name": "get_weather", "arguments": "{\"location\":\"Paris\"}"}
      }]
    },
    {"role": "tool", "tool_call_id": "call_1", "content": "{\"temp_c\": 14, \"sky\": \"cloudy\"}"}
  ]
}

Assistant prefill

End your messages array with an assistant message that already contains some text, and the model continues from it instead of starting a fresh turn. This is a reliable way to force a response to begin a specific way — a leading "{" for JSON, a chosen language, or a fixed prefix. The same trick works on /v1/messages. Providers that reject native prefill are handled automatically: the gateway retries once with a compatible rewrite, so you do not have to special-case them.

{
  "model": "claude-sonnet-4.6",
  "messages": [
    {"role": "user", "content": "List three primary colors as a JSON array."},
    {"role": "assistant", "content": "["}
  ]
}

Structured outputs

Set response_format to make the model return JSON. Two modes are supported:

{ "type": "json_object" } — the response is a single valid JSON value.
{ "type": "json_schema", "json_schema": { "name", "schema", "strict" } } — the model is steered to produce JSON that matches your JSON Schema.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Extract the city and country: I live in Paris, France."}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "location",
        "schema": {
          "type": "object",
          "properties": { "city": {"type": "string"}, "country": {"type": "string"} },
          "required": ["city", "country"]
        }
      }
    }
  }'

Reliability: even when a model wraps its answer in prose or a markdown code fence, Airforce extracts the JSON payload so you always receive parseable content. If no valid JSON can be recovered, the original text is returned unchanged — so the guarantee never makes a response worse. This applies to non-streamed responses; streamed responses are passed through unchanged.

스트리밍

세트 stream: true 서버 전송 이벤트로 부분 완료를 수신합니다. 각 이벤트는 스트리밍되지 않는 응답과 동일한 모양을 가진 하나의 JSON 청크입니다. message 로 대체됩니다 delta. 스트림은 다음으로 끝납니다. data: [DONE].

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Write a haiku about Berlin."}],
    "stream": true,
    "stream_options": {"include_usage": true}
  }'

와이어 형식

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"Cold "},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"stone "},"finish_reason":null}]}

…

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":12,"completion_tokens":17,"total_tokens":29}}

data: [DONE]

Reliability & smart routing

Every model ID resolves to a pool of upstream providers behind the scenes. If the first one errors or times out, the request is automatically retried against the next provider for the same model, in order, before any failure is returned — you do not configure or trigger this. The model field in the response always reports the variant that actually answered. This is independent of the optional models / fallbacks array, which adds your own cross-model candidates on top: first the primary model exhausts its own provider chain, then each fallback model exhausts its chain.

POST /v1/messages

Anthropic 호환 메시지 API입니다. 공식 @anthropic-ai/sdk 설정으로 baseURL 에게 https://api.airforce. Claude가 아닌 모델은 OpenAI/Google 등으로 투명하게 전달합니다.

POSThttps://api.airforce/v1/messages

요청 본문

Parameter	Type	Required	Description
model	string	Required	모델 ID(Anthropic 형식 또는 라우팅된 별칭).
messages	array	Required	각 항목: { role: "user" \| "assistant", content: string \| array }.
max_tokens	integer	Required	Anthropic이 필요합니다. 응답의 토큰 한도입니다.
system	string \| array	Optional	시스템 프롬프트. 캐시된 접두사 세그먼트를 표시하려면 { type: "text", text, cache_control? } 블록 배열을 전달하세요. "프롬프트 캐싱"을 참조하세요.
temperature	float	Optional	0–1.
top_p	float	Optional	핵 샘플링.
top_k	integer	Optional	샘플링 풀을 상위 K 토큰으로 제한합니다.
stop_sequences	array	Optional	정지 시퀀스는 최대 4개입니다.
stream	boolean	Optional	true인 경우 Anthropic 스타일 SSE 이벤트 스트림을 내보냅니다("스트리밍" 참조).
fallbacks	array	Optional	Fallback models (max 3) in Anthropic form: [{"model": "gpt-4o-mini"}]. If every channel of the primary model fails, each candidate is tried in order; you are billed for — and the response model field reports — the model that actually answered. A plain models string array is accepted too.
tools	array	Optional	Anthropic 도구 정의: { name, description, input_schema }. 응답에는 tool_use 콘텐츠 블록이 포함될 수 있습니다.
tool_choice	object	Optional	{ type: "auto" \| "any" \| "tool", name? }.
thinking	object	Optional	Anthropic 확장 사고: { type: "enabled", budget_tokens: N }.

예

curl https://api.airforce/v1/messages \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "max_tokens": 256,
    "system": "You are a helpful assistant.",
    "messages": [
      {"role": "user", "content": "Hello, Claude!"}
    ]
  }'

응답 형태

Parameter	Type	Required	Description
id	string	Optional	메시지 ID(예: "msg_01ABCxyz").
type	string	Optional	항상 "message".
role	string	Optional	항상 "assistant".
content	array	Optional	콘텐츠 블록 배열: { type: "text" \| "tool_use" \| "thinking", … }.
model	string	Optional	요청한 모델의 에코입니다.
stop_reason	string	Optional	"end_turn" \| "max_tokens" \| "stop_sequence" \| "tool_use".
usage	object	Optional	{ input_tokens, output_tokens, cache_read_input_tokens?, cache_creation_input_tokens?, cache_creation? }. 캐시 필드는 프롬프트 캐싱이 사용되었을 때 나타납니다. cache_creation.ephemeral_5m_input_tokens 및 ephemeral_1h_input_tokens는 TTL별 쓰기 분할을 제공합니다.

스트리밍 이벤트

Anthropic SSE는 일회성 JSON 청크 대신 명명된 이벤트를 사용합니다. 각 이벤트에는 event: 이름과 data: JSON 페이로드.

event: message_start
data: {"type":"message_start","message":{"id":"msg_01","role":"assistant","content":[],"model":"claude-sonnet-4.6","stop_reason":null,"usage":{"input_tokens":12,"output_tokens":1}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":17}}

event: message_stop
data: {"type":"message_stop"}

POST /v1/messages/count_tokens

Anthropic-compatible token counting. Send the same system / messages / tools you would pass to /v1/messages and get an input-token estimate back without running the model — nothing is billed.

POSThttps://api.airforce/v1/messages/count_tokens

curl https://api.airforce/v1/messages/count_tokens \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "system": "You are a helpful assistant.",
    "messages": [{"role": "user", "content": "Hello, Claude!"}]
  }'

# → {"input_tokens": 34}

The count is a fast character-based estimate (about 4 characters per token) over system, messages and tools — close enough for context-budget checks, not an exact tokenizer run.

프롬프트 캐싱

~에 /v1/messages Claude 모델을 사용하면 다음을 전달하여 접두사를 캐시된 것으로 표시합니다. system 캐시된 세그먼트가 전달되는 블록 배열 cache_control: { type: "ephemeral" }. 동일한 접두사로 시작하는 후속 요청에는 더 저렴한 캐시 읽기 요금이 부과됩니다. 다음을 갖춘 모델 supports_caching: true ~에 /v1/models 이것을 지원하십시오.

Write vs read pricing

Cache writes are typically charged slightly above normal input (about 1.25× on Claude-family models). Cache reads are much cheaper (about 0.1× input). A large write with almost no later read is the expensive case — not a “cache discount”. Only reusing the same prefix turns the write into savings.

Tools like Claude Code often attach a large project context with cache markers on the first turns. Expect cache-write spend while the repo/system prefix is loaded; later turns only get cheap if that prefix is stable and reused. Subagents and multi-step agents can multiply large contexts across several requests.

프롬프트 캐싱을 지원하는 모델

…· live

{
  "model": "claude-sonnet-4.6",
  "max_tokens": 1024,
  "system": [
    {"type": "text", "text": "You are a senior staff engineer at Airforce."},
    {
      "type": "text",
      "text": "<repository-snapshot>...</repository-snapshot>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "Where is rate limiting enforced?"}
  ]
}

응답에서 캐시 카운트가 보고되는 방식

캐시 토큰 카운트는 각 형식의 네이티브 형태로 전달되므로 SDK(openai, @anthropic-ai/sdk, @google/genai)는 사용자 정의 코드 없이 읽습니다. 값이 0이면 필드가 생략되어 캐시되지 않은 응답을 가볍게 유지합니다.

/v1/chat/completions (OpenAI 형식)

"usage": {
  "prompt_tokens": 2104,
  "completion_tokens": 147,
  "total_tokens": 2251,
  "prompt_tokens_details": { "cached_tokens": 1980 },
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1/messages (Anthropic 형식)

"usage": {
  "input_tokens": 2104,
  "output_tokens": 147,
  "cache_read_input_tokens": 1980,
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1beta/.../generateContent (Gemini 형식)

"usageMetadata": {
  "promptTokenCount": 2104,
  "candidatesTokenCount": 147,
  "totalTokenCount": 2251,
  "cachedContentTokenCount": 1980
}

캐싱이 적용되는 곳

명시적 cache_control 마커는 Claude 모델의 경우 /v1/messages와 /v1/chat/completions에서 적용됩니다 — system 또는 message 콘텐츠 블록에 붙이세요. 다른 많은 제공자(OpenAI 계열, DeepSeek, Gemini)는 자동으로 캐싱합니다. 마커를 보내지 않아도 충분히 긴 프리픽스가 재사용되면 응답에 cached_tokens가 나타납니다.

캐시 유지 시간: 5분 또는 1시간

캐시된 프리픽스는 기본적으로 5분 동안 유지되며 적중할 때마다 타이머가 갱신됩니다. 더 오래 유지하려면 마커에 ttl: "1h" 을(를) 추가하세요. 응답은 각 TTL을 cache_creation 아래에 별도로 보고합니다.

"cache_control": { "type": "ephemeral", "ttl": "1h" }

예제: 먼저 쓰기, 다음 읽기

정확히 같은 요청을 두 번 보냅니다(위 캐싱 예제). 프리픽스를 처음 본 호출은 일회성 캐시 쓰기를 지불하고, TTL 이내의 동일한 호출은 훨씬 저렴한 캐시 읽기를 지불합니다.

첫 번째 호출 — 캐시 쓰기(usage 발췌):

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 1980,
  "cache_read_input_tokens": 0
}

TTL 이내의 두 번째 동일 호출 — 캐시 읽기:

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 1980
}

제한 및 비용

Claude는 최소 캐시 가능 프리픽스(약 1024 토큰, 일부 모델은 더 큼)를 요구합니다. 더 짧은 프리픽스는 캐시되지 않습니다.
요청당 최대 4개의 캐시 브레이크포인트가 있으며, 캐시된 프리픽스는 호출 간에 바이트 단위로 동일해야 합니다 — 한 글자만 바뀌어도 캐시를 놓칩니다.
캐시 쓰기는 일반 입력보다 비싸고(5m ≈ 1.25×, 1h ≈ 2×) 읽기는 훨씬 저렴합니다(≈ 0.1×). 모델별 캐시 가격은 가격 페이지를 참고하세요.

POST /v1/responses

상태 유지 대화를 위한 OpenAI Responses-API 표면. 동일한 Bearer/x-api-key 인증. 캐시 카운트는 input_tokens_details.cached_tokens(읽기) 및 평면 cache_creation_input_tokens + cache_creation.ephemeral_*(쓰기)로 나타나 /v1/chat/completions와 동등합니다.

POSThttps://api.airforce/v1/responses

POST /v1beta/models/{model}:generateContent

Google Gemini-compatible endpoint. Works with the official @google/genai SDK and the Gemini CLI by pointing the base URL at https://api.airforce/v1beta. Any routed model works — requests are translated to and from the native Gemini shape, and the model is taken from the URL path (not the body).

POSThttps://api.airforce/v1beta/models/{model}:generateContent

Authentication

Pass your Airforce API key any of the three ways Google clients use:

# 1) query parameter (Google default)
?key=sk-air-YOUR_API_KEY

# 2) header
x-goog-api-key: sk-air-YOUR_API_KEY

# 3) bearer token
Authorization: Bearer sk-air-YOUR_API_KEY

Request body

Parameter	Type	Required	Description
contents	array	Required	Conversation turns. Each: { role: "user" \| "model", parts: [...] }. A part is { text }, { functionCall: { name, args } }, or { functionResponse: { name, response } }. "model" is Gemini's term for the assistant role.
systemInstruction	object	Optional	System prompt: { parts: [{ text }] }.
generationConfig	object	Optional	{ temperature, maxOutputTokens, topP, stopSequences } — mapped to the canonical sampling parameters.
tools	array	Optional	Tool definitions: [{ functionDeclarations: [{ name, description, parameters }] }]. functionDeclarations are flattened across entries.
toolConfig	object	Optional	Tool-choice control: { functionCallingConfig: { mode: "AUTO" \| "ANY" \| "NONE" } }. ANY forces a call, NONE disables tools.

Example

curl "https://api.airforce/v1beta/models/gemini-3.1-pro:generateContent" \
  -H "x-goog-api-key: sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {"role": "user", "parts": [{"text": "What is the capital of France?"}]}
    ],
    "systemInstruction": {"parts": [{"text": "You are a helpful assistant."}]},
    "generationConfig": {"temperature": 0.7, "maxOutputTokens": 256}
  }'

Response shape

Parameter	Type	Required	Description
candidates	array	Optional	Generated turns: [{ content: { role: "model", parts }, finishReason, index }]. Only the first candidate is populated.
candidates[].finishReason	string	Optional	"STOP" \| "MAX_TOKENS" \| "SAFETY" \| "OTHER".
usageMetadata	object	Optional	{ promptTokenCount, candidatesTokenCount, totalTokenCount, cachedContentTokenCount? }. cachedContentTokenCount appears when the upstream reported a cache read.
modelVersion	string	Optional	Echo of the requested model.

{
  "candidates": [{
    "content": {
      "role": "model",
      "parts": [{"text": "The capital of France is Paris."}]
    },
    "finishReason": "STOP",
    "index": 0
  }],
  "usageMetadata": {
    "promptTokenCount": 16,
    "candidatesTokenCount": 8,
    "totalTokenCount": 24
  },
  "modelVersion": "gemini-3.1-pro"
}

POST /v1beta/models/{model}:streamGenerateContent

Streaming uses the :streamGenerateContent action and returns Server-Sent Events. Each data: line is a full Gemini-shaped chunk (not a delta object); the final chunk carries usageMetadata.

data: {"candidates":[{"content":{"role":"model","parts":[{"text":"The capital"}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[{"text":" is Paris."}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[]},"finishReason":"STOP","index":0}],"usageMetadata":{"promptTokenCount":16,"candidatesTokenCount":8,"totalTokenCount":24}}

List models

The catalog is also exposed in Gemini Model-resource shape so Google clients can enumerate models.

curl https://api.airforce/v1beta/models

Notes: the base URL is https://api.airforce/v1beta (or /v1), not Google's host. The model name comes from the URL path, not the request body. Only the first candidate is returned, and a subset of Gemini fields is translated — safetySettings and cachedContent are currently ignored. Billing, rate limits and smart routing apply exactly as on /v1/chat/completions.

오류

Airforce는 두 끝점 모두에 대해 표준 HTTP 상태 코드와 균일한 오류 봉투를 반환합니다.

Parameter	Type	Required	Description
400	invalid_request_error	Optional	잘못된 JSON, 필수 필드 누락, 알 수 없는 모델입니다.
401	invalid_request_error / auth_required	Optional	API 키가 누락되었거나 잘못되었습니다.
402	insufficient_quota	Optional	이 모델에는 활성 구독 또는 양수의 Pay-as-you-Go 잔액이 필요합니다.
403	model_access_denied / insufficient_scope	Optional	계획 또는 키별 권한은 이 요청을 거부합니다.
404	model_not_found	Optional	요청한 모델이 존재하지 않거나 접근 권한이 없습니다.
429	rate_limit_error	Optional	요청률 또는 일일 토큰 한도를 초과했습니다.
503	api_error / moderation_unavailable	Optional	요청한 제공업체의 모든 업스트림 키가 실패했습니다.

{
  "error": {
    "message": "The requested model does not exist or you do not have access to it.",
    "type": "model_not_found",
    "param": null,
    "code": "404"
  }
}

설명용 슬러그는 type에 있습니다. code는 문자열로 된 HTTP 상태입니다(예: "404"), 그리고 param은 매개변수 범위 검증 오류를 제외하면 null이며, 해당 오류에서는 문제가 된 매개변수의 이름을 나타냅니다.

모델 검색

모델 ID의 전체 목록과 해당 기능 플래그(비전, 도구, 추론, 캐싱, 컨텍스트 길이 등)를 확인하세요. /docs/api/models.

curl https://api.airforce/v1/models \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY"