API REFERENCE

Conclusões de bate-papo

Gere respostas de chat em mais de 100 modelos a partir de uma API. Drop-in compatível com conclusões de bate-papo OpenAI, Anthropic Messages e Anthropic Responses.

A Airforce fala tanto o formato wire do OpenAI Chat Completions quanto o do Anthropic Messages sobre o mesmo conjunto de modelos. Escolha o SDK que você já usa e apenas altere a base URL — modelos não-Claude são encaminhados de forma transparente em qualquer uma das interfaces.

Esta página cobre autenticação, os formatos de request e response para ambas as interfaces, streaming, tool calling, vision, reasoning e prompt caching. Novo por aqui? Comece com o exemplo básico abaixo, faça uma chamada funcionar e então adicione streaming, tools ou caching depois que ela funcionar.

Autenticação

Cada solicitação precisa de um token de portador (sua chave de API do Airforce). O Anthropic x-api-key cabeçalho também é aceito em /v1/messages para compatibilidade com SDK.

Authorization: Bearer sk-air-YOUR_API_KEY
# alt for /v1/messages:
x-api-key: sk-air-YOUR_API_KEY

POST /v1/chat/completions

Conclusões de bate-papo compatíveis com OpenAI. Trabalha com o oficial openai SDK por substituição base_url para https://api.airforce/v1.

POSThttps://api.airforce/v1/chat/completions

Solicitar corpo

Parameter	Type	Required	Description
model	string	Required	ID do modelo. Use GET /v1/models para descobrir os IDs disponíveis.
messages	array	Required	Histórico da conversa. Cada entrada tem { role: "system" \| "user" \| "assistant" \| "tool", content }. O conteúdo é uma string ou um array de blocos de conteúdo (visão, veja abaixo).
max_tokens	integer	Optional	Número máximo de tokens a serem gerados. Limitado aos max_output_tokens do modelo.
temperature	float	Optional	Temperatura de amostragem, 0–2. Menor é mais determinístico. O padrão depende do provedor upstream.
top_p	float	Optional	Amostragem de núcleo. Use temperatura ou top_p, não ambos.
stream	boolean	Optional	Quando verdadeiro, a resposta é um fluxo de eventos enviados pelo servidor. Consulte "Streaming" abaixo.
models	array	Optional	Fallback models (max 3), e.g. ["deepseek-v3.2", "gpt-4o-mini"]. If every channel of the primary model fails, each candidate is tried in order. You are billed for — and response.model reports — the model that actually answered. Unknown or plan-gated candidates are skipped. With the OpenAI SDK pass it via extra_body.
transforms	array	Optional	Prompt transforms. Supported: ["middle-out"] — when the conversation overflows the model's context window, whole messages are dropped from the middle (system prompts, the first message and the most recent turns are kept), so long roleplay or agent histories keep working instead of erroring. Opt-in; off by default.
stream_options	object	Optional	{ include_usage: boolean }. O uso é sempre incluído no último fragmento do stream; este campo é aceite por compatibilidade com a OpenAI, mas não o pode desativar.
stop	string \| array	Optional	Até 4 sequências de parada. A geração é interrompida assim que uma é produzida.
tools	array	Optional	Definições de função que o modelo pode chamar. Consulte "Chamada de ferramenta" abaixo.
tool_choice	string \| object	Optional	"auto" (padrão), "none" ou { type: "function", function: { name } } para forçar uma chamada específica.
response_format	object	Optional	{ type: "json_object" } força o modelo a emitir JSON válido. Ignorado para modelos que não o suportam.
reasoning_effort	string	Optional	Reasoning depth: "low" \| "medium" \| "high" \| "xhigh" \| "max". Any model with supports_reasoning: true (Claude, OpenAI o/GPT-5, Gemini, Qwen, DeepSeek, …). See "Reasoning & thinking".
thinking	string \| object	Optional	Cross-model thinking switch. "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. See "Reasoning & thinking".
thinking_budget	integer	Optional	Limite de token para o rastreamento de raciocínio do modelo (quando o provedor expõe um).
ignore_defaults	boolean	Optional	Ignore os parâmetros padrão por modelo salvos do usuário (configurados no painel) para esta solicitação.
skill	string	Optional	ID of a single marketplace skill to apply to this request. The skill transforms your messages/parameters before the upstream call and overrides any installed-skill defaults. Consumed by Airforce, never forwarded upstream. See the Skills catalog at /docs/api/extend.
skills	array	Optional	Array of marketplace skill IDs applied in order, for stacking multiple skills on one request.

Exemplo básico

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

Forma de resposta

Parameter	Type	Required	Description
id	string	Optional	ID de conclusão estável, por ex. "chatcmpl-abc123".
object	string	Optional	"chat.completion" para não transmitido, "chat.completion.chunk" para transmitido.
created	integer	Optional	Carimbo de data/hora Unix (segundos).
model	string	Optional	Eco do ID do modelo solicitado.
choices	array	Optional	Matriz de candidatos à conclusão: [{ index, message: { role, content, tool_calls? }, razão_de_acabamento }].
choices[].finish_reason	string	Optional	"parar" \| "comprimento" \| "chamadas_ferramentas" \| "filtro_conteúdo".
usage	object	Optional	{ prompt_tokens, completion_tokens, total_tokens, completion_tokens_details?, prompt_tokens_details?, cache_creation_input_tokens?, cache_creation? }. completion_tokens_details.reasoning_tokens é definido quando o modelo produziu um rastro de raciocínio. Os campos de cache aparecem quando o upstream retornou informações de cache de prompt: prompt_tokens_details.cached_tokens reporta leituras de cache (padrão OpenAI), cache_creation_input_tokens agrega as escritas, e cache_creation.ephemeral_5m_input_tokens / ephemeral_1h_input_tokens dão a divisão por TTL.

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1710000000,
  "model": "gpt-5.1-chat",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "The capital of France is Paris."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 8,
    "total_tokens": 28
  }
}

Raciocínio e pensamento

Reasoning/thinking is a cross-model feature for every model ID with supports_reasoning: true — Claude, OpenAI o-series/GPT-5, Gemini, Qwen, DeepSeek, and others. You send the same canonical parameters; Airforce maps them to each provider's native shape. This is not a DeepSeek-only API.

Truth source: check supports_reasoning: true em um modelo em GET /v1/models (or GET /api/models/{id}/allowed-params). Prefer that flag over guessing from the model name.

Modelos com suporte de raciocínio

…· live

Parâmetros canônicos

Parameter	Type	Required	Description
reasoning_effort	string	Optional	"low" \| "medium" \| "high" \| "xhigh" \| "max". Accepted on every model with supports_reasoning: true. Some upstreams only honour a subset (e.g. high/max); others clamp unsupported levels to the nearest served value.
thinking	string \| object	Optional	Three accepted shapes (we normalise): "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. Mapped onto Claude extended thinking, OpenAI effort profiles, Gemini thinking_config, Qwen enable_thinking, DeepSeek hybrid, etc.
thinking_budget	integer	Optional	Maximum tokens the model may spend reasoning before emitting visible output. Mirrors budget_tokens when the upstream exposes a budget; takes precedence over reasoning_effort when both are sent and a budget is available.

What differs by family (mapping only)

Parameters are the same everywhere. Only how we map them (and how hard "off" is) differs:

Claude — Thinking on/off + budget; often also reasoning_effort via the gateway.
OpenAI (o1/o3, GPT-5) — Mainly reasoning_effort. A full "thinking off" is often not available — you control how strongly the model reasons, not always whether it reasons at all.
Gemini — thinking_config / budget mapped internally.
Qwen / Xiaomi / Alibaba — thinking + enable_thinking-style controls.
DeepSeek (generic) — Hybrid on/off is especially clear: thinking: { type: enabled|disabled } plus optional reasoning_effort.
Resellers / other — Often generic passthrough of the same canonical fields.

Controlling where the trace appears

An optional reasoning object on the request decides what happens to the thinking trace. It is consumed by Airforce and never forwarded upstream.

Parameter	Type	Required	Description
reasoning.format	string	Optional	"separate" (default) puts the trace in message.reasoning (and delta.reasoning while streaming). "inline" keeps the legacy inline <think>…</think> form inside content.
reasoning.exclude	boolean	Optional	When true, the reasoning trace is dropped entirely from the response. Reasoning tokens are still counted and billed if the model produced them.

"reasoning": { "format": "separate", "exclude": false }

Esforço de raciocínio (estilo OpenAI)

Primary control for o-series and GPT-5: how much the model may reason. Same canonical field as on every other supports_reasoning model — OpenAI is included, but behaviour is not 1:1 with DeepSeek's hard on/off.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "o3-mini",
    "messages": [{"role": "user", "content": "Prove the Pythagorean theorem."}],
    "reasoning_effort": "high"
  }'

Pensamento estendido (estilo Anthropic)

Budget-based thinking for Claude (and gateways that accept the Anthropic shape). You can still send reasoning_effort; we map when the channel supports it.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "messages": [{"role": "user", "content": "Plan a 7-day Italy trip."}],
    "thinking": {"type": "enabled", "budget_tokens": 4000}
  }'

Hybrid thinking (e.g. DeepSeek V3.2/V4)

Example of a hybrid model family with a clear Thinking / Non-Thinking switch — not a separate protocol. deepseek-v3.2, deepseek-v4-flash and deepseek-v4-pro accept the same canonical fields as every other supports_reasoning model. Toggle thinking and optionally set effort in one request:

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-pro",
    "messages": [{"role": "user", "content": "Solve this step by step: integrate x^2 * e^x."}],
    "thinking": {"type": "enabled"},
    "reasoning_effort": "high"
  }'

Turn thinking off (faster, cheaper when you only need the final answer) — this hard off is clearer on hybrid models than on many OpenAI o-series profiles:

"thinking": {"type": "disabled"}
// or simply: "thinking": "off"

Native docs for this family often list effort levels such as "high" and "max". We accept the full low…max scale and map unsupported levels to the nearest value that reaches the model. Prefer the hybrid IDs above over retired deepseek-chat / deepseek-reasoner names when you need an explicit on/off switch.

O próprio traço de raciocínio aparece em choices[0].message.reasoning (formato OpenAI) ou como thinking blocos em content (Forma Anthropic). Os tokens de raciocínio são cobrados e relatados em usage.completion_tokens_details.reasoning_tokens.

Esse detalhe completion_tokens_details.reasoning_tokens só está presente quando o provider upstream o reporta. Numa resposta em stream, o trace chega em delta.reasoning_content por chunk.

Entrada de visão e imagem

Modelos com supports_vision: true aceite imagens incorporadas como blocos de conteúdo. Uma URL pública ou uma URL de dados base64 funcionam; os limites de tamanho dependem do modelo upstream.

Modelos com suporte de visão

…· live

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
      ]
    }]
  }'

Chamada de ferramenta

Modelos com supports_tools: true pode chamar funções que você definir. O modelo retorna um tool_calls variedade; você executa a chamada e envia o resultado de volta em um tool mensagem.

Modelos com suporte para chamada de ferramenta

…· live

Solicitar

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City name"}
          },
          "required": ["location"]
        }
      }
    }],
    "tool_choice": "auto"
  }'

Resposta com chamada de ferramenta

{
  "id": "chatcmpl-abc123",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\":\"Paris\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

Acompanhamento com resultado da ferramenta

{
  "model": "gpt-5.1-chat",
  "messages": [
    {"role": "user", "content": "What is the weather in Paris?"},
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {"name": "get_weather", "arguments": "{\"location\":\"Paris\"}"}
      }]
    },
    {"role": "tool", "tool_call_id": "call_1", "content": "{\"temp_c\": 14, \"sky\": \"cloudy\"}"}
  ]
}

Assistant prefill

End your messages array with an assistant message that already contains some text, and the model continues from it instead of starting a fresh turn. This is a reliable way to force a response to begin a specific way — a leading "{" for JSON, a chosen language, or a fixed prefix. The same trick works on /v1/messages. Providers that reject native prefill are handled automatically: the gateway retries once with a compatible rewrite, so you do not have to special-case them.

{
  "model": "claude-sonnet-4.6",
  "messages": [
    {"role": "user", "content": "List three primary colors as a JSON array."},
    {"role": "assistant", "content": "["}
  ]
}

Structured outputs

Set response_format to make the model return JSON. Two modes are supported:

{ "type": "json_object" } — the response is a single valid JSON value.
{ "type": "json_schema", "json_schema": { "name", "schema", "strict" } } — the model is steered to produce JSON that matches your JSON Schema.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Extract the city and country: I live in Paris, France."}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "location",
        "schema": {
          "type": "object",
          "properties": { "city": {"type": "string"}, "country": {"type": "string"} },
          "required": ["city", "country"]
        }
      }
    }
  }'

Reliability: even when a model wraps its answer in prose or a markdown code fence, Airforce extracts the JSON payload so you always receive parseable content. If no valid JSON can be recovered, the original text is returned unchanged — so the guarantee never makes a response worse. This applies to non-streamed responses; streamed responses are passed through unchanged.

Transmissão

Definir stream: true para receber conclusões parciais como eventos enviados pelo servidor. Cada evento é um pedaço JSON com o mesmo formato da resposta não transmitida, exceto message é substituído por delta. O fluxo termina com data: [DONE].

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Write a haiku about Berlin."}],
    "stream": true,
    "stream_options": {"include_usage": true}
  }'

Formato de fio

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"Cold "},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"stone "},"finish_reason":null}]}

…

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":12,"completion_tokens":17,"total_tokens":29}}

data: [DONE]

Reliability & smart routing

Every model ID resolves to a pool of upstream providers behind the scenes. If the first one errors or times out, the request is automatically retried against the next provider for the same model, in order, before any failure is returned — you do not configure or trigger this. The model field in the response always reports the variant that actually answered. This is independent of the optional models / fallbacks array, which adds your own cross-model candidates on top: first the primary model exhausts its own provider chain, then each fallback model exhausts its chain.

POST /v1/messages

API de mensagens compatível com Anthropic. Trabalha com o oficial @anthropic-ai/sdk configurando baseURL para https://api.airforce. Encaminha para OpenAI/Google/etc. transparentemente para modelos não-Claude.

POSThttps://api.airforce/v1/messages

Solicitar corpo

Parameter	Type	Required	Description
model	string	Required	ID do modelo (formato Anthropic ou alias roteado).
messages	array	Required	Cada entrada: { role: "user" \| "assistant", content: string \| array }.
max_tokens	integer	Required	Exigido pela Anthropic. Limite de token para a resposta.
system	string \| array	Optional	Alerta do sistema. Passe um array de { type: "text", text, cache_control? } blocos para marcar segmentos de prefixo em cache. Consulte "Cache de prompts".
temperature	float	Optional	0–1.
top_p	float	Optional	Amostragem de núcleo.
top_k	integer	Optional	Limite o pool de amostragem aos principais tokens.
stop_sequences	array	Optional	Até 4 sequências de parada.
stream	boolean	Optional	Quando verdadeiro, emite fluxo de eventos SSE no estilo Anthropic (consulte "Streaming").
fallbacks	array	Optional	Fallback models (max 3) in Anthropic form: [{"model": "gpt-4o-mini"}]. If every channel of the primary model fails, each candidate is tried in order; you are billed for — and the response model field reports — the model that actually answered. A plain models string array is accepted too.
tools	array	Optional	Definições de ferramentas da Anthropic: { name, description, input_schema }. A resposta pode conter blocos de conteúdo tool_use.
tool_choice	object	Optional	{ type: "auto" \| "any" \| "tool", name? }.
thinking	object	Optional	Pensamento Anthropic estendido: { type: "enabled", budget_tokens: N }.

Exemplo

curl https://api.airforce/v1/messages \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "max_tokens": 256,
    "system": "You are a helpful assistant.",
    "messages": [
      {"role": "user", "content": "Hello, Claude!"}
    ]
  }'

Forma de resposta

Parameter	Type	Required	Description
id	string	Optional	ID da mensagem, por ex. "msg_01ABCxyz".
type	string	Optional	Sempre "mensagem".
role	string	Optional	Sempre "assistente".
content	array	Optional	Array de blocos de conteúdo: { type: "text" \| "tool_use" \| "thinking", … }.
model	string	Optional	Eco do modelo solicitado.
stop_reason	string	Optional	"fim_turno" \| "max_tokens" \| "stop_sequence" \| "uso_ferramenta".
usage	object	Optional	{ input_tokens, output_tokens, cache_read_input_tokens?, cache_creation_input_tokens?, cache_creation? }. Os campos de cache aparecem quando o cache de prompt foi usado. cache_creation.ephemeral_5m_input_tokens e ephemeral_1h_input_tokens dão a divisão de escrita por TTL.

Transmissão de eventos

O SSE da Anthropic usa eventos nomeados em vez de blocos JSON únicos. Cada evento tem um event: nome e um data: Carga útil JSON.

event: message_start
data: {"type":"message_start","message":{"id":"msg_01","role":"assistant","content":[],"model":"claude-sonnet-4.6","stop_reason":null,"usage":{"input_tokens":12,"output_tokens":1}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":17}}

event: message_stop
data: {"type":"message_stop"}

POST /v1/messages/count_tokens

Anthropic-compatible token counting. Send the same system / messages / tools you would pass to /v1/messages and get an input-token estimate back without running the model — nothing is billed.

POSThttps://api.airforce/v1/messages/count_tokens

curl https://api.airforce/v1/messages/count_tokens \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "system": "You are a helpful assistant.",
    "messages": [{"role": "user", "content": "Hello, Claude!"}]
  }'

# → {"input_tokens": 34}

The count is a fast character-based estimate (about 4 characters per token) over system, messages and tools — close enough for context-budget checks, not an exact tokenizer run.

Cache de prompt

Sobre /v1/messages com modelos Claude, marque um prefixo como armazenado em cache passando system como uma matriz de blocos onde o segmento em cache carrega cache_control: { type: "ephemeral" }. As solicitações subsequentes que começam com o mesmo prefixo cobram a taxa de leitura de cache mais barata. Modelos com supports_caching: true em /v1/models apoie isso.

Write vs read pricing

Cache writes are typically charged slightly above normal input (about 1.25× on Claude-family models). Cache reads are much cheaper (about 0.1× input). A large write with almost no later read is the expensive case — not a “cache discount”. Only reusing the same prefix turns the write into savings.

Tools like Claude Code often attach a large project context with cache markers on the first turns. Expect cache-write spend while the repo/system prefix is loaded; later turns only get cheap if that prefix is stable and reused. Subagents and multi-step agents can multiply large contexts across several requests.

Modelos com cache imediato

…· live

{
  "model": "claude-sonnet-4.6",
  "max_tokens": 1024,
  "system": [
    {"type": "text", "text": "You are a senior staff engineer at Airforce."},
    {
      "type": "text",
      "text": "<repository-snapshot>...</repository-snapshot>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "Where is rate limiting enforced?"}
  ]
}

Como as contagens de cache são reportadas na resposta

Os conteúdos de tokens de cache são passados na forma nativa de cada formato, então SDKs (openai, @anthropic-ai/sdk, @google/genai) os leem sem código personalizado. Os campos são omitidos quando o valor é zero, mantendo respostas não-cacheadas enxutas.

/v1/chat/completions (forma OpenAI)

"usage": {
  "prompt_tokens": 2104,
  "completion_tokens": 147,
  "total_tokens": 2251,
  "prompt_tokens_details": { "cached_tokens": 1980 },
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1/messages (forma Anthropic)

"usage": {
  "input_tokens": 2104,
  "output_tokens": 147,
  "cache_read_input_tokens": 1980,
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1beta/.../generateContent (forma Gemini)

"usageMetadata": {
  "promptTokenCount": 2104,
  "candidatesTokenCount": 147,
  "totalTokenCount": 2251,
  "cachedContentTokenCount": 1980
}

Onde o cache se aplica

Marcadores cache_control explícitos são respeitados em /v1/messages e /v1/chat/completions para modelos Claude — coloque-os em blocos de conteúdo system ou message. Muitos outros provedores (família OpenAI, DeepSeek, Gemini) fazem cache automaticamente: você não envia marcadores e simplesmente vê cached_tokens na resposta quando um prefixo longo o suficiente é reutilizado.

Duração do cache: 5 minutos ou 1 hora

Um prefixo em cache vive 5 minutos por padrão e o timer é renovado a cada acerto. Para um prefixo mais duradouro, adicione ttl: "1h" ao marcador. A resposta informa cada TTL separadamente em cache_creation.

"cache_control": { "type": "ephemeral", "ttl": "1h" }

Exemplo: primeiro write, depois read

Envie exatamente a mesma requisição duas vezes (o exemplo de cache acima). A primeira chamada que vê o prefixo paga um write de cache único; chamadas idênticas dentro da TTL pagam o read de cache muito mais barato.

Primeira chamada — write de cache (trecho de usage):

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 1980,
  "cache_read_input_tokens": 0
}

Segunda chamada idêntica dentro da TTL — read de cache:

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 1980
}

Limites e custo

O Claude exige um prefixo mínimo cacheável (cerca de 1024 tokens; mais em alguns modelos). Prefixos mais curtos simplesmente não são cacheados.
Até 4 breakpoints de cache por requisição, e o prefixo em cache deve ser idêntico byte a byte entre chamadas — até uma mudança de um caractere erra o cache.
Writes de cache custam mais que a entrada normal (5m ≈ 1,25×, 1h ≈ 2×); reads custam muito menos (≈ 0,1×). Veja os preços de cache de cada modelo na página de preços.

POST /v1/responses

Superfície OpenAI Responses-API para conversas com estado. Mesma autenticação Bearer/x-api-key. As contagens de cache aparecem como input_tokens_details.cached_tokens (leitura) mais cache_creation_input_tokens plano + cache_creation.ephemeral_* (escritas) para paridade com /v1/chat/completions.

POSThttps://api.airforce/v1/responses

POST /v1beta/models/{model}:generateContent

Google Gemini-compatible endpoint. Works with the official @google/genai SDK and the Gemini CLI by pointing the base URL at https://api.airforce/v1beta. Any routed model works — requests are translated to and from the native Gemini shape, and the model is taken from the URL path (not the body).

POSThttps://api.airforce/v1beta/models/{model}:generateContent

Authentication

Pass your Airforce API key any of the three ways Google clients use:

# 1) query parameter (Google default)
?key=sk-air-YOUR_API_KEY

# 2) header
x-goog-api-key: sk-air-YOUR_API_KEY

# 3) bearer token
Authorization: Bearer sk-air-YOUR_API_KEY

Request body

Parameter	Type	Required	Description
contents	array	Required	Conversation turns. Each: { role: "user" \| "model", parts: [...] }. A part is { text }, { functionCall: { name, args } }, or { functionResponse: { name, response } }. "model" is Gemini's term for the assistant role.
systemInstruction	object	Optional	System prompt: { parts: [{ text }] }.
generationConfig	object	Optional	{ temperature, maxOutputTokens, topP, stopSequences } — mapped to the canonical sampling parameters.
tools	array	Optional	Tool definitions: [{ functionDeclarations: [{ name, description, parameters }] }]. functionDeclarations are flattened across entries.
toolConfig	object	Optional	Tool-choice control: { functionCallingConfig: { mode: "AUTO" \| "ANY" \| "NONE" } }. ANY forces a call, NONE disables tools.

Example

curl "https://api.airforce/v1beta/models/gemini-3.1-pro:generateContent" \
  -H "x-goog-api-key: sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {"role": "user", "parts": [{"text": "What is the capital of France?"}]}
    ],
    "systemInstruction": {"parts": [{"text": "You are a helpful assistant."}]},
    "generationConfig": {"temperature": 0.7, "maxOutputTokens": 256}
  }'

Response shape

Parameter	Type	Required	Description
candidates	array	Optional	Generated turns: [{ content: { role: "model", parts }, finishReason, index }]. Only the first candidate is populated.
candidates[].finishReason	string	Optional	"STOP" \| "MAX_TOKENS" \| "SAFETY" \| "OTHER".
usageMetadata	object	Optional	{ promptTokenCount, candidatesTokenCount, totalTokenCount, cachedContentTokenCount? }. cachedContentTokenCount appears when the upstream reported a cache read.
modelVersion	string	Optional	Echo of the requested model.

{
  "candidates": [{
    "content": {
      "role": "model",
      "parts": [{"text": "The capital of France is Paris."}]
    },
    "finishReason": "STOP",
    "index": 0
  }],
  "usageMetadata": {
    "promptTokenCount": 16,
    "candidatesTokenCount": 8,
    "totalTokenCount": 24
  },
  "modelVersion": "gemini-3.1-pro"
}

POST /v1beta/models/{model}:streamGenerateContent

Streaming uses the :streamGenerateContent action and returns Server-Sent Events. Each data: line is a full Gemini-shaped chunk (not a delta object); the final chunk carries usageMetadata.

data: {"candidates":[{"content":{"role":"model","parts":[{"text":"The capital"}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[{"text":" is Paris."}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[]},"finishReason":"STOP","index":0}],"usageMetadata":{"promptTokenCount":16,"candidatesTokenCount":8,"totalTokenCount":24}}

List models

The catalog is also exposed in Gemini Model-resource shape so Google clients can enumerate models.

curl https://api.airforce/v1beta/models

Notes: the base URL is https://api.airforce/v1beta (or /v1), not Google's host. The model name comes from the URL path, not the request body. Only the first candidate is returned, and a subset of Gemini fields is translated — safetySettings and cachedContent are currently ignored. Billing, rate limits and smart routing apply exactly as on /v1/chat/completions.

Erros

Airforce retorna códigos de status HTTP padrão e um envelope de erro uniforme para ambos os endpoints.

Parameter	Type	Required	Description
400	invalid_request_error	Optional	JSON malformado, campo obrigatório ausente, modelo desconhecido.
401	invalid_request_error / auth_required	Optional	Chave de API ausente ou inválida.
402	insufficient_quota	Optional	O modelo requer uma subscrição ativa ou um saldo Pay-as-you-Go positivo.
403	model_access_denied / insufficient_scope	Optional	As permissões planejadas ou por chave negam essa solicitação.
404	model_not_found	Optional	O modelo solicitado não existe ou não tens acesso a ele.
429	rate_limit_error	Optional	Taxa de solicitação ou limite diário de token excedido.
503	api_error / moderation_unavailable	Optional	Todas as chaves upstream do provedor solicitado falharam.

{
  "error": {
    "message": "The requested model does not exist or you do not have access to it.",
    "type": "model_not_found",
    "param": null,
    "code": "404"
  }
}

O slug descritivo está em type. O code é o status HTTP como string (ex.: "404"), e param é null exceto em erros de validação de intervalo de parâmetros, onde indica o parâmetro problemático.

Descubra modelos

Veja a lista completa de IDs de modelo e seus sinalizadores de capacidade (visão, ferramentas, raciocínio, cache, comprimento de contexto,…) em /docs/api/models.

curl https://api.airforce/v1/models \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY"