API REFERENCE

Finalizaciones de chat

Genera respuestas de chat en más de 100 modelos desde una sola API. Directamente compatible con OpenAI Chat Completions, Anthropic Messages y Anthropic Responses.

Airforce habla tanto el formato OpenAI Chat Completions como el formato Anthropic Messages sobre el mismo conjunto de modelos. Elige el SDK que ya uses y solo cambia la base URL — los modelos que no son de Claude se reenvían de forma transparente detrás de cualquiera de las dos superficies.

Esta página cubre la autenticación, las estructuras de request y response para ambas superficies, streaming, tool calling, vision, reasoning y prompt caching. ¿Eres nuevo aquí? Empieza con el ejemplo básico de abajo, haz que funcione una llamada y luego añade streaming, tools o caching una vez que lo logres.

Autenticación

Cada solicitud necesita un token Bearer (tu clave API de Airforce). La cabecera de Anthropic x-api-key también se acepta en /v1/messages para compatibilidad con SDK.

Authorization: Bearer sk-air-YOUR_API_KEY
# alt for /v1/messages:
x-api-key: sk-air-YOUR_API_KEY

POST /v1/chat/completions

Finalizaciones de chat compatibles con OpenAI. Trabaja con el funcionario. openai SDK anulando base_url a https://api.airforce/v1.

POSThttps://api.airforce/v1/chat/completions

Cuerpo de la solicitud

Parameter	Type	Required	Description
model	string	Required	Identificación del modelo. Utilice GET /v1/models para descubrir los ID disponibles.
messages	array	Required	Historial de la conversación. Cada entrada tiene { role: "system" \| "user" \| "assistant" \| "tool", content }. El contenido es una cadena o un array de bloques de contenido (visión, ver más abajo).
max_tokens	integer	Optional	Número máximo de tokens a generar. Limitado a los max_output_tokens del modelo.
temperature	float	Optional	Temperatura de muestreo, 0–2. Más baja es más determinista. El valor predeterminado depende del proveedor upstream.
top_p	float	Optional	Muestreo de núcleos. Utilice temperatura o top_p, no ambos.
stream	boolean	Optional	Cuando es verdadero, la respuesta es una secuencia de eventos enviados por el servidor. Consulte "Transmisión" a continuación.
models	array	Optional	Fallback models (max 3), e.g. ["deepseek-v3.2", "gpt-4o-mini"]. If every channel of the primary model fails, each candidate is tried in order. You are billed for — and response.model reports — the model that actually answered. Unknown or plan-gated candidates are skipped. With the OpenAI SDK pass it via extra_body.
transforms	array	Optional	Prompt transforms. Supported: ["middle-out"] — when the conversation overflows the model's context window, whole messages are dropped from the middle (system prompts, the first message and the most recent turns are kept), so long roleplay or agent histories keep working instead of erroring. Opt-in; off by default.
stream_options	object	Optional	{ include_usage: boolean }. El uso siempre se incluye en el último fragmento del stream; este campo se acepta por compatibilidad con OpenAI, pero no se puede desactivar.
stop	string \| array	Optional	Hasta 4 secuencias de parada. La generación se detiene tan pronto como se produce una.
tools	array	Optional	Definiciones de funciones que el modelo puede llamar. Consulte "Llamada de herramientas" a continuación.
tool_choice	string \| object	Optional	"auto" (predeterminado), "none" o { type: "function", function: { name } } para forzar una llamada específica.
response_format	object	Optional	{ type: "json_object" } obliga al modelo a emitir JSON válido. Ignorado para modelos que no lo admiten.
reasoning_effort	string	Optional	Reasoning depth: "low" \| "medium" \| "high" \| "xhigh" \| "max". Any model with supports_reasoning: true (Claude, OpenAI o/GPT-5, Gemini, Qwen, DeepSeek, …). See "Reasoning & thinking".
thinking	string \| object	Optional	Cross-model thinking switch. "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. See "Reasoning & thinking".
thinking_budget	integer	Optional	Límite de token para el seguimiento de razonamiento del modelo (cuando el proveedor expone uno).
ignore_defaults	boolean	Optional	Omita los parámetros predeterminados por modelo guardados por el usuario (configurados en el panel) para esta solicitud.
skill	string	Optional	ID of a single marketplace skill to apply to this request. The skill transforms your messages/parameters before the upstream call and overrides any installed-skill defaults. Consumed by Airforce, never forwarded upstream. See the Skills catalog at /docs/api/extend.
skills	array	Optional	Array of marketplace skill IDs applied in order, for stacking multiple skills on one request.

Ejemplo básico

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

Forma de respuesta

Parameter	Type	Required	Description
id	string	Optional	ID de finalización estable, p. "chatcmpl-abc123".
object	string	Optional	"chat.completion" para no transmitidos, "chat.completion.chunk" para transmitidos.
created	integer	Optional	Marca de tiempo Unix (segundos).
model	string	Optional	Eco del ID del modelo solicitado.
choices	array	Optional	Array de candidatos de finalización: [{ index, message: { role, content, tool_calls? }, finish_reason }].
choices[].finish_reason	string	Optional	"stop" \| "length" \| "tool_calls" \| "content_filter".
usage	object	Optional	{ prompt_tokens, completion_tokens, total_tokens, completion_tokens_details?, prompt_tokens_details?, cache_creation_input_tokens?, cache_creation? }. completion_tokens_details.reasoning_tokens se establece cuando el modelo produjo un rastro de razonamiento. Los campos de caché aparecen cuando el upstream devolvió información de caché de prompt: prompt_tokens_details.cached_tokens reporta lecturas de caché (estándar OpenAI), cache_creation_input_tokens agrega las escrituras, y cache_creation.ephemeral_5m_input_tokens / ephemeral_1h_input_tokens dan el desglose por TTL.

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1710000000,
  "model": "gpt-5.1-chat",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "The capital of France is Paris."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 8,
    "total_tokens": 28
  }
}

Razonamiento y pensamiento

Reasoning/thinking is a cross-model feature for every model ID with supports_reasoning: true — Claude, OpenAI o-series/GPT-5, Gemini, Qwen, DeepSeek, and others. You send the same canonical parameters; Airforce maps them to each provider's native shape. This is not a DeepSeek-only API.

Truth source: check supports_reasoning: true en un modelo en GET /v1/models (or GET /api/models/{id}/allowed-params). Prefer that flag over guessing from the model name.

Modelos con apoyo al razonamiento.

…· live

Parámetros canónicos

Parameter	Type	Required	Description
reasoning_effort	string	Optional	"low" \| "medium" \| "high" \| "xhigh" \| "max". Accepted on every model with supports_reasoning: true. Some upstreams only honour a subset (e.g. high/max); others clamp unsupported levels to the nearest served value.
thinking	string \| object	Optional	Three accepted shapes (we normalise): "on" \| "off" \| "auto"; Anthropic-style { type: "enabled", budget_tokens: N }; hybrid { type: "enabled" \| "disabled" }. Mapped onto Claude extended thinking, OpenAI effort profiles, Gemini thinking_config, Qwen enable_thinking, DeepSeek hybrid, etc.
thinking_budget	integer	Optional	Maximum tokens the model may spend reasoning before emitting visible output. Mirrors budget_tokens when the upstream exposes a budget; takes precedence over reasoning_effort when both are sent and a budget is available.

What differs by family (mapping only)

Parameters are the same everywhere. Only how we map them (and how hard "off" is) differs:

Claude — Thinking on/off + budget; often also reasoning_effort via the gateway.
OpenAI (o1/o3, GPT-5) — Mainly reasoning_effort. A full "thinking off" is often not available — you control how strongly the model reasons, not always whether it reasons at all.
Gemini — thinking_config / budget mapped internally.
Qwen / Xiaomi / Alibaba — thinking + enable_thinking-style controls.
DeepSeek (generic) — Hybrid on/off is especially clear: thinking: { type: enabled|disabled } plus optional reasoning_effort.
Resellers / other — Often generic passthrough of the same canonical fields.

Controlling where the trace appears

An optional reasoning object on the request decides what happens to the thinking trace. It is consumed by Airforce and never forwarded upstream.

Parameter	Type	Required	Description
reasoning.format	string	Optional	"separate" (default) puts the trace in message.reasoning (and delta.reasoning while streaming). "inline" keeps the legacy inline <think>…</think> form inside content.
reasoning.exclude	boolean	Optional	When true, the reasoning trace is dropped entirely from the response. Reasoning tokens are still counted and billed if the model produced them.

"reasoning": { "format": "separate", "exclude": false }

Esfuerzo de razonamiento (estilo OpenAI)

Primary control for o-series and GPT-5: how much the model may reason. Same canonical field as on every other supports_reasoning model — OpenAI is included, but behaviour is not 1:1 with DeepSeek's hard on/off.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "o3-mini",
    "messages": [{"role": "user", "content": "Prove the Pythagorean theorem."}],
    "reasoning_effort": "high"
  }'

Pensamiento extendido (estilo Anthropic)

Budget-based thinking for Claude (and gateways that accept the Anthropic shape). You can still send reasoning_effort; we map when the channel supports it.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "messages": [{"role": "user", "content": "Plan a 7-day Italy trip."}],
    "thinking": {"type": "enabled", "budget_tokens": 4000}
  }'

Hybrid thinking (e.g. DeepSeek V3.2/V4)

Example of a hybrid model family with a clear Thinking / Non-Thinking switch — not a separate protocol. deepseek-v3.2, deepseek-v4-flash and deepseek-v4-pro accept the same canonical fields as every other supports_reasoning model. Toggle thinking and optionally set effort in one request:

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-pro",
    "messages": [{"role": "user", "content": "Solve this step by step: integrate x^2 * e^x."}],
    "thinking": {"type": "enabled"},
    "reasoning_effort": "high"
  }'

Turn thinking off (faster, cheaper when you only need the final answer) — this hard off is clearer on hybrid models than on many OpenAI o-series profiles:

"thinking": {"type": "disabled"}
// or simply: "thinking": "off"

Native docs for this family often list effort levels such as "high" and "max". We accept the full low…max scale and map unsupported levels to the nearest value that reaches the model. Prefer the hybrid IDs above over retired deepseek-chat / deepseek-reasoner names when you need an explicit on/off switch.

La propia huella del razonamiento aparece en choices[0].message.reasoning (forma OpenAI) o como thinking bloques en content (formato de Anthropic). Los tokens de razonamiento se facturan y reportan en usage.completion_tokens_details.reasoning_tokens.

Ese desglose completion_tokens_details.reasoning_tokens solo está presente cuando el proveedor upstream lo informa. En una respuesta en stream la traza llega en delta.reasoning_content por chunk.

Entrada de visión e imagen

Modelos con supports_vision: true aceptar imágenes incrustadas como bloques de contenido. Funciona una URL pública o una URL de datos base64; Los límites de tamaño dependen del modelo anterior.

Modelos con soporte de visión.

…· live

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
      ]
    }]
  }'

Llamada de herramientas

Modelos con supports_tools: true Puede llamar a funciones que usted defina. El modelo devuelve un tool_calls formación; ejecuta la llamada y luego envía el resultado de vuelta en un tool mensaje.

Modelos con soporte de llamada de herramientas.

…· live

Pedido

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City name"}
          },
          "required": ["location"]
        }
      }
    }],
    "tool_choice": "auto"
  }'

Respuesta con llamada de herramienta

{
  "id": "chatcmpl-abc123",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\":\"Paris\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

Seguimiento con el resultado de la herramienta.

{
  "model": "gpt-5.1-chat",
  "messages": [
    {"role": "user", "content": "What is the weather in Paris?"},
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {"name": "get_weather", "arguments": "{\"location\":\"Paris\"}"}
      }]
    },
    {"role": "tool", "tool_call_id": "call_1", "content": "{\"temp_c\": 14, \"sky\": \"cloudy\"}"}
  ]
}

Assistant prefill

End your messages array with an assistant message that already contains some text, and the model continues from it instead of starting a fresh turn. This is a reliable way to force a response to begin a specific way — a leading "{" for JSON, a chosen language, or a fixed prefix. The same trick works on /v1/messages. Providers that reject native prefill are handled automatically: the gateway retries once with a compatible rewrite, so you do not have to special-case them.

{
  "model": "claude-sonnet-4.6",
  "messages": [
    {"role": "user", "content": "List three primary colors as a JSON array."},
    {"role": "assistant", "content": "["}
  ]
}

Structured outputs

Set response_format to make the model return JSON. Two modes are supported:

{ "type": "json_object" } — the response is a single valid JSON value.
{ "type": "json_schema", "json_schema": { "name", "schema", "strict" } } — the model is steered to produce JSON that matches your JSON Schema.

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Extract the city and country: I live in Paris, France."}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "location",
        "schema": {
          "type": "object",
          "properties": { "city": {"type": "string"}, "country": {"type": "string"} },
          "required": ["city", "country"]
        }
      }
    }
  }'

Reliability: even when a model wraps its answer in prose or a markdown code fence, Airforce extracts the JSON payload so you always receive parseable content. If no valid JSON can be recovered, the original text is returned unchanged — so the guarantee never makes a response worse. This applies to non-streamed responses; streamed responses are passed through unchanged.

Transmisión

Colocar stream: true para recibir finalizaciones parciales como eventos enviados por el servidor. Cada evento es un fragmento JSON con la misma forma que la respuesta no transmitida, excepto message es reemplazado por delta. La corriente termina con data: [DONE].

curl https://api.airforce/v1/chat/completions \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1-chat",
    "messages": [{"role": "user", "content": "Write a haiku about Berlin."}],
    "stream": true,
    "stream_options": {"include_usage": true}
  }'

Formato de cable

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"Cold "},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{"content":"stone "},"finish_reason":null}]}

…

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1710000000,"model":"gpt-5.1-chat","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":12,"completion_tokens":17,"total_tokens":29}}

data: [DONE]

Reliability & smart routing

Every model ID resolves to a pool of upstream providers behind the scenes. If the first one errors or times out, the request is automatically retried against the next provider for the same model, in order, before any failure is returned — you do not configure or trigger this. The model field in the response always reports the variant that actually answered. This is independent of the optional models / fallbacks array, which adds your own cross-model candidates on top: first the primary model exhausts its own provider chain, then each fallback model exhausts its chain.

POST /v1/messages

API de mensajes compatible con Anthropic. Trabaja con el funcionario. @anthropic-ai/sdk estableciendo baseURL a https://api.airforce. Reenvía a OpenAI/Google/etc. de forma transparente para modelos que no son Claude.

POSThttps://api.airforce/v1/messages

Cuerpo de la solicitud

Parameter	Type	Required	Description
model	string	Required	ID del modelo (formato de Anthropic o alias enrutado).
messages	array	Required	Cada entrada: { role: "user" \| "assistant", content: string \| array }.
max_tokens	integer	Required	Requerido por Anthropic. Límite de token para la respuesta.
system	string \| array	Optional	Prompt del sistema. Pasa un array de bloques { type: "text", text, cache_control? } para marcar segmentos de prefijo en caché. Consulta "Prompt caching".
temperature	float	Optional	0–1.
top_p	float	Optional	Muestreo de núcleos.
top_k	integer	Optional	Limite el grupo de muestreo a los K tokens principales.
stop_sequences	array	Optional	Hasta 4 secuencias de parada.
stream	boolean	Optional	Cuando es verdadero, emite un flujo de eventos SSE estilo Anthropic (consulta "Streaming").
fallbacks	array	Optional	Fallback models (max 3) in Anthropic form: [{"model": "gpt-4o-mini"}]. If every channel of the primary model fails, each candidate is tried in order; you are billed for — and the response model field reports — the model that actually answered. A plain models string array is accepted too.
tools	array	Optional	Definiciones de herramientas de Anthropic: { name, description, input_schema }. La respuesta puede contener bloques de contenido tool_use.
tool_choice	object	Optional	{ type: "auto" \| "any" \| "tool", name? }.
thinking	object	Optional	Pensamiento extendido de Anthropic: { type: "enabled", budget_tokens: N }.

Ejemplo

curl https://api.airforce/v1/messages \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "max_tokens": 256,
    "system": "You are a helpful assistant.",
    "messages": [
      {"role": "user", "content": "Hello, Claude!"}
    ]
  }'

Forma de respuesta

Parameter	Type	Required	Description
id	string	Optional	ID de mensaje, p.e. "msg_01ABCxyz".
type	string	Optional	Siempre "message".
role	string	Optional	Siempre "assistant".
content	array	Optional	Array de bloques de contenido: { type: "text" \| "tool_use" \| "thinking", … }.
model	string	Optional	Eco del modelo solicitado.
stop_reason	string	Optional	"end_turn" \| "max_tokens" \| "stop_sequence" \| "tool_use".
usage	object	Optional	{ input_tokens, output_tokens, cache_read_input_tokens?, cache_creation_input_tokens?, cache_creation? }. Los campos de caché aparecen cuando se usó caché de prompt. cache_creation.ephemeral_5m_input_tokens y ephemeral_1h_input_tokens dan el desglose de escritura por TTL.

Transmisión de eventos

Anthropic SSE utiliza eventos con nombre en lugar de fragmentos JSON únicos. Cada evento tiene tanto un event: nombre y un data: Carga útil JSON.

event: message_start
data: {"type":"message_start","message":{"id":"msg_01","role":"assistant","content":[],"model":"claude-sonnet-4.6","stop_reason":null,"usage":{"input_tokens":12,"output_tokens":1}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":17}}

event: message_stop
data: {"type":"message_stop"}

POST /v1/messages/count_tokens

Anthropic-compatible token counting. Send the same system / messages / tools you would pass to /v1/messages and get an input-token estimate back without running the model — nothing is billed.

POSThttps://api.airforce/v1/messages/count_tokens

curl https://api.airforce/v1/messages/count_tokens \
  -H "x-api-key: sk-air-YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "system": "You are a helpful assistant.",
    "messages": [{"role": "user", "content": "Hello, Claude!"}]
  }'

# → {"input_tokens": 34}

The count is a fast character-based estimate (about 4 characters per token) over system, messages and tools — close enough for context-budget checks, not an exact tokenizer run.

Almacenamiento en caché rápido

En /v1/messages con los modelos Claude, marque un prefijo como almacenado en caché pasando system como una matriz de bloques donde el segmento almacenado en caché lleva cache_control: { type: "ephemeral" }. Las solicitudes posteriores que comienzan con el mismo prefijo cobran la tarifa de lectura de caché más económica. Modelos con supports_caching: true en /v1/models apoyar esto.

Write vs read pricing

Cache writes are typically charged slightly above normal input (about 1.25× on Claude-family models). Cache reads are much cheaper (about 0.1× input). A large write with almost no later read is the expensive case — not a “cache discount”. Only reusing the same prefix turns the write into savings.

Tools like Claude Code often attach a large project context with cache markers on the first turns. Expect cache-write spend while the repo/system prefix is loaded; later turns only get cheap if that prefix is stable and reused. Subagents and multi-step agents can multiply large contexts across several requests.

Modelos con almacenamiento en caché rápido

…· live

{
  "model": "claude-sonnet-4.6",
  "max_tokens": 1024,
  "system": [
    {"type": "text", "text": "You are a senior staff engineer at Airforce."},
    {
      "type": "text",
      "text": "<repository-snapshot>...</repository-snapshot>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "Where is rate limiting enforced?"}
  ]
}

Cómo se reportan los conteos de caché en la respuesta

Los conteos de tokens de caché se devuelven en la forma nativa de cada formato, así los SDKs (openai, @anthropic-ai/sdk, @google/genai) los leen sin código personalizado. Los campos se omiten cuando el valor es cero, manteniendo las respuestas no cacheadas ligeras.

/v1/chat/completions (forma OpenAI)

"usage": {
  "prompt_tokens": 2104,
  "completion_tokens": 147,
  "total_tokens": 2251,
  "prompt_tokens_details": { "cached_tokens": 1980 },
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1/messages (forma Anthropic)

"usage": {
  "input_tokens": 2104,
  "output_tokens": 147,
  "cache_read_input_tokens": 1980,
  "cache_creation_input_tokens": 124,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 124,
    "ephemeral_1h_input_tokens": 0
  }
}

/v1beta/.../generateContent (forma Gemini)

"usageMetadata": {
  "promptTokenCount": 2104,
  "candidatesTokenCount": 147,
  "totalTokenCount": 2251,
  "cachedContentTokenCount": 1980
}

Dónde se aplica el caché

Los marcadores cache_control explícitos se respetan en /v1/messages y /v1/chat/completions para modelos Claude — colócalos en bloques de contenido system o message. Muchos otros proveedores (familia OpenAI, DeepSeek, Gemini) cachean automáticamente: no envías marcadores y simplemente ves cached_tokens en la respuesta cuando se reutiliza un prefijo suficientemente largo.

Duración del caché: 5 minutos o 1 hora

Un prefijo cacheado vive 5 minutos por defecto y el temporizador se reinicia en cada acierto. Para un prefijo más duradero, añade ttl: "1h" al marcador. La respuesta informa cada TTL por separado bajo cache_creation.

"cache_control": { "type": "ephemeral", "ttl": "1h" }

Ejemplo: primero escritura, luego lectura

Envía exactamente la misma solicitud dos veces (el ejemplo de caché de arriba). La primera llamada que ve el prefijo paga una escritura de caché única; las llamadas idénticas dentro de la TTL pagan la lectura de caché mucho más barata.

Primera llamada — escritura de caché (extracto de usage):

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 1980,
  "cache_read_input_tokens": 0
}

Segunda llamada idéntica dentro de la TTL — lectura de caché:

"usage": {
  "input_tokens": 2104,
  "output_tokens": 12,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 1980
}

Límites y coste

Claude requiere un prefijo cacheable mínimo (unos 1024 tokens; más en algunos modelos). Los prefijos más cortos simplemente no se cachean.
Hasta 4 puntos de caché por solicitud, y el prefijo cacheado debe ser idéntico byte a byte entre llamadas — incluso un cambio de un carácter falla el caché.
Las escrituras de caché cuestan más que la entrada normal (5m ≈ 1,25×, 1h ≈ 2×); las lecturas cuestan mucho menos (≈ 0,1×). Consulta los precios de caché de cada modelo en la página de precios.

POST /v1/responses

Superficie OpenAI Responses-API para conversaciones con estado. Misma autenticación Bearer/x-api-key. Los conteos de caché aparecen como input_tokens_details.cached_tokens (lectura) más el plano cache_creation_input_tokens + cache_creation.ephemeral_* (escrituras) para paridad con /v1/chat/completions.

POSThttps://api.airforce/v1/responses

POST /v1beta/models/{model}:generateContent

Google Gemini-compatible endpoint. Works with the official @google/genai SDK and the Gemini CLI by pointing the base URL at https://api.airforce/v1beta. Any routed model works — requests are translated to and from the native Gemini shape, and the model is taken from the URL path (not the body).

POSThttps://api.airforce/v1beta/models/{model}:generateContent

Authentication

Pass your Airforce API key any of the three ways Google clients use:

# 1) query parameter (Google default)
?key=sk-air-YOUR_API_KEY

# 2) header
x-goog-api-key: sk-air-YOUR_API_KEY

# 3) bearer token
Authorization: Bearer sk-air-YOUR_API_KEY

Request body

Parameter	Type	Required	Description
contents	array	Required	Conversation turns. Each: { role: "user" \| "model", parts: [...] }. A part is { text }, { functionCall: { name, args } }, or { functionResponse: { name, response } }. "model" is Gemini's term for the assistant role.
systemInstruction	object	Optional	System prompt: { parts: [{ text }] }.
generationConfig	object	Optional	{ temperature, maxOutputTokens, topP, stopSequences } — mapped to the canonical sampling parameters.
tools	array	Optional	Tool definitions: [{ functionDeclarations: [{ name, description, parameters }] }]. functionDeclarations are flattened across entries.
toolConfig	object	Optional	Tool-choice control: { functionCallingConfig: { mode: "AUTO" \| "ANY" \| "NONE" } }. ANY forces a call, NONE disables tools.

Example

curl "https://api.airforce/v1beta/models/gemini-3.1-pro:generateContent" \
  -H "x-goog-api-key: sk-air-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {"role": "user", "parts": [{"text": "What is the capital of France?"}]}
    ],
    "systemInstruction": {"parts": [{"text": "You are a helpful assistant."}]},
    "generationConfig": {"temperature": 0.7, "maxOutputTokens": 256}
  }'

Response shape

Parameter	Type	Required	Description
candidates	array	Optional	Generated turns: [{ content: { role: "model", parts }, finishReason, index }]. Only the first candidate is populated.
candidates[].finishReason	string	Optional	"STOP" \| "MAX_TOKENS" \| "SAFETY" \| "OTHER".
usageMetadata	object	Optional	{ promptTokenCount, candidatesTokenCount, totalTokenCount, cachedContentTokenCount? }. cachedContentTokenCount appears when the upstream reported a cache read.
modelVersion	string	Optional	Echo of the requested model.

{
  "candidates": [{
    "content": {
      "role": "model",
      "parts": [{"text": "The capital of France is Paris."}]
    },
    "finishReason": "STOP",
    "index": 0
  }],
  "usageMetadata": {
    "promptTokenCount": 16,
    "candidatesTokenCount": 8,
    "totalTokenCount": 24
  },
  "modelVersion": "gemini-3.1-pro"
}

POST /v1beta/models/{model}:streamGenerateContent

Streaming uses the :streamGenerateContent action and returns Server-Sent Events. Each data: line is a full Gemini-shaped chunk (not a delta object); the final chunk carries usageMetadata.

data: {"candidates":[{"content":{"role":"model","parts":[{"text":"The capital"}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[{"text":" is Paris."}]},"index":0}],"modelVersion":"gemini-3.1-pro"}

data: {"candidates":[{"content":{"role":"model","parts":[]},"finishReason":"STOP","index":0}],"usageMetadata":{"promptTokenCount":16,"candidatesTokenCount":8,"totalTokenCount":24}}

List models

The catalog is also exposed in Gemini Model-resource shape so Google clients can enumerate models.

curl https://api.airforce/v1beta/models

Notes: the base URL is https://api.airforce/v1beta (or /v1), not Google's host. The model name comes from the URL path, not the request body. Only the first candidate is returned, and a subset of Gemini fields is translated — safetySettings and cachedContent are currently ignored. Billing, rate limits and smart routing apply exactly as on /v1/chat/completions.

Errores

Airforce devuelve códigos de estado HTTP estándar y un sobre de error uniforme para ambos puntos finales.

Parameter	Type	Required	Description
400	invalid_request_error	Optional	JSON con formato incorrecto, falta el campo obligatorio, modelo desconocido.
401	invalid_request_error / auth_required	Optional	Clave API faltante o no válida.
402	insufficient_quota	Optional	El modelo requiere una suscripción activa o un saldo Pay-as-you-Go positivo.
403	model_access_denied / insufficient_scope	Optional	Los permisos de plan o por clave niegan esta solicitud.
404	model_not_found	Optional	El modelo solicitado no existe o no tienes acceso a él.
429	rate_limit_error	Optional	Se superó la tasa de solicitud o el límite de token diario.
503	api_error / moderation_unavailable	Optional	Todas las claves ascendentes para el proveedor solicitado fallaron.

{
  "error": {
    "message": "The requested model does not exist or you do not have access to it.",
    "type": "model_not_found",
    "param": null,
    "code": "404"
  }
}

El slug descriptivo está en type. code es el estado HTTP como cadena (p. ej. "404"), y param es null excepto en errores de validación de rango de parámetros, donde nombra el parámetro infractor.

Descubre modelos

Consulte la lista completa de ID de modelo y sus indicadores de capacidad (visión, herramientas, razonamiento, almacenamiento en caché, longitud del contexto, etc.) en /docs/api/models.

curl https://api.airforce/v1/models \
  -H "Authorization: Bearer sk-air-YOUR_API_KEY"