Skip to main content
The native ByteSpike protocol. Speaks the Anthropic Messages API verbatim, including tool_use, cache_control, and thinking blocks. Cross-vendor models (GPT, Gemini, DeepSeek, Doubao, etc.) are transparently translated under the hood — the request you send is Anthropic-shape, regardless of which model value you pick.

When to use

Pick this endpoint when you want:
  • Anthropic SDK / Claude Code / Claude Desktop to talk to any model in our catalog
  • Tool use with the cleanest schema (no JSON-string wrapping like OpenAI’s tool_calls)
  • Prompt caching (cache_control blocks) on long static system prompts
  • Extended thinking on Opus / Sonnet 4.x
For a strict OpenAI-shape request, use /chat/completions. For Google Native, use /v1beta/models/{model}:generateContent.

Request

curl https://llm.bytespike.ai/v1/messages \
  -H "x-api-key: $BYTESPIKE_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-sonnet-4-6",
    "max_tokens": 1024,
    "system": "You are a helpful assistant.",
    "messages": [
      {"role": "user", "content": "Summarize the first chapter of Moby Dick."}
    ]
  }'

Headers

HeaderRequiredNotes
x-api-keyyesYour ByteSpike key (sk-byts-…).
anthropic-versionyesPin 2023-06-01.
content-typeyesapplication/json.
anthropic-betanoForwarded to the model for Anthropic beta features.

Body

FieldTypeRequiredNotes
modelstringyesModel slug. Any catalog model works — see Cross-model routing below.
messagesarrayyesConversation history (Anthropic shape).
max_tokensintegeryesHard cap on response length.
systemstring | arraynoSystem prompt (string for simple cases, array for cache_control).
toolsarraynoTool definitions (Anthropic input_schema format).
tool_choiceobjectno{"type": "auto"} / {"type": "any"} / {"type": "tool", "name": "…"}.
temperaturenumbernoDefault 1.0.
top_pnumbernoNucleus sampling.
stop_sequencesstring[]noCustom stop tokens.
streambooleannoServer-sent events. See Streaming.
metadataobjectno{"user_id": "..."} — forwarded to the model and logged on our side.

Response

{
  "id": "msg_01AbCdEf",
  "type": "message",
  "role": "assistant",
  "content": [
    {"type": "text", "text": "Ishmael, the narrator, signs onto a whaling ship..."}
  ],
  "model": "claude-sonnet-4-6",
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 23,
    "output_tokens": 87
  }
}

Response fields

FieldTypeNotes
idstringServer-generated id, prefix msg_.
typestringAlways "message" on a non-error response.
rolestringAlways "assistant".
contentarrayContent blocks: text, tool_use, thinking (Opus / Sonnet 4.x).
stop_reasonstringend_turn, max_tokens, stop_sequence, tool_use.
usage.input_tokensintegerTokens billed for input.
usage.output_tokensintegerTokens billed for output.
usage.cache_read_input_tokensintegerTokens served from cache (discounted rate).
usage.cache_creation_input_tokensintegerTokens written to cache (full rate).

Accounting headers

Every response — success or failure, streamed or not — carries the gateway’s quota envelope:
X-RateLimit-Limit: 50.00
X-RateLimit-Remaining: 42.18
X-RateLimit-Reset: 1716705600
X-Quota-Remaining-Credits: 192.40
  • X-RateLimit-Limit / Remaining — USD budget for the rate-limit bucket closest to constraining you (5h / 1d / 7d, whichever is tightest).
  • X-RateLimit-Reset — Unix timestamp for when that bucket resets.
  • X-Quota-Remaining-Credits — lifetime credits remaining on this key (USD; 1 USD = 1,000,000 credits). Failed requests don’t move this number.
  • X-Org-Quota-Remaining-Credits — org wallet remaining, on org-owned keys.
For the actual per-request cost, query GET /api/v1/usage — it returns the prompt + completion tokens and the final billed credits per call.

Streaming

Set "stream": true. The response is SSE in the standard Anthropic format:
event: message_start
data: {"type":"message_start","message":{...}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Ishmael"}}



event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":87}}

event: message_stop
data: {"type":"message_stop"}
The full SSE event sequence — message_start → one or more (content_block_startcontent_block_delta× → content_block_stop) → message_deltamessage_stop — is identical to the Anthropic Messages spec. Tool calls arrive as tool_use content blocks streamed in chunks via input_json_delta.

Tool use (multi-turn)

Tool calling round-trips through two requests. The first carries the tool schema; the model responds with a tool_use block; you execute the tool and POST the result back in a second request.

Round 1 — tool offered

{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "tools": [
    {
      "name": "get_weather",
      "description": "Get current weather for a city.",
      "input_schema": {
        "type": "object",
        "properties": {
          "city": {"type": "string", "description": "Name of the city."}
        },
        "required": ["city"]
      }
    }
  ],
  "messages": [
    {"role": "user", "content": "What's the weather in Tokyo?"}
  ]
}
Response:
{
  "id": "msg_01...",
  "role": "assistant",
  "content": [
    {"type": "text", "text": "Let me check that for you."},
    {
      "type": "tool_use",
      "id": "toolu_01ABC",
      "name": "get_weather",
      "input": {"city": "Tokyo"}
    }
  ],
  "stop_reason": "tool_use"
}

Round 2 — tool result returned

Execute get_weather({city: "Tokyo"}) locally, then send the result as a tool_result block referencing the original tool_use.id:
{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "tools": [ /* same schema */ ],
  "messages": [
    {"role": "user", "content": "What's the weather in Tokyo?"},
    {
      "role": "assistant",
      "content": [
        {"type": "text", "text": "Let me check that for you."},
        {
          "type": "tool_use",
          "id": "toolu_01ABC",
          "name": "get_weather",
          "input": {"city": "Tokyo"}
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_01ABC",
          "content": "18°C, partly cloudy"
        }
      ]
    }
  ]
}
The model now returns a final text answer with stop_reason: "end_turn".

Image / multimodal content

Send images as image content blocks. Both base64 and URL sources work; the gateway forwards the bytes directly to vision-capable models (Claude Sonnet/Opus 4.x, GPT-5-x, Gemini, Doubao Vision, etc.).
{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image",
          "source": {
            "type": "base64",
            "media_type": "image/jpeg",
            "data": "/9j/4AAQSkZJRg..."
          }
        },
        {"type": "text", "text": "What's in this image?"}
      ]
    }
  ]
}
For URL inputs:
{
  "type": "image",
  "source": {"type": "url", "url": "https://example.com/photo.jpg"}
}
Non-vision models reject image blocks with a 400; check the model’s capability tags via GET /v1/models.

Cross-model routing

This endpoint accepts any ByteSpike catalog model in the model field — the gateway translates the request to each model’s native protocol transparently. Pick whatever fits your latency / cost / capability mix:
{"model": "claude-opus-4-8", "messages": [...]}
{"model": "gpt-5-4", "messages": [...]}
{"model": "gemini-3-1-pro", "messages": [...]}
{"model": "deepseek-v4-pro", "messages": [...]}
{"model": "doubao-seed-2-0-pro", "messages": [...]}
Caveats:
  • Model-specific features that don’t translate (e.g. OpenAI response_format: {"type": "json_schema"}) need the matching protocol endpoint.
  • stop_reason and usage fields are normalised back to Anthropic shape regardless of the model.
Full model list: GET /v1/models. Pricing per model: bytespike.ai/pricing.

Cache control

cache_control blocks work identically to the Anthropic Messages spec. Costs are billed at the discounted cache-read rate when a hit occurs; the rate is visible in the pricing table under “cache read”.
{
  "model": "claude-sonnet-4-6",
  "system": [
    {
      "type": "text",
      "text": "<long static system prompt>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [...]
}
The usage.cache_read_input_tokens and usage.cache_creation_input_tokens fields in the response report hits and writes respectively.

Rate limiting & quota headers

HeaderNotes
x-ratelimit-limit-requestsRequests/min cap for your tier.
x-ratelimit-remaining-requestsRemaining in the current window.
x-ratelimit-reset-requestsSeconds until the bucket refills.
x-ratelimit-limit-tokensTokens/min cap.
x-ratelimit-remaining-tokensTokens remaining in this window.
On a 429, inspect the x-ratelimit-reset-* header to know when to retry.

Errors

All non-2xx responses are free — failures don’t bill.
Statuserror.typeTrigger
400invalid_request_errorBody validation failed (per Anthropic schema). The message tells you which field.
400unsupported_modelmodel slug not in your scope or retired.
400unsupported_featureE.g. image block sent to a text-only model, or tools to a model that can’t tool-use.
401authentication_errorMissing / revoked key.
402insufficient_creditsWallet exhausted. Top up at console.bytespike.ai/billing.
403permission_errorScope denied, IP not allowlisted, or model gated.
404not_found_errorPath typo (/v1/messages vs /messages) or unknown model id.
429rate_limit_errorTier rate-limit. Backoff per x-ratelimit-reset-*.
5xxapi_error / overloaded_errorUpstream provider issue. Free + automatic retry envelope.
Body shape:
{
  "type": "error",
  "error": {
    "type": "rate_limit_error",
    "message": "You have exceeded your requests-per-minute budget."
  }
}