Skip to main content
Vendor: Anthropic Model ID: claude-haiku-4-5 Capability: 200K context · tool use · vision · prompt caching · streaming Pricing: per-token, Haiku tier (live rate) Haiku 4.5 is the model you reach for when the plan is to make a lot of LLM calls — agent loops, tool-heavy workflows, sub-LLM judges, embeddings pipelines that need a quick rewrite step. It’s not the model you ship when one shot has to be perfect; for that use Sonnet or Opus. But its latency floor is low enough that you can chain four or five Haiku calls in the time Sonnet takes for one, and the quality holds for routine classification, extraction, and routing tasks.

Request

curl https://llm.bytespike.ai/v1/messages \
  -H "x-api-key: $BYTESPIKE_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-haiku-4-5",
    "max_tokens": 1024,
    "messages": [
      {"role": "user", "content": "Classify this support ticket: My order is late."}
    ]
  }'

Body parameters

FieldTypeRequiredDefaultNotes
modelstringyesclaude-haiku-4-5
messagesarrayyesConversation history.
max_tokensintegeryesHard cap on response length. Max for this model: 8192.
systemstring | arraynoSystem prompt. Array form supports cache_control.
temperaturenumberno1.0Range 0.0–1.0.
top_pnumberno1.0Nucleus sampling.
toolsarraynoSupported.
tool_choiceobjectno{"type":"auto"}auto / any / tool (named).
streambooleannofalseSSE streaming.

Response

{
  "id": "msg_haiku_…",
  "type": "message",
  "role": "assistant",
  "model": "claude-haiku-4-5",
  "content": [
    {"type": "text", "text": "Logistics — delivery delay."}
  ],
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 18,
    "output_tokens": 6
  }
}

Response fields

FieldTypeNotes
idstringByteSpike-issued message ID.
modelstringEchoes the request model.
contentarrayText in {"type": "text"}; tool calls in {"type": "tool_use"}.
stop_reasonstringend_turn / max_tokens / tool_use / stop_sequence.
usage.input_tokensintegerPrompt tokens billed.
usage.output_tokensintegerGenerated tokens billed.
usage.cache_read_input_tokensintegerPresent when a cache_control block hits.

Code examples

curl https://llm.bytespike.ai/v1/messages \
  -H "x-api-key: $BYTESPIKE_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-haiku-4-5",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Classify this ticket: My order is late."}]
  }'

Streaming

Set "stream": true. Response is SSE in the standard Anthropic format. Estimated credits ship in the HTTP response before the first SSE event, so you can short-circuit a long completion before paying for it.

Cache control

cache_control blocks reduce cost on repeated prompts. Cache reads at the discounted rate visible in the pricing table under “cache read”. Cost-effective on Haiku for retrieval-heavy agent loops where the system prompt and tool definitions are stable across calls.
{
  "model": "claude-haiku-4-5",
  "system": [
    {
      "type": "text",
      "text": "<long static system prompt>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [...]
}

Errors

CodeTriggerBilled?
400Body validation failedNo
401Missing / revoked keyNo
402Wallet exhaustedNo
403Scope denied / IP not allowlistedNo
422Param not supported (rare on Haiku)No
429Rate-limitedNo
5xxUpstream provider issueNo (auto-retry envelope)
See Error Handling for the full enum.

When to use

  • Production agent loops where you make 3+ LLM calls per user action.
  • Routing / triage / classification ahead of a heavier model.
  • Embedding pipelines that need a quick rewrite or cleanup step.
  • For one-shot quality where latency is secondary, see Sonnet 4.6.
  • For long-context reasoning, see Opus 4.7.

Limits

LimitValue
Context window200K tokens
Max output8192 tokens
Supports tool useYes
Supports visionYes
Supports streamingYes
Supports prompt cachingYes