Claude Haiku 4.5 - ByteSpike

Vendor: Anthropic Model ID: claude-haiku-4-5 Capability: 200K context · tool use · vision · prompt caching · streaming Pricing: per-token, Haiku tier (live rate) Haiku 4.5 is the model you reach for when the plan is to make a lot of LLM calls — agent loops, tool-heavy workflows, sub-LLM judges, embeddings pipelines that need a quick rewrite step. It’s not the model you ship when one shot has to be perfect; for that use Sonnet or Opus. But its latency floor is low enough that you can chain four or five Haiku calls in the time Sonnet takes for one, and the quality holds for routine classification, extraction, and routing tasks.

Request

curl https://llm.bytespike.ai/v1/messages \
  -H "x-api-key: $BYTESPIKE_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-haiku-4-5",
    "max_tokens": 1024,
    "messages": [
      {"role": "user", "content": "Classify this support ticket: My order is late."}
    ]
  }'

Body parameters

Field	Type	Required	Default	Notes
`model`	string	yes	—	`claude-haiku-4-5`
`messages`	array	yes	—	Conversation history.
`max_tokens`	integer	yes	—	Hard cap on response length. Max for this model: 8192.
`system`	string \| array	no	—	System prompt. Array form supports `cache_control`.
`temperature`	number	no	1.0	Range 0.0–1.0.
`top_p`	number	no	1.0	Nucleus sampling.
`tools`	array	no	—	Supported.
`tool_choice`	object	no	`{"type":"auto"}`	`auto` / `any` / `tool` (named).
`stream`	boolean	no	false	SSE streaming.

Response

{
  "id": "msg_haiku_…",
  "type": "message",
  "role": "assistant",
  "model": "claude-haiku-4-5",
  "content": [
    {"type": "text", "text": "Logistics — delivery delay."}
  ],
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 18,
    "output_tokens": 6
  }
}

Response fields

Field	Type	Notes
`id`	string	ByteSpike-issued message ID.
`model`	string	Echoes the request `model`.
`content`	array	Text in `{"type": "text"}`; tool calls in `{"type": "tool_use"}`.
`stop_reason`	string	`end_turn` / `max_tokens` / `tool_use` / `stop_sequence`.
`usage.input_tokens`	integer	Prompt tokens billed.
`usage.output_tokens`	integer	Generated tokens billed.
`usage.cache_read_input_tokens`	integer	Present when a `cache_control` block hits.

Code examples

curl https://llm.bytespike.ai/v1/messages \
  -H "x-api-key: $BYTESPIKE_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-haiku-4-5",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Classify this ticket: My order is late."}]
  }'

Streaming

Set "stream": true. Response is SSE in the standard Anthropic format. Estimated credits ship in the HTTP response before the first SSE event, so you can short-circuit a long completion before paying for it.

Cache control

cache_control blocks reduce cost on repeated prompts. Cache reads at the discounted rate visible in the pricing table under “cache read”. Cost-effective on Haiku for retrieval-heavy agent loops where the system prompt and tool definitions are stable across calls.

{
  "model": "claude-haiku-4-5",
  "system": [
    {
      "type": "text",
      "text": "<long static system prompt>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [...]
}

Errors

Code	Trigger	Billed?
400	Body validation failed	No
401	Missing / revoked key	No
402	Wallet exhausted	No
403	Scope denied / IP not allowlisted	No
422	Param not supported (rare on Haiku)	No
429	Rate-limited	No
5xx	Upstream provider issue	No (auto-retry envelope)

See Error Handling for the full enum.

When to use

Production agent loops where you make 3+ LLM calls per user action.
Routing / triage / classification ahead of a heavier model.
Embedding pipelines that need a quick rewrite or cleanup step.
For one-shot quality where latency is secondary, see Sonnet 4.6.
For long-context reasoning, see Opus 4.7.

Limits

Limit	Value
Context window	200K tokens
Max output	8192 tokens
Supports tool use	Yes
Supports vision	Yes
Supports streaming	Yes
Supports prompt caching	Yes

​Request

​Body parameters

​Response

​Response fields

​Code examples

​Streaming

​Cache control

​Errors

​When to use

​Limits