claude-haiku-4-5
Capability: 200K context · tool use · vision · prompt caching · streaming
Pricing: per-token, Haiku tier (live rate)
Haiku 4.5 is the model you reach for when the plan is to make a lot of
LLM calls — agent loops, tool-heavy workflows, sub-LLM judges, embeddings
pipelines that need a quick rewrite step. It’s not the model you ship
when one shot has to be perfect; for that use Sonnet or Opus. But its
latency floor is low enough that you can chain four or five Haiku calls
in the time Sonnet takes for one, and the quality holds for routine
classification, extraction, and routing tasks.
Request
Body parameters
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
model | string | yes | — | claude-haiku-4-5 |
messages | array | yes | — | Conversation history. |
max_tokens | integer | yes | — | Hard cap on response length. Max for this model: 8192. |
system | string | array | no | — | System prompt. Array form supports cache_control. |
temperature | number | no | 1.0 | Range 0.0–1.0. |
top_p | number | no | 1.0 | Nucleus sampling. |
tools | array | no | — | Supported. |
tool_choice | object | no | {"type":"auto"} | auto / any / tool (named). |
stream | boolean | no | false | SSE streaming. |
Response
Response fields
| Field | Type | Notes |
|---|---|---|
id | string | ByteSpike-issued message ID. |
model | string | Echoes the request model. |
content | array | Text in {"type": "text"}; tool calls in {"type": "tool_use"}. |
stop_reason | string | end_turn / max_tokens / tool_use / stop_sequence. |
usage.input_tokens | integer | Prompt tokens billed. |
usage.output_tokens | integer | Generated tokens billed. |
usage.cache_read_input_tokens | integer | Present when a cache_control block hits. |
Code examples
Streaming
Set"stream": true. Response is SSE in the standard Anthropic format.
Estimated credits ship in the HTTP response before the first SSE event,
so you can short-circuit a long completion before paying for it.
Cache control
cache_control blocks reduce cost on repeated prompts. Cache reads at
the discounted rate visible in the
pricing table under “cache read”.
Cost-effective on Haiku for retrieval-heavy agent loops where the
system prompt and tool definitions are stable across calls.
Errors
| Code | Trigger | Billed? |
|---|---|---|
| 400 | Body validation failed | No |
| 401 | Missing / revoked key | No |
| 402 | Wallet exhausted | No |
| 403 | Scope denied / IP not allowlisted | No |
| 422 | Param not supported (rare on Haiku) | No |
| 429 | Rate-limited | No |
| 5xx | Upstream provider issue | No (auto-retry envelope) |
When to use
- Production agent loops where you make 3+ LLM calls per user action.
- Routing / triage / classification ahead of a heavier model.
- Embedding pipelines that need a quick rewrite or cleanup step.
- For one-shot quality where latency is secondary, see Sonnet 4.6.
- For long-context reasoning, see Opus 4.7.
Limits
| Limit | Value |
|---|---|
| Context window | 200K tokens |
| Max output | 8192 tokens |
| Supports tool use | Yes |
| Supports vision | Yes |
| Supports streaming | Yes |
| Supports prompt caching | Yes |