veo3.1 is Google’s Veo 3.1 model. Same two-phase task-based protocol as the other video models, with one differentiator worth knowing about: native audio generation alongside the video track. The same submit → poll flow produces an MP4 with an audio layer the model invented to match the scene — useful for one-shot deliverables that won’t get a separate sound-design pass.
Pricing: $0.40 / second of generated footage — see the rate card. Failures don’t bill; per-second pricing applies to generated footage length, and audio doesn’t add a separate line item on this tier.
Protocols
| Protocol | Path | Purpose |
|---|---|---|
| OpenAI Video — submit | POST https://llm.bytespike.ai/v1/videos/generations | enqueues; returns task_id |
| OpenAI Video — poll | GET https://llm.bytespike.ai/v1/videos/tasks/{task_id} | returns status, result_url, and audio_url when ready |
Quickstart
Capabilities
| Capability | Supported |
|---|---|
| Text-to-video | ✅ |
Image-to-video (with source_image) | ✅ |
| Native audio generation | ✅ (set audio: true) |
duration_seconds 5 / 10 | ✅ |
size 1280×720 / 1920×1080 | ✅ |
| Modality | video |
| Capability bucket | video_generate |
When to use
- One-shot deliverable — clip is the final output, no sound-design pass coming.
- Ambient / atmospheric footage — rain, wind, city noise, where Veo’s native audio is more authentic than dubbing-over-silent footage.
- Alternative to Sora — when Sora’s particular motion style isn’t the right fit and Google’s render feels closer to brand.
- You already have your own sound design — audio is a small premium that’s wasted in that flow; drop to
veo3.1-fastwithout audio. - Sora-specific motion characteristics — go to
sora2orsora2-pro.
Next
veo3.1-fast— cheaper tiersora2— OpenAI alternative- Multimodal endpoints — overview