veo-3
Veo 3 via Vertex AI - Generate videos with audio from text prompts and images
Google's Veo is the largest and most differentiated video generation family available today. Six models spanning three generations, from the original Veo 2 through the latest Veo 3.1 variants - each with distinct price points, speed profiles, and capabilities. It's a lot to navigate. I've spent months running all of them through inference.sh and have opinions about when each one earns its cost and when you're better off with a cheaper tier.
Let me be direct about the state of AI video before we get into specifics. It's still early. Every model in this family - and every competitor - shares fundamental limitations. Temporal coherence breaks down. Physics gets weird. Hands remain cursed. These are architectural constraints of current diffusion-based approaches, not bugs that a faster GPU will fix. The question isn't whether Veo produces perfect video. It doesn't. The question is whether it produces useful video reliably enough for professional workflows. For specific applications - short-form content, concept visualization, B-roll, motion prototyping - the answer is increasingly yes.
the generational leap from veo 2 to veo 3
Veo 2 launched in December 2024 as Google's first widely available video generation model. It produces realistic clips from text or image prompts. No audio. Just prompts in, video out. By today's standards it feels limited, but the visual quality holds up surprisingly well for simple compositions - landscapes, product shots, anything without complex human motion.
The jump to Veo 3, released in May 2025, introduced the feature that actually matters: native audio synthesis. This isn't a separate model processing the video after the fact. Audio generates in the same forward pass as the visual frames, which means synchronization happens naturally rather than requiring alignment in post. Ocean waves sound like ocean waves at the right moment. Footsteps land when feet hit ground. A door closes and you hear it close.
Veo 3 full quality is actually cheaper than Veo 2 for video-only work, which tells you something about how quickly this field moves - the newer model costs less and does more. The generation times are long though. We're talking minutes, not seconds. For final hero assets that's fine. For iterative prompt development it's painful.
understanding the speed tiers
Every Veo generation ships with variants optimized for different points on the quality-speed curve. The naming is straightforward: "Fast" means shorter generation times with slightly reduced fidelity. "Lite" means even faster and cheaper, suitable for drafts and rapid iteration.
Veo 3 Fast costs a fraction of the full Veo 3 price. The quality difference is real but contextual. On simple scenes - a single subject, smooth camera movement, natural lighting - the fast variant produces results that most viewers couldn't distinguish from the full model without a direct comparison. On complex multi-element scenes with lots of motion, the full model shows its advantage in cleaner edges and more coherent physics.
My workflow recommendation is simple: use the fast tier for everything during exploration. You might generate fifteen or twenty variations before finding the right prompt language and camera direction. Paying full price for each iteration is wasteful. Once you've locked in a prompt that produces what you want, switch to the full quality tier for the final render. This approach typically saves the majority of total generation costs on any given project.
veo 3.1 - the current generation
Veo 3.1, announced in October 2025, represents the latest evolution. Three tiers here: full, fast, and lite. The full model adds scene extension for chaining clips into longer narratives, frames-to-video transitions between two images, 4K upscaling, and improved temporal coherence over Veo 3. Pricing scales with resolution and whether audio is enabled, with 4K and audio-enabled options at the premium end.
Veo 3.1 Fast splits the difference between the full model and the lite tier. The lower-resolution video-only option is the sweet spot for most production workflows. Audio-enabled output remains cheaper than standard Veo 3 with audio.
Then there's Veo 3.1 Lite, which became available on the Gemini API in March 2026. It's the cheapest option in the current generation. The lite tier is positioned for high-volume applications where you need thousands of clips and can tolerate some quality variance. Social media automation, thumbnail generation, content farm workflows - anywhere that volume matters more than per-clip perfection.
audio changes the creative equation
I keep coming back to the audio capability because it genuinely shifts what's possible. Before Veo 3, generating a video clip meant either shipping it silent, finding stock audio to match, or running a separate audio generation model and manually aligning the results. Each approach has problems. Silent clips feel incomplete. Stock audio rarely matches exactly. Manual alignment is tedious and imprecise.
With native audio synthesis, you describe a scene and get both the visual and the sonic environment in one generation. The quality of environmental audio - rain, wind, machinery, traffic, nature sounds - is consistently convincing. Ambient soundscapes work well. Synchronized sound effects for specific actions (impacts, doors, footsteps) are surprisingly accurate in their timing.
Where it falls short: dialogue and music. Synthesized speech is intelligible but lands in uncanny territory. Music generation is basic - it can produce generic atmospheric scoring but nothing you'd mistake for composed work. These limitations matter less than you'd think for most practical applications. The majority of short-form video content relies on environmental audio and simple sound design rather than complex dialogue or original music.
The cost premium for audio varies by tier but is significant across the board. Whether that premium is worth paying depends entirely on your downstream workflow. If the clip will be re-scored with licensed music anyway, skip audio and save the money. If it's going straight to social media as a self-contained piece, the audio generation eliminates a production step that would cost more in time than the added expense.
resolution choices and when they matter
Every current Veo model supports 720p, 1080p, and some form of higher resolution. The practical guidance is less complicated than the pricing grids suggest.
For prompt development and iteration: 720p. It renders faster, costs the same or less, and shows you everything you need to evaluate whether a prompt is working. You're not judging pixel quality at this stage - you're judging composition, motion, timing, and whether the model understood your intent.
For final output destined for screens: 1080p. It's the native resolution of most web video players, social media platforms, and presentation displays. The price is comparable to 720p on most Veo tiers, which makes the 720p option primarily a speed optimization rather than a cost one.
For broadcast, large-format display, or maximum flexibility in post: 4K. The price jump is significant. Reserve this for final renders where you've already confirmed the prompt produces what you want. The extra resolution also gives you room to crop and reframe in editing without dropping below 1080p output quality, which is genuinely useful for multi-format delivery.
choosing between models - an honest assessment
Here's how I think about model selection in practice. Veo 2 is the legacy option. It's more expensive than Veo 3 Fast for video-only generation and offers fewer capabilities. The only reason to use it is if you've already built a pipeline around it and don't want to change. For new work, start with Veo 3 or later.
Veo 3 Fast is the workhorse for budget-conscious video generation without audio. It's cheap enough to run at high volume and produces quality that satisfies most social media and web applications. Adding audio remains very reasonable.
Veo 3 full is for hero content where quality justifies patience. Use it when the output will be scrutinized - a header video on a landing page, a key social post, a client deliverable.
Veo 3.1 Fast is the best all-around option for most professional workflows right now. It combines the latest model improvements with practical generation speeds. The image-to-video support adds flexibility that the Veo 3 variants lack.
Veo 3.1 full is the maximum quality option. Frame interpolation produces smoother motion. Reference image handling is more precise. For cinematic work and high-end content, it's worth the premium.
Veo 3.1 Lite is the volume play. When you're generating hundreds or thousands of clips and need the latest generation's capabilities at minimum cost, this is the tier.
what the veo family does well and where it still struggles
Across all tiers and generations, the Veo family excels at naturalistic scenes with smooth camera movement. Landscapes, architecture, atmospheric conditions, slow reveals, orbital shots around objects - these consistently produce cinematic results. Lighting transitions are handled with sophistication. Depth of field effects look physically accurate. Color grading responds well to prompt direction.
The persistent weaknesses span the family too. Human faces in motion degrade past about three seconds. Hands remain unreliable. Text rendered within scenes is garbled more often than not. Fast action sequences with multiple interacting objects produce temporal artifacts. Very long clips (beyond 8 seconds) show increasing coherence drift where the model gradually forgets what it was generating.
These limitations are shared with every video generation model currently available. They're the edges of what diffusion architectures can currently achieve. I mention them not to single out Veo but to set honest expectations. If your use case requires sustained close-up human faces with natural expressions across a 10-second clip, no model reliably delivers that today. Plan your prompts and your projects around what works, not what you wish worked.
the competitive landscape briefly
Seedance handles character consistency better and generates longer clips with native audio. Kling offers free tiers for experimentation. Runway has strong creative community tooling and continues iterating on Gen-4.
What distinguishes Veo from all of them is the combination of native audio, multiple quality tiers at different price points, and the breadth of the family. No competitor offers six models spanning three generations with clear upgrade paths between them. For professional workflows where you need both a cheap iteration tier and a premium final-output tier within the same ecosystem, Veo is uniquely positioned.
That said, I'd encourage using multiple models rather than committing exclusively to one family. Different architectures have different strengths. A clip that Veo struggles with might render cleanly in Seedance, and vice versa. The video generation space is moving fast enough that loyalty to any single provider is premature.
frequently asked questions
how do I choose between veo 3 fast and veo 3.1 fast?
Veo 3.1 Fast costs more than Veo 3 Fast at base resolution but includes architectural improvements that show up as better temporal coherence and more accurate physics simulation. It also supports image-to-video, which Veo 3 Fast handles less precisely. For text-only generation of simple scenes, Veo 3 Fast remains excellent value. For image-to-video work, complex multi-element scenes, or anything where you need the latest quality improvements, the 3.1 Fast premium is justified. Start with 3 Fast for budget work and upgrade when quality demands it.
is the audio generation good enough to ship without editing?
For environmental and ambient audio - yes, frequently. Rain, wind, ocean, city noise, nature sounds, and mechanical ambience regularly pass without any listener noticing they're synthetic. For sound effects synchronized to specific visual events, it works about 80% of the time with convincing timing. Dialogue and music are the weak points - synthesized speech sounds robotic and music lacks structure. My practical rule: if the clip is ambient or atmospheric, ship the generated audio directly. If it involves speech or needs scored music, plan to replace the audio track in post regardless.
what's the maximum clip duration and can I generate longer videos?
Most Veo models produce individual clips in the 5-8 second range. Veo 3.1 Lite supports configurable durations of 4, 6, or 8 seconds. You can request specific durations but the model treats these as targets rather than guarantees - actual output may be a second or two shorter. For longer content, Veo 3.1 introduced scene extension, which generates new clips that connect to the end of your previous video, maintaining visual continuity by basing each new generation on the final second of the prior clip. You can chain up to 20 clips this way for narratives exceeding two minutes. Alternatively, generating separate clips and cutting them together in an editor gives you tighter creative control over each segment's composition.
api reference
about
veo 3 via vertex ai - generate videos with audio from text prompts and images
1. calling the api
install the client
the client provides a convenient way to interact with the api.
1pip install inferenceshsetup your api key
set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.
1export INFERENCE_API_KEY="inf_your_key"run and get result
submit a request and wait for the final result. best for batch processing or when you don't need progress updates.
1from inferencesh import inference23client = inference()456result = client.run({7 "app": "google/veo-3",8 "input": {}9 })1011print(result["output"])stream live updates
get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8 "app": "google/veo-3",9 "input": {}10 }, stream=True):11 if update.get("progress"):12 print(f"progress: {update['progress']}%")13 if update.get("output"):14 print(f"output: {update['output']}")2. authentication
the api uses api keys for authentication. see the authentication docs for detailed setup instructions.
3. files
file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.
automatic upload
the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.
1# local file paths are automatically uploaded2result = client.run({3 "app": "google/veo-3",4 "input": {5 "image": "/path/to/local/image.png", # detected & uploaded6 "audio": "https://example.com/audio.mp3", # url passed through7 }8})4. webhooks
get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.
1result = client.run({2 "app": "google/veo-3",3 "input": {},4 "webhook": "https://your-server.com/webhook"5}, wait=False)webhook payload
your endpoint receives a JSON POST with the task result:
1{2 "id": "task_abc123",3 "status": 9,4 "output": { ... },5 "error": "",6 "session_id": null,7 "created_at": "2024-01-15T10:30:00Z",8 "updated_at": "2024-01-15T10:30:05Z"9}5. schema
input
text prompt describing the desired video content.
optional first frame image for image-to-video generation.
optional last frame image for frame interpolation. requires first frame image.
optional video to extend (1-30s mp4, 24fps, 720p/1080p). extends by 7 seconds.
video aspect ratio. 16:9 for landscape, 9:16 for portrait.
video duration in seconds.
output video resolution.
whether to generate audio for the video.
number of videos to generate.
person generation setting. allow_adult: only adults, disallow: no people/faces.
ready to run veo-3?
we use cookies
we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.
by clicking "accept", you agree to our use of cookies.
learn more.