Google's Veo 3.1 Fast is a video generation model that turns text prompts and images into video clips with optional synchronized audio. Available on inference.sh as a serverless app, it has handled 989 tasks for 87 paying users who need quick, affordable video generation without managing GPU infrastructure. The "Fast" variant prioritizes generation speed over maximum quality, making it practical for iteration, prototyping, and production workflows where turnaround time matters.

What makes Veo 3.1 stand out in the current landscape is Google's native audio generation capability. Unlike models that produce silent video requiring separate audio work, Veo 3.1 can generate synchronized sound — dialogue, ambient noise, sound effects — directly in the output. Combined with resolution options from 720p to 4K, it covers everything from social media content to high-resolution production assets.

what it does

Veo 3.1 Fast generates video from text descriptions. You describe the scene, camera movement, action, and mood in natural language, and the model produces a video clip that follows your instructions. It supports both text-to-video (generating entirely from a prompt) and image-to-video (using a reference image as the first frame or visual guide).

The optional audio generation adds synchronized sound to the video output. This is not generic background music — the model generates contextually appropriate audio that matches the visual action. A scene of waves crashing will have ocean sounds. A person speaking will have synthesized dialogue. The audio is generated in the same pass as the video, so sync is built in rather than applied after the fact.

key features

Text-to-video generation — Describe a scene in natural language and receive a video clip. The model handles camera movement, subject motion, lighting changes, and environmental effects.

Image-to-video — Provide a reference image to anchor the visual style, subject appearance, or starting composition. The model animates from that starting point according to your text instructions.

Native audio generation — Generate synchronized audio alongside the video. Sound effects, ambient noise, and dialogue are produced in context with the visual content. Toggle this on or off per request.

Resolution tiers — Output at 720p for fast previews and social media, 1080p for standard high-definition, or 4K for production-quality assets.

Fast inference — The "Fast" variant is optimized for generation speed. You get results quicker than the standard Veo 3.1 model, with a quality tradeoff that is acceptable for most use cases outside premium production work.

use cases

Social media content — Generate short video clips for social posts, stories, and reels. The combination of speed and audio means you can produce complete, ready-to-post video without editing.

Product demos and explainers — Animate product concepts, show usage scenarios, or create visual explanations. Use image-to-video with product photos to bring static assets to life.

Creative prototyping — Quickly visualize scene concepts, storyboard ideas, or narrative sequences before committing to expensive production. Iterate on prompts until you find the right direction.

Marketing and advertising — Generate ad creatives, promotional clips, and brand content. The audio generation means you get complete assets rather than silent video that needs post-production.

Content automation — Build pipelines that generate video content programmatically. News summaries, weather visualizations, data-driven stories, and automated social content all benefit from API-driven video generation.

Game and app assets — Generate background videos, loading animations, cutscene concepts, or UI motion graphics.

how to run

belt CLI

Basic text-to-video generation:

bash

1belt app run google/veo-3-1-fast --input '{"prompt": "A drone shot slowly rising over a foggy forest at sunrise, golden light breaking through the canopy, cinematic"}'

Video generation with audio:

bash

1belt app run google/veo-3-1-fast --input '{"prompt": "A busy Tokyo street crossing at night with neon signs, car horns, and crowd chatter", "generate_audio": true}'

Image-to-video with a reference frame:

bash

1belt app run google/veo-3-1-fast --input '{"prompt": "The camera slowly zooms out as wind blows through the scene", "image": "./landscape-photo.jpg"}'

High-resolution output:

bash

1belt app run google/veo-3-1-fast --input '{"prompt": "Close-up of a glassblower shaping molten glass, warm orange glow, shallow depth of field", "resolution": "4K", "generate_audio": true}'

API

bash

1curl -X POST https://api.inference.sh/v1/apps/google/veo-3-1-fast/run \2  -H "Authorization: Bearer $INFERENCE_API_KEY" \3  -H "Content-Type: application/json" \4  -d '{5    "prompt": "A timelapse of clouds rolling over mountain peaks, dramatic lighting shifts from golden to deep purple as the sun sets",6    "resolution": "1080p",7    "generate_audio": true8  }'

input parameters

Parameter	Type	Required	Description
`prompt`	string	yes	Text description of the video to generate. Include details about subject, motion, camera movement, lighting, mood, and style. Longer, more specific prompts produce better results.
`image`	string	no	Reference image for image-to-video generation. The model uses this as a visual anchor for the starting frame or overall composition.
`resolution`	string	no	Output resolution: "720p", "1080p", or "4K". Higher resolution costs more per second of video.
`generate_audio`	boolean	no	Generate synchronized audio alongside the video. When enabled, the model produces contextually appropriate sound effects, ambient audio, or dialogue. Increases cost per second.
`duration`	number	no	Target duration of the generated video in seconds.
`aspect_ratio`	string	no	Aspect ratio for the output video (e.g., "16:9", "9:16", "1:1").

output

The app returns:

Video file — The generated video clip hosted on inference.sh cloud storage, accessible via URL. Format is typically MP4.
Audio track — When generate_audio is enabled, the audio is embedded in the video file. No separate audio download is needed.
output_meta — Metadata including actual resolution, duration, audio presence, and billing information.

pricing

Pricing is per second of generated video:

Configuration	Price per second
720p / 1080p, video only	$0.10
720p / 1080p, with audio	$0.25
4K, video only	$0.30
4K, with audio	$0.35

A typical 5-second clip at 1080p with audio costs $1.25. A 4K clip of the same length with audio costs $1.75.

when to use this vs alternatives

Choose Veo 3.1 Fast when you need quick turnaround, native audio generation, or Google's video quality at competitive pricing. Best for iterative workflows where you want to try multiple prompts quickly.

Choose Seedance 2.0 when you need longer clips (up to 20 seconds), multimodal reference input (combining images, video clips, and audio references), or maximum motion quality and character consistency.

Choose Kling when you want a free tier for experimentation or need the fastest possible iteration on short clips.

Choose Veo 3.1 (standard) when you need maximum quality for premium production work and can accept longer generation times.

FAQ

How long are the generated videos?

Veo 3.1 Fast generates video clips typically in the 4-8 second range. The exact duration depends on the complexity of the prompt and the model's interpretation of the scene. You can specify a target duration in your request.

Does the audio generation support dialogue?

Yes, the audio generation can produce synthesized speech when your prompt describes people talking. It also generates environmental sounds, sound effects, and ambient audio. The audio is contextually matched to the visual content, so a scene of a rainstorm will have rain sounds, and a café scene will have background chatter and clinking.

Can I use a reference image as the first frame?

Yes, pass an image in the image parameter. The model will use it as a visual anchor — this can be a first frame that gets animated, a style reference, or a subject reference. Combine this with a descriptive prompt about the motion and camera work you want.

What is the difference between Veo 3.1 Fast and standard Veo 3.1?

The Fast variant is optimized for generation speed at a slight quality tradeoff. For iteration, prototyping, and most production use cases, the quality difference is minimal. Standard Veo 3.1 takes longer but produces marginally higher fidelity output — choose it for final hero assets where maximum quality justifies the wait.

What video format is output?

Videos are delivered as MP4 files with H.264 encoding. When audio is enabled, it is embedded in the MP4 as an AAC audio track. The files are ready to use directly in most applications without transcoding.