apps/pruna/p-video

p-video

Fast text-to-video and image-to-video in 720p/1080p with audio support

run in browser run via API

run with your agent

# install belt

$curl -fsSL https://cli.inference.sh | sh

# view schema & details

$belt app get pruna/p-video

# run

$belt app run pruna/p-video

There's a category of tool I reach for when the goal is volume over virtuosity. Pruna, a Munich-based startup specializing in model optimization through quantization, pruning, and distillation techniques, built P-Video to sit in that space for video generation - not the model you pick when you need a single hero clip that will hold up on a 4K display, but the one you reach for when you need twenty iterations before lunch and you'd rather not burn through your budget. It's the cheapest general-purpose video generator I've used that still produces usable results.

I want to be clear about what "usable" means here. P-Video generates 720p and 1080p clips up to 10 seconds with both text-to-video and image-to-video modes. The output is good enough for social content, prototyping, and automated pipelines. It is not going to match Wan 2.7 or Veo on motion fidelity or temporal coherence in complex scenes. That's fine. The value proposition is speed of iteration at negligible cost, and on those terms it delivers.

This is also the model with native audio support built in - both audio-conditioned generation (feed it a soundtrack and it tries to match the visual energy) and audio synthesis on the output (set save_audio to true and get a video with sound included). For the price point, having sound baked in without a separate pipeline is genuinely useful.

One important note: Pruna also offers P-Video Avatar for talking-head generation. That's a different model with its own guide. This article covers the general-purpose P-Video tool for scenes, motion, products, and creative content.

why the budget matters

P-Video's positioning becomes clear when you think about how it changes your workflow. Draft mode is so cheap that you can generate hundreds of variations without thinking about cost. That changes how you work - instead of carefully crafting one prompt and hoping it lands, you throw twenty variations at the wall and keep what works. The iterative cost is effectively zero for any professional workflow.

Full quality costs more but remains a fraction of what you'd spend on Wan 2.7 or Veo 3.1 Fast. P-Video is roughly 5-10x cheaper depending on configuration. The quality gap exists but it's not proportionally worse - it's maybe 30-40% worse on complex motion and fine detail. For many applications, that tradeoff is obvious.

text-to-video and what it handles well

The text-to-video mode is straightforward. You provide a prompt describing your scene and the model generates a clip. Duration ranges from 3 to 15 seconds, resolution is either 720p or 1080p, frame rate is 24 or 48 fps, and you can pick from seven aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3).

Prompt upsampling is enabled by default, which means an LLM rewrites your prompt to add detail before generation. I leave this on for exploratory work - it fills in lighting, camera angle, and atmospheric details that I might not specify. For precise control over the output, turn it off. The rewrite can sometimes shift the scene in directions you didn't intend, particularly with abstract or conceptual prompts where the LLM "helpfully" adds concrete elements that dilute your original idea.

What P-Video handles well: simple camera movements over landscapes, product shots with gentle rotation or lighting changes, atmospheric scenes (fog, rain, clouds), abstract motion graphics, and anything where temporal coherence over long durations isn't the primary concern. It's solid at 3-5 second clips. At 8-10 seconds, you start seeing the drift that plagues all diffusion-based video models, but it's manageable for non-critical applications.

What it struggles with: complex multi-subject interactions, human faces in close-up (especially during motion), readable text in scene, and fast action sequences. These are universal limitations across all video generators in this tier, but they're more pronounced here than on premium models. If your use case demands people talking, hands manipulating objects, or athletic movement - budget more time for prompt iteration or consider upgrading to Wan 2.7.

image-to-video for controlled composition

Providing an input image gives you control over the starting visual that text-to-video can't match. You pass an image URL, write a prompt describing the motion you want, and the model animates from that first frame. This is where P-Video becomes particularly useful for product work - you have your product photo dialed in already, you just need it to rotate slowly or have the background come alive.

A few things to know about the image-to-video mode: aspect ratio and duration parameters are ignored. The output matches your input image's dimensions and uses default timing. This means you need to prepare your source image at the aspect ratio you want for the final video. If you feed in a square product shot, you get a square video.

The quality of the input image matters more than you might expect. High-resolution, well-lit source images produce dramatically better animations than compressed or noisy inputs. I've found that images generated by other models (FLUX, Qwen, even Pruna's own P-Image) work as excellent starting frames because they're clean and well-composed by default.

The prompt in image-to-video mode should focus purely on motion and temporal change rather than visual description. The model already knows what the scene looks like - it has the image. What it needs from you is choreography: "the camera slowly pulls back revealing the full scene," "wind moves through the foliage while light shifts," "the product rotates 90 degrees counterclockwise on a reflective surface." Describing the scene's appearance in image-to-video mode wastes prompt space and can sometimes conflict with what the model sees in the actual image.

audio - the underrated feature at this price

P-Video has two audio capabilities that are easy to overlook. First, save_audio is true by default, which means your output videos come with synthesized sound already embedded. This is ambient audio matched to the visual content - waves for ocean scenes, city noise for urban environments, that sort of thing. The quality is reasonable for social content. It won't fool an audio engineer, but it's miles better than silence and saves a post-production step.

Second, audio-conditioned generation lets you provide your own audio file and the model attempts to generate video that matches its energy and rhythm. This is not lip-sync or precise beat-matching - it's more of a general vibe alignment. Provide an upbeat track and the generated motion tends to be more dynamic. Provide something ambient and slow and you get smoother, gentler movement. I think of it as mood-setting rather than synchronization.

For social media creators producing daily content, having audio baked into the output at no additional cost removes an entire step from the pipeline. You don't need a separate audio generation tool, you don't need to manually align tracks, you don't need editing software to merge the two. The clip arrives ready to post. That's a meaningful workflow improvement even if the audio quality is merely adequate rather than exceptional.

prompt tips that actually improve output

After running a few hundred generations through P-Video, I've landed on a few patterns that consistently produce better results.

Keep prompts focused on one subject and one type of motion. "A single red rose petal falling in slow motion against a dark background" will give you something clean and usable. "A garden scene with butterflies and a cat and flowers waving in the wind and a person walking through" will give you a mess. The model has limited coherence bandwidth - spend it on one thing done well.

Specify camera behavior explicitly. "Static camera," "slow dolly forward," "gentle orbital motion" - these phrases constrain the model in ways that reduce artifacts. Unspecified camera movement tends to result in subtle jitter that makes the clip feel unstable. Pinning the camera to a specific behavior eliminates this class of problem entirely.

Lighting descriptions improve temporal stability. "Soft overcast lighting" or "golden hour directional light" gives the model a consistent illumination reference that prevents the frame-to-frame brightness fluctuation you sometimes see in clips with unspecified lighting. Natural lighting descriptions work better than artificial ones in my experience.

For draft mode exploration, write your prompts at full detail even though the visual output will be lower quality. The composition, framing, and motion characteristics still respond to your prompt - you're evaluating whether the model understood your intent, not whether every pixel is sharp. Once you have a prompt that produces the right motion and composition in draft, switching to full quality typically preserves those characteristics at higher fidelity.

Disable prompt upsampling when your prompts are already detailed. The LLM rewrite adds value to short, vague prompts but can dilute or redirect carefully crafted ones. If you've written three sentences describing exactly what you want, the upsampler might add a fourth that contradicts your intent.

where p-video fits in the hierarchy

I think of inference.sh video generation as roughly three tiers for general-purpose work. At the top, Veo 3.1 and standard Wan 2.7 deliver maximum quality with longer durations and better temporal coherence. In the middle, Wan 2.7 at lower resolutions and Veo 3.1 Fast offer a good balance of quality and cost. And at the bottom, P-Video gives you acceptable quality at a fraction of the price.

"Bottom" isn't pejorative here. Budget options exist because not every task demands maximum quality. Generating fifty thumbnail concepts for A/B testing? P-Video in draft mode. Building an automated pipeline that generates daily recap videos from text feeds? P-Video keeps costs negligible. Prototyping motion ideas before committing to an expensive model for the final render? P-Video all day.

The smart workflow, if budget matters to you at all, is to develop your prompts on P-Video draft mode, validate your creative direction at trivial cost, then render final assets on P-Video full quality or upgrade to a premium model depending on how critical the output is. This costs a fraction of iterating directly on expensive models and still gets you to the same creative destination.

the 48 fps option and when to use it

P-Video supports both 24 fps and 48 fps output. Higher frame rates produce smoother motion but double the frame count, which affects generation time. For most content, 24 fps is the right default - it's the cinematic standard and gives motion a natural weight and rhythm.

The 48 fps option becomes relevant for specific content types: smooth panning shots over detailed environments, flowing water or particle effects, slow-motion footage where you want extra temporal resolution, and motion graphics with geometric elements that benefit from smoother interpolation. Sports-style content or anything with fast lateral movement also benefits from the higher frame rate.

I wouldn't use 48 fps for everything by default. It can make some content feel unnaturally smooth in a way that triggers the "soap opera effect" that people associate with cheap television. For cinematic and narrative content, 24 fps produces results that feel more deliberately crafted. Reserve 48 fps for cases where smoothness is specifically what you're after.

honest limitations

P-Video won't give you consistent character identity across multiple generations. Every clip is a fresh start - the same prompt twice produces different faces, different environments, different compositions. If you need recurring characters across a series of clips, this isn't the tool. Look at Wan 2.7's reference-to-video mode instead.

Duration caps at 15 seconds for text-to-video, and image-to-video uses its own default timing. You cannot generate a 30-second clip in one pass. For longer content, you'd need to stitch multiple generations together, which introduces continuity challenges.

The model's understanding of physics is approximate. Water flows downhill most of the time. Gravity usually works. But fabrics, hair, and fluids in unusual configurations can produce uncanny results. Simple physics scenes (object falling, water flowing, clouds drifting) are reliable. Complex physics interactions are not.

Safety filtering is present by default and tends toward conservative. You can disable it, but be aware that the model may still refuse certain generations based on its training. This isn't unusual for the category but it's worth knowing before you plan a workflow around content that might trigger filters.

frequently asked questions

how does p-video compare to pruna's wan-based models?

Pruna also offers optimized versions of Wan models (pruna/wan-t2v and pruna/wan-i2v) which are faster variants of Alibaba's architecture. P-Video is Pruna's own general-purpose model rather than an optimization of someone else's work. In practice, the Wan variants tend to produce slightly better temporal coherence on longer clips, while P-Video is cheaper and faster for shorter generations. If you're already using Pruna's ecosystem, P-Video is the budget workhorse and the Wan variants are for when you need more polish on a specific clip.

is draft mode good enough to actually use the output, or just for previewing?

Draft mode is genuinely usable for low-stakes applications. Social media stories, internal presentations, mood boards, prototype demonstrations - all fine at draft quality. The resolution is still 720p or 1080p, the motion characteristics are preserved, and the framing works the same way. What you lose is fine detail, sharpness on edges, and consistency on textures. I'd post a draft-mode clip to Instagram Stories without hesitation. I wouldn't use one in a client pitch deck. The 75% cost saving is worth it for anything below your quality threshold for "good enough."

can I chain p-video with other tools for a production workflow?

Yes, and this is where the low cost becomes strategically useful. Generate a still image with P-Image or FLUX, feed it into P-Video's image-to-video mode to create a clip, then add professional audio with ElevenLabs if the built-in audio isn't sufficient. The entire pipeline costs very little per clip at full quality. You can also use P-Video drafts to explore motion ideas, then regenerate the best concepts on Wan 2.7 or Veo for higher fidelity finals. The API-first design means all of this is scriptable and automatable.

api reference

about

fast text-to-video and image-to-video in 720p/1080p with audio support

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash

1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash

1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python

1from inferencesh import inference23client = inference()456result = client.run({7        "app": "pruna/p-video",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python

1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "pruna/p-video",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python

1# local file paths are automatically uploaded2result = client.run({3    "app": "pruna/p-video",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python

1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python

1result = client.run({2    "app": "pruna/p-video",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json

1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}

idstring— task id

statusnumber— terminal status (9=completed, 10=failed, 11=cancelled)

outputobject— task output (when completed)

errorstring— error message (when failed)

session_idstring— session id (if using sessions)

created_atstring— iso timestamp

updated_atstring— iso timestamp

5. schema

input

promptstring*

text description for video generation.

imagestring(file)

input image for image-to-video. when provided, aspect_ratio is ignored.

audiostring(file)

audio file for audio-conditioned video. when provided, duration is ignored. supports flac, mp3, wav.

durationinteger

video duration in seconds (1-10). ignored if audio is provided.

default: 5min:1max:10

resolutionstring

video resolution: 720p or 1080p.

default: "720p"

options:"720p""1080p"

fpsinteger

frames per second: 24 or 48.

default: 24

options:2448

aspect_ratiostring

aspect ratio. ignored when input image is provided.

default: "16:9"

options:"16:9""9:16""4:3""3:4""3:2""2:3""1:1"

draftboolean

draft mode for cheaper, lower-quality previews.

default: false

save_audioboolean

include audio in output video.

default: true

prompt_upsamplingboolean

enhance prompt with llm.

default: true

seedinteger

random seed for reproducible generation.

disable_safety_filterboolean

disable safety filter for prompts and input images.

default: true

output

videostring(file)*

generated video file.

seedinteger

seed used for generation.

output_metaobject

structured metadata about inputs/outputs for pricing calculation

ready to run p-video?

try in browser browse all tools

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.