apps/pruna/wan-i2v

wan-i2v

Transform static images into animated videos with text prompts

run in browser run via API

run with your agent

# install belt

$curl -fsSL https://cli.inference.sh | sh

# view schema & details

$belt app get pruna/wan-i2v

# run

$belt app run pruna/wan-i2v

There is a persistent tension in generative AI between what you want and what you can afford. Alibaba's Wan 2.7 models represent one end of that spectrum - genuinely impressive video and image generators with pricing that reflects their quality. Pruna's optimized versions of the same Wan architecture sit at the other end. Pruna, a Munich-based startup that open-sourced its model optimization framework in March 2025, applies quantization, pruning, and distillation to reduce computational load, trading some fidelity for dramatically lower costs. If you have ever abandoned a creative idea because the per-generation cost made experimentation feel reckless, these are worth understanding.

Pruna has brought three optimized Wan models to inference.sh: a text-to-video generator, an image-to-video animator, and a small fast image model. Together they cover the most common generative workflows at prices that make iteration feel free. I have been running them for various projects and the value proposition is clear, even if the output is not going to win any visual fidelity awards against the full-size Alibaba versions.

what pruna actually does to these models

Pruna specializes in model optimization - taking open-source architectures and making them run faster on less hardware without completely destroying the output quality. The specifics involve quantization, pruning, and inference pipeline optimizations that reduce the computational load per generation. The result is models that can run on cheaper infrastructure, which translates directly to lower prices for the end user.

This is not magic. There is a real quality cost. When you compress a model's weights and strip out redundant computation, you lose some of the nuance that the full model captured during training. Fine details get softer. Complex scenes lose some coherence at the edges. Motion in video can feel slightly less natural. But the degree of degradation depends heavily on what you are generating and whether your use case actually demands the highest tier of output.

The honest framing is this: Pruna's Wan models are to Alibaba's native Wan 2.7 what a competent stock photo is to a commissioned illustration. Both serve real purposes. The question is which purpose you are serving right now.

text-to-video with flat pricing

The Pruna Wan text-to-video model uses flat per-video pricing rather than per-second billing. This is unusual in the video generation market and it changes how you think about duration.

Most competing services charge per second of output, which creates a constant pressure to keep clips short. When you are paying a flat rate, you can let scenes breathe. A slow camera pan that takes eight seconds costs exactly the same as a quick two-second cut. That freedom is worth more than it might seem on paper - it changes the kinds of videos you attempt in the first place.

The model handles standard generative video tasks competently. Describe a landscape with moving clouds, a person walking through a city street, abstract particle effects, product turntables - and you will get something usable. The motion is generally coherent within the 1-10 second duration range. Aspect ratios cover the formats you actually need: widescreen for YouTube, square for social feeds, portrait for Stories and Reels.

Where the optimization shows its hand is in fine motion detail. Fabric draping, water splashing, facial expressions during complex emotions - these are areas where the full Wan 2.7 text-to-video model visibly outperforms the Pruna version. Hair movement can look slightly rigid. Particle effects sometimes lose their organic quality. These are not failures in isolation, but placed next to the unoptimized model's output, the difference is apparent.

For reference, Alibaba's native Wan 2.7 text-to-video on the same platform costs more but produces noticeably smoother motion and better handling of complex prompts. If you are generating hero content for a campaign landing page, spend the extra money. If you are generating fifty variations to find the right visual direction for a storyboard, Pruna's version at five cents each is the rational choice.

image-to-video for budget animation

The image-to-video model takes a different pricing approach - per-second billing rather than the flat rate of the text-to-video model. It remains extremely affordable by market standards, though longer clips do cost proportionally more.

I find image-to-video more forgiving of optimization artifacts than pure text-to-video. When you provide a source image, the model has concrete visual information to anchor its generation. Colors, composition, lighting - these are all defined by the input rather than hallucinated from text. The model's job narrows to animating what already exists, and that constrained task survives compression better than open-ended generation from descriptions alone.

The practical applications here are straightforward. Product photography that needs subtle motion for social ads. Illustrations that want a parallax-style animation. Concept art that needs to come alive for a pitch deck. Photographs turned into cinemagraphs. In each case, you already have a strong static image and want to add motion without paying for a full video production pipeline.

The quality ceiling is real though. Complex motion that departs significantly from the source image - like turning a portrait into a full-body walking animation - will produce artifacts. The model works best when the implied motion is relatively contained. A breeze moving through hair, clouds drifting behind a building, a slight zoom with parallax separation between foreground and background. Stay within those bounds and the results are genuinely good for the price.

budget images for batch workflows

The smallest model in the trio is Pruna's Wan image generator, priced comparably to FLUX Dev. The differentiator is not price but aesthetic character - different models produce different visual signatures even at similar quality tiers.

This is positioned explicitly for rapid prototyping and batch generation. The model supports multiple outputs per request, configurable aspect ratios, and a "juiced" mode for even faster generation at some quality cost. When you need to produce dozens or hundreds of images quickly - populating a design system with placeholder imagery, generating training data for a visual classifier, producing social media content at scale - the economics are essentially negligible.

The output quality is what you would expect from an optimized small model. Good enough for web thumbnails, social posts, and iterative exploration. Not good enough for hero images, print production, or anything where someone will spend more than a few seconds examining the result. Textures can look soft. Complex compositions sometimes lose spatial coherence. But these same criticisms apply to every model in this price tier, and Pruna's version holds up respectably against its direct competitors.

What I appreciate about having this in the same family as the video models is workflow continuity. You can generate a batch of images with wan-image-small, pick the ones that work, and feed them directly into wan-i2v to animate them. The entire exploration-to-animation pipeline is extremely cheap, which means you can run it dozens of times without any budget anxiety.

the honest comparison with alibaba's native wan 2.7

I want to be direct about this because it matters for decision-making. Alibaba's Wan 2.7 models - the non-Pruna, non-optimized versions - produce better results. The video is smoother. The motion is more naturalistic. Complex prompts are interpreted with more nuance. Image quality has more detail in textures and better coherence in multi-subject scenes. This is not a close call in terms of raw output quality.

The full Wan 2.7 lineup on inference.sh includes specialized variants - a pro image model, dedicated video editing capabilities, reference-to-video generation that maintains character consistency. These are capabilities the Pruna versions simply do not offer. If your project requires any of those specialized workflows, the choice is already made for you.

But quality comparisons in isolation miss the point. The relevant question is always quality relative to cost relative to volume. If you need ten videos and they all need to be excellent, spend the money on Wan 2.7. If you need two hundred videos and they need to be acceptable, Pruna's versions make that financially viable where the full models would not.

I have seen this play out in practice. Teams doing content marketing at scale, generating social video variations for A/B testing, building datasets for model training, creating quick visual references for internal communication - in all these cases, the cheaper models win not because they are better, but because the volume they enable creates more value than the quality they sacrifice destroys.

when to use which

The decision framework is simpler than it might seem. Use the Pruna Wan models when you are in exploration mode, when you are generating at volume, when the output is intermediate rather than final, or when your budget constrains your options more than your quality standards do. Use the full Wan 2.7 models when the output is customer-facing, when you need specialized capabilities like video editing or reference consistency, or when you are generating a small number of high-stakes assets.

There is also a hybrid approach that makes economic sense for many teams. Use Pruna's cheap models for the divergent phase of creative work - generating many options, exploring directions, validating concepts. Then switch to the full models for the convergent phase - producing the final polished assets from validated directions. Your total spend ends up lower than using premium models throughout, and your creative output is arguably better because you explored more broadly in the cheap phase.

The video models in particular benefit from this approach. Video generation is inherently more expensive than image generation, which makes the exploration phase proportionally more costly. Getting that phase down to a trivial cost per attempt means you can try dozens of different prompt formulations for very little. That kind of freedom transforms how you approach video content ideation.

the optimization tradeoff is permanent and that is fine

Every model optimization involves an irreversible tradeoff. You cannot un-quantize weights. You cannot restore pruned connections. The information that Pruna removes to achieve faster inference is gone. This means these models will never match their parent models regardless of how good your prompts are or how carefully you configure parameters.

Some people find this uncomfortable. I think it reflects a misunderstanding of how creative tools work in practice. A $20 brush and a $200 brush both make marks on canvas. The expensive one makes more controlled marks with more nuance, but plenty of compelling work gets made with cheap brushes. The tool's ceiling matters less than whether it clears the floor your project requires.

Pruna's Wan models clear the floor for a surprisingly wide range of projects. They do not clear it for everything. Know which category your current project falls into and choose accordingly. The platform makes switching between tiers trivial, which means you are never locked into a decision - you can always upgrade to the full models for specific generations that demand it.

frequently asked questions

how much quality do you actually lose compared to the full wan 2.7 models?

The difference is most visible in video generation, where motion smoothness and fine detail take the biggest hit from optimization. In my experience, the Pruna versions produce output that is roughly 70-80% of the visual quality of the full Wan 2.7 models. That gap is immediately noticeable in side-by-side comparisons but often invisible when viewing the Pruna output in isolation. For images, the gap is narrower since single frames are less demanding than temporal coherence across video. The practical question is whether your audience will ever see both versions next to each other - if not, the quality difference may be irrelevant to your use case.

can I use the image model output as input for the video model?

Yes, and this is one of the more useful workflows available with this model family. Generate a batch of images with wan-image-small, select the ones with promising compositions, then feed them into wan-i2v to produce animated versions. The total cost for this kind of pipeline is trivial. This workflow works well because the image-to-video model is forgiving of input quality variations, and the visual consistency within the Wan model family means outputs from one model animate naturally in the other.

is the flat per-video pricing on text-to-video actually better than per-second pricing?

It depends entirely on your typical clip length. If you mostly generate short 1-2 second clips, per-second pricing like the i2v model uses would be cheaper. But the flat rate shines for longer generations - the effective per-second cost drops as duration increases. The flat pricing also removes the mental friction of optimizing duration, which means you focus on getting the right output rather than minimizing seconds. For most creative workflows where you want clips in the 4-10 second range, the flat rate is the better deal.

api reference

about

transform static images into animated videos with text prompts

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash

1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash

1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python

1from inferencesh import inference23client = inference()456result = client.run({7        "app": "pruna/wan-i2v",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python

1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "pruna/wan-i2v",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python

1# local file paths are automatically uploaded2result = client.run({3    "app": "pruna/wan-i2v",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python

1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python

1result = client.run({2    "app": "pruna/wan-i2v",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json

1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}

idstring— task id

statusnumber— terminal status (9=completed, 10=failed, 11=cancelled)

outputobject— task output (when completed)

errorstring— error message (when failed)

session_idstring— session id (if using sessions)

created_atstring— iso timestamp

updated_atstring— iso timestamp

5. schema

input

promptstring*

text description for video generation.

imagestring(file)*

input image for image-to-video generation.

resolutionstring

video resolution: 480p or 720p.

default: "480p"

options:"480p""720p"

durationinteger

video duration in seconds (1-10).

default: 5min:1max:10

seedinteger

random seed for reproducible generation.

output

videostring(file)*

generated video file.

seedinteger

seed used for generation.

output_metaobject

structured metadata about inputs/outputs for pricing calculation

ready to run wan-i2v?

try in browser browse all tools

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.