wan-i2v
Transform static images into animated videos with text prompts
There is a persistent tension in generative AI between what you want and what you can afford. Alibaba's Wan 2.7 models represent one end of that spectrum - genuinely impressive video and image generators with pricing that reflects their quality. Pruna's optimized versions of the same Wan architecture sit at the other end. Pruna, a Munich-based startup that open-sourced its model optimization framework in March 2025, applies quantization, pruning, and distillation to reduce computational load, trading some fidelity for dramatically lower costs. If you have ever abandoned a creative idea because the per-generation cost made experimentation feel reckless, these are worth understanding.
Pruna has brought three optimized Wan models to inference.sh: a text-to-video generator, an image-to-video animator, and a small fast image model. Together they cover the most common generative workflows at prices that make iteration feel free. I have been running them for various projects and the value proposition is clear, even if the output is not going to win any visual fidelity awards against the full-size Alibaba versions.
what pruna actually does to these models
Pruna specializes in model optimization - taking open-source architectures and making them run faster on less hardware without completely destroying the output quality. The specifics involve quantization, pruning, and inference pipeline optimizations that reduce the computational load per generation. The result is models that can run on cheaper infrastructure, which translates directly to lower prices for the end user.
This is not magic. There is a real quality cost. When you compress a model's weights and strip out redundant computation, you lose some of the nuance that the full model captured during training. Fine details get softer. Complex scenes lose some coherence at the edges. Motion in video can feel slightly less natural. But the degree of degradation depends heavily on what you are generating and whether your use case actually demands the highest tier of output.
The honest framing is this: Pruna's Wan models are to Alibaba's native Wan 2.7 what a competent stock photo is to a commissioned illustration. Both serve real purposes. The question is which purpose you are serving right now.
text-to-video with flat pricing
The Pruna Wan text-to-video model uses flat per-video pricing rather than per-second billing. This is unusual in the video generation market and it changes how you think about duration.
Most competing services charge per second of output, which creates a constant pressure to keep clips short. When you are paying a flat rate, you can let scenes breathe. A slow camera pan that takes eight seconds costs exactly the same as a quick two-second cut. That freedom is worth more than it might seem on paper - it changes the kinds of videos you attempt in the first place.
The model handles standard generative video tasks competently. Describe a landscape with moving clouds, a person walking through a city street, abstract particle effects, product turntables - and you will get something usable. The motion is generally coherent within the 1-10 second duration range. Aspect ratios cover the formats you actually need: widescreen for YouTube, square for social feeds, portrait for Stories and Reels.
Where the optimization shows its hand is in fine motion detail. Fabric draping, water splashing, facial expressions during complex emotions - these are areas where the full Wan 2.7 text-to-video model visibly outperforms the Pruna version. Hair movement can look slightly rigid. Particle effects sometimes lose their organic quality. These are not failures in isolation, but placed next to the unoptimized model's output, the difference is apparent.
For reference, Alibaba's native Wan 2.7 text-to-video on the same platform costs more but produces noticeably smoother motion and better handling of complex prompts. If you are generating hero content for a campaign landing page, spend the extra money. If you are generating fifty variations to find the right visual direction for a storyboard, Pruna's version at five cents each is the rational choice.
image-to-video for budget animation
The image-to-video model takes a different pricing approach - per-second billing rather than the flat rate of the text-to-video model. It remains extremely affordable by market standards, though longer clips do cost proportionally more.
I find image-to-video more forgiving of optimization artifacts than pure text-to-video. When you provide a source image, the model has concrete visual information to anchor its generation. Colors, composition, lighting - these are all defined by the input rather than hallucinated from text. The model's job narrows to animating what already exists, and that constrained task survives compression better than open-ended generation from descriptions alone.
The practical applications here are straightforward. Product photography that needs subtle motion for social ads. Illustrations that want a parallax-style animation. Concept art that needs to come alive for a pitch deck. Photographs turned into cinemagraphs. In each case, you already have a strong static image and want to add motion without paying for a full video production pipeline.
The quality ceiling is real though. Complex motion that departs significantly from the source image - like turning a portrait into a full-body walking animation - will produce artifacts. The model works best when the implied motion is relatively contained. A breeze moving through hair, clouds drifting behind a building, a slight zoom with parallax separation between foreground and background. Stay within those bounds and the results are genuinely good for the price.
budget images for batch workflows
The smallest model in the trio is Pruna's Wan image generator, priced comparably to FLUX Dev. The differentiator is not price but aesthetic character - different models produce different visual signatures even at similar quality tiers.
This is positioned explicitly for rapid prototyping and batch generation. The model supports multiple outputs per request, configurable aspect ratios, and a "juiced" mode for even faster generation at some quality cost. When you need to produce dozens or hundreds of images quickly - populating a design system with placeholder imagery, generating training data for a visual classifier, producing social media content at scale - the economics are essentially negligible.
The output quality is what you would expect from an optimized small model. Good enough for web thumbnails, social posts, and iterative exploration. Not good enough for hero images, print production, or anything where someone will spend more than a few seconds examining the result. Textures can look soft. Complex compositions sometimes lose spatial coherence. But these same criticisms apply to every model in this price tier, and Pruna's version holds up respectably against its direct competitors.
What I appreciate about having this in the same family as the video models is workflow continuity. You can generate a batch of images with wan-image-small, pick the ones that work, and feed them directly into wan-i2v to animate them. The entire exploration-to-animation pipeline is extremely cheap, which means you can run it dozens of times without any budget anxiety.
the honest comparison with alibaba's native wan 2.7
I want to be direct about this because it matters for decision-making. Alibaba's Wan 2.7 models - the non-Pruna, non-optimized versions - produce better results. The video is smoother. The motion is more naturalistic. Complex prompts are interpreted with more nuance. Image quality has more detail in textures and better coherence in multi-subject scenes. This is not a close call in terms of raw output quality.
The full Wan 2.7 lineup on inference.sh includes specialized variants - a pro image model, dedicated video editing capabilities, reference-to-video generation that maintains character consistency. These are capabilities the Pruna versions simply do not offer. If your project requires any of those specialized workflows, the choice is already made for you.
But quality comparisons in isolation miss the point. The relevant question is always quality relative to cost relative to volume. If you need ten videos and they all need to be excellent, spend the money on Wan 2.7. If you need two hundred videos and they need to be acceptable, Pruna's versions make that financially viable where the full models would not.
I have seen this play out in practice. Teams doing content marketing at scale, generating social video variations for A/B testing, building datasets for model training, creating quick visual references for internal communication - in all these cases, the cheaper models win not because they are better, but because the volume they enable creates more value than the quality they sacrifice destroys.
when to use which
The decision framework is simpler than it might seem. Use the Pruna Wan models when you are in exploration mode, when you are generating at volume, when the output is intermediate rather than final, or when your budget constrains your options more than your quality standards do. Use the full Wan 2.7 models when the output is customer-facing, when you need specialized capabilities like video editing or reference consistency, or when you are generating a small number of high-stakes assets.
There is also a hybrid approach that makes economic sense for many teams. Use Pruna's cheap models for the divergent phase of creative work - generating many options, exploring directions, validating concepts. Then switch to the full models for the convergent phase - producing the final polished assets from validated directions. Your total spend ends up lower than using premium models throughout, and your creative output is arguably better because you explored more broadly in the cheap phase.
The video models in particular benefit from this approach. Video generation is inherently more expensive than image generation, which makes the exploration phase proportionally more costly. Getting that phase down to a trivial cost per attempt means you can try dozens of different prompt formulations for very little. That kind of freedom transforms how you approach video content ideation.
the optimization tradeoff is permanent and that is fine
Every model optimization involves an irreversible tradeoff. You cannot un-quantize weights. You cannot restore pruned connections. The information that Pruna removes to achieve faster inference is gone. This means these models will never match their parent models regardless of how good your prompts are or how carefully you configure parameters.
Some people find this uncomfortable. I think it reflects a misunderstanding of how creative tools work in practice. A $20 brush and a $200 brush both make marks on canvas. The expensive one makes more controlled marks with more nuance, but plenty of compelling work gets made with cheap brushes. The tool's ceiling matters less than whether it clears the floor your project requires.
Pruna's Wan models clear the floor for a surprisingly wide range of projects. They do not clear it for everything. Know which category your current project falls into and choose accordingly. The platform makes switching between tiers trivial, which means you are never locked into a decision - you can always upgrade to the full models for specific generations that demand it.
frequently asked questions
how much quality do you actually lose compared to the full wan 2.7 models?
The difference is most visible in video generation, where motion smoothness and fine detail take the biggest hit from optimization. In my experience, the Pruna versions produce output that is roughly 70-80% of the visual quality of the full Wan 2.7 models. That gap is immediately noticeable in side-by-side comparisons but often invisible when viewing the Pruna output in isolation. For images, the gap is narrower since single frames are less demanding than temporal coherence across video. The practical question is whether your audience will ever see both versions next to each other - if not, the quality difference may be irrelevant to your use case.
can I use the image model output as input for the video model?
Yes, and this is one of the more useful workflows available with this model family. Generate a batch of images with wan-image-small, select the ones with promising compositions, then feed them into wan-i2v to produce animated versions. The total cost for this kind of pipeline is trivial. This workflow works well because the image-to-video model is forgiving of input quality variations, and the visual consistency within the Wan model family means outputs from one model animate naturally in the other.
is the flat per-video pricing on text-to-video actually better than per-second pricing?
It depends entirely on your typical clip length. If you mostly generate short 1-2 second clips, per-second pricing like the i2v model uses would be cheaper. But the flat rate shines for longer generations - the effective per-second cost drops as duration increases. The flat pricing also removes the mental friction of optimizing duration, which means you focus on getting the right output rather than minimizing seconds. For most creative workflows where you want clips in the 4-10 second range, the flat rate is the better deal.
api reference
about
transform static images into animated videos with text prompts
1. calling the api
install the client
the client provides a convenient way to interact with the api.
1pip install inferenceshsetup your api key
set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.
1export INFERENCE_API_KEY="inf_your_key"run and get result
submit a request and wait for the final result. best for batch processing or when you don't need progress updates.
1from inferencesh import inference23client = inference()456result = client.run({7 "app": "pruna/wan-i2v",8 "input": {}9 })1011print(result["output"])stream live updates
get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8 "app": "pruna/wan-i2v",9 "input": {}10 }, stream=True):11 if update.get("progress"):12 print(f"progress: {update['progress']}%")13 if update.get("output"):14 print(f"output: {update['output']}")2. authentication
the api uses api keys for authentication. see the authentication docs for detailed setup instructions.
3. files
file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.
automatic upload
the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.
1# local file paths are automatically uploaded2result = client.run({3 "app": "pruna/wan-i2v",4 "input": {5 "image": "/path/to/local/image.png", # detected & uploaded6 "audio": "https://example.com/audio.mp3", # url passed through7 }8})4. webhooks
get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.
1result = client.run({2 "app": "pruna/wan-i2v",3 "input": {},4 "webhook": "https://your-server.com/webhook"5}, wait=False)webhook payload
your endpoint receives a JSON POST with the task result:
1{2 "id": "task_abc123",3 "status": 9,4 "output": { ... },5 "error": "",6 "session_id": null,7 "created_at": "2024-01-15T10:30:00Z",8 "updated_at": "2024-01-15T10:30:05Z"9}5. schema
input
text description for video generation.
input image for image-to-video generation.
video resolution: 480p or 720p.
video duration in seconds (1-10).
random seed for reproducible generation.
ready to run wan-i2v?
we use cookies
we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.
by clicking "accept", you agree to our use of cookies.
learn more.