apps/alibaba/wan-2-7-i2v

wan-2-7-i2v

Wan 2.7 Image-to-Video generates videos from images using multi-modal input (text, images, audio, video). Supports first frame generation, first+last frame, and video continuation with 720P/1080P resolution

run with your agent
# install belt
$curl -fsSL https://cli.inference.sh | sh
# view schema & details
$belt app get alibaba/wan-2-7-i2v
# run
$belt app run alibaba/wan-2-7-i2v

Alibaba's Wan 2.7, released in late March 2026 by the Tongyi Lab team, is not a single model. It's four distinct video generation tools that share a Diffusion Transformer architecture with Flow Matching but solve different problems. Text-to-video, image-to-video, reference-to-video, and video editing - each designed for a specific stage of the creative pipeline rather than trying to be a universal solution. I've been running all four through inference.sh for weeks, and what strikes me most is how well they compose together. You can ideate with text-to-video, lock in visual direction with image-to-video, maintain character consistency with reference-to-video, and then refine the result with video editing. It's a production workflow disguised as four API calls.

One of the headline features is what Alibaba calls "Thinking Mode" - the model first deeply understands the prompt, logically plans the composition, then generates the final output. The quality ceiling is genuinely high. Wan 2.7 produces motion that looks physically plausible - not perfect, nothing is yet - but consistent enough that the clips don't immediately scream "AI generated" to a casual viewer. Temporal coherence across frames is strong, especially on subjects with repetitive motion like walking, waves, or machinery. The model handles both 720P and 1080P output, with durations stretching to 15 seconds on most modes. That's enough for a complete social media clip or a meaningful B-roll segment without needing to stitch anything together.

text-to-video as the starting point

The text-to-video model (alibaba/wan-2-7-t2v) is where most people will begin, and it's the simplest entry point. You write a prompt, choose a resolution and duration, and get back an MP4. Durations range from 2 to 15 seconds. The model supports a prompt extension feature that rewrites your input through an LLM before generation - I'd recommend leaving this on for casual exploration and turning it off when you've refined a prompt you're happy with and want precise control.

What makes Wan 2.7's text-to-video competitive is the motion quality at longer durations. Many video models produce impressive 3-second clips but fall apart at 10 or 15 seconds as temporal drift accumulates. Wan handles this better than most. A 15-second shot of a person walking through a market maintains consistent limb movement, correct shadow direction, and stable background geometry throughout. It's not flawless - you'll still see occasional frame-to-frame jitter on fine details like jewelry or text on signs - but the overall coherence is above average for the current generation of models.

The negative prompt parameter does meaningful work here. Unlike image models where negative prompts feel like superstition, video generation benefits substantially from explicit exclusions. Telling the model "no camera shake, no morphing, no extra fingers" actually reduces those failure modes in measurable ways. I use a standard negative prompt for most generations and only remove items when I specifically want them.

image-to-video for visual precision

The gap between imagining something and describing it in words is where image-to-video (alibaba/wan-2-7-i2v) becomes essential. You provide a first frame - a photograph, a render, a generated image from any source - and the model animates outward from it. The visual identity of your starting frame persists through the clip in a way that text-to-video simply cannot guarantee.

Wan 2.7's image-to-video mode offers several input configurations that go beyond simple first-frame animation. You can provide both a first frame and a last frame, letting the model interpolate between two known states. This is powerful for creating specific transitions or movements where you need both endpoints defined. You can also feed in a video clip for continuation, extending existing footage seamlessly.

The driving audio input is particularly interesting. You can provide an audio file and the model will generate video that responds to its characteristics - matching lip movement to speech, syncing action to music beats, or reflecting audio energy in camera movement. This isn't a novelty feature; it's genuinely useful for music video production and dialogue visualization where audio exists before video.

Duration constraints shift slightly depending on mode. First-frame-only generation supports the full 2-15 second range at 720P, but drops to 2-10 seconds at 1080P. First-plus-last-frame mode is capped at 5 seconds. These aren't arbitrary limitations - they reflect the computational cost of maintaining coherence across different generation strategies. Plan accordingly.

reference-to-video and the character consistency problem

Character consistency across multiple generated clips is the hardest unsolved problem in AI video. You generate a clip of a person, love how they look, then generate a second clip and get someone entirely different. Reference-to-video (alibaba/wan-2-7-r2v) attacks this problem directly, and while it doesn't solve it perfectly, it gets meaningfully closer than anything else I've used at this price point.

The approach is straightforward: you provide reference images or reference videos of the characters you want to appear, then describe the scene in your prompt. Wan 2.7 can accept up to five mixed references - images, video clips, or audio files - and extracts identity embeddings from all of them simultaneously. A single generation can lock in a character's facial geometry, their voice tone and lip sync, camera movement style from a reference clip, and a specific visual effect, all at once. You can specify multiple characters with different references, creating scenes with consistent multi-person interaction.

Voice timbre cloning adds another dimension. You provide an audio reference for a character's voice, and the output video includes speech that matches that vocal character. This collapses what would otherwise be a multi-step pipeline - generate video, clone voice separately, sync audio to lips - into a single generation pass. The quality of the voice cloning is respectable, though it works best with clean reference audio of at least a few seconds.

The pricing structure here differs from the other models. Reference-to-video is significantly cheaper than the text-to-video or image-to-video endpoints - an order of magnitude less per second. The catch is that billing includes both input reference duration and output video duration, so providing a long reference video adds to the cost. Keep references short and representative.

video editing as the finishing layer

The video editing model (alibaba/wan-2-7-videoedit) completes the family by letting you modify existing footage rather than generating from scratch. You provide a source video and an instruction describing what to change, and the model outputs an edited version. Style transfer, object modification, scene alteration, atmospheric changes - all expressed as natural language instructions.

This is where the family's composability becomes most apparent. Generate a clip with text-to-video, decide you want a different color palette or time of day, and run it through the editor rather than regenerating from scratch. The editor preserves the motion and composition of your source while applying the requested changes. It's faster and cheaper than regeneration, and it maintains the specific motion characteristics you liked about the original.

Reference images work here too. You can provide a style reference image and instruct the model to apply that visual style to your source video. The model handles the temporal application of the style - ensuring consistency across frames rather than applying it independently per-frame like a naive approach would.

Input videos for editing must be between 2 and 10 seconds, with resolution between 240 and 4096 pixels on each side. The audio handling offers three modes: auto (model decides whether to preserve or replace audio), keep_original (pass through the source audio unchanged), or mute. For most editing tasks, keeping the original audio makes sense unless you're doing dramatic style transfer that would make the visual-audio mismatch jarring.

prompt writing that actually works

Wan 2.7 responds well to cinematic language. Specifying camera movement (slow dolly forward, orbital tracking shot, static wide angle), lighting conditions (golden hour, overcast diffused, harsh noon sun), and motion characteristics (slow motion, time-lapse, natural speed) all produce meaningfully different results. The model understands film terminology and translates it into appropriate visual output.

Prompt length matters. Short prompts like "a dog running" produce generic results with default everything. Longer prompts that specify the breed, the surface it's running on, the camera angle, the lighting, the background, and the emotional tone produce substantially better clips. I've found the sweet spot is 2-4 sentences - enough to constrain the important variables without overwhelming the model with conflicting instructions.

Temporal language is particularly effective. Phrases like "the camera slowly reveals," "gradually the light shifts," or "the subject turns to face the viewer" give the model narrative direction that produces more purposeful-feeling clips than static descriptions. You're writing a micro-screenplay, not a photograph description.

The prompt_extend feature rewrites your prompt through an LLM to add detail and cinematic language. It's useful when you have a rough idea but want the model to fill in the visual specifics. Turn it off when you've already written a detailed prompt - the rewrite can sometimes change your intent in unwanted ways. I leave it off by default and only enable it for rapid brainstorming sessions.

pricing across the family

The pricing is competitive and consistent across most of the lineup. Text-to-video, image-to-video, and video editing share the same per-second rate at each resolution tier. Reference-to-video is the outlier - an order of magnitude cheaper per second, which makes it practical to run many iterations when working on character consistency. That's exactly the use case where you'd want to experiment most.

Compared to competing video generation models at similar quality levels, Wan 2.7 sits at the lower end while delivering quality that competes with models priced higher. The economics work out particularly well for production workflows where you're generating dozens of candidates before selecting the best ones.

where the limits are

No model family deserves pure praise, and Wan 2.7 has real constraints. The maximum duration of 15 seconds means longer narrative sequences require stitching clips together, and maintaining consistency between separately generated clips remains challenging even with reference-to-video. The 1080P ceiling means no 4K output for broadcast or large-screen applications.

Fast motion remains a weakness. Rapid camera pans, quick cuts, or subjects moving at speed introduce blur and frame inconsistency that slow, deliberate motion doesn't trigger. If your creative vision involves high-energy action, you'll need to either adjust expectations or plan for multiple shorter clips edited together externally.

Text rendering in generated videos is unreliable. Signage, books, screens showing text - these will almost certainly contain garbled characters. This is a universal limitation across current video models, not specific to Wan, but worth noting if your use case involves legible on-screen text.

building a production workflow

The most effective way to use the Wan 2.7 family isn't as four independent tools but as stages in a pipeline. Start with text-to-video to explore directions quickly and cheaply at 720P. Once you find a direction you like, generate a still frame (or use Wan's image generation models) that nails the visual identity. Feed that into image-to-video for higher-fidelity motion generation with locked visual direction. If you need character consistency across multiple clips, establish references and switch to reference-to-video. Finally, use the editor for color grading, style transfer, or tweaking specific elements without regenerating the entire clip.

Each step narrows the creative space, trading freedom for precision. Text-to-video is maximum freedom, minimum control. Video editing is minimum freedom, maximum control. The family is designed to move between these modes fluidly, and that's where its real value lies compared to using a single model for everything.

how does wan 2.7 compare to seedance or kling for video generation?

Wan 2.7 sits in a strong position for general-purpose video work. Its main advantages are the family approach - having text, image, reference, and editing models that share an architecture - and competitive pricing. Seedance tends to produce slightly more cinematic results on complex human motion but costs more per second. Kling offers longer durations but with less consistent quality. Wan 2.7's sweet spot is production workflows where you need multiple generation modes working together rather than a single best-in-class model for one task.

what resolution and duration should I use for social media content?

For TikTok, Instagram Reels, and YouTube Shorts, use 720P at 9:16 aspect ratio with 5-8 second durations. The 720P resolution is sufficient for mobile viewing and keeps costs low during iteration. Only switch to 1080P for the final render you plan to publish. Most social platforms compress uploaded video anyway, so the perceptual difference between 720P and 1080P on a phone screen is minimal. Save your budget for generating more variations rather than higher resolution on every attempt.

can I maintain a consistent character across multiple wan 2.7 clips?

Yes, using the reference-to-video model (alibaba/wan-2-7-r2v). Provide clear, well-lit reference images of your character from multiple angles if possible, then describe different scenes in your prompts. Consistency is best when references show the character's full face and distinctive features clearly. It's not perfect - subtle details like exact hair length or clothing patterns may drift between clips - but for most purposes the character reads as the same person across generations. The low per-second cost makes it practical to generate many candidates and select the most consistent ones.

api reference

about

wan 2.7 image-to-video generates videos from images using multi-modal input (text, images, audio, video). supports first frame generation, first+last frame, and video continuation with 720p/1080p resolution

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash
1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash
1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python
1from inferencesh import inference23client = inference()456result = client.run({7        "app": "alibaba/wan-2-7-i2v",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "alibaba/wan-2-7-i2v",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python
1# local file paths are automatically uploaded2result = client.run({3    "app": "alibaba/wan-2-7-i2v",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python
1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python
1result = client.run({2    "app": "alibaba/wan-2-7-i2v",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json
1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}
idstringtask id
statusnumberterminal status (9=completed, 10=failed, 11=cancelled)
outputobjecttask output (when completed)
errorstringerror message (when failed)
session_idstringsession id (if using sessions)
created_atstringiso timestamp
updated_atstringiso timestamp

5. schema

input

promptstring

text prompt describing video content. supports chinese and english, up to 5000 characters.

negative_promptstring

content to exclude from the video. up to 500 characters.

first_framestring(file)

first frame image. formats: jpeg, jpg, png, bmp, webp. resolution: 240-8000px. up to 20mb.

last_framestring(file)

last frame image for first+last frame generation. same format limits as first_frame.

driving_audiostring(file)

audio file for driving video generation (lip-sync, action timing). wav/mp3, 2-30s, up to 15mb.

first_clipstring(file)

video clip for continuation. mp4/mov, 2-10s, 240-4096px, up to 100mb.

resolutionstring

video resolution: 720p or 1080p (default).

default: "1080P"
options:"720P""1080P"
durationinteger

video duration in seconds (2-15). for continuation, total output length including input clip.

default: 5min:2max:15
prompt_extendboolean

enable prompt rewriting via llm for better results.

default: true
watermarkboolean

add 'ai generated' watermark to bottom-right corner.

default: false
seedinteger

random seed for reproducibility (0-2147483647).

min:0max:2147483647

output

videostring(file)*

generated video in mp4 format.

output_metaobject

structured metadata about inputs/outputs for pricing calculation

ready to run wan-2-7-i2v?

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.