wan-2-7-r2v
Wan 2.7 Reference-to-Video generates videos featuring characters from reference images and videos, supporting multi-character interaction, voice timbre cloning, and first-frame control
Alibaba's Wan 2.7, released in late March 2026 by the Tongyi Lab team, is not a single model. It's four distinct video generation tools that share a Diffusion Transformer architecture with Flow Matching but solve different problems. Text-to-video, image-to-video, reference-to-video, and video editing - each designed for a specific stage of the creative pipeline rather than trying to be a universal solution. I've been running all four through inference.sh for weeks, and what strikes me most is how well they compose together. You can ideate with text-to-video, lock in visual direction with image-to-video, maintain character consistency with reference-to-video, and then refine the result with video editing. It's a production workflow disguised as four API calls.
One of the headline features is what Alibaba calls "Thinking Mode" - the model first deeply understands the prompt, logically plans the composition, then generates the final output. The quality ceiling is genuinely high. Wan 2.7 produces motion that looks physically plausible - not perfect, nothing is yet - but consistent enough that the clips don't immediately scream "AI generated" to a casual viewer. Temporal coherence across frames is strong, especially on subjects with repetitive motion like walking, waves, or machinery. The model handles both 720P and 1080P output, with durations stretching to 15 seconds on most modes. That's enough for a complete social media clip or a meaningful B-roll segment without needing to stitch anything together.
text-to-video as the starting point
The text-to-video model (alibaba/wan-2-7-t2v) is where most people will begin, and it's the simplest entry point. You write a prompt, choose a resolution and duration, and get back an MP4. Durations range from 2 to 15 seconds. The model supports a prompt extension feature that rewrites your input through an LLM before generation - I'd recommend leaving this on for casual exploration and turning it off when you've refined a prompt you're happy with and want precise control.
What makes Wan 2.7's text-to-video competitive is the motion quality at longer durations. Many video models produce impressive 3-second clips but fall apart at 10 or 15 seconds as temporal drift accumulates. Wan handles this better than most. A 15-second shot of a person walking through a market maintains consistent limb movement, correct shadow direction, and stable background geometry throughout. It's not flawless - you'll still see occasional frame-to-frame jitter on fine details like jewelry or text on signs - but the overall coherence is above average for the current generation of models.
The negative prompt parameter does meaningful work here. Unlike image models where negative prompts feel like superstition, video generation benefits substantially from explicit exclusions. Telling the model "no camera shake, no morphing, no extra fingers" actually reduces those failure modes in measurable ways. I use a standard negative prompt for most generations and only remove items when I specifically want them.
image-to-video for visual precision
The gap between imagining something and describing it in words is where image-to-video (alibaba/wan-2-7-i2v) becomes essential. You provide a first frame - a photograph, a render, a generated image from any source - and the model animates outward from it. The visual identity of your starting frame persists through the clip in a way that text-to-video simply cannot guarantee.
Wan 2.7's image-to-video mode offers several input configurations that go beyond simple first-frame animation. You can provide both a first frame and a last frame, letting the model interpolate between two known states. This is powerful for creating specific transitions or movements where you need both endpoints defined. You can also feed in a video clip for continuation, extending existing footage seamlessly.
The driving audio input is particularly interesting. You can provide an audio file and the model will generate video that responds to its characteristics - matching lip movement to speech, syncing action to music beats, or reflecting audio energy in camera movement. This isn't a novelty feature; it's genuinely useful for music video production and dialogue visualization where audio exists before video.
Duration constraints shift slightly depending on mode. First-frame-only generation supports the full 2-15 second range at 720P, but drops to 2-10 seconds at 1080P. First-plus-last-frame mode is capped at 5 seconds. These aren't arbitrary limitations - they reflect the computational cost of maintaining coherence across different generation strategies. Plan accordingly.
reference-to-video and the character consistency problem
Character consistency across multiple generated clips is the hardest unsolved problem in AI video. You generate a clip of a person, love how they look, then generate a second clip and get someone entirely different. Reference-to-video (alibaba/wan-2-7-r2v) attacks this problem directly, and while it doesn't solve it perfectly, it gets meaningfully closer than anything else I've used at this price point.
The approach is straightforward: you provide reference images or reference videos of the characters you want to appear, then describe the scene in your prompt. Wan 2.7 can accept up to five mixed references - images, video clips, or audio files - and extracts identity embeddings from all of them simultaneously. A single generation can lock in a character's facial geometry, their voice tone and lip sync, camera movement style from a reference clip, and a specific visual effect, all at once. You can specify multiple characters with different references, creating scenes with consistent multi-person interaction.
Voice timbre cloning adds another dimension. You provide an audio reference for a character's voice, and the output video includes speech that matches that vocal character. This collapses what would otherwise be a multi-step pipeline - generate video, clone voice separately, sync audio to lips - into a single generation pass. The quality of the voice cloning is respectable, though it works best with clean reference audio of at least a few seconds.
The pricing structure here differs from the other models. Reference-to-video is significantly cheaper than the text-to-video or image-to-video endpoints - an order of magnitude less per second. The catch is that billing includes both input reference duration and output video duration, so providing a long reference video adds to the cost. Keep references short and representative.
video editing as the finishing layer
The video editing model (alibaba/wan-2-7-videoedit) completes the family by letting you modify existing footage rather than generating from scratch. You provide a source video and an instruction describing what to change, and the model outputs an edited version. Style transfer, object modification, scene alteration, atmospheric changes - all expressed as natural language instructions.
This is where the family's composability becomes most apparent. Generate a clip with text-to-video, decide you want a different color palette or time of day, and run it through the editor rather than regenerating from scratch. The editor preserves the motion and composition of your source while applying the requested changes. It's faster and cheaper than regeneration, and it maintains the specific motion characteristics you liked about the original.
Reference images work here too. You can provide a style reference image and instruct the model to apply that visual style to your source video. The model handles the temporal application of the style - ensuring consistency across frames rather than applying it independently per-frame like a naive approach would.
Input videos for editing must be between 2 and 10 seconds, with resolution between 240 and 4096 pixels on each side. The audio handling offers three modes: auto (model decides whether to preserve or replace audio), keep_original (pass through the source audio unchanged), or mute. For most editing tasks, keeping the original audio makes sense unless you're doing dramatic style transfer that would make the visual-audio mismatch jarring.
prompt writing that actually works
Wan 2.7 responds well to cinematic language. Specifying camera movement (slow dolly forward, orbital tracking shot, static wide angle), lighting conditions (golden hour, overcast diffused, harsh noon sun), and motion characteristics (slow motion, time-lapse, natural speed) all produce meaningfully different results. The model understands film terminology and translates it into appropriate visual output.
Prompt length matters. Short prompts like "a dog running" produce generic results with default everything. Longer prompts that specify the breed, the surface it's running on, the camera angle, the lighting, the background, and the emotional tone produce substantially better clips. I've found the sweet spot is 2-4 sentences - enough to constrain the important variables without overwhelming the model with conflicting instructions.
Temporal language is particularly effective. Phrases like "the camera slowly reveals," "gradually the light shifts," or "the subject turns to face the viewer" give the model narrative direction that produces more purposeful-feeling clips than static descriptions. You're writing a micro-screenplay, not a photograph description.
The prompt_extend feature rewrites your prompt through an LLM to add detail and cinematic language. It's useful when you have a rough idea but want the model to fill in the visual specifics. Turn it off when you've already written a detailed prompt - the rewrite can sometimes change your intent in unwanted ways. I leave it off by default and only enable it for rapid brainstorming sessions.
pricing across the family
The pricing is competitive and consistent across most of the lineup. Text-to-video, image-to-video, and video editing share the same per-second rate at each resolution tier. Reference-to-video is the outlier - an order of magnitude cheaper per second, which makes it practical to run many iterations when working on character consistency. That's exactly the use case where you'd want to experiment most.
Compared to competing video generation models at similar quality levels, Wan 2.7 sits at the lower end while delivering quality that competes with models priced higher. The economics work out particularly well for production workflows where you're generating dozens of candidates before selecting the best ones.
where the limits are
No model family deserves pure praise, and Wan 2.7 has real constraints. The maximum duration of 15 seconds means longer narrative sequences require stitching clips together, and maintaining consistency between separately generated clips remains challenging even with reference-to-video. The 1080P ceiling means no 4K output for broadcast or large-screen applications.
Fast motion remains a weakness. Rapid camera pans, quick cuts, or subjects moving at speed introduce blur and frame inconsistency that slow, deliberate motion doesn't trigger. If your creative vision involves high-energy action, you'll need to either adjust expectations or plan for multiple shorter clips edited together externally.
Text rendering in generated videos is unreliable. Signage, books, screens showing text - these will almost certainly contain garbled characters. This is a universal limitation across current video models, not specific to Wan, but worth noting if your use case involves legible on-screen text.
building a production workflow
The most effective way to use the Wan 2.7 family isn't as four independent tools but as stages in a pipeline. Start with text-to-video to explore directions quickly and cheaply at 720P. Once you find a direction you like, generate a still frame (or use Wan's image generation models) that nails the visual identity. Feed that into image-to-video for higher-fidelity motion generation with locked visual direction. If you need character consistency across multiple clips, establish references and switch to reference-to-video. Finally, use the editor for color grading, style transfer, or tweaking specific elements without regenerating the entire clip.
Each step narrows the creative space, trading freedom for precision. Text-to-video is maximum freedom, minimum control. Video editing is minimum freedom, maximum control. The family is designed to move between these modes fluidly, and that's where its real value lies compared to using a single model for everything.
how does wan 2.7 compare to seedance or kling for video generation?
Wan 2.7 sits in a strong position for general-purpose video work. Its main advantages are the family approach - having text, image, reference, and editing models that share an architecture - and competitive pricing. Seedance tends to produce slightly more cinematic results on complex human motion but costs more per second. Kling offers longer durations but with less consistent quality. Wan 2.7's sweet spot is production workflows where you need multiple generation modes working together rather than a single best-in-class model for one task.
what resolution and duration should I use for social media content?
For TikTok, Instagram Reels, and YouTube Shorts, use 720P at 9:16 aspect ratio with 5-8 second durations. The 720P resolution is sufficient for mobile viewing and keeps costs low during iteration. Only switch to 1080P for the final render you plan to publish. Most social platforms compress uploaded video anyway, so the perceptual difference between 720P and 1080P on a phone screen is minimal. Save your budget for generating more variations rather than higher resolution on every attempt.
can I maintain a consistent character across multiple wan 2.7 clips?
Yes, using the reference-to-video model (alibaba/wan-2-7-r2v). Provide clear, well-lit reference images of your character from multiple angles if possible, then describe different scenes in your prompts. Consistency is best when references show the character's full face and distinctive features clearly. It's not perfect - subtle details like exact hair length or clothing patterns may drift between clips - but for most purposes the character reads as the same person across generations. The low per-second cost makes it practical to generate many candidates and select the most consistent ones.
api reference
about
wan 2.7 reference-to-video generates videos featuring characters from reference images and videos, supporting multi-character interaction, voice timbre cloning, and first-frame control
1. calling the api
install the client
the client provides a convenient way to interact with the api.
1pip install inferenceshsetup your api key
set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.
1export INFERENCE_API_KEY="inf_your_key"run and get result
submit a request and wait for the final result. best for batch processing or when you don't need progress updates.
1from inferencesh import inference23client = inference()456result = client.run({7 "app": "alibaba/wan-2-7-r2v",8 "input": {}9 })1011print(result["output"])stream live updates
get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8 "app": "alibaba/wan-2-7-r2v",9 "input": {}10 }, stream=True):11 if update.get("progress"):12 print(f"progress: {update['progress']}%")13 if update.get("output"):14 print(f"output: {update['output']}")2. authentication
the api uses api keys for authentication. see the authentication docs for detailed setup instructions.
3. files
file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.
automatic upload
the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.
1# local file paths are automatically uploaded2result = client.run({3 "app": "alibaba/wan-2-7-r2v",4 "input": {5 "image": "/path/to/local/image.png", # detected & uploaded6 "audio": "https://example.com/audio.mp3", # url passed through7 }8})4. webhooks
get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.
1result = client.run({2 "app": "alibaba/wan-2-7-r2v",3 "input": {},4 "webhook": "https://your-server.com/webhook"5}, wait=False)webhook payload
your endpoint receives a JSON POST with the task result:
1{2 "id": "task_abc123",3 "status": 9,4 "output": { ... },5 "error": "",6 "session_id": null,7 "created_at": "2024-01-15T10:30:00Z",8 "updated_at": "2024-01-15T10:30:05Z"9}5. schema
input
text prompt describing the video scene. use 'image 1', 'image 2' to reference images and 'video 1', 'video 2' to reference videos in order of the media arrays. up to 5000 characters.
content to exclude from the video. up to 500 characters.
reference images for characters/objects/scenes. each must contain a single subject. images + videos <= 5 total.
reference videos for characters and voice timbre. mp4/mov, 1-30s, up to 100mb each. avoid empty-scene videos.
first frame image for precise scene composition. max 1.
audio for voice timbre reference. wav/mp3, 1-10s, up to 15mb. overrides reference video audio.
video resolution: 720p or 1080p (default).
aspect ratio. ignored if first_frame is provided (uses frame ratio instead).
video duration in seconds. 2-10 if reference videos provided, 2-15 otherwise.
enable prompt rewriting via llm for better results.
add 'ai generated' watermark to bottom-right corner.
random seed for reproducibility (0-2147483647).
ready to run wan-2-7-r2v?
we use cookies
we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.
by clicking "accept", you agree to our use of cookies.
learn more.