apps/alibaba/wan-2-7-image-pro

wan-2-7-image-pro

Wan 2.7 Image Pro is Alibaba's professional image generation model supporting text-to-image, image editing, and multi-reference generation with up to 4K high-definition output

run in browser run via API

run with your agent

# install belt

$curl -fsSL https://cli.inference.sh | sh

# view schema & details

$belt app get alibaba/wan-2-7-image-pro

# run

$belt app run alibaba/wan-2-7-image-pro

Alibaba's Tongyi Lab has been building image generation models for years, mostly for their own ecosystem. Wan 2.7 Image - released in April 2026 as the latest major upgrade to the Wanxiang series - is what happens when that internal investment gets packaged for external consumption. Available on inference.sh in both standard and Pro variants, it occupies an interesting middle ground in the current market - cheaper than the premium models that dominate conversations, more capable than the budget options that cut too many corners. The pricing sits in a tier that forces you to actually evaluate the output rather than dismissing it on cost alone.

I have spent time with both variants, and my take is nuanced. Wan 2.7 is genuinely good at certain things - Asian aesthetics, detailed photography-style compositions, scenes with lots of visual information - and mediocre at others. It is not a universal replacement for anything. But for specific workflows, particularly those involving cultural content, product photography, and high-density visual scenes, it produces output that surprised me.

the two-tier system explained

The standard Wan 2.7 Image model outputs up to 2K resolution (2048x2048). The Pro variant pushes output to 4K resolution (up to 4096x4096) at a higher price point. Both share the same input interface - text prompt, optional reference images, thinking mode, batch generation up to four images. The difference is purely in output fidelity and the internal model capacity dedicated to each generation.

This is a sensible split. Most use cases do not require 4K output. Social media posts, web assets, design explorations, content illustrations - these live at 1K or 2K comfortably. You only need Pro when the final output will be viewed at large scale or when you need the additional coherence that comes with a more computationally intensive generation pass. I find myself using standard for 80% of work and switching to Pro only when I know the result needs to survive close inspection.

In terms of positioning, Wan standard sits between the budget tier and the premium tier, while Wan Pro competes directly with mid-range options from Google and OpenAI.

where wan 2.7 genuinely excels

Every model has a signature strength, and Wan 2.7's is visual density. Give it a prompt describing a crowded market scene, a detailed interior with lots of objects, or a landscape packed with visual information, and it handles the complexity with unusual grace. Where FLUX Dev might simplify or smear background elements, Wan 2.7 tends to render them with individual attention. This is not a subtle difference - in scenes with ten or more distinct objects, the gap is noticeable.

The model also shows clear strength with East Asian visual aesthetics. Traditional architecture, calligraphy-influenced compositions, ink wash painting styles, scenes involving Chinese or Japanese cultural elements - these feel native rather than approximated. I suspect the training data heavily represented this content, and it shows. If your work involves Asian markets, cultural content, or any aesthetic that draws from that tradition, Wan 2.7 should be on your shortlist regardless of what else you are using.

Another standout feature is precise color control - the model accepts HEX codes and color palette specifications for brand-accurate visuals. If you need output that matches a specific brand color system, you can specify exact values rather than hoping the model interprets "corporate blue" the way you intend.

Photography-style output is another area where the model punches above its price point. Prompts that specify camera characteristics, lighting setups, and photographic techniques get interpreted with sophistication. Asking for "overhead flat lay product photography, soft diffused lighting, white marble surface" produces results that look like they came from a product shoot rather than a generation pipeline. The model understands depth of field, bokeh characteristics, and the way different lens lengths compress perspective.

thinking mode and what it actually does

Both variants support a thinking mode toggle that enables internal reasoning before generation begins. This is not marketing fluff - it produces measurably different results on complex prompts. When you have a prompt describing multiple subjects with specific spatial relationships, color requirements, and stylistic constraints, thinking mode helps the model plan the composition before committing pixels.

The tradeoff is generation time. Thinking mode adds latency, and on prompts that are already simple and direct, it adds nothing of value. I use it selectively: for prompts with four or more distinct elements that need to coexist in specific arrangements, or when I have tried a prompt without thinking mode and gotten confused compositions. For single-subject prompts or straightforward scene descriptions, skip it.

reference images and editing workflows

Both models accept up to nine reference images, which opens up editing and guided generation workflows that pure text-to-image models cannot touch. You can provide a product photo and ask for it placed in a lifestyle context. You can supply style reference images and get generations that match that visual language. You can pass in multiple reference angles of a subject and generate new views.

This capability puts Wan 2.7 in a different category from FLUX Dev, which has no reference image support at all. It competes more directly with GPT Image 2's editing mode and Gemini Flash's image understanding, though the interface is different. Rather than masking and inpainting, Wan 2.7 treats references as compositional guidance - it synthesizes new images informed by the references rather than surgically modifying existing ones.

The reference system works best when you provide clear, well-lit source images and pair them with prompts that explicitly describe what you want done with those references. Vague prompts with reference images produce unpredictable results. Specific prompts with good references produce surprisingly controlled output.

prompt tips that actually matter

After generating several hundred images across both variants, I have opinions about what works.

Be specific about lighting. Wan 2.7 responds strongly to lighting descriptions - "golden hour side lighting," "overhead fluorescent," "single point source from upper left" all produce distinctly different results. Generic prompts without lighting information tend to default to flat, even illumination that looks stock-photo-ish.

Front-load the subject. The model weights earlier tokens more heavily than later ones. Put your main subject and its key characteristics in the first clause, then add environmental and stylistic details after. "A weathered fishing boat with peeling blue paint, docked at a misty morning harbor, shot from water level" works better than burying the boat description after paragraphs of atmosphere.

Use photography terminology when you want photorealistic output. Specifying "shot on 85mm f/1.4" or "wide angle lens distortion" or "macro photography, 1:1 magnification" gives the model concrete visual parameters to work with rather than leaving it to guess what "photorealistic" means in context.

For the Pro model specifically, detail your texture expectations. Pro has the resolution to render fine textures meaningfully, so prompts that mention "visible fabric weave," "individual hair strands," or "water droplet surface tension" get rewarded with corresponding detail. The standard model will attempt these but cannot resolve them at 2K the way Pro does at 4K.

Avoid prompt overloading. Both models handle complex prompts better than most competitors, but there is still a point where adding more descriptors starts contradicting earlier ones. I try to keep prompts under 100 words and use thinking mode instead of prompt length to handle complexity.

honest comparison with the competition

Against FLUX Dev: FLUX Dev wins decisively on cost and is perfectly adequate for simple, single-subject generations. Wan 2.7 standard costs more but produces noticeably better results on complex scenes, handles reference images, and offers thinking mode. If your workflow is bulk generation of straightforward images, FLUX Dev remains the rational choice. If you need more sophisticated compositions or editing capabilities, the price premium buys real capability.

Against Gemini Flash Image: Similar price tier to Wan Pro. Both models handle text rendering competently - Wan 2.7 supports 12 languages natively, while Gemini brings Google's search grounding for factual accuracy in generated scenes. Wan Pro counters with higher base resolution (4K vs Gemini's output), stronger performance on dense visual scenes, and better handling of non-Western aesthetics. Gemini is more versatile; Wan Pro is more precise in its areas of strength.

Against GPT Image 2: At comparable price points, GPT Image 2 produces more universally polished output. Wan 2.7's text rendering in 12 languages is competitive, though GPT Image 2 still edges ahead on English typography consistency. Wan 2.7 offers better value on batch generation (four images per request at the same per-image cost), superior performance on Asian aesthetic content, and the reference image system gives it editing flexibility that GPT Image 2's approach does not match. For volume work with Asian aesthetics or reference-driven generation, Wan 2.7 is the better tool.

what it cannot do

Text rendering is actually one of Wan 2.7's strengths - a significant departure from earlier versions. The model supports long-text rendering across 12 languages, handling signs, labels, tables, and even formulas with solid accuracy. It handles prompts of over 3,000 tokens, which means detailed typographic instructions get parsed rather than truncated. For signage, posters, and labeled diagrams, Wan 2.7 is a legitimate option alongside Qwen Image 2 Pro. Where it still falls short compared to Qwen is in the most complex information-design layouts - dense multi-section infographics with hierarchical typography.

The model does not support inpainting or masked editing. Reference images guide new generations rather than surgically modifying existing ones. If you need to change one element of an existing image while preserving everything else, Wan 2.7 is the wrong tool.

Consistency across multiple generations of the same character or subject is not guaranteed, even with seed locking. You can get close by combining seed control with detailed descriptions and reference images, but true character consistency for sequential storytelling requires purpose-built solutions that Wan 2.7 does not provide.

Generation speed on the Pro model is slower than competitors at similar price points. The 4K output takes time to render, and thinking mode adds further latency. For interactive applications where users are waiting, the standard model with thinking mode disabled provides the most responsive experience.

when to choose wan 2.7

The decision tree is straightforward. Choose Wan 2.7 standard when you need better quality than FLUX Dev provides, your content involves complex scenes or Asian aesthetics, and you want reference image capability without paying premium prices. Choose Pro when the output needs to survive scrutiny at large display sizes or print, or when you need maximum detail fidelity and can tolerate the additional cost and latency.

Do not choose either variant if you need character consistency across a narrative sequence, or if your use case is simple enough that FLUX Dev handles it adequately. Paying six times more for output that looks the same on a social media thumbnail is waste, not quality.

The sequential image set feature deserves mention for specific workflows. Enabling it produces coordinated sets of related images - useful for storyboard creation, product photography series, or any application where you need visual coherence across a batch. Combined with reference images, this creates a lightweight production pipeline for visual content that would otherwise require manual coordination across multiple generation passes.

is wan 2.7 image better than flux dev?

Not universally, no. FLUX Dev remains the better choice for high-volume, straightforward generation where cost efficiency is the primary concern. Wan 2.7 is better at complex multi-element scenes, reference-based generation, and content with East Asian visual aesthetics. The two models serve different positions in a generation pipeline - FLUX Dev for cheap exploration and simple assets, Wan 2.7 for compositions that demand more visual intelligence. The significant price difference means you should only reach for Wan 2.7 when you genuinely need what it offers beyond FLUX Dev's capabilities.

should I use standard or pro?

Use standard for anything destined for screens at normal viewing distance - web content, social media, email campaigns, design mockups. Switch to Pro when output will be displayed at large physical sizes, printed, or scrutinized closely by art directors. The per-image difference between standard and Pro is meaningful at volume, so defaulting to standard and upgrading selectively keeps costs rational. Pro's 4K output is genuinely impressive for photography-style content, but that resolution is wasted on a 400-pixel-wide blog thumbnail.

how does thinking mode affect quality?

Thinking mode makes a real difference on prompts with four or more distinct elements that need specific spatial arrangements. For a prompt like "three people at a dinner table, one standing, candles between them, window behind showing rain" - thinking mode helps the model plan the composition logically before generating. On simple prompts like "a red car on a highway" it adds latency without visible improvement. I recommend leaving it off by default and enabling it when you notice compositional confusion in outputs, or when your prompt reads like a paragraph rather than a phrase.

api reference

about

wan 2.7 image pro is alibaba's professional image generation model supporting text-to-image, image editing, and multi-reference generation with up to 4k high-definition output

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash

1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash

1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python

1from inferencesh import inference23client = inference()456result = client.run({7        "app": "alibaba/wan-2-7-image-pro",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python

1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "alibaba/wan-2-7-image-pro",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python

1# local file paths are automatically uploaded2result = client.run({3    "app": "alibaba/wan-2-7-image-pro",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python

1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python

1result = client.run({2    "app": "alibaba/wan-2-7-image-pro",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json

1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}

idstring— task id

statusnumber— terminal status (9=completed, 10=failed, 11=cancelled)

outputobject— task output (when completed)

errorstring— error message (when failed)

session_idstring— session id (if using sessions)

created_atstring— iso timestamp

updated_atstring— iso timestamp

5. schema

input

promptstring*

text prompt describing what to generate or edit. supports chinese and english, up to 5000 characters. for editing, provide reference images.

reference_imagesarray

reference images for editing (0-9 images). order in array defines image order.

num_imagesinteger

number of images to generate (1-4). when image set mode is enabled, max is 12.

default: 1min:1max:4

sizestring

output resolution: '1k' (1024x1024), '2k' (2048x2048, default), '4k' (4096x4096, text-to-image only). or specify pixel dimensions like '1024*768'.

default: "2K"

watermarkboolean

add 'ai generated' watermark to bottom-right corner.

default: false

thinking_modeboolean

enable thinking mode for better quality. only effective for text-to-image without image input or image set mode.

default: true

enable_sequentialboolean

enable image set output mode for generating consistent multi-image sets (e.g., same character in different scenes).

default: false

seedinteger

random seed for reproducibility (0-2147483647). same seed yields similar outputs.

min:0max:2147483647

output

imagesarray*

generated images in png format.

output_metaobject

structured metadata about inputs/outputs for pricing calculation

ready to run wan-2-7-image-pro?

try in browser browse all tools

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.