apps/pruna/qwen-image

qwen-image

Advanced text-to-image generation with optional LoRA weights and prompt enhancement

run with your agent
# install belt
$curl -fsSL https://cli.inference.sh | sh
# view schema & details
$belt app get pruna/qwen-image
# run
$belt app run pruna/qwen-image

Alibaba's Qwen-Image foundation model - a 20B parameter MMDiT (multimodal diffusion transformer) - arrived with strong fundamentals: excellent text rendering, solid prompt comprehension, and that unmistakable aesthetic sensibility that produces images with a different character than what you get from Western-trained models. Pruna, the Munich-based AI optimization startup, took those foundations and applied their compression and optimization expertise to create three distinct tiers: a premium generator, a budget speed demon, and a dedicated image editor. The result is a family that covers the full spectrum from quick drafts to polished final output, all running through inference.sh.

I've been running these models head-to-head against the Alibaba-native Qwen Image 2 variants, and the honest summary is that Pruna's optimizations produce meaningfully different tradeoffs rather than straight upgrades. The premium model adds capabilities that the base Qwen models lack. The fast model sacrifices things the base models preserve. And the editor takes a different architectural approach to image manipulation entirely. Understanding which one to reach for in a given situation matters more than picking a favorite.

the premium tier: pruna/qwen-image

This is the full-featured version and it's genuinely loaded. You get img2img mode, negative prompts, LoRA weight support, prompt enhancement, configurable guidance and inference steps, and multiple output formats. It's the model you use when you know exactly what you want and need fine control over how you get there.

The negative prompt support is something I particularly value. Most modern generation models have moved away from negative prompts, treating them as a legacy feature from the Stable Diffusion era. But negative prompts remain one of the most efficient ways to steer output away from failure modes you've already identified. If your generations keep producing watermarks, or you're fighting a specific type of artifacting, a negative prompt resolves it in one field rather than requiring you to re-engineer your entire positive prompt.

Prompt enhancement is the other standout. Enable it and the model expands your input into a more detailed specification before generating. This works well when you have a general concept but don't want to spend ten minutes writing an exhaustive description. A casual "sunset over a mountain lake" becomes a richly specified scene with atmospheric detail, lighting direction, and compositional choices. The model's taste isn't always your taste, but it consistently produces more interesting output than a bare prompt.

The LoRA support opens up style transfer and character consistency workflows. Point it at a trained LoRA weights file, dial in the application strength, and the base model adapts its output accordingly. This matters for anyone doing series work - product lines that need visual coherence, character-driven content, or brand-specific aesthetics that can't be achieved through prompting alone.

Where does it sit relative to the native Alibaba Qwen Image 2? The Pruna version gives you more control surfaces. More steps, explicit guidance values, img2img with adjustable strength, LoRA integration. The Alibaba version has its own strengths - a unified generation-editing pipeline in a single 7B model with native 2K resolution and strong multilingual text rendering. The choice comes down to whether you want more parameters to tune or a more opinionated model that handles decisions internally.

the speed tier: pruna/qwen-image-fast

Pruna/qwen-image-fast strips the Qwen architecture down to its essentials and runs it with aggressive optimization. You get a prompt, an aspect ratio, a creativity slider, and a seed. That's it.

The creative constraint here is real. No negative prompts. No guidance parameter. No step count control. No img2img. No LoRAs. You write a prompt, you pick a shape, and you trust the model. The creativity parameter is your only real lever - push it higher for more diverse and unexpected output, keep it low for safer and more predictable results.

What you get for that constraint is speed and cost that enable entirely different use cases. The fast model is cheap enough that generating a hundred variations is trivial. You can use this model for rapid exploration in a way that would be absurd at premium pricing. Need to test twenty different prompt phrasings to find the right direction? Run them all. Want to generate a dozen aspect ratio variations to see which composition works? Do it without thinking about cost.

The quality gap versus the premium tier is visible but not catastrophic. Images have less fine detail, occasionally produce softer textures, and sometimes miss subtle prompt requirements. For social media content, placeholder imagery, moodboards, storyboards, and exploration workflows, the output is entirely usable. For client-facing final assets or anything that will be viewed at large scale, you'll want to graduate promising concepts up to the premium model.

the editor: pruna/qwen-image-edit-plus

This is architecturally distinct from the other two. Rather than generating from text alone, it takes one or two input images and transforms them according to text instructions. Pose transfer, background swaps, style changes, element composition, object manipulation. The prompt describes what you want changed, not what you want created from scratch.

The multi-image support is where it gets interesting. Feed it two images - say, a person in one pose and a reference image showing a different pose or outfit - and ask it to combine elements. Or provide a product photo and a background plate and ask for compositing with matched lighting. The model handles the spatial reasoning required to make these combinations look natural.

The editor's pricing sits in a sweet spot for iterative refinement workflows. Generate your base image with the premium model, then refine it through the editor at a fraction of the cost per pass. Multiple rounds of editing are far cheaper than regenerating from scratch until you get lucky.

The go_fast flag is worth knowing about. It trades some quality for speed when you're iterating quickly and don't need final-quality output on every pass. Turn it off for your last refinement step when the edit needs to be clean.

One limitation: aspect ratio defaults to matching the input image. You can override this, but the model works best when it isn't simultaneously trying to edit content and recompose for a different frame shape. Keep those as separate operations.

prompt strategies that actually work

For the premium model, I've found that specificity in material and lighting language pays dividends. "Brushed aluminum with warm overhead lighting casting soft shadows" produces dramatically better metallic surfaces than "metal object." The model responds well to photographic terminology - focal length references, depth of field descriptions, lighting setup language. Write prompts like you're giving a brief to a photographer and the model will interpret them with that same visual vocabulary.

For the fast model, simplicity wins. Long elaborate prompts don't hurt but they also don't help as much as you'd expect given the reduced inference steps. Front-load the most important information. Put your subject and primary style direction in the first sentence. Details about background, lighting, and atmosphere can follow but may receive less attention.

For the editor, instructional clarity matters more than descriptive richness. "Change the background to a beach at sunset" works better than elaborate descriptions of what the beach should look like. The model needs to understand the operation you want performed, not receive a full generation prompt. Think of it as giving directions to a retoucher rather than writing a creative brief.

Negative prompts on the premium model respond well to specific artifact types rather than general quality terms. "Blurry, low quality" is less effective than "chromatic aberration, JPEG compression artifacts, banding in gradients, text watermark." Name the specific failure modes you want to avoid.

the workflow advantage

The three-tier structure maps cleanly to workflow stages. Explore with the fast model - throw prompts at the wall, find directions that work, identify compositions worth pursuing. Generate final candidates with the premium model - full control, maximum quality, all the parameters tuned. Refine with the editor - fix specific issues, adjust elements, iterate toward perfection.

Compared to other models - FLUX Dev at similar premium positioning, Seedream for quality-per-dollar, GPT Image for instruction following - the Pruna Qwen family's advantage is the integrated tier system. You stay within the same model family's aesthetic sensibility as you move between speed, quality, and editing. Outputs from the fast model have the same underlying style character as premium outputs, which means concepts that look good at the exploration stage will scale up predictably.

when to use alibaba's native qwen models instead

The native Alibaba Qwen Image 2 lineup has its own strengths. Released in February 2026, Qwen-Image-2.0 is a leaner 7B parameter model (compared to the original 20B Qwen-Image) with native 2K resolution support and a unified generation-editing pipeline. It uses an 8B Qwen3-VL vision-language encoder paired with a 7B diffusion transformer decoder, which gives it strong prompt comprehension and compositional awareness.

The native models handle text rendering in images with particular skill - Alibaba invested heavily in multilingual text accuracy, supporting precise bilingual Chinese and English typography, and it shows. If generating images with embedded text (posters, UI mockups, signage, book covers) is your primary use case, compare results between both families before committing to a workflow.

Pruna's versions win on parameter control (premium), raw cost efficiency (fast), and dedicated editing capability (editor). Alibaba's native versions win on unified generation-editing in a single model and text rendering fidelity. Neither family obsoletes the other.

honest limitations

The fast model occasionally produces images that feel undercooked - textures that read as painted rather than photographic, backgrounds that simplify into gradients when you wanted detail. At half a cent per image this is an acceptable tradeoff but it means you can't use it as a straight replacement for the premium model on cost-sensitive projects.

The premium model isn't the cheapest option by current market standards. FLUX Dev with LoRA support offers different strengths at competitive pricing. If you don't need the Qwen model's specific characteristics - its text rendering capability, its particular aesthetic, its prompt enhancement feature - you might get equivalent results from a cheaper alternative.

The editor requires good source images to work from. Garbage in, garbage out applies here more than with pure generation. If your input image has issues - poor lighting, low resolution, compression artifacts - the editor will preserve those problems or introduce new ones trying to work around them.

is pruna/qwen-image-fast good enough for production use?

It depends on what "production" means. For social media posts, blog illustrations, placeholder content, internal presentations, and any context where images are viewed quickly at moderate size, the fast model produces acceptable output. For hero images, print work, large-format display, or anywhere that quality will be scrutinized, use it for exploration only and generate finals with the premium tier. The massive cost difference between tiers means you can afford to be generous with exploration and selective with final production.

how does the editor compare to inpainting workflows?

The editor uses a text-instruction approach rather than a mask-based inpainting approach. You describe what you want changed in natural language rather than painting selection areas. This is faster for broad changes - background swaps, style transfers, pose adjustments - but less precise for targeted local edits where you need exact boundary control. For pixel-precise editing, traditional inpainting tools still have an edge. For conceptual edits where you want the model to make intelligent decisions about what to change and what to preserve, the text-instruction approach is more efficient.

should I use these or the native alibaba qwen image 2 models?

Use Pruna's versions when you want more control parameters (premium tier), need budget-friendly exploration (fast tier), or want dedicated image editing (editor). Use Alibaba's native Qwen Image 2 when text rendering in images is your primary concern, when you want the unified generation-editing pipeline in a single model, or when you prefer a more opinionated model that makes good default choices without extensive parameter tuning. Both families share the same underlying Qwen-Image architecture and produce output with similar aesthetic character, so switching between them for different tasks within the same project is perfectly viable.

api reference

about

advanced text-to-image generation with optional lora weights and prompt enhancement

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash
1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash
1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python
1from inferencesh import inference23client = inference()456result = client.run({7        "app": "pruna/qwen-image",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "pruna/qwen-image",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python
1# local file paths are automatically uploaded2result = client.run({3    "app": "pruna/qwen-image",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python
1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python
1result = client.run({2    "app": "pruna/qwen-image",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json
1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}
idstringtask id
statusnumberterminal status (9=completed, 10=failed, 11=cancelled)
outputobjecttask output (when completed)
errorstringerror message (when failed)
session_idstringsession id (if using sessions)
created_atstringiso timestamp
updated_atstringiso timestamp

5. schema

input

promptstring*

text description of the image to generate.

enhance_promptboolean

auto-enhance prompt for better results.

default: false
go_fastboolean

run faster with optimizations.

default: true
guidancenumber

how closely to follow the prompt.

default: 3min:0max:10
negative_promptstring

things to avoid (e.g., 'blurry, low quality').

default: ""
num_inference_stepsinteger

number of denoising steps.

default: 30min:1max:50
seedinteger

random seed.

disable_safety_checkerboolean

disable safety checker.

default: false
imagestring(file)

input image for img2img mode.

strengthnumber

strength for img2img.

default: 0.9min:0max:1
lora_weightsstring

url to lora weights file.

lora_scalenumber

lora application strength.

default: 1
aspect_ratiostring

aspect ratio.

default: "16:9"
options:"1:1""16:9""9:16""4:3""3:4""3:2""2:3"
image_sizestring

optimize for quality or speed.

default: "optimize_for_quality"
options:"optimize_for_quality""optimize_for_speed"
output_formatstring

output format.

default: "webp"
options:"webp""jpg""png"
output_qualityinteger

quality for jpg/webp.

default: 80min:0max:100

output

imagestring(file)*

generated image file.

output_metaobject

structured metadata about inputs/outputs for pricing calculation

ready to run qwen-image?

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.