apps/google/gemini-3-pro-image-preview

gemini-3-pro-image-preview

Gemini 3 Pro Image Preview (NanoBanana Pro) via Vertex AI - Advanced image generation model powered by Google Cloud

run with your agent
# install belt
$curl -fsSL https://cli.inference.sh | sh
# view schema & details
$belt app get google/gemini-3-pro-image-preview
# run
$belt app run google/gemini-3-pro-image-preview

Google's Gemini 3 Pro Image Preview - codenamed "Nano Banana Pro" and released in November 2025 - is the model you reach for when image quality matters more than speed or cost. It sits at the top of Google's image generation lineup, incorporating what Google calls a "World Simulator" reasoning engine that constructs internal scene representations before generating pixels. It is slower and roughly twice the price of its Flash sibling, but capable of producing output with noticeably better fidelity, richer detail, and more accurate prompt adherence. If Flash is the workhorse, Pro is the specialist you bring in when the brief demands precision.

I've spent enough time with both tiers to say the difference is real but situational. For social media thumbnails or rapid iteration, Flash gets you there faster and cheaper. For hero images, editorial work, or anything where a client will zoom in and scrutinize, Pro earns its premium. The question isn't which is better in absolute terms - it's whether the use case justifies the extra cost and latency.

how it actually works

Gemini Pro Image isn't a traditional diffusion model with a CLIP encoder bolted on. It's built on Google's multimodal architecture, which means the same system that understands language at a deep semantic level is the one deciding what pixels to produce. The practical effect is that complex, multi-clause prompts get interpreted with genuine comprehension rather than keyword extraction. You can write a paragraph of creative direction - conditional instructions, spatial relationships, stylistic nuance - and the model parses it the way a human art director would.

This matters more than it sounds. With pure diffusion models, there's always a translation gap between what you write and what the encoder actually captures. Gemini Pro closes that gap considerably. It won't misinterpret "a red car behind a blue house" as "a blue car behind a red house" the way simpler architectures sometimes do. Compositional accuracy is where the language model backbone pays dividends.

The model accepts up to 1K, 2K, or 4K resolution output. At 4K you're getting print-ready files. Aspect ratios span the standard range - square, portrait, landscape, cinematic - plus an auto mode that lets the model choose what suits the content. You can generate up to four images per request, which is useful for exploring variations without burning through separate API calls.

search grounding and factual imagery

The feature that genuinely separates Gemini from the field is Google Search grounding. Enable it and the model queries live search results during generation, pulling in current visual knowledge about real products, places, people, and events. This isn't the model hallucinating what it thinks a Tesla Cybertruck looks like based on training data from two years ago - it's actively verifying against current information.

For anyone generating images that need to be factually correct rather than imaginatively plausible, this changes the calculus entirely. Product shots that reference real designs. Location imagery that matches how a building actually looks today. Educational materials where accuracy isn't optional. Search grounding adds a small per-request fee, which is negligible given what it enables.

I find this most useful for commercial work where a client will immediately notice if something looks wrong. Fashion brands want their actual garments rendered correctly. Architecture firms need real building materials and proportions. Search grounding doesn't guarantee perfection, but it raises the floor dramatically.

image editing through conversation

Pass one or more reference images alongside your text prompt and Gemini Pro becomes an editor rather than a generator. The model understands what to preserve and what to change based on your instructions - background swaps, object removal, style transfers, element compositing. It handles these operations through natural language rather than masks or manual selection tools.

The editing capability is where Pro's quality advantage over Flash shows up most clearly. Subtle edits - matching lighting across a composited scene, preserving texture consistency when swapping elements, maintaining skin tone accuracy during background changes - require the kind of fine-grained control that benefits from the higher-fidelity model. Flash handles broad strokes well enough, but Pro manages the details that make edits look seamless rather than obviously synthetic.

Multi-image input means you can feed the model several references and ask it to combine elements, match styles across sources, or use one image as a guide while editing another. The workflow feels closer to directing a skilled retoucher than operating a tool.

text rendering

Gemini Pro renders text in images more reliably than most generation models, achieving 94% text rendering accuracy according to Google's benchmarks. Short text elements - signs, labels, titles, watermarks - come out clean and legible the majority of the time. The language model backbone helps here because the system actually understands what the text says, so it doesn't produce the garbled letterforms that plague pure diffusion approaches.

For complex typography, dense paragraphs, or precise font matching, dedicated text-rendering models like Qwen Image 2 still have an edge. Gemini Pro is good enough for most practical purposes but it's not a typography engine. If your entire use case is generating infographics with lots of body copy, look elsewhere. If you need a product mockup with a clean logo and tagline, Pro handles it well.

the quality tradeoff

Pro costs roughly double what Flash charges for equivalent output, but 2K costs the same as 1K - so there's no reason not to generate at 2K unless you specifically need smaller files for bandwidth reasons. At that resolution you get genuinely usable output for web, print, and production work.

Whether the Pro tier makes sense depends entirely on volume and use case. If you're generating thousands of images for A/B testing ad creatives, Flash is the rational choice - the quality difference won't move the needle on click-through rates. If you're producing a handful of hero images for a campaign landing page, quality is everything. Pro exists for the second scenario.

Generation times run longer than Flash - sometimes noticeably so for complex prompts or high resolutions. This isn't a model for real-time applications or user-facing generation where someone is watching a loading spinner. Batch workflows, async pipelines, and pre-production use cases are where the latency is invisible.

where it fits in the landscape of options

Running Gemini 3 Pro Image through inference.sh puts it alongside a dozen other generation models, and the honest positioning is that no single model wins everywhere. Pro's strengths are instruction following, editing capability, and search grounding. Its weaknesses are speed and cost.

If you need custom style adaptation through fine-tuned LoRAs or precise control over the diffusion process with CFG and step parameters, FLUX Dev LoRA gives you knobs that Gemini doesn't expose. If you want the best quality per dollar for straightforward generation without editing needs, Seedream 4.5 often delivers comparable visual quality at a lower price point. And if text rendering is your primary concern, Qwen Image 2 remains the specialist.

Pro earns its place when you need the combination - high fidelity, strong instruction adherence, editing capability, and factual grounding all in one model. That combination doesn't exist elsewhere in a single system. You're paying for the integration, not just any single capability.

safety controls

The model offers configurable safety filtering through a tolerance parameter ranging from minimal blocking to aggressive filtering. The default setting works for general commercial use. Loosening it makes sense for artistic or editorial work where the model might otherwise reject legitimate creative prompts. Tightening it makes sense for consumer-facing products where you need to guarantee family-safe output regardless of input.

Safety filtering on generation models is always a compromise. Too aggressive and you get false rejections on perfectly reasonable prompts - portraits blocked because skin is visible, historical imagery blocked because conflict is depicted. Too loose and you open liability exposure. Google's implementation lets you choose where on that spectrum you want to sit, which is the right approach even if finding the right setting takes some experimentation.

the generation count question

Four images maximum per request. This is generous enough for variation exploration - generating a set and picking the best one - but not designed for bulk production runs. If you need fifty variations, you're making thirteen requests. The practical implication is that Pro works best in workflows where you're generating a focused set, evaluating quality, refining your prompt, and iterating. It rewards precision over volume.

The aspect ratio auto-detection is worth mentioning specifically. Rather than forcing a ratio and potentially cropping or distorting the composition, auto lets the model decide what shape suits the content. A landscape scene gets landscape proportions. A portrait gets portrait. It's a small thing but it eliminates one decision from the workflow and the model's choices are reasonable.

who should actually use this

Production teams creating final assets rather than drafts. Agencies delivering client work where quality is contractually expected. Editorial teams generating illustrations that will sit alongside professional photography. E-commerce operations needing accurate product visualizations. Anyone whose workflow includes "this needs to look right" as a hard requirement rather than an aspiration.

If your workflow is "generate a bunch of options quickly and pick one that's close enough," Flash or another faster model will serve you better. Pro is for workflows where you generate fewer images but need each one to be closer to done when it arrives.

is gemini 3 pro worth the premium over flash?

It depends on whether quality or throughput matters more to your specific workflow. For production assets, editorial imagery, and client-facing work where individual image quality is scrutinized, the Pro tier produces noticeably better results - finer detail, more accurate prompt interpretation, and cleaner editing output. For high-volume generation where speed and cost efficiency matter more than per-image perfection, Flash delivers excellent results at a fraction of the cost. The 2K resolution sweet spot makes Pro particularly compelling for mid-resolution work.

how does search grounding affect generation quality?

Search grounding queries Google's live index during generation to verify visual facts - what real products look like, current architectural designs, accurate geographic features. The quality improvement is most dramatic when generating images of real-world subjects where training data might be outdated or incomplete. For purely imaginative or artistic generation, search grounding adds minimal value since there's nothing factual to verify. The per-request cost is low enough to enable by default for any commercial or reference imagery work.

can gemini pro replace dedicated editing software?

For a growing category of edits, yes. Background replacement, object removal, style transfer, and element compositing through natural language instructions work well enough to skip Photoshop for many production tasks. The limitation is precision control - you can't mask a specific region, adjust feathering, or fine-tune blend modes. Complex retouching with pixel-level requirements still needs traditional tools. But for the 80% of edits that are conceptually simple even if manually tedious, describing what you want in plain language and letting the model execute is faster and often produces cleaner results than manual work.

api reference

about

gemini 3 pro image preview (nanobanana pro) via vertex ai - advanced image generation model powered by google cloud

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash
1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash
1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python
1from inferencesh import inference23client = inference()456result = client.run({7        "app": "google/gemini-3-pro-image-preview",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "google/gemini-3-pro-image-preview",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python
1# local file paths are automatically uploaded2result = client.run({3    "app": "google/gemini-3-pro-image-preview",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python
1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python
1result = client.run({2    "app": "google/gemini-3-pro-image-preview",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json
1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}
idstringtask id
statusnumberterminal status (9=completed, 10=failed, 11=cancelled)
outputobjecttask output (when completed)
errorstringerror message (when failed)
session_idstringsession id (if using sessions)
created_atstringiso timestamp
updated_atstringiso timestamp

5. schema

input

promptstring*

the prompt for image generation or editing. describe what you want to create or change.

imagesarray

optional list of input images for editing (up to 14 images). when provided, the model will edit these images based on the prompt. when not provided, the model will generate new images from the text prompt. supported formats: jpeg, png, webp

num_imagesinteger

number of images to generate.

default: 1min:1max:4
aspect_ratiostring

aspect ratio for the output image. use 'auto' to automatically match the first input image's aspect ratio. default: 1:1

default: "1:1"
options:"auto""21:9""16:9""3:2""4:3""5:4""1:1""4:5""3:4""2:3""9:16"
resolutionstring

output resolution. options: 1k, 2k, 4k. default: 1k

default: "1K"
options:"1K""2K""4K"
output_formatstring

output format for the generated images.

default: "png"
options:"png""jpeg""webp"
enable_google_searchboolean

enable google search grounding for real-time information (weather, news, etc.)

default: false
safety_tolerancestring

safety filter threshold. options: block_none, block_low_and_above, block_medium_and_above, block_only_high

default: "BLOCK_NONE"
options:"BLOCK_NONE""BLOCK_LOW_AND_ABOVE""BLOCK_MEDIUM_AND_ABOVE""BLOCK_ONLY_HIGH"
retry_countinteger

number of automatic retries on 429 rate limit errors using exponential backoff with jitter. set to 0 to disable retries. example: retry_count=2 means up to 3 total attempts (1 initial + 2 retries).

default: 2min:0max:5

output

imagesarray*

the generated or edited images

descriptionstring

text description or response from the model

output_metaobject

structured metadata about inputs/outputs for pricing calculation

ready to run gemini-3-pro-image-preview?

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.