Gemini 3 Pro Image Generation on inference.sh

Google's Gemini 3 Pro Image Preview is the most popular image generation app on inference.sh, and for good reason. It combines Google's multimodal language understanding with high-fidelity image generation, producing outputs that follow complex prompts with remarkable accuracy. The model supports text-to-image generation, image editing with reference inputs, and Google Search grounding for real-time visual knowledge. Available at app.inference.sh/apps/google/gemini-3-pro-image-preview.

what it does

Gemini 3 Pro Image Preview generates images from text prompts and edits existing images based on natural language instructions. Unlike pure diffusion models, it leverages Google's multimodal architecture to deeply understand what you are asking for. The model interprets compositional instructions, spatial relationships, and stylistic directions with the comprehension of a language model rather than the pattern matching of a CLIP encoder.

The Google Search grounding feature sets it apart from other generators. Enable it and the model can reference current visual knowledge when generating images - real products, recent events, specific people or places it can verify through search. This makes it particularly effective for generating images that need to be factually grounded rather than purely imagined.

key features

Multimodal understanding - The model processes prompts with full language model comprehension. Complex multi-sentence descriptions, conditional instructions, and nuanced creative direction all get interpreted correctly.

Image editing - Pass reference images alongside your prompt to edit, transform, or build upon existing visuals. The model understands what to preserve and what to change based on your instructions.

Google Search grounding - Enable real-time web search to ground generations in current visual knowledge. Useful for generating images of real products, locations, or concepts that require factual accuracy.

Resolution options - Generate at 1K, 2K, or 4K resolution depending on your needs. The 4K option produces print-ready outputs.

Batch generation - Generate up to 4 images per request for variation exploration or asset production.

Safety controls - Configurable safety tolerance from BLOCK_NONE to BLOCK_ONLY_HIGH, giving you control over content filtering thresholds.

use cases

Product visualization - Generate realistic product shots, packaging mockups, and marketing materials. Google Search grounding helps the model understand real product designs and brand aesthetics.

Content creation - Blog illustrations, social media graphics, and editorial imagery that matches specific creative briefs without stock photo limitations.

Image editing workflows - Remove backgrounds, swap elements, change styles, or composite multiple images through natural language instructions rather than manual editing tools.

Rapid prototyping - UI mockups, architectural concepts, and design explorations generated from detailed text descriptions. The model's instruction following makes iteration fast.

Knowledge-grounded visuals - Educational content, infographics, and reference imagery where factual accuracy matters more than pure creativity.

how to run

belt CLI

Basic text-to-image generation:

bash
1belt app run google/gemini-3-pro-image-preview --prompt "A minimalist Japanese zen garden at sunrise, raked sand patterns forming concentric circles around moss-covered stones, soft golden light filtering through bamboo, 35mm film grain"

With Google Search grounding enabled:

bash
1belt app run google/gemini-3-pro-image-preview --prompt "The latest Tesla Cybertruck parked in a desert landscape at golden hour, photorealistic" --enable_google_search true --resolution 2K

Image editing with a reference:

bash
1belt app run google/gemini-3-pro-image-preview --prompt "Change the background to a tropical beach at sunset, keep the subject unchanged" --images '["https://example.com/portrait.jpg"]' --resolution 2K

Generate multiple variations:

bash
1belt app run google/gemini-3-pro-image-preview --prompt "Watercolor illustration of a cozy bookshop interior, warm lighting, cats sleeping on bookshelves" --num_images 4 --aspect_ratio "3:4"

API

python
1from inference import Client23client = Client()4result = client.run("google/gemini-3-pro-image-preview", {5    "prompt": "Professional headshot photo, woman in business attire, neutral studio background, soft directional lighting, shallow depth of field, Canon EOS R5 aesthetic",6    "resolution": "2K",7    "aspect_ratio": "3:4",8    "output_format": "png"9})

Image editing example:

python
1result = client.run("google/gemini-3-pro-image-preview", {2    "prompt": "Remove the text overlay and restore the underlying image naturally",3    "images": ["https://example.com/image-with-text.png"],4    "resolution": "2K"5})

input parameters

ParameterTypeRequiredDescription
promptstringyesText description of the desired image or editing instruction
resolutionenumnoOutput resolution: 1K, 2K, or 4K. Default is 1K
aspect_ratioenumnoOutput aspect ratio. Options include 1:1, 3:4, 4:3, 16:9, 9:16, and auto
imagesarraynoInput images for editing workflows. Pass URLs or base64 data
num_imagesintegernoNumber of images to generate (1-4). Default is 1
output_formatenumnoOutput format: png or jpeg
enable_google_searchbooleannoEnable Google Search grounding for factual accuracy
safety_toleranceenumnoSafety filter level: BLOCK_NONE, BLOCK_ONLY_HIGH, BLOCK_MEDIUM_AND_ABOVE, BLOCK_LOW_AND_ABOVE
retry_countintegernoAutomatic retries on rate limit (429) errors

output

The app returns an images array containing the generated image files, plus an optional description field with text commentary from the model about the generation. The output_meta field provides structured metadata about token usage and processing details.

Each image in the output array is a file reference you can download directly or pass to downstream workflows.

pricing

  • 1K resolution: $0.15 per image
  • 2K resolution: $0.15 per image
  • 4K resolution: $0.30 per image
  • Google Search grounding: +$0.015 per request

Generating 4 images at 2K resolution with search grounding costs $0.615 total. The pricing is competitive for the quality level, particularly at 2K where you get high-resolution output at the same price as 1K.

when to use gemini vs alternatives

Choose Gemini 3 Pro when you need strong instruction following, image editing capabilities, or Google Search grounding for factual imagery. It excels at complex compositional prompts and multi-step editing workflows.

Choose FLUX Dev LoRA when you need custom style adaptation through fine-tuned LoRAs or want the most control over the diffusion process with CFG and step tuning.

Choose Seedream 4.5 when you want the best price-to-quality ratio for straightforward text-to-image generation at high resolution.

Choose Qwen Image 2 Pro when text rendering, infographic generation, or document-style outputs are your primary need.

FAQ

Can Gemini 3 Pro generate text in images?

Yes. The model's language understanding makes it better than most diffusion models at rendering text, though dedicated text-rendering models like Qwen Image 2 Pro still handle complex typography more reliably. For short text elements like signs, labels, or titles, Gemini 3 Pro performs well.

How does Google Search grounding work?

When enabled, the model queries Google Search during generation to verify or reference real-world visual information. This helps with generating accurate depictions of real products, locations, public figures, or current events. It adds $0.015 to the per-request cost.

What aspect ratios are supported?

The model supports 1:1, 3:4, 4:3, 16:9, 9:16, and an auto option that lets the model choose the best ratio for the content. Auto is useful when you want the model to decide based on the subject matter.

Can I use it for image editing?

Yes. Pass one or more images in the images array alongside a text prompt describing the desired edit. The model can handle background removal, style transfer, object removal, element swapping, and compositional changes.

What is the safety_tolerance parameter?

It controls how aggressively the safety filter blocks content. BLOCK_NONE applies minimal filtering, while BLOCK_LOW_AND_ABOVE is the most restrictive. The default is appropriate for general use. Adjust based on your content requirements and compliance needs.

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.