gemini-2-5-flash-image
Gemini 2.5 Flash Image (NanoBanana) via Vertex AI - Advanced image generation model powered by Google Cloud
Google's Gemini 3.1 Flash Image Preview - internally codenamed "Nano Banana 2" and released in February 2026 - has quietly become the most-used image generation model on inference.sh. Over 53,000 tasks run through it, 138 paying users calling it daily, and a trajectory that shows no signs of slowing. It currently holds the number one ranking on the Artificial Analysis Image Arena leaderboard. The reasons are not mysterious. It generates fast, renders text you can actually read, edits existing images without a separate pipeline, and costs less than its Pro sibling while delivering results that most people cannot tell apart.
The model sits at an interesting intersection. It is not the cheapest option available - FLUX Dev still wins on raw cost per image. It is not the highest-fidelity either - Gemini Pro Image takes that crown. But for the daily reality of production workloads where you need good images quickly and reliably, Flash has found its lane and stayed there.
the flash advantage
There is something specific about how Flash operates that matters for anyone building products rather than just generating pretty pictures. The "Flash" in the name refers to Google's fastest inference architecture, which means you get results back in seconds rather than the longer waits associated with Pro-tier models. When you are iterating on a design concept or running image generation inside a user-facing application, that latency difference compounds fast.
Think about it from a product perspective. Your user uploads a product photo, types "put this on a beach at sunset," and waits. Two seconds versus eight seconds. The gap between those two experiences is the gap between an app that feels responsive and one that feels like it is thinking too hard. Flash gives you the responsive version without asking you to sacrifice much on output quality.
The quality difference between Flash and Pro is real but narrower than you might expect. Side by side, a trained eye can spot more refined lighting and slightly better compositional choices in Pro outputs. But for the vast majority of production use cases - social media assets, e-commerce product shots, marketing mockups, automated thumbnails - Flash outputs are indistinguishable from what you would get at the higher tier.
text that actually reads
I keep coming back to this because it genuinely changed what I expect from image generators. Most models produce text that looks like text from a distance but falls apart on closer inspection. Letters blur into each other, spacing goes wrong, characters get invented. You learn to work around it - avoiding text in prompts, adding it in post-production, accepting the limitation.
Gemini Flash does not have that limitation. Or rather, it has it much less. Generate a poster with a headline and the headline is legible. Create a product mockup with a label and the label says what you told it to say. Make a greeting card and the message renders cleanly enough to actually send.
This matters enormously for automation. If you are building a system that generates social media images with text overlays, or product labels, or educational materials with captions, you no longer need a separate text rendering step after generation. The model handles it in a single pass, which simplifies your pipeline and reduces the number of failure points.
The text rendering also works across languages. Generate in English, then regenerate the same concept with Hindi or Spanish or Japanese text, and the model handles the character sets correctly while preserving the visual composition. For teams producing localized marketing materials, that is hours of design work replaced by a second API call.
editing without the pipeline tax
One thing that surprised me when I first used this model is how naturally it handles image editing. You pass in one or more reference images alongside your prompt, and it understands what you want changed. Not in a crude inpainting way where you draw a mask and fill it - in a conversational way where you describe the edit and the model figures out what stays and what changes.
"Replace the background with a modern office." "Make the lighting warmer and more dramatic." "Remove the person on the left." "Combine these two product shots into one scene." These instructions work because the model processes the input images with the same understanding it brings to text prompts. It sees the image, comprehends the composition, and makes targeted modifications while preserving what you did not ask it to change.
The multi-image input capability extends this further. You can pass multiple reference images - style references, composition guides, product photos from different angles, texture samples - and the model synthesizes them into a single coherent output. The model can maintain visual coherence across up to five characters and fourteen objects in a single scene. This replaces workflows that previously required multiple generation passes, manual compositing in Photoshop, or specialized tools for each step.
google search grounding
Here is a feature that no other image generator currently matches. When you enable search grounding, the model can query Google Search during generation to ensure factual accuracy. This sounds like a small thing until you try generating images of real places, products, or concepts.
Ask a standard image generator for "the Sagrada Familia cathedral" and you get something that looks vaguely like a gothic church. Ask Gemini Flash with grounding enabled and you get something that actually resembles the Sagrada Familia, because the model referenced current images of it during generation. The difference between "a thing that looks plausible" and "a thing that looks correct" matters a lot when you are creating educational content, reference materials, or anything that represents reality.
The practical applications are specific and valuable. Generate an infographic about cloud formations and the model knows what cumulus versus cirrus actually looks like. Create a diagram of a Mars rover and it references the real design rather than inventing a generic sci-fi rover. Make a product visualization of an existing object and it matches the actual product rather than hallucinating features.
Search grounding is optional and toggled per request. Pure creative work where you want the model to imagine freely - leave it off. Anything that needs to reflect reality accurately - turn it on. The flexibility means you do not pay for grounding when you do not need it, but it is there the moment accuracy matters.
resolution and scaling
The model outputs at multiple resolution tiers, from 512 pixels for quick previews up through 1K and 2K for standard production work to 4K for print and high-fidelity applications. The model supports resolutions from 512px up to 4096px, giving you detail and clarity appropriate to the target size.
For production pipelines, this resolution control is practical rather than aspirational. Generate previews at 512px during iteration when you are exploring concepts and speed matters more than detail. Switch to 2K or 4K for final outputs when you have locked in the direction and need publication-ready assets. The pricing scales with resolution, so you are not overpaying for throwaway exploration.
Aspect ratio support covers ten options including 1:1, 4:3, 3:2, 16:9, 21:9, and their portrait inverses plus 5:4 and 4:5. There is also an auto-detect mode during editing workflows that matches the aspect ratio of your input image. Between resolution tiers and aspect ratio options, you can generate assets that fit into specific placements - Instagram posts, YouTube thumbnails, website headers, print layouts - without post-generation cropping or resizing. Every generated image includes an invisible SynthID watermark for provenance tracking.
where flash wins and where it does not
I want to be honest about positioning because there are legitimate reasons to choose other models depending on what you are building.
Flash wins decisively on speed-to-quality ratio. If you need good images fast - in a user-facing application, in a content pipeline running thousands of generations, in an iterative creative process - nothing else matches the combination of quality and latency at this price point. The text rendering and search grounding features add capabilities that most competitors simply do not offer.
GPT Image 2 from OpenAI offers a different aesthetic sensibility, strong text rendering with roughly 99% character-level accuracy, and better mask-based inpainting for precision editing workflows. If your product is built around OpenAI's ecosystem and the visual style matches what your users expect, it remains a valid choice. But it lacks search grounding and costs more at its high quality tier.
FLUX Dev from Black Forest Labs wins on cost alone. For pure text-to-image generation where you need volume and the quality bar is "good enough," FLUX is hard to beat economically. But it does not do editing, does not render text reliably, and cannot ground in real-world information.
Qwen Image 2 excels at dense informational outputs - complex infographics, detailed diagrams, long-form visual documents with heavy text content. If your use case is specifically "generate a dense information graphic," Qwen handles it better than Flash.
The honest answer for most production use cases is that Flash covers 80% of what you need, and you reach for specialized models for the remaining 20%.
running it on inference.sh
The model runs as a serverless app on inference.sh, which means no GPU provisioning, no cold starts to manage, no infrastructure to maintain. You call the API, pass your prompt and any reference images, and get generated images back. The same interface works whether you are generating one image from a terminal or processing thousands through a production pipeline.
This is the standard inference.sh experience - one API endpoint, consistent interface across all models, no operational overhead. If you are already running other AI workloads through inference.sh, adding Flash image generation is just another app call with no new concepts to learn.
FAQ
how does flash compare to gemini pro image for real production work?
Flash delivers roughly 90% of Pro's quality at significantly higher speed and lower cost. The remaining 10% shows up in subtle ways - slightly more refined lighting, marginally better compositional choices, more detailed textures at high resolution. For user-facing applications where latency matters, marketing assets, automated pipelines, and most creative workflows, Flash is the better choice. Reserve Pro for hero images, print-quality assets, or situations where you need maximum fidelity and can tolerate longer generation times.
can I use this for batch processing thousands of images?
Yes, and this is one of the strongest use cases. The combination of speed and reliability makes Flash well-suited for high-volume pipelines. Generate product shots across an entire catalog, create social media variants for a campaign, produce thumbnails for a content library. The API handles concurrent requests and the pricing is predictable at scale. Built-in retry handling on rate limits means your batch jobs complete without manual intervention.
what makes the text rendering better than other models?
Google trained Flash specifically on text rendering accuracy as a primary objective rather than treating it as a secondary capability. The model understands letterforms, spacing, kerning, and character sets across languages. It also benefits from Google's font rendering expertise. The practical result is text that reads correctly at intended sizes - you can generate a poster and the headline says what you typed, not a garbled approximation.
api reference
about
gemini 2.5 flash image (nanobanana) via vertex ai - advanced image generation model powered by google cloud
1. calling the api
install the client
the client provides a convenient way to interact with the api.
1pip install inferenceshsetup your api key
set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.
1export INFERENCE_API_KEY="inf_your_key"run and get result
submit a request and wait for the final result. best for batch processing or when you don't need progress updates.
1from inferencesh import inference23client = inference()456result = client.run({7 "app": "google/gemini-2-5-flash-image",8 "input": {}9 })1011print(result["output"])stream live updates
get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8 "app": "google/gemini-2-5-flash-image",9 "input": {}10 }, stream=True):11 if update.get("progress"):12 print(f"progress: {update['progress']}%")13 if update.get("output"):14 print(f"output: {update['output']}")2. authentication
the api uses api keys for authentication. see the authentication docs for detailed setup instructions.
3. files
file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.
automatic upload
the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.
1# local file paths are automatically uploaded2result = client.run({3 "app": "google/gemini-2-5-flash-image",4 "input": {5 "image": "/path/to/local/image.png", # detected & uploaded6 "audio": "https://example.com/audio.mp3", # url passed through7 }8})4. webhooks
get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.
1result = client.run({2 "app": "google/gemini-2-5-flash-image",3 "input": {},4 "webhook": "https://your-server.com/webhook"5}, wait=False)webhook payload
your endpoint receives a JSON POST with the task result:
1{2 "id": "task_abc123",3 "status": 9,4 "output": { ... },5 "error": "",6 "session_id": null,7 "created_at": "2024-01-15T10:30:00Z",8 "updated_at": "2024-01-15T10:30:05Z"9}5. schema
input
the prompt for image generation or editing. describe what you want to create or change.
optional list of input images for editing (up to 14 images). max file size: 7mb (inline). supported formats: png, jpeg, webp, heic, heif.
number of images to generate.
aspect ratio for the output image. use 'auto' to automatically match the first input image's aspect ratio. default: 1:1
output format for the generated images.
enable google search grounding for real-time information (weather, news, etc.)
controls randomness in token selection. range: 0.0 - 2.0. default: 1.0
nucleus sampling probability. range: 0.0 - 1.0. default: 0.95
top-k sampling. fixed at 64 for this model.
maximum number of tokens to generate. max: 32768
safety filter threshold. options: block_none, block_low_and_above, block_medium_and_above, block_only_high
number of automatic retries on 429 rate limit errors using exponential backoff with jitter. set to 0 to disable retries. example: retry_count=2 means up to 3 total attempts (1 initial + 2 retries).
ready to run gemini-2-5-flash-image?
we use cookies
we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.
by clicking "accept", you agree to our use of cookies.
learn more.