apps/google/gemini-2-5-flash-image

gemini-2-5-flash-image

Gemini 2.5 Flash Image (NanoBanana) via Vertex AI - Advanced image generation model powered by Google Cloud

run in browser run via API

run with your agent

# install belt

$curl -fsSL https://cli.inference.sh | sh

# view schema & details

$belt app get google/gemini-2-5-flash-image

# run

$belt app run google/gemini-2-5-flash-image

Google's Gemini 3.1 Flash Image Preview - internally codenamed "Nano Banana 2" and released in February 2026 - has quietly become the most-used image generation model on inference.sh. Over 53,000 tasks run through it, 138 paying users calling it daily, and a trajectory that shows no signs of slowing. It currently holds the number one ranking on the Artificial Analysis Image Arena leaderboard. The reasons are not mysterious. It generates fast, renders text you can actually read, edits existing images without a separate pipeline, and costs less than its Pro sibling while delivering results that most people cannot tell apart.

The model sits at an interesting intersection. It is not the cheapest option available - FLUX Dev still wins on raw cost per image. It is not the highest-fidelity either - Gemini Pro Image takes that crown. But for the daily reality of production workloads where you need good images quickly and reliably, Flash has found its lane and stayed there.

the flash advantage

There is something specific about how Flash operates that matters for anyone building products rather than just generating pretty pictures. The "Flash" in the name refers to Google's fastest inference architecture, which means you get results back in seconds rather than the longer waits associated with Pro-tier models. When you are iterating on a design concept or running image generation inside a user-facing application, that latency difference compounds fast.

Think about it from a product perspective. Your user uploads a product photo, types "put this on a beach at sunset," and waits. Two seconds versus eight seconds. The gap between those two experiences is the gap between an app that feels responsive and one that feels like it is thinking too hard. Flash gives you the responsive version without asking you to sacrifice much on output quality.

The quality difference between Flash and Pro is real but narrower than you might expect. Side by side, a trained eye can spot more refined lighting and slightly better compositional choices in Pro outputs. But for the vast majority of production use cases - social media assets, e-commerce product shots, marketing mockups, automated thumbnails - Flash outputs are indistinguishable from what you would get at the higher tier.

text that actually reads

I keep coming back to this because it genuinely changed what I expect from image generators. Most models produce text that looks like text from a distance but falls apart on closer inspection. Letters blur into each other, spacing goes wrong, characters get invented. You learn to work around it - avoiding text in prompts, adding it in post-production, accepting the limitation.

Gemini Flash does not have that limitation. Or rather, it has it much less. Generate a poster with a headline and the headline is legible. Create a product mockup with a label and the label says what you told it to say. Make a greeting card and the message renders cleanly enough to actually send.

This matters enormously for automation. If you are building a system that generates social media images with text overlays, or product labels, or educational materials with captions, you no longer need a separate text rendering step after generation. The model handles it in a single pass, which simplifies your pipeline and reduces the number of failure points.

The text rendering also works across languages. Generate in English, then regenerate the same concept with Hindi or Spanish or Japanese text, and the model handles the character sets correctly while preserving the visual composition. For teams producing localized marketing materials, that is hours of design work replaced by a second API call.

editing without the pipeline tax

One thing that surprised me when I first used this model is how naturally it handles image editing. You pass in one or more reference images alongside your prompt, and it understands what you want changed. Not in a crude inpainting way where you draw a mask and fill it - in a conversational way where you describe the edit and the model figures out what stays and what changes.

"Replace the background with a modern office." "Make the lighting warmer and more dramatic." "Remove the person on the left." "Combine these two product shots into one scene." These instructions work because the model processes the input images with the same understanding it brings to text prompts. It sees the image, comprehends the composition, and makes targeted modifications while preserving what you did not ask it to change.

The multi-image input capability extends this further. You can pass multiple reference images - style references, composition guides, product photos from different angles, texture samples - and the model synthesizes them into a single coherent output. The model can maintain visual coherence across up to five characters and fourteen objects in a single scene. This replaces workflows that previously required multiple generation passes, manual compositing in Photoshop, or specialized tools for each step.

google search grounding

Here is a feature that no other image generator currently matches. When you enable search grounding, the model can query Google Search during generation to ensure factual accuracy. This sounds like a small thing until you try generating images of real places, products, or concepts.

Ask a standard image generator for "the Sagrada Familia cathedral" and you get something that looks vaguely like a gothic church. Ask Gemini Flash with grounding enabled and you get something that actually resembles the Sagrada Familia, because the model referenced current images of it during generation. The difference between "a thing that looks plausible" and "a thing that looks correct" matters a lot when you are creating educational content, reference materials, or anything that represents reality.

The practical applications are specific and valuable. Generate an infographic about cloud formations and the model knows what cumulus versus cirrus actually looks like. Create a diagram of a Mars rover and it references the real design rather than inventing a generic sci-fi rover. Make a product visualization of an existing object and it matches the actual product rather than hallucinating features.

Search grounding is optional and toggled per request. Pure creative work where you want the model to imagine freely - leave it off. Anything that needs to reflect reality accurately - turn it on. The flexibility means you do not pay for grounding when you do not need it, but it is there the moment accuracy matters.

resolution and scaling

The model outputs at multiple resolution tiers, from 512 pixels for quick previews up through 1K and 2K for standard production work to 4K for print and high-fidelity applications. The model supports resolutions from 512px up to 4096px, giving you detail and clarity appropriate to the target size.

For production pipelines, this resolution control is practical rather than aspirational. Generate previews at 512px during iteration when you are exploring concepts and speed matters more than detail. Switch to 2K or 4K for final outputs when you have locked in the direction and need publication-ready assets. The pricing scales with resolution, so you are not overpaying for throwaway exploration.

Aspect ratio support covers ten options including 1:1, 4:3, 3:2, 16:9, 21:9, and their portrait inverses plus 5:4 and 4:5. There is also an auto-detect mode during editing workflows that matches the aspect ratio of your input image. Between resolution tiers and aspect ratio options, you can generate assets that fit into specific placements - Instagram posts, YouTube thumbnails, website headers, print layouts - without post-generation cropping or resizing. Every generated image includes an invisible SynthID watermark for provenance tracking.

where flash wins and where it does not

I want to be honest about positioning because there are legitimate reasons to choose other models depending on what you are building.

Flash wins decisively on speed-to-quality ratio. If you need good images fast - in a user-facing application, in a content pipeline running thousands of generations, in an iterative creative process - nothing else matches the combination of quality and latency at this price point. The text rendering and search grounding features add capabilities that most competitors simply do not offer.

GPT Image 2 from OpenAI offers a different aesthetic sensibility, strong text rendering with roughly 99% character-level accuracy, and better mask-based inpainting for precision editing workflows. If your product is built around OpenAI's ecosystem and the visual style matches what your users expect, it remains a valid choice. But it lacks search grounding and costs more at its high quality tier.

FLUX Dev from Black Forest Labs wins on cost alone. For pure text-to-image generation where you need volume and the quality bar is "good enough," FLUX is hard to beat economically. But it does not do editing, does not render text reliably, and cannot ground in real-world information.

Qwen Image 2 excels at dense informational outputs - complex infographics, detailed diagrams, long-form visual documents with heavy text content. If your use case is specifically "generate a dense information graphic," Qwen handles it better than Flash.

The honest answer for most production use cases is that Flash covers 80% of what you need, and you reach for specialized models for the remaining 20%.

running it on inference.sh

The model runs as a serverless app on inference.sh, which means no GPU provisioning, no cold starts to manage, no infrastructure to maintain. You call the API, pass your prompt and any reference images, and get generated images back. The same interface works whether you are generating one image from a terminal or processing thousands through a production pipeline.

This is the standard inference.sh experience - one API endpoint, consistent interface across all models, no operational overhead. If you are already running other AI workloads through inference.sh, adding Flash image generation is just another app call with no new concepts to learn.

FAQ

how does flash compare to gemini pro image for real production work?

Flash delivers roughly 90% of Pro's quality at significantly higher speed and lower cost. The remaining 10% shows up in subtle ways - slightly more refined lighting, marginally better compositional choices, more detailed textures at high resolution. For user-facing applications where latency matters, marketing assets, automated pipelines, and most creative workflows, Flash is the better choice. Reserve Pro for hero images, print-quality assets, or situations where you need maximum fidelity and can tolerate longer generation times.

can I use this for batch processing thousands of images?

Yes, and this is one of the strongest use cases. The combination of speed and reliability makes Flash well-suited for high-volume pipelines. Generate product shots across an entire catalog, create social media variants for a campaign, produce thumbnails for a content library. The API handles concurrent requests and the pricing is predictable at scale. Built-in retry handling on rate limits means your batch jobs complete without manual intervention.

what makes the text rendering better than other models?

Google trained Flash specifically on text rendering accuracy as a primary objective rather than treating it as a secondary capability. The model understands letterforms, spacing, kerning, and character sets across languages. It also benefits from Google's font rendering expertise. The practical result is text that reads correctly at intended sizes - you can generate a poster and the headline says what you typed, not a garbled approximation.

api reference

about

gemini 2.5 flash image (nanobanana) via vertex ai - advanced image generation model powered by google cloud

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash

1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash

1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python

1from inferencesh import inference23client = inference()456result = client.run({7        "app": "google/gemini-2-5-flash-image",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python

1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "google/gemini-2-5-flash-image",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python

1# local file paths are automatically uploaded2result = client.run({3    "app": "google/gemini-2-5-flash-image",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python

1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python

1result = client.run({2    "app": "google/gemini-2-5-flash-image",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json

1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}

idstring— task id

statusnumber— terminal status (9=completed, 10=failed, 11=cancelled)

outputobject— task output (when completed)

errorstring— error message (when failed)

session_idstring— session id (if using sessions)

created_atstring— iso timestamp

updated_atstring— iso timestamp

5. schema

input

promptstring*

the prompt for image generation or editing. describe what you want to create or change.

imagesarray

optional list of input images for editing (up to 14 images). max file size: 7mb (inline). supported formats: png, jpeg, webp, heic, heif.

num_imagesinteger

number of images to generate.

default: 1min:1max:4

aspect_ratiostring

aspect ratio for the output image. use 'auto' to automatically match the first input image's aspect ratio. default: 1:1

default: "1:1"

options:"auto""21:9""16:9""3:2""4:3""5:4""1:1""4:5""3:4""2:3""9:16"

output_formatstring

output format for the generated images.

default: "png"

options:"png""jpeg""webp""heic""heif"

enable_google_searchboolean

enable google search grounding for real-time information (weather, news, etc.)

default: false

temperaturenumber

controls randomness in token selection. range: 0.0 - 2.0. default: 1.0

default: 1min:0max:2

top_pnumber

nucleus sampling probability. range: 0.0 - 1.0. default: 0.95

default: 0.95min:0max:1

top_kinteger

top-k sampling. fixed at 64 for this model.

default: 64

max_output_tokensinteger

maximum number of tokens to generate. max: 32768

default: 32768

safety_tolerancestring

safety filter threshold. options: block_none, block_low_and_above, block_medium_and_above, block_only_high

default: "BLOCK_NONE"

options:"BLOCK_NONE""BLOCK_LOW_AND_ABOVE""BLOCK_MEDIUM_AND_ABOVE""BLOCK_ONLY_HIGH"

retry_countinteger

number of automatic retries on 429 rate limit errors using exponential backoff with jitter. set to 0 to disable retries. example: retry_count=2 means up to 3 total attempts (1 initial + 2 retries).

default: 2min:0max:5

output

imagesarray*

the generated or edited images

descriptionstring

text description or response from the model

output_metaobject

structured metadata about inputs/outputs for pricing calculation

ready to run gemini-2-5-flash-image?

try in browser browse all tools

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.