apps/openai/gpt-image-2

gpt-image-2

Generate and edit images using OpenAI's GPT Image 2 model. Supports text-to-image, image editing with reference images, and mask-based inpainting.

run with your agent
# install belt
$curl -fsSL https://cli.inference.sh | sh
# view schema & details
$belt app get openai/gpt-image-2
# run
$belt app run openai/gpt-image-2

OpenAI's image generation models have become a kind of baseline. When someone says "AI-generated image," there's a good chance the mental image they conjure looks like something DALL-E produced - clean, well-lit, slightly too perfect. GPT Image 2, released in April 2026 as the successor to DALL-E 3, carries that lineage forward with a fundamentally different architecture - generating images more like an LLM generates text rather than using traditional diffusion. It brings better instruction following, more flexible output dimensions, and a mask-based editing system that actually works for production use cases. OpenAI is shutting down DALL-E 2 and DALL-E 3 in May 2026, making GPT Image 2 their sole image model going forward.

I've been watching how people use this model on inference.sh, and the pattern is interesting. It's not the cheapest option. It's not the fastest. But it handles the edit-in-place workflow better than almost anything else available right now, and that's what keeps people coming back.

the editing pipeline is the real product

Most image generation models treat generation and editing as fundamentally different operations. You have one endpoint for text-to-image and another, often clunkier system for modifications. GPT Image 2 collapses these into a single pipeline with three modes that share the same underlying architecture.

The simplest mode is pure text-to-image. Describe what you want, get an image back. The model has a distinctive aesthetic here - compositions tend toward clean negative space, lighting that feels intentional, and a photographic quality that reads as polished without being sterile. If you've seen enough AI-generated images to develop preferences, you'll recognize the GPT Image 2 look within a few generations.

The second mode accepts reference images alongside your prompt. Pass in a product photo and ask for it placed on a marble countertop with morning light. Pass in a brand style guide and request new compositions that match. The model uses these references as visual context, and it does a reasonable job maintaining consistency - though I'd note it's not infallible. Complex style matching across very different subject matter can drift.

The third mode is where things get genuinely useful for iterative work. Mask-based inpainting lets you define exactly which pixels should be regenerated. White areas in your mask get filled based on the prompt. Black areas stay untouched. This sounds simple, but the execution quality matters enormously. Bad inpainting systems produce visible seams, inconsistent lighting at mask boundaries, or ignore the surrounding context entirely. GPT Image 2 handles these boundaries well enough that you can do multi-pass edits on the same image without accumulating artifacts.

resolution flexibility and quality tiers

One underappreciated aspect of this model is the resolution system. Rather than offering a fixed set of aspect ratios, it accepts custom dimensions with both edges as multiples of 16, a maximum edge of 3840 pixels, an aspect ratio no wider than 3:1, and total pixel count between roughly 655K and 8.3 million. That means 1920x1080 for widescreen content, 1080x1920 for mobile stories, 768x1024 for product listings, or whatever specific dimensions your layout requires. The native output ceiling is 2K (2048px). No cropping, no letterboxing, no workarounds.

The quality tier system splits generation into three levels. Low quality produces fast drafts - useful when you're iterating on prompts and need to see results quickly without committing budget. Medium provides solid output suitable for most applications. High maximizes detail and coherence for production assets.

The cost difference between tiers is significant, which makes the tiered approach genuinely useful as a workflow tool rather than just a pricing gimmick. You can burn through dozens of low-quality iterations at minimal cost, then generate your final at high quality once you've nailed the prompt.

where it falls short

I want to be honest about the limitations because they matter for choosing the right tool.

Text rendering is actually one of the model's strongest features, achieving roughly 99% character-level accuracy across Latin, CJK, Hindi, Bengali, and Arabic scripts. This is a dramatic leap from DALL-E 3's roughly 60-70% accuracy. If you need words accurately placed within an image - a storefront sign, a book cover title, text on a screen - GPT Image 2 handles it reliably. It is competitive with or better than Gemini's text rendering for most use cases.

There's no search grounding. Gemini Flash Image can pull in real-world knowledge through Google Search to produce factually accurate depictions of places, people, or objects it wasn't trained on. GPT Image 2 works purely from its training data. If you ask for a specific restaurant that opened last month, you'll get a plausible hallucination rather than an accurate representation.

Cost at the high quality tier is steep relative to alternatives. FLUX Dev is dramatically cheaper for basic generation. FLUX lacks the editing and inpainting capabilities, but if you just need bulk text-to-image generation and don't need surgical edits, the economics point elsewhere.

The model also doesn't support video or animation. It's purely static images. If your pipeline eventually needs motion, you'll want to consider whether starting with a different model's aesthetic makes sense for consistency downstream.

the mask workflow in practice

The inpainting system deserves more attention because it's the feature that most differentiates this model from cheaper alternatives. In practice, the workflow looks like this: you generate or upload a base image, create a mask that isolates the region you want to change, then write a prompt describing what should fill that region.

The mask is a separate image at the same dimensions as your source. White pixels mark regeneration zones. Black pixels mark preservation zones. The model considers the surrounding context when filling masked areas, which means you get coherent lighting, perspective, and texture continuity across the boundary.

This matters for production work in ways that whole-image regeneration doesn't address. When a client says "I like everything about this except the background" or "can you replace just the product in the hero image," you need precision. Regenerating the entire image means losing elements you've already approved. Mask-based editing preserves approved work and modifies only what needs changing.

The tradeoff is that creating good masks requires additional work. You need a separate image editing step - whether that's Photoshop, a programmatic mask generator, or a segmentation model - to produce the mask before you can use this feature. It's not a one-click operation.

batch generation and format options

You can request up to 4 images per API call using the n parameter. Each comes back as an independently generated variation from the same prompt. This is useful for exploration - seeing multiple interpretations of the same description helps you understand what the model is and isn't capturing from your prompt language. Requesting n=4 in a single call is more efficient than four separate requests for the same prompt.

Output format options cover the practical range: PNG for lossless quality when you need to continue editing, JPEG for smaller files when compression artifacts are acceptable, and WebP for the best size-to-quality ratio in web delivery. Compression is configurable for JPEG and WebP outputs, letting you dial in the exact tradeoff your application needs.

Content moderation ships with two modes. The default applies standard filtering. A relaxed mode exists for creative applications that need more permissive content, though hard safety limits remain regardless of setting.

who should actually use this

The model finds its best fit with people doing iterative creative work where precision editing matters. Product photography teams that need to composite items into different environments. Design workflows where you're refining a specific composition across multiple rounds. Marketing teams that need variations on an approved visual concept without starting from scratch each time.

If you just need fast, cheap text-to-image generation and don't care about editing capability, this isn't the most economical choice. The editing pipeline is what justifies the price premium, and if you're not using it, you're paying for capabilities that sit idle.

For applications where text accuracy matters - infographics, diagrams with labels, screenshots with UI text - GPT Image 2 is now genuinely strong, with roughly 99% character-level accuracy. For very dense text-heavy outputs like full-page infographics or slide decks, Qwen's image generation may still have an edge in layout handling.

the aesthetic question

Every image model has a visual signature, and GPT Image 2's signature is "competent photography." The default output looks like it was shot by someone who knows how to light a scene and compose a frame. Shadows are soft, highlights don't blow out, and compositions follow conventional rules of balance.

This is both a strength and a limitation. The output looks professional without much prompt engineering, but it can be difficult to push toward genuinely unusual or experimental aesthetics. The model seems to have strong priors about what constitutes a "good" image, and those priors lean commercial. If you want deliberately rough, lo-fi, or unconventional compositions, you'll fight the model's tendencies more than you would with something like FLUX.

For commercial work - product shots, marketing assets, social media content - this bias toward polish is exactly what you want. The model's defaults align with professional standards, which reduces the prompting effort needed to get usable results.

how does gpt image 2 compare to dall-e 3?

GPT Image 2 represents a fundamental architectural shift from DALL-E 3 - it generates images more like an LLM generates text, moving away from the diffusion process used by DALL-E 3 and most prior models. The practical improvements are substantial: text rendering accuracy jumped from roughly 60-70% to 99%, resolution is now flexible rather than limited to fixed aspect ratios, and the mask-based inpainting pipeline is entirely new. The quality ceiling is higher, the prompt adherence is tighter, and the editing workflow transforms it from a generation-only tool into something useful for iterative creative work. OpenAI is retiring DALL-E 2 and 3 in May 2026, making this the sole path forward.

can I use reference images without a mask?

Yes. Passing images without a mask engages the reference-based editing mode rather than inpainting. The model uses your reference images as visual context alongside the text prompt. This works for style matching, subject consistency, and compositional guidance. The model interprets references loosely rather than literally copying them, so think of it as "generate something inspired by these" rather than "composite these together." Results vary based on how closely your prompt aligns with the visual content of the references.

what happens if my mask dimensions don't match the source image?

The mask must match the source image dimensions exactly. If there's a mismatch, the request will fail rather than silently resizing or cropping. This is actually preferable to automatic adjustment, since mask alignment is pixel-precise by design. If you're generating masks programmatically, ensure your pipeline produces output at the same resolution as your source. For manual mask creation, open your source image and paint directly on a layer at native resolution to avoid dimension drift.

api reference

about

generate and edit images using openai's gpt image 2 model. supports text-to-image, image editing with reference images, and mask-based inpainting.

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash
1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash
1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python
1from inferencesh import inference23client = inference()456result = client.run({7        "app": "openai/gpt-image-2",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "openai/gpt-image-2",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python
1# local file paths are automatically uploaded2result = client.run({3    "app": "openai/gpt-image-2",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python
1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python
1result = client.run({2    "app": "openai/gpt-image-2",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json
1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}
idstringtask id
statusnumberterminal status (9=completed, 10=failed, 11=cancelled)
outputobjecttask output (when completed)
errorstringerror message (when failed)
session_idstringsession id (if using sessions)
created_atstringiso timestamp
updated_atstringiso timestamp

5. schema

input

promptstring*

text prompt describing the desired image.

example: "A cat wearing a tiny top hat, oil painting style"
imagesarray

optional reference image(s) for editing. when a mask is provided, it applies to the first image.

maskstring(file)

optional mask image indicating areas to edit (requires input images). transparent areas in the mask indicate where the image should be edited. applied to the first image.

widthinteger

output image width in pixels. must be a multiple of 16.

default: 1024min:256max:3840
heightinteger

output image height in pixels. must be a multiple of 16.

default: 1024min:256max:3840
qualitystring

rendering quality. 'low' for fast drafts, 'high' for final assets.

default: "auto"
options:"auto""low""medium""high"
ninteger

number of images to generate (1-10).

default: 1min:1max:10
output_formatstring

output file format.

default: "png"
options:"png""jpeg""webp"
output_compressioninteger

compression level for jpeg/webp (0-100). ignored for png.

min:0max:100
moderationstring

content moderation strictness. 'auto' applies standard filtering; 'low' is less restrictive.

default: "auto"
options:"auto""low"

output

imagesarray*

the generated image files.

output_metaobject

structured metadata about inputs/outputs for pricing calculation

ready to run gpt-image-2?

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.