OpenAI's GPT Image 2 is their most capable image generation model, combining text-to-image generation, reference-based editing, and mask-based inpainting in a single endpoint. Available on inference.sh as a serverless app, it has processed over 6,800 tasks for 57 paying users who rely on it for everything from product mockups to creative exploration. The model produces images with a distinctive aesthetic — clean compositions, strong lighting, and a photographic quality that immediately reads as "polished."

What makes GPT Image 2 compelling is the flexibility of its editing pipeline. You can generate from text alone, pass reference images for context-aware editing, or use precise mask-based inpainting to control exactly which regions of an image get modified. The quality tiers (low, medium, high) let you balance speed against fidelity depending on whether you are iterating on concepts or producing final assets.

what it does

GPT Image 2 generates images from text descriptions and edits existing images using natural language instructions. It supports three distinct workflows:

Text-to-image — Describe what you want and get a polished image back. The model excels at photorealistic scenes, product photography, illustrations, and graphic design compositions.

Reference-based editing — Pass one or more images alongside your prompt. The model uses them as visual context to guide generation, enabling style matching, subject consistency, and compositional references.

Mask-based inpainting — Provide an image plus a mask indicating which areas should be regenerated. The model fills masked regions while maintaining consistency with the surrounding content. This is precise surgical editing rather than whole-image regeneration.

key features

Flexible resolution — Output any dimension from 256px to 4096px in both width and height, as long as dimensions are multiples of 64. This gives you exact control over output size for any use case, from thumbnails to large-format prints.

Quality tiers — Three rendering quality levels. "Low" generates fast drafts for iteration. "Medium" provides good quality at moderate cost. "High" delivers maximum fidelity for final assets. Pricing scales accordingly.

Batch generation — Generate up to 10 images per request with the n parameter. Useful for exploring variations or producing asset sets from a single prompt.

Output format control — Choose between PNG (lossless), JPEG, or WebP output. Compression level is configurable for JPEG and WebP, letting you balance file size against quality.

Content moderation — Configurable moderation strictness. "Auto" applies standard filters, "low" relaxes restrictions for creative applications.

Mask-based precision — The mask input enables pixel-level control over which parts of an image get regenerated, making it ideal for targeted edits without disturbing the rest of the composition.

use cases

Product photography — Generate product shots in various settings, lighting conditions, and compositions without physical photoshoots. Edit existing product images to swap backgrounds or adjust context.

Creative iteration — Rapidly explore visual directions with low-quality drafts, then generate high-quality finals once you settle on a concept. The quality tiers make this workflow economical.

Inpainting and repair — Fix specific regions of images using masks. Remove unwanted objects, fill in missing areas, or replace specific elements while keeping everything else intact.

UI and design mockups — Generate interface screenshots, app mockups, and design concepts at exact pixel dimensions matching your target screens.

Batch asset generation — Produce multiple variations of a concept in a single request. Generate 10 social media post variations, icon sets, or color explorations simultaneously.

Style transfer and consistency — Pass reference images to maintain visual style across a series of generations. Build consistent asset libraries without manually matching aesthetics.

how to run

belt CLI

Basic text-to-image:

bash

1belt app run openai/gpt-image-2 --input '{"prompt": "A flat-lay photograph of a developer workspace: mechanical keyboard, espresso, notebook with handwritten code sketches, warm desk lamp lighting"}'

High-quality generation at specific dimensions:

bash

1belt app run openai/gpt-image-2 --input '{"prompt": "Isometric illustration of a server room with glowing blue cables and blinking status lights, clean vector style", "quality": "high", "width": 1920, "height": 1080}'

Reference-based editing:

bash

1belt app run openai/gpt-image-2 --input '{"prompt": "Place this product on a marble kitchen counter with natural morning light from the left", "images": ["./product-cutout.png"]}'

Batch generation with multiple outputs:

bash

1belt app run openai/gpt-image-2 --input '{"prompt": "App icon for a meditation app, minimal design, gradient background, centered lotus symbol", "n": 5, "width": 1024, "height": 1024, "quality": "medium"}'

Inpainting with mask:

bash

1belt app run openai/gpt-image-2 --input '{"prompt": "A lush garden with flowering plants", "images": ["./backyard-photo.png"], "mask": "./mask-sky-area.png"}'

API

bash

1curl -X POST https://api.inference.sh/v1/apps/openai/gpt-image-2/run \2  -H "Authorization: Bearer $INFERENCE_API_KEY" \3  -H "Content-Type: application/json" \4  -d '{5    "prompt": "Professional headshot photograph, studio lighting, neutral gray background, sharp focus, corporate style",6    "quality": "high",7    "width": 1024,8    "height": 1024,9    "output_format": "png",10    "n": 111  }'

input parameters

Parameter	Type	Required	Description
`prompt`	string	yes	Text description of the desired image or editing instruction. Be specific about subject, style, lighting, composition, and mood.
`width`	integer	no	Output width in pixels. Must be between 256 and 4096, in multiples of 64. Default is 1024.
`height`	integer	no	Output height in pixels. Must be between 256 and 4096, in multiples of 64. Default is 1024.
`quality`	string	no	Rendering quality tier: "low" (fast drafts), "medium" (balanced), or "high" (maximum fidelity). Affects both generation time and cost.
`n`	integer	no	Number of images to generate, 1 to 10. Default is 1.
`images`	array	no	Reference images for editing workflows. The model uses these as visual context alongside your prompt.
`mask`	string	no	Mask image for inpainting. White areas indicate regions to regenerate; black areas are preserved. Must match the dimensions of the input image.
`output_format`	string	no	Output format: "png", "jpeg", or "webp". Default is "png".
`output_compression`	integer	no	Compression level for JPEG/WebP output, 0-100. Higher values mean more compression (smaller files, lower quality).
`moderation`	string	no	Content moderation strictness: "auto" (standard) or "low" (relaxed).

output

The app returns:

images — Array of generated image file URLs hosted on inference.sh cloud storage.
output_meta — Metadata including the actual resolution produced, quality tier used, and billing details.

pricing

Pricing varies by quality tier and resolution. Representative examples:

Quality	Resolution	Price per image
Low	1024x1024	$0.006
Medium	1024x1024	$0.024
High	1024x1024	$0.21

Larger resolutions cost more within each quality tier. The range spans from $0.006 for quick low-quality drafts up to $0.21 for high-quality large images. Batch generation (n > 1) multiplies the per-image cost.

when to use this vs alternatives

Choose GPT Image 2 when you need mask-based inpainting precision, flexible arbitrary dimensions (not just aspect ratios), OpenAI's distinctive aesthetic quality, or when iterating with quality tiers to control cost.

Choose Gemini Flash Image when you need Google Search grounding for factual accuracy, text rendering within images, or faster generation at lower cost for standard resolutions.

Choose FLUX Dev when budget is the primary constraint. At $0.005/image, FLUX is 40x cheaper than GPT Image 2 at high quality, though it lacks editing and inpainting capabilities.

Choose Qwen Image 2 when you need complex infographic generation with dense text layouts or document-style outputs.

FAQ

What dimensions can I output?

Any width and height between 256 and 4096 pixels, as long as both are multiples of 64. This gives you precise control — 1920x1080 for widescreen, 1080x1920 for mobile stories, 512x512 for thumbnails, or any custom size your application needs.

How does mask-based inpainting work?

Provide your source image in the images array and a mask image in the mask parameter. The mask should be the same dimensions as the source image. White regions in the mask indicate areas the model should regenerate based on your prompt. Black regions are preserved exactly as-is. This lets you surgically edit specific parts of an image.

What is the difference between quality levels?

"Low" generates rough drafts quickly and cheaply — good for brainstorming. "Medium" provides solid quality suitable for most applications. "High" maximizes detail, coherence, and fidelity — use it for final production assets. The cost difference is significant (roughly 35x between low and high at the same resolution).

Can I generate multiple variations at once?

Yes, set n to any value from 1 to 10. Each request returns that many independently generated images from the same prompt. Useful for exploring variations, A/B testing creative options, or generating asset sets.

How does the moderation filter work?

The default "auto" mode applies standard content moderation. Setting moderation to "low" relaxes the filter for creative applications that need more permissive content generation. Both modes still enforce hard safety limits.