apps/openrouter/gemini-3-pro-preview

gemini-3-pro-preview

Gemini 3 Pro Preview

run with your agent
# install belt
$curl -fsSL https://cli.inference.sh | sh
# view schema & details
$belt app get openrouter/gemini-3-pro-preview
# run
$belt app run openrouter/gemini-3-pro-preview

Google has been iterating on the Gemini family at a pace that makes it hard to keep score. Gemini 3 Pro Preview is a flagship reasoning model from Google DeepMind, and it lands in a market that has gotten brutally competitive in the last six months. Google has since released Gemini 3.1 Pro Preview as the successor, but the core capabilities carry forward. It undercuts Anthropic's Claude Opus by a meaningful margin while positioning itself as a genuine alternative for complex reasoning tasks. The "Preview" label is honest about where this stands - Google is still refining the model, which means you get access to frontier capability with the caveat that behavior may shift between updates.

I have been running Gemini 3 Pro alongside Claude models for several weeks now, and the picture that emerges is more nuanced than any ranking would suggest. This is not a simple "better or worse" story. It is a model with distinct strengths that make it the right choice for specific workloads, and real weaknesses that make it the wrong choice for others. The interesting question is not whether it beats Claude or GPT on some aggregate benchmark, but where it excels in ways those models do not.

native multimodal understanding

Here is where Gemini 3 Pro genuinely pulls ahead of the competition, and it is not close. Google built this model from the ground up as a multimodal system. It processes text, images, video, and audio natively rather than routing through separate encoder modules that were bolted on after training. The practical difference shows up the moment you start working with mixed inputs.

Pass the model a screenshot of a complex dashboard and ask it to identify the anomaly in the data. Feed it a video clip and ask for a timeline of events. Upload an audio recording of a meeting and request structured notes. These are not party tricks - they are workflows that other models either cannot do at all or handle with visible seams where the modality-specific components fail to integrate properly.

The video understanding is particularly striking. You can pass substantial video clips and the model will track objects, understand temporal relationships, identify actions, and reason about causality across frames. Claude can process images well, but video remains a gap. For teams building applications that need to analyze visual media - content moderation systems, surveillance analytics, media archival tools, accessibility services - Gemini 3 Pro offers capabilities that simply do not exist in competing models.

Audio processing follows a similar pattern. The model transcribes, understands tone and context, identifies speakers, and can reason about the content of audio alongside text and visual inputs in the same conversation. If your workflow involves processing podcast episodes, analyzing call recordings, or building voice-first applications, the native audio understanding removes an entire layer of preprocessing that other approaches require.

the context window advantage

Gemini 3 Pro supports a context window that dwarfs what most competitors offer. Google has been pushing context length as a core differentiator since the original Gemini launch, and with this model they continue to lead. You can feed the model enormous documents - entire codebases, full research papers with appendices, lengthy transcripts - without the summarization tricks or retrieval augmentation that shorter-context models require.

The practical value depends entirely on your use case. For retrieval-augmented generation pipelines where you have already chunked your documents and are passing focused context, a massive window buys you little. But for workflows where you need the model to reason over a large, interconnected body of information - legal document analysis, codebase comprehension, long-form research synthesis - the ability to fit everything into a single context changes what is possible.

I have found the long context most useful for code review scenarios where understanding requires seeing how components interact across files. Rather than summarizing each file separately and hoping the model connects the dots, you can present the full picture and ask targeted questions. The quality of analysis genuinely improves when the model can reference the actual code rather than a compressed representation of it.

There is a cost consideration here though. Larger context means more input tokens, which means higher per-request costs. For production systems that hit this endpoint frequently, those costs accumulate. The long context is a capability, not a default operating mode - use it deliberately for tasks that require it rather than treating it as a convenience to avoid good context management.

search grounding and live knowledge

Google's unique advantage in the LLM space has always been its access to the world's information through Search, and Gemini 3 Pro makes that advantage tangible. Search grounding allows the model to query Google Search during inference, pulling in current information that goes beyond its training data cutoff.

For applications that deal with current events, real-time information, or rapidly changing domains, this feature is transformative. Ask about a company's latest quarterly results, a recent policy change, or current market conditions, and the model can retrieve and reason over up-to-date information rather than confabulating from stale training data. Every LLM hallucinates when pushed past its knowledge boundary. Search grounding moves that boundary to roughly "what Google has indexed in the last few hours."

The feature is optional and adds cost per request, but for applications where currency of information matters, it eliminates an entire class of reliability problems. Compare this to the alternative - building and maintaining your own retrieval pipeline with a search API, a chunking strategy, an embedding model, and a reranking layer - and the simplicity of a single parameter toggle becomes very appealing.

where google falls short

I would not be doing this model justice if I only talked about its strengths. Gemini 3 Pro has real limitations that matter for production workloads, and being clear about them saves you the frustration of discovering them in production.

Instruction following is looser than Claude. This is the single most consistent observation across my testing. When you give Claude a complex system prompt with multiple constraints - output format requirements, tone guidelines, content restrictions, behavioral rules - it tends to follow them with high fidelity. Gemini 3 Pro follows them most of the time but drifts more frequently. It paraphrases when you asked for verbatim reproduction. It reorganizes output when you specified a particular structure. It interprets constraints creatively when you needed them interpreted literally.

For casual applications this barely registers. For production systems where your downstream parsing depends on consistent output formatting, or where regulatory requirements dictate specific response structures, the reduced instruction adherence creates real engineering overhead. You end up adding validation layers, retry logic, and output parsing safeguards that you would not need with a more compliant model.

The safety filters are also more aggressive than what you encounter with Claude or GPT. Google applies content filtering that can block legitimate requests in domains like medical information, legal analysis, and creative writing that touches mature themes. You can adjust filter sensitivity in some configurations, but the default posture is conservative in ways that occasionally frustrate professional use cases. A medical AI assistant that refuses to discuss symptoms in plain language, or a legal tool that balks at describing criminal conduct in the context of case analysis, creates problems that prompt engineering alone cannot always solve.

Then there is the "Preview" question. Google has historically used preview labels to indicate that the model's behavior, capabilities, and even availability may change without notice. If you are building a production system that needs to behave consistently over months, that uncertainty introduces risk. The model you test against today may not behave identically next month. Google's track record on API stability has improved substantially, but the preview label means they are explicitly not making stability guarantees.

the google ecosystem play

One angle that gets underappreciated is how Gemini 3 Pro fits into the broader Google model ecosystem. On inference.sh, you already have access to Gemini Flash for image generation, Gemini Pro Image for high-fidelity visual work, and now Gemini 3 Pro for chat and reasoning. Using a single provider family for text reasoning, image generation, and multimodal understanding simplifies your architecture in ways that matter at scale.

You do not need separate API integrations, separate billing relationships, or separate error handling patterns for different modalities. Your prompt engineering knowledge transfers across the family. The model's understanding of visual concepts aligns with how the image models generate them, which means you can build pipelines where the chat model describes what it sees, reasons about what changes to make, and the image model executes those changes with consistent interpretation.

This is not a theoretical benefit. Teams that build multi-model pipelines spend significant engineering time managing the impedance mismatch between providers. Different tokenization, different rate limiting patterns, different error codes, different content policies. A single provider family reduces that integration surface meaningfully.

reasoning quality in practice

On pure reasoning benchmarks, Gemini 3 Pro performs well. It handles multi-step mathematical problems, complex logical deductions, and sophisticated analytical tasks with a level of competence that places it firmly in the frontier tier. The gap between it and Claude Opus on structured reasoning is narrower than you might expect.

Where I notice the quality difference most is in coding tasks. Claude remains stronger at generating complex, production-ready code - particularly for tasks that require understanding architectural patterns, maintaining consistency across files, and following project-specific conventions. Gemini 3 Pro generates functional code and explains its reasoning clearly, but the code tends to need more iteration before it is production-ready. For coding-heavy workloads, the additional review and refinement cycles may offset any cost advantage.

For analytical tasks - summarizing research, comparing arguments, extracting structured data from unstructured text, answering questions over long documents - Gemini 3 Pro is genuinely competitive. The combination of strong reasoning and massive context makes it particularly effective for tasks where understanding requires digesting large volumes of information before producing a focused answer.

who should consider this model

Gemini 3 Pro Preview is the right choice for teams that need strong multimodal capabilities, value access to current information through search grounding, or work with large context windows regularly. It is also worth serious consideration for teams looking for a more affordable alternative to Opus, as long as the instruction following gap does not create downstream problems.

It is not the right choice for teams that need maximum instruction adherence, are building tightly constrained production systems that depend on consistent output formatting, or need the stability guarantees that come with a generally available release. It is also not ideal if your primary workload is code generation where Claude's stronger coding performance makes it the better choice.

The most pragmatic approach is probably not choosing a single model at all. Use Gemini 3 Pro for the multimodal and knowledge-grounded workloads where it excels, and Claude for the coding and instruction-heavy workloads where it leads. The cost of calling two different models is lower than the cost of forcing a single model to handle tasks it was not optimized for.

FAQ

how does gemini 3 pro compare to claude sonnet 4.5 at a similar price?

The two models are in a similar competitive tier but excel in different areas. Gemini 3 Pro has stronger native multimodal capabilities - particularly video and audio understanding - and benefits from search grounding for current information. Claude Sonnet 4.5 offers tighter instruction following, more consistent output formatting, and generally stronger code generation. For analytical and multimodal workloads, Gemini has the edge. For coding, structured output, and tasks requiring precise behavioral control, Claude tends to deliver more reliable results. The right choice depends on which capabilities your specific application demands.

is the "preview" label a real concern for production use?

It depends on your tolerance for change. Google's preview models can receive updates that alter behavior, adjust capabilities, or shift performance characteristics without the versioning guarantees that come with generally available releases. If your application has tight regression testing and you validate model behavior before each deployment, preview models are manageable. If you ship a product and expect it to behave identically six months later without intervention, the preview label represents genuine risk. Google's recent track record on stability has been better than historical patterns suggest, but the label exists for a reason and you should take it at face value.

what makes the multimodal capabilities different from other models that accept images?

Most competing models process images through a vision encoder that was trained separately and integrated into the language model through an adapter layer. Gemini 3 Pro was trained as a unified multimodal system from the start, which means visual understanding and language reasoning share the same internal representations. The practical result is more coherent reasoning across modalities - the model does not just describe what it sees in an image and then reason over that description, it reasons directly over the visual information. This difference is most apparent with video and audio inputs, where temporal understanding requires the kind of deep integration that bolted-on encoders struggle to achieve.

api reference

about

gemini 3 pro preview

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash
1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash
1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python
1from inferencesh import inference23client = inference()456result = client.run({7        "app": "openrouter/gemini-3-pro-preview",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "openrouter/gemini-3-pro-preview",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python
1# local file paths are automatically uploaded2result = client.run({3    "app": "openrouter/gemini-3-pro-preview",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python
1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python
1result = client.run({2    "app": "openrouter/gemini-3-pro-preview",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json
1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}
idstringtask id
statusnumberterminal status (9=completed, 10=failed, 11=cancelled)
outputobjecttask output (when completed)
errorstringerror message (when failed)
session_idstringsession id (if using sessions)
created_atstringiso timestamp
updated_atstringiso timestamp

5. schema

input

reasoning_excludeboolean

exclude reasoning tokens from response

default: false
context_sizeinteger

the context size for the model.

default: 200000
streamboolean

stream the response (true) or return complete response (false)

default: true
imagesarray

the images to use for the model

toolsarray

tool definitions for function calling

tool_call_idstring

the tool call id for tool role messages

reasoningstring

the reasoning input of the message

reasoning_effortstring

enable step-by-step reasoning

default: "none"
options:"low""medium""high""none"
reasoning_max_tokensinteger

the maximum number of tokens to use for reasoning

system_promptstring

the system prompt to use for the model

default: "you are a helpful assistant that can answer questions and help with tasks."example: "you are a helpful assistant that can answer questions and help with tasks."
contextarray

the context to use for the model

default: []example: [{"content":[{"text":"What is the capital of France?","type":"text"}],"role":"user"},{"content":[{"text":"The capital of France is Paris.","type":"text"}],"role":"assistant"}]
rolestring

the role of the input text

default: "user"
options:"user""assistant""system""tool"
textstring*

the input text to use for the model

example: "write a haiku about artificial general intelligence"
temperaturenumber

temperature

default: 0.7min:0max:1
top_pnumber

top p

default: 0.95min:0max:1
max_tokensinteger

max tokens

default: 64000

output

imagesarray

images

output_metaobject

structured metadata about inputs/outputs for pricing calculation

responsestring*

the generated text response

usageobject

token usage statistics

tool_callsarray

tool calls for function calling

reasoningstring

the reasoning output of the model

ready to run gemini-3-pro-preview?

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.