Inference Logoinference.sh

BYOK on inference.sh

Most AI platforms work the same way. You send a request, they run it on their compute, they bill you. Simple enough when you are prototyping. Less simple when your company already has six-figure cloud commitments, data residency requirements, or a procurement process that takes longer than the project itself.

Bring Your Own Keys changes this. Instead of routing all your model runs through a platform's pooled infrastructure, you connect your existing cloud accounts and API keys. The platform handles orchestration, retries, observability, and the unified API. The actual compute runs on your cloud.

inference.sh built BYOK as a first-class feature because we kept hearing the same story from teams: they wanted a unified AI API, but they could not justify paying twice for compute they already had commitments on.

What BYOK Actually Means

BYOK stands for Bring Your Own Keys. In practice, it means you register your cloud provider credentials with inference.sh, and when you make API calls, the underlying model execution routes through your account instead of ours.

You still call the same inference.sh API. Your code does not change. The request hits our API layer, which handles authentication, input validation, tool orchestration, and retry logic. Then, instead of dispatching the model run to our pooled compute, it dispatches to your cloud account using your credentials.

The response flows back through inference.sh, where we handle output formatting, logging, and observability. From your application's perspective, nothing is different. From your cloud billing perspective, the compute charges appear on your existing account.

Currently supported providers include Google Cloud (Vertex AI), Azure, AWS, Fal, and more. The list grows as providers expand their API offerings and customers request new integrations.

Why This Matters More Than It Sounds

BYOK is not just a billing convenience. It solves real organizational problems that block AI adoption at scale.

Existing Cloud Commitments

Large organizations negotiate committed use discounts with cloud providers. A company with a three-year GCP commitment has already allocated budget for that compute. Running AI workloads through a third-party platform means paying full price on the platform while leaving committed capacity underutilized.

With BYOK, those AI workloads count against existing commitments. The compute is not wasted. The finance team does not need to justify a new line item. The existing cloud relationship absorbs the new workload.

Data Residency and Compliance

Some industries and regions require that data processing happens within specific geographic boundaries or on infrastructure that meets particular compliance standards. Healthcare, finance, and government organizations often have strict requirements about where data lives and who processes it.

When you use BYOK, the model execution happens on your cloud account. You control the region. You control the compliance posture. You control the data retention policies. inference.sh's API layer processes the request metadata, but the heavy computation - where your actual data gets processed by models - happens on infrastructure you manage.

This does not eliminate all compliance considerations, but it addresses the biggest one: ensuring model execution happens within your controlled environment.

Procurement and Vendor Management

Adding a new compute vendor to an enterprise's approved list is rarely fast. Procurement reviews, security assessments, contract negotiations, and budget approvals can take months. Meanwhile, the engineering team needs to ship.

BYOK sidesteps most of this friction. The compute runs on an already-approved cloud provider. The inference.sh API layer is a much smaller surface area to evaluate - it handles orchestration, not data processing at scale. This distinction often simplifies and accelerates vendor approval.

How It Works in Practice

Setting up BYOK on inference.sh takes a few steps. You do not need to change your application code.

Step 1: Register Your Provider Credentials

You add your cloud provider credentials to your inference.sh account. For Google Cloud, this means a service account key with the right permissions. For Azure, it is an API key or managed identity configuration. For AWS, it is IAM credentials scoped to the services you want to use.

inference.sh stores these credentials securely and uses them only when routing requests on your behalf.

Step 2: Configure Routing

You specify which models or which requests should route through your keys. This can be broad - all requests use my GCP account - or targeted. Maybe you want image generation on your Fal account, LLM calls on your Azure account, and everything else on inference.sh's default compute.

This routing configuration lives at the account level, not in your code. Your API calls remain identical regardless of where compute runs.

Step 3: Make API Calls as Usual

Your application code does not change. You call the same inference.sh endpoints, pass the same parameters, and receive the same response format. The routing happens behind the scenes based on your configuration.

python
1# This code works exactly the same whether using2# inference.sh compute or your own keys3import requests45response = requests.post(6    "https://api.inference.sh/v1/run",7    headers={"Authorization": "Bearer YOUR_INFERENCE_KEY"},8    json={9        "app": "google/gemini-2.5-pro",10        "input": {11            "prompt": "Summarize this quarterly report",12            "document": document_url13        }14    }15)1617result = response.json()

If you have BYOK configured for Google models, this request routes through your GCP account. If you have not, it routes through inference.sh's default compute. Same code either way.

What You Still Get with BYOK

Routing compute through your own keys does not mean you lose the platform benefits. That is the whole point - you get both.

Unified API

You do not need to learn and maintain integrations for every cloud provider's model API. Google's Vertex AI API is different from Azure's OpenAI Service API, which is different from AWS Bedrock. inference.sh normalizes all of this into one API. BYOK changes where compute runs, not how you interact with it.

Retries and Reliability

Cloud APIs fail. Rate limits hit. Regions have outages. inference.sh's retry logic, exponential backoff, and failover handling work the same whether the underlying compute is ours or yours. Your application gets resilient execution without you building that resilience into every integration.

Observability

Every request - whether routed through your keys or ours - gets logged, timed, and tracked in inference.sh's observability layer. You see latency, token usage, error rates, and cost across all your models in one place. No need to correlate logs across multiple cloud consoles.

Tool Orchestration

If you are using inference.sh's inference.sh tools for function calling, MCP integrations, or multi-step workflows, all of that works identically with BYOK. The orchestration layer sits above the compute layer. Changing where models run does not affect how they are orchestrated.

BYOK vs. the Alternatives

Let us look at what happens without BYOK when you want a unified AI API.

Option 1: Use a Platform Without BYOK

Platforms like Replicate and Fal offer clean APIs for running models. But they run everything on their own compute. You cannot bring your existing cloud credentials. If you have committed spend on GCP, that spend sits unused while you pay the platform's markup on their infrastructure.

You are also locked into their pricing structure. If they raise prices or change terms, your options are to accept it or migrate everything - a non-trivial effort when your application is tightly integrated.

Option 2: Build Direct Integrations

You skip the platform and integrate directly with each cloud provider. You get to use your own keys by default. But you also get to build and maintain every integration yourself.

That means learning each provider's API, handling their different authentication schemes, building retry logic for each one, normalizing response formats, and maintaining all of this as APIs change. For one provider, this is manageable. For three or four, it becomes a meaningful engineering cost.

Option 3: BYOK on inference.sh

You get the unified API, the reliability layer, the observability, and the tool orchestration. And you route compute through your own cloud accounts. You keep your existing cloud commitments working. You maintain control over where data gets processed. You avoid lock-in to any single compute provider.

When you need to switch providers - say, moving from Azure to GCP for a particular model - you change a configuration, not your application code.

When BYOK Makes Sense

BYOK is not for everyone. If you are an individual developer or a small team without existing cloud commitments, using inference.sh's default compute is simpler. You do not need to manage cloud credentials or worry about provider-specific configurations.

BYOK makes sense when:

You have existing cloud contracts. If your organization has committed spend on GCP, Azure, or AWS, routing AI workloads through those accounts maximizes the value of those commitments.

You have data residency requirements. If regulations or internal policies dictate where your data gets processed, BYOK lets you maintain that control while still using a unified API.

You are running at scale. At high volume, the economics of using your own negotiated cloud rates versus a platform's standard pricing become significant. BYOK lets you benefit from whatever pricing you have negotiated with your cloud provider.

You need multi-cloud flexibility. If your organization uses multiple cloud providers, BYOK lets you route different workloads to different providers based on cost, performance, or compliance needs - all through one API.

You want to avoid vendor lock-in. BYOK means the compute layer is something you control. If you ever need to move away from inference.sh, your cloud infrastructure and credentials are already yours. The migration is about the API layer, not the compute layer.

Cost Optimization Strategies with BYOK

BYOK opens up cost strategies that are not possible on locked platforms.

Use Committed Spend First

Route your baseline workloads through your committed cloud capacity. Use inference.sh's default compute for burst capacity that exceeds your commitments. This way, you maximize the value of what you have already paid for while still having elastic capacity when you need it.

Provider-Specific Routing

Different providers price different models differently. Route each model type to the provider where you get the best rate. LLM inference might be cheapest on your Azure commitment. Image generation might be cheapest on Fal. Video models might work best through your GCP account. BYOK lets you optimize at the model level without any code changes.

Development vs. Production

Use inference.sh's default compute for development and testing where volume is low and convenience matters most. Route production traffic through your own keys where volume is high and cost optimization matters most. Same code, different routing configuration per environment.

Getting Started

Setting up BYOK takes minutes, not days. You need an inference.sh account and credentials for the cloud provider you want to use.

The process is straightforward: add your provider credentials, configure your routing preferences, and start making API calls. Your existing code works without modification. If you are already using inference.sh, the switch is a configuration change.

If you are not yet using inference.sh, you can start with default compute and add BYOK later. The API is the same either way, so there is no migration effort when you decide to bring your own keys.

Frequently Asked Questions

Is my data sent through inference.sh servers when using BYOK?

Your request metadata passes through inference.sh's API layer for routing, authentication, and orchestration. The actual model execution - where your input data gets processed by the model - happens on your cloud account. Response data flows back through inference.sh for logging and formatting. If you have specific requirements about what data touches which infrastructure, reach out to discuss your architecture needs.

Can I use BYOK for some models and inference.sh compute for others?

Yes. Routing is configurable per model, per provider, or as a blanket default. Many teams route their high-volume production models through their own keys while using inference.sh compute for lower-volume or experimental models. You can change this configuration at any time without modifying your application code.

What happens if my cloud provider credentials expire or hit a quota?

inference.sh's retry and error handling applies to BYOK requests just like any other request. If your credentials are invalid or your quota is exceeded, you get clear error messages indicating the issue. You can also configure fallback behavior - for example, falling back to inference.sh compute if your own keys fail - so your application stays running while you resolve the credential or quota issue.

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.