If you are building an AI agent, a product that calls AI models, or any automation that touches more than one external service, you have a tool problem. Not a model problem. Models are getting good fast. The hard part is wiring those models to everything else they need to do useful work - generate images, send emails, search the web, post to social media, render video, run code, manage projects.

Most teams solve this piecemeal. You sign up for Replicate for image generation, Fal for fast video inference, Resend for email, Tavily for search, Puppeteer for browser automation. Each service has its own SDK, its own auth, its own billing, its own failure modes. You write glue code. You build retry logic. You manage API keys. You handle rate limits. Every new capability means another integration.

This is the problem inference.sh exists to solve. One API, one SDK, one billing relationship - 250+ tools that span AI models, video rendering, email, search, social media, project management, browser automation, and code execution.

This guide breaks down what that actually means, how it compares to the alternatives, and when it makes sense.

What "250+ Tools" Actually Covers

Replicate and Fal give you AI model inference. You send a prompt or an image, you get a generation back. They do this well. But AI models are one category of tool.

Here is what inference.sh covers, with real app names from the store:

AI model inference - the same territory as Replicate and Fal. Image generation with pruna/flux-dev, video generation with falai/seedance-2-t2v, text-to-speech with elevenlabs/tts. These run the same underlying models, often from the same providers.

Video rendering - infsh/remotion-render lets you render data-driven video programmatically. This is not AI generation. This is template-based rendering - the kind you need for personalized marketing videos, automated reports, or dynamic content.

Search and research - tavily/search-assistant gives your agent web search capabilities without managing a search API integration directly.

Social media - x/post-create lets your agent or automation post to X (Twitter). Managing OAuth tokens and API changes for social platforms is its own category of pain. This abstracts it.

Browser automation - infsh/agent-browser runs a headful browser your agent can control. Scraping, testing, form filling, screenshot capture - without managing Puppeteer infrastructure.

Email, project management, code execution - the long tail of capabilities that agents need to be genuinely useful.

The point is not that any single tool is unique. You can find each of these capabilities somewhere. The point is that all of them share the same API surface, the same authentication, the same execution model, and the same billing.

Three Sources of Tools

The 250+ built-in apps are just the starting point. inference.sh has three distinct sources of tools, and understanding all three matters.

Built-in Apps

These are the tools in the store - curated, hosted, ready to call. pruna/flux-dev for images, falai/seedance-2-t2v for video, elevenlabs/tts for audio, and so on. You do not deploy or manage anything. You call them through the API and pay per execution.

Connected MCP Servers

MCP (Model Context Protocol) is becoming the standard way for AI agents to discover and call tools. inference.sh lets you connect any MCP server and expose its tools through the same API.

This means if your company has internal tools exposed via MCP - a database query tool, a deployment trigger, a custom CRM action - you can connect those servers and your agents access them alongside the built-in tools. Same interface, same execution model.

For teams already investing in MCP, this is significant. You are not locked into a fixed catalog. You extend the platform with whatever tools you need.

Composed Flows

Flows let you chain tools into new tools. Take a search result, feed it to a language model, post the summary to Slack. Take an image generation, run it through upscaling, then through a watermark renderer. Each step is a tool call; the composition itself becomes a reusable tool.

This is not a visual workflow builder (though you can build one on top). It is a programmable composition layer. You define flows in code, and they execute with the same durability guarantees as individual tool calls.

BYOK: Bring Your Own Keys

This is where inference.sh diverges sharply from both Replicate and Fal.

When you run an AI model on Replicate or Fal, you pay their markup on compute. You cannot bring your own API keys. If you have negotiated pricing with Google Cloud, Azure, or AWS - or if you have credits with Fal or other providers - you cannot use them.

inference.sh supports BYOK (Bring Your Own Keys). You can route model runs through your own API keys for supported providers including Fal, Google, Azure, and AWS. The platform handles orchestration, retries, and tracking. You pay the provider directly at your negotiated rate.

For startups burning through cloud credits, this is straightforward cost savings. For enterprises with existing cloud contracts, it means you do not pay twice for compute. For anyone who wants full control over their provider relationships, it removes the middleman from the billing equation while keeping the middleware for everything else.

Neither Replicate nor Fal offer this. It is a fundamental difference in business model.

Durable Execution

Most API platforms treat tool calls as fire-and-forget. You make a request, you get a response (or a timeout), and managing everything around that is your problem.

inference.sh treats every tool call as a durable execution. This means:

Automatic retries on failure. If a tool call fails due to a transient error - network blip, provider rate limit, temporary outage - the platform retries automatically with appropriate backoff. You do not write retry logic for each tool.

State persistence. Every execution persists its state. If a long-running video render is interrupted, it does not vanish. You can query its status, retrieve partial results, or let the platform resume it.

Tracking and observability. Every execution is logged with inputs, outputs, timing, and status. When something goes wrong at 3 AM, you can trace exactly what happened without adding custom logging to every tool call.

This matters most when you are running agents that chain multiple tool calls. A typical agent workflow might call five to ten tools in sequence. Without durable execution, a failure in step seven means re-running steps one through six. With durable execution, step seven retries on its own and the workflow continues.

The alternative is building this yourself. Every team that has scaled an agent past demo stage has written some version of retry logic, state persistence, and execution tracking. It is undifferentiated work that slows you down.

The Belt CLI

Everything available through the API is also available from your terminal through the Belt CLI.

code

1belt app run pruna/flux-dev -i '{"prompt": "a cat on a skateboard"}'

code

1belt app run elevenlabs/tts -i '{"text": "Hello from the terminal", "voice": "alloy"}'

code

1belt app run tavily/search-assistant -i '{"query": "latest developments in MCP protocol"}'

This is not a secondary interface. It is the same API, the same execution engine, the same tracking. Every tool call from Belt shows up in your dashboard alongside API and SDK calls.

For development workflows, this is practical. You can test tool calls before writing integration code. You can script complex workflows in bash. You can run one-off tasks without spinning up a project.

For operations, Belt enables cron jobs, CI/CD integrations, and shell scripts that use the full tool catalog. Need to generate a daily report image and post it to X? That is a three-line shell script.

SDKs: JS, Python, Go

The API is language-agnostic, but SDKs remove boilerplate. inference.sh ships SDKs for the three languages that cover most agent development:

javascript

1import Inference from '@inference/sdk'23const client = new Inference()4const result = await client.app.run('pruna/flux-dev', {5  input: { prompt: 'a cat on a skateboard' }6})

The SDK handles authentication, request formatting, polling for async results, and error handling. Same capabilities in Python and Go.

If you are already using the Replicate or Fal SDK for model inference, switching is straightforward. The calling pattern is similar - you specify an app, pass inputs, get outputs. The difference is that the same pattern works for non-AI tools too.

Comparing the Alternatives

Let's be direct about when each option makes sense.

Replicate

Replicate is a well-built platform for running AI models. If your needs are strictly AI model inference - image generation, language models, audio models - and you do not need non-AI tools, BYOK, MCP integration, or flows, Replicate works.

Where it falls short: no non-AI tools (you cannot send emails, search the web, or automate browsers through Replicate), no MCP server support, no BYOK, no tool composition. If your agent needs to do anything beyond AI model inference, you need additional integrations.

Fal

Fal optimizes for fast inference, particularly for image and video models. If raw inference speed on supported models is your primary concern and nothing else matters, Fal delivers.

Where it falls short: same limitations as Replicate regarding scope. No non-AI tools, no MCP support, no BYOK, no flows. Fast inference on a narrow set of models.

Calling APIs Directly

You can always skip platforms entirely and call each provider's API directly. Google for Gemini, OpenAI for GPT, Stability for image generation, Resend for email, Tavily for search.

This works. Many teams start here. The cost becomes clear as you scale:

Auth management - each provider has different authentication. OAuth, API keys, JWT tokens, webhook signatures. You manage all of them.

Retry logic - each provider has different failure modes, rate limits, and backoff requirements. You write retry logic for each one.

Billing complexity - you have a billing relationship with every provider. Reconciling costs across ten providers is non-trivial.

Monitoring fragmentation - your observability is split across provider dashboards. Correlating a failed workflow across four providers means checking four different logging systems.

For one or two integrations, direct API calls are fine. For ten or twenty, the overhead dominates your engineering time.

inference.sh

inference.sh makes sense when you need breadth - multiple tool categories, not just AI models. When you want a single API surface for everything your agents call. When BYOK matters for cost control. When durable execution saves you from building reliability infrastructure. When MCP compatibility is part of your architecture.

It makes less sense if you need exactly one AI model and nothing else. Use the provider directly in that case.

Pay-Per-Execution Pricing

inference.sh charges per execution. No idle costs, no reserved capacity, no monthly minimums for tool access.

You pay when tools run. When they do not run, you pay nothing. This aligns cost with value - you are charged for work done, not for infrastructure standing by.

For workloads that are bursty (most agent workloads are), this matters. An agent that runs ten tool calls during a user interaction and then sits idle for hours costs you ten executions, not 24 hours of uptime.

Combined with BYOK for the AI model portion of your costs, the economics can be materially better than alternatives where you pay platform markup on compute plus per-seat fees.

When This Matters Most

The tool platform question becomes urgent at specific points:

When your agent needs its third non-model tool. One or two direct integrations are manageable. By the third, you are spending more time on plumbing than on the agent itself.

When you hit your first production outage caused by a tool failure. Retry logic and state persistence stop being nice-to-haves.

When you need to add MCP tools alongside hosted tools. Running two separate tool systems with different execution models creates complexity that compounds.

When your cloud bill matters. BYOK with your own negotiated rates versus paying platform markup on every model call.

Getting Started

The fastest path is the Belt CLI. Install it, run a tool call, see the result. Then move to the SDK when you are ready to integrate.

Every tool in the store has documented inputs and outputs. You can browse the catalog, pick what you need, and start calling tools in minutes. No infrastructure to provision, no containers to deploy, no GPUs to manage.

FAQ

Can I use inference.sh just for AI models and ignore the other tools?

Yes. The AI model tools work standalone. You can use pruna/flux-dev for image generation or falai/seedance-2-t2v for video without touching email, search, or browser tools. Many teams start with models and expand to other tool categories as their agents mature.

How does MCP integration work in practice?

You connect an MCP server to your inference.sh account. The server's tools become available through the same API and SDK you use for built-in tools. Your agent code does not need to know whether a tool is built-in or comes from a connected MCP server. The execution model - retries, tracking, state persistence - applies uniformly.

What happens if I already use Replicate or Fal?

You do not have to migrate everything at once. inference.sh complements existing setups. You might keep Replicate for specific models where you have established workflows and use inference.sh for non-AI tools, flows, or BYOK routing. The SDKs can coexist in the same codebase. Over time, consolidating onto one platform reduces integration surface area, but it is not required on day one.