You are building something with AI. Maybe it is an agent that books meetings, a pipeline that generates product images, or a chatbot that needs to search the web and send emails. You have the model calls working. The prompts are good. Then you hit the part that takes ten times longer than the AI itself: making everything around it actually work.

You need to call a video model, but the provider requires separate authentication and a different SDK. You need to send an email after the video renders, but now you are managing a queue. The agent needs to wait for human approval before posting to social media, but your framework does not support that. You want to share this with your team, but the tool configs live on your laptop.

This is the problem inference.sh exists to solve. It is an infrastructure layer for AI applications - the tools, the runtime, and the knowledge layer that sits between your agent logic and the outside world.

The Core Idea

Most AI infrastructure today is fragmented. You use one service for model inference, another for tool integrations, another for workflow orchestration, and yet another for knowledge management. Each has its own auth, its own SDK, its own pricing model, its own failure modes.

inference.sh consolidates this into a single platform. One API key. One execution model. One billing relationship. You get access to 250+ serverless tools spanning AI models, video rendering, email, search, social media, and project management. You get a runtime that handles failures and state persistence. You get a skills system that packages agent knowledge into portable, versioned files.

The design philosophy is straightforward: your agent code should focus on decisions and logic. Everything else - calling tools, persisting state, recovering from failures, managing credentials - should be handled by the platform.

Tools: 250+ Integrations, One API

The tool catalog covers the categories that AI applications actually need. Not hundreds of variations on the same thing, but working integrations across the domains where agents and pipelines operate.

AI models include image generation (FLUX, Seedream), video generation (Seedance), language models (Claude, GPT), and specialized models for tasks like upscaling and text-to-speech. You call them the same way regardless of provider. The platform handles authentication, rate limits, and response normalization.

Media and rendering includes Remotion for programmatic video rendering. Agents that produce visual content can generate videos without you standing up a rendering pipeline.

Search and data includes Tavily for web search and extraction. Research agents and RAG pipelines get structured search results without managing search infrastructure.

Communication includes email sending and social media tools for X/Twitter. Agents that need to notify, publish, or engage can do so through standard tool calls.

Project management includes Linear integration for issue tracking. Agents that manage workflows can create, update, and query issues as part of their execution.

Every tool runs serverless. You pay per execution, not per hour of uptime. There is no infrastructure to manage, no containers to keep warm, no scaling to configure.

BYOK: Bring Your Own Keys

You might already have API keys for some of these services. inference.sh supports bring-your-own-key for providers where you want to route calls through your own account. Your cloud credits, your rate limits, your billing - the platform handles the execution mechanics either way.

This matters for teams that have negotiated enterprise pricing or need calls to originate from their own accounts for compliance reasons.

Skills: Versioned Agent Knowledge

Tools give agents capabilities. Skills give agents knowledge about how to use those capabilities well.

A skill is a versioned markdown file that contains instructions, context, and patterns for a specific domain or task. Think of it as the difference between handing someone a hammer and teaching them carpentry. The hammer is the tool. The skill is the knowledge of when to use it, how to hold it, and what mistakes to avoid.

Skills are designed to be portable. They work in Claude Code, Cursor, Cline, Windsurf, Codex, and any other agent runtime that supports markdown-based instructions. They are not locked to the inference.sh runtime. You install them with a single command:

code

1belt skill use namespace/skill-name

The skills registry provides versioning and security scanning. Every skill version is scanned before publication. When you install a skill, you get a specific version. When the skill author publishes an update, you can upgrade on your own schedule.

This solves a real problem. Today, agent knowledge lives in system prompts that are copy-pasted between projects, never versioned, never audited. Skills formalize this into a proper package management system for agent instructions.

Why Markdown?

Markdown is the lingua franca of LLMs. Every major model understands it natively. It requires no special parser. It is human-readable, version-controllable, and diff-friendly. A skill written in markdown works everywhere a model can read text - which is everywhere.

The alternative would be a proprietary format that requires specific tooling to use. That would make skills faster to parse but impossible to use outside the ecosystem. Markdown trades a small amount of structure for universal compatibility.

The Agent Runtime

Calling a tool once is simple. Orchestrating dozens of tool calls across a multi-step agent workflow - where any call might fail, where some steps need human approval, and where the whole thing needs to survive server restarts - is hard.

The inference.sh runtime handles this with durable execution. When your agent makes a tool call, the runtime persists the state of that execution. If the call fails due to a transient error, the runtime retries automatically. If the server restarts mid-execution, the agent resumes from where it left off, not from the beginning.

This is the same pattern used by workflow engines like Temporal, applied specifically to agent execution. The difference is that you do not need to learn a new programming model or deploy additional infrastructure. Your agent makes tool calls. The runtime makes them durable.

Human-in-the-Loop Approval Gates

Some actions should not happen without a human saying yes. Sending an email to a customer. Posting on social media. Transferring money. Deleting records. These are actions where the cost of a mistake is high enough that automated execution is not acceptable.

The runtime supports approval gates - points in execution where the agent pauses and waits for human confirmation before proceeding. The state is persisted while waiting. The human reviews the proposed action, approves or rejects it, and the agent continues accordingly.

This is not a bolted-on feature. It is a first-class part of the execution model. Approval gates work the same way whether the wait is five seconds or five days.

Belt: The CLI

Belt is the command-line interface for inference.sh. Install it with one command:

code

1curl -fsSL https://cli.inference.sh | sh

From there, you can run tools directly from the terminal:

code

1belt run flux-dev --prompt "a mountain lake at sunset"

You can manage skills:

code

1belt skill use namespace/skill-name2belt skill list

And you can connect MCP servers, manage configurations, and interact with the full platform without leaving your terminal.

Belt is designed for developers who live in the terminal. It complements the web UI rather than replacing it - use whichever interface fits your workflow.

MCP: The Protocol Layer

inference.sh supports the Model Context Protocol in both directions.

You can use inference.sh as an MCP server, exposing its 250+ tools to any MCP-compatible client. If your agent framework speaks MCP, it can call inference.sh tools without a custom integration.

You can also connect external MCP servers to inference.sh. If you have an internal service that exposes an MCP interface, the runtime can call it alongside the built-in tools. Your agents get a unified tool surface that spans both the inference.sh catalog and your own services.

This bidirectional MCP support means inference.sh fits into existing setups rather than requiring you to replace them.

UI Components

Not every AI application is a CLI tool or a backend pipeline. Many need a user interface - a chat window, a tool approval dialog, a streaming response display.

inference.sh provides AI-native React components for these common patterns. Chat interfaces with streaming support. Generative UI that renders dynamic content based on agent output. Tool approval components that integrate with the runtime's human-in-the-loop system.

These are opinionated components built for the specific needs of AI applications, not generic UI primitives. They handle the streaming, state management, and interaction patterns that AI interfaces require.

Teams and Workspaces

AI work is rarely solo. Teams need shared access to tools, shared skills that encode organizational knowledge, and shared visibility into what agents are doing.

inference.sh workspaces give teams a shared environment. Team members share tool configurations and credentials. Skills evolve as the team learns - when one person improves a skill, everyone benefits. Memory persists across sessions, so agents retain context about the team's work.

Automations are coming soon, enabling teams to set up recurring agent workflows that run on schedule or in response to events.

Self-Hosted Option

Some organizations cannot send data to external services. Regulatory requirements, data residency rules, or security policies might prohibit it.

inference.sh offers a self-hosted option for these cases. Run the platform on your own infrastructure, behind your own firewall, with your own data governance. The same tools, the same runtime, the same skills system - just on hardware you control.

How It Fits Together

The pieces work independently but are better together. You can use just the tools without the runtime. You can use skills without the UI components. You can use Belt without ever opening the web interface.

But the full picture looks like this: you write agent logic in your preferred framework. You install skills that give the agent domain knowledge. The agent calls tools through the inference.sh API. The runtime handles durability, retries, and approval gates. Your team shares the workspace. Belt and the web UI give you visibility and control.

The result is less time building infrastructure and more time building the thing that actually matters - the agent behavior, the pipeline logic, the product experience.

Who Uses inference.sh

The platform serves a few distinct groups.

Solo developers and small teams building AI-powered products. They need tools and infrastructure but do not want to build and maintain it themselves. Pay-per-execution pricing means they start with zero fixed costs and scale with usage.

Agent builders working with frameworks like Claude Code, Cursor, or custom setups. They need reliable tool execution, knowledge management, and durability for production agents.

Companies adding AI capabilities to existing products. They need a consistent API for diverse tool types, approval workflows for sensitive actions, and team features for collaboration.

The common thread is that all of these groups want to focus on their application logic rather than on the infrastructure that supports it.

Getting Started

The fastest path is Belt:

code

1curl -fsSL https://cli.inference.sh | sh

From there, explore the tool catalog, install a skill, or run your first tool call. The platform has a free tier, so you can experiment before committing.

If you prefer to start with the API, the documentation covers authentication, tool calling, and runtime features. If you want UI components, the React package is available through the standard package registry.

Pick the entry point that matches how you work. The platform meets you where you are.

FAQ

How is inference.sh different from calling model APIs directly?

Direct API calls give you model inference. inference.sh gives you model inference plus 250+ non-model tools (search, email, social media, rendering), a durable execution runtime that handles failures and state, a skills system for portable agent knowledge, and team features. If you only need to call one model, direct API access is simpler. If you are building an agent or pipeline that coordinates multiple tools and needs production reliability, the platform saves you months of integration work.

Do I have to use all the features?

No. The tools, runtime, skills, and UI components are independent. Many users start with just the tool API, calling inference.sh tools from their existing agent framework. Others start with skills, using Belt to install knowledge packages into Claude Code or Cursor. Adopt what you need, ignore the rest.

What happens if inference.sh goes down?

The durable execution model means in-progress work is persisted, not lost. When the platform recovers, executions resume from their last checkpoint. Skills are markdown files stored locally after installation, so they remain available regardless of platform status. The self-hosted option provides full independence from the managed service for teams that need maximum control over availability.