Most agent tutorials skip the hard part. They show you how to call an LLM, wire up a tool, and get a response. The agent works on your laptop with a stable connection and your full attention. Then you deploy it, a network blip hits mid-task, and all progress vanishes. The user sees nothing. You see nothing. The work is gone.

This guide builds an agent that survives the real world. You will create a simple agent that uses tools, handles failures through durable execution, and includes human-in-the-loop approval for actions that matter. Think of it as the "hello world" for production-grade agents on inference.sh - simpler than a full research agent, but with the foundations that separate toy projects from software you can actually ship.

What You Are Building

The agent you will build does something straightforward: it takes a topic from a user, generates an image based on that topic, and posts the result. Simple enough to understand in one sitting. Complex enough to demonstrate the patterns that matter.

Along the way, you will encounter every problem that breaks agents in production - and see how durable execution solves each one.

Why a Script Falls Short

Before building the agent, it helps to understand what you are replacing. A script that does the same job might look something like this:

Conceptual pseudocode - not a working example:

code

1// A naive script approach2const topic = getInput()3const image = await callImageAPI(topic)     // What if this times out?4const caption = await callLLM(topic, image)  // What if the process restarts here?5await postToChannel(image, caption)          // What if this runs twice?

This script has three problems that get worse as tasks get longer.

No state persistence. If the image generation takes two minutes and the process crashes at minute one, you start over. The API call you already paid for? Wasted. The time the user waited? Gone.

No retry logic. If the LLM call hits a rate limit, the script throws an error and stops. You could add try-catch blocks and retry loops, but now you are writing infrastructure instead of agent logic.

No observability. When something goes wrong - and in production, something always goes wrong - you have no record of what happened. No trace of which calls succeeded before the failure. No way to diagnose the problem without adding logging everywhere.

A durable agent handles all three automatically. You write the agent logic. The runtime handles persistence, retries, and logging.

The Runtime Model

Before writing code, you need a mental model of how the inference.sh agent runtime works. It is different from a typical server process or a serverless function.

Your agent does not run as a long-lived process sitting in memory waiting for events. Instead, the runtime is event-driven. When something happens - a user message arrives, a tool call completes, an approval comes in - the runtime loads your agent's checkpointed state, executes the next step, and checkpoints the new state.

This means three things for how you think about your agent:

State survives everything. Process crashes, deployments, network failures, server migrations. Your agent's progress is stored externally, not in process memory. When execution resumes, the agent picks up exactly where it left off.

Each step is a checkpoint. After every LLM call, every tool execution, every decision point, the runtime saves the agent's complete state. If a failure happens between step three and step four, steps one through three are already persisted. The agent resumes at step four.

Retries are automatic. If a tool call fails due to a transient error - a rate limit, a timeout, a temporary outage - the runtime retries with appropriate backoff. You do not write retry loops.

This is what "durable execution" means in practice. Your agent's progress is durable - it persists independently of any single process, connection, or machine.

Setting Up

You will use the inference.sh SDK to build the agent. The SDK gives you access to over 250 tools through a single, consistent interface - image generators, language models, search engines, social media APIs, and more.

Conceptual setup - refer to SDK docs for exact installation steps:

1import Inference from '@inference/sdk'23const client = new Inference()

The client handles authentication, routing to the right model providers, and the underlying HTTP connections. You interact with tools through a single run method regardless of what the tool actually does.

You can also use the Belt CLI to run agents directly from your terminal with belt app run. This is useful for testing during development and for running agents in CI/CD pipelines or automation scripts.

Calling Tools

The core of any agent is its ability to use tools. On inference.sh, every tool - whether it generates images, runs a language model, searches the web, or posts to social media - is available through the same API pattern.

Conceptual example - illustrates the pattern, not exact API:

1// Generate an image2const imageResult = await client.run('pruna/flux-dev', {3  prompt: "A sunset over mountain peaks, photorealistic"4})56// Use a language model7const captionResult = await client.run('anthropic/claude', {8  prompt: "Write a short caption for this mountain sunset image"9})

The consistency matters. You do not need different SDKs, different authentication flows, or different result formats for different providers. One interface, hundreds of tools. Your agent code stays clean regardless of how many tools it uses.

When the runtime executes a tool call, it checkpoints the result. If the agent is interrupted after the image generates but before the caption is written, the image result is already saved. On resume, the runtime skips the image generation and moves straight to captioning.

Adding Intelligence

A tool-calling script becomes an agent when a language model makes decisions about which tools to call and in what order. Instead of hardcoding the sequence, you give the agent a goal and let it figure out the steps.

Conceptual agent definition - illustrates the structure:

1const agent = {2  instructions: `You are a creative assistant. Given a topic, you:3    1. Generate an image that captures the topic visually4    2. Write an engaging caption for the image5    3. Post the image with the caption6    7    Choose appropriate styles and compositions based on the topic.8    If the topic is abstract, find a concrete visual metaphor.`,9  10  tools: ['pruna/flux-dev', 'anthropic/claude', 'post/channel'],11}

The agent receives the topic, decides how to interpret it visually, picks the right prompt for image generation, evaluates the result, writes a caption that fits, and posts the final output. Each decision is a step. Each step is checkpointed.

This is where durable execution becomes visibly different from a script. The agent might take multiple attempts to get an image it likes. It might revise the caption. It might decide the first approach did not work and try a different visual metaphor. Each of those iterations involves tool calls that cost time and money. Durable execution means none of that work is lost to infrastructure failures.

Human-in-the-Loop Approval

Some actions should not happen without a human saying yes. Posting content publicly is a good example. You want the agent to do the creative work autonomously, but you want a person to approve the final result before it goes live.

On inference.sh, you add approval gates to specific tools. When the agent reaches a tool marked for approval, the runtime pauses execution automatically. It sends a notification to the appropriate person. That person sees exactly what the agent wants to do - the specific tool, the specific parameters, the full context. They approve, reject, or modify.

Conceptual example - illustrates the approval pattern:

1const agent = {2  instructions: "...",3  tools: [4    'pruna/flux-dev',          // Runs automatically5    'anthropic/claude',         // Runs automatically6    { 7      name: 'post/channel',    // Requires human approval8      requiresApproval: true 9    },10  ],11}

Here is what happens at runtime:

The agent generates an image. No approval needed - this is a creative step with no external side effects.
The agent writes a caption. Same - no approval needed.
The agent attempts to post. The runtime pauses. A notification goes out.
The approver sees the image, the caption, and the target channel. They approve.
The runtime resumes. The post goes live.

If the approver rejects, the agent receives that feedback and adapts. Maybe it generates a different image. Maybe it rewrites the caption. The agent keeps working toward the goal, guided by human judgment at the points that matter.

The key insight: during the pause, no resources are consumed. The agent is not sitting in memory waiting for a response. Its state is checkpointed. When approval arrives - whether that is thirty seconds or three hours later - the runtime loads the checkpoint and continues. This is durable execution making human-in-the-loop practical at scale.

Real-Time Streaming

While the agent works, you can watch. Every step streams in real time - the agent's reasoning, its tool calls, the results coming back. You see the image being generated, the caption being drafted, the approval request going out.

This is not just nice to have. When you are building and debugging agents, visibility into what the agent is doing and why is the difference between productive iteration and frustrated guessing. You can see when the agent makes a bad decision, understand its reasoning, and adjust the instructions accordingly.

In production, streaming keeps users informed. Instead of staring at a spinner while the agent works, they see progress. "Generating image... Writing caption... Waiting for approval..." Transparency builds trust.

Observability

Every tool call, every LLM decision, every approval event is logged and traceable. This happens automatically - you do not add logging statements or configure log levels.

When something goes wrong in production - and something always does eventually - you can trace the exact sequence of events. Which tool was called with which parameters. What the LLM decided and why. Where the failure occurred. How the retry handled it.

This observability serves three purposes:

Debugging. When an agent produces unexpected output, you trace backward from the result through every decision and tool call to find where things went wrong. Was the image prompt bad? Did the LLM misinterpret the topic? Did a tool return unexpected data?

Cost awareness. You see exactly which tool calls happened, how many retries occurred, and where time was spent. This lets you optimize - maybe a different image model is faster for your use case, or your prompts are causing unnecessary retries.

Compliance. For teams that need audit trails, every action is recorded with timestamps, inputs, outputs, and approval decisions. The records exist without any extra code.

The Difference in Practice

Here is what your agent now handles that a script does not:

The image API times out. A script crashes. Your agent retries automatically with backoff. If the timeout persists, the checkpoint preserves all prior progress. When the API recovers, the agent continues from where it stopped.

The process restarts mid-task. A script loses everything. Your agent's state is checkpointed externally. The new process loads the checkpoint and picks up at the exact step where execution was interrupted.

A tool call runs twice. A script might post duplicate content. Your agent's runtime tracks which actions have completed, preventing duplicate side effects on retry.

The user closes their browser. A script stops. Your agent keeps working (or pauses at an approval gate), and the user can check back later to see the result.

You need to know what happened last Tuesday. A script has whatever logs you remembered to add. Your agent has a complete trace of every decision and action.

Running with Belt

You can also run agents from the command line using Belt, the inference.sh CLI. This is useful for testing agents during development, running them in automation scripts, and integrating them into existing workflows.

bash

1belt app run your-agent-name

Belt connects to the same runtime, so you get the same durable execution, the same observability, and the same approval flows. The difference is the interface - terminal output instead of a web UI.

Where to Go From Here

This guide covered the fundamentals - tools, durable execution, human-in-the-loop, streaming, and observability - through the simplest possible agent. Real agents build on these same patterns with more tools, more complex decision-making, and more sophisticated approval workflows.

Some directions to explore:

More tools. The 250+ tools on inference.sh agents cover image generation, video, audio, search, social media, code execution, and more. Your agent can combine any of them.

Multi-step workflows. Instead of a single generate-caption-post flow, build agents that iterate. Generate multiple options, evaluate them, refine the best one, then seek approval.

Memory across sessions. Agents can remember information across runs. A content creation agent might learn your preferred styles, your audience's preferences, and what performed well in the past.

Multi-agent systems. Complex tasks benefit from multiple specialized agents coordinating. One agent researches, another creates visuals, another writes copy, and an orchestrator coordinates them all.

The patterns you learned here - checkpointed state, tool composition, approval gates - remain the foundation regardless of complexity.

FAQ

How is a durable agent different from a serverless function with retry logic?

A serverless function with retries can handle individual call failures, but it does not preserve state across those retries. If your function calls three APIs in sequence and the process crashes after the second call, retrying the function re-executes all three calls from the start. A durable agent checkpoints after each step, so a crash after step two means resuming at step three. The distinction becomes significant as tasks grow longer and involve more expensive operations.

What happens if an approval request is never answered?

The agent's state stays checkpointed indefinitely. There is no process consuming resources while waiting. If an approval times out (configurable), the agent can be designed to handle that - perhaps by notifying a different approver, proceeding with a safer default action, or alerting the operator. The state persists regardless, so even a very late approval can still resume execution.

Can I test durable execution locally during development?

Yes. The Belt CLI (belt app run) connects to the same runtime, so your local development workflow gets the same durability guarantees as production. You can simulate failures by stopping and restarting the agent mid-task to verify that checkpointing works as expected. Streaming output in the terminal lets you watch each step execute and confirm the agent behaves correctly before deploying.