When your agent crashes mid-task, does it lose all progress? Durable execution uses checkpoints to make agents resilient to failures, network issues, and process restarts. See durable execution in action →

Every demo agent runs for thirty seconds on a stable connection with an attentive operator. Every production agent eventually faces a network blip, a rate limit, a crashed process, or a user who closes their laptop mid-task. The difference between an agent that handles these gracefully and one that loses all progress is durable execution - a pattern borrowed from workflow orchestration that makes agent runs resilient to the failures that inevitably occur in real-world environments.

The Problem with Ephemeral Agent Runs

Consider what happens during a typical agent task. A user asks for a research report on market trends. The agent begins by searching multiple sources, spending several minutes gathering information. It then reads and extracts key points from each source, cross-references findings, and starts assembling a structured report. Fifteen minutes in, the user's VPN reconnects and drops the WebSocket connection. In a standard agent implementation, everything vanishes. The searches already completed, the documents already analyzed, the partial report already drafted - all gone because agent state existed only in the memory of a now-terminated process.

The user returns, frustrated, and starts the same request again. The agent performs the same searches, re-reads the same documents, and re-extracts the same information. The API costs double. The user's time doubles. The experience feels broken even though, technically, the agent worked correctly both times. It simply was not designed to survive interruptions.

This pattern repeats across every failure mode that production environments present. A deployment kills running processes. A dependent API goes down temporarily. The LLM provider rate-limits your requests during a traffic spike. Memory pressure causes the container to restart. Each failure throws away whatever progress the agent had made.

For simple question-answering tasks, this is annoying but recoverable. Users can ask again. But as agents take on more complex work - research projects spanning twenty minutes, data processing involving multiple stages, multi-step workflows with external side effects - the cost of losing progress becomes unacceptable.

What Durable Execution Means

Durable execution is a pattern where the execution state of a long-running process is persisted externally so that the process can resume from its last checkpoint after any interruption. The concept comes from workflow orchestration systems like Temporal and Azure Durable Functions, adapted here for the specific needs of agent workloads.

The core idea involves four components working together. State checkpointing saves the complete agent state after each meaningful step - after every LLM call completes, after every tool returns results, after every decision is made. Resumability means that if execution stops for any reason, it can restart from the most recent checkpoint instead of from the beginning. Retry logic handles transient failures in individual steps by attempting them again with appropriate backoff. Idempotency ensures that retrying steps does not cause duplicate side effects - the same tool call executed twice produces the same outcome as executing it once.

Together, these components make agent execution durable because the progress persists independently of any single process, connection, or server.

Why Standard Approaches Fail

Most agent implementations follow a straightforward loop pattern. Accept user input, call the model, execute any tools the model requests, feed results back to the model, repeat until done, return the final response. This loop runs synchronously in memory. If anything interrupts it - an exception, a timeout, a process termination - the loop stops and its state disappears.

Adding basic error handling helps with transient failures but does not solve the durability problem. You can retry a failed API call, but if the process itself terminates, all the conversation history, all the tool results gathered so far, all the intermediate reasoning - none of that survives.

Some teams attempt to add checkpointing to their agent loop. After each step, serialize the state and save it to a database. On startup, check for existing state and resume if found. This works in principle but introduces substantial complexity. You need to design a state schema that captures everything relevant. You need serialization logic that handles all the types in your state. You need storage infrastructure that is fast enough to not slow down execution but reliable enough to trust. You need cleanup logic for abandoned sessions. You need to handle the edge cases where state was partially written before a crash. You need to verify that resumed execution produces correct behavior.

This is infrastructure work, not agent work. Every hour spent building checkpoint storage is an hour not spent improving agent capabilities.

inference.sh handles checkpointing automatically. Every agent gets durable execution by default. State persists across failures, and agents resume from where they left off. Learn more →

How Durable Execution Works in Practice

When an agent runs on infrastructure designed for durable execution, each step in the agent loop triggers a checkpoint automatically. The conversation history, the memory the agent has stored, the results from completed tool calls, the current position in any multi-step plan - all of this persists to durable storage before execution continues.

If a connection drops, the stored state remains. When the user reconnects or sends another message, the system loads the checkpoint and continues from where execution stopped. The agent does not re-execute the tool calls it already completed. It picks up mid-task, with full context of what it was doing and why.

From the user's perspective, the experience is seamless. They might see "Resuming your research task..." instead of starting fresh. The work already done remains done. The time already invested is not lost.

From the operator's perspective, failures become routine events rather than incidents. A process termination during a task is no longer a problem to debug and apologize for - it is a normal occurrence that the system handles automatically.

The Anatomy of an Agent Checkpoint

Understanding what gets checkpointed helps clarify why durable execution requires specific infrastructure rather than a quick add-on.

The conversation history is the most obvious component - all the messages exchanged between user and agent, including the tool calls and their results. This history is what gives the agent context for its next decision. Without it, the agent cannot continue a task because it does not know what task it was doing or what it has already learned.

The agent memory contains information the agent has explicitly stored for later reference. A research agent might store key facts it has discovered. A customer service agent might store the user's account details. This memory is separate from the conversation history and must be preserved separately.

The plan state tracks progress through multi-step tasks. If an agent decided to perform five searches and has completed three, the plan state records that the first three are done and the fourth is next. Without this, the agent might repeat work or skip steps upon resume.

The sub-agent state matters when agents delegate to other agents. The parent agent needs to know which delegations are pending, which have completed, and what results came back. Resuming a parent agent requires resuming any in-progress child agents as well.

The execution context includes metadata about the run itself - timing information, configuration settings, authentication tokens in use. Some of this can be reconstructed from other sources, but including it in the checkpoint simplifies resumption.

All of this state must be serialized in a format that can be stored and later deserialized to recreate the exact execution context. This is not trivial when state might include custom objects, circular references, or large binary data.

Failure Scenarios That Durability Addresses

Different failure types require different handling, all unified under the durability model.

Network interruptions are the most common. The user's connection drops, but the server-side agent process might continue running for a while. When the connection cannot be restored, the process eventually terminates, but the checkpointed state survives. When the user reconnects, they continue from the checkpoint.

Process crashes from bugs, memory exhaustion, or infrastructure issues terminate execution immediately. The most recent checkpoint represents the recoverable state. Any work done since the last checkpoint is lost, which is why checkpointing at appropriate granularity matters.

Timeouts occur when operations take longer than expected. LLM providers sometimes respond slowly. External APIs might hang. Durable execution systems typically use checkpoint-and-resume rather than long-lived connections, so timeouts in the underlying transport do not lose progress.

Rate limits and throttling affect agents that make many API calls in succession. Rather than failing the entire task, durable execution allows pausing, waiting for the rate limit window to pass, and resuming. The task takes longer but completes successfully.

Deployments and scaling cause process termination as a normal part of operations. New code versions require restarting processes. Scaling down kills some instances. In traditional architectures, long-running tasks cannot survive these events. With durable execution, they migrate to new processes seamlessly.

Building Agents That Survive

If you are building agents that need to handle real-world workloads, durability should be a design consideration from the start. The most important decision is whether to build the durability infrastructure yourself or use a platform that provides it.

Building yourself means implementing the checkpoint storage, designing the state schema, handling all the serialization edge cases, integrating with your deployment infrastructure, and maintaining the system over time. This makes sense if you have unusual requirements or an existing platform team equipped for this kind of work.

Using a runtime that provides durable execution means these concerns are handled for you. Your agent code focuses on the task logic. The runtime handles checkpointing, storage, and resumption automatically.

Either way, designing agents with durability in mind means avoiding state that cannot be serialized (like open file handles or active network connections), using tools that can handle being called multiple times safely, and thinking about what happens if any step is the last step before an interruption.

The Experience Difference

For users, durable execution transforms agents from unreliable tools into dependable assistants. A research task can be started on a laptop, continued on a phone during a commute, and finished on a desktop at work. Interruptions become pauses, not restarts. Complex tasks that take time are now viable because the time investment is protected.

For developers, durability removes an entire category of support issues. The "it lost my work" complaints disappear. Debugging becomes easier because you can examine the checkpointed state at any step. Confidence in deploying changes increases because you know running tasks will survive.

For operators, durability changes the reliability characteristics of the system. Instead of worrying about keeping processes alive, you let them terminate and restart freely. The system's correctness does not depend on nothing ever going wrong - it depends on recovering correctly when things do go wrong.

This shift - from hoping nothing breaks to expecting recovery from breaks - is what makes durable execution essential for production agent systems. It is the foundation that enables agents to take on substantial, long-running work that creates real value.

For teams building agents that need to survive the real world, inference.sh provides durable execution as a core capability. Checkpointing, resumption, and retry logic are built into the runtime. Your agents can focus on being useful while the infrastructure handles being reliable.

FAQ

How often should agent state be checkpointed?

The right checkpoint frequency balances durability against overhead. Checkpointing too rarely means losing significant progress when failures occur. Checkpointing too frequently adds latency and storage costs. The practical answer is to checkpoint after each complete step - after every LLM call completes and after every tool returns results. This granularity means you lose at most one step's worth of progress, which is usually acceptable. Checkpointing mid-step (like during LLM streaming) adds complexity without much benefit since incomplete steps generally need to be re-executed anyway.

Does durable execution add latency to agent runs?

Yes, but typically not enough to matter for user experience. Each checkpoint involves serializing state and writing to storage. With efficient serialization and fast storage, this adds milliseconds per checkpoint. For tasks measured in seconds or minutes, this overhead is negligible compared to the time spent waiting for LLM calls and tool executions. The latency cost is also much lower than the cost of restarting failed tasks from scratch, so the net effect on user experience is positive. If you need to optimize, focus on state size rather than checkpoint frequency - smaller state serializes faster.

Can any agent be made durable, or does it require specific design?

Most agents can be made durable with minimal changes, but some design patterns work better than others. The main requirement is that agent state must be serializable. Avoid keeping open connections, file handles, or other resources that cannot be serialized. Tools should be designed so that calling them multiple times with the same input produces the same result, or at least does not cause harmful side effects. If a tool makes an irreversible change like sending an email, ensure it tracks that the action was taken so resumed execution does not send duplicate emails. These considerations are good practice regardless of durability.