The most powerful agents can write and execute code. This capability transforms agents from systems that can only use predefined tools into general-purpose problem solvers. Need to analyze a dataset? Write Python code to process it. Need to transform some data? Write a script. Need to test a hypothesis? Write code to check it. But arbitrary code execution is also the most dangerous capability an agent can have. Sandboxing makes code execution safe enough for production use.

Why Code Execution Matters

Predefined tools limit what agents can do to what tool builders anticipated. A web search tool searches the web. A calculator calculates. Each tool does one thing. If the user's request does not match any available tool, the agent cannot help.

Code execution changes this. An agent that can write and run code can do essentially anything that code can do. Data analysis, transformation, computation, validation, automation - these become possible even if no specific tool exists for the exact task.

Consider a user who uploads a spreadsheet and asks for analysis. Without code execution, you need a specific data analysis tool configured for the exact analysis they want. With code execution, the agent writes Python code to load the data, perform whatever analysis makes sense, and return the results. The agent adapts to the task rather than requiring pre-built tools for every possibility.

This flexibility is why the most capable agent systems include code execution. It is also why code execution requires careful handling - the flexibility that makes it powerful also makes it dangerous.

The Security Challenge

Arbitrary code execution is the ultimate attack surface. Code can do anything the process it runs in can do. Without containment, agent-generated code could access files it should not read, make network connections to external servers, consume unlimited computational resources, or interact with other processes and systems.

Even without malicious intent, code can cause problems. A bug in agent-generated code might enter an infinite loop, consuming resources until something crashes. A misunderstanding about file paths might overwrite important data. An unintended network call might expose internal services to external traffic.

The challenge is providing the power of code execution while containing its potential damage. This is the job of sandboxing.

Sandboxing Approaches

Several approaches provide different levels of isolation and security.

Language-level restrictions attempt to sandbox within the language runtime. Python's exec() with restricted builtins is a common example. This approach is lightweight but inherently leaky - determined code can often escape language-level restrictions through edge cases and implementation details.

Process isolation runs code in a separate process with limited permissions. Better than language-level restrictions, but the process still shares the host's filesystem, network, and resources. Configuring permission restrictions is complex and error-prone.

Container isolation runs code in containers with their own filesystem namespace, network isolation, and resource limits. This provides strong isolation from the host system. The overhead is higher, but security is substantially better.

Managed sandbox services handle sandboxing as infrastructure. You send code to the service; it executes in an isolated environment and returns results. The security complexity is the service provider's responsibility.

For production agent systems, container-level or managed sandboxing is typically necessary. Language-level and process-level restrictions are insufficient against sophisticated attacks or even accidental damage.

What Sandboxing Provides

A properly sandboxed execution environment provides several guarantees.

Filesystem isolation means code sees a limited filesystem, not the host's real filesystem. Only explicitly provided files are accessible. Code cannot read sensitive files or overwrite important data on the host.

Network isolation means code cannot make arbitrary network connections. Outbound connections to external services are blocked unless explicitly allowed. Data exfiltration becomes much harder.

Resource limits bound the CPU time, memory, and disk space code can use. Infinite loops exhaust their time limit and terminate. Memory-intensive code hits limits rather than crashing the host.

Process isolation means code cannot spawn other processes, access other running processes, or affect the host operating system. Fork bombs and similar attacks are contained.

Clean environment means each execution starts fresh. State from previous executions does not persist. Code cannot poison the environment for future runs.

Together, these guarantees allow executing untrusted code - code that might be buggy, might be maliciously crafted, or might simply do unexpected things - with confidence that damage is contained.

Practical Code Execution

For agents, sandboxed code execution typically works as follows.

The agent decides to execute code and generates the code to run. This code goes to the sandboxed execution environment along with any input files the agent provides. The sandbox executes the code, enforcing all isolation constraints. Results - standard output, created files, returned values - come back to the agent. Any errors or constraint violations are reported rather than causing broader failure.

The agent's prompt should explain when code execution is appropriate and provide guidance on writing code that works within sandbox constraints. For example, the prompt might explain that network access is not available, that specific libraries are pre-installed, and that output should be written to particular locations.

Users interacting with code-executing agents should understand that code runs in isolation. Files they upload become available in the sandbox. Generated outputs can be downloaded. The code cannot access anything beyond what is explicitly provided.

Common Use Cases

Code execution enables several agent capabilities that would otherwise require specialized tools.

Data analysis is perhaps the most common. Users upload data files, agents write Python code using pandas, numpy, or similar libraries to analyze the data, and results come back. The variety of possible analyses means no finite set of predefined tools could cover all cases.

Data transformation converts data between formats, applies filtering or mapping, and restructures information. Again, the variety of transformations is too large for predefined tools.

Computation handles math, statistics, simulations, and other calculations too complex for simple tools. Agents can write algorithms to solve specific problems.

Validation checks data, tests conditions, and verifies properties. Code can implement arbitrary validation logic.

Automation scripts generate code that users can take away and use elsewhere. The sandbox executes test runs; users get working scripts for their own environments.

Sandbox Configuration

Different use cases warrant different sandbox configurations.

Minimal sandboxes for simple computation might allow only pure Python with limited libraries, no filesystem access, and no network. The attack surface is tiny; the capability is limited to computation.

Data analysis sandboxes add filesystem access for input/output files and data science libraries. Still no network, but rich computational capability.

Extended sandboxes might allow specific network destinations for cases where code needs to call approved APIs. Each allowed destination expands capability but also attack surface.

The principle is minimal capability: enable only what the use case requires. A sandbox that allows everything is not really sandboxing; it is just running code.

Timeout configuration deserves attention. Long-running computations might be legitimate or might be bugs or attacks. Setting appropriate timeouts balances allowing real work with preventing resource exhaustion.

Building vs Using Sandboxing

Building sandbox infrastructure requires substantial effort. Container orchestration, security configuration, resource management, network policy, and operational monitoring all need implementation. Doing this correctly requires security expertise that not all teams have.

Using a platform with managed sandboxing shifts this complexity. The platform provides the execution environment; you provide the code to execute. Security is the platform's responsibility and specialty.

For most agent deployments, using managed sandboxing makes more sense than building custom infrastructure. Sandboxing is infrastructure that should work reliably; it is not a competitive differentiator for most agent products.

For teams building agents that need code execution, inference.sh provides sandboxed Python execution as a built-in capability. Code runs in isolated containers with appropriate resource limits. The agent sends code; the platform handles secure execution. You focus on what code the agent should write, not on how to contain its execution.

Code execution is the capability that makes agents truly flexible. Sandboxing is what makes that capability safe to deploy. The combination enables agents that can adapt to arbitrary tasks while operating within appropriate safety bounds.

FAQ

How do I handle code that needs external network access?

Network-enabled code execution is risky and should be used sparingly. If legitimate use cases require network access, use whitelisting to allow only specific destinations rather than general internet access. Require human approval for network-enabled execution, especially for user-facing agents. Consider whether the network call could be made outside the sandbox by the agent itself, with only the processing done in sandbox. Log all network connections for audit purposes. Accept that network-enabled sandboxes have larger attack surface and monitor accordingly. Default to network-disabled execution and enable only when truly necessary.

What happens when sandbox resource limits are exceeded?

Exceeding resource limits should terminate the execution cleanly rather than crashing the host or leaking resources. The agent receives an error indicating what limit was exceeded - time, memory, disk, or another constraint. The agent can then decide how to proceed: inform the user, try a different approach, or break the task into smaller pieces. Limits should be configured to allow legitimate work while preventing runaway consumption. Start with conservative limits and adjust based on actual usage patterns. Monitor limit violations to identify whether limits are too tight for legitimate use cases or whether code is systematically exceeding appropriate bounds.

Can I let users control what libraries are available in the sandbox?

User-controlled libraries create security risk. Libraries might have vulnerabilities, might provide unexpected capabilities, or might be maliciously crafted. The safest approach is providing a fixed set of pre-approved libraries that cover common needs - data science, file processing, utilities. If user-requested libraries are necessary, consider a curation process where requested libraries are reviewed before being added to the approved set. Never allow arbitrary user-specified code to install arbitrary packages. The tradeoff between flexibility and security strongly favors a curated library set for production deployments.