Freestyle Sandboxes for Coding Agents — What Devs Actually Need to Know

Now I have all the context I need. Let me write the full article.

SUBJECT: Coding agent sandboxes: what actually works PREVIEW: Most devs reach for subprocess or Docker. Both are wrong. Here's the architecture that works. TITLE: Freestyle Sandboxes for Coding Agents — What Devs Actually Need to Know

The Subprocess Trap — Where Every Agent Builder Starts (and Gets Stuck)

You're building a coding agent. It needs to run code. You write subprocess.run(agent_output, shell=True) and move on.

That's the trap. Not slightly wrong — architecturally wrong in ways that compound the moment your agent tries to do anything real.

The "freestyle sandbox" problem is one of the most underspecified problems in the agent ecosystem right now. Everyone's shipping coding agents. Devin runs code. Claude Code runs code. SWE-agent runs code. OpenHands runs code. But the sandboxing underneath varies wildly — from outright dangerous to "works until it doesn't" to genuinely production-ready.

This week it landed on the Hacker News front page. Good. Because most devs building agents haven't thought through what their agent actually does to a system — and what it needs in return.

Here's the practical breakdown. No evangelism. Just the tradeoffs.

What Coding Agents Actually Do to a Filesystem

A "coding agent" is not a code interpreter. That distinction matters.

A code interpreter runs a snippet and returns output. A coding agent:

installs packages (pip install, npm install, cargo add)
reads and writes files across a repository
runs test suites, builds, linters
makes git commits and creates branches
spawns subprocesses (compilers, formatters, test runners)
sometimes makes HTTP requests — to clone repos, fetch docs, call APIs
fails mid-task and needs to recover

That's not a REPL. That's a developer machine. And when you try to run a coding agent in a REPL, you find out fast.

The moment your agent runs pip install in a shared Python environment, it can break the next agent run. The moment it touches /tmp without isolation, state leaks across sessions. The moment it spawns npm install inside a container without a volume mount, you've just paid for 200MB of re-download on every single invocation.

The real question isn't "how do I run code safely" — it's "how do I give an agent a full developer environment that's isolated, fast, and reproducible."

The Sandbox Spectrum — Four Levels, Four Different Problems

There's no single "sandbox." There's a spectrum, and where you sit determines what your agent can do and what breaks.

| Level | Examples | Isolation | Cold Start | Full OS? | Cost | |---|---|---|---|---|---| | Process/REPL | exec(), Jupyter kernel | None | <50ms | No | Free | | Container | Docker, Podman | Moderate | 2-30s | Partial | Low | | microVM | Fly Machines, Firecracker, Lambda | Strong | 100-500ms | Yes | Medium | | Full VM | EC2, GCP Compute, Hetzner | Complete | 30-120s | Yes | High |

Each level exists because the previous one hit a wall.

Process/REPL dies when agents need package installs that outlive the session, or when two agents run concurrently and clobber each other's environment.

Containers die when cold start latency makes agent UX unbearable (Docker image pull alone can be 15s on a cold host), when agents need privileged operations, or when container state from the previous run poisons the next one.

microVMs are where the frontier is. Firecracker (AWS's open-source microVM hypervisor, used by Lambda and Fly.io) can boot a full Linux VM in under 200ms. That's the number that changes what's possible. Below 500ms cold start, users don't notice. Above it, they do.

Full VMs are for long-running agent sessions — the "autonomous agent working for hours on a feature branch" use case. The latency is acceptable when the session is measured in minutes, not milliseconds.

Cold Start: The Hidden Agent Killer

Cold start is the thing that kills agent UX before users can articulate why.

Here's what happens in a naive Docker-based agent setup:

# What your infrastructure does on every agent invocation:
docker pull your-agent-env:latest  # 2-8s if not cached, 0ms if cached
docker run --rm your-agent-env python agent.py  # 500ms-2s startup
# Agent waits for tool calls...
# User already hit refresh.

With warm containers and pre-pulled images on a dedicated host, you get that down to 300-800ms. Still noticeable. Still makes your agent feel sluggish.

Firecracker + pre-warmed snapshots is a different world:

snapshot restore → 80ms
kernel boot      → (skipped — restored from memory snapshot)
agent ready      → 80ms total

E2B uses this pattern. They maintain a pool of pre-warmed sandbox VMs, each snapshotted at the "ready for work" state. Your agent gets an environment in under 300ms because the VM is already running — it's restored from a snapshot, not booted fresh.

Modal uses a similar trick with container snapshots and a persistent warm pool. The difference: Modal snaps at the container level (no full kernel boot), E2B snaps at the VM level (stronger isolation, slower restore).

The practical implication: if you're building an agent that responds to user requests synchronously, cold start above 500ms will hurt your conversion. Pre-warm or use a managed service. If you're running batch/async agents, it matters less.

State Bleed: The Silent Correctness Bug

State bleed is harder to spot than a cold start. It doesn't crash your agent — it makes it wrong in ways you don't notice immediately.

Consider this: your agent runs on task-A, installs requests==2.28.0 because a test demanded it, then gets scheduled on task-B which needed requests==2.31.0. Task B passes its tests — on the wrong version. You shipped a bug.

This is not hypothetical. It happened to us running Stagehand pipeline stages sequentially on a shared host — stages were writing to ~/.stagehand/active/ atomically using file locks, but the Python environment underneath was shared:

# checkpoint.py — what we got right
@contextmanager
def _lock(pipeline_id: str):
    """Acquire an exclusive file lock for this pipeline_id."""
    lock_path = _lock_path(pipeline_id)
    lock_path.parent.mkdir(parents=True, exist_ok=True)
    with open(lock_path, "w") as lf:
        try:
            fcntl.flock(lf, fcntl.LOCK_EX)
            yield
        finally:
            fcntl.flock(lf, fcntl.LOCK_UN)

# what we got wrong — no environment isolation between stages
# stage A ran: pip install pillow==9.5.0
# stage B assumed system pillow — got 9.5.0 instead of 10.x
# tests passed, wrong output, silent failure

The fix isn't a better lock — it's a fresh environment per stage. Every stage in a pipeline should boot into a clean state. The checkpoint file handles data continuity. The sandbox handles environment continuity.

Two patterns that actually work:

Pattern 1: Ephemeral sandbox per invocation. Each agent call gets a clean VM or container. State is passed explicitly via structured output, not implicit via filesystem. Expensive but correct.

Pattern 2: Layered filesystem with copy-on-write. Base image is read-only. Agent writes land on a writable overlay layer. After the task, overlay is discarded. Docker's overlay2 driver does this — but the base layer still needs to be disciplined (no mutable global state at build time).

Pattern 1 is safer. Pattern 2 is faster. Most production agent platforms use Pattern 2 with careful base image design.

The Filesystem Problem: Ephemeral vs. Persistent State

Here's the design decision that trips up most agent builders: what should survive between runs?

Answer: almost nothing. But "almost" is doing a lot of work there.

Things that should be ephemeral:

installed packages (always rebuild from a lockfile)
scratch files, temp dirs
build artifacts
agent's working directory

Things that should persist:

the repository being worked on (if the agent is doing multi-session work)
the agent's final output (commits, generated files, reports)
credentials and tokens (mounted at runtime, never baked in)

The SE-Workflow pattern handles this well: the agent's workspace is the GitHub repo itself. The agent clones, works, commits, and PRs. The sandbox is fully ephemeral. Persistence is Git.

# SE-Workflow agent execution model
# 1. Spawn fresh container/VM
# 2. Clone repo (clean state guaranteed by git, not by sandbox)
# 3. Agent works, commits, pushes
# 4. Container dies — no state to clean up

The insight: Git is your persistence layer. Treat it that way. Don't fight your sandbox trying to preserve filesystem state across runs — commit it instead.

For agents that can't commit every intermediate state (e.g., multi-hour research tasks), you need explicit snapshot/restore. E2B supports pausing and resuming sandboxes. Fly Machines can be suspended and restarted with full memory state. These are the right primitives for long-running agents.

Network: The Access Control Layer Nobody Configures

Most developers leave network access completely open in their agent sandboxes. Full egress, unrestricted. That's a mistake with three failure modes.

Failure mode 1: Data exfiltration. A compromised agent prompt (prompt injection from a malicious webpage the agent browsed) can exfiltrate environment variables, repo contents, or credentials via HTTP. If your sandbox has unrestricted egress, there's no containment.

Failure mode 2: Cost explosion. An agent with a bug in a loop that makes API calls will drain your budget before you notice. Egress rate limiting at the network level — not just in your agent code — is the backstop.

Failure mode 3: Non-determinism. Agents that call external APIs mid-task produce different results based on what those APIs return. For reproducible evaluations, air-gapped sandboxes (or recorded/replayed network responses) are the only path.

The right defaults by use case:

| Agent type | Network policy | |---|---| | Coding tasks (no web research) | Whitelist: GitHub, PyPI, npm, crates.io | | Research agents | Full egress, rate-limited, logged | | Evaluation/benchmarking | Air-gapped or fully recorded | | Long-running autonomous | Egress allowed, no ingress |

gVisor (Google's container sandbox runtime) gives you syscall interception — you can see every network call the agent makes. Slower than runc (~15% overhead on compute-heavy workloads), but the audit trail is worth it for high-stakes agent workflows.

Choosing Your Stack — The Decision Table

Stop building your own. The infrastructure work is real and the failure modes are non-obvious. Here's the honest matrix:

E2B — best for: coding agents that need a full, clean Linux environment fast; Jupyter-compatible workflows; teams that want managed infra. Firecracker-based microVMs, sub-300ms cold start, Python and JS SDKs. ~$0.000111 per sandbox-second. Cap at ~1000 concurrent sandboxes on growth tier.

from e2b_code_interpreter import Sandbox

with Sandbox() as sbx:
    # Full Ubuntu VM, packages pre-installed, internet access
    sbx.files.write("/repo/main.py", agent_generated_code)
    result = sbx.commands.run("python /repo/main.py")
    output = result.stdout

Modal — best for: batch agent workloads, parallel scaling, teams already on Modal for other compute. Container-based (not VM), faster cold start on warm containers, worse isolation than microVMs. Generous free tier. Better for high-volume async than interactive agent use.

import modal

@modal.function(image=modal.Image.debian_slim().pip_install("pytest"))
def run_agent_task(code: str, test_suite: str) -> str:
    # Fresh container per call, no state bleed
    # Write code, run tests, return output
    ...

Self-hosted Docker + gVisor — best for: teams with compliance requirements, on-prem deployments, full control over egress. Higher ops burden. Cold start of 500ms-2s depending on image size. Use if E2B/Modal's data residency doesn't meet your requirements.

Fly Machines — best for: agents that need to be geographically distributed, or that need fast HTTP-accessible endpoints for incoming webhooks. Firecracker-based, ~300ms cold start, suspend/resume supported. More networking flexibility than E2B.

Daytona — best for: teams building developer workspace products (not just task-specific agents). CDE (Cloud Development Environment) model — full persistent workspace per developer/agent. Heavier than what most agent tasks need, right for IDE-integrated coding agents.

Where This Falls Short

Let's be honest about the current state.

Snapshot restore isn't magic. E2B's sub-300ms cold start is real for standard environments. The moment you need a custom base image (CUDA, unusual system deps, large model weights), you're either waiting for custom snapshot builds or you're back to 2-5s cold starts. The benchmark is best-case.

Parallel agent scaling hits account limits fast. Running 50 concurrent agent sandboxes for an evaluation suite? You'll hit E2B's default concurrency limits and need to contact sales. Modal handles parallelism better — it auto-scales containers — but you're trading isolation strength.

Persistent filesystem across runs is genuinely unsolved at scale. Every solution has a tradeoff. Git-as-persistence works for code but not for large binary assets, ML models, or datasets. Network filesystems (EFS, NFS) introduce latency and single points of failure. No provider has nailed cheap, fast, ephemeral-but-optionally-persistent storage for agents yet.

Egress logging is incomplete. The providers that offer network egress control (mostly self-hosted options) don't offer clean attribution — you can see that a process made a request, not which agent call caused it in a multi-step pipeline. Observability at the agent level, through the sandbox, is still duct tape and luck.

Cost accounting is opaque. Your per-sandbox-second cost looks fine in a spreadsheet. The real cost includes the data transfer, the API calls your agent makes from within the sandbox, and the engineering time debugging sandbox-specific failures. Budget 2x what the calculator says until you have 30 days of production data.

The Forward View: What Changes Next

The current generation of sandbox providers were built for short-lived, interactive code execution tasks. What's coming is agents that run for hours, manage long-horizon tasks, and need richer observability hooks — not just stdout/stderr, but structured tool call traces, resource consumption by stage, and the ability to pause/inspect/resume at arbitrary points.

The platforms that add those primitives — debuggable agent execution, not just sandboxed execution — will own the next wave of agent infrastructure.

What to Know Monday

Sandboxes for coding agents aren't an ops detail. They're a product decision.

Your sandbox choice determines what your agent can do, how fast it responds, whether it leaks state, and how much you spend at scale. Getting it right early means not refactoring your entire agent architecture when you hit the wall.

The synthesis: use a managed microVM provider (E2B or Fly Machines) for interactive coding agents, Modal for batch/parallel workloads, and Git as your persistence layer across all of them. Don't build your own unless you have specific compliance or infra requirements that make the managed options impossible.

If you're building a coding agent today: spin up a single E2B sandbox, run your agent's typical task end-to-end inside it, and measure the cold start latency you actually observe. That number will tell you everything.

If this was useful, paid subscribers get this depth 2-3x per week — architecture deep dives, build logs, and the failure modes nobody writes about.

[Subscribe — $8/month or $80/year]

Next week: How to build an agent evaluation harness that doesn't lie to you — the gap between "passed the benchmark" and "works in production."

What are you using for agent sandboxing right now? I'm collecting real production data — reply and I'll aggregate it in a follow-up.

Freestyle Sandboxes for Coding Agents — What Devs Actually Need to Know

Enjoyed this?