
Hacker News · Feb 27, 2026 · Collected from RSS
Article URL: https://www.shayon.dev/post/2026/52/lets-discuss-sandbox-isolation/ Comments URL: https://news.ycombinator.com/item?id=47184049 Points: 4 # Comments: 0
There is a lot of energy right now around sandboxing untrusted code. AI agents generating and executing code, multi-tenant platforms running customer scripts, RL training pipelines evaluating model outputs—basically, you have code you did not write, and you need to run it without letting it compromise the host, other tenants, or itself in unexpected ways.The word “isolation” gets used loosely. A Docker container is “isolated.” A microVM is “isolated.” A WebAssembly module is “isolated.” But these are fundamentally different things, with different boundaries, different attack surfaces, and different failure modes. I wanted to write down my learnings on what each layer actually provides, because I think the distinctions matter and allow you to make informed decisions for the problems you are looking to solve.When any code runs on Linux, it interacts with the hardware through the kernel via system calls. The Linux kernel exposes roughly 340 syscalls, and the kernel implementation is tens of millions of lines of C code. Every syscall is an entry point into that codebase.Untrusted Code ─( Syscall )─→ Host Kernel ─( Hardware API )─→ Hardware [ 40M LOC C ] Every isolation technique is answering the same question of how to reduce or eliminate the untrusted code’s access to that massive attack surface.A useful mental model here is shared state versus dedicated state. Because standard containers share the host kernel, they also share its internal data structures like the TCP/IP stack, the Virtual File System caches, and the memory allocators. A vulnerability in parsing a malformed TCP packet in the kernel affects every container on that host. Stronger isolation models push this complex state up into the sandbox, exposing only simple, low-level interfaces to the host, like raw block I/O or a handful of syscalls.The approaches differ in where they draw the boundary. Namespaces use the same kernel but restrict visibility. Seccomp uses the same kernel but restricts the allowed syscall set. Projects like gVisor use a completely separate user-space kernel and make minimal host syscalls. MicroVMs provide a dedicated guest kernel and a hardware-enforced boundary. Finally, WebAssembly provides no kernel access at all, relying instead on explicit capability imports. Each step is a qualitatively different boundary, not just a stronger version of the same thing.Namespaces as visibility wallsLinux namespaces wrap global system resources so that processes appear to have their own isolated instance. There are eight types, and each isolates a specific resource.NamespaceWhat it isolatesWhat the process seesPIDProcess IDsOwn process tree, starts at PID 1MountFilesystem mount pointsOwn mount table, can have different rootNetworkNetwork interfaces, routingOwn interfaces, IP addresses, portsUserUID/GID mappingCan be root inside, nobody outsideUTSHostnameOwn hostnameIPCSysV IPC, POSIX message queuesOwn shared memory, semaphoresCgroupCgroup root directoryOwn cgroup hierarchyTimeSystem clocks (monotonic, boot)Own system uptime and clock offsetsNamespaces are what Docker containers use. When you run a container, it gets its own PID namespace (cannot see host processes), its own mount namespace (own filesystem view), its own network namespace (own interfaces), and so on.The critical thing to understand is namespaces are visibility walls, not security boundaries. They prevent a process from seeing things outside its namespace. They do not prevent a process from exploiting the kernel that implements the namespace. The process still makes syscalls to the same host kernel. If there is a bug in the kernel’s handling of any syscall, the namespace boundary does not help.In January 2024, CVE-2024-21626 showed that a file descriptor leak in runc (the standard container runtime) allowed containers to access the host filesystem. The container’s mount namespace was intact — the escape happened through a leaked fd that runc failed to close before handing control to the container. In 2025, three more runc CVEs (CVE-2025-31133, CVE-2025-52565, CVE-2025-52881) demonstrated mount race conditions that allowed writing to protected host paths from inside containers.Cgroups: accounting is not securityCgroups (control groups) limit and account for resource usage: CPU, memory, disk I/O, number of processes. They prevent a container from consuming all available memory or spinning up thousands of processes.Cgroups are important for stability, but they are not a security boundary. They prevent denial-of-service, not escape. A process constrained by cgroups still makes syscalls to the same kernel with the same attack surface.Seccomp-BPF as a filterSeccomp-BPF lets you attach a Berkeley Packet Filter program that decides which syscalls a process is allowed to make. You can deny dangerous syscalls like process tracing, filesystem manipulation, kernel extension loading, and performance monitoring.Docker applies a default seccomp profile that blocks around 40 to 50 syscalls. This meaningfully reduces the attack surface. But the key limitation is that seccomp is a filter on the same kernel. The syscalls you allow still enter the host kernel’s code paths. If there is a vulnerability in the write implementation, or in the network stack, or in any allowed syscall path, seccomp does not help.Without Seccomp: Untrusted Code ─( ~340 syscalls )─→ Host Kernel With Seccomp: Untrusted Code ─( ~300 syscalls )─→ Host Kernel The attack surface is smaller. The boundary is the same.Running a container in privileged modeThis is worth calling out because it comes up surprisingly often. Some isolation approaches require Docker’s privileged flag. For example, building a custom sandbox that uses nested PID namespaces inside a container often leads developers to use privileged mode, because mounting a new /proc filesystem for the nested sandbox requires the CAP_SYS_ADMIN capability (unless you also use user namespaces).If you enable --privileged just to get CAP_SYS_ADMIN for nested process isolation, you have added one layer (nested process visibility) while removing several others (seccomp, all capability restrictions, device isolation). The net effect is arguably weaker isolation than a standard unprivileged container. This is a real trade-off that shows up in production. The ideal solutions are either to grant only the specific capability needed instead of all of them, or to use a different isolation approach entirely that does not require host-level privileges.gVisor and user-space kernelsgVisor is where the isolation model changes qualitatively. To understand the difference, it helps to look at the attack surface of a standard container.Standard Container (Docker) ┌───────────────────────┐ │ Untrusted Code │ └──────────┬────────────┘ │ ~340 syscalls ▼ [ Seccomp Filter ] │ ~300 allowed syscalls ▼ ┌───────────────────────┐ │ Host Kernel (Ring 0) │ ◄── FULL ATTACK SURFACE └───────────────────────┘ The code runs as a standard Linux process. Seccomp acts as a strict allowlist filter, reducing the set of permitted system calls. However, any allowed syscall still executes directly against the shared host kernel. Once a syscall is permitted, the kernel code processing that request is the exact same code used by the host and every other container. The failure mode here is that a vulnerability in an allowed syscall lets the code compromise the host kernel, bypassing the namespace boundaries.Instead of filtering syscalls to the host kernel, gVisor interposes a completely separate kernel implementation called the Sentry between the untrusted code and the host. The Sentry does not access the host filesystem directly; instead, a separate process called the Gofer handles file operations on the Sentry’s behalf, communicating over a restricted protocol. This means even the Sentry’s own file access is mediated.gVisor ┌───────────────────────┐ │ Untrusted Code │ └──────────┬────────────┘ │ ~340 syscalls ▼ ┌───────────────────────┐ │ gVisor Sentry (Ring 3)│ ◄── USER-SPACE KERNEL └──────┬────────┬───────┘ │ │ 9P / LISAFS │ ▼ │ ┌───────────┐ │ │ Gofer │ ◄── FILE I/O PROXY │ └─────┬─────┘ │ │ ▼ ▼ ┌───────────────────────┐ │ Host Kernel (Ring 0) │ ◄── REDUCED ATTACK SURFACE └───────────────────────┘ (~70 host syscalls from Sentry) The Sentry intercepts the untrusted code’s syscalls and handles them in user-space. It reimplements around 200 Linux syscalls in Go, which is enough to run most applications. When the Sentry actually needs to interact with the host to read a file, it makes its own highly restricted set of roughly 70 host syscalls. This is not just a smaller filter on the same surface; it is a completely different surface. The failure mode changes significantly. An attacker must first find a bug in gVisor’s Go implementation of a syscall to compromise the Sentry process, and then find a way to escape from the Sentry to the host using only those limited host syscalls.The Sentry intercepts syscalls using one of several mechanisms, such as seccomp traps or KVM, with the default since 2023 being the seccomp-trap approach known as systrap.What this means in practice is that if someone discovers a bug in the Linux kernel’s I/O implementation, containers using Docker are directly exposed. A gVisor sandbox is not, because those syscalls are handled by the Sentry, and the Sentry does not expose them to the host kernel.The trade-off is performance. Every syscall goes through user-space interception, which adds overhead. I/O-heavy workloads feel this the most. For short-lived code execution like scripts and tests, it is usually fine, but for sustained high-throughput I/O, it can matter.Also, by adopting gVisor, you are betting that it’s easier to audit and maintain a smaller footprint of code (the Sentry and its limited host interactions) than to secure the entire massive Linux kernel surface against untrusted execution. That bet is not free of risk, gVisor itself has had security vulnerabilities in the Sentry but th