When two users exec into the same pod simultaneously, there is no way to know who ran a specific command. The first post in this series established this gap and proposed two architectural directions to close it. Neither was evaluated for feasibility. Time to find out.

Two Directions, One Problem

The core problem is that the control plane and the runtime plane are independent. The API server knows who made the request. The runtime knows what process ran. Neither plane speaks the other’s language. Closing the gap requires a shared attribute, something that exists in both planes and can be used to join them. The two directions propose different answers to what that attribute should be.

  1. Push the audit requestId down to the runtime layer. Every kubectl exec request generates an audit event with a unique requestId already present in the Kubernetes audit log, tied to the authenticated caller. The proposal is to propagate this existing identifier through kubelet to the container runtime, making it available to security tooling at exec time. The identity never leaves the API server. The requestId is the bridge between the two planes. Security tooling on the node correlates it back to the caller’s identity via the audit log.

  2. Pull process information up to the audit layer. Have the runtime surface the host PID of the spawned process back up to the API server, embedding it in the audit log. If your runtime agent logs that PID 1236 with parent PID 1234 ran cat /etc/shadow, and your audit log shows alice opened an exec session whose process spawned as PID 1234, you have direct attribution. No timestamps. No overlap problem.

The PID That Does Not Exist Yet

Pulling the PID information up actually seems like the simplest option. But the findings might disappoint you. The answer lies in the exec mechanism between the kubelet and the CRI. When an ExecRequest comes in, the container runtime creates a streaming endpoint and returns its URL as part of the ExecResponse. The URL is the only field in the ExecResponse protobuf.

message ExecResponse {
    // Fully qualified URL of the exec streaming server.
    string url = 1;
}

Source: k8s.io/cri-api/pkg/apis/runtime/v1

This gets propagated back up to the API server and recorded in the audit log. The process has not been spawned yet at this point. It is created only when the kubelet connects to that streaming endpoint.

The containerd source confirms this. The CRI gRPC Exec handler in internal/cri/server/container_exec.go does exactly one thing: it delegates to streamServer.GetExec, which registers a token and returns the streaming URL:

// internal/cri/server/container_exec.go
func (c *criService) Exec(ctx context.Context, r *runtime.ExecRequest) (*runtime.ExecResponse, error) {
    ...
    return c.streamServer.GetExec(r) // delegates to streaming server, no process spawned here
}

// vendor/k8s.io/cri-streaming/pkg/streaming/server.go
func (s *server) GetExec(req *runtimeapi.ExecRequest) (*runtimeapi.ExecResponse, error) {
    ...
    return &runtimeapi.ExecResponse{
        Url: s.buildURL("exec", token), // only a URL is returned, the process does not exist yet
    }, nil
}

Source: containerd/internal/cri/server, containerd/vendor/k8s.io/cri-streaming/pkg/streaming

No process is spawned. The function returns before any exec happens. The process is created only when the kubelet connects to that URL. For a detailed walkthrough of the full streaming mechanism, see CRI Streaming Explained.

The host PID does not exist when ExecResponse is returned. It cannot be included in the audit log entry at ResponseStarted time because the process has not been created yet, meaning the idea of trying to transport something that does not exist yet will not work.

One might ask: what if the runtime made an out of band call to the API server after the process is created, amending the audit record with the host PID? The problem is threefold.

  1. The ResponseStarted audit event has already been written before the process exists and there is no amendment mechanism in the Kubernetes audit pipeline.
  2. An asynchronous callback introduces a race condition where the security agent on the node sees the execve before the amended audit record arrives.
  3. The callback itself requires a new protocol between the runtime and kubelet that does not exist today, making this approach more complex than the problem it was supposed to simplify.

Even if it were hypothetically possible, two more problems remain.

  1. PIDs can be reused within minutes on high volume systems. A PID that belonged to alice’s exec session at 09:14:07 may belong to an entirely different process by the time you query it. You cannot rely on a PID as a stable attribution key across time.
  2. A kubectl exec session typically spawns a shell, which spawns further sub-processes. If attribution is anchored to the initial host PID, it breaks the moment the user runs a command inside that shell. Recovering it requires expensive backtracking through parent PID chains, which is exactly the problem this direction was supposed to avoid.

Direction 2 eliminates itself. The PID does not exist when it is needed.

The Ray of Hope: Pushing the requestId Down

Now we need a mechanism to carry the requestId down to the runtime layer. The success criteria are straightforward. The attribution must be per process, and child processes must be able to inherit it. This leads us directly to environment variables. They are scoped to a process and inherited by child processes by default.

The mechanism already exists. SPO (Security Profiles Operator), a Kubernetes operator that runs as a mutating webhook, intercepts the exec request at the API server, reads the requestId already present in the Kubernetes audit log, and prepends env K8S_REQUEST_ID=<id> to the command. What was kubectl exec pod -- bash becomes kubectl exec pod -- env K8S_REQUEST_ID=<id> bash. The requestId travels down with the process. No protocol changes. No new fields. No kernel dependency.

But working for the common case is not enough. A robust attribution mechanism must also satisfy these properties.

  1. The requestId must not be tamperable by an adversary. Environment variables can be unset or overwritten by any process inside the container. A malicious workload can forge K8S_REQUEST_ID before spawning a child process, attributing arbitrary commands to a different user or to a non-existent session.
  2. The artifact must be available for the entire attribution window, from before the first userspace instruction executes until security tooling has consumed it. A mechanism that arrives late or disappears before tooling reads it fails attribution at either end.
  3. The requestId must be independently scoped per exec session. Two concurrent exec sessions into the same pod must carry different requestIds. A shared or overwritable value breaks attribution the moment sessions overlap, which is exactly the scenario this work is trying to solve.
  4. The solution must not require a kernel patch. Changes to the upstream Linux kernel take years to reach production clusters. Any mechanism that requires a new kernel interface is not viable for near-term deployment.

Environment variables satisfy properties 3 and partially 2. They fail on property 1, which is non-negotiable from a security standpoint. This means environment variables are a viable transport mechanism but not a reliable storage mechanism. We need something else to hold the requestId once it arrives.

The Limits of Tamper Resistance

Tamper resistance is what guarantees that an attacker cannot forge or destroy the requestId to evade attribution. But we need to be realistic about what these guarantees can and cannot cover. To do that, we define three layers of attacker capability, each more powerful than the last, and ask whether the guarantee holds at each one.

  1. An attacker has access to an unprivileged container with no CAP_SYS_ADMIN and no access to the host filesystem.
  2. An attacker has access to a privileged container with all Linux capabilities.
  3. An attacker compromises the node itself, gaining control of the container runtime and kubelet.

Layer 3 is the trust boundary ceiling. Any mechanism that relies on the container runtime to deliver the requestId cannot survive a compromised runtime. The runtime controls the write path. If it is malicious, it can inject a forged requestId or omit one entirely, and the storage layer will faithfully protect the wrong value. This is not a solvable problem because the node is the root of trust. A compromised node means a compromised runtime, a compromised kubelet, and any security tooling running on the node. There is no higher authority to appeal to. It must be acknowledged explicitly and accepted as the ceiling.

The goal is a solution that holds at layers 1 and 2. Layer 3 is out of scope by definition.

Storage, Attribution and the Search for a Complete Solution

There are several explored solutions that partially satisfy the properties and hold against some of the threats modelled. It would be interesting to see which of these actually tick all the boxes.

The solutions fall into two broad categories: those that travel with the process (like an environment variable), and those that live outside it (like a database or socket). Each category has a different failure mode.

External Store Solutions

Cgroup File

All container processes are attached to cgroups for resource constraints. The requestId can be written to a file in the container’s cgroup directory on the host, and security tooling can read it from there. The problem is that this is not kernel enforcement. The tamper resistance guarantee is mount namespace isolation only. The host cgroup filesystem is outside the container’s mount namespace by default, but a privileged container with bind mount capabilities can potentially reach it.

More fundamentally, the cgroup is shared across all processes inside the container. Two concurrent exec sessions write to the same file. There is no way to distinguish alice’s requestId from bob’s. Per-session scoping fails immediately. Child cgroups per exec session seem like a fix but they require non-trivial containerd changes to create and manage a new cgroup subtree per exec call. The fundamental problem remains unchanged. The tamper resistance guarantee is still mount namespace isolation, not kernel enforcement.

No matter how the subtree is structured, mount namespace isolation is not kernel enforcement. A motivated adversary with a privileged container will find a way through. The cgroup file fails on tamper resistance at layer 2 and on per-session scoping.

Out of Band Mapping

The requestId could live in an external store outside the process context entirely, fetched by security tooling when needed. A Unix domain socket, an HTTP service, a database. The delivery mechanism does not matter much. The idea is simple: maintain a mapping somewhere, and let tooling query it.

The problem is the attribution chain. Security tooling sees a process. To attribute it, it needs to resolve: process to session, then session to requestId. Two hops. The process ID is not known before the process spawns, so the external map cannot be pre-populated. You only have session information, which is a Kubernetes and CRI abstraction. It does not travel with the process by default. The only reliable way to deliver session information into the process is via an environment variable. Which means you still depend on env var injection, plus the overhead of an external lookup on top. You have not replaced the env var dependency. You have added complexity around it, and you are back to square one with extra steps.

Process-Carried Solutions

Sealed memfd

A memfd is an anonymous file backed by memory. It has no path in the regular filesystem but is accessible as a file descriptor and discoverable via /proc/<pid>/fd/. The requestId can be written into a memfd and sealed using fcntl with write, shrink, and grow seals. Once sealed, the kernel prevents any further writes absolutely. No process, privileged or not, can modify the contents. This is genuine kernel enforcement, stronger than anything filesystem-based.

However, the kernel can only prevent writes. It cannot prevent a process from closing the file descriptor. If a malicious workload closes the fd, the requestId disappears from /proc/<pid>/fd/ and security tooling sees nothing. Write integrity holds but existence integrity does not. Additionally, this requires a runc change to create, seal, and pass the fd into the exec’d process. Without a corresponding OCI spec change, this would be a runc-specific behaviour with no guarantee of support in other runtimes.

Even if the implementation hurdle were cleared, the fd closure problem remains unsolved. The attribution window cannot be guaranteed. Sealed memfd fails on the attribution window requirement.

Kernel Keyring

The kernel keyring is a temporary credential store scoped to a session and inherited by child processes through fork and exec. The interface is keyctl. The requestId can be stored as a key in the session keyring with kernel-enforced permission bits. A non-privileged process cannot modify a key it does not own. This is genuine kernel enforcement and layer 1 holds.

But a privileged container with CAP_SYS_ADMIN can call keyctl directly and manipulate the keyring, revoking or replacing keys. Layer 2 fails. The lookup path is also not clean. Security tooling cannot query another process’s session keyring directly by PID. It has to parse /proc/<pid>/keys, find the key ID, and then call keyctl(KEYCTL_READ) to get the value. Not a single lookup. And if the process exits before tooling reads the key, the session keyring is destroyed and the requestId is gone with it. The attribution window cannot be guaranteed.

The kernel keyring fails on tamper resistance at layer 2, on the attribution window, and has no clean inline read path for security tooling.


The Grey Area

Every solution explored so far hits the same wall: write the artifact early enough to be present at exec time, and it tends to disappear before out of band tooling can read it. Write it persistently, and you introduce a race at the start.

This points to a different way of thinking about the problem. Rather than satisfying both conditions with a persistent artifact, what if we used a solution that writes the artifact early enough and captured the attribution inline, before the process even begins executing? Security tooling does not need to wait for the process to complete. It fires at the same kernel event where the process starts.

eBPF to the Rescue

This is exactly the kind of problem BPF specialises in. BPF programs are hooks that execute custom code inline with the system call flow. They fire inside the kernel event handler at execve time, before any userspace instruction runs. The requestId can be read from the environment variable at that exact moment and written into kernel-managed BPF storage. Nothing inside the container can interact with that storage. Only the BPF program that owns the map can read or write it.

This means environment variables can be used purely as a transport mechanism. The BPF program intercepts the exec event, copies the requestId from the env var into a BPF map or task storage, and the kernel takes over from there. For persistent sessions like an interactive bash shell, the same hooks log every command executed and attribute it using the requestId already present in the BPF storage. The attribution window problem disappears because the BPF program fires inline at the same event where attribution needs to happen. There is no race condition. There is no out of band write. The kernel is both the executor and the trust anchor.

BPF sits in the grey area between process-carried and external store. The requestId is keyed on the process, like a process-carried mechanism. But it is written by a trusted external program, and it happens inline. Because the BPF program fires at the same kernel event where the process starts, the attribution is captured before any userspace instruction runs. The requestId does not need to persist beyond the process lifetime. Inline capture eliminates the attribution window problem entirely.

The Right Weapon, To Be Wielded

MechanismTamper Resistant (L1)Tamper Resistant (L2)Attribution WindowPer-session ScopedNo Kernel Patch
Cgroup file✓ Mount namespace isolation holds✗ Privileged container can reach host cgroup filesystem✓ File persists on host after process exits✗ Shared across all container processes✓ No patch needed
Out of band mapping✓ External store is outside container reach✓ Privileged container cannot reach external store✗ Map cannot be pre-populated without PID✓ Each session can have its own mapping✓ No patch needed
Sealed memfd✗ Any process can close its own fd✗ Any process can close its own fd✗ fd closure destroys requestId before tooling reads it✓ Each exec session gets its own fd✓ runc change needed, not a kernel patch
Kernel keyring✓ Unprivileged process cannot modify a key it does not own✗ CAP_SYS_ADMIN allows keyctl manipulation✗ Session keyring destroyed when session ends✓ Each exec session gets its own session keyring✓ No patch needed
eBPF✓ Container cannot access BPF map without fd✓ Unpinned map, fd held exclusively by daemonset✓ Inline capture at execve, emitted to ring buffer✓ task_storage keyed on task struct per process✓ kernel 5.11+ required, no patch needed

A viable solution for the exec identity problem carries significant architectural constraints. Direction 2 eliminated itself before it could be properly evaluated. Direction 1 survived, but env vars alone were not enough. The solution needed to be inline, tamper resistant, per-session scoped, and free of kernel patches.

Every mechanism explored here was a serious candidate at some point. Each one failed on specific, verifiable grounds. The conclusion that eBPF is the right mechanism is not because it is a popular technology or a buzzword. It is because every other option was ruled out with clear reasoning, and eBPF is what remained. That is the only kind of conclusion worth trusting.

The implementation is the subject of the next post.


Previously in this series: The Kubernetes Exec Identity Gap: Kubernetes cannot tell you who ran that command, where identity disappears between the API server, kubelet, and the container runtime, and the two upstream-shaped directions to close the gap.