Bridging the Kubernetes Exec Identity Gap: One Hook Was Never Enough

This is the third and final post in a series documenting a Kubernetes KEP, from gap discovery to implementation evidence.

The second post concluded that the most capable solution to the exec identity gap captures the request_id inline at execve using eBPF and uses it to attribute every subsequent command. But building it meant navigating constraints imposed by the kernel and the BPF verifier. Each attempt surfaced a new kernel constraint. Each constraint forced a new design decision.

The Delivery Guarantee

Environment variables are the right transport mechanism. They are straightforward to inject and propagate to child processes by default. They cannot be used as storage because they are not tamper proof. This post assumes K8S_REQUEST_ID is delivered to the process, whether via a mutating webhook like the Security Profiles Operator today or via an in-tree KEP change in the future, and focuses entirely on what happens when a BPF program tries to read and protect it.

Iteration 1: One Hook, One Job

Choosing The Hook

Before stepping into any non-trivial mechanisms, the goal is to write a simple eBPF program that prints the details of a process (pid, ppid, command) if it has the K8S_REQUEST_ID environment variable. As with any other BPF program, choosing the right hook is important. sched_process_exec is a reasonable hook for the first iteration. Only successful exec events reach this hook, thereby allowing natural filtering of all failed ones due to lack of permissions or other reasons.

The Search Loop

The environment variables need to be read. They are stored as a single null-separated buffer in the process address space. This means looping through a fixed maximum buffer size and trying to match the string K8S_REQUEST_ID.

    long ret = bpf_probe_read_user(scratch->buf, env_size, (void *)env_start);
    if (ret < 0)
        return 0;

    // Search for the environment variable K8S_REQUEST_ID
    for (int i = 0; i < MAX_ENV_SIZE - 15; i++)
    {
        if (scratch->buf[i] == 'K' &&
            scratch->buf[i + 1] == '8' &&
            scratch->buf[i + 2] == 'S' &&
            scratch->buf[i + 3] == '_' &&
            scratch->buf[i + 4] == 'R' &&
            scratch->buf[i + 5] == 'E' &&
            scratch->buf[i + 6] == 'Q' &&
            scratch->buf[i + 7] == 'U' &&
            scratch->buf[i + 8] == 'E' &&
            scratch->buf[i + 9] == 'S' &&
            scratch->buf[i + 10] == 'T' &&
            scratch->buf[i + 11] == '_' &&
            scratch->buf[i + 12] == 'I' &&
            scratch->buf[i + 13] == 'D' &&
            scratch->buf[i + 14] == '=')
        {
            found_off = i + 15;
            break;
        }
    }

The loop that traverses through the search buffer to find the environment variable. source: command-logger.bpf.c:100

The BPF verifier is not generous with the buffer window to loop through. The 1 million instruction limit is exceeded quickly due to the large state space.

BPF program is too large. Processed 1000001 insn
	processed 1000001 insns (limit 1000000) max_states_per_insn 43 total_states 27933 peak_states 631 mark_read 551
2026/04/21 07:15:40 failed to load BPF collection: program handle_exec: load program: argument list too long: BPF program is too large. Processed 1000001 insn (33174 line(s) omitted)
exit status 1.

The verifier error thrown when trying to increase MAX_ENV_SIZE.

A straightforward loop does not allow more than a 512 byte buffer size. Optimisation of the search logic stretches this to 2048 bytes. Nevertheless, if a few environment variables with large values precede the target variable, it could easily be pushed out of the search window.

    long ret = bpf_probe_read_user(scratch->buf, env_size, (void *)env_start);
    if (ret < 0)
        return 0;

    {
        const char needle[] = "K8S_REQUEST_ID=";
        u32 state = 0;

        for (int i = 0; i < MAX_ENV_SIZE - 15; i++) {
            unsigned char c = scratch->buf[i];

            if (c == (unsigned char)needle[state]) {
                state++;
                if (state == 15) {
                    found_off = i + 1;
                    break;
                }
            } else {
                state = (c == 'K') ? 1 : 0;
            }
        }
    }

The optimised state machine approach. source: command-logger.bpf.c:112

Attempt 1 output: handler screenshot, host process with REQUEST_ID visible — Top: BPF handler output on the host. Bottom: the pod session with `K8S_REQUEST_ID=beefed` injected, then overwritten to `SPOOFED` from inside the shell.

In Retrospect

This hook is the right choice for emitting exec events because failed execs never reach it. But for reading the environment variable, a fundamental constraint remains: the env vars live in a flat buffer and the scan window is bounded by the verifier limit. At sys_enter_execve, individual pointers per variable are available with no flat buffer constraint.

At this stage, the program has no tamper resistance. Any process inside the container can overwrite K8S_REQUEST_ID before spawning a child, and the forged value is attributed.

Iteration 2: The Storage Mechanism

BPF Task Storage

Environment variables are not tamper resistant. This point has been reiterated across the last post and the previous iteration. The request_id needs to be stored inside a kernel object whose lifecycle is tied to the process and that is reachable only from the BPF program that holds a reference to it at load time. A BPF task storage map fits this constraint directly.

Write In, Read Out

The first step to solving tamper resistance is finding an appropriate storage mechanism. The goal here is only to test the storage mechanism by writing into it and reading from it again. Ensuring that what enters this storage cannot be tampered with is a problem reserved for the next iteration.

The previous BPF program is extended to write the request_id into the task storage when found in the environment, then read it back when emitting the event.

struct request_data {
    char request_id[REQUEST_ID_MAX];
};
// Task storage is a custom bpf map whose lifecycle is tracked 
// alongside that of the corresponding process itself.
struct {
    __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
    __uint(map_flags, BPF_F_NO_PREALLOC);
    __type(key, int);
    __type(value, struct request_data);
} task_storage_map SEC(".maps");

Declaration of the task storage map. The kernel keys entries by task_struct pointer. The userspace loader owns the FD. source: command-logger.bpf.c:31

    struct request_data *rd = bpf_task_storage_get(
                &task_storage_map, task, NULL,
                BPF_LOCAL_STORAGE_GET_F_CREATE);
    if (!rd) 
        return 0;
    // Copy the string from scratch buffer into the task storage
    bpf_probe_read_kernel_str(rd->request_id,
                           sizeof(rd->request_id), scratch->buf + found_off);
    storage_written = 1;
    
    ...

    // Copy RequestID from task storage to event in ring buffer
    bpf_probe_read_kernel_str(e->request_id,
                            sizeof(e->request_id), rd->request_id);

Creating a task storage entry if it does not exist, writing the request_id into it, then reading it back when emitting the event. source: command-logger.bpf.c:144

The write-then-read pattern is deliberate: it confirms both paths behave as expected.

Attempt 2 output: handler screenshot, host process with STORED visible — Top: BPF handler output on the host. `STORED=1` confirms the `request_id` was written to task storage and read back. Bottom: the pod session with `K8S_REQUEST_ID=beefed` injected, then overwritten to `SPOOFED` from inside the shell.

The Chain Of Trust

The current iteration does not solve the spoofing problem. Once the request_id enters task storage, the value is tamper resistant. Task storage is kernel resident and keyed by task_struct, and nothing inside the container can write to it. But the source of the value written into task storage is still the environment variable, and the environment variable is writable from inside the container. An adversary can overwrite K8S_REQUEST_ID before the next execve, and the forged value is what reaches task storage.

Iteration 3: Parent Task Storage Precedence

The Chain Of Trust, Revisited

At sched_process_exec, the only source for the request_id that is already inside the kernel and not writable from userspace in the container is the parent task’s task_storage. The resulting flow:

check parent_process_task_storage for request_id
if present:
    - copy the request_id from the parent task storage to the current process task storage
    - skip reading the env var
if not present:
    - read env var and write request_id to current process task storage

In a Kubernetes exec flow into a pod:

When an exec session is opened, the request_id is injected into the environment variable using a mutating webhook (for now).
The hook on sched_process_exec captures this request_id from the environment variable and writes it into the task storage of the exec’d process. For example, if the command provided is /bin/sh, the request_id is stored in the task_storage entry keyed by that process’s task_struct.
For every subsequent command run inside the exec session, the same hook fires, but this time it first checks the parent process’s (/bin/sh) task storage for the request_id. Once found, the current process’s environment variable is not consulted. The request_id fetched from the parent process is written into the current process’s task storage.
The event is emitted to the Go handler, which prints or forwards the event.

This way, the trust anchor moves from the environment variable to two fixed points: the mutating webhook that issues the value and the task storage that preserves it.

Fetching The Parent Task Storage

The parent process’s task_struct must be fetched to access its task storage, and the pointer passed to bpf_task_storage_get() must be one the verifier treats as trusted. Trust here is a verifier property: the helper accepts only pointers whose provenance the verifier has tracked, and pointers returned by bpf_get_current_task_btf() carry that property. Fields accessed by direct dereference inherit it, so task->real_parent is still trusted when passed to bpf_task_storage_get().

    // get a handle to the current task struct
    task = bpf_get_current_task_btf();
    if (!task)
        return 0;

    /* Direct field access keeps the pointer "trusted" for the verifier;
     * BPF_CORE_READ would turn it into a scalar and bpf_task_storage_get
     * rejects that. */
    parent = task->real_parent;
    if (parent) {
        prd = bpf_task_storage_get(&task_storage_map, parent, NULL, 0);
        ...
    }

Retrieving the parent’s task storage by dereferencing the BTF-typed pointer chain. source: command-logger.bpf.c:91

If the direct real_parent walk were not available, the alternative would be bpf_task_from_pid(), a kfunc rather than a built-in BPF helper. The verifier does not allow bpf_task_from_pid() from a tracepoint because it acquires a reference to an arbitrary task whose lifecycle is outside the current context. Only LSM programs can hold that reference. The logic would then have to move to lsm/bprm_check_security, and the acquired reference would have to be released explicitly with bpf_task_release() before the program returns.

// under SEC("lsm/bprm_check_security")
task = bpf_get_current_task_btf();
if (!task)
    return 0;

pid  = bpf_get_current_pid_tgid() >> 32;
ppid = BPF_CORE_READ(task, real_parent, tgid);

parent = bpf_task_from_pid(ppid);
if (!parent)
    return 0;

rd = bpf_task_storage_get(&task_storage_map, parent, NULL, 0);
bpf_task_release(parent);

Using an LSM hook to read the parent’s task storage via the bpf_task_from_pid kfunc. source: command-logger.bpf.c:123

Defeating The Spoof

Previous iterations let an adversary evade attribution by overwriting the environment variable. With the parent precedence rule in place, an overwrite of K8S_REQUEST_ID is ignored for every command after the first exec in the session.

Attempt 3 output: handler screenshot, POC for defeating spoofing — Top: BPF handler output on the host. `PARENT_SRC=1` means the `request_id` was fetched from the parent task storage. Bottom: the pod session with `K8S_REQUEST_ID=beefed` injected, then overwritten to `SPOOFED` from inside the shell. The handler still emits `beefed`, confirming the parent precedence rule holds.

The attribution loop is now closed. Every command executed inside the exec session is tied back to the request_id injected at session open, and spoofing the environment variable no longer changes that attribution. But the emitted event still carries only the process metadata (pid, ppid, comm). An operator auditing a kubectl exec session sees whom the command belongs to, but not what the command actually was.

Iteration 4: Enriching Events with Args

The File Matters

Consider two concurrent exec sessions running the same cat command, one on server.go and the other on /etc/shadow. In the iterations so far, the logs only record that cat was invoked, each tagged with the corresponding request_id. Which file was read distinguishes a benign debug session from exfiltration of a sensitive file.

Adding The Args

At sched_process_exec, the kernel has already installed the new mm, so mm->arg_start and mm->arg_end point into the new image’s argv region. The same hook that emits the event can read the args, without adding a second hook.

    // under SEC("tp/sched/sched_process_exec")
    ...
    // Ship the raw argv region (NUL-separated, starts with argv[0]) as a
    // single buffer. User space is responsible for splitting and trimming.
    __builtin_memset(e->args, 0, sizeof(e->args));
    mm = task->mm;
    if (mm) {
        unsigned long arg_start = BPF_CORE_READ(mm, arg_start);
        unsigned long arg_end   = BPF_CORE_READ(mm, arg_end);

        if (arg_start && arg_end && arg_end > arg_start) {
            long arg_size = arg_end - arg_start;
            if (arg_size > MAX_ARGS_SIZE)
                arg_size = MAX_ARGS_SIZE;

            bpf_probe_read_user(e->args,
                                arg_size & (MAX_ARGS_SIZE - 1),
                                (void *)arg_start);
        }
    }

Reading the argv region from mm->arg_start at sched_process_exec. source: command-logger.bpf.c:214

Attempt 4 output: handler now includes argv — Top: BPF handler output on the host. The `ARGS` column is now populated (`ARGS=/etc/passwd` for both `ls` and `cat`). Bottom: the pod session runs `ls /etc/passwd`, overwrites `K8S_REQUEST_ID=beefed` to `SPOOFED`, then runs `cat /etc/passwd`. The handler continues to emit `REQUEST_ID=beefed` after the spoof, now with the argument captured.

Is This Good Enough?

The emitted event now carries enough detail for an operator to tell cat server.go from cat /etc/shadow and attribute each to the originating exec session. Argv reading at sched_process_exec has no flat-buffer ceiling because mm->arg_start and mm->arg_end bound the read exactly. The env var reading, however, is still the flat-buffer scan from the first iteration: K8S_REQUEST_ID is still found by walking a single userspace buffer up to 2048 bytes. In a container whose environment exceeds that window, the scan does not reach K8S_REQUEST_ID before the buffer ends. The lookup fails silently: no task_storage entry is created, the emitted event carries no request_id, and attribution for that exec is lost with no signal to the handler that anything went wrong. The next iteration returns to that constraint.

Iteration 5: Two Hooks, Larger Reach

Envp As A Pointer Array

The fourth iteration solves the attribution problem end to end, but the risk of the request_id slipping underneath a large buffer still remains. The first iteration already noted the verifier-imposed limitations of using a flat search buffer, and pointed to the availability of individual pointers to each environment variable. At sys_enter_execve, ctx->args[2] is a pointer to the envp array. Each entry is a pointer to a NUL-terminated KEY=VALUE string in userspace. Reading one entry is two bpf_probe_read_user calls, one for the pointer and one to follow it. There is no single-allocation ceiling: the bound is a per-entry read length and a loop iteration count, both well inside the verifier’s envelope.

// under SEC("tracepoint/syscalls/sys_enter_execve")
...
const char **envp = (const char **)ctx->args[2];
if (!envp)
    return 0;

for (int i = 0; i < MAX_ENV_VARS; i++) {
    const char *entry = NULL;
    if (bpf_probe_read_user(&entry, sizeof(entry), &envp[i]) < 0)
        break;
    if (!entry)
        break; // NULL terminator of envp[]

    char var[REQUEST_ID_MAX + 16] = {};
    long n = bpf_probe_read_user_str(var, sizeof(var), entry);
    // var now contains the env var string 
    ...
}

Accessing environment variables from sys_enter_execve. source: command-logger.bpf.c:52

The Improvised Mechanism

Environment variable reading pivots into sys_enter_execve. Parent precedence, args capture, and event emission stay at sched_process_exec. Args cannot move to sys_enter_execve because the new mm is not installed until after the execve succeeds, which is the same property that made the fourth iteration’s mm->arg_start read at sched_process_exec valid. The mechanism looks like this:

hook sys_enter_execve
    - Check if K8S_REQUEST_ID is present in the environment variables
    if present:
        Write the request_id into the current process task storage
    if not:
        do nothing

hook sched_process_exec
    - Check if parent process task storage contains request_id
    if present:
        Overwrite the current process task storage with the inherited request_id
    if not:
        do nothing. The request_id set from the env var at sys_enter_execve remains

With this structure, the flat buffer limitation is averted and the tamper resistance guarantees of the third iteration are preserved.

What This Settles

Attempt 5 output: two-hook architecture end to end — Top: BPF handler output on the host. The `bash` row shows `PARENT_SRC=0` (the `request_id` was captured from `envp` at `sys_enter_execve`). The three child rows (`ls`, `whoami`, `cat`) show `PARENT_SRC=1` (inherited from the bash’s task storage). Bottom: the pod session launched with `K8S_REQUEST_ID=beefed`, overwritten to `SPOOFED` between `whoami` and `cat /etc/passwd`. The handler continues to emit `REQUEST_ID=beefed` for every event, with `ARGS=/etc/passwd` captured on the `cat` row.

Nothing changes much visually. The handler output is shaped like the fourth iteration’s. What the screenshot confirms is that the two-hook split holds. The bash row is sourced from envp at sys_enter_execve with PARENT_SRC=0. The three children inherit through parent task storage at sched_process_exec with PARENT_SRC=1. The SPOOFED overwrite between whoami and cat is ignored because the parent lookup takes precedence. ARGS=/etc/passwd is captured on the cat row. The final architecture delivers the env-var reach of sys_enter_execve, the tamper-resistant inheritance of parent task_storage, and the full argv from mm->arg_start, in one program.

Closing The Identity Gap

All the iterations come down to this moment. The whole point of all the research and experimentation is to ensure the initial problem is solved. When alice and bob execute commands inside concurrent exec sessions, it must be possible to unambiguously attribute who executed the commands.

End-to-end demo: two concurrent kubectl exec sessions, two request_ids, two identities recovered from the audit log — Top: BPF handler output on the host. Middle: alice’s and bob’s concurrent pod sessions. Bottom: audit-log lookup resolving each `request_id` back to its `kubectl` caller.

The handler records five exec events across the two concurrent sessions. The two bash rows show PARENT_SRC=0, meaning each request_id was captured from envp at sys_enter_execve. The three child rows show PARENT_SRC=1, meaning the request_id was inherited through the parent’s task storage at sched_process_exec. Alice’s UUID 943eb393... appears on bash, whoami, and ls -l. Bob’s UUID 8e7bde12... appears on bash and cat /etc/shadow, with ARGS=/etc/shadow captured on the cat row. The audit log at the bottom, queried by each request_id through jq, resolves 943eb393... back to alice and 8e7bde12... back to bob.

The problem stated in the first post is solved. Two concurrent exec sessions in the same pod carry distinct request_ids, and every command inside each session attributes back through the audit log to the caller that opened it. The second post defined five criteria any solution had to clear: tamper resistance at layer 1, tamper resistance at layer 2, an inline attribution window, per-session scoping, and no kernel patch. The program above clears all five on a stock 5.11+ kernel.

From POC To Upstream

That closes the three-part series. The POC is the foundation, not the finished article. Parts of the solution are ready for standardisation, parts are not, and the rest is implementation work.

What A KEP Can Carry

The KEP can propose to propagate the request_id as an environment variable by:

Modifying the API server to include request_id on the call to kubelet.
Including request_id in the gRPC proto between kubelet and CRI.
Injecting request_id as an environment variable in the container runtime, sourced from the CRI field above.

Alternatively, the KEP can propose to inject the request_id by modifying the exec command at admission, as SPO does through its mutating admission webhook.

What A KEP Cannot

The eBPF-based solution requires kernel 5.11+ for BPF_MAP_TYPE_TASK_STORAGE. Mandating an eBPF program as part of the node baseline is an uphill task: it crosses sig-node, distribution maintainers, and every runtime-security vendor’s existing stack. It might instead do better as an out-of-tree eBPF daemonset, maintained alongside a consumer like Falco, with the CRI-side env-var delivery covered by the KEP above.

Next Steps

Discussion toward an in-tree change is tracked in kubernetes/enhancements#6035; the proposed KEP is in kubernetes/enhancements#6036.

Three pieces of work stand between the POC and production:

Benchmark both hooks on realistic exec loads. Per-event overhead at sys_enter_execve and sched_process_exec, ring buffer drop rate when the userspace handler lags, and the baseline exec-latency delta all need numbers before the program runs anywhere beyond a demo.
Package the POC as a daemonset after production hardening and ship it as a separate tool focused on command logging.
Evaluate feasibility of integrating this mechanism with existing runtime-security projects like Falco, Tetragon and SPO.

The attribution problem has a kernel answer. The productionisation problem does not yet.

The Delivery Guarantee#

Iteration 1: One Hook, One Job#

Choosing The Hook#

The Search Loop#

In Retrospect#

Iteration 2: The Storage Mechanism#

BPF Task Storage#

Write In, Read Out#

The Chain Of Trust#

Iteration 3: Parent Task Storage Precedence#

The Chain Of Trust, Revisited#

Fetching The Parent Task Storage#

Defeating The Spoof#

Iteration 4: Enriching Events with Args#

The File Matters#

Adding The Args#

Is This Good Enough?#

Iteration 5: Two Hooks, Larger Reach#

Envp As A Pointer Array#

The Improvised Mechanism#

What This Settles#

Closing The Identity Gap#

From POC To Upstream#

What A KEP Can Carry#

What A KEP Cannot#

Next Steps#