Building a Self-Healing k3s Homelab (Part 4): Containerd's Ghost Sandboxes

Part 3 covered the NFS outage series and the fixes: soft mounts, then SSH rsync, then full NFS removal. By November 9, no kernel NFS mount existed anywhere in the cluster.

The cluster kept going down.

This is Part 4: the containerd sandbox leak pattern that appeared after every hard reboot, why it happens, what it does to the cluster, and how I finally automated the fix.

The Pattern

Every time Jarvis suffered a hard unclean shutdown (power cycle to recover from a freeze, or hardware watchdog reboot), the same thing happened when k3s came back up:

Some pods wouldn’t start
The errors didn’t look like OOM or resource exhaustion
Eventually longhorn-csi-plugin would fail to start
Every StatefulSet in the cluster would get stuck in ContainerCreating
Home Assistant and everything else would be down until someone manually cleaned up

This happened on: November 10, November 12, November 13, November 14, November 30.

Five times. Each time required manual investigation to identify and fix. Each time, once I knew what to look for, the fix was about 2 minutes. But the first time it took over an hour to diagnose.

What Are Sandbox Leaks?

In Kubernetes, when a pod is scheduled to a node, the CRI (Container Runtime Interface) creates a “sandbox” for it. On k3s the runtime is containerd. A sandbox is a namespaced environment: it sets up the network namespace, the cgroup hierarchy, the seccomp profile. The actual containers inside the pod run inside this sandbox.

Each sandbox has a name. The name is deterministic and includes the pod name, namespace, and UID.

When a pod is gracefully deleted, Kubernetes drain-kills the containers, the kubelet tells containerd to clean up the sandbox, and containerd removes the internal reservation. The name is freed.

When a node suffers an unclean shutdown (hard reset, power cut, kernel freeze forcing a hard reboot), containerd is killed mid-operation. The sandbox cleanup steps never run. containerd restarts but still has the old sandbox name records in its internal state: pods that no longer exist, whose names are still marked as “reserved.”

When the kubelet comes back up and tries to recreate the pods, it generates new sandbox names (same deterministic formula, same pod name, same namespace, new UID). But the old record is still there with the same name. containerd refuses to create the new sandbox:

Failed to create pod sandbox: rpc error: code = Unknown
desc = failed to reserve sandbox name "longhorn-manager-w6dd4_longhorn-system_<uid>_0":
name "longhorn-manager-w6dd4_longhorn-system_<uid>_0" is reserved for "<old-sandbox-id>"

The pod goes into CrashLoopBackOff. The old sandbox is a ghost: NotReady in containerd’s internal state, not actually running, just occupying the name slot.

Why Longhorn CSI Is the Chokepoint

Any pod on Jarvis can be blocked by sandbox leaks, but the one that matters most is longhorn-csi-plugin.

longhorn-csi-plugin is a DaemonSet pod that runs on every node. It implements the CSI Node interface: it’s responsible for mounting and unmounting volumes on the local node. If it doesn’t start, the CSI socket file (/csi/csi.sock) is never created.

The CSI sidecar pods (attacher, provisioner, resizer, snapshotter) all connect through this socket to do their work: attaching Longhorn volumes, provisioning new PVCs, resizing volumes. If the socket doesn’t exist, every sidecar immediately crashes:

Error: no such file or directory: /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock

They enter CrashLoopBackOff. With exponential backoff, a new crash attempt happens every 2-3 seconds per pod. With 8 sidecar pods doing this, that’s a new container creation event every few hundred milliseconds, all of it generating runc:[2:INIT] invocations and veth pair creation on Jarvis. Kryptonite for the RCU-stall-vulnerable Pi 5 kernel.

Without working CSI, no Longhorn PVC can be attached to any pod on any node. Every StatefulSet that uses Longhorn gets stuck in ContainerCreating or Init:CrashLoopBackOff. Prometheus, Grafana, Gitea, PostgreSQL, Valkey. All down. The cluster is effectively headless even though most pods are technically still scheduled.

flowchart TD
    A[Hard reboot / kernel freeze] --> B[containerd killed\nwithout sandbox cleanup]
    B --> C[Stale NotReady sandboxes\nremain in containerd state]
    C --> D[k3s restarts\nkubelet tries to recreate pods]
    D --> E{Sandbox name conflict?}
    E -->|yes| F[Failed to reserve sandbox name\nCrashLoopBackOff]
    E -->|no| G[Pod starts normally]
    F --> H[longhorn-csi-plugin\ncannot start]
    H --> I[csi.sock never created]
    I --> J[All CSI sidecars crash\nno such file or directory]
    J --> K["8 crash-looping pods\nrunc:[2:INIT] every 2s"]
    K --> L[RCU stall acceleration\nfrom container churn]
    L --> M[All Longhorn PVC attachments fail\nStatefulSets stuck ContainerCreating]
    M --> N[Entire cluster effectively down\nuntil manual cleanup]

    style A fill:#c00,color:#fff
    style N fill:#c00,color:#fff

Diagnosing It

Step 1: Check for stale sandboxes

When pods are failing with sandbox reservation errors, the first thing to check:

# Count stale sandboxes
crictl pods --state NotReady | wc -l
# On November 12: 49
# On November 13: 37
# On November 14: 36
# On November 30: dozens

If you’re seeing Failed to create pod sandbox errors in kubectl describe pod, there will always be stale NotReady sandboxes in crictl.

Step 2: Check Longhorn CSI socket

kubectl describe pod -n longhorn-system -l app=longhorn-csi-plugin
# Look for:
# Warning  FailedCreatePodSandBox
# Error    csi.sock: no such file or directory

If longhorn-csi-plugin is in CrashLoopBackOff with sandbox errors, that’s the chokepoint. Everything else will follow once it’s fixed.

Step 3: Check stuck VolumeAttachments

kubectl get volumeattachment -A
# Look for VolumeAttachment objects where attached=false and they've been there for >5 min

After sandbox cleanup, these may also need to be manually deleted to unblock pods that were waiting on them.

The Fix (Manual)

Every time this happened, the fix was the same two steps:

Step 1: Clean stale sandboxes

# If you have SSH to Jarvis:
crictl pods --state NotReady -q | xargs -r crictl rmp --force

# If SSH is broken (as it was on November 12, 'agent refused operation'):
kubectl debug node/jarvis -n kube-system --profile=sysadmin -it --image=alpine -- \
  sh -c 'chroot /host crictl pods --state NotReady -q | xargs -r chroot /host crictl rmp'

The kubectl debug node approach is worth knowing. It creates a privileged pod on the target node with access to the host filesystem via /host. The --profile=sysadmin flag is required: other profiles don’t grant host filesystem access. Use kube-system as the namespace because other namespaces may have PodSecurity restricted which blocks debug pods (and debug pods, being privileged, would be rejected).

Step 2: Delete CrashLoopBackOff CSI pods

Once stale sandboxes are gone, the existing CSI sidecar pods are still in CrashLoopBackOff with exponential backoff timers (they may be waiting minutes before their next restart). Delete them to force an immediate restart:

kubectl -n longhorn-system delete pods -l app=csi-attacher
kubectl -n longhorn-system delete pods -l app=csi-provisioner
kubectl -n longhorn-system delete pods -l app=csi-resizer
kubectl -n longhorn-system delete pods -l app=csi-snapshotter
kubectl -n longhorn-system delete pods -l app=longhorn-csi-plugin

Within 15-60 seconds, longhorn-csi-plugin creates the CSI socket, the sidecars connect, and all the stuck StatefulSets start scheduling their volumes.

The Automated Fix

Manual fixes are fine once. Manual fixes five times are a process problem.

The right approach is to clean stale sandboxes automatically before k3s starts on each boot. k3s manages containerd internally (it’s not a separate systemd service), so a standalone containerd.service-dependent systemd unit won’t work. The correct hook is an ExecStartPre in the k3s systemd service.

The fix that actually worked (applied November 30, 2025):

Script: /usr/local/bin/k3s-pre-start-cleanup.sh

#!/bin/bash
# Clean stale containerd sandboxes before k3s starts.
# These accumulate on unclean shutdowns and cause cascade failures on restart.
export PATH=/usr/local/bin:$PATH

crictl --runtime-endpoint unix:///run/k3s/containerd/containerd.sock \
  pods --state NotReady -q 2>/dev/null | \
  xargs -r crictl --runtime-endpoint unix:///run/k3s/containerd/containerd.sock \
  rmp --force 2>/dev/null || true

Systemd drop-in: /etc/systemd/system/k3s.service.d/sandbox-cleanup.conf

[Service]
ExecStartPre=/usr/local/bin/k3s-pre-start-cleanup.sh

This runs before k3s starts on every boot. If there are stale sandboxes, they get cleaned. If there aren’t (clean shutdown), the script runs in ~0.1 seconds and does nothing. No downside.

The earlier attempt at a standalone k3s-sandbox-cleanup.service failed because it declared After=containerd.service, a target that doesn’t exist when containerd is managed internally by k3s. The ExecStartPre approach doesn’t have this problem: it hooks directly into the k3s service lifecycle.

The November 14 Cascade: A Fully Self-Reinforcing Loop

The November 14 outage is worth examining in detail because it demonstrates how these failures compound over time.

Timeline:

November 13, 06:02 MST: Jarvis hard-reboots (unthrottled rsync saturated the CPU, covered in Part 3). 36 stale sandboxes persist.
November 13, 06:02 → November 14, 16:33 MST (37 hours): longhorn-csi-plugin is stuck in CrashLoopBackOff the entire time. 8 crash-looping pods, one new runc:[2:INIT] container every 2-3 seconds. Jarvis accumulates 1,191 RCU expedited stalls over those 37 hours.
November 14, 16:33 MST: The API server stops responding. Not from OOM, not from disk pressure. Jarvis’s CPU was at ~12%, RAM at 56%. The kernel’s network softirq path stopped processing inbound connections, making the node appear dead while actually still running (etcd was compacting, cron was firing, journald was writing).
November 14, 19:38 MST: Hard reboot.
November 14, 19:38 MST: 36 new stale sandboxes from the unclean shutdown. Crash loops resume immediately.

The full causal chain:

Hard reboot (Nov 13)
  → 36 stale containerd sandboxes (cleanup service broken)
    → longhorn-csi-plugin blocked for 37 hours
      → 8 CSI sidecar pods in CrashLoopBackOff
        → runc:[2:INIT] every ~2-3 seconds, 24/7
          → 1,191 RCU expedited stalls over 37 hours
            → k3s API server network stack freeze at 16:33
              → Hard reboot (Nov 14)
                → 36 new stale sandboxes
                  → cycle repeats

This is a perfectly self-reinforcing failure loop. Each reboot causes the next reboot. Without external intervention, it runs indefinitely.

After November 14, I stopped trying to fix it with ad-hoc cleanup procedures and started looking at the November 30 ExecStartPre approach as the permanent solution.

What the Recovery Looks Like After `ExecStartPre` Is In Place

November 30, 2025: k3s was restarted manually (user intervention for a different issue). The sandbox leak cascade started again. At this point ExecStartPre was not yet deployed.

06:33 MDT: k3s restarted manually
06:33–07:53: Same cascade (longhorn-csi-plugin blocked, CSI sidecars crash, PVCs stuck)
07:48: Investigation begins
07:53: Fix applied (manual sandbox cleanup + CSI pod deletion)
08:04: All 11 ArgoCD apps Synced/Healthy. 11 minutes from fix to all-green

Then ExecStartPre was deployed.

Compare the next reboot sequence: k3s starts, ExecStartPre runs the cleanup script, stale sandboxes are gone before k3s schedules any pods. longhorn-csi-plugin creates the CSI socket. Every pod starts cleanly. No intervention required.

Debugging Tools Reference

For future incidents like this, the tools that matter:

# Check stale sandboxes
crictl --runtime-endpoint unix:///run/k3s/containerd/containerd.sock pods --state NotReady

# Clean all stale sandboxes
crictl --runtime-endpoint unix:///run/k3s/containerd/containerd.sock \
  pods --state NotReady -q | xargs -r \
  crictl --runtime-endpoint unix:///run/k3s/containerd/containerd.sock rmp --force

# Check CSI socket exists
ls -la /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock

# Check stuck VolumeAttachments
kubectl get volumeattachment -A
kubectl describe volumeattachment <name>  # look for 'attached: false' with long age

# Get node-level debug access when SSH is broken
kubectl debug node/jarvis -n kube-system --profile=sysadmin -it --image=alpine

# Inside the debug pod (/host = node filesystem):
chroot /host crictl pods --state NotReady
chroot /host crictl rmp <sandbox-id>

# Check pod events (most useful first pass)
kubectl get events -n longhorn-system --sort-by='.lastTimestamp' | tail -30

Takeaways

Unclean shutdowns leave containerd sandbox name reservations behind. This is inherent to how containerd works. It’s not a k3s bug or a Pi bug; it’s a property of any containerd-based runtime under unexpected shutdown.
longhorn-csi-plugin is the most important pod on the control-plane node. If it doesn’t start, nothing that needs a PVC starts. Check it first when storage-dependent pods are stuck.
The fix is crictl rmp on stale sandboxes. 2 minutes. Done. Then delete the crash-looping CSI pods so they restart immediately instead of waiting for exponential backoff.
Automate the fix with ExecStartPre in the k3s systemd service. Standalone containerd.service dependencies don’t work with k3s’s embedded containerd. Hook into the k3s service lifecycle directly.
CrashLoopBackOff pods generate RCU stalls. If you’re seeing elevated RCU stall rates, check for crash-looping pods in longhorn-system. 8 crash-looping pods at 2-3 restarts per second is enough to destabilize the kernel on a Pi 5 over 24-36 hours.

Part 5 covers the RCU stall problem directly: what it is, why it keeps coming back even without NFS or sandbox leaks, and how the cluster eventually learned to heal itself without physical intervention.