Skip to main content

Building a Self-Healing k3s Homelab (Part 5): RCU Stalls, Watchdogs, and Actually Healing

Part 4 covered the containerd sandbox leak problem and the ExecStartPre fix. The sandbox leaks were solved. But there was still an underlying issue that kept forcing those hard reboots in the first place.

This is Part 5: the kernel RCU stall problem, why it’s dangerous on a Raspberry Pi running k3s, the mitigations I layered on, and the moment the cluster finally handled an outage without me.


What Is an RCU Stall?

RCU stands for Read-Copy-Update. It’s a synchronization mechanism built into the Linux kernel for situations where reads are very frequent and writes are rare. The basic idea: readers don’t take locks. Writers make a copy of the data, update it, and wait until all current readers are done before pointing the system at the new copy. The period readers are finishing up is called a “grace period.”

RCU is used everywhere in the kernel: memory management, networking, filesystem layers, scheduling. When something goes wrong in a grace period, when a reader holds an RCU read-side critical section open far longer than expected, the kernel logs:

rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { P<pid> } 21 jiffies

This is the kernel saying: “something is blocking an RCU grace period, and I’ve been waiting 21 scheduler ticks for it to release.” Usually this resolves in a few milliseconds. Occasionally it doesn’t.

The Raspberry Pi kernel issue: The 6.17.0-1008-raspi kernel (Ubuntu 25.10 for Raspberry Pi) has a bug where RCU expedited stalls are triggered during runc:[2:INIT] container initialization, specifically when Flannel creates virtual ethernet pairs (veth). Every container start on a Jarvis pod goes through this path:

container start → runc → fork runc:[2:INIT] → Flannel creates veth pair → RCU stall

The stalls are normally brief (21-22 jiffies ≈ 200ms) and the system recovers. The kernel can have dozens of these per hour and still run fine.

The problem is what happens when they accumulate.


How Stalls Kill the Node

Individual RCU stalls are survivable. A rate of 0-5 per 10 minutes is normal background noise on this cluster. But under certain conditions the rate climbs, and once it climbs past a threshold the failure mode is dramatic.

The “network black-hole” failure mode:

When the RCU stall rate becomes high enough for long enough, the kernel’s softirq processing path (interrupt-driven network packet handling) stops making forward progress. The CPU is too busy trying to resolve RCU grace periods to process incoming network packets.

From the outside, the node looks dead: no ping response, no SSH, no API server connections, nothing. From the inside, the kernel is still running: etcd compacts, cron fires, journald writes. The hardware watchdog might not trigger because the kernel is technically alive. The kernel just isn’t servicing the network stack.

This is different from an OOM kill (there’s no crash, no dmesg panic) and different from a D-state hang (there’s no single stuck process). It’s an emergent network blackout from aggregate CPU pressure on the interrupt path. I called it a “network black-hole” because that’s what it looks like from the outside.

Evidence from the November 30 outage:

13:31 MDT: First RCU stall of the day
13:31–16:04 MDT: Stalls fire every 2-4 minutes (avg 1.4/10min)
13:59:43: Longhorn admission webhook: context deadline exceeded
13:59:53: remotedialer: i/o timeout — Ultron loses link to Jarvis API
14:04:49: k3s taint-eviction-controller deletes ALL Ultron pods simultaneously
14:14:45: Mass "client disconnected" — all watchers lose streams

And throughout all of this, from Jarvis’s perspective:

  • CPU: ~12% (not saturated)
  • RAM: 56% (not under pressure)
  • etcd: compacting every 5 minutes (still working)
  • journald: still writing

The node was alive. It just couldn’t talk to anyone.

flowchart LR
    subgraph NETWORK[Network Path - FAILING]
        NIC[NIC hardware interrupts]
        SOFTIRQ[softirq handlers\nnet_rx_action]
        TCP[TCP stack]
    end
    subgraph KERNEL[Kernel - RUNNING]
        SCHED[Scheduler]
        RCU[RCU subsystem\nstalled]
        ETCD[etcd\nstill compacting]
        JOURNALD[journald\nstill writing]
    end
    subgraph OUTSIDE[External - DEAD]
        SSH[SSH connections blocked]
        API[k3s API blocked]
        PING[ICMP blocked]
    end

    RCU -->|saturates CPU cycles| SOFTIRQ
    SOFTIRQ -.->|cannot process| NIC
    NIC -.->|packets dropped| OUTSIDE

    SCHED --> ETCD
    SCHED --> JOURNALD

The Conditions That Amplify Stalls

Not every workload configuration produces this. The following conditions make it worse:

Container churn. Every container lifecycle event (create, start, stop, delete) goes through runc:[2:INIT] and veth pair creation. A normal cluster lifecycle produces some of this, but crash-looping pods (as in Part 4) produce it at extreme rates. Eight crash-looping CSI pods means ~4 container events per second, sustained.

Prometheus TSDB compaction. Under load, Prometheus compaction runs burn through CPU cycles. This raises the background pressure on the kernel scheduler, reducing its ability to absorb RCU stall handling.

Unthrottled rsync over SSH. Before I added --bwlimit=5000 to the backup job, rsync over a gigabit link fully utilized Jarvis’s I/O and network stack. This produced an interrupt storm that generated RCU stalls equivalent to a full crash.

The November 13 outage was specifically caused by unthrottled rsync: the backup job at 02:30 MST pushed the CPU from ~20% to ~85% and triggered continuous RCU stalls that froze the node in under 2 seconds.


Layer 1: hung_task_panic

This was the first mitigation (applied after the November 2 outage):

kernel.hung_task_timeout_secs = 120
kernel.hung_task_panic = 1

If any kernel task stays in D-state (uninterruptible sleep) for 120+ seconds, the kernel panics and reboots automatically (via the panic=10 boot cmdline parameter k3s sets by default).

This works for the NFS case: hard NFS mounts create D-state threads. After 2 minutes, panic, automatic reboot. No physical intervention needed.

It does NOT work for the “network black-hole” case. RCU stalls produce network layer saturation, not D-state tasks. Individual tasks return from NFS operations (with EIO), so no single task exceeds the 120-second threshold. The system looks fine at the task level while being functionally dead at the network level.


Layer 2: Hardware Watchdog (systemd)

The Raspberry Pi BCM2835/ BCM2711 has a hardware watchdog: a timer that resets the board if the kernel doesn’t “pet” it (reset the timer) within the configured window.

The k3s default configuration doesn’t enable the watchdog. I added it via systemd:

# /etc/systemd/system.conf.d/watchdog.conf
[Manager]
RuntimeWatchdogSec=30
RebootWatchdogSec=10min
WatchdogDevice=/dev/watchdog

With this, systemd pets the hardware watchdog every 30 seconds. If systemd can’t run (kernel panic, hard lock, etc.), the watchdog fires and the board resets.

This also doesn’t catch the network black-hole. In that failure mode, the kernel is running fine. Systemd is running fine. The watchdog gets petted every 30 seconds. No reset. The node just silently stops serving traffic until someone pushes the power button.


Layer 3: Prometheus Alerts for Reboot Events

After the November 22 outage I added PrometheusRule resources to alert on reboot events:

# NodeUnexpectedReboot: fires within 10 min of any unexpected reboot
- alert: NodeUnexpectedReboot
  expr: |
    (node_time_seconds - node_boot_time_seconds < 600)
    and
    (node_time_seconds - node_boot_time_seconds offset 5m > 600)
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: "Unexpected reboot detected on {{ $labels.instance }}"

# NodeRepeatedReboots: >1 reboot in 24h
- alert: NodeRepeatedReboots
  expr: changes(node_boot_time_seconds[24h]) > 1
  for: 0m
  labels:
    severity: warning

These would have fired immediately after any of the hard resets, giving an Alertmanager notification instead of relying on me noticing the Home Assistant tile was grey.

Detection, not healing. Useful but not the goal.


Layer 4: The RCU Stall Watchdog

This is the one that actually made a difference.

After analyzing the November 30 outage logs, I had the RCU stall rate data I needed:

PhaseStall rateNotes
Boot burst (first 60s)~55 stalls in 1 minNormal — CNI initialization, containerd pulling state
Stable operation0-2 per 10 minBackground noise level
Pre-blackout cascadeavg 1.4/10min, peaks at 3/10minDegrading toward blackout
Post-hard-reboot noiseavg 3.9/10min, peaks at 8/10minBoot-time burst, not a crisis

The threshold: 3 stalls in a 10-minute window is the signal to act. Under stable operation, that doesn’t happen. In the 2-3 hours before a blackout, it becomes consistent.

The action: issue a controlled systemctl reboot before the network stack goes dark. A clean reboot takes ~2 minutes. A network black-hole can last 2 hours or more (the November 30 blackout lasted from 13:59 to 16:10, over 2 hours of dead services).

Script deployed to Jarvis as /usr/local/bin/rcu-stall-watchdog.sh:

#!/bin/bash
LOOKBACK_MIN=10
GRACE_MIN=15    # Skip checks during boot burst
THRESHOLD=3

uptime_seconds=$(awk '{print int($1)}' /proc/uptime)
uptime_minutes=$(( uptime_seconds / 60 ))

if [[ "$uptime_minutes" -lt "$GRACE_MIN" ]]; then
    logger -p kern.info -t rcu-stall-watchdog \
        "uptime=${uptime_minutes}min — skipping (grace period ${GRACE_MIN}min)"
    exit 0
fi

stall_count=$(journalctl -k --since "${LOOKBACK_MIN} minutes ago" --no-pager -q 2>/dev/null \
    | grep -c "rcu_preempt detected expedited stalls" || true)

logger -p kern.info -t rcu-stall-watchdog \
    "uptime=${uptime_minutes}min stalls_${LOOKBACK_MIN}min=${stall_count} threshold=${THRESHOLD}"

if [[ "$stall_count" -ge "$THRESHOLD" ]]; then
    logger -p kern.crit -t rcu-stall-watchdog \
        "CRITICAL: ${stall_count} stalls in ${LOOKBACK_MIN}min. Rebooting."
    systemctl reboot
fi

Systemd timer to run it every 5 minutes:

# /etc/systemd/system/rcu-stall-watchdog.timer
[Timer]
OnBootSec=15min
OnUnitActiveSec=5min
AccuracySec=30s

[Install]
WantedBy=timers.target

The 15-minute OnBootSec aligns with the GRACE_MIN=15 in the script: during the first 15 minutes after boot, even if stall rates are high (they always are during CNI initialization), the watchdog takes no action.


The First Real-World Trigger

November 30, 2025. Same day the watchdog was deployed.

The cluster went through the network black-hole failure at 13:59-16:10 MDT (the last manual outage). I power-cycled, investigated, deployed the watchdog, and went on with my day.

At 17:50:44 MDT:

Nov 30 17:50:44 Jarvis rcu-stall-watchdog[...]: CRITICAL: 7 stalls in last 10 min
  (threshold=3, uptime=95min). Initiating controlled reboot.

The RCU stall cascade was building again. Seven stalls in 10 minutes. The watchdog caught it.

At 17:52:41 MDT, Jarvis was back online.

2 minutes. Compared to 2+ hours of dead services from the morning’s manual incident.

Nodes Ready, all 11 ArgoCD apps Synced/Healthy within 5 minutes of the watchdog reboot.

flowchart LR
    subgraph MANUAL["Manual Outage (2+ hrs)"]
        A["13:31 RCU stalls begin"] --> B["13:59 Network blackout"]
        B --> C["14:04 Ultron pods evicted"]
        C --> D["14:14 All watchers disconnect"]
        D --> E["16:10 Manual power cycle"]
    end
    subgraph AUTO["Watchdog Recovery (2 min)"]
        F["17:31 RCU stalls resume"] --> G["17:50 Watchdog detects 7 stalls"]
        G --> H["17:50 Controlled reboot"]
        H --> I["17:52 Jarvis back online"]
    end
    E --> F

    style E fill:#c00,color:#fff
    style I fill:#0a0,color:#fff

The Full Mitigation Stack

After all the work documented across this series, the defense-in-depth looks like this:

flowchart TD
    subgraph PREVENT[Prevention Layer]
        SSH_RSYNC[SSH rsync backup\ninstead of NFS]
        BWLIMIT["bwlimit=5000 on rsync\nno I/O saturation"]
        TOPOLOGY[topologySpreadConstraints\nCSI sidecars across both nodes]
        AFFINITY[requiredDuringScheduling\nmonitoring on Ultron]
    end
    subgraph DETECT[Detection Layer]
        PROM_ALERTS[PrometheusRule\nNodeUnexpectedReboot\nLocalPathBackupFailed]
        RCU_WD[rcu-stall-watchdog.service\nchecks every 5 min]
    end
    subgraph RECOVER[Recovery Layer]
        HUNG_TASK[hung_task_panic=1\nauto-reboot on D-state hang]
        HW_WD[Hardware watchdog\n30s systemd pet]
        RCU_REBOOT[Controlled systemctl reboot\nbefore network blackout]
        EXECSTARTPRE[ExecStartPre sandbox cleanup\nauto-cleanup on k3s start]
    end

    PREVENT --> DETECT
    DETECT --> RECOVER

    SSH_RSYNC -->|eliminates NFS hang triggers| RECOVER
    TOPOLOGY -->|reduces veth churn on Jarvis| DETECT
    RCU_WD -->|detects cascade early| RCU_REBOOT
    RCU_REBOOT -->|2 min recovery| EXECSTARTPRE
    EXECSTARTPRE -->|cleans sandbox leaks| RECOVER

Each layer addresses a different failure mode:

LayerWhat it catches
SSH rsyncNFS-related D-state hangs (eliminates the trigger)
--bwlimitI/O saturation from backup jobs
topologySpreadConstraintsContainer churn concentration on Jarvis
hung_task_panicD-state hangs from any cause (>120s in uninterruptible sleep)
Hardware watchdogComplete kernel hangs where systemd can’t run
RCU stall watchdogNetwork black-hole from sustained RCU stall cascade
ExecStartPre cleanupSandbox leaks after any unclean shutdown
Prometheus alertsHuman notification for any reboot event

What’s Still Open

The kernel bug is still there. The rcu_preempt stall issue in 6.17.0-1008-raspi is a known regression in the Raspberry Pi kernel for Ubuntu 25.10. It’s not fixed in the current kernel. The RCU stall watchdog is a mitigation, not a fix. When Ubuntu ships a patched linux-raspi kernel, the watchdog gets disabled.

Ubuntu 25.10 goes EOL in July 2026. Both nodes need to migrate to 26.04 LTS. The upgrade process needs to be staggered: upgrade Ultron first (agent, lower blast radius), verify, then upgrade Jarvis. The 26.04 LTS kernel will almost certainly include a fix for the RCU stall regression.

Longhorn backup still has no off-node destination. The NFS backup target was removed, the daily snapshot runs locally, and the SSH rsync backup copies PVC data to Ultron. But Ultron and Jarvis are on the same LAN. If the LAN goes down, both backups are on equipment I can’t reach. A MinIO S3 backup target (either a local MinIO instance outside the cluster, or a cloud bucket) would give true off-node backup. This is on the list.

External reachability monitoring. The November 30 network black-hole lasted 2 hours because I discovered it manually. An external ping monitor (Uptime Kuma on a separate host, or a router cron job) that sends a notification if Jarvis is unreachable for >2 minutes would reduce detection time to under 5 minutes. The RCU watchdog handles internal detection; external detection is still human.


Takeaways from the Full Series

  1. Kernel NFS mounts on a control-plane node are catastrophic. Replace them with userspace alternatives at the first opportunity.

  2. GitOps is non-negotiable. An emergency fix that isn’t committed and pushed is not a fix. ArgoCD will revert it.

  3. Stale containerd sandboxes are inevitable after unclean shutdowns. Automate the cleanup with ExecStartPre in the k3s systemd service.

  4. RCU stalls escalate to network black-holes under sustained container churn. The watchdog pattern (measure the rate, trigger a controlled reboot at a threshold) works. 2 minutes of intentional downtime beats 2 hours of invisible dead services.

  5. Layer your mitigations. No single fix caught everything. The combination of hung_task_panic, hardware watchdog, RCU stall watchdog, sandbox cleanup, and SSH rsync with bandwidth limiting covers the failure space adequately.

  6. Build your monitoring dashboards in Git. Every Grafana dashboard committed as a ConfigMap, ArgoCD-managed. When the cluster gets rebuilt, everything comes back on the first sync.

  7. Required affinity beats preferred affinity for critical workloads. Preferred sounds safer. It isn’t. The scheduler will make resource-destroying decisions when one node isn’t ready.

The cluster is not perfect. The kernel bug is still there. The backup situation could be better. The OS needs upgrading before summer. But right now, when Jarvis starts accumulating RCU stalls, it reboots itself, cleans up the sandbox leaks, and comes back online in under 3 minutes. Without anyone pressing a power button.

That’s the self-healing cluster. It took nine weeks and nine outages to get here.