Building a Self-Healing k3s Homelab (Part 3): The NFS Nightmare
Part 2 ended with a functioning two-node cluster, all workloads placed correctly, resource limits set, monitoring green. Late October 2025, everything looked good.
Early November was five outages in nine days, all rooted in the same mistake: putting a kernel-level NFS mount on the control-plane node. This is the story of that, in all its embarrassing detail.
The Backup Plan
Longhorn provides volume snapshots natively. A snapshot is taken locally, stored in Longhorn’s own format, and can be restored. That’s fine for a bad deployment. For a node failure followed by a disk failure, you need something off-node.
Longhorn supports backup targets: an S3 bucket or an NFS share where it sends backup data. Since I had Ultron right there on the same LAN with a few hundred GB of free space on its SSD, I mounted a directory over NFS:
Ultron: /backup/longhorn (NFS export)
Jarvis: mounts via Longhorn's BackupTarget CRD
I also set up a local-path-backup CronJob to copy the HA config PVCs from Jarvis’s local-path storage to Ultron’s /backup/local-path directory. Same approach: NFS.
At the time this seemed completely reasonable. Two machines on a gigabit LAN. NFS is fast. Easy to set up. What could go wrong.
November 2 – Outage #1 (05:38 MST)
The first symptom was the cluster becoming completely unresponsive at 05:38 MST. Home Assistant offline. ArgoCD offline. Grafana flat-lined. kubectl couldn’t reach the API server. Physical power cycle fixed it.
After the reboot, I pulled the kernel logs and started reading. What I found was a multi-factor failure chain.
Factor 1: Hard NFS mount semantics
Linux NFS mounts default to “hard” mode. In hard mode, when the NFS server becomes unreachable, the kernel threads waiting on NFS I/O block in uninterruptible D-state indefinitely. They cannot be signaled, killed, or interrupted. They just sit there, holding kernel locks, waiting for a server that isn’t coming back.
The NFS server was Ultron’s nfs-kernel-server. The previous day, both Jarvis and Ultron had been rebooted for maintenance. Ultron came back online about 7 minutes after Jarvis. During those 7 minutes, Longhorn tried to reconnect to the NFS backup target. The connection failed, and the kernel state got into an inconsistent place.
About 22 hours later (04:44 MST on November 2), those stale NFS sessions started timing out. Jarvis kernel began logging:
Nov 02 04:44:29 Jarvis kernel: nfs: server 10.0.1.11 not responding, timed out
And kept logging it every 30-60 seconds, continuously, until 05:38 MST.
Factor 2: RCU preemption stalls
The 6.17.0-1006-raspi kernel (Raspberry Pi kernel for Ubuntu 25.10) has a known bug: it generates RCU (Read-Copy-Update) preemption stalls whenever runc:[2:INIT] processes create virtual ethernet pairs during Flannel CNI operations. This shows up in kernel logs as:
kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { P<pid> }
(immediately followed by)
kernel: cni0: port N(vethXXXXXXXX) entered blocking state
These stalls are normally brief and recoverable. The system had been running with them for 54 days without issue. But with D-state NFS threads holding RCU read-side critical sections open, the grace periods that normally let RCU stalls recover could not complete.
Factor 3: All CSI sidecar pods on Jarvis
All 8 CSI sidecar pods (attacher ×2, provisioner ×2, resizer ×2, snapshotter ×2) had landed on Jarvis. Every Longhorn volume attach/detach kicks off container activity on Jarvis, spawning runc:[2:INIT] processes, creating veth pairs, generating RCU stalls. With 8 pods generating continuous container churn, all of it concentrated on the already-struggling Jarvis node.
The failure cascade looked like this:
flowchart TD
A[Both nodes rebooted simultaneously\nNov 1 04:22 MST] --> B[Ultron NFS server delayed\n~7 min late]
B --> C[Longhorn NFS client state\nleft inconsistent on Jarvis]
C --> D[NFS sessions time out\nNov 2 04:44 MST]
D --> E[Kernel NFS threads enter\nD-state, uninterruptible]
E --> F[D-state threads hold\nRCU read-side locks open]
F --> G[8 CSI sidecar pods\nall on Jarvis, continuous veth churn]
G --> H[veth creation triggers\nRCU stalls]
H --> I["RCU stalls escalate:\n41/hr to 216/hr final hour"]
I --> J[Scheduler cannot make\nforward progress]
J --> K[Jarvis completely frozen\n05:38 MST, physical reboot required]
style K fill:#c00,color:#fff
Immediate fixes applied:
NFS soft mount: Changed the Longhorn BackupTarget URL to include
?nfsOptions=soft,timeo=100,retrans=3. With soft semantics, NFS threads will give up and returnEIOafter ~30 seconds instead of blocking forever. The backup will fail (recoverable), the kernel won’t hang.The full BackupTarget URL committed to
argocd-apps.yaml:nfs://10.0.1.11:/backup/longhorn?nfsOptions=soft,timeo=100,retrans=3Side note: there is no
nfsMountTimeoutsetting in Longhorn v1.11.0 despite what the documentation implies. The URL query param approach is the only way.hung_task_panic: Added kernel sysctl to auto-reboot if a kernel task is stuck in D-state for >120 seconds:
sudo sysctl -w kernel.hung_task_timeout_secs=120 sudo sysctl -w kernel.hung_task_panic=1 # persisted to /etc/sysctl.d/99-hung-task.confThis means: if a kernel thread hangs in D-state for 2 minutes → kernel panic → automatic reboot in 10 seconds (via the
panic=10cmdline parameter k3s adds). Physical intervention no longer required.CSI pod spread: Added
topologySpreadConstraintsto all 4 CSI Deployments to force them to distribute across both nodes instead of piling up on Jarvis.
All three fixes committed to Git, synced via ArgoCD.
November 5 – Three Outages in One Night
One week later. Three outages. Same night.
Outage #1 (00:58 MST): Ultron’s kernel started hitting its own RCU stalls from container churn under Prometheus TSDB compaction load. These stalls caused Ultron’s NFS server to stop responding. Jarvis’s NFS client (now with soft mounts) started generating EIO errors. This should have been non-fatal. Instead, the volume of rapid-fire soft-timeout failures saturated Jarvis’s kernel networking layer. The node froze.
Wait. The soft mount fix worked as designed: NFS threads returned EIO in 30 seconds instead of blocking forever. hung_task_panic did not trigger because no single task was stuck for >120 seconds. But the aggregate effect of many short-lived NFS failures flooding the kernel’s network path still killed the node.
hung_task_panic is a mechanism for detecting one stuck task. This failure mode was emergent: thousands of short failures, each individually fine, collectively fatal.
Outage #2 (05:49 MST): Same night. The daily-backup-to-nfs recurring job fired at 03:00, as scheduled. The backup ran. The NFS connection did what it was supposed to: connected, transferred data, timed out when Ultron was under pressure. Over 2h49m the soft-timeout failures accumulated until the network stack froze again.
Outage #3 (the lesson – 05:44 MST the following day): After Outage #2, I applied the P0 fixes: deleted the daily-backup-to-nfs RecurringJob from the live cluster with kubectl delete, cleared the BackupTarget URL. Then I went to sleep.
The backup job was recreated by ArgoCD within 20 seconds.
I hadn’t committed the changes to Git. The local file edits were in my working directory but never staged or pushed. ArgoCD’s selfHeal: true saw the live cluster diverging from the Git state (the job shouldn’t be there per the live cluster, but it was in Gitea), so it reconciled: created the job. Every time I deleted it, ArgoCD brought it back. This continued through the next night’s 03:00 backup window.
The third outage was caused by the same mechanism, made inevitable by the same GitOps lesson I should have learned two weeks earlier:
In a GitOps cluster with selfHeal: true, a live kubectl change without a matching Git commit is not a fix. It will be reverted within minutes.
sequenceDiagram
participant ME as Operator
participant LIVE as Live Cluster
participant GIT as Gitea
participant ARGOCD as ArgoCD
ME->>LIVE: kubectl delete recurringjob daily-backup-to-nfs
LIVE-->>ME: deleted
Note over ARGOCD: Reconcile loop runs (20s)
ARGOCD->>GIT: Check desired state
GIT-->>ARGOCD: daily-backup-to-nfs still in longhorn-backups.yaml
ARGOCD->>LIVE: Create recurringjob daily-backup-to-nfs
LIVE-->>ME: job back
ME->>LIVE: kubectl delete recurringjob daily-backup-to-nfs
Note over ME: This loop repeats until you push to Git
The permanent fix was:
- Edit
infrastructure/longhorn-backups.yaml: replacedaily-backup-to-nfswithdaily-snapshot(local snapshot, no NFS) - Edit
argocd-apps.yaml: setbackupTarget: "" git add → git commit → git pushkubectl apply -f argocd-apps.yaml -n argocd- ArgoCD sync
After that: job gone for real, BackupTarget empty, no more Longhorn polling NFS every 5 minutes.
November 7-8 – The CronJob That Kept Killing Jarvis
On November 7 at 05:10 MST, Jarvis froze. Physical reboot. November 8 at 05:58 MST, frozen again. Both times: same kernel signature, same D-state threads.
I had removed the Longhorn NFS backup. But there was still an NFS mount on Jarvis.
The local-path-backup CronJob backup-to-NFS had also been put on my “remove this” list, but I had only suspended it, not fully replaced it. The CronJob was scheduled for 02:30 MST. It ran, mounted the NFS PV backed by nfs-backup-pv (pointing at 10.0.1.11:/backup/local-path), and the exact same failure chain played out: Ultron RCU stalls, NFS I/O hang on Jarvis, kernel freeze.
The correct fix was to replace the NFS mount entirely with something that runs in userspace and can’t take down the kernel if the server is unreachable.
The SSH rsync approach
Instead of mounting NFS and running a backup inside a pod, I replaced the CronJob with a pod that does rsync over SSH. SSH is entirely userspace. If the SSH connection fails (ConnectTimeout=30), the rsync process exits with a clean error code. No kernel threads block. No D-state. The backup fails gracefully instead of crashing the node.
Architecture: local-path-backup (after fix)
┌─────────────────────────────────────────────────────────────┐
│ Jarvis (02:30 MST daily) │
│ │
│ CronJob: local-path-backup │
│ Image: alpine:3.21 (pre-cached on-node) │
│ │
│ Volumes: │
│ /var/lib/rancher/k3s/storage → /storage (read-only) │
│ /usr → /host-usr (read-only, host binaries) │
│ /lib/aarch64-linux-gnu → /host-lib (host glibc) │
│ Secret: local-path-backup-ssh-key → /ssh-key (0400) │
│ ConfigMap: backup script → /scripts/backup.sh │
│ │
│ Command: │
│ rsync -az --bwlimit=5000 \ │
│ -e "ssh -i /ssh-key/id_ed25519 ..." \ │
│ /storage/ victor@10.0.1.11:/backup/local-path/ │
└────────────────────────────┬────────────────────────────────┘
│ SSH (TCP 22) — userspace
▼
┌─────────────────────────────────────────────────────────────┐
│ Ultron │
│ /backup/local-path/ (4 IoT PVCs, ~550MB) │
└─────────────────────────────────────────────────────────────┘
The implementation had a few interesting sub-problems.
Problem: Pods can’t reach the internet
My first instinct was to use alpine:3.21 and run apk add rsync openssh. This produced no error and no package installation. Turns out pods on Jarvis can’t reach external HTTPS (TCP RST to CDN IPs). apk add fails silently when it can’t reach the mirror. Great.
Problem: Pre-packaged images aren’t available either
Tried using instrumentisto/rsync (a Docker Hub image with rsync + ssh pre-installed). Docker Hub was also unreachable from the pod. Can’t pull the image.
Solution: Mount host binaries
Jarvis host (Ubuntu 25.10, arm64) has rsync 3.4.1 and OpenSSH 10.0p2 installed. The solution was to mount the host’s /usr directory and the glibc dynamic linker into the Alpine container, then invoke the host binaries using the host’s dynamic linker:
LD = /host-lib/ld-linux-aarch64.so.1
exec $LD --library-path /host-lib /host-usr/bin/rsync -az ...
This bypasses the container’s libc entirely and uses the host’s. It’s a bit unorthodox but works reliably.
Problem: YAML quoting destroys the SSH wrapper script
The initial approach put the SSH wrapper script in the CronJob’s args field as an inline shell command. The "$@" in the script’s printf format string was being mangled to "" by the YAML-to-JSON-to-shell pipeline. Jobs were failing instantly with no useful error.
The fix was moving the entire script to a ConfigMap and mounting it at /scripts/backup.sh. ConfigMap content isn’t subject to YAML inline escaping.
Problem: SSH key permissions
The private key was mounted from a Kubernetes Secret with defaultMode: 0400. OpenSSH requires this. But the pod securityContext had fsGroup: 0, which causes Kubernetes to set the group ownership of Secret volume files to 0 and add group-read bit, resulting in 0440 permissions. SSH refused to use it.
Fix: remove fsGroup: 0 from the pod security context. The defaultMode: 0400 then takes full effect.
After all that, the backup ran successfully:
Number of files: 6,016 (reg: 5,767, dir: 249)
Number of regular files transferred: 5
Total transferred file size: 68.41K bytes
sent 1.94M bytes received 85.53K bytes 368.58K bytes/sec
=== backup complete: 2025-11-08T23:27:27+00:00 ===
Nine seconds. Four IoT PVCs backed up to Ultron. No NFS. No kernel threads at risk.
NFS Fully Removed
With every backup mechanism replaced, I removed NFS entirely from the cluster:
- Stopped and disabled
nfs-kernel-serveron Ultron - Removed all NFS export entries from Ultron’s
/etc/exports - Deleted the
allow-nfs-backup-egressNetworkPolicy fromlonghorn-system(the policy allowing TCP 2049 from pods to Ultron) - Committed the NetworkPolicy deletion to Git, synced via ArgoCD
From this point on, no pod on Jarvis ever touches a kernel NFS mount. The backup mechanism is SSH rsync, entirely userspace, proven to fail gracefully.
Takeaways
Kernel NFS mounts on a control-plane node are a single point of failure tied to the NFS server’s health. If the server has any problem (RCU stalls, reboot, network glitch) the kernel threads on the client will block indefinitely with hard mounts, and fail-flood with soft mounts. Both paths lead to cluster instability.
Soft NFS mounts reduce severity but don’t eliminate the coupling. The soft mount fix gave me more time and avoided permanent hangs, but rapid-fire EIO returns can still saturate the kernel’s networking layer.
In a GitOps cluster with
selfHeal: true, you cannot fix a production issue withkubectlalone. The Git repository is the authoritative source. Any change that isn’t committed and pushed is gone within one reconcile cycle. Emergency fix procedure:kubectl applythe immediate mitigation, then immediatelygit pushthe same change. Do not sleep until both are done.SSH rsync is a safer backup mechanism than NFS for this use case. Userspace, clean failure modes, no kernel threads involved.
Bandwidth throttle your backups. Unthrottled rsync over gigabit saturates a Pi’s CPU and network stack. I found this out in November 13’s outage.
--bwlimit=5000(5 MB/s) keeps the system stable. More on that in Part 4.
Part 4 is about what happens after the NFS fixes: the recurring containerd sandbox leak pattern that appeared every time the cluster hard-rebooted, and how it eventually got automated away.