Skip to main content

Building a Self-Healing k3s Homelab (Part 3): The NFS Nightmare

Part 2 ended with a functioning two-node cluster, all workloads placed correctly, resource limits set, monitoring green. Late October 2025, everything looked good.

Early November was five outages in nine days, all rooted in the same mistake: putting a kernel-level NFS mount on the control-plane node. This is the story of that, in all its embarrassing detail.


The Backup Plan

Longhorn provides volume snapshots natively. A snapshot is taken locally, stored in Longhorn’s own format, and can be restored. That’s fine for a bad deployment. For a node failure followed by a disk failure, you need something off-node.

Longhorn supports backup targets: an S3 bucket or an NFS share where it sends backup data. Since I had Ultron right there on the same LAN with a few hundred GB of free space on its SSD, I mounted a directory over NFS:

Ultron: /backup/longhorn (NFS export)
Jarvis: mounts via Longhorn's BackupTarget CRD

I also set up a local-path-backup CronJob to copy the HA config PVCs from Jarvis’s local-path storage to Ultron’s /backup/local-path directory. Same approach: NFS.

At the time this seemed completely reasonable. Two machines on a gigabit LAN. NFS is fast. Easy to set up. What could go wrong.


November 2 – Outage #1 (05:38 MST)

The first symptom was the cluster becoming completely unresponsive at 05:38 MST. Home Assistant offline. ArgoCD offline. Grafana flat-lined. kubectl couldn’t reach the API server. Physical power cycle fixed it.

After the reboot, I pulled the kernel logs and started reading. What I found was a multi-factor failure chain.

Factor 1: Hard NFS mount semantics

Linux NFS mounts default to “hard” mode. In hard mode, when the NFS server becomes unreachable, the kernel threads waiting on NFS I/O block in uninterruptible D-state indefinitely. They cannot be signaled, killed, or interrupted. They just sit there, holding kernel locks, waiting for a server that isn’t coming back.

The NFS server was Ultron’s nfs-kernel-server. The previous day, both Jarvis and Ultron had been rebooted for maintenance. Ultron came back online about 7 minutes after Jarvis. During those 7 minutes, Longhorn tried to reconnect to the NFS backup target. The connection failed, and the kernel state got into an inconsistent place.

About 22 hours later (04:44 MST on November 2), those stale NFS sessions started timing out. Jarvis kernel began logging:

Nov 02 04:44:29 Jarvis kernel: nfs: server 10.0.1.11 not responding, timed out

And kept logging it every 30-60 seconds, continuously, until 05:38 MST.

Factor 2: RCU preemption stalls

The 6.17.0-1006-raspi kernel (Raspberry Pi kernel for Ubuntu 25.10) has a known bug: it generates RCU (Read-Copy-Update) preemption stalls whenever runc:[2:INIT] processes create virtual ethernet pairs during Flannel CNI operations. This shows up in kernel logs as:

kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { P<pid> }
(immediately followed by)
kernel: cni0: port N(vethXXXXXXXX) entered blocking state

These stalls are normally brief and recoverable. The system had been running with them for 54 days without issue. But with D-state NFS threads holding RCU read-side critical sections open, the grace periods that normally let RCU stalls recover could not complete.

Factor 3: All CSI sidecar pods on Jarvis

All 8 CSI sidecar pods (attacher ×2, provisioner ×2, resizer ×2, snapshotter ×2) had landed on Jarvis. Every Longhorn volume attach/detach kicks off container activity on Jarvis, spawning runc:[2:INIT] processes, creating veth pairs, generating RCU stalls. With 8 pods generating continuous container churn, all of it concentrated on the already-struggling Jarvis node.

The failure cascade looked like this:

flowchart TD
    A[Both nodes rebooted simultaneously\nNov 1 04:22 MST] --> B[Ultron NFS server delayed\n~7 min late]
    B --> C[Longhorn NFS client state\nleft inconsistent on Jarvis]
    C --> D[NFS sessions time out\nNov 2 04:44 MST]
    D --> E[Kernel NFS threads enter\nD-state, uninterruptible]
    E --> F[D-state threads hold\nRCU read-side locks open]
    F --> G[8 CSI sidecar pods\nall on Jarvis, continuous veth churn]
    G --> H[veth creation triggers\nRCU stalls]
    H --> I["RCU stalls escalate:\n41/hr to 216/hr final hour"]
    I --> J[Scheduler cannot make\nforward progress]
    J --> K[Jarvis completely frozen\n05:38 MST, physical reboot required]

    style K fill:#c00,color:#fff

Immediate fixes applied:

  1. NFS soft mount: Changed the Longhorn BackupTarget URL to include ?nfsOptions=soft,timeo=100,retrans=3. With soft semantics, NFS threads will give up and return EIO after ~30 seconds instead of blocking forever. The backup will fail (recoverable), the kernel won’t hang.

    The full BackupTarget URL committed to argocd-apps.yaml:

    nfs://10.0.1.11:/backup/longhorn?nfsOptions=soft,timeo=100,retrans=3
    

    Side note: there is no nfsMountTimeout setting in Longhorn v1.11.0 despite what the documentation implies. The URL query param approach is the only way.

  2. hung_task_panic: Added kernel sysctl to auto-reboot if a kernel task is stuck in D-state for >120 seconds:

    sudo sysctl -w kernel.hung_task_timeout_secs=120
    sudo sysctl -w kernel.hung_task_panic=1
    # persisted to /etc/sysctl.d/99-hung-task.conf
    

    This means: if a kernel thread hangs in D-state for 2 minutes → kernel panic → automatic reboot in 10 seconds (via the panic=10 cmdline parameter k3s adds). Physical intervention no longer required.

  3. CSI pod spread: Added topologySpreadConstraints to all 4 CSI Deployments to force them to distribute across both nodes instead of piling up on Jarvis.

All three fixes committed to Git, synced via ArgoCD.


November 5 – Three Outages in One Night

One week later. Three outages. Same night.

Outage #1 (00:58 MST): Ultron’s kernel started hitting its own RCU stalls from container churn under Prometheus TSDB compaction load. These stalls caused Ultron’s NFS server to stop responding. Jarvis’s NFS client (now with soft mounts) started generating EIO errors. This should have been non-fatal. Instead, the volume of rapid-fire soft-timeout failures saturated Jarvis’s kernel networking layer. The node froze.

Wait. The soft mount fix worked as designed: NFS threads returned EIO in 30 seconds instead of blocking forever. hung_task_panic did not trigger because no single task was stuck for >120 seconds. But the aggregate effect of many short-lived NFS failures flooding the kernel’s network path still killed the node.

hung_task_panic is a mechanism for detecting one stuck task. This failure mode was emergent: thousands of short failures, each individually fine, collectively fatal.

Outage #2 (05:49 MST): Same night. The daily-backup-to-nfs recurring job fired at 03:00, as scheduled. The backup ran. The NFS connection did what it was supposed to: connected, transferred data, timed out when Ultron was under pressure. Over 2h49m the soft-timeout failures accumulated until the network stack froze again.

Outage #3 (the lesson – 05:44 MST the following day): After Outage #2, I applied the P0 fixes: deleted the daily-backup-to-nfs RecurringJob from the live cluster with kubectl delete, cleared the BackupTarget URL. Then I went to sleep.

The backup job was recreated by ArgoCD within 20 seconds.

I hadn’t committed the changes to Git. The local file edits were in my working directory but never staged or pushed. ArgoCD’s selfHeal: true saw the live cluster diverging from the Git state (the job shouldn’t be there per the live cluster, but it was in Gitea), so it reconciled: created the job. Every time I deleted it, ArgoCD brought it back. This continued through the next night’s 03:00 backup window.

The third outage was caused by the same mechanism, made inevitable by the same GitOps lesson I should have learned two weeks earlier:

In a GitOps cluster with selfHeal: true, a live kubectl change without a matching Git commit is not a fix. It will be reverted within minutes.

sequenceDiagram
    participant ME as Operator
    participant LIVE as Live Cluster
    participant GIT as Gitea
    participant ARGOCD as ArgoCD

    ME->>LIVE: kubectl delete recurringjob daily-backup-to-nfs
    LIVE-->>ME: deleted
    Note over ARGOCD: Reconcile loop runs (20s)
    ARGOCD->>GIT: Check desired state
    GIT-->>ARGOCD: daily-backup-to-nfs still in longhorn-backups.yaml
    ARGOCD->>LIVE: Create recurringjob daily-backup-to-nfs
    LIVE-->>ME: job back
    ME->>LIVE: kubectl delete recurringjob daily-backup-to-nfs
    Note over ME: This loop repeats until you push to Git

The permanent fix was:

  1. Edit infrastructure/longhorn-backups.yaml: replace daily-backup-to-nfs with daily-snapshot (local snapshot, no NFS)
  2. Edit argocd-apps.yaml: set backupTarget: ""
  3. git add → git commit → git push
  4. kubectl apply -f argocd-apps.yaml -n argocd
  5. ArgoCD sync

After that: job gone for real, BackupTarget empty, no more Longhorn polling NFS every 5 minutes.


November 7-8 – The CronJob That Kept Killing Jarvis

On November 7 at 05:10 MST, Jarvis froze. Physical reboot. November 8 at 05:58 MST, frozen again. Both times: same kernel signature, same D-state threads.

I had removed the Longhorn NFS backup. But there was still an NFS mount on Jarvis.

The local-path-backup CronJob backup-to-NFS had also been put on my “remove this” list, but I had only suspended it, not fully replaced it. The CronJob was scheduled for 02:30 MST. It ran, mounted the NFS PV backed by nfs-backup-pv (pointing at 10.0.1.11:/backup/local-path), and the exact same failure chain played out: Ultron RCU stalls, NFS I/O hang on Jarvis, kernel freeze.

The correct fix was to replace the NFS mount entirely with something that runs in userspace and can’t take down the kernel if the server is unreachable.

The SSH rsync approach

Instead of mounting NFS and running a backup inside a pod, I replaced the CronJob with a pod that does rsync over SSH. SSH is entirely userspace. If the SSH connection fails (ConnectTimeout=30), the rsync process exits with a clean error code. No kernel threads block. No D-state. The backup fails gracefully instead of crashing the node.

Architecture: local-path-backup (after fix)

┌─────────────────────────────────────────────────────────────┐
│ Jarvis (02:30 MST daily)                                    │
│                                                             │
│  CronJob: local-path-backup                                 │
│  Image: alpine:3.21 (pre-cached on-node)                   │
│                                                             │
│  Volumes:                                                   │
│    /var/lib/rancher/k3s/storage → /storage (read-only)      │
│    /usr → /host-usr (read-only, host binaries)              │
│    /lib/aarch64-linux-gnu → /host-lib (host glibc)         │
│    Secret: local-path-backup-ssh-key → /ssh-key (0400)     │
│    ConfigMap: backup script → /scripts/backup.sh           │
│                                                             │
│  Command:                                                   │
│    rsync -az --bwlimit=5000 \                               │
│      -e "ssh -i /ssh-key/id_ed25519 ..." \                  │
│      /storage/ victor@10.0.1.11:/backup/local-path/     │
└────────────────────────────┬────────────────────────────────┘
                             │ SSH (TCP 22) — userspace
                             ▼
┌─────────────────────────────────────────────────────────────┐
│ Ultron                                                      │
│  /backup/local-path/ (4 IoT PVCs, ~550MB)                  │
└─────────────────────────────────────────────────────────────┘

The implementation had a few interesting sub-problems.

Problem: Pods can’t reach the internet

My first instinct was to use alpine:3.21 and run apk add rsync openssh. This produced no error and no package installation. Turns out pods on Jarvis can’t reach external HTTPS (TCP RST to CDN IPs). apk add fails silently when it can’t reach the mirror. Great.

Problem: Pre-packaged images aren’t available either

Tried using instrumentisto/rsync (a Docker Hub image with rsync + ssh pre-installed). Docker Hub was also unreachable from the pod. Can’t pull the image.

Solution: Mount host binaries

Jarvis host (Ubuntu 25.10, arm64) has rsync 3.4.1 and OpenSSH 10.0p2 installed. The solution was to mount the host’s /usr directory and the glibc dynamic linker into the Alpine container, then invoke the host binaries using the host’s dynamic linker:

LD = /host-lib/ld-linux-aarch64.so.1
exec $LD --library-path /host-lib /host-usr/bin/rsync -az ...

This bypasses the container’s libc entirely and uses the host’s. It’s a bit unorthodox but works reliably.

Problem: YAML quoting destroys the SSH wrapper script

The initial approach put the SSH wrapper script in the CronJob’s args field as an inline shell command. The "$@" in the script’s printf format string was being mangled to "" by the YAML-to-JSON-to-shell pipeline. Jobs were failing instantly with no useful error.

The fix was moving the entire script to a ConfigMap and mounting it at /scripts/backup.sh. ConfigMap content isn’t subject to YAML inline escaping.

Problem: SSH key permissions

The private key was mounted from a Kubernetes Secret with defaultMode: 0400. OpenSSH requires this. But the pod securityContext had fsGroup: 0, which causes Kubernetes to set the group ownership of Secret volume files to 0 and add group-read bit, resulting in 0440 permissions. SSH refused to use it.

Fix: remove fsGroup: 0 from the pod security context. The defaultMode: 0400 then takes full effect.

After all that, the backup ran successfully:

Number of files: 6,016 (reg: 5,767, dir: 249)
Number of regular files transferred: 5
Total transferred file size: 68.41K bytes
sent 1.94M bytes  received 85.53K bytes  368.58K bytes/sec
=== backup complete: 2025-11-08T23:27:27+00:00 ===

Nine seconds. Four IoT PVCs backed up to Ultron. No NFS. No kernel threads at risk.


NFS Fully Removed

With every backup mechanism replaced, I removed NFS entirely from the cluster:

  • Stopped and disabled nfs-kernel-server on Ultron
  • Removed all NFS export entries from Ultron’s /etc/exports
  • Deleted the allow-nfs-backup-egress NetworkPolicy from longhorn-system (the policy allowing TCP 2049 from pods to Ultron)
  • Committed the NetworkPolicy deletion to Git, synced via ArgoCD

From this point on, no pod on Jarvis ever touches a kernel NFS mount. The backup mechanism is SSH rsync, entirely userspace, proven to fail gracefully.


Takeaways

  1. Kernel NFS mounts on a control-plane node are a single point of failure tied to the NFS server’s health. If the server has any problem (RCU stalls, reboot, network glitch) the kernel threads on the client will block indefinitely with hard mounts, and fail-flood with soft mounts. Both paths lead to cluster instability.

  2. Soft NFS mounts reduce severity but don’t eliminate the coupling. The soft mount fix gave me more time and avoided permanent hangs, but rapid-fire EIO returns can still saturate the kernel’s networking layer.

  3. In a GitOps cluster with selfHeal: true, you cannot fix a production issue with kubectl alone. The Git repository is the authoritative source. Any change that isn’t committed and pushed is gone within one reconcile cycle. Emergency fix procedure: kubectl apply the immediate mitigation, then immediately git push the same change. Do not sleep until both are done.

  4. SSH rsync is a safer backup mechanism than NFS for this use case. Userspace, clean failure modes, no kernel threads involved.

  5. Bandwidth throttle your backups. Unthrottled rsync over gigabit saturates a Pi’s CPU and network stack. I found this out in November 13’s outage. --bwlimit=5000 (5 MB/s) keeps the system stable. More on that in Part 4.

Part 4 is about what happens after the NFS fixes: the recurring containerd sandbox leak pattern that appeared every time the cluster hard-rebooted, and how it eventually got automated away.