Skip to main content

Kubernetes

Building a Self-Healing k3s Homelab (Part 1): Foundation

Over the past few years I’ve been accumulating Raspberry Pis and single-board computers like some people accumulate unfinished side projects. It started back in school in France: breaking Linux installations, running a Pi-hole server, building a Magic Mirror. The nice thing about Pis is they’re small enough to throw in a bag. Through multiple moves and transatlantic travel they came with me, always finding some new use. Now I’ve finally settled in Calgary, and I have a pile of these devices sitting around.

Building a Self-Healing k3s Homelab (Part 2): Multi-Node, GitOps, and Growing Pains

Part 1 covered the hardware, k3s, GitOps setup with ArgoCD and Gitea, Longhorn storage, and the monitoring stack. Everything on paper was clean. The actual first few weeks were messier.

This is Part 2: the story of expanding from one node to two, migrating workloads, hardening resources, and accidentally wiping the entire ArgoCD control plane.


Starting Point: Why Two Nodes

The original cluster was single-node. Jarvis ran everything: Home Assistant, GitOps, monitoring, storage, all of it. This works fine until you try to schedule any real memory workload. Prometheus needs 700MB+ at minimum. Grafana takes another 300MB. Add Gitea and its PostgreSQL instance, and you’re staring at 2GB of non-home-automation workloads on a node that also has to run the k3s API server, Longhorn, and every IoT integration.

Building a Self-Healing k3s Homelab (Part 3): The NFS Nightmare

Part 2 ended with a functioning two-node cluster, all workloads placed correctly, resource limits set, monitoring green. Late October 2025, everything looked good.

Early November was five outages in nine days, all rooted in the same mistake: putting a kernel-level NFS mount on the control-plane node. This is the story of that, in all its embarrassing detail.


The Backup Plan

Longhorn provides volume snapshots natively. A snapshot is taken locally, stored in Longhorn’s own format, and can be restored. That’s fine for a bad deployment. For a node failure followed by a disk failure, you need something off-node.

Building a Self-Healing k3s Homelab (Part 4): Containerd's Ghost Sandboxes

Part 3 covered the NFS outage series and the fixes: soft mounts, then SSH rsync, then full NFS removal. By November 9, no kernel NFS mount existed anywhere in the cluster.

The cluster kept going down.

This is Part 4: the containerd sandbox leak pattern that appeared after every hard reboot, why it happens, what it does to the cluster, and how I finally automated the fix.


The Pattern

Every time Jarvis suffered a hard unclean shutdown (power cycle to recover from a freeze, or hardware watchdog reboot), the same thing happened when k3s came back up:

Building a Self-Healing k3s Homelab (Part 5): RCU Stalls, Watchdogs, and Actually Healing

Part 4 covered the containerd sandbox leak problem and the ExecStartPre fix. The sandbox leaks were solved. But there was still an underlying issue that kept forcing those hard reboots in the first place.

This is Part 5: the kernel RCU stall problem, why it’s dangerous on a Raspberry Pi running k3s, the mitigations I layered on, and the moment the cluster finally handled an outage without me.


What Is an RCU Stall?

RCU stands for Read-Copy-Update. It’s a synchronization mechanism built into the Linux kernel for situations where reads are very frequent and writes are rare. The basic idea: readers don’t take locks. Writers make a copy of the data, update it, and wait until all current readers are done before pointing the system at the new copy. The period readers are finishing up is called a “grace period.”