Part 4 covered the containerd sandbox leak problem and the ExecStartPre fix. The sandbox leaks were solved. But there was still an underlying issue that kept forcing those hard reboots in the first place.
This is Part 5: the kernel RCU stall problem, why it’s dangerous on a Raspberry Pi running k3s, the mitigations I layered on, and the moment the cluster finally handled an outage without me.
What Is an RCU Stall?
RCU stands for Read-Copy-Update. It’s a synchronization mechanism built into the Linux kernel for situations where reads are very frequent and writes are rare. The basic idea: readers don’t take locks. Writers make a copy of the data, update it, and wait until all current readers are done before pointing the system at the new copy. The period readers are finishing up is called a “grace period.”