Over the past few years I’ve been accumulating Raspberry Pis and single-board computers like some people accumulate unfinished side projects. It started back in school in France: breaking Linux installations, running a Pi-hole server, building a Magic Mirror. The nice thing about Pis is they’re small enough to throw in a bag. Through multiple moves and transatlantic travel they came with me, always finding some new use. Now I’ve finally settled in Calgary, and I have a pile of these devices sitting around.
Infra
Building a Self-Healing k3s Homelab (Part 2): Multi-Node, GitOps, and Growing Pains
Part 1 covered the hardware, k3s, GitOps setup with ArgoCD and Gitea, Longhorn storage, and the monitoring stack. Everything on paper was clean. The actual first few weeks were messier.
This is Part 2: the story of expanding from one node to two, migrating workloads, hardening resources, and accidentally wiping the entire ArgoCD control plane.
Starting Point: Why Two Nodes
The original cluster was single-node. Jarvis ran everything: Home Assistant, GitOps, monitoring, storage, all of it. This works fine until you try to schedule any real memory workload. Prometheus needs 700MB+ at minimum. Grafana takes another 300MB. Add Gitea and its PostgreSQL instance, and you’re staring at 2GB of non-home-automation workloads on a node that also has to run the k3s API server, Longhorn, and every IoT integration.
Building a Self-Healing k3s Homelab (Part 3): The NFS Nightmare
Part 2 ended with a functioning two-node cluster, all workloads placed correctly, resource limits set, monitoring green. Late October 2025, everything looked good.
Early November was five outages in nine days, all rooted in the same mistake: putting a kernel-level NFS mount on the control-plane node. This is the story of that, in all its embarrassing detail.
The Backup Plan
Longhorn provides volume snapshots natively. A snapshot is taken locally, stored in Longhorn’s own format, and can be restored. That’s fine for a bad deployment. For a node failure followed by a disk failure, you need something off-node.
Building a Self-Healing k3s Homelab (Part 4): Containerd's Ghost Sandboxes
Part 3 covered the NFS outage series and the fixes: soft mounts, then SSH rsync, then full NFS removal. By November 9, no kernel NFS mount existed anywhere in the cluster.
The cluster kept going down.
This is Part 4: the containerd sandbox leak pattern that appeared after every hard reboot, why it happens, what it does to the cluster, and how I finally automated the fix.
The Pattern
Every time Jarvis suffered a hard unclean shutdown (power cycle to recover from a freeze, or hardware watchdog reboot), the same thing happened when k3s came back up:
Building a Self-Healing k3s Homelab (Part 5): RCU Stalls, Watchdogs, and Actually Healing
Part 4 covered the containerd sandbox leak problem and the ExecStartPre fix. The sandbox leaks were solved. But there was still an underlying issue that kept forcing those hard reboots in the first place.
This is Part 5: the kernel RCU stall problem, why it’s dangerous on a Raspberry Pi running k3s, the mitigations I layered on, and the moment the cluster finally handled an outage without me.
What Is an RCU Stall?
RCU stands for Read-Copy-Update. It’s a synchronization mechanism built into the Linux kernel for situations where reads are very frequent and writes are rare. The basic idea: readers don’t take locks. Writers make a copy of the data, update it, and wait until all current readers are done before pointing the system at the new copy. The period readers are finishing up is called a “grace period.”
How B2B Sales Did Not Teach Me About CloudFront Functions
You’ve probably seen the posts:
- “How B2B sales helped me run a marathon”
- “How cold calling made me a better engineer”
This isn’t that. Unfortunately.
Redirects, DNS, and Terraform
This one started simple: I wanted to redirect the apex domain (vakintosh.com) to the www subdomain.
flowchart TD
Start([Start]) --> A[User types vakintosh.com]
A --> B[/Browser sends HTTP request/]
B --> C[DNS resolves apex domain to CloudFront edge node]
C --> D[CloudFront Function fires on Viewer Request event]
D --> E{Is host vakintosh.com?}
E -->|Yes| F[Return 301 Redirect Location: www.vakintosh.com]
F --> G[/301 Response sent to browser/]
G --> H[Browser follows redirect to www.vakintosh.com]
H --> I[CloudFront forwards request to S3 origin]
E -->|No| J[Rewrite URI e.g. /blog → /blog/index.html]
J --> I
I --> K[/Static content served from S3/]
K --> Finish([Finish])
- The user’s browser sends a request to
vakintosh.com, which DNS resolves to a CloudFront edge node. - A CloudFront Function fires on the Viewer Request event, before the request ever reaches the S3 origin.
- If the host is the apex domain, the function returns a 301 redirect to
www.vakintosh.comdirectly from the edge. - If the host is already
www, the function rewrites pretty URLs (e.g./blog→/blog/index.html) before forwarding to S3.
I figured I’d just do it manually in the Porkbun DNS console. Bad idea.
GitHub OIDC + AWS IAM + Terraform: A Practical Guide (and Pain Log)
I wanted to deploy my Hugo website using Terraform and GitHub Actions, securely, with least privilege, without Route 53, using my domain on Porkbun, and leveraging AWS Free Tier services.
Day 1 – AWS Account Setup + Role Plumbing
Started from scratch.
- Created the AWS account
- Set up MFA, secure root, all that
- Made a single
AdminIAM user (for CLI/debug, not daily use)
Then I created a role: GitHubAction-AssumeRoleWithAction.