Skip to main content

Building a Self-Healing k3s Homelab (Part 2): Multi-Node, GitOps, and Growing Pains

Part 1 covered the hardware, k3s, GitOps setup with ArgoCD and Gitea, Longhorn storage, and the monitoring stack. Everything on paper was clean. The actual first few weeks were messier.

This is Part 2: the story of expanding from one node to two, migrating workloads, hardening resources, and accidentally wiping the entire ArgoCD control plane.


Starting Point: Why Two Nodes

The original cluster was single-node. Jarvis ran everything: Home Assistant, GitOps, monitoring, storage, all of it. This works fine until you try to schedule any real memory workload. Prometheus needs 700MB+ at minimum. Grafana takes another 300MB. Add Gitea and its PostgreSQL instance, and you’re staring at 2GB of non-home-automation workloads on a node that also has to run the k3s API server, Longhorn, and every IoT integration.

The addition of Ultron (Pi 4, 8GB) solved the resource problem by creating a new one: now I had a 2-node cluster, and I needed the workloads to actually stay on the right node. The scheduler doesn’t care about your intentions.


Workload Migration: Prometheus, Grafana, Gitea

The migrations I did:

Phase 1: Grafana from Jarvis β†’ Ultron. Simple: change the nodeAffinity, let ArgoCD sync, wait for the pod to reschedule. The Longhorn volume migrated automatically because Longhorn can attach a volume to any node.

Phase 2: Gitea + PostgreSQL + Valkey from Jarvis β†’ Ultron. Gitea uses local-path for some things, which is node-pinned. Had to recreate the PVC on Ultron-local storage. Brief downtime, worth it.

Phase 3: Prometheus + Alertmanager + Prometheus Operator + kube-state-metrics from Jarvis to Ultron. This was the big one. Prometheus’s PVC (20GB Longhorn volume) had an immutable nodeAffinity annotation pinned to Jarvis from when it was first created. nodeAffinity in Persistent Volumes is immutable in Kubernetes 1.33+. I had to accept the data loss, delete the PV, recreate it on Ultron, and let Prometheus build a fresh TSDB. The historical data up to that point was gone. I called it a fair trade.

After all three phases:

NodeRAM beforeRAM after
Jarvis~6.4 GB (83%)~4.0 GB (50%)
Ultron~2.0 GB (25%)~3.2 GB (42%)

A lot more breathing room on Jarvis, and Ultron well within safe operating range.


The preferred Affinity Problem

I initially set all the monitoring affinities to preferredDuringSchedulingIgnoredDuringExecution, which means “try to schedule on Ultron, but if you can’t, use Jarvis.” This seemed reasonable: if Ultron is down, monitoring should still run somewhere.

What preferred actually means at 4am: on a cold reboot, Ultron registers with the k3s API about 30 seconds after Jarvis. In those 30 seconds, the scheduler looks at the pending Prometheus pod, sees Ultron hasn’t registered yet, and decides Jarvis is fine. Prometheus (700MB) lands on the control-plane node. Within two hours, Jarvis is at 74% RAM and the kernel is unhappy.

The fix was requiredDuringSchedulingIgnoredDuringExecution for all five monitoring components. Yes, they’ll stay Pending for 30 seconds on a cold reboot. That’s correct behavior. The 30-second wait is safe. Running out of RAM on the control-plane is not.

sequenceDiagram
    participant J as Jarvis (k3s server)
    participant U as Ultron (k3s agent)
    participant SCH as Scheduler

    J->>SCH: Node Ready (t=0)
    Note over U: Still booting...
    SCH->>SCH: Prometheus pending, where to place?
    SCH-->>J: Only node available, schedule on Jarvis
    U->>SCH: Node Ready (t=+30s)
    Note over J: Already running Prometheus 700MB
    Note over J: RAM at 74% - danger zone

With required affinity, that conversation changes:

sequenceDiagram
    participant J as Jarvis (k3s server)
    participant U as Ultron (k3s agent)
    participant SCH as Scheduler

    J->>SCH: Node Ready (t=0)
    Note over U: Still booting...
    SCH->>SCH: Prometheus pending, Ultron required, not available
    SCH-->>SCH: Keep Pending
    U->>SCH: Node Ready (t=+30s)
    SCH->>U: Schedule Prometheus on Ultron

Boring. Correct.


ArgoCD Helm Migration

I ran ArgoCD on plain manifests (kubectl apply -f install.yaml) from the beginning, which meant all my resource limits and configuration tweaks were imperative: kubectl set resources, kubectl patch, etc. None of it was in Git. If anything restarted ArgoCD, all those settings were gone.

The fix was migrating ArgoCD to be managed by Helm directly (not via an ArgoCD Application; see the warning box below). I created argocd/values.yaml with all the resource limits and the nodeAffinity to pin the application-controller to Jarvis. Then:

# Adopt all existing ArgoCD resources into the Helm release
kubectl annotate deployment argocd-server \
  'meta.helm.sh/release-name=argo-cd' \
  'meta.helm.sh/release-namespace=argocd' --overwrite
# ...repeat for StatefulSet, Services, CRDs, ConfigMaps, Secrets

# Dry-run first
helm upgrade --install argo-cd argo/argo-cd --version 7.8.26 \
  --namespace argocd --dry-run

# Actual upgrade
helm upgrade --install argo-cd argo/argo-cd --version 7.8.26 \
  --namespace argocd -f argocd/values.yaml

The whole process took about 25 minutes, with two iterations on the adoption annotations. The main gotcha: Helm renames resources with the release prefix. argocd-server becomes argo-cd-argocd-server. Every ingress, ServiceMonitor, and NetworkPolicy that referenced the old name needed updating.

The second gotcha: ArgoCD’s Helm chart defaults to HTTPS mode (redirects HTTP β†’ HTTPS). Traefik is already handling TLS termination at the ingress. When both terminate TLS, things break. Fix: configs.params."server.insecure": "true" in values.yaml.

After migration, the controller pinned to Jarvis. Resource limits committed to Git. No more imperative debt.

Do NOT manage ArgoCD with an ArgoCD Application. I did this initially (an Application called argocd-manifests managing ArgoCD itself). It worked until it didn’t. The moment I deleted that Application to fix a sync issue, the resources-finalizer.argocd.argoproj.io finalizer on the Application object triggered a cascade deletion of every ArgoCD Deployment, StatefulSet, and Ingress. The entire GitOps control plane disappeared. Recovery was: reinstall ArgoCD from upstream manifests, re-apply all Application CRs, re-apply all resource patches. It took about an hour and caused some secondary cascades (Traefik restart, HA 400 errors). Not doing that again.


Resource Hardening

Before hardening, almost everything was BestEffort (no requests, no limits). In Kubernetes, BestEffort pods are the first to be OOM-killed under memory pressure. On a cluster with tight RAM budgets, that’s not a configuration state. It’s a timer.

The target: every pod at Burstable or Guaranteed. The approach:

Infrastructure first: ArgoCD (7 pods), Longhorn manager, CSI components, Prometheus, Grafana. These are the things that will take down the cluster if they get OOM-killed.

Applications second: Gitea, Home Assistant, ESPHome, Mosquitto, Matter Server. Important but more recoverable.

For ArgoCD I had to use imperative kubectl set resources because ArgoCD v2.10.4 (at the time) rejected partial Deployment manifests via kubectl apply. It wanted the full spec including the image field which I didn’t want to hardcode. After the Helm migration, the resource limits moved into argocd/values.yaml and the imperative debt was cleared.

For Longhorn the Helm values expose resource limits for the manager and CSI components:

longhornManager:
  resources:
    requests:
      cpu: 50m
      memory: 150Mi
    limits:
      cpu: 500m
      memory: 500Mi
defaultSettings:
  systemManagedCSIComponentsResourceLimits: '{"cpu":"200m","memory":"200Mi"}'

After hardening, I also added ttlSecondsAfterFinished to CronJob specs to automatically garbage-collect completed pods. Without it, pod debris accumulates. I found a 28-day-old helm-install-traefik-crd job pod still sitting in kube-system. It wasn’t doing anything, but it’s the kind of invisible clutter that turns into a problem if you’re not watching for it.


The Prometheus Retention Debugging Session

I changed Prometheus TSDB retention from 10d to 7d by updating argocd-apps.yaml and pushing to Git. ArgoCD said Synced. Ultron’s load average was still 11.0 two hours later. I confirmed the change was in Git. ArgoCD said Synced. The Prometheus pod was still running with --storage.tsdb.retention.time=10d in its actual args.

This took me a while to understand.

argocd-apps.yaml contains the ArgoCD Application CRD definitions with inline Helm values. When you edit this file and push it to Git, it updates the file in Gitea. But the ArgoCD Application object in the cluster is not automatically updated. It still has the old inline values. You have to re-apply the Application objects to the cluster:

kubectl apply -f argocd-apps.yaml -n argocd

That’s what makes ArgoCD pick up the new Helm values and reconcile. Without this step, ArgoCD is happily syncing the old Helm values (cached in its Application object) to the cluster, and reporting Synced. Because from its perspective, the cluster matches what it knows the Application should look like.

The correct workflow is: git push + kubectl apply -f argocd-apps.yaml -n argocd + ArgoCD force sync. In that order.

After this was fixed and Prometheus actually ran with 7d retention, Ultron’s load average went from 11.0 to 1.08 over about 55 minutes as the TSDB compaction drained. That 55-minute drop is documented in the session notes because I was watching it in real time.

TimeLoad avg (1m)
16:2511.00
16:465.57
17:081.99
17:221.08

The load that looks like a crisis is sometimes just TSDB compaction on a 4-core ARM chip.


Monitoring Stack Architecture

After all the migration and hardening work, here’s the monitoring topology:

flowchart TD
    subgraph JARVIS["Jarvis - Pi 5"]
        NE_J["node-exporter\nhostNetwork"]
        HA_APP[Home Assistant + IoT]
        ARGOCD_CTRL["ArgoCD\napplication-controller"]
    end
    subgraph ULTRON["Ultron - Pi 4"]
        NE_U[node-exporter\nhostNetwork]
        PROM[Prometheus\n20GB TSDB]
        GRAF[Grafana]
        AM[Alertmanager]
        KSM[kube-state-metrics]
    end
    subgraph INFRA[Cluster-wide]
        TRAEFIK[Traefik\ningress]
        LH[Longhorn]
        ARGOCD_SRV[ArgoCD server\nrepo-server\nredis]
    end
    NE_J -->|metrics :9100| PROM
    NE_U -->|metrics :9100| PROM
    LH -->|ServiceMonitor| PROM
    ARGOCD_CTRL -->|ServiceMonitor| PROM
    PROM -->|datasource| GRAF
    PROM -->|alerting rules| AM
    GRAF -->|ingress| TRAEFIK
    AM -->|ingress| TRAEFIK

Dashboards are committed to Git as ConfigMap objects with embedded JSON. ArgoCD manages them. Every time I change a panel, it’s a commit. Every dashboard is reproducible. If the cluster gets rebuilt from scratch, all dashboards come back automatically on the first ArgoCD sync.


Security: NetworkPolicies per Namespace

The ArgoCD cascade deletion incident had a silver lining: it forced me to think carefully about what “accidentally delete everything” looks like in a GitOps cluster. The answer is: it looks like exactly what happened, and the only protection is not putting ArgoCD’s finalizer on its own Application.

For the rest of the namespaces, NetworkPolicies are the primary defense against unexpected traffic. Each namespace gets a default-deny-all policy first:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: home-assistant
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Then explicit allow-list policies for each traffic pattern: ArgoCD sync (TCP 8080), Longhorn engine (TCP 9500-9504), Prometheus scraping (TCP 9090 from monitoring namespace), Traefik ingress (TCP 80/443 from kube-system), external secrets (TCP 443 to 1Password API).

The External Secrets Operator one was a debugging session in itself: the webhook needs to accept connections from the kube-apiserver, which comes from the node’s LAN IP (10.0.1.0/24), not from a pod CIDR. Missing that allow caused every ExternalSecret to fail with a 502 and ArgoCD to report the app as degraded. Fix was adding a NetworkPolicy in external-secrets namespace allowing TCP 10250 (webhook port) from both the LAN and pod CIDR ranges.


The State of Things at End of October

After all this work, at the end of October 2025, the cluster was in a good place:

  • 2 nodes, both healthy
  • 11 ArgoCD applications, all Synced/Healthy
  • All workloads on the right nodes, with required or preferred affinity
  • All workloads at Burstable or Guaranteed QoS
  • NetworkPolicies in every namespace
  • Longhorn configured with a backup target (NFS on Ultron’s local disk)
  • Monitoring in Grafana, dashboards for both nodes

Jarvis RAM: 50%. Ultron RAM: 42%. Both in the green.

I congratulated myself on a solid foundation and went to sleep.

At 04:44 the next morning, the cluster fell over for the first time.

That story is Part 3.