Five years on-call taught me the pager’s first demand is context, not heroics. From half-built clusters to rogue upgrades, this post shares war stories, triage tactics, and manager tips for keeping incidents—and engineers—under control.
Cluster idles at 8 % yet the bill soars? Learn how Prometheus data + KRR, a 50-line Python wrapper, Grafana and ArgoCD reclaimed 500 vCPU and 200 GiB across dozens of Kubernetes clusters—no magic, no incidents, just rightsizing done right.
When df lies and du swears, look for Loki’s orphaned WAL segments. Our prod cluster filled up every week until we purged legacy boltdb-shipper data from S3. Postmortem, fix steps, and preventive checks summarized.
After seven years running clusters I finally built my first Kubernetes Operator—a Redis PoC in Go with Operator-SDK. This post demystifies CRDs, reconcile loops, defaults, secrets, and status conditions, sharing hard-won lessons and next steps for anyone Operator-curious.
Deploying on Kubernetes doesn't equal disaster recovery. My MSc research showed this clearly—comparing AWS EKS and GKE recovery scenarios with Velero backups taught invaluable, practical lessons.