Kubernetes Operators 101: A Hands-On Tale of Go, CRDs, and Redis

Kubernetes Operators 101: A Hands-On Tale of Go, CRDs, and Redis
Piper, our chick with a piercing gaze—Fall 2023 © Sergio Fernández
Never send a human to do a machine’s job

The quote is from The Matrix, but in 2025 you’ll hear a variant of it every time the Kubernetes community starts talking about Operators. Operators are our way of encoding an experienced SRE or DBA’s brain into a controller so that nobody has to wake up at 03:00 to type kubectl patch and fix a StatefulSet ever again—or maybe yes, it's Kubernetes in the end 😎.

After seven-plus years of keeping clusters alive—in one form or another—I'd done just about every K8s trick you can imagine: multi-tenant ingress topologies, multi-cluster service meshes, storage-class gymnastics… but I had never written an Operator. I love programming (my GitHub history shows commits long before ChatGPT was a thing) and I’d been itching to learn Go properly, so writing an Operator felt like the perfect side-quest.

Enter my Redis Operator proof-of-concept (PoC). Does the world need another Redis Operator? Probably not. Did I need to build one to finally scratch this itch? Absolutely—so I did. It’s open-sourced here. This post tells the story: what Operators are, what decisions you face while writing one, and what I learned along the way.

What Exactly Is an Operator?

In a nutshell, Operators are a way to extend Kubernetes functionality with application-specific logic. Think of three building blocks:

  1. CustomResourceDefinition (CRD) – teaches the API server about a new noun (Redis, Backup, EtcdCluster, …).
  2. Controller (a.k.a. “the Operator”) – a Go program that reconciles desired state to actual state.
  3. Intent – users express what they want (spec.replicas: 3, version: 8.0), the Operator decides how.

Kubernetes already reconciles Pods, Deployments, Services and so on; an Operator reconciles the things Kubernetes doesn’t know about (fail-over, backup, schema migration, licence keys, you name it).

In practice you:

  • Scaffold a project with Kubebuilder or Operator-SDK
  • Define Go structs for Spec and Status
  • Write a reconcile loop that reads the CR, diffs reality, and applies changes
  • Package and ship it like any other container image.

Design Notes From My Redis PoC

Below are the decisions, head-scratches, and “next time I’d do X” moments that shaped the PoC.

  • Defaults: API vs. Controller vs. Webhook

Operator-SDK will happily generate a mutating webhook if you ask, but for a lean PoC I stuffed defaults inside api/v1alpha1/redis_types.go (SetDefaults()). The trade-offs:

Option Pros Cons
API Zero plumbing, unit-testable, survives kubectl apply --server-side Cannot look at cluster state; no "dry run"
Controller Shim One place to patch objects Defaults resolved after admission; kubectl diff shows "missing" fields
Webhook Cluster-aware (can query StorageClass, TLS Secret, etc.) Cert management, latency, more moving parts

If you only need static defaults (image tag, CPU/mem, readiness probe), put them in the API and move on. If you need to inject cluster-specific data (CA bundles, storage classes), invest in the webhook.

  • Password Secret

Every Redis CR owns exactly one Secret (<name>-secret). On first reconcile:

  1. If the Secret doesn’t exist, generate 32 random bytes, base64.RawURLEncoding them (URL-safe, no padding), and mark the Secret immutable: true.
  2. Annotate the Pod template with a SHA-256 of the password so that changing the Secret triggers a rollout (redis.cache.geiser.cloud/secret-hash).
  3. Emit a Normal event (PasswordGenerated) so operators can audit.

Why SHA instead of resourceVersion? Because it survives a kubectl replace --force on the Secret.

  • Status Conditions

status.conditions carries three booleans:

  • PasswordGenerated – true after step 1 above.
  • DeploymentReady – true when deployment.status.readyReplicas == spec.replicas.
  • Ready – AND-ed version of the first two (classic upstream pattern).

This turns kubectl get redis into an SRE dashboard

NAME        READY   PASS   DEPLOY   AGE
redis-demo  True    True   True     4m

…and makes tests trivial (Eventually(hasCondition("Ready", True)).Should(BeTrue())).

  • Reconciling Resources vs. Config

The Operator only touches three resource kinds:

  1. Secret – immutable password.
  2. Deployment – replicas + resource limits.
  3. Pod – observed via ownerRef for restart counting.

Resource size changes (CPU/memory) are detected by hashing the desired ResourceList and comparing it to deployment.spec.template.spec.containers[0].resources. If different, I patch the Pod template; Kubernetes restarts Pods automatically.

  • Testing: envtest + Ginkgo

Unit tests run in‐process against envtest, so CI finishes in seconds and never needs a real cluster. End-to-end tests (Kind + Ginkgo) prove that when a Pod restarts 3 times the Controller emits a Kubernetes Warning event—because humans actually look at those.

  • Operator Image and Delivery

For the PoC I publish a multi-arch image to Docker Hub (drumsergio/redis-operator:0.0.1) and hand out an all-in-one manifest (config/deploy-redis-operator.yaml). In the real world the menu is richer:

  • Helm chart – values for image, leader-election, RBAC, etc.
  • OLM bundle – declarative upgrades, alpha/beta channels.
  • GitOps – commit the CRDs + Deployment into ArgoCD or Flux and let them drift-manage.

I would push a bundle to OperatorHub just for fun... But I'll delay it for the next time

  • Operator-SDK niceties I leaned on

operator-sdk scorecard – sanity-checks CRD validation, open-api schema coverage.

make bundle – spits out an OLM-ready CSV with RBAC baked in.

envtest integration – same libs as Kubebuilder; drop-in for unit tests.

Lessons From the Wider Community

I binge-read Google’s Best practices for building operators and stateful apps and Anynines CEO Julian Fischer’s Principles for Building Kubernetes Operators.” (it's in fact a recording). A summary of the advice that resonated:

One Operator per app – don’t ship a “mega-operator” that does Redis + Postgres + RabbitMQ; separation of concerns matters.
Declarative, not imperative – users say what, Operator figures out how. If your CRD has verbs (“backupNow: true”) you’re probably off-track.
Async reconcile loops – never block; set RequeueAfter and come back later.
Idempotency everywhere – reconciliation might run 10× a second; every create must be “create-if-not-exists.”
Backups are a first-class CRD – treat RedisBackup the way K8s treats Job. Controllers for ops like restore, purge, verify belong outside the main reconcile loop.
Observability is part of UX – emit Prometheus metrics and structured events; label everything (app.kubernetes.io/name, instance, version).

Known Gaps / Next Steps

The README is brutally honest and I’ll go from there:

  • Persistence – right now Redis runs ephemeral; PVC support is next.
  • StatefulSet – migrate to a STS to improve state handling
  • High Availability – no Sentinel, no Cluster mode yet.
  • Backup CRD – need RedisBackup + RedisBackupSchedule and a side-car.
  • Finalizer – ownerRefs are fine until the Namespace is deleted; a finalizer would let me roll my own clean-up logic.
  • OLM packaging – bundle, install modes, upgrade e2e.
  • Cross-compatibility table – so users know which Redis versions match with wich K8s version
  • Prometheus – it will never be production-ready if observability isn't there

All perfectly fine for a learning project; none acceptable for production. Baby steps.

Takeaway

Writing an Operator is half coding, half systems-thinking. You juggle Kubernetes’ eventual consistency model, the quirks of the stateful app you’re automating, and the developer ergonomics of your CRD.

It forced me to learn more about informers, caches, and client-go idempotency than years of plain YAML ever did. If you’re Operator-curious, fire up Operator-SDK, pick a stateful app you love, and let the reconcile loop show you the rest.

Sergio Fernández

Sergio Fernández

Senior Cloud DevOps Engineer specializing in Kubernetes.
Murcia, Spain