Kubernetes Operators 101: A Hands-On Tale of Go, CRDs, and Redis

Never send a human to do a machine’s job
The quote is from The Matrix, but in 2025 you’ll hear a variant of it every time the Kubernetes community starts talking about Operators. Operators are our way of encoding an experienced SRE or DBA’s brain into a controller so that nobody has to wake up at 03:00 to type kubectl patch
and fix a StatefulSet ever again—or maybe yes, it's Kubernetes in the end 😎.
After seven-plus years of keeping clusters alive—in one form or another—I'd done just about every K8s trick you can imagine: multi-tenant ingress topologies, multi-cluster service meshes, storage-class gymnastics… but I had never written an Operator. I love programming (my GitHub history shows commits long before ChatGPT was a thing) and I’d been itching to learn Go properly, so writing an Operator felt like the perfect side-quest.
Enter my Redis Operator proof-of-concept (PoC). Does the world need another Redis Operator? Probably not. Did I need to build one to finally scratch this itch? Absolutely—so I did. It’s open-sourced here. This post tells the story: what Operators are, what decisions you face while writing one, and what I learned along the way.
What Exactly Is an Operator?
In a nutshell, Operators are a way to extend Kubernetes functionality with application-specific logic. Think of three building blocks:
- CustomResourceDefinition (CRD) – teaches the API server about a new noun (
Redis
,Backup
,EtcdCluster
, …). - Controller (a.k.a. “the Operator”) – a Go program that reconciles desired state to actual state.
- Intent – users express what they want (
spec.replicas: 3
,version: 8.0
), the Operator decides how.
Kubernetes already reconciles Pods, Deployments, Services and so on; an Operator reconciles the things Kubernetes doesn’t know about (fail-over, backup, schema migration, licence keys, you name it).
In practice you:
- Scaffold a project with Kubebuilder or Operator-SDK
- Define Go structs for
Spec
andStatus
- Write a reconcile loop that reads the CR, diffs reality, and applies changes
- Package and ship it like any other container image.
Design Notes From My Redis PoC
Below are the decisions, head-scratches, and “next time I’d do X” moments that shaped the PoC.
- Defaults: API vs. Controller vs. Webhook
Operator-SDK will happily generate a mutating webhook if you ask, but for a lean PoC I stuffed defaults inside api/v1alpha1/redis_types.go
(SetDefaults()
). The trade-offs:
Option | Pros | Cons |
---|---|---|
API | Zero plumbing, unit-testable, survives kubectl apply --server-side |
Cannot look at cluster state; no "dry run" |
Controller Shim | One place to patch objects | Defaults resolved after admission; kubectl diff shows "missing" fields |
Webhook | Cluster-aware (can query StorageClass, TLS Secret, etc.) | Cert management, latency, more moving parts |
If you only need static defaults (image tag, CPU/mem, readiness probe), put them in the API and move on. If you need to inject cluster-specific data (CA bundles, storage classes), invest in the webhook.
- Password Secret
Every Redis
CR owns exactly one Secret (<name>-secret
). On first reconcile:
- If the Secret doesn’t exist, generate 32 random bytes,
base64.RawURLEncoding
them (URL-safe, no padding), and mark the Secretimmutable: true
. - Annotate the Pod template with a SHA-256 of the password so that changing the Secret triggers a rollout (
redis.cache.geiser.cloud/secret-hash
). - Emit a
Normal
event (PasswordGenerated
) so operators can audit.
Why SHA instead of resourceVersion? Because it survives a kubectl replace --force
on the Secret.
- Status Conditions
status.conditions
carries three booleans:
PasswordGenerated
– true after step 1 above.DeploymentReady
– true whendeployment.status.readyReplicas == spec.replicas
.Ready
– AND-ed version of the first two (classic upstream pattern).
This turns kubectl get redis
into an SRE dashboard
NAME READY PASS DEPLOY AGE
redis-demo True True True 4m
…and makes tests trivial (Eventually(hasCondition("Ready", True)).Should(BeTrue())
).
- Reconciling Resources vs. Config
The Operator only touches three resource kinds:
Secret
– immutable password.Deployment
– replicas + resource limits.Pod
– observed via ownerRef for restart counting.
Resource size changes (CPU/memory) are detected by hashing the desired ResourceList
and comparing it to deployment.spec.template.spec.containers[0].resources
. If different, I patch the Pod template; Kubernetes restarts Pods automatically.
- Testing: envtest + Ginkgo
Unit tests run in‐process against envtest
, so CI finishes in seconds and never needs a real cluster. End-to-end tests (Kind + Ginkgo) prove that when a Pod restarts 3 times the Controller emits a Kubernetes Warning
event—because humans actually look at those.
- Operator Image and Delivery
For the PoC I publish a multi-arch image to Docker Hub (drumsergio/redis-operator:0.0.1
) and hand out an all-in-one manifest (config/deploy-redis-operator.yaml
). In the real world the menu is richer:
- Helm chart – values for image, leader-election, RBAC, etc.
- OLM bundle – declarative upgrades, alpha/beta channels.
- GitOps – commit the CRDs + Deployment into ArgoCD or Flux and let them drift-manage.
I would push a bundle to OperatorHub just for fun... But I'll delay it for the next time
- Operator-SDK niceties I leaned on
operator-sdk scorecard
– sanity-checks CRD validation, open-api schema coverage.
make bundle
– spits out an OLM-ready CSV with RBAC baked in.
envtest
integration – same libs as Kubebuilder; drop-in for unit tests.
Lessons From the Wider Community
I binge-read Google’s “Best practices for building operators and stateful apps” and Anynines CEO Julian Fischer’s “Principles for Building Kubernetes Operators.” (it's in fact a recording). A summary of the advice that resonated:
• One Operator per app – don’t ship a “mega-operator” that does Redis + Postgres + RabbitMQ; separation of concerns matters.
• Declarative, not imperative – users say what, Operator figures out how. If your CRD has verbs (“backupNow: true”) you’re probably off-track.
• Async reconcile loops – never block; set RequeueAfter
and come back later.
• Idempotency everywhere – reconciliation might run 10× a second; every create must be “create-if-not-exists.”
• Backups are a first-class CRD – treat RedisBackup
the way K8s treats Job
. Controllers for ops like restore, purge, verify belong outside the main reconcile loop.
• Observability is part of UX – emit Prometheus metrics and structured events; label everything (app.kubernetes.io/name
, instance
, version
).
Known Gaps / Next Steps
The README is brutally honest and I’ll go from there:
- Persistence – right now Redis runs ephemeral; PVC support is next.
- StatefulSet – migrate to a STS to improve state handling
- High Availability – no Sentinel, no Cluster mode yet.
- Backup CRD – need
RedisBackup
+RedisBackupSchedule
and a side-car. - Finalizer – ownerRefs are fine until the Namespace is deleted; a finalizer would let me roll my own clean-up logic.
- OLM packaging – bundle, install modes, upgrade e2e.
- Cross-compatibility table – so users know which Redis versions match with wich K8s version
- Prometheus – it will never be production-ready if observability isn't there
All perfectly fine for a learning project; none acceptable for production. Baby steps.
Takeaway
Writing an Operator is half coding, half systems-thinking. You juggle Kubernetes’ eventual consistency model, the quirks of the stateful app you’re automating, and the developer ergonomics of your CRD.
It forced me to learn more about informers, caches, and client-go idempotency than years of plain YAML ever did. If you’re Operator-curious, fire up Operator-SDK, pick a stateful app you love, and let the reconcile loop show you the rest.