Upgrade Management at Scale — An Invisible Art

Upgrade Management at Scale — An Invisible Art
Summertime 2023 in Schaukelbaum Bad Grönenbach, Germany © Sergio Fernández
"Aren't you a developer if one of your projects is managing software updates?"

That's what I heard from a good friend, a seasoned developer himself, when I explained that one of my main responsibilities was coordinating, automating, and executing large-scale upgrade management in the context of Kubernetes. Yes—startling at first, but painful to hear. But trust me, this post has been written thanks to you, my friend. And for everyone else wondering what it actually means to perform upgrade management at scale in a Managed Services Provider (MSP) company.

Spoiler alert: It's neither boring nor trivial. It's chaotic, messy, challenging, inspiring, maddening, creative, exhausting, and yes, deeply rewarding. Let me explain by sharing a candid, behind-the-scenes look into my daily work—the real human element, the hidden struggles, organizational complexities, and of course, some automation magic to make it all feasible.

What does "Upgrade management at scale" actually mean?

Managing upgrades at scale doesn't mean logging in and hitting "update" on some dashboard. It means carefully orchestrating software updates for critical infrastructure and applications spanning a staggering number of Kubernetes clusters, hundreds of namespaces, thousands of running workloads, many clients, diverse business structures, and—of course—people.

In our team, we came up with a way of categorizing our upgrades into different "classes":

  • Class 1 (C1): Infrastructure-level upgrades with high criticality (such as Kubernetes version upgrades, ingress controllers, and critical databases).
  • Class 2 (C2): Applications co-managed or fully-managed. Often introducing added complexity through customizations.
  • Class 3 (C3): Applications fully managed and maintained by us, "transparently" for the end client, with no immediate impact.

Every category presents its unique challenges. C1 upgrades can critically impact system availability, C2 changes require extremely detailed coordination with the customers, and even relatively simpler C3 upgrades may unexpectedly reveal hidden dependencies or legacy complexity neglected for years. However, the line between C2 and C3 upgrades is blurrier than it might seem to be, as well as the decision to promote C2 databases to C1.

The Human, organizational, and technical challenges

A significant part of my job involves coordinating a small team of brilliant and motivated engineers. Assigning them tasks isn't a matter of "giving work to someone else." Rather, it involves ongoing mentoring, training, and careful supervision.

Humans… Yes, humans

Nearly every upgrade cycle, something unexpected happens—often due to small, understandable mistakes (and sometimes I can be the one responsible, too). For example:

  • Someone forgets to include (or another engineer does not mention) important software in the rollout.
  • PRs don't fully follow clearly defined templates and protocol, despite documentation.
  • Deployments unintentionally trigger customer-visible incidents.
  • Upgrades for certain applications are prepared in haste, causing uncertainty.
  • Scheduled deployment windows are overlooked or missed.
  • Small differences between dev and production environments lead to unexpected issues.
  • Some problems are quietly introduced and only surface the next day(s).
  • Forgetting to disable maintenance mode afterward (unless a timer was set).
  • Recovery procedures aren't thoroughly prepared, and much less, tested.
  • Colleagues introduce uncommunicated or undocumented changes.
  • You encounter errors in applications, that are inherited from other colleagues.

I'm not blaming anyone—these are normal human mistakes. Each of them is an opportunity to coach, teach, translate complexity, support, and also to bear the occasional burden silently. I've learned to accept gracefully that it's part of being a mentor and the project lead. You help your colleagues grow, even when no one sees the extra effort that goes into avoiding future errors.

Handling old Helm charts and forgotten software

One constant problem is the regular discovery of countless legacy Helm charts—applications left untouched for years. They only appear on our radar when something goes wrong, like the recent DockerHub rate-limit issue that threatened to break our clusters. Uncovering these "lost" workloads in Kubernetes means navigating outdated documentation pages or ancient Helm charts never migrated to modern tooling. Every forgotten chart means researching, reading (sometimes non-existent) docs, inspecting source-code, or reverse-engineering configurations directly from Kubernetes.

ArgoCD, Helmfile, and GitOps—Salvations and frustrations

Back then, we managed deployments exclusively through Helmfile or manual Helm upgrades. Initially, we relied heavily on this somewhat prehistoric approach to Continuous Delivery, manually applying upgrades with Helmfile as needed. It wasn't ideal, involving repetitive work and leaving plenty of potential for human error—we clearly needed a better way forward.

Migrating to ArgoCD was a career-highlight for our team; the magical promise: no more "drift" between Git and clusters. However, there are still some realities worth mentioning:

  • Migration Path: Initial ArgoCD migrations were laborious, required careful state reconciliations, coordination among colleagues and sometimes client teams.
  • ArgoCD WebUI: As the number of applications grew, the Web UI was frequently becoming sluggish and unresponsive, even after scaling the underlying resources for it several times. It's currently a limitation we see with ArgoCD, in the end.
  • ArgoCD CLI: I ended up training colleagues on how to perform their upgrades in ArgoCD with the help of its command-line interface (which is still, quite slow), leading step-by-step documentation and checklists. We still encounter mistakes (broken syncs, bad manifests from PRs), but I'm proactively handling cleanup behind-the-scenes.

Coordinating with clients—A job in itself

Every upgrade cycle starts with me creating separate communications with dozens of stakeholders in a staggering amount of clusters:

  1. Carefully preparing client notices (every client has distinct time-windows and individual software).
  2. Dealing proactively yet empathetically with even more client-side requirements or scheduling-changes they might reply back.
  3. Handling open tickets in the ticketing system carefully, ensuring nothing gets overlooked and clients always receive updates when necessary.

Every interaction matters, every misunderstanding could escalate, every unacknowledged nuance leads to hours of unnecessary toil.

Automation—My painkiller and daydream

No human alone could orchestrate this complexity again and again without automation. I've personally created infrastructure scripts and tools that make this job bearable.

Downtime automation – Reducing alert fatigue

As anyone who's been on-call can attest, accidental false-positive alerts from upgrades are both frustrating and disruptive—not just for the on-client engineer but also for the client. To mitigate this, I developed a comprehensive Python tool to automate scheduling downtimes for alerts:

  • Flexibly configurable for multiple scenarios (project-wide, cluster-wide, specific namespaces, specific clients, etc.).
  • Allows categorizations for different upgrade scenarios (C1, C2, or C3 upgrades), avoiding accidental downtimes on unrelated systems.

This not only saves precious minutes manually enabling/disabling alerts but also gives peace of mind to engineers and clients alike during maintenance windows.

Kubernetes upgrades: Blue/Green deployment automation (C1 upgrades)

The most critical infrastructure-level upgrades (e.g., Kubernetes versions, Ingress controllers, databases) require flawlessly executed Blue-Green deployments. In collaboration with another colleague, we created an almost thousand-line bash script to handle complex Blue-Green Kubernetes migrations. Key features include:

  • Statefile tracking: The script maintains a record of already migrated applications, assisting us when timing issues or human intervention unexpectedly surface half-way through.
  • Safe incremental migrations: By migrating only a determined number of pods at a time (e.g., batches of 5 pods), this ensures we won't overwhelm underlying infrastructure or trigger unexpected race conditions.
  • Automated backups & downtimes: Utilizing tools like Velero, the script performs pre-upgrade backups and verifies post-migration health before continuing to subsequent workloads. It further integrates monitoring checks and creates downtimes.
  • Environment-dependency awareness: It includes specific logic for different cloud providers, handling node cordons, uncordons and drain operations. It also reminds the engineer to do some administrative tasks.

Looking ahead: Enhancing automation into a unified upgrade tool

Despite these significant automation gains, my personal dream continues to be consolidating all of this intricate tooling into a single unified "Upgrade Manager"—a master tool capable of handling parameters flexibly, like the type of upgrade task, client configurations, and scripts that dynamically handle specialized cases and integrations, reducing the number of manual interventions.

This task is challenging given all the intricacies: multiple upgrade classifications, unique client scenarios, varied downtime and recovery procedures, alerts handling, and diverse scripts for each scenario. But it's the improvement that I envision being the next transformative step for our entire organization, enabling even greater scale with stability, fewer mistakes, easier onboarding for newly trained junior engineers, less burnout, and fewer unexpected nights troubleshooting human errors or missed edge-cases.

Yet, even with the finest automation tooling in place, the true art of upgrade management—carefully and thoughtfully preparing each upgrade—remains a fundamentally human task. This delicate preparation involves meticulously reviewing lengthy changelogs, interpreting subtle impacts of changes, accommodating each clients' individual preferences and customizations, and anticipating potential interdependencies across clusters, applications, and infrastructures.

Navigating these nuanced considerations goes beyond the scope of scripts or automation policies, requiring the intuition, awareness, empathy, and critical judgment that only human experience can provide. Thus, while I pursue this dream of complete tool unification and automation, I still deeply value—and continue practicing—the invisible art of personally ensuring every detail, every variable, and every client's needs on its own context are thoughtfully considered, understood, and addressed.

Final reflections: "Invisible" work—Yet deeply rewarding

When done effectively, upgrade management at scale allows our clients to focus reliably on their core business rather than software maintenance headaches. From the outside, upgrades seem invisible—seamless, easy, almost trivial. Yet beneath the surface lies a vast space of human, organizational, and technical complexities—related interdependencies, internal communication, training, mentoring, automation, troubleshooting, documentation, and yes, endless cup of coffee.

My good friend—the engineer who asked if I'm "just managing software upgrades"—wasn't entirely incorrect. Ultimately, this invisible art is a nuanced form of development: scripting, automation, pipeline improvements, infrastructure as code—it involves all of this. But at its heart, it's always about people and processes; making life simpler for the next engineer behind us and ensuring "invisible" remains synonymous with "effortlessly successful."

It's complex. It's invisible. It's challenging. And yes—I wouldn't trade it.

Sergio Fernández

Sergio Fernández

Senior Cloud DevOps Engineer specializing in Kubernetes.
Murcia, Spain