Paging into the Night—Assess Before You Fix: Five Years of On-Call Lessons

Paging into the Night—Assess Before You Fix: Five Years of On-Call Lessons
Amazing sunset in Nova Zagora, Bulgaria. Summer 2018 © Sergio Fernández
Being on-call is the tax you pay for running software in production.

The first time my phone buzzed at 3 a.m. I sprang out of bed, half–super-hero, half-imposter, certain I was about to save the internet. Five years later the ringtone still yanks me upright, but the impulse has changed: I open the laptop first to assess, not to save. That single shift in mindset— from “fix everything” to “understand what’s happening, how bad it is, and who really owns it”—has kept me (mostly) sane while our pager volume has snowballed with every new client, every new cluster and every new teammate who thinks “Kubernetes-as-a-Service” means “call the cloud folks for everything.”

This post braids together a handful of war stories, a pocket guide for junior engineers, and a short wish-list for managers who still only say “just fix it.” Sprinkled throughout are some hard-won lessons that echo what Google’s SRE book formalised years ago: the goal of on-call is sustainable reliability, not personal heroics.

More Clients, More Noise

When I joined the cloud-Kubernetes team we had three customers and a pager rotation that was more theory than practice. Today we support many SaaS tenants, hybrid clusters, dozens of AWS and GCP projects and some on-prem Kubernetes estate. Pager volume followed Metcalfe’s Law: each new system created exponentially more ways to wake us up.

What changed wasn’t just scale; it was ambiguity. The midnight page no longer meant “our stuff is down.” It meant “something, somewhere, attached to us by an undocumented IAM role, is unhappy.” At that point assessment—knowing where the boundary lies and who is supposed to act—became the primary skill.

Duty #1: Assess, Not Solve

A page is a request for situational awareness, not instant remediation. Google’s SRE target of “two incidents per 12-hour shift” is built around this idea: diagnose, mitigate if you can, wake the right people if you can’t, and capture enough context for daylight follow-up. The incident commander’s checklist at Google starts with two lines:

  • Is this real user impact?
  • Do I own the fix?

Those two questions decide whether you type furiously, escalate, or hit Snooze-Until-Morning. Everything that follows in this post hangs off that principle.

War Stories (and the Take-aways)


1 - The Half-Built Cluster that Stopped a Car Factory


The page hit at exactly midnight: “K8s / API unreachable”. This was the brand-new environment for a well-known German car maker, half-built by a teammate who had left the runbook littered with TODOs. Because only the alerts, not the workloads, were live (but it was something I needed later to check out properly), every metric screamed red.

I spent four hours hopping between Grafana graphs that showed zero traffic, ssh sessions that timed out, and a sleepy German SRE on Teams who kept saying, “Alles grün bei uns.” From my side every control-plane node was dark—no observability, no API server, nothing.

At 03:45 I finally wrangled access to their internal status board and saw the real culprit: a planned power-maintenance window had taken the whole platform team offline. Nothing for me to fix, nothing for him to see.

Impact assessed, scope confirmed, ticket parked, pillow reclaimed.

Take-away: Always verify ownership. If it isn’t your platform, escalate early and save the heroics—and your sleep—for issues you can actually control. Nevertheless, always make sure of it especially when having no proper documentation, like I did.

2 - The NGINX Zero-Day and the Flu

I was on-call but it wasn't my responsibility; Reading Hacker News at midnight I spotted CVE-2025-blah-blah hitting NGINX (it was actually the second time I had something like this with Nginx, last one in 2024, which I promptly fixed). Pagers were silent—this falls under the security team, not platform. Still, I dropped the link in our #security-alerts channel, tagged the on-duty analyst, and went back to fighting a 39 °C fever.

Fast-forward two days: my manager wants to know why production wasn't patched by me. My Slack ping didn't vanished in the noise, and was picked by another fellow colleague the next morning, but still he somehow wanted more.

Take-away: Spotting a risk isn’t the same as owning the fix. When it’s not your mandate (and especially when you’re sick), create a trackeable ticket and hand it to the proper department, but don't forget to tell your manager you didn't cancel your on-call rotation even if you were sick 😄—then rest.

3 - PVCs at 100 %: My Favourite “Go-Back-to-Sleep” Alert

When Prometheus cries VolumeFull, I already know the playbook: kubectl patch pvc … , then update the chart; or scale the workload if it's a managed DB in the cloud. Ten minutes later Grafana is happy and I’m horizontal again.

Take-away: Automate the boring stuff, but if you can’t yet, at least make the remediation single-command and well-documented.

4 - Upgrades Gone Sideways: DefectDojo & RabbitMQ

I wear the Upgrade Manager hat, so every bump lands on my desk. This DefectDojo jump looked routine until the migration script garbled the Postgres schema and the DB backup we’d taken refused to restore. It would not work no matter which restore we applied.

Ten hours later we were still coaxing data back: the client had to run throw-away security scans so we could verify findings, and Semgrep’s JSON format had changed, which broke half the import pipeline. By dawn the platform was healthy, but only because the rollback/restore playbook was printed, highlighted, and applied.

Regarding RabbitMQ: A junior engineer kicked off what should have been a painless point-release. The cluster never came back—version skew meant half the nodes refused to re-join. I wasn’t on call, but the “if an upgrade tanks, ping me” rule applies, so we paired up: snapshot restore, config diff, rolling restart, back in service before midnight. The documented backup-first protocol saved us; the mentoring moment was a bonus.

Take-away: Detailed, drill-tested upgrade runbooks turn chaos into a (long) checklist. Backups are only real when you’ve proved you can restore them, and juniors can steer the ship if the map is crystal clear.

5 - Mystery Alert: Cloud SQL That Wasn’t

The Pager said CloudSQL down. In reality Prometheus was timing out while scraping the Stackdriver exporter. Two clicks to bump the ServiceMonitor timeout and the “database” magically healed.

Take-away: Not every symptom is the disease. Monitor the monitoring stack or you’ll chase false leads forever.

Patterns That Actually Work


Signal over noise

If an alert isn’t actionable in the next 10 minutes, it’s not a page, it’s an e-mail. Our rotation’s sanity improved the day we deleted 40 % of rules and promoted another 30 % to ticket-only.

Playbooks as executable checklists

A good entry answers three questions: What does this alert really mean? How do I confirm impact? What’s the first safe mitigation? Anything longer belongs in docs or, better yet, scripts.

Escalation paths that managers respect

If the runbook says “call Security before patching,” managers must back the on-caller who follows the rule, even when Twitter is roasting the company. Psychological safety isn’t a feel-good slogan; it’s the only way people keep picking up the pager.

Rotation hygiene

We are not tracking yet pager load as a trailing 21-day average as suggested by the SRE book. It's gold, but we need to work on it in a way that all parties are satisfied.

A Pocket Guide for First-Time On-Callers


  • Prep before your first shift
  • Know where the dashboards live.
  • Bookmark the runbooks.
  • Have working VPN/kubectl/ssh keys on two devices.
  • Read yesterday’s hand-off.

The 5-minute triage loop:

  1. Is it real user impact?
  2. Does the playbook exist? Follow it.
  3. No playbook? Check logs & dashboards for the last change.
  4. Can I mitigate safely within 1 hour? Yes → do it. No → escalate.
  5. Write three lines in the ticket: Symptom, Suspected scope, Action taken / next step.

After the storm

Grab screenshots, logs and timeline while fresh. File bugs for any manual step you had to take. Go get coffee—or sleep.

A Short List for Managers


Fund rest and redundancy

The number of people involved, and duration, wildly depends on the projects. But if budget won’t cover that, budget for attrition

Reward reduction, not reaction

Celebrate the engineer who deletes 200 noisy alerts, not the one who slacks #general about “another all-nighter, fixed!” Hero culture keeps pager volume high.

Protect project time

Google asks SREs to spend at least 50 % on engineering, not ops. The math is brutal: if every shift runs hot, nothing ever gets automated and tomorrow is worse.

Provide escalation air-cover

When the on-caller says “We’re outside our remit, engaging vendor,” back them publicly. Second-guessing in the incident channel erodes confidence instantly.

Lead from the front

As the technical lead, take part in both the planning and the on-call rotation. Opting out signals that pager duty is “someone else’s problem,” and your engineers will notice—and resent it.

Designing Alerts that Don’t Hate You


  • Start with an SLO: Pick a user-visible metric (latency, error rate, availability). Page only when violating the SLO fast enough to burn the week’s error budget in an hour or two.
  • Categorise like adults: P1 – wake me now. P2 – business hours ticket. P3 – dashboard only. Anything else goes to /dev/null.
  • Test in shadow mode: Run new rules as e-mails for a week; predict their pager-budget cost before promotion.

Incident Reviews that Teach, Not Blame


Blameless & detailed

Write “The pod eviction caused a cascading failure” not “Alice fat-fingered kubectl.” List contributing factors, detection gaps, and concrete actions with owners and due dates.

Automate one thing

Every post-mortem must spawn at least one automation or monitoring improvement. Otherwise you’re producing literature, not learning.

Share widely

Circulate highlights to product teams and even sales. Outages are free training material—milk them.

Conclusion


Five years of living with the pager have convinced me of two truths:

  1. Reliable systems are the by-product of reliable processes, not heroic individuals.
  2. The most valuable skill an on-call engineer can cultivate is calm, surgical assessment.

Everything else—speed, tools, even deep technical knowledge—flows from knowing when to act and when to escalate.

So to the junior engineer dreading their first rotation: breathe, read the playbook, and remember your real job is to understand, not to instantly repair. And to the manager asking why the problem isn’t “simply fixed” yet: give your team the time, tools and trust to do assessment right. The fixes will follow—usually before the next 3 a.m. alarm.

Sergio Fernández

Sergio Fernández

Senior Cloud DevOps Engineer specializing in Kubernetes.
Murcia, Spain