Zero-downtime is a marketing word for an engineering claim that's almost never actually tested. Most upgrades succeed because they didn't surface the case the rollback was designed for. The opposite story — the upgrade hit the case and the rollback worked — is rarer and worth more.
The pieces that have to exist before you call something zero-downtime
- A migration runbook reviewed by someone other than the author.
- A rollback script that's been executed end-to-end against a production-shaped snapshot.
- A pre-flight script that verifies the assumed schema state before the migration runs.
- A canary path that lets a small percentage of traffic hit the new code before the cutover, with explicit success criteria.
- An evidence-collection step that writes the state before, during, and after the migration to a separate audit log.
What we've learned from doing it badly
Skip step 3 and the migration silently no-ops on a row your code thought existed. Skip step 5 and the post-mortem becomes archeology. The expensive lesson, every time, is that the runbook is the rehearsal — if it hasn't been run on a snapshot, it hasn't been written.