Home
DevOps & Cloud Engineering / Lesson 32 — Disaster Recovery & Backups

Disaster Recovery & Backups

RPO, RTO, restore drills — preparing for the day a region disappears.


What Disasters Look Like

An "incident" is something breaking that you fix in minutes-to-hours. A "disaster" is something breaking that takes longer than your normal incident response can handle.

Real-world disasters teams face:
• Cloud region outage (rare but happens — multi-hour, sometimes day-long)
• Database corruption from a bad migration or script
• Accidental deletion (DROP TABLE, S3 bucket nuked)
• Ransomware encrypting your systems
• Cloud account compromise
• Service provider going out of business
• Compliance forcing you to rebuild from scratch

Most teams underestimate disasters until they happen. Then they invest. You can decide which side of that learning curve to be on.


RPO and RTO — The Two Numbers

Disaster planning starts with two questions:

RPO (Recovery Point Objective) — how much data are you willing to lose?
• 0 minutes = never lose any committed data (synchronous replication)
• 5 minutes = could lose 5 minutes of data (async replication)
• 1 hour = hourly backups
• 24 hours = daily backups

RTO (Recovery Time Objective) — how long can you be down?
• 1 minute = automated failover required
• 1 hour = manual failover practiced
• 24 hours = restore from backup with people working
• 1 week = "we'll figure it out"

Each tier costs more. RPO/RTO aren't engineering decisions, they're business decisions.

Map RPO/RTO to specific systems:

Text
                        RPO            RTO
Production database:    1 minute       30 minutes
User uploads (S3):      0              1 hour
Analytics warehouse:    1 hour         8 hours
Reporting tools:        1 day          1 day
Internal wiki:          1 week         1 week

Critical systems get tight RPO/RTO; less critical can be looser.


Backup Strategies

Three rules of backups (the 3-2-1 rule):
• 3 copies of data (production + 2 backups)
• 2 different storage media (or types)
• 1 copy off-site (different region or different cloud)

For databases:

Automated backups — RDS / Cloud SQL do this by default. Daily snapshots, 7-day retention typical.

Point-in-time recovery (PITR) — replay transaction logs to restore to a specific second. AWS RDS supports this for up to 35 days back. Critical for human-error recovery.

Logical exports — pg_dump / mysqldump for periodic full exports. Stored in S3 with lifecycle policies.

Cross-region copies — RDS snapshots can be auto-copied to another region. Survives region failure.

For S3:
• Versioning — enable on critical buckets
• Cross-region replication — automatic
• Object Lock — write-once-read-many
• Backup to a separate AWS account — protects against account compromise

For application state:
• Code is in Git (already backed up by GitHub)
• Configuration in Terraform / IaC repo (also Git)
• Secrets — backup the secret store

The principle: every piece of state your business depends on must be backed up. List them out. There's almost always a forgotten one.


Test Your Backups

Untested backups are not backups. They're hopes.

The single most common DR failure: discovering during a real disaster that backups don't actually restore. Reasons:
• Backup files are corrupt
• Restore process was never documented
• Restoration takes 10x longer than expected
• Some data wasn't being backed up at all
• Restore script depends on systems that are also down

Test backups quarterly:
1. Pick a backup at random
2. Restore it to a test environment
3. Verify the data is correct
4. Time the process
5. Document anything that surprised you
6. Update RTO if reality is worse than your target

Some teams run a full DR drill annually:
• Pretend a region is down
• Restore systems in the DR region
• Switch traffic
• Verify business operations work
• Switch back

This is expensive (a day or two of engineering time) but invaluable. The first one is always disastrous; subsequent ones get smoother.


Multi-Region Strategies

For systems with tight RTO that can't tolerate region failure, multi-region is the answer. Expensive and complex; only adopt when warranted.

Active-passive (warm standby):
• Primary region serves all traffic
• Standby region has infrastructure provisioned but idle
• Database replicates async (some data loss possible)
• On disaster: failover to standby
• RTO: 30-60 minutes typical
• Cost: ~30% more than single region

Active-active:
• Both regions serve traffic simultaneously
• Database is active in both (multi-master OR globally distributed)
• Load balancer routes by latency
• On disaster: traffic shifts to surviving region
• RTO: minutes (or seconds with health checks)
• Cost: 2x or more

The hard problem: data
• Synchronous replication across regions adds 50-200ms latency to every write
• Asynchronous replication = small data loss on failover
• Globally distributed databases (Spanner, Cosmos DB, CockroachDB, Aurora Global) handle this but cost more

Most teams settle for:
• Stateless services in multiple regions
• Single primary database with cross-region read replicas
• Async replication with documented data loss window

The 80/20 of multi-region:
• Static assets via CDN — already global
• Read traffic to nearest region's read replicas
• Write traffic to single primary
• On primary outage: promote a replica (manual decision)

For most apps, this is enough.


People-Based DR

Technology aside, your DR plan has people dependencies. What if:
• The on-call engineer is on a flight?
• Two SMEs are at the same conference?
• The only person who knows the secret rotation process is on vacation?
• The CEO is unreachable when status page decisions need to be made?

Plan for human availability:
• Document everything important (no "Bob knows how to do that")
• At least two people know each critical procedure
• Cross-train. Run incident drills with the "wrong" people leading.
• Have escalation paths that don't require a single person being reachable
• Communication procedures that work with any 2-3 of: phone, Slack, email

The runbook test: print your runbooks and try to recover from a tabletop disaster using only the documents. What's missing?

The next lesson covers FinOps — the discipline of managing cloud costs.


⁂ Back to all modules