Disaster Recovery & Backups
RPO, RTO, restore drills — preparing for the day a region disappears.
What Disasters Look Like
An "incident" is something breaking that you fix in minutes-to-hours. A "disaster" is something breaking that takes longer than your normal incident response can handle.
Real-world disasters teams face:
• Cloud region outage (rare but happens — multi-hour, sometimes day-long)
• Database corruption from a bad migration or script
• Accidental deletion (DROP TABLE, S3 bucket nuked)
• Ransomware encrypting your systems
• Cloud account compromise
• Service provider going out of business
• Compliance forcing you to rebuild from scratch
Most teams underestimate disasters until they happen. Then they invest. You can decide which side of that learning curve to be on.
RPO and RTO — The Two Numbers
Disaster planning starts with two questions:
RPO (Recovery Point Objective) — how much data are you willing to lose?
• 0 minutes = never lose any committed data (synchronous replication)
• 5 minutes = could lose 5 minutes of data (async replication)
• 1 hour = hourly backups
• 24 hours = daily backups
RTO (Recovery Time Objective) — how long can you be down?
• 1 minute = automated failover required
• 1 hour = manual failover practiced
• 24 hours = restore from backup with people working
• 1 week = "we'll figure it out"
Each tier costs more. RPO/RTO aren't engineering decisions, they're business decisions.
Map RPO/RTO to specific systems:
RPO RTO
Production database: 1 minute 30 minutes
User uploads (S3): 0 1 hour
Analytics warehouse: 1 hour 8 hours
Reporting tools: 1 day 1 day
Internal wiki: 1 week 1 week
Critical systems get tight RPO/RTO; less critical can be looser.
Backup Strategies
Three rules of backups (the 3-2-1 rule):
• 3 copies of data (production + 2 backups)
• 2 different storage media (or types)
• 1 copy off-site (different region or different cloud)
For databases:
Automated backups — RDS / Cloud SQL do this by default. Daily snapshots, 7-day retention typical.
Point-in-time recovery (PITR) — replay transaction logs to restore to a specific second. AWS RDS supports this for up to 35 days back. Critical for human-error recovery.
Logical exports — pg_dump / mysqldump for periodic full exports. Stored in S3 with lifecycle policies.
Cross-region copies — RDS snapshots can be auto-copied to another region. Survives region failure.
For S3:
• Versioning — enable on critical buckets
• Cross-region replication — automatic
• Object Lock — write-once-read-many
• Backup to a separate AWS account — protects against account compromise
For application state:
• Code is in Git (already backed up by GitHub)
• Configuration in Terraform / IaC repo (also Git)
• Secrets — backup the secret store
The principle: every piece of state your business depends on must be backed up. List them out. There's almost always a forgotten one.
Test Your Backups
Untested backups are not backups. They're hopes.
The single most common DR failure: discovering during a real disaster that backups don't actually restore. Reasons:
• Backup files are corrupt
• Restore process was never documented
• Restoration takes 10x longer than expected
• Some data wasn't being backed up at all
• Restore script depends on systems that are also down
Test backups quarterly:
1. Pick a backup at random
2. Restore it to a test environment
3. Verify the data is correct
4. Time the process
5. Document anything that surprised you
6. Update RTO if reality is worse than your target
Some teams run a full DR drill annually:
• Pretend a region is down
• Restore systems in the DR region
• Switch traffic
• Verify business operations work
• Switch back
This is expensive (a day or two of engineering time) but invaluable. The first one is always disastrous; subsequent ones get smoother.
Multi-Region Strategies
For systems with tight RTO that can't tolerate region failure, multi-region is the answer. Expensive and complex; only adopt when warranted.
Active-passive (warm standby):
• Primary region serves all traffic
• Standby region has infrastructure provisioned but idle
• Database replicates async (some data loss possible)
• On disaster: failover to standby
• RTO: 30-60 minutes typical
• Cost: ~30% more than single region
Active-active:
• Both regions serve traffic simultaneously
• Database is active in both (multi-master OR globally distributed)
• Load balancer routes by latency
• On disaster: traffic shifts to surviving region
• RTO: minutes (or seconds with health checks)
• Cost: 2x or more
The hard problem: data
• Synchronous replication across regions adds 50-200ms latency to every write
• Asynchronous replication = small data loss on failover
• Globally distributed databases (Spanner, Cosmos DB, CockroachDB, Aurora Global) handle this but cost more
Most teams settle for:
• Stateless services in multiple regions
• Single primary database with cross-region read replicas
• Async replication with documented data loss window
The 80/20 of multi-region:
• Static assets via CDN — already global
• Read traffic to nearest region's read replicas
• Write traffic to single primary
• On primary outage: promote a replica (manual decision)
For most apps, this is enough.
People-Based DR
Technology aside, your DR plan has people dependencies. What if:
• The on-call engineer is on a flight?
• Two SMEs are at the same conference?
• The only person who knows the secret rotation process is on vacation?
• The CEO is unreachable when status page decisions need to be made?
Plan for human availability:
• Document everything important (no "Bob knows how to do that")
• At least two people know each critical procedure
• Cross-train. Run incident drills with the "wrong" people leading.
• Have escalation paths that don't require a single person being reachable
• Communication procedures that work with any 2-3 of: phone, Slack, email
The runbook test: print your runbooks and try to recover from a tabletop disaster using only the documents. What's missing?
The next lesson covers FinOps — the discipline of managing cloud costs.
⁂ Back to all modules