Home
DevOps & Cloud Engineering / Lesson 31 — Incident Response

Incident Response

When things break: how good teams handle the first 60 minutes and the next 60 days.


Incidents Are Inevitable

Every system breaks eventually. The variable isn't whether but how often, how badly, and how fast you recover.

What determines outcome isn't avoiding incidents — it's how the team responds. The difference between a 5-minute blip and a 5-hour outage is usually preparation, not luck.

Good incident response has phases:
1. Detection — knowing something is wrong
2. Response — gathering people, communicating, mitigating
3. Resolution — fixing the immediate problem
4. Recovery — confirming normal operation
5. Postmortem — learning from it

Each phase has its own discipline.


Detection — Knowing Something's Wrong

The faster you know, the less damage. Your detection capability defines your floor for incident impact.

Sources of detection (best to worst):

  1. Internal monitoring (best) — your alerts fire before users notice
  2. Synthetic checks — black-box tests of critical user journeys, every minute
  3. Customer support tickets
  4. Social media mentions
  5. Twitter and downdetector.com (worst — many users impacted, public visibility)

A good detection setup:
• Critical metrics paged: error rate, latency, key transactions
• Synthetic checks for each critical user flow
• Status page subscriptions for upstream dependencies
• Customer support has a fast path to escalate

The mean time to detect (MTTD) is measurable. Track it. Reduce it. Going from "users tell us" (10+ minutes) to "Datadog tells us" (1-2 minutes) saves real money.


Response — The First Few Minutes

Page fires. What happens next matters enormously.

A simple incident response framework:

Roles (assigned at incident start):
• Incident Commander (IC) — coordinates the response. NOT a debugger; their job is to keep things organized.
• Communications Lead — updates status page, talks to customers
• Subject Matter Experts (SMEs) — actually debugging the problem

For small incidents, one engineer plays all roles. For big incidents, separate them.

The IC's checklist:
1. Assess severity (SEV-1, SEV-2, SEV-3)
2. Open an incident channel (#incident-2024-05-05-api-degradation)
3. Page additional people if needed
4. Update the status page if customers are affected
5. Run a check-in every 10-15 minutes: what do we know, what are we doing, what's the impact
6. Decide when to declare resolved

Communication discipline:
• Status updates on a fixed cadence (every 15-30 min)
• Customer-facing messages avoid technical detail and avoid blame
• Post in the incident channel; don't DM
• Use threads for sub-discussions, keep main channel for facts

Common pitfall: too many cooks. The IC gates who's actively trying things. "Alice, you're investigating. Bob, you're standing by. Don't both touch the database."

Tools that help:
• PagerDuty / Opsgenie — paging
• Slack — coordination
• StatusPage / Atlassian StatusPage — customer-facing communication
• Incident.io, Rootly, FireHydrant — full incident management platforms


Resolution — Stop the Bleeding First

The instinct is to find the root cause. Resist it. The first goal is RESTORING SERVICE, not understanding why.

The hierarchy:
1. Mitigate — make customers stop hurting
2. Stabilize — confirm the system is healthy
3. Investigate — figure out what happened
4. Fix — actually solve the underlying problem

Common mitigations (often before understanding):
• Rollback — undo the recent change
• Scale up — throw more capacity at it
• Failover — switch traffic to a different region/AZ
• Disable a feature — flip the flag
• Restart services — sometimes works (cargo-culted, but sometimes works)

Roll back, then investigate. "Was the deploy bad?" is something you can determine after, while service is restored.

Document what you do as you do it:

Text
14:32 - PR #1234 merged
14:47 - Started seeing 5xx spike
14:52 - Page fired
14:54 - alice ack'd, started investigating
15:01 - Suspected migration in #1234
15:03 - Decided to roll back
15:08 - Rollback initiated
15:19 - Errors back to baseline, calling resolved

This becomes your timeline for the postmortem. Write it as it happens, not from memory after.


Recovery, Communication, Postmortems

Recovery checklist:
• All metrics back to baseline (error rate, latency)
• Customer-facing flows tested
• Backlogged work processed (queue depth back to normal)
• No lingering effects (failed jobs need replay? cache needs warming?)
• On-call standby for at least an hour

Customer communication is its own art:

DURING:

Text
[Investigating] We're seeing elevated error rates on our API. We're investigating.
[Identified] We've identified the issue and are working on a fix.
[Monitoring] A fix has been deployed. We're monitoring for full recovery.
[Resolved] The issue has been resolved. Full incident report coming within 5 days.

Things that work:
• Acknowledge the issue quickly even before you understand it
• Update on a regular cadence even with "still investigating"
• Be honest about scope and impact
• No marketing-speak ("a small number of users may have experienced...")
• Apologize when warranted

Postmortems (covered in Module 26 in depth):
• Within 5 business days of a SEV-1 or SEV-2
• Blameless — focus on systems and processes, not individuals
• Timeline with timestamps
• Identify what went well, what went poorly
• 3-5 concrete action items, each owned, with deadlines
• Distribute the document widely

Track action item completion. A team that creates great postmortems but doesn't ship the action items is going through theater.


Building the Muscle

Most teams handle their first big incident badly. That's normal. The goal is to get better, not be born good.

Practices that build incident response skill:

1. Game days — schedule deliberate "we're going to break something on purpose" exercises. Inject failures and practice the response. Chaos engineering tools (Gremlin, ChaosMesh, AWS Fault Injection) make this easier.

2. Tabletop exercises — talk through hypothetical scenarios. "DNS provider goes down for 4 hours. What do we do?"

3. On-call shadowing — junior engineers shadow seniors before going solo.

4. Runbook discipline — every alert has a runbook. Engineers update them after responding to incidents. Stale runbooks indicate a stale process.

5. Postmortem templates — standardize so people don't reinvent each time.

6. Incident metrics:
• MTTD (mean time to detect) — alerts working?
• MTTR (mean time to recover) — process working?
• Recurrence rate — same root cause twice = action items not shipped

Teams that take incidents seriously become formidable. The incidents you DO have, you handle well. Customer trust grows even from outages, because the response is so visibly competent.


⁂ Back to all modules