Home
DevOps & Cloud Engineering / Lesson 26 — SRE Principles

SRE Principles

SLIs, SLOs, error budgets — Google's framework for reliability without sacrificing velocity.


What SRE Is

Site Reliability Engineering (SRE) is Google's specific implementation of DevOps, formalized in their books in 2016. SRE's core insight: reliability is engineering, not luck.

The premise: 100% reliability is impossible AND undesirable. Some unreliability is the cost of moving fast. The question isn't "how do we never fail?" but "how much failure can we afford while shipping fast?"

SRE answers this with three concepts:
• SLIs (Service Level Indicators) — measurable signals of service health
• SLOs (Service Level Objectives) — targets for SLIs
• Error budgets — how much you can fail before pumping the brakes

This framework changes the conversation between dev and ops:
• OLD: "We need 100% reliability." "We need to ship features." → conflict
• SRE: "Are we within our error budget?" → if yes, ship features; if no, focus on reliability

It's a contract that aligns incentives.


SLIs — What to Measure

An SLI is a quantifiable measure of service reliability. The key: measure what users actually experience.

Common SLIs:

Availability — % of requests that succeed

Promql
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

Latency — % of requests faster than some threshold

Promql
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

Throughput — requests per second handled (less common as an SLI)

Quality — % of requests with full / correct responses (some apps return degraded responses under load)

Choosing SLIs — RED method (for request-driven services):
• Rate — request rate
• Errors — error rate
• Duration — latency

Or USE method (for resources):
• Utilization — % busy
• Saturation — queue/wait
• Errors

Best practices:
• Pick SLIs at the user-visible layer. "API returns 200 in <500ms" not "Database query <100ms".
• Average across all relevant traffic, not single endpoints (unless that endpoint is the user experience).
• Use ratios: good_events / total_events.
• Long enough window to be stable (5-30 minutes) but short enough to catch real issues.


SLOs — Targets That Matter

An SLO is a target value for an SLI over a window of time.

Example SLOs:
• 99.9% of requests succeed (per month)
• 95% of requests complete in under 500ms (per month)
• 99.5% of writes are durable (per quarter)

Choosing the right number is harder than it looks. Pick wisely:

99.9% sounds easy. Per year that's 8.76 hours of downtime. Per month: 43.2 minutes. Per week: 10 minutes. That's enough budget for one big incident per month.

99.99%: 52.6 minutes per year. 4.4 minutes per month. Suddenly EVERY component must be nearly perfect.

99.999%: 5.26 minutes per year. Very expensive — global active-active setup, deep automation, dedicated SRE team.

Real-world reference:
• Internal admin tool — 99% might be fine
• B2C app — 99.9% is reasonable
• Banking, healthcare — 99.99% is the bar
• Five-nines (99.999%) — telecoms, only when really needed

The cost grows non-linearly. Going from 99.9% to 99.99% might 10x your engineering cost. From 99.99% to 99.999% might 100x.

Match your SLO to actual business value. A perfect SLO that's hard to meet is worse than a realistic SLO that drives good engineering decisions.

User-perceived availability — sometimes higher than your service availability. If users retry transient failures and the second try works, the user-perceived experience is better than your raw error rate. Some teams measure this.


Error Budgets — The Trade-Off

An error budget is the inverse of your SLO: the amount of failure you're allowed.

99.9% SLO → 0.1% error budget → 43.2 min of failure per month allowed.

How to use it:
• Track budget consumption month by month
• Budget remaining → ship features fast
• Budget low → focus on reliability
• Budget exhausted → freeze risky changes, prioritize fixing root causes

This sounds bureaucratic. In practice it's liberating. Teams that adopt error budgets stop arguing about reliability vs features. The data decides.

Concrete policy:

Text
If error budget for the month > 50% remaining:
  → ship as fast as possible
  → take risks on new features
  → relax review processes (slightly)

If error budget 0-50% remaining:
  → ship features but require careful review
  → no risky changes after 5 PM
  → small batch sizes

If error budget exhausted:
  → freeze risky changes
  → all engineering effort goes to reliability work
  → fix root causes from recent incidents
  → improve test coverage of failure scenarios

Burn rate alerts — instead of waking on-call when SLO is violated (it's too late), alert when you're consuming the budget too fast.

Promql
# At current rate, we'll exhaust the budget in 1 hour
(
  sum(rate(http_requests_total{status=~"5.."}[1h])) 
  / 
  sum(rate(http_requests_total[1h]))
) > 14.4 * 0.001

Page on burn rate, not on every error spike.


Postmortems — Learning from Failure

When things break, you write a postmortem. Done well, this is one of the most valuable practices in engineering.

A good postmortem:

Template:

Text
## Incident: API outage on May 5
- Severity: SEV-1
- Duration: 47 minutes
- Affected: ~20% of API traffic returned 503
- Author: alice (incident commander)

## Timeline
14:32 - PR #1234 merged with database migration
14:47 - Migration completed, started receiving error spikes  
14:52 - PagerDuty alert fired: high 5xx rate
14:54 - On-call (alice) acknowledged
15:03 - Identified migration as cause
15:08 - Decided to roll back
15:19 - Rollback complete, traffic recovering
15:19 - Incident declared resolved

## Impact
- 20% of API requests returned 503 for 47 minutes
- ~120,000 user-visible errors
- 3 customers escalated to support

## Root Cause
The migration added a NOT NULL column without a default. The old app version was still running on some instances and didn't know about the column. Inserts from old instances failed.

## What Went Well
- On-call response time was 2 minutes
- Rollback was clean, no data loss

## What Went Poorly
- Migration policy is documented but wasn't followed
- We don't have automated checks for the "expand-contract" pattern
- Old/new app coexistence wasn't tested

## Action Items
- [ ] Add CI check that flags non-null columns without defaults (alice, 1 week)
- [ ] Document the expand-contract pattern more prominently (bob, 2 weeks)
- [ ] Set up canary deploys for migrations (charlie, 1 month)

The blamelessness is real. If engineers fear punishment for incidents, they'll hide problems and you'll never improve.

Conduct postmortems within 5 business days. Share broadly — across teams, even publicly when appropriate. Track action item completion. Patterns across postmortems reveal the deeper system issues worth solving.


On-Call That Doesn't Burn People Out

On-call is hard. Bad on-call — being woken constantly for non-actionable pages — destroys engineers.

Practices that keep on-call sustainable:

1. Page on symptoms, not causes (covered earlier).

2. Compensation. On-call is work. Engineers carrying a pager deserve extra pay or comp time. This isn't a perk — it's the cost of taking it seriously.

3. Reasonable rotation. 1 week of primary on-call per 4-6 week cycle is typical. Shorter cycles burn people out; longer cycles mean stale context.

4. Runbooks for every alert. The page should link to a runbook explaining: what does this mean, what to check first, how to mitigate, who else to call. Don't make the 3 AM engineer think too hard.

5. Page volume budget. Track pages per engineer per week. If it's > 5, you have an alert problem (too many false positives) or a reliability problem. Either way, fix it.

6. Hand-off rituals. Outgoing on-call updates incoming on-call: ongoing issues, recent changes, things to watch.

7. Engineer joining on-call requires shadow rotation first. Don't throw new hires into production debugging cold.

8. Postmortems for missed pages. If on-call slept through a page, that's a system problem (alert too quiet, route wrong, exhaustion).

The ultimate test: does your team WANT to be on call? If yes, your reliability work is in good shape. If no, fix the systemic issues.


⁂ Back to all modules