Monitoring with Prometheus & Grafana
Metrics that matter, dashboards that don't lie, and alerts that fire when (and only when) things break.
What Monitoring Actually Solves
Without monitoring, you operate blind:
• "Is the site up?" — refresh and hope
• "Why was it slow yesterday?" — guess
• "Are we close to capacity?" — wait for the explosion
• "Did the deploy break anything?" — wait for users to complain
Monitoring gives you eyes. It answers:
• How much traffic are we handling?
• How fast is each service responding?
• Are error rates rising?
• Are we close to running out of resources?
• Did this deploy cause a regression?
Three pillars of observability (covered in Backend Module 15):
• Metrics — numerical measurements over time. CPU, latency, request rate.
• Logs — discrete events with context. "User 123 logged in at 10:15 AM"
• Traces — request flow across services. "This request took 2s; here's where each second went."
This lesson focuses on metrics. The next two cover logs and traces.
Prometheus — The De Facto Standard
Prometheus is the most-used metrics system, especially in cloud-native environments. The CNCF graduated project, used by basically everyone running K8s.
How it works:
1. Your apps expose a /metrics endpoint with current values
2. Prometheus periodically SCRAPES that endpoint
3. Prometheus stores time-series data efficiently
4. You query it with PromQL
App #1 ─┐
App #2 ─┼─── /metrics endpoint
App #3 ─┘
▲
│ every 15s
│
┌────────────────┐
│ Prometheus │
│ scrapes, │
│ stores TS db │
└────────────────┘
▲
│ PromQL queries
│
┌────────────────┐
│ Grafana / UI │
└────────────────┘
Key insight: PULL not PUSH. Prometheus scrapes; apps don't push. This makes the system simpler — apps just expose data, Prometheus controls cadence.
A /metrics endpoint looks like:
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 12453
http_requests_total{method="POST",status="200"} 1083
http_requests_total{method="GET",status="500"} 12
# HELP http_request_duration_seconds Request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 9234
http_request_duration_seconds_bucket{le="0.5"} 11892
http_request_duration_seconds_bucket{le="1.0"} 12410
http_request_duration_seconds_bucket{le="+Inf"} 12453
http_request_duration_seconds_sum 1845.3
http_request_duration_seconds_count 12453
Most languages have client libraries (prometheus_client for Python, prom-client for Node, etc.) that produce this format from your code.
The Four Metric Types
Prometheus has four metric types. Each tells a different story.
Counter — only goes up (or resets to zero on restart)
from prometheus_client import Counter
http_requests_total = Counter('http_requests_total', 'Total HTTP requests', ['method', 'status'])
# In handler:
http_requests_total.labels(method='GET', status='200').inc()
Use for: total requests, errors, bytes processed. To get RATE, query rate(metric[5m]).
Gauge — goes up and down
from prometheus_client import Gauge
queue_length = Gauge('queue_length', 'Pending jobs in queue')
queue_length.set(42)
Use for: current resource usage, queue depth, temperature, anything that fluctuates.
Histogram — distribution of values (buckets)
from prometheus_client import Histogram
request_duration = Histogram('http_request_duration_seconds', 'Request latency')
with request_duration.time():
handle_request()
Use for: latencies, sizes. Lets you compute percentiles (p50, p95, p99).
Summary — like histogram but computes percentiles client-side
Less commonly used; histograms are more flexible.
Naming convention:
• <noun>_<unit>_<suffix> — e.g. http_request_duration_seconds
• Counters end in _total
• Always include the unit (_seconds, _bytes, _count)
Labels — slice your metrics by dimension
http_requests_total{method="GET", endpoint="/users", status="200"}
http_requests_total{method="POST", endpoint="/users", status="201"}
Each unique combination of label values is a separate time series. Don't put high-cardinality values in labels (user IDs, request IDs) — you'll explode storage.
PromQL — Querying Metrics
PromQL is the query language. Examples:
Current value:
http_requests_total
Filter by labels:
http_requests_total{status="500"}
http_requests_total{status=~"5.."} # regex match
Rate (per-second average over 5 minutes):
rate(http_requests_total[5m])
Sum across instances:
sum(rate(http_requests_total[5m]))
Sum by status code:
sum by (status) (rate(http_requests_total[5m]))
Error rate as percentage:
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
P95 latency from histogram:
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
Compared to a week ago:
sum(rate(http_requests_total[5m]))
/
sum(rate(http_requests_total[5m] offset 1w))
PromQL has a learning curve. The trick is: most queries you need are variations on sum(rate(metric[range])) or histogram_quantile(0.95, ...). Master those two and 80% of dashboards are buildable.
Grafana — Dashboards & Visualization
Grafana queries Prometheus (and many other data sources) and renders dashboards. The combination is the standard observability stack.
Setup:
# Prometheus + Grafana via Docker Compose
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports: ['9090:9090']
grafana:
image: grafana/grafana:latest
ports: ['3000:3000']
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
volumes:
- grafana-data:/var/lib/grafana
volumes:
grafana-data:
In Grafana, add Prometheus as a data source (URL http://prometheus:9090). Then build dashboards.
The four dashboards every service needs:
1. RED metrics dashboard
• Rate — requests per second
• Errors — error rate
• Duration — p50, p95, p99 latency
These tell you: are users having a good time?
2. USE metrics dashboard
• Utilization — CPU%, memory%, disk%, network%
• Saturation — queue depth, lock contention
• Errors — failed allocations, OOM kills
These tell you: is the infrastructure happy?
3. Business metrics
• Active users, signups per minute, revenue per hour
• Whatever your business actually cares about
These tell you: is the BUSINESS healthy?
4. Per-deploy diff
• Pre-vs-post-deploy comparison of key metrics
• Did the last release regress anything?
Grafana has thousands of community dashboards on grafana.com/dashboards. Common ones (Node Exporter, Kubernetes, PostgreSQL) are imported with one click.
Anti-pattern: dashboards with 50 graphs nobody looks at. A good dashboard has 5-10 graphs and one purpose. Make many dashboards rather than one giant one.
Alerting — When to Wake Someone Up
Alerts that aren't actionable destroy on-call cultures. Every page should require human action.
Prometheus Alertmanager handles routing alerts to PagerDuty/Opsgenie/Slack.
Define alerts in Prometheus:
# alerts.yml
groups:
- name: api
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: page
annotations:
summary: "Error rate above 5% for 5 minutes"
runbook: "https://wiki.internal/runbooks/api-errors"
- alert: SlowResponses
expr: |
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) > 2
for: 10m
labels:
severity: page
annotations:
summary: "P95 latency above 2s for 10 minutes"
The structure:
• expr — the PromQL query that evaluates to "alert" or "no alert"
• for — alert only fires after the condition holds for this duration (avoids flapping)
• severity: page vs severity: warning — routed differently
• runbook URL — engineer woken at 3 AM should have a guide
What to alert on (the SRE rule of thumb): symptoms, not causes.
- PAGE on: error rate spike, latency spike, queue depth growing unboundedly, all instances down
- PAGE on: customer-facing impact
- DON'T page on: high CPU (might be normal load), individual instance failures (auto-replaced), specific log lines (often false positives)
- EMAIL/Slack: things requiring attention but not 3 AM action — slow disk fill, certificate expiring in 30 days
Alert fatigue is real. If on-call is paged 5 times a night, real alerts get ignored. Audit your alerts quarterly: "did anyone act on this in the last month?" If no, delete or downgrade it.
Cloud-Native Alternatives
If you don't want to run Prometheus yourself, alternatives:
Hosted / managed:
• AWS CloudWatch — built-in for AWS resources. Limited but free for basics.
• GCP Cloud Monitoring — equivalent for GCP.
• Datadog — most popular SaaS. Excellent UX, expensive at scale.
• New Relic — established, broad feature set.
• Grafana Cloud — Prometheus + Grafana managed. Generous free tier.
• Honeycomb — events-based, great for tracing-heavy workloads.
Self-hosted alternatives to Prometheus:
• Mimir, Cortex — Prometheus-compatible, multi-tenant, scalable
• VictoriaMetrics — Prometheus-compatible, faster, more efficient
• InfluxDB — different model, push-based, popular for IoT
Cloud-native pattern in 2026:
• OpenTelemetry to instrument apps (vendor-neutral)
• Prometheus / Mimir for storage
• Grafana for visualization
• Alertmanager for routing
• Self-host or managed (Grafana Cloud)
The next lesson covers logging — the second pillar.
⁂ Back to all modules