Home
DevOps & Cloud Engineering / Lesson 23 — Monitoring with Prometheus & Grafana

Monitoring with Prometheus & Grafana

Metrics that matter, dashboards that don't lie, and alerts that fire when (and only when) things break.


What Monitoring Actually Solves

Without monitoring, you operate blind:
• "Is the site up?" — refresh and hope
• "Why was it slow yesterday?" — guess
• "Are we close to capacity?" — wait for the explosion
• "Did the deploy break anything?" — wait for users to complain

Monitoring gives you eyes. It answers:
• How much traffic are we handling?
• How fast is each service responding?
• Are error rates rising?
• Are we close to running out of resources?
• Did this deploy cause a regression?

Three pillars of observability (covered in Backend Module 15):
• Metrics — numerical measurements over time. CPU, latency, request rate.
• Logs — discrete events with context. "User 123 logged in at 10:15 AM"
• Traces — request flow across services. "This request took 2s; here's where each second went."

This lesson focuses on metrics. The next two cover logs and traces.


Prometheus — The De Facto Standard

Prometheus is the most-used metrics system, especially in cloud-native environments. The CNCF graduated project, used by basically everyone running K8s.

How it works:
1. Your apps expose a /metrics endpoint with current values
2. Prometheus periodically SCRAPES that endpoint
3. Prometheus stores time-series data efficiently
4. You query it with PromQL

Text
   App #1 ─┐
   App #2 ─┼─── /metrics endpoint
   App #3 ─┘
                       ▲
                       │  every 15s
                       │
               ┌────────────────┐
               │  Prometheus    │
               │  scrapes,      │
               │  stores TS db  │
               └────────────────┘
                       ▲
                       │  PromQL queries
                       │
               ┌────────────────┐
               │ Grafana / UI   │
               └────────────────┘

Key insight: PULL not PUSH. Prometheus scrapes; apps don't push. This makes the system simpler — apps just expose data, Prometheus controls cadence.

A /metrics endpoint looks like:

Text
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 12453
http_requests_total{method="POST",status="200"} 1083
http_requests_total{method="GET",status="500"} 12

# HELP http_request_duration_seconds Request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 9234
http_request_duration_seconds_bucket{le="0.5"} 11892
http_request_duration_seconds_bucket{le="1.0"} 12410
http_request_duration_seconds_bucket{le="+Inf"} 12453
http_request_duration_seconds_sum 1845.3
http_request_duration_seconds_count 12453

Most languages have client libraries (prometheus_client for Python, prom-client for Node, etc.) that produce this format from your code.


The Four Metric Types

Prometheus has four metric types. Each tells a different story.

Counter — only goes up (or resets to zero on restart)

Python
from prometheus_client import Counter
http_requests_total = Counter('http_requests_total', 'Total HTTP requests', ['method', 'status'])

# In handler:
http_requests_total.labels(method='GET', status='200').inc()

Use for: total requests, errors, bytes processed. To get RATE, query rate(metric[5m]).

Gauge — goes up and down

Python
from prometheus_client import Gauge
queue_length = Gauge('queue_length', 'Pending jobs in queue')
queue_length.set(42)

Use for: current resource usage, queue depth, temperature, anything that fluctuates.

Histogram — distribution of values (buckets)

Python
from prometheus_client import Histogram
request_duration = Histogram('http_request_duration_seconds', 'Request latency')

with request_duration.time():
    handle_request()

Use for: latencies, sizes. Lets you compute percentiles (p50, p95, p99).

Summary — like histogram but computes percentiles client-side
Less commonly used; histograms are more flexible.

Naming convention:
<noun>_<unit>_<suffix> — e.g. http_request_duration_seconds
• Counters end in _total
• Always include the unit (_seconds, _bytes, _count)

Labels — slice your metrics by dimension

Text
http_requests_total{method="GET", endpoint="/users", status="200"}
http_requests_total{method="POST", endpoint="/users", status="201"}

Each unique combination of label values is a separate time series. Don't put high-cardinality values in labels (user IDs, request IDs) — you'll explode storage.


PromQL — Querying Metrics

PromQL is the query language. Examples:

Current value:

Promql
http_requests_total

Filter by labels:

Promql
http_requests_total{status="500"}
http_requests_total{status=~"5.."}      # regex match

Rate (per-second average over 5 minutes):

Promql
rate(http_requests_total[5m])

Sum across instances:

Promql
sum(rate(http_requests_total[5m]))

Sum by status code:

Promql
sum by (status) (rate(http_requests_total[5m]))

Error rate as percentage:

Promql
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

P95 latency from histogram:

Promql
histogram_quantile(0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

Compared to a week ago:

Promql
sum(rate(http_requests_total[5m]))
/
sum(rate(http_requests_total[5m] offset 1w))

PromQL has a learning curve. The trick is: most queries you need are variations on sum(rate(metric[range])) or histogram_quantile(0.95, ...). Master those two and 80% of dashboards are buildable.


Grafana — Dashboards & Visualization

Grafana queries Prometheus (and many other data sources) and renders dashboards. The combination is the standard observability stack.

Setup:

YAML
# Prometheus + Grafana via Docker Compose
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports: ['9090:9090']

  grafana:
    image: grafana/grafana:latest
    ports: ['3000:3000']
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  grafana-data:

In Grafana, add Prometheus as a data source (URL http://prometheus:9090). Then build dashboards.

The four dashboards every service needs:

Snippet
1. RED metrics dashboard
   • Rate — requests per second
   • Errors — error rate
   • Duration — p50, p95, p99 latency
   These tell you: are users having a good time?
Snippet
2. USE metrics dashboard
   • Utilization — CPU%, memory%, disk%, network%
   • Saturation — queue depth, lock contention
   • Errors — failed allocations, OOM kills
   These tell you: is the infrastructure happy?
Snippet
3. Business metrics
   • Active users, signups per minute, revenue per hour
   • Whatever your business actually cares about
   These tell you: is the BUSINESS healthy?
Snippet
4. Per-deploy diff
   • Pre-vs-post-deploy comparison of key metrics
   • Did the last release regress anything?

Grafana has thousands of community dashboards on grafana.com/dashboards. Common ones (Node Exporter, Kubernetes, PostgreSQL) are imported with one click.

Anti-pattern: dashboards with 50 graphs nobody looks at. A good dashboard has 5-10 graphs and one purpose. Make many dashboards rather than one giant one.


Alerting — When to Wake Someone Up

Alerts that aren't actionable destroy on-call cultures. Every page should require human action.

Prometheus Alertmanager handles routing alerts to PagerDuty/Opsgenie/Slack.

Define alerts in Prometheus:

YAML
# alerts.yml
groups:
  - name: api
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) 
          / 
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Error rate above 5% for 5 minutes"
          runbook: "https://wiki.internal/runbooks/api-errors"

      - alert: SlowResponses
        expr: |
          histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) > 2
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "P95 latency above 2s for 10 minutes"

The structure:
expr — the PromQL query that evaluates to "alert" or "no alert"
for — alert only fires after the condition holds for this duration (avoids flapping)
severity: page vs severity: warning — routed differently
runbook URL — engineer woken at 3 AM should have a guide

What to alert on (the SRE rule of thumb): symptoms, not causes.

Alert fatigue is real. If on-call is paged 5 times a night, real alerts get ignored. Audit your alerts quarterly: "did anyone act on this in the last month?" If no, delete or downgrade it.


Cloud-Native Alternatives

If you don't want to run Prometheus yourself, alternatives:

Hosted / managed:
• AWS CloudWatch — built-in for AWS resources. Limited but free for basics.
• GCP Cloud Monitoring — equivalent for GCP.
• Datadog — most popular SaaS. Excellent UX, expensive at scale.
• New Relic — established, broad feature set.
• Grafana Cloud — Prometheus + Grafana managed. Generous free tier.
• Honeycomb — events-based, great for tracing-heavy workloads.

Self-hosted alternatives to Prometheus:
• Mimir, Cortex — Prometheus-compatible, multi-tenant, scalable
• VictoriaMetrics — Prometheus-compatible, faster, more efficient
• InfluxDB — different model, push-based, popular for IoT

Cloud-native pattern in 2026:
• OpenTelemetry to instrument apps (vendor-neutral)
• Prometheus / Mimir for storage
• Grafana for visualization
• Alertmanager for routing
• Self-host or managed (Grafana Cloud)

The next lesson covers logging — the second pillar.


⁂ Back to all modules