Module 15 • Operations • 16 min read • Updated May 07, 2026

Logging, Monitoring & Observability

Logs, metrics, traces — the three pillars. Correlation IDs. What to alert on.

The Three Pillars of Observability

Observability is the ability to understand what's happening inside your system from its external outputs. Three pillars:

Logs — Discrete events with timestamps. "User 123 logged in at 10:32:15."
Metrics — Numeric measurements over time. "95th percentile response time: 245ms."
Traces — Request flows across services. "Request R1 went through: API→Auth→DB→Cache."

You need all three. Logs tell you WHAT happened. Metrics tell you WHEN and HOW OFTEN. Traces tell you WHERE time was spent.

Structured Logging

Never log plain strings in production. Log structured JSON so logs are searchable and parseable.

Snippet

Bad:
  console.log("User 123 created order 456")

Good:

JavaScript

logger.info({
  event: "order.created",
  userId: "123",
  orderId: "456",
  amount: 49.99,
  currency: "USD",
  duration_ms: 145,
  requestId: "req_abc123"
});

Tools: Pino (Node.js), Zap (Go), structlog (Python).

Snippet

Log levels:
  DEBUG — verbose, dev only (never in production)
  INFO  — normal operations, key events
  WARN  — unexpected but handled (deprecated API used, retry succeeded)
  ERROR — something failed and needs attention
  FATAL — system cannot continue, about to crash

Metrics to Track

The RED Method (for services):
• Rate — requests per second
• Errors — error rate (%)
• Duration — response time (p50, p95, p99)

The USE Method (for resources):
• Utilization — CPU %, memory %, disk %
• Saturation — queue depth, pending requests
• Errors — hardware errors, packet drops

Key metrics for a backend API:
• Request rate by endpoint
• Error rate by status code and endpoint
• p50/p95/p99 latency by endpoint
• DB query duration
• Cache hit/miss ratio
• Queue depth and processing rate
• Active connections
• Memory and CPU usage

Tools: Prometheus + Grafana, DataDog, New Relic, CloudWatch.

Distributed Tracing

In a microservices system, one user request touches 5 services. Tracing follows it across all of them.

Each request gets a Trace ID (unique per user request) and spans (timed operations within the trace).

Snippet

Request → [API Gateway] → [Auth Service] → [Order Service] → [DB]
           span1(50ms)      span2(10ms)       span3(200ms)     span4(30ms)

A trace visualization shows the total time and exactly where latency lives.

How it works:
1. API Gateway creates Trace ID, injects into request headers
2. Each service extracts Trace ID, creates a child span
3. Spans are sent to a collector (Jaeger, Zipkin, DataDog APM)
4. Collector assembles the full trace

Standards: OpenTelemetry (OTEL) is the unified standard. Instrument once, export to any backend.

Alerting

Monitoring without alerting is useless. Set up alerts for:

Error rate > 1% for 5 minutes → PagerDuty alert
p99 latency > 2 seconds → Slack warning
CPU > 90% for 10 minutes → Scale-up trigger
DB connection pool exhausted → Immediate alert
Queue depth > 10,000 → Scale workers alert
Disk > 85% full → Warning (at 95%: critical)
Memory leak detected (ever-increasing memory) → Alert

Alert fatigue is real. Only alert on things that require human action. Too many alerts → people start ignoring them.

On-call rotation: Someone is always responsible. Incidents have runbooks. Postmortems are blameless.

Correlation IDs — Tracing One Request Through Many Services

In a microservices system, a single user click can pass through five different services. Without a way to tie those logs together, debugging is detective work — "I think THIS log entry belongs to the same request as THAT log entry, but I can't be sure."

A correlation ID solves this. The first service to handle a request generates a unique ID and includes it in every log message AND in the headers of every downstream call. Every other service does the same. Now you can grep one ID across all your logs and see the full lifecycle of one specific request.

JavaScript

// Express middleware — assign a correlation ID to every incoming request
import { randomUUID } from 'crypto';

app.use((req, res, next) => {
  req.correlationId = req.headers['x-correlation-id'] || randomUUID();
  res.setHeader('X-Correlation-ID', req.correlationId);
  next();
});

// Every log line includes it
logger.info({
  event: 'payment_processed',
  correlationId: req.correlationId,
  userId: req.user.id,
  amount: payment.amount
});

When you call a downstream service, pass the correlation ID along:

JavaScript

await axios.post('/orders', payload, {
  headers: { 'X-Correlation-ID': req.correlationId }
});

Now investigating any issue is one search across all your services. This single technique — costing maybe ten lines of code — is one of the highest-leverage things you'll add to a production system.

Don't Log PII

The biggest pitfall in logging is quietly building a copy of your most sensitive data inside your log aggregator.

Things you should never log:
• Passwords (even hashed — log the hash NOWHERE outside the password column)
• API keys, secrets, JWT tokens
• Full credit card numbers, CVVs
• Social security numbers, government IDs

Things to log carefully:
• Email addresses — these are PII. Log a user ID instead and look up the email in the database when you actually need it for an investigation.
• Phone numbers, addresses — same.
• Free-text user content that may contain PII — be conscious about what makes it into logs.

A good defensive habit: redact-by-default in your logger configuration. Pino, Winston, structlog all support automatic field redaction.

JavaScript

const logger = pino({
  redact: {
    paths: ['password', 'apiKey', 'token', 'creditCard.number', '*.password'],
    censor: '[REDACTED]'
  }
});

When something does leak, it'll be quiet — no alarm, no crash. Logs accumulate forever, and a year from now you'll have a copy of every email address that ever signed up sitting in CloudWatch. Build the habit now.

★

Source & Credit

The Backend from First Principles series is based on what I learnt from Sriniously's YouTube playlist — a thoughtful, framework-agnostic walk through backend engineering. If this material helped you, please go check the original out: youtube.com/@Sriniously. The notes here are my own restatement for revisiting later.

⁂ Back to all modules

Logging, Monitoring & Observability

The Three Pillars of Observability

Structured Logging

Metrics to Track

Distributed Tracing

Alerting

Correlation IDs — Tracing One Request Through Many Services

Don't Log PII

Continue reading

Security

Graceful Shutdown

Scaling & Performance