Logging, Monitoring & Observability
Logs, metrics, traces — the three pillars. Correlation IDs. What to alert on.
The Three Pillars of Observability
Observability is the ability to understand what's happening inside your system from its external outputs. Three pillars:
Logs — Discrete events with timestamps. "User 123 logged in at 10:32:15."
Metrics — Numeric measurements over time. "95th percentile response time: 245ms."
Traces — Request flows across services. "Request R1 went through: API→Auth→DB→Cache."
You need all three. Logs tell you WHAT happened. Metrics tell you WHEN and HOW OFTEN. Traces tell you WHERE time was spent.
Structured Logging
Never log plain strings in production. Log structured JSON so logs are searchable and parseable.
Bad:
console.log("User 123 created order 456")
Good:
logger.info({
event: "order.created",
userId: "123",
orderId: "456",
amount: 49.99,
currency: "USD",
duration_ms: 145,
requestId: "req_abc123"
});
Tools: Pino (Node.js), Zap (Go), structlog (Python).
Log levels:
DEBUG — verbose, dev only (never in production)
INFO — normal operations, key events
WARN — unexpected but handled (deprecated API used, retry succeeded)
ERROR — something failed and needs attention
FATAL — system cannot continue, about to crash
Metrics to Track
The RED Method (for services):
• Rate — requests per second
• Errors — error rate (%)
• Duration — response time (p50, p95, p99)
The USE Method (for resources):
• Utilization — CPU %, memory %, disk %
• Saturation — queue depth, pending requests
• Errors — hardware errors, packet drops
Key metrics for a backend API:
• Request rate by endpoint
• Error rate by status code and endpoint
• p50/p95/p99 latency by endpoint
• DB query duration
• Cache hit/miss ratio
• Queue depth and processing rate
• Active connections
• Memory and CPU usage
Tools: Prometheus + Grafana, DataDog, New Relic, CloudWatch.
Distributed Tracing
In a microservices system, one user request touches 5 services. Tracing follows it across all of them.
Each request gets a Trace ID (unique per user request) and spans (timed operations within the trace).
Request → [API Gateway] → [Auth Service] → [Order Service] → [DB]
span1(50ms) span2(10ms) span3(200ms) span4(30ms)
A trace visualization shows the total time and exactly where latency lives.
How it works:
1. API Gateway creates Trace ID, injects into request headers
2. Each service extracts Trace ID, creates a child span
3. Spans are sent to a collector (Jaeger, Zipkin, DataDog APM)
4. Collector assembles the full trace
Standards: OpenTelemetry (OTEL) is the unified standard. Instrument once, export to any backend.
Alerting
Monitoring without alerting is useless. Set up alerts for:
- Error rate > 1% for 5 minutes → PagerDuty alert
- p99 latency > 2 seconds → Slack warning
- CPU > 90% for 10 minutes → Scale-up trigger
- DB connection pool exhausted → Immediate alert
- Queue depth > 10,000 → Scale workers alert
- Disk > 85% full → Warning (at 95%: critical)
- Memory leak detected (ever-increasing memory) → Alert
Alert fatigue is real. Only alert on things that require human action. Too many alerts → people start ignoring them.
On-call rotation: Someone is always responsible. Incidents have runbooks. Postmortems are blameless.
Correlation IDs — Tracing One Request Through Many Services
In a microservices system, a single user click can pass through five different services. Without a way to tie those logs together, debugging is detective work — "I think THIS log entry belongs to the same request as THAT log entry, but I can't be sure."
A correlation ID solves this. The first service to handle a request generates a unique ID and includes it in every log message AND in the headers of every downstream call. Every other service does the same. Now you can grep one ID across all your logs and see the full lifecycle of one specific request.
// Express middleware — assign a correlation ID to every incoming request
import { randomUUID } from 'crypto';
app.use((req, res, next) => {
req.correlationId = req.headers['x-correlation-id'] || randomUUID();
res.setHeader('X-Correlation-ID', req.correlationId);
next();
});
// Every log line includes it
logger.info({
event: 'payment_processed',
correlationId: req.correlationId,
userId: req.user.id,
amount: payment.amount
});
When you call a downstream service, pass the correlation ID along:
await axios.post('/orders', payload, {
headers: { 'X-Correlation-ID': req.correlationId }
});
Now investigating any issue is one search across all your services. This single technique — costing maybe ten lines of code — is one of the highest-leverage things you'll add to a production system.
Don't Log PII
The biggest pitfall in logging is quietly building a copy of your most sensitive data inside your log aggregator.
Things you should never log:
• Passwords (even hashed — log the hash NOWHERE outside the password column)
• API keys, secrets, JWT tokens
• Full credit card numbers, CVVs
• Social security numbers, government IDs
Things to log carefully:
• Email addresses — these are PII. Log a user ID instead and look up the email in the database when you actually need it for an investigation.
• Phone numbers, addresses — same.
• Free-text user content that may contain PII — be conscious about what makes it into logs.
A good defensive habit: redact-by-default in your logger configuration. Pino, Winston, structlog all support automatic field redaction.
const logger = pino({
redact: {
paths: ['password', 'apiKey', 'token', 'creditCard.number', '*.password'],
censor: '[REDACTED]'
}
});
When something does leak, it'll be quiet — no alarm, no crash. Logs accumulate forever, and a year from now you'll have a copy of every email address that ever signed up sitting in CloudWatch. Build the habit now.
The Backend from First Principles series is based on what I learnt from Sriniously's YouTube playlist — a thoughtful, framework-agnostic walk through backend engineering. If this material helped you, please go check the original out: youtube.com/@Sriniously. The notes here are my own restatement for revisiting later.