Logging at Scale
Structured logs, ELK, Loki, and the discipline of logging things you'll actually want at 3 AM.
Why Centralized Logging
A single server's logs are easy: SSH in, tail -f /var/log/app.log. Done.
A fleet of servers is different. With 50 instances of your service, where IS the log line for the request that just failed? Which instance handled it? When did it start failing? Did this happen on other instances?
Centralized logging solves this by:
• Collecting logs from every instance
• Storing them in one searchable system
• Letting you search across all of them
• Correlating across services
Without it, debugging distributed systems is impossible. With it, you can answer "what happened to user 12345's request at 3:42 PM" in seconds.
Structured Logging — The Foundation
Old-school logs are unstructured strings:
2024-05-05 10:23:45 INFO User alice logged in from 192.168.1.5
2024-05-05 10:23:46 ERROR Payment failed for order 12345: insufficient funds
Searchable? Sort of, with grep. Aggregatable ("how many login attempts per minute")? No.
Structured logs are JSON (or similar) with explicit fields:
{
"timestamp": "2024-05-05T10:23:45.123Z",
"level": "info",
"event": "user_login",
"user_id": "alice",
"ip": "192.168.1.5",
"trace_id": "abc123def456"
}
{
"timestamp": "2024-05-05T10:23:46.789Z",
"level": "error",
"event": "payment_failed",
"order_id": 12345,
"reason": "insufficient_funds",
"amount_cents": 4999,
"trace_id": "abc124ghi789"
}
Now you can:
• Search by field: event:user_login AND user_id:alice
• Aggregate: count of payment_failed by reason
• Correlate: all logs with this trace_id
Use a logging library that emits structured logs:
• Python — structlog, loguru
• Node — pino, winston
• Go — zap, slog
• Java — Logback with JSON encoder
Required fields in every log line:
• timestamp (ISO 8601 with timezone)
• level (debug, info, warn, error)
• event or message
• service name
• trace ID (if available — for correlation across services)
Optional but valuable:
• user_id (for support investigations — but never email/PII directly)
• request_id / correlation_id
• version / commit SHA (which deploy emitted this)
The ELK Stack — Elasticsearch + Logstash + Kibana
The classic logging stack:
Filebeat / Fluent Bit — collects logs from files, containers, applications. Lightweight agent on each host.
Logstash — parses, enriches, transforms logs. Powerful but heavy.
Elasticsearch — stores logs, indexes for search.
Kibana — UI for searching and visualizing.
Production deployment:
App writes logs ──► Container ──► Filebeat agent
│
▼
Logstash (parse, enrich)
│
▼
Elasticsearch (store, index)
│
▼
Kibana (UI)
Searching in Kibana:
event:payment_failed AND amount_cents:>10000
level:error AND timestamp:[now-1h TO now]
trace_id:abc123def456
Pros:
• Mature, battle-tested
• Excellent search and visualization
• Big ecosystem
Cons:
• Operationally heavy (Elasticsearch is hungry for disk and RAM)
• Expensive at scale
• Mostly licensed under Elastic License now (not pure open source)
For most teams, ELK works but is overkill. Smaller alternatives are usually a better fit.
Loki — Logs the Cloud-Native Way
Loki is Prometheus' sibling for logs, built by Grafana Labs. It's designed differently from Elasticsearch:
- Doesn't full-text-index logs (cheaper to run)
- Indexes only LABELS (service, environment, region)
- Stores log content compressed
- Query via LogQL (similar to PromQL)
Why this matters: a small Loki cluster handles what would require a much larger Elasticsearch cluster.
Architecture:
Apps ──► Promtail (or Fluent Bit, OTel) ──► Loki
│
▼
Grafana
Loki and Grafana share a UI — you can switch between metrics and logs in one dashboard.
LogQL examples:
{service="api"} |= "error"
{service="api", environment="production"}
| json
| level="error"
# Rate of errors
rate({service="api"} |= "error" [5m])
# Logs containing trace_id
{service="api"} | json | trace_id="abc123"
When Loki is great:
• You're already using Prometheus + Grafana
• You want cheap log storage
• You don't need full-text search across log content
• You query logs by service/environment/level mostly
When Loki is less great:
• You need full-text search across logs (Elasticsearch wins)
• Complex log enrichment / parsing pipelines (Logstash wins)
• Compliance requirements that mandate specific tools
Loki is winning ground in 2026. The combination of Prometheus + Loki + Grafana + Tempo (tracing, next lesson) is the cloud-native observability default.
Cloud-Provider Logging
Cloud providers have built-in logging:
AWS CloudWatch Logs — collects logs from EC2, Lambda, ECS, anything. Stores in log groups. Can be searched (CloudWatch Logs Insights query language).
GCP Cloud Logging — collects from everything Google. Powerful filter language. Excellent free tier.
Azure Monitor Logs — equivalent.
Pros:
• Zero setup for cloud-native services (Lambda logs go straight there)
• Integrated with cloud IAM, alerting
• No infrastructure to manage
Cons:
• Lock-in
• Search performance varies
• Cost can balloon at scale (especially CloudWatch Logs)
Cost gotcha for AWS CloudWatch Logs:
• Ingestion: $0.50/GB
• Storage: $0.03/GB/month
• Query: $0.005 per GB scanned
A chatty app logging 10 GB/day = $5/day in ingestion = $150/month, before storage and queries.
Many teams use CloudWatch Logs for cloud services they can't avoid (Lambda, RDS) and ship application logs elsewhere (Loki, Datadog) for cost reasons.
What to Log — and What Not To
Logging decisions you'll thank yourself for:
DO log:
• Significant business events (signup, login, purchase)
• Errors with full context (stack trace, request data, user ID)
• Slow operations (any DB query > 1s, any external call > 5s)
• Authentication events (logins, failures, token refreshes)
• Configuration changes
• Service start/stop, deploy markers
• Cache misses for expensive operations
DON'T log:
• Routine successful requests (you have metrics for that — counter incr)
• Sensitive data (passwords, full credit card numbers, social security numbers)
• Health-check responses (they happen every few seconds)
• Verbose debug output in production (use DEBUG level + sampling instead)
The PII trap (Backend Module 15 covered this). Three rules:
1. Never log passwords, full card numbers, secrets
2. Log user IDs, not emails — look up email from ID when you need it
3. Use redaction in your logger config:
const logger = pino({
redact: {
paths: ['password', 'apiKey', '*.password', 'creditCard.number'],
censor: '[REDACTED]'
}
});
Log levels — discipline:
• ERROR — system failure that needs human attention
• WARN — recoverable issue, but unusual
• INFO — significant business events
• DEBUG — diagnostic detail, off in production
Don't log user errors at ERROR level. "Wrong password" is INFO at most. Otherwise your error rate metric is meaningless.
Operational Discipline
A few habits that compound:
1. Log retention policy. Logs cost money. Define how long to keep them:
• Hot (queryable, fast) — 7-30 days
• Warm (queryable, slow) — 30-90 days
• Cold (archived, S3 Glacier) — 1+ year for compliance
• Delete after that
2. Sampling for high-volume info logs. If your service handles 10k req/sec, logging every request is overkill:
if random.random() < 0.01: # 1% sample
logger.info("request_succeeded", ...)
3. Every error log gets investigated within 24 hours OR is downgraded. If it's not actionable, it shouldn't be ERROR.
4. Build a log dashboard, not just an ad-hoc-search tool. Pre-built queries for common questions: "errors in the last hour", "5xx by endpoint", "auth failures by IP".
5. Test your logging. Write a test that triggers an error and asserts the log includes the relevant context. Otherwise you'll learn at 3 AM that your error logs lack the user_id you need.
The next lesson covers traces — connecting individual log lines into a coherent picture of one request's journey.
⁂ Back to all modules