Distributed Tracing
OpenTelemetry, Jaeger, Tempo — seeing how a single request flows across many services.
What Tracing Solves
In a microservices system, one user click might involve:
User ──► API Gateway ──► Auth Service ──► User Service ──► DB
└──► Cart Service ──► Inventory Service ──► Cache
└──► Pricing Service ──► DB
Five services, two databases, one cache, three concurrent calls. The user complains "page is slow." Where's the bottleneck?
Logs alone won't tell you. You'd need to grep across five log streams, correlate by request ID (if you have one), and manually piece together the timing.
Distributed tracing does this automatically. Each request gets a unique trace ID. Every operation within that request becomes a "span" with timing. You see the entire request as a tree:
[Trace abc123: total 1.2s]
├── [API Gateway: 1.18s]
│ ├── [Auth check: 50ms]
│ ├── [User lookup: 80ms]
│ │ └── [DB query: 60ms]
│ ├── [Cart fetch: 200ms]
│ │ └── [Inventory check: 180ms] ← BOTTLENECK
│ │ └── [Cache miss: 20ms]
│ │ └── [DB query: 150ms]
│ └── [Pricing: 100ms]
└── [Send response: 20ms]
Suddenly the bottleneck is obvious. Inventory's DB query is slow. Now you fix the right thing.
OpenTelemetry — The Standard
OpenTelemetry (OTel) is the CNCF-graduated standard for observability data. As of 2026, it's the default — every major language has SDKs, every backend (Datadog, New Relic, Jaeger, Tempo) ingests OTel.
Key components:
• APIs and SDKs to instrument your code
• A specification for trace, metric, and log formats
• A "Collector" — a daemon that receives, processes, and ships data
The goal: instrument once, send to any backend.
Auto-instrumentation — many languages support automatic tracing of common libraries (HTTP clients, DB drivers, frameworks) without code changes:
# Python
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
opentelemetry-instrument python myapp.py
Done. Your Flask/Django/FastAPI requests get traced, your DB queries become spans, your HTTP outbound calls get tracked.
For custom instrumentation:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def process_order(order_id):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
with tracer.start_as_current_span("fetch_inventory"):
inventory = check_inventory(order_id)
with tracer.start_as_current_span("charge_payment"):
payment_id = charge(order_id)
span.set_attribute("payment.id", payment_id)
return payment_id
Each start_as_current_span becomes a node in the trace. Attributes attach metadata.
Spans, Traces, and Context
The data model:
Trace — a tree of spans representing one logical operation (typically one user request).
Span — a single unit of work within a trace. Has start time, end time, attributes, events.
Each span has:
• trace_id — same for all spans in a trace
• span_id — unique to this span
• parent_span_id — links to the parent (forms the tree)
• name — what operation
• attributes — key-value metadata
• events — timestamped log entries within the span
• status — OK or ERROR
Critical concept: trace context propagation
When service A calls service B, A's HTTP request must carry the trace_id (and parent span_id) so B can attach its spans to the same trace. This happens via HTTP headers:
traceparent: 00-abc123def456...-1a2b3c4d5e6f7890-01
OpenTelemetry SDKs handle this automatically when you use instrumented HTTP clients. Without proper context propagation, you get DISCONNECTED traces (each service has its own trace, no relation visible).
Common pitfall: writing custom HTTP clients that don't propagate trace context. If a service shows up disconnected from the rest of the trace, this is usually why.
Backends — Where Traces Go
Where you store and view traces:
Jaeger — CNCF, popular open-source backend. Easy to run.
docker run -d -p 16686:16686 -p 4317:4317 jaegertracing/all-in-one
Then point your OTel exporter at localhost:4317. Traces appear at http://localhost:16686.
Tempo — Grafana's tracing backend. Like Loki but for traces. Cheap to run, integrates with Grafana.
Zipkin — older, predates OTel. Still around.
Cloud-managed:
• AWS X-Ray — built into AWS, automatic for Lambda
• GCP Cloud Trace — built into GCP
• Datadog APM — pricey but excellent UX
• Honeycomb — events-based, novel UX, great for debugging weird issues
• Lightstep / Splunk Observability
For most teams: start with Tempo (if using Grafana) or Jaeger (if not). Move to managed when scale or features warrant it.
Sampling — you can't trace 100% of requests at scale (too expensive). Common approaches:
• Head-based sampling — decide at request start, before knowing if anything went wrong. 1-10% typical.
• Tail-based sampling — decide at request end. Sample 100% of errors and slow requests, sample less of healthy ones. Smarter but harder.
The OTel Collector handles both. Tail-based is the right pattern for production but requires more infrastructure.
Practical Tracing Patterns
Patterns that pay off:
1. Always include trace_id in your logs
from opentelemetry import trace
span = trace.get_current_span()
trace_id = span.get_span_context().trace_id
logger.info("order_processed", trace_id=hex(trace_id), order_id=order_id)
Now you can pivot from a log line to its trace and back.
2. Add semantic attributes
span.set_attribute("http.method", "POST")
span.set_attribute("http.route", "/orders")
span.set_attribute("user.id", user_id)
span.set_attribute("order.amount_cents", amount)
Standard names from OTel's semantic conventions (https://opentelemetry.io/docs/specs/semconv/) make traces queryable across services.
3. Mark errors explicitly
try:
do_work()
except Exception as e:
span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
span.record_exception(e)
raise
4. Don't trace everything everywhere
Tracing every internal function call creates noise. Trace at boundaries:
• Incoming HTTP requests
• Outgoing HTTP requests
• DB queries
• Cache operations
• Queue publishes / consumes
• Significant business operations
If your trace has 500 spans, you've over-instrumented.
5. Use the trace as the SOURCE OF TRUTH for "how did this request go?"
A trace is more useful than any single log line. Build your debugging workflow around trace IDs.
Putting It Together
The 2026 cloud-native observability stack:
Your apps emit:
• Metrics (Prometheus format)
• Logs (structured JSON)
• Traces (OpenTelemetry)
│
▼
OpenTelemetry Collector
(parses, enriches, fans out)
│
┌───────┼───────┐
▼ ▼ ▼
Prometheus Loki Tempo
│ │ │
└───────┴───────┘
│
▼
Grafana
(one UI for all 3)
This stack is open source, integrates seamlessly, and can be self-hosted or managed (Grafana Cloud has a generous free tier for all three).
For each request you can:
1. See request rate / error rate / latency in Prometheus dashboards
2. Click into a problem timeframe → see logs in Loki
3. Click on a log line → see the full trace in Tempo
4. Identify the slow span → know which service to optimize
This investment compounds. Teams with mature observability ship faster, recover faster, and have less mystery production behavior. The next lesson covers SRE — the engineering discipline that takes observability and turns it into reliability practices.
⁂ Back to all modules