Lesson 25 • Observability • 11 min read • Updated May 07, 2026

Distributed Tracing

OpenTelemetry, Jaeger, Tempo — seeing how a single request flows across many services.

What Tracing Solves

In a microservices system, one user click might involve:

Text

   User ──► API Gateway ──► Auth Service ──► User Service ──► DB
                       └──► Cart Service ──► Inventory Service ──► Cache
                       └──► Pricing Service ──► DB

Five services, two databases, one cache, three concurrent calls. The user complains "page is slow." Where's the bottleneck?

Logs alone won't tell you. You'd need to grep across five log streams, correlate by request ID (if you have one), and manually piece together the timing.

Distributed tracing does this automatically. Each request gets a unique trace ID. Every operation within that request becomes a "span" with timing. You see the entire request as a tree:

Text

[Trace abc123: total 1.2s]
├── [API Gateway: 1.18s]
│   ├── [Auth check: 50ms]
│   ├── [User lookup: 80ms]
│   │   └── [DB query: 60ms]
│   ├── [Cart fetch: 200ms]
│   │   └── [Inventory check: 180ms]  ← BOTTLENECK
│   │       └── [Cache miss: 20ms]
│   │       └── [DB query: 150ms]
│   └── [Pricing: 100ms]
└── [Send response: 20ms]

Suddenly the bottleneck is obvious. Inventory's DB query is slow. Now you fix the right thing.

OpenTelemetry — The Standard

OpenTelemetry (OTel) is the CNCF-graduated standard for observability data. As of 2026, it's the default — every major language has SDKs, every backend (Datadog, New Relic, Jaeger, Tempo) ingests OTel.

Key components:
• APIs and SDKs to instrument your code
• A specification for trace, metric, and log formats
• A "Collector" — a daemon that receives, processes, and ships data

The goal: instrument once, send to any backend.

Auto-instrumentation — many languages support automatic tracing of common libraries (HTTP clients, DB drivers, frameworks) without code changes:

Bash

# Python
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
opentelemetry-instrument python myapp.py

Done. Your Flask/Django/FastAPI requests get traced, your DB queries become spans, your HTTP outbound calls get tracked.

For custom instrumentation:

Python

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_order(order_id):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        
        with tracer.start_as_current_span("fetch_inventory"):
            inventory = check_inventory(order_id)
        
        with tracer.start_as_current_span("charge_payment"):
            payment_id = charge(order_id)
            span.set_attribute("payment.id", payment_id)
        
        return payment_id

Each start_as_current_span becomes a node in the trace. Attributes attach metadata.

Spans, Traces, and Context

The data model:

Trace — a tree of spans representing one logical operation (typically one user request).
Span — a single unit of work within a trace. Has start time, end time, attributes, events.

Each span has:
• trace_id — same for all spans in a trace
• span_id — unique to this span
• parent_span_id — links to the parent (forms the tree)
• name — what operation
• attributes — key-value metadata
• events — timestamped log entries within the span
• status — OK or ERROR

Critical concept: trace context propagation

When service A calls service B, A's HTTP request must carry the trace_id (and parent span_id) so B can attach its spans to the same trace. This happens via HTTP headers:

Text

traceparent: 00-abc123def456...-1a2b3c4d5e6f7890-01

OpenTelemetry SDKs handle this automatically when you use instrumented HTTP clients. Without proper context propagation, you get DISCONNECTED traces (each service has its own trace, no relation visible).

Common pitfall: writing custom HTTP clients that don't propagate trace context. If a service shows up disconnected from the rest of the trace, this is usually why.

Backends — Where Traces Go

Where you store and view traces:

Jaeger — CNCF, popular open-source backend. Easy to run.

Bash

docker run -d -p 16686:16686 -p 4317:4317 jaegertracing/all-in-one

Then point your OTel exporter at localhost:4317. Traces appear at http://localhost:16686.

Tempo — Grafana's tracing backend. Like Loki but for traces. Cheap to run, integrates with Grafana.

Zipkin — older, predates OTel. Still around.

Cloud-managed:
• AWS X-Ray — built into AWS, automatic for Lambda
• GCP Cloud Trace — built into GCP
• Datadog APM — pricey but excellent UX
• Honeycomb — events-based, novel UX, great for debugging weird issues
• Lightstep / Splunk Observability

For most teams: start with Tempo (if using Grafana) or Jaeger (if not). Move to managed when scale or features warrant it.

Sampling — you can't trace 100% of requests at scale (too expensive). Common approaches:
• Head-based sampling — decide at request start, before knowing if anything went wrong. 1-10% typical.
• Tail-based sampling — decide at request end. Sample 100% of errors and slow requests, sample less of healthy ones. Smarter but harder.

The OTel Collector handles both. Tail-based is the right pattern for production but requires more infrastructure.

Practical Tracing Patterns

Patterns that pay off:

1. Always include trace_id in your logs

Python

from opentelemetry import trace
span = trace.get_current_span()
trace_id = span.get_span_context().trace_id

logger.info("order_processed", trace_id=hex(trace_id), order_id=order_id)

Now you can pivot from a log line to its trace and back.

2. Add semantic attributes

Python

span.set_attribute("http.method", "POST")
span.set_attribute("http.route", "/orders")
span.set_attribute("user.id", user_id)
span.set_attribute("order.amount_cents", amount)

Standard names from OTel's semantic conventions (https://opentelemetry.io/docs/specs/semconv/) make traces queryable across services.

3. Mark errors explicitly

Python

try:
    do_work()
except Exception as e:
    span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
    span.record_exception(e)
    raise

4. Don't trace everything everywhere
Tracing every internal function call creates noise. Trace at boundaries:
• Incoming HTTP requests
• Outgoing HTTP requests
• DB queries
• Cache operations
• Queue publishes / consumes
• Significant business operations

If your trace has 500 spans, you've over-instrumented.

5. Use the trace as the SOURCE OF TRUTH for "how did this request go?"
A trace is more useful than any single log line. Build your debugging workflow around trace IDs.

Putting It Together

The 2026 cloud-native observability stack:

Text

Your apps emit:
• Metrics (Prometheus format)
• Logs (structured JSON)
• Traces (OpenTelemetry)
                │
                ▼
        OpenTelemetry Collector
        (parses, enriches, fans out)
                │
        ┌───────┼───────┐
        ▼       ▼       ▼
   Prometheus  Loki   Tempo
        │       │       │
        └───────┴───────┘
                │
                ▼
            Grafana
        (one UI for all 3)

This stack is open source, integrates seamlessly, and can be self-hosted or managed (Grafana Cloud has a generous free tier for all three).

For each request you can:
1. See request rate / error rate / latency in Prometheus dashboards
2. Click into a problem timeframe → see logs in Loki
3. Click on a log line → see the full trace in Tempo
4. Identify the slow span → know which service to optimize

This investment compounds. Teams with mature observability ship faster, recover faster, and have less mystery production behavior. The next lesson covers SRE — the engineering discipline that takes observability and turns it into reliability practices.

⁂ Back to all modules

Distributed Tracing

What Tracing Solves

OpenTelemetry — The Standard

Spans, Traces, and Context

Backends — Where Traces Go

Practical Tracing Patterns

Putting It Together

Continue reading

SRE Principles

DevSecOps — Shifting Security Left

Secrets Management