Observability & Tracing — A Production Playbook

In production, everything is eventually broken. Not in a dramatic outage sense — in the everyday sense that some request is slow, some error rate is creeping up, some consumer is lagging behind. The question isn’t whether things go wrong; it’s whether you can see it, localize it, and explain it before users complain. This article is about how to build that visibility — the signals you need, the tools that compose them, and the operational discipline that makes telemetry actually useful instead of just expensive.

Monitoring vs. observability

Two words people use interchangeably but shouldn’t:

Monitoring answers known questions. “Is CPU above 80%? Is the error rate above 1%?” You write the check, the alert fires.
Observability lets you ask new questions about your system without shipping new code. “What did all requests from user X in the last 10 minutes do, and where did they spend time?”

Monitoring is a subset of observability. You need both. Observability is what saves you at 3 AM when the incident doesn’t match any alert you wrote.

The three signals

Everything you need boils down to three types of data, each with a distinct role:

  ┌──────────┬──────────────────────────────────┬──────────────────────┐
  │  Signal  │              Answers             │     Storage          │
  ├──────────┼──────────────────────────────────┼──────────────────────┤
  │ Metrics  │  "is something broken?"          │  time-series DB      │
  │          │  aggregated numbers over time    │  Prometheus          │
  ├──────────┼──────────────────────────────────┼──────────────────────┤
  │  Traces  │  "where is it broken?"           │  trace store         │
  │          │  per-request span graph          │  Tempo / Jaeger      │
  ├──────────┼──────────────────────────────────┼──────────────────────┤
  │   Logs   │  "why is it broken?"             │  log store           │
  │          │  structured events with context  │  Loki / ES           │
  └──────────┴──────────────────────────────────┴──────────────────────┘

They aren’t interchangeable. Metrics are cheap to store but can’t tell you about one specific request. Traces are expensive but show exactly which span slowed down. Logs are rich but drown you if you grep them.

The magic happens when you correlate them — a trace ID in a log line, an exemplar in a metric pointing to a trace, all navigable in one click.

Signal 1 — Metrics

What to measure

Google’s RED method gives you the minimum viable metric set per service:

Rate — requests per second
Errors — error rate
Duration — latency distribution (p50, p95, p99)

For infrastructure, USE:

Utilization — how busy is the resource
Saturation — how much queued work is backing up
Errors — failures

Ship RED for every HTTP/gRPC endpoint, every Kafka consumer, every background job. That alone catches ~80% of problems.

Histograms, not averages

Averages lie. p50 = 50 ms and p99 = 3 s is a different world from p50 = 100 ms and p99 = 110 ms, yet the average is similar. Always plot percentiles, computed from histogram buckets.

Prometheus histogram in Java via Micrometer:

Timer.builder("http.server.requests")
    .tag("method", "POST")
    .tag("uri", "/orders")
    .publishPercentileHistogram()
    .register(meterRegistry);

publishPercentileHistogram() exposes bucket data, so you can compute any quantile at query time, not just the ones you hardcoded.

Cardinality discipline

The #1 way to blow up a metrics store: high-cardinality labels.

// BAD — one time-series per user
Counter.builder("orders.placed").tag("userId", userId).register(registry);

// BAD — one time-series per URL
Counter.builder("http.requests").tag("path", req.getPath()).register(registry);

// GOOD — bounded cardinality
Counter.builder("orders.placed").tag("tier", user.tier()).register(registry);
Counter.builder("http.requests").tag("route", matchedRoute).register(registry);

Rule of thumb: a label’s cardinality must be bounded and small (tens, maybe hundreds). User IDs, order IDs, request paths with dynamic segments — all forbidden as metric labels. Put them in traces and logs instead.

SLOs are the real dashboard

Dashboards with 50 panels don’t tell you if the service is healthy. SLOs do:

SLI — the measurement (“fraction of requests under 300 ms”)
SLO — the target (“99.5% over rolling 30 days”)
Error budget — what you have left to burn (“0.5% = 3.6 hours of downtime per month”)

When the error budget is exhausted, feature work stops and reliability work starts. That’s the contract between engineering and product, enforced by a number.

Signal 2 — Distributed tracing

Why tracing matters more as systems grow

In a monolith, a slow request has one timeline. In microservices, a single user request fans out across 8 services, 20 DB calls, 4 Kafka publishes. Tracing stitches those into one view:

  /checkout [850ms]
  ├── auth.verify [15ms]
  ├── cart.load [45ms]
  │   └── redis.get [3ms]
  ├── pricing.calculate [180ms]
  │   ├── catalog.fetch [40ms]
  │   └── tax.compute [135ms]      ← slow!
  ├── payments.charge [280ms]
  │   └── stripe.create [270ms]
  └── orders.persist [110ms]
      ├── postgres.insert [85ms]
      └── kafka.publish [20ms]

Without this, “checkout is slow” is a guessing game. With it, you see exactly which span is the culprit and whether it’s your code, your DB, or an external API.

Context propagation

The mechanism that makes this work: every request carries a trace_id and a span_id in its headers (W3C traceparent). When service A calls service B, A’s client library writes the headers; B’s server library reads them and continues the trace.

The standard today is OpenTelemetry. One library, one wire format (OTLP), works with every major vendor as a backend.

Java setup that gets you tracing for free:

<dependency>
  <groupId>io.opentelemetry.instrumentation</groupId>
  <artifactId>opentelemetry-spring-boot-starter</artifactId>
</dependency>

otel:
  exporter:
    otlp:
      endpoint: http://otel-collector:4317
  service:
    name: orders-service
  traces:
    sampler: parentbased_traceidratio
    sampler.arg: 0.1

That alone gives you traces for every HTTP request, gRPC call, Kafka publish/consume, and JDBC query — without touching business code.

Manual spans when auto-instrumentation isn’t enough

For business-level operations, add explicit spans:

Tracer tracer = GlobalOpenTelemetry.getTracer("orders");

Span span = tracer.spanBuilder("order.place")
    .setAttribute("order.id", orderId.toString())
    .setAttribute("order.items", items.size())
    .setAttribute("customer.tier", customer.tier())
    .startSpan();

try (Scope s = span.makeCurrent()) {
    return orderService.place(cmd);
} catch (Exception e) {
    span.recordException(e);
    span.setStatus(StatusCode.ERROR);
    throw e;
} finally {
    span.end();
}

Span attributes are where you put high-cardinality data that was forbidden on metrics — user IDs, order IDs, request parameters. Traces are designed to handle it.

Sampling without losing the interesting traces

Keeping every trace is expensive. But random sampling loses the slow and failing ones — exactly the ones you want. Two strategies that work:

Head sampling with error/latency bias — keep 100% of traces where the root span errored or exceeded p99; 1% of the rest.
Tail sampling — the OpenTelemetry Collector buffers traces, then decides based on the full trace. More accurate, more complex to run.

Most teams start with head sampling and graduate to tail sampling when volume demands it.

Signal 3 — Logs

Structured or nothing

Free-text logs are a dead end at scale. Ship JSON:

{
  "ts": "2026-04-22T08:14:02.417Z",
  "level": "WARN",
  "service": "orders",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "event": "order.declined",
  "order_id": "o-8421",
  "customer_id": "c-12",
  "reason": "insufficient_funds",
  "amount_cents": 4499
}

Every field is queryable in Loki / Elasticsearch. You can answer “show me all orders declined for insufficient funds in the last hour, grouped by customer tier” without scanning anything.

Logback with a JSON encoder, one-time setup:

<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
  <encoder class="net.logstash.logback.encoder.LogstashEncoder">
    <includeMdcKeyName>trace_id</includeMdcKeyName>
    <includeMdcKeyName>span_id</includeMdcKeyName>
  </encoder>
</appender>

Trace ID in every log line

Without trace_id in logs, traces and logs live in separate worlds and you pivot between them by guessing. With it, one click in the trace view filters the logs to exactly that request:

// OpenTelemetry auto-populates MDC via context
MDC.put("trace_id", Span.current().getSpanContext().getTraceId());
MDC.put("span_id", Span.current().getSpanContext().getSpanId());

The OpenTelemetry Java agent does this automatically. Don’t write it by hand.

Log levels with intent

A rule that survives contact with production:

ERROR — something broke, a human needs to see this. Alert-worthy in aggregate.
WARN — something unexpected, we handled it, but worth knowing. Never used for routine conditions.
INFO — major business events (order placed, payment settled). Low volume.
DEBUG — developer-level detail. Off in prod, on when investigating.
TRACE — method-level detail. Effectively never on.

The common failure is logging ERROR for things that are actually expected (like “user not found” when that’s a valid code path). Errors become noise, alerts get ignored, and real errors get missed.

Correlation — making the three signals one picture

The goal isn’t three separate stacks. It’s one investigation flow:

  Alert fires: p99 latency > SLO
        │
        ▼
  Dashboard shows the spike, exemplar link to a sample trace
        │
        ▼
  Trace shows pricing.tax-compute taking 3s
        │
        ▼
  Click logs for that trace_id, see the upstream 5xx from tax-api
        │
        ▼
  Root cause in 90 seconds

This flow requires deliberate integration:

Metrics export exemplars (a trace_id attached to sample data points)
Traces and logs share trace_id
UI (Grafana) links across all three

If your observability setup requires the on-call engineer to copy IDs manually between three tabs, it’s costing more than it saves.

OpenTelemetry: the one thing to standardize on

If you take one architectural decision from this article, make it: instrument once with OpenTelemetry, export to whatever backend you want.

   [ Services instrumented with OTel SDK ]
                    │
                    │ OTLP
                    ▼
         [ OpenTelemetry Collector ]
                    │
      ┌─────────────┼──────────────┐
      ▼             ▼              ▼
 [Prometheus]   [Tempo]        [Loki]
      ▼             ▼              ▼
  [   Grafana (single pane of glass)   ]

Why this matters: vendor lock-in in observability is real and expensive. OTel decouples instrumentation from backend. Switching from Datadog to an open-source stack — or the reverse — becomes a collector config change, not a codebase-wide rewrite.

The operational side

Alerting on symptoms, not causes

Alert on things users experience. An alert that fires because “disk is 85% full” wakes you up for something that might never affect anyone. An alert that fires because “5xx rate for /checkout exceeded SLO” wakes you up for something users are currently hitting.

Four rules that keep alerting sane:

Every alert is actionable. If the response is “check, then go back to sleep,” the alert is wrong.
Every alert has a runbook. A link in the alert message to the markdown that says exactly what to do.
Alerts are owned. One team, one Slack channel, one rotation. Ownerless alerts rot.
Noisy alerts are incidents. Treat alert fatigue with the same urgency as a real outage — because it causes them.

SLO-based alerting

Better than “p99 > 500 ms”: alert on error budget burn rate. If you’re burning budget fast enough that the monthly target will be missed, that’s urgent. If you’re burning it slowly, that’s a ticket, not a page.

Standard multi-window burn rate alert (from the Google SRE workbook):

- alert: HighErrorBudgetBurn
  expr: |
    (error_rate_1h > 14.4 * slo_target)
      and
    (error_rate_5m > 14.4 * slo_target)

This fires when you’re burning 14.4× your budget — i.e. you’d exhaust a month’s budget in 2 days if it continued. Fast enough to be urgent, slow enough not to be noisy.

Incident response needs data at fingertips

During an incident, the on-call engineer has 30 seconds of attention for each tool before frustration sets in. That means:

The status dashboard has 5 panels, not 50 — request rate, error rate, p99, queue depth, DB connections
The runbook links directly from the alert, opens in one click
The deploy dashboard shows “was there a deploy in the last hour?” prominently
The change log across all services is queryable — “what went to prod today?”

The single most common cause of prolonged incidents isn’t hard-to-find root causes. It’s that the data to find them exists but isn’t reachable in a hurry.

Cost control

Observability at scale is expensive. Three levers that usually keep it in check:

Cardinality hygiene — discussed above. This is the biggest lever by far.
Sampling — 100% of errors, 10% of successes. Often a 10× cost reduction with minimal visibility loss.
Retention tiering — 7 days hot, 30 days warm, 1 year cold in object storage. Most investigations need 7 days.

A common failure: all-or-nothing thinking that leads to disabling tracing entirely because “it’s too expensive.” Smart sampling keeps the cost manageable without going blind.

Checklist: is your observability production-grade?

Closing thought

Observability isn’t a tool you buy — it’s a property of the system, like security or reliability. Every service, every call, every log line either makes your system more observable or less. The teams that run highload systems smoothly aren’t the ones with the most dashboards; they’re the ones where every signal has a purpose, every alert has an owner, and the path from “something is wrong” to “here is the line of code” is short enough to walk in the middle of the night.