“Average response time is 80ms” is the metric most teams start with. It’s also almost useless. This article covers why percentiles matter, how to read them, and how to turn them into SLOs that actually reflect user experience.
Why averages lie
Imagine 100 requests:
- 99 take 50ms
- 1 takes 5000ms
Average: 100ms. “Looks fine, about our target.”
That one user waited 5 seconds. If your service has a million requests an hour, 10,000 users had 5-second experiences. At 100 million requests, 1 million bad experiences. The “fine” average hides thousands of angry users.
Percentiles don’t.
What percentiles mean
- p50 (median) — half of requests are faster than this, half slower. A useful summary of typical.
- p95 — 95% of requests are faster. The top 5% are slower.
- p99 — 99% are faster. The slowest 1%.
- p999 (three nines) — 99.9% are faster. The slowest 0.1%.
- Max — the single worst request. Usually noise but sometimes diagnostic.
A healthy-looking service might have: p50 = 40ms, p95 = 120ms, p99 = 300ms, p999 = 900ms. The tail is always longer than the median.
Why the tail matters
Users don’t experience the average. They experience each of their individual requests. Most users make many requests — a web session might include 30 API calls.
If your service has p99 of 300ms, the odds that at least one request in a 30-call session exceeds 300ms is ~26%. A quarter of sessions hit the tail. If p99 is 2 seconds, a quarter of sessions experience a 2-second pause.
The tail is not an edge case. It’s what users notice.
Percentiles compound
Service A calls B calls C. Each has p99 of 200ms.
For service A’s request, A’s p99 is not 200ms. Because A depends on B and C completing, A’s p99 is driven by the combined probability that any of the three exceeds their 99th percentile. Rough math: A’s p99 ≈ 500-600ms.
This is why deep call graphs are dangerous. Every layer’s tail adds to every layer above. Shallow call graphs and fewer synchronous dependencies give better tail latency than any tuning can.
Measuring correctly
Use histograms, not averages. Prometheus/Micrometer histograms let you compute any percentile at query time:
Timer.builder("http.server.requests")
.publishPercentileHistogram()
.register(registry);publishPercentileHistogram exposes bucket data; Prometheus can compute p95/p99/p999 with histogram_quantile().
Measure at the right layer. The only latency that matters is wall clock at the user’s view. Server-side time is a proxy; include everything the user waits for — network, queue time, processing.
Measure per endpoint. Aggregating across all endpoints hides which ones are slow. Histogram per route.
Include failures. Requests that timed out or errored often contribute to the worst tail. Don’t exclude them — unless you’re deliberately measuring success latency specifically.
SLO targets
SLOs should be derived from user experience. Rough guide:
- Interactive user-facing APIs (web page load, mobile app): p95 under 300ms, p99 under 800ms, p999 under 2s
- Backend service-to-service: p99 under 200ms for fast calls, under 500ms for moderate
- Batch processing: percentile matters less; throughput matters more
These are starting points; adjust based on what your users actually tolerate.
Common mistakes
Picking “three nines” for everything. p999 is rarely achievable cheaply; demanding it across every service multiplies cost without clear user benefit. Pick the percentile that matches user experience.
Measuring p99 on low-volume endpoints. 100 requests means p99 is “the worst of 1”. Statistical noise. Use p95 or wait for more data.
Ignoring percentile at scale. 100 RPS with p99 = 1s means 1 RPS exceeds a second. Per hour: 3600 users with a slow experience. Per day: 86,400. Small percentages × large scale = many angry users.
Optimizing averages. Teams celebrate reducing average from 100ms to 80ms while p99 gets worse. Users feel the second metric, not the first.
Reducing tail latency
Standard techniques:
- Hedging. Send a second copy of a slow request after p50; use whichever returns first. Cuts tail at the cost of modest extra load.
- Request duplication. For reads, some systems fire to two replicas, take first.
- Timeouts matched to percentiles. Timeout at p99 + margin; give up on the tail, retry or serve degraded.
- Load shedding. Drop excess requests during load spikes rather than queue them.
- Reducing dependency count. Each synchronous call adds to the tail.
- Caching the slow paths. Sometimes simple.
The p99 dashboard
A useful production dashboard shows for each endpoint:
- Request rate
- Error rate
- p50 (trend — median health)
- p99 (trend — tail behavior)
- p99 dependency breakdown (which downstream is the slow one)
That’s enough to tell at a glance whether the service is healthy and where tail problems come from.
Multi-window burn rate
For SLOs, alert not on absolute threshold breaches but on rate of error budget consumption. From Google’s SRE book:
Alert when recent 1-hour error rate > 14.4× the SLO target AND 5-minute rate > 14.4× as well. Means “burning budget fast enough to miss monthly target in 2 days”. Urgent without being noisy.
Purely threshold-based alerts (“p99 > 500ms”) fire on brief spikes; burn-rate alerts fire on sustained issues.
Closing note
Latency percentiles look like a basic metric; using them well takes practice. Pick targets based on user experience, measure with histograms at every service boundary, alert on sustained burn rather than instantaneous breach, and never settle for averages alone. Services that nail tail latency feel fast to users; ones that only optimize the median feel slow despite impressive averages. The difference is entirely in how you measure.