Retry Strategies Done Right

Retries are the simplest resilience pattern and the one most often misused. A retry helps when the failure is transient; it hurts when the failure is systemic. This article is about telling the two apart and implementing retries that don’t turn small incidents into outages.

Retries make outages worse

The counter-intuitive failure mode: a downstream service slows from 50ms to 3s. Upstream retries twice on timeout. Each user request now creates three downstream calls. Downstream load triples at the exact moment it’s struggling. It dies. Upstream retries continue to the dead downstream until its breaker trips. Meanwhile, users see errors for 10 minutes instead of 30 seconds.

This is a retry storm. The cure: retry only what’s retriable, back off when you do, and bound the total attempts.

What’s retriable

Retryable:

Network timeouts
5xx errors (specifically 502, 503, 504)
Connection resets
Known transient DB errors (serialization, deadlock)
Rate limit responses with Retry-After

Not retryable:

4xx errors (the request is bad; retrying doesn’t fix it)
Non-idempotent operations that may have already succeeded
Timeouts where the request might have side-effected
Errors with explicit “do not retry” signals

Most retry frameworks default to retrying “on exception”. That’s usually wrong. Configure explicitly which exceptions / status codes retry.

Exponential backoff

Retrying immediately after a failure is the worst strategy — the downstream is probably still overloaded. Exponential backoff waits progressively longer:

1st retry: after 100ms
2nd retry: after 400ms
3rd retry: after 1.6s

Doubling (or 4×) each time. Gives the downstream time to recover.

Jitter

Without jitter, every client retries at the same moment. Thousands of clients → coordinated wave of retries → downstream load spikes.

With jitter, each client adds randomness:

delay = base * 2^attempt * random(0.5, 1.5)

Clients retry at different moments. Downstream sees smooth rate instead of spikes.

“Full jitter” is a simpler variant — the delay is a random value between 0 and the computed max. Usually works as well in practice.

Bounded total time

Attempts alone aren’t enough. 5 attempts with exponential backoff can mean 30+ seconds of waiting. The user is gone by then.

Bound the total retry duration:

maxAttempts = 3
maxTotalDuration = 5s

Whichever comes first, stop retrying. User gets a prompt error instead of a timeout.

Idempotency is required

Retries for non-idempotent operations are almost always a bug. If POST /payments charges the user, a retry might charge again.

Solutions:

Make the endpoint idempotent via idempotency keys
Only retry on explicit “call didn’t reach the server” signals (connection reset before send; almost never safe)
Don’t retry non-idempotent operations at all

Most real systems end up with (1). Every write endpoint accepts an idempotency key; retries reuse it.

Resilience4j configuration

Typical production config:

resilience4j:
  retry:
    instances:
      payments:
        max-attempts: 3
        wait-duration: 500ms
        exponential-backoff-multiplier: 2
        randomized-wait-factor: 0.5
        retry-exceptions:
          - java.net.SocketTimeoutException
          - org.springframework.web.client.HttpServerErrorException
          - java.io.IOException
        ignore-exceptions:
          - com.company.payments.PaymentDeclinedException
          - org.springframework.web.client.HttpClientErrorException.BadRequest

Three retries with exponential backoff + jitter, only on transient network issues, never on business errors or 4xx.

Server-side hints

Good downstreams tell you how to retry:

Retry-After header — “wait N seconds before retrying”. Respect it.
429 status — rate-limited; retry with backoff
503 + Retry-After — temporary unavailability; retry

Don’t retry faster than Retry-After asks. That’s the downstream telling you what’s safe.

Per-request deadlines (budget)

Instead of per-call retries + timeouts, propagate a deadline:

Gateway sets deadline: 5 seconds
Passes grpc-timeout or custom header to downstream
Each layer checks remaining time before retrying
If deadline is 200ms away, skip retry — fail now

Budget-based handling prevents the “each layer adds its own 5s retry, total wait is 25s” problem.

Circuit breaker + retry

The combo for most real systems:

@CircuitBreaker(name = "payments")
@Retry(name = "payments")
public PaymentResult charge(...) { ... }

Circuit breaker outside retry. When circuit is closed, retries happen. When open, the first call fails fast and no retries occur.

This prevents retry storms against a dead downstream — the breaker short-circuits the retries.

Don’t retry in the wrong layer

Avoid multiple layers each with their own retries:

Client → Gateway (retries 3×)
Gateway → Service A (retries 3×)
Service A → Service B (retries 3×)

A single downstream failure becomes 27 attempts. Pick one layer — usually the gateway or the outermost service — to own retries. Inner layers fail fast and let the outer layer retry.

Metrics

Per retry instance:

Attempts per successful operation (ideally 1; drift upward = downstream issues)
Retries per second (sudden spike = incident brewing)
Retry-exhausted count (every one is a user-visible failure)

Without these, retry behavior is invisible until it’s catastrophic.

Closing note

A good retry strategy is unobtrusive — users never notice the transient blip, downstreams never notice the additional load. A bad retry strategy turns a 10-second downstream hiccup into an hour-long outage. The difference is discipline: only retry what’s retriable, back off with jitter, bound attempts and total duration, respect server hints, wire a circuit breaker around the whole thing. Default frameworks give you most of this; the work is configuring them deliberately instead of accepting defaults that were tuned for demo purposes.