Retries are the simplest resilience pattern and the one most often misused. A retry helps when the failure is transient; it hurts when the failure is systemic. This article is about telling the two apart and implementing retries that don’t turn small incidents into outages.
Retries make outages worse
The counter-intuitive failure mode: a downstream service slows from 50ms to 3s. Upstream retries twice on timeout. Each user request now creates three downstream calls. Downstream load triples at the exact moment it’s struggling. It dies. Upstream retries continue to the dead downstream until its breaker trips. Meanwhile, users see errors for 10 minutes instead of 30 seconds.
This is a retry storm. The cure: retry only what’s retriable, back off when you do, and bound the total attempts.
What’s retriable
Retryable:
- Network timeouts
- 5xx errors (specifically 502, 503, 504)
- Connection resets
- Known transient DB errors (serialization, deadlock)
- Rate limit responses with Retry-After
Not retryable:
- 4xx errors (the request is bad; retrying doesn’t fix it)
- Non-idempotent operations that may have already succeeded
- Timeouts where the request might have side-effected
- Errors with explicit “do not retry” signals
Most retry frameworks default to retrying “on exception”. That’s usually wrong. Configure explicitly which exceptions / status codes retry.
Exponential backoff
Retrying immediately after a failure is the worst strategy — the downstream is probably still overloaded. Exponential backoff waits progressively longer:
- 1st retry: after 100ms
- 2nd retry: after 400ms
- 3rd retry: after 1.6s
Doubling (or 4×) each time. Gives the downstream time to recover.
Jitter
Without jitter, every client retries at the same moment. Thousands of clients → coordinated wave of retries → downstream load spikes.
With jitter, each client adds randomness:
delay = base * 2^attempt * random(0.5, 1.5)Clients retry at different moments. Downstream sees smooth rate instead of spikes.
“Full jitter” is a simpler variant — the delay is a random value between 0 and the computed max. Usually works as well in practice.
Bounded total time
Attempts alone aren’t enough. 5 attempts with exponential backoff can mean 30+ seconds of waiting. The user is gone by then.
Bound the total retry duration:
maxAttempts = 3
maxTotalDuration = 5sWhichever comes first, stop retrying. User gets a prompt error instead of a timeout.
Idempotency is required
Retries for non-idempotent operations are almost always a bug. If POST /payments charges the user, a retry might charge again.
Solutions:
- Make the endpoint idempotent via idempotency keys
- Only retry on explicit “call didn’t reach the server” signals (connection reset before send; almost never safe)
- Don’t retry non-idempotent operations at all
Most real systems end up with (1). Every write endpoint accepts an idempotency key; retries reuse it.
Resilience4j configuration
Typical production config:
resilience4j:
retry:
instances:
payments:
max-attempts: 3
wait-duration: 500ms
exponential-backoff-multiplier: 2
randomized-wait-factor: 0.5
retry-exceptions:
- java.net.SocketTimeoutException
- org.springframework.web.client.HttpServerErrorException
- java.io.IOException
ignore-exceptions:
- com.company.payments.PaymentDeclinedException
- org.springframework.web.client.HttpClientErrorException.BadRequestThree retries with exponential backoff + jitter, only on transient network issues, never on business errors or 4xx.
Server-side hints
Good downstreams tell you how to retry:
Retry-Afterheader — “wait N seconds before retrying”. Respect it.429status — rate-limited; retry with backoff503+ Retry-After — temporary unavailability; retry
Don’t retry faster than Retry-After asks. That’s the downstream telling you what’s safe.
Per-request deadlines (budget)
Instead of per-call retries + timeouts, propagate a deadline:
- Gateway sets deadline: 5 seconds
- Passes
grpc-timeoutor custom header to downstream - Each layer checks remaining time before retrying
- If deadline is 200ms away, skip retry — fail now
Budget-based handling prevents the “each layer adds its own 5s retry, total wait is 25s” problem.
Circuit breaker + retry
The combo for most real systems:
@CircuitBreaker(name = "payments")
@Retry(name = "payments")
public PaymentResult charge(...) { ... }Circuit breaker outside retry. When circuit is closed, retries happen. When open, the first call fails fast and no retries occur.
This prevents retry storms against a dead downstream — the breaker short-circuits the retries.
Don’t retry in the wrong layer
Avoid multiple layers each with their own retries:
- Client → Gateway (retries 3×)
- Gateway → Service A (retries 3×)
- Service A → Service B (retries 3×)
A single downstream failure becomes 27 attempts. Pick one layer — usually the gateway or the outermost service — to own retries. Inner layers fail fast and let the outer layer retry.
Metrics
Per retry instance:
- Attempts per successful operation (ideally 1; drift upward = downstream issues)
- Retries per second (sudden spike = incident brewing)
- Retry-exhausted count (every one is a user-visible failure)
Without these, retry behavior is invisible until it’s catastrophic.
Closing note
A good retry strategy is unobtrusive — users never notice the transient blip, downstreams never notice the additional load. A bad retry strategy turns a 10-second downstream hiccup into an hour-long outage. The difference is discipline: only retry what’s retriable, back off with jitter, bound attempts and total duration, respect server hints, wire a circuit breaker around the whole thing. Default frameworks give you most of this; the work is configuring them deliberately instead of accepting defaults that were tuned for demo purposes.