Circuit Breakers — Beyond the Basics

Most engineers learn circuit breakers from the same diagram — closed, open, half-open — and then use them wrong in production. This article is about the second 90% — tuning, interaction with other patterns, and the failure modes nobody mentions.

The quick recap

Closed — normal; calls flow through
Open — failing; calls fail fast immediately
Half-open — probing; a few test calls go through to see if things recovered

The breaker trips (closed → open) when failures exceed a threshold. It waits a configured duration, then enters half-open to test. A successful probe closes the circuit; a failure re-opens it.

Tuning parameters that matter

Default Resilience4j config:

resilience4j:
  circuitbreaker:
    instances:
      payments:
        failure-rate-threshold: 50                # % of calls that are failures to trip
        sliding-window-size: 100                  # calls in the window
        minimum-number-of-calls: 10               # min calls before the rate matters
        wait-duration-in-open-state: 60s          # how long to stay open
        permitted-number-of-calls-in-half-open-state: 5   # test calls when half-open
        slow-call-duration-threshold: 2s          # slow calls count as failures
        slow-call-rate-threshold: 100             # % of slow calls that triggers

What each actually does:

sliding-window-size + minimum-number-of-calls. You want enough samples to be statistically meaningful. Tripping after 3 failed calls is overreactive; tripping after 10,000 is too slow. For typical service-to-service calls at moderate load: 50-100 window, 10-20 minimum.

failure-rate-threshold. 50% is a common default. For critical dependencies (payment, auth), 30% might be right — trip earlier. For non-critical (recommendations, analytics), 70% is fine — don’t be overly sensitive.

wait-duration-in-open-state. Long enough for the downstream to recover. Short enough that users don’t suffer. 30-60s is typical for transient downstream issues. For downstreams known to take longer (deploys, restarts), 2-5 minutes.

slow-call-duration-threshold. The underrated setting. A downstream that takes 10 seconds per call but “succeeds” is just as deadly as one that errors. Count slow calls as failures.

Interaction with retries

This combination kills more services than any other:

@Retry(maxAttempts = 3)
@CircuitBreaker(...)
public Result call(Request r) { ... }

Order matters. Resilience4j applies outside-in based on annotation order. If retry wraps circuit breaker: every retry attempt checks the breaker; retrying on breaker-open is pointless. If circuit breaker wraps retry: each retried call counts as one toward the breaker; retries of flaky calls make the breaker trip faster.

Usually you want circuit breaker outside retry: the breaker sees each user request as one call. The retry only fires inside the breaker’s closed state. On breaker open, no retry happens (fail fast).

Fallbacks are load-bearing

A circuit breaker without a fallback just moves the error. What the fallback does is the design question:

Return a cached value. Great when staleness is acceptable. “Last known product price” is fine for a catalog; “last known account balance” isn’t.

Return a default. return emptyList(); is safe if the caller tolerates empty. Usually a degraded UX is better than an error.

Queue for retry. Accept the operation, persist it, process later. Good for writes that aren’t on the user’s critical path.

Fail with a specific code. Let upstream decide what to do. Minimum viable fallback.

Proxy to a backup. If there’s an alternative, fall back to it. Rare but useful.

A fallback that silently returns wrong data is worse than an error. Know which you’re returning.

Per-instance vs per-endpoint

A circuit breaker per dependency is the default. Sometimes more granularity helps — one breaker per endpoint of the dependency:

paymentsClient has breakers for:
  charge       (critical, tight thresholds)
  authorize    (critical, tight thresholds)
  listMethods  (non-critical, loose thresholds)

Breaking on listMethods shouldn’t block charge if charge is still healthy.

Too much granularity, and monitoring becomes a mess. Group by criticality.

Circuit breakers and bulkheads

Complementary. Circuit breakers prevent cascading from a dead dependency. Bulkheads prevent one slow dependency from starving resources:

Service calls:
  Payments (bulkhead: 30 concurrent, breaker: 50% threshold)
  Catalog  (bulkhead: 100 concurrent, breaker: 60% threshold)
  Shipping (bulkhead: 50 concurrent, breaker: 50% threshold)

When Catalog is slow, it can use up to 100 slots; the other 80 (of 180 total) are still available for Payments and Shipping. If Catalog then fails (half of calls 5xx), the breaker opens and Catalog calls fail fast.

What goes wrong

Breaker tuned for the 99% case. A burst of 100 real failures trips the breaker for legitimate reasons. Usually fine. But a daily flaky-DNS spike during deploys trips it for no reason. Investigate the failures before tuning.

No monitoring of breaker state. Breaker tripped silently for hours. Users saw errors. Dashboard show per-breaker state transitions and open-duration histograms.

No observability on fallbacks. Fallback returns empty list; nobody notices that the “products” page now shows zero items for an hour.

Too-aggressive tuning. 3-failure trip means any network blip trips the breaker. Users see errors for 60s because of a 2s network glitch.

Metrics to alert on

Per breaker:

State (closed/open/half-open)
State transition count per hour
Time spent in open state
Failure rate when closed (leading indicator)
Fallback invocation count (when fallbacks ≠ 0, something is degraded)

Alerting:

Open state lasting > 5 minutes → page
Repeated transitions (flapping) → page
Fallback rate climbing → warn

Closing note

Circuit breakers are load-bearing infrastructure disguised as a one-line annotation. Default settings work most of the time; when they don’t, the cost is either missed protection or false alarms. Measure the failure patterns of your real dependencies before tuning. Wire fallbacks that degrade intentionally. Monitor the breakers themselves — not just the calls they wrap. Do all three and the pattern becomes invisible until it saves you.