Architecture diagrams in books are always tidy. Real systems look messier — compromises, half-migrations, workarounds that became permanent. This article walks through five case studies from fintech, banking, and consumer-grade highload systems. Each one is a composite of real projects (names anonymized), chosen because the engineering trade-off is one I’ve seen teams genuinely wrestle with. Every case includes the code or config that actually solved it.

Case 1 — Picking between Zuul, Spring Cloud Gateway, and Envoy

The setting

A fintech with 18 services, growing to 30, serving both a mobile app and a web console. They needed one thing in front: auth, rate limiting, routing, request shaping. Three candidates were on the table, each with real advocates on the team.

The options

Netflix Zuul 1.x — the historical choice. Blocking I/O model, one thread per connection. Easy to extend with Java filters. Deprecated for new development since Spring Cloud dropped Zuul in favor of Gateway.

Spring Cloud Gateway — Spring’s replacement. Reactive, non-blocking, built on Project Reactor and Netty. Native integration with the Spring ecosystem (service discovery, circuit breakers, metrics).

Envoy — C++, runs as a sidecar or edge proxy. Configuration via xDS API. Excellent performance, used by huge infrastructures. Not Java, not Spring, but integrates with anything.

How they decided

A weighted comparison, scored on criteria the team actually cared about:

CriterionZuul 1.xSpring Cloud GatewayEnvoy
Throughput per instanceLowHighHighest
In-house Java skill reuseHighHighLow
Integration with existing stackFairExcellentFair
Operational complexityLowLowMedium
Active development / future-proofDeadActiveActive
Custom filter complexitySimpleModerate (reactive)Complex

The team picked Spring Cloud Gateway. Reasons: existing Spring expertise meant custom filters were a day of work instead of a week; throughput was an order of magnitude better than Zuul; Envoy’s benefits weren’t worth leaving the JVM for a 30-service system.

What they actually shipped

Gateway config, pulling it all together — rate limiting per API key, circuit breaker with fallback, JWT validation, request logging:

@Configuration
public class GatewayConfig {

    @Bean
    RouteLocator routes(RouteLocatorBuilder builder,
                        RedisRateLimiter rateLimiter,
                        KeyResolver apiKeyResolver,
                        JwtAuthFilter jwt) {
        return builder.routes()
            .route("accounts", r -> r.path("/api/accounts/**")
                .filters(f -> f
                    .stripPrefix(1)
                    .filter(jwt)
                    .requestRateLimiter(c -> c
                        .setRateLimiter(rateLimiter)
                        .setKeyResolver(apiKeyResolver))
                    .circuitBreaker(c -> c
                        .setName("accounts")
                        .setFallbackUri("forward:/fallback/accounts")))
                .uri("lb://accounts-service"))
            .route("payments", r -> r.path("/api/payments/**")
                .filters(f -> f
                    .stripPrefix(1)
                    .filter(jwt)
                    .circuitBreaker(c -> c
                        .setName("payments")
                        .setFallbackUri("forward:/fallback/payments")))
                .uri("lb://payments-service"))
            .build();
    }

    @Bean
    RedisRateLimiter rateLimiter() {
        return new RedisRateLimiter(100, 200); // 100 rps steady, 200 burst
    }

    @Bean
    KeyResolver apiKeyResolver() {
        return exchange -> Mono.just(
            exchange.getRequest().getHeaders().getFirst("X-API-Key"));
    }
}

Lesson

The “right” gateway is the one that fits your team’s existing skills and scale, not the one with the most benchmarks. At 30 services, Spring Cloud Gateway is fine. Past a few hundred services with polyglot backends, Envoy starts earning its complexity.

Case 2 — A digital bank’s payment fan-out

The setting

A retail bank’s payment-initiation service. One user action (confirm payment) had to trigger: the actual payment rails, fraud check, SMS notification, push notification, ledger write, regulatory reporting, BI event. Seven downstreams. If any was slow, the whole payment UX felt slow.

The wrong first attempt

The original code called all seven sequentially inside the HTTP handler. The p99 latency was 3.2 seconds — each call added its own tail. On slow days, the timeout on the mobile app expired before the last step completed, leading to “ghost payments” where the money moved but the user saw failure.

What actually worked

Split the seven into two classes:

  • On the critical path (must succeed before responding): payment rails + fraud check + ledger write
  • Fire-and-forget (user doesn’t wait): SMS, push, BI, regulatory

The critical path runs synchronously, in parallel where safe. The rest is published as a PaymentConfirmed event and handled by separate consumers.

@Service
public class PaymentService {
    private final PaymentRailsClient rails;
    private final FraudClient fraud;
    private final LedgerRepository ledger;
    private final OutboxRepository outbox;
    private final Clock clock;

    @Transactional
    public PaymentResult confirm(ConfirmRequest req) {
        FraudCheckResult fraudCheck = fraud.check(req);
        if (!fraudCheck.approved()) {
            return PaymentResult.rejected(fraudCheck.reason());
        }

        RailsResult result = rails.execute(req.toRailsRequest());
        if (!result.success()) {
            return PaymentResult.failed(result.reason());
        }

        LedgerEntry entry = ledger.save(new LedgerEntry(
            req.paymentId(), req.fromAccount(), req.toAccount(),
            req.amount(), clock.instant()));

        outbox.save(OutboxMessage.forEvent(
            "payment.confirmed",
            req.paymentId(),
            new PaymentConfirmedEvent(req, entry.getId(), clock.instant())));

        return PaymentResult.ok(entry.getId());
    }
}

The consumers on the other end of payment.confirmed:

@Component
public class SmsOnPaymentConfirmed {
    @KafkaListener(topics = "payment.confirmed", groupId = "sms")
    public void send(PaymentConfirmedEvent e) {
        smsService.sendReceipt(e.customerPhone(), e.amount(), e.currency());
    }
}

@Component
public class BiOnPaymentConfirmed {
    @KafkaListener(topics = "payment.confirmed", groupId = "bi")
    public void record(PaymentConfirmedEvent e) {
        biPipeline.push("payments_confirmed", e.toBiRow());
    }
}

The numbers

  • p99 before: 3,200 ms
  • p99 after: 640 ms
  • Critical-path failure modes reduced from seven to three
  • “Ghost payment” incidents eliminated — ledger and rails either both succeeded or both didn’t

Lesson

In a regulated domain, deciding what must be on the critical path is a product conversation, not just an engineering one. “Is the SMS late by 2 minutes a problem?” The team had to ask product and compliance. The answer was “no, as long as the ledger is correct and the SMS eventually arrives.” That unlocked the architecture.

Case 3 — Distributed transactions at a clearing house

The setting

A clearing-house system had to coordinate settlement across three internal services (positions, limits, reporting) plus an external counterparty API. All four had to reflect the same trade or none of them did. XA transactions weren’t an option — one of the parties was external. What actually worked?

The approach: orchestrated saga with explicit state

A saga coordinator stores state in its own DB. Each step either commits locally or is undone by a compensating action. State transitions are logged, queryable, and restartable after crashes.

Saga state table:

CREATE TABLE saga_instance (
    id            UUID PRIMARY KEY,
    trade_id      TEXT NOT NULL,
    state         TEXT NOT NULL,
    last_error    TEXT,
    created_at    TIMESTAMPTZ NOT NULL,
    updated_at    TIMESTAMPTZ NOT NULL
);

CREATE INDEX idx_saga_state ON saga_instance(state) WHERE state != 'COMPLETED';

Saga coordinator code (simplified):

@Service
public class SettlementSaga {
    private final PositionsClient positions;
    private final LimitsClient limits;
    private final ReportingClient reporting;
    private final CounterpartyClient counterparty;
    private final SagaRepository sagaRepo;

    @Transactional
    public SagaResult settle(Trade trade) {
        SagaInstance saga = sagaRepo.create(trade.id());
        String positionId = null, limitHold = null;

        try {
            saga.transitionTo(State.RESERVING_LIMIT);
            limitHold = limits.reserve(trade.account(), trade.amount());

            saga.transitionTo(State.BOOKING_POSITION);
            positionId = positions.book(trade);

            saga.transitionTo(State.CONFIRMING_COUNTERPARTY);
            counterparty.confirm(trade.externalRef());

            saga.transitionTo(State.REPORTING);
            reporting.submit(trade);

            saga.transitionTo(State.COMPLETED);
            return SagaResult.completed();

        } catch (CounterpartyRejectedException e) {
            saga.transitionTo(State.COMPENSATING, e.getMessage());
            positions.unbook(positionId);
            limits.release(limitHold);
            saga.transitionTo(State.COMPENSATED);
            return SagaResult.rejected(e.getMessage());

        } catch (ReportingFailedException e) {
            // Reporting failure is a soft failure — retry asynchronously
            saga.transitionTo(State.REPORTING_PENDING_RETRY, e.getMessage());
            return SagaResult.completedWithDelayedReporting();
        }
    }
}

A cron job retries any saga stuck in REPORTING_PENDING_RETRY with exponential backoff.

The operational payoff

The fact that saga state is in a queryable table means:

  • A stuck saga is visible in one SQL query
  • A crashed coordinator resumes by reading state and picking up where it left off
  • Compliance can audit a specific trade’s saga without tracing through logs
  • The DLQ for manual intervention is a table, not a mystery

Lesson

For multi-step workflows with regulatory weight, saga state as an explicit, queryable artifact is the feature that makes the operation sustainable. Implicit state hidden in code and logs becomes a black box you can’t explain in an audit.

Case 4 — Migrating a monolithic fraud engine without downtime

The setting

A consumer fintech had a 200k-line fraud engine embedded in their monolith. It scored every transaction in real time. The team wanted to extract it — faster iteration on rules, independent scaling, separate data store — without any measurable impact on the existing flow.

The strangler migration

They used a dual-scoring pattern:

  1. Phase 1 (shadow mode): the extracted service runs alongside the old engine. Every transaction is scored by both. Results from the old engine are used; results from the new engine are logged and compared. The team tunes the new engine based on divergence.

  2. Phase 2 (canary mode): a small percentage of transactions (1%, then 5%, 10%) use the new engine’s result as the decision. Error rates and false-positive rates are watched closely.

  3. Phase 3 (cut-over): 100% of decisions come from the new engine. The old one still runs for 4 weeks as a sanity net, flagged if it disagrees.

  4. Phase 4 (retire): old engine code deleted.

The dual-scoring filter, a few lines:

@Component
public class FraudDecisionFilter {
    private final LegacyFraudEngine legacy;
    private final ModernFraudService modern;
    private final FraudRollout rollout;
    private final DivergenceLogger divergence;

    public FraudDecision decide(Transaction tx) {
        FraudDecision legacyResult = legacy.score(tx);

        if (rollout.modernEnabled(tx.id())) {
            FraudDecision modernResult = modern.score(tx);
            if (!legacyResult.equals(modernResult)) {
                divergence.record(tx, legacyResult, modernResult);
            }
            return rollout.useModern(tx.id()) ? modernResult : legacyResult;
        }

        return legacyResult;
    }
}

rollout.modernEnabled and rollout.useModern were feature flags hitting a config service. The migration was reversible at any step — one flag flip and traffic was back on the old engine.

Why it worked

Three boring things:

  1. Dual-scoring in shadow mode before any cut-over. The team had weeks of divergence data to debug before the new engine made real decisions.
  2. Feature-flag-driven rollout, not deploys. Reverting was a config change, not a deploy. This matters at 3 AM.
  3. Hard deletion only after 4 weeks of 100% traffic on the new engine. The old engine as a silent sanity check caught two bugs in those four weeks.

Lesson

The strangler pattern works. The part people skip is the silent verification phase, which is the part that actually makes it safe. Don’t cut over to the new service the moment tests pass — run both, compare, then cut.

Case 5 — When the “monolith” was actually the right answer

The setting

A B2B SaaS team of 12 engineers, 6 months into building the product. Management pushed for microservices because “that’s how modern products are built.” The team dutifully planned an 8-service architecture and started extracting.

Three months in, shipping velocity had dropped by half. Everyone was doing cross-service work, everyone was debugging inter-service timeouts, no one was shipping features. They paused and made a different call.

The re-decision

They merged the services back into one codebase, but structured it as a modular monolith. One deployable. Strict internal boundaries between modules. Each module had its own package, its own tests, its own owner, but shared process and DB.

com.company.app/
├── payments/
│   ├── api/            (public: only API types)
│   ├── domain/         (internal)
│   ├── persistence/    (internal)
│   └── PaymentsFacade  (the only class other modules may call)
├── accounts/
│   ├── api/
│   ├── domain/
│   ├── persistence/
│   └── AccountsFacade
├── notifications/
│   └── ...
└── common/

An ArchUnit test enforces that modules only talk to each other via facades:

@AnalyzeClasses(packages = "com.company.app")
class ArchitectureTest {
    @ArchTest
    static final ArchRule modules_isolated =
        classes().that().resideInAPackage("com.company.app.payments..")
            .should().onlyBeAccessed().byAnyPackage(
                "com.company.app.payments..",
                "com.company.app.payments.api..")
            .orShould().bePublic().andResideInAPackage(
                "com.company.app.payments.api..");
}

Break the rule in a PR, CI fails. Over six months, the modules became cleanly separated — but in the same process, with one database.

What they got back

  • Shipping velocity returned to pre-split levels within a month
  • Deploy time: under 4 minutes end-to-end
  • Debugging: one process, one log stream, one trace
  • Testing: integration tests just called facades; no service startup dance
  • On-call: one rotation, one paging channel

They still have a path to microservices — the module boundaries are clean enough that extracting one is straightforward when the team grows. The rule became: extract a module to a service when there’s a team-level reason, not an architectural fashion reason.

Lesson

Microservices solve team-scaling problems. At 12 engineers, you don’t have that problem. A modular monolith gives you almost all the benefits of clean boundaries without any of the distribution tax. Wait until the team size or independent-scaling needs justify the split — and make the modular monolith easy to split when that day comes.

Patterns that showed up across all five cases

Even though the domains are different, the same handful of moves kept appearing:

  1. Event-driven side effects via outbox. The moment the system needed to do more than one thing in response to a user action, the outbox pattern showed up.
  2. Feature flags as the real deployment mechanism. Every risky change was decoupled from deploy via a flag, so rollback was instant.
  3. Strangler migrations, never big-bang rewrites. Every extraction ran alongside the original for weeks before cut-over.
  4. Saga state as a queryable table. Every distributed workflow had state stored somewhere you could SQL.
  5. “How fast can we revert?” as the primary resilience question. Every decision was evaluated on how quickly it could be undone.
  6. Synchronous critical path, asynchronous everything else. The most impactful architectural move in almost every case.

Checklist: applying these lessons

  • Before adding a gateway, scored the options against your actual scale and team
  • Critical path and fire-and-forget side effects are explicitly classified per endpoint
  • Multi-step distributed workflows have saga state stored in a queryable table
  • Strangler migrations run dual-mode with divergence logging before cut-over
  • Rollouts gated by feature flags, not by deploys
  • Modular monolith is an option, not a failure, when team size doesn’t justify services
  • Every risky change can be reverted in under 60 seconds
  • Outbox pattern used for every DB write that produces downstream effects
  • Integration tests exercise facades / API contracts, not internal classes
  • Architectural boundaries enforced by CI (ArchUnit, dependency rules)

Closing thought

The most useful thing about real cases isn’t the specific solution — it’s seeing that the team had to choose. Every one of these stories had three viable options. The winning option wasn’t always the most technically elegant; it was the one that matched the team’s size, the domain’s constraints, and the operational reality of running it at 3 AM. Architecture is a series of trade-offs, not a menu of best practices. The teams that get this right are the ones willing to choose the simpler option when it fits, and the more complex one only when the complexity is earned.