Scaling a Java Backend — Production Lessons from a Real Migration

Most articles about scaling talk about the end state — the elegant multi-region deployment with exactly the right caching layer. This one is about the journey. Specifically, about taking a reasonably successful Java monolith from roughly 500 requests per second at its first visible strain, through a two-year path to 25,000 requests per second as a set of microservices. The patterns that mattered, the experiments that failed, and the handful of decisions I’d make differently.

It’s a composite of real work, not a single system, but every number and every failure mode is one I’ve seen.

The starting point

A Spring Boot monolith on six application servers behind an nginx load balancer, talking to a single Postgres primary with one asynchronous read replica. Redis for sessions. RabbitMQ for a modest amount of async work. Roughly 300k lines of code. Two engineering teams, one product.

At 500 RPS everything is fine. At 1,500 RPS the symptoms start:

p99 latency climbs from 120 ms to 800 ms
Postgres CPU sits at 70% with frequent spikes to 95%
The six app servers are at 30% CPU but thread pools occasionally saturate
Every deploy takes 18 minutes and causes a noticeable blip

Nothing is on fire, but the trajectory is clear.

Stage 1 — What we did first (the cheap wins)

The temptation at this stage was to start breaking the monolith apart. We didn’t. Instead, we spent two months on things that didn’t change architecture at all, and they gave us a year of runway.

Fix the queries, not the system

A Postgres slow-query log enabled for queries over 200 ms, turned on for a week, produced a ranked list. The top 20 queries accounted for 80% of DB CPU. Most of them were missing a single index, or had an unnecessary ORDER BY in an N+1 loop, or returned 500 rows when the UI only displayed 20.

What worked:

CREATE INDEX CONCURRENTLY on the 12 indexes the access patterns clearly wanted
Pagination everywhere — we found an endpoint returning the full user’s order history on every page load
Killing N+1 queries with explicit JOIN FETCH in JPA, or switching to jOOQ for a few hot paths
Read-only transactions routed to the replica via AbstractRoutingDataSource

Result: Postgres CPU dropped from 70% average to 25%. p99 latency halved. We’d done no architectural work.

Size Hikari correctly

The pool was set to 80 connections per instance × 6 instances = 480 potential connections. Postgres was configured for 400. Connection starvation events showed up in logs. The fix was counterintuitive: decrease the pool to 20 per instance. Fewer connections, less contention, better throughput. The DB was the bottleneck; making more clients fight for it didn’t help.

Cache the right things

Not entities. We cached response projections — the fully composed JSON that the UI actually asked for. A two-level cache, Caffeine in-process for the hot keys, Redis behind it for a shared warm layer. TTLs of 30–60 seconds, never complex invalidation.

@Service
public class ProductViewService {
    private final Cache<String, ProductView> local =
        Caffeine.newBuilder()
            .maximumSize(20_000)
            .expireAfterWrite(Duration.ofSeconds(30))
            .recordStats()
            .build();

    public ProductView get(String id) {
        return local.get(id, this::fetch);
    }

    private ProductView fetch(String id) {
        ProductView cached = redis.opsForValue().get("pv:" + id);
        if (cached != null) return cached;
        ProductView fresh = projectionService.build(id);
        redis.opsForValue().set("pv:" + id, fresh, Duration.ofMinutes(5));
        return fresh;
    }
}

Local cache hit rate landed at 88%. Redis caught another 11%. DB reads for the product endpoint dropped to 1% of their previous volume.

Lessons at this stage

The first wave of scaling is almost always unglamorous. Query plans, pool sizes, response caching, pagination. Doing these first buys you the time to do architectural work thoughtfully instead of in a panic.

Stage 2 — The async pivot

Stage 1 got us through an 8× traffic increase. At 4,000 RPS we hit a different wall: write amplification on sync paths. Every order placement touched 12 tables, sent 3 emails, fired 5 webhooks, wrote to 2 analytics tables. Each write in that chain had its own latency tail, and the p99 of the whole was dominated by whichever one was slowest at that moment.

Splitting the critical path from the secondary one

For every endpoint, we classified each side-effect as either:

On the critical path — user can’t proceed without this (persist the order, charge the card)
Side effect — needs to happen eventually, user doesn’t wait (send email, update analytics)

Everything in category 2 moved to an event queue. The pattern was the same everywhere — transactional outbox:

@Transactional
public Order placeOrder(CreateOrderRequest req) {
    Order order = orderRepo.save(Order.from(req));
    outboxRepo.save(OutboxMessage.forEvent(
        "order.placed",
        order.getId().toString(),
        OrderPlacedEvent.from(order)
    ));
    return order;
}

@Scheduled(fixedDelay = 500)
@Transactional
public void drainOutbox() {
    List<OutboxMessage> batch = outboxRepo.findUnpublishedLimit(100);
    for (OutboxMessage msg : batch) {
        kafka.send(msg.topic(), msg.key(), msg.payload()).get();
        outboxRepo.markPublished(msg.id());
    }
}

Order-placement p99 dropped from 800 ms to 220 ms because the HTTP request no longer waited for the email, the analytics write, or the search index update.

Switching from RabbitMQ to Kafka

At modest scale, RabbitMQ was fine. But we were building a pattern of “publish a fact, let multiple consumers react,” and that’s Kafka’s model, not RabbitMQ’s. We migrated order/payment/shipment events to Kafka over three months. The wins:

Each consumer group processes at its own pace
Replayability for new consumers catching up from history
Native support for high-fanout events without overhead per consumer

The cost:

Operational complexity (Kafka clusters need real ops investment)
Idempotency discipline became non-negotiable — every handler checks “have I processed this?” against a local table

Lessons at this stage

Moving from synchronous chains to event-driven flows is the single biggest architectural leverage you get before true microservices. It’s also a place where you pay a permanent complexity tax: eventual consistency, idempotency, dead-letter handling. Earn it — don’t reach for events when a simple synchronous call does the job.

Stage 3 — The great unbundling

At 8,000 RPS the real constraint became the team, not the system. Two teams of 8 engineers sharing a monolith meant every deploy was a cross-team coordination exercise. The system’s performance was fine; the throughput of changes wasn’t.

We started extracting services — carefully, one at a time, with a clear rule: extract along business boundaries, never along technical ones.

The order of extraction matters

Our first attempt was to extract “the payment logic” as a service, because it was the most logically bounded module. That failed. The payment logic was called from 40 places across the codebase, with 40 different sets of context. Extracting it meant freezing design decisions too early.

What worked instead:

Extract by team ownership first. The team that owned notifications extracted Notifications. The team that owned search extracted Search. Each team got independent deploy, without needing to touch anyone else’s code.
Extract along Conway boundaries. The service boundary matched the team boundary. If two teams shared a service, we’d see merge conflicts; if one team owned two services, communication overhead grew.
Strangler pattern, not big-bang rewrite. New features of the extracted domain were built in the new service. Existing features were migrated one at a time, with a feature flag switching traffic per-call.

A representative extraction — Notifications — looked like:

Month 1:  new service stood up, no traffic
Month 2:  new event types published, notifications service consumes, reads old DB via JDBC
Month 3:  own database, data migrated, old code calls new service via HTTP
Month 4:  old code deleted, dual-write turned off
Month 5:  load-tested, owned-by-team rotation confirmed

Five months for one extraction, done carefully, beats a three-month “migration sprint” that leaves the old code around forever.

The hard part: shared data

Every extraction has the same obstacle — the new service needs data that lived in the monolith’s DB. Three approaches, in order of preference:

Copy-via-events. The monolith publishes events for the relevant domain; the new service subscribes and builds its own view. Eventual consistency is the price. This was our default.
Sync RPC back to the monolith. The new service calls the monolith for data it doesn’t own. Easy to do, creates tight coupling, and should be a stepping stone, not a destination.
Shared database. Tempting, always a trap. We ruled this out at the start.

For reference data that rarely changes (countries, product categories), we just materialized a copy in every service that needed it, refreshed daily. Simple and stable.

Lessons at this stage

The benefit of microservices is mostly about team autonomy, not performance. If you’re extracting services to make the system faster, you’ll probably be disappointed — you’ll add network hops that weren’t there before. Extract to decouple the pace of change across teams; everything else is a side effect.

Stage 4 — The highload realities

At 15,000 RPS we had maybe 20 services, a respectable Kafka footprint, and our first real highload incidents. A catalog of the ones that taught the most:

The retry storm

Payments service had a 3-second timeout, with 3 retries. Our gateway had a 5-second timeout, with 3 retries. When payments slowed from 200 ms to 2.5 s one afternoon, each request turned into 1 gateway retry × 3 payments retries = effectively 9× the load on payments. Payments dies. Gateway dies. Everything dies.

The fix was layered:

Exponential backoff with jitter on every retry — not flat retries
Budget-based timeouts — gateway’s 5s gets split: 2.5s for payments, 1s for orders, leaving slack
Circuit breaker that opens after N failures — fail fast, don’t keep hammering
Bounded retries per deployment — 1 or 2, not 3, and never on non-idempotent operations

The cache stampede

The product page cache had a 5-minute TTL. At peak hour, the top 10 products were on every home feed. When the TTL expired on product #1, 3,000 concurrent requests missed the cache simultaneously, all did the DB fetch, all thundered into Postgres. Postgres saturated, everything slowed, cascading timeouts followed.

Fixed with request coalescing — Caffeine’s AsyncLoadingCache ensures that only one request does the DB fetch while everyone else waits for its result:

AsyncLoadingCache<String, ProductView> cache = Caffeine.newBuilder()
    .maximumSize(20_000)
    .expireAfterWrite(Duration.ofMinutes(5))
    .buildAsync(this::loadProduct);

Plus a small random jitter on TTL, so not everything expires at the same instant.

The consumer lag incident

A deploy to the notifications consumer introduced a 500 ms per-message processing cost (an accidentally-enabled debug call to an external API). Peak traffic was 1,200 messages/sec; the consumer could process 100. Lag grew at ~1,100 msg/sec. Within an hour, lag was 3 million messages. Nobody had an alert on consumer lag growth rate.

Lessons that stuck:

Alert on lag growth rate, not absolute lag. Absolute lag always exists; rapid growth is the signal.
Know your per-message processing cost. It’s easy to benchmark; everyone should.
Have a runbook for “the consumer is behind.” Scale consumer replicas? Skip ahead to current offset and accept data loss? Pause producers? Decide before the incident, not during it.

The deploy-at-peak disaster

One Friday at 4:45 PM, a small config change went out. Something about the way the new config was loaded caused every service to reload its connection pool, which caused a burst of connection churn to Postgres, which caused Postgres to briefly refuse connections, which caused every service to retry, which looked to monitoring like a coordinated outage. We brought everything down for 12 minutes.

From that day: no deploys in the last 2 hours of the workweek. A cultural rule enforced by the deploy pipeline itself.

Stage 5 — The things we didn’t do

Equally instructive — the architectural moves we considered and rejected:

We didn’t move to Kubernetes until year 2. Not because it’s bad, but because our VMs-with-Ansible setup was working, and the migration cost was huge. We moved when our number of services crossed ~15 and deploy variance became a real problem.

We didn’t adopt a service mesh. Istio was fashionable at the time. The problems it solved (mTLS, retries, circuit breaking) were being solved by Resilience4j and standard libraries already. The operational complexity was higher than the value.

We didn’t go reactive. We evaluated Project Reactor and WebFlux for the hot-path services. The code became harder to read, debugging was harder, and the benchmarks only showed real gains on absurdly I/O-bound services. When virtual threads landed in JDK 21, that debate ended — blocking code with virtual threads gets most of the performance with none of the complexity.

We didn’t shard the main Postgres. We partitioned the three biggest tables with declarative partitioning, added more read replicas, and upgraded hardware. At ~15,000 writes per second, a well-tuned single Postgres primary was still cheaper and simpler than sharding. Shard only when the single instance genuinely can’t keep up — not before.

What actually moved the needle

Ranked by impact, looking back:

Query and index work in stage 1 — bought more time than any architectural change
Outbox + event-driven flows — made the system tolerant of downstream slowness
Team-aligned service extraction — unblocked parallel development
Two-level caching with stampede protection — largest single reduction in DB load
Budgeted timeouts and circuit breakers everywhere — prevented the worst incidents
Canary deploys with automated rollback — prevented bad deploys from becoming outages
SLOs and burn-rate alerting — made reliability a tracked number, not an opinion
Virtual threads (once available) — doubled per-service capacity with zero code change

What didn’t move the needle, despite the hype:

Custom reactive code in most services
Service mesh
Premature sharding
Custom caching layers
GraphQL (adopted then deprecated — the BFF pattern served better)

If I were starting over today

A greenfield version of this journey, with the lessons learned, would look different in several places:

Start as a modular monolith, not a single blob. Packages by bounded context, strict dependency rules. Extraction later is straightforward.
Outbox from day one. Even if you’re not publishing events yet, the table costs nothing and removes the biggest consistency footgun when you start.
OpenTelemetry from day one. Retrofitting observability into a mature system is many times more expensive than building it in.
Postgres partitioning on the biggest tables from day one. Backfilling partitions on a live 500 GB table is a project; doing it on an empty table is a line of DDL.
JDK 21+ with virtual threads on, always. There’s no reason in 2026 to write a new Java backend without them.
Kafka or cloud-native equivalent from day one, even if oversized for the workload. Once you’re stuck on RabbitMQ / SQS and outgrow it, migration is painful.
CI/CD with canary and auto-rollback before the first production deploy. The pipeline is the safety net for everything else.

Checklist: the production-scaling playbook

Closing thought

Scaling a Java backend isn’t a single project — it’s a sequence of stages, each with its own dominant constraint. Query plans first, async flows next, service boundaries after that, highload defenses when the scale earns them. The mistake I’ve made — and watched others make — is reaching for the late-stage tools early. Sharding before indexing. Microservices before understanding the domain. Service mesh before needing it. The systems that scale gracefully aren’t the ones that used the most advanced tools earliest; they’re the ones that solved the problem actually in front of them, then moved on to the next. Keep the system boring as long as possible, and only add complexity when you’ve earned the right.

Scaling a Java Backend — Production Lessons from a Real Migration

The starting point

Stage 1 — What we did first (the cheap wins)

Fix the queries, not the system

Size Hikari correctly

Cache the right things

Lessons at this stage

Stage 2 — The async pivot

Splitting the critical path from the secondary one

Switching from RabbitMQ to Kafka

Lessons at this stage

Stage 3 — The great unbundling

The order of extraction matters

The hard part: shared data

Lessons at this stage

Stage 4 — The highload realities

The retry storm

The cache stampede

The consumer lag incident

The deploy-at-peak disaster

Stage 5 — The things we didn’t do

What actually moved the needle

If I were starting over today

Checklist: the production-scaling playbook

Closing thought

Related essays

Postgres at Scale — How Far You Can Push a Single Node

The Strangler Fig Pattern — Migrations That Actually Ship

Monolith to Microservices — The Migration Path