Most articles about scaling talk about the end state — the elegant multi-region deployment with exactly the right caching layer. This one is about the journey. Specifically, about taking a reasonably successful Java monolith from roughly 500 requests per second at its first visible strain, through a two-year path to 25,000 requests per second as a set of microservices. The patterns that mattered, the experiments that failed, and the handful of decisions I’d make differently.
It’s a composite of real work, not a single system, but every number and every failure mode is one I’ve seen.
The starting point
A Spring Boot monolith on six application servers behind an nginx load balancer, talking to a single Postgres primary with one asynchronous read replica. Redis for sessions. RabbitMQ for a modest amount of async work. Roughly 300k lines of code. Two engineering teams, one product.
At 500 RPS everything is fine. At 1,500 RPS the symptoms start:
- p99 latency climbs from 120 ms to 800 ms
- Postgres CPU sits at 70% with frequent spikes to 95%
- The six app servers are at 30% CPU but thread pools occasionally saturate
- Every deploy takes 18 minutes and causes a noticeable blip
Nothing is on fire, but the trajectory is clear.
Stage 1 — What we did first (the cheap wins)
The temptation at this stage was to start breaking the monolith apart. We didn’t. Instead, we spent two months on things that didn’t change architecture at all, and they gave us a year of runway.
Fix the queries, not the system
A Postgres slow-query log enabled for queries over 200 ms, turned on for a week, produced a ranked list. The top 20 queries accounted for 80% of DB CPU. Most of them were missing a single index, or had an unnecessary ORDER BY in an N+1 loop, or returned 500 rows when the UI only displayed 20.
What worked:
CREATE INDEX CONCURRENTLYon the 12 indexes the access patterns clearly wanted- Pagination everywhere — we found an endpoint returning the full user’s order history on every page load
- Killing N+1 queries with explicit JOIN FETCH in JPA, or switching to jOOQ for a few hot paths
- Read-only transactions routed to the replica via
AbstractRoutingDataSource
Result: Postgres CPU dropped from 70% average to 25%. p99 latency halved. We’d done no architectural work.
Size Hikari correctly
The pool was set to 80 connections per instance × 6 instances = 480 potential connections. Postgres was configured for 400. Connection starvation events showed up in logs. The fix was counterintuitive: decrease the pool to 20 per instance. Fewer connections, less contention, better throughput. The DB was the bottleneck; making more clients fight for it didn’t help.
Cache the right things
Not entities. We cached response projections — the fully composed JSON that the UI actually asked for. A two-level cache, Caffeine in-process for the hot keys, Redis behind it for a shared warm layer. TTLs of 30–60 seconds, never complex invalidation.
@Service
public class ProductViewService {
private final Cache<String, ProductView> local =
Caffeine.newBuilder()
.maximumSize(20_000)
.expireAfterWrite(Duration.ofSeconds(30))
.recordStats()
.build();
public ProductView get(String id) {
return local.get(id, this::fetch);
}
private ProductView fetch(String id) {
ProductView cached = redis.opsForValue().get("pv:" + id);
if (cached != null) return cached;
ProductView fresh = projectionService.build(id);
redis.opsForValue().set("pv:" + id, fresh, Duration.ofMinutes(5));
return fresh;
}
}Local cache hit rate landed at 88%. Redis caught another 11%. DB reads for the product endpoint dropped to 1% of their previous volume.
Lessons at this stage
The first wave of scaling is almost always unglamorous. Query plans, pool sizes, response caching, pagination. Doing these first buys you the time to do architectural work thoughtfully instead of in a panic.
Stage 2 — The async pivot
Stage 1 got us through an 8× traffic increase. At 4,000 RPS we hit a different wall: write amplification on sync paths. Every order placement touched 12 tables, sent 3 emails, fired 5 webhooks, wrote to 2 analytics tables. Each write in that chain had its own latency tail, and the p99 of the whole was dominated by whichever one was slowest at that moment.
Splitting the critical path from the secondary one
For every endpoint, we classified each side-effect as either:
- On the critical path — user can’t proceed without this (persist the order, charge the card)
- Side effect — needs to happen eventually, user doesn’t wait (send email, update analytics)
Everything in category 2 moved to an event queue. The pattern was the same everywhere — transactional outbox:
@Transactional
public Order placeOrder(CreateOrderRequest req) {
Order order = orderRepo.save(Order.from(req));
outboxRepo.save(OutboxMessage.forEvent(
"order.placed",
order.getId().toString(),
OrderPlacedEvent.from(order)
));
return order;
}
@Scheduled(fixedDelay = 500)
@Transactional
public void drainOutbox() {
List<OutboxMessage> batch = outboxRepo.findUnpublishedLimit(100);
for (OutboxMessage msg : batch) {
kafka.send(msg.topic(), msg.key(), msg.payload()).get();
outboxRepo.markPublished(msg.id());
}
}Order-placement p99 dropped from 800 ms to 220 ms because the HTTP request no longer waited for the email, the analytics write, or the search index update.
Switching from RabbitMQ to Kafka
At modest scale, RabbitMQ was fine. But we were building a pattern of “publish a fact, let multiple consumers react,” and that’s Kafka’s model, not RabbitMQ’s. We migrated order/payment/shipment events to Kafka over three months. The wins:
- Each consumer group processes at its own pace
- Replayability for new consumers catching up from history
- Native support for high-fanout events without overhead per consumer
The cost:
- Operational complexity (Kafka clusters need real ops investment)
- Idempotency discipline became non-negotiable — every handler checks “have I processed this?” against a local table
Lessons at this stage
Moving from synchronous chains to event-driven flows is the single biggest architectural leverage you get before true microservices. It’s also a place where you pay a permanent complexity tax: eventual consistency, idempotency, dead-letter handling. Earn it — don’t reach for events when a simple synchronous call does the job.
Stage 3 — The great unbundling
At 8,000 RPS the real constraint became the team, not the system. Two teams of 8 engineers sharing a monolith meant every deploy was a cross-team coordination exercise. The system’s performance was fine; the throughput of changes wasn’t.
We started extracting services — carefully, one at a time, with a clear rule: extract along business boundaries, never along technical ones.
The order of extraction matters
Our first attempt was to extract “the payment logic” as a service, because it was the most logically bounded module. That failed. The payment logic was called from 40 places across the codebase, with 40 different sets of context. Extracting it meant freezing design decisions too early.
What worked instead:
- Extract by team ownership first. The team that owned notifications extracted Notifications. The team that owned search extracted Search. Each team got independent deploy, without needing to touch anyone else’s code.
- Extract along Conway boundaries. The service boundary matched the team boundary. If two teams shared a service, we’d see merge conflicts; if one team owned two services, communication overhead grew.
- Strangler pattern, not big-bang rewrite. New features of the extracted domain were built in the new service. Existing features were migrated one at a time, with a feature flag switching traffic per-call.
A representative extraction — Notifications — looked like:
Month 1: new service stood up, no traffic
Month 2: new event types published, notifications service consumes, reads old DB via JDBC
Month 3: own database, data migrated, old code calls new service via HTTP
Month 4: old code deleted, dual-write turned off
Month 5: load-tested, owned-by-team rotation confirmedFive months for one extraction, done carefully, beats a three-month “migration sprint” that leaves the old code around forever.
The hard part: shared data
Every extraction has the same obstacle — the new service needs data that lived in the monolith’s DB. Three approaches, in order of preference:
- Copy-via-events. The monolith publishes events for the relevant domain; the new service subscribes and builds its own view. Eventual consistency is the price. This was our default.
- Sync RPC back to the monolith. The new service calls the monolith for data it doesn’t own. Easy to do, creates tight coupling, and should be a stepping stone, not a destination.
- Shared database. Tempting, always a trap. We ruled this out at the start.
For reference data that rarely changes (countries, product categories), we just materialized a copy in every service that needed it, refreshed daily. Simple and stable.
Lessons at this stage
The benefit of microservices is mostly about team autonomy, not performance. If you’re extracting services to make the system faster, you’ll probably be disappointed — you’ll add network hops that weren’t there before. Extract to decouple the pace of change across teams; everything else is a side effect.
Stage 4 — The highload realities
At 15,000 RPS we had maybe 20 services, a respectable Kafka footprint, and our first real highload incidents. A catalog of the ones that taught the most:
The retry storm
Payments service had a 3-second timeout, with 3 retries. Our gateway had a 5-second timeout, with 3 retries. When payments slowed from 200 ms to 2.5 s one afternoon, each request turned into 1 gateway retry × 3 payments retries = effectively 9× the load on payments. Payments dies. Gateway dies. Everything dies.
The fix was layered:
- Exponential backoff with jitter on every retry — not flat retries
- Budget-based timeouts — gateway’s 5s gets split: 2.5s for payments, 1s for orders, leaving slack
- Circuit breaker that opens after N failures — fail fast, don’t keep hammering
- Bounded retries per deployment — 1 or 2, not 3, and never on non-idempotent operations
The cache stampede
The product page cache had a 5-minute TTL. At peak hour, the top 10 products were on every home feed. When the TTL expired on product #1, 3,000 concurrent requests missed the cache simultaneously, all did the DB fetch, all thundered into Postgres. Postgres saturated, everything slowed, cascading timeouts followed.
Fixed with request coalescing — Caffeine’s AsyncLoadingCache ensures that only one request does the DB fetch while everyone else waits for its result:
AsyncLoadingCache<String, ProductView> cache = Caffeine.newBuilder()
.maximumSize(20_000)
.expireAfterWrite(Duration.ofMinutes(5))
.buildAsync(this::loadProduct);Plus a small random jitter on TTL, so not everything expires at the same instant.
The consumer lag incident
A deploy to the notifications consumer introduced a 500 ms per-message processing cost (an accidentally-enabled debug call to an external API). Peak traffic was 1,200 messages/sec; the consumer could process 100. Lag grew at ~1,100 msg/sec. Within an hour, lag was 3 million messages. Nobody had an alert on consumer lag growth rate.
Lessons that stuck:
- Alert on lag growth rate, not absolute lag. Absolute lag always exists; rapid growth is the signal.
- Know your per-message processing cost. It’s easy to benchmark; everyone should.
- Have a runbook for “the consumer is behind.” Scale consumer replicas? Skip ahead to current offset and accept data loss? Pause producers? Decide before the incident, not during it.
The deploy-at-peak disaster
One Friday at 4:45 PM, a small config change went out. Something about the way the new config was loaded caused every service to reload its connection pool, which caused a burst of connection churn to Postgres, which caused Postgres to briefly refuse connections, which caused every service to retry, which looked to monitoring like a coordinated outage. We brought everything down for 12 minutes.
From that day: no deploys in the last 2 hours of the workweek. A cultural rule enforced by the deploy pipeline itself.
Stage 5 — The things we didn’t do
Equally instructive — the architectural moves we considered and rejected:
We didn’t move to Kubernetes until year 2. Not because it’s bad, but because our VMs-with-Ansible setup was working, and the migration cost was huge. We moved when our number of services crossed ~15 and deploy variance became a real problem.
We didn’t adopt a service mesh. Istio was fashionable at the time. The problems it solved (mTLS, retries, circuit breaking) were being solved by Resilience4j and standard libraries already. The operational complexity was higher than the value.
We didn’t go reactive. We evaluated Project Reactor and WebFlux for the hot-path services. The code became harder to read, debugging was harder, and the benchmarks only showed real gains on absurdly I/O-bound services. When virtual threads landed in JDK 21, that debate ended — blocking code with virtual threads gets most of the performance with none of the complexity.
We didn’t shard the main Postgres. We partitioned the three biggest tables with declarative partitioning, added more read replicas, and upgraded hardware. At ~15,000 writes per second, a well-tuned single Postgres primary was still cheaper and simpler than sharding. Shard only when the single instance genuinely can’t keep up — not before.
What actually moved the needle
Ranked by impact, looking back:
- Query and index work in stage 1 — bought more time than any architectural change
- Outbox + event-driven flows — made the system tolerant of downstream slowness
- Team-aligned service extraction — unblocked parallel development
- Two-level caching with stampede protection — largest single reduction in DB load
- Budgeted timeouts and circuit breakers everywhere — prevented the worst incidents
- Canary deploys with automated rollback — prevented bad deploys from becoming outages
- SLOs and burn-rate alerting — made reliability a tracked number, not an opinion
- Virtual threads (once available) — doubled per-service capacity with zero code change
What didn’t move the needle, despite the hype:
- Custom reactive code in most services
- Service mesh
- Premature sharding
- Custom caching layers
- GraphQL (adopted then deprecated — the BFF pattern served better)
If I were starting over today
A greenfield version of this journey, with the lessons learned, would look different in several places:
- Start as a modular monolith, not a single blob. Packages by bounded context, strict dependency rules. Extraction later is straightforward.
- Outbox from day one. Even if you’re not publishing events yet, the table costs nothing and removes the biggest consistency footgun when you start.
- OpenTelemetry from day one. Retrofitting observability into a mature system is many times more expensive than building it in.
- Postgres partitioning on the biggest tables from day one. Backfilling partitions on a live 500 GB table is a project; doing it on an empty table is a line of DDL.
- JDK 21+ with virtual threads on, always. There’s no reason in 2026 to write a new Java backend without them.
- Kafka or cloud-native equivalent from day one, even if oversized for the workload. Once you’re stuck on RabbitMQ / SQS and outgrow it, migration is painful.
- CI/CD with canary and auto-rollback before the first production deploy. The pipeline is the safety net for everything else.
Checklist: the production-scaling playbook
- Slow-query log enabled, top 20 queries reviewed weekly until clean
- Indexes added via
CREATE INDEX CONCURRENTLYonly - HikariCP sized to DB capacity (usually 10–20), leak detection on
- Two-level caching (Caffeine + Redis) with stampede protection
- Transactional outbox for every DB write that produces an event
- Budgeted timeouts with deadlines propagated across services
- Circuit breakers and bulkheads on every external call
- Events via Kafka with schemas in a registry
- Idempotency on every write endpoint and every event handler
- Strangler pattern for service extractions (no big-bang rewrites)
- Service boundaries aligned to team boundaries
- Canary deploys with SLO-based automated rollback
- Consumer lag growth-rate alerts in addition to absolute lag
- OpenTelemetry instrumentation from the first line of code
- JDK 21+ with virtual threads enabled
Closing thought
Scaling a Java backend isn’t a single project — it’s a sequence of stages, each with its own dominant constraint. Query plans first, async flows next, service boundaries after that, highload defenses when the scale earns them. The mistake I’ve made — and watched others make — is reaching for the late-stage tools early. Sharding before indexing. Microservices before understanding the domain. Service mesh before needing it. The systems that scale gracefully aren’t the ones that used the most advanced tools earliest; they’re the ones that solved the problem actually in front of them, then moved on to the next. Keep the system boring as long as possible, and only add complexity when you’ve earned the right.