Scalability talks usually sound the same — diagrams of horizontally-scaled stateless services with a magic cache in front of a magic database, and a bullet point saying “profile and measure.” Building it is less glamorous. This article is about what actually happens when a Java backend grows from a single-box MVP to a system handling tens of thousands of requests per second — and the patterns that made the difference in real projects.
The three scaling walls
Almost every backend hits the same three walls, in this order:
- The synchronous wall. The first slow dependency turns every request into a queue.
- The database wall. One Postgres instance can do a lot, until it can’t.
- The consistency wall. You stop being able to pretend the system has a single source of truth.
Each wall demands a different set of patterns. Recognizing which wall you’re hitting is half the work.
Wall 1 — Synchronous bottlenecks
Symptom
Latency climbs. CPU is bored. Thread pool is exhausted. Heap looks fine. Threads are parked on Socket.read against some downstream service.
Root cause
Your service is doing synchronous I/O on a bounded thread pool. Every slow call holds a thread. Once you run out of threads, every new request queues.
Fixes, in order of effort
1. Virtual threads (JDK 21+)
The single biggest win for I/O-bound Java services in a decade. One line:
spring:
threads:
virtual:
enabled: trueTomcat now runs each request on a virtual thread, so blocking I/O no longer pins a platform thread. A service that topped out at ~200 concurrent requests routinely handles 10k+ without code changes.
2. Timeouts on every network call
A missing timeout is the most common production incident in Java backends. Default HTTP clients don’t time out. Set them:
HttpClient client = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(2))
.build();
HttpRequest req = HttpRequest.newBuilder(URI.create(url))
.timeout(Duration.ofSeconds(3))
.GET()
.build();For JDBC, HikariCP has four separate timeouts — connectionTimeout, validationTimeout, maxLifetime, idleTimeout. Tune all four.
3. Bulkheads
Separate connection pools or thread pools per downstream. One slow dependency shouldn’t starve the others:
resilience4j:
bulkhead:
instances:
payments:
max-concurrent-calls: 50
search:
max-concurrent-calls: 2004. Move slow work off the request path
If a request triggers an email, a webhook, or an analytics write, it should not wait for those. Publish an event, return 200, let consumers handle it asynchronously:
@Transactional
public Order placeOrder(CreateOrderRequest req) {
Order order = repo.save(new Order(...));
outboxRepo.save(new OutboxMessage("order.placed", order.toEvent()));
return order;
}A separate outbox poller publishes to Kafka after the transaction commits. The user sees a fast response; the side effects happen reliably.
Wall 2 — The database
The database is almost always where scalability goes to die.
Read the query plan before you write the feature
Fast queries at 100 rows can be catastrophic at 10M. Check the plan:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders
WHERE customer_id = $1 AND created_at > NOW() - INTERVAL '30 days'
ORDER BY created_at DESC LIMIT 20;If you see Seq Scan on a large table for a production query path, you need an index:
CREATE INDEX CONCURRENTLY idx_orders_customer_created
ON orders (customer_id, created_at DESC);Note CONCURRENTLY — standard CREATE INDEX locks writes, which is how people take prod down with a migration.
Connection pool sizing
The counter-intuitive rule: fewer connections usually mean more throughput. A pool larger than cores × 2 + effective_spindles just creates contention on the DB side.
spring:
datasource:
hikari:
maximum-pool-size: 20
minimum-idle: 5
connection-timeout: 2000
max-lifetime: 1800000If you need more, you probably need a connection pooler in front of Postgres (PgBouncer in transaction mode), not a bigger pool.
Read replicas and routing
For read-heavy workloads, split reads from writes:
@Configuration
public class DataSourceConfig {
@Bean
@Primary
public DataSource routingDataSource(
@Qualifier("writerDs") DataSource writer,
@Qualifier("readerDs") DataSource reader) {
Map<Object, Object> targets = Map.of(
DataSourceRole.WRITE, writer,
DataSourceRole.READ, reader
);
AbstractRoutingDataSource rds = new AbstractRoutingDataSource() {
@Override
protected Object determineCurrentLookupKey() {
return TransactionSynchronizationManager.isCurrentTransactionReadOnly()
? DataSourceRole.READ
: DataSourceRole.WRITE;
}
};
rds.setTargetDataSources(targets);
rds.setDefaultTargetDataSource(writer);
return rds;
}
}Now @Transactional(readOnly = true) methods hit the replica. Replicas are eventually consistent — never read back what you just wrote unless you explicitly read from the primary.
Caching that actually helps
Caches only help if you cache the right thing. Rules of thumb:
- Cache outputs, not entities. Cache the fully-rendered response for
/products/{id}rather than theProductrow. - Short TTLs beat clever invalidation. Most systems are fine with 30-second cache staleness.
- Two-level caches win. Caffeine in-process + Redis distributed. In-process catches the hot keys; Redis catches the warm ones.
@Service
public class ProductService {
private final Cache<String, ProductView> local =
Caffeine.newBuilder()
.maximumSize(10_000)
.expireAfterWrite(Duration.ofSeconds(30))
.build();
private final RedisTemplate<String, ProductView> redis;
private final ProductRepository repo;
public ProductView get(String id) {
return local.get(id, k -> {
ProductView cached = redis.opsForValue().get("product:" + k);
if (cached != null) return cached;
ProductView fresh = ProductView.from(repo.findById(k).orElseThrow());
redis.opsForValue().set("product:" + k, fresh, Duration.ofMinutes(5));
return fresh;
});
}
}Watch out for thundering herds. When a hot key expires, a thousand requests miss simultaneously and all hit the DB. Use Caffeine’s AsyncLoadingCache or explicit locking per key to collapse concurrent misses into one DB call.
When to shard
Sharding is a last resort. You need it when a single primary can’t handle writes — usually in the tens of thousands of writes per second range. Before sharding, try in order:
- Bigger instance (vertical scaling works further than people think)
- Move hot tables to separate databases
- Offload reads to replicas
- Partition the largest tables (Postgres declarative partitioning)
Only then shard. Sharding adds a class of bugs that don’t exist otherwise: cross-shard queries, hotspots, rebalancing pain.
Wall 3 — Consistency across services
Once your system spans more than one service, @Transactional no longer saves you. You need to design explicitly for eventual consistency.
The outbox pattern
The default way to write to DB and publish an event safely:
@Entity
@Table(name = "outbox")
public class OutboxMessage {
@Id private UUID id;
private String topic;
private String payload;
private Instant createdAt;
private Instant publishedAt;
}
@Service
public class OutboxPublisher {
private final OutboxRepository repo;
private final KafkaTemplate<String, String> kafka;
@Scheduled(fixedDelay = 500)
@Transactional
public void publish() {
List<OutboxMessage> batch = repo.findUnpublishedLimit(100);
for (OutboxMessage msg : batch) {
kafka.send(msg.getTopic(), msg.getPayload());
msg.setPublishedAt(Instant.now());
}
}
}Business code writes both the domain row and the outbox row in one transaction. Either both commit or neither does. The publisher picks up the outbox rows and sends them. If Kafka is down, messages accumulate — fine, they’ll be sent when it recovers.
Idempotency
Every handler that writes must be idempotent, because every message broker delivers at least once in the real world:
@Transactional
public void onPaymentSettled(PaymentSettled event) {
if (shipmentRepo.existsByOrderId(event.orderId())) {
return;
}
shipmentRepo.save(new Shipment(event.orderId(), ShipmentStatus.SCHEDULED));
}For REST write endpoints, accept an Idempotency-Key header and persist it:
@PostMapping("/payments")
public PaymentResponse charge(
@RequestHeader("Idempotency-Key") String key,
@RequestBody ChargeRequest req) {
return idempotencyStore.findByKey(key)
.orElseGet(() -> {
PaymentResponse result = paymentService.charge(req);
idempotencyStore.save(key, result);
return result;
});
}Sagas for multi-step business workflows
Don’t try to emulate XA transactions across services. Model workflows as compensating steps:
place-order ─▶ reserve-inventory ─▶ charge-card ─▶ schedule-shipping
│ │ │
▼ ▼ ▼
on failure: on failure: on failure:
(none) release-inventory release-inventory,
refund-cardOrchestrated sagas (a coordinator drives the flow) are easier to debug. Choreographed sagas (services react to events) couple less tightly. Pick based on whether you value observability or decoupling more.
Observability: measure or hallucinate
You cannot tune what you can’t see. The minimum viable stack:
- Metrics — Micrometer + Prometheus. Track the RED trio per endpoint: Rate, Errors, Duration (p50/p95/p99).
- Tracing — OpenTelemetry with trace context propagation. Every slow request has a trace; every trace names the guilty span.
- Logs — structured JSON logs with a
traceIdfield. Ship to Loki or Elasticsearch. Never grep individual pods in prod. - Profiler — async-profiler or Java Flight Recorder. Turn it on when CPU is unexpectedly high. It will point at the line.
A custom timing you should add to every external call:
Timer.Sample sample = Timer.start(meterRegistry);
try {
return client.charge(orderId, amount);
} finally {
sample.stop(Timer.builder("external.call")
.tag("service", "payments")
.tag("method", "charge")
.register(meterRegistry));
}Now your Grafana dashboard tells you exactly which downstream is eating your p99. Guessing ends, engineering begins.
JVM-level knobs that actually matter
Most JVM tuning is folklore. These three are real:
Pick the right garbage collector.
- G1 (default) — good general-purpose, predictable pause times up to large heaps.
- ZGC — sub-millisecond pauses, pays ~5-15% throughput. Use for latency-sensitive services with heaps > 8 GB.
- Parallel — highest throughput, longer pauses. Only for batch / throughput jobs.
Size the heap with container awareness.
-XX:InitialRAMPercentage=60 -XX:MaxRAMPercentage=75The JVM inside a container must know it’s in a container (JDK 17+ does by default). Leave 25–40% of the pod memory for metaspace, direct buffers, and the OS page cache.
Enable JFR in production.
-XX:StartFlightRecording=filename=recording.jfr,duration=5m,settings=profileZero meaningful overhead, huge diagnostic value when something goes wrong.
Load tests that tell you the truth
Three mistakes that make load tests lie:
- Testing with an empty database. Production has 50 GB; your test has 50 MB. Query plans are different. Seed realistic data.
- Ramping too fast. Jumping from 0 to 10k RPS in 5 seconds measures the warm-up path, not steady state. Ramp over 5–10 minutes.
- Measuring from a laptop. Network latency from your machine dwarfs service latency. Run load generators in the same region as the service.
A k6 script that does it right:
import http from 'k6/http';
import { sleep } from 'k6';
export const options = {
stages: [
{ duration: '5m', target: 500 },
{ duration: '15m', target: 500 },
{ duration: '5m', target: 0 },
],
thresholds: {
http_req_duration: ['p(95)<300', 'p(99)<800'],
http_req_failed: ['rate<0.01'],
},
};
export default function () {
http.get('https://api.example.com/products/42');
sleep(0.2);
}If p99 blows past your threshold, the report pinpoints when. Correlate with your metrics dashboard — you’ll see which dependency broke first.
The non-technical part: team topology
Conway’s Law doesn’t care about your architecture diagram. If three teams share a codebase, you’ll have a distributed monolith no matter how many services you draw. Scalable backends usually require:
- Service ownership — one team owns each service end-to-end (code, on-call, roadmap).
- Platform team — someone maintains the shared infrastructure (CI/CD, observability, base images). Without this, every team reinvents the same wheels.
- API contracts as first-class artifacts — versioned, documented, breaking-change-reviewed. Changes to public APIs cross team boundaries; they need a process.
Checklist: is your Java backend actually scalable?
- Virtual threads enabled on I/O-bound services
- Every network call has explicit timeouts (connect + read)
- Every write endpoint accepts an idempotency key
- Outbox or CDC for DB writes that publish events
- Circuit breakers with fallbacks around every external dependency
- Connection pool sized for DB capacity, not JVM optimism
- Read-only transactions route to replicas where appropriate
- Two-level cache (in-process + distributed) for hot reads
- RED metrics per endpoint, exported to Prometheus
- Distributed tracing with correlation IDs in logs
- Realistic load test with production-scale data
- A runbook per failure mode you’ve seen
Closing thought
Scaling a Java backend isn’t about picking the fanciest framework — it’s about removing synchronous dependencies one by one until the system can absorb failure instead of amplifying it. Every pattern in this article is a different way to answer the same question: what does this service do when the thing it depends on is slow, broken, or missing? Services that answer that well stay up. Services that don’t, don’t. Build for failure, measure ruthlessly, and keep the default path boring — the interesting part of a scalable backend is the shape of the graph, not the code at each node.