Real Scalable Java Backend — Lessons from Practice

Scalability talks usually sound the same — diagrams of horizontally-scaled stateless services with a magic cache in front of a magic database, and a bullet point saying “profile and measure.” Building it is less glamorous. This article is about what actually happens when a Java backend grows from a single-box MVP to a system handling tens of thousands of requests per second — and the patterns that made the difference in real projects.

The three scaling walls

Almost every backend hits the same three walls, in this order:

The synchronous wall. The first slow dependency turns every request into a queue.
The database wall. One Postgres instance can do a lot, until it can’t.
The consistency wall. You stop being able to pretend the system has a single source of truth.

Each wall demands a different set of patterns. Recognizing which wall you’re hitting is half the work.

Wall 1 — Synchronous bottlenecks

Symptom

Latency climbs. CPU is bored. Thread pool is exhausted. Heap looks fine. Threads are parked on Socket.read against some downstream service.

Root cause

Your service is doing synchronous I/O on a bounded thread pool. Every slow call holds a thread. Once you run out of threads, every new request queues.

Fixes, in order of effort

1. Virtual threads (JDK 21+)

The single biggest win for I/O-bound Java services in a decade. One line:

spring:
  threads:
    virtual:
      enabled: true

Tomcat now runs each request on a virtual thread, so blocking I/O no longer pins a platform thread. A service that topped out at ~200 concurrent requests routinely handles 10k+ without code changes.

2. Timeouts on every network call

A missing timeout is the most common production incident in Java backends. Default HTTP clients don’t time out. Set them:

HttpClient client = HttpClient.newBuilder()
    .connectTimeout(Duration.ofSeconds(2))
    .build();

HttpRequest req = HttpRequest.newBuilder(URI.create(url))
    .timeout(Duration.ofSeconds(3))
    .GET()
    .build();

For JDBC, HikariCP has four separate timeouts — connectionTimeout, validationTimeout, maxLifetime, idleTimeout. Tune all four.

3. Bulkheads

Separate connection pools or thread pools per downstream. One slow dependency shouldn’t starve the others:

resilience4j:
  bulkhead:
    instances:
      payments:
        max-concurrent-calls: 50
      search:
        max-concurrent-calls: 200

4. Move slow work off the request path

If a request triggers an email, a webhook, or an analytics write, it should not wait for those. Publish an event, return 200, let consumers handle it asynchronously:

@Transactional
public Order placeOrder(CreateOrderRequest req) {
    Order order = repo.save(new Order(...));
    outboxRepo.save(new OutboxMessage("order.placed", order.toEvent()));
    return order;
}

A separate outbox poller publishes to Kafka after the transaction commits. The user sees a fast response; the side effects happen reliably.

Wall 2 — The database

The database is almost always where scalability goes to die.

Read the query plan before you write the feature

Fast queries at 100 rows can be catastrophic at 10M. Check the plan:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders
WHERE customer_id = $1 AND created_at > NOW() - INTERVAL '30 days'
ORDER BY created_at DESC LIMIT 20;

If you see Seq Scan on a large table for a production query path, you need an index:

CREATE INDEX CONCURRENTLY idx_orders_customer_created
  ON orders (customer_id, created_at DESC);

Note CONCURRENTLY — standard CREATE INDEX locks writes, which is how people take prod down with a migration.

Connection pool sizing

The counter-intuitive rule: fewer connections usually mean more throughput. A pool larger than cores × 2 + effective_spindles just creates contention on the DB side.

spring:
  datasource:
    hikari:
      maximum-pool-size: 20
      minimum-idle: 5
      connection-timeout: 2000
      max-lifetime: 1800000

If you need more, you probably need a connection pooler in front of Postgres (PgBouncer in transaction mode), not a bigger pool.

Read replicas and routing

For read-heavy workloads, split reads from writes:

@Configuration
public class DataSourceConfig {

    @Bean
    @Primary
    public DataSource routingDataSource(
            @Qualifier("writerDs") DataSource writer,
            @Qualifier("readerDs") DataSource reader) {

        Map<Object, Object> targets = Map.of(
            DataSourceRole.WRITE, writer,
            DataSourceRole.READ,  reader
        );

        AbstractRoutingDataSource rds = new AbstractRoutingDataSource() {
            @Override
            protected Object determineCurrentLookupKey() {
                return TransactionSynchronizationManager.isCurrentTransactionReadOnly()
                    ? DataSourceRole.READ
                    : DataSourceRole.WRITE;
            }
        };
        rds.setTargetDataSources(targets);
        rds.setDefaultTargetDataSource(writer);
        return rds;
    }
}

Now @Transactional(readOnly = true) methods hit the replica. Replicas are eventually consistent — never read back what you just wrote unless you explicitly read from the primary.

Caching that actually helps

Caches only help if you cache the right thing. Rules of thumb:

Cache outputs, not entities. Cache the fully-rendered response for /products/{id} rather than the Product row.
Short TTLs beat clever invalidation. Most systems are fine with 30-second cache staleness.
Two-level caches win. Caffeine in-process + Redis distributed. In-process catches the hot keys; Redis catches the warm ones.

@Service
public class ProductService {
    private final Cache<String, ProductView> local =
        Caffeine.newBuilder()
            .maximumSize(10_000)
            .expireAfterWrite(Duration.ofSeconds(30))
            .build();

    private final RedisTemplate<String, ProductView> redis;
    private final ProductRepository repo;

    public ProductView get(String id) {
        return local.get(id, k -> {
            ProductView cached = redis.opsForValue().get("product:" + k);
            if (cached != null) return cached;

            ProductView fresh = ProductView.from(repo.findById(k).orElseThrow());
            redis.opsForValue().set("product:" + k, fresh, Duration.ofMinutes(5));
            return fresh;
        });
    }
}

Watch out for thundering herds. When a hot key expires, a thousand requests miss simultaneously and all hit the DB. Use Caffeine’s AsyncLoadingCache or explicit locking per key to collapse concurrent misses into one DB call.

When to shard

Sharding is a last resort. You need it when a single primary can’t handle writes — usually in the tens of thousands of writes per second range. Before sharding, try in order:

Bigger instance (vertical scaling works further than people think)
Move hot tables to separate databases
Offload reads to replicas
Partition the largest tables (Postgres declarative partitioning)

Only then shard. Sharding adds a class of bugs that don’t exist otherwise: cross-shard queries, hotspots, rebalancing pain.

Wall 3 — Consistency across services

Once your system spans more than one service, @Transactional no longer saves you. You need to design explicitly for eventual consistency.

The outbox pattern

The default way to write to DB and publish an event safely:

@Entity
@Table(name = "outbox")
public class OutboxMessage {
    @Id private UUID id;
    private String topic;
    private String payload;
    private Instant createdAt;
    private Instant publishedAt;
}

@Service
public class OutboxPublisher {
    private final OutboxRepository repo;
    private final KafkaTemplate<String, String> kafka;

    @Scheduled(fixedDelay = 500)
    @Transactional
    public void publish() {
        List<OutboxMessage> batch = repo.findUnpublishedLimit(100);
        for (OutboxMessage msg : batch) {
            kafka.send(msg.getTopic(), msg.getPayload());
            msg.setPublishedAt(Instant.now());
        }
    }
}

Business code writes both the domain row and the outbox row in one transaction. Either both commit or neither does. The publisher picks up the outbox rows and sends them. If Kafka is down, messages accumulate — fine, they’ll be sent when it recovers.

Idempotency

Every handler that writes must be idempotent, because every message broker delivers at least once in the real world:

@Transactional
public void onPaymentSettled(PaymentSettled event) {
    if (shipmentRepo.existsByOrderId(event.orderId())) {
        return;
    }
    shipmentRepo.save(new Shipment(event.orderId(), ShipmentStatus.SCHEDULED));
}

For REST write endpoints, accept an Idempotency-Key header and persist it:

@PostMapping("/payments")
public PaymentResponse charge(
        @RequestHeader("Idempotency-Key") String key,
        @RequestBody ChargeRequest req) {

    return idempotencyStore.findByKey(key)
        .orElseGet(() -> {
            PaymentResponse result = paymentService.charge(req);
            idempotencyStore.save(key, result);
            return result;
        });
}

Sagas for multi-step business workflows

Don’t try to emulate XA transactions across services. Model workflows as compensating steps:

place-order ─▶ reserve-inventory ─▶ charge-card ─▶ schedule-shipping
                     │                    │                │
                     ▼                    ▼                ▼
             on failure:          on failure:      on failure:
             (none)              release-inventory  release-inventory,
                                                    refund-card

Orchestrated sagas (a coordinator drives the flow) are easier to debug. Choreographed sagas (services react to events) couple less tightly. Pick based on whether you value observability or decoupling more.

Observability: measure or hallucinate

You cannot tune what you can’t see. The minimum viable stack:

Metrics — Micrometer + Prometheus. Track the RED trio per endpoint: Rate, Errors, Duration (p50/p95/p99).
Tracing — OpenTelemetry with trace context propagation. Every slow request has a trace; every trace names the guilty span.
Logs — structured JSON logs with a traceId field. Ship to Loki or Elasticsearch. Never grep individual pods in prod.
Profiler — async-profiler or Java Flight Recorder. Turn it on when CPU is unexpectedly high. It will point at the line.

A custom timing you should add to every external call:

Timer.Sample sample = Timer.start(meterRegistry);
try {
    return client.charge(orderId, amount);
} finally {
    sample.stop(Timer.builder("external.call")
        .tag("service", "payments")
        .tag("method", "charge")
        .register(meterRegistry));
}

Now your Grafana dashboard tells you exactly which downstream is eating your p99. Guessing ends, engineering begins.

JVM-level knobs that actually matter

Most JVM tuning is folklore. These three are real:

Pick the right garbage collector.

G1 (default) — good general-purpose, predictable pause times up to large heaps.
ZGC — sub-millisecond pauses, pays ~5-15% throughput. Use for latency-sensitive services with heaps > 8 GB.
Parallel — highest throughput, longer pauses. Only for batch / throughput jobs.

Size the heap with container awareness.

-XX:InitialRAMPercentage=60 -XX:MaxRAMPercentage=75

The JVM inside a container must know it’s in a container (JDK 17+ does by default). Leave 25–40% of the pod memory for metaspace, direct buffers, and the OS page cache.

Enable JFR in production.

-XX:StartFlightRecording=filename=recording.jfr,duration=5m,settings=profile

Zero meaningful overhead, huge diagnostic value when something goes wrong.

Load tests that tell you the truth

Three mistakes that make load tests lie:

Testing with an empty database. Production has 50 GB; your test has 50 MB. Query plans are different. Seed realistic data.
Ramping too fast. Jumping from 0 to 10k RPS in 5 seconds measures the warm-up path, not steady state. Ramp over 5–10 minutes.
Measuring from a laptop. Network latency from your machine dwarfs service latency. Run load generators in the same region as the service.

A k6 script that does it right:

import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  stages: [
    { duration: '5m',  target: 500 },
    { duration: '15m', target: 500 },
    { duration: '5m',  target: 0 },
  ],
  thresholds: {
    http_req_duration: ['p(95)<300', 'p(99)<800'],
    http_req_failed:   ['rate<0.01'],
  },
};

export default function () {
  http.get('https://api.example.com/products/42');
  sleep(0.2);
}

If p99 blows past your threshold, the report pinpoints when. Correlate with your metrics dashboard — you’ll see which dependency broke first.

The non-technical part: team topology

Conway’s Law doesn’t care about your architecture diagram. If three teams share a codebase, you’ll have a distributed monolith no matter how many services you draw. Scalable backends usually require:

Service ownership — one team owns each service end-to-end (code, on-call, roadmap).
Platform team — someone maintains the shared infrastructure (CI/CD, observability, base images). Without this, every team reinvents the same wheels.
API contracts as first-class artifacts — versioned, documented, breaking-change-reviewed. Changes to public APIs cross team boundaries; they need a process.

Checklist: is your Java backend actually scalable?

Closing thought

Scaling a Java backend isn’t about picking the fanciest framework — it’s about removing synchronous dependencies one by one until the system can absorb failure instead of amplifying it. Every pattern in this article is a different way to answer the same question: what does this service do when the thing it depends on is slow, broken, or missing? Services that answer that well stay up. Services that don’t, don’t. Build for failure, measure ruthlessly, and keep the default path boring — the interesting part of a scalable backend is the shape of the graph, not the code at each node.