Or: why concurrent updates fail in ways your staging environment can’t reproduce.
The ticket came through normal customer support and got escalated to engineering with the urgency that usually means the customer was important enough to address quickly. Subject:
“User reports exposure mismatch!!!!!”
Five exclamation marks. The number scaled with how much money the user had bet, apparently.
The complaint was simple. The user had placed several bets in rapid succession, knew exactly how much he’d staked (about 1000 in our currency), and was looking at his exposure dashboard which showed approximately 600. He wanted to know which number was real.
I pulled his account. The bets were there, all of them, in the bets table, totaling 1024. The aggregator table — which held running per-user-per-round totals used for risk management and the UI — showed 612.
So the bets were correct. The aggregator was lying.
I want to write about the next month, because what looked like a simple concurrency bug taught me something about the gap between knowing concurrency primitives and knowing which one your data actually needs. The two attempts I made before landing on the right approach were both reasonable choices. They were both wrong for our specific shape of contention. The lesson was less “use this primitive” and more “understand the data before choosing.”
What the aggregator did
The service computes running totals as bets are placed in real time: total stake, total exposure, number of bets, all keyed by (userId, roundId). The output drives the user’s UI, the risk-management dashboard, and the post-round settlement service.
Architecturally:
flowchart LR
P[Bet Placement Service] -->|Kafka topic: place-bet<br/>partitioned by userId| T[(Kafka Topic)]
T --> A1[Aggregator Pod #1]
T --> A2[Aggregator Pod #2]
T --> A3[Aggregator Pod #3]
T --> A4[Aggregator Pod #4]
A1 --> DB[(Postgres<br/>aggregator table)]
A2 --> DB
A3 --> DB
A4 --> DBMultiple pods, each with concurrency 10, all writing to the same aggregator table. Standard horizontal scale-out. Each bet event triggered a read-modify-write cycle:
@KafkaListener
public void onBet(BetEvent event) {
Aggregator agg = aggregatorRepo.findByUserAndRound(
event.getUserId(), event.getRoundId()
).orElseGet(this::createNew);
agg.addBet(event.getAmount());
agg.recalculateExposure();
aggregatorRepo.save(agg);
}Read. Mutate. Write. The textbook shape, and the textbook problem.
The failure mode I drew on the whiteboard
sequenceDiagram
participant A as Pod A
participant B as Pod B
participant DB as Postgres
Note over A,B: same (user, round) row
A->>DB: SELECT exposure → 100
B->>DB: SELECT exposure → 100
Note over A,B: both saw the same baseline
A->>DB: WRITE exposure = 100 + 50 = 150
B->>DB: WRITE exposure = 100 + 30 = 130
Note over DB: final = 130<br/>expected = 180<br/>Pod A's +50 lostThis is the classic lost-update anti-pattern. Two transactions read the same baseline. Each computes against it independently. Each writes back. Last write wins.
The default Postgres isolation level (READ_COMMITTED) doesn’t prevent this — it prevents dirty reads, which is a different anomaly. Lost update is only prevented by SERIALIZABLE isolation (expensive) or by explicit locking (cheap if scoped right, expensive if not).
I reproduced the bug on demand with a test rig: 50 concurrent bets for the same user/round produced an aggregator value of about 50–60% of expected. Reliably, every time. The more concurrent writers, the more likely two would overlap, and the more updates were lost.
For most users, who place 1–3 bets per round, contention was effectively zero and the bug was invisible. For high-rollers placing 30–50 bets in rapid bursts, contention was nearly guaranteed. We had been silently losing high-roller exposure data every time a serious bettor played.
The first attempt
Optimistic locking was the obvious starting point. Add a version column:
UPDATE aggregator
SET exposure = ?, version = version + 1
WHERE user_id = ? AND round_id = ? AND version = ?;If the WHERE matches no rows, someone else updated the row in the meantime. Catch the resulting conflict, re-read, recompute, retry with bounded attempts.
The reasoning was deliberate. Optimistic locking outperforms pessimistic locking when actual conflicts are rare, because it avoids holding any lock at all. Each writer proceeds independently and only pays for coordination when an actual conflict occurs. For a workload where most writes target different rows or hit the same row at different times, this is a meaningful win.
I’d validated my mental model against staging. Synthetic load with realistic-looking distribution: thousands of users, varying bet rates, occasional bursts. Throughput stable. p99 latency under target. Conflicts logged at low rate, all resolved within one retry. On the basis of this evidence, I shipped to production.
By the end of the first day in production, the metrics were diverging from staging in a way I hadn’t anticipated.
Most user segments looked identical. But there was a partition of users — small in count, large in activity — where the consumer was struggling. Kafka lag was climbing on those specific partitions. Throughput on aggregator updates for that segment had dropped to roughly a fifth of the rest. The retry counter on optimistic conflicts was firing at rates that didn’t match anything I’d seen in load tests.
I pulled the production write distribution and held it next to my staging synthetic load. The shapes were almost completely different.
In staging, I’d modeled writes as a smooth Poisson distribution per user — average 5 bets per round, evenly distributed in time. In production, the actual distribution was bimodal: 99% of users were close to my Poisson model, but a tiny fraction — perhaps fifty users out of fifty thousand — had a burst pattern. Thirty to fifty bets in under a second, all on the same (userId, roundId). For these users, optimistic locking degraded into a contention storm. Reader A reads version 1, computes new value, writes version 2. Meanwhile readers B through L have already read version 1; they all conflict, retry, conflict again with version 3, retry, conflict, retry. By the time the dust settled, the system had done ten times more work than necessary.
I’d known about this failure mode in the abstract — the textbook caveat is that optimistic locking doesn’t suit “hot key” workloads. But abstract knowledge wasn’t enough; the question I should have asked, before choosing optimistic, was what specifically does our hot key distribution look like. Mean tells you nothing about hot keys; the tail tells you everything. I’d modeled the mean.
Six hours after the deploy, I rolled back. We were back to silent data loss instead of loud throughput collapse. Of those two failure modes, silent loss was operationally less painful in real time, even though it was strictly worse for data integrity.
The mistake wasn’t choosing optimistic locking. It was a defensible default given the workload as I’d characterized it. The mistake was not characterizing the workload before choosing. I had a year of production data and chose to model it with synthetics. Five minutes spent looking at the actual histogram of writes-per-key would have shown me the bimodal shape, which would have eliminated optimistic locking from consideration.
The second attempt
Next: pessimistic database locking. SELECT ... FOR UPDATE acquires a row lock at the database level, held for the duration of the transaction. No retries needed. Writers wait their turn. Lost updates impossible by construction.
I tested it. Correctness was perfect. Operational profile was bad in a different way.
FOR UPDATE locks are held for the duration of the database transaction. For our use case, that meant the lock was held while the aggregator row was read, the application performed the calculation (which occasionally included an outbound call to a validation service), and the new value was written back. Tens to hundreds of milliseconds per critical section, holding a database connection the whole time.
With 40 concurrent listener threads across 4 pods, and high-roller events landing on the same row, the database connection pool got saturated quickly. Threads queued for connections that were waiting for locks. The waiting wasn’t just on the lock — it was on the full lifecycle of whatever transaction was holding the lock, including the third-party calls that occasionally took a second or more.
p99 latency in load test went from 200 ms (with the broken optimistic version) to 2.4 seconds (with FOR UPDATE). At sustained high-roller load, the connection pool exhausted within minutes. The system was correct but unresponsive.
This was actually worse than optimistic locking, just along a different axis. Optimistic failed under contention but kept the system responsive for non-contended users. Pessimistic serialized contention correctly but starved the connection pool, affecting all users.
I needed the correctness of pessimistic locking with the resource profile of optimistic locking. Specifically: locking had to happen outside the database transaction, so connections wouldn’t be held during waits. Locking had to be scoped to the actual contention key (userId+roundId), not to the database connection or the entire row. Locking had to be cheap to acquire and release — the existing patterns were both too coarse.
This is exactly what distributed locks are for, and we had Redis in the stack already.
The realization, and the implementation
The aha moment was reframing the question: what’s the smallest possible critical section?
The actual conflict was between two independent pods both trying to update the same aggregator row. The bet processing pipeline as a whole didn’t need to be serialized. Just the update of the aggregator value needed to be serialized — and only with respect to other updates of the same key.
I split bet processing into two stages.
Stage 1, synchronous: persist the bet to the bets table, no aggregator update. This is fast, contention-free, and maps directly to what the customer cares about (“did my bet go through?”). The Kafka offset gets committed here.
Stage 2, asynchronous: a separate executor, triggered by an in-process event after Stage 1 commits, acquires a Redis lock on userId:roundId, reads the aggregator, updates it, writes back, releases the lock. If contention is high, Stage 2 serializes — but only for that specific user+round, and without holding any database connection during the wait.
Stage 1:
@KafkaListener(topics = "place-bet", concurrency = "10")
@Transactional
public void onBetPlacement(BetEvent event, Acknowledgment ack) {
if (betRepository.existsByExternalId(event.getExternalId())) {
ack.acknowledge();
return;
}
Bet bet = betRepository.save(toEntity(event));
appEventPublisher.publishEvent(new BetPersistedEvent(bet));
ack.acknowledge();
}Stage 2:
@EventListener
@Async("aggregatorExecutor")
public void onBetPersisted(BetPersistedEvent event) {
String lockKey = "lock:aggregator:" + event.getUserId() + ":" + event.getRoundId();
RLock lock = redissonClient.getLock(lockKey);
try {
if (!lock.tryLock(10, 30, TimeUnit.SECONDS)) {
log.error("Could not acquire lock: {}", lockKey);
return;
}
Aggregator agg = aggregatorRepo.findByUserAndRound(...).orElseGet(this::createNew);
agg.addBet(event.getAmount());
agg.recalculateExposure();
aggregatorRepo.save(agg);
kafkaTemplate.send("aggregator-state", agg);
} finally {
lock.unlock();
}
}The lock TTL is 30 seconds — long enough to outlive any reasonable processing duration, short enough to release a stranded lock if a pod dies mid-update. Redisson’s watchdog mechanism extends the lease automatically as long as the holder is alive, so we don’t accidentally release a still-active lock.
The flow:
sequenceDiagram
participant K as Kafka topic
participant L as KafkaListener<br/>(Stage 1)
participant DB as Postgres
participant Bus as Spring AppEvent
participant Ex as Async Executor
participant R as Redis (RLock)
K-->>L: poll() → bet event
L->>DB: INSERT bet (idempotent on externalId)
L->>Bus: publishEvent(BetPersisted)
L->>K: ack offset
Bus-->>Ex: dispatch (Stage 2)
Ex->>R: tryLock(user:round, ttl=30s)
R-->>Ex: ACQUIRED
Ex->>DB: read aggregator
Ex->>DB: write aggregator (recomputed)
Ex->>R: unlockThe async executor is sized larger than the listener concurrency, so high-roller events queue inside the executor (cheaply, in JVM memory) rather than back-pressuring Kafka:
@Bean("aggregatorExecutor")
public Executor aggregatorExecutor() {
var executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(50);
executor.setMaxPoolSize(150);
executor.setQueueCapacity(200);
executor.initialize();
return executor;
}What surprised me after deploy
Latency dropped, didn’t rise. I’d budgeted some latency increase from adding a network round-trip to Redis. Instead p99 end-to-end dropped from ~800 ms (under the broken optimistic version) to ~120 ms. The explanation became clear once I looked at flame graphs: the optimistic version was burning CPU on retry loops, and that CPU was contended across all bet processing, slowing everything down. The new version held the lock for 5–15 ms, did the work, released. No retries. No CPU thrashing. The serialization happened cheaply in Redis instead of expensively in the JVM.
Lock contention was extremely concentrated. The “hot keys” turned out to be exactly three users — the platform’s most active high-rollers. These three users represented less than 0.001% of the user base but generated 30% of all aggregator lock acquisitions. The rest of the userbase rarely waited for a lock at all. This data shape was exactly what should have ruled out optimistic locking in the first place — and now I could see it clearly because I’d added the dashboards I should have built before the original deploy.
Bet placement throughput tripled. Stage 1 was now non-contending and fast, so we could increase Kafka consumer concurrency from 10 to 30 per pod without saturating the database. The aggregator stage was still serialized, but the upstream pipeline ran much faster. A win we hadn’t planned for.
What I’d do differently
The biggest miss was profiling. If I’d looked at write-frequency-per-key in production data before choosing a strategy, optimistic locking would have been disqualified in five minutes. Instead I burned a week on it, including the rollback. There’s a general principle: when choosing between concurrency strategies, the distribution of contention matters more than the average — most strategies handle low contention; only some handle the long tail.
I’d also instrument lock acquisition latency from day one. The current dashboards show acquisition times by percentile — useful for spotting hot keys and detecting Redis issues. I added them after shipping. They should have been part of the original spec.
And one architectural thing: I’d think harder about whether a single aggregator row per (user, round) is the right shape, or whether sharding on a finer key (e.g., bet category) would have spread contention naturally. We didn’t go there because the aggregator was used for round-level totals and re-aggregating across shards added complexity downstream. But in retrospect, eliminating hot keys at the data model level is a stronger long-term solution than fighting them at the lock level. The two-stage pattern works for our scale; at 10× scale I’d expect to need the data-model fix as well.
The take-home
The thing I now believe more strongly: default database isolation does not protect against concurrent updates. I’d nominally known this, but I’d been operating with a vague sense that “Postgres handles concurrency.” It does — for some definitions of “handle.” It prevents dirty reads, phantom reads, certain anomalies. It does not prevent two transactions from independently overwriting each other’s writes. That’s lost update, and you have to defend against it explicitly.
The other thing: the right concurrency primitive depends on the contention shape. Optimistic locking is great when contention is rare. Database FOR UPDATE is great when contention is moderate and operations are short. Distributed locks at the application layer are great when contention is concentrated on hot keys and operations need fast resource release. None of these is universally “right.” The shape of your data determines the shape of your fix — which means measuring the shape of your data should come before choosing the fix.
The aggregator service has been running with the two-stage pattern for over a year. Zero lost-update incidents. The hot keys are still hot — three high-rollers continue to be three high-rollers — but the system handles them invisibly. From the outside it’s just a service that always reports the right number.
That, I’ve come to think, is what reliable distributed systems look like. Not the absence of contention. The careful, invisible handling of it.
If you want the broader companion piece — how this aggregator pattern fits with the cross-cutting idempotency framework I built around the same time — I wrote that one up in The Friday Afternoon Slack Message That Killed My Weekend. And if you’ve been bitten by lost update — especially the version where the bug had been quietly running for months before you noticed — I’d love to commiserate. There’s a special dread in finding a year-old data-integrity bug, and I don’t think we talk enough about that particular kind of engineering grief.