The Friday Afternoon Slack Message That Killed My Weekend

Or: how I learned that idempotency is a system property, not a bug fix.

The message arrived at 16:30 on a Friday. From the finance team lead. The kind of subject line that ruins weekends:

“We have a problem. Reconciliation showing duplicate payouts. ~30 this week. Can we talk?”

I clicked. Spreadsheet attached. Red cells in a column labeled Duplicate detected. A column of dollar amounts next to it. The sum at the bottom was the kind of number where you stop being annoyed and start being scared.

The duplicates weren’t subtle. Same user. Same round. Same prize. Two payouts. Sometimes minutes apart, sometimes seconds. The audit log confirmed two distinct INSERT records with the same business identifier — the field we used everywhere as the canonical “this is a unique settlement event” key.

I want to write about the next two weeks, because the eventual fix is well-known but the path to it taught me something about the level at which you treat a bug. The technical details aren’t novel. The framing is what I think is worth sharing.

If you’ve ever gotten a duplicate-data report from a downstream team and treated it as a bug in one specific service, this might save you a week.

What the system looked like

The platform processes high-volume financial transactions: bets placed, rounds settled, prizes paid. The relevant pipeline:

flowchart TD
    P[Settlement Producer] -->|Kafka topic: settlement-events| T[(Kafka Topic)]
    T --> A[Wallet Consumer<br/>credits user balance]
    T --> B[Notification Consumer<br/>sends UI push]
    T --> C[Audit Consumer<br/>writes ledger]
    A --> DB[(Postgres)]
    B --> WS[/WebSocket fan-out/]
    C --> LDG[(Audit store)]

Three independent consumers, each with its own consumer group. The wallet service was the one writing payouts to user balance. That’s where finance had spotted the duplicates.

Volume: a few thousand settlements per second at peak. The platform had been live for years. Idempotency had been “discussed” several times but never formalized — there were UNIQUE constraints on a few critical tables, an existsBy check in some services, but no consistent pattern across consumers. Each team had done what felt right at the time their service was written, often by engineers no longer on the team.

The duplicates had presumably been happening for a while. Finance had only caught them now because their reconciliation tool had been updated to flag same-day same-user payouts.

The investigation

I started where you’d expect: pulling the wallet service logs for one specific duplicate from the spreadsheet.

User had been credited at 14:23:11 for a winning of 250. Then again at 14:23:15. Same external_id. Same amount. Two distinct database records, four seconds apart.

Two processSettlement invocations for the same external_id, in different threads. The consumer had genuinely received the message twice and processed it twice. I pulled the consumer-group offset history and saw the rebalance: at 14:23:13, partition reassignment, and the offset of the first message was reread by the new owner.

So far, a normal at-least-once anomaly: rebalance during processing, message reprocessed, consumer not idempotent, duplicate in DB.

Then I noticed something in the producer logs: a send timeout retry exactly one second before the original 14:23:11 processing. Which meant the producer had also retried. I checked Kafka. Yes — two distinct offsets in the topic with the same payload, ~1 second apart.

The duplicate I was looking at had two upstream causes that had compounded. Producer retry created the first duplicate (in Kafka). Consumer rebalance created the second (in processing). Neither alone would have produced the four-second gap I saw in DB; both together did.

sequenceDiagram
    participant Prod as Producer
    participant K as Kafka topic
    participant C1 as Consumer (owner #1)
    participant C2 as Consumer (owner #2)
    participant DB as Postgres

    Prod->>K: send(externalId=X) [in flight]
    Note over Prod,K: ack lost on the wire
    Prod->>K: retry(externalId=X)
    Note over K: two distinct offsets,<br/>same payload

    K-->>C1: poll() returns offset 100
    C1->>DB: INSERT settlement(externalId=X)
    Note over C1: rebalance triggered<br/>before offset commit
    K-->>C2: poll() returns offset 100 again
    C2->>DB: INSERT settlement(externalId=X) [DUP]
    K-->>C2: poll() returns offset 101 (producer retry)
    C2->>DB: INSERT settlement(externalId=X) [DUP]

This was already enough to act on. But before I started designing a fix, I wanted to know whether this was the only place it was happening.

The first thing I tried (and what it taught me)

The triage took three hours. By Friday evening I had a working theory and a tactical fix. The wallet service was double-processing. Add an existsByExternalId check before the credit. Wrap the underlying INSERT in a UNIQUE constraint as a safety net. Use INSERT ... ON CONFLICT DO NOTHING semantics so that the duplicate path was a clean no-op rather than an exception that flooded the logs. Tested, code-reviewed, deployed by Monday afternoon.

For the wallet path, this worked. Finance reconciliation the following Friday showed zero duplicate payouts from that pipeline. I considered the ticket closed.

On Tuesday of the second week, finance escalated again — different problem, same shape. Push notifications were duplicating. Some users had received the same “you won” toast twice; one had complained. I pulled logs and saw the same pattern: rebalance during processing, message reprocessed, no application-layer dedup, two notifications sent.

That was when I realized I’d framed the problem wrong.

The wallet service hadn’t been a buggy consumer. It had been an example of a category. We had something like a dozen Kafka consumers across the platform, and almost none of them had idempotency in any meaningful sense. They had been written over several years by different engineers, each one assuming the consumer above and below it was handling duplicates somehow. None of them were. Wallet was just the consumer where duplicates had financial impact, so finance noticed. Notification, ledger, audit — they were all silently double-processing too. Nobody had been measuring.

The cost of having framed it wrong: the wallet-only fix I’d shipped didn’t generalize. The existsByExternalId pattern was tightly coupled to the wallet domain — it took a wallet-specific repository, returned a wallet-specific result. To fix notifications I’d have to write something similar from scratch. To fix ledger, again. Each consumer would get its own slightly-different idempotency code, each a new place for bugs to hide.

The right framing was: idempotency is a cross-cutting concern, like authentication or logging. It needs a unified pattern enforced consistently across consumers, not bespoke implementations. That meant going back, designing the framework properly, then retrofitting wallet into it — undoing some of what I’d shipped the previous Friday.

I lost about a week to the local-fix framing. That’s the cost I’d most want back. The lesson here, in retrospect, isn’t about UNIQUE constraints or existsBy checks. Both of those things are fine; both ended up in the final framework. The lesson is about the level at which you treat a problem. When finance flags duplicates from one service, the question isn’t what’s wrong with that service. It’s why didn’t every service have a guard against this. And if the answer is “we never required them to” — that’s the actual bug, and it’s not in any service’s code.

The other approach I evaluated

Once I had the right framing, I needed to pick the layer at which idempotency should live. The most ambitious option was Kafka’s exactly-once semantics. With transactional.id on the producer, read_committed on the consumer, and sendOffsetsToTransaction for the offset commit, you get end-to-end exactly-once for any pipeline that’s Kafka in, Kafka out. The pitch is exactly what I wanted: each message is processed exactly once, period.

I read the docs carefully. It took about an hour to talk myself out of it.

Our pipeline wasn’t Kafka-to-Kafka. The wallet service read from Kafka and wrote to Postgres. The Postgres write was outside Kafka’s transaction; if the Kafka commit succeeded but the Postgres INSERT failed, we’d lose money. If the Postgres INSERT succeeded but the Kafka commit failed (rare but possible during failover), we’d duplicate it on the retry. EOS in Kafka guarantees exactly-once for Kafka-internal operations. Anything that touches external state — a database, a third-party API, a cache — sits outside that guarantee.

For our shape, exactly-once Kafka semantics would solve a problem we didn’t have, at the cost of significant performance overhead and operational complexity (managing transactional IDs, dealing with ProducerFencedException on restarts, etc.). The actual problem — making the pipeline tolerant of duplicate delivery to external state — required idempotency in the application layer regardless. EOS would be a duplicative protection.

This was the moment the framework started taking shape. At-least-once is the contract everywhere, including with EOS configured. Idempotency is the application’s half of the contract and has to live where the external state lives. The Kafka-level configurations are useful for reducing the rate of duplicates — fewer producer retries means fewer duplicates to catch — but they can’t be the only line of defense.

The shape of the fix

The mental model I landed on was: each layer plays a specific role in reducing duplicates, but the application layer is the only one that can guarantee idempotency.

Producer: enable idempotent producer mode. Sequence numbers per partition mean the broker dedupes producer retries within a session. This is essentially free and removes the largest source of upstream duplicates.
Broker: ensure that what’s written stays written. replication.factor=3, min.insync.replicas=2, acks=all. Without this, you can have phantom messages that vanish during failover, leading to inconsistent reprocessing.
Consumer: commit offset after processing completes, not before. We accept that crashes will cause reprocessing — that’s fine, because the next layer makes it safe. The default auto-commit was a holdover from earlier, simpler services and got removed everywhere.
Application: explicit idempotency check on a business identifier, with a UNIQUE constraint as the safety net. The pattern uses a propagated X-Request-Id header (set at the original request) carried through every Kafka header, with INSERT ... ON CONFLICT DO NOTHING semantics where possible.

The producer config:

acks: all
enable.idempotence: true
retries: 2147483647 # MAX_INT
max.in.flight.requests.per.connection: 5

The consumer config:

enable.auto.commit: false
isolation.level: read_committed

The listener pattern:

@KafkaListener(topics = "settlement-events", containerFactory = "manualAckFactory")
@Transactional
public void onSettlement(SettlementEvent event, Acknowledgment ack) {
    int rowsInserted = walletRepository.insertIfNotExists(
        event.getExternalId(),
        event.getUserId(),
        event.getAmount()
    );

    if (rowsInserted == 0) {
        log.debug("Duplicate settlement skipped: {}", event.getExternalId());
        ack.acknowledge();
        return;
    }

    walletService.creditUser(event.getUserId(), event.getAmount());
    ack.acknowledge();
}

insertIfNotExists wraps a Postgres INSERT ... ON CONFLICT DO NOTHING, which returns affected row count. Zero meant “we’ve seen this external_id before, skip.” No exception, no log noise. The pattern was extracted into a small IdempotentInsertSupport library that every consumer used the same way, with a generic existsBy fallback for cases where ON CONFLICT didn’t apply (e.g., compound business keys spanning multiple tables).

The whole framework took two weeks to roll out across all financial consumers. Most of that time wasn’t writing code — it was finding consumers, understanding what their natural idempotency keys should be, and adding the X-Request-Id header propagation in the producers that hadn’t been carrying it.

What I didn’t expect

A few surprises after the rollout.

Throughput went up, not down. I’d budgeted a 5–10% drop from acks=all plus the extra dedup check. Instead, peak-hour throughput on critical pipelines went up by roughly 3–5%. The reason became clear once I looked at retry metrics: pre-fix, the producer’s retry logic had been firing constantly during routine network blips, and each retry cascaded — failed retries triggered more retries as the system tried to converge. Once enable.idempotence=true cut spurious retries near zero, the system spent more time doing actual work and less time recovering from itself. The acks=all overhead was real but smaller than the savings.

We caught duplicates we didn’t know existed. In the first week, the “duplicates blocked at app layer” counter (which I’d added two days into the rollout, after realizing I had no way to measure my own fix) showed unexpectedly high numbers from a service we hadn’t suspected. Turned out a different consumer had been running with broken auto-commit semantics and silently double-processing about 0.1% of messages for over a year. We’d been losing money on it the whole time, in amounts too small for finance to flag with their old reconciliation rules. The new framework caught it on day one of that consumer’s rollout.

Rebalances stopped being an event. Pre-fix, a Kafka rebalance was a low-grade panic moment — someone had to monitor for “bad effects.” Post-fix, rebalances became routine. The system tolerated them at the application layer, so the operational cost of monitoring them dropped to zero. We started doing rolling deploys without watching the dashboards.

What I’d do differently

The biggest miss: I shipped without the “duplicates blocked at app layer” metric. I added it later, when I needed to convince finance the problem was actually fixed. It should have been part of the spec from the start. There’s a general principle: when you ship a fix for a class of bug, ship the measurement of the fix at the same time. Otherwise you’ve built something whose effectiveness you can’t prove.

I’d also push for a Schema Registry from the start. The X-Request-Id header propagation was a coordination problem — every producer needed to set it, every consumer needed to use it. Without compile-time enforcement, this became a tribal-knowledge thing that we had to chase down service by service. With Schema Registry, headers can be part of the topic contract, and producers without them fail at deploy time rather than silently in production.

Finally, the organizational thing: I should have written the runbook before shipping. We had three on-call incidents in the first month where engineers saw “duplicate skipped” log lines and treated them as bugs. They weren’t bugs — they were the system working as designed. But without documentation, on-call had no way to know.

The take-home

The thing I now believe more strongly than before: at-least-once is not a setting you turn on or off. It’s the default contract of every distributed messaging system, including Kafka, including the systems built on top. Idempotency isn’t a fix for at-least-once — it’s the application’s half of the contract. The temptation is to look for the setting that fixes duplicates: acks=all, enable.idempotence, read_committed. These all help. None of them are sufficient. Idempotency in the application layer is a property your code has or doesn’t, and no Kafka configuration can reach into your database.

The other thing: defense in depth isn’t paranoia, it’s the architecture of distributed reliability. Each layer catches a different failure mode. Producer idempotence catches retry storms. Broker replication catches phantom messages. Manual commit catches premature offset commits. Application-layer dedup catches everything else. Removing any single layer leaves most of the problem; together, they reduce duplicate-induced incidents to functionally zero.

Six months after the rollout, finance reconciliation has flagged zero duplicate payouts. Not “fewer.” Zero. Not because the system never tries to send duplicates — it tries constantly, especially during deploys and broker maintenance. But by the time those duplicates reach the application, four layers of guards have already caught them.

This is what reliable distributed systems look like, I’ve come to think: not the absence of failures, but the routine, invisible handling of them.

If you want a deeper dive on the idempotency-key pattern itself — server-side storage, hashing strategies, retention windows — I wrote one earlier: Idempotency — A Practical Guide for Backend Engineers. And if you’ve fought a similar battle — especially the part where the first fix worked locally and made the systemic problem invisible until it hit somewhere else — I’d love to hear about it. There’s a particular kind of engineering humility that comes from realizing your “fix” was just a tactical patch, and I think it’s worth talking about more.

The Friday Afternoon Slack Message That Killed My Weekend

What the system looked like

The investigation

The first thing I tried (and what it taught me)

The other approach I evaluated

The shape of the fix

What I didn’t expect

What I’d do differently

The take-home

Related essays

Idempotency — A Practical Guide for Backend Engineers

The Customer Support Ticket That Taught Me to Profile Before Designing

When the Backend Spoiled the Animation