Or: how often the answer to “I need to delay processing” lives in primitives I’d been ignoring.

The product manager messaged me on a Wednesday afternoon with a screen recording.

“This looks like a bug. The result toast appears before the animation finishes.”

I watched it. The animation — a spinning visualization that resolves into a result — was running for its full two-second duration. About a quarter-second before the spin completed, a toast notification appeared in the corner of the screen: “you won 250.” The toast told you the answer before the spin had visually arrived at it.

It was a small thing. The kind of thing nobody notices in any individual instance. But once you saw it, you couldn’t unsee it. It was a spoiler. The whole point of the animation was the moment of resolution; revealing the answer beforehand defeated the design.

I want to write about the design discussion that followed, because it’s an example of a problem where the initial framing — “we need a delay” — points at several reasonable approaches, most of which are wrong for this specific shape of problem. Choosing the right one taught me something about how often the answer to “I need to delay processing” lives in primitives I’d been ignoring.

If you’ve ever needed sub-minute event delays in a Kafka pipeline, this might save you from over-engineering.

The system

The platform pushes real-time game events from backend to frontend over WebSocket:

flowchart LR
    S[Game State Service] -->|Kafka topic: round-results| K[(Kafka)]
    K --> D[Real-Time Delivery Service]
    D -->|WebSocket frame| B[Browser]
    B --> Anim[animation runs] --> Reveal[result reveal]

Backend processing was variable. Sometimes a result was published immediately when a round closed; sometimes it took a few hundred milliseconds for downstream services to settle their part. The animation duration was fixed at two seconds.

The race between backend completion time and animation duration determined whether the toast spoiled the result. If backend was fast (~100 ms), the toast arrived 1.9 seconds before the animation ended — a clear spoiler. If backend was slow (~1.8 s), toast arrived 0.2 seconds before — still a spoiler, just briefer.

The fix conceptually was straightforward: deliver the toast exactly two seconds after the round closes, regardless of backend processing time. If backend is fast, wait. If backend is slow, deliver as soon as ready (you’ll be late to the animation, but that’s a different problem class).

The implementation question was: where does the wait live, and how do we make it durable?

The first approach I evaluated

The first concrete proposal in design discussion came from a teammate: an external scheduling service. Postgres table with execute_at timestamps, a scheduled poller scanning for due events, due events published into Kafka:

CREATE TABLE delayed_events (
    id BIGSERIAL PRIMARY KEY,
    payload JSONB NOT NULL,
    execute_at TIMESTAMP NOT NULL,
    status VARCHAR DEFAULT 'PENDING'
);

CREATE INDEX idx_due ON delayed_events (execute_at) WHERE status = 'PENDING';
@Scheduled(fixedDelay = 100)
public void publishDueEvents() {
    List<DelayedEvent> due = repository.findDueAndLock(Instant.now(), 100);
    for (var event : due) {
        kafkaTemplate.send("round-results", event.getPayload());
        repository.markSent(event.getId());
    }
}

The proposal had merit. Persistent storage means delayed events survive pod restart. Per-event delay supports arbitrary durations. Multiple workers can fight over due rows via SELECT FOR UPDATE SKIP LOCKED — proven horizontal scalability. We had similar patterns elsewhere in the platform for delayed jobs that ran in the minutes-to-days range.

I sat with the proposal for half an hour and ruled it out. The reasons stack up against it for this specific problem.

Operational complexity disproportionate to the problem. Adding a Postgres-backed scheduler means adding a new persistent surface to monitor, back up, recover, and reason about. For a 2-second UI sync delay, that’s significant overhead for marginal benefit. The cost-benefit only works if you have several use cases that justify the new dependency, and at the time we didn’t. The scheduler would be sitting there doing one job, costing operational attention that could go elsewhere.

Latency precision tied to polling cadence. With a 100 ms polling interval, the worst-case latency for an event scheduled at exactly T+2s is T+2.1s — 100 ms of jitter on top of the target. For our use case (synchronizing with a 2-second visual animation), 100 ms of unpredictable jitter was actually more disruptive than a constant offset. Users would notice the variation more than a stable extra wait. To reduce jitter we’d reduce polling interval, which increases DB load. The shape of the trade-off didn’t fit the shape of the problem.

Architectural mismatch with the rest of the pipeline. Our event pipeline was end-to-end async: Kafka producer → Kafka topic → Kafka consumer → WebSocket out. Inserting a synchronous DB write plus a polling loop in the middle introduced an architecturally different element — one with different failure modes, different tracing, different testing requirements. Not a deal-breaker individually, but the value of architectural uniformity for long-term reasoning was higher than the value of using the most expedient pattern.

Right tool, wrong problem. This is the framing I keep coming back to. The Postgres-poller pattern is excellent for delays in minutes-to-days, where durability is essential and precision is loose. Our use case was the opposite: delays in seconds, where precision was critical and durability was already provided by Kafka itself. Reaching for the heavier tool would have been over-engineering.

The second approach (and a Thread.sleep that didn’t ship)

Once we’d ruled out the external scheduler, the discussion turned to keeping the delay logic in-pipeline. Two ideas surfaced.

Producer-side scheduling

Schedule the publish, not the processing. The producer holds the message in a ScheduledExecutorService and sends it after the delay:

public void publishResult(RoundResult result) {
    scheduledExecutor.schedule(
        () -> kafkaTemplate.send("round-results", result),
        2, TimeUnit.SECONDS
    );
}

This sidesteps consumer-side complexity entirely. The consumer sees no delay, the producer just waits before sending.

The problem is durability. The scheduled task lives in pod memory. If the pod is killed, drained for a deploy, or OOM’d in the window between scheduling and sending, the message is lost. There’s no record of it anywhere durable. For UI animation timing you might initially say “fine, we’ll lose a few” — but the failure mode is correlated, not random. A deploy doesn’t kill one task; it kills all pending tasks on every pod that gets rolled. We’d lose hundreds at once during routine maintenance. That’s not a UI annoyance; that’s a customer-visible incident.

The whole point of putting events in Kafka is to make them durable. Bypassing Kafka for the delay defeats the architecture.

Thread.sleep in the listener

This came up next, suggested casually as the obvious thing if we were keeping the delay on the consumer side. I had to push back on it explicitly.

The issue isn’t that Thread.sleep doesn’t work in isolation. It does. The issue is max.poll.interval.ms, which measures wall-clock time between two poll() calls. With max.poll.records=500 (default) and a 2-second sleep per record processed, a single batch of records takes a thousand seconds — sixteen minutes — between polls. The default max.poll.interval.ms is five minutes. The consumer gets evicted from the group, the partition reassigns, the new consumer picks up the same backlog and starts sleeping through it too. You’ve created a rebalance cascade that can take the partition offline for tens of minutes while it converges.

You can mitigate this by lowering max.poll.records to 1, which makes each batch contain one record and keeps the per-poll time to 2 seconds plus processing. That’s defensible, but it costs you all the throughput benefits of batched polling, makes Kafka commit overhead per-record dominant, and is a fragile tuning that breaks the moment someone increases the delay parameter without realizing the implication. You can also raise max.poll.interval.ms to something larger than your worst-case sleep time, but that changes how Kafka detects genuinely-stuck consumers, making it slower to evict actually-broken pods and triggering rebalance later in genuine failure cases. You’re trading a clear failure (rebalance during normal sleep) for an obscure one (slow detection of real failures).

I’ve worked with engineers who’ve operated systems with Thread.sleep-based delays in production. It’s possible. It’s also a tuning minefield where every parameter change interacts with everything else, and the failure mode under load is a rebalance cascade that’s expensive to recover from. Not a debate I wanted to have with future on-call engineers when a simpler option existed.

We ruled out Thread.sleep. We ruled out producer-side scheduling. What we needed was a way to tell the Kafka consumer “stop reading from this partition for a while, then come back to where I was” — without blocking the consumer thread and without losing messages.

That’s when I started looking at consumer.pause().

The realization

consumer.pause(partitions) is a Kafka client API that tells the client to stop fetching from those partitions on subsequent poll() calls. The consumer itself keeps polling (just gets back empty results); heartbeats continue to be sent; the consumer stays alive in the group. No rebalance.

consumer.resume(partitions) re-enables fetching. The next poll() returns whatever’s been waiting in the partition.

Combined with consumer.seek(partition, offset) to reset the read position, these three primitives compose into a clean pattern: “stop reading this partition until time T, then come back to where I was.”

The pattern is:

  1. Consumer pulls a message via normal poll().
  2. A custom adapter checks: is the message’s intended delivery time in the future?
  3. If yes:
    • Pause the partition.
    • Seek back to the message’s offset (so we’ll re-read it after resume).
    • Schedule a resume task with a TaskScheduler, with the appropriate delay.
  4. If no (the time has arrived):
    • Invoke the actual listener. Process normally.

The message stays in Kafka the whole time. The consumer thread stays unblocked. Heartbeats continue. The partition just isn’t read until the right moment.

The implementation

The custom adapter wraps the actual listener and intercepts each record:

public class DelayedMessageListenerAdapter<K, V>
        implements AcknowledgingConsumerAwareMessageListener<K, V> {

    private final MessageListener<K, V> delegate;
    private final KafkaConsumerBackoffManager backoffManager;
    private final String listenerId;
    private final Duration delay;

    @Override
    public void onMessage(ConsumerRecord<K, V> record,
                          Acknowledgment ack,
                          Consumer<?, ?> consumer) {
        long recordTimestamp = record.timestamp() > 0
            ? record.timestamp()
            : System.currentTimeMillis();
        long deliveryTime = recordTimestamp + delay.toMillis();

        if (System.currentTimeMillis() < deliveryTime) {
            backoffManager.backOffIfNecessary(
                backoffManager.createContext(
                    deliveryTime,
                    listenerId,
                    new TopicPartition(record.topic(), record.partition()),
                    consumer
                )
            );
            return;
        }

        delegate.onMessage(record, ack, consumer);
    }
}

Spring Kafka’s KafkaConsumerBackoffManager encapsulates the pause/seek/schedule logic. When invoked with a future timestamp, it throws a KafkaBackoffException caught by the listener container’s error handler, which performs the pause and registers a resume task with a TaskScheduler:

@Bean
public TaskScheduler kafkaDelayTaskScheduler() {
    var scheduler = new ThreadPoolTaskScheduler();
    scheduler.setPoolSize(5);
    scheduler.setThreadNamePrefix("kafka-delay-");
    scheduler.initialize();
    return scheduler;
}

@Bean
public KafkaConsumerBackoffManager backoffManager(
        KafkaListenerEndpointRegistry registry,
        @Qualifier("kafkaDelayTaskScheduler") TaskScheduler scheduler) {
    return new ContainerPartitionPausingBackOffManager(
        registry,
        new ContainerPausingBackOffHandler(
            new ListenerContainerPauseService(registry, scheduler))
    );
}

When the resume task fires, the partition resumes. The next poll() returns the same record (we’d seeked back to it). The adapter checks again: now the timestamp is past, the delegate is invoked, the message is processed.

The flow:

sequenceDiagram
    participant K as Kafka topic
    participant C as Consumer
    participant Adapter as Delayed Adapter
    participant Sched as TaskScheduler
    participant Real as Real Listener

    Note over K,Real: T=0
    K-->>C: poll() returns record (t=0)
    C->>Adapter: onMessage(record)
    Note over Adapter: now=0, target=2 → too early
    Adapter->>C: pause(partition)
    Adapter->>C: seek(record.offset)
    Adapter->>Sched: schedule resume in 2s

    Note over C: partition paused,<br/>heartbeats continue

    Note over Sched: 2 seconds pass...

    Sched->>C: resume(partition)
    K-->>C: poll() returns same record
    C->>Adapter: onMessage(record)
    Note over Adapter: now=2, target=2 → execute
    Adapter->>Real: invoke delegate
    Real->>Real: deliver toast to frontend

We isolated the delayed flow on its own dedicated topic so that other (non-delayed) messages couldn’t be incidentally blocked by a paused partition. This was a deliberate architectural choice and turned out to matter.

What surprised me after deploy

The animation-toast sync became invisible. Pre-fix, the bug was something users would notice if you pointed it out, but most wouldn’t have on their own. Post-fix, nobody — including users — said anything about it. The fix worked exactly as intended: the toast appeared at the moment the animation resolved, which is what users expected, so they didn’t think about it. Good UX is invisible by design.

The pattern got reused for something completely different. Six months later, we needed to rate-limit calls to a partner API that allowed us 10 requests per second. Instead of building a separate rate limiter, we used the same pause/resume pattern: the listener checked an in-memory token bucket, and if the bucket was empty, it called backoffIfNecessary to pause the partition until the bucket refilled. Same primitive, different problem. We extracted the underlying pattern into a small library of “Kafka delivery throttle patterns” that other teams started using. I hadn’t planned this generalization — it emerged organically.

Head-of-line blocking turned out to be more subtle than I’d thought. The pattern’s clear cost is that pausing a partition pauses all messages in that partition, not just the one we wanted to delay. With a 2-second delay on a dedicated topic, this was negligible — the next message in the partition was usually less than 100 ms behind, so we delayed it by 2 seconds when it should have been delayed by 1.9 seconds. Functionally identical. But this is a real limitation, and it created an interesting follow-on incident I’ll come back to.

The trap I left for the team (and what I’d do differently)

Four months after we shipped, a junior engineer on a different team wanted to use the pattern for a 30-minute delayed batch report. He’d seen our adapter, understood the high-level idea, and applied it to his use case.

Within an hour of his deploy, our alerts fired. His partition had been paused for 30 minutes — which was working as intended for his batch report — but every other message landing in that partition during those 30 minutes was also delayed. The downstream pipeline that depended on those messages was 30 minutes behind real time. By the time we’d rolled back, his consumer had a 47,000-message backlog.

The fix was twofold. First, we documented the pattern with an explicit warning: “this pattern is appropriate for delays under approximately 10 seconds. For longer delays, use [the alternative external-scheduler pattern].” Second, we built a separate pattern for arbitrary-duration delays, based on the external-scheduler approach we’d ruled out earlier — because for the long-delay case, that was the right pattern, just not for our original 2-second use case.

The lesson, which I should have anticipated at the time of the original design: when you build a primitive that has scaling limits, you have to make those limits visible in the API or in documentation. Otherwise the next engineer who reaches for it will extend it past the breaking point, often without realizing.

This is the thing I’d most want back from the original design. If I’d written even a one-paragraph README — “this pattern works for short delays because of head-of-line blocking; for longer delays use X” — the junior engineer would have made a different choice. Instead, the codebase suggested the pattern was generic, and he reasonably trusted it.

I’d also have set up metrics on partition pause duration from the start. They’d have caught the blocking issue much earlier. We added them after the incident; they should have been part of the original spec.

The take-home

The thing I now believe more strongly: consumer.pause() is one of the most underrated primitives in the Kafka client API. It’s been in the API forever. Most people I talk to don’t know it exists. It enables a category of patterns — short delays, throttling, backpressure, rate limiting, conditional consumption — that don’t fit cleanly into any other Kafka primitive. Once you have it in your toolbox, you start seeing places to use it.

The other thing: the right delay mechanism depends on the duration. For short delays (under ~10 seconds), Kafka-native primitives plus a TaskScheduler is clean and correct. For long delays (minutes to days), an external scheduler with persistent storage is the better tool. The boundary between these isn’t a sharp line — somewhere around 10–30 seconds you can argue either way — but the trade-offs flip clearly outside those ranges. Knowing which side of the line you’re on is more valuable than knowing either pattern in isolation.

The animation-toast sync has been working for over a year. The same adapter has been extended to throttle three different external API integrations. About 80 lines of code in a small library that two other teams now depend on. We’ve never modified it after the original implementation, beyond adding metrics and the documentation that should have been there from the start.

That’s how I know the design fits the problem: not because of the metrics (though those are good), but because nothing has tried to escape its boundaries.

If consumer.pause() is new to you, the Kafka Consumer Rebalancing post pairs well with this one — it covers the rebalance dynamics that make pause() such a clean alternative to Thread.sleep. And if you’ve used consumer.pause() in production for something interesting, I’d love to hear about it. The pattern is more general than the documentation suggests, and I suspect there are use cases I haven’t thought of.