Kafka Consumer Rebalancing — The Details That Bite

Kafka consumer rebalancing is one of those “simple in theory” features that cause a disproportionate share of production incidents. Understanding what actually happens during a rebalance — and why the default behavior is often wrong — separates smooth-running systems from ones that flap under load.

What a rebalance is

A consumer group is a set of consumers sharing the work of reading a topic’s partitions. Kafka decides which consumer gets which partitions. A rebalance is the process of re-deciding.

Triggers:

New consumer joins the group
Consumer leaves (crash, stop, network timeout)
Consumer session timeout (heartbeat not received)
Partition count changes (rare)
max.poll.interval.ms exceeded

During rebalance (classic protocol), all consumers pause. Partitions are redistributed. Consumers resume. The world stops for a moment.

Why rebalancing is painful

A rebalance takes time — 1-30 seconds is typical. During this time:

Consumption stops entirely
Lag accumulates
Users see delayed processing

Worse, if rebalances happen frequently (flapping), consumption never catches up. Lag grows unbounded. Queue backs up. Eventually someone is paged.

What causes flapping

max.poll.interval.ms (default 5 minutes). If the consumer doesn’t call poll() within this interval, Kafka considers it dead and rebalances. Common cause: consumer is processing a message slowly — GC pause, slow DB call, long-running work.

Session timeout too short. Consumer heartbeat fails for 10s due to network blip; group kicks it out. Default 45s is usually OK; tune based on your network stability.

Consumer restarts during deploys. Rolling deploy restarts 10 pods in sequence → 10 rebalances in quick succession.

Autoscaling consumers. Aggressive scale-up/down triggers constant rebalances. Keep consumer count stable.

Cooperative rebalancing (Kafka 2.4+)

The original protocol (“eager”) had all consumers drop all partitions, then reassign. The new protocol (“cooperative”) only moves the partitions that actually need to move:

Consumer A had partitions 0, 1, 2
Consumer B joins; plan says A gets 0, 1; B gets 2
Cooperative: A keeps 0, 1 (no interruption); only partition 2 is moved

Result: most consumers never stop. Only the affected partitions pause briefly. Rebalances become nearly invisible.

Enable via:

spring:
  kafka:
    consumer:
      properties:
        partition.assignment.strategy: >
          org.apache.kafka.clients.consumer.CooperativeStickyAssignor

Huge improvement. If your Kafka is 2.4+ and you’re not using cooperative rebalancing, turn it on.

Static membership

Another Kafka 2.3+ feature, group.instance.id. Gives each consumer a stable identity:

spring:
  kafka:
    consumer:
      group-instance-id: ${HOSTNAME}

With static membership, brief disconnects (pod restart, network blip) don’t trigger a rebalance. Kafka waits for the specific instance to come back. Combined with cooperative rebalancing, rebalances on normal operations become rare.

Tuning the knobs

Sensible defaults for typical Spring Kafka:

spring:
  kafka:
    consumer:
      properties:
        session.timeout.ms: 45000          # how long before "consumer is dead"
        heartbeat.interval.ms: 15000       # how often consumer says "alive"
        max.poll.interval.ms: 300000       # how long between poll() calls
        max.poll.records: 100              # messages per poll
        partition.assignment.strategy: org.apache.kafka.clients.consumer.CooperativeStickyAssignor
      group-instance-id: ${HOSTNAME}

For slow consumers, increase max.poll.interval.ms — but ideally, speed up processing or split into smaller batches.

The slow consumer problem

Processing a single message takes 10s. max.poll.records=500 means a single poll batch takes 5000s → rebalance. Options:

Reduce batch size. max.poll.records=50, processing completes inside the interval.
Process async. Poll() returns quickly; work happens in a thread pool.
Increase max.poll.interval.ms. Only if you really need long-running processing.

Async processing is tricky — you must ack messages only after work completes, handle backpressure, and reason about ordering. Most teams get better results by making processing itself faster or batches smaller.

Monitoring

Metrics worth alerting on:

Rebalance rate — more than a few per hour is a signal
Consumer lag per partition — lagging partitions reveal slow consumers
Time since last poll per consumer — if it’s climbing toward max.poll.interval.ms, processing is too slow
Rebalance duration — if climbing, group is unstable

A dashboard showing rebalances + lag + throughput makes the health of a Kafka consumer group obvious.

Deploy patterns

Rolling deploys + sticky partition assignment + static membership = smooth deploys. Consumer pod restarts don’t trigger full group rebalances; it picks up its old partitions when it comes back.

Rolling deploys without those features = rebalance per pod = flapping = lag. Always configure sticky+static for production consumer groups.

Kafka Streams note

Rebalancing for Kafka Streams is even more painful because state stores must be reshuffled. num.stream.threads=1 minimizes churn per pod. Use Kafka Streams 2.6+ which added warmup replicas to hide rebalance cost.

Closing note

Rebalancing is unavoidable but should be rare and painless. Cooperative rebalancing + static membership + sensible timeouts make most rebalances a non-event. Skip those and you’ll chase mysterious lag spikes forever. For production Kafka in 2026, the default defensive configuration is worth the 10 minutes of tuning — saves hours of incident response later.