Introduction
When a message queue cluster undergoes partition rebalancing -- triggered by broker additions, removals, or consumer group membership changes -- consumer lag can grow rapidly. During rebalance, all consumers in the group temporarily pause processing while partition ownership is redistributed, creating a processing gap that accumulates backlogged messages.
Symptoms
- Consumer lag metrics increase sharply during or immediately after a rebalance event
- Consumer group state shows
PreparingRebalanceorCompletingRebalancefor extended periods - Processing throughput drops to near zero during the rebalance window
- Downstream services experience delayed data or stale dashboards
- Rebalance storms -- rapid consecutive rebalances -- compound the lag accumulation
Common Causes
- Consumer group coordinator triggering frequent rebalances due to session timeout misconfiguration
- Large partition counts requiring extended assignment computation and state transfer
- Slow consumers failing heartbeat checks, triggering cascading rebalance cycles
- Offset commits deferred until after rebalance completion, losing progress tracking
- Network partition between consumers and the group coordinator causing timeout-driven rebalances
Step-by-Step Fix
- 1.Identify the rebalance trigger: Check consumer group coordinator logs for the root cause of the rebalance event.
- 2.```bash
- 3.# For Kafka-based systems, check consumer group state
- 4.kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-consumer-group
- 5.
` - 6.Increase session timeout to reduce false-positive rebalances: Extend the heartbeat interval and session timeout to tolerate temporary network hiccups.
- 7.```properties
- 8.session.timeout.ms=45000
- 9.heartbeat.interval.ms=15000
- 10.max.poll.interval.ms=600000
- 11.
` - 12.Enable incremental cooperative rebalancing: Switch from eager stop-the-world rebalancing to incremental cooperative protocol, which only moves affected partitions.
- 13.```properties
- 14.partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
- 15.
` - 16.Process remaining records before committing offsets during shutdown: Implement a graceful shutdown hook that drains the current poll batch.
- 17.```java
- 18.Runtime.getRuntime().addShutdownHook(new Thread(() -> {
- 19.consumer.wakeup();
- 20.}));
- 21.
` - 22.Monitor lag recovery rate after rebalance completes: Verify consumers catch up at an acceptable rate.
- 23.```bash
- 24.kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-consumer-group | awk '{print $6}' | tail -n +2 | paste -sd+ | bc
- 25.
`
Prevention
- Use cooperative sticky partition assignment to minimize rebalance scope and duration
- Tune
max.poll.recordsto ensure consumers can process a batch withinmax.poll.interval.ms - Deploy consumer health checks that alert on rebalance frequency, not just lag magnitude
- Size consumer groups to match partition counts, avoiding over-provisioning that triggers unnecessary rebalances
- Implement static group membership with
group.instance.idto prevent transient restarts from triggering rebalances