Introduction

When a message queue cluster undergoes partition rebalancing -- triggered by broker additions, removals, or consumer group membership changes -- consumer lag can grow rapidly. During rebalance, all consumers in the group temporarily pause processing while partition ownership is redistributed, creating a processing gap that accumulates backlogged messages.

Symptoms

  • Consumer lag metrics increase sharply during or immediately after a rebalance event
  • Consumer group state shows PreparingRebalance or CompletingRebalance for extended periods
  • Processing throughput drops to near zero during the rebalance window
  • Downstream services experience delayed data or stale dashboards
  • Rebalance storms -- rapid consecutive rebalances -- compound the lag accumulation

Common Causes

  • Consumer group coordinator triggering frequent rebalances due to session timeout misconfiguration
  • Large partition counts requiring extended assignment computation and state transfer
  • Slow consumers failing heartbeat checks, triggering cascading rebalance cycles
  • Offset commits deferred until after rebalance completion, losing progress tracking
  • Network partition between consumers and the group coordinator causing timeout-driven rebalances

Step-by-Step Fix

  1. 1.Identify the rebalance trigger: Check consumer group coordinator logs for the root cause of the rebalance event.
  2. 2.```bash
  3. 3.# For Kafka-based systems, check consumer group state
  4. 4.kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-consumer-group
  5. 5.`
  6. 6.Increase session timeout to reduce false-positive rebalances: Extend the heartbeat interval and session timeout to tolerate temporary network hiccups.
  7. 7.```properties
  8. 8.session.timeout.ms=45000
  9. 9.heartbeat.interval.ms=15000
  10. 10.max.poll.interval.ms=600000
  11. 11.`
  12. 12.Enable incremental cooperative rebalancing: Switch from eager stop-the-world rebalancing to incremental cooperative protocol, which only moves affected partitions.
  13. 13.```properties
  14. 14.partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
  15. 15.`
  16. 16.Process remaining records before committing offsets during shutdown: Implement a graceful shutdown hook that drains the current poll batch.
  17. 17.```java
  18. 18.Runtime.getRuntime().addShutdownHook(new Thread(() -> {
  19. 19.consumer.wakeup();
  20. 20.}));
  21. 21.`
  22. 22.Monitor lag recovery rate after rebalance completes: Verify consumers catch up at an acceptable rate.
  23. 23.```bash
  24. 24.kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-consumer-group | awk '{print $6}' | tail -n +2 | paste -sd+ | bc
  25. 25.`

Prevention

  • Use cooperative sticky partition assignment to minimize rebalance scope and duration
  • Tune max.poll.records to ensure consumers can process a batch within max.poll.interval.ms
  • Deploy consumer health checks that alert on rebalance frequency, not just lag magnitude
  • Size consumer groups to match partition counts, avoiding over-provisioning that triggers unnecessary rebalances
  • Implement static group membership with group.instance.id to prevent transient restarts from triggering rebalances