Introduction

Kafka consumer group rebalancing redistributes partition ownership among consumers when members join, leave, or fail. During rebalance, all consumers in the group pause processing, creating a throughput gap. When rebalances happen frequently or take too long, the cumulative pause time significantly impacts message processing latency and consumer lag.

Symptoms

  • Consumer processing throughput drops to zero periodically with no apparent cause
  • Kafka consumer group state oscillates between Stable and PreparingRebalance
  • Consumer logs show The coordinator is not aware of this member or Rebalance in progress
  • Consumer lag grows steadily despite adequate consumer capacity
  • Application health checks fail during extended rebalance windows

Common Causes

  • max.poll.interval.ms set too low, causing slow consumers to be kicked out and trigger rebalance
  • Consumer processing takes longer than the poll interval during peak message volume
  • Network instability between consumers and the group coordinator causing heartbeat failures
  • Rolling deployments removing and adding consumers without static group membership
  • GC pauses in JVM-based consumers exceeding session timeout thresholds

Step-by-Step Fix

  1. 1.Identify rebalance frequency and duration from broker logs: Check how often rebalances occur.
  2. 2.```bash
  3. 3.grep "Rebalance" /var/log/kafka/server.log | awk '{print $1, $2, $NF}' | tail -50
  4. 4.`
  5. 5.Increase max.poll.interval.ms to accommodate processing time: Ensure consumers have enough time to process their assigned batch.
  6. 6.```properties
  7. 7.max.poll.interval.ms=600000
  8. 8.max.poll.records=500
  9. 9.session.timeout.ms=45000
  10. 10.heartbeat.interval.ms=15000
  11. 11.`
  12. 12.Enable static group membership to avoid rebalances during rolling restarts: Use a stable group instance ID.
  13. 13.```properties
  14. 14.group.instance.id=consumer-1
  15. 15.`
  16. 16.Switch to cooperative sticky partition assignment: Minimize partition movement during rebalances.
  17. 17.```properties
  18. 18.partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
  19. 19.`
  20. 20.Verify rebalance stabilization: Monitor group state after applying changes.
  21. 21.```bash
  22. 22.kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  23. 23.--describe --group my-consumer-group --state
  24. 24.`

Prevention

  • Size consumer poll intervals based on p99 processing time, not average
  • Use static group membership (group.instance.id) for all production consumer deployments
  • Implement graceful consumer shutdown that commits offsets before leaving the group
  • Monitor rebalance frequency as a key SLO metric, alerting when it exceeds baseline
  • Use incremental cooperative rebalancing protocol (available since Kafka 2.3) to reduce pause duration