Introduction
Kafka consumer group rebalancing redistributes partition ownership among consumers when members join, leave, or fail. During rebalance, all consumers in the group pause processing, creating a throughput gap. When rebalances happen frequently or take too long, the cumulative pause time significantly impacts message processing latency and consumer lag.
Symptoms
- Consumer processing throughput drops to zero periodically with no apparent cause
- Kafka consumer group state oscillates between
StableandPreparingRebalance - Consumer logs show
The coordinator is not aware of this memberorRebalance in progress - Consumer lag grows steadily despite adequate consumer capacity
- Application health checks fail during extended rebalance windows
Common Causes
max.poll.interval.msset too low, causing slow consumers to be kicked out and trigger rebalance- Consumer processing takes longer than the poll interval during peak message volume
- Network instability between consumers and the group coordinator causing heartbeat failures
- Rolling deployments removing and adding consumers without static group membership
- GC pauses in JVM-based consumers exceeding session timeout thresholds
Step-by-Step Fix
- 1.Identify rebalance frequency and duration from broker logs: Check how often rebalances occur.
- 2.```bash
- 3.grep "Rebalance" /var/log/kafka/server.log | awk '{print $1, $2, $NF}' | tail -50
- 4.
` - 5.Increase max.poll.interval.ms to accommodate processing time: Ensure consumers have enough time to process their assigned batch.
- 6.```properties
- 7.max.poll.interval.ms=600000
- 8.max.poll.records=500
- 9.session.timeout.ms=45000
- 10.heartbeat.interval.ms=15000
- 11.
` - 12.Enable static group membership to avoid rebalances during rolling restarts: Use a stable group instance ID.
- 13.```properties
- 14.group.instance.id=consumer-1
- 15.
` - 16.Switch to cooperative sticky partition assignment: Minimize partition movement during rebalances.
- 17.```properties
- 18.partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
- 19.
` - 20.Verify rebalance stabilization: Monitor group state after applying changes.
- 21.```bash
- 22.kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
- 23.--describe --group my-consumer-group --state
- 24.
`
Prevention
- Size consumer poll intervals based on p99 processing time, not average
- Use static group membership (
group.instance.id) for all production consumer deployments - Implement graceful consumer shutdown that commits offsets before leaving the group
- Monitor rebalance frequency as a key SLO metric, alerting when it exceeds baseline
- Use incremental cooperative rebalancing protocol (available since Kafka 2.3) to reduce pause duration