Fix Message Queue Consumer Lag During Partition Rebalance | Troubleshooting Guide

Introduction

When a message queue cluster undergoes partition rebalancing -- triggered by broker additions, removals, or consumer group membership changes -- consumer lag can grow rapidly. During rebalance, all consumers in the group temporarily pause processing while partition ownership is redistributed, creating a processing gap that accumulates backlogged messages.

Symptoms

Consumer lag metrics increase sharply during or immediately after a rebalance event
Consumer group state shows PreparingRebalance or CompletingRebalance for extended periods
Processing throughput drops to near zero during the rebalance window
Downstream services experience delayed data or stale dashboards
Rebalance storms -- rapid consecutive rebalances -- compound the lag accumulation

Common Causes

Consumer group coordinator triggering frequent rebalances due to session timeout misconfiguration
Large partition counts requiring extended assignment computation and state transfer
Slow consumers failing heartbeat checks, triggering cascading rebalance cycles
Offset commits deferred until after rebalance completion, losing progress tracking
Network partition between consumers and the group coordinator causing timeout-driven rebalances

Step-by-Step Fix

1.Identify the rebalance trigger: Check consumer group coordinator logs for the root cause of the rebalance event.
2.```bash
3.# For Kafka-based systems, check consumer group state
4.kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-consumer-group
5.`
6.Increase session timeout to reduce false-positive rebalances: Extend the heartbeat interval and session timeout to tolerate temporary network hiccups.
7.```properties
8.session.timeout.ms=45000
9.heartbeat.interval.ms=15000
10.max.poll.interval.ms=600000
11.`
12.Enable incremental cooperative rebalancing: Switch from eager stop-the-world rebalancing to incremental cooperative protocol, which only moves affected partitions.
13.```properties
14.partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
15.`
16.Process remaining records before committing offsets during shutdown: Implement a graceful shutdown hook that drains the current poll batch.
17.```java
18.Runtime.getRuntime().addShutdownHook(new Thread(() -> {
19.consumer.wakeup();
20.}));
21.`
22.Monitor lag recovery rate after rebalance completes: Verify consumers catch up at an acceptable rate.
23.```bash
24.kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-consumer-group | awk '{print $6}' | tail -n +2 | paste -sd+ | bc
25.`

Prevention

Use cooperative sticky partition assignment to minimize rebalance scope and duration
Tune max.poll.records to ensure consumers can process a batch within max.poll.interval.ms
Deploy consumer health checks that alert on rebalance frequency, not just lag magnitude
Size consumer groups to match partition counts, avoiding over-provisioning that triggers unnecessary rebalances
Implement static group membership with group.instance.id to prevent transient restarts from triggering rebalances

Message Queue Consumer Lag Growing During Partition Rebalance

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Prevention

Share this guide

More Messaging Troubleshooting Guides

Pulsar Namespace Policy Denied

Pulsar Broker Connection Failed

Pulsar Producer Queue Full

Pulsar Subscription Not Found

Pulsar Topic Not Found

RocketMQ Delay Message Not Delivering