Fix Kafka ISR Shrinking From Slow Follower Replication

Introduction

The in-sync replica (ISR) set contains all replicas that are fully caught up with the leader. When a follower broker replicates data slower than the producer write rate, it falls behind and is removed from the ISR. A shrinking ISR increases the risk of data loss -- if the leader fails while the ISR has only one member, no replica has the latest data.

Symptoms

ISR size decreases from expected count, visible in topic describe output
Broker metrics show UnderReplicatedPartitions increasing
Producer with acks=all experiences increased latency as fewer brokers acknowledge
Follower broker logs show Failed to fetch or replica lag warnings
Alert: ISR for partition my-topic-0 has shrunk to 1 replicas

Common Causes

Follower broker on slower disk or network connection than the leader
Network bandwidth saturation between follower and leader broker
Follower broker under heavy consumer load, starving fetch replica threads
replica.lag.time.max.ms set too aggressively, removing followers prematurely
Leader writing at a rate that exceeds follower disk write capacity

Step-by-Step Fix

1.Check ISR status for all partitions: Identify which partitions have shrunk ISR.
2.```bash
3.kafka-topics.sh --bootstrap-server localhost:9092 --describe | grep "Isr:" | grep -v "Isr: 0,1,2"
4.`
5.Identify the lagging follower broker: Check replica lag metrics.
6.```bash
7.# Via JMX or kafka metrics endpoint
8.curl -s http://broker:9999/metrics | grep kafka_server_ReplicaManager_IsrShrinksPerSec
9.`
10.Increase replica lag time threshold: Give slow followers more time to catch up before removal.
11.```properties
12.replica.lag.time.max.ms=60000
13.`
14.Investigate follower broker resource constraints: Check disk I/O, network, and CPU on the lagging broker.
15.```bash
16.# Check disk write speed on follower
17.dd if=/dev/zero of=/var/lib/kafka/data/test bs=1M count=1024 oflag=direct 2>&1 | tail -1
18.# Check network throughput
19.iperf3 -c leader-broker -t 10
20.`
21.Rebalance partition leadership to distribute load: Reduce pressure on the overloaded broker.
22.```bash
23.kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
24.--execute --reassignment-json-file reassignment.json
25.`

Prevention

Use homogeneous hardware across all broker nodes to prevent individual follower bottlenecks
Set replica.lag.time.max.ms to at least 30000ms (30 seconds) to tolerate temporary slowdowns
Monitor ISR size per partition and alert when it drops below the expected count
Ensure network bandwidth between brokers is sufficient for peak replication traffic
Use dedicated disks for Kafka log data with consistent IOPS guarantees
Balance partition leadership across brokers to prevent any single broker from becoming a replication bottleneck

Kafka ISR Shrinking Due to Slow Follower Replication Lag

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Prevention

Share this guide

More Kafka Troubleshooting Guides

Kafka Schema Registry Backward Compatibility Check Rejecting New Version

Kafka Consumer Offset Commit Failed Group Coordinator Not Available

Kafka SASL SCRAM Authentication Failed During Credentials Rotation

Kafka Compacted Topic Log Cleanup Removing Active Keys

Kafka Producer Idempotence Lost After Broker Crash

Kafka Topic Partition Leader Election Taking Too Long