Introduction
The in-sync replica (ISR) set contains all replicas that are fully caught up with the leader. When a follower broker replicates data slower than the producer write rate, it falls behind and is removed from the ISR. A shrinking ISR increases the risk of data loss -- if the leader fails while the ISR has only one member, no replica has the latest data.
Symptoms
- ISR size decreases from expected count, visible in topic describe output
- Broker metrics show
UnderReplicatedPartitionsincreasing - Producer with
acks=allexperiences increased latency as fewer brokers acknowledge - Follower broker logs show
Failed to fetchorreplica lagwarnings - Alert:
ISR for partition my-topic-0 has shrunk to 1 replicas
Common Causes
- Follower broker on slower disk or network connection than the leader
- Network bandwidth saturation between follower and leader broker
- Follower broker under heavy consumer load, starving fetch replica threads
replica.lag.time.max.msset too aggressively, removing followers prematurely- Leader writing at a rate that exceeds follower disk write capacity
Step-by-Step Fix
- 1.Check ISR status for all partitions: Identify which partitions have shrunk ISR.
- 2.```bash
- 3.kafka-topics.sh --bootstrap-server localhost:9092 --describe | grep "Isr:" | grep -v "Isr: 0,1,2"
- 4.
` - 5.Identify the lagging follower broker: Check replica lag metrics.
- 6.```bash
- 7.# Via JMX or kafka metrics endpoint
- 8.curl -s http://broker:9999/metrics | grep kafka_server_ReplicaManager_IsrShrinksPerSec
- 9.
` - 10.Increase replica lag time threshold: Give slow followers more time to catch up before removal.
- 11.```properties
- 12.replica.lag.time.max.ms=60000
- 13.
` - 14.Investigate follower broker resource constraints: Check disk I/O, network, and CPU on the lagging broker.
- 15.```bash
- 16.# Check disk write speed on follower
- 17.dd if=/dev/zero of=/var/lib/kafka/data/test bs=1M count=1024 oflag=direct 2>&1 | tail -1
- 18.# Check network throughput
- 19.iperf3 -c leader-broker -t 10
- 20.
` - 21.Rebalance partition leadership to distribute load: Reduce pressure on the overloaded broker.
- 22.```bash
- 23.kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
- 24.--execute --reassignment-json-file reassignment.json
- 25.
`
Prevention
- Use homogeneous hardware across all broker nodes to prevent individual follower bottlenecks
- Set
replica.lag.time.max.msto at least 30000ms (30 seconds) to tolerate temporary slowdowns - Monitor ISR size per partition and alert when it drops below the expected count
- Ensure network bandwidth between brokers is sufficient for peak replication traffic
- Use dedicated disks for Kafka log data with consistent IOPS guarantees
- Balance partition leadership across brokers to prevent any single broker from becoming a replication bottleneck