Introduction

The in-sync replica (ISR) set contains all replicas that are fully caught up with the leader. When a follower broker replicates data slower than the producer write rate, it falls behind and is removed from the ISR. A shrinking ISR increases the risk of data loss -- if the leader fails while the ISR has only one member, no replica has the latest data.

Symptoms

  • ISR size decreases from expected count, visible in topic describe output
  • Broker metrics show UnderReplicatedPartitions increasing
  • Producer with acks=all experiences increased latency as fewer brokers acknowledge
  • Follower broker logs show Failed to fetch or replica lag warnings
  • Alert: ISR for partition my-topic-0 has shrunk to 1 replicas

Common Causes

  • Follower broker on slower disk or network connection than the leader
  • Network bandwidth saturation between follower and leader broker
  • Follower broker under heavy consumer load, starving fetch replica threads
  • replica.lag.time.max.ms set too aggressively, removing followers prematurely
  • Leader writing at a rate that exceeds follower disk write capacity

Step-by-Step Fix

  1. 1.Check ISR status for all partitions: Identify which partitions have shrunk ISR.
  2. 2.```bash
  3. 3.kafka-topics.sh --bootstrap-server localhost:9092 --describe | grep "Isr:" | grep -v "Isr: 0,1,2"
  4. 4.`
  5. 5.Identify the lagging follower broker: Check replica lag metrics.
  6. 6.```bash
  7. 7.# Via JMX or kafka metrics endpoint
  8. 8.curl -s http://broker:9999/metrics | grep kafka_server_ReplicaManager_IsrShrinksPerSec
  9. 9.`
  10. 10.Increase replica lag time threshold: Give slow followers more time to catch up before removal.
  11. 11.```properties
  12. 12.replica.lag.time.max.ms=60000
  13. 13.`
  14. 14.Investigate follower broker resource constraints: Check disk I/O, network, and CPU on the lagging broker.
  15. 15.```bash
  16. 16.# Check disk write speed on follower
  17. 17.dd if=/dev/zero of=/var/lib/kafka/data/test bs=1M count=1024 oflag=direct 2>&1 | tail -1
  18. 18.# Check network throughput
  19. 19.iperf3 -c leader-broker -t 10
  20. 20.`
  21. 21.Rebalance partition leadership to distribute load: Reduce pressure on the overloaded broker.
  22. 22.```bash
  23. 23.kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
  24. 24.--execute --reassignment-json-file reassignment.json
  25. 25.`

Prevention

  • Use homogeneous hardware across all broker nodes to prevent individual follower bottlenecks
  • Set replica.lag.time.max.ms to at least 30000ms (30 seconds) to tolerate temporary slowdowns
  • Monitor ISR size per partition and alert when it drops below the expected count
  • Ensure network bandwidth between brokers is sufficient for peak replication traffic
  • Use dedicated disks for Kafka log data with consistent IOPS guarantees
  • Balance partition leadership across brokers to prevent any single broker from becoming a replication bottleneck