Introduction

When a partition leader becomes unavailable, Kafka must elect a new leader from the in-sync replica set (ISR). If the election process takes too long -- due to controller overload, ISR shrinkage, or metadata propagation delays -- the partition remains leaderless, blocking all produce and consume operations for that partition.

Symptoms

  • Producers receive NOT_LEADER_OR_FOLLOWER or LEADER_NOT_AVAILABLE errors
  • Consumer fetch requests hang or fail with partition metadata errors
  • kafka-topics --describe shows partitions with Leader: -1
  • Controller logs show Leader election timed out or failed to complete
  • Partition unavailability duration exceeds the configured election timeout

Common Causes

  • Controller broker overloaded with metadata operations, delaying election processing
  • ISR empty or containing only the failed leader, requiring unclean election
  • Network latency between controller and broker nodes slowing metadata propagation
  • Large number of simultaneous partition failures overwhelming the controller
  • ZooKeeper latency affecting controller election and metadata updates

Step-by-Step Fix

  1. 1.Identify leaderless partitions: Find all partitions without a leader.
  2. 2.```bash
  3. 3.kafka-topics.sh --bootstrap-server localhost:9092 --describe --unavailable-partitions
  4. 4.`
  5. 5.Trigger manual leader election for stuck partitions: Force a preferred replica election.
  6. 6.```bash
  7. 7.kafka-leader-election.sh --bootstrap-server localhost:9092 \
  8. 8.--election-type preferred --topic my-topic --partition 0
  9. 9.`
  10. 10.Check controller broker health: Verify the controller is responsive.
  11. 11.```bash
  12. 12.# Find the active controller
  13. 13.zookeeper-shell.sh localhost:2181 <<< "get /controller" 2>/dev/null | grep brokerid
  14. 14.# Check controller broker logs
  15. 15.grep "Controller" /var/log/kafka/server.log | tail -30
  16. 16.`
  17. 17.If ISR is empty, perform unclean leader election as last resort: Accept potential data loss to restore availability.
  18. 18.```bash
  19. 19.kafka-leader-election.sh --bootstrap-server localhost:9092 \
  20. 20.--election-type unclean --topic my-topic --partition 0
  21. 21.`
  22. 22.Verify partition recovery: Confirm all partitions have elected leaders.
  23. 23.```bash
  24. 24.kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic my-topic | grep -v "Leader: -1"
  25. 25.`

Prevention

  • Ensure ISR has at least 2 members with min.insync.replicas=2 to maintain election candidates
  • Monitor controller broker CPU, memory, and request queue depth
  • Set leader.imbalance.check.interval.seconds to detect and correct leader skew proactively
  • Distribute partition leaders evenly across brokers using auto.leader.rebalance.enable=true
  • Keep Kafka controller on a dedicated, well-resourced broker for metadata-heavy clusters
  • Monitor election duration and alert when it exceeds 10 seconds