Introduction
When a partition leader becomes unavailable, Kafka must elect a new leader from the in-sync replica set (ISR). If the election process takes too long -- due to controller overload, ISR shrinkage, or metadata propagation delays -- the partition remains leaderless, blocking all produce and consume operations for that partition.
Symptoms
- Producers receive
NOT_LEADER_OR_FOLLOWERorLEADER_NOT_AVAILABLEerrors - Consumer fetch requests hang or fail with partition metadata errors
kafka-topics --describeshows partitions withLeader: -1- Controller logs show
Leader election timed outorfailed to complete - Partition unavailability duration exceeds the configured election timeout
Common Causes
- Controller broker overloaded with metadata operations, delaying election processing
- ISR empty or containing only the failed leader, requiring unclean election
- Network latency between controller and broker nodes slowing metadata propagation
- Large number of simultaneous partition failures overwhelming the controller
- ZooKeeper latency affecting controller election and metadata updates
Step-by-Step Fix
- 1.Identify leaderless partitions: Find all partitions without a leader.
- 2.```bash
- 3.kafka-topics.sh --bootstrap-server localhost:9092 --describe --unavailable-partitions
- 4.
` - 5.Trigger manual leader election for stuck partitions: Force a preferred replica election.
- 6.```bash
- 7.kafka-leader-election.sh --bootstrap-server localhost:9092 \
- 8.--election-type preferred --topic my-topic --partition 0
- 9.
` - 10.Check controller broker health: Verify the controller is responsive.
- 11.```bash
- 12.# Find the active controller
- 13.zookeeper-shell.sh localhost:2181 <<< "get /controller" 2>/dev/null | grep brokerid
- 14.# Check controller broker logs
- 15.grep "Controller" /var/log/kafka/server.log | tail -30
- 16.
` - 17.If ISR is empty, perform unclean leader election as last resort: Accept potential data loss to restore availability.
- 18.```bash
- 19.kafka-leader-election.sh --bootstrap-server localhost:9092 \
- 20.--election-type unclean --topic my-topic --partition 0
- 21.
` - 22.Verify partition recovery: Confirm all partitions have elected leaders.
- 23.```bash
- 24.kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic my-topic | grep -v "Leader: -1"
- 25.
`
Prevention
- Ensure ISR has at least 2 members with
min.insync.replicas=2to maintain election candidates - Monitor controller broker CPU, memory, and request queue depth
- Set
leader.imbalance.check.interval.secondsto detect and correct leader skew proactively - Distribute partition leaders evenly across brokers using
auto.leader.rebalance.enable=true - Keep Kafka controller on a dedicated, well-resourced broker for metadata-heavy clusters
- Monitor election duration and alert when it exceeds 10 seconds