Fix Message Broker Cluster Split Brain Partition Isolation

Introduction

A split brain condition in a message broker cluster occurs when a network partition divides the cluster into two or more isolated groups of nodes, each believing it is the authoritative leader. This can cause divergent message states, lost writes, and data inconsistency that requires careful manual reconciliation after the partition heals.

Symptoms

Different broker nodes report different leader elections for the same partitions
Producers connected to minority partition receive acknowledgments that majority partition does not have
Consumer groups show offset divergence between nodes in different partition segments
Broker logs show NetworkPartitionException or partition isolated messages
Cluster metadata shows inconsistent partition assignments across nodes

Common Causes

Network switch or router failure isolating a subset of broker nodes
Cloud provider availability zone outage severing inter-node communication
Firewall rule changes inadvertently blocking inter-broker communication ports
DNS resolution failures preventing brokers from discovering each other
Asymmetric network partition where some broker pairs can communicate but others cannot

Step-by-Step Fix

1.Confirm the network partition: Verify connectivity between all broker nodes.
2.```bash
3.# From each broker, test connectivity to all others
4.for broker in broker-1 broker-2 broker-3; do
5.nc -zv $broker 9092 && echo "$broker: reachable" || echo "$broker: unreachable"
6.done
7.`
8.Identify the majority partition and isolate the minority: Determine which partition has the quorum.
9.```bash
10.# Check which brokers form the majority
11.zookeeper-shell.sh localhost:2181 <<< "ls /brokers/ids" 2>/dev/null | grep -o "[0-9]\+"
12.`
13.Shut down minority partition brokers: Prevent further divergence from the isolated nodes.
14.```bash
15.systemctl stop kafka
16.# On minority nodes only
17.`
18.Force leader election on the majority partition: Ensure the majority partition re-establishes leadership.
19.```bash
20.kafka-leader-election.sh --bootstrap-server broker-1:9092 \
21.--election-type unclean --topic my-topic --partition 0
22.`
23.After network recovery, rejoin minority nodes and resync data: Restart the isolated brokers and allow ISR to rebuild.
24.```bash
25.systemctl start kafka
26.kafka-topics.sh --bootstrap-server broker-1:9092 --describe --topic my-topic
27.`

Prevention

Deploy broker nodes across at least 3 failure domains (availability zones, racks) to maintain quorum
Configure unclean.leader.election.enable=false to prevent data loss from out-of-sync leaders
Use min.insync.replicas=2 to ensure writes are acknowledged by multiple brokers before committing
Monitor inter-node network latency and packet loss, alerting before partitions form
Implement automated network partition detection with runbook-guided recovery procedures
Test partition scenarios regularly in staging using chaos engineering tools like litmus or chaos mesh

Message Broker Cluster Split Brain Partition Isolated

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Prevention

Share this guide

More Messaging Troubleshooting Guides

Pulsar Namespace Policy Denied

Pulsar Broker Connection Failed

Pulsar Producer Queue Full

Pulsar Subscription Not Found

Pulsar Topic Not Found

RocketMQ Delay Message Not Delivering