Introduction
RabbitMQ quorum queues use the Raft consensus algorithm to replicate messages across cluster nodes. During a network partition, if the partition divides the cluster such that no single partition has a majority of nodes, the quorum queue cannot elect a leader. This split brain scenario makes the queue unavailable for both publishing and consuming until the partition heals and a majority can be re-established.
Symptoms
- Quorum queue reports no leader available in management UI
- Producers receive
NO_ROUTEor connection errors when publishing to quorum queues - Consumer connections hang waiting for a leader to become available
- RabbitMQ logs show
Raft election timeoutandno leader elected - Error message:
Quorum queue my-queue has no leader, operations are blocked
Common Causes
- Network partition dividing a 3-node cluster into 1+1+1 or 2+1 without a clear majority
- Cloud provider availability zone outage taking down a majority of quorum queue members
- Asymmetric network partition where different node pairs have different connectivity
- Quorum queue members distributed unevenly across failure domains
- Node crashes during active leader election, reducing the available quorum
Step-by-Step Fix
- 1.Check quorum queue status and leader state: Identify the affected queues.
- 2.```bash
- 3.rabbitmqctl list_queues name type state leader
- 4.
` - 5.Diagnose the network partition: Verify connectivity between nodes.
- 6.```bash
- 7.rabbitmqctl cluster_status
- 8.# Check which nodes can communicate
- 9.for node in node1 node2 node3; do
- 10.rabbitmqctl ping -n rabbit@$node
- 11.done
- 12.
` - 13.After partition heals, wait for automatic leader election: Raft will self-heal when majority is restored.
- 14.```bash
- 15.# Monitor election progress
- 16.rabbitmqctl list_queues name state leader --format table
- 17.# Wait for state to change from 'no_leader' to 'running'
- 18.
` - 19.Force quorum queue recovery if automatic election fails: Use the Raft safety override as last resort.
- 20.```bash
- 21.rabbitmqctl eval 'rabbit_raft_registry:force_vote(rabbit@node1, <<"my-queue">>).'
- 22.
` - 23.Verify queue consistency after recovery: Check that messages are intact.
- 24.```bash
- 25.rabbitmqctl list_queues name messages
- 26.
`
Prevention
- Deploy quorum queue members across at least 3 failure domains (nodes, zones, racks)
- Use odd-numbered cluster sizes (3, 5, 7) to ensure a clear majority is always possible
- Configure
quorum_commands_soft_timeoutandquorum_commands_hard_timeoutappropriately - Monitor quorum queue leader status and alert on
no_leaderstate - Test network partition scenarios in staging to verify quorum queue behavior
- Avoid placing all quorum queue members on nodes that share a common network dependency