Introduction

RabbitMQ quorum queues use the Raft consensus algorithm to replicate messages across cluster nodes. During a network partition, if the partition divides the cluster such that no single partition has a majority of nodes, the quorum queue cannot elect a leader. This split brain scenario makes the queue unavailable for both publishing and consuming until the partition heals and a majority can be re-established.

Symptoms

  • Quorum queue reports no leader available in management UI
  • Producers receive NO_ROUTE or connection errors when publishing to quorum queues
  • Consumer connections hang waiting for a leader to become available
  • RabbitMQ logs show Raft election timeout and no leader elected
  • Error message: Quorum queue my-queue has no leader, operations are blocked

Common Causes

  • Network partition dividing a 3-node cluster into 1+1+1 or 2+1 without a clear majority
  • Cloud provider availability zone outage taking down a majority of quorum queue members
  • Asymmetric network partition where different node pairs have different connectivity
  • Quorum queue members distributed unevenly across failure domains
  • Node crashes during active leader election, reducing the available quorum

Step-by-Step Fix

  1. 1.Check quorum queue status and leader state: Identify the affected queues.
  2. 2.```bash
  3. 3.rabbitmqctl list_queues name type state leader
  4. 4.`
  5. 5.Diagnose the network partition: Verify connectivity between nodes.
  6. 6.```bash
  7. 7.rabbitmqctl cluster_status
  8. 8.# Check which nodes can communicate
  9. 9.for node in node1 node2 node3; do
  10. 10.rabbitmqctl ping -n rabbit@$node
  11. 11.done
  12. 12.`
  13. 13.After partition heals, wait for automatic leader election: Raft will self-heal when majority is restored.
  14. 14.```bash
  15. 15.# Monitor election progress
  16. 16.rabbitmqctl list_queues name state leader --format table
  17. 17.# Wait for state to change from 'no_leader' to 'running'
  18. 18.`
  19. 19.Force quorum queue recovery if automatic election fails: Use the Raft safety override as last resort.
  20. 20.```bash
  21. 21.rabbitmqctl eval 'rabbit_raft_registry:force_vote(rabbit@node1, <<"my-queue">>).'
  22. 22.`
  23. 23.Verify queue consistency after recovery: Check that messages are intact.
  24. 24.```bash
  25. 25.rabbitmqctl list_queues name messages
  26. 26.`

Prevention

  • Deploy quorum queue members across at least 3 failure domains (nodes, zones, racks)
  • Use odd-numbered cluster sizes (3, 5, 7) to ensure a clear majority is always possible
  • Configure quorum_commands_soft_timeout and quorum_commands_hard_timeout appropriately
  • Monitor quorum queue leader status and alert on no_leader state
  • Test network partition scenarios in staging to verify quorum queue behavior
  • Avoid placing all quorum queue members on nodes that share a common network dependency