The Problem
A Redis cluster node has failed and your cluster reports a degraded state. Applications receive CLUSTERDOWN errors, or specific slots show as failing. The cluster might have lost quorum or simply have unreachable nodes. Recovery depends on whether the node data is recoverable and how many nodes failed.
Immediate Assessment
Check Cluster State
redis-cli -c CLUSTER INFOLook for these critical indicators:
cluster_state:ok # Should be "ok", if "fail" cluster is down
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0 # Potentially failing slots
cluster_slots_fail:0 # Actually failed slots
cluster_known_nodes:6
cluster_size:3 # Number of master nodesIdentify Failed Nodes
redis-cli -c CLUSTER NODESOutput shows each node's state:
nodeid1 10.0.0.1:6379@16379 myself,master - 0 1609459200000 1 connected 0-5460
nodeid2 10.0.0.2:6379@16379 master - 0 1609459201000 2 connected 5461-10922
nodeid3 10.0.0.3:6379@16379 master - 0 1609459202000 3 connected 10923-16383
nodeid4 10.0.0.4:6379@16379 slave nodeid1 0 1609459200000 1 connected
nodeid5 10.0.0.5:6379@16379 slave nodeid2 0 1609459201000 2 connected
nodeid6 10.0.0.6:6379@16379 fail,master - 0 1609459202000 4 connected 0-5460Look for flags: fail, fail?, handshake, noaddr, noflags
Check Node Reachability
redis-cli -h <node-ip> -p <node-port> PINGRecovery Scenarios
Scenario 1: Single Slave Node Failed
If only a replica failed, the cluster remains operational. Simply remove the failed node:
# Connect to any surviving node
redis-cli -c CLUSTER FORGET <failed-node-id>Then add a new replica when ready:
# On the new replica node, join cluster and replicate
redis-cli -h <new-replica-ip> -p <new-replica-port> CLUSTER MEET <existing-node-ip> <existing-node-port>
redis-cli -h <new-replica-ip> -p <new-replica-port> CLUSTER REPLICATE <master-node-id>Scenario 2: Master Node Failed, Replica Available
The replica should automatically failover. If not, force failover:
# On the replica that should become master
redis-cli -h <replica-ip> -p <replica-port> CLUSTER FAILOVERFor immediate takeover (use with caution):
redis-cli -h <replica-ip> -p <replica-port> CLUSTER FAILOVER TAKEOVERVerify promotion:
redis-cli -c CLUSTER NODES | grep <promoted-node-id>Scenario 3: Master Node Failed, No Replica
This is critical - you've lost data. If no replica exists and the master is truly down:
# First, try to recover the failed node
redis-cli -h <failed-master-ip> -p <failed-master-port> PINGIf the node responds, check why it was marked failed:
redis-cli -h <failed-master-ip> -p <failed-master-port> CLUSTER INFOIf truly unrecoverable, you must create an empty node and reshard:
```bash # Start fresh Redis instance on the failed node's hardware # Then join it as empty master redis-cli -c CLUSTER MEET <new-node-ip> <new-node-port>
# Assign slots from other masters (this will move data) redis-cli --cluster reshard <any-cluster-node>:6379 ```
Scenario 4: Multiple Masters Failed (Cluster Down)
If majority of masters failed, the cluster is completely down:
CLUSTERDOWN The cluster is downEmergency recovery steps:
- 1.Stop all cluster nodes
- 2.Identify which nodes have the most recent data
- 3.Start nodes one by one, starting with the most recent
- 4.Force cluster reassembly:
redis-cli --cluster fix <any-node-ip>:<port>Step-by-Step Recovery Procedure
Step 1: Document Current State
Before making changes:
# Save cluster configuration
redis-cli -c CLUSTER NODES > cluster_state_backup.txt
redis-cli -c CLUSTER INFO > cluster_info_backup.txtStep 2: Verify Network Connectivity
# Test connectivity between all nodes
for node in node1:6379 node2:6379 node3:6379; do
echo "Testing $node"
redis-cli -h ${node%:*} -p ${node#*:} PING
doneStep 3: Check Node Health Individually
redis-cli -h <each-node-ip> -p <each-node-port> INFO replicationLook for:
- role:master or role:slave
- master_link_status:up (for replicas)
- connected_slaves:X (for masters)
Step 4: Remove Failed Nodes
# First, forget the failed node from all surviving nodes
for node in surviving_node1:6379 surviving_node2:6379; do
redis-cli -h ${node%:*} -p ${node#*:} CLUSTER FORGET <failed-node-id>
doneStep 5: Add Replacement Node
# On new node
redis-cli CLUSTER MEET <surviving-node-ip> <surviving-node-port>Step 6: Rebalance Slots (if needed)
redis-cli --cluster rebalance --cluster-threshold 1 <any-node>:6379Preventing Future Failures
Configure Proper Timeouts
In redis.conf:
cluster-node-timeout 5000
cluster-require-full-coverage yes
cluster-migration-barrier 1Ensure Sufficient Replicas
Each master should have at least 1 replica:
redis-cli -c CLUSTER NODES | grep slave | wc -lSet Up Monitoring
Monitor these metrics:
# Cluster health check script
redis-cli -c CLUSTER INFO | grep -E "cluster_state|cluster_slots_fail|cluster_slots_pfail"Configure Persistent Configuration
Ensure cluster-config-file is set:
cluster-enabled yes
cluster-config-file nodes-6379.conf
cluster-node-timeout 5000Verification
After recovery, verify cluster health:
```bash # Should show "ok" redis-cli -c CLUSTER INFO | grep cluster_state
# All 16384 slots should be assigned redis-cli -c CLUSTER INFO | grep cluster_slots_assigned
# Test write to each slot range redis-cli -c SET test1 value1 redis-cli -c SET test2 value2 redis-cli -c SET test3 value3 ```