Fix Redis Cluster Node Failure - Recovery and Troubleshooting Guide

The Problem

A Redis cluster node has failed and your cluster reports a degraded state. Applications receive CLUSTERDOWN errors, or specific slots show as failing. The cluster might have lost quorum or simply have unreachable nodes. Recovery depends on whether the node data is recoverable and how many nodes failed.

Immediate Assessment

Check Cluster State

bash

redis-cli -c CLUSTER INFO

Look for these critical indicators:

bash

cluster_state:ok          # Should be "ok", if "fail" cluster is down
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0     # Potentially failing slots
cluster_slots_fail:0      # Actually failed slots
cluster_known_nodes:6
cluster_size:3            # Number of master nodes

Identify Failed Nodes

bash

redis-cli -c CLUSTER NODES

Output shows each node's state:

bash

nodeid1 10.0.0.1:6379@16379 myself,master - 0 1609459200000 1 connected 0-5460
nodeid2 10.0.0.2:6379@16379 master - 0 1609459201000 2 connected 5461-10922
nodeid3 10.0.0.3:6379@16379 master - 0 1609459202000 3 connected 10923-16383
nodeid4 10.0.0.4:6379@16379 slave nodeid1 0 1609459200000 1 connected
nodeid5 10.0.0.5:6379@16379 slave nodeid2 0 1609459201000 2 connected
nodeid6 10.0.0.6:6379@16379 fail,master - 0 1609459202000 4 connected 0-5460

Look for flags: fail, fail?, handshake, noaddr, noflags

Check Node Reachability

bash

redis-cli -h <node-ip> -p <node-port> PING

Recovery Scenarios

Scenario 1: Single Slave Node Failed

If only a replica failed, the cluster remains operational. Simply remove the failed node:

bash

# Connect to any surviving node
redis-cli -c CLUSTER FORGET <failed-node-id>

Then add a new replica when ready:

bash

# On the new replica node, join cluster and replicate
redis-cli -h <new-replica-ip> -p <new-replica-port> CLUSTER MEET <existing-node-ip> <existing-node-port>
redis-cli -h <new-replica-ip> -p <new-replica-port> CLUSTER REPLICATE <master-node-id>

Scenario 2: Master Node Failed, Replica Available

The replica should automatically failover. If not, force failover:

bash

# On the replica that should become master
redis-cli -h <replica-ip> -p <replica-port> CLUSTER FAILOVER

For immediate takeover (use with caution):

bash

redis-cli -h <replica-ip> -p <replica-port> CLUSTER FAILOVER TAKEOVER

Verify promotion:

bash

redis-cli -c CLUSTER NODES | grep <promoted-node-id>

Scenario 3: Master Node Failed, No Replica

This is critical - you've lost data. If no replica exists and the master is truly down:

bash

# First, try to recover the failed node
redis-cli -h <failed-master-ip> -p <failed-master-port> PING

If the node responds, check why it was marked failed:

bash

redis-cli -h <failed-master-ip> -p <failed-master-port> CLUSTER INFO

If truly unrecoverable, you must create an empty node and reshard:

```bash # Start fresh Redis instance on the failed node's hardware # Then join it as empty master redis-cli -c CLUSTER MEET <new-node-ip> <new-node-port>

# Assign slots from other masters (this will move data) redis-cli --cluster reshard <any-cluster-node>:6379 ```

Scenario 4: Multiple Masters Failed (Cluster Down)

If majority of masters failed, the cluster is completely down:

bash

CLUSTERDOWN The cluster is down

Emergency recovery steps:

1.Stop all cluster nodes
2.Identify which nodes have the most recent data
3.Start nodes one by one, starting with the most recent
4.Force cluster reassembly:

bash

redis-cli --cluster fix <any-node-ip>:<port>

Step-by-Step Recovery Procedure

Step 1: Document Current State

Before making changes:

bash

# Save cluster configuration
redis-cli -c CLUSTER NODES > cluster_state_backup.txt
redis-cli -c CLUSTER INFO > cluster_info_backup.txt

Step 2: Verify Network Connectivity

bash

# Test connectivity between all nodes
for node in node1:6379 node2:6379 node3:6379; do
    echo "Testing $node"
    redis-cli -h ${node%:*} -p ${node#*:} PING
done

Step 3: Check Node Health Individually

bash

redis-cli -h <each-node-ip> -p <each-node-port> INFO replication

Look for: - role:master or role:slave - master_link_status:up (for replicas) - connected_slaves:X (for masters)

Step 4: Remove Failed Nodes

bash

# First, forget the failed node from all surviving nodes
for node in surviving_node1:6379 surviving_node2:6379; do
    redis-cli -h ${node%:*} -p ${node#*:} CLUSTER FORGET <failed-node-id>
done

Step 5: Add Replacement Node

bash

# On new node
redis-cli CLUSTER MEET <surviving-node-ip> <surviving-node-port>

Step 6: Rebalance Slots (if needed)

bash

redis-cli --cluster rebalance --cluster-threshold 1 <any-node>:6379

Preventing Future Failures

Configure Proper Timeouts

In redis.conf:

bash

cluster-node-timeout 5000
cluster-require-full-coverage yes
cluster-migration-barrier 1

Ensure Sufficient Replicas

Each master should have at least 1 replica:

bash

redis-cli -c CLUSTER NODES | grep slave | wc -l

Set Up Monitoring

Monitor these metrics:

bash

# Cluster health check script
redis-cli -c CLUSTER INFO | grep -E "cluster_state|cluster_slots_fail|cluster_slots_pfail"

Configure Persistent Configuration

Ensure cluster-config-file is set:

bash

cluster-enabled yes
cluster-config-file nodes-6379.conf
cluster-node-timeout 5000

Verification

After recovery, verify cluster health:

```bash # Should show "ok" redis-cli -c CLUSTER INFO | grep cluster_state

# All 16384 slots should be assigned redis-cli -c CLUSTER INFO | grep cluster_slots_assigned

# Test write to each slot range redis-cli -c SET test1 value1 redis-cli -c SET test2 value2 redis-cli -c SET test3 value3 ```

Redis Cluster Node Failed

The Problem

Immediate Assessment

Check Cluster State

Identify Failed Nodes

Check Node Reachability

Recovery Scenarios

Scenario 1: Single Slave Node Failed

Scenario 2: Master Node Failed, Replica Available

Scenario 3: Master Node Failed, No Replica

Scenario 4: Multiple Masters Failed (Cluster Down)

Step-by-Step Recovery Procedure

Step 1: Document Current State

Step 2: Verify Network Connectivity

Step 3: Check Node Health Individually

Step 4: Remove Failed Nodes

Step 5: Add Replacement Node

Step 6: Rebalance Slots (if needed)

Preventing Future Failures

Configure Proper Timeouts

Ensure Sufficient Replicas

Set Up Monitoring

Configure Persistent Configuration

Verification

Share this guide

More Redis Troubleshooting Guides

Redis Persistence Disabled Warning

Redis Client Output Buffer Exceeded

Redis Slow Log Not Logging

Redis AOF Load Error

Redis Loading RDB Error

Redis TLS Handshake Failed