Redis Cluster Node Failing

# Redis Cluster Node Failing

Error Messages

bash

CLUSTERDOWN The cluster is down

Or:

bash

MOVED 1234 10.0.0.1:6379

Or:

bash

ASK 1234 10.0.0.2:6379

Or:

bash

Node 10.0.0.1:6379 is not empty

Root Causes

1.Node network partition - Node isolated from cluster
2.Master failure without replica - No available replica to promote
3.Slot coverage incomplete - Not all slots covered by nodes
4.Configuration mismatch - Nodes have conflicting cluster config
5.Too many failed nodes - Cluster cannot achieve majority
6.Manual resharding errors - Improper slot migration

Diagnosis Steps

Step 1: Check Cluster Status

```bash # Check cluster info redis-cli -c -h <any_node> -p 6379 CLUSTER INFO

# Key fields: # cluster_state:ok/fail # cluster_slots_assigned:16384 # cluster_slots_ok:16384 # cluster_known_nodes:6 # cluster_size:3 ```

Step 2: Check Node Status

```bash # List all nodes redis-cli -c CLUSTER NODES

# Output format: # <node_id> <ip:port> <flags> <master_id> <ping_sent> <pong_recv> <config_epoch> <link_state> <slots> ```

Flags to watch: - master - Node is a master - slave - Node is a replica - fail? - Node is being pinged to check status - fail - Node is confirmed down - handshake - New node joining cluster - noaddr - Node address unknown

Step 3: Check Slot Coverage

```bash # Check which slots each node handles redis-cli -c CLUSTER NODES | grep -E "connected|slots"

# Or use cluster slots command redis-cli -c CLUSTER SLOTS

# Verify all 16384 slots are covered redis-cli -c CLUSTER INFO | grep cluster_slots_assigned ```

Step 4: Test Node Connectivity

bash

# Ping each node individually
for node in node1:6379 node2:6379 node3:6379; do
    echo "Testing $node"
    redis-cli -h $(echo $node | cut -d: -f1) -p $(echo $node | cut -d: -f2) ping
done

Step 5: Check Cluster Meet Status

```bash # Verify nodes know each other redis-cli -c CLUSTER NODES | grep -c "connected"

# Should equal total expected nodes ```

Solutions

Solution 1: Fix Network Partition

```bash # Check network connectivity between nodes ping <failed_node_ip>

# If node is reachable but marked as fail, force forget redis-cli -c CLUSTER FORGET <failed_node_id>

# Re-add the node redis-cli -c CLUSTER MEET <node_ip> <node_port>

# Wait for cluster to sync sleep 5 redis-cli -c CLUSTER NODES ```

Solution 2: Replace Failed Master with Replica

```bash # Identify failed master redis-cli -c CLUSTER NODES | grep "fail" | grep "master"

# Find replica of failed master redis-cli -c CLUSTER NODES | grep <failed_master_id>

# On the replica node, promote it redis-cli -h <replica_ip> -p <replica_port> CLUSTER FAILOVER FORCE

# Or takeover immediately redis-cli -h <replica_ip> -p <replica_port> CLUSTER FAILOVER TAKEOVER ```

Solution 3: Add New Node to Cluster

```bash # First, ensure new node is empty redis-cli -h <new_node_ip> -p <new_node_port> FLUSHALL redis-cli -h <new_node_ip> -p <new_node_port> CLUSTER RESET HARD

# Meet the cluster redis-cli -c CLUSTER MEET <new_node_ip> <new_node_port>

# Add as replica redis-cli -c CLUSTER REPLICATE <master_node_id> ```

Solution 4: Fix Incomplete Slot Coverage

```bash # Find uncovered slots redis-cli -c CLUSTER INFO | grep "cluster_slots_assigned"

# If less than 16384, find which slots are missing redis-cli -c CLUSTER SLOTS

# Reshard to cover all slots redis-cli --cluster reshard <any_node>:6379

# Example: reshard 1000 slots to a node redis-cli --cluster reshard <node>:6379 --cluster-from all --cluster-to <target_node_id> --cluster-slots 1000 --cluster-yes ```

Solution 5: Rebalance Cluster

```bash # Rebalance slots evenly across masters redis-cli --cluster rebalance <any_node>:6379

# With specific options redis-cli --cluster rebalance <node>:6379 \ --cluster-weight <node1_id>=1 <node2_id>=1 <node3_id>=1 \ --cluster-use-empty-masters \ --cluster-yes ```

Solution 6: Fix Stalled Resharding

If resharding is interrupted:

```bash # Check for importing/exporting slots redis-cli -c CLUSTER NODES | grep -E "[.*->"

# Cancel stalled import/export redis-cli -c CLUSTER SETSLOT <slot> STABLE

# Or reset the node and re-reshard redis-cli -h <node_ip> -p <node_port> CLUSTER RESET SOFT ```

Solution 7: Handle Majority Loss

When majority of masters are down:

```bash # Check how many masters are down redis-cli -c CLUSTER NODES | grep "fail" | grep "master" | wc -l

# If majority lost and cannot recover: # Reset cluster (WARNING: loses all data) redis-cli -h <node_ip> -p <node_port> CLUSTER RESET HARD

# Recreate cluster redis-cli --cluster create <node1>:6379 <node2>:6379 <node3>:6379 \ <node4>:6379 <node5>:6379 <node6>:6379 \ --cluster-replicas 1 ```

Solution 8: Fix Configuration Epoch Issues

```bash # Check epoch values redis-cli -c CLUSTER NODES

# If epochs are inconsistent, force update redis-cli -c CLUSTER BUMPEPOCH

# Or on specific node redis-cli -h <node_ip> -p <node_port> CLUSTER BUMPEPOCH ```

Common Scenarios

Scenario: Node Marked as FAIL but is Reachable

```bash # Node is up but marked as fail (network partition resolved) # Wait for cluster to auto-recover sleep 30 redis-cli -c CLUSTER NODES

# If still marked as fail, manually forget and re-meet redis-cli -c CLUSTER FORGET <node_id> redis-cli -c CLUSTER MEET <node_ip> <node_port> ```

Scenario: Slots Migration Stuck

```bash # Check slot migration status redis-cli -c CLUSTER NODES

# Look for slots with migration state: [1234->-] # Or import state: [1234-<-node_id]

# Complete the migration manually redis-cli -c CLUSTER SETSLOT <slot> NODE <target_node_id>

# On source node redis-cli -h <source_ip> -p <source_port> CLUSTER SETSLOT <slot> NODE <target_node_id>

# On target node redis-cli -h <target_ip> -p <target_port> CLUSTER SETSLOT <slot> NODE <target_node_id> ```

Scenario: Cluster is Down (CLUSTERDOWN)

```bash # Check state redis-cli -c CLUSTER INFO

# If cluster_state:fail, find the cause: # 1. Check slot coverage # 2. Check master availability # 3. Check majority

# Quick fix for missing slots redis-cli --cluster fix <any_node>:6379

# Or with more aggressive repair redis-cli --cluster fix <any_node>:6379 --cluster-searchmultipleowners ```

Cluster Management Commands

```bash # Create cluster redis-cli --cluster create node1:6379 node2:6379 node3:6379 node4:6379 node5:6379 node6:6379 --cluster-replicas 1

# Add node redis-cli --cluster add-node new_node:6379 existing_node:6379

# Add node as replica redis-cli --cluster add-node new_node:6379 existing_node:6379 --cluster-slave --cluster-master-id <master_id>

# Remove node redis-cli --cluster del-node node:6379 <node_id>

# Reshard redis-cli --cluster reshard node:6379

# Rebalance redis-cli --cluster rebalance node:6379

# Check cluster redis-cli --cluster check node:6379

# Fix cluster redis-cli --cluster fix node:6379

# Info redis-cli --cluster info node:6379 ```

Monitoring Script

```bash #!/bin/bash # redis_cluster_monitor.sh

NODE="localhost:6379"

# Get cluster state STATE=$(redis-cli -c -h ${NODE%:*} -p ${NODE#*:} CLUSTER INFO | grep cluster_state | cut -d: -f2 | tr -d '\r')

if [ "$STATE" != "ok" ]; then echo "CRITICAL: Cluster state is $STATE" redis-cli -c -h ${NODE%:*} -p ${NODE#*:} CLUSTER INFO exit 2 fi

# Check slot coverage SLOTS=$(redis-cli -c -h ${NODE%:*} -p ${NODE#*:} CLUSTER INFO | grep cluster_slots_assigned | cut -d: -f2 | tr -d '\r')

if [ "$SLOTS" != "16384" ]; then echo "WARNING: Only $SLOTS slots covered" exit 1 fi

# Check failed nodes FAILED=$(redis-cli -c -h ${NODE%:*} -p ${NODE#*:} CLUSTER NODES | grep -c "fail")

if [ "$FAILED" -gt 0 ]; then echo "WARNING: $FAILED nodes marked as fail" exit 1 fi

echo "OK: Cluster healthy" exit 0 ```

Prevention

1. Proper Cluster Configuration

bash

# Recommended: 3 masters + 3 replicas minimum
redis-cli --cluster create \
    master1:6379 master2:6379 master3:6379 \
    replica1:6379 replica2:6379 replica3:6379 \
    --cluster-replicas 1

2. Monitor Cluster Health

bash

# Set up regular monitoring
redis-cli -c CLUSTER INFO | grep cluster_state

3. Balanced Slot Distribution

bash

# After adding nodes, rebalance
redis-cli --cluster rebalance <node>:6379

4. Document Node IDs and Roles

Keep documentation of: - Node IDs - Master-replica relationships - Slot assignments - IP addresses and ports

[Redis Replication Broken](./fix-redis-replication-broken)
[Redis Connection Refused](./fix-redis-connection-refused)

Error Messages

Root Causes

Diagnosis Steps

Step 1: Check Cluster Status

Step 2: Check Node Status

Step 3: Check Slot Coverage

Step 4: Test Node Connectivity

Step 5: Check Cluster Meet Status

Solutions

Solution 1: Fix Network Partition

Solution 2: Replace Failed Master with Replica

Solution 3: Add New Node to Cluster

Solution 4: Fix Incomplete Slot Coverage

Solution 5: Rebalance Cluster

Solution 6: Fix Stalled Resharding

Solution 7: Handle Majority Loss

Solution 8: Fix Configuration Epoch Issues

Common Scenarios

Scenario: Node Marked as FAIL but is Reachable

Scenario: Slots Migration Stuck

Scenario: Cluster is Down (CLUSTERDOWN)

Cluster Management Commands

Monitoring Script

Prevention

1. Proper Cluster Configuration

2. Monitor Cluster Health

3. Balanced Slot Distribution

4. Document Node IDs and Roles

Related Errors

Share this guide

More Redis Troubleshooting Guides

Redis Persistence Disabled Warning

Redis Client Output Buffer Exceeded

Redis Slow Log Not Logging

Redis AOF Load Error

Redis Loading RDB Error

Redis TLS Handshake Failed