# Redis Cluster Node Failing

Error Messages

bash
CLUSTERDOWN The cluster is down

Or:

bash
MOVED 1234 10.0.0.1:6379

Or:

bash
ASK 1234 10.0.0.2:6379

Or:

bash
Node 10.0.0.1:6379 is not empty

Root Causes

  1. 1.Node network partition - Node isolated from cluster
  2. 2.Master failure without replica - No available replica to promote
  3. 3.Slot coverage incomplete - Not all slots covered by nodes
  4. 4.Configuration mismatch - Nodes have conflicting cluster config
  5. 5.Too many failed nodes - Cluster cannot achieve majority
  6. 6.Manual resharding errors - Improper slot migration

Diagnosis Steps

Step 1: Check Cluster Status

```bash # Check cluster info redis-cli -c -h <any_node> -p 6379 CLUSTER INFO

# Key fields: # cluster_state:ok/fail # cluster_slots_assigned:16384 # cluster_slots_ok:16384 # cluster_known_nodes:6 # cluster_size:3 ```

Step 2: Check Node Status

```bash # List all nodes redis-cli -c CLUSTER NODES

# Output format: # <node_id> <ip:port> <flags> <master_id> <ping_sent> <pong_recv> <config_epoch> <link_state> <slots> ```

Flags to watch: - master - Node is a master - slave - Node is a replica - fail? - Node is being pinged to check status - fail - Node is confirmed down - handshake - New node joining cluster - noaddr - Node address unknown

Step 3: Check Slot Coverage

```bash # Check which slots each node handles redis-cli -c CLUSTER NODES | grep -E "connected|slots"

# Or use cluster slots command redis-cli -c CLUSTER SLOTS

# Verify all 16384 slots are covered redis-cli -c CLUSTER INFO | grep cluster_slots_assigned ```

Step 4: Test Node Connectivity

bash
# Ping each node individually
for node in node1:6379 node2:6379 node3:6379; do
    echo "Testing $node"
    redis-cli -h $(echo $node | cut -d: -f1) -p $(echo $node | cut -d: -f2) ping
done

Step 5: Check Cluster Meet Status

```bash # Verify nodes know each other redis-cli -c CLUSTER NODES | grep -c "connected"

# Should equal total expected nodes ```

Solutions

Solution 1: Fix Network Partition

```bash # Check network connectivity between nodes ping <failed_node_ip>

# If node is reachable but marked as fail, force forget redis-cli -c CLUSTER FORGET <failed_node_id>

# Re-add the node redis-cli -c CLUSTER MEET <node_ip> <node_port>

# Wait for cluster to sync sleep 5 redis-cli -c CLUSTER NODES ```

Solution 2: Replace Failed Master with Replica

```bash # Identify failed master redis-cli -c CLUSTER NODES | grep "fail" | grep "master"

# Find replica of failed master redis-cli -c CLUSTER NODES | grep <failed_master_id>

# On the replica node, promote it redis-cli -h <replica_ip> -p <replica_port> CLUSTER FAILOVER FORCE

# Or takeover immediately redis-cli -h <replica_ip> -p <replica_port> CLUSTER FAILOVER TAKEOVER ```

Solution 3: Add New Node to Cluster

```bash # First, ensure new node is empty redis-cli -h <new_node_ip> -p <new_node_port> FLUSHALL redis-cli -h <new_node_ip> -p <new_node_port> CLUSTER RESET HARD

# Meet the cluster redis-cli -c CLUSTER MEET <new_node_ip> <new_node_port>

# Add as replica redis-cli -c CLUSTER REPLICATE <master_node_id> ```

Solution 4: Fix Incomplete Slot Coverage

```bash # Find uncovered slots redis-cli -c CLUSTER INFO | grep "cluster_slots_assigned"

# If less than 16384, find which slots are missing redis-cli -c CLUSTER SLOTS

# Reshard to cover all slots redis-cli --cluster reshard <any_node>:6379

# Example: reshard 1000 slots to a node redis-cli --cluster reshard <node>:6379 --cluster-from all --cluster-to <target_node_id> --cluster-slots 1000 --cluster-yes ```

Solution 5: Rebalance Cluster

```bash # Rebalance slots evenly across masters redis-cli --cluster rebalance <any_node>:6379

# With specific options redis-cli --cluster rebalance <node>:6379 \ --cluster-weight <node1_id>=1 <node2_id>=1 <node3_id>=1 \ --cluster-use-empty-masters \ --cluster-yes ```

Solution 6: Fix Stalled Resharding

If resharding is interrupted:

```bash # Check for importing/exporting slots redis-cli -c CLUSTER NODES | grep -E "[.*->"

# Cancel stalled import/export redis-cli -c CLUSTER SETSLOT <slot> STABLE

# Or reset the node and re-reshard redis-cli -h <node_ip> -p <node_port> CLUSTER RESET SOFT ```

Solution 7: Handle Majority Loss

When majority of masters are down:

```bash # Check how many masters are down redis-cli -c CLUSTER NODES | grep "fail" | grep "master" | wc -l

# If majority lost and cannot recover: # Reset cluster (WARNING: loses all data) redis-cli -h <node_ip> -p <node_port> CLUSTER RESET HARD

# Recreate cluster redis-cli --cluster create <node1>:6379 <node2>:6379 <node3>:6379 \ <node4>:6379 <node5>:6379 <node6>:6379 \ --cluster-replicas 1 ```

Solution 8: Fix Configuration Epoch Issues

```bash # Check epoch values redis-cli -c CLUSTER NODES

# If epochs are inconsistent, force update redis-cli -c CLUSTER BUMPEPOCH

# Or on specific node redis-cli -h <node_ip> -p <node_port> CLUSTER BUMPEPOCH ```

Common Scenarios

Scenario: Node Marked as FAIL but is Reachable

```bash # Node is up but marked as fail (network partition resolved) # Wait for cluster to auto-recover sleep 30 redis-cli -c CLUSTER NODES

# If still marked as fail, manually forget and re-meet redis-cli -c CLUSTER FORGET <node_id> redis-cli -c CLUSTER MEET <node_ip> <node_port> ```

Scenario: Slots Migration Stuck

```bash # Check slot migration status redis-cli -c CLUSTER NODES

# Look for slots with migration state: [1234->-] # Or import state: [1234-<-node_id]

# Complete the migration manually redis-cli -c CLUSTER SETSLOT <slot> NODE <target_node_id>

# On source node redis-cli -h <source_ip> -p <source_port> CLUSTER SETSLOT <slot> NODE <target_node_id>

# On target node redis-cli -h <target_ip> -p <target_port> CLUSTER SETSLOT <slot> NODE <target_node_id> ```

Scenario: Cluster is Down (CLUSTERDOWN)

```bash # Check state redis-cli -c CLUSTER INFO

# If cluster_state:fail, find the cause: # 1. Check slot coverage # 2. Check master availability # 3. Check majority

# Quick fix for missing slots redis-cli --cluster fix <any_node>:6379

# Or with more aggressive repair redis-cli --cluster fix <any_node>:6379 --cluster-searchmultipleowners ```

Cluster Management Commands

```bash # Create cluster redis-cli --cluster create node1:6379 node2:6379 node3:6379 node4:6379 node5:6379 node6:6379 --cluster-replicas 1

# Add node redis-cli --cluster add-node new_node:6379 existing_node:6379

# Add node as replica redis-cli --cluster add-node new_node:6379 existing_node:6379 --cluster-slave --cluster-master-id <master_id>

# Remove node redis-cli --cluster del-node node:6379 <node_id>

# Reshard redis-cli --cluster reshard node:6379

# Rebalance redis-cli --cluster rebalance node:6379

# Check cluster redis-cli --cluster check node:6379

# Fix cluster redis-cli --cluster fix node:6379

# Info redis-cli --cluster info node:6379 ```

Monitoring Script

```bash #!/bin/bash # redis_cluster_monitor.sh

NODE="localhost:6379"

# Get cluster state STATE=$(redis-cli -c -h ${NODE%:*} -p ${NODE#*:} CLUSTER INFO | grep cluster_state | cut -d: -f2 | tr -d '\r')

if [ "$STATE" != "ok" ]; then echo "CRITICAL: Cluster state is $STATE" redis-cli -c -h ${NODE%:*} -p ${NODE#*:} CLUSTER INFO exit 2 fi

# Check slot coverage SLOTS=$(redis-cli -c -h ${NODE%:*} -p ${NODE#*:} CLUSTER INFO | grep cluster_slots_assigned | cut -d: -f2 | tr -d '\r')

if [ "$SLOTS" != "16384" ]; then echo "WARNING: Only $SLOTS slots covered" exit 1 fi

# Check failed nodes FAILED=$(redis-cli -c -h ${NODE%:*} -p ${NODE#*:} CLUSTER NODES | grep -c "fail")

if [ "$FAILED" -gt 0 ]; then echo "WARNING: $FAILED nodes marked as fail" exit 1 fi

echo "OK: Cluster healthy" exit 0 ```

Prevention

1. Proper Cluster Configuration

bash
# Recommended: 3 masters + 3 replicas minimum
redis-cli --cluster create \
    master1:6379 master2:6379 master3:6379 \
    replica1:6379 replica2:6379 replica3:6379 \
    --cluster-replicas 1

2. Monitor Cluster Health

bash
# Set up regular monitoring
redis-cli -c CLUSTER INFO | grep cluster_state

3. Balanced Slot Distribution

bash
# After adding nodes, rebalance
redis-cli --cluster rebalance <node>:6379

4. Document Node IDs and Roles

Keep documentation of: - Node IDs - Master-replica relationships - Slot assignments - IP addresses and ports

  • [Redis Replication Broken](./fix-redis-replication-broken)
  • [Redis Connection Refused](./fix-redis-connection-refused)