The Problem

Your Redis Cluster shows cluster_state:fail and operations return errors like CLUSTERDOWN The cluster is down. Applications cannot access data, and some keys return MOVED or ASK redirections that fail.

Checking cluster status reveals the problem:

bash
redis-cli -c CLUSTER INFO
cluster_state:fail
cluster_slots_assigned:16384
cluster_slots_ok:15360
cluster_slots_pfail:1024
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3

Or attempting operations:

bash
redis-cli -c GET mykey
(error) CLUSTERDOWN The cluster is down

The cluster is unhealthy because 1024 slots are in pfail state (probably failing), meaning those slots lack coverage.

Why Cluster State Fails

A Redis Cluster requires all 16384 hash slots to be covered by available nodes. The cluster enters fail state when:

  1. 1.Master node down - A master serving slots is unreachable
  2. 2.No replica available - Master failed without promotable replica
  3. 3.Slots migrating - Slot migration incomplete or stuck
  4. 4.Network partition - Nodes cannot communicate (split-brain)
  5. 5.Configuration mismatch - Nodes have conflicting cluster config
  6. 6.Insufficient majority - Less than half of masters reachable

Diagnosis Steps

Check Overall Cluster State

```bash # Check cluster info on any node redis-cli -c CLUSTER INFO

# Key fields to examine: # cluster_state:ok or fail # cluster_slots_ok: should be 16384 # cluster_slots_pfail: slots probably failing # cluster_slots_fail: slots definitely failing ```

List All Nodes and Their Status

```bash # Show all cluster nodes redis-cli -c CLUSTER NODES

# Output format per line: # node_id ip:port@bus_port flags master_id ping_sent ping_recv link_state slots

# Look for flags like: # master - master node # slave - replica node # fail? - probably failing (pfail) # fail - definitely failing # noaddr - address unknown # disconnected - not connected ```

Example output showing a problem:

bash
a1b2c3d4 10.0.1.1:6379@16379 master - 0 1704067200 1 connected 0-5460
e5f6g7h8 10.0.1.2:6379@16379 master - 0 1704067195 2 connected 5461-10922
i9j0k1l2 10.0.1.3:6379@16379 master,fail? - 1704067190 1704067190 3 disconnected 10923-16383
m3n4o5p6 10.0.1.4:6379@16379 slave a1b2c3d4 0 1704067200 1 connected
q7r8s9t0 10.0.1.5:6379@16379 slave e5f6g7h8 0 1704067195 2 connected
u1v2w3x4 10.0.1.6:6379@16379 slave i9j0k1l2 0 1704067190 3 fail?

Node i9j0k1l2 is marked fail? and disconnected, covering slots 10923-16383.

Identify Uncovered Slots

```bash # Find which slots have no coverage redis-cli -c CLUSTER NODES | grep -E "fail|disconnected" | awk '{print $NF}'

# Check specific slot coverage redis-cli -c CLUSTER KEYSLOT "mykey" # Returns slot number, e.g., 12568

# Check which node should serve this slot redis-cli -c CLUSTER NODES | grep "12568" ```

Check Node Connectivity

```bash # Test connectivity to each node redis-cli -h 10.0.1.1 -p 6379 PING redis-cli -h 10.0.1.2 -p 6379 PING redis-cli -h 10.0.1.3 -p 6379 PING

# Check if nodes can reach each other redis-cli -h 10.0.1.1 CLUSTER NODES | grep link_state ```

Verify Replication Status

```bash # On working master, check replicas redis-cli -h 10.0.1.1 INFO replication

# Check replica readiness to take over redis-cli -h 10.0.1.6 INFO replication ```

Solutions

Solution 1: Failover to Replica

If a master is down but has a healthy replica, trigger manual failover:

```bash # On the replica you want to promote redis-cli -h 10.0.1.6 CLUSTER FAILOVER

# Or force immediate takeover redis-cli -h 10.0.1.6 CLUSTER FAILOVER FORCE

# Or takeover when master unreachable redis-cli -h 10.0.1.6 CLUSTER FAILOVER TAKEOVER ```

The replica becomes the new master and claims the slots.

Solution 2: Fix Network Connectivity

If nodes are disconnected due to network issues:

```bash # Check firewall rules sudo iptables -L -n sudo ufw status

# Redis Cluster uses two ports per node: # - Main port (6379) # - Cluster bus port (6379 + 10000 = 16379)

# Open both ports sudo ufw allow from 10.0.1.0/24 to any port 6379 sudo ufw allow from 10.0.1.0/24 to any port 16379

# For firewalld sudo firewall-cmd --add-port=6379/tcp --permanent sudo firewall-cmd --add-port=16379/tcp --permanent sudo firewall-cmd --reload ```

Solution 3: Rejoin Disconnected Node

If a node lost its cluster config, rejoin it:

```bash # On the disconnected node, check its view redis-cli -h 10.0.1.3 CLUSTER NODES

# If it shows empty or wrong cluster, meet it back redis-cli -h 10.0.1.1 CLUSTER MEET 10.0.1.3 6379

# Wait for gossip to propagate sleep 5

# Verify node rejoined redis-cli -h 10.0.1.1 CLUSTER NODES | grep 10.0.1.3 ```

Solution 4: Complete Stuck Migration

If slot migration is incomplete:

```bash # Check for importing/exporting slots redis-cli -c CLUSTER NODES | grep -E "importing|exporting"

# Example stuck migration: # [10923-16383->-i9j0k1l2] importing

# On destination node, complete or rollback redis-cli -h 10.0.1.2 CLUSTER SETSLOT 10923 STABLE

# Or complete the migration redis-cli -h 10.0.1.2 CLUSTER SETSLOT 10923 NODE i9j0k1l2 ```

Solution 5: Add Missing Replica

If a master has no replica and fails, cluster goes down. Add replicas:

```bash # Start a new Redis instance as replica redis-server --port 6380 --cluster-enabled yes

# On existing master, meet the new node redis-cli -h 10.0.1.1 CLUSTER MEET 10.0.1.7 6380

# On new node, become replica of master redis-cli -h 10.0.1.7 CLUSTER REPLICATE a1b2c3d4 ```

Solution 6: Reshard Slots to Healthy Node

If a node is permanently lost and has no replica:

```bash # Reshard slots from failed node to healthy node redis-cli --cluster reshard 10.0.1.1:6379 \ --cluster-from i9j0k1l2 \ --cluster-to a1b2c3d4 \ --cluster-slots 5122 \ --cluster-yes

# Or use interactive mode redis-cli --cluster reshard 10.0.1.1:6379 ```

Solution 7: Reset and Rebuild Cluster

For severe corruption, rebuild the cluster:

```bash # Stop all nodes sudo systemctl stop redis-server

# Reset cluster config on each node redis-cli -h 10.0.1.1 FLUSHALL redis-cli -h 10.0.1.1 CLUSTER RESET HARD

# Repeat for all nodes redis-cli -h 10.0.1.2 FLUSHALL redis-cli -h 10.0.1.2 CLUSTER RESET HARD

# Recreate cluster redis-cli --cluster create \ 10.0.1.1:6379 10.0.1.2:6379 10.0.1.3:6379 \ 10.0.1.4:6379 10.0.1.5:6379 10.0.1.6:6379 \ --cluster-replicas 1

# Verify cluster redis-cli -c CLUSTER INFO ```

Solution 8: Fix Split-Brain Scenario

When cluster partitions, both sides may claim to be valid:

```bash # Check each side's node count redis-cli -h <partition1-node> CLUSTER NODES | wc -l redis-cli -h <partition2-node> CLUSTER NODES | wc -l

# Majority side (more masters) continues # Minority side should stop writes

# Fix by ensuring minority nodes connect to majority redis-cli -h <minority-node> CLUSTER MEET <majority-node-ip> 6379

# Reset minority side if needed redis-cli -h <minority-node> CLUSTER RESET SOFT ```

Cluster Configuration Best Practices

```ini # /etc/redis/redis.conf for cluster mode

cluster-enabled yes cluster-config-file nodes.conf cluster-node-timeout 15000 cluster-replica-validity-factor 10 cluster-migration-barrier 1 cluster-require-full-coverage yes

# Network settings bind 0.0.0.0 protected-mode no port 6379

# Persistence appendonly yes appendfsync everysec ```

Monitoring Script

```bash #!/bin/bash # cluster_monitor.sh

while true; do STATE=$(redis-cli -c CLUSTER INFO | grep cluster_state | cut -d: -f2 | tr -d '\r')

if [ "$STATE" != "ok" ]; then echo "CRITICAL: Cluster state is $STATE" redis-cli -c CLUSTER INFO redis-cli -c CLUSTER NODES | grep -E "fail|disconnected"

# Alert notification # send_alert "Redis cluster down" fi

# Check slot coverage SLOTS_OK=$(redis-cli -c CLUSTER INFO | grep cluster_slots_ok | cut -d: -f2 | tr -d '\r') if [ "$SLOTS_OK" != "16384" ]; then echo "WARNING: Only $SLOTS_OK slots covered" fi

sleep 10 done ```

Prevention Checklist

  • [ ] Each master has at least one replica
  • [ ] Cluster bus port (16379) open on firewall
  • [ ] Set appropriate cluster-node-timeout
  • [ ] Monitor cluster_state continuously
  • [ ] Test failover scenarios regularly
  • [ ] Place replicas in different failure zones
  • [ ] Use cluster-require-full-coverage based on needs
  • [ ] Keep cluster config backups
  • [ ] Document node IDs and slot assignments
  • [Redis Connection Refused](./fix-redis-connection-refused)
  • [Redis Replication Lag](./fix-redis-replication-lag)
  • [Redis Max Clients Reached](./fix-redis-max-clients)