Redis Cluster Node Failure During Resharding - Recovery Guide

Introduction When a Redis Cluster node fails during resharding, the cluster can be left in an inconsistent state where some slots are marked as `migrating` or `importing` but the migration never completed. This causes `MOVED` redirect errors and potential data loss for keys in the affected slots.

Symptoms - `CLUSTER INFO` shows `cluster_state:fail` or `cluster_slots_assigned` less than 16384 - `CLUSTER NODES` shows failed node as `fail` with slots in transitioning state - Clients receive `CLUSTERDOWN The cluster is not ok` errors - Some keys return `-MOVED` redirects to a non-existent node - `redis-cli --cluster check` reports "Not all 16384 slots are covered"

Common Causes - Node crashes due to OOM killer during memory-intensive migration - Network partition isolating the target node mid-migration - Disk failure on the node being migrated to - Kubernetes pod eviction terminating a Redis node during reshard - Hardware failure on bare-metal Redis cluster nodes

Step-by-Step Fix 1. Assess cluster state and identify failed node: ```bash redis-cli -p 7000 CLUSTER INFO redis-cli -p 7000 CLUSTER NODES | grep fail redis-cli --cluster check 127.0.0.1:7000 ```

1.Forget the failed node from all remaining nodes:
2.```bash
3.# Get the failed node ID
4.FAILED_NODE_ID=$(redis-cli -p 7000 CLUSTER NODES | grep fail | awk '{print $1}')

# Forget from each remaining node redis-cli -p 7000 CLUSTER FORGET $FAILED_NODE_ID redis-cli -p 7001 CLUSTER FORGET $FAILED_NODE_ID redis-cli -p 7002 CLUSTER FORGET $FAILED_NODE_ID redis-cli -p 7003 CLUSTER FORGET $FAILED_NODE_ID redis-cli -p 7004 CLUSTER FORGET $FAILED_NODE_ID ```

1.Identify slots in migrating/importing state and reset them:
2.```bash
3.# Check each node for stuck slots
4.for port in 7000 7001 7002 7003 7004; do
5.echo "=== Port $port ==="
6.redis-cli -p $port CLUSTER NODES | grep -E "migrating|importing"
7.done

# For each stuck slot, determine where the data actually is # and set the slot to the correct node redis-cli -p 7000 CLUSTER SETSLOT <slot> STABLE redis-cli -p 7000 CLUSTER SETSLOT <slot> NODE <correct_node_id> ```

1.Recover data from the failed node if possible:
2.```bash
3.# If the node can be restarted, start it and let it rejoin
4.redis-server /etc/redis/redis.conf --port 7005

# Once it rejoins, check its slots redis-cli -p 7005 CLUSTER NODES redis-cli -p 7005 DBSIZE

# If data is needed, dump keys from the recovered node redis-cli -p 7005 --scan | head -100 ```

1.Repair the cluster by redistributing affected slots:
2.```bash
3.redis-cli --cluster fix 127.0.0.1:7000
4.redis-cli --cluster check 127.0.0.1:7000
5.`
6.Add a replacement node if needed:
7.```bash
8.redis-server /etc/redis/7005.conf --port 7005 --cluster-enabled yes
9.redis-cli --cluster add-node 127.0.0.1:7005 127.0.0.1:7000
10.`

Prevention - Never reshard during maintenance windows for other cluster nodes - Monitor cluster health with `redis-cli --cluster check` on a schedule - Set up alerting on `cluster_state` changing from `ok` - Use at least 3 masters and 3 replicas for production clusters - Configure proper memory limits to prevent OOM during migration - Test failure scenarios regularly: kill nodes during resharding to validate recovery procedures - Keep a current backup of all node RDB files before any resharding operation

Redis Cluster Node Failure During Active Resharding

Introduction When a Redis Cluster node fails during resharding, the cluster can be left in an inconsistent state where some slots are marked as `migrating` or `importing` but the migration never completed. This causes `MOVED` redirect errors and potential data loss for keys in the affected slots.

Step-by-Step Fix 1. **Assess cluster state and identify failed node**: ```bash redis-cli -p 7000 CLUSTER INFO redis-cli -p 7000 CLUSTER NODES | grep fail redis-cli --cluster check 127.0.0.1:7000 ```

Share this guide

More Redis Troubleshooting Guides

Redis Persistence Disabled Warning

Redis Client Output Buffer Exceeded

Redis Slow Log Not Logging

Redis AOF Load Error

Redis Loading RDB Error

Redis TLS Handshake Failed

Step-by-Step Fix 1. Assess cluster state and identify failed node: ```bash redis-cli -p 7000 CLUSTER INFO redis-cli -p 7000 CLUSTER NODES | grep fail redis-cli --cluster check 127.0.0.1:7000 ```