Introduction When a Redis Cluster node fails during resharding, the cluster can be left in an inconsistent state where some slots are marked as `migrating` or `importing` but the migration never completed. This causes `MOVED` redirect errors and potential data loss for keys in the affected slots.

Symptoms - `CLUSTER INFO` shows `cluster_state:fail` or `cluster_slots_assigned` less than 16384 - `CLUSTER NODES` shows failed node as `fail` with slots in transitioning state - Clients receive `CLUSTERDOWN The cluster is not ok` errors - Some keys return `-MOVED` redirects to a non-existent node - `redis-cli --cluster check` reports "Not all 16384 slots are covered"

Common Causes - Node crashes due to OOM killer during memory-intensive migration - Network partition isolating the target node mid-migration - Disk failure on the node being migrated to - Kubernetes pod eviction terminating a Redis node during reshard - Hardware failure on bare-metal Redis cluster nodes

Step-by-Step Fix 1. **Assess cluster state and identify failed node**: ```bash redis-cli -p 7000 CLUSTER INFO redis-cli -p 7000 CLUSTER NODES | grep fail redis-cli --cluster check 127.0.0.1:7000 ```

  1. 1.Forget the failed node from all remaining nodes:
  2. 2.```bash
  3. 3.# Get the failed node ID
  4. 4.FAILED_NODE_ID=$(redis-cli -p 7000 CLUSTER NODES | grep fail | awk '{print $1}')

# Forget from each remaining node redis-cli -p 7000 CLUSTER FORGET $FAILED_NODE_ID redis-cli -p 7001 CLUSTER FORGET $FAILED_NODE_ID redis-cli -p 7002 CLUSTER FORGET $FAILED_NODE_ID redis-cli -p 7003 CLUSTER FORGET $FAILED_NODE_ID redis-cli -p 7004 CLUSTER FORGET $FAILED_NODE_ID ```

  1. 1.Identify slots in migrating/importing state and reset them:
  2. 2.```bash
  3. 3.# Check each node for stuck slots
  4. 4.for port in 7000 7001 7002 7003 7004; do
  5. 5.echo "=== Port $port ==="
  6. 6.redis-cli -p $port CLUSTER NODES | grep -E "migrating|importing"
  7. 7.done

# For each stuck slot, determine where the data actually is # and set the slot to the correct node redis-cli -p 7000 CLUSTER SETSLOT <slot> STABLE redis-cli -p 7000 CLUSTER SETSLOT <slot> NODE <correct_node_id> ```

  1. 1.Recover data from the failed node if possible:
  2. 2.```bash
  3. 3.# If the node can be restarted, start it and let it rejoin
  4. 4.redis-server /etc/redis/redis.conf --port 7005

# Once it rejoins, check its slots redis-cli -p 7005 CLUSTER NODES redis-cli -p 7005 DBSIZE

# If data is needed, dump keys from the recovered node redis-cli -p 7005 --scan | head -100 ```

  1. 1.Repair the cluster by redistributing affected slots:
  2. 2.```bash
  3. 3.redis-cli --cluster fix 127.0.0.1:7000
  4. 4.redis-cli --cluster check 127.0.0.1:7000
  5. 5.`
  6. 6.Add a replacement node if needed:
  7. 7.```bash
  8. 8.redis-server /etc/redis/7005.conf --port 7005 --cluster-enabled yes
  9. 9.redis-cli --cluster add-node 127.0.0.1:7005 127.0.0.1:7000
  10. 10.`

Prevention - Never reshard during maintenance windows for other cluster nodes - Monitor cluster health with `redis-cli --cluster check` on a schedule - Set up alerting on `cluster_state` changing from `ok` - Use at least 3 masters and 3 replicas for production clusters - Configure proper memory limits to prevent OOM during migration - Test failure scenarios regularly: kill nodes during resharding to validate recovery procedures - Keep a current backup of all node RDB files before any resharding operation