Introduction When a Redis Cluster node fails during resharding, the cluster can be left in an inconsistent state where some slots are marked as `migrating` or `importing` but the migration never completed. This causes `MOVED` redirect errors and potential data loss for keys in the affected slots.
Symptoms - `CLUSTER INFO` shows `cluster_state:fail` or `cluster_slots_assigned` less than 16384 - `CLUSTER NODES` shows failed node as `fail` with slots in transitioning state - Clients receive `CLUSTERDOWN The cluster is not ok` errors - Some keys return `-MOVED` redirects to a non-existent node - `redis-cli --cluster check` reports "Not all 16384 slots are covered"
Common Causes - Node crashes due to OOM killer during memory-intensive migration - Network partition isolating the target node mid-migration - Disk failure on the node being migrated to - Kubernetes pod eviction terminating a Redis node during reshard - Hardware failure on bare-metal Redis cluster nodes
Step-by-Step Fix 1. **Assess cluster state and identify failed node**: ```bash redis-cli -p 7000 CLUSTER INFO redis-cli -p 7000 CLUSTER NODES | grep fail redis-cli --cluster check 127.0.0.1:7000 ```
- 1.Forget the failed node from all remaining nodes:
- 2.```bash
- 3.# Get the failed node ID
- 4.FAILED_NODE_ID=$(redis-cli -p 7000 CLUSTER NODES | grep fail | awk '{print $1}')
# Forget from each remaining node redis-cli -p 7000 CLUSTER FORGET $FAILED_NODE_ID redis-cli -p 7001 CLUSTER FORGET $FAILED_NODE_ID redis-cli -p 7002 CLUSTER FORGET $FAILED_NODE_ID redis-cli -p 7003 CLUSTER FORGET $FAILED_NODE_ID redis-cli -p 7004 CLUSTER FORGET $FAILED_NODE_ID ```
- 1.Identify slots in migrating/importing state and reset them:
- 2.```bash
- 3.# Check each node for stuck slots
- 4.for port in 7000 7001 7002 7003 7004; do
- 5.echo "=== Port $port ==="
- 6.redis-cli -p $port CLUSTER NODES | grep -E "migrating|importing"
- 7.done
# For each stuck slot, determine where the data actually is # and set the slot to the correct node redis-cli -p 7000 CLUSTER SETSLOT <slot> STABLE redis-cli -p 7000 CLUSTER SETSLOT <slot> NODE <correct_node_id> ```
- 1.Recover data from the failed node if possible:
- 2.```bash
- 3.# If the node can be restarted, start it and let it rejoin
- 4.redis-server /etc/redis/redis.conf --port 7005
# Once it rejoins, check its slots redis-cli -p 7005 CLUSTER NODES redis-cli -p 7005 DBSIZE
# If data is needed, dump keys from the recovered node redis-cli -p 7005 --scan | head -100 ```
- 1.Repair the cluster by redistributing affected slots:
- 2.```bash
- 3.redis-cli --cluster fix 127.0.0.1:7000
- 4.redis-cli --cluster check 127.0.0.1:7000
- 5.
` - 6.Add a replacement node if needed:
- 7.```bash
- 8.redis-server /etc/redis/7005.conf --port 7005 --cluster-enabled yes
- 9.redis-cli --cluster add-node 127.0.0.1:7005 127.0.0.1:7000
- 10.
`