Introduction

Redis cluster node synchronization failures occur when new nodes cannot join the cluster, replica nodes cannot sync with masters, or cluster nodes lose synchronization after network partitions. Redis Cluster uses a gossip-based failure detection and configuration propagation protocol where nodes periodically exchange cluster state information. When nodes fail to sync, the cluster may operate in degraded mode, reject writes, or become completely unavailable if quorum cannot be established. Synchronization failures happen at multiple stages: initial CLUSTER MEET handshake failure, gossip packet loss preventing slot map propagation, full resync (SYNC) failing due to memory constraints, partial resync (PSYNC) failing due to replication backlog overflow, replication buffer overflow disconnecting replicas, slot migration stuck during rebalancing, or cluster bus port blocking inter-node communication. Common causes include network firewall blocking cluster bus port (16379 by default), cluster-announce-ip misconfigured for NAT/container environments, replication backlog buffer too small for write volume, client output buffer limits disconnecting slow replicas, master node under heavy load unable to serve replication stream, cluster node timeout too short causing false failure detection, DNS resolution failures between nodes, TLS certificate validation failures for encrypted cluster bus, and persistent disk full preventing RDB snapshot for full sync. The fix requires understanding Redis Cluster architecture, gossip protocol mechanics, replication strategies (full vs partial), buffer management, and cluster healing procedures. This guide provides production-proven troubleshooting for Redis Cluster across bare metal, Kubernetes, Docker, and cloud-managed Redis deployments.

Symptoms

  • CLUSTER MEET command returns OK but node never joins cluster
  • CLUSTER NODES shows node in fail or disconnected state
  • Replica node shows master_link_status: down
  • CLUSTER INFO shows cluster_state:fail
  • Slots not served (cluster_state:fail with slot coverage gaps)
  • Write operations return CLUSTERDOWN Hash slot not served
  • Replica stuck in LOADING or SYNC state
  • Cluster bus port connection refused between nodes
  • Gossip messages not propagating across cluster
  • Slot migration stuck mid-transfer
  • MIGRATE command timeout during rebalance
  • Node marked as PFAIL (Probably Fail) then FAIL
  • Cluster quorum lost during network partition
  • Partial resync falls back to full sync repeatedly

Common Causes

  • Cluster bus port (16379) blocked by firewall
  • cluster-announce-ip not set for NAT/container environments
  • Replication backlog overflow causing full resync
  • Client output buffer limit disconnecting replicas
  • Master node memory pressure during full sync
  • Network partition isolating cluster nodes
  • cluster-node-timeout too short for network latency
  • TLS certificate mismatch on cluster bus
  • Disk full preventing RDB snapshot for sync
  • DNS resolution failure between cluster nodes

Step-by-Step Fix

### 1. Diagnose cluster sync issues

Check cluster status:

```bash # Connect to any cluster node redis-cli -h node1.example.com -p 6379

# Check overall cluster state CLUSTER INFO

# Key metrics: # cluster_state:ok (or fail) # cluster_slots_assigned:16384 # cluster_slots_ok:16384 # cluster_known_nodes:6 # cluster_size:3

# If cluster_state:fail, check slot coverage CLUSTER SLOTS

# Shows which nodes serve which slots # Expected: All 16384 slots covered

# Check node relationships CLUSTER NODES

# Output format: # <node_id> <ip:port@cport> <flags> <master> <ping-sent> <pong-recv> <config-epoch> <link-state> <slot>...

# Flags explain node state: # myself = This node # master = Master node # slave = Replica node # fail? = PFAIL (Probably Fail) # fail = FAIL (Confirmed Fail) # noflags = Healthy

# Check specific node sync status redis-cli -h node2.example.com -p 6379 CLUSTER NODES | grep myself ```

Check replication status:

```bash # On replica node redis-cli -h replica1.example.com -p 6379 INFO replication

# Key fields: # role:slave # master_host:master1.example.com # master_port:6379 # master_link_status:up (or down) # master_sync_in_progress:0 (or 1 if syncing) # master_last_io_seconds_ago:0 # master_sync_left_bytes:0 # master_sync_status:ok

# If master_link_status:down: # - Network issue to master # - Master crashed # - Authentication failure

# On master node redis-cli -h master1.example.com -p 6379 INFO replication

# Shows connected replicas: # connected_slaves:2 # slave0: ip=192.168.1.11,port=6379,state=online,offset=123456,lag=0 # slave1: ip=192.168.1.12,port=6379,state=online,offset=123450,lag=1

# If lag is high or offset not advancing: # - Replica falling behind # - Network latency # - Replica under heavy load ```

Check cluster bus connectivity:

```bash # Test cluster bus port between nodes # Default cluster bus port = client_port + 10000 # If client port is 6379, cluster bus is 16379

telnet node1.example.com 16379 # Connected = port open # Connection refused = port blocked

# Or using nc (netcat) nc -zv node1.example.com 16379 # Connection to node1.example.com 16379 port [tcp/*] succeeded!

# Check firewall rules sudo iptables -L -n | grep 16379 # Should allow traffic from cluster nodes

# On Kubernetes, check NetworkPolicy kubectl get networkpolicy -n redis-cluster kubectl describe networkpolicy redis-cluster-policy

# Check if cluster bus traffic is encrypted redis-cli -h node1.example.com -p 6379 CONFIG GET tls-cluster # Returns: tls-cluster "yes" (if TLS enabled) ```

### 2. Fix CLUSTER MEET failures

Manually add node to cluster:

```bash # Add new node to existing cluster redis-cli -h existing-node.example.com -p 6379 \ CLUSTER MEET new-node.example.com 6379

# For cluster bus on non-default port redis-cli -h existing-node.example.com -p 6379 \ CLUSTER MEET new-node.example.com 6379 16380

# Check if node joined redis-cli -h existing-node.example.com -p 6379 CLUSTER NODES | grep new-node

# If node doesn't appear after 30 seconds: # 1. Check cluster bus connectivity # 2. Verify cluster-announce-ip configuration # 3. Check node timeout settings ```

Configure cluster-announce-ip for NAT:

```bash # For Docker/Kubernetes/NAT environments # Edit redis.conf on each node

# What the node should announce to cluster cluster-announce-ip 192.168.1.100 cluster-announce-port 6379 cluster-announce-bus-port 16379

# In Docker Compose services: redis-node-1: image: redis:7-alpine command: redis-server --cluster-enabled yes \ --cluster-announce-ip 10.0.0.1 \ --cluster-announce-port 6379 \ --cluster-announce-bus-port 16379 ports: - "6379:6379" - "16379:16379"

# In Kubernetes StatefulSet spec: containers: - name: redis command: - redis-server - --cluster-enabled - "yes" - --cluster-announce-ip - $(POD_IP) # Use environment variable env: - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP ```

Use CLUSTER FORGET to remove stale nodes:

```bash # If node was removed but still appears in cluster # Get node ID to forget redis-cli -h node1.example.com -p 6379 CLUSTER NODES | grep stale-node

# Forget the node (run on ALL cluster nodes) redis-cli -h node1.example.com -p 6379 \ CLUSTER FORGET <stale-node-id>

# Wait for gossip to propagate (2x node-timeout) # Node should disappear from CLUSTER NODES output

# IMPORTANT: Must run on every node within 60 seconds # Or the forgotten node will reappear via gossip ```

### 3. Fix full resync failures

Increase replication backlog:

```bash # Check current backlog size redis-cli -h master.example.com -p 6379 CONFIG GET repl-backlog-size # Returns: repl-backlog-size "1mb"

# Check backlog history length redis-cli -h master.example.com -p 6379 CONFIG GET repl-backlog-ttl # Returns: repl-backlog-ttl "3600"

# Increase backlog for high-write workloads redis-cli -h master.example.com -p 6379 \ CONFIG SET repl-backlog-size 256mb

# Make permanent in redis.conf # repl-backlog-size 256mb # repl-backlog-ttl 3600

# Backlog sizing guidelines: # - Low write (<10KB/s): 1MB # - Medium write (10-100KB/s): 64MB # - High write (>100KB/s): 256MB+ # Backlog should hold 60+ seconds of writes ```

Increase client output buffer limits:

```bash # Check current buffer limits redis-cli -h master.example.com -p 6379 CONFIG GET client-output-buffer-limit

# Returns format: # client-output-buffer-limit "normal 0 0 0 slave 256mb 64mb 60 pubsub 32mb 8mb 60" # Format: class hard-limit soft-limit soft-seconds # slave: 256MB hard, 64MB soft, 60 seconds

# If replica disconnects during full sync: # Increase slave buffer limits redis-cli -h master.example.com -p 6379 \ CONFIG SET client-output-buffer-limit "slave 512mb 128mb 120"

# Make permanent in redis.conf # client-output-buffer-limit slave 512mb 128mb 120

# Monitor buffer usage on replica redis-cli -h replica.example.com -p 6379 CLIENT LIST | grep master # Look for omem (output memory) value ```

Debug full resync process:

```bash # Monitor replication on replica redis-cli -h replica.example.com -p 6379 DEBUG SLEEP 0 # Watch for SYNC command in response

# Or tail replication logs redis-cli -h replica.example.com -p 6379 \ CONFIG SET loglevel verbose

# Check INFO replication during sync watch -n 1 'redis-cli INFO replication | grep -E "role|master_sync|repl"'

# During full sync: # master_sync_in_progress:1 # master_sync_left_bytes:<decreasing> # master_sync_status:ok

# If sync fails, check: # - Disk space for RDB file # - Memory for copy-on-write # - Network bandwidth ```

### 4. Fix partial resync failures

Understand PSYNC flow:

```bash # Check if replica supports PSYNC redis-cli -h replica.example.com -p 6379 INFO server | grep redis_version # PSYNC requires Redis 2.8+

# Check master replication ID redis-cli -h master.example.com -p 6379 INFO replication | grep master_replid # Returns: master_replid:<40-char-hex> # master_replid2:<40-char-hex-or->

# When replica reconnects, it sends: # PSYNC <replication_id> <offset>

# Master responds: # CONTINUE <offset> (partial resync) # FULLRESYNC <id> <offset> (full resync needed)

# Full resync triggered when: # - Replication ID mismatch (new master) # - Offset not in backlog (too old) # - First-time replication ```

Fix partial resync issues:

```bash # If replica constantly falls back to full sync: # 1. Increase replication backlog size CONFIG SET repl-backlog-size 512mb

# 2. Increase backlog TTL CONFIG SET repl-backlog-ttl 7200 # 2 hours

# 3. Check replica offset lag redis-cli -h replica.example.com -p 6379 INFO replication | grep offset

# Compare master and replica offset: # master: master_repl_offset:1234567 # replica: master_repl_offset:1234500 # Lag = 67 bytes

# If lag growing continuously: # - Replica write load too high # - Network bandwidth limited # - Master producing writes faster than network

# Check partial sync failure reason redis-cli -h master.example.com -p 6379 DEBUG SEGFAULT 2>&1 | head # Don't actually run DEBUG SEGFAULT - it crashes! # Instead check logs: tail -f /var/log/redis/redis-server.log | grep -E "PSYNC|FULLRESYNC" ```

Monitor replication buffer:

```bash # On master, check replica connection buffers redis-cli -h master.example.com -p 6379 CLIENT LIST

# Find replica entries: # addr=192.168.1.11:45678 fd=10 name= age=3600 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=REPLCONF

# Key fields: # flags=S = slave # obl = output buffer length # omem = output memory used

# If omem approaching client-output-buffer-limit: # Replica falling behind, needs larger buffer or faster sync ```

### 5. Fix slot migration stuck issues

Resume stuck slot migration:

```bash # Check migration state redis-cli -h source-node.example.com -p 6379 CLUSTER NODES

# Look for migrating slots: # <node_id> ... - 0 1234567890 1 connected 0-5460 [1->-<target-node-id>] # [1->-<node-id>] = slot 1 migrating to target

# If migration stuck: # 1. Check target node connectivity redis-cli -h source-node.example.com -p 6379 PING redis-cli -h target-node.example.com -p 6379 PING

# 2. Check network bandwidth during migration # MIGRATE sends data over network # Large keys can take minutes to transfer

# 3. Cancel stuck migration redis-cli -h source-node.example.com -p 6379 \ CLUSTER SETSLOT <slot-number> STABLE

# Then retry migration ```

Configure MIGRATE timeout:

```bash # Default MIGRATE timeout is 10 seconds # For large keys, increase timeout

# Manual MIGRATE with custom timeout redis-cli -h source-node.example.com -p 6379 \ MIGRATE target-host 6379 "" 0 60000 COPY KEYS key1 key2 key3

# Timeout: 60000ms (60 seconds) # COPY = Don't delete from source # KEYS = Multiple keys

# For entire slot migration: redis-cli -h source-node.example.com -p 6379 \ CLUSTER SETSLOT <slot> MIGRATING <target-node-id>

# Then use redis-cli --cluster reshard redis-cli --cluster reshard source-node.example.com:6379 \ --cluster-from <source-node-id> \ --cluster-to <target-node-id> \ --cluster-slots <num-slots> \ --cluster-timeout 300000 \ --cluster-pipeline 100 ```

Fix key size issues during migration:

```bash # Find large keys that may cause migration timeout redis-cli -h source-node.example.com -p 6379 \ --bigkeys

# Output shows largest keys by type: # [00.00%] Biggest string found so far '"user:12345:session" - 1048576 bytes' # [00.00%] Biggest list found so far '"queue:jobs" - 50000 elements'

# For large keys: # 1. Increase MIGRATE timeout # 2. Migrate during low-traffic period # 3. Consider splitting large keys

# Check memory fragmentation redis-cli -h source-node.example.com -p 6379 INFO memory | grep fragmentation # High fragmentation can slow migration ```

### 6. Fix gossip protocol issues

Understand gossip timing:

```bash # Gossip parameters in redis.conf # cluster-node-timeout: 15000 (15 seconds default) # cluster-slave-validity-factor: 10 # cluster-migration-barrier: 1 # cluster-require-full-coverage: yes

# Gossip runs every 1 second # PFAIL detected after cluster-node-timeout # FAIL broadcast after PFAIL

# Check gossip packet stats redis-cli -h node.example.com -p 6379 INFO stats | grep cluster

# Key metrics: # cluster_messages_sent_count # cluster_messages_received_count # cluster_messages_sent_bytes # cluster_messages_received_bytes ```

Fix gossip packet loss:

```bash # If nodes not receiving gossip: # 1. Check network MTU (jumbo frames can fragment gossip) ip link show eth0 | grep mtu # Standard: mtu 1500 # Jumbo: mtu 9000

# 2. Reduce gossip message size # Fewer slots per node = smaller gossip # More nodes = more gossip traffic

# 3. Increase cluster-node-timeout for high-latency networks redis-cli -h node.example.com -p 6379 \ CONFIG SET cluster-node-timeout 30000

# Make permanent # cluster-node-timeout 30000

# For cross-datacenter clusters, use 30-60 second timeout ```

Handle network partitions:

```bash # During network partition: # - Nodes on minority side stop accepting writes # - Nodes on majority side continue operating # - cluster_state may show fail on minority

# Check partition state redis-cli -h node1.example.com -p 6379 CLUSTER INFO

# If cluster_known_nodes < total_nodes: # Partition detected

# After partition heals: # 1. Nodes rejoin automatically via gossip # 2. Full sync may be needed if partition lasted > backlog-ttl

# Force node to rejoin cluster redis-cli -h healed-node.example.com -p 6379 \ CLUSTER MEET existing-node.example.com 6379

# Check cluster recovered redis-cli -h node1.example.com -p 6379 CLUSTER INFO | grep cluster_state # Should return: cluster_state:ok ```

### 7. Fix TLS cluster bus issues

Configure TLS for cluster bus:

```bash # Enable TLS cluster communication # redis.conf tls-cluster yes tls-port 6379 tls-cert-file /etc/redis/tls/redis.crt tls-key-file /etc/redis/tls/redis.key tls-ca-cert-file /etc/redis/tls/ca.crt

# For cluster bus specifically tls-cluster yes

# Restart Redis after TLS config changes sudo systemctl restart redis

# Check TLS is working redis-cli -h node1.example.com -p 6379 --tls \ --cacert /etc/redis/tls/ca.crt \ --cert /etc/redis/tls/redis.crt \ --key /etc/redis/tls/redis.key \ CLUSTER INFO ```

Debug TLS handshake failures:

```bash # Test TLS connection openssl s_client -connect node1.example.com:16379 \ -CAfile /etc/redis/tls/ca.crt \ -cert /etc/redis/tls/redis.crt \ -key /etc/redis/tls/redis.key

# Check certificate validity openssl x509 -in /etc/redis/tls/redis.crt -text -noout | grep -E "Subject|Issuer|Not Before|Not After"

# Common TLS issues: # 1. Certificate expired # 2. CN/SAN doesn't match hostname # 3. CA chain incomplete # 4. Certificate not trusted

# Fix: Regenerate certificates with correct SAN openssl req -new -key redis.key -out redis.csr \ -subj "/CN=redis-node1.example.com" \ -addext "subjectAltName=DNS:redis-node1.example.com,DNS:node1" ```

### 8. Fix Kubernetes Redis Cluster sync issues

Configure StatefulSet for cluster:

yaml # redis-cluster-statefulset.yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: redis-cluster spec: serviceName: redis-cluster replicas: 6 selector: matchLabels: app: redis-cluster template: metadata: labels: app: redis-cluster spec: containers: - name: redis image: redis:7-alpine command: - redis-server - --cluster-enabled - "yes" - --cluster-config-file - /data/nodes.conf - --cluster-announce-ip - $(POD_IP) - --cluster-announce-port - "6379" - --cluster-announce-bus-port - "16379" env: - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP ports: - name: client containerPort: 6379 - name: cluster containerPort: 16379 volumeMounts: - name: data mountPath: /data volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 10Gi

Create cluster after StatefulSet ready:

```bash # Wait for all pods ready kubectl wait --for=condition=Ready pod -l app=redis-cluster --timeout=300s

# Create cluster using redis-cli kubectl exec -it redis-cluster-0 -- redis-cli \ --cluster create \ redis-cluster-0.redis-cluster:6379 \ redis-cluster-1.redis-cluster:6379 \ redis-cluster-2.redis-cluster:6379 \ redis-cluster-3.redis-cluster:6379 \ redis-cluster-4.redis-cluster:6379 \ redis-cluster-5.redis-cluster:6379 \ --cluster-replicas 1 \ --cluster-yes

# Check cluster status kubectl exec -it redis-cluster-0 -- redis-cli CLUSTER INFO ```

Fix Kubernetes network policy:

yaml # Allow cluster bus traffic apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: redis-cluster-policy spec: podSelector: matchLabels: app: redis-cluster policyTypes: - Ingress - Egress ingress: - from: - podSelector: matchLabels: app: redis-cluster ports: - protocol: TCP port: 6379 - protocol: TCP port: 16379 egress: - to: - podSelector: matchLabels: app: redis-cluster ports: - protocol: TCP port: 6379 - protocol: TCP port: 16379

### 9. Monitor cluster sync health

Create monitoring script:

```bash #!/bin/bash # monitor-redis-cluster-sync.sh

CLUSTER_NODES=( "node1.example.com:6379" "node2.example.com:6379" "node3.example.com:6379" )

ALERT_EMAIL="ops@example.com"

check_cluster_health() { for node in "${CLUSTER_NODES[@]}"; do host=$(echo $node | cut -d: -f1) port=$(echo $node | cut -d: -f2)

# Check cluster state state=$(redis-cli -h $host -p $port CLUSTER INFO | grep cluster_state | cut -d: -f2)

if [ "$state" != "ok" ]; then echo "ALERT: Cluster state is FAIL on $node" echo "Cluster state check failed on $node at $(date)" | \ mail -s "Redis Cluster Alert: $node" $ALERT_EMAIL fi

# Check for PFAIL/FAIL nodes failed_nodes=$(redis-cli -h $host -p $port CLUSTER NODES | grep -E "fail\?|fail " | wc -l)

if [ $failed_nodes -gt 0 ]; then echo "ALERT: $failed_nodes nodes in FAIL state" fi

# Check slot coverage slots_ok=$(redis-cli -h $host -p $port CLUSTER INFO | grep cluster_slots_ok | cut -d: -f2)

if [ "$slots_ok" != "16384" ]; then echo "ALERT: Incomplete slot coverage on $node: $slots_ok/16384" fi

# Check replication lag lag=$(redis-cli -h $host -p $port INFO replication | grep master_link_status | cut -d: -f2)

if [ "$lag" == "down" ]; then echo "ALERT: Replica link down on $node" fi done }

check_cluster_health ```

Configure Redis Sentinel for cluster monitoring:

```bash # Sentinel configuration for cluster monitoring # sentinel.conf sentinel monitor mymaster node1.example.com 6379 2 sentinel down-after-milliseconds mymaster 30000 sentinel parallel-syncs mymaster 1 sentinel failover-timeout mymaster 180000

# Note: Sentinel is for master-replica, not Cluster mode # For Cluster, use redis-cli --cluster check or monitoring tools

# Start Sentinel redis-sentinel /etc/redis/sentinel.conf ```

### 10. Automated cluster healing

Configure automatic failover:

```bash # Redis Cluster handles failover automatically # When master is marked FAIL, replica takes over

# Tune failover sensitivity # redis.conf cluster-node-timeout 15000 # Default 15 seconds # Lower = faster failover, more false positives # Higher = slower failover, fewer false positives

# For critical workloads cluster-node-timeout 5000 # 5 seconds

# For cross-DC clusters cluster-node-timeout 30000 # 30 seconds

# Replica validity factor # How long replica can be disconnected before not eligible for failover # cluster-slave-validity-factor 10 # Value of 10 means: (node-timeout * 10) + repl-ping-replica-period

# Migration barrier # Minimum number of replicas a master must have before replica can migrate # cluster-migration-barrier 1 ```

Script automatic cluster healing:

```bash #!/bin/bash # heal-redis-cluster.sh

# Check for nodes in FAIL state FAILED_NODES=$(redis-cli -h healthy-node.example.com -p 6379 \ CLUSTER NODES | grep "fail " | awk '{print $1}')

for node_id in $FAILED_NODES; do echo "Node $node_id is in FAIL state"

# Try to meet the failed node redis-cli -h healthy-node.example.com -p 6379 \ CLUSTER MEET failed-node.example.com 6379

# Wait for gossip sleep 30

# Check if node recovered state=$(redis-cli -h healthy-node.example.com -p 6379 \ CLUSTER NODES | grep $node_id | awk '{print $3}')

if [[ $state == *"fail"* ]]; then echo "Node still failed, initiating failover if replica" # If it's a replica, manual failover may be needed redis-cli -h failed-node.example.com -p 6379 \ CLUSTER FAILOVER TAKEOVER fi done

# Check for slots not covered redis-cli -h healthy-node.example.com -p 6379 \ CLUSTER INFO | grep cluster_slots_ok ```

Prevention

  • Monitor replication lag continuously with alerting at 10+ seconds
  • Size replication backlog for 60+ seconds of write volume
  • Set cluster-node-timeout appropriate for network latency
  • Test cluster failover quarterly with chaos engineering
  • Document cluster topology and recovery procedures
  • Use Redis Cluster proxy or client-side clustering for large deployments
  • Regular backup of cluster configuration (nodes.conf)
  • **CLUSTERDOWN Hash slot not served**: Cluster not operational
  • **MIGRATE failed**: Slot migration error
  • **master_link_status: down**: Replica cannot sync with master
  • **Connection refused**: Cluster bus port blocked
  • **CLUSTER MEET timeout**: Node cannot join cluster