Fix Redis Sentinel Failover Problems - Complete Troubleshooting Guide

The Problem

Redis Sentinel is supposed to automatically failover when your master goes down, but something goes wrong. Either failover never triggers, Sentinel nodes disagree on the master, or the application connects to the wrong node after failover. These issues can cause extended outages or, worse, split-brain scenarios where multiple masters accept writes.

Understanding Sentinel Architecture

Sentinel relies on three key concepts:

1.Quorum - Number of Sentinels that must agree a master is down
2.Majority - Number of Sentinels needed to authorize failover
3.Parallel-syncs - Number of replicas that can sync simultaneously

Misunderstanding these leads to most Sentinel issues.

Diagnosis Commands

Check Sentinel Status

bash

redis-cli -p 26379 SENTINEL master mymaster

Key fields to examine:

bash

name: mymaster
ip: 10.0.0.1
port: 6379
flags: master
num-other-sentinels: 2
quorum: 2

Check Sentinel Discovery

bash

redis-cli -p 26379 SENTINEL sentinels mymaster

Should show all Sentinel instances.

Check Current Master

bash

redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster

Check Replication Info

bash

redis-cli -h <master-ip> INFO replication

Common Failover Failures

Failure 1: Failover Never Triggers

Symptoms: Master is down but no failover occurs.

Diagnosis:

```bash # Check if Sentinel sees the master as down redis-cli -p 26379 SENTINEL master mymaster | grep flags

# If showing "s_down" or "o_down", Sentinel detects the issue # Check if quorum is reached redis-cli -p 26379 SENTINEL ckquorum mymaster ```

Common Causes:

1.Insufficient Sentinels:
2.`
3.(quorum 2) (majority 1). Sentinel can't failover in this state.
4.`

Fix: Ensure all Sentinels can see each other:

bash

# Each Sentinel must be configured with other Sentinels
# Or use auto-discovery with:
sentinel monitor mymaster <ip> <port> <quorum>
sentinel announce-ip <this-sentinel-ip>
sentinel announce-port 26379

1.Network Partition:

Sentinels can't reach each other:

bash

# From each Sentinel, test connectivity to others
redis-cli -h <other-sentinel-ip> -p 26379 PING

1.Quorum Too High:

If quorum is set higher than available Sentinels:

```bash # Check current config redis-cli -p 26379 SENTINEL master mymaster | grep quorum

# Reduce quorum if needed redis-cli -p 26379 SENTINEL SET mymaster quorum 2 ```

Failure 2: Split-Brain After Failover

Symptoms: Multiple masters exist, data diverges.

Diagnosis:

bash

# Check what each Sentinel thinks is the master
for sentinel in sentinel1 sentinel2 sentinel3; do
    echo "Sentinel $sentinel sees:"
    redis-cli -h $sentinel -p 26379 SENTINEL get-master-addr-by-name mymaster
done

Root Cause: Network partition isolated some Sentinels from others.

Prevention:

Set appropriate quorum and ensure even number of Sentinels plus one:

bash

# For 3 Sentinels, quorum should be 2
# For 5 Sentinels, quorum should be 3
sentinel monitor mymaster 10.0.0.1 6379 2

Recovery:

1.Identify the correct master (one with most recent data):

bash

redis-cli -h <candidate-master> INFO replication | grep master_repl_offset

1.Stop the incorrect master:

bash

redis-cli -h <wrong-master> SLAVEOF <correct-master-ip> <correct-master-port>

1.Reset Sentinel configuration:

bash

redis-cli -p 26379 SENTINEL FAILOVER mymaster

Failure 3: Failover Loop

Symptoms: Failover happens repeatedly.

Diagnosis:

bash

redis-cli -p 26379 SENTINEL master mymaster | grep -E "num-slaves|flags"

Root Cause: Often caused by:

Old master recovering and claiming master role
Network flapping between nodes
Inconsistent Sentinel configurations

Fix:

1.Check down-after-milliseconds:

bash

redis-cli -p 26379 SENTINEL master mymaster | grep down-after-milliseconds

Set consistent values in sentinel.conf:

bash

sentinel down-after-milliseconds mymaster 30000
sentinel failover-timeout mymaster 180000

1.Ensure all Sentinels have identical monitor configuration:

bash

# Should be same across all Sentinels
sentinel monitor mymaster 10.0.0.1 6379 2

Failure 4: Application Not Notified

Symptoms: Failover completes but application still connects to old master.

Diagnosis:

Check if application uses Sentinel for discovery:

bash

# Correct approach - query Sentinel for current master
redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster

Fix:

Application must use Sentinel client or implement polling:

```python # Python example with redis-py from redis.sentinel import Sentinel

sentinel = Sentinel([ ('sentinel1', 26379), ('sentinel2', 26379), ('sentinel3', 26379) ], socket_timeout=0.1)

master = sentinel.master_for('mymaster', socket_timeout=0.1) ```

Step-by-Step Failover Recovery

Step 1: Assess Current State

bash

# Check all Sentinels
redis-cli -p 26379 SENTINEL masters
redis-cli -p 26379 SENTINEL slaves mymaster
redis-cli -p 26379 SENTINEL sentinels mymaster

Step 2: Identify the Actual Master

bash

# Check replication info on each Redis node
for node in redis1 redis2 redis3; do
    echo "=== $node ==="
    redis-cli -h $node INFO replication | grep -E "role|master_host|master_port"
done

Step 3: Resolve Conflicts

If multiple nodes think they're master:

bash

# On nodes that should be replicas
redis-cli -h <replica-ip> REPLICAOF <master-ip> <master-port>

Step 4: Reset Sentinel State

If Sentinel is confused:

```bash # Reset specific master redis-cli -p 26379 SENTINEL RESET mymaster

# This clears: # - Master state # - Known replicas # - Known sentinels # Sentinel will re-discover from sentinel.conf ```

Step 5: Force Failover (If Needed)

To force a specific replica to become master:

bash

redis-cli -p 26379 SENTINEL FAILOVER mymaster

This forces immediate failover regardless of master state.

Sentinel Configuration Best Practices

Minimum Configuration

conf

# sentinel.conf
port 26379
sentinel monitor mymaster 10.0.0.1 6379 2
sentinel down-after-milliseconds mymaster 30000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 180000

Critical Settings Explained

1.down-after-milliseconds: How long before marking master as down
2.- Too low: False failovers during brief network hiccups
3.- Too high: Extended outage before failover
4.parallel-syncs: Number of replicas to sync simultaneously after failover
5.- Too high: Overwhelms new master
6.- Too low: Slow recovery of redundancy
7.failover-timeout: How long before retrying failed failover
8.- Should be longer than expected failover duration

Network Configuration

Sentinels must be able to reach each other and all Redis nodes:

conf

# Required when hosts have multiple IPs
sentinel announce-ip <public-ip>
sentinel announce-port 26379

Verification After Recovery

```bash # Check Sentinel agrees on master redis-cli -p 26379 SENTINEL ckquorum mymaster

# Expected: (quorum 2) (majority 2)

# Test failover readiness redis-cli -p 26379 SENTINEL master mymaster | grep -E "flags|num-slaves|num-other-sentinels"

# Should show: # flags: master # num-slaves: 2 # num-other-sentinels: 2 ```

Monitoring Sentinel Health

Key metrics to track:

```bash # Sentinel sees correct number of slaves redis-cli -p 26379 SENTINEL master mymaster | grep num-slaves

# Sentinel sees other sentinels redis-cli -p 26379 SENTINEL master mymaster | grep num-other-sentinels

# No pending failover redis-cli -p 26379 SENTINEL master mymaster | grep -E "s_down|o_down" ```

Redis Sentinel Failover Issues

The Problem

Understanding Sentinel Architecture

Diagnosis Commands

Check Sentinel Status

Check Sentinel Discovery

Check Current Master

Check Replication Info

Common Failover Failures

Failure 1: Failover Never Triggers

Failure 2: Split-Brain After Failover

Failure 3: Failover Loop

Failure 4: Application Not Notified

Step-by-Step Failover Recovery

Step 1: Assess Current State

Step 2: Identify the Actual Master

Step 3: Resolve Conflicts

Step 4: Reset Sentinel State

Step 5: Force Failover (If Needed)

Sentinel Configuration Best Practices

Minimum Configuration

Critical Settings Explained

Network Configuration

Verification After Recovery

Monitoring Sentinel Health

Share this guide

More Redis Troubleshooting Guides

Redis Persistence Disabled Warning

Redis Client Output Buffer Exceeded

Redis Slow Log Not Logging

Redis AOF Load Error

Redis Loading RDB Error

Redis TLS Handshake Failed