The Problem
Redis Sentinel is supposed to automatically failover when your master goes down, but something goes wrong. Either failover never triggers, Sentinel nodes disagree on the master, or the application connects to the wrong node after failover. These issues can cause extended outages or, worse, split-brain scenarios where multiple masters accept writes.
Understanding Sentinel Architecture
Sentinel relies on three key concepts:
- 1.Quorum - Number of Sentinels that must agree a master is down
- 2.Majority - Number of Sentinels needed to authorize failover
- 3.Parallel-syncs - Number of replicas that can sync simultaneously
Misunderstanding these leads to most Sentinel issues.
Diagnosis Commands
Check Sentinel Status
redis-cli -p 26379 SENTINEL master mymasterKey fields to examine:
name: mymaster
ip: 10.0.0.1
port: 6379
flags: master
num-other-sentinels: 2
quorum: 2Check Sentinel Discovery
redis-cli -p 26379 SENTINEL sentinels mymasterShould show all Sentinel instances.
Check Current Master
redis-cli -p 26379 SENTINEL get-master-addr-by-name mymasterCheck Replication Info
redis-cli -h <master-ip> INFO replicationCommon Failover Failures
Failure 1: Failover Never Triggers
Symptoms: Master is down but no failover occurs.
Diagnosis:
```bash # Check if Sentinel sees the master as down redis-cli -p 26379 SENTINEL master mymaster | grep flags
# If showing "s_down" or "o_down", Sentinel detects the issue # Check if quorum is reached redis-cli -p 26379 SENTINEL ckquorum mymaster ```
Common Causes:
- 1.Insufficient Sentinels:
- 2.
` - 3.(quorum 2) (majority 1). Sentinel can't failover in this state.
- 4.
`
Fix: Ensure all Sentinels can see each other:
# Each Sentinel must be configured with other Sentinels
# Or use auto-discovery with:
sentinel monitor mymaster <ip> <port> <quorum>
sentinel announce-ip <this-sentinel-ip>
sentinel announce-port 26379- 1.Network Partition:
Sentinels can't reach each other:
# From each Sentinel, test connectivity to others
redis-cli -h <other-sentinel-ip> -p 26379 PING- 1.Quorum Too High:
If quorum is set higher than available Sentinels:
```bash # Check current config redis-cli -p 26379 SENTINEL master mymaster | grep quorum
# Reduce quorum if needed redis-cli -p 26379 SENTINEL SET mymaster quorum 2 ```
Failure 2: Split-Brain After Failover
Symptoms: Multiple masters exist, data diverges.
Diagnosis:
# Check what each Sentinel thinks is the master
for sentinel in sentinel1 sentinel2 sentinel3; do
echo "Sentinel $sentinel sees:"
redis-cli -h $sentinel -p 26379 SENTINEL get-master-addr-by-name mymaster
doneRoot Cause: Network partition isolated some Sentinels from others.
Prevention:
Set appropriate quorum and ensure even number of Sentinels plus one:
# For 3 Sentinels, quorum should be 2
# For 5 Sentinels, quorum should be 3
sentinel monitor mymaster 10.0.0.1 6379 2Recovery:
- 1.Identify the correct master (one with most recent data):
redis-cli -h <candidate-master> INFO replication | grep master_repl_offset- 1.Stop the incorrect master:
redis-cli -h <wrong-master> SLAVEOF <correct-master-ip> <correct-master-port>- 1.Reset Sentinel configuration:
redis-cli -p 26379 SENTINEL FAILOVER mymasterFailure 3: Failover Loop
Symptoms: Failover happens repeatedly.
Diagnosis:
redis-cli -p 26379 SENTINEL master mymaster | grep -E "num-slaves|flags"Root Cause: Often caused by:
- Old master recovering and claiming master role
- Network flapping between nodes
- Inconsistent Sentinel configurations
Fix:
- 1.Check down-after-milliseconds:
redis-cli -p 26379 SENTINEL master mymaster | grep down-after-millisecondsSet consistent values in sentinel.conf:
sentinel down-after-milliseconds mymaster 30000
sentinel failover-timeout mymaster 180000- 1.Ensure all Sentinels have identical monitor configuration:
# Should be same across all Sentinels
sentinel monitor mymaster 10.0.0.1 6379 2Failure 4: Application Not Notified
Symptoms: Failover completes but application still connects to old master.
Diagnosis:
Check if application uses Sentinel for discovery:
# Correct approach - query Sentinel for current master
redis-cli -p 26379 SENTINEL get-master-addr-by-name mymasterFix:
Application must use Sentinel client or implement polling:
```python # Python example with redis-py from redis.sentinel import Sentinel
sentinel = Sentinel([ ('sentinel1', 26379), ('sentinel2', 26379), ('sentinel3', 26379) ], socket_timeout=0.1)
master = sentinel.master_for('mymaster', socket_timeout=0.1) ```
Step-by-Step Failover Recovery
Step 1: Assess Current State
# Check all Sentinels
redis-cli -p 26379 SENTINEL masters
redis-cli -p 26379 SENTINEL slaves mymaster
redis-cli -p 26379 SENTINEL sentinels mymasterStep 2: Identify the Actual Master
# Check replication info on each Redis node
for node in redis1 redis2 redis3; do
echo "=== $node ==="
redis-cli -h $node INFO replication | grep -E "role|master_host|master_port"
doneStep 3: Resolve Conflicts
If multiple nodes think they're master:
# On nodes that should be replicas
redis-cli -h <replica-ip> REPLICAOF <master-ip> <master-port>Step 4: Reset Sentinel State
If Sentinel is confused:
```bash # Reset specific master redis-cli -p 26379 SENTINEL RESET mymaster
# This clears: # - Master state # - Known replicas # - Known sentinels # Sentinel will re-discover from sentinel.conf ```
Step 5: Force Failover (If Needed)
To force a specific replica to become master:
redis-cli -p 26379 SENTINEL FAILOVER mymasterThis forces immediate failover regardless of master state.
Sentinel Configuration Best Practices
Minimum Configuration
# sentinel.conf
port 26379
sentinel monitor mymaster 10.0.0.1 6379 2
sentinel down-after-milliseconds mymaster 30000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 180000Critical Settings Explained
- 1.down-after-milliseconds: How long before marking master as down
- 2.- Too low: False failovers during brief network hiccups
- 3.- Too high: Extended outage before failover
- 4.parallel-syncs: Number of replicas to sync simultaneously after failover
- 5.- Too high: Overwhelms new master
- 6.- Too low: Slow recovery of redundancy
- 7.failover-timeout: How long before retrying failed failover
- 8.- Should be longer than expected failover duration
Network Configuration
Sentinels must be able to reach each other and all Redis nodes:
# Required when hosts have multiple IPs
sentinel announce-ip <public-ip>
sentinel announce-port 26379Verification After Recovery
```bash # Check Sentinel agrees on master redis-cli -p 26379 SENTINEL ckquorum mymaster
# Expected: (quorum 2) (majority 2)
# Test failover readiness redis-cli -p 26379 SENTINEL master mymaster | grep -E "flags|num-slaves|num-other-sentinels"
# Should show: # flags: master # num-slaves: 2 # num-other-sentinels: 2 ```
Monitoring Sentinel Health
Key metrics to track:
```bash # Sentinel sees correct number of slaves redis-cli -p 26379 SENTINEL master mymaster | grep num-slaves
# Sentinel sees other sentinels redis-cli -p 26379 SENTINEL master mymaster | grep num-other-sentinels
# No pending failover redis-cli -p 26379 SENTINEL master mymaster | grep -E "s_down|o_down" ```