The Problem

You notice your Redis replica is lagging behind the master. Applications reading from the replica get stale data, or monitoring alerts show increasing replication delay. The INFO replication command shows lag growing over time instead of staying near zero.

Typical indicators:

``` # On master redis-cli INFO replication # slave0:ip=10.0.1.2,port=6379,state=online,offset=9543210,lag=45

# On replica redis-cli INFO replication # master_link_status:up # master_last_io_seconds_ago:3 # master_sync_in_progress:0 # slave_repl_offset:9540000 # slave_priority:100 ```

That lag=45 means the replica is 45 seconds behind the master's write operations.

Why This Happens

Replication lag occurs when the replica cannot process write commands as fast as the master generates them. Common causes:

  1. 1.Network bandwidth constraints - The replication stream exceeds available bandwidth
  2. 2.Replica under-provisioned - CPU or memory insufficient to apply writes quickly
  3. 3.Large write bursts - Sudden spikes in write volume overwhelm replication
  4. 4.High latency network - Geographic distance between master and replica
  5. 5.Disk I/O bottleneck - Replica persisting data slower than receiving it
  6. 6.Output buffer limits - Master throttling replication stream to protect memory

Diagnosis Steps

Check Current Replication Status

```bash # On master - see all replicas and their lag redis-cli INFO replication

# Look for: # connected_slaves:2 # slave0:ip=10.0.1.2,port=6379,state=online,offset=10000000,lag=2 # slave1:ip=10.0.1.3,port=6379,state=online,offset=9999000,lag=15 ```

The offset shows how much data the replica has received. The lag shows seconds since last acknowledgment.

Check Network Bandwidth

```bash # Measure actual bandwidth between master and replica # On master iftop -i eth0

# Or use iperf3 # On replica iperf3 -s # On master iperf3 -c <replica-ip> -t 60

# Check current replication throughput redis-cli INFO stats | grep total_net_output_bytes ```

Check Replication Buffer Limits

```bash # Check output buffer limits for replica clients redis-cli CONFIG GET client-output-buffer-limit

# Default is often: # client-output-buffer-limit replica 256mb 64mb 60 # This means: hard limit 256MB, soft limit 64MB for 60 seconds ```

Check Replica Resource Usage

```bash # On replica - check if it's struggling redis-cli INFO memory | grep used_memory redis-cli INFO stats | grep instantaneous_ops_per_sec

# Check system resources top -p $(pgrep redis-server) ```

Identify the Replication Stream Size

```bash # Check replication backlog size redis-cli INFO replication | grep repl_backlog

# Current write rate on master redis-cli INFO stats | grep instantaneous_input_kbps ```

Solutions

Solution 1: Increase Replication Output Buffer

When the master disconnects replicas due to buffer overflow, increase the limits:

```bash # Check current buffer usage for replica connection redis-cli CLIENT LIST | grep replica

# Increase buffer limits redis-cli CONFIG SET client-output-buffer-limit "replica 512mb 128mb 120"

# For permanent change, add to redis.conf: client-output-buffer-limit replica 512mb 128mb 120 ```

The format is: hard-limit soft-limit soft-limit-duration. When the buffer exceeds the soft limit for the specified duration, the client is disconnected.

Solution 2: Increase Replication Backlog

The backlog allows replicas to reconnect without full sync:

```bash # Increase backlog size (default 1MB, often too small) redis-cli CONFIG SET repl-backlog-size 100mb

# In redis.conf: repl-backlog-size 100mb ```

Larger backlog helps replicas resume after temporary disconnections without expensive full resync.

Solution 3: Optimize Network Configuration

For high-latency or bandwidth-constrained networks:

```bash # Increase TCP buffer sizes on both master and replica # On Linux sudo sysctl -w net.core.rmem_max=16777216 sudo sysctl -w net.core.wmem_max=16777216 sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216" sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

# Make permanent in /etc/sysctl.conf: net.core.rmem_max=16777216 net.core.wmem_max=16777216 ```

Solution 4: Use Diskless Replication

When disk I/O is the bottleneck, use diskless replication where the master sends data directly to replicas without writing to disk:

```bash # Enable diskless replication redis-cli CONFIG SET repl-diskless-sync yes redis-cli CONFIG SET repl-diskless-sync-delay 5

# In redis.conf: repl-diskless-sync yes repl-diskless-sync-delay 5 ```

The delay allows multiple replicas to connect before starting the transfer.

Solution 5: Reduce Write Load on Master

Temporarily reduce write volume to let replicas catch up:

```bash # Identify high-write keys redis-cli --hotkeys # Redis 6+

# Or monitor write commands redis-cli MONITOR | grep -E "SET|HSET|LPUSH|SADD|ZADD" | head -100

# If using write-heavy operations, consider: # - Batching writes # - Moving some writes to a different Redis instance # - Using pipelining to reduce command overhead ```

Solution 6: Scale the Replica

If the replica cannot keep up due to resource constraints:

```bash # Check replica's write application rate redis-cli INFO stats | grep instantaneous_ops_per_sec

# If significantly lower than master's, upgrade replica: # - More CPU cores # - More memory (for faster write application) # - Faster disk (if using persistence)

# Or add more replicas for read scaling # and distribute read load across them ```

Solution 7: Implement Read After Write Consistency

For critical data, read from master to avoid stale reads:

```javascript // Node.js - read critical data from master async function getCriticalData(key) { // Write goes to master await masterRedis.set(key, value);

// For critical reads, use master return await masterRedis.get(key); }

// For less critical reads, use replica async function getCachedData(key) { return await replicaRedis.get(key); } ```

Solution 8: Use Redis Sentinel for Automatic Failover

When lag becomes unacceptable, Sentinel can promote a closer replica:

```ini # sentinel.conf sentinel monitor mymaster 10.0.1.1 6379 2 sentinel down-after-milliseconds mymaster 30000 sentinel failover-timeout mymaster 60000 sentinel parallel-syncs mymaster 1

# Set lag threshold for considering replica unhealthy sentinel replica-lag-max-seconds mymaster 30 ```

Monitoring Replication Lag

```bash #!/bin/bash # replication_monitor.sh

LAG_THRESHOLD=10 # seconds

while true; do LAG=$(redis-cli INFO replication | grep -oP 'lag=\K[0-9]+' | head -1)

if [ "$LAG" -gt "$LAG_THRESHOLD" ]; then echo "WARNING: Replication lag is ${LAG} seconds" echo "Master offset: $(redis-cli INFO replication | grep master_repl_offset | cut -d: -f2)"

# Check buffer usage redis-cli CLIENT LIST | grep replica | awk '{print $2, $4}' fi

sleep 5 done ```

Production Configuration

```ini # /etc/redis/redis.conf on master

# Replication settings repl-backlog-size 256mb repl-backlog-ttl 3600 repl-diskless-sync yes repl-diskless-sync-delay 5

# Output buffer limits client-output-buffer-limit replica 512mb 128mb 120

# TCP keepalive repl-timeout 60 tcp-keepalive 300

# Disable writes if no replicas connected (optional) min-replicas-to-write 1 min-replicas-max-lag 10 ```

Prevention Checklist

  • [ ] Provision replicas with equal or better resources than master
  • [ ] Monitor replication lag continuously
  • [ ] Configure adequate output buffer limits
  • [ ] Use diskless replication for fast networks
  • [ ] Set appropriate backlog size for your write volume
  • [ ] Place replicas in same datacenter for low latency
  • [ ] Use connection pooling to reduce command overhead
  • [ ] Implement read-after-write for critical data
  • [Redis Connection Refused](./fix-redis-connection-refused)
  • [Redis Persistence Failed](./fix-redis-persistence-failed)
  • [Redis Max Clients Reached](./fix-redis-max-clients)