Introduction When Redis performs a `BGSAVE` or `BGREWRITEAOF`, it forks a child process to create the RDB snapshot. During the fork, the primary uses copy-on-write (COW) memory, and if the dataset changes rapidly, the COW memory overhead can cause the kernel to throttle the primary, reducing its ability to serve replication traffic. This results in a spike in replication lag.
Symptoms - Replica `master_link_down_since_seconds` increases after each BGSAVE - `INFO replication` on replica shows increasing `master_repl_offset` gap - Primary's `instantaneous_ops_per_sec` drops during BGSAVE execution - Memory usage on the primary spikes during snapshot creation - Application read queries on replicas return stale data for 30-120 seconds
Common Causes - High write rate during BGSAVE causing excessive copy-on-write memory allocation - Insufficient RAM for COW overhead (typically needs 20-30% extra memory) - `repl-backlog-size` too small, requiring full resync instead of partial resync after lag - OOM killer terminating the BGSAVE child process - Slow disk I/O extending the BGSAVE duration
Step-by-Step Fix 1. **Check replication status on both primary and replica**: ```bash # On primary redis-cli -p 6379 INFO replication
# On replica redis-cli -p 6380 INFO replication redis-cli -p 6380 INFO stats | grep master_sync ```
- 1.Check memory usage and COW overhead during BGSAVE:
- 2.```bash
- 3.redis-cli INFO memory | grep -E "used_memory|cow|rss"
- 4.# Look for: used_memory_rss - used_memory indicating COW overhead
- 5.
` - 6.Increase the replication backlog to allow partial resyncs:
- 7.```bash
- 8.redis-cli CONFIG SET repl-backlog-size 256mb
- 9.redis-cli CONFIG SET repl-backlog-ttl 3600
- 10.# Make persistent
- 11.echo "repl-backlog-size 268435456" >> /etc/redis/redis.conf
- 12.
` - 13.Schedule BGSAVE during lower write periods:
- 14.```bash
- 15.# Disable automatic saves during peak hours
- 16.redis-cli CONFIG SET save ""
# Manually trigger during off-peak redis-cli BGSAVE
# Re-enable after maintenance window redis-cli CONFIG SET save "900 1 300 10 60 10000" ```
- 1.Reduce memory fragmentation to minimize COW overhead:
- 2.```bash
- 3.# Check fragmentation ratio
- 4.redis-cli INFO memory | grep mem_fragmentation_ratio
- 5.# If > 1.5, consider enabling active defragmentation
- 6.redis-cli CONFIG SET activedefrag yes
- 7.redis-cli CONFIG SET active-defrag-enabled yes
- 8.
` - 9.Monitor replication lag in real-time:
- 10.```bash
- 11.# Script to check replication lag every 5 seconds
- 12.while true; do
- 13.master_offset=$(redis-cli -p 6379 INFO replication | grep master_repl_offset | cut -d: -f2 | tr -d '\r')
- 14.slave_offset=$(redis-cli -p 6380 INFO replication | grep master_repl_offset | cut -d: -f2 | tr -d '\r')
- 15.lag=$((master_offset - slave_offset))
- 16.echo "$(date): Replication lag: $lag bytes"
- 17.sleep 5
- 18.done
- 19.
`