Home / Redis / Redis Replication Lag After Fork Causing Replica Staleness

Redis

Redis Replication Lag After Fork Causing Replica Staleness

Diagnose and fix Redis replication lag spikes that occur after fork-based RDB snapshot creation on the primary.

Today3 min read

Illustration of Redis cache diagnostics.

Introduction When Redis performs a `BGSAVE` or `BGREWRITEAOF`, it forks a child process to create the RDB snapshot. During the fork, the primary uses copy-on-write (COW) memory, and if the dataset changes rapidly, the COW memory overhead can cause the kernel to throttle the primary, reducing its ability to serve replication traffic. This results in a spike in replication lag.

Symptoms - Replica `master_link_down_since_seconds` increases after each BGSAVE - `INFO replication` on replica shows increasing `master_repl_offset` gap - Primary's `instantaneous_ops_per_sec` drops during BGSAVE execution - Memory usage on the primary spikes during snapshot creation - Application read queries on replicas return stale data for 30-120 seconds

Common Causes - High write rate during BGSAVE causing excessive copy-on-write memory allocation - Insufficient RAM for COW overhead (typically needs 20-30% extra memory) - `repl-backlog-size` too small, requiring full resync instead of partial resync after lag - OOM killer terminating the BGSAVE child process - Slow disk I/O extending the BGSAVE duration

Step-by-Step Fix 1. Check replication status on both primary and replica: ```bash # On primary redis-cli -p 6379 INFO replication

# On replica redis-cli -p 6380 INFO replication redis-cli -p 6380 INFO stats | grep master_sync ```

1.Check memory usage and COW overhead during BGSAVE:
2.```bash
3.redis-cli INFO memory | grep -E "used_memory|cow|rss"
4.# Look for: used_memory_rss - used_memory indicating COW overhead
5.`
6.Increase the replication backlog to allow partial resyncs:
7.```bash
8.redis-cli CONFIG SET repl-backlog-size 256mb
9.redis-cli CONFIG SET repl-backlog-ttl 3600
10.# Make persistent
11.echo "repl-backlog-size 268435456" >> /etc/redis/redis.conf
12.`
13.Schedule BGSAVE during lower write periods:
14.```bash
15.# Disable automatic saves during peak hours
16.redis-cli CONFIG SET save ""

# Manually trigger during off-peak redis-cli BGSAVE

# Re-enable after maintenance window redis-cli CONFIG SET save "900 1 300 10 60 10000" ```

1.Reduce memory fragmentation to minimize COW overhead:
2.```bash
3.# Check fragmentation ratio
4.redis-cli INFO memory | grep mem_fragmentation_ratio
5.# If > 1.5, consider enabling active defragmentation
6.redis-cli CONFIG SET activedefrag yes
7.redis-cli CONFIG SET active-defrag-enabled yes
8.`
9.Monitor replication lag in real-time:
10.```bash
11.# Script to check replication lag every 5 seconds
12.while true; do
13.master_offset=$(redis-cli -p 6379 INFO replication | grep master_repl_offset | cut -d: -f2 | tr -d '\r')
14.slave_offset=$(redis-cli -p 6380 INFO replication | grep master_repl_offset | cut -d: -f2 | tr -d '\r')
15.lag=$((master_offset - slave_offset))
16.echo "$(date): Replication lag: $lag bytes"
17.sleep 5
18.done
19.`

Prevention - Ensure at least 30% free RAM beyond the dataset size for COW overhead - Set `repl-backlog-size` to at least the amount of data written during peak BGSAVE duration - Use `no-appendfsync-on-rewrite yes` to reduce I/O during AOF rewrite - Monitor `mem_fragmentation_ratio` and keep it below 1.5 - Consider Redis 7.0+ RDB child process optimization with reduced COW impact - Use diskless replication (`repl-diskless-sync yes`) for fast networks to reduce disk I/O contention - Deploy replicas in the same availability zone as the primary to minimize network latency