Introduction
Redis persistence and memory errors occur when Redis cannot save data to disk (RDB/AOF failures), runs out of memory (OOM errors), or encounters issues during background operations (fork failures, copy-on-write exhaustion). Redis provides two persistence mechanisms: RDB (point-in-time snapshots) and AOF (append-only file logging every write). Memory issues arise when Redis hits maxmemory limits, causing OOM errors or triggering eviction policies. Common causes include disk full preventing RDB/AOF writes, fork() failures on memory-constrained systems, AOF file corruption, background save disabled due to previous fork failures, memory fragmentation exceeding thresholds, eviction policy misconfiguration causing unexpected key removal, and replication buffer exhaustion. The fix requires understanding Redis persistence architecture, memory management, background save mechanics, and proper configuration for production workloads. This guide provides production-proven troubleshooting for Redis persistence and memory issues across standalone and clustered deployments.
Symptoms
MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on diskCan't save in background: fork: Cannot allocate memoryERROR: Out of memoryOOM command not allowed when used memory > 'maxmemory'Background AOF fsync errorAOF fsync is taking too long- Redis rejects writes with
used_memory > maxmemory - Keys disappearing unexpectedly (eviction)
- RDB save taking too long, blocking commands
- AOF file growing unbounded
NOAUTH Authentication requiredafter restart (persistence lost)- Replica sync failing due to master persistence issues
Common Causes
- Disk space exhausted, can't write RDB/AOF files
- Insufficient memory for fork() copy-on-write
stop-writes-on-bgsave-errorenabled with failing saves- AOF fsync every second causing latency spikes
- Memory fragmentation from allocations/deallocations
maxmemoryset too low for dataset size- Eviction policy
noevictioncausing OOM rejects - Big keys causing large fork memory usage
- AOF rewrite failing due to memory pressure
- Transparent Huge Pages (THP) causing fork issues
- Permission issues on Redis data directory
- AOF truncation or corruption after crash
Step-by-Step Fix
### 1. Diagnose persistence errors
Check Redis server status:
```bash # Connect to Redis redis-cli
# Check server info INFO persistence
# Output: # # Persistence # loading:0 # current_cow_size:0 # rdb_changes_since_last_save:1523 # rdb_bgsave_in_progress:0 # rdb_last_save_time:1711900000 # rdb_last_bgsave_status:ok # rdb_last_bgsave_time_sec:2 # rdb_last_cow_size:524288 # aof_enabled:1 # aof_rewrite_in_progress:0 # aof_rewrite_scheduled:0 # aof_last_rewrite_time_sec:5 # aof_current_rewrite_time_sec:-1 # aof_last_bgrewrite_status:ok # aof_last_write_status:ok # aof_last_cow_size:1048576
# Key fields to check: # rdb_last_bgsave_status: Should be "ok" # aof_last_write_status: Should be "ok" # aof_last_bgrewrite_status: Should be "ok"
# Check for errors in Redis logs tail -f /var/log/redis/redis-server.log
# Common error patterns: # "Can't save in background: fork: Cannot allocate memory" # "Background AOF fsync error: Disk quota exceeded" # "Writing AOF on disk is slowing down" ```
Check disk space:
```bash # Check Redis data directory df -h /var/lib/redis
# If > 90% full, persistence may fail # Clean up old RDB files or expand disk
# Check RDB/AOF file sizes ls -lh /var/lib/redis/
# Typical files: # dump.rdb # RDB snapshot # appendonly.aof # AOF log
# Check file permissions ls -la /var/lib/redis/ # Should be owned by redis user chown redis:redis /var/lib/redis/* ```
### 2. Fix RDB persistence failures
RDB configuration:
```conf # redis.conf - RDB persistence settings
# Save points (trigger BGSAVE when conditions met) save 900 1 # Save after 900 sec if at least 1 key changed save 300 10 # Save after 300 sec if at least 10 keys changed save 60 10000 # Save after 60 sec if at least 10000 keys changed
# RDB filename dbfilename dump.rdb
# Working directory dir /var/lib/redis
# Compress RDB files (LZF compression) rdbcompression yes
# RDB checksum rdbchecksum yes
# Stop accepting writes if BGSAVE fails stop-writes-on-bgsave-error yes
# For development/testing, can disable RDB # save "" ```
Fix fork failures:
```bash # Error: "Can't save in background: fork: Cannot allocate memory"
# Cause: fork() requires memory for copy-on-write # Redis forks, child writes RDB while parent continues serving
# Solution 1: Enable overcommit memory (Linux) sysctl vm.overcommit_memory=1 # Add to /etc/sysctl.conf for persistence
# Solution 2: Increase swap swapon --show # If no swap, add some: fallocate -l 4G /swapfile chmod 600 /swapfile mkswap /swapfile swapon /swapfile
# Solution 3: Disable THP (Transparent Huge Pages) # THP causes fork to be slow and memory-intensive echo never > /sys/kernel/mm/transparent_hugepage/enabled echo never > /sys/kernel/mm/transparent_hugepage/defrag # Add to /etc/rc.local for persistence
# Solution 4: Reduce Redis memory usage # Lower maxmemory to leave room for fork # redis-cli CONFIG SET maxmemory 2gb
# Solution 5: Schedule BGSAVE during low-traffic periods # Disable automatic saves, use manual BGSAVE # save "" # Then cron job: 0 3 * * * redis-cli BGSAVE ```
Manual RDB operations:
```bash # Trigger manual save (blocks Redis!) redis-cli SAVE
# Trigger background save (non-blocking) redis-cli BGSAVE
# Check BGSAVE status redis-cli LASTSAVE # Returns Unix timestamp of last successful save
redis-cli INFO persistence | grep rdb_last
# Wait for BGSAVE to complete watch 'redis-cli INFO persistence | grep rdb_bgsave'
# Create RDB without forking (Redis 4.0+) redis-cli BGSAVE SCHEDULE # Schedule if none in progress ```
### 3. Fix AOF persistence issues
AOF configuration:
```conf # redis.conf - AOF persistence settings
# Enable AOF appendonly yes
# AOF filename appendfilename "appendonly.aof"
# AOF fsync policy: # always: fsync every write (safest, slowest) # everysec: fsync every second (recommended) # no: let OS decide when to fsync (fastest, riskiest) appendfsync everysec
# Don't fsync during rewrite (reduce latency) no-appendfsync-on-rewrite no
# Auto-rewrite AOF when it grows too large auto-aof-rewrite-percentage 100 # Rewrite when 100% larger than last auto-aof-rewrite-min-size 64mb # Minimum size to trigger rewrite
# Handle truncated AOF on load aof-load-truncated yes
# Use RDB preamble in AOF (faster loading, Redis 4.0+) aof-use-rdb-preamble yes ```
Fix AOF fsync errors:
```bash # Error: "Background AOF fsync error" or "AOF fsync is taking too long"
# Check disk I/O performance iostat -x 1 5
# If %util is consistently > 80%, disk is saturated # Solutions: # 1. Use SSD for Redis data directory # 2. Move AOF to separate disk from RDB # 3. Change appendfsync to "no" (risk of data loss)
# Temporary fix: Change fsync policy redis-cli CONFIG SET appendfsync no
# Or reduce rewrite triggers redis-cli CONFIG SET auto-aof-rewrite-percentage 200 redis-cli CONFIG SET auto-aof-rewrite-min-size 128mb ```
AOF rewrite issues:
```bash # Check if rewrite is in progress redis-cli INFO persistence | grep aof_rewrite
# Manually trigger AOF rewrite redis-cli BGREWRITEAOF
# If rewrite fails with fork error, same fixes as RDB fork issues
# Check AOF file integrity redis-check-aof --fix /var/lib/redis/appendonly.aof
# Truncate corrupted AOF (last resort!) # Stop Redis first systemctl stop redis
# Check and fix redis-check-aof --fix /var/lib/redis/appendonly.aof
# Start Redis systemctl start redis ```
AOF recovery procedure:
```bash # If AOF is corrupted and Redis won't start:
# Option 1: Use redis-check-aof to fix redis-check-aof --fix /var/lib/redis/appendonly.aof
# Option 2: Recover from RDB only # Stop Redis systemctl stop redis
# Rename AOF (backup) mv /var/lib/redis/appendonly.aof /var/lib/redis/appendonly.aof.bak
# Start Redis (will load RDB) systemctl start redis
# Re-enable AOF redis-cli CONFIG SET appendonly yes
# Option 3: Manual AOF truncation # Edit AOF file, remove incomplete transaction at end # AOF is valid Redis protocol, can be edited carefully ```
### 4. Fix memory errors
Check memory usage:
```bash # Check memory info redis-cli INFO memory
# Key fields: # used_memory: Actual memory used # used_memory_human: Human readable (e.g., "1.50G") # used_memory_rss: Memory from OS perspective # used_memory_peak: Maximum memory used # maxmemory: Configured limit # maxmemory_human: Human readable limit # maxmemory_policy: Eviction policy # mem_fragmentation_ratio: RSS / used_memory (ideal: 1.0-1.5)
# If fragmentation_ratio > 1.5, consider restart or active defrag # If fragmentation_ratio < 1.0, Redis is swapping (bad!)
# Check memory by database redis-cli INFO memorystats
# Check individual key sizes redis-cli --bigkeys
# Output shows largest keys by type: # # Largest keys found # [00.00%] Largest string key: 15.23KB (key:user:session:abc123) # [00.00%] Largest hash key: 1.25MB (key:product:catalog) # [00.00%] Largest list key: 500KB (key:queue:jobs) ```
Fix OOM errors:
```bash # Error: "ERROR: Out of memory" or "OOM command not allowed"
# Option 1: Increase maxmemory redis-cli CONFIG SET maxmemory 4gb
# Make permanent in redis.conf: maxmemory 4gb
# Option 2: Change eviction policy # Current policy redis-cli CONFIG GET maxmemory-policy
# Eviction policies: # noeviction: Return errors on writes (default) # allkeys-lru: Evict least recently used keys (recommended for cache) # volatile-lru: Evict LRU keys with TTL set # allkeys-lfu: Evict least frequently used keys # volatile-lfu: Evict LFU keys with TTL set # allkeys-random: Evict random keys # volatile-random: Evict random keys with TTL set # volatile-ttl: Evict keys with shortest TTL
# Set appropriate policy redis-cli CONFIG SET maxmemory-policy allkeys-lru
# Option 3: Delete large or unnecessary keys redis-cli --bigkeys redis-cli DEL key:user:session:old
# Option 4: Set TTL on keys to auto-expire redis-cli EXPIRE key:temp 3600 ```
Memory defragmentation:
```bash # Enable active defragmentation (Redis 4.0+) # redis.conf: activedefrag yes active-defrag-ignore-bytes 100mb # Start if fragmentation > 100MB active-defrag-threshold-lower 10 # Start if fragmentation > 10% active-defrag-threshold-upper 100 # Stop if fragmentation > 100% active-defrag-cycle-min 5 # Min CPU percentage for defrag active-defrag-cycle-max 75 # Max CPU percentage for defrag
# Check if defrag is running redis-cli INFO stats | grep defrag
# Manual defrag (restart with restart) # Sometimes restarting Redis is the quickest defrag systemctl restart redis ```
### 5. Fix replication buffer issues
Replication configuration:
```conf # redis.conf - Replication buffer settings
# Client output buffer limits # Format: client-output-buffer-limit <class> <hard> <soft> <seconds>
# Normal clients (subscribers, regular commands) client-output-buffer-limit normal 0 0 0
# Replica clients (data sent to replicas) client-output-buffer-limit replica 256mb 64mb 60 # Hard limit: Disconnect if buffer exceeds 256MB # Soft limit: Disconnect if buffer exceeds 64MB for 60 seconds
# Pub/Sub clients client-output-buffer-limit pubsub 32mb 8mb 60
# Replication timeout repl-timeout 60
# Ping interval to replicas repl-ping-replica-period 10 ```
Diagnose replication issues:
```bash # Check replication status redis-cli INFO replication
# Output: # role:master # connected_slaves:2 # slave0:ip=10.0.0.2,port=6379,state=online,offset=123456 # slave1:ip=10.0.0.3,port=6379,state=online,offset=123450
# If replica shows "buffer is full" or disconnects frequently: # Increase replica buffer limits
# Check replica-side issues # On replica: redis-cli INFO replication
# Output: # role:slave # master_host:10.0.0.1 # master_port:6379 # master_link_status:up # master_sync_in_progress:0 # master_last_io_seconds_ago:1
# If master_link_status:down, check network and master availability ```
Fix replication buffer exhaustion:
```bash # Error: "Client closed connection" or replica keeps disconnecting
# Cause: Replication buffer exceeded during heavy writes or slow replica
# Solution 1: Increase buffer limits redis-cli CONFIG SET client-output-buffer-limit "replica 512mb 128mb 120"
# Solution 2: Reduce write load or batch writes # Use pipelining instead of individual commands
# Solution 3: Check replica performance # Slow replica may not process buffer fast enough redis-cli -h <replica-host> SLOWLOG GET 10
# Solution 4: Use partial resynchronization # Redis 2.8+ supports PSYNC for efficient reconnection # Ensure replica has repl-backlog-size configured repl-backlog-size 64mb # Buffer for partial resync repl-backlog-ttl 3600 # How long to keep backlog ```
### 6. Fix slow persistence operations
Monitor persistence performance:
```bash # Check RDB/AOF timing redis-cli INFO persistence | grep -E "time|cow"
# Key metrics: # rdb_last_bgsave_time_sec: Last RDB save duration # aof_last_rewrite_time_sec: Last AOF rewrite duration # rdb_last_cow_size: Copy-on-write memory used during RDB # aof_last_cow_size: Copy-on-write memory used during AOF rewrite
# If bgsave time > 5 seconds, investigate big keys # If cow_size is very large, memory pressure during fork
# Check for blocking due to persistence redis-cli INFO stats | grep blocked_clients ```
Optimize persistence:
```conf # redis.conf optimization
# For latency-sensitive workloads, disable RDB during peak hours # Use AOF only, or schedule RDB during off-peak
# Disable RDB (use AOF only) save ""
# Or reduce RDB frequency save 300 100 save 60 1000
# For AOF, use everysec (balance of safety and performance) appendfsync everysec
# Don't fsync during rewrite (prevents double I/O) no-appendfsync-on-rewrite yes
# Set reasonable rewrite thresholds auto-aof-rewrite-percentage 100 auto-aof-rewrite-min-size 64mb ```
Big key impact on persistence:
```bash # Big keys slow down RDB/AOF operations # Find big keys redis-cli --bigkeys redis-cli --scan --pattern '*' | xargs redis-cli MEMORY USAGE
# Or use redis-rdb-tools for detailed analysis pip install rdbtools rdb --command memory /var/lib/redis/dump.rdb > memory-report.txt
# Delete or split big keys # Hash with many fields -> split into multiple hashes # List with many elements -> use multiple smaller lists
# Example: Split large hash # Before: HSET user:123 field1 value1 ... field10000 value10000 # After: HSET user:123:part1 field1 value1 ... field1000 value1000 # HSET user:123:part2 field1001 value1001 ... field2000 value2000 ```
### 7. Monitor Redis health
Prometheus metrics:
```yaml # Redis exporter # https://github.com/oliver006/redis_exporter
docker run -d --name redis-exporter \ -p 9121:9121 \ oliver006/redis_exporter \ --redis.addr=redis://localhost:6379
# Prometheus scrape config scrape_configs: - job_name: 'redis' static_configs: - targets: ['localhost:9121']
# Key metrics: # redis_memory_used_bytes # redis_memory_max_bytes # redis_memory_fragmentation_ratio # redis_connected_clients # redis_blocked_clients # redis_evicted_keys_total # redis_keyspace_hits_total # redis_keyspace_misses_total # redis_persistence_rdb_bgsave_in_progress # redis_persistence_aof_rewrite_in_progress # redis_persistence_last_bgsave_status # redis_persistence_aof_last_write_status ```
Grafana alert rules:
```yaml groups: - name: redis_health rules: - alert: RedisMemoryHigh expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9 for: 10m labels: severity: warning annotations: summary: "Redis memory usage above 90%" description: "{{ $value | humanizePercentage }} of max memory used"
- alert: RedisPersistenceFailed
- expr: redis_persistence_last_bgsave_status == 0
- for: 5m
- labels:
- severity: critical
- annotations:
- summary: "Redis RDB persistence failing"
- alert: RedisAOFLastWriteFailed
- expr: redis_persistence_aof_last_write_status == 0
- for: 5m
- labels:
- severity: critical
- annotations:
- summary: "Redis AOF write failing"
- alert: RedisFragmentationHigh
- expr: redis_memory_fragmentation_ratio > 1.5
- for: 1h
- labels:
- severity: warning
- annotations:
- summary: "Redis memory fragmentation high"
- description: "Fragmentation ratio at {{ $value }}"
- alert: RedisReplicationDown
- expr: redis_connected_slaves < 1
- for: 5m
- labels:
- severity: critical
- annotations:
- summary: "Redis has no connected replicas"
`
Health check script:
```bash #!/bin/bash # Redis health check
REDIS_HOST="${REDIS_HOST:-localhost}" REDIS_PORT="${REDIS_PORT:-6379}"
# Check connectivity if ! redis-cli -h $REDIS_HOST -p $REDIS_PORT PING | grep -q "PONG"; then echo "CRITICAL: Redis not responding" exit 2 fi
# Check memory MEMORY_INFO=$(redis-cli -h $REDIS_HOST -p $REDIS_PORT INFO memory) MEMORY_USED=$(echo "$MEMORY_INFO" | grep "used_memory:" | cut -d: -f2 | tr -d '\r') MEMORY_MAX=$(echo "$MEMORY_INFO" | grep "maxmemory:" | cut -d: -f2 | tr -d '\r')
if [ "$MEMORY_MAX" -gt 0 ]; then MEMORY_PCT=$((MEMORY_USED * 100 / MEMORY_MAX)) if [ "$MEMORY_PCT" -gt 90 ]; then echo "WARNING: Memory usage at ${MEMORY_PCT}%" fi fi
# Check persistence PERSISTENCE_INFO=$(redis-cli -h $REDIS_HOST -p $REDIS_PORT INFO persistence) RDB_STATUS=$(echo "$PERSISTENCE_INFO" | grep "rdb_last_bgsave_status:" | cut -d: -f2 | tr -d '\r') AOF_STATUS=$(echo "$PERSISTENCE_INFO" | grep "aof_last_write_status:" | cut -d: -f2 | tr -d '\r')
if [ "$RDB_STATUS" != "ok" ]; then echo "CRITICAL: RDB persistence failing" exit 2 fi
if [ "$AOF_STATUS" != "ok" ]; then echo "CRITICAL: AOF write failing" exit 2 fi
# Check replication ROLE=$(redis-cli -h $REDIS_HOST -p $REDIS_PORT INFO replication | grep "role:" | cut -d: -f2 | tr -d '\r') if [ "$ROLE" == "slave" ]; then LINK_STATUS=$(redis-cli -h $REDIS_HOST -p $REDIS_PORT INFO replication | grep "master_link_status:" | cut -d: -f2 | tr -d '\r') if [ "$LINK_STATUS" != "up" ]; then echo "CRITICAL: Replica link to master is down" exit 2 fi fi
echo "OK: Redis healthy" exit 0 ```
Prevention
- Enable both RDB and AOF for durability
- Set
vm.overcommit_memory=1for reliable fork() - Disable Transparent Huge Pages (THP)
- Configure appropriate
maxmemorywith eviction policy - Monitor memory fragmentation and restart if needed
- Use
allkeys-lruorallkeys-lfueviction for cache workloads - Size replication buffers for expected write volume
- Schedule RDB saves during low-traffic periods
- Use SSDs for Redis data directory
- Implement proper alerting for persistence failures
Related Errors
- **WRONGTYPE Operation against a key holding the wrong kind of value**: Type mismatch
- **NOAUTH Authentication required**: Missing authentication
- **EXECABORT Transaction discarded**: Transaction error
- **MOVED Cluster slot not served by this node**: Cluster redirection
- **ASK Cluster slot migrating**: Cluster slot migration in progress