Introduction When a MongoDB secondary falls behind the primary by more than the oplog window, it cannot catch up through normal replication and must perform a full initial sync. This is extremely disruptive for large datasets as it requires copying all data over the network and rebuilding all indexes.
Symptoms - Secondary logs show `repl: X replSetReinitializing` or `too stale to catch up` - `rs.status()` shows secondary in `RECOVERING` state with `lastHeartbeatMessage: "too stale"` - `rs.printReplicationInfo()` shows oplog window smaller than secondary lag - `rs.printSecondaryReplicationInfo()` shows one or more secondaries significantly behind - Secondary requires full resync after being offline for maintenance
Common Causes - Oplog size too small for the write volume of the application - Secondary taken offline for maintenance longer than the oplog window - Network issue causing replication to stall for an extended period - Burst of write operations rapidly filling the oplog - Secondary disk I/O too slow to keep up with replication apply rate
Step-by-Step Fix 1. **Check oplog window and secondary lag**: ```javascript // On the primary rs.printReplicationInfo() // Look for: "log length start to end" - this is your oplog window in hours
// Check secondary lag rs.printSecondaryReplicationInfo() // Look for: "replicated ops" and "behind the primary" for each secondary ```
- 1.Calculate required oplog size:
- 2.```javascript
- 3.// Get average operations per hour
- 4.var oplog = db.getReplicationInfo();
- 5.var opsPerHour = oplog.timeDiff / (oplog.logLengthEnd - oplog.logLengthStart);
// If you need 72 hours of oplog: var neededHours = 72; print("Current window: " + oplog.logLengthEnd + " hours"); print("Needed window: " + neededHours + " hours"); print("Oplog should be: " + (neededHours / oplog.logLengthEnd * oplog.logSizeMB) + " MB"); ```
- 1.**Resize the oplog":
- 2.```javascript
- 3.// MongoDB 4.4+
- 4.db.adminCommand({ replSetResizeOplog: { size: 51200, minRetentionHours: 72 } });
// Verify the change rs.printReplicationInfo(); ```
- 1.**Perform initial sync for a too-stale secondary":
- 2.```bash
- 3.# On the stale secondary
- 4.sudo systemctl stop mongod
# Remove all data files (WARNING: this deletes all local data) sudo rm -rf /var/lib/mongodb/*
# Restart - MongoDB will perform initial sync from primary sudo systemctl start mongod
# Monitor sync progress mongosh --eval "rs.status()" mongosh --eval "db.currentOp()" ```
- 1.**Use db.copyDatabase for faster recovery on smaller datasets":
- 2.```javascript
- 3.// On the stale secondary
- 4.// This is faster than initial sync for small databases
- 5.db.copyDatabase("mydb", "mydb", "primary-host:27017")
- 6.
`