Introduction When a MongoDB secondary falls behind the primary by more than the oplog window, it cannot catch up through normal replication and must perform a full initial sync. This is extremely disruptive for large datasets as it requires copying all data over the network and rebuilding all indexes.

Symptoms - Secondary logs show `repl: X replSetReinitializing` or `too stale to catch up` - `rs.status()` shows secondary in `RECOVERING` state with `lastHeartbeatMessage: "too stale"` - `rs.printReplicationInfo()` shows oplog window smaller than secondary lag - `rs.printSecondaryReplicationInfo()` shows one or more secondaries significantly behind - Secondary requires full resync after being offline for maintenance

Common Causes - Oplog size too small for the write volume of the application - Secondary taken offline for maintenance longer than the oplog window - Network issue causing replication to stall for an extended period - Burst of write operations rapidly filling the oplog - Secondary disk I/O too slow to keep up with replication apply rate

Step-by-Step Fix 1. **Check oplog window and secondary lag**: ```javascript // On the primary rs.printReplicationInfo() // Look for: "log length start to end" - this is your oplog window in hours

// Check secondary lag rs.printSecondaryReplicationInfo() // Look for: "replicated ops" and "behind the primary" for each secondary ```

  1. 1.Calculate required oplog size:
  2. 2.```javascript
  3. 3.// Get average operations per hour
  4. 4.var oplog = db.getReplicationInfo();
  5. 5.var opsPerHour = oplog.timeDiff / (oplog.logLengthEnd - oplog.logLengthStart);

// If you need 72 hours of oplog: var neededHours = 72; print("Current window: " + oplog.logLengthEnd + " hours"); print("Needed window: " + neededHours + " hours"); print("Oplog should be: " + (neededHours / oplog.logLengthEnd * oplog.logSizeMB) + " MB"); ```

  1. 1.**Resize the oplog":
  2. 2.```javascript
  3. 3.// MongoDB 4.4+
  4. 4.db.adminCommand({ replSetResizeOplog: { size: 51200, minRetentionHours: 72 } });

// Verify the change rs.printReplicationInfo(); ```

  1. 1.**Perform initial sync for a too-stale secondary":
  2. 2.```bash
  3. 3.# On the stale secondary
  4. 4.sudo systemctl stop mongod

# Remove all data files (WARNING: this deletes all local data) sudo rm -rf /var/lib/mongodb/*

# Restart - MongoDB will perform initial sync from primary sudo systemctl start mongod

# Monitor sync progress mongosh --eval "rs.status()" mongosh --eval "db.currentOp()" ```

  1. 1.**Use db.copyDatabase for faster recovery on smaller datasets":
  2. 2.```javascript
  3. 3.// On the stale secondary
  4. 4.// This is faster than initial sync for small databases
  5. 5.db.copyDatabase("mydb", "mydb", "primary-host:27017")
  6. 6.`

Prevention - Size oplog to hold at least 2-3x your longest expected maintenance window - Set `minRetentionHours` to ensure the oplog is never smaller than needed - Monitor secondary lag continuously with alerting at 50% of oplog window - Use `replSetResizeOplog` proactively before the window becomes critical - Test maintenance procedures to ensure secondaries can catch up within the window - For large datasets, consider using `mongodump`/`mongorestore` instead of initial sync - Deploy secondaries in the same region as the primary to minimize network latency