Introduction
MongoDB secondaries can only catch up if the oplog still contains the operations they missed. When a secondary has been offline too long, is severely lagged, or the oplog is too small for current write volume, the member reports RS102 with a message like too stale to catch up in lastHeartbeatMessage. At that point, normal replication catch-up is no longer possible and you need to verify whether re-sync is required.
Symptoms
rs.status()showslastHeartbeatMessagewithtoo stale to catch up- The secondary remains in a recovering or not healthy state
- Replication lag was high before the member dropped out
- The problem started after a long outage, disk issue, or heavy write spike
Common Causes
- The secondary was offline longer than the oplog retention window
- Write volume increased enough that the oplog rolled past the missing entries
- Disk or network issues delayed replication until the node fell too far behind
- The oplog was sized too small for normal recovery windows
Step-by-Step Fix
- 1.Confirm the exact replica set status
- 2.Check both the heartbeat message and the current member state before deciding on a re-sync.
rs.status()
rs.printSecondaryReplicationInfo()- 1.Measure the oplog window on the primary
- 2.The oplog must cover the period the secondary missed. If it does not, catch-up cannot succeed.
use local
db.oplog.rs.find().sort({ $natural: 1 }).limit(1)
db.oplog.rs.find().sort({ $natural: -1 }).limit(1)- 1.Decide whether the secondary can recover or must be resynced
- 2.If the missing range is outside the oplog window, do not keep waiting for normal replication.
rs.printReplicationInfo()
rs.printSecondaryReplicationInfo()- 1.Resync the stale member cleanly
- 2.Remove the stale data path or rebuild the member using initial sync, then watch it rejoin from scratch.
systemctl stop mongod
mv /var/lib/mongo /var/lib/mongo.stale.$(date +%s)
mkdir -p /var/lib/mongo
chown -R mongod:mongod /var/lib/mongo
systemctl start mongodPrevention
- Size the oplog for realistic outage and maintenance windows
- Watch replication lag and oplog window together, not as separate metrics
- Investigate secondaries that stay behind before they cross the stale threshold
- Reevaluate oplog size after workload growth or migration events