Introduction

MongoDB secondaries can only catch up if the oplog still contains the operations they missed. When a secondary has been offline too long, is severely lagged, or the oplog is too small for current write volume, the member reports RS102 with a message like too stale to catch up in lastHeartbeatMessage. At that point, normal replication catch-up is no longer possible and you need to verify whether re-sync is required.

Symptoms

  • rs.status() shows lastHeartbeatMessage with too stale to catch up
  • The secondary remains in a recovering or not healthy state
  • Replication lag was high before the member dropped out
  • The problem started after a long outage, disk issue, or heavy write spike

Common Causes

  • The secondary was offline longer than the oplog retention window
  • Write volume increased enough that the oplog rolled past the missing entries
  • Disk or network issues delayed replication until the node fell too far behind
  • The oplog was sized too small for normal recovery windows

Step-by-Step Fix

  1. 1.Confirm the exact replica set status
  2. 2.Check both the heartbeat message and the current member state before deciding on a re-sync.
javascript
rs.status()
rs.printSecondaryReplicationInfo()
  1. 1.Measure the oplog window on the primary
  2. 2.The oplog must cover the period the secondary missed. If it does not, catch-up cannot succeed.
javascript
use local
db.oplog.rs.find().sort({ $natural: 1 }).limit(1)
db.oplog.rs.find().sort({ $natural: -1 }).limit(1)
  1. 1.Decide whether the secondary can recover or must be resynced
  2. 2.If the missing range is outside the oplog window, do not keep waiting for normal replication.
javascript
rs.printReplicationInfo()
rs.printSecondaryReplicationInfo()
  1. 1.Resync the stale member cleanly
  2. 2.Remove the stale data path or rebuild the member using initial sync, then watch it rejoin from scratch.
bash
systemctl stop mongod
mv /var/lib/mongo /var/lib/mongo.stale.$(date +%s)
mkdir -p /var/lib/mongo
chown -R mongod:mongod /var/lib/mongo
systemctl start mongod

Prevention

  • Size the oplog for realistic outage and maintenance windows
  • Watch replication lag and oplog window together, not as separate metrics
  • Investigate secondaries that stay behind before they cross the stale threshold
  • Reevaluate oplog size after workload growth or migration events