Fix MongoDB Oplog Errors - Exhaustion, Corruption, and Retention Issues

Introduction

The oplog (operations log) is the heartbeat of MongoDB replication. It stores all write operations in a capped collection that secondaries consume to maintain data consistency. When oplog errors occur, replication can completely halt, causing secondaries to fall irrecoverably behind and requiring emergency intervention to restore replica set integrity.

Symptoms

Oplog errors manifest with distinct patterns:

```text # Oplog exhaustion (secondary too far behind) Error: Oplog CursorMinKeyNotFound Replication halted: oplog no longer contains required entries MongoServerError: cannot sync, oplog too far behind

# Oplog corruption WiredTiger error reading from oplog BSONObjectTooLarge: oplog entry exceeds size limit

# Oplog query errors CursorNotFound: oplog cursor expired OperationFailed: oplog query failed

# In logs {"msg":"Replication halt","attr":{"reason":"Oplog Position Lost"}} {"msg":"Secondary cannot catch up","attr":{"minValid":{"$timestamp":{"t":100,"i":1}}}}

# In rs.status() "syncSourceHost": "", "lastHeartbeatMessage": "error RS102 too stale to catch up" "optimeDate": significantly behind primary ```

Common Causes

1.Oplog size too small - Retention window shorter than secondary downtime
2.Secondary extended downtime - Offline longer than oplog coverage
3.High write volume - Oplog fills faster than secondaries consume
4.Oplog corruption - WiredTiger corruption in oplog collection
5.Large transactions - Single oplog entry exceeds 16MB BSON limit
6.Network instability - Intermittent connectivity causing cursor expiration

Step-by-Step Fix

Step 1: Diagnose Oplog State

Check oplog status on primary:

```javascript mongosh --host primary:27017

// Check oplog size and window use local db.oplog.rs.stats()

// Calculate time window covered let first = db.oplog.rs.find().sort({ ts: 1 }).limit(1).next() let last = db.oplog.rs.find().sort({ ts: -1 }).limit(1).next() let hoursCovered = (last.ts.t - first.ts.t) / 3600 print("Oplog covers: " + hoursCovered.toFixed(2) + " hours")

// Check oplog size print("Size: " + (db.oplog.rs.stats().size / 1024 / 1024 / 1024).toFixed(2) + " GB") print("Max size: " + (db.oplog.rs.stats().maxSize / 1024 / 1024 / 1024).toFixed(2) + " GB") ```

Check secondary's required position:

```javascript // On secondary mongosh --host secondary:27017

use local // What oplog position secondary needs db.oplog.rs.find().sort({ ts: -1 }).limit(1).next()

// Check if required entries exist on primary // Compare secondary's lastApplied with primary's oldest entry ```

Step 2: Check Oplog Exhaustion

When secondary is "too stale":

```javascript // On primary - check oldest oplog entry use local let oldest = db.oplog.rs.find().sort({ ts: 1 }).limit(1).next() printjson(oldest.ts)

// On secondary - check needed position // Look for "lastApplied" or "minValid" in rs.status() rs.status().members.find(m => m.name.includes("secondary")) ```

If secondary needs entries older than primary's oplog:

text

Secondary needs: { ts: Timestamp(1234567890, 1) }
Primary's oldest: { ts: Timestamp(1234567900, 1) }
// Secondary is 10 seconds "behind" primary's oldest = stale

Step 3: Resize Oplog (Immediate Fix)

Increase oplog size to extend retention window:

```javascript // On primary (MongoDB 4.0+) db.adminCommand({ replSetResizeOplog: 1, size: 10240 }) // 10 GB

// Check result use local db.oplog.rs.stats()

// This takes effect immediately without restart ```

For older MongoDB versions:

```bash # Stop primary (careful - will trigger election) sudo systemctl stop mongod

# Edit mongod.conf sudo nano /etc/mongod.conf

# Add oplogSizeMB setting replication: oplogSizeMB: 10240

# Restart sudo systemctl start mongod ```

Step 4: Resync Stale Secondary

When resize doesn't help (secondary already too far behind):

```bash # On stale secondary sudo systemctl stop mongod

# Remove all data sudo rm -rf /var/lib/mongodb/*

# Restart - will perform initial sync sudo systemctl start mongod

# Monitor sync progress mongosh --eval "rs.status()" ```

Alternative: Clone from another secondary:

```bash # Stop secondary sudo systemctl stop mongod

# Copy data from healthy secondary sudo rsync -avz /var/lib/mongodb/ secondary2:/var/lib/mongodb-backup/ # Or use LVM snapshot, mongodump, etc.

# Restart with copied data sudo systemctl start mongod ```

Step 5: Handle Oplog Corruption

Check for corruption:

```bash # Validate oplog mongosh --host primary:27017 use local db.validateCollection("oplog.rs")

# Check for errors in output ```

If corruption found:

```bash # Stop primary sudo systemctl stop mongod

# Run repair mongod --repair --dbpath /var/lib/mongodb

# Or more targeted: extract oplog, recreate # This is complex - consider resyncing entire member instead ```

Step 6: Handle Large Transactions

Find oversized oplog entries:

```javascript use local db.oplog.rs.find({ $where: function() { return Object.bsonsize(this) > 16 * 1024 * 1024 } }).forEach(o => { print("Large entry at " + o.ts.t + " size: " + Object.bsonsize(o)) })

// Typically caused by large array operations or multi: true updates ```

Prevent future large entries:

```javascript // Avoid large multi-updates db.collection.updateMany({}, { $set: { field: "value" } }) // Split into smaller batches

// Use bulk operations with batches let bulk = db.collection.initializeOrderedBulkOp() let cursor = db.collection.find() cursor.forEach(doc => { bulk.find({ _id: doc._id }).updateOne({ $set: { field: "value" } }) if (bulk.n > 1000) { bulk.execute() bulk = db.collection.initializeOrderedBulkOp() } }) bulk.execute() ```

Step 7: Monitor Oplog Health

Set up ongoing monitoring:

```javascript // Script to check oplog health function checkOplogHealth() { let stats = db.oplog.rs.stats() let first = db.oplog.rs.find().sort({ ts: 1 }).limit(1).next() let last = db.oplog.rs.find().sort({ ts: -1 }).limit(1).next()

let hours = (last.ts.t - first.ts.t) / 3600 let usage = stats.size / stats.maxSize

return { hoursCovered: hours, sizeGB: stats.size / 1e9, maxSizeGB: stats.maxSize / 1e9, percentUsed: usage * 100 } }

checkOplogHealth() ```

Verification

Verify oplog functioning:

```javascript // 1. Oplog window sufficient (> 24 hours recommended) use local let hours = (db.oplog.rs.find().sort({ ts: -1 }).limit(1).next().ts.t - db.oplog.rs.find().sort({ ts: 1 }).limit(1).next().ts.t) / 3600 print("Oplog window: " + hours + " hours")

// 2. No corruption db.validateCollection("oplog.rs")

// 3. Secondaries catching up rs.status() // All secondaries should have recent optimeDate

// 4. Replication lag minimal rs.printSlaveReplicationInfo() // Should show lag < 10 seconds

// 5. No oversized entries db.oplog.rs.find({ $where: "Object.bsonsize(this) > 16777216" }).count() // Should be 0 ```

Common Pitfalls

Oplog size based on disk, not write rate - Size must match write volume and downtime tolerance
Not monitoring oplog window - Can silently shrink during traffic spikes
Resyncing during peak hours - Initial sync consumes resources heavily
Forgetting to resize after capacity planning - Default size may be insufficient
Using initial sync for all recoveries - Sometimes cloning is faster

Best Practices

Size oplog to cover at least 24-72 hours of operations
Monitor oplog window with alerts at < 8 hours remaining
Plan for write spikes when sizing oplog
Document recovery procedures for stale secondary scenarios
Test oplog resize procedure before needing it
Use point-in-time recovery for critical data protection
Schedule initial sync during low-traffic windows

MongoDB Replica Set Error
MongoDB Initial Sync Failed
MongoDB Chunk Migration Error
MongoDB WiredTiger Error

MongoDB Oplog Error

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Step 1: Diagnose Oplog State

Step 2: Check Oplog Exhaustion

Step 3: Resize Oplog (Immediate Fix)

Step 4: Resync Stale Secondary

Step 5: Handle Oplog Corruption

Step 6: Handle Large Transactions

Step 7: Monitor Oplog Health

Verification

Common Pitfalls

Best Practices

Related Issues

Share this guide

More Database Troubleshooting Guides

Database Query Timeout

SQL Server TempDB Contention

Database Connection Pool Exhausted

SQL Server AlwaysOn Failover Failed

Database Sharding Rebalance Failed

SQL Server Log Reader Agent Stalled