Introduction MongoDB replica sets use Raft consensus to elect a primary node. When the primary becomes unreachable, the remaining members hold an election. If the election times out—typically due to network partitioning, insufficient voting members, or slow disk I/O—the replica set has no primary and all writes fail until a new primary is elected.
Symptoms - `NotPrimaryNoSecondaryOk` error on all write operations - `rs.status()` shows `"stateStr": "RECOVERING"` or no member as PRIMARY - MongoDB logs show `election timeout` or `election failed` messages - Application connection pool shows all connections returning `NotMaster` errors - `rs.status().electionDate` is missing or stale
Common Causes - Network partition isolating the primary from a majority of voters - Only 2-member replica set (no majority possible if one fails) - Slow disk on voting members causing heartbeat timeouts - `electionTimeoutMillis` set too low for the network latency - All remaining members have stale data and cannot win election
Step-by-Step Fix 1. **Check the current replica set status**: ```javascript rs.status() // Look for: // - Members in SECONDARY, RECOVERING, or UNKNOWN state // - "stateStr" values // - "health" field (1 = healthy, 0 = unreachable) ```
- 1.Check election candidate status:
- 2.```javascript
- 3.rs.status().members.forEach(function(m) {
- 4.print(m.name + " | state: " + m.stateStr + " | health: " + m.health +
- 5." | optime: " + JSON.stringify(m.optimeDate));
- 6.});
- 7.
` - 8.Force reconfigure if a majority cannot be reached:
- 9.```javascript
- 10.// Connect to a secondary
- 11.cfg = rs.conf();
- 12.// Remove unreachable members temporarily
- 13.cfg.members = cfg.members.filter(function(m) {
- 14.return m.host !== "unreachable-node:27017";
- 15.});
- 16.cfg.version++;
- 17.rs.reconfig(cfg, { force: true });
- 18.
` - 19.Adjust election timeout for high-latency networks:
- 20.```javascript
- 21.cfg = rs.conf();
- 22.cfg.settings.electionTimeoutMillis = 20000; // Default is 10000
- 23.cfg.settings.heartbeatTimeoutSecs = 15; // Default is 10
- 24.rs.reconfig(cfg);
- 25.
` - 26.Step down the current primary gracefully if it is partially degraded:
- 27.```javascript
- 28.// On the primary
- 29.rs.stepDown(120, 60) // Step down for 120s, give 60s for others to catch up
- 30.
` - 31.Restart a member that is stuck in RECOVERING state:
- 32.```bash
- 33.sudo systemctl restart mongod
- 34.# Then check if it rejoins
- 35.mongosh --eval "rs.status()"
- 36.
`