Home / MongoDB / MongoDB Replica Set Election Timeout Causing Write Unavailability

MongoDB

MongoDB Replica Set Election Timeout Causing Write Unavailability

How to diagnose and recover from MongoDB replica set election timeouts that cause temporary write unavailability.

Yesterday3 min read

Illustration of MongoDB database diagnostics.

Introduction MongoDB replica sets use Raft consensus to elect a primary node. When the primary becomes unreachable, the remaining members hold an election. If the election times out—typically due to network partitioning, insufficient voting members, or slow disk I/O—the replica set has no primary and all writes fail until a new primary is elected.

Symptoms - `NotPrimaryNoSecondaryOk` error on all write operations - `rs.status()` shows `"stateStr": "RECOVERING"` or no member as PRIMARY - MongoDB logs show `election timeout` or `election failed` messages - Application connection pool shows all connections returning `NotMaster` errors - `rs.status().electionDate` is missing or stale

Common Causes - Network partition isolating the primary from a majority of voters - Only 2-member replica set (no majority possible if one fails) - Slow disk on voting members causing heartbeat timeouts - `electionTimeoutMillis` set too low for the network latency - All remaining members have stale data and cannot win election

Step-by-Step Fix 1. Check the current replica set status: ```javascript rs.status() // Look for: // - Members in SECONDARY, RECOVERING, or UNKNOWN state // - "stateStr" values // - "health" field (1 = healthy, 0 = unreachable) ```

1.Check election candidate status:
2.```javascript
3.rs.status().members.forEach(function(m) {
4.print(m.name + " | state: " + m.stateStr + " | health: " + m.health +
5." | optime: " + JSON.stringify(m.optimeDate));
6.});
7.`
8.Force reconfigure if a majority cannot be reached:
9.```javascript
10.// Connect to a secondary
11.cfg = rs.conf();
12.// Remove unreachable members temporarily
13.cfg.members = cfg.members.filter(function(m) {
14.return m.host !== "unreachable-node:27017";
15.});
16.cfg.version++;
17.rs.reconfig(cfg, { force: true });
18.`
19.Adjust election timeout for high-latency networks:
20.```javascript
21.cfg = rs.conf();
22.cfg.settings.electionTimeoutMillis = 20000; // Default is 10000
23.cfg.settings.heartbeatTimeoutSecs = 15; // Default is 10
24.rs.reconfig(cfg);
25.`
26.Step down the current primary gracefully if it is partially degraded:
27.```javascript
28.// On the primary
29.rs.stepDown(120, 60) // Step down for 120s, give 60s for others to catch up
30.`
31.Restart a member that is stuck in RECOVERING state:
32.```bash
33.sudo systemctl restart mongod
34.# Then check if it rejoins
35.mongosh --eval "rs.status()"
36.`

Prevention - Use odd-numbered replica set sizes (3, 5, 7) to ensure majority voting - Deploy members across at least 3 failure domains (availability zones) - Add an arbiter only when cost constraints prevent a full third data-bearing member - Monitor replica set heartbeats and alert on `pingMs` exceeding 50% of `electionTimeoutMillis` - Set `electionTimeoutMillis` to at least 3x the typical network RTT - Use `writeConcern: { w: "majority" }` for critical writes to detect unavailability early - Regularly test failover by running `rs.stepDown()` during maintenance windows