Introduction The MongoDB balancer automatically migrates chunks between shards to maintain even data distribution. When the balancer gets stuck—often due to a failed migration, config server inconsistency, or an active maintenance lock—the cluster can become severely imbalanced, with one shard handling disproportionate read/write traffic.

Symptoms - `sh.isBalancerRunning()` returns `true` for hours without completing - Chunk distribution is heavily skewed across shards - Balancer log shows repeated `migration failed` for the same chunk - `config.locks` shows a balancer lock that is not being released - `mongos` logs show `balancer: could not acquire balancer lock`

Common Causes - Previous migration left a chunk in a transitional state (jumbo flag) - Config server replica set is not healthy, preventing lock management - Balancer window is too narrow, not enough time to complete migrations - Network partition between mongos and config servers - Manual `moveChunk` operation conflicting with balancer

Step-by-Step Fix 1. **Check balancer state and current migrations": ```javascript sh.isBalancerRunning() sh.getBalancerState()

// Check for active migrations db.getSiblingDB("config").locks.find({ _id: "balancer" })

// Check chunk distribution db.getSiblingDB("config").chunks.aggregate([ { $group: { _id: "$shard", count: { $sum: 1 } } }, { $sort: { count: -1 } } ]) ```

  1. 1.**Stop and restart the balancer":
  2. 2.```javascript
  3. 3.// Stop the balancer
  4. 4.sh.stopBalancer()

// Verify it stopped sh.getBalancerState() // Should be false sh.isBalancerRunning() // Should be false

// Clear any stale migration state db.getSiblingDB("config").locks.remove({ _id: "balancer" })

// Restart the balancer sh.startBalancer() ```

  1. 1.**Clear jumbo chunks that block migration":
  2. 2.```javascript
  3. 3.db.getSiblingDB("config").chunks.updateMany(
  4. 4.{ jumbo: true },
  5. 5.{ $unset: { jumbo: "" } }
  6. 6.);
  7. 7.`
  8. 8.**Manually move chunks from overloaded shards":
  9. 9.```javascript
  10. 10.// Identify chunks to move
  11. 11.var chunks = db.getSiblingDB("config").chunks.find({
  12. 12.ns: "mydb.mycollection",
  13. 13.shard: "shard1"
  14. 14.}).limit(5);

chunks.forEach(function(chunk) { db.adminCommand({ moveChunk: "mydb.mycollection", find: chunk.min, to: "shard2", _secondaryThrottle: true }); }); ```

  1. 1.**Check config server replica set health":
  2. 2.```javascript
  3. 3.// Connect to config server
  4. 4.use config
  5. 5.rs.status()
  6. 6.// Ensure all config servers are healthy and one is PRIMARY
  7. 7.`

Prevention - Monitor balancer state with automated checks every 5 minutes - Keep the balancer window wide open (24/7) unless there is a specific reason to restrict it - Monitor chunk distribution regularly with `sh.status()` - Ensure config server replica set has odd members and good health - Test balancer behavior during maintenance by running `sh.stopBalancer()` and `sh.startBalancer()` - Use proper shard keys that distribute data evenly from the start - Avoid manual `moveChunk` operations that can conflict with the balancer