Introduction

MongoDB sharding distributes data across multiple shards to handle large datasets and high throughput. When sharding errors occur, they can prevent proper data distribution, cause query routing failures, or lead to severely unbalanced cluster states. These errors require understanding the sharding architecture: mongos routers, config servers, and shard replica sets.

Symptoms

Sharding errors manifest across the cluster:

```text # Balancer errors Balancer is not running Cannot balance collection: balancer locked Chunk migration failed: data transfer error

# Config server errors Config server unreachable Could not contact config server Config server read error

# Shard key issues Shard key pattern invalid Cannot shard collection: index not found Chunk split failed: shard key value out of range

# Query routing errors Cannot determine shard for query Mongos routing table stale

# In sh.status() balancer is not running chunks severely unbalanced across shards "draining" shards showing ```

Common Causes

  1. 1.Balancer stopped or locked - Migration process halted
  2. 2.Config server connectivity - mongos cannot reach config servers
  3. 3.Missing shard key index - Collection lacks required shard key index
  4. 4.Unbalanced chunk distribution - Hot shard due to poor shard key choice
  5. 5.Jumbo chunks - Chunks too large to migrate (> 64MB)
  6. 6.Zone violation - Chunks not respecting zone sharding rules
  7. 7.Version mismatch - mongos and config server version incompatible

Step-by-Step Fix

Step 1: Check Cluster Status

Get comprehensive sharding status:

```javascript mongosh --host mongos:27017

// Full cluster status sh.status()

// Key information: // - Sharding version // - Shards list and their states // - Balancer state // - Databases and their sharding status // - Chunk distribution per collection

// Check individual components sh._adminCommand("listShards") sh._adminCommand({ balancerStatus: 1 }) ```

Check config server connectivity:

```javascript // Verify config servers use config db.shards.find() db.mongos.find()

// Check config server replica set status sh._adminCommand({ getShardVersion: "mydb.mycollection" }) ```

Step 2: Fix Balancer Issues

Check and enable balancer:

```javascript // Check balancer status sh.getBalancerState() sh.isBalancerRunning()

// If stopped, enable it sh.startBalancer()

// Check for locks preventing balance use config db.locks.find({ state: 1 })

// Clear stuck locks (caution!) db.locks.update({ _id: "balancer" }, { $set: { state: 0 } }) ```

Balancer settings:

```javascript // Check balancer configuration sh.getBalancerState()

// Adjust balance window (if needed) use config db.settings.update( { _id: "balancer" }, { $set: { activeWindow: { start: "02:00", stop: "06:00" } } } )

// Or disable window for continuous balancing db.settings.update( { _id: "balancer" }, { $unset: { activeWindow: 1 } } ) ```

Step 3: Enable Sharding for Collection

If collection is not yet sharded:

```javascript // Enable sharding on database sh.enableSharding("mydb")

// Create shard key index first (required) use mydb db.mycollection.createIndex({ userId: 1 })

// Shard the collection sh.shardCollection("mydb.mycollection", { userId: 1 })

// For hashed shard key (better distribution) db.mycollection.createIndex({ userId: "hashed" }) sh.shardCollection("mydb.mycollection", { userId: "hashed" }) ```

If sharding fails due to missing index:

text
Error: cannot shard collection mydb.mycollection without shard key index

Create the index:

```javascript // Check existing indexes db.mycollection.getIndexes()

// Create shard key index db.mycollection.createIndex({ userId: 1 })

// Retry sharding sh.shardCollection("mydb.mycollection", { userId: 1 }) ```

Step 4: Fix Chunk Imbalance

Check chunk distribution:

```javascript // Detailed chunk info use config db.chunks.aggregate([ { $group: { _id: "$shard", count: { $sum: 1 }, avgSize: { $avg: "$size" } }}, { $sort: { count: -1 } } ])

// For specific collection db.chunks.find({ ns: "mydb.mycollection" }).count() db.chunks.aggregate([ { $match: { ns: "mydb.mycollection" } }, { $group: { _id: "$shard", count: { $sum: 1 } } } ]) ```

If severely imbalanced:

```javascript // Check if balancer is running sh.isBalancerRunning()

// Force immediate balance check sh._adminCommand({ balancerStart: 1 })

// Monitor migrations db.changelog.find({ what: "balance" }).sort({ time: -1 }).limit(10) ```

Step 5: Handle Jumbo Chunks

Jumbo chunks (> 64MB) cannot be auto-migrated:

```javascript // Find jumbo chunks use config db.chunks.find({ ns: "mydb.mycollection", jumbo: true }).forEach(c => printjson(c))

// Try to split jumbo chunk let chunk = db.chunks.findOne({ ns: "mydb.mycollection", jumbo: true }) sh.splitAt("mydb.mycollection", chunk.min) sh.splitAt("mydb.mycollection", chunk.max)

// If cannot split (single document > 64MB) // Need to reconsider shard key or document structure ```

Manual chunk management:

```javascript // Find split point db.mycollection.find().sort({ userId: 1 }).skip(10000).limit(1)

// Split at specific value sh.splitAt("mydb.mycollection", { userId: "user12345" })

// Move chunk manually sh.moveChunk("mydb.mycollection", { userId: "user12345" }, "shard02") ```

Step 6: Fix Zone Sharding Issues

Check zone configuration:

```javascript // List zones sh._adminCommand({ listTags: 1 })

// Zone ranges use config db.tags.find()

// Verify zone-shard assignment db.shards.find({ tags: { $exists: true } }) ```

Fix zone issues:

```javascript // Add missing tag range sh.addTagRange("mydb.mycollection", { region: "us-east" }, { region: "us-west" }, "US")

// Remove incorrect tag sh.removeTagRange("mydb.mycollection", { region: "eu" }, { region: "eu-z" }, "EU")

// Assign shard to zone sh.addShardTag("shard01", "US") sh.removeShardTag("shard01", "EU") ```

Step 7: Recover from Config Server Failure

If config servers are unreachable:

```bash # Check config server status (it's a replica set) mongosh --host config-primary:27019 --eval "rs.status()"

# Restart config servers sudo systemctl restart mongod-config ```

From mongos:

```javascript // Verify config server connectivity sh._adminCommand("listShards")

// If mongos cached config is stale // Restart mongos to refresh ```

Restart mongos router:

bash
sudo systemctl restart mongos
mongosh --eval "sh.status()"

Step 8: Handle Draining Shard

When removing a shard:

```javascript // Start draining sh.removeShard("shard03")

// Monitor progress sh.status()

// Check draining status use config db.shards.find({ draining: true })

// Chunks being moved db.changelog.find({ what: "moveChunk.commit", ns: "mydb.*" }).sort({ time: -1 }).limit(10) ```

Verification

Verify healthy sharding:

```javascript // 1. All shards healthy sh._adminCommand("listShards") // All shards should have state: 1

// 2. Balancer running sh.isBalancerRunning()

// 3. Chunk distribution balanced // Check sh.status() for even distribution

// 4. Queries route correctly db.mycollection.find({ userId: "test" }).explain() // Should show shards targeted

// 5. Write distributes db.mycollection.insertOne({ userId: "newUser", data: "test" }) // Check which shard received it ```

Check from shards directly:

bash
# Connect to each shard and verify chunks
mongosh --host shard01:27018 --eval "db.mycollection.countDocuments()"
mongosh --host shard02:27018 --eval "db.mycollection.countDocuments()"

Common Pitfalls

  • Choosing poor shard key - Low cardinality causes hot shards
  • Not pre-splitting - Initial data goes to single shard
  • Ignoring chunk size limits - Large documents become jumbo chunks
  • Balancer window misconfiguration - Balance window may not cover load
  • Removing shards improperly - Drain may take hours for large data

Best Practices

  • Choose shard keys with high cardinality and even distribution
  • Use hashed shard keys for uniform distribution when order irrelevant
  • Pre-split chunks before bulk loading data
  • Monitor chunk distribution with alerts for imbalance > 20%
  • Keep balancer running during low-traffic periods
  • Test zone sharding rules before production deployment
  • Document shard key rationale for future reference
  • MongoDB Chunk Migration Error
  • MongoDB Oplog Error
  • MongoDB Replica Set Error
  • MongoDB WiredTiger Error