Introduction
Chunk migration is the core mechanism for balancing data across shards in a MongoDB sharded cluster. When migrations fail, the cluster becomes unbalanced, leading to hot shards, query performance degradation, and potential capacity exhaustion on overloaded shards. Understanding the migration process and its failure modes is essential for maintaining cluster health.
Symptoms
Chunk migration errors appear in logs and sharding status:
```text # Migration timeout Migration aborted: took too long Chunk migration failed: network error
# Jumbo chunk issues Chunk marked as jumbo: size exceeds threshold Cannot migrate chunk: exceeds maximum chunk size
# Concurrent migration conflict Migration conflict detected Cannot start migration: another migration in progress
# Destination shard errors Destination shard rejected chunk Shard does not have enough disk space
# In logs {"msg":"Migration aborted","attr":{"reason":"exceeded migrationLockTimeout"}} {"msg":"MoveChunk error","attr":{"error":"could not acquire collection lock"}} {"msg":"Jumbo chunk","attr":{"ns":"mydb.mycollection","min":{...},"max":{...}}}
# In sh.status() chunks on shard01: 1500 (overloaded) chunks on shard02: 200 (underloaded) ```
Common Causes
- 1.Network latency or instability - Migration exceeds timeout threshold
- 2.Jumbo chunks (> 64MB) - Chunks too large to auto-migrate
- 3.Concurrent migration limits - Maximum 1 migration per shard pair
- 4.Disk capacity exhaustion - Destination shard lacks free space
- 5.Collection locks - Operations blocking migration locks
- 6.High write volume during migration - Catchup phase cannot complete
- 7.Shard key distribution issues - Chunks cannot be split further
Step-by-Step Fix
Step 1: Check Migration Status
Identify current migration state:
```javascript mongosh --host mongos:27017
// Check balancer and active migrations sh.isBalancerRunning()
// Detailed migration status use config db.locks.find({ state: 1 })
// Check for failed migrations in changelog db.changelog.find({ what: "moveChunk.error" }).sort({ time: -1 }).limit(10)
// Current chunk distribution db.chunks.aggregate([ { $match: { ns: "mydb.mycollection" } }, { $group: { _id: "$shard", count: { $sum: 1 } } }, { $sort: { count: -1 } } ]) ```
Step 2: Diagnose Specific Failure
Find the error type in logs:
```bash # On mongos router grep -i "moveChunk" /var/log/mongodb/mongos.log | tail -50
# On affected shards grep -i "migration" /var/log/mongodb/mongod.log | tail -50
# Common patterns: # "migration aborted" -> timeout or network # "jumbo chunk" -> size issue # "lock timeout" -> concurrent operation ```
Step 3: Fix Network Timeout Issues
Adjust migration parameters:
```javascript // Increase migration timeout use config db.settings.update( { _id: "chunkMigration" }, { $set: { "_secondaryThrottle": true, "maxChunkMigrationTimeMS": 600000 // 10 minutes } }, { upsert: true } )
// Check network connectivity between shards // Test from mongos db.runCommand({ ping: 1 }) ```
Test shard connectivity:
```bash # Test connectivity between all shard pairs mongosh --host shard01-primary:27018 --eval "db.runCommand({ping:1})" mongosh --host shard02-primary:27018 --eval "db.runCommand({ping:1})"
# Check latency ping shard01 ping shard02 ```
Step 4: Handle Jumbo Chunks
Find and address jumbo chunks:
```javascript // Find jumbo chunks use config db.chunks.find({ jumbo: true }).forEach(c => { print("Jumbo chunk: " + c.ns + " on " + c.shard) print("Min: " + JSON.stringify(c.min)) print("Max: " + JSON.stringify(c.max)) print("Size: " + c.size / 1024 / 1024 + " MB") })
// Count jumbo chunks db.chunks.countDocuments({ jumbo: true }) ```
Attempt to split jumbo chunks:
```javascript // Get jumbo chunk details let jumboChunk = db.chunks.findOne({ jumbo: true })
// Find split points in the data use mydb db.mycollection.find({ userId: { $gte: jumboChunk.min.userId, $lt: jumboChunk.max.userId } }).sort({ userId: 1 }).limit(100)
// Find midpoint value let midValue = db.mycollection.aggregate([ { $match: { userId: { $gte: jumboChunk.min.userId, $lt: jumboChunk.max.userId } } }, { $group: { _id: null, min: { $min: "$userId" }, max: { $max: "$userId" } } } ]).next()
// Try manual split sh.splitAt("mydb.mycollection", { userId: "midpoint_value" })
// If split succeeds, clear jumbo flag db.chunks.update( { _id: jumboChunk._id }, { $unset: { jumbo: true } } ) ```
If chunk cannot be split (single document issue):
```javascript // Find documents exceeding chunk size use mydb db.mycollection.find({ userId: { $gte: jumboChunk.min.userId, $lt: jumboChunk.max.userId } }).forEach(doc => { let size = Object.bsonsize(doc) if (size > 16 * 1024 * 1024) { // > 16MB print("Large document: " + doc._id + " size: " + size / 1024 / 1024 + "MB") } })
// Solution options: // 1. Refactor document structure (split into multiple docs) // 2. Change shard key to distribute differently // 3. Manually move chunk (with force flag) ```
Step 5: Resolve Concurrent Migration Conflicts
Check migration locks:
```javascript use config db.locks.find({ state: { $ne: 0 } })
// Find what's holding locks db.locks.find({ "when": { $gt: new Date(Date.now() - 3600000) } }) ```
Clear stuck locks (caution):
```javascript // Only if migration is truly stuck // First verify no active migration sh.isBalancerRunning()
// Clear lock db.locks.update( { _id: "mydb.mycollection", state: 2 }, { $set: { state: 0 } } )
// Restart balancer sh.startBalancer() ```
Step 6: Handle Destination Shard Issues
Check shard capacity:
javascript
// Disk usage on each shard
sh._adminCommand("listShards").forEach(s => {
mongosh --host s.host --quiet --eval
db.adminCommand("listDatabases").databases.forEach(d => {
let stats = db.getSiblingDB(d.name).stats()
print(d.name + ": " + stats.dataSize / 1024 / 1024 + "MB")
})
`
})
// Or check from mongos for specific shard use admin db.runCommand({ getShardVersion: "mydb.mycollection" }) ```
Add capacity or move chunks manually:
# Check disk on shard
df -h /var/lib/mongodbMove chunks to shard with more capacity:
```javascript // Manual chunk move sh.moveChunk("mydb.mycollection", { userId: "specific_value" }, "shard03")
// Monitor move progress use config db.changelog.find({ what: "moveChunk", ns: "mydb.mycollection" }).sort({ time: -1 }).limit(5) ```
Step 7: Handle High Write Volume During Migration
Reduce write pressure during migration:
```javascript // Check write rate db.mycollection.stats()
// Temporarily reduce write rate or batch writes // Migration includes a catchup phase for writes during move // High write rate can prevent catchup completion ```
Adjust migration settings:
```javascript // Secondary throttle - wait for secondaries during migration use config db.settings.update( { _id: "chunkMigration" }, { $set: { _secondaryThrottle: true } }, { upsert: true } )
// Pause balancer during peak writes sh.stopBalancer() // Resume during off-hours sh.startBalancer() ```
Step 8: Force Manual Migration
When automatic balancing fails:
```javascript // Find overloaded shard use config db.chunks.aggregate([ { $match: { ns: "mydb.mycollection" } }, { $group: { _id: "$shard", count: { $sum: 1 } } }, { $sort: { count: -1 } } ])
// Get specific chunk to move let chunk = db.chunks.findOne({ shard: "shard01", ns: "mydb.mycollection" })
// Force move sh.moveChunk("mydb.mycollection", chunk.min, "shard02") ```
Verification
Verify successful migrations:
```javascript // 1. Check balancer running sh.isBalancerRunning()
// 2. Verify balanced distribution sh.status() // Check "chunks" section for balanced counts
// 3. No jumbo chunks remaining use config db.chunks.countDocuments({ jumbo: true }) // Should be 0 or decreasing
// 4. No stuck locks db.locks.find({ state: { $ne: 0 } }).count() // Should be 0
// 5. Recent successful migrations db.changelog.find({ what: "moveChunk.commit" }).sort({ time: -1 }).limit(10) ```
System verification:
```bash # Check disk space balanced across shards for shard in shard01 shard02 shard03; do echo "=== $shard ===" ssh $shock "df -h /var/lib/mongodb" done
# Check mongos logs for errors grep -i "moveChunk" /var/log/mongodb/mongos.log | grep -i "error" ```
Common Pitfalls
- Ignoring jumbo chunks - Will accumulate and prevent balancing
- Forcing migration without split - May create new jumbo chunks
- Not checking destination capacity - Move may fail due to disk space
- Running migrations during peak traffic - Catchup phase may timeout
- Disabling balancer permanently - Cluster becomes unbalanced over time
Best Practices
- Monitor chunk distribution with alerts for > 20% imbalance
- Size chunks appropriately (default 64MB is often good)
- Schedule balancing during low-traffic windows
- Proactively split large chunks before they become jumbo
- Document shard key rationale to prevent future distribution issues
- Test migration procedures before production issues arise
- Keep shards at similar capacity levels
Related Issues
- MongoDB Sharding Error
- MongoDB Oplog Error
- MongoDB WiredTiger Error
- MongoDB Disk Full