Fix MongoDB Chunk Migration Errors - Network, Size, and Capacity Issues

Introduction

Chunk migration is the core mechanism for balancing data across shards in a MongoDB sharded cluster. When migrations fail, the cluster becomes unbalanced, leading to hot shards, query performance degradation, and potential capacity exhaustion on overloaded shards. Understanding the migration process and its failure modes is essential for maintaining cluster health.

Symptoms

Chunk migration errors appear in logs and sharding status:

```text # Migration timeout Migration aborted: took too long Chunk migration failed: network error

# Jumbo chunk issues Chunk marked as jumbo: size exceeds threshold Cannot migrate chunk: exceeds maximum chunk size

# Concurrent migration conflict Migration conflict detected Cannot start migration: another migration in progress

# Destination shard errors Destination shard rejected chunk Shard does not have enough disk space

# In logs {"msg":"Migration aborted","attr":{"reason":"exceeded migrationLockTimeout"}} {"msg":"MoveChunk error","attr":{"error":"could not acquire collection lock"}} {"msg":"Jumbo chunk","attr":{"ns":"mydb.mycollection","min":{...},"max":{...}}}

# In sh.status() chunks on shard01: 1500 (overloaded) chunks on shard02: 200 (underloaded) ```

Common Causes

1.Network latency or instability - Migration exceeds timeout threshold
2.Jumbo chunks (> 64MB) - Chunks too large to auto-migrate
3.Concurrent migration limits - Maximum 1 migration per shard pair
4.Disk capacity exhaustion - Destination shard lacks free space
5.Collection locks - Operations blocking migration locks
6.High write volume during migration - Catchup phase cannot complete
7.Shard key distribution issues - Chunks cannot be split further

Step-by-Step Fix

Step 1: Check Migration Status

Identify current migration state:

```javascript mongosh --host mongos:27017

// Check balancer and active migrations sh.isBalancerRunning()

// Detailed migration status use config db.locks.find({ state: 1 })

// Check for failed migrations in changelog db.changelog.find({ what: "moveChunk.error" }).sort({ time: -1 }).limit(10)

// Current chunk distribution db.chunks.aggregate([ { $match: { ns: "mydb.mycollection" } }, { $group: { _id: "$shard", count: { $sum: 1 } } }, { $sort: { count: -1 } } ]) ```

Step 2: Diagnose Specific Failure

Find the error type in logs:

```bash # On mongos router grep -i "moveChunk" /var/log/mongodb/mongos.log | tail -50

# On affected shards grep -i "migration" /var/log/mongodb/mongod.log | tail -50

# Common patterns: # "migration aborted" -> timeout or network # "jumbo chunk" -> size issue # "lock timeout" -> concurrent operation ```

Step 3: Fix Network Timeout Issues

Adjust migration parameters:

```javascript // Increase migration timeout use config db.settings.update( { _id: "chunkMigration" }, { $set: { "_secondaryThrottle": true, "maxChunkMigrationTimeMS": 600000 // 10 minutes } }, { upsert: true } )

// Check network connectivity between shards // Test from mongos db.runCommand({ ping: 1 }) ```

Test shard connectivity:

```bash # Test connectivity between all shard pairs mongosh --host shard01-primary:27018 --eval "db.runCommand({ping:1})" mongosh --host shard02-primary:27018 --eval "db.runCommand({ping:1})"

# Check latency ping shard01 ping shard02 ```

Step 4: Handle Jumbo Chunks

Find and address jumbo chunks:

```javascript // Find jumbo chunks use config db.chunks.find({ jumbo: true }).forEach(c => { print("Jumbo chunk: " + c.ns + " on " + c.shard) print("Min: " + JSON.stringify(c.min)) print("Max: " + JSON.stringify(c.max)) print("Size: " + c.size / 1024 / 1024 + " MB") })

// Count jumbo chunks db.chunks.countDocuments({ jumbo: true }) ```

Attempt to split jumbo chunks:

```javascript // Get jumbo chunk details let jumboChunk = db.chunks.findOne({ jumbo: true })

// Find split points in the data use mydb db.mycollection.find({ userId: { $gte: jumboChunk.min.userId, $lt: jumboChunk.max.userId } }).sort({ userId: 1 }).limit(100)

// Find midpoint value let midValue = db.mycollection.aggregate([ { $match: { userId: { $gte: jumboChunk.min.userId, $lt: jumboChunk.max.userId } } }, { $group: { _id: null, min: { $min: "$userId" }, max: { $max: "$userId" } } } ]).next()

// Try manual split sh.splitAt("mydb.mycollection", { userId: "midpoint_value" })

// If split succeeds, clear jumbo flag db.chunks.update( { _id: jumboChunk._id }, { $unset: { jumbo: true } } ) ```

If chunk cannot be split (single document issue):

```javascript // Find documents exceeding chunk size use mydb db.mycollection.find({ userId: { $gte: jumboChunk.min.userId, $lt: jumboChunk.max.userId } }).forEach(doc => { let size = Object.bsonsize(doc) if (size > 16 * 1024 * 1024) { // > 16MB print("Large document: " + doc._id + " size: " + size / 1024 / 1024 + "MB") } })

// Solution options: // 1. Refactor document structure (split into multiple docs) // 2. Change shard key to distribute differently // 3. Manually move chunk (with force flag) ```

Step 5: Resolve Concurrent Migration Conflicts

Check migration locks:

```javascript use config db.locks.find({ state: { $ne: 0 } })

// Find what's holding locks db.locks.find({ "when": { $gt: new Date(Date.now() - 3600000) } }) ```

Clear stuck locks (caution):

```javascript // Only if migration is truly stuck // First verify no active migration sh.isBalancerRunning()

// Clear lock db.locks.update( { _id: "mydb.mycollection", state: 2 }, { $set: { state: 0 } } )

// Restart balancer sh.startBalancer() ```

Step 6: Handle Destination Shard Issues

Check shard capacity:

javascript // Disk usage on each shard sh._adminCommand("listShards").forEach(s => { mongosh --host s.host --quiet --eval db.adminCommand("listDatabases").databases.forEach(d => { let stats = db.getSiblingDB(d.name).stats() print(d.name + ": " + stats.dataSize / 1024 / 1024 + "MB") }) ` })

// Or check from mongos for specific shard use admin db.runCommand({ getShardVersion: "mydb.mycollection" }) ```

Add capacity or move chunks manually:

bash

# Check disk on shard
df -h /var/lib/mongodb

Move chunks to shard with more capacity:

```javascript // Manual chunk move sh.moveChunk("mydb.mycollection", { userId: "specific_value" }, "shard03")

// Monitor move progress use config db.changelog.find({ what: "moveChunk", ns: "mydb.mycollection" }).sort({ time: -1 }).limit(5) ```

Step 7: Handle High Write Volume During Migration

Reduce write pressure during migration:

```javascript // Check write rate db.mycollection.stats()

// Temporarily reduce write rate or batch writes // Migration includes a catchup phase for writes during move // High write rate can prevent catchup completion ```

Adjust migration settings:

```javascript // Secondary throttle - wait for secondaries during migration use config db.settings.update( { _id: "chunkMigration" }, { $set: { _secondaryThrottle: true } }, { upsert: true } )

// Pause balancer during peak writes sh.stopBalancer() // Resume during off-hours sh.startBalancer() ```

Step 8: Force Manual Migration

When automatic balancing fails:

```javascript // Find overloaded shard use config db.chunks.aggregate([ { $match: { ns: "mydb.mycollection" } }, { $group: { _id: "$shard", count: { $sum: 1 } } }, { $sort: { count: -1 } } ])

// Get specific chunk to move let chunk = db.chunks.findOne({ shard: "shard01", ns: "mydb.mycollection" })

// Force move sh.moveChunk("mydb.mycollection", chunk.min, "shard02") ```

Verification

Verify successful migrations:

```javascript // 1. Check balancer running sh.isBalancerRunning()

// 2. Verify balanced distribution sh.status() // Check "chunks" section for balanced counts

// 3. No jumbo chunks remaining use config db.chunks.countDocuments({ jumbo: true }) // Should be 0 or decreasing

// 4. No stuck locks db.locks.find({ state: { $ne: 0 } }).count() // Should be 0

// 5. Recent successful migrations db.changelog.find({ what: "moveChunk.commit" }).sort({ time: -1 }).limit(10) ```

System verification:

```bash # Check disk space balanced across shards for shard in shard01 shard02 shard03; do echo "=== $shard ===" ssh $shock "df -h /var/lib/mongodb" done

# Check mongos logs for errors grep -i "moveChunk" /var/log/mongodb/mongos.log | grep -i "error" ```

Common Pitfalls

Ignoring jumbo chunks - Will accumulate and prevent balancing
Forcing migration without split - May create new jumbo chunks
Not checking destination capacity - Move may fail due to disk space
Running migrations during peak traffic - Catchup phase may timeout
Disabling balancer permanently - Cluster becomes unbalanced over time

Best Practices

Monitor chunk distribution with alerts for > 20% imbalance
Size chunks appropriately (default 64MB is often good)
Schedule balancing during low-traffic windows
Proactively split large chunks before they become jumbo
Document shard key rationale to prevent future distribution issues
Test migration procedures before production issues arise
Keep shards at similar capacity levels

MongoDB Sharding Error
MongoDB Oplog Error
MongoDB WiredTiger Error
MongoDB Disk Full

MongoDB Chunk Migration Error

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Step 1: Check Migration Status

Step 2: Diagnose Specific Failure

Step 3: Fix Network Timeout Issues

Step 4: Handle Jumbo Chunks

Step 5: Resolve Concurrent Migration Conflicts

Step 6: Handle Destination Shard Issues

Step 7: Handle High Write Volume During Migration

Step 8: Force Manual Migration

Verification

Common Pitfalls

Best Practices

Related Issues

Share this guide

More Database Troubleshooting Guides

Database Query Timeout

SQL Server TempDB Contention

Database Connection Pool Exhausted

SQL Server AlwaysOn Failover Failed

Database Sharding Rebalance Failed

SQL Server Log Reader Agent Stalled