Introduction
Node.js cluster mode creates worker processes to handle incoming connections on shared ports. When a worker crashes and is automatically respawned, but the new worker also crashes for the same reason, a disconnect-reconnect loop occurs. This loop consumes CPU, generates massive log output, and leaves the application with fewer available workers than configured. The root cause is typically an unhandled exception in startup code, a resource that cannot be initialized (database connection, file lock), or a memory leak that causes the worker to exceed its memory limit shortly after starting.
Symptoms
Cluster logs show rapid restart loop:
Worker 12345 disconnected
Worker 12346 started
Worker 12346 disconnected (code: 1, signal: null)
Worker 12347 started
Worker 12347 disconnected (code: 1, signal: null)
Worker 12348 started
... (loop continues)CPU usage spikes from constant forking:
$ top -p $(pgrep -d',' node)
PID USER %CPU %MEM
12345 app 180 2.1 <-- Multiple workers consuming CPU in crash loop
12346 app 45 0.5
12347 app 35 0.4Or the master process log:
Worker 42 exited with code 1, signal null
Forking new worker to replace 42
Worker 43 forked
Worker 43 exited with code 1, signal null
Forking new worker to replace 43
... (thousands of restarts per minute)Common Causes
- Unhandled exception in startup code: Worker throws during initialization
- Database connection fails: Worker cannot connect to database and crashes
- Missing environment variable: Required config not passed to forked workers
- Port already in use from previous worker: Old worker did not release the port
- Memory limit too low: Worker OOMs immediately after starting
- Infinite loop in worker code: Worker consumes all CPU and is killed by health check
Step-by-Step Fix
Step 1: Add restart delay and limit
```javascript const cluster = require('cluster'); const os = require('os');
if (cluster.isPrimary) { const workers = []; let restartCount = 0; const maxRestartsPerMinute = 10; let restartTimestamps = [];
function forkWorker() { const now = Date.now();
// Check restart rate restartTimestamps = restartTimestamps.filter(t => now - t < 60000); if (restartTimestamps.length >= maxRestartsPerMinute) { console.error('Too many worker restarts. Stopping cluster.'); process.exit(1); }
const worker = cluster.fork(); restartTimestamps.push(now);
worker.on('exit', (code, signal) => {
console.log(Worker ${worker.process.pid} exited: code=${code}, signal=${signal});
if (code !== 0 && signal === null) {
// Worker crashed - wait before restarting
const delay = Math.min(1000 * Math.pow(2, restartCount), 30000);
console.log(Restarting worker in ${delay}ms (attempt ${restartCount + 1}));
restartCount++;
setTimeout(() => { forkWorker(); restartCount = Math.max(0, restartCount - 1); }, delay); } else { // Normal exit or signal - restart immediately forkWorker(); } });
return worker; }
// Fork initial workers const numCPUs = os.cpus().length; for (let i = 0; i < numCPUs; i++) { forkWorker(); } } ```
Step 2: Handle worker startup errors gracefully
```javascript // worker.js async function startWorker() { try { // Initialize database connection await db.connect();
// Load configuration const config = await loadConfig();
// Start server
const server = app.listen(config.port, () => {
console.log(Worker ${process.pid} listening on port ${config.port});
// Notify master that we are ready if (process.send) { process.send({ type: 'ready', pid: process.pid }); } });
// Handle graceful shutdown
process.on('SIGTERM', () => {
console.log(Worker ${process.pid} received SIGTERM);
server.close(() => {
db.disconnect();
process.exit(0);
});
// Force exit after timeout setTimeout(() => process.exit(1), 10000); });
} catch (err) {
console.error(Worker ${process.pid} failed to start:, err.message);
// Notify master of the failure if (process.send) { process.send({ type: 'error', error: err.message }); }
// Exit with non-zero code to indicate crash process.exit(1); } }
startWorker(); ```
Prevention
- Add exponential backoff to worker restarts to prevent rapid crash loops
- Set a maximum restart rate and stop the cluster if exceeded
- Log the full error stack trace before the worker exits
- Use
process.send()to communicate worker readiness to the master - Implement graceful shutdown with SIGTERM handling
- Add a health check endpoint that verifies database and dependency connectivity
- Monitor worker restart rate in production monitoring with alerts