Introduction

Node.js cluster mode creates worker processes to handle incoming connections on shared ports. When a worker crashes and is automatically respawned, but the new worker also crashes for the same reason, a disconnect-reconnect loop occurs. This loop consumes CPU, generates massive log output, and leaves the application with fewer available workers than configured. The root cause is typically an unhandled exception in startup code, a resource that cannot be initialized (database connection, file lock), or a memory leak that causes the worker to exceed its memory limit shortly after starting.

Symptoms

Cluster logs show rapid restart loop:

bash
Worker 12345 disconnected
Worker 12346 started
Worker 12346 disconnected (code: 1, signal: null)
Worker 12347 started
Worker 12347 disconnected (code: 1, signal: null)
Worker 12348 started
... (loop continues)

CPU usage spikes from constant forking:

bash
$ top -p $(pgrep -d',' node)
PID    USER   %CPU  %MEM
12345  app    180   2.1   <-- Multiple workers consuming CPU in crash loop
12346  app    45    0.5
12347  app    35    0.4

Or the master process log:

bash
Worker 42 exited with code 1, signal null
Forking new worker to replace 42
Worker 43 forked
Worker 43 exited with code 1, signal null
Forking new worker to replace 43
... (thousands of restarts per minute)

Common Causes

  • Unhandled exception in startup code: Worker throws during initialization
  • Database connection fails: Worker cannot connect to database and crashes
  • Missing environment variable: Required config not passed to forked workers
  • Port already in use from previous worker: Old worker did not release the port
  • Memory limit too low: Worker OOMs immediately after starting
  • Infinite loop in worker code: Worker consumes all CPU and is killed by health check

Step-by-Step Fix

Step 1: Add restart delay and limit

```javascript const cluster = require('cluster'); const os = require('os');

if (cluster.isPrimary) { const workers = []; let restartCount = 0; const maxRestartsPerMinute = 10; let restartTimestamps = [];

function forkWorker() { const now = Date.now();

// Check restart rate restartTimestamps = restartTimestamps.filter(t => now - t < 60000); if (restartTimestamps.length >= maxRestartsPerMinute) { console.error('Too many worker restarts. Stopping cluster.'); process.exit(1); }

const worker = cluster.fork(); restartTimestamps.push(now);

worker.on('exit', (code, signal) => { console.log(Worker ${worker.process.pid} exited: code=${code}, signal=${signal});

if (code !== 0 && signal === null) { // Worker crashed - wait before restarting const delay = Math.min(1000 * Math.pow(2, restartCount), 30000); console.log(Restarting worker in ${delay}ms (attempt ${restartCount + 1})); restartCount++;

setTimeout(() => { forkWorker(); restartCount = Math.max(0, restartCount - 1); }, delay); } else { // Normal exit or signal - restart immediately forkWorker(); } });

return worker; }

// Fork initial workers const numCPUs = os.cpus().length; for (let i = 0; i < numCPUs; i++) { forkWorker(); } } ```

Step 2: Handle worker startup errors gracefully

```javascript // worker.js async function startWorker() { try { // Initialize database connection await db.connect();

// Load configuration const config = await loadConfig();

// Start server const server = app.listen(config.port, () => { console.log(Worker ${process.pid} listening on port ${config.port});

// Notify master that we are ready if (process.send) { process.send({ type: 'ready', pid: process.pid }); } });

// Handle graceful shutdown process.on('SIGTERM', () => { console.log(Worker ${process.pid} received SIGTERM); server.close(() => { db.disconnect(); process.exit(0); });

// Force exit after timeout setTimeout(() => process.exit(1), 10000); });

} catch (err) { console.error(Worker ${process.pid} failed to start:, err.message);

// Notify master of the failure if (process.send) { process.send({ type: 'error', error: err.message }); }

// Exit with non-zero code to indicate crash process.exit(1); } }

startWorker(); ```

Prevention

  • Add exponential backoff to worker restarts to prevent rapid crash loops
  • Set a maximum restart rate and stop the cluster if exceeded
  • Log the full error stack trace before the worker exits
  • Use process.send() to communicate worker readiness to the master
  • Implement graceful shutdown with SIGTERM handling
  • Add a health check endpoint that verifies database and dependency connectivity
  • Monitor worker restart rate in production monitoring with alerts