Introduction

PM2's cluster mode forks multiple Node.js processes that share the same server port, distributing incoming requests across workers. When a worker gets stuck in the "forking" or "launching" state and never reaches "online", the application is partially available -- some workers handle requests while the stuck workers consume resources but serve nothing. This commonly happens due to port conflicts, native addon incompatibilities with cluster mode, or application code that blocks the event loop during startup.

Symptoms

PM2 list shows stuck workers:

bash
$ pm2 list
┌────┬────────────────────┬──────────┬──────┬───────────┬──────────┬──────────┐
│ id │ name               │ mode     │ status  │ cpu       │ memory    │ watching │
├────┼────────────────────┼──────────┼──────┼───────────┼──────────┼──────────┤
│ 0  │ myapp              │ cluster  │ online │ 0%        │ 85.2mb    │ disabled │
│ 1  │ myapp              │ cluster  │ online │ 0%        │ 84.8mb    │ disabled │
│ 2  │ myapp              │ cluster  │ launching │ 0%     │ 45.1mb    │ disabled │
│ 3  │ myapp              │ cluster  │ errored   │ 0%     │ 0mb       │ disabled │
└────┴────────────────────┴──────────┴──────┴───────────┴──────────┴──────────┘

PM2 logs show the issue:

bash
$ pm2 logs myapp --lines 50
0|myapp  | Error: listen EADDRINUSE: address already in use :::3000
1|myapp  | Server listening on port 3000
2|myapp  | (stuck - no output)
3|myapp  | Error: Cannot find module './build/Release/addon.node'

Common Causes

  • Port already in use: Another process is bound to the same port
  • Native addons not compiled for cluster mode: Some native modules do not work with cluster.fork()
  • Application code blocks startup: Synchronous file I/O or database migration blocks the fork
  • PM2 max memory restart loop: Worker exceeds memory limit, restarts, exceeds again
  • Missing environment variables: Forked process does not inherit required environment
  • File descriptor limit: Too many open files prevent the fork from creating sockets

Step-by-Step Fix

Step 1: Check for port conflicts

```bash # Find what is using the port lsof -i :3000 # OR ss -tlnp | grep 3000

# Kill the conflicting process kill -9 $(lsof -t -i:3000)

# Restart PM2 pm2 restart myapp ```

Step 2: Use PM2 ecosystem file with proper configuration

```javascript // ecosystem.config.js module.exports = { apps: [{ name: 'myapp', script: 'server.js', instances: 4, exec_mode: 'cluster',

// Environment variables for all workers env: { NODE_ENV: 'production', PORT: 3000, },

// Restart configuration max_memory_restart: '500M', restart_delay: 3000, max_restarts: 10,

// Logging error_file: '/var/log/pm2/myapp-error.log', out_file: '/var/log/pm2/myapp-out.log', merge_logs: true,

// Worker timeout kill_timeout: 5000, listen_timeout: 8000, // How long to wait for 'listening' event }] }; ```

Step 3: Fix native addon compatibility

If using native addons that do not support cluster mode:

```javascript // server.js const cluster = require('cluster');

if (cluster.isPrimary) { // Primary process - do not load native addons here const numCPUs = require('os').cpus().length;

for (let i = 0; i < numCPUs; i++) { cluster.fork(); }

cluster.on('exit', (worker, code, signal) => { console.log(Worker ${worker.process.pid} died. Restarting...); cluster.fork(); }); } else { // Worker process - load native addons here const nativeAddon = require('./build/Release/addon.node'); const app = require('./app'); app.listen(process.env.PORT || 3000); } ```

Step 4: Debug stuck workers

```bash # Get detailed info on a stuck worker pm2 describe myapp

# Check worker logs pm2 logs myapp --raw

# Monitor worker memory pm2 monit

# If stuck, delete and recreate pm2 delete myapp pm2 start ecosystem.config.js ```

Prevention

  • Use PM2 ecosystem files instead of command-line arguments for reproducible configuration
  • Set listen_timeout to detect stuck workers (default 8000ms)
  • Monitor worker restart rate with pm2 monit and alert on frequent restarts
  • Avoid native addons in cluster mode, or load them only in worker processes
  • Ensure the application emits the listening event on the server object
  • Use merge_logs: true to combine logs from all workers for easier debugging
  • Set max_restarts to prevent infinite restart loops on broken deployments