Introduction
PM2's cluster mode forks multiple Node.js processes that share the same server port, distributing incoming requests across workers. When a worker gets stuck in the "forking" or "launching" state and never reaches "online", the application is partially available -- some workers handle requests while the stuck workers consume resources but serve nothing. This commonly happens due to port conflicts, native addon incompatibilities with cluster mode, or application code that blocks the event loop during startup.
Symptoms
PM2 list shows stuck workers:
$ pm2 list
┌────┬────────────────────┬──────────┬──────┬───────────┬──────────┬──────────┐
│ id │ name │ mode │ status │ cpu │ memory │ watching │
├────┼────────────────────┼──────────┼──────┼───────────┼──────────┼──────────┤
│ 0 │ myapp │ cluster │ online │ 0% │ 85.2mb │ disabled │
│ 1 │ myapp │ cluster │ online │ 0% │ 84.8mb │ disabled │
│ 2 │ myapp │ cluster │ launching │ 0% │ 45.1mb │ disabled │
│ 3 │ myapp │ cluster │ errored │ 0% │ 0mb │ disabled │
└────┴────────────────────┴──────────┴──────┴───────────┴──────────┴──────────┘PM2 logs show the issue:
$ pm2 logs myapp --lines 50
0|myapp | Error: listen EADDRINUSE: address already in use :::3000
1|myapp | Server listening on port 3000
2|myapp | (stuck - no output)
3|myapp | Error: Cannot find module './build/Release/addon.node'Common Causes
- Port already in use: Another process is bound to the same port
- Native addons not compiled for cluster mode: Some native modules do not work with
cluster.fork() - Application code blocks startup: Synchronous file I/O or database migration blocks the fork
- PM2 max memory restart loop: Worker exceeds memory limit, restarts, exceeds again
- Missing environment variables: Forked process does not inherit required environment
- File descriptor limit: Too many open files prevent the fork from creating sockets
Step-by-Step Fix
Step 1: Check for port conflicts
```bash # Find what is using the port lsof -i :3000 # OR ss -tlnp | grep 3000
# Kill the conflicting process kill -9 $(lsof -t -i:3000)
# Restart PM2 pm2 restart myapp ```
Step 2: Use PM2 ecosystem file with proper configuration
```javascript // ecosystem.config.js module.exports = { apps: [{ name: 'myapp', script: 'server.js', instances: 4, exec_mode: 'cluster',
// Environment variables for all workers env: { NODE_ENV: 'production', PORT: 3000, },
// Restart configuration max_memory_restart: '500M', restart_delay: 3000, max_restarts: 10,
// Logging error_file: '/var/log/pm2/myapp-error.log', out_file: '/var/log/pm2/myapp-out.log', merge_logs: true,
// Worker timeout kill_timeout: 5000, listen_timeout: 8000, // How long to wait for 'listening' event }] }; ```
Step 3: Fix native addon compatibility
If using native addons that do not support cluster mode:
```javascript // server.js const cluster = require('cluster');
if (cluster.isPrimary) { // Primary process - do not load native addons here const numCPUs = require('os').cpus().length;
for (let i = 0; i < numCPUs; i++) { cluster.fork(); }
cluster.on('exit', (worker, code, signal) => {
console.log(Worker ${worker.process.pid} died. Restarting...);
cluster.fork();
});
} else {
// Worker process - load native addons here
const nativeAddon = require('./build/Release/addon.node');
const app = require('./app');
app.listen(process.env.PORT || 3000);
}
```
Step 4: Debug stuck workers
```bash # Get detailed info on a stuck worker pm2 describe myapp
# Check worker logs pm2 logs myapp --raw
# Monitor worker memory pm2 monit
# If stuck, delete and recreate pm2 delete myapp pm2 start ecosystem.config.js ```
Prevention
- Use PM2 ecosystem files instead of command-line arguments for reproducible configuration
- Set
listen_timeoutto detect stuck workers (default 8000ms) - Monitor worker restart rate with
pm2 monitand alert on frequent restarts - Avoid native addons in cluster mode, or load them only in worker processes
- Ensure the application emits the
listeningevent on the server object - Use
merge_logs: trueto combine logs from all workers for easier debugging - Set
max_restartsto prevent infinite restart loops on broken deployments