Introduction

Zombie processes are terminated child processes whose exit status has not been collected by their parent via wait(). While zombies themselves consume no memory or CPU, they retain a process table entry and any inherited file descriptors remain open. When thousands of zombies accumulate, the system can exhaust available PIDs and file descriptors, preventing new process creation.

Symptoms

  • ps aux | grep defunct shows many zombie (Z state) processes
  • fork: retry: Resource temporarily unavailable errors
  • /proc/sys/kernel/pid_max limit approaching maximum
  • lsof shows high file descriptor count from parent process
  • New processes fail to spawn with Cannot allocate memory (despite free RAM)

Common Causes

  • Parent process does not handle SIGCHLD signal to reap children
  • Application fork-bomb pattern spawning children without wait loops
  • Long-running daemon with a buggy child process management implementation
  • Container PID namespace issues where init process does not reap orphans
  • Python subprocess.Popen without wait() or communicate() calls

Step-by-Step Fix

  1. 1.Count zombie processes and identify parents:
  2. 2.```bash
  3. 3.ps -eo pid,ppid,stat,comm | awk '$3 ~ /^Z/ {print $0}' | head -20
  4. 4.ps -eo ppid,stat | awk '$2 ~ /^Z/ {print $1}' | sort | uniq -c | sort -rn | head -10
  5. 5.`
  6. 6.Check file descriptor usage system-wide:
  7. 7.```bash
  8. 8.cat /proc/sys/fs/file-nr
  9. 9.# Output: allocated free max
  10. 10.cat /proc/sys/fs/file-max
  11. 11.`
  12. 12.Check per-process file descriptor count:
  13. 13.```bash
  14. 14.ls /proc/1234/fd/ | wc -l
  15. 15.# Find the parent PID with most open FDs
  16. 16.for pid in $(pgrep -f "myapp"); do
  17. 17.echo "$pid: $(ls /proc/$pid/fd/ 2>/dev/null | wc -l) fds"
  18. 18.done | sort -t: -k2 -n -r | head -10
  19. 19.`
  20. 20.Signal the parent to reap zombies (SIGCHLD):
  21. 21.```bash
  22. 22.sudo kill -SIGCHLD <parent-pid>
  23. 23.`
  24. 24.If the parent is unresponsive, kill it to let init adopt and reap zombies:
  25. 25.```bash
  26. 26.sudo kill -TERM <parent-pid>
  27. 27.# If it does not respond:
  28. 28.sudo kill -9 <parent-pid>
  29. 29.`
  30. 30.Increase system file descriptor limit as temporary relief:
  31. 31.```bash
  32. 32.echo 1048576 | sudo tee /proc/sys/fs/file-max
  33. 33.sudo sysctl -w fs.file-max=1048576
  34. 34.`

Prevention

  • Ensure all child process spawning code includes proper wait() or SIGCHLD handlers
  • Use prctl(PR_SET_CHILD_SUBREAPER, 1) in long-running daemons to adopt orphan processes
  • Monitor zombie count: watch 'ps -eo stat | grep -c Z'
  • Set ulimit -n limits in systemd unit files with LimitNOFILE=
  • Use process supervisors like systemd, supervisord, or runit that properly manage child lifecycle